Python 合并多個TXT文件并統(tǒng)計詞頻的實現(xiàn)

發(fā)布時間：2020-08-23 04:53:03 來源：腳本之家閱讀：483 作者：alpha 的博客欄目：開發(fā)技術(shù)

需求是：針對三篇英文文章進(jìn)行分析，計算出現(xiàn)次數(shù)最多的 10 個單詞

邏輯很清晰簡單，不算難，使用 python 讀取多個 txt 文件，將文件的內(nèi)容寫入新的 txt 中，然后對新 txt 文件進(jìn)行詞頻統(tǒng)計，得到最終結(jié)果。

代碼如下：(在Windows 10，Python 3.7.4環(huán)境下運行通過)

# coding=utf-8

import re
import os

# 獲取源文件夾的路徑下的所有文件
sourceFileDir = 'D:\\Python\\txt\\'
filenames = os.listdir(sourceFileDir)

# 打開當(dāng)前目錄下的 result.txt 文件，如果沒有則創(chuàng)建
# 文件也可以是其他類型的格式，如 result.js
file = open('D:\\Python\\result.txt', 'w')

# 遍歷文件
for filename in filenames:
 filepath = sourceFileDir+'\\'+filename
 # 遍歷單個文件，讀取行數(shù)，寫入內(nèi)容
 for line in open(filepath):
  file.writelines(line)
  file.write('\n')

# 關(guān)閉文件
file.close()


# 獲取單詞函數(shù)定義
def getTxt():
 txt = open('result.txt').read()
 txt = txt.lower()
 txt = txt.replace(''', '\'')
 # !"@#$%^&*()+,-./:;<=>?@[\\]_`~{|}
 for ch in '!"'@#$%^&*()+,-/:;<=>?@[\\]_`~{|}':
  txt.replace(ch, ' ')
  return txt

# 1.獲取單詞
hamletTxt = getTxt()

# 2.切割為列表格式，'' 兼容符號錯誤情況，只保留英文單詞
txtArr = re.findall('[a-z\''A-Z]+', hamletTxt)

# 3.去除所有遍歷統(tǒng)計
counts = {}
for word in txtArr:
 # 去掉一些常見無價值詞
 forbinArr = ['a.', 'the', 'a', 'i']
 if word not in forbinArr:
  counts[word] = counts.get(word, 0) + 1

# 4.轉(zhuǎn)換格式，方便打印，將字典轉(zhuǎn)換為列表，次數(shù)按從大到小排序
countsList = list(counts.items())
countsList.sort(key=lambda x: x[1], reverse=True)

# 5. 輸出結(jié)果
for i in range(10):
 word, count = countsList[i]
 print('{0:<10}{1:>5}'.format(word, count))

效果如下圖：

Python 合并多個TXT文件并統(tǒng)計詞頻的實現(xiàn)

另一種更簡單的統(tǒng)計詞頻的方法：

# coding=utf-8
from collections import Counter

# words 為讀取到的結(jié)果 list
words = ['a', 'b' ,'a', 'c', 'v', '4', ',', 'w', 'y', 'y', 'u', 'y', 'r', 't', 'w']
wordCounter = Counter(words)
print(wordCounter.most_common(10))

# output: [('y', 3), ('a', 2), ('w', 2), ('b', 1), ('c', 1), ('v', 1), ('4', 1), (',', 1), ('u', 1), ('r', 1)]

以上就是本文的全部內(nèi)容，希望對大家的學(xué)習(xí)有所幫助，也希望大家多多支持億速云。

向AI問一下細(xì)節(jié)

Python 合并多個TXT文件并統(tǒng)計詞頻的實現(xiàn)

猜你喜歡

最新資訊

相關(guān)推薦

相關(guān)標(biāo)簽