<dfn id="l2lsd"></dfn>

溫馨提示×

溫馨提示×

您好，登錄后才能下訂單哦！

密碼登錄×

忘記密碼？

登錄注冊(cè)×

獲取短信驗(yàn)證碼

其他方式登錄

點(diǎn)擊登錄注冊(cè) 即表示同意《億速云用戶服務(wù)條款》

用戶登錄×

賬戶密碼登錄

請(qǐng)使用微信掃描上方二維碼

使用幫助

請(qǐng)求超時(shí)！

請(qǐng)點(diǎn)擊重新獲取二維碼

Python爬取十篇新聞統(tǒng)計(jì)TF-IDF

發(fā)布時(shí)間：2020-09-12 01:11:07 來(lái)源：腳本之家閱讀：192 作者：果7 欄目：開發(fā)技術(shù)

統(tǒng)計(jì)十篇新聞TF-IDF

統(tǒng)計(jì)TF-IDF詞頻，每篇文章的 top10 的高頻詞存儲(chǔ)為 json 文件

TF-IDF

TF-IDF（term frequency–inverse document frequency）是一種用于資訊檢索與文本挖掘的常用加權(quán)技術(shù)。TF-IDF是一種統(tǒng)計(jì)方法，用以評(píng)估一字詞對(duì)于一個(gè)文件集或一個(gè)語(yǔ)料庫(kù)中的其中一份文件的重要程度。字詞的重要性隨著它在文件中出現(xiàn)的次數(shù)成正比增加，但同時(shí)會(huì)隨著它在語(yǔ)料庫(kù)中出現(xiàn)的頻率成反比下降。TF-IDF加權(quán)的各種形式常被搜索引擎應(yīng)用，作為文件與用戶查詢之間相關(guān)程度的度量或評(píng)級(jí)。除了TF-IDF以外，互聯(lián)網(wǎng)上的搜索引擎還會(huì)使用基于連結(jié)分析的評(píng)級(jí)方法，以確定文件在搜尋結(jié)果中出現(xiàn)的順序。
假如一篇文件的總詞語(yǔ)數(shù)是100個(gè)，而詞語(yǔ)“母牛”出現(xiàn)了3次，那么“母?！币辉~在該文件中的詞頻就是3/100=0.03。一個(gè)計(jì)算文件頻率（DF）的方法是測(cè)定有多少份文件出現(xiàn)過(guò)“母?！币辉~，然后除以文件集里包含的文件總數(shù)。所以，如果“母?！币辉~在1,000份文件出現(xiàn)過(guò)，而文件總數(shù)是10,000,000份的話，其逆向文件頻率就是log（10,000,000 / 1,000）=4。最后的TF-IDF的分?jǐn)?shù)為0.03 * 4=0.12。 —— [ 維基百科 ]

博主選擇的是chinadaily的十篇新聞.

1.使用http request請(qǐng)求
2.使用Beautiful Soup來(lái)抓取文章標(biāo)題和內(nèi)容
3.統(tǒng)計(jì)TF-IDF
4.保存到j(luò)son文件中

代碼塊

@requires_authorization
#coding=utf-8

import requests
import bs4
import sys
import math
import json
reload(sys)
sys.setdefaultencoding('utf-8')

url_list = ['http://www.chinadaily.com.cn/china/2016-04/20/content_24701635.htm',
      'http://www.chinadaily.com.cn/china/2016-04/20/content_24700746.htm',
      'http://www.chinadaily.com.cn/china/2016-04/20/content_24681482.htm',
      'http://www.chinadaily.com.cn/china/2016-04/19/content_24675530.htm',
      'http://www.chinadaily.com.cn/china/2016-04/19/content_24675455.htm',
      'http://www.chinadaily.com.cn/china/2016-04/19/content_24674074.htm',
      'http://www.chinadaily.com.cn/china/2016-04/19/content_24655536.htm',
      'http://www.chinadaily.com.cn/china/2016-04/18/content_24643685.htm',
      'http://www.chinadaily.com.cn/china/2016-04/18/content_24636917.htm',
      'http://www.chinadaily.com.cn/china/2016-04/15/content_24562198.htm'
      ]

articles_title = []
articles_content = []

for pos,url in enumerate(url_list):
  r = requests.get(url)
  soup1 = bs4.BeautifulSoup(r.text)
  soup2 = bs4.BeautifulSoup(str(soup1.find_all(id="Title_e")))
  articles_title.append(soup2.h2.string)
  mystr = ""
  soup3 = bs4.BeautifulSoup(str(soup1.find_all(id="Content")))
  for x in soup3.find_all("p"):
    mystr = mystr + x.string

  str_p = ""
  contents = []
  for pos,x in enumerate(mystr):
    if x == '.' or x == ',':
      if pos < (len(mystr) - 1) and mystr[pos+1] >= '0' and mystr[pos+1] <= '9':
        str_p = str_p + x
      elif str_p == "":
        continue
      else:
        contents.append(str_p)
        str_p = ""
    elif x == '(' or x == ')' or x == ' ' or x == '"' or x == '[' or x == ']' or x == '-':
      if str_p == "":
        continue
      else:
        contents.append(str_p)
        str_p = ""
    else:
      str_p = str_p + x

  articles_content.append(contents)

Dict_idf = {}
DictList = []

for content in articles_content:
  Dict_tf = {}
  for x in content:
    if not Dict_tf.has_key(x):
      Dict_tf[x] = 1.0
      if not Dict_idf.has_key(x):
        Dict_idf[x] = 1.0
      else:
        Dict_idf[x] += 1.0
    else:
      Dict_tf[x] += 1.0

  for k, v in Dict_tf.items():
    Dict_tf[k] = v / len(content)

  DictList.append(Dict_tf)

for k, v in Dict_idf.items():
  Dict_idf[k] = math.log(float(len(url_list)) / v)

for pos,x in enumerate(DictList):
  for k,v in x.items():
    DictList[pos][k] = v*Dict_idf[k]
  DictList[pos] = sorted(x.iteritems(), key=lambda d: d[1], reverse=True)

"""
[
  [
    article_titile:"XXXX"
    [
      {
        word:"hello"
        value:3.5
      }
      {
        word:"hello"
        value:3.5
      }
      {
        word:"hello"
        value:3.5
      }
      ...
    ]
  ]
]
"""

data = []
for pos in range(10):
  data2=[]
  data2.append("article_titile:")
  data2.append(articles_title[pos])
  data2.append([{"word": k,"value":round(v,4)} for k,v in DictList[pos][:10]])
  data.append(data2)

# Writing JSON data
with open('data.json', 'w') as f:
  json.dump(data, f)

Python爬取十篇新聞統(tǒng)計(jì)TF-IDF

使用json.cn查看數(shù)據(jù):

Python爬取十篇新聞統(tǒng)計(jì)TF-IDF

github地址：https://github.com/mqsee/learngit

以上就是本文的全部?jī)?nèi)容，希望對(duì)大家的學(xué)習(xí)有所幫助，也希望大家多多支持億速云。

向AI問(wèn)一下細(xì)節(jié)

推薦閱讀：

免責(zé)聲明：本站發(fā)布的內(nèi)容（圖片、視頻和文字）以原創(chuàng)、轉(zhuǎn)載和分享為主，文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng)，如果涉及侵權(quán)請(qǐng)聯(lián)系站長(zhǎng)郵箱：is@yisu.com進(jìn)行舉報(bào)，并提供相關(guān)證據(jù)，一經(jīng)查實(shí)，將立刻刪除涉嫌侵權(quán)內(nèi)容。

上一篇新聞：
Java實(shí)現(xiàn)二維碼功能的實(shí)例代碼
下一篇新聞：
IDEA 開發(fā)多項(xiàng)目依賴的方法(圖文)

猜你喜歡

AI
助
手

產(chǎn)品服務(wù)

地區(qū)劃分

專題活動(dòng)

幫助支持

關(guān)于我們

售后咨詢

7*24小時(shí)在線電話：400-100-2938

7*24小時(shí)在線 QQ：800811969

關(guān)注億速云

億速云公眾號(hào)

手機(jī)網(wǎng)站二維碼

<p id="dexje"></p>

^{<p id="dexje"></p>}