怎么用Python爬取電視劇所有劇情

發(fā)布時(shí)間：2022-01-13 09:40:09 來(lái)源：億速云閱讀：402 作者：iii 欄目：大數(shù)據(jù)

這篇“怎么用Python爬取電視劇所有劇情”文章的知識(shí)點(diǎn)大部分人都不太理解，所以小編給大家總結(jié)了以下內(nèi)容，內(nèi)容詳細(xì)，步驟清晰，具有一定的借鑒價(jià)值，希望大家閱讀完這篇文章能有所收獲，下面我們一起來(lái)看看這篇“怎么用Python爬取電視劇所有劇情”文章吧。

【示例代碼】

# coding=utf-8# @Auther :　鵬哥賊優(yōu)秀# @Date : 2019/8/7
from bs4 import BeautifulSoupimport requestsimport getheader
# 獲取每一集對(duì)應(yīng)的標(biāo)題及對(duì)應(yīng)的界面URL關(guān)鍵地址def get_title():    url = "https://www.tvsou.com/storys/0d884ba0dd/"    headers = getheader.getheaders()    r = requests.get(url, headers=headers)    r.encoding = "utf-8"    soup = BeautifulSoup(r.text, "lxml")    temps = soup.find("ul", class_="m-l14 clearfix episodes-list teleplay-lists").find_all("li")    tempurllist = []    titlelist = []    for temp in temps:        tempurl = temp.a.get("href")        title = temp.a.get("title")        tempurllist.append(tempurl)        titlelist.append(title)    return tempurllist, titlelist
# 下載長(zhǎng)安十二時(shí)辰的第x集之后所有劇情，默認(rèn)從第一集開(kāi)始下載。def Changan(episode=1):    tempurllist_b, titlelist_b = get_title()    tempurllist = tempurllist_b[(episode - 1):]    titlelist = titlelist_b[(episode - 1):]    baseurl = "https://www.tvsou.com"    for i, tempurl in enumerate(tempurllist):        print("正在下載第{0}篇".format(str(i + episode)))        url = baseurl + tempurl        r = requests.get(url, headers=getheader.getheaders())        r.encoding = "utf-8"        soup = BeautifulSoup(r.text, "lxml")        result = soup.find("pre", class_="font-16 color-3 mt-20 pre-content").find_all("p")        content = []        for temp in result:            if temp.string:                content.append(temp.string)        with open("test.txt", "a") as f:            f.write(titlelist[i] + "\n")            f.writelines(content)            f.write("\n")
if __name__ == "__main__":    Changan(43)

【效果如下】

怎么用Python爬取電視劇所有劇情

【知識(shí)點(diǎn)】

1、怎么自動(dòng)獲取每一集對(duì)應(yīng)的URL地址？

先查看第一集的爬取內(nèi)容，發(fā)現(xiàn)在響應(yīng)中有一段各劇集的信息，如下圖：

怎么用Python爬取電視劇所有劇情

從這段響應(yīng)消息中可以看到，每一集對(duì)應(yīng)了一個(gè)href，然后第一集的URL地址中“https://www.tvsou.com/storys/0d884ba0dd/”剛好有部分URL地址與href一致。然后再驗(yàn)證了下第二集URL，發(fā)現(xiàn)的確就是對(duì)應(yīng)的href。因此就得到了如何自動(dòng)獲取各集URL地址的方式。

2、如何爬取每一集的劇情內(nèi)容呢？

以第一集為例，在響應(yīng)中可以看到這樣一段內(nèi)容。

怎么用Python爬取電視劇所有劇情

在class_="font-16 color-3 mt-20 pre-content"標(biāo)簽內(nèi)，就有劇情內(nèi)容。但是由于這段響應(yīng)中有多個(gè)p標(biāo)簽，每個(gè)p標(biāo)簽對(duì)應(yīng)一段內(nèi)容。因此需要對(duì)每個(gè)p標(biāo)簽進(jìn)行text提取。并且由于第一個(gè)p標(biāo)簽是<p></p>，因此需要進(jìn)行非空判斷。

以上就是關(guān)于“怎么用Python爬取電視劇所有劇情”這篇文章的內(nèi)容，相信大家都有了一定的了解，希望小編分享的內(nèi)容對(duì)大家有幫助，若想了解更多相關(guān)的知識(shí)內(nèi)容，請(qǐng)關(guān)注億速云行業(yè)資訊頻道。

向AI問(wèn)一下細(xì)節(jié)

怎么用Python爬取電視劇所有劇情

猜你喜歡

最新資訊

相關(guān)推薦

相關(guān)標(biāo)簽