溫馨提示×

溫馨提示×

您好，登錄后才能下訂單哦！

密碼登錄×

忘記密碼？

登錄注冊(cè)×

獲取短信驗(yàn)證碼

其他方式登錄

點(diǎn)擊登錄注冊(cè) 即表示同意《億速云用戶服務(wù)條款》

用戶登錄×

賬戶密碼登錄

請(qǐng)使用微信掃描上方二維碼

使用幫助

請(qǐng)求超時(shí)！

請(qǐng)點(diǎn)擊重新獲取二維碼

用PYTHON爬蟲(chóng)簡(jiǎn)單爬取網(wǎng)絡(luò)小說(shuō)的示例

發(fā)布時(shí)間：2020-11-21 13:37:48 來(lái)源：億速云閱讀：266 作者：小新欄目：編程語(yǔ)言

這篇文章主要介紹用PYTHON爬蟲(chóng)簡(jiǎn)單爬取網(wǎng)絡(luò)小說(shuō)的示例，文中介紹的非常詳細(xì)，具有一定的參考價(jià)值，感興趣的小伙伴們一定要看完！

用PYTHON爬蟲(chóng)簡(jiǎn)單爬取網(wǎng)絡(luò)小說(shuō)。

這里是17K小說(shuō)網(wǎng)上，隨便找了一本小說(shuō)，名字是《千萬(wàn)大獎(jiǎng)》。

里面主要是三個(gè)函數(shù)：

1、get_download_url() 用于獲取該小說(shuō)的所有章節(jié)的URL。

分析了該小說(shuō)的目錄頁(yè)http://www.17k.com/list/2819620.html的HTML源碼，發(fā)現(xiàn)其目錄是包含在Volume里的A標(biāo)簽合集。所以就提取出了URLS列表。

2、get_contents(target) 用于獲取小說(shuō)指定章節(jié)的正文內(nèi)容

分析了小說(shuō)中第一章節(jié)的頁(yè)面http://www.17k.com/chapter/2819620/34988369.html，發(fā)現(xiàn)其正文內(nèi)容包含在P標(biāo)簽中，正文標(biāo)題包含在H1標(biāo)簽中，經(jīng)過(guò)對(duì)換行等處理，得到正文內(nèi)容。傳入?yún)?shù)是上一函數(shù)得到的URL。

3、writer(name, path, text) 用于將得到的正文內(nèi)容和章節(jié)標(biāo)題寫(xiě)入到千萬(wàn)大獎(jiǎng).txt

理論上，該簡(jiǎn)單爬蟲(chóng)可以爬取該網(wǎng)站的任意小說(shuō)。

from bs4 import BeautifulSoup
import requests, sys
'''
遇到不懂的問(wèn)題？Python學(xué)習(xí)交流群：821460695滿足你的需求，資料都已經(jīng)上傳群文件，可以自行下載！
'''
target='http://www.17k.com/list/2819620.html'
server='http://www.17k.com'
urls=[]

def get_download_url():
    req = requests.get(url = target)
    html = req.text
    div_bf = BeautifulSoup(html,'lxml')
    div = div_bf.find_all('dl', class_ = 'Volume')
    a_bf = BeautifulSoup(str(div[0]),'lxml')
    a = a_bf.find_all('a')
    for each in a[1:]:
        urls.append(server + each.get('href'))


def get_contents(target):
        req = requests.get(url = target)
        html = req.text
        bf = BeautifulSoup(html,'lxml')
        title=bf.find_all('div', class_ = 'readAreaBox content')
        title_bf = BeautifulSoup(str(title[0]),'lxml')
        title = title_bf.find_all('h2')
        title=str(title[0]).replace('<h2>','')
        title=str(title).replace('</h2>','')
        title=str(title).replace(' ','')
        title=str(title).replace('\n','')
        texts = bf.find_all('div', class_ = 'p')
        texts=str(texts).replace('<br/>','\n')
        texts=texts[:texts.index('本書(shū)首發(fā)來(lái)自17K小說(shuō)網(wǎng)，第一時(shí)間看正版內(nèi)容！')]
        texts=str(texts).replace('                                        　　','')
        return title,str(texts[len('[<div class="p">'):])

def writer(name, path, text):
        write_flag = True
        with open(path, 'a', encoding='utf-8') as f:
            f.write(name + '\n')
            f.writelines(text)
            f.write('\n')


#title,content=get_contents(target)
#print(title,content)
#writer(title,title+".txt",content)
get_download_url()
#print(urls)
i=1
for url in urls:
    title,content=get_contents(url)
    writer(title,"千萬(wàn)大獎(jiǎng).txt",content)
    print(str(int(i/len(urls)*100))+"%")
    i+=1

以上是用PYTHON爬蟲(chóng)簡(jiǎn)單爬取網(wǎng)絡(luò)小說(shuō)的示例的所有內(nèi)容，感謝各位的閱讀！希望分享的內(nèi)容對(duì)大家有幫助，更多相關(guān)知識(shí)，歡迎關(guān)注億速云行業(yè)資訊頻道！

向AI問(wèn)一下細(xì)節(jié)

推薦閱讀：

免責(zé)聲明：本站發(fā)布的內(nèi)容（圖片、視頻和文字）以原創(chuàng)、轉(zhuǎn)載和分享為主，文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng)，如果涉及侵權(quán)請(qǐng)聯(lián)系站長(zhǎng)郵箱：is@yisu.com進(jìn)行舉報(bào)，并提供相關(guān)證據(jù)，一經(jīng)查實(shí)，將立刻刪除涉嫌侵權(quán)內(nèi)容。

上一篇新聞：
如何查看python解釋器
下一篇新聞：
python算不算是前端語(yǔ)言

猜你喜歡

AI
助
手

產(chǎn)品服務(wù)

地區(qū)劃分

專題活動(dòng)

幫助支持

關(guān)于我們

售后咨詢

7*24小時(shí)在線電話：400-100-2938

7*24小時(shí)在線 QQ：800811969

關(guān)注億速云

億速云公眾號(hào)

手機(jī)網(wǎng)站二維碼

<table id="sjbka"><legend id="sjbka"><big id="sjbka"></big></legend></table>