<abbr id="o7kv2"></abbr>

溫馨提示×

溫馨提示×

您好，登錄后才能下訂單哦！

密碼登錄×

忘記密碼？

登錄注冊×

獲取短信驗證碼

其他方式登錄

點擊登錄注冊即表示同意《億速云用戶服務(wù)條款》

用戶登錄×

賬戶密碼登錄

請使用微信掃描上方二維碼

使用幫助

請求超時！

請點擊重新獲取二維碼

如何使用python正則爬取某段子網(wǎng)站前20頁段子

發(fā)布時間：2021-08-21 14:44:17 來源：億速云閱讀：216 作者：小新欄目：開發(fā)技術(shù)

這篇文章主要介紹如何使用python正則爬取某段子網(wǎng)站前20頁段子，文中介紹的非常詳細，具有一定的參考價值，感興趣的小伙伴們一定要看完！

首先還是谷歌瀏覽器抓包對該網(wǎng)站數(shù)據(jù)進行分析，結(jié)果如下：

該網(wǎng)站地址：http://www.budejie.com/text

該網(wǎng)站數(shù)據(jù)都是通過html頁面進行展示，網(wǎng)站url默認為第一頁，http://www.budejie.com/text/2為第二頁，以此類推

對網(wǎng)站的內(nèi)容段子所處位置進行分析，發(fā)現(xiàn)段子內(nèi)容都是在一個 a 標簽中

如何使用python正則爬取某段子網(wǎng)站前20頁段子

坑還是有的，這是我第一次寫的正則：

content_list = re.findall(r'<a href="/detail-.*" rel="external nofollow" rel="external nofollow" rel="external nofollow" >(.+?)</a>', html_str)

之后發(fā)現(xiàn)竟然匹配到了一些推薦的內(nèi)容，最后我把正則改變下面這樣，發(fā)現(xiàn)沒有問題了，關(guān)于正則的知識這里就不做過多解釋了

content_list = re.findall(r'<div class="j-r-list-c-desc">\s*<a href="/detail-.*" rel="external nofollow" rel="external nofollow" rel="external nofollow" >(.+?)</a>', html_str)

現(xiàn)在要的是爬取前20頁的段子并保存到本地，已經(jīng)知道翻頁的規(guī)律和匹配內(nèi)容的正則，就直接可以寫代碼了

代碼如下，整體思路還是和前兩排爬蟲博客一樣，面向?qū)ο蟮膶懛ǎ?/strong>

import requests
import re
import json

class NeihanSpider(object):
  """內(nèi)涵段子，百思不得其姐，正則爬取一頁的數(shù)據(jù)"""
  def __init__(self):
    self.temp_url = 'http://www.budejie.com/text/{}' # 網(wǎng)站地址，給頁碼留個可替換的{}
    self.headers = {
      'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
    }

  def pass_url(self, url): # 發(fā)送請求，獲取響應(yīng)
    print(url)
    response = requests.get(url, headers=self.headers)
    return response.content.decode()

  def get_first_page_content_list(self, html_str): # 提取第一頁的數(shù)據(jù)
    content_list = re.findall(r'<div class="j-r-list-c-desc">\s*<a href="/detail-.*" rel="external nofollow" rel="external nofollow" rel="external nofollow" >(.+?)</a>', html_str) # 非貪婪匹配
    return content_list

  def save_content_list(self, content_list):
    with open('neihan.txt', 'a', encoding='utf-8') as f:
      for content in content_list:
        f.write(json.dumps(content, ensure_ascii=False))
        f.write('\n') # 換行
      print('成功保存一頁！')

  def run(self): # 實現(xiàn)主要邏輯
    for i in range(20): # 只爬取前20頁數(shù)據(jù)
      # 1. 構(gòu)造url
      # 2. 發(fā)送請求，獲取響應(yīng)
      html_str = self.pass_url(self.temp_url.format(i+1))
      # 3. 提取數(shù)據(jù)
      content_list = self.get_first_page_content_list(html_str)
      # 4. 保存
      self.save_content_list(content_list)

if __name__ == '__main__':
  neihan = NeihanSpider()
  neihan.run()

以上是“如何使用python正則爬取某段子網(wǎng)站前20頁段子”這篇文章的所有內(nèi)容，感謝各位的閱讀！希望分享的內(nèi)容對大家有幫助，更多相關(guān)知識，歡迎關(guān)注億速云行業(yè)資訊頻道！

向AI問一下細節(jié)

推薦閱讀：

Python爬取糗事百科所有段子

段子引發(fā)的軟件測試思考

免責(zé)聲明：本站發(fā)布的內(nèi)容（圖片、視頻和文字）以原創(chuàng)、轉(zhuǎn)載和分享為主，文章觀點不代表本網(wǎng)站立場，如果涉及侵權(quán)請聯(lián)系站長郵箱：is@yisu.com進行舉報，并提供相關(guān)證據(jù)，一經(jīng)查實，將立刻刪除涉嫌侵權(quán)內(nèi)容。

python

上一篇新聞：
Angularjs如何為ng-click事件傳遞參數(shù)

下一篇新聞：
iOS與Unity交互中如何實現(xiàn)參數(shù)傳遞

猜你喜歡

INSERT INTO SELECT用法

什么是RESTful API

exhentai服務(wù)器不穩(wěn)定怎么解決

SQL中distinct的使用方法

如何理解javascript:void(0)語句

JavaScript中的innerHTML使用方法

fwrite函數(shù)的用法

MetaMask安裝使用方法

tcp三次握手是什么

DNS與DNSLog_dnslog域名解析

最新資訊

LAMP環(huán)境MySQL索引優(yōu)化方法

LAMP服務(wù)器資源監(jiān)控與管理技巧

LAMP環(huán)境PHP版本升級注意事項

Apache在LAMP中的角色與性能考量

LAMP架構(gòu)適合哪些類型網(wǎng)站

Linux中LAMP配置最佳實踐分享

LAMP與LEMP，Linux服務(wù)器如何選擇

LAMP環(huán)境如何保障網(wǎng)站安全

Linux下LAMP環(huán)境搭建難點解析

LAMP架構(gòu)如何優(yōu)化Linux服務(wù)器性能

相關(guān)推薦

Python2爬蟲中爬取糗事百科段子的案例分析

Python項目實戰(zhàn):爬取糗事百科最熱門的內(nèi)涵搞笑段子

python實戰(zhàn)項目：爬取某網(wǎng)帥哥圖片

使用Python怎么爬取某文庫文檔數(shù)據(jù)

python如何爬取某網(wǎng)站原圖作為壁紙

Python爬取某婚戀網(wǎng)征婚數(shù)據(jù)的示例分析

Python如何爬取某拍短視頻

怎么用Python爬取某圖網(wǎng)的圖片

Python怎么爬取上道網(wǎng)項目

Python如何爬取某乎問答數(shù)

相關(guān)標簽

python2.7 ipython python爬蟲 python+ python遞歸 python發(fā)送郵件 python 3.4 python dict python 爬蟲 python3多線程 python 多線程 python前景 python培訓(xùn) python os模塊 python線程隊列 python 正則表達式 python正則 python之路 Python 3.8 用python

AI
助
手

產(chǎn)品服務(wù)

云服務(wù)器

高防服務(wù)器

高防IP

裸金屬服務(wù)器

機柜租用

SSL證書

高防CDN

彈性IP

地區(qū)劃分

中國香港服務(wù)器

美國服務(wù)器

德國服務(wù)器

日本服務(wù)器

韓國服務(wù)器

新加坡服務(wù)器

專題活動

控制臺

應(yīng)用市場

最新活動

幫助支持

幫助中心

網(wǎng)站備案

法律條款

全國服務(wù)

安全漏洞

主題地圖

關(guān)于我們

關(guān)于億速云

客戶案例

新聞資訊

資訊地圖

問答地圖

聯(lián)系我們

人才招聘

售后咨詢

7*24小時在線電話：400-100-2938

7*24小時在線 QQ：800811969

關(guān)注億速云

億速云公眾號

手機網(wǎng)站二維碼

Copyright ? Yisu Cloud Ltd. All Rights Reserved. 2018 版權(quán)所有

廣州億速云計算有限公司粵ICP備17096448號-1 粵公網(wǎng)安備 44010402001142號增值電信業(yè)務(wù)經(jīng)營許可證編號：B1-20181529

感谢您访问我们的网站，您可能还对以下资源感兴趣：
女人爽到高潮潮喷18禁
欧美日韩亚洲国产精品自拍精品丝袜国产自在线拍日本高清视频在线网站 97久久超碰国产精品旧版麻豆久久久国产一区二区三区

^{<meter id="w7g11"></meter>}