溫馨提示×

溫馨提示×

您好，登錄后才能下訂單哦！

密碼登錄×

忘記密碼？

登錄注冊×

獲取短信驗(yàn)證碼

其他方式登錄

點(diǎn)擊登錄注冊即表示同意《億速云用戶服務(wù)條款》

用戶登錄×

賬戶密碼登錄

請使用微信掃描上方二維碼

使用幫助

請求超時(shí)！

請點(diǎn)擊重新獲取二維碼

Python使用隊(duì)列方式實(shí)現(xiàn)多線程爬蟲的方法

發(fā)布時(shí)間：2020-07-27 17:04:47 來源：億速云閱讀：239 作者：小豬欄目：開發(fā)技術(shù)

這篇文章主要講解了Python使用隊(duì)列方式實(shí)現(xiàn)多線程爬蟲的方法，內(nèi)容清晰明了，對此有興趣的小伙伴可以學(xué)習(xí)一下，相信大家閱讀完之后會有幫助。

說明：糗事百科段子的爬取，采用了隊(duì)列和多線程的方式，其中關(guān)鍵點(diǎn)是Queue.task_done()、Queue.join()，保證了線程的有序進(jìn)行。

代碼如下

import requests
from lxml import etree
import json
from queue import Queue
import threading

class Qsbk(object):
  def __init__(self):
    self.headers = {
      "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36",
      "Referer": "https://www.qiushibaike.com/"
    }
    # 實(shí)例化三個(gè)隊(duì)列，用來存放內(nèi)容
    self.url_queue = Queue()
    self.html_queue = Queue()
    self.content_queue = Queue()

  def get_total_url(self):
    """
    獲取了所有的頁面url，并且返回url_list
    return:url_list
    現(xiàn)在放入url_queue隊(duì)列中保存
    """
    url_temp = "https://www.qiushibaike.com/text/page/{}/"
    url_list = list()
    for i in range(1,13):
      # url_list.append(url_temp.format(i))
      # 將生成的url放入url_queue隊(duì)列
      self.url_queue.put(url_temp.format(i))

  def parse_url(self):
    """
    發(fā)送請求，獲取響應(yīng)，同時(shí)etree處理html
    """
    while self.url_queue.not_empty:
      # 判斷非空，為空時(shí)結(jié)束循環(huán)

      # 從隊(duì)列中取出一個(gè)url
      url = self.url_queue.get()
      print("parsing url:",url)
      # 發(fā)送請求
      response = requests.get(url,headers=self.headers,timeout=10)
      # 獲取html字符串
      html = response.content.decode()
      # 獲取element類型的html
      html = etree.HTML(html)
      # 將生成的element對象放入html_queue隊(duì)列
      self.html_queue.put(html)
      # Queue.task_done() 在完成一項(xiàng)工作之后，Queue.task_done()函數(shù)向任務(wù)已經(jīng)完成的隊(duì)列發(fā)送一個(gè)信號
      self.url_queue.task_done()

  def get_content(self):
    """
    解析網(wǎng)頁內(nèi)容，獲取想要的信息
    """
    while self.html_queue.not_empty:
      items = list()
      html = self.html_queue.get()
      total_div = html.xpath("//div[@class='col1 old-style-col1']/div")
      for i in total_div:

        author_img = i.xpath(".//a[@rel='nofollow']/img/@src")
        author_img = "https"+author_img[0] if len(author_img)>0 else None

        author_name = i.xpath(".//a[@rel='nofollow']/img/@alt")
        author_name = author_name[0] if len(author_name)>0 else None

        author_href = i.xpath("./a/@href")
        author_href = "https://www.qiushibaike.com/"+author_href[0] if len(author_href)>0 else None

        author_gender = i.xpath("./div[1]/div/@class")
        author_gender = author_gender[0].split(" ")[-1].replace("Icon","").strip() if len(author_gender)>0 else None

        author_age = i.xpath("./div[1]/div/text()")
        author_age = author_age[0] if len(author_age)>0 else None

        content = i.xpath("./a/div/span/text()")
        content = content[0].strip() if len(content)>0 else None

        content_vote = i.xpath("./div[@class='stats']/span[@class='stats-vote']/i/text()")
        content_vote = content_vote[0] if len(content_vote)>0 else None

        content_comment_numbers = i.xpath("./div[@class='stats']/span[@class='stats-comments']/a/i/text()")
        content_comment_numbers = content_comment_numbers[0] if len(content_comment_numbers)>0 else None

        item = {
          "author_name":author_name,
          "author_age" :author_age,
          "author_gender":author_gender,
          "author_img":author_img,
          "author_href":author_href,
          "content":content,
          "content_vote":content_vote,
          "content_comment_numbers":content_comment_numbers,
        }
        items.append(item)
      self.content_queue.put(items)
      # task_done的時(shí)候，隊(duì)列計(jì)數(shù)減一
      self.html_queue.task_done()

  def save_items(self):
    """
    保存items
    """
    while self.content_queue.not_empty:
      items = self.content_queue.get()
      with open("quishibaike.txt",'a',encoding='utf-8') as f:
        for i in items:
          json.dump(i,f,ensure_ascii=False,indent=2)
      self.content_queue.task_done()

  def run(self):
    # 獲取url list
    thread_list = list()
    thread_url = threading.Thread(target=self.get_total_url)
    thread_list.append(thread_url)

    # 發(fā)送網(wǎng)絡(luò)請求
    for i in range(10):
      thread_parse = threading.Thread(target=self.parse_url)
      thread_list.append(thread_parse)

    # 提取數(shù)據(jù)
    thread_get_content = threading.Thread(target=self.get_content)
    thread_list.append(thread_get_content)

    # 保存
    thread_save = threading.Thread(target=self.save_items)
    thread_list.append(thread_save)


    for t in thread_list:
      # 為每個(gè)進(jìn)程設(shè)置為后臺進(jìn)程，效果是主進(jìn)程退出子進(jìn)程也會退出
      t.setDaemon(True)
      t.start()
    
    # 讓主線程等待，所有的隊(duì)列為空的時(shí)候才能退出
    self.url_queue.join()
    self.html_queue.join()
    self.content_queue.join()


if __name__=="__main__":
  obj = Qsbk()
  obj.run()

看完上述內(nèi)容，是不是對Python使用隊(duì)列方式實(shí)現(xiàn)多線程爬蟲的方法有進(jìn)一步的了解，如果還想學(xué)習(xí)更多內(nèi)容，歡迎關(guān)注億速云行業(yè)資訊頻道。

向AI問一下細(xì)節(jié)

推薦閱讀：

免責(zé)聲明：本站發(fā)布的內(nèi)容（圖片、視頻和文字）以原創(chuàng)、轉(zhuǎn)載和分享為主，文章觀點(diǎn)不代表本網(wǎng)站立場，如果涉及侵權(quán)請聯(lián)系站長郵箱：is@yisu.com進(jìn)行舉報(bào)，并提供相關(guān)證據(jù)，一經(jīng)查實(shí)，將立刻刪除涉嫌侵權(quán)內(nèi)容。

上一篇新聞：
打擊走火入魔C#.NET通用權(quán)限管理系統(tǒng)組件盜版采取的方法
下一篇新聞：
CSS常用操作——————對齊

猜你喜歡

AI
助
手

產(chǎn)品服務(wù)

地區(qū)劃分

專題活動

幫助支持

關(guān)于我們

售后咨詢

7*24小時(shí)在線電話：400-100-2938

7*24小時(shí)在線 QQ：800811969

關(guān)注億速云

億速云公眾號

手機(jī)網(wǎng)站二維碼

<legend id="l8g1j"><acronym id="l8g1j"><tfoot id="l8g1j"></tfoot></acronym></legend><dl id="l8g1j"><strike id="l8g1j"><label id="l8g1j"></label></strike></dl>

<dl id="l8g1j"><bdo id="l8g1j"><pre id="l8g1j"></pre></bdo></dl>