溫馨提示×

溫馨提示×

您好,登錄后才能下訂單哦!

密碼登錄×
登錄注冊×
其他方式登錄
點擊 登錄注冊 即表示同意《億速云用戶服務(wù)條款》

總結(jié)Python爬蟲性能

發(fā)布時間:2020-08-03 13:41:49 來源:億速云 閱讀:153 作者:小豬 欄目:開發(fā)技術(shù)

這篇文章主要為大家展示了總結(jié)Python爬蟲性能,內(nèi)容簡而易懂,希望大家可以學(xué)習(xí)一下,學(xué)習(xí)完之后肯定會有收獲的,下面讓小編帶大家一起來看看吧。

這里我們通過請求網(wǎng)頁例子來一步步理解爬蟲性能

當(dāng)我們有一個列表存放了一些url需要我們獲取相關(guān)數(shù)據(jù),我們首先想到的是循環(huán)

簡單的循環(huán)串行

這一種方法相對來說是最慢的,因為一個一個循環(huán),耗時是最長的,是所有的時間總和
代碼如下:

import requests

url_list = [
  'http://www.baidu.com',
  'http://www.pythonsite.com',
  'http://www.cnblogs.com/'
]

for url in url_list:
  result = requests.get(url)
  print(result.text)

通過線程池

通過線程池的方式訪問,這樣整體的耗時是所有連接里耗時最久的那個,相對循環(huán)來說快了很多

import requests
from concurrent.futures import ThreadPoolExecutor

def fetch_request(url):
  result = requests.get(url)
  print(result.text)

url_list = [
  'http://www.baidu.com',
  'http://www.bing.com',
  'http://www.cnblogs.com/'
]
pool = ThreadPoolExecutor(10)

for url in url_list:
  #去線程池中獲取一個線程,線程去執(zhí)行fetch_request方法
  pool.submit(fetch_request,url)

pool.shutdown(True)

線程池+回調(diào)函數(shù)

這里定義了一個回調(diào)函數(shù)callback

from concurrent.futures import ThreadPoolExecutor
import requests


def fetch_async(url):
  response = requests.get(url)

  return response


def callback(future):
  print(future.result().text)


url_list = [
  'http://www.baidu.com',
  'http://www.bing.com',
  'http://www.cnblogs.com/'
]

pool = ThreadPoolExecutor(5)

for url in url_list:
  v = pool.submit(fetch_async,url)
  #這里調(diào)用回調(diào)函數(shù)
  v.add_done_callback(callback)

pool.shutdown()

通過進程池

通過進程池的方式訪問,同樣的也是取決于耗時最長的,但是相對于線程來說,進程需要耗費更多的資源,同時這里是訪問url時IO操作,所以這里線程池比進程池更好

import requests
from concurrent.futures import ProcessPoolExecutor

def fetch_request(url):
  result = requests.get(url)
  print(result.text)

url_list = [
  'http://www.baidu.com',
  'http://www.bing.com',
  'http://www.cnblogs.com/'
]
pool = ProcessPoolExecutor(10)

for url in url_list:
  #去進程池中獲取一個線程,子進程程去執(zhí)行fetch_request方法
  pool.submit(fetch_request,url)

pool.shutdown(True)

進程池+回調(diào)函數(shù)

這種方式和線程+回調(diào)函數(shù)的效果是一樣的,相對來說開進程比開線程浪費資源

from concurrent.futures import ProcessPoolExecutor
import requests


def fetch_async(url):
  response = requests.get(url)

  return response


def callback(future):
  print(future.result().text)


url_list = [
  'http://www.baidu.com',
  'http://www.bing.com',
  'http://www.cnblogs.com/'
]

pool = ProcessPoolExecutor(5)

for url in url_list:
  v = pool.submit(fetch_async, url)
  # 這里調(diào)用回調(diào)函數(shù)
  v.add_done_callback(callback)

pool.shutdown()

主流的單線程實現(xiàn)并發(fā)的幾種方式

  1. asyncio
  2. gevent
  3. Twisted
  4. Tornado
     

下面分別是這四種代碼的實現(xiàn)例子:

asyncio例子1:

import asyncio


@asyncio.coroutine #通過這個裝飾器裝飾
def func1():
  print('before...func1......')
  # 這里必須用yield from,并且這里必須是asyncio.sleep不能是time.sleep
  yield from asyncio.sleep(2)
  print('end...func1......')


tasks = [func1(), func1()]

loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.gather(*tasks))
loop.close()

上述的效果是同時會打印兩個before的內(nèi)容,然后等待2秒打印end內(nèi)容
這里asyncio并沒有提供我們發(fā)送http請求的方法,但是我們可以在yield from這里構(gòu)造http請求的方法。

asyncio例子2:

import asyncio


@asyncio.coroutine
def fetch_async(host, url='/'):
  print("----",host, url)
  reader, writer = yield from asyncio.open_connection(host, 80)

  #構(gòu)造請求頭內(nèi)容
  request_header_content = """GET %s HTTP/1.0\r\nHost: %s\r\n\r\n""" % (url, host,)
  request_header_content = bytes(request_header_content, encoding='utf-8')
  #發(fā)送請求
  writer.write(request_header_content)
  yield from writer.drain()
  text = yield from reader.read()
  print(host, url, text)
  writer.close()

tasks = [
  fetch_async('www.cnblogs.com', '/zhaof/'),
  fetch_async('dig.chouti.com', '/pic/show?nid=4073644713430508&lid=10273091')
]

loop = asyncio.get_event_loop()
results = loop.run_until_complete(asyncio.gather(*tasks))
loop.close()

asyncio + aiohttp 代碼例子:

import aiohttp
import asyncio


@asyncio.coroutine
def fetch_async(url):
  print(url)
  response = yield from aiohttp.request('GET', url)
  print(url, response)
  response.close()


tasks = [fetch_async('http://baidu.com/'), fetch_async('http://www.chouti.com/')]

event_loop = asyncio.get_event_loop()
results = event_loop.run_until_complete(asyncio.gather(*tasks))
event_loop.close()

asyncio+requests代碼例子

import asyncio
import requests


@asyncio.coroutine
def fetch_async(func, *args):
  loop = asyncio.get_event_loop()
  future = loop.run_in_executor(None, func, *args)
  response = yield from future
  print(response.url, response.content)


tasks = [
  fetch_async(requests.get, 'http://www.cnblogs.com/wupeiqi/'),
  fetch_async(requests.get, 'http://dig.chouti.com/pic/show?nid=4073644713430508&lid=10273091')
]

loop = asyncio.get_event_loop()
results = loop.run_until_complete(asyncio.gather(*tasks))
loop.close()

gevent+requests代碼例子

import gevent

import requests
from gevent import monkey

monkey.patch_all()


def fetch_async(method, url, req_kwargs):
  print(method, url, req_kwargs)
  response = requests.request(method=method, url=url, **req_kwargs)
  print(response.url, response.content)

# ##### 發(fā)送請求 #####
gevent.joinall([
  gevent.spawn(fetch_async, method='get', url='https://www.python.org/', req_kwargs={}),
  gevent.spawn(fetch_async, method='get', url='https://www.yahoo.com/', req_kwargs={}),
  gevent.spawn(fetch_async, method='get', url='https://github.com/', req_kwargs={}),
])

# ##### 發(fā)送請求(協(xié)程池控制最大協(xié)程數(shù)量) #####
# from gevent.pool import Pool
# pool = Pool(None)
# gevent.joinall([
#   pool.spawn(fetch_async, method='get', url='https://www.python.org/', req_kwargs={}),
#   pool.spawn(fetch_async, method='get', url='https://www.yahoo.com/', req_kwargs={}),
#   pool.spawn(fetch_async, method='get', url='https://www.github.com/', req_kwargs={}),
# ])

grequests代碼例子
這個是講requests+gevent進行了封裝

import grequests


request_list = [
  grequests.get('http://httpbin.org/delay/1', timeout=0.001),
  grequests.get('http://fakedomain/'),
  grequests.get('http://httpbin.org/status/500')
]


# ##### 執(zhí)行并獲取響應(yīng)列表 #####
# response_list = grequests.map(request_list)
# print(response_list)


# ##### 執(zhí)行并獲取響應(yīng)列表(處理異常) #####
# def exception_handler(request, exception):
# print(request,exception)
#   print("Request failed")

# response_list = grequests.map(request_list, exception_handler=exception_handler)
# print(response_list)

twisted代碼例子

#getPage相當(dāng)于requets模塊,defer特殊的返回值,rector是做事件循環(huán)
from twisted.web.client import getPage, defer
from twisted.internet import reactor

def all_done(arg):
  reactor.stop()

def callback(contents):
  print(contents)

deferred_list = []

url_list = ['http://www.bing.com', 'http://www.baidu.com', ]
for url in url_list:
  deferred = getPage(bytes(url, encoding='utf8'))
  deferred.addCallback(callback)
  deferred_list.append(deferred)
#這里就是進就行一種檢測,判斷所有的請求知否執(zhí)行完畢
dlist = defer.DeferredList(deferred_list)
dlist.addBoth(all_done)

reactor.run()

tornado代碼例子

from tornado.httpclient import AsyncHTTPClient
from tornado.httpclient import HTTPRequest
from tornado import ioloop


def handle_response(response):
  """
  處理返回值內(nèi)容(需要維護計數(shù)器,來停止IO循環(huán)),調(diào)用 ioloop.IOLoop.current().stop()
  :param response: 
  :return: 
  """
  if response.error:
    print("Error:", response.error)
  else:
    print(response.body)


def func():
  url_list = [
    'http://www.baidu.com',
    'http://www.bing.com',
  ]
  for url in url_list:
    print(url)
    http_client = AsyncHTTPClient()
    http_client.fetch(HTTPRequest(url), handle_response)


ioloop.IOLoop.current().add_callback(func)
ioloop.IOLoop.current().start()

以上就是關(guān)于總結(jié)Python爬蟲性能的內(nèi)容,如果你們有學(xué)習(xí)到知識或者技能,可以把它分享出去讓更多的人看到。

向AI問一下細節(jié)

免責(zé)聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點不代表本網(wǎng)站立場,如果涉及侵權(quán)請聯(lián)系站長郵箱:is@yisu.com進行舉報,并提供相關(guān)證據(jù),一經(jīng)查實,將立刻刪除涉嫌侵權(quán)內(nèi)容。

AI