在Python爬蟲庫中,設(shè)置抓取頻率主要是通過控制請求頭中的User-Agent和設(shè)置延遲時(shí)間來實(shí)現(xiàn)。以下是一些建議的步驟:
import time
import random
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
}
time.sleep()
函數(shù)和random.uniform()
函數(shù)來實(shí)現(xiàn)。def random_delay():
time.sleep(random.uniform(1, 3)) # 設(shè)置延遲時(shí)間在1到3秒之間
requests.get()
函數(shù)發(fā)送請求,并使用BeautifulSoup庫解析頁面內(nèi)容。def get_page(url):
response = requests.get(url, headers=headers)
if response.status_code == 200:
return response.text
else:
print(f"請求失敗,狀態(tài)碼:{response.status_code}")
return None
get_page()
函數(shù)獲取頁面內(nèi)容,然后解析頁面并提取所需信息。在每次請求之后,調(diào)用random_delay()
函數(shù)設(shè)置延遲時(shí)間。def main():
url = "https://example.com" # 目標(biāo)網(wǎng)址
while True:
page_content = get_page(url)
if page_content:
soup = BeautifulSoup(page_content, "html.parser")
# 解析頁面內(nèi)容,提取所需信息
# ...
random_delay() # 設(shè)置延遲時(shí)間
if __name__ == "__main__":
main()
通過以上步驟,我們可以設(shè)置爬蟲的抓取頻率,降低被目標(biāo)網(wǎng)站封禁的風(fēng)險(xiǎn)。請注意,實(shí)際應(yīng)用中可能需要根據(jù)目標(biāo)網(wǎng)站的特性調(diào)整延遲時(shí)間和User-Agent。