您好,登錄后才能下訂單哦!
這篇文章主要介紹python分布式爬蟲中spider_Worker節(jié)點指的是什么,文中介紹的非常詳細(xì),具有一定的參考價值,感興趣的小伙伴們一定要看完!
在將多線程版本改寫成分布式的爬蟲,主要用的可跨平臺的multiprocessing.managers的BaseManager模塊,這個模塊的主要功能就是將task_queue和result_queue兩個隊列注冊成函數(shù)暴露到網(wǎng)上去,Master節(jié)點監(jiān)聽端口,讓W(xué)orker子節(jié)點去連接,不同主機(jī)之間就可以通過注冊的函數(shù)來共享同步資源,Master節(jié)點主要負(fù)責(zé)發(fā)送任務(wù)和獲取結(jié)果,Worker就獲取任務(wù)隊列的任務(wù)開始跑,并將獲取的結(jié)果存儲到數(shù)據(jù)庫獲取返回回來。
spider_Worker 節(jié)點主要調(diào)用spider()函數(shù)對任務(wù)進(jìn)行處理,方法都類似,子節(jié)點每獲取一個鏈接就傳回Master, 另外需要注意的是Master文件只能運(yùn)行一個,但Worker節(jié)點可以同時運(yùn)行多個并行同步處理task任務(wù)隊列。
spider_Master.py
#coding:utf-8 from multiprocessing.managers import BaseManager from Queue import Queue import time import argparse import MySQLdb import sys page = 2 word = 'inurl:login.action' output = 'test.txt' page = (page+1) * 10 host = '127.0.0.1' port = 500 urls = [] class Master(): def __init__(self): self.task_queue = Queue() #server需要先創(chuàng)建兩個共享隊列,worker端不需要 self.result_queue = Queue() def start(self): BaseManager.register('get_task_queue',callable=lambda:self.task_queue) #在網(wǎng)絡(luò)上注冊一個get_task_queue函數(shù),即把兩個隊列暴露到網(wǎng)上,worker端不需要callable參數(shù) BaseManager.register('get_result_queue',callable=lambda:self.result_queue) manager = BaseManager(address=(host,port),authkey='sir') manager.start() #master端為start,即開始監(jiān)聽端口,worker端為connect task = manager.get_task_queue() #master和worker都是從網(wǎng)絡(luò)上獲取task隊列和result隊列,不能在創(chuàng)建的兩個隊列 result = manager.get_result_queue() print 'put task' for i in range(0,page,10): target = 'https://www.baidu.com/s?wd=%s&pn=%s'%(word,i) print 'put task %s'%target task.put(target) print 'try get result' while True: try: url = result.get(True,5) #獲取數(shù)據(jù)時需要超時長一些 print url urls.append(url) except: break manager.shutdown() if __name__ == '__main__': start = time.time() server = Master() server.start() print '共爬取數(shù)據(jù)%s條'%len(urls) print time.time()-start with open(output,'a') as f: for url in urls: f.write(url[1]+'\n') conn = MySQLdb.connect('localhost','root','root','Struct',charset='utf8') cursor = conn.cursor() for record in urls: sql = "insert into s045 values('%s','%s','%s')"%(record[0],record[1],str(record[2])) cursor.execute(sql) conn.commit() conn.close()
spider_Worker
#coding:utf-8 import re import Queue import time import requests from multiprocessing.managers import BaseManager from bs4 import BeautifulSoup as bs host = '127.0.0.1' port = 500 class Worder(): def __init__(self): self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'} def spider(self,target,result): urls = [] pn = int(target.split('=')[-1])/10 +1 # print pn # print target html = requests.get(target,headers=self.headers) soup = bs(html.text,"lxml") res = soup.find_all(name="a", attrs={'class':'c-showurl'}) for r in res: try: h = requests.get(r['href'],headers=self.headers,timeout=3) if h.status_code == 200: url = h.url # print url time.sleep(1) title = re.findall(r'<title>(.*?)</title>',h.content)[0] # print url,title title = title.decode('utf-8') print 'send spider url:',url result.put((pn,url,title)) else: continue except: continue # return urls def start(self): BaseManager.register('get_task_queue') BaseManager.register('get_result_queue') print 'Connect to server %s'%host m = BaseManager(address=(host,port),authkey='sir') m.connect() task = m.get_task_queue() result = m.get_result_queue() print 'try get queue' while True: try: target = task.get(True,1) print 'run pages %s'%target res = self.spider(target,result) # print res except: break if __name__ == '__main__': w = Worder() w.start()
以上是“python分布式爬蟲中spider_Worker節(jié)點指的是什么”這篇文章的所有內(nèi)容,感謝各位的閱讀!希望分享的內(nèi)容對大家有幫助,更多相關(guān)知識,歡迎關(guān)注億速云行業(yè)資訊頻道!
免責(zé)聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點不代表本網(wǎng)站立場,如果涉及侵權(quán)請聯(lián)系站長郵箱:is@yisu.com進(jìn)行舉報,并提供相關(guān)證據(jù),一經(jīng)查實,將立刻刪除涉嫌侵權(quán)內(nèi)容。