您好,登錄后才能下訂單哦!
Scrapy中的數(shù)據(jù)流由執(zhí)行引擎控制,下面的原文摘自Scrapy官網(wǎng),我根據(jù)猜測(cè)做了點(diǎn)評(píng),為進(jìn)一步開(kāi)發(fā)GooSeeker開(kāi)源爬蟲(chóng)指示方向:
The Engine gets the first URLs to crawl from the Spider and schedules them in the Scheduler, as Requests.
URL誰(shuí)來(lái)準(zhǔn)備呢?看樣子是Spider自己來(lái)準(zhǔn)備,那么可以猜測(cè)Scrapy架構(gòu)部分(不包括Spider)主要做事件調(diào)度,不管網(wǎng)址的存儲(chǔ)。看起來(lái)類似GooSeeker會(huì)員中心的爬蟲(chóng)羅盤(pán),為目標(biāo)網(wǎng)站準(zhǔn)備一批網(wǎng)址,放在羅盤(pán)中準(zhǔn)備執(zhí)行爬蟲(chóng)調(diào)度操作。所以,這個(gè)開(kāi)源項(xiàng)目的下一個(gè)目標(biāo)是把URL的管理放在一個(gè)集中的調(diào)度庫(kù)里面。
The Engine asks the Scheduler for the next URLs to crawl.
看到這里其實(shí)挺難理解的,要看一些其他文檔才能理解透。接第1點(diǎn),引擎從Spider中把網(wǎng)址拿到以后,封裝成一個(gè)Request,交給了事件循環(huán),會(huì)被Scheduler收來(lái)做調(diào)度管理的,暫且理解成對(duì)Request做排隊(duì)。引擎現(xiàn)在就找Scheduler要接下來(lái)要下載的網(wǎng)頁(yè)地址。
The Scheduler returns the next URLs to crawl to the Engine and the Engine sends them to the Downloader, passing through the Downloader Middleware (request direction).
從調(diào)度器申請(qǐng)任務(wù),把申請(qǐng)到的任務(wù)交給下載器,在下載器和引擎之間有個(gè)下載器中間件,這是作為一個(gè)開(kāi)發(fā)框架的必備亮點(diǎn),開(kāi)發(fā)者可以在這里進(jìn)行一些定制化擴(kuò)展。
Once the page finishes downloading the Downloader generates a Response (with that page) and sends it to the Engine, passing through the Downloader Middleware (response direction).
下載完成了,產(chǎn)生一個(gè)Response,通過(guò)下載器中間件交給引擎。注意,Response和前面的Request的首字母都是大寫(xiě),雖然我還沒(méi)有看其它Scrapy文檔,但是我猜測(cè)這是Scrapy框架內(nèi)部的事件對(duì)象,也可以推測(cè)出是一個(gè)異步的事件驅(qū)動(dòng)的引擎,就像DS打數(shù)機(jī)的三級(jí)事件循環(huán)一樣,對(duì)于高性能、低開(kāi)銷引擎來(lái)說(shuō),這是必須的。
The Engine receives the Response from the Downloader and sends it to the Spider for processing, passing through the Spider Middleware (input direction).
再次出現(xiàn)一個(gè)中間件,給開(kāi)發(fā)者足夠的發(fā)揮空間。
The Spider processes the Response and returns scraped items and new Requests (to follow) to the Engine.
每個(gè)Spider順序抓取一個(gè)個(gè)網(wǎng)頁(yè),完成一個(gè)就構(gòu)造另一個(gè)Request事件,開(kāi)始另一個(gè)網(wǎng)頁(yè)的抓取。
The Engine passes scraped items and new Requests returned by a spider through Spider Middleware (output direction), and then sends processed items to Item Pipelines and processed Requests to the Scheduler.
引擎作事件分發(fā)
The process repeats (from step 1) until there are no more requests from the Scheduler.
持續(xù)不斷地運(yùn)行。
免責(zé)聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng),如果涉及侵權(quán)請(qǐng)聯(lián)系站長(zhǎng)郵箱:is@yisu.com進(jìn)行舉報(bào),并提供相關(guān)證據(jù),一經(jīng)查實(shí),將立刻刪除涉嫌侵權(quán)內(nèi)容。