<big id="yx5y3"></big>

<ruby id="yx5y3"></ruby>

<samp id="yx5y3"></samp>

^{<pre id="yx5y3"></pre>}

溫馨提示×

溫馨提示×

您好，登錄后才能下訂單哦！

密碼登錄×

忘記密碼？

登錄注冊×

獲取短信驗證碼

其他方式登錄

點擊登錄注冊即表示同意《億速云用戶服務條款》

用戶登錄×

賬戶密碼登錄

請使用微信掃描上方二維碼

使用幫助

請求超時！

請點擊重新獲取二維碼

python：scrapy學習demo分享

發(fā)布時間：2020-08-07 00:53:28 來源：ITPUB博客閱讀：146 作者：ckxllf 欄目：編程語言

　　推薦一個比較容易上手的Python 框架scrapy。

　　開發(fā)環(huán)境搭建

　　Python安裝

　　下載地址：官網(wǎng)

　　這里我下載的是3.8.0的版本(我的安裝目錄是：D:\python\Python38-32)

　　安裝完后設置環(huán)境變量：在path中追加：D:\python\Python38-32; D:\python\Python38-32\Scripts

　　升級pip

　　輸入命令：

　　python -m pip install --upgrade pip

　　安裝scrapy依賴的模塊

　　安裝wheel

　　進入cmd執(zhí)行命令命令：

　　> pip install wheel

　　安裝pywin32

　　下載地址：github

　　由于我安裝的Python是32位的，估選擇win32-py3.8版本，下載后雙擊安裝即可

　　安裝 lxml

　　運行命令：

　　> pip install lxml

　　安裝Twisted

　　由于直接使用命令在線安裝一直報下載超時，估采用離線安裝的方式

　　運行命令：

　　> pip install Twisted-19.10.0-cp38-cp38-win32.whl

　　安裝scrapy

　　運行命令：

　　> pip install scrapy

　　到目前為止就完成了scrapy環(huán)境的搭建，相對簡單

　　編寫demo

　　準備內(nèi)容

　　被爬網(wǎng)站

　　選擇百度圖片首頁：http://image.baidu.com/

　　規(guī)則分析

　　首先想到的是通過xpath的方式來爬取圖片，xpath語句：//div[@class=“imgrow”]/a/img/@src。但是在編寫爬蟲(Spiders)的時候發(fā)現(xiàn)http://image.baidu.com/請求并沒有將圖片的URL直接返回，而是通過后面的異步請求獲取，而且返回的是一個json字符串，估xpath方式行不通。

　　更換異步請求的URL為被爬網(wǎng)站：http://image.baidu.com/search/acjson?tn=resultjson_com&catename=pcindexhot&ipn=rj&ct=201326592&is=&fp=result&queryWord=&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=0&word=pcindexhot&face=0&istype=2&qc=&nc=1&fr=&pn=0&rn=30

　　創(chuàng)建scrapy項目 ImagesRename

　　運行命令：

　　> scrapy startproject ImagesRename

　　執(zhí)行完后生成項目的目錄結(jié)構(gòu)如圖：

　　

python：scrapy學習demo分享

　　其中：

　　spiders目錄：用于放置爬蟲文件

　　items.py：用于保存所抓取的數(shù)據(jù)的容器，其存儲方式類似于 Python 的字典

　　pipelines.py：核心處理器，對爬取到的內(nèi)容進行相應的操作，如：下載，保存等

　　settings.py：配置文件，修改USER_AGENT、存儲目錄等信息

　　scrapy.cfg：項目的配置文件

　　編寫item容器 items.py

　　import scrapy

　　class ImagesrenameItem(scrapy.Item):

　　# define the fields for your item here like:

　　# name = scrapy.Field()

　　imgurl = scrapy.Field()

　　pass 鄭州專業(yè)婦科醫(yī)院 http://www.120zzzy.com/

　　創(chuàng)建蜘蛛文件ImgsRename.py

　　# -*- coding: utf-8 -*-

　　import scrapy

　　import json

　　from scrapy.linkextractors import LinkExtractor

　　from scrapy.spiders import CrawlSpider, Rule

　　from ImagesRename.items import ImagesrenameItem

　　class ImgsRenameSpider(CrawlSpider):

　　name = 'ImgsRename'

　　allowed_domains = ['image.baidu.com']

　　#http://image.baidu.com/ 并沒有返回圖片鏈接，而是通過異步請求接口獲取的，爬取的URL必須是異步請求的鏈接

　　start_urls = ['http://image.baidu.com/search/acjson?tn=resultjson_com&catename=pcindexhot&ipn=rj&ct=201326592&is=&fp=result&queryWord=&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=0&word=pcindexhot&face=0&istype=2&qc=&nc=1&fr=&pn=0&rn=30',]

　　def parse(self, response):

　　# 實例化item

　　item = ImagesrenameItem()

　　#解析異步請求返回的json字符串

　　#經(jīng)過分析需要的圖片鏈接保存在json——》data——》hoverURL

　　jsonString = json.loads(response.text)

　　data = jsonString["data"]

　　imgUrls = []

　　#循環(huán)將圖片URL保存到數(shù)組中

　　for d in data:

　　if d:

　　hov = d["hoverURL"]

　　imgUrls.append(hov)

　　item['imgurl'] = imgUrls

　　yield item

　　編寫核心處理器圖片下載中間件pipelines.py

　　# -*- coding: utf-8 -*-

　　# Define your item pipelines here

　　#

　　# Don't forget to add your pipeline to the ITEM_PIPELINES setting

　　# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

　　import re

　　from scrapy.pipelines.images import ImagesPipeline

　　from scrapy import Request

　　class ImagesrenamePipeline(ImagesPipeline):

　　def get_media_requests(self, item, info):

　　# 循環(huán)每一張圖片地址下載

　　for image_url in item['imgurl']:

　　#發(fā)起圖片下載的請求

　　yield Request(image_url)

　　修改配置文件settings.py

　　# -*- coding: utf-8 -*-

　　# Scrapy settings for ImagesRename project

　　BOT_NAME = 'ImagesRename'

　　SPIDER_MODULES = ['ImagesRename.spiders']

　　NEWSPIDER_MODULE = 'ImagesRename.spiders'

　　# Crawl responsibly by identifying yourself (and your website) on the user-agent

　　#USER_AGENT = 'ImagesRename (+http://www.yourdomain.com)'

　　USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'

　　# Obey robots.txt rules

　　ROBOTSTXT_OBEY = False

　　ITEM_PIPELINES = {

　　'ImagesRename.pipelines.ImagesrenamePipeline': 300,

　　}

　　# 設置圖片存儲目錄

　　IMAGES_STORE = 'E:\圖片'

　　啟動程序下載圖片

　　運行命令：

　　scrapy crawl ImgsRename

　　到目前為止就已經(jīng)完成了一個簡單的圖片爬取程序，結(jié)果如圖：

　　當然這些下載的文件名稱是一個隨機數(shù)，如果需要按照一個格式的文件名存儲則可以重新ImagesPipeline類的file_path方法即可，這里就不做詳細的介紹

向AI問一下細節(jié)

推薦閱讀：

免責聲明：本站發(fā)布的內(nèi)容（圖片、視頻和文字）以原創(chuàng)、轉(zhuǎn)載和分享為主，文章觀點不代表本網(wǎng)站立場，如果涉及侵權(quán)請聯(lián)系站長郵箱：is@yisu.com進行舉報，并提供相關證據(jù)，一經(jīng)查實，將立刻刪除涉嫌侵權(quán)內(nèi)容。

上一篇新聞：
organizational alignment 組織調(diào)整
下一篇新聞：
一位初學Python同學的課堂筆記，仿佛看到當年的自己

猜你喜歡

AI
助
手

產(chǎn)品服務

地區(qū)劃分

專題活動

幫助支持

關于我們

售后咨詢

7*24小時在線電話：400-100-2938

7*24小時在線 QQ：800811969

關注億速云

億速云公眾號

手機網(wǎng)站二維碼