^{<strike id="ggwuf"></strike>}

溫馨提示×

溫馨提示×

您好，登錄后才能下訂單哦！

密碼登錄×

忘記密碼？

登錄注冊(cè)×

獲取短信驗(yàn)證碼

其他方式登錄

點(diǎn)擊登錄注冊(cè) 即表示同意《億速云用戶服務(wù)條款》

用戶登錄×

賬戶密碼登錄

請(qǐng)使用微信掃描上方二維碼

使用幫助

請(qǐng)求超時(shí)！

請(qǐng)點(diǎn)擊重新獲取二維碼

Python爬蟲入門【22】：scrapy爬取酷安網(wǎng)全站應(yīng)用

發(fā)布時(shí)間：2020-07-09 05:34:04 來(lái)源：網(wǎng)絡(luò) 閱讀：350 作者：學(xué)Python派森欄目：編程語(yǔ)言

今天要爬取一個(gè)網(wǎng)站叫做酷安，是一個(gè)應(yīng)用商店，大家可以嘗試從手機(jī)APP爬取，不過(guò)爬取APP的博客，我打算在50篇博客之后在寫，所以現(xiàn)在就放一放啦~~~

Python爬蟲入門【22】：scrapy爬取酷安網(wǎng)全站應(yīng)用

酷安網(wǎng)站打開首頁(yè)之后是一個(gè)廣告頁(yè)面，點(diǎn)擊頭部的應(yīng)用即可

Python爬蟲入門【22】：scrapy爬取酷安網(wǎng)全站應(yīng)用

頁(yè)面分析

分頁(yè)地址找到，這樣就可以構(gòu)建全部頁(yè)面信息

Python爬蟲入門【22】：scrapy爬取酷安網(wǎng)全站應(yīng)用

我們想要保存的數(shù)據(jù)找到，用來(lái)后續(xù)的數(shù)據(jù)分析

Python爬蟲入門【22】：scrapy爬取酷安網(wǎng)全站應(yīng)用

Python爬蟲入門【22】：scrapy爬取酷安網(wǎng)全站應(yīng)用

上述信息都是我們需要的信息，接下來(lái)，只需要爬取即可，本篇文章使用的還是scrapy，所有的代碼都會(huì)在文章中出現(xiàn)，閱讀全文之后，你就擁有完整的代碼啦

import scrapy

from apps.items import AppsItem  # 導(dǎo)入item類
import re  # 導(dǎo)入正則表達(dá)式類

class AppsSpider(scrapy.Spider):
    name = 'Apps'
    allowed_domains = ['www.coolapk.com']
    start_urls = ['https://www.coolapk.com/apk?p=1']
    custom_settings = {
        "DEFAULT_REQUEST_HEADERS" :{
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language': 'en',
            'User-Agent':'Mozilla/5.0 你的UA'

        }
    }
Python資源分享qun 784758214 ,內(nèi)有安裝包，PDF，學(xué)習(xí)視頻，這里是Python學(xué)習(xí)者的聚集地，零基礎(chǔ)，進(jìn)階，都?xì)g迎

代碼講解

custom_settings 第一次出現(xiàn)，目的是為了修改默認(rèn)setting.py 文件中的配置

    def parse(self, response):
        list_items = response.css(".app_left_list>a")
        for item in list_items:
            url = item.css("::attr('href')").extract_first()

            url = response.urljoin(url)

            yield scrapy.Request(url,callback=self.parse_url)

        next_page = response.css('.pagination li:nth-child(8) a::attr(href)').extract_first()
        url = response.urljoin(next_page)
        yield scrapy.Request(url, callback=self.parse)

代碼講解

response.css 可以解析網(wǎng)頁(yè)，具體的語(yǔ)法，你可以參照上述代碼，重點(diǎn)閱讀 ::attr('href') 和 ::text

response.urljoin 用來(lái)合并URL

next_page 表示翻頁(yè)

parse_url函數(shù)用來(lái)解析內(nèi)頁(yè)，本函數(shù)內(nèi)容又出現(xiàn)了3個(gè)輔助函數(shù)，分別是self.getinfo(response),self.gettags(response)，self.getappinfo(response) 還有response.css().re支持正則表達(dá)式匹配，可以匹配文字內(nèi)部?jī)?nèi)容

   def parse_url(self,response):
        item = AppsItem()

        item["title"] = response.css(".detail_app_title::text").extract_first()
        info = self.getinfo(response)

        item['volume'] = info[0]
        item['downloads'] = info[1]
        item['follow'] = info[2]
        item['comment'] = info[3]

        item["tags"] = self.gettags(response)
        item['rank_num'] = response.css('.rank_num::text').extract_first()
        item['rank_num_users'] = response.css('.apk_rank_p1::text').re("共(.*?)個(gè)評(píng)分")[0]
        item["update_time"],item["rom"],item["developer"] = self.getappinfo(response)

        yield item

三個(gè)輔助方法如下

    def getinfo(self,response):

        info = response.css(".apk_topba_message::text").re("\s+(.*?)\s+/\s+(.*?)下載\s+/\s+(.*?)人關(guān)注\s+/\s+(.*?)個(gè)評(píng)論.*?")
        return info

    def gettags(self,response):
        tags = response.css(".apk_left_span2")
        tags = [item.css('::text').extract_first() for item in tags]

        return tags

    def getappinfo(self,response):
        #app_info = response.css(".apk_left_title_info::text").re("[\s\S]+更新時(shí)間：(.*?)")
        body_text = response.body_as_unicode()

        update = re.findall(r"更新時(shí)間：(.*)?[<]",body_text)[0]
        rom =  re.findall(r"支持ROM：(.*)?[<]",body_text)[0]
        developer = re.findall(r"開發(fā)者名稱：(.*)?[<]", body_text)[0]
        return update,rom,developer

保存數(shù)據(jù)

數(shù)據(jù)傳輸?shù)膇tem在這個(gè)地方就不提供給你了，需要從我的代碼中去推斷一下即可，哈哈

import pymongo

class AppsPipeline(object):

    def __init__(self,mongo_url,mongo_db):
        self.mongo_url = mongo_url
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls,crawler):
        return cls(
            mongo_url=crawler.settings.get("MONGO_URL"),
            mongo_db=crawler.settings.get("MONGO_DB")
        )

    def open_spider(self,spider):
        try:
            self.client = pymongo.MongoClient(self.mongo_url)
            self.db = self.client[self.mongo_db]

        except Exception as e:
            print(e)

    def process_item(self, item, spider):
        name = item.__class__.__name__

        self.db[name].insert(dict(item))
        return item

    def close_spider(self,spider):
        self.client.close()

代碼解讀

open_spider 開啟爬蟲時(shí)，打開Mongodb

process_item 存儲(chǔ)每一條數(shù)據(jù)

close_spider 關(guān)閉爬蟲

重點(diǎn)查看本方法 from_crawler 是一個(gè)類方法，在初始化的時(shí)候，從setting.py中讀取配置

SPIDER_MODULES = ['apps.spiders']
NEWSPIDER_MODULE = 'apps.spiders'
MONGO_URL = '127.0.0.1'
MONGO_DB = 'KuAn'

得到數(shù)據(jù)

調(diào)整一下爬取速度和并發(fā)數(shù)

DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 8
Python資源分享qun 784758214 ,內(nèi)有安裝包，PDF，學(xué)習(xí)視頻，這里是Python學(xué)習(xí)者的聚集地，零基礎(chǔ)，進(jìn)階，都?xì)g迎

代碼走起，經(jīng)過(guò)一系列的努力，得到數(shù)據(jù)啦！?。?/p>

Python爬蟲入門【22】：scrapy爬取酷安網(wǎng)全站應(yīng)用

抽空寫個(gè)酷安的數(shù)據(jù)分析，有需要源碼的，自己從頭到尾的跟著寫一遍就OK了

Python爬蟲入門【22】：scrapy爬取酷安網(wǎng)全站應(yīng)用

向AI問(wèn)一下細(xì)節(jié)

推薦閱讀：

免責(zé)聲明：本站發(fā)布的內(nèi)容（圖片、視頻和文字）以原創(chuàng)、轉(zhuǎn)載和分享為主，文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng)，如果涉及侵權(quán)請(qǐng)聯(lián)系站長(zhǎng)郵箱：is@yisu.com進(jìn)行舉報(bào)，并提供相關(guān)證據(jù)，一經(jīng)查實(shí)，將立刻刪除涉嫌侵權(quán)內(nèi)容。

上一篇新聞：
如何混跡程序猿江湖，你得懂程序員黑話暗語(yǔ)！
下一篇新聞：
StackPanel、WrapPanel、DockPanel 容器(四)

猜你喜歡

AI
助
手

產(chǎn)品服務(wù)

地區(qū)劃分

專題活動(dòng)

幫助支持

關(guān)于我們

售后咨詢

7*24小時(shí)在線電話：400-100-2938

7*24小時(shí)在線 QQ：800811969

關(guān)注億速云

億速云公眾號(hào)

手機(jī)網(wǎng)站二維碼

<video id="jztts"><th id="jztts"></th></video>