<b id="xmziy"><nav id="xmziy"><form id="xmziy"></form></nav></b><div id="xmziy"></div><ol id="xmziy"><ruby id="xmziy"></ruby></ol>

<table id="xmziy"><nav id="xmziy"></nav></table>

<table id="xmziy"><nav id="xmziy"></nav></table>

<u id="xmziy"></u>

<div id="xmziy"><del id="xmziy"><menuitem id="xmziy"></menuitem></del></div><table id="xmziy"><dfn id="xmziy"></dfn></table>

溫馨提示×

溫馨提示×

您好，登錄后才能下訂單哦！

密碼登錄×

忘記密碼？

登錄注冊×

獲取短信驗(yàn)證碼

其他方式登錄

點(diǎn)擊登錄注冊即表示同意《億速云用戶服務(wù)條款》

用戶登錄×

賬戶密碼登錄

請使用微信掃描上方二維碼

使用幫助

請求超時(shí)！

請點(diǎn)擊重新獲取二維碼

Python爬蟲入門【16】：鏈家租房數(shù)據(jù)抓取

發(fā)布時(shí)間：2020-07-26 16:43:08 來源：網(wǎng)絡(luò) 閱讀：456 作者：學(xué)Python派森欄目：編程語言

1. 寫在前面

作為一個(gè)活躍在京津冀地區(qū)的開發(fā)者，要閑著沒事就看看石家莊這個(gè)國際化大都市的一些數(shù)據(jù)，這篇博客爬取了鏈家網(wǎng)的租房信息，爬取到的數(shù)據(jù)在后面的博客中可以作為一些數(shù)據(jù)分析的素材。
我們需要爬取的網(wǎng)址為：https://sjz.lianjia.com/zufang/

2. 分析網(wǎng)址

首先確定一下，哪些數(shù)據(jù)是我們需要的

Python爬蟲入門【16】：鏈家租房數(shù)據(jù)抓取

可以看到，×××框就是我們需要的數(shù)據(jù)。

接下來，確定一下翻頁規(guī)律

https://sjz.lianjia.com/zufang/pg1/
https://sjz.lianjia.com/zufang/pg2/
https://sjz.lianjia.com/zufang/pg3/
https://sjz.lianjia.com/zufang/pg4/
https://sjz.lianjia.com/zufang/pg5/
... 
https://sjz.lianjia.com/zufang/pg80/
Python資源分享qun 784758214 ,內(nèi)有安裝包，PDF，學(xué)習(xí)視頻，這里是Python學(xué)習(xí)者的聚集地，零基礎(chǔ)，進(jìn)階，都?xì)g迎

3. 解析網(wǎng)頁

有了分頁地址，就可以快速把鏈接拼接完畢，我們采用lxml模塊解析網(wǎng)頁源碼，獲取想要的數(shù)據(jù)。

本次編碼使用了一個(gè)新的模塊 fake_useragent ，這個(gè)模塊，可以隨機(jī)的去獲取一個(gè)UA（user-agent），模塊使用比較簡單，可以去百度百度就很多教程。

本篇博客主要使用的是調(diào)用一個(gè)隨機(jī)的UA

self._ua = UserAgent()
self._headers = {"User-Agent": self._ua.random}  # 調(diào)用一個(gè)隨機(jī)的UA

由于可以快速的把頁碼拼接出來，所以采用協(xié)程進(jìn)行抓取，寫入csv文件采用的pandas模塊

from fake_useragent import UserAgent
from lxml import etree
import asyncio
import aiohttp
import pandas as pd

class LianjiaSpider(object):

    def __init__(self):
        self._ua = UserAgent()
        self._headers = {"User-Agent": self._ua.random}
        self._data = list()

    async def get(self,url):
        async with aiohttp.ClientSession() as session:
            try:
                async with session.get(url,headers=self._headers,timeout=3) as resp:
                    if resp.status==200:
                        result = await resp.text()
                        return result
            except Exception as e:
                print(e.args)

    async def parse_html(self):
        for page in range(1,77):
            url = "https://sjz.lianjia.com/zufang/pg{}/".format(page)
            print("正在爬取{}".format(url))
            html = await self.get(url)   # 獲取網(wǎng)頁內(nèi)容
            html = etree.HTML(html)  # 解析網(wǎng)頁
            self.parse_page(html)   # 匹配我們想要的數(shù)據(jù)

            print("正在存儲數(shù)據(jù)....")
            ######################### 數(shù)據(jù)寫入
            data = pd.DataFrame(self._data)
            data.to_csv("鏈家網(wǎng)租房數(shù)據(jù).csv", encoding='utf_8_sig')   # 寫入文件
            ######################### 數(shù)據(jù)寫入

    def run(self):
        loop = asyncio.get_event_loop()
        tasks = [asyncio.ensure_future(self.parse_html())]
        loop.run_until_complete(asyncio.wait(tasks))

if __name__ == '__main__':
    l = LianjiaSpider()
    l.run()

上述代碼中缺少一個(gè)解析網(wǎng)頁的函數(shù)，我們接下來把他補(bǔ)全

    def parse_page(self,html):
        info_panel = html.xpath("http://div[@class='info-panel']")
        for info in info_panel:
            region = self.remove_space(info.xpath(".//span[@class='region']/text()"))
            zone = self.remove_space(info.xpath(".//span[@class='zone']/span/text()"))
            meters = self.remove_space(info.xpath(".//span[@class='meters']/text()"))
            where = self.remove_space(info.xpath(".//div[@class='where']/span[4]/text()"))

            con = info.xpath(".//div[@class='con']/text()")
            floor = con[0]  # 樓層
            type = con[1]   # 樣式

            agent = info.xpath(".//div[@class='con']/a/text()")[0]

            has = info.xpath(".//div[@class='left agency']//text()")

            price = info.xpath(".//div[@class='price']/span/text()")[0]
            price_pre =  info.xpath(".//div[@class='price-pre']/text()")[0]
            look_num = info.xpath(".//div[@class='square']//span[@class='num']/text()")[0]

            one_data = {
                "region":region,
                "zone":zone,
                "meters":meters,
                "where":where,
                "louceng":floor,
                "type":type,
                "xiaoshou":agent,
                "has":has,
                "price":price,
                "price_pre":price_pre,
                "num":look_num
            }
            self._data.append(one_data)  # 添加數(shù)據(jù)
Python資源分享qun 784758214 ,內(nèi)有安裝包，PDF，學(xué)習(xí)視頻，這里是Python學(xué)習(xí)者的聚集地，零基礎(chǔ)，進(jìn)階，都?xì)g迎

不一會，數(shù)據(jù)就爬取的差不多了。

Python爬蟲入門【16】：鏈家租房數(shù)據(jù)抓取

向AI問一下細(xì)節(jié)

推薦閱讀：

免責(zé)聲明：本站發(fā)布的內(nèi)容（圖片、視頻和文字）以原創(chuàng)、轉(zhuǎn)載和分享為主，文章觀點(diǎn)不代表本網(wǎng)站立場，如果涉及侵權(quán)請聯(lián)系站長郵箱：is@yisu.com進(jìn)行舉報(bào)，并提供相關(guān)證據(jù)，一經(jīng)查實(shí)，將立刻刪除涉嫌侵權(quán)內(nèi)容。

上一篇新聞：
淺析AVL樹算法
下一篇新聞：
C++編程音視頻庫ffmpeg的pts時(shí)間怎么換算

猜你喜歡

AI
助
手

產(chǎn)品服務(wù)

地區(qū)劃分

專題活動

幫助支持

關(guān)于我們

售后咨詢

7*24小時(shí)在線電話：400-100-2938

7*24小時(shí)在線 QQ：800811969

關(guān)注億速云

億速云公眾號

手機(jī)網(wǎng)站二維碼

<listing id="vfkhs"><optgroup id="vfkhs"><tbody id="vfkhs"></tbody></optgroup></listing>