溫馨提示×

您好,登錄后才能下訂單哦!

密碼登錄×
登錄注冊(cè)×
其他方式登錄
點(diǎn)擊 登錄注冊(cè) 即表示同意《億速云用戶服務(wù)條款》

利用python爬取貝殼網(wǎng)租房信息

發(fā)布時(shí)間:2020-07-05 16:13:56 來(lái)源:網(wǎng)絡(luò) 閱讀:766 作者:Rainbowhhy 欄目:編程語(yǔ)言

? ? 最近準(zhǔn)備換房子,在網(wǎng)站上尋找各種房源信息,看得眼花繚亂,于是想著能否將基本信息匯總起來(lái)便于查找,便用python將基本信息爬下來(lái)放到excel,這樣一來(lái)就容易搜索了。

? ? 1. 利用lxml中的xpath提取信息

? ? xpath是一門在 xml文檔中查找信息的語(yǔ)言,xpath可用來(lái)在 xml 文檔中對(duì)元素和屬性進(jìn)行遍歷。對(duì)比正則表達(dá)式 re兩者可以完成同樣的工作,實(shí)現(xiàn)的功能也差不多,但xpath明顯比re具有優(yōu)勢(shì)。具有如下優(yōu)點(diǎn):(1)可在xml中查找信息 ;(2)支持html的查找;(3)通過(guò)元素和屬性進(jìn)行導(dǎo)航

? ? 2. 利用xlsxwriter模塊將信息保存至excel

? ? xlsxwriter是操作excel的庫(kù),可以幫助我們高效快速的,大批量的,自動(dòng)化的操作excel。它可以寫數(shù)據(jù),畫圖,完成大部分常用的excel操作。缺點(diǎn)是xlsxwriter 只能創(chuàng)建新文件,不可以修改原有文件,如果創(chuàng)建新文件時(shí)與原有文件同名,則會(huì)覆蓋原有文件。

? ? 3. 爬取思路

? ? 觀察發(fā)現(xiàn)貝殼網(wǎng)租房信息總共是100頁(yè),我們可以分每頁(yè)獲取到html代碼,然后提取需要的信息保存至字典,將所有頁(yè)面的信息匯總,最后將字典數(shù)據(jù)寫入excel。

? ? 4. 爬蟲(chóng)源代碼

#?@Author:?Rainbowhhy
#?@Date??:?19-6-25?下午6:35


import?requests
import?time
from?lxml?import?etree
import?xlsxwriter


def?get_html(page):
????"""獲取網(wǎng)站html代碼"""
????url?=?"https://bj.zu.ke.com/zufang/pg{}/#contentList".format(page)
????headers?=?{
????????'user-agent':?'Mozilla/5.0?(X11;?Linux?x86_64)?AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/70.0.3538.77?Safari/537.36'
????}
????response?=?requests.get(url,?headers=headers).text
????return?response


def?parse_html(htmlcode,?data):
????"""解析html代碼"""
????content?=?etree.HTML(htmlcode)
????results?=?content.xpath('///div[@class="content__article"]/div[1]/div')
????for?result?in?results[:]:
????????community?=?result.xpath('./div[1]/p[@class="content__list--item--title?twoline"]/a/text()')[0].replace('\n',
????????????????????????????????????????????????????????????????????????????????????????????????????????????????'').strip().split()[
????????????0]
????????address?=?"-".join(result.xpath('./div/p[@class="content__list--item--des"]/a/text()'))
????????landlord?=?result.xpath('./div/p[@class="content__list--item--brand?oneline"]/text()')[0].replace('\n',
??????????????????????????????????????????????????????????????????????????????????????????????????????????'').strip()?if?len(
????????????result.xpath('./div/p[@class="content__list--item--brand?oneline"]/text()'))?>?0?else?""
????????postime?=?result.xpath('./div/p[@class="content__list--item--time?oneline"]/text()')[0]
????????introduction?=?",".join(result.xpath('./div/p[@class="content__list--item--bottom?oneline"]/i/text()'))
????????price?=?result.xpath('./div/span/em/text()')[0]
????????description?=?"".join(result.xpath('./div/p[2]/text()')).replace('\n',?'').replace('-',?'').strip().split()
????????area?=?description[0]
????????count?=?len(description)
????????if?count?==?6:
????????????orientation?=?description[1]?+?description[2]?+?description[3]?+?description[4]
????????elif?count?==?5:
????????????orientation?=?description[1]?+?description[2]?+?description[3]
????????elif?count?==?4:
????????????orientation?=?description[1]?+?description[2]
????????elif?count?==?3:
????????????orientation?=?description[1]
????????else:
????????????orientation?=?""
????????pattern?=?description[-1]
????????floor?=?"".join(result.xpath('./div/p[2]/span/text()')[1].replace('\n',?'').strip().split()).strip()?if?len(
????????????result.xpath('./div/p[2]/span/text()'))?>?1?else?""
????????date_time?=?time.strftime("%Y-%m-%d",?time.localtime())
????????"""數(shù)據(jù)存入字典"""
????????data_dict?=?{
????????????"community":?community,
????????????"address":?address,
????????????"landlord":?landlord,
????????????"postime":?postime,
????????????"introduction":?introduction,
????????????"price":?'¥'?+?price,
????????????"area":?area,
????????????"orientation":?orientation,
????????????"pattern":?pattern,
????????????"floor":?floor,
????????????"date_time":?date_time
????????}

????????data.append(data_dict)


def?excel_storage(response):
????"""將字典數(shù)據(jù)寫入excel"""
????workbook?=?xlsxwriter.Workbook('./beikeHouse.xlsx')
????worksheet?=?workbook.add_worksheet()
????"""設(shè)置標(biāo)題加粗"""
????bold_format?=?workbook.add_format({'bold':?True})
????worksheet.write('A1',?'小區(qū)名稱',?bold_format)
????worksheet.write('B1',?'租房地址',?bold_format)
????worksheet.write('C1',?'房屋來(lái)源',?bold_format)
????worksheet.write('D1',?'發(fā)布時(shí)間',?bold_format)
????worksheet.write('E1',?'租房說(shuō)明',?bold_format)
????worksheet.write('F1',?'房屋價(jià)格',?bold_format)
????worksheet.write('G1',?'房屋面積',?bold_format)
????worksheet.write('H1',?'房屋朝向',?bold_format)
????worksheet.write('I1',?'房屋戶型',?bold_format)
????worksheet.write('J1',?'房屋樓層',?bold_format)
????worksheet.write('K1',?'查看日期',?bold_format)

????row?=?1
????col?=?0
????for?item?in?response:
????????worksheet.write_string(row,?col?+?0,?item['community'])
????????worksheet.write_string(row,?col?+?1,?item['address'])
????????worksheet.write_string(row,?col?+?2,?item['landlord'])
????????worksheet.write_string(row,?col?+?3,?item['postime'])
????????worksheet.write_string(row,?col?+?4,?item['introduction'])
????????worksheet.write_string(row,?col?+?5,?item['price'])
????????worksheet.write_string(row,?col?+?6,?item['area'])
????????worksheet.write_string(row,?col?+?7,?item['orientation'])
????????worksheet.write_string(row,?col?+?8,?item['pattern'])
????????worksheet.write_string(row,?col?+?9,?item['floor'])
????????worksheet.write_string(row,?col?+?10,?item['date_time'])
????????row?+=?1
????workbook.close()


def?main():
????all_datas?=?[]
????"""網(wǎng)站總共100頁(yè),循環(huán)100次"""
????for?page?in?range(1,?100):
????????html?=?get_html(page)
????????parse_html(html,?all_datas)
????excel_storage(all_datas)


if?__name__?==?'__main__':
????main()

? ? 5. 信息截圖

? ??利用python爬取貝殼網(wǎng)租房信息

向AI問(wèn)一下細(xì)節(jié)

免責(zé)聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng),如果涉及侵權(quán)請(qǐng)聯(lián)系站長(zhǎng)郵箱:is@yisu.com進(jìn)行舉報(bào),并提供相關(guān)證據(jù),一經(jīng)查實(shí),將立刻刪除涉嫌侵權(quán)內(nèi)容。

AI