您好,登錄后才能下訂單哦!
? ? 最近準(zhǔn)備換房子,在網(wǎng)站上尋找各種房源信息,看得眼花繚亂,于是想著能否將基本信息匯總起來(lái)便于查找,便用python將基本信息爬下來(lái)放到excel,這樣一來(lái)就容易搜索了。
? ? 1. 利用lxml中的xpath提取信息
? ? xpath是一門在 xml文檔中查找信息的語(yǔ)言,xpath可用來(lái)在 xml 文檔中對(duì)元素和屬性進(jìn)行遍歷。對(duì)比正則表達(dá)式 re兩者可以完成同樣的工作,實(shí)現(xiàn)的功能也差不多,但xpath明顯比re具有優(yōu)勢(shì)。具有如下優(yōu)點(diǎn):(1)可在xml中查找信息 ;(2)支持html的查找;(3)通過(guò)元素和屬性進(jìn)行導(dǎo)航
? ? 2. 利用xlsxwriter模塊將信息保存至excel
? ? xlsxwriter是操作excel的庫(kù),可以幫助我們高效快速的,大批量的,自動(dòng)化的操作excel。它可以寫數(shù)據(jù),畫圖,完成大部分常用的excel操作。缺點(diǎn)是xlsxwriter 只能創(chuàng)建新文件,不可以修改原有文件,如果創(chuàng)建新文件時(shí)與原有文件同名,則會(huì)覆蓋原有文件。
? ? 3. 爬取思路
? ? 觀察發(fā)現(xiàn)貝殼網(wǎng)租房信息總共是100頁(yè),我們可以分每頁(yè)獲取到html代碼,然后提取需要的信息保存至字典,將所有頁(yè)面的信息匯總,最后將字典數(shù)據(jù)寫入excel。
? ? 4. 爬蟲(chóng)源代碼
#?@Author:?Rainbowhhy #?@Date??:?19-6-25?下午6:35 import?requests import?time from?lxml?import?etree import?xlsxwriter def?get_html(page): ????"""獲取網(wǎng)站html代碼""" ????url?=?"https://bj.zu.ke.com/zufang/pg{}/#contentList".format(page) ????headers?=?{ ????????'user-agent':?'Mozilla/5.0?(X11;?Linux?x86_64)?AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/70.0.3538.77?Safari/537.36' ????} ????response?=?requests.get(url,?headers=headers).text ????return?response def?parse_html(htmlcode,?data): ????"""解析html代碼""" ????content?=?etree.HTML(htmlcode) ????results?=?content.xpath('///div[@class="content__article"]/div[1]/div') ????for?result?in?results[:]: ????????community?=?result.xpath('./div[1]/p[@class="content__list--item--title?twoline"]/a/text()')[0].replace('\n', ????????????????????????????????????????????????????????????????????????????????????????????????????????????????'').strip().split()[ ????????????0] ????????address?=?"-".join(result.xpath('./div/p[@class="content__list--item--des"]/a/text()')) ????????landlord?=?result.xpath('./div/p[@class="content__list--item--brand?oneline"]/text()')[0].replace('\n', ??????????????????????????????????????????????????????????????????????????????????????????????????????????'').strip()?if?len( ????????????result.xpath('./div/p[@class="content__list--item--brand?oneline"]/text()'))?>?0?else?"" ????????postime?=?result.xpath('./div/p[@class="content__list--item--time?oneline"]/text()')[0] ????????introduction?=?",".join(result.xpath('./div/p[@class="content__list--item--bottom?oneline"]/i/text()')) ????????price?=?result.xpath('./div/span/em/text()')[0] ????????description?=?"".join(result.xpath('./div/p[2]/text()')).replace('\n',?'').replace('-',?'').strip().split() ????????area?=?description[0] ????????count?=?len(description) ????????if?count?==?6: ????????????orientation?=?description[1]?+?description[2]?+?description[3]?+?description[4] ????????elif?count?==?5: ????????????orientation?=?description[1]?+?description[2]?+?description[3] ????????elif?count?==?4: ????????????orientation?=?description[1]?+?description[2] ????????elif?count?==?3: ????????????orientation?=?description[1] ????????else: ????????????orientation?=?"" ????????pattern?=?description[-1] ????????floor?=?"".join(result.xpath('./div/p[2]/span/text()')[1].replace('\n',?'').strip().split()).strip()?if?len( ????????????result.xpath('./div/p[2]/span/text()'))?>?1?else?"" ????????date_time?=?time.strftime("%Y-%m-%d",?time.localtime()) ????????"""數(shù)據(jù)存入字典""" ????????data_dict?=?{ ????????????"community":?community, ????????????"address":?address, ????????????"landlord":?landlord, ????????????"postime":?postime, ????????????"introduction":?introduction, ????????????"price":?'¥'?+?price, ????????????"area":?area, ????????????"orientation":?orientation, ????????????"pattern":?pattern, ????????????"floor":?floor, ????????????"date_time":?date_time ????????} ????????data.append(data_dict) def?excel_storage(response): ????"""將字典數(shù)據(jù)寫入excel""" ????workbook?=?xlsxwriter.Workbook('./beikeHouse.xlsx') ????worksheet?=?workbook.add_worksheet() ????"""設(shè)置標(biāo)題加粗""" ????bold_format?=?workbook.add_format({'bold':?True}) ????worksheet.write('A1',?'小區(qū)名稱',?bold_format) ????worksheet.write('B1',?'租房地址',?bold_format) ????worksheet.write('C1',?'房屋來(lái)源',?bold_format) ????worksheet.write('D1',?'發(fā)布時(shí)間',?bold_format) ????worksheet.write('E1',?'租房說(shuō)明',?bold_format) ????worksheet.write('F1',?'房屋價(jià)格',?bold_format) ????worksheet.write('G1',?'房屋面積',?bold_format) ????worksheet.write('H1',?'房屋朝向',?bold_format) ????worksheet.write('I1',?'房屋戶型',?bold_format) ????worksheet.write('J1',?'房屋樓層',?bold_format) ????worksheet.write('K1',?'查看日期',?bold_format) ????row?=?1 ????col?=?0 ????for?item?in?response: ????????worksheet.write_string(row,?col?+?0,?item['community']) ????????worksheet.write_string(row,?col?+?1,?item['address']) ????????worksheet.write_string(row,?col?+?2,?item['landlord']) ????????worksheet.write_string(row,?col?+?3,?item['postime']) ????????worksheet.write_string(row,?col?+?4,?item['introduction']) ????????worksheet.write_string(row,?col?+?5,?item['price']) ????????worksheet.write_string(row,?col?+?6,?item['area']) ????????worksheet.write_string(row,?col?+?7,?item['orientation']) ????????worksheet.write_string(row,?col?+?8,?item['pattern']) ????????worksheet.write_string(row,?col?+?9,?item['floor']) ????????worksheet.write_string(row,?col?+?10,?item['date_time']) ????????row?+=?1 ????workbook.close() def?main(): ????all_datas?=?[] ????"""網(wǎng)站總共100頁(yè),循環(huán)100次""" ????for?page?in?range(1,?100): ????????html?=?get_html(page) ????????parse_html(html,?all_datas) ????excel_storage(all_datas) if?__name__?==?'__main__': ????main()
? ? 5. 信息截圖
? ??
免責(zé)聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng),如果涉及侵權(quán)請(qǐng)聯(lián)系站長(zhǎng)郵箱:is@yisu.com進(jìn)行舉報(bào),并提供相關(guān)證據(jù),一經(jīng)查實(shí),將立刻刪除涉嫌侵權(quán)內(nèi)容。