您好,登錄后才能下訂單哦!
這篇文章給大家分享的是有關(guān)Python爬蟲(chóng)之如何爬取我愛(ài)我家二手房數(shù)據(jù)的內(nèi)容。小編覺(jué)得挺實(shí)用的,因此分享給大家做個(gè)參考,一起跟隨小編過(guò)來(lái)看看吧。
首先,運(yùn)行下述代碼,復(fù)現(xiàn)問(wèn)題:
# -*-coding:utf-8-*- import re import requests from bs4 import BeautifulSoup cookie = 'PHPSESSID=aivms4ufg15sbrj0qgboo3c6gj; HMF_CI=4d8ff20092e9832daed8fe5eb0475663812603504e007aca93e6630c00b84dc207; _ga=GA1.2.556271139.1620784679; gr_user_id=4c878c8f-406b-46a0-86ee-a9baf2267477; _dx_uzZo5y=68b673b0aaec1f296c34e36c9e9d378bdb2050ab4638a066872a36f781c888efa97af3b5; smidV2=20210512095758ff7656962db3adf41fa8fdc8ddc02ecb00bac57209becfaa0; yfx_c_g_u_id_10000001=_ck21051209583410015104784406594; __TD_deviceId=41HK9PMCSF7GOT8G; zufang_cookiekey=["%7B%22url%22%3A%22%2Fzufang%2F_%25E9%2595%25BF%25E6%2598%25A5%25E6%25A1%25A5%3Fzn%3D%25E9%2595%25BF%25E6%2598%25A5%25E6%25A1%25A5%22%2C%22x%22%3A%220%22%2C%22y%22%3A%220%22%2C%22name%22%3A%22%E9%95%BF%E6%98%A5%E6%A1%A5%22%2C%22total%22%3A%220%22%7D","%7B%22url%22%3A%22%2Fzufang%2F_%25E8%258B%258F%25E5%25B7%259E%25E8%25A1%2597%3Fzn%3D%25E8%258B%258F%25E5%25B7%259E%25E8%25A1%2597%22%2C%22x%22%3A%220%22%2C%22y%22%3A%220%22%2C%22name%22%3A%22%E8%8B%8F%E5%B7%9E%E8%A1%97%22%2C%22total%22%3A%220%22%7D","%7B%22url%22%3A%22%2Fzufang%2F_%25E8%258B%258F%25E5%25B7%259E%25E6%25A1%25A5%3Fzn%3D%25E8%258B%258F%25E5%25B7%259E%25E6%25A1%25A5%22%2C%22x%22%3A%220%22%2C%22y%22%3A%220%22%2C%22name%22%3A%22%E8%8B%8F%E5%B7%9E%E6%A1%A5%22%2C%22total%22%3A%220%22%7D"]; ershoufang_cookiekey=["%7B%22url%22%3A%22%2Fzufang%2F_%25E9%2595%25BF%25E6%2598%25A5%25E6%25A1%25A5%3Fzn%3D%25E9%2595%25BF%25E6%2598%25A5%25E6%25A1%25A5%22%2C%22x%22%3A%220%22%2C%22y%22%3A%220%22%2C%22name%22%3A%22%E9%95%BF%E6%98%A5%E6%A1%A5%22%2C%22total%22%3A%220%22%7D","%7B%22url%22%3A%22%2Fershoufang%2F_%25E8%258B%258F%25E5%25B7%259E%25E6%25A1%25A5%3Fzn%3D%25E8%258B%258F%25E5%25B7%259E%25E6%25A1%25A5%22%2C%22x%22%3A%220%22%2C%22y%22%3A%220%22%2C%22name%22%3A%22%E8%8B%8F%E5%B7%9E%E6%A1%A5%22%2C%22total%22%3A%220%22%7D"]; zufang_BROWSES=501465046,501446051,90241951,90178388,90056278,90187979,501390110,90164392,90168076,501472221,501434480,501480593,501438374,501456072,90194547,90223523,501476326,90245144; historyCity=["\u5317\u4eac"]; _gid=GA1.2.23153704.1621410645; Hm_lvt_94ed3d23572054a86ed341d64b267ec6=1620784715,1621410646; _Jo0OQK=4958FA78A5CC420C425C480565EB46670E81832D8173C5B3CFE61303A51DE43E320422D6C7A15892C5B8B66971ED1B97A7334F0B591B193EBECAAB0E446D805316B26107A0B847CA53375B268E06EC955BB75B268E06EC955BB9D992FB153179892GJ1Z1OA==; ershoufang_BROWSES=501129552; domain=bj; 8fcfcf2bd7c58141_gr_session_id=61676ce2-ea23-4f77-8165-12edcc9ed902; 8fcfcf2bd7c58141_gr_session_id_61676ce2-ea23-4f77-8165-12edcc9ed902=true; yfx_f_l_v_t_10000001=f_t_1620784714003__r_t_1621471673953__v_t_1621474304616__r_c_2; Hm_lpvt_94ed3d23572054a86ed341d64b267ec6=1621475617' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.72 Safari/537.36', 'Cookie': cookie.encode("utf-8").decode("latin1") } def run(): base_url = 'https://bj.5i5j.com/ershoufang/xichengqu/n%d/' for page in range(1, 11): url = base_url % page print(url) html = requests.get(url, headers=headers).text soup = BeautifulSoup(html, 'lxml') try: for li in soup.find('div', class_='list-con-box').find('ul', class_='pList').find_all('li'): title = li.find('h4', class_='listTit').get_text() # 名稱(chēng) # print(title) except Exception as e: print(e) print(html) break if __name__ == '__main__': run()
運(yùn)行后會(huì)發(fā)現(xiàn),在抓取https://bj.5i5j.com/ershoufang/xichengqu/n1/
(也可能是其他頁(yè)碼)時(shí),會(huì)報(bào)錯(cuò):'NoneType' object has no attribute 'find'
,觀察輸出的html
信息,可以發(fā)現(xiàn)html內(nèi)容為:<HTML><HEAD><script>window.location.href="https://bj.5i5j.com/ershoufang/xichengqu/n1/?wscckey=0f36b400da92f41d_1621823822" rel="external nofollow" ;</script></HEAD><BODY>
,但此鏈接在瀏覽器訪(fǎng)問(wèn)是可以看到數(shù)據(jù)的,但鏈接會(huì)被重定向,重定向后的url即為上面這個(gè)html
的href
內(nèi)容。因此,可以合理的推斷,針對(duì)部分頁(yè)碼鏈接,我愛(ài)我家不會(huì)直接返回?cái)?shù)據(jù),但會(huì)返回帶有正確鏈接的信息,通過(guò)正則表達(dá)式獲取該鏈接即可正確抓取數(shù)據(jù)。
在下面的完整代碼中,采取的解決方法是:
1.首先判斷當(dāng)前html
是否含有數(shù)據(jù)
2.若無(wú)數(shù)據(jù),則通過(guò)正則表達(dá)式獲取正確鏈接
3.重新獲取html
數(shù)據(jù)
if '<HTML><HEAD><script>window.location.href=' in html: url = re.search(r'.*?href="(.+)" rel="external nofollow" rel="external nofollow" .*?', html).group(1) html = requests.get(url, headers=headers).text
# -*-coding:utf-8-*- import os import re import requests import csv import time from bs4 import BeautifulSoup folder_path = os.path.split(os.path.abspath(__file__))[0] + os.sep # 獲取當(dāng)前文件所在目錄 cookie = 'PHPSESSID=aivms4ufg15sbrj0qgboo3c6gj; HMF_CI=4d8ff20092e9832daed8fe5eb0475663812603504e007aca93e6630c00b84dc207; _ga=GA1.2.556271139.1620784679; gr_user_id=4c878c8f-406b-46a0-86ee-a9baf2267477; _dx_uzZo5y=68b673b0aaec1f296c34e36c9e9d378bdb2050ab4638a066872a36f781c888efa97af3b5; smidV2=20210512095758ff7656962db3adf41fa8fdc8ddc02ecb00bac57209becfaa0; yfx_c_g_u_id_10000001=_ck21051209583410015104784406594; __TD_deviceId=41HK9PMCSF7GOT8G; zufang_cookiekey=["%7B%22url%22%3A%22%2Fzufang%2F_%25E9%2595%25BF%25E6%2598%25A5%25E6%25A1%25A5%3Fzn%3D%25E9%2595%25BF%25E6%2598%25A5%25E6%25A1%25A5%22%2C%22x%22%3A%220%22%2C%22y%22%3A%220%22%2C%22name%22%3A%22%E9%95%BF%E6%98%A5%E6%A1%A5%22%2C%22total%22%3A%220%22%7D","%7B%22url%22%3A%22%2Fzufang%2F_%25E8%258B%258F%25E5%25B7%259E%25E8%25A1%2597%3Fzn%3D%25E8%258B%258F%25E5%25B7%259E%25E8%25A1%2597%22%2C%22x%22%3A%220%22%2C%22y%22%3A%220%22%2C%22name%22%3A%22%E8%8B%8F%E5%B7%9E%E8%A1%97%22%2C%22total%22%3A%220%22%7D","%7B%22url%22%3A%22%2Fzufang%2F_%25E8%258B%258F%25E5%25B7%259E%25E6%25A1%25A5%3Fzn%3D%25E8%258B%258F%25E5%25B7%259E%25E6%25A1%25A5%22%2C%22x%22%3A%220%22%2C%22y%22%3A%220%22%2C%22name%22%3A%22%E8%8B%8F%E5%B7%9E%E6%A1%A5%22%2C%22total%22%3A%220%22%7D"]; ershoufang_cookiekey=["%7B%22url%22%3A%22%2Fzufang%2F_%25E9%2595%25BF%25E6%2598%25A5%25E6%25A1%25A5%3Fzn%3D%25E9%2595%25BF%25E6%2598%25A5%25E6%25A1%25A5%22%2C%22x%22%3A%220%22%2C%22y%22%3A%220%22%2C%22name%22%3A%22%E9%95%BF%E6%98%A5%E6%A1%A5%22%2C%22total%22%3A%220%22%7D","%7B%22url%22%3A%22%2Fershoufang%2F_%25E8%258B%258F%25E5%25B7%259E%25E6%25A1%25A5%3Fzn%3D%25E8%258B%258F%25E5%25B7%259E%25E6%25A1%25A5%22%2C%22x%22%3A%220%22%2C%22y%22%3A%220%22%2C%22name%22%3A%22%E8%8B%8F%E5%B7%9E%E6%A1%A5%22%2C%22total%22%3A%220%22%7D"]; zufang_BROWSES=501465046,501446051,90241951,90178388,90056278,90187979,501390110,90164392,90168076,501472221,501434480,501480593,501438374,501456072,90194547,90223523,501476326,90245144; historyCity=["\u5317\u4eac"]; _gid=GA1.2.23153704.1621410645; Hm_lvt_94ed3d23572054a86ed341d64b267ec6=1620784715,1621410646; _Jo0OQK=4958FA78A5CC420C425C480565EB46670E81832D8173C5B3CFE61303A51DE43E320422D6C7A15892C5B8B66971ED1B97A7334F0B591B193EBECAAB0E446D805316B26107A0B847CA53375B268E06EC955BB75B268E06EC955BB9D992FB153179892GJ1Z1OA==; ershoufang_BROWSES=501129552; domain=bj; 8fcfcf2bd7c58141_gr_session_id=61676ce2-ea23-4f77-8165-12edcc9ed902; 8fcfcf2bd7c58141_gr_session_id_61676ce2-ea23-4f77-8165-12edcc9ed902=true; yfx_f_l_v_t_10000001=f_t_1620784714003__r_t_1621471673953__v_t_1621474304616__r_c_2; Hm_lpvt_94ed3d23572054a86ed341d64b267ec6=1621475617' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.72 Safari/537.36', 'Cookie': cookie.encode("utf-8").decode("latin1") } def get_page(url): """獲取網(wǎng)頁(yè)原始數(shù)據(jù)""" global headers html = requests.get(url, headers=headers).text return html def extract_info(html): """解析網(wǎng)頁(yè)數(shù)據(jù),抽取出房源相關(guān)信息""" host = 'https://bj.5i5j.com' soup = BeautifulSoup(html, 'lxml') data = [] for li in soup.find('div', class_='list-con-box').find('ul', class_='pList').find_all('li'): try: title = li.find('h4', class_='listTit').get_text() # 名稱(chēng) url = host + li.find('h4', class_='listTit').a['href'] # 鏈接 info_li = li.find('div', class_='listX') # 每個(gè)房源核心信息都在這里 p1 = info_li.find_all('p')[0].get_text() # 獲取第一段 info1 = [i.strip() for i in p1.split(' · ')] # 戶(hù)型、面積、朝向、樓層、裝修、建成時(shí)間 house_type, area, direction, floor, decoration, build_year = info1 p2 = info_li.find_all('p')[1].get_text() # 獲取第二段 info2 = [i.replace(' ', '') for i in p2.split('·')] # 小區(qū)、位于幾環(huán)、交通信息 if len(info2) == 2: residence, ring = info2 transport = '' # 部分房源無(wú)交通信息 elif len(info2) == 3: residence, ring, transport = info2 else: residence, ring, transport = ['', '', ''] p3 = info_li.find_all('p')[2].get_text() # 獲取第三段 info3 = [i.replace(' ', '') for i in p3.split('·')] # 關(guān)注人數(shù)、帶看次數(shù)、發(fā)布時(shí)間 try: watch, arrive, release_year = info3 except Exception as e: print(info2, '獲取帶看、發(fā)布日期信息出錯(cuò)') watch, arrive, release_year = ['', '', ''] total_price = li.find('p', class_='redC').get_text().strip() # 房源總價(jià) univalence = li.find('div', class_='jia').find_all('p')[1].get_text().replace('單價(jià)', '') # 房源單價(jià) else_info = li.find('div', class_='listTag').get_text() data.append([title, url, house_type, area, direction, floor, decoration, residence, ring, transport, total_price, univalence, build_year, release_year, watch, arrive, else_info]) except Exception as e: print('extract_info: ', e) return data def crawl(): esf_url = 'https://bj.5i5j.com/ershoufang/' # 主頁(yè)網(wǎng)址 fields = ['城區(qū)', '名稱(chēng)', '鏈接', '戶(hù)型', '面積', '朝向', '樓層', '裝修', '小區(qū)', '環(huán)', '交通情況', '總價(jià)', '單價(jià)', '建成時(shí)間', '發(fā)布時(shí)間', '關(guān)注', '帶看', '其他信息'] f = open(folder_path + 'data' + os.sep + '北京二手房-我愛(ài)我家.csv', 'w', newline='', encoding='gb18030') writer = csv.writer(f, delimiter=',') # 以逗號(hào)分割 writer.writerow(fields) page = 1 regex = re.compile(r'.*?href="(.+)" rel="external nofollow" rel="external nofollow" .*?') while True: url = esf_url + 'n%s/' % page # 構(gòu)造頁(yè)面鏈接 if page == 1: url = esf_url html = get_page(url) # 部分頁(yè)面鏈接無(wú)法獲取數(shù)據(jù),需進(jìn)行判斷,并從返回html內(nèi)容中獲取正確鏈接,重新獲取html if '<HTML><HEAD><script>window.location.href=' in html: url = regex.search(html).group(1) html = requests.get(url, headers=headers).text print(url) data = extract_info(html) if data: writer.writerows(data) page += 1 f.close() if __name__ == '__main__': crawl() # 啟動(dòng)爬蟲(chóng)
截至2021年5月23日,共獲取數(shù)據(jù)62943條,基本上將我愛(ài)我家官網(wǎng)上北京地區(qū)的二手房數(shù)據(jù)全部抓取下來(lái)了。
感謝各位的閱讀!關(guān)于“Python爬蟲(chóng)之如何爬取我愛(ài)我家二手房數(shù)據(jù)”這篇文章就分享到這里了,希望以上內(nèi)容可以對(duì)大家有一定的幫助,讓大家可以學(xué)到更多知識(shí),如果覺(jué)得文章不錯(cuò),可以把它分享出去讓更多的人看到吧!
免責(zé)聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng),如果涉及侵權(quán)請(qǐng)聯(lián)系站長(zhǎng)郵箱:is@yisu.com進(jìn)行舉報(bào),并提供相關(guān)證據(jù),一經(jīng)查實(shí),將立刻刪除涉嫌侵權(quán)內(nèi)容。