溫馨提示×

溫馨提示×

您好，登錄后才能下訂單哦！

密碼登錄×

忘記密碼？

登錄注冊(cè)×

獲取短信驗(yàn)證碼

其他方式登錄

點(diǎn)擊登錄注冊(cè) 即表示同意《億速云用戶(hù)服務(wù)條款》

用戶(hù)登錄×

賬戶(hù)密碼登錄

請(qǐng)使用微信掃描上方二維碼

使用幫助

請(qǐng)求超時(shí)！

請(qǐng)點(diǎn)擊重新獲取二維碼

如何用Python獲取成都租房信息

發(fā)布時(shí)間：2021-10-28 18:15:56 來(lái)源：億速云閱讀：149 作者：柒染欄目：編程語(yǔ)言

這篇文章將為大家詳細(xì)講解有關(guān)如何用Python獲取成都租房信息，文章內(nèi)容質(zhì)量較高，因此小編分享給大家做個(gè)參考，希望大家閱讀完這篇文章后對(duì)相關(guān)知識(shí)有一定的了解。

信息數(shù)據(jù)的獲取，這里首先收集趕集網(wǎng)和自如網(wǎng)的信息。

1. 趕集網(wǎng)信息獲取

如何用Python獲取成都租房信息

I. 獲取當(dāng)頁(yè)內(nèi)容

這里的規(guī)則比較明顯，獲取網(wǎng)頁(yè)內(nèi)容用xpath解析即可，各個(gè)板塊的信息都很容易獲取，最后用列表保存并返回即可，首先循環(huán)出每個(gè)divs塊，對(duì)里面的每個(gè)版塊內(nèi)容逐個(gè)獲取

def get_this_page_gj(url, tmp):
 html = etree.HTML(requests.get(url).text)
 divs = html.xpath('//div[@class="f-list-item ershoufang-list"]')
 for div in divs:
 title = div.xpath('./dl/dd[@class="dd-item title"]/a/text()')[0]
 house_url = div.xpath('./dl/dd[@class="dd-item title"]/a/@href')[0]
 size = "、".join(div.xpath('./dl/dd[@class="dd-item size"]/span/text()'))
 address = '-'.join([
 data.strip() for data in divs[0].xpath('./dl/dd[@class="dd-item address"][1]//a//text()')
 if data.strip() != ''
 ]
 )
 agent_string = div.xpath('./dl/dd[@class="dd-item address"][2]/span/span/text()')[0]
 agent = re.sub(' ', '', agent_string)
 price = div.xpath('./dl/dd[@class="dd-item info"]/div[@class="price"]/span[@class="num"]/text()')[0]
 tmp.append([
 title, size, price, address, agent, house_url
 ])
 return tmp

II. URL構(gòu)造

訪問(wèn)首頁(yè)鏈接，獲取總頁(yè)數(shù)，按照url的訪問(wèn)規(guī)則構(gòu)造url，調(diào)用獲取當(dāng)頁(yè)數(shù)據(jù)的方法即可，這里的url都是以http://cd.ganji.com/zufang/pn開(kāi)頭的，后面跟上網(wǎng)頁(yè)的頁(yè)碼

def house_gj(headers):
 index_url = 'http://cd.ganji.com/zufang/'
 html = etree.HTML(get_html(index_url, headers))
 total = html.xpath('//div[@class="pageBox"]/a[position() = last() -1]/span/text()')[0]
 result = []
 for num in range(1, int(total) + 1):
 result += get_this_page_gj('http://cd.ganji.com/zufang/pn{}'.format(num), [])
 print('完成讀取第{}頁(yè)/趕集網(wǎng)'.format(num))
 return result

2 .

這里和趕集網(wǎng)類(lèi)似，結(jié)構(gòu)也相似，同樣的獲取方式，我們也抓取基礎(chǔ)信息加url鏈接，區(qū)別在于這里的價(jià)格可能不太好獲取，并不是直接顯示，而是以圖片+偏移量的形式展示

如何用Python獲取成都租房信息

1. 價(jià)格獲取

每個(gè)數(shù)字對(duì)應(yīng)一張圖片，圖片中的數(shù)字會(huì)根據(jù)style中設(shè)置的偏移去原圖中獲取，每頁(yè)的原圖也不盡相同，所以處理起來(lái)比較麻煩

如何用Python獲取成都租房信息

如何用Python獲取成都租房信息

這里我們仔細(xì)留心的會(huì)發(fā)現(xiàn)其實(shí)每個(gè)數(shù)字間的間距是一樣的，可以自己在頁(yè)面上更改數(shù)值查看規(guī)律，每個(gè)數(shù)字間的距離是21.4px，從原圖的左邊開(kāi)始做偏移，根據(jù)偏移確定對(duì)應(yīng)的數(shù)字，返回的數(shù)字下標(biāo) = |偏移量/21.4|,當(dāng)然這里根據(jù)頁(yè)面圖片、內(nèi)容等元素會(huì)有微小的誤差，但都是極小的誤差了，最后取個(gè)整去原圖的數(shù)字列表中取得對(duì)應(yīng)下標(biāo)的值即可，這里我們用到tesseract來(lái)對(duì)圖片進(jìn)行解析

......
......
price_strings = div.xpath('./div[@class="info-box"]/div[@class="price"]/span[@class="num"]/@style')
offset_list = []
for data in price_strings:
 offset_list.append(re.findall('position: (.*?)px', data)[0])
style_string = html.xpath('//div[@class="info-box"]/div[@class="price"]/span[@class="num"]/@style')[0]
pic = "http:" + re.findall(r'background-image: url\((.*?)\);.*?', style_string)[0]
price = get_price_zr(pic, offset_list)
def get_price_zr(pic_url, offset_list):
 '''
 這里的index保存所有數(shù)字的下標(biāo)值，等待圖片解析完成獲取對(duì)應(yīng)下標(biāo)的價(jià)格數(shù)字
 '''
 index, price = [], []
 with open('pic.png', 'wb') as f:
 f.write(requests.get(pic_url).content)
 code_list = list(pytesseract.image_to_string(Image.open('pic.png')))
 for data in offset_list:
 index.append(int(math.fabs(eval(data)/21.4)))
 for data in index:
 price.append(code_list[data])
 return "".join(price)

pic_url是每頁(yè)的原圖地址，將之下載下來(lái)后用pytesseract解析，最后返回每個(gè)下標(biāo)對(duì)應(yīng)的數(shù)字所組成的新的數(shù)字字符串(價(jià)格),offset_list是獲取的每個(gè)數(shù)字的偏移值組成的列表

2. 自如網(wǎng)數(shù)據(jù)獲取

這里和趕集網(wǎng)類(lèi)似，結(jié)構(gòu)也相似，同樣的獲取方式，我們也抓取基礎(chǔ)信息加url鏈接，區(qū)別在于這里的價(jià)格可能不太好獲取，并不是直接顯示，而是以圖片+偏移量的形式展示

如何用Python獲取成都租房信息

I. 價(jià)格獲取

每個(gè)數(shù)字對(duì)應(yīng)一張圖片，圖片中的數(shù)字會(huì)根據(jù)style中設(shè)置的偏移去原圖中獲取，每頁(yè)的原圖也不盡相同，所以處理起來(lái)比較麻煩

如何用Python獲取成都租房信息

如何用Python獲取成都租房信息

這里我們仔細(xì)留心的會(huì)發(fā)現(xiàn)其實(shí)每個(gè)數(shù)字間的間距是一樣的，可以自己在頁(yè)面上更改數(shù)值查看規(guī)律，每個(gè)數(shù)字間的距離是21.4px，從原圖的左邊開(kāi)始做偏移，根據(jù)偏移確定對(duì)應(yīng)的數(shù)字，返回的數(shù)字下標(biāo) = |偏移量/21.4|,當(dāng)然這里根據(jù)頁(yè)面圖片、內(nèi)容等元素會(huì)有微小的誤差，但都是極小的誤差了，最后取個(gè)整去原圖的數(shù)字列表中取得對(duì)應(yīng)下標(biāo)的值即可，這里我們用到tesseract來(lái)對(duì)圖片進(jìn)行解析

......
......
price_strings = div.xpath('./div[@class="info-box"]/div[@class="price"]/span[@class="num"]/@style')
offset_list = []
for data in price_strings:
 offset_list.append(re.findall('position: (.*?)px', data)[0])
style_string = html.xpath('//div[@class="info-box"]/div[@class="price"]/span[@class="num"]/@style')[0]
pic = "http:" + re.findall(r'background-image: url\((.*?)\);.*?', style_string)[0]
price = get_price_zr(pic, offset_list)
def get_price_zr(pic_url, offset_list):
 '''
 這里的index保存所有數(shù)字的下標(biāo)值，等待圖片解析完成獲取對(duì)應(yīng)下標(biāo)的價(jià)格數(shù)字
 '''
 index, price = [], []
 with open('pic.png', 'wb') as f:
 f.write(requests.get(pic_url).content)
 code_list = list(pytesseract.image_to_string(Image.open('pic.png')))
 for data in offset_list:
 index.append(int(math.fabs(eval(data)/21.4)))
 for data in index:
 price.append(code_list[data])
 return "".join(price)

pic_url是每頁(yè)的原圖地址，將之下載下來(lái)后用pytesseract解析，最后返回每個(gè)下標(biāo)對(duì)應(yīng)的數(shù)字所組成的新的數(shù)字字符串(價(jià)格),offset_list是獲取的每個(gè)數(shù)字的偏移值組成的列表

II. 獲取當(dāng)頁(yè)數(shù)據(jù)

這里和趕集網(wǎng)類(lèi)似，我們構(gòu)造獲取每頁(yè)數(shù)據(jù)的函數(shù)，之后調(diào)用函數(shù)傳入每頁(yè)的url即可，這里可以關(guān)注一下xpath的擴(kuò)展用法(contains函數(shù))和正則獲取原圖鏈接

def get_this_page_zr(url, tmp):
 html = etree.HTML(requests.get(url).text)
 divs = html.xpath('//div[@class="item"]')
 for div in divs:
 if div.xpath('./div[@class="info-box"]/h6/a/text()'):
 title = div.xpath('./div[@class="info-box"]/h6/a/text()')[0]
 else:
 continue
 link = 'http:' + div.xpath('./div[@class="info-box"]/h6/a/@href')[0]
 location = div.xpath('./div[@class="info-box"]/div[@class="desc"]/div[@class="location"]/text()')[0]
 area = div.xpath('./div[@class="info-box"]/div[@class="desc"]/div[contains(text(), "㎡")]/text()')[0]
 price_strings = div.xpath('./div[@class="info-box"]/div[@class="price"]/span[@class="num"]/@style')
 offset_list = []
 for data in price_strings:
 offset_list.append(re.findall('position: (.*?)px', data)[0])
 style_string = html.xpath('//div[@class="info-box"]/div[@class="price"]/span[@class="num"]/@style')[0]
 pic = "http:" + re.findall(r'background-image: url\((.*?)\);.*?', style_string)[0]
 price = get_price_zr(pic, offset_list)
 tag = '、'.join(div.xpath('./div[@class="info-box"]//div[@class="tag"]/span/text()'))
 tmp.append([
 title, tag, price, area, location, link
 ])
 return tmp

III. url構(gòu)造

原理同趕集網(wǎng)的一樣，主要關(guān)注一下xpath的擴(kuò)展用法position()=last()

def house_zr(headers):
 index_url = 'http://cd.ziroom.com/z/'
 html = etree.HTML(get_html(index_url, headers))
 total = html.xpath('//div[@class="Z_pages"]/a[position()=last()-1]/text()')[0]
 result = []
 for num in range(1, int(total) + 1):
 result += get_this_page_zr('http://cd.ziroom.com/z/p{}/'.format(num), [])
 print('完成讀取第{}頁(yè)/自如網(wǎng)'.format(num))
 return result

關(guān)于如何用Python獲取成都租房信息就分享到這里了，希望以上內(nèi)容可以對(duì)大家有一定的幫助，可以學(xué)到更多知識(shí)。如果覺(jué)得文章不錯(cuò)，可以把它分享出去讓更多的人看到。

向AI問(wèn)一下細(xì)節(jié)

推薦閱讀：

免責(zé)聲明：本站發(fā)布的內(nèi)容（圖片、視頻和文字）以原創(chuàng)、轉(zhuǎn)載和分享為主，文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng)，如果涉及侵權(quán)請(qǐng)聯(lián)系站長(zhǎng)郵箱：is@yisu.com進(jìn)行舉報(bào)，并提供相關(guān)證據(jù)，一經(jīng)查實(shí)，將立刻刪除涉嫌侵權(quán)內(nèi)容。

上一篇新聞：
如何利用python的KMeans和PCA包實(shí)現(xiàn)聚類(lèi)算法
下一篇新聞：
Mysql數(shù)據(jù)分組排名實(shí)現(xiàn)的示例分析

猜你喜歡

AI
助
手

產(chǎn)品服務(wù)

地區(qū)劃分

專(zhuān)題活動(dòng)

幫助支持

關(guān)于我們

售后咨詢(xún)

7*24小時(shí)在線(xiàn)電話(huà)：400-100-2938

7*24小時(shí)在線(xiàn) QQ：800811969

關(guān)注億速云

億速云公眾號(hào)

手機(jī)網(wǎng)站二維碼

<small id="ucgry"></small>