^{<td id="8z2i1"></td>}

溫馨提示×

溫馨提示×

您好，登錄后才能下訂單哦！

密碼登錄×

忘記密碼？

登錄注冊(cè)×

獲取短信驗(yàn)證碼

其他方式登錄

點(diǎn)擊登錄注冊(cè) 即表示同意《億速云用戶服務(wù)條款》

用戶登錄×

賬戶密碼登錄

請(qǐng)使用微信掃描上方二維碼

使用幫助

請(qǐng)求超時(shí)！

請(qǐng)點(diǎn)擊重新獲取二維碼

如何使用Python爬取求職網(wǎng)requests庫(kù)和BeautifulSoup庫(kù)

發(fā)布時(shí)間：2021-10-21 17:04:45 來(lái)源：億速云閱讀：142 作者：iii 欄目：開(kāi)發(fā)技術(shù)

這篇文章主要講解了“如何使用Python爬取求職網(wǎng)requests庫(kù)和BeautifulSoup庫(kù)”，文中的講解內(nèi)容簡(jiǎn)單清晰，易于學(xué)習(xí)與理解，下面請(qǐng)大家跟著小編的思路慢慢深入，一起來(lái)研究和學(xué)習(xí)“如何使用Python爬取求職網(wǎng)requests庫(kù)和BeautifulSoup庫(kù)”吧！

目錄

一、requests庫(kù)

1、requests簡(jiǎn)介
2、安裝requests庫(kù)
3、使用requests獲取網(wǎng)頁(yè)數(shù)據(jù) 我們先導(dǎo)入模塊
4、總結(jié)requests的一些方法

二、BeautifulSoup庫(kù)

1、BeautifulSoup簡(jiǎn)介
2、安裝BeautifulSoup庫(kù)
3、使用BeautifulSoup解析并提取獲取的數(shù)據(jù)
4、BeautifulSoup提取數(shù)據(jù)的方法

一、requests庫(kù)

1、requests簡(jiǎn)介

requests庫(kù)就是一個(gè)發(fā)起請(qǐng)求的第三方庫(kù)，requests允許你發(fā)送HTTP/1.1 請(qǐng)求，你不需要手動(dòng)為 URL 添加查詢字串，也不需要對(duì) POST 數(shù)據(jù)進(jìn)行表單編碼。Keep-alive 和 HTTP 連接池的功能是 100% 自動(dòng)化的，一切動(dòng)力都來(lái)自于根植在 requests 內(nèi)部的 urllib3。簡(jiǎn)單來(lái)說(shuō)有了這個(gè)庫(kù)，我們就能輕而易舉向?qū)?yīng)的網(wǎng)站發(fā)起請(qǐng)求，從而對(duì)網(wǎng)頁(yè)數(shù)據(jù)進(jìn)行獲取，還可以獲取服務(wù)器返回的響應(yīng)內(nèi)容和狀態(tài)碼。

requesets中文文檔頁(yè)面https://requests.kennethreitz.org/zh_CN/latest/

2、安裝requests庫(kù)

一般電腦安裝的Python都會(huì)自帶這個(gè)庫(kù)，如果沒(méi)有就可在命令行輸入下面這行代碼安裝

pip install requests

3、使用requests獲取網(wǎng)頁(yè)數(shù)據(jù) 我們先導(dǎo)入模塊

import requests

對(duì)想要獲取數(shù)據(jù)的網(wǎng)站發(fā)起請(qǐng)求，以下以qq音樂(lè)官網(wǎng)為例

res = requests.get('https://y.qq.com/') #發(fā)起請(qǐng)求
print(res) #輸出<Response [200]>

輸出的200其實(shí)就是一個(gè)響應(yīng)狀態(tài)碼，下面給大家列出有可能返回的各狀態(tài)碼含義

狀態(tài)碼	含義
1xx	繼續(xù)發(fā)送信息
2xx	請(qǐng)求成功
3xx	重定向
4xx	客戶端錯(cuò)誤
5xx	服務(wù)端錯(cuò)誤

獲取qq音樂(lè)首頁(yè)的網(wǎng)頁(yè)源代碼

res = requests.get('https://y.qq.com/')  #發(fā)起請(qǐng)求
print(res.text)   #res.text就是網(wǎng)頁(yè)的源代碼

4、總結(jié)requests的一些方法

屬性	含義
res.status_code	HTTP的狀態(tài)碼
res.text	響應(yīng)內(nèi)容的文本
res.content	響應(yīng)內(nèi)容的二進(jìn)制形式文本
res.encoding	響應(yīng)內(nèi)容的編碼

既然我們學(xué)好了如何獲取網(wǎng)頁(yè)源代碼，接下來(lái)我們就學(xué)習(xí)下怎么用BeautifulSoup庫(kù)對(duì)我們獲取的內(nèi)容進(jìn)行提取。

二、BeautifulSoup庫(kù)

1、BeautifulSoup簡(jiǎn)介

BeautifulSoup是Python里的第三方庫(kù)，處理數(shù)據(jù)十分實(shí)用，有了這個(gè)庫(kù)，我們就可以根據(jù)網(wǎng)頁(yè)源代碼里對(duì)應(yīng)的HTML標(biāo)簽對(duì)數(shù)據(jù)進(jìn)行有目的性的提取。BeautifulSoup庫(kù)一般與requests庫(kù)搭配使用。不熟的HTML標(biāo)簽的最好去百度下，了解一些常用的標(biāo)簽。

2、安裝BeautifulSoup庫(kù)

同樣的如果沒(méi)用這個(gè)庫(kù)，可以通過(guò)命令行輸入下列代碼安裝

pip install beautifulsoup4

3、使用BeautifulSoup解析并提取獲取的數(shù)據(jù)

import requests
from bs4 import BeautifulSoup
header={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'}
res = requests.get('https://y.qq.com/',headers=header)  #headers是一種反爬蟲(chóng)措施
soup = BeautifulSoup(res.text,'html.parser')  #第一個(gè)參數(shù)是HTML文本，第二個(gè)參數(shù)html.parser是Python內(nèi)置的編譯器
print(soup)  #輸出qq音樂(lè)首頁(yè)的源代碼

部分輸出結(jié)果

如何使用Python爬取求職網(wǎng)requests庫(kù)和BeautifulSoup庫(kù)

看到輸出結(jié)果，我們已經(jīng)成功將網(wǎng)頁(yè)源代碼解析成BeautifulSoup對(duì)象。這時(shí)可能有人就會(huì)問(wèn)res.text輸出的不就是網(wǎng)頁(yè)代碼了嗎，何苦再將它轉(zhuǎn)為BeautifulSoup對(duì)象呢?
我們先來(lái)通過(guò)type()函數(shù)看下它們的類型

import requests
from bs4 import BeautifulSoup
header={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) 	Chrome/87.0.4280.66 Safari/537.36'}
res = requests.get('https://y.qq.com/',headers=header)  #headers是一種反爬蟲(chóng)措施
soup = BeautifulSoup(res.text,'html.parser')  #第一個(gè)參數(shù)是HTML文本，第二個(gè)參數(shù)html.parser是Python內(nèi)置的編譯器
print(type(res.text))
print(type(soup))

輸出結(jié)果

如何使用Python爬取求職網(wǎng)requests庫(kù)和BeautifulSoup庫(kù)

我們可以看到res.text的類型是字符串類型，而soup則是BeautifulSoup對(duì)象類型。相比于res.text的字符串類型，soup的BeautifulSoup對(duì)象類型擁有著更多可用的方法，以便我們快速提取出需要的數(shù)據(jù)。這就是為什么我們要多此一步了。

4、BeautifulSoup提取數(shù)據(jù)的方法

先來(lái)了解兩個(gè)最常用的方法

方法	作用
find()	返回第一個(gè)符合要求的數(shù)據(jù)
find_all()	返回所有符合要求的數(shù)據(jù)

這兩個(gè)函數(shù)傳入的參數(shù)就是我們對(duì)數(shù)據(jù)的篩選條件了，我們可以向這兩個(gè)函數(shù)分別傳入什么參數(shù)呢？

我們以下面在qq音樂(lè)首頁(yè)截取的源代碼片段為例，試用兩個(gè)函數(shù)

		<div class="index__hd">
            <h3 class="index__tit"><i class="icon_txt">歌單推薦</i></h3>
        </div>
        <!-- 切換 -->
        <div class="mod_index_tab" data-stat="y_new.index.playlist">
	<a href="javascript:;" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"              class="index_tab__item index_tab__item--current js_tag" data-index="0" data-type="recomPlaylist" data-id="1">為你推薦</a>
		    <a href="javascript:;" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"              class="index_tab__item js_tag" data-type="playlist" data-id="3056">網(wǎng)絡(luò)歌曲</a>
	               <a href="javascript:;" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"              class="index_tab__item js_tag" data-type="playlist" data-id="3256">綜藝</a>
		    <a href="javascript:;" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"              class="index_tab__item js_tag" data-type="playlist" data-id="59">經(jīng)典</a>
		    <a href="javascript:;" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"              class="index_tab__item js_tag" data-type="playlist" data-id="3317">官方歌單</a>
		    <a href="javascript:;" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"              class="index_tab__item js_tag" data-type="playlist" data-id="71">情歌</a>
        </div>

find()函數(shù)

如果想要獲取歌單推薦這一行的內(nèi)容，我們就需要先對(duì)歌單推薦的HTML標(biāo)簽進(jìn)行識(shí)別，我們發(fā)現(xiàn)它在class="icon_txt"的i標(biāo)簽下，接著就可以通過(guò)以下這種方法進(jìn)行提取

	import requests
	from bs4 import BeautifulSoup
	header={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'}
	res = requests.get('https://y.qq.com/',headers=header)  #headers是一種反爬蟲(chóng)措施
	soup = BeautifulSoup(res.text,'html.parser')  #第一個(gè)參數(shù)是HTML文本，第二個(gè)參數(shù)html.parser是Python內(nèi)置的編譯器
	print(soup.find('i', class_='icon_txt'))  #找到 class_='icon_txt'的 i 標(biāo)簽

因?yàn)?class 是 Python 中定義類的關(guān)鍵字，所以用 class_ 表示 HTML 中的 class

輸出結(jié)果

如何使用Python爬取求職網(wǎng)requests庫(kù)和BeautifulSoup庫(kù)

find_all()函數(shù)

如果我們想要把歌單推薦的全部主題提取下來(lái)的話，就要用到find_all()函數(shù)

同樣的，我們發(fā)現(xiàn)這幾個(gè)主題都在 class="index_tab__item js_tag"的 a標(biāo)簽下，這時(shí)為了避免篩選到源代碼中其他同為class="index_tab__item js_tag"的標(biāo)簽，我們需要再加多一個(gè)條件data-type=“playlist”，具體怎么操作呢？

import requests
from bs4 import BeautifulSoup
header={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'}
res = requests.get('https://y.qq.com/',headers=header)  #headers是一種反爬蟲(chóng)措施
soup = BeautifulSoup(res.text,'html.parser')  #第一個(gè)參數(shù)是HTML文本，第二個(gè)參數(shù)html.parser是Python內(nèi)置的編譯器
print(soup.find('i', class_='icon_txt'))
items = soup.find_all('a',attrs={"class" :"index_tab__item js_tag","data-type":"playlist"})

實(shí)現(xiàn)的方法就是在第二個(gè)參數(shù)處傳入一個(gè)鍵值對(duì)，在里面添加篩選的屬性

輸出結(jié)果

如何使用Python爬取求職網(wǎng)requests庫(kù)和BeautifulSoup庫(kù)

通過(guò)上面兩個(gè)小案例，我們發(fā)現(xiàn)find()和find_all()函數(shù)返回的是Tag對(duì)象和Tag對(duì)象組成的列表，而我們需要的并不是這一大串東西，我們需要的只是Tag對(duì)象的text屬性或者h(yuǎn)ref(鏈接)屬性，實(shí)現(xiàn)代碼如下

import requests
from bs4 import BeautifulSoup
header={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'}
res = requests.get('https://y.qq.com/',headers=header)  #headers是一種反爬蟲(chóng)措施
soup = BeautifulSoup(res.text,'html.parser')  #第一個(gè)參數(shù)是HTML文本，第二個(gè)參數(shù)html.parser是Python內(nèi)置的編譯器
tag1=soup.find('i', class_='icon_txt')
print(tag1.text)
items = soup.find_all('a',attrs={"class" :"index_tab__item js_tag","data-type":"playlist"})
for i in items:  #遍歷列表
    tag2=i.text
    print(tag2)

輸出結(jié)果

如何使用Python爬取求職網(wǎng)requests庫(kù)和BeautifulSoup庫(kù)

這樣我們就成功把主題的文本內(nèi)容獲取了，而想要提取標(biāo)簽中的屬性值，則可以用對(duì)象名[‘屬性']的方法獲取，這里就不演示了

感謝各位的閱讀，以上就是“如何使用Python爬取求職網(wǎng)requests庫(kù)和BeautifulSoup庫(kù)”的內(nèi)容了，經(jīng)過(guò)本文的學(xué)習(xí)后，相信大家對(duì)如何使用Python爬取求職網(wǎng)requests庫(kù)和BeautifulSoup庫(kù)這一問(wèn)題有了更深刻的體會(huì)，具體使用情況還需要大家實(shí)踐驗(yàn)證。這里是億速云，小編將為大家推送更多相關(guān)知識(shí)點(diǎn)的文章，歡迎關(guān)注！

向AI問(wèn)一下細(xì)節(jié)

推薦閱讀：

免責(zé)聲明：本站發(fā)布的內(nèi)容（圖片、視頻和文字）以原創(chuàng)、轉(zhuǎn)載和分享為主，文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng)，如果涉及侵權(quán)請(qǐng)聯(lián)系站長(zhǎng)郵箱：is@yisu.com進(jìn)行舉報(bào)，并提供相關(guān)證據(jù)，一經(jīng)查實(shí)，將立刻刪除涉嫌侵權(quán)內(nèi)容。

上一篇新聞：
怎么用vue實(shí)現(xiàn)省市區(qū)的級(jí)聯(lián)選擇
下一篇新聞：
如何解析Linux tun/tap

猜你喜歡

AI
助
手

產(chǎn)品服務(wù)

地區(qū)劃分

專題活動(dòng)

幫助支持

關(guān)于我們

售后咨詢

7*24小時(shí)在線電話：400-100-2938

7*24小時(shí)在線 QQ：800811969

關(guān)注億速云

億速云公眾號(hào)

手機(jī)網(wǎng)站二維碼

<strong id="ix3rs"></strong>