beautifulsoup庫(kù)怎么在python中使用

發(fā)布時(shí)間：2021-02-19 15:31:56 來(lái)源：億速云閱讀：213 作者：Leah 欄目：開(kāi)發(fā)技術(shù)

今天就跟大家聊聊有關(guān)beautifulsoup庫(kù)怎么在python中使用，可能很多人都不太了解，為了讓大家更加了解，小編給大家總結(jié)了以下內(nèi)容，希望大家根據(jù)這篇文章可以有所收獲。

1. BeautifulSoup庫(kù)簡(jiǎn)介

BeautifulSoup庫(kù)在python中被美其名為“靚湯”，它和和 lxml 一樣也是一個(gè)HTML/XML的解析器，主要的功能也是如何解析和提取 HTML/XML 數(shù)據(jù)。BeautifulSoup支持Python標(biāo)準(zhǔn)庫(kù)中的HTML解析器,還支持一些第三方的解析器，若在沒(méi)用安裝此庫(kù)的情況下， Python 會(huì)使用 Python默認(rèn)的解析器lxml，lxml 解析器更加強(qiáng)大，速度更快，而B(niǎo)eautifulSoup庫(kù)中的lxml解析器則是集成了單獨(dú)的lxml的特點(diǎn)，使得功能更加強(qiáng)大。

需要注意的是，Beautiful Soup已經(jīng)自動(dòng)將輸入文檔轉(zhuǎn)換為Unicode編碼，輸出文檔轉(zhuǎn)換為utf-8編碼。因此在使用它的時(shí)候不需要考慮編碼方式，僅僅需要說(shuō)明一下原始編碼方式就可以了。

使用pip命令工具安裝BeautifulSoup4庫(kù)

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple/ BeautifulSoup # 使用清華大學(xué)鏡像源安裝

2. BeautifulSoup庫(kù)的主要解析器

在代碼中 html.parser是一種針對(duì)于html網(wǎng)頁(yè)頁(yè)面的解析器，Beautiful Soup庫(kù)還有其他的解析器，用于針對(duì)不同的網(wǎng)頁(yè)

demo = 'https://www.baidu.com'
soup = BeautifulSoup(demo,'html.parser')

解析器	使用方法	條件
bs4的html解析器	BeautifulSoup(demo,‘html.parser')	安裝bs4庫(kù)
lxml的html解析器	BeautifulSoup(demo,‘lxml')	pip install lxml
lxml的xml解析器	BeautifulSoup(demo,‘xml')	pip install lxml
html5lib的解析器	BeautifulSoup(demo,‘html5lib')	pip install html5lib

3. BeautifulSoup的簡(jiǎn)單使用

假如有一個(gè)簡(jiǎn)單的網(wǎng)頁(yè)，提取百度搜索頁(yè)面的一部分源代碼為例

<!DOCTYPE html>
<html>
<head>
 <meta content="text/html;charset=utf-8" http-equiv="content-type" />
 <meta content="IE=Edge" http-equiv="X-UA-Compatible" />
 <meta content="always" name="referrer" />
 <link
href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.
css" rel="stylesheet" type="text/css" />
 <title>百度一下，你就知道 </title>
</head>
<body link="#0000cc">
 <div >
 <div >
 <div >
  <div >
  <a href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞
</a>
  <a href="https://www.hao123.com" rel="external nofollow" 
name="tj_trhao123">hao123 </a>
  <a href="http://map.baidu.com" rel="external nofollow" name="tj_trmap">地圖 </a>
  <a href="http://v.baidu.com" rel="external nofollow" name="tj_trvideo">視頻 </a>
  <a href="http://tieba.baidu.com" rel="external nofollow" name="tj_trtieba">貼吧
</a>
  <a href="//www.baidu.com/more/" rel="external nofollow" name="tj_briicon"
>更多產(chǎn)品 </a>
  </div>
 </div>
 </div>
 </div>
</body>
</html>

結(jié)合requests庫(kù)和使用BeautifulSoup庫(kù)的html解析器，對(duì)其進(jìn)行解析有如下

import requests
from bs4 import BeautifulSoup

# 使用Requests庫(kù)加載頁(yè)面代碼
r = requests.get('https://www.baidu.com')
r.raise_for_status()  # 狀態(tài)碼返回
r.encoding = r.apparent_encoding
demo = r.text

# 使用BeautifulSoup庫(kù)解析代碼
soup = BeautifulSoup(demo,'html.parser')  # 使用html的解析器

print(soup.prettify())   # prettify 方式輸出頁(yè)面

beautifulsoup庫(kù)怎么在python中使用

4. BeautifuSoup的類(lèi)的基本元素

BeautifulSoup4將復(fù)雜HTML文檔轉(zhuǎn)換成一個(gè)復(fù)雜的樹(shù)形結(jié)構(gòu),每個(gè)節(jié)點(diǎn)都是Python對(duì)象,BeautifulSoup庫(kù)有針對(duì)于html的標(biāo)簽數(shù)的特定元素，重點(diǎn)有如下三種

<p > ... </p>

Tag
NavigableString
Comment
BeautifulSoup

基本元素	說(shuō)明
Tag	標(biāo)簽，最基本的信息組織單元，分別用<>和</>標(biāo)明開(kāi)頭和結(jié)尾，格式：soup.a或者soup.p（獲取a標(biāo)簽中或者p標(biāo)簽中的內(nèi)容）
Name	標(biāo)簽的名字， … 的名字是‘p' 格式為：.name
Attributes	標(biāo)簽的屬性，字典形式組織，格式：.attrs
NavigableString	標(biāo)簽內(nèi)非屬性字符串，<>…</>中的字符串，格式：.string
Comment	標(biāo)簽內(nèi)的字符串的注釋部分，一種特殊的Comment類(lèi)型

4.1 Tag

標(biāo)簽是html中的最基本的信息組織單元，使用方式如下

from bs4 import BeautifulSoup
html = 'https://www.baidu.com'
bs = BeautifulSoup(html,"html.parser")

print(bs.title) # 獲取title標(biāo)簽的所有內(nèi)容
print(bs.head) # 獲取head標(biāo)簽的所有內(nèi)容
print(bs.a)  # 獲取第一個(gè)a標(biāo)簽的所有內(nèi)容
print(type(bs.a))	# 類(lèi)型

在Tag標(biāo)簽中最重要的就是html頁(yè)面中的name哈attrs屬性，使用方式如下

print(bs.name)
print(bs.head.name)			# head 之外對(duì)于其他內(nèi)部標(biāo)簽，輸出的值便為標(biāo)簽本身的名稱(chēng)
print(bs.a.attrs) 			# 把 a 標(biāo)簽的所有屬性打印輸出了出來(lái)，得到的類(lèi)型是一個(gè)字典。
print(bs.a['class']) 		# 等價(jià) bs.a.get('class') 也可以使用get方法，傳入屬性的名稱(chēng)，二者是等價(jià)的
bs.a['class'] = "newClass" # 對(duì)這些屬性和內(nèi)容進(jìn)行修改
print(bs.a)
del bs.a['class']			# 對(duì)這個(gè)屬性進(jìn)行刪除
print(bs.a)

4.2 NavigableString

NavigableString中的string方法用于獲取標(biāo)簽內(nèi)部的文字

from bs4 import BeautifulSoup
html = 'https://www.baidu.com'
bs = BeautifulSoup(html,"html.parser")
print(bs.title.string)
print(type(bs.title.string))

4.3 Comment

Comment 對(duì)象是一個(gè)特殊類(lèi)型的 NavigableString 對(duì)象，其輸出的內(nèi)容不包括注釋符號(hào)，用于輸出注釋中的內(nèi)容

from bs4 import BeautifulSoup
html = 'https://www.baidu.com'
bs = BeautifulSoup(html,"html.parser")
print(bs.a)
# 標(biāo)簽中的內(nèi)容<a href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" name="tj_trnews"><!--新聞--></a>
print(bs.a.string) 		# 新聞
print(type(bs.a.string)) # <class 'bs4.element.Comment'>

5. 基于bs4庫(kù)的HTML內(nèi)容的遍歷方法

在HTML中有如下特定的基本格式，也是構(gòu)成HTML頁(yè)面的基本組成成分

beautifulsoup庫(kù)怎么在python中使用

而在這種基本的格式下有三種基本的遍歷流程

下行遍歷
上行遍歷
平行遍歷

三種種遍歷方式分別是從當(dāng)前節(jié)點(diǎn)出發(fā)。對(duì)之上或者之下或者平行的格式以及關(guān)系進(jìn)行遍歷

5.1 下行遍歷

下行遍歷有三種遍歷的屬性，分別是

contents
children
descendants

屬性	說(shuō)明
.contents	子節(jié)點(diǎn)的列表，將所有兒子節(jié)點(diǎn)存入列表
.children	子節(jié)點(diǎn)的迭代類(lèi)型，用于循環(huán)遍歷兒子節(jié)點(diǎn)
.descendants	子孫節(jié)點(diǎn)的迭代類(lèi)型，包含所有子孫節(jié)點(diǎn)，用于循環(huán)遍歷

使用舉例

soup = BeautifulSoup(demo,'html.parser') 

# 循環(huán)遍歷兒子節(jié)點(diǎn)
for child in soup.body.children:
	print(child)

# 循環(huán)遍歷子孫節(jié)點(diǎn) 
for child in soup.body.descendants:
 print(child)
 
# 輸出子節(jié)點(diǎn)的列表形式
print(soup.head.contents)
print(soup.head.contents[1])	# 用列表索引來(lái)獲取它的某一個(gè)元素

5.2 上行遍歷

上行遍歷有兩種方式

parent
parents

屬性	說(shuō)明
.parent	節(jié)點(diǎn)的父親標(biāo)簽
.parents	節(jié)點(diǎn)先輩標(biāo)簽的迭代類(lèi)型，用于循環(huán)遍歷先輩節(jié)點(diǎn)，返回一個(gè)生成器

使用舉例

soup = BeautifulSoup(demo,'html.parser') 

for parent in soup.a.parents:
	if parent is None:
		parent(parent)
	else:
		print(parent.name)

5.3 平行遍歷

平行遍歷有四種屬性

next_sibling
previous_sibling
next_siblings
previous_siblings

屬性	說(shuō)明
.next_sibling	返回按照HTML文本順序的下一個(gè)平行節(jié)點(diǎn)標(biāo)簽
.previous_sibling	返回按照HTML文本順序的上一個(gè)平行節(jié)點(diǎn)標(biāo)簽
.next_siblings	迭代類(lèi)型，返回按照html文本順序的后續(xù)所有平行節(jié)點(diǎn)標(biāo)簽
.previous_siblings	迭代類(lèi)型，返回按照html文本順序的前序所有平行節(jié)點(diǎn)標(biāo)簽

beautifulsoup庫(kù)怎么在python中使用

平行遍歷舉例如下

for sibling in soup.a.next_sibling:
	print(sibling)		# 遍歷后續(xù)節(jié)點(diǎn)
	
for sibling in soup.a.previous_sibling:
	print(sibling)		# 遍歷

5.4 其他遍歷

屬性	說(shuō)明
.strings	如果Tag包含多個(gè)字符串，即在子孫節(jié)點(diǎn)中有內(nèi)容，可以用此獲取，而后進(jìn)行遍歷
.stripped_strings	與strings用法一致，可以去除掉那些多余的空白內(nèi)容
.has_attr	判斷Tag是否包含屬性

6. 文件樹(shù)搜索

使用soup.find_all(name,attrs,recursive,string,**kwargs)方法，用于返回一個(gè)列表類(lèi)型，存儲(chǔ)查找的結(jié)果

name：對(duì)標(biāo)簽名稱(chēng)的檢索字符串
attrs：對(duì)標(biāo)簽屬性值得檢索字符串，可標(biāo)注屬性檢索
recursive：是否對(duì)子孫全部檢索，默認(rèn)為
Truestring：用與在信息文本中特定字符串的檢索

6.1 name參數(shù)

如果是指定的字符串：會(huì)查找與字符串完全匹配的內(nèi)容，如下

a_list = bs.find_all("a")
print(a_list)		# 將會(huì)返回所有包含a標(biāo)簽的內(nèi)容

如果是使用正則表達(dá)式：將會(huì)使用BeautifulSoup4中的search()方法來(lái)匹配內(nèi)容，如下

from bs4 import BeautifulSoup
import re

html = 'https://www.baidu.com'
bs = BeautifulSoup(html,"html.parser")
t_list = bs.find_all(re.compile("a"))
for item in t_list:
 	print(item)		# 輸出列表

如果傳入一個(gè)列表：BeautifulSoup4將會(huì)與列表中的任一元素匹配到的節(jié)點(diǎn)返回，如下

t_list = bs.find_all(["meta","link"])
for item in t_list:
	print(item)

如果傳入一個(gè)函數(shù)或者方法：將會(huì)根據(jù)函數(shù)或者方法來(lái)匹配

from bs4 import BeautifulSoup

html = 'https://www.baidu.com'
bs = BeautifulSoup(html,"html.parser")
def name_is_exists(tag):
 	 return tag.has_attr("name")
t_list = bs.find_all(name_is_exists)
for item in t_list:
 	 print(item)

6.2 attrs參數(shù)

并不是所有的屬性都可以使用上面這種方式進(jìn)行搜索，比如HTML的data屬性，用于指定屬性搜索

t_list = bs.find_all(data-foo="value")

6.3 string參數(shù)

通過(guò)通過(guò)string參數(shù)可以搜索文檔中的字符串內(nèi)容，與name參數(shù)的可選值一樣，string參數(shù)接受字符串，正則表達(dá)式，列表

from bs4 import BeautifulSoup
import re

html = 'https://www.baidu.com'
bs = BeautifulSoup(html, "html.parser")
t_list = bs.find_all(attrs={"data-foo": "value"})
for item in t_list:
 	print(item)
t_list = bs.find_all(text="hao123")
for item in t_list:
 	print(item)
t_list = bs.find_all(text=["hao123", "地圖", "貼吧"])
for item in t_list:
 	print(item)
t_list = bs.find_all(text=re.compile("\d"))
for item in t_list:
 	print(item)

使用find_all()方法的時(shí)，常用到正則表達(dá)式的形式import re如下所示

soup.find_all(sring = re.compile('pyhton'))		# 指定查找內(nèi)容

# 或者指定使用正則表達(dá)式要搜索的內(nèi)容
sring = re.compile('pyhton')		# 字符為python
soup.find_all(string)				# 調(diào)用方法模板

6.4 常用的fiid()方法如下

beautifulsoup庫(kù)怎么在python中使用

看完上述內(nèi)容，你們對(duì)beautifulsoup庫(kù)怎么在python中使用有進(jìn)一步的了解嗎？如果還想了解更多知識(shí)或者相關(guān)內(nèi)容，請(qǐng)關(guān)注億速云行業(yè)資訊頻道，感謝大家的支持。

向AI問(wèn)一下細(xì)節(jié)

beautifulsoup庫(kù)怎么在python中使用

2. BeautifulSoup庫(kù)的主要解析器

3. BeautifulSoup的簡(jiǎn)單使用

4. BeautifuSoup的類(lèi)的基本元素

4.1 Tag

4.2 NavigableString

4.3 Comment

5. 基于bs4庫(kù)的HTML內(nèi)容的遍歷方法

5.1 下行遍歷

5.2 上行遍歷

5.3 平行遍歷

5.4 其他遍歷

6. 文件樹(shù)搜索

6.1 name參數(shù)

6.2 attrs參數(shù)

6.3 string參數(shù)

6.4 常用的fiid()方法如下

猜你喜歡

最新資訊

相關(guān)推薦

相關(guān)標(biāo)簽