您好,登錄后才能下訂單哦!
今天就跟大家聊聊有關(guān)Python爬蟲BeautifulSoup4的使用方法,可能很多人都不太了解,為了讓大家更加了解,小編給大家總結(jié)了以下內(nèi)容,希望大家根據(jù)這篇文章可以有所收獲。
爬蟲——BeautifulSoup4解析器
BeautifulSoup用來解析HTML比較簡單,API非常人性化,支持CSS選擇器、Python標準庫中的HTML解析器,也支持lxml的XML解析器。
其相較與正則而言,使用更加簡單。
示例:
首先必須要導入bs4庫
#!/usr/bin/python3 # -*- coding:utf-8 -*- from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # 創(chuàng)建 Beautiful Soup 對象,指定lxml解析器 soup = BeautifulSoup(html, "lxml") # 格式化輸出 soup 對象的內(nèi)容 print(soup.prettify())
運行結(jié)果
<html> <head> <title> The Dormouse's story </title> </head> <body> <p class="title" name="dromouse"> <b> The Dormouse's story </b> </p> <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> <!-- Elsie --> </a> , <a class="sister" href="http://example.com/lacie" id="link2"> Lacie </a> and <a class="sister" href="http://example.com/tillie" id="link3"> Tillie </a> ; and they lived at the bottom of a well. </p> <p class="story"> ... </p> </body> </html>
四大對象種類
BeautifulSoup將復雜的HTML文檔轉(zhuǎn)換成一個復雜的樹形結(jié)構(gòu),每個節(jié)點都是Python對象,所有對象可以歸納為4種:
(1)Tag
(2)NavigableString
(3)BeautifulSoup
(4)Comment
1.Tag
Tag 通俗點講就是HTML中的一個個標簽,例如:
<head><title>The Dormouse's story</title></head> <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a> <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
上面title head a p 等等HTML標簽加上里面包括的內(nèi)容就是Tag,那么試著使用BeautifulSoup來獲取Tags:
#!/usr/bin/python3 # -*- coding:utf-8 -*- from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # 創(chuàng)建 Beautiful Soup 對象,指定lxml解析器 soup = BeautifulSoup(html, "lxml") # # 打印title標簽 print(soup.title) # 打印head標簽 print(soup.head) # 打印a標簽 print(soup.a) # 打印p標簽 print(soup.p) # 打印soup.p的類型 print(type(soup.p))
運行結(jié)果
<title>The Dormouse's story</title> <head><title>The Dormouse's story</title></head> <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <class 'bs4.element.Tag'>
我們可以利用soup加標簽名輕松地獲取這些標簽內(nèi)容,這些對象的類型是bs4.element.Tag。但是注意,它查找的是在所有內(nèi)容中的第一個符合要求的標簽。如果需要查詢所有的標簽,后面會進行介紹。
對于Tag,它有兩個重要的屬性,就是name和attrs。
#!/usr/bin/python3 # -*- coding:utf-8 -*- from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # 創(chuàng)建 Beautiful Soup 對象,指定lxml解析器 soup = BeautifulSoup(html, "lxml") # soup對象比較特殊,它的name為[document] print(soup.name) # 對于其他內(nèi)部標簽,輸出的值便為標簽本身的名稱 print(soup.head.name) # 打印p標簽的所有屬性,其類型是一個字典 print(soup.p.attrs) # 打印p標簽的class屬性 print(soup.p['class']) # 還可以利用get方法獲取屬性,傳入屬性的名稱,與上面的方法等價 print(soup.p.get('class')) print(soup.p) # 修改屬性 soup.p['class'] = "newClass" print(soup.p) # 刪除屬性 del soup.p['class'] print(soup.p)
運行結(jié)果
[document] head {'class': ['title'], 'name': 'dromouse'} ['title'] ['title'] <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="newClass" name="dromouse"><b>The Dormouse's story</b></p> <p name="dromouse"><b>The Dormouse's story</b></p>
2.NavigableString
既然我們已經(jīng)得到了標簽的內(nèi)容,那么問題來了,我們想要獲取標簽內(nèi)部的文字怎么辦呢?很簡單,用.string即可,例如:
#!/usr/bin/python3 # -*- coding:utf-8 -*- from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # 創(chuàng)建 Beautiful Soup 對象,指定lxml解析器 soup = BeautifulSoup(html, "lxml") # 打印p標簽的內(nèi)容 print(soup.p.string) # 打印soup.p.string的類型 print(type(soup.p.string))
運行結(jié)果
The Dormouse's story <class 'bs4.element.NavigableString'>
3.BeautifulSoup
BeautifulSoup對象表示的是一個文檔的內(nèi)容。大部分時候,可以把它當作Tag對象,是一個特殊的Tag,我們可以分別獲取它的類型,名稱,以及屬性。
#!/usr/bin/python3 # -*- coding:utf-8 -*- from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # 創(chuàng)建 Beautiful Soup 對象,指定lxml解析器 soup = BeautifulSoup(html, "lxml") # 類型 print(type(soup.name)) # 名稱 print(soup.name) # 屬性 print(soup.attrs)
運行結(jié)果
<class 'str'> [document] {}
4.Comment
Comment對象是一個特殊類型的NavigableString對象,其輸出的內(nèi)容不包括注釋符號。
#!/usr/bin/python3 # -*- coding:utf-8 -*- from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # 創(chuàng)建 Beautiful Soup 對象,指定lxml解析器 soup = BeautifulSoup(html, "lxml") print(soup.a) print(soup.a.string) print(type(soup.a.string))
運行結(jié)果
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a> Elsie <class 'bs4.element.Comment'>
a標簽里的內(nèi)容實際上是注釋,但是如果我們利用.string來輸出它的內(nèi)容時,注釋符號已經(jīng)去掉了。
看完上述內(nèi)容,你們對Python爬蟲BeautifulSoup4的使用方法有進一步的了解嗎?如果還想了解更多知識或者相關(guān)內(nèi)容,請關(guān)注億速云行業(yè)資訊頻道,感謝大家的支持。
免責聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點不代表本網(wǎng)站立場,如果涉及侵權(quán)請聯(lián)系站長郵箱:is@yisu.com進行舉報,并提供相關(guān)證據(jù),一經(jīng)查實,將立刻刪除涉嫌侵權(quán)內(nèi)容。