溫馨提示×

溫馨提示×

您好，登錄后才能下訂單哦！

密碼登錄×

忘記密碼？

登錄注冊×

獲取短信驗證碼

其他方式登錄

點擊登錄注冊即表示同意《億速云用戶服務條款》

用戶登錄×

賬戶密碼登錄

請使用微信掃描上方二維碼

使用幫助

請求超時！

請點擊重新獲取二維碼

Python爬蟲BeautifulSoup4的使用方法

發(fā)布時間：2020-09-24 09:29:54 來源：億速云閱讀：138 作者：Leah 欄目：編程語言

今天就跟大家聊聊有關(guān)Python爬蟲BeautifulSoup4的使用方法，可能很多人都不太了解，為了讓大家更加了解，小編給大家總結(jié)了以下內(nèi)容，希望大家根據(jù)這篇文章可以有所收獲。

爬蟲——BeautifulSoup4解析器

BeautifulSoup用來解析HTML比較簡單，API非常人性化，支持CSS選擇器、Python標準庫中的HTML解析器，也支持lxml的XML解析器。

其相較與正則而言，使用更加簡單。

示例：

首先必須要導入bs4庫

#!/usr/bin/python3
# -*- coding:utf-8 -*- 
 
from bs4 import BeautifulSoup
 
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
 
# 創(chuàng)建 Beautiful Soup 對象，指定lxml解析器
soup = BeautifulSoup(html, "lxml")
 
# 格式化輸出 soup 對象的內(nèi)容
print(soup.prettify())

運行結(jié)果

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

四大對象種類

BeautifulSoup將復雜的HTML文檔轉(zhuǎn)換成一個復雜的樹形結(jié)構(gòu)，每個節(jié)點都是Python對象，所有對象可以歸納為4種：

（1）Tag

（2）NavigableString

（3）BeautifulSoup

（4）Comment

1.Tag

Tag 通俗點講就是HTML中的一個個標簽，例如：

<head><title>The Dormouse's story</title></head>
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

上面title head a p 等等HTML標簽加上里面包括的內(nèi)容就是Tag，那么試著使用BeautifulSoup來獲取Tags：

#!/usr/bin/python3
# -*- coding:utf-8 -*-
 
from bs4 import BeautifulSoup
 
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
 
# 創(chuàng)建 Beautiful Soup 對象，指定lxml解析器
soup = BeautifulSoup(html, "lxml")
 
# # 打印title標簽
print(soup.title)
 
# 打印head標簽
print(soup.head)
 
# 打印a標簽
print(soup.a)
 
# 打印p標簽
print(soup.p)
 
# 打印soup.p的類型
print(type(soup.p))

運行結(jié)果

<title>The Dormouse's story</title>
<head><title>The Dormouse's story</title></head>
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<class 'bs4.element.Tag'>

我們可以利用soup加標簽名輕松地獲取這些標簽內(nèi)容，這些對象的類型是bs4.element.Tag。但是注意，它查找的是在所有內(nèi)容中的第一個符合要求的標簽。如果需要查詢所有的標簽，后面會進行介紹。

對于Tag，它有兩個重要的屬性，就是name和attrs。

#!/usr/bin/python3
# -*- coding:utf-8 -*-
 
from bs4 import BeautifulSoup
 
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
 
# 創(chuàng)建 Beautiful Soup 對象，指定lxml解析器
soup = BeautifulSoup(html, "lxml")
 
# soup對象比較特殊，它的name為[document]
print(soup.name)
 
# 對于其他內(nèi)部標簽，輸出的值便為標簽本身的名稱
print(soup.head.name)
 
# 打印p標簽的所有屬性，其類型是一個字典
print(soup.p.attrs)
 
# 打印p標簽的class屬性
print(soup.p['class'])
# 還可以利用get方法獲取屬性，傳入屬性的名稱，與上面的方法等價
print(soup.p.get('class'))
 
print(soup.p)
 
# 修改屬性
soup.p['class'] = "newClass"
print(soup.p)
 
# 刪除屬性
del soup.p['class']
print(soup.p)

運行結(jié)果

[document]
head
{'class': ['title'], 'name': 'dromouse'}
['title']
['title']
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="newClass" name="dromouse"><b>The Dormouse's story</b></p>
<p name="dromouse"><b>The Dormouse's story</b></p>

2.NavigableString

既然我們已經(jīng)得到了標簽的內(nèi)容，那么問題來了，我們想要獲取標簽內(nèi)部的文字怎么辦呢？很簡單，用.string即可，例如：

#!/usr/bin/python3
# -*- coding:utf-8 -*-
 
from bs4 import BeautifulSoup
 
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
 
# 創(chuàng)建 Beautiful Soup 對象，指定lxml解析器
soup = BeautifulSoup(html, "lxml")
 
# 打印p標簽的內(nèi)容
print(soup.p.string)
 
# 打印soup.p.string的類型
print(type(soup.p.string))

運行結(jié)果

The Dormouse's story
<class 'bs4.element.NavigableString'>

3.BeautifulSoup

BeautifulSoup對象表示的是一個文檔的內(nèi)容。大部分時候，可以把它當作Tag對象，是一個特殊的Tag，我們可以分別獲取它的類型，名稱，以及屬性。

#!/usr/bin/python3
# -*- coding:utf-8 -*-
 
from bs4 import BeautifulSoup
 
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
 
# 創(chuàng)建 Beautiful Soup 對象，指定lxml解析器
soup = BeautifulSoup(html, "lxml")
 
# 類型
print(type(soup.name))
 
# 名稱
print(soup.name)
 
# 屬性
print(soup.attrs)

運行結(jié)果

<class 'str'>
[document]
{}

4.Comment

Comment對象是一個特殊類型的NavigableString對象，其輸出的內(nèi)容不包括注釋符號。

#!/usr/bin/python3
# -*- coding:utf-8 -*-
 
from bs4 import BeautifulSoup
 
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
 
# 創(chuàng)建 Beautiful Soup 對象，指定lxml解析器
soup = BeautifulSoup(html, "lxml")
 
print(soup.a)
 
print(soup.a.string)
 
print(type(soup.a.string))

運行結(jié)果

<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
 Elsie
<class 'bs4.element.Comment'>

a標簽里的內(nèi)容實際上是注釋，但是如果我們利用.string來輸出它的內(nèi)容時，注釋符號已經(jīng)去掉了。

看完上述內(nèi)容，你們對Python爬蟲BeautifulSoup4的使用方法有進一步的了解嗎？如果還想了解更多知識或者相關(guān)內(nèi)容，請關(guān)注億速云行業(yè)資訊頻道，感謝大家的支持。

向AI問一下細節(jié)

推薦閱讀：

免責聲明：本站發(fā)布的內(nèi)容（圖片、視頻和文字）以原創(chuàng)、轉(zhuǎn)載和分享為主，文章觀點不代表本網(wǎng)站立場，如果涉及侵權(quán)請聯(lián)系站長郵箱：is@yisu.com進行舉報，并提供相關(guān)證據(jù)，一經(jīng)查實，將立刻刪除涉嫌侵權(quán)內(nèi)容。

上一篇新聞：
JAVA面試題之Forward與Redirect的區(qū)別詳解
下一篇新聞：
查看mysql進程的方法

猜你喜歡

AI
助
手

產(chǎn)品服務

地區(qū)劃分

專題活動

幫助支持

關(guān)于我們

售后咨詢

7*24小時在線電話：400-100-2938

7*24小時在線 QQ：800811969

關(guān)注億速云

億速云公眾號

手機網(wǎng)站二維碼

<thead id="1m59w"><abbr id="1m59w"><object id="1m59w"></object></abbr></thead>

<progress id="1m59w"><button id="1m59w"></button></progress>