BeautifulSoup的介紹及作用有哪些

發(fā)布時間：2021-06-25 13:41:14 來源：億速云閱讀：320 作者：chen 欄目：大數(shù)據(jù)

本篇內(nèi)容介紹了“BeautifulSoup的介紹及作用有哪些”的有關(guān)知識，在實(shí)際案例的操作過程中，不少人都會遇到這樣的困境，接下來就讓小編帶領(lǐng)大家學(xué)習(xí)一下如何處理這些情況吧！希望大家仔細(xì)閱讀，能夠?qū)W有所成！

一、BeautifulSoup構(gòu)建

1.1 通過字符串構(gòu)建

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="story">Once upon a time there were three little sisters; and their names were
</p>
"""

soup = BeautifulSoup(html, 'html.parser')
print(soup.prettify())

1.2 從文件加載

from bs4 import BeautifulSoup

with open(r"F:\tmp\etree.html") as fp:
    soup = BeautifulSoup(fp,"lxml")

print(soup.prettify())

二、Tag對象

2.1 string、strings、stripped_strings

如果一個節(jié)點(diǎn)只包含文本節(jié)點(diǎn)，可以通過string直接訪問文本節(jié)點(diǎn)

如果不止包含文本節(jié)點(diǎn)，那么string為None

如果不止包含文本節(jié)點(diǎn)，可以通過strings、stripped_strings獲取文本節(jié)點(diǎn)內(nèi)容，strings、stripped_strings獲取的都是生成器。

2.2 get_text()

只獲取文本節(jié)點(diǎn)

soup.get_text()
#可以指定不同節(jié)點(diǎn)之間的文本使用|分割。
soup.get_text("|")
# 可以指定去除空格
soup.get_text("|", strip=True)

2.3 屬性

tag.attrs是一個字典類型，可以通過tag['id']這樣的方式獲取值。下標(biāo)訪問的方式可能會拋出異常KeyError，所以可以使用tag.get('id')方式，如果id屬性不存在，返回None。

三、contents、children與descendants

都是節(jié)點(diǎn)的子節(jié)點(diǎn)，不過： contents是列表 children是生成器

contents、children只包含直接子節(jié)點(diǎn)，descendants也是一個生成器，不過包含節(jié)點(diǎn)的子孫節(jié)點(diǎn)

3.1 parent、parents

parent：父節(jié)點(diǎn) parents：遞歸父節(jié)點(diǎn)

3.2 next_sibling、previous_sibling

next_sibling：后一個兄弟節(jié)點(diǎn) previous_sibling：前一個兄弟節(jié)點(diǎn)

3.3 next_element、previous_element

next_element：后一個節(jié)點(diǎn) previous_element：前一個節(jié)點(diǎn)

next_element與next_sibling的區(qū)別是：

next_sibling從當(dāng)前tag的結(jié)束標(biāo)簽開始解析
next_element從當(dāng)前tag的開始標(biāo)簽開始解析

四、find、find_all

4.1 方法

find_parent:查找父節(jié)點(diǎn) find_parents:遞歸查找父節(jié)點(diǎn) find_next_siblings:查找后面的兄弟節(jié)點(diǎn) find_next_sibling:查找后面滿足條件的第一個兄弟節(jié)點(diǎn) find_all_next:查找后面所有節(jié)點(diǎn) find_next:查找后面第一個滿足條件的節(jié)點(diǎn) find_all_previous:查找前面所有滿足條件的節(jié)點(diǎn) find_previous:查找前面第一個滿足條件的節(jié)點(diǎn)

4.2 tag名稱

# 查找所有p節(jié)點(diǎn)
soup.find_all('p')
# 查找title節(jié)點(diǎn)，不遞歸
soup.find_all("title", recursive=False)
# 查找p節(jié)點(diǎn)和span節(jié)點(diǎn)
soup.find_all(["p", "span"])
# 查找第一個a節(jié)點(diǎn)，和下面一個find等價
soup.find_all("a", limit=1)
soup.find('a')

4.3 屬性

# 查找id為id1的節(jié)點(diǎn)
soup.find_all(id='id1')
# 查找name屬性為tim的節(jié)點(diǎn)
soup.find_all(name="tim")
soup.find_all(attrs={"name": "tim"})
#查找class為clazz的p節(jié)點(diǎn)
soup.find_all("p", "clazz")
soup.find_all("p", class_="clazz")
soup.find_all("p", class_="body strikeout")

4.4 正則表達(dá)式

import re
# 查找與p開頭的節(jié)點(diǎn)
soup.find_all(class_=re.compile("^p"))

4.5 函數(shù)

# 查找有class屬性并且沒有id屬性的節(jié)點(diǎn)
soup.find_all(hasClassNoId)
def hasClassNoId(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

4.6 文本

soup.find_all(string="tim")
soup.find_all(string=["alice", "tim", "allen"])
soup.find_all(string=re.compile("tim"))

def onlyTextTag(s):
    return (s == s.parent.string)

# 查找只有文本節(jié)點(diǎn)的節(jié)點(diǎn)
soup.find_all(string=onlyTextTag)
# 查找文本節(jié)點(diǎn)為tim的a節(jié)點(diǎn)
soup.find_all("a", string="tim")

五、select

5.1 方法

相比于find，select方法就少了很多，就2個，一個是select，另一個是select_one，區(qū)別是select_one只選擇滿足條件的第一個元素。

select的重點(diǎn)在于選擇器上，所以接下來我們重點(diǎn)通過介紹示例介紹一些常用的選擇器。如果對應(yīng)css選擇器不熟悉的朋友，可以先看一下后面CSS選擇器的介紹。

5.2 通過tag選擇

# 選擇title節(jié)點(diǎn)
soup.select("title")
# 選擇body節(jié)點(diǎn)下的所有a節(jié)點(diǎn)
soup.select("body a")
# 選擇html節(jié)點(diǎn)下的head節(jié)點(diǎn)下的title節(jié)點(diǎn)
soup.select("html head title")

通過tag選擇非常簡單，就是按層級，通過tag的名稱使用空格分割就可以了。

5.3 id與類選擇器

# 選擇類名為article的節(jié)點(diǎn)
soup.select(".article")
# 選擇id為id1的a節(jié)點(diǎn)
soup.select("a#id1")
# 選擇id為id1的節(jié)點(diǎn)
soup.select("#id1")
# 選擇id為id1、id2的節(jié)點(diǎn)
soup.select("#id1,#id2")

id和類選擇器也比較簡單，類選擇器使用.開頭，id選擇器使用#開頭。

5.4 屬性選擇器

# 選擇有href屬性的a節(jié)點(diǎn)
soup.select('a[href]')
# 選擇href屬性為http://mycollege.vip/tim的a節(jié)點(diǎn)
soup.select('a[href="http://mycollege.vip/tim"]')
# 選擇href以http://mycollege.vip/開頭的a節(jié)點(diǎn)
soup.select('a[href^="http://mycollege.vip/"]')
# 選擇href以png結(jié)尾的a節(jié)點(diǎn)
soup.select('a[href$="png"]')
# 選擇href屬性包含china的a節(jié)點(diǎn)
soup.select('a[href*="china"]')
# 選擇href屬性包含china的a節(jié)點(diǎn)
soup.select("a[href~=china]")

5.5 其他選擇器

# 父節(jié)點(diǎn)為div節(jié)點(diǎn)的p節(jié)點(diǎn)
soup.select("div > p")
# 節(jié)點(diǎn)之前有div節(jié)點(diǎn)的p節(jié)點(diǎn)
soup.select("div + p")
# p節(jié)點(diǎn)之后的ul節(jié)點(diǎn)(p和ul有共同父節(jié)點(diǎn))
soup.select("p~ul")
# 父節(jié)點(diǎn)中的第3個p節(jié)點(diǎn)
soup.select("p:nth-of-type(3)")

六、實(shí)例

最后我們還是通過一個小例子，來看一下BeautifulSoup的使用。

from bs4 import BeautifulSoup

text = '''
<li class="subject-item">
    <div class="pic">
      <a class="nbg" href="https://mycollege.vip/subject/25862578/">
        <img class="" src="https://mycollege.vip/s27264181.jpg" width="90">
      </a>
    </div>
    <div class="info">
      <h3 class=""><a href="https://mycollege.vip/subject/25862578/" title="解憂雜貨店">解憂雜貨店</a></h3>
      <div class="pub">[日] 東野圭吾 / 李盈春 / 南海出版公司 / 2014-5 / 39.50元</div>
      <div class="star clearfix">
        <span class="allstar45"></span>
        <span class="rating_nums">8.5</span>
        <span class="pl">
            (537322人評價)
        </span>
      </div>
      <p>現(xiàn)代人內(nèi)心流失的東西，這家雜貨店能幫你找回——僻靜的街道旁有一家雜貨店，只要寫下煩惱投進(jìn)卷簾門的投信口，
      第二天就會在店后的牛奶箱里得到回答。因男友身患絕... </p>
    </div>
</li>
'''

soup = BeautifulSoup(text, 'lxml')

print(soup.select_one("a.nbg").get("href"))
print(soup.find("img").get("src"))
title = soup.select_one("h3 a")
print(title.get("href"))
print(title.get("title"))

print(soup.find("div", class_="pub").string)
print(soup.find("span", class_="rating_nums").string)
print(soup.find("span", class_="pl").string.strip())
print(soup.find("p").string)

非常簡單，如果對CSS選擇器熟悉的話，很多復(fù)雜的結(jié)構(gòu)也能輕松搞定。

七、CSS選擇器

7.1 常用選擇器

BeautifulSoup的介紹及作用有哪些

選擇器	示例	說明
.class	.intro	選擇class="intro"的所有節(jié)點(diǎn)
#id	#firstname	選擇id="firstname"的所有節(jié)點(diǎn)
*	*	選擇所有節(jié)點(diǎn)
element	p	選擇所有p節(jié)點(diǎn)
element,element	div,p	選擇所有div節(jié)點(diǎn)和所有p節(jié)點(diǎn)
element element	div p	選擇div節(jié)點(diǎn)內(nèi)部的所有p節(jié)點(diǎn)
element>element	div>p	選擇父節(jié)點(diǎn)為div節(jié)點(diǎn)的所有p節(jié)點(diǎn)
element+element	div+p	選擇緊接在div節(jié)點(diǎn)之后的所有p節(jié)點(diǎn)
element~element	p~ul	選擇和p元素?fù)碛邢嗤腹?jié)點(diǎn)，并且在p元素之后的ul節(jié)點(diǎn)
[attribute^=value]	a[src^="https"]	選擇其src屬性值以"https"開頭的每個a節(jié)點(diǎn)
[attribute$=value]	a[src$=".png"]	選擇其src屬性以".png"結(jié)尾的所有a節(jié)點(diǎn)
[attribute*=value]	a[src*="abc"]	選擇其src屬性中包含"abc"子串的每個a節(jié)點(diǎn)
[attribute]	[target]	選擇帶有target屬性所有節(jié)點(diǎn)
[attribute=value]	[target=_blank]	選擇target="_blank"的所有節(jié)點(diǎn)
[attribute~=value]	[title~=china]	選擇title屬性包含單詞"china"的所有節(jié)點(diǎn)
[attribute\|=value]	[lang\|=zh]	選擇lang屬性值以"zh"開頭的所有節(jié)點(diǎn)

div p是包含孫子節(jié)點(diǎn)，div > p只選擇子節(jié)點(diǎn)

element~element選擇器有點(diǎn)不好理解，看下面的例子：

<!DOCTYPE html>
<html>

<head>
    <meta charset="utf-8">
    <style>
    p~ul {
        background: red;
    }
    </style>
</head>

<body>
    <div>
        <ul>
            <li>ul-li1</li>
            <li>ul-li1</li>
            <li>ul-li1</li>
        </ul>
        <p>p標(biāo)簽</p>
        <ul>
            <li>ul-li2</li>
            <li>ul-li2</li>
            <li>ul-li2</li>
        </ul>
        <h3>h3 tag</h3>
        <ul>
            <li>ul-li3</li>
            <li>ul-li3</li>
            <li>ul-li3</li>
        </ul>
    </div>
</body>

</html>

BeautifulSoup的介紹及作用有哪些

7.2 位置選擇器

BeautifulSoup的介紹及作用有哪些

選擇器	示例	說明
:first-of-type	p:first-of-type	選擇其父節(jié)點(diǎn)的首個p節(jié)點(diǎn)
:last-of-type	p:last-of-type	選擇其父節(jié)點(diǎn)的最后p節(jié)點(diǎn)
:only-of-type	p:only-of-type	選擇其父節(jié)點(diǎn)唯一的p節(jié)點(diǎn)
:only-child	p:only-child	選擇其父節(jié)點(diǎn)的唯一子節(jié)點(diǎn)的p節(jié)點(diǎn)
:nth-child(n)	p:nth-child(2)	選擇其父節(jié)點(diǎn)的第二個子節(jié)點(diǎn)的p節(jié)點(diǎn)
:nth-last-child(n)	p:nth-last-child(2)	從最后一個子節(jié)點(diǎn)開始計(jì)數(shù)
:nth-of-type(n)	p:nth-of-type(2)	選擇其父節(jié)點(diǎn)第二個p節(jié)點(diǎn)
:nth-last-of-type(n)	p:nth-last-of-type(2)	選擇其父節(jié)點(diǎn)倒數(shù)第二個p節(jié)點(diǎn)
:last-child	p:last-child	選擇其父節(jié)點(diǎn)最后一個p節(jié)點(diǎn)

需要主要的是tag:nth-child(n)與tag:nth-of-type(n)，nth-child計(jì)算的時候不要求類型相同，nth-of-type計(jì)算的時候必須是相同的tag。

有點(diǎn)繞，可以看一下下面的示例。

<!DOCTYPE html>
<html>
<head>
    <title>nth</title>
     <style>
        #wrap p:nth-of-type(3) {
            background: red;
        }
 
        #wrap p:nth-child(3) {
            background: yellow;
        }
    </style>
</head>
<body>
    <div id="wrap">
        <p>1-1p</p>
        <div>2-1div</div>
        <p>3-2p</p>
        <p>4-3p</p>
        <p>5-4p</p>
    </div>
</body>
</html>

BeautifulSoup的介紹及作用有哪些

7.3 其他選擇器

BeautifulSoup的介紹及作用有哪些

選擇器	示例	說明
:not(selector)	:not(p)	選擇非p節(jié)點(diǎn)的節(jié)點(diǎn)
:empty	p:empty	選擇沒有子節(jié)點(diǎn)的p節(jié)點(diǎn)
::selection	::selection	選擇被用戶選取的節(jié)點(diǎn)
:focus	input:focus	選擇獲得焦點(diǎn)的input節(jié)點(diǎn)
:root	:root	選擇文檔的根節(jié)點(diǎn)
:enabled	input:enabled	選擇每個啟用的input節(jié)點(diǎn)
:disabled	input:disabled	選擇每個禁用的input節(jié)點(diǎn)
:checked	input:checked	選擇每個被選中的input節(jié)點(diǎn)
:link	a:link	選擇所有未被訪問的鏈接
:visited	a:visited	選擇所有已被訪問的鏈接
:active	a:active	選擇活動鏈接
:hover	a:hover	選擇鼠標(biāo)指針位于其上的鏈接
:first-letter	p:first-letter	選擇每個p節(jié)點(diǎn)的首字母
:first-line	p:first-line	選擇每個p節(jié)點(diǎn)的首行
:first-child	p:first-child	選擇屬于父節(jié)點(diǎn)的第一個子節(jié)點(diǎn)的每個p節(jié)點(diǎn)
:before	p:before	在每個p節(jié)點(diǎn)的內(nèi)容之前插入內(nèi)容
:after	p:after	在每個p節(jié)點(diǎn)的內(nèi)容之后插入內(nèi)容
:lang(language)	p:lang(it)	選擇帶有以"it"開頭的lang屬性值的每個p節(jié)點(diǎn)

“BeautifulSoup的介紹及作用有哪些”的內(nèi)容就介紹到這里了，感謝大家的閱讀。如果想了解更多行業(yè)相關(guān)的知識可以關(guān)注億速云網(wǎng)站，小編將為大家輸出更多高質(zhì)量的實(shí)用文章！

向AI問一下細(xì)節(jié)

BeautifulSoup的介紹及作用有哪些

一、BeautifulSoup構(gòu)建

1.1 通過字符串構(gòu)建

1.2 從文件加載

二、Tag對象

2.1 string、strings、stripped_strings

2.2 get_text()

2.3 屬性

三、contents、children與descendants

3.1 parent、parents

3.2 next_sibling、previous_sibling

3.3 next_element、previous_element

四、find、find_all

4.1 方法

4.2 tag名稱

4.3 屬性

4.4 正則表達(dá)式

4.5 函數(shù)

4.6 文本

五、select

5.1 方法

5.2 通過tag選擇

5.3 id與類選擇器

5.4 屬性選擇器

5.5 其他選擇器

六、實(shí)例

七、CSS選擇器

7.1 常用選擇器

7.2 位置選擇器

7.3 其他選擇器

猜你喜歡

最新資訊

相關(guān)推薦

相關(guān)標(biāo)簽

一、BeautifulSoup構(gòu)建

2.1 string、strings、stripped_strings

三、contents、children與descendants

3.1 parent、parents

3.2 next_sibling、previous_sibling

3.3 next_element、previous_element

四、find、find_all

五、select

六、實(shí)例

七、CSS選擇器