<strike id="ubyrh"></strike>

溫馨提示×

溫馨提示×

您好，登錄后才能下訂單哦！

密碼登錄×

忘記密碼？

登錄注冊×

獲取短信驗證碼

其他方式登錄

點擊登錄注冊即表示同意《億速云用戶服務條款》

用戶登錄×

賬戶密碼登錄

請使用微信掃描上方二維碼

使用幫助

請求超時！

請點擊重新獲取二維碼

利用BeautifulSoup怎么獲取特定標簽下內(nèi)容

發(fā)布時間：2020-12-07 14:33:03 來源：億速云閱讀：254 作者：Leah 欄目：開發(fā)技術(shù)

利用BeautifulSoup怎么獲取特定標簽下內(nèi)容？相信很多沒有經(jīng)驗的人對此束手無策，為此本文總結(jié)了問題出現(xiàn)的原因和解決方法，通過這篇文章希望你能解決這個問題。

先用find_all()找出需要內(nèi)容所在的標簽，如果所需內(nèi)容一個find_all()不能滿足，那就用兩個或者多個。接下來遍歷find_all的結(jié)果，用get_txt（）、get(‘href')、得到文本或者鏈接，然后放入各自的列表中。這樣做有一個缺點就是txt的數(shù)據(jù)是一個單獨的列表，鏈接的數(shù)據(jù)也是一個單獨的列表，一方面不能體現(xiàn)這些數(shù)據(jù)之間的結(jié)構(gòu)性，另一方面當想要獲得更多的內(nèi)容時，就要創(chuàng)建更多的空列表。

遍歷所有標簽：

soup.find_all('a')

找出所有頁面中含有標簽a的html語句，結(jié)果以列表形式存儲。對找到的標簽可以進一步處理，如用for對結(jié)果遍歷，可以對結(jié)果進行purify，得到如鏈接，字符等結(jié)果。

# 創(chuàng)建空列表
links=[] 
txts=[]
tags=soup.find_all('a')
for tag in tags:
  links.append(tag.get('href')
  txts.append(tag.txt)         #或者txts.append(tag.get_txt())

得到html的屬性名：

atr=[]
tags=soup.find_all('a')
for tag in tags:
  atr.append(tag.p('class')) # 得到a 標簽下，子標簽p的class名稱

find_all()的相關(guān)用法實例：

實例來自BeautifulSoup中文文檔

1. 字符串

最簡單的過濾器是字符串.在搜索方法中傳入一個字符串參數(shù),Beautiful Soup會查找與字符串完整匹配的內(nèi)容,下面的例子用于查找文檔中所有的標簽:

soup.find_all('b')
# [<b>The Dormouse's story</b>]

2.正則表達式

如果傳入正則表達式作為參數(shù),Beautiful Soup會通過正則表達式的 match() 來匹配內(nèi)容.下面例子中找出所有以b開頭的標簽,這表示和標簽都應該被找到:

import re
for tag in soup.find_all(re.compile("^b")):
  print(tag.name)
# body
# b

下面代碼找出所有名字中包含”t”的標簽:

for tag in soup.find_all(re.compile("t")):
  print(tag.name)
# html
# title

3.列表

如果傳入列表參數(shù),Beautiful Soup會將與列表中任一元素匹配的內(nèi)容返回.下面代碼找到文檔中所有標簽和標簽:

soup.find_all(["a", "b"])
# [<b>The Dormouse's story</b>,
# <a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

4.方法（自定義函數(shù)，傳入find_all）

如果沒有合適過濾器,那么還可以定義一個方法,方法只接受一個元素參數(shù) [4] ,如果這個方法返回 True 表示當前元素匹配并且被找到,如果不是則反回 False
下面方法校驗了當前元素,如果包含 class 屬性卻不包含 id 屬性,那么將返回 True:

def has_class_but_no_id(tag):
  return tag.has_attr('class') and not tag.has_attr('id')```

返回結(jié)果中只有

標簽沒有標簽,因為標簽還定義了”id”,沒有返回和,因為和中沒有定義”class”屬性.
下面代碼找到所有被文字包含的節(jié)點內(nèi)容:

from bs4 import NavigableString
def surrounded_by_strings(tag):
  return (isinstance(tag.next_element, NavigableString)
      and isinstance(tag.previous_element, NavigableString))

for tag in soup.find_all(surrounded_by_strings):
  print tag.name
# p
# a
# a
# a
# p

5.按照CSS搜索

按照CSS類名搜索tag的功能非常實用,但標識CSS類名的關(guān)鍵字 class 在Python中是保留字,使用 class 做參數(shù)會導致語法錯誤.從Beautiful Soup的4.1.1版本開始,可以通過 class_ 參數(shù)搜索有指定CSS類名的tag:

soup.find_all("a", class_="sister")
# [<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

或者：

soup.find_all("a", attrs={"class": "sister"})
# [<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

6.按照text參數(shù)查找

通過 text 參數(shù)可以搜搜文檔中的字符串內(nèi)容.與 name 參數(shù)的可選值一樣, text 參數(shù)接受字符串 , 正則表達式 , 列表, True . 看例子:

soup.find_all(text="Elsie")
# [u'Elsie']

soup.find_all(text=["Tillie", "Elsie", "Lacie"])
# [u'Elsie', u'Lacie', u'Tillie']

soup.find_all(text=re.compile("Dormouse"))
[u"The Dormouse's story", u"The Dormouse's story"]

def is_the_only_string_within_a_tag(s):
  ""Return True if this string is the only child of its parent tag.""
  return (s == s.parent.string)

soup.find_all(text=is_the_only_string_within_a_tag)
# [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...']

雖然 text 參數(shù)用于搜索字符串,還可以與其它參數(shù)混合使用來過濾tag.Beautiful Soup會找到 .string 方法與 text 參數(shù)值相符的tag.下面代碼用來搜索內(nèi)容里面包含“Elsie”的標簽:

soup.find_all("a", text="Elsie")
# [<a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>]

7.只查找當前標簽的子節(jié)點

調(diào)用tag的 find_all() 方法時,Beautiful Soup會檢索當前tag的所有子孫節(jié)點,如果只想搜索tag的直接子節(jié)點,可以使用參數(shù) recursive=False .

一段簡單的文檔:

<html>
 <head>
 <title>
  The Dormouse's story
 </title>
 </head>
...

是否使用 recursive 參數(shù)的搜索結(jié)果:

soup.html.find_all("title")
# [<title>The Dormouse's story</title>]

soup.html.find_all("title", recursive=False)
# []

看完上述內(nèi)容，你們掌握利用BeautifulSoup怎么獲取特定標簽下內(nèi)容的方法了嗎？如果還想學到更多技能或想了解更多相關(guān)內(nèi)容，歡迎關(guān)注億速云行業(yè)資訊頻道，感謝各位的閱讀！

向AI問一下細節(jié)

推薦閱讀：

免責聲明：本站發(fā)布的內(nèi)容（圖片、視頻和文字）以原創(chuàng)、轉(zhuǎn)載和分享為主，文章觀點不代表本網(wǎng)站立場，如果涉及侵權(quán)請聯(lián)系站長郵箱：is@yisu.com進行舉報，并提供相關(guān)證據(jù)，一經(jīng)查實，將立刻刪除涉嫌侵權(quán)內(nèi)容。

上一篇新聞：
java項目中是如何實現(xiàn)方法遞歸的
下一篇新聞：
怎么用python中的for語句打印乘法表

猜你喜歡

AI
助
手

產(chǎn)品服務

地區(qū)劃分

專題活動

幫助支持

關(guān)于我們

售后咨詢

7*24小時在線電話：400-100-2938

7*24小時在線 QQ：800811969

關(guān)注億速云

億速云公眾號

手機網(wǎng)站二維碼

<var id="baazt"><em id="baazt"></em></var>

<source id="baazt"><optgroup id="baazt"></optgroup></source>