您好,登錄后才能下訂單哦!
python 網(wǎng)絡爬蟲常用的4大解析庫助手:re正則、etree xpath、scrapy xpath、BeautifulSoup。(因為etree xpath和scrapy xpath用法上有較大的不同,故沒有歸為一類),本文來介紹BeautifulSoup一個少為人知的坑,見示例:
例1(它是長得不一樣, 柬文勿怪):
content = """
<html>
<body>
<div class="td-post-content td-pb-padding-side">
<p>
<img alt="" class="alignnone size-full wp-image-122426"
data-recalc-dims="1" height="352"
src="https://i2.wp.com/img.postnews.com.kh/2017/01/Anal-Itching.jpg?resize=630%2C352&ssl=1"
width="630"/>
</p>
<p>
<img alt="" class="alignnone size-full wp-image-122427"
data-recalc-dims="1" height="473"
src="https://i1.wp.com/img.postnews.com.kh/2017/01/Anal-Itching1.jpg?resize=630%2C473&ssl=1"
width="630"/>
</p>
<p>
????????????????? ?????????????????????????????
?????????????????????????????????????????????????????????????????
????????????????????????? ??????????????????????????
</p>
<p>
<img alt="" class="alignnone size-full wp-image-122427"
data-recalc-dims="1" height="473"
src="https://i1.wp.com/img.postnews.com.kh/2017/01/Anal-Itching1.jpg?resize=630%2C473&ssl=1"
width="630"/>
</p>
<p>
<img alt="" class="alignnone size-full wp-image-122428"
data-recalc-dims="1" height="473"
src="https://i2.wp.com/img.postnews.com.kh/2017/01/Anal-Itching2.jpg?resize=630%2C473&ssl=1"
width="630"/>
<br/>
<em>
<br/>
??????
</em>
??????????????????????? ???????????? ??????????????????
?????????????????????????????????????
</p>
</div>
</body>
</html>
"""
soup = BeautifulSoup(content)
img_lst = []
inner_src_list = soup.find_all('img', src=True)
for i, src in enumerate(inner_src_list):
url=src["src"].replace("&ssl", "&ssl")
print(url)
print(soup.prettify())
# content = soup.prettify() # src的打印結(jié)果一樣
img_tags = soup.find_all('img')
for img in img_tags:
print(img['src'])
控制臺打印輸出如下:
![](https://cache.yisu.com/upload/information/20200310/57/120424.jpg?x-oss-process=image/watermark,size_16,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_100,g_se,x_10,y_10,shadow_90,type_ZmFuZ3poZW5naGVpdGk=)
![](https://cache.yisu.com/upload/information/20200310/57/120431.jpg?x-oss-process=image/watermark,size_16,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_100,g_se,x_10,y_10,shadow_90,type_ZmFuZ3poZW5naGVpdGk=)
![](https://cache.yisu.com/upload/information/20200310/57/120432.jpg?x-oss-process=image/watermark,size_16,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_100,g_se,x_10,y_10,shadow_90,type_ZmFuZ3poZW5naGVpdGk=)
怎么會這樣:文本中的‘a(chǎn)mp;’字符怎么消失了?
解釋如下:BeautifulSoup在提取src時內(nèi)部會自動把符號‘&’轉(zhuǎn)義成'&',【網(wǎng)頁解析有時不一定要眼前的直覺】【不僅bs如此, etree xpath和scrapy xpath也是一樣】
例2:
文本同上
soup = BeautifulSoup(content)
img_lst = []
inner_src_list = soup.find_all('img', src=True) # 注意比較
for i, src in enumerate(inner_src_list):
url=src["src"].replace("&ssl", "&ssl")
print(url)
inner_src_list = soup.find_all('img', attr={'src':True}) # 注意比較
for i, src in enumerate(inner_src_list):
url=src["src"].replace("&ssl", "&ssl")
print(url)
這里不作打印了,直接說明現(xiàn)象,第一個print正常打印,第二個print輸出為空,為什么?
解釋如下: 第一個find_all,把src=True視為存在src屬性的img標簽,第二個find_all,把attr={'src', True}視為存在src且屬性值為True的img標簽,所以結(jié)果可想而知!
上述如有不正之處,歡迎指出,謝謝!
免責聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點不代表本網(wǎng)站立場,如果涉及侵權請聯(lián)系站長郵箱:is@yisu.com進行舉報,并提供相關證據(jù),一經(jīng)查實,將立刻刪除涉嫌侵權內(nèi)容。