溫馨提示×

溫馨提示×

您好，登錄后才能下訂單哦！

密碼登錄×

忘記密碼？

登錄注冊×

獲取短信驗證碼

其他方式登錄

點擊登錄注冊即表示同意《億速云用戶服務條款》

用戶登錄×

賬戶密碼登錄

請使用微信掃描上方二維碼

使用幫助

請求超時！

請點擊重新獲取二維碼

Python lxml模塊的基本使用方法分析

發(fā)布時間：2020-09-18 13:47:53 來源：腳本之家閱讀：204 作者：Dylan HU 欄目：開發(fā)技術

本文實例講述了Python lxml模塊的基本使用方法。分享給大家供大家參考，具體如下：

1 lxml的安裝

安裝方式：pip install lxml

2 lxml的使用

2.1 lxml模塊的入門使用

導入lxml 的 etree 庫 (導入沒有提示不代表不能用)

from lxml import etree

利用etree.HTML，將字符串轉化為Element對象,Element對象具有xpath的方法,返回結果的列表，能夠接受bytes類型的數(shù)據(jù)和str類型的數(shù)據(jù)

html = etree.HTML(text) 
ret_list = html.xpath("xpath字符串")

把轉化后的element對象轉化為字符串，返回bytes類型結果 etree.tostring(element)

假設我們現(xiàn)有如下的html字符換，嘗試對他進行操作

<div> <ul> 
<li class="item-1"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li> 
<li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
<li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
<li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
<li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> # 注意，此處缺少一個 </li> 閉合標簽 
</ul> </div>

from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''
html = etree.HTML(text)
print(type(html)) 
handeled_html_str = etree.tostring(html).decode()
print(handeled_html_str)

輸出為

<class 'lxml.etree._Element'>
<html><body><div> <ul>
        <li class="item-1"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li>
        <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>
        <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li>
        <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>
        <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a>
        </li></ul> </div> </body></html>

可以發(fā)現(xiàn)，lxml確實能夠把確實的標簽補充完成，但是請注意lxml是人寫的，很多時候由于網頁不夠規(guī)范，或者是lxml的bug，即使參考url地址對應的響應去提取數(shù)據(jù)，任然獲取不到，這個時候我們需要使用etree.tostring的方法，觀察etree到底把html轉化成了什么樣子，即根據(jù)轉化后的html字符串去進行數(shù)據(jù)的提取。

2.2 lxml的深入練習

接下來我們繼續(xù)操作，假設每個class為item-1的li標簽是1條新聞數(shù)據(jù)，如何把這條新聞數(shù)據(jù)組成一個字典

from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''
html = etree.HTML(text)
#獲取href的列表和title的列表
href_list = html.xpath("http://li[@class='item-1']/a/@href")
title_list = html.xpath("http://li[@class='item-1']/a/text()")
#組裝成字典
for href in href_list:
  item = {}
  item["href"] = href
  item["title"] = title_list[href_list.index(href)]
  print(item)

輸出為

{'href': 'link1.html', 'title': 'first item'}
{'href': 'link2.html', 'title': 'second item'}
{'href': 'link4.html', 'title': 'fourth item'}

假設在某種情況下，某個新聞的href沒有，那么會怎樣呢？

from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a>first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''

結果是

{'href': 'link2.html', 'title': 'first item'}
{'href': 'link4.html', 'title': 'second item'}

數(shù)據(jù)的對應全部錯了，這不是我們想要的，接下來通過2.3小節(jié)的學習來解決這個問題

2.3 lxml模塊的進階使用

前面我們取到屬性，或者是文本的時候，返回字符串但是如果我們取到的是一個節(jié)點，返回什么呢?

返回的是element對象，可以繼續(xù)使用xpath方法，對此我們可以在后面的數(shù)據(jù)提取過程中：先根據(jù)某個標簽進行分組，分組之后再進行數(shù)據(jù)的提取

示例如下：

from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a>first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''
html = etree.HTML(text)
li_list = html.xpath("http://li[@class='item-1']")
print(li_list)

結果為：

[<Element li at 0x11106cb48>, <Element li at 0x11106cb88>, <Element li at 0x11106cbc8>]

可以發(fā)現(xiàn)結果是一個element對象，這個對象能夠繼續(xù)使用xpath方法

先根據(jù)li標簽進行分組，之后再進行數(shù)據(jù)的提取

from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a>first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''
#根據(jù)li標簽進行分組
html = etree.HTML(text)
li_list = html.xpath("http://li[@class='item-1']")
#在每一組中繼續(xù)進行數(shù)據(jù)的提取
for li in li_list:
  item = {}
  item["href"] = li.xpath("./a/@href")[0] if len(li.xpath("./a/@href"))>0 else None
  item["title"] = li.xpath("./a/text()")[0] if len(li.xpath("./a/text()"))>0 else None
  print(item)

結果是：

{'href': None, 'title': 'first item'}
{'href': 'link2.html', 'title': 'second item'}
{'href': 'link4.html', 'title': 'fourth item'}

前面的代碼中，進行數(shù)據(jù)提取需要判斷，可能某些一面不存在數(shù)據(jù)的情況，對應的可以使用三元運算符來解決

PS：這里再為大家提供幾款關于xml操作的在線工具供大家參考使用：

在線XML/JSON互相轉換工具：
http://tools.jb51.net/code/xmljson

在線格式化XML/在線壓縮XML：
http://tools.jb51.net/code/xmlformat

XML在線壓縮/格式化工具：
http://tools.jb51.net/code/xml_format_compress

XML代碼在線格式化美化工具：
http://tools.jb51.net/code/xmlcodeformat

更多關于Python相關內容感興趣的讀者可查看本站專題：《Python操作xml數(shù)據(jù)技巧總結》、《Python數(shù)據(jù)結構與算法教程》、《Python Socket編程技巧總結》、《Python函數(shù)使用技巧總結》、《Python字符串操作技巧匯總》、《Python入門與進階經典教程》及《Python文件與目錄操作技巧匯總》

希望本文所述對大家Python程序設計有所幫助。

向AI問一下細節(jié)

推薦閱讀：

免責聲明：本站發(fā)布的內容（圖片、視頻和文字）以原創(chuàng)、轉載和分享為主，文章觀點不代表本網站立場，如果涉及侵權請聯(lián)系站長郵箱：is@yisu.com進行舉報，并提供相關證據(jù)，一經查實，將立刻刪除涉嫌侵權內容。

上一篇新聞：
如何為一塊彈性網卡分配一個或多個輔助私有IP地址
下一篇新聞：
設置Windows實例帶寬和CPU跑滿或跑高排查的方法

猜你喜歡

AI
助
手

產品服務

地區(qū)劃分

專題活動

幫助支持

關于我們

售后咨詢

7*24小時在線電話：400-100-2938

7*24小時在線 QQ：800811969

關注億速云

億速云公眾號

手機網站二維碼

<strike id="iuecq"></strike>

<strike id="iuecq"></strike>