溫馨提示×

怎么用lxml清理和規(guī)范化HTML文檔

小億
93
2024-05-14 13:23:16
欄目: 編程語言

使用lxml庫清理和規(guī)范化HTML文檔的步驟如下:

  1. 導入lxml庫:
from lxml import etree
  1. 讀取HTML文檔:
html = """
<html>
<head>
<title>Example</title>
</head>
<body>
<h1>Hello, World!</h1>
<p>This is an example HTML document.</p>
</body>
</html>
"""

# 將HTML文檔轉(zhuǎn)換為lxml對象
tree = etree.HTML(html)
  1. 清理HTML文檔:
# 使用tostring方法將lxml對象轉(zhuǎn)換回字符串,清理HTML文檔
clean_html = etree.tostring(tree, pretty_print=True, method="html").decode('utf-8')
  1. 規(guī)范化HTML文檔:
# 使用tostring方法的method參數(shù)規(guī)范化HTML文檔
normalized_html = etree.tostring(tree, pretty_print=True, method="xml").decode('utf-8')

通過以上步驟,您可以使用lxml庫清理和規(guī)范化HTML文檔。

0