溫馨提示×

您好,登錄后才能下訂單哦!

密碼登錄×
登錄注冊(cè)×
其他方式登錄
點(diǎn)擊 登錄注冊(cè) 即表示同意《億速云用戶服務(wù)條款》

怎么用Web Scraping爬取HTML網(wǎng)頁(yè)

發(fā)布時(shí)間:2022-03-09 15:10:00 來(lái)源:億速云 閱讀:158 作者:iii 欄目:web開(kāi)發(fā)

這篇文章主要講解了“怎么用Web Scraping爬取HTML網(wǎng)頁(yè)”,文中的講解內(nèi)容簡(jiǎn)單清晰,易于學(xué)習(xí)與理解,下面請(qǐng)大家跟著小編的思路慢慢深入,一起來(lái)研究和學(xué)習(xí)“怎么用Web Scraping爬取HTML網(wǎng)頁(yè)”吧!

  -爬取HTML網(wǎng)頁(yè)

  -直接下載數(shù)據(jù)文件,例如csv,txt,pdf文件

  -通過(guò)應(yīng)用程序編程接口(API)訪問(wèn)數(shù)據(jù),例如 電影數(shù)據(jù)庫(kù),Twitter

  選擇網(wǎng)頁(yè)爬取,當(dāng)然了解HTML網(wǎng)頁(yè)的基本結(jié)構(gòu),可以參考這個(gè)網(wǎng)頁(yè):

  HTML的基本結(jié)構(gòu)

  HTML標(biāo)記:head,body,p,a,form,table等等

  標(biāo)簽會(huì)具有屬性。例如,標(biāo)記a具有屬性(或?qū)傩裕﹉ref的鏈接的目標(biāo)。

  class和id是html用來(lái)通過(guò)級(jí)聯(lián)樣式表(CSS)控制每個(gè)元素的樣式的特殊屬性。 id是元素的唯一標(biāo)識(shí)符,而class用于將元素分組以進(jìn)行樣式設(shè)置。

  一個(gè)元素可以與多個(gè)類(lèi)相關(guān)聯(lián)。 這些類(lèi)別之間用空格隔開(kāi),例如 <h3 class=“ city main”>倫敦</ h3>

  下圖是來(lái)自W3SCHOOL的例子,city的包括三個(gè)屬性,main包括一個(gè)屬性,London運(yùn)用了兩個(gè)city和main,這兩個(gè)類(lèi),呈現(xiàn)出來(lái)的是下圖的樣子。

  可以通過(guò)標(biāo)簽相對(duì)于彼此的位置來(lái)引用標(biāo)簽

  child-child是另一個(gè)標(biāo)簽內(nèi)的標(biāo)簽,例如 這兩個(gè)p標(biāo)簽是div標(biāo)簽的子標(biāo)簽。

  parent-parent是一個(gè)標(biāo)簽,另一個(gè)標(biāo)簽在其中,例如 html標(biāo)簽是body標(biāo)簽的parent標(biāo)簽。

  siblings-siblings是與另一個(gè)標(biāo)簽具有相同parent標(biāo)簽的標(biāo)簽,例如 在html示例中,head和body標(biāo)簽是同級(jí)標(biāo)簽,因?yàn)樗鼈兌荚趆tml內(nèi)。 兩個(gè)p標(biāo)簽都是sibling,因?yàn)樗鼈兌荚赽ody里面。

  四步爬取網(wǎng)頁(yè):

  第一步:安裝模塊

  安裝requests,beautifulsoup4,用來(lái)爬取網(wǎng)頁(yè)信息

  Install modules requests, BeautifulSoup4/scrapy/selenium/....requests: allow you to send HTTP/1.1 requests using Python. To install:Open terminal (Mac) or Anaconda Command Prompt (Windows)code:  BeautifulSoup: web page parsing library, to install, use:

  第二步 :利用安裝包來(lái)讀取網(wǎng)頁(yè)源碼

  第三步:瀏覽網(wǎng)頁(yè)源碼找到需要讀取信息的位置

  這里不同的瀏覽器讀取源碼有差異,下面介紹幾個(gè),有相關(guān)網(wǎng)頁(yè)查詢?cè)敿?xì)信息。

  Firefox: right click on the web page and select "view page source"Safari: please instruction here to see page source ()Ineternet Explorer: see instruction at

  第四步:開(kāi)始讀取

  Beautifulsoup: 簡(jiǎn)單那,支持CSS Selector, 但不支持 XPathscrapy (): 支持 CSS Selector 和XPathSelenium: 可以爬取動(dòng)態(tài)網(wǎng)頁(yè) (例如下拉不斷更新的)lxml等BeautifulSoup里Tag: an xml or HTML tag 標(biāo)簽Name: every tag has a name 每個(gè)標(biāo)簽的名字Attributes: a tag may have any number of attributes. 每個(gè)標(biāo)簽有一個(gè)到多個(gè)屬性 A tag is shown as a dictionary in the form of {attribute1_name:attribute1_value, attribute2_name:attribute2_value, ...}. If an attribute has multiple values, the value is stored as a listNavigableString: the text within a tag

  上代碼:

  #Import requests and beautifulsoup packages

  from IPython.core.interactiveshell import InteractiveShell

  InteractiveShell.ast_node_interactivity="all"

  # import requests package

  import requests

  # import BeautifulSoup from package bs4 (i.e. beautifulsoup4)

  from bs4 import BeautifulSoup

  Get web page content

  # send a get request to the web page

  page=requests.get("A simple example page")

  # status_code 200 indicates success.

  # a status code >200 indicates a failure

  if page.status_code==200:

  # content property gives the content returned in bytes

  print(page.content)  # text in bytes

  print(page.text)     # text in unicode

  #Parse web page content

  # Process the returned content using beautifulsoup module

  # initiate a beautifulsoup object using the html source and Python&rsquo;s html.parser

  soup=BeautifulSoup(page.content, 'html.parser')

  # soup object stands for the **root**

  # node of the html document tree

  print("Soup object:")

  # print soup object nicely

  print(soup.prettify())

  # soup.children returns an iterator of all children nodes

  print("\soup children nodes:")

  soup_children=soup.children

  print(soup_children)

  # convert to list

  soup_children=list(soup.children)

  print("\nlist of children of root:")

  print(len(soup_children))

  # html is the only child of the root node

  html=soup_children[0]

  html

  # Get head and body tag

  html_children=list(html.children)

  print("how many children under html? ", len(html_children))

  for idx, child in enumerate(html_children):

  print("Child {} is: {}\n".format(idx, child))

  # head is the second child of html

  head=html_children[1]

  # extract all text inside head

  print("\nhead text:")

  print(head.get_text())

  # body is the fourth child of html

  body=html_children[3]

  # Get details of a tag

  # get the first p tag in the div of body

  div=list(body.children)[1]

  p=list(div.children)[1]

  p

  # get the details of p tag

  # first, get the data type of p

  print("\ndata type:")

  print(type(p))

  # get tag name (property of p object)

  print ("\ntag name: ")

  print(p.name)

  # a tag object with attributes has a dictionary

  # use <tag>.attrs to get the dictionary

  # each attribute name of the tag is a key

  # get all attributes

  p.attrs

  # get "class" attribute

  print ("\ntag class: ")

  print(p["class"])

  # how to determine if 'id' is an attribute of p?

  # get text of p tag

  p.get_text()

感謝各位的閱讀,以上就是“怎么用Web Scraping爬取HTML網(wǎng)頁(yè)”的內(nèi)容了,經(jīng)過(guò)本文的學(xué)習(xí)后,相信大家對(duì)怎么用Web Scraping爬取HTML網(wǎng)頁(yè)這一問(wèn)題有了更深刻的體會(huì),具體使用情況還需要大家實(shí)踐驗(yàn)證。這里是億速云,小編將為大家推送更多相關(guān)知識(shí)點(diǎn)的文章,歡迎關(guān)注!

向AI問(wèn)一下細(xì)節(jié)

免責(zé)聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng),如果涉及侵權(quán)請(qǐng)聯(lián)系站長(zhǎng)郵箱:is@yisu.com進(jìn)行舉報(bào),并提供相關(guān)證據(jù),一經(jīng)查實(shí),將立刻刪除涉嫌侵權(quán)內(nèi)容。

AI