web開發(fā)中快速有效檢索網(wǎng)頁數(shù)據(jù)的方法是什么

發(fā)布時(shí)間：2022-01-07 11:49:52 來源：億速云閱讀：212 作者：iii 欄目：大數(shù)據(jù)

這篇文章主要講解了“web開發(fā)中快速有效檢索網(wǎng)頁數(shù)據(jù)的方法是什么”，文中的講解內(nèi)容簡單清晰，易于學(xué)習(xí)與理解，下面請(qǐng)大家跟著小編的思路慢慢深入，一起來研究和學(xué)習(xí)“web開發(fā)中快速有效檢索網(wǎng)頁數(shù)據(jù)的方法是什么”吧！

網(wǎng)頁抓取問題1

網(wǎng)頁抓取者嘗試查找Facebook當(dāng)前的股票價(jià)格。代碼如下：

import requests         from bs4 importBeautifulSoup              defparsePrice():           r = requests.get("https://finance.yahoo.com/quote/FB?p=FB")           soup =BeautifulSoup(r.text, "lxml")           price = soup.find( div , { class : My(6px) Pos(r)smartphone_Mt(6px) }).find( span ).text           print(f the current price: {price} )

該代碼輸出如下：

the current price: 216.08

使用簡單的網(wǎng)頁抓取解決方案非常簡單，但這還不夠“懶惰”，讓我們看下一個(gè)。

網(wǎng)頁抓取問題2

網(wǎng)頁抓取者正在嘗試從統(tǒng)計(jì)標(biāo)簽中查找有關(guān)股票的企業(yè)價(jià)值和空頭股票數(shù)量的數(shù)據(jù)。他的問題實(shí)際上是檢索可能存在或不存在的嵌套字典值，但是在檢索數(shù)據(jù)上，他似乎已經(jīng)找到了更好的解決方法。

import requests, re, json, pprint              p = re.compile(r root.App.main =(.*); )         tickers = [ AGL.AX ]         results = {}              with requests.Session() as s:                  for ticker in tickers:                 r = s.get( https://finance.yahoo.com/quote/{}/key-statistics?p={} .format(ticker,ticker))                 data = json.loads(p.findall(r.text)[0])                 key_stats = data[ context ][ dispatcher ][ stores ][ QuoteSummaryStore ]                 print(key_stats)                 res = {                          Enterprise Value  : key_stats[ defaultKeyStatistics ][ enterpriseValue ][ fmt ]                         , Shares_Short  : key_stats[ defaultKeyStatistics ][ sharesShort ].get( longFmt ,  N/A )                 }                 results[ticker] = res              print(results)

看第3行：網(wǎng)頁抓取者能夠在javascript的變量內(nèi)找到他要查找的數(shù)據(jù)：

root.App.main = {.... };

在那里，只需訪問字典中適當(dāng)?shù)那短祖I，即可輕松檢索數(shù)據(jù)。但是，確實(shí)還有更“懶惰”的辦法。

“懶惰”的解決方案1

import requests              r = requests.get("https://query2.finance.yahoo.com/v10/finance/quoteSummary/FB?modules=price")        data = r.json()        print(data)        print(f"the currentprice: {data[ quoteSummary ][ result ][0][ price ][ regularMarketPrice ][ raw ]}")

看看第三行的URL，輸出如下：

{      quoteSummary : {          error : None,          result : [{              price : {                 averageDailyVolume10Day : {},                 averageDailyVolume3Month : {},                  circulatingSupply : {},                  currency :  USD ,                  currencySymbol :  $ ,                  exchange :  NMS ,                  exchangeDataDelayedBy :0,                  exchangeName : NasdaqGS ,                  fromCurrency : None,                  lastMarket : None,                  longName :  Facebook,Inc. ,                  marketCap : {                      fmt :  698.42B ,                      longFmt : 698,423,836,672.00 ,                      raw : 698423836672                 },                  marketState :  REGULAR ,                  maxAge : 1,                  openInterest : {},                  postMarketChange : {},                  postMarketPrice : {},                  preMarketChange : {                      fmt :  -0.90 ,                      raw : -0.899994                 },                  preMarketChangePercent :{                      fmt :  -0.37% ,                      raw : -0.00368096                 },                  preMarketPrice : {                      fmt :  243.60 ,                      raw : 243.6                 },                  preMarketSource : FREE_REALTIME ,                  preMarketTime :1594387780,                  priceHint : {                      fmt :  2 ,                      longFmt :  2 ,                      raw : 2                 },                  quoteSourceName : Nasdaq Real Time                    Price ,                  quoteType :  EQUITY ,                  regularMarketChange : {                      fmt :  0.30 ,                      raw : 0.30160522                 },                 regularMarketChangePercent : {                      fmt :  0.12% ,                      raw : 0.0012335592                 },                  regularMarketDayHigh : {                      fmt :  245.49 ,                      raw : 245.49                 },                  regularMarketDayLow : {                      fmt :  239.32 ,                      raw : 239.32                 },                  regularMarketOpen : {                      fmt :  243.68 ,                      raw : 243.685                 },                 regularMarketPreviousClose : {                      fmt :  244.50 ,                      raw : 244.5                 },                  regularMarketPrice : {                      fmt :  244.80 ,                      raw : 244.8016                 },                  regularMarketSource : FREE_REALTIME ,                  regularMarketTime :1594410026,                  regularMarketVolume : {                      fmt :  19.46M ,                      longFmt :  19,456,621.00 ,                      raw : 19456621                 },                  shortName :  Facebook,Inc. ,                  strikePrice : {},                  symbol :  FB ,                  toCurrency : None,                  underlyingSymbol : None,                  volume24Hr : {},                  volumeAllCurrencies : {}             }         }]     } }the current price: 241.63

“懶惰”的解決方案2

import requests              r = requests.get("https://query2.finance.yahoo.com/v10/finance/quoteSummary/AGL.AX?modules=defaultKeyStatistics")       data = r.json()       print(data)       print({            AGL.AX : {                Enterprise Value : data[ quoteSummary ][ result ][0][ defaultKeyStatistics ][ enterpriseValue ][ fmt ],                Shares Short : data[ quoteSummary ][ result ][0][ defaultKeyStatistics ][ sharesShort ].get( longFmt ,  N/A )           }       })

再次看一下第三行的URL，輸出如下：

{      quoteSummary : {          result : [{              defaultKeyStatistics : {                  maxAge : 1,                  priceHint : {                      raw : 2,                      fmt :  2 ,                      longFmt :  2                  },                  enterpriseValue : {                      raw : 13677747200,                      fmt :  13.68B ,                      longFmt : 13,677,747,200                  },                  forwardPE : {},                  profitMargins : {                      raw : 0.07095,                      fmt :  7.10%                  },                  floatShares : {                      raw : 637754149,                      fmt :  637.75M ,                      longFmt : 637,754,149                  },                  sharesOutstanding : {                      raw : 639003008,                      fmt :  639M ,                      longFmt : 639,003,008                  },                  sharesShort : {},                  sharesShortPriorMonth :{},                  sharesShortPreviousMonthDate :{},                  dateShortInterest : {},                  sharesPercentSharesOut : {},                  heldPercentInsiders : {                      raw : 0.0025499999,                      fmt :  0.25%                  },                  heldPercentInstitutions : {                      raw : 0.31033,                      fmt :  31.03%                  },                  shortRatio : {},                  shortPercentOfFloat :{},                  beta : {                      raw : 0.365116,                      fmt :  0.37                  },                  morningStarOverallRating :{},                  morningStarRiskRating :{},                  category : None,                  bookValue : {                      raw : 12.551,                      fmt :  12.55                  },                  priceToBook : {                      raw : 1.3457094,                      fmt :  1.35                  },                 annualReportExpenseRatio : {},                  ytdReturn : {},                  beta3Year : {},                  totalAssets : {},                  yield : {},                  fundFamily : None,                  fundInceptionDate : {},                  legalType : None,                  threeYearAverageReturn :{},                  fiveYearAverageReturn :{},                  priceToSalesTrailing12Months :{},                  lastFiscalYearEnd : {                      raw : 1561852800,                      fmt :  2019-06-30                  },                  nextFiscalYearEnd : {                      raw : 1625011200,                      fmt :  2021-06-30                  },                  mostRecentQuarter : {                      raw : 1577750400,                      fmt :  2019-12-31                  },                 earningsQuarterlyGrowth : {                      raw : 0.114,                      fmt :  11.40%                  },                  revenueQuarterlyGrowth :{},                  netIncomeToCommon : {                      raw : 938000000,                      fmt :  938M ,                      longFmt : 938,000,000                  },                  trailingEps : {                      raw : 1.434,                      fmt :  1.43                  },                  forwardEps : {},                  pegRatio : {},                  lastSplitFactor : None,                  lastSplitDate : {},                  enterpriseToRevenue : {                      raw : 1.035,                      fmt :  1.03                  },                  enterpriseToEbitda : {                      raw : 6.701,                      fmt :  6.70                  },                  52WeekChange : {                      raw : -0.17621362,                      fmt :  -17.62%                  },                  SandP52WeekChange : {                      raw : 0.045882702,                      fmt :  4.59%                  },                  lastDividendValue : {},                  lastCapGain : {},                  annualHoldingsTurnover :{}             }        }],          error : None     } }{ AGL.AX : { Enterprise Value :  13.73B ,  Shares Short :  N/A }}

“懶惰”的解決方案只是簡單地將請(qǐng)求從使用前端URL更改為某種非官方的返回JSON數(shù)據(jù)的API端點(diǎn)。這個(gè)方案更簡單，可以導(dǎo)出更多數(shù)據(jù) ，那么它的速度呢?代碼如下：

import timeit      import requests      from bs4 importBeautifulSoup      import json      import re              repeat =5      number =5              defweb_scrape_1():          r = requests.get(f https://finance.yahoo.com/quote/FB?p=FB )          soup =BeautifulSoup(r.text, "lxml")          price = soup.find( div , { class : My(6px) Pos(r)smartphone_Mt(6px) }).find( span ).text          returnf the current price: {price}               deflazy_1():          r = requests.get( https://query2.finance.yahoo.com/v10/finance/quoteSummary/FB?modules=price )          data = r.json()          returnf"the currentprice: {data[ quoteSummary ][ result ][0][ price ][ regularMarketPrice ][ raw ]}"               defweb_scrape_2():          p = re.compile(r root.App.main = (.*); )          ticker = AGL.AX           results = {}          with requests.Session() as s:              r = s.get( https://finance.yahoo.com/quote/{}/key-statistics?p={} .format(ticker,ticker))              data = json.loads(p.findall(r.text)[0])              key_stats = data[ context ][ dispatcher ][ stores ][ QuoteSummaryStore ]              res = {                       Enterprise Value : key_stats[ defaultKeyStatistics ][ enterpriseValue ][ fmt ],                       Shares Short : key_stats[ defaultKeyStatistics ][ sharesShort ].get( longFmt ,  N/A )              }              results[ticker] = res          return results              deflazy_2():          r = requests.get( https://query2.finance.yahoo.com/v10/finance/quoteSummary/AGL.AX?modules=defaultKeyStatistics )          data = r.json()          return {               AGL.AX : {                   Enterprise Value : data[ quoteSummary ][ result ][0][ defaultKeyStatistics ][ enterpriseValue ][ fmt ],                   Shares Short : data[ quoteSummary ][ result ][0][ defaultKeyStatistics ][ sharesShort ].get( longFmt ,  N/A )              }          }              web_scraping_1_times = timeit.repeat(           web_scrape_1() ,          setup= import requests; from bs4 import BeautifulSoup ,          globals=globals(),          repeat=repeat,          number=number)      print(f web scraping #1min time is {min(web_scraping_1_times) / number} )              lazy_1_times = timeit.repeat(           lazy_1() ,          setup= import requests ,          globals=globals(),          repeat=repeat,          number=number      )      print(f lazy #1 min timeis {min(lazy_1_times) / number} )              web_scraping_2_times = timeit.repeat(           web_scrape_2() ,          setup= import requests, re, json ,          globals=globals(),          repeat=repeat,          number=number)      print(f web scraping #2min time is {min(web_scraping_2_times) / number} )              lazy_2_times = timeit.repeat(           lazy_2() ,          setup= import requests ,          globals=globals(),          repeat=repeat,          number=number      )      print(f lazy #2 min timeis {min(lazy_2_times) / number} )

web scraping #1 min time is 0.5678426799999997 lazy #1 min time is 0.11238783999999953 web scraping #2 min time is 0.3731000199999997 lazy #2 min time is 0.0864451399999993

“懶惰”的替代方案比其網(wǎng)頁抓取同類產(chǎn)品快4到5倍!

“偷懶”的過程

思考一下上面遇到的兩個(gè)問題：原來的方案里，代碼加載到頁面后，我們嘗試檢索數(shù)據(jù)。“懶惰”的解決方案直接針對(duì)數(shù)據(jù)源，根本不理會(huì)前端頁面。當(dāng)你嘗試從網(wǎng)站提取數(shù)據(jù)時(shí)，這是一個(gè)重要區(qū)別和一個(gè)很好的方法。

步驟1：檢查XHR請(qǐng)求

XHR(XMLHttpRequest)對(duì)象是可用于Web瀏覽器腳本語言(例如JavaScript)的API，它將HTTP或HTTPS請(qǐng)求發(fā)送到Web服務(wù)器，并將服務(wù)器響應(yīng)數(shù)據(jù)加載回腳本中?；旧希琗HR允許客戶端從URL檢索數(shù)據(jù)，不必刷新整個(gè)網(wǎng)頁。

筆者將使用Chrome進(jìn)行以下演示，但是其他瀏覽器也具有類似的功能。

· 打開Chrome的開發(fā)者控制臺(tái)。要在Google Chrome中打開開發(fā)者控制臺(tái)，請(qǐng)打開瀏覽器窗口右上角的Chrome菜單，然后選擇更多工具>開發(fā)者工具。也可以使用快捷鍵Option + ?+ J(適用于ios系統(tǒng))，或Shift + CTRL + J(適用于Windows / Linux)。

選擇“網(wǎng)絡(luò)”選項(xiàng)卡。