溫馨提示×

溫馨提示×

您好,登錄后才能下訂單哦!

密碼登錄×
登錄注冊×
其他方式登錄
點擊 登錄注冊 即表示同意《億速云用戶服務條款》

Python 爬取攜程所有機票的實例代碼

發(fā)布時間:2020-10-07 10:19:22 來源:腳本之家 閱讀:587 作者:KK-Huang 欄目:開發(fā)技術(shù)

打開攜程網(wǎng),查詢機票,如廣州到成都。

這時網(wǎng)址為:http://flights.ctrip.com/booking/CAN-CTU-day-1.html?DDate1=2018-06-15

其中,CAN 表示廣州,CTU 表示成都,日期 “2018-06-15”就比較明顯了。一般的爬蟲,只有替換這幾個值,就可以遍歷了。但觀察發(fā)現(xiàn),有個鏈接可以看到當前網(wǎng)頁的所有json格式的數(shù)據(jù)。如下

http://flights.ctrip.com/domesticsearch/search/SearchFirstRouteFlights?DCity1=CAN&ACity1=CTU&SearchType=S&DDate1=2018-06-15(后面省略……)

同樣可以看到城市和日期,該連接打開的是 json 文件,里面存儲的就是當前頁面的數(shù)據(jù)。顯示如下,其中 "fis" 則是航班信息。

Python 爬取攜程所有機票的實例代碼

每一次爬取只要替換城市代碼和日期即可,城市代碼自己手動整理了一份:

city={"YIE":"阿爾山","AKU":"阿克蘇","RHT":"阿拉善右旗","AXF":"阿拉善左旗","AAT":"阿勒泰","NGQ":"阿里","MFM":"澳門" 
,"AQG":"安慶","AVA":"安順","AOG":"鞍山","RLK":"巴彥淖爾","AEB":"百色","BAV":"包頭","BSD":"保山","BHY":"北海","BJS":"北京" 
,"DBC":"白城","NBS":"白山","BFJ":"畢節(jié)","BPL":"博樂","CKG":"重慶","BPX":"昌都","CGD":"常德","CZX":"常州" 
,"CHG":"朝陽","CTU":"成都","JUH":"池州","CIF":"赤峰","SWA":"潮州","CGQ":"長春","CSX":"長沙","CIH":"長治","CDE":"承德" 
,"CWJ":"滄源","DAX":"達州","DLU":"大理","DLC":"大連","DQA":"大慶","DAT":"大同","DDG":"丹東","DCY":"稻城","DOY":"東營" 
,"DNH":"敦煌","DAX":"達縣","LUM":"德宏","EJN":"額濟納旗","DSN":"鄂爾多斯","ENH":"恩施","ERL":"二連浩特","FUO":"佛山" 
,"FOC":"福州","FYJ":"撫遠","FUG":"阜陽","KOW":"贛州","GOQ":"格爾木","GYU":"固原","GYS":"廣元","CAN":"廣州","KWE":"貴陽" 
,"KWL":"桂林","HRB":"哈爾濱","HMI":"哈密","HAK":"???,"HLD":"海拉爾","HDG":"邯鄲","HZG":"漢中","HGH":"杭州","HFE":"合肥" 
,"HTN":"和田","HEK":"黑河","HET":"呼和浩特","HIA":"淮安","HJJ":"懷化","TXN":"黃山","HUZ":"惠州","JXA":"雞西","TNA":"濟南" 
,"JNG":"濟寧","JGD":"加格達奇","JMU":"佳木斯","JGN":"嘉峪關","SWA":"揭陽","JIC":"金昌","KNH":"金門","JNZ":"錦州" 
,"CYI":"嘉義","JHG":"景洪","JSJ":"建三江","JJN":"晉江","JGS":"井岡山","JDZ":"景德鎮(zhèn)","JIU":"九江","JZH":"九寨溝","KHG":"喀什" 
,"KJH":"凱里","KGT":"康定","KRY":"克拉瑪依","KCA":"庫車","KRL":"庫爾勒","KMG":"昆明","LXA":"拉薩","LHW":"蘭州","HZH":"黎平" 
,"LJG":"麗江","LLB":"荔波","LYG":"連云港","LPF":"六盤水","LFQ":"臨汾","LZY":"林芝","LNJ":"臨滄","LYI":"臨沂","LZH":"柳州" 
,"LZO":"瀘州","LYA":"洛陽","LLV":"呂梁","JMJ":"瀾滄","LCX":"龍巖","NZH":"滿洲里","LUM":"芒市","MXZ":"梅州","MIG":"綿陽" 
,"OHE":"漠河","MDG":"牡丹江","MFK":"馬祖" ,"KHN":"南昌","NAO":"南充","NKG":"南京","NNG":"南寧","NTG":"南通","NNY":"南陽" 
,"NGB":"寧波","NLH":"寧蒗","PZI":"攀枝花","SYM":"普洱","NDG":"齊齊哈爾","JIQ":"黔江","IQM":"且末","BPE":"秦皇島","TAO":"青島" 
,"IQN":"慶陽","JUZ":"衢州","RKZ":"日喀則","RIZ":"日照","SYX":"三亞","XMN":"廈門","SHA":"上海","SZX":"深圳","HPG":"神農(nóng)架" 
,"SHE":"沈陽","SJW":"石家莊","TCG":"塔城","HYN":"臺州","TYN":"太原","YTY":"泰州","TVS":"唐山","TCZ":"騰沖","TSN":"天津" 
,"THQ":"天水","TGO":"通遼","TEN":"銅仁","TLQ":"吐魯番","WXN":"萬州","WEH":"威海","WEF":"濰坊","WNZ":"溫州","WNH":"文山" 
,"WUA":"烏海","HLH":"烏蘭浩特","URC":"烏魯木齊","WUX":"無錫","WUZ":"梧州","WUH":"武漢","WUS":"武夷山","SIA":"西安","XIC":"西昌" 
,"XNN":"西寧","JHG":"西雙版納","XIL":"錫林浩特","DIG":"香格里拉(迪慶)","XFN":"襄陽","ACX":"興義","XUZ":"徐州","HKG":"香港" 
,"YNT":"煙臺","ENY":"延安","YNJ":"延吉","YNZ":"鹽城","YTY":"揚州","LDS":"伊春","YIN":"伊寧","YBP":"宜賓","YIH":"宜昌" 
,"YIC":"宜春","YIW":"義烏","INC":"銀川","LLF":"永州","UYN":"榆林","YUS":"玉樹","YCU":"運城","ZHA":"湛江","DYG":"張家界" 
,"ZQZ":"張家口","YZY":"張掖","ZAT":"昭通","CGO":"鄭州","ZHY":"中衛(wèi)","HSN":"舟山","ZUH":"珠海","WMT":"遵義(茅臺)","ZYI":"遵義(新舟)"} 

為了防止頻繁請求出現(xiàn) 429,UserAgent 也找多一些讓其隨機取值。但是有時候太頻繁則需要輸入驗證碼,所以還是每爬取一個出發(fā)城市,暫停10秒鐘吧。

先創(chuàng)建表用于存儲數(shù)據(jù),此處用的是 SQL Server:

CREATE TABLE KKFlight( 
  ID int IDENTITY(1,1),  --自增ID 
  ItinerarDate  date,      --行程日期 
  Airline     varchar(100),  --航空公司 
  AirlineCode   varchar(100),  --航空公司代碼 
  FlightNumber  varchar(20),  --航班號 
  FlightNumberS  varchar(20),  --航班號-共享(實際航班) 
  Aircraft    varchar(50),  --飛機型號 
  AircraftSize  char(2),    --型號大小(L大;M中;S小) 
  AirportTax   decimal(10,2), --機場建設費 
  FuelOilTax   decimal(10,2), --燃油稅 
  FromCity    varchar(50),  --出發(fā)城市 
  FromCityCode  varchar(10),  --出發(fā)城市代碼 
  FromAirport   varchar(50),  --出發(fā)機場 
  FromTerminal  varchar(20),  --出發(fā)航站樓 
  FromDateTime  datetime,    --出發(fā)時間 
  ToCity     varchar(50),  --到達城市 
  ToCityCode   varchar(10),  --到達城市代碼 
  ToAirport    varchar(50),  --到達機場 
  ToTerminal   varchar(20),  --到達航站樓 
  ToDateTime   datetime,    --到達時間 
  DurationHour  int,      --時長(小時h) 
  DurationMinute int,      --時長(分鐘m) 
  Duration    varchar(20),  --時長(字符串) 
  Currency    varchar(10),  --幣種 
  TicketPrices  decimal(10,2), --票價 
  Discount    decimal(4,2),  --已打折扣 
  PunctualityRate decimal(4,2),  --準點率 
  AircraftCabin  char(1),    --倉位(F頭等艙;C公務艙;Y經(jīng)濟艙) 
  InsertDate   datetime default(getdate()), --添加時間 
) 

因為是爬取所有城市,所以城市不限制,只限制日期,即爬取哪天至哪天的數(shù)據(jù)。全部腳本如下:

#-*- coding: utf-8 -*- 
# python 3.5.0 
import json 
import time 
import random 
import datetime 
import sqlalchemy 
import urllib.request 
import pandas as pd 
from operator import itemgetter 
from dateutil.parser import parse 
class FLIGHT(object): 
  def __init__(self): 
    self.Airline = {} #航空公司代碼 
    self.engine = sqlalchemy.create_engine("mssql+pymssql://kk:kk@HZC/Myspider") 
    self.url = '' 
    self.headers = {} 
    self.city={"AAT":"阿勒泰","ACX":"興義","AEB":"百色","AKU":"阿克蘇","AOG":"鞍山","AQG":"安慶","AVA":"安順","AXF":"阿拉善左旗","BAV":"包頭","BFJ":"畢節(jié)","BHY":"北海" 
    ,"BJS":"北京","BPE":"秦皇島","BPL":"博樂","BPX":"昌都","BSD":"保山","CAN":"廣州","CDE":"承德","CGD":"常德","CGO":"鄭州","CGQ":"長春","CHG":"朝陽","CIF":"赤峰" 
    ,"CIH":"長治","CKG":"重慶","CSX":"長沙","CTU":"成都","CWJ":"滄源","CYI":"嘉義","CZX":"常州","DAT":"大同","DAX":"達縣","DBC":"白城","DCY":"稻城","DDG":"丹東" 
    ,"DIG":"香格里拉(迪慶)","DLC":"大連","DLU":"大理","DNH":"敦煌","DOY":"東營","DQA":"大慶","DSN":"鄂爾多斯","DYG":"張家界","EJN":"額濟納旗","ENH":"恩施" 
    ,"ENY":"延安","ERL":"二連浩特","FOC":"福州","FUG":"阜陽","FUO":"佛山","FYJ":"撫遠","GOQ":"格爾木","GYS":"廣元","GYU":"固原","HAK":"???,"HDG":"邯鄲" 
    ,"HEK":"黑河","HET":"呼和浩特","HFE":"合肥","HGH":"杭州","HIA":"淮安","HJJ":"懷化","HKG":"香港","HLD":"海拉爾","HLH":"烏蘭浩特","HMI":"哈密","HPG":"神農(nóng)架" 
    ,"HRB":"哈爾濱","HSN":"舟山","HTN":"和田","HUZ":"惠州","HYN":"臺州","HZG":"漢中","HZH":"黎平","INC":"銀川","IQM":"且末","IQN":"慶陽","JDZ":"景德鎮(zhèn)" 
    ,"JGD":"加格達奇","JGN":"嘉峪關","JGS":"井岡山","JHG":"西雙版納","JIC":"金昌","JIQ":"黔江","JIU":"九江","JJN":"晉江","JMJ":"瀾滄","JMU":"佳木斯","JNG":"濟寧" 
    ,"JNZ":"錦州","JSJ":"建三江","JUH":"池州","JUZ":"衢州","JXA":"雞西","JZH":"九寨溝","KCA":"庫車","KGT":"康定","KHG":"喀什","KHN":"南昌","KJH":"凱里","KMG":"昆明" 
    ,"KNH":"金門","KOW":"贛州","KRL":"庫爾勒","KRY":"克拉瑪依","KWE":"貴陽","KWL":"桂林","LCX":"龍巖","LDS":"伊春","LFQ":"臨汾","LHW":"蘭州","LJG":"麗江","LLB":"荔波" 
    ,"LLF":"永州","LLV":"呂梁","LNJ":"臨滄","LPF":"六盤水","LUM":"芒市","LXA":"拉薩","LYA":"洛陽","LYG":"連云港","LYI":"臨沂","LZH":"柳州","LZO":"瀘州" 
    ,"LZY":"林芝","MDG":"牡丹江","MFK":"馬祖","MFM":"澳門","MIG":"綿陽","MXZ":"梅州","NAO":"南充","NBS":"白山","NDG":"齊齊哈爾","NGB":"寧波","NGQ":"阿里" 
    ,"NKG":"南京","NLH":"寧蒗","NNG":"南寧","NNY":"南陽","NTG":"南通","NZH":"滿洲里","OHE":"漠河","PZI":"攀枝花","RHT":"阿拉善右旗","RIZ":"日照","RKZ":"日喀則" 
    ,"RLK":"巴彥淖爾","SHA":"上海","SHE":"沈陽","SIA":"西安","SJW":"石家莊","SWA":"揭陽","SYM":"普洱","SYX":"三亞","SZX":"深圳","TAO":"青島","TCG":"塔城","TCZ":"騰沖" 
    ,"TEN":"銅仁","TGO":"通遼","THQ":"天水","TLQ":"吐魯番","TNA":"濟南","TSN":"天津","TVS":"唐山","TXN":"黃山","TYN":"太原","URC":"烏魯木齊","UYN":"榆林","WEF":"濰坊" 
    ,"WEH":"威海","WMT":"遵義(茅臺)","WNH":"文山","WNZ":"溫州","WUA":"烏海","WUH":"武漢","WUS":"武夷山","WUX":"無錫","WUZ":"梧州","WXN":"萬州","XFN":"襄陽","XIC":"西昌" 
    ,"XIL":"錫林浩特","XMN":"廈門","XNN":"西寧","XUZ":"徐州","YBP":"宜賓","YCU":"運城","YIC":"宜春","YIE":"阿爾山","YIH":"宜昌","YIN":"伊寧","YIW":"義烏","YNJ":"延吉" 
    ,"YNT":"煙臺","YNZ":"鹽城","YTY":"揚州","YUS":"玉樹","YZY":"張掖","ZAT":"昭通","ZHA":"湛江","ZHY":"中衛(wèi)","ZQZ":"張家口","ZUH":"珠海","ZYI":"遵義(新舟)"} 
    """{"KJI":"布爾津"}""" 
    self.UserAgent = [ 
      "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36", 
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7", 
      "Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0) Gecko/16.0 Firefox/16.0", 
      "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/534.55.3 (KHTML, like Gecko) Version/5.1.3 Safari/534.53.10", 
      "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)", 
      "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)", 
      "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1500.55 Safari/537.36", 
      "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1309.0 Safari/537.17" 
      "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre", 
      "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0", 
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11" 
    ] 
  #遍歷兩個日期間的所有日期 
  def set_url_headers(self,startdate,enddate): 
    startDate=datetime.datetime.strptime(startdate,'%Y-%m-%d') 
    endDate=datetime.datetime.strptime(enddate,'%Y-%m-%d') 
    while startDate<=endDate: 
      today = startDate.strftime('%Y-%m-%d') 
      for fromcode, fromcity in sorted(self.city.items(), key=itemgetter(0)): 
        for tocode, tocity in sorted(self.city.items(), key=itemgetter(0)): 
          if fromcode != tocode: 
            self.url = 'http://flights.ctrip.com/domesticsearch/search/SearchFirstRouteFlights?DCity1=%s&ACity1=%s&SearchType=S&DDate1=%s&IsNearAirportRecommond=0&LogToken=027e478a47494975ad74857b18283e12&rk=4.381066884522498182534&CK=9FC7881E8F373585C0E5F89152BC143D&r=0.24149333708195565406316' % (fromcode,tocode,today) 
            self.headers = { 
              "Host": "flights.ctrip.com", 
              "User-Agent": random.choice(self.UserAgent), 
              "Referer": "https://flights.ctrip.com/booking/%s-%s-day-1.html?DDate1=%s" % (fromcode,tocode,today), 
              "Connection": "keep-alive", 
            } 
            print("%s : %s(%s) ==> %s(%s) " % (today,fromcity,fromcode,tocity,tocode)) 
            self.get_parse_json_data(today) 
        time.sleep(10) 
      startDate+=datetime.timedelta(days=1) 
  #獲取一個頁面中的數(shù)據(jù) 
  def get_one_page_json_data(self): 
    req = urllib.request.Request(self.url,headers=self.headers) 
    body = urllib.request.urlopen(req,timeout=30).read().decode('gbk') 
    jsonData = json.loads(body.strip("'<>() ").replace('\'', '\"')) 
    return jsonData 
  #獲取一個頁面中的數(shù)據(jù),解析保存到數(shù)據(jù)庫 
  def get_parse_json_data(self,today): 
    jsonData = self.get_one_page_json_data() 
    df = pd.DataFrame(columns=['ItinerarDate','Airline','AirlineCode','FlightNumber','FlightNumberS','Aircraft','AircraftSize'  
    ,'AirportTax','FuelOilTax','FromCity','FromCityCode','FromAirport','FromTerminal','FromDateTime','ToCity','ToCityCode','ToAirport' 
    ,'ToTerminal','ToDateTime','DurationHour','DurationMinute','Duration','Currency','TicketPrices','Discount','PunctualityRate','AircraftCabin'])  
    if bool(jsonData["fis"]): 
      #獲取航空公司代碼及公司名稱 
      company = jsonData["als"] 
      for k in company.keys(): 
        if k not in self.Airline: 
          self.Airline[k]=company[k] 
      index = 0 
      for data in jsonData["fis"]: 
        df.loc[index,'ItinerarDate'] = today #行程日期 
        #df.loc[index,'Airline'] = self.Airline[data["alc"].strip()] #航空公司 
        df.loc[index,'Airline'] = self.Airline[data["alc"].strip()] if (data["alc"].strip() in self.Airline) else None #航空公司 
        df.loc[index,'AirlineCode'] = data["alc"].strip() #航空公司代碼 
        df.loc[index,'FlightNumber'] = data["fn"] #航班號 
        df.loc[index,'FlightNumberS'] = data["sdft"] #共享航班號(實際航班) 
        df.loc[index,'Aircraft'] = data["cf"]["c"] #飛機型號 
        df.loc[index,'AircraftSize'] = data["cf"]["s"] #型號大小(L大;M中;S小) 
        df.loc[index,'AirportTax'] = data["tax"] #機場建設費 
        df.loc[index,'FuelOilTax'] = data["of"] #燃油稅 
        df.loc[index,'FromCity'] = data["acn"] #出發(fā)城市 
        df.loc[index,'FromCityCode'] = data["acc"] #出發(fā)城市代碼 
        df.loc[index,'FromAirport'] = data["apbn"] #出發(fā)機場 
        df.loc[index,'FromTerminal'] = data["asmsn"] #出發(fā)航站樓 
        df.loc[index,'FromDateTime'] = data["dt"] #出發(fā)時間 
        df.loc[index,'ToCity'] = data["dcn"] #到達城市 
        df.loc[index,'ToCityCode'] = data["dcc"] #到達城市代碼 
        df.loc[index,'ToAirport'] = data["dpbn"] #到達機場 
        df.loc[index,'ToTerminal'] = data["dsmsn"] #到達航站樓 
        df.loc[index,'ToDateTime'] = data["at"] #到達時間 
        df.loc[index,'DurationHour'] = int((parse(data["at"])-parse(data["dt"])).seconds/3600) #時長(小時h) 
        df.loc[index,'DurationMinute'] = int((parse(data["at"])-parse(data["dt"])).seconds%3600/60) #時長(分鐘m) 
        df.loc[index,'Duration'] = str(df.loc[index,'DurationHour']) + 'h' + str(df.loc[index,'DurationMinute']) + 'm' #時長(字符串) 
        df.loc[index,'Currency'] = None #幣種 
        df.loc[index,'TicketPrices'] = data["lp"] #票價 
        df.loc[index,'Discount'] = None #已打折扣 
        df.loc[index,'PunctualityRate'] = None #準點率 
        df.loc[index,'AircraftCabin'] = None #倉位(F頭等艙;C公務艙;Y經(jīng)濟艙) 
        index = index + 1 
      df.to_sql("KKFlight", self.engine, index=False, if_exists='append')  
      print("done!~") 
if __name__ == "__main__": 
  fly = FLIGHT() 
  fly.set_url_headers('2018-06-16','2018-06-16') 

總結(jié)

以上所述是小編給大家介紹的Python 爬取攜程所有機票,希望對大家有所幫助,如果大家有任何疑問請給我留言,小編會及時回復大家的。在此也非常感謝大家對億速云網(wǎng)站的支持!

向AI問一下細節(jié)

免責聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點不代表本網(wǎng)站立場,如果涉及侵權(quán)請聯(lián)系站長郵箱:is@yisu.com進行舉報,并提供相關證據(jù),一經(jīng)查實,將立刻刪除涉嫌侵權(quán)內(nèi)容。

AI