您好,登錄后才能下訂單哦!
快要過(guò)年了,大家都在忙些什么呢?一到年底公司各種搶票,備年貨,被這過(guò)年的氣氛一烘,都?xì)w心似箭,哪還有心思上班啊。歸心似箭=產(chǎn)出低下=一行代碼十個(gè)錯(cuò)=無(wú)聊。于是想起了以前學(xué)過(guò)一段時(shí)間的Python,自己平時(shí)也挺愛(ài)看電影的,手動(dòng)點(diǎn)進(jìn)去看電影詳情然后一部一部的去下載太煩了,何不用Python寫(xiě)個(gè)自動(dòng)下載電影的工具呢?誒,這么一想就不無(wú)聊了。以前還沒(méi)那么多XX會(huì)員的時(shí)候,想看看電影都是去XX天堂去找電影資源,大部分想看的電影還是有的,就它了,爬它!
話說(shuō)以前玩Python的時(shí)候爬過(guò)挺多網(wǎng)站的,都是在公司干的(Python不屬于公司的業(yè)務(wù)范圍,純屬自己折騰著好玩),我那個(gè)負(fù)責(zé)運(yùn)維的同事天天跑過(guò)來(lái)說(shuō):你又在爬啥啊,你去看看新聞,某某爬東西又被抓了!出了事你自己負(fù)責(zé)啊!哎呀我的娘親,嚇的都沒(méi)繼續(xù)玩下去了。這個(gè)博客是爬取某天堂的資源(具體是哪個(gè)天堂下面的代碼里會(huì)有的),會(huì)不會(huì)被抓???單純的作為技術(shù)討論,個(gè)人練手,不做商業(yè)用途應(yīng)該沒(méi)事吧?寫(xiě)到這里小手不禁微微顫抖...
得嘞,死就死吧,我不入地獄誰(shuí)入地獄,先看最終實(shí)現(xiàn)效果:
如上,這個(gè)下載工具是有界面的(牛皮吧),只要輸入一個(gè)根地址和電影評(píng)分,就可以自動(dòng)爬電影了,要完成這個(gè)工具需要具備以下知識(shí)點(diǎn):
差不多就這些了,至于實(shí)現(xiàn)的技術(shù)細(xì)節(jié)的話,也不多,requests+BeautifulSoup的使用,re正則,Python數(shù)據(jù)類(lèi)型,Python線程,dbm、pickle等數(shù)據(jù)持久化庫(kù)的使用,等等,這個(gè)工具也就這么些知識(shí)范疇了。當(dāng)然,Python是面向?qū)ο蟮?,編程思想是所有語(yǔ)言通用的,這個(gè)不是一朝一夕的事,也沒(méi)辦法通過(guò)語(yǔ)言描述清楚。各位對(duì)號(hào)入座,以上哪個(gè)知識(shí)面不足的自己去翻資料學(xué)習(xí),我可是直接貼代碼的。
說(shuō)到Python的學(xué)習(xí)還是多說(shuō)兩句吧,以前學(xué)習(xí)Python爬蟲(chóng)的時(shí)候看的是 @工匠若水 https://blog.csdn.net/yanbober的博客,這哥們的Python文章寫(xiě)的真不錯(cuò),對(duì)于有過(guò)編程經(jīng)驗(yàn)卻從沒(méi)接觸過(guò)Python的人很有幫助,基本上很快就能上手一個(gè)小項(xiàng)目。得嘞,擼代碼:
import url_manager import html_parser import html_download import persist_util from tkinter import * from threading import Thread import os class SpiderMain(object): def __init__(self): self.mUrlManager = url_manager.UrlManager() self.mHtmlParser = html_parser.HtmlParser() self.mHtmlDownload = html_download.HtmlDownload() self.mPersist = persist_util.PersistUtil() # 加載歷史下載鏈接 def load_history(self): history_download_links = self.mPersist.load_history_links() if history_download_links is not None and len(history_download_links) > 0: for download_link in history_download_links: self.mUrlManager.add_download_url(download_link) d_log("加載歷史下載鏈接: " + download_link) # 保存歷史下載鏈接 def save_history(self): history_download_links = self.mUrlManager.get_download_url() if history_download_links is not None and len(history_download_links) > 0: self.mPersist.save_history_links(history_download_links) def craw_movie_links(self, root_url, score=8): count = 0; self.mUrlManager.add_url(root_url) while self.mUrlManager.has_continue(): try: count = count + 1 url = self.mUrlManager.get_url() d_log("craw %d : %s" % (count, url)) headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36', 'Referer': url } content = self.mHtmlDownload.down_html(url, retry_count=3, headers=headers) if content is not None: doc = content.decode('gb2312', 'ignore') movie_urls, next_link = self.mHtmlParser.parser_movie_link(doc) if movie_urls is not None and len(movie_urls) > 0: for movie_url in movie_urls: d_log('movie info url: ' + movie_url) content = self.mHtmlDownload.down_html(movie_url, retry_count=3, headers=headers) if content is not None: doc = content.decode('gb2312', 'ignore') movie_name, movie_score, movie_xunlei_links = self.mHtmlParser.parser_movie_info(doc, score=score) if movie_xunlei_links is not None and len(movie_xunlei_links) > 0: for xunlei_link in movie_xunlei_links: # 判斷該電影是否已經(jīng)下載過(guò)了 is_download = self.mUrlManager.has_download(xunlei_link) if is_download == False: # 沒(méi)下載過(guò)的電影添加到迅雷下載列表 d_log('開(kāi)始下載 ' + movie_name + ', 鏈接地址: ' + xunlei_link) self.mUrlManager.add_download_url(xunlei_link) os.system(r'"D:\迅雷\Thunder\Program\Thunder.exe" {url}'.format(url=xunlei_link)) # 每下載一部電影都實(shí)時(shí)更新數(shù)據(jù)庫(kù),這樣可以保證即使程序異常退出也不會(huì)重復(fù)下載該電影 self.save_history() if next_link is not None: d_log('next link: ' + next_link) self.mUrlManager.add_url(next_link) except Exception as e: d_log('錯(cuò)誤信息: ' + str(e)) def runner(rootLink=None, scoreLimit=None): if rootLink is None: return spider = SpiderMain() spider.load_history() if scoreLimit is None: spider.craw_movie_links(rootLink) else: spider.craw_movie_links(rootLink, score=float(scoreLimit)) spider.save_history() # rootLink = 'https://www.dytt8.net/html/gndy/dyzz/index.html' # rootLink = 'https://www.dytt8.net/html/gndy/dyzz/list_23_207.html' def start(rootLink, scoreLimit): loop_thread = Thread(target=runner, args=(rootLink, scoreLimit,), name='LOOP THREAD') #loop_thread.setDaemon(True) loop_thread.start() #loop_thread.join() # 不能讓主線程等待,否則GUI界面將卡死 btn_start.configure(state='disable') # 刷新GUI界面,文字滾動(dòng)效果 def d_log(log): s = log + '\n' txt.insert(END, s) txt.see(END) if __name__ == "__main__": rootGUI = Tk() rootGUI.title('XX電影自動(dòng)下載工具') # 設(shè)置窗體背景顏色 black_background = '#000000' rootGUI.configure(background=black_background) # 獲取屏幕寬度和高度 screen_w, screen_h = rootGUI.maxsize() # 居中顯示窗體 window_x = (screen_w - 640) / 2 window_y = (screen_h - 480) / 2 window_xy = '640x480+%d+%d' % (window_x, window_y) rootGUI.geometry(window_xy) lable_link = Label(rootGUI, text='解析根地址: ',\ bg='black',\ fg='red', \ font=('宋體', 12), \ relief=FLAT) lable_link.place(x=20, y=20) lable_link_width = lable_link.winfo_reqwidth() lable_link_height = lable_link.winfo_reqheight() input_link = Entry(rootGUI) input_link.place(x=20+lable_link_width, y=20, relwidth=0.5) lable_score = Label(rootGUI, text='電影評(píng)分限制: ', \ bg='black', \ fg='red', \ font=('宋體', 12), \ relief=FLAT) lable_score.place(x=20, y=20+lable_link_height+10) input_score = Entry(rootGUI) input_score.place(x=20+lable_link_width, y=20+lable_link_height+10, relwidth=0.3) btn_start = Button(rootGUI, text='開(kāi)始下載', command=lambda: start(input_link.get(), input_score.get())) btn_start.place(relx=0.4, rely=0.2, relwidth=0.1, relheight=0.1) txt = Text(rootGUI) txt.place(rely=0.4, relwidth=1, relheight=0.5) rootGUI.mainloop()
spider_main.py,主代碼入口,主要是tkinter 實(shí)現(xiàn)的一個(gè)簡(jiǎn)陋的界面,可以輸入根地址,電影最低評(píng)分。所謂的根地址就是某天堂網(wǎng)站的一類(lèi)電影的入口,比如進(jìn)入首頁(yè)有如下的分類(lèi),最新電影、日韓電影、歐美影片、2019精品專(zhuān)區(qū),等等。這里以2019精品專(zhuān)區(qū)為例(https://www.dytt8.net/html/gndy/dyzz/index.html),當(dāng)然,用其它的分類(lèi)地址入口也是可以的。評(píng)分就是個(gè)過(guò)濾電影的條件,要學(xué)會(huì)對(duì)垃圾電影說(shuō)不,浪費(fèi)時(shí)間浪費(fèi)表情,你可以指定大于等于8分的電影才下載,也可以指定大于等于9分等,必須輸入數(shù)字哈,輸入些亂七八糟的東西進(jìn)去程序會(huì)崩潰,這個(gè)細(xì)節(jié)我懶得處理。
''' URL鏈接管理類(lèi),負(fù)責(zé)管理爬取下來(lái)的電影鏈接地址,包括新解析出來(lái)的鏈接地址,和已經(jīng)下載過(guò)的鏈接地址,保證相同的鏈接地址只會(huì)下載一次 ''' class UrlManager(object): def __init__(self): self.urls = set() self.used_urls = set() self.download_urls = set() def add_url(self, url): if url is None: return if url not in self.urls and url not in self.used_urls: self.urls.add(url) def add_urls(self, urls): if urls is None or len(urls) == 0: return for url in urls: self.add_url(url) def has_continue(self): return len(self.urls) > 0 def get_url(self): url = self.urls.pop() self.used_urls.add(url) return url def get_download_url(self): return self.download_urls def has_download(self, url): return url in self.download_urls def add_download_url(self, url): if url is None: return if url not in self.download_urls: self.download_urls.add(url)
url_manager.py,注釋里寫(xiě)的很清楚了,基本上每個(gè)py文件的關(guān)鍵地方我都寫(xiě)了比較詳細(xì)的注釋
import requests from requests import Timeout ''' HtmlDownload,通過(guò)一個(gè)鏈接地址將該html頁(yè)面整體down下來(lái),然后通過(guò)html_parser.py解析其中有價(jià)值的信息 ''' class HtmlDownload(object): def __init__(self): self.request_session = requests.session() self.request_session.proxies def down_html(self, url, retry_count=3, headers=None, proxies=None, data=None): if headers: self.request_session.headers.update(headers) try: if data: content = self.request_session.post(url, data=data, proxies=proxies) print('result code: ' + str(content.status_code) + ', link: ' + url) if content.status_code == 200: return content.content else: content = self.request_session.get(url, proxies=proxies) print('result code: ' + str(content.status_code) + ', link: ' + url) if content.status_code == 200: return content.content except (ConnectionError, Timeout) as e: print('HtmlDownload ConnectionError or Timeout: ' + str(e)) if retry_count > 0: self.down_html(url, retry_count-1, headers, proxies, data) return None except Exception as e: print('HtmlDownload Exception: ' + str(e))
html_download.py,就是用requests將靜態(tài)網(wǎng)頁(yè)的內(nèi)容整體down下來(lái)
from bs4 import BeautifulSoup from urllib.parse import urljoin import re import urllib.parse import base64 ''' html頁(yè)面解析器 ''' class HtmlParser(object): # 解析電影列表頁(yè)面,獲取電影詳情頁(yè)面的鏈接 def parser_movie_link(self, content): try: urls = set() next_link = None doc = BeautifulSoup(content, 'lxml') div_content = doc.find('div', class_='co_content8') if div_content is not None: tables = div_content.find_all('table') if tables is not None and len(tables) > 0: for table in tables: link = table.find('a', class_='ulink') if link is not None: print('movie name: ' + link.text) movie_link = urljoin('https://www.dytt8.net', link.get('href')) print('movie link ' + movie_link) urls.add(movie_link) next = div_content.find('a', text=re.compile(r".*?下一頁(yè).*?")) if next is not None: next_link = urljoin('https://www.dytt8.net/html/gndy/dyzz/', next.get('href')) print('movie next link ' + next_link) return urls, next_link except Exception as e: print('解析電影鏈接地址發(fā)生錯(cuò)誤: ' + str(e)) # 解析電影詳情頁(yè)面,獲取電影詳細(xì)信息 def parser_movie_info(self, content, score=8): try: movie_name = None # 電影名稱(chēng) movie_score = 0 # 電影評(píng)分 movie_xunlei_links = set() # 電影的迅雷下載地址,可能存在多個(gè) doc = BeautifulSoup(content, 'lxml') movie_name = doc.find('title').text.replace('迅雷下載_電影天堂', '') #print(movie_name) div_zoom = doc.find('div', id='Zoom') if div_zoom is not None: # 獲取電影評(píng)分 span_txt = div_zoom.text txt_list = span_txt.split('◎') if txt_list is not None and len(txt_list) > 0: for tl in txt_list: if 'IMDB' in tl or 'IMDb' in tl or 'imdb' in tl or 'IMdb' in tl: txt_score = tl.split('/')[0] print(txt_score) movie_score = re.findall(r"\d+\.?\d*", txt_score) if movie_score is None or len(movie_score) <= 0: movie_score = 1 else: movie_score = movie_score[0] print(movie_name + ' IMDB影片分?jǐn)?shù): ' + str(movie_score)) if float(movie_score) < score: print('電影評(píng)分低于' + str(score) + ', 忽略') return movie_name, movie_score, movie_xunlei_links txt_a = div_zoom.find_all('a', href=re.compile(r".*?ftp:.*?")) if txt_a is not None: # 獲取電影迅雷下載地址,base64轉(zhuǎn)成迅雷格式 for alink in txt_a: xunlei_link = alink.get('href') ''' 這里將電影鏈接轉(zhuǎn)換成迅雷的專(zhuān)用下載鏈接,后來(lái)發(fā)現(xiàn)不轉(zhuǎn)換迅雷也能識(shí)別 xunlei_link = urllib.parse.quote(xunlei_link) xunlei_link = xunlei_link.replace('%3A', ':') xunlei_link = xunlei_link.replace('%40', '@') xunlei_link = xunlei_link.replace('%5B', '[') xunlei_link = xunlei_link.replace('%5D', ']') xunlei_link = 'AA' + xunlei_link + 'ZZ' xunlei_link = base64.b64encode(xunlei_link.encode('gbk')) xunlei_link = 'thunder://' + str(xunlei_link, encoding='gbk') ''' print(xunlei_link) movie_xunlei_links.add(xunlei_link) return movie_name, movie_score, movie_xunlei_links except Exception as e: print('解析電影詳情頁(yè)面錯(cuò)誤: ' + str(e))
html_parser.py,用bs4解析down下來(lái)的html頁(yè)面內(nèi)容,根據(jù)網(wǎng)頁(yè)規(guī)則過(guò)去我們需要的東西,這是爬蟲(chóng)最重要的地方,寫(xiě)爬蟲(chóng)的目的就是想要取出對(duì)我們有用的東西。
import dbm import pickle import os ''' 數(shù)據(jù)持久化工具類(lèi) ''' class PersistUtil(object): def save_data(self, name='No Name', urls=None): if urls is None or len(urls) <= 0: return try: history_db = dbm.open('downloader_history', 'c') history_db[name] = str(urls) finally: history_db.close() def get_data(self): history_links = set() try: history_db = dbm.open('downloader_history', 'r') for key in history_db.keys(): history_links.add(str(history_db[key], 'gbk')) except Exception as e: print('遍歷dbm數(shù)據(jù)失敗: ' + str(e)) return history_links # 使用pickle保存歷史下載記錄 def save_history_links(self, urls): if urls is None or len(urls) <= 0: return with open('DownloaderHistory', 'wb') as pickle_file: pickle.dump(urls, pickle_file) # 獲取保存在pickle中的歷史下載記錄 def load_history_links(self): if os.path.exists('DownloaderHistory'): with open('DownloaderHistory', 'rb') as pickle_file: return pickle.load(pickle_file) else: return None
persist_util.py,數(shù)據(jù)持久化工具類(lèi)。
這樣代碼部分就完成了,說(shuō)下迅雷,我安裝的是最新版的迅雷X,一定要如下圖一樣在迅雷設(shè)置打開(kāi)一鍵下載功能,否則每次新增一個(gè)下載任務(wù)都會(huì)彈出用戶(hù)確認(rèn)框的,還有就是調(diào)用迅雷下載資源的代碼:os.system(r'"D:\迅雷\Thunder\Program\Thunder.exe" {url}'.format(url=xunlei_link)),一定要去到迅雷安裝目錄找到Thunder.exe文件,不能用快捷方式的地址(我的電腦->迅雷->右鍵屬性->目標(biāo),迅雷X這里顯示的路徑是快捷方式的路徑,不能用這個(gè)),否則找不到程序。
到這里你應(yīng)該就可以電影爬起來(lái)了,妥妥的。當(dāng)然,你想要優(yōu)化也可以,程序有很多可以?xún)?yōu)化的地方,比如線程那一塊,比如數(shù)據(jù)持久化那里..... 初學(xué)者可以通過(guò)這個(gè)練手,然后自己去分析分析靜態(tài)網(wǎng)站的規(guī)則,把解析html那一塊的代碼改改就可以爬其它的網(wǎng)站了,比如那些有著危險(xiǎn)動(dòng)作的電影... 不過(guò)這類(lèi)電影還是少看為妙,要多讀書(shū),偶爾看了也要擦擦干凈,洗洗干凈,要講衛(wèi)生。
以上就是本文的全部?jī)?nèi)容,希望對(duì)大家的學(xué)習(xí)有所幫助,也希望大家多多支持億速云。
免責(zé)聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng),如果涉及侵權(quán)請(qǐng)聯(lián)系站長(zhǎng)郵箱:is@yisu.com進(jìn)行舉報(bào),并提供相關(guān)證據(jù),一經(jīng)查實(shí),將立刻刪除涉嫌侵權(quán)內(nèi)容。