python爬蟲(chóng)實(shí)戰(zhàn)之抓取異常的處理方法

發(fā)布時(shí)間：2020-11-30 10:09:55 來(lái)源：億速云閱讀：189 作者：小新欄目：編程語(yǔ)言

這篇文章給大家分享的是有關(guān)python爬蟲(chóng)實(shí)戰(zhàn)之抓取異常的處理方法的內(nèi)容。小編覺(jué)得挺實(shí)用的，因此分享給大家做個(gè)參考。一起跟隨小編過(guò)來(lái)看看吧。

可能在抓取的時(shí)候，某個(gè)賬號(hào)突然就被封了，或者由于網(wǎng)絡(luò)原因，某次請(qǐng)求失敗了，該如何處理？對(duì)于前者，我們需要判斷每次請(qǐng)求返回的內(nèi)容是否符合預(yù)期，也就是看response url是否正常，看response content是否是404或者讓你驗(yàn)證手機(jī)號(hào)等，對(duì)于后者，我們可以做一個(gè)簡(jiǎn)單的重試策略。處理這兩種情況的代碼如下

@timeout_decorator
def get_page(url, user_verify=True, need_login=True):
    """
    :param url: 待抓取url
    :param user_verify: 是否為可能出現(xiàn)驗(yàn)證碼的頁(yè)面(ajax連接不會(huì)出現(xiàn)驗(yàn)證碼，如果是請(qǐng)求微博或者用戶信息可能出現(xiàn)驗(yàn)證碼)，否為抓取轉(zhuǎn)發(fā)的ajax連接
    :param need_login: 抓取頁(yè)面是否需要登錄，這樣做可以減小一些賬號(hào)的壓力
    :return: 返回請(qǐng)求的數(shù)據(jù)，如果出現(xiàn)404或者403,或者是別的異常，都返回空字符串
    """
    crawler.info('本次抓取的url為{url}'.format(url=url))
    count = 0
 
    while count < max_retries:
 
        if need_login:
            # 每次重試的時(shí)候都換cookies,并且和上次不同,如果只有一個(gè)賬號(hào)，那么就允許相同
            name_cookies = Cookies.fetch_cookies()
            
            if name_cookies is None:
                crawler.warning('cookie池中不存在cookie，正在檢查是否有可用賬號(hào)')
                rs = get_login_info()
 
                # 選擇狀態(tài)正常的賬號(hào)進(jìn)行登錄，賬號(hào)都不可用就停掉celery worker
                if len(rs) == 0:
                    crawler.error('賬號(hào)均不可用，請(qǐng)檢查賬號(hào)健康狀況')
                    # 殺死所有關(guān)于celery的進(jìn)程
                    if 'win32' in sys.platform:
                        os.popen('taskkill /F /IM "celery*"')
                    else:
                        os.popen('pkill -f "celery"')
                else:
                    crawler.info('重新獲取cookie中...')
                    login.excute_login_task()
                    time.sleep(10)
 
        try:
            if need_login:
                resp = requests.get(url, headers=headers, cookies=name_cookies[1], timeout=time_out, verify=False)
 
                if "$CONFIG['islogin'] = '0'" in resp.text:
                    crawler.warning('賬號(hào){}出現(xiàn)異常'.format(name_cookies[0]))
                    freeze_account(name_cookies[0], 0)
                    Cookies.delete_cookies(name_cookies[0])
                    continue
            else:
                resp = requests.get(url, headers=headers, timeout=time_out, verify=False)
 
            page = resp.text
            if page:
                page = page.encode('utf-8', 'ignore').decode('utf-8')
            else:
                continue
 
            # 每次抓取過(guò)后程序sleep的時(shí)間，降低封號(hào)危險(xiǎn)
            time.sleep(interal)
 
            if user_verify:
                if 'unfreeze' in resp.url or 'accessdeny' in resp.url or 'userblock' in resp.url or is_403(page):
                    crawler.warning('賬號(hào){}已經(jīng)被凍結(jié)'.format(name_cookies[0]))
                    freeze_account(name_cookies[0], 0)
                    Cookies.delete_cookies(name_cookies[0])
                    count += 1
                    continue
 
                if 'verifybmobile' in resp.url:
                    crawler.warning('賬號(hào){}功能被鎖定，需要手機(jī)解鎖'.format(name_cookies[0]))
        
                    freeze_account(name_cookies[0], -1)
                    Cookies.delete_cookies(name_cookies[0])
                    continue
 
                if not is_complete(page):
                    count += 1
                    continue
 
                if is_404(page):
                    crawler.warning('url為{url}的連接不存在'.format(url=url))
                    return ''
 
        except (requests.exceptions.ReadTimeout, requests.exceptions.ConnectionError, AttributeError) as e:
            crawler.warning('抓取{}出現(xiàn)異常，具體信息是{}'.format(url, e))
            count += 1
            time.sleep(excp_interal)
 
        else:
            Urls.store_crawl_url(url, 1)
            return page
 
    crawler.warning('抓取{}已達(dá)到最大重試次數(shù)，請(qǐng)?jiān)?a title="redis" target="_blank" href="http://www.kemok4.com/redis/">redis的失敗隊(duì)列中查看該url并檢查原因'.format(url))
    Urls.store_crawl_url(url, 0)
return ''

這里大家把上述代碼當(dāng)一段偽代碼讀就行了，主要看看如何處理抓取時(shí)候的異常。因?yàn)槿绻N整個(gè)用戶抓取的代碼，不是很現(xiàn)實(shí)，代碼量有點(diǎn)大。

感謝各位的閱讀！關(guān)于python爬蟲(chóng)實(shí)戰(zhàn)之抓取異常的處理方法就分享到這里了，希望以上內(nèi)容可以對(duì)大家有一定的幫助，讓大家可以學(xué)到更多知識(shí)。如果覺(jué)得文章不錯(cuò)，可以把它分享出去讓更多的人看到吧！

向AI問(wèn)一下細(xì)節(jié)

python爬蟲(chóng)實(shí)戰(zhàn)之抓取異常的處理方法

猜你喜歡

最新資訊

相關(guān)推薦

相關(guān)標(biāo)簽