<s id="okscp"></s>

<label id="okscp"></label>

<center id="okscp"></center>

<acronym id="okscp"></acronym>

溫馨提示×

溫馨提示×

您好，登錄后才能下訂單哦！

密碼登錄×

忘記密碼？

登錄注冊(cè)×

獲取短信驗(yàn)證碼

其他方式登錄

點(diǎn)擊登錄注冊(cè) 即表示同意《億速云用戶服務(wù)條款》

用戶登錄×

賬戶密碼登錄

請(qǐng)使用微信掃描上方二維碼

使用幫助

請(qǐng)求超時(shí)！

請(qǐng)點(diǎn)擊重新獲取二維碼

如何用Python實(shí)現(xiàn)網(wǎng)頁(yè)正文的提取

發(fā)布時(shí)間：2022-01-18 15:28:21 來(lái)源：億速云閱讀：283 作者：iii 欄目：編程語(yǔ)言

這篇文章主要介紹了如何用Python實(shí)現(xiàn)網(wǎng)頁(yè)正文的提取的相關(guān)知識(shí)，內(nèi)容詳細(xì)易懂，操作簡(jiǎn)單快捷，具有一定借鑒價(jià)值，相信大家閱讀完這篇如何用Python實(shí)現(xiàn)網(wǎng)頁(yè)正文的提取文章都會(huì)有所收獲，下面我們一起來(lái)看看吧。

一個(gè)典型的新聞網(wǎng)頁(yè)包括幾個(gè)不同區(qū)域：

如何用Python實(shí)現(xiàn)網(wǎng)頁(yè)正文的提取

新聞網(wǎng)頁(yè)區(qū)域

我們要提取的新聞要素包含在：

標(biāo)題區(qū)域
meta數(shù)據(jù)區(qū)域（發(fā)布時(shí)間等）
配圖區(qū)域（如果想把配圖也提?。?/p>
正文區(qū)域

而導(dǎo)航欄區(qū)域、相關(guān)鏈接區(qū)域的文字就不屬于該新聞的要素。

新聞的標(biāo)題、發(fā)布時(shí)間、正文內(nèi)容一般都是從我們抓取的html里面提取的。如果僅僅是一個(gè)網(wǎng)站的新聞網(wǎng)頁(yè)，提取這三個(gè)內(nèi)容很簡(jiǎn)單，寫三個(gè)正則表達(dá)式就可以完美提取了。然而，我們的爬蟲抓來(lái)的是成百上千的網(wǎng)站的網(wǎng)頁(yè)。對(duì)這么多不同格式的網(wǎng)頁(yè)寫正則表達(dá)式會(huì)累死人的，而且網(wǎng)頁(yè)一旦稍微改版，表達(dá)式可能就失效，維護(hù)這群表達(dá)式也是會(huì)累死人的。

累死人的做法當(dāng)然想不通，我們就要探索一下好的算法來(lái)實(shí)現(xiàn)。

1. 標(biāo)題的提取

標(biāo)題基本上都會(huì)出現(xiàn)在html的<title>標(biāo)簽里面，但是又被附加了諸如頻道名稱、網(wǎng)站名稱等信息；

標(biāo)題還會(huì)出現(xiàn)在網(wǎng)頁(yè)的“標(biāo)題區(qū)域”。

那么這兩個(gè)地方，從哪里提取標(biāo)題比較容易呢？

網(wǎng)頁(yè)的“標(biāo)題區(qū)域”沒(méi)有明顯的標(biāo)識(shí)，不同網(wǎng)站的“標(biāo)題區(qū)域”的html代碼部分千差萬(wàn)別。所以這個(gè)區(qū)域并不容易提取出來(lái)。

那么就只剩下<title>標(biāo)簽了，這個(gè)標(biāo)簽很容易提取，無(wú)論是正則表達(dá)式，還是lxml解析都很容易，不容易的是如何去除頻道名稱、網(wǎng)站名稱等信息。

先來(lái)看看，<title>標(biāo)簽里面都是設(shè)么樣子的附加信息：

上海用“智慧”激活城市交通脈搏，讓道路更安全更有序更通暢_浦江頭條_澎湃新聞-The Paper
“滬港大學(xué)聯(lián)盟”今天在復(fù)旦大學(xué)成立_教育_新民網(wǎng)
三亞老人腳踹司機(jī)致公交車失控撞墻被判刑3年_社會(huì)
外交部：中美外交安全對(duì)話9日在美舉行
進(jìn)博會(huì)：中國(guó)行動(dòng)全球矚目，中國(guó)擔(dān)當(dāng)世界點(diǎn)贊_南方觀瀾_南方網(wǎng)
資本市場(chǎng)迎來(lái)重大改革設(shè)立科創(chuàng)板有何深意？-新華網(wǎng)

觀察這些title不難發(fā)現(xiàn)，新聞標(biāo)題和頻道名、網(wǎng)站名之間都是有一些連接符號(hào)的。那么我就可以通過(guò)這些連接符吧title分割，找出最長(zhǎng)的部分就是新聞標(biāo)題了。

這個(gè)思路也很容易實(shí)現(xiàn)，這里就不再上代碼了，留給小猿們作為思考練習(xí)題自己實(shí)現(xiàn)一下。

2. 發(fā)布時(shí)間提取

發(fā)布時(shí)間，指的是這個(gè)網(wǎng)頁(yè)在該網(wǎng)站上線的時(shí)間，一般它會(huì)出現(xiàn)在正文標(biāo)題的下方——meta數(shù)據(jù)區(qū)域。從html代碼看，這個(gè)區(qū)域沒(méi)有什么特殊特征讓我們定位，尤其是在非常多的網(wǎng)站版面面前，定位這個(gè)區(qū)域幾乎是不可能的。這需要我們另辟蹊徑。
跟標(biāo)題一樣，我們也先看看一些網(wǎng)站的發(fā)布時(shí)間都是怎么寫的：

央視網(wǎng)2018年11月06日 22:22
時(shí)間：2018-11-07 14:27:00
2018-11-07 11:20:37 來(lái)源：新華網(wǎng)
來(lái)源：中國(guó)日?qǐng)?bào)網(wǎng) 2018-11-07 08:06:39
2018年11月07日 07:39:19
2018-11-06 09:58 來(lái)源：澎湃新聞

這些寫在網(wǎng)頁(yè)上的發(fā)布時(shí)間，都有一個(gè)共同的特點(diǎn)，那就是一個(gè)表示時(shí)間的字符串，年月日時(shí)分秒，無(wú)外乎這幾個(gè)要素。通過(guò)正則表達(dá)式，我們列舉一些不同時(shí)間表達(dá)方式（也就那么幾種）的正則表達(dá)式，就可以從網(wǎng)頁(yè)文本中進(jìn)行匹配提取發(fā)布時(shí)間了。

這也是一個(gè)很容易實(shí)現(xiàn)的思路，但是細(xì)節(jié)比較多，表達(dá)方式要涵蓋的盡可能多，寫好這么一個(gè)提取發(fā)布時(shí)間的函數(shù)也不是那么容易的哦。小猿們盡情發(fā)揮動(dòng)手能力，看看自己能寫出怎樣的函數(shù)實(shí)現(xiàn)。這也是留給小猿們的一道練習(xí)題。

3. 正文的提取

正文（包括新聞配圖）是一個(gè)新聞網(wǎng)頁(yè)的主體部分，它在視覺(jué)上占據(jù)中間位置，是新聞的內(nèi)容主要的文字區(qū)域。正文的提取有很多種方法，實(shí)現(xiàn)上有復(fù)雜也有簡(jiǎn)單。本文介紹的方法，是結(jié)合老猿多年的實(shí)踐經(jīng)驗(yàn)和思考得出來(lái)的一個(gè)簡(jiǎn)單快速的方法，姑且稱之為“節(jié)點(diǎn)文本密度法”。

我們知道，網(wǎng)頁(yè)的html代碼是由不同的標(biāo)簽（tag）組成了一個(gè)樹(shù)狀結(jié)構(gòu)樹(shù)，每個(gè)標(biāo)簽是樹(shù)的一個(gè)節(jié)點(diǎn)。通過(guò)遍歷這個(gè)樹(shù)狀結(jié)構(gòu)的每個(gè)節(jié)點(diǎn)，找到文本最多的節(jié)點(diǎn)，它就是正文所在的節(jié)點(diǎn)。根據(jù)這個(gè)思路，我們來(lái)實(shí)現(xiàn)一下代碼。

3.1 實(shí)現(xiàn)源碼

#!/usr/bin/env python3
#File: maincontent.py
#Author: veelion
import re
import time
import traceback
import cchardet
import lxml
import lxml.html
from lxml.html import HtmlComment
REGEXES = {
    'okMaybeItsACandidateRe': re.compile(
        'and|article|artical|body|column|main|shadow', re.I),
    'positiveRe': re.compile(
        ('article|arti|body|content|entry|hentry|main|page|'
         'artical|zoom|arti|context|message|editor|'
         'pagination|post|txt|text|blog|story'), re.I),
    'negativeRe': re.compile(
        ('copyright|combx|comment|com-|contact|foot|footer|footnote|decl|copy|'
         'notice|'
         'masthead|media|meta|outbrain|promo|related|scroll|link|pagebottom|bottom|'
         'other|shoutbox|sidebar|sponsor|shopping|tags|tool|widget'), re.I),
}
class MainContent:
    def __init__(self,):
        self.non_content_tag = set([
            'head',
            'meta',
            'script',
            'style',
            'object', 'embed',
            'iframe',
            'marquee',
            'select',
        ])
        self.title = ''
        self.p_space = re.compile(r'\s')
        self.p_html = re.compile(r'<html|</html>', re.IGNORECASE|re.DOTALL)
        self.p_content_stop = re.compile(r'正文.*結(jié)束|正文下|相關(guān)閱讀|聲明')
        self.p_clean_tree = re.compile(r'author|post-add|copyright')
    def get_title(self, doc):
        title = ''
        title_el = doc.xpath('//title')
        if title_el:
            title = title_el[0].text_content().strip()
        if len(title) < 7:
            tt = doc.xpath('//meta[@name="title"]')
            if tt:
                title = tt[0].get('content', '')
        if len(title) < 7:
            tt = doc.xpath('//*[contains(@id, "title") or contains(@class, "title")]')
            if not tt:
                tt =  doc.xpath('//*[contains(@id, "font01") or contains(@class, "font01")]')
            for t in tt:
                ti = t.text_content().strip()
                if ti in title and len(ti)*2 > len(title):
                    title = ti
                    break
                if len(ti) > 20: continue
                if len(ti) > len(title) or len(ti) > 7:
                    title = ti
        return title
    def shorten_title(self, title):
        spliters = [' - ', '–', '—', '-', '|', '::']
        for s in spliters:
            if s not in title:
                continue
            tts = title.split(s)
            if len(tts) < 2:
                continue
            title = tts[0]
            break
        return title
    def calc_node_weight(self, node):
        weight = 1
        attr = '%s %s %s' % (
            node.get('class', ''),
            node.get('id', ''),
            node.get('style', '')
        )
        if attr:
            mm = REGEXES['negativeRe'].findall(attr)
            weight -= 2 * len(mm)
            mm = REGEXES['positiveRe'].findall(attr)
            weight += 4 * len(mm)
        if node.tag in ['div', 'p', 'table']:
            weight += 2
        return weight
    def get_main_block(self, url, html, short_title=True):
        ''' return (title, etree_of_main_content_block)
        '''
        if isinstance(html, bytes):
            encoding = cchardet.detect(html)['encoding']
            if encoding is None:
                return None, None
            html = html.decode(encoding, 'ignore')
        try:
            doc = lxml.html.fromstring(html)
            doc.make_links_absolute(base_url=url)
        except :
            traceback.print_exc()
            return None, None
        self.title = self.get_title(doc)
        if short_title:
            self.title = self.shorten_title(self.title)
        body = doc.xpath('//body')
        if not body:
            return self.title, None
        candidates = []
        nodes = body[0].getchildren()
        while nodes:
            node = nodes.pop(0)
            children = node.getchildren()
            tlen = 0
            for child in children:
                if isinstance(child, HtmlComment):
                    continue
                if child.tag in self.non_content_tag:
                    continue
                if child.tag == 'a':
                    continue
                if child.tag == 'textarea':
                    # FIXME: this tag is only part of content?
                    continue
                attr = '%s%s%s' % (child.get('class', ''),
                                   child.get('id', ''),
                                   child.get('style'))
                if 'display' in attr and 'none' in attr:
                    continue
                nodes.append(child)
                if child.tag == 'p':
                    weight = 3
                else:
                    weight = 1
                text = '' if not child.text else child.text.strip()
                tail = '' if not child.tail else child.tail.strip()
                tlen += (len(text) + len(tail)) * weight
            if tlen < 10:
                continue
            weight = self.calc_node_weight(node)
            candidates.append((node, tlen*weight))
        if not candidates:
            return self.title, None
        candidates.sort(key=lambda a: a[1], reverse=True)
        good = candidates[0][0]
        if good.tag in ['p', 'pre', 'code', 'blockquote']:
            for i in range(5):
                good = good.getparent()
                if good.tag == 'div':
                    break
        good = self.clean_etree(good, url)
        return self.title, good
    def clean_etree(self, tree, url=''):
        to_drop = []
        drop_left = False
        for node in tree.iterdescendants():
            if drop_left:
                to_drop.append(node)
                continue
            if isinstance(node, HtmlComment):
                to_drop.append(node)
                if self.p_content_stop.search(node.text):
                    drop_left = True
                continue
            if node.tag in self.non_content_tag:
                to_drop.append(node)
                continue
            attr = '%s %s' % (
                node.get('class', ''),
                node.get('id', '')
            )
            if self.p_clean_tree.search(attr):
                to_drop.append(node)
                continue
            aa = node.xpath('.//a')
            if aa:
                text_node = len(self.p_space.sub('', node.text_content()))
                text_aa = 0
                for a in aa:
                    alen = len(self.p_space.sub('', a.text_content()))
                    if alen > 5:
                        text_aa += alen
                if text_aa > text_node * 0.4:
                    to_drop.append(node)
        for node in to_drop:
            try:
                node.drop_tree()
            except:
                pass
        return tree
    def get_text(self, doc):
        lxml.etree.strip_elements(doc, 'script')
        lxml.etree.strip_elements(doc, 'style')
        for ch in doc.iterdescendants():
            if not isinstance(ch.tag, str):
                continue
            if ch.tag in ['div', 'h2', 'h3', 'h4', 'p', 'br', 'table', 'tr', 'dl']:
                if not ch.tail:
                    ch.tail = '\n'
                else:
                    ch.tail = '\n' + ch.tail.strip() + '\n'
            if ch.tag in ['th', 'td']:
                if not ch.text:
                    ch.text = '  '
                else:
                    ch.text += '  '
            # if ch.tail:
            #     ch.tail = ch.tail.strip()
        lines = doc.text_content().split('\n')
        content = []
        for l in lines:
            l = l.strip()
            if not l:
                continue
            content.append(l)
        return '\n'.join(content)
    def extract(self, url, html):
        '''return (title, content)
        '''
        title, node = self.get_main_block(url, html)
        if node is None:
            print('\tno main block got !!!!!', url)
            return title, '', ''
        content = self.get_text(node)
        return title, content

3.2 代碼解析

跟新聞爬蟲一樣，我們把整個(gè)算法實(shí)現(xiàn)為一個(gè)類：MainContent。

首先，定義了一個(gè)全局變量： REGEXES。它收集了一些經(jīng)常出現(xiàn)在標(biāo)簽的class和id中的關(guān)鍵詞，這些詞標(biāo)識(shí)著該標(biāo)簽可能是正文或者不是。我們用這些詞來(lái)給標(biāo)簽節(jié)點(diǎn)計(jì)算權(quán)重，也就是方法calc_node_weight()的作用。

MainContent類的初始化，先定義了一些不會(huì)包含正文的標(biāo)簽 self.non_content_tag，遇到這些標(biāo)簽節(jié)點(diǎn)，直接忽略掉即可。

本算法提取標(biāo)題實(shí)現(xiàn)在get_title()這個(gè)函數(shù)里面。首先，它先獲得<title>標(biāo)簽的內(nèi)容，然后試著從<meta>里面找title，再嘗試從<body>里面找id和class包含title的節(jié)點(diǎn)，最后把從不同地方獲得的可能是標(biāo)題的文本進(jìn)行對(duì)比，最終獲得標(biāo)題。對(duì)比的原則是：

<meta>, <body>里面找到的疑似標(biāo)題如果包含在<title>標(biāo)簽里面，則它是一個(gè)干凈（沒(méi)有頻道名、網(wǎng)站名）的標(biāo)題；
如果疑似標(biāo)題太長(zhǎng)就忽略
主要把<title>標(biāo)簽作為標(biāo)題

從<title>標(biāo)簽里面獲得標(biāo)題，就要解決標(biāo)題清洗的問(wèn)題。這里實(shí)現(xiàn)了一個(gè)簡(jiǎn)單的方法： clean_title()。

在這個(gè)實(shí)現(xiàn)中，我們使用了lxml.html把網(wǎng)頁(yè)的html轉(zhuǎn)化成一棵樹(shù)，從body節(jié)點(diǎn)開(kāi)始遍歷每一個(gè)節(jié)點(diǎn)，看它直接包含（不含子節(jié)點(diǎn)）的文本的長(zhǎng)度，從中找出含有最長(zhǎng)文本的節(jié)點(diǎn)。這個(gè)過(guò)程實(shí)現(xiàn)在方法：get_main_block()中。其中一些細(xì)節(jié)，小猿們可以仔細(xì)體會(huì)一下。

其中一個(gè)細(xì)節(jié)就是，clean_node()這個(gè)函數(shù)。通過(guò)get_main_block()得到的節(jié)點(diǎn)，有可能包含相關(guān)新聞的鏈接，這些鏈接包含大量新聞標(biāo)題，如果不去除，就會(huì)給新聞內(nèi)容帶來(lái)雜質(zhì)（相關(guān)新聞的標(biāo)題、概述等）。

還有一個(gè)細(xì)節(jié)，get_text()函數(shù)。我們從main block中提取文本內(nèi)容，不是直接使用text_content()，而是做了一些格式方面的處理，比如在一些標(biāo)簽后面加入換行符合\n，在table的單元格之間加入空格。這樣處理后，得到的文本格式比較符合原始網(wǎng)頁(yè)的效果。

爬蟲知識(shí)點(diǎn)

1. cchardet模塊
用于快速判斷文本編碼的模塊

2. lxml.html模塊
結(jié)構(gòu)化html代碼的模塊，通過(guò)xpath解析網(wǎng)頁(yè)的工具，高效易用，是寫爬蟲的居家必備的模塊。

3. 內(nèi)容提取的復(fù)雜性
我們這里實(shí)現(xiàn)的正文提取的算法，基本上可以正確處理90%以上的新聞網(wǎng)頁(yè)。
但是，世界上沒(méi)有千篇一律的網(wǎng)頁(yè)一樣，也沒(méi)有一勞永逸的提取算法。大規(guī)模使用本文算法的過(guò)程中，你會(huì)碰到奇葩的網(wǎng)頁(yè)，這個(gè)時(shí)候，你就要針對(duì)這些網(wǎng)頁(yè)，來(lái)完善這個(gè)算法類。

關(guān)于“如何用Python實(shí)現(xiàn)網(wǎng)頁(yè)正文的提取”這篇文章的內(nèi)容就介紹到這里，感謝各位的閱讀！相信大家對(duì)“如何用Python實(shí)現(xiàn)網(wǎng)頁(yè)正文的提取”知識(shí)都有一定的了解，大家如果還想學(xué)習(xí)更多知識(shí)，歡迎關(guān)注億速云行業(yè)資訊頻道。

向AI問(wèn)一下細(xì)節(jié)

推薦閱讀：

免責(zé)聲明：本站發(fā)布的內(nèi)容（圖片、視頻和文字）以原創(chuàng)、轉(zhuǎn)載和分享為主，文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng)，如果涉及侵權(quán)請(qǐng)聯(lián)系站長(zhǎng)郵箱：is@yisu.com進(jìn)行舉報(bào)，并提供相關(guān)證據(jù)，一經(jīng)查實(shí)，將立刻刪除涉嫌侵權(quán)內(nèi)容。

上一篇新聞：
發(fā)現(xiàn)“小火車托馬斯”智能玩具APP聊天應(yīng)用漏洞的示例分析
下一篇新聞：
Sqlmap自動(dòng)化注入的示例分析

猜你喜歡

AI
助
手

產(chǎn)品服務(wù)

地區(qū)劃分

專題活動(dòng)

幫助支持

關(guān)于我們

售后咨詢

7*24小時(shí)在線電話：400-100-2938

7*24小時(shí)在線 QQ：800811969

關(guān)注億速云

億速云公眾號(hào)

手機(jī)網(wǎng)站二維碼

<s id="fwxqh"></s>

<small id="fwxqh"><menuitem id="fwxqh"></menuitem></small>

<strike id="fwxqh"></strike>

<button id="fwxqh"></button>