溫馨提示×

溫馨提示×

您好，登錄后才能下訂單哦！

密碼登錄×

忘記密碼？

登錄注冊(cè)×

獲取短信驗(yàn)證碼

其他方式登錄

點(diǎn)擊登錄注冊(cè) 即表示同意《億速云用戶服務(wù)條款》

用戶登錄×

賬戶密碼登錄

請(qǐng)使用微信掃描上方二維碼

使用幫助

請(qǐng)求超時(shí)！

請(qǐng)點(diǎn)擊重新獲取二維碼

Python中pdfminer.six和pdfplumber模塊是什么

發(fā)布時(shí)間：2020-10-28 09:31:48 來(lái)源：億速云閱讀：377 作者：小新欄目：編程語(yǔ)言

這篇文章將為大家詳細(xì)講解有關(guān)Python中pdfminer.six和pdfplumber模塊是什么，小編覺得挺實(shí)用的，因此分享給大家做個(gè)參考，希望大家閱讀完這篇文章后可以有所收獲。

pdfminer.six

PDFMiner的操作門檻比較高，需要部分了解PDF的文檔結(jié)構(gòu)模型，適合定制開發(fā)復(fù)雜的內(nèi)容處理工具。

平時(shí)直接用PDFMiner比較少，這里只演示基本的文檔內(nèi)容操作：

import pathlib
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams, LTTextBox, LTFigure, LTImage
from pdfminer.converter import PDFPageAggregator

path = list(pathlib.Path.cwd().parents)[1].joinpath('data/automate/002pdf')
f_path = path.joinpath('2020-新冠肺炎疫情對(duì)中國(guó)連鎖餐飲行業(yè)的影響調(diào)研報(bào)告-中國(guó)連鎖經(jīng)營(yíng)協(xié)會(huì).pdf')

with open(f_path, 'rb') as f:
    parser = PDFParser(f)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    laparams = LAParams()
    device = PDFPageAggregator(rsrcmgr, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)
        layout = device.get_result()
        for x in layout:
            # 獲取文本對(duì)象
            if isinstance(x, LTTextBox):
                print(x.get_text().strip())
            # 獲取圖片對(duì)象
            if isinstance(x,LTImage):
                print('這里獲取到一張圖片')
            # 獲取 figure 對(duì)象
            if isinstance(x,LTFigure):
                print('這里獲取到一個(gè) figure 對(duì)象')

雖然pdfminer使用門檻較高，但遇到復(fù)雜情況，最后還得用它。目前開源模塊中，它對(duì)PDF的支持應(yīng)該是最全的了。

下面這個(gè)pdfplumber就是基于pdfminer.six開發(fā)的模塊，降低了使用門檻。

pdfplumber

相比pdfminer.six，pdfplumber提供了更便捷的PDF內(nèi)容抽取接口。

日常工作中常用的操作，比如：

提取PDF內(nèi)容，保存到txt文件
提取PDF中的表格到Excel
提取PDF中的圖片
提取PDF中的圖表

提取PDF內(nèi)容，保存到txt文件

import pathlib
import pdfplumber

path = list(pathlib.Path.cwd().parents)[1].joinpath('data/automate/002pdf')
f_path = path.joinpath('2020-新冠肺炎疫情對(duì)中國(guó)連鎖餐飲行業(yè)的影響調(diào)研報(bào)告-中國(guó)連鎖經(jīng)營(yíng)協(xié)會(huì).pdf')
out_path = path.joinpath('002pdf_out.txt')

with pdfplumber.open(f_path) as pdf, open(out_path ,'a') as txt:
    for page in pdf.pages:
        textdata = page.extract_text()
        txt.write(textdata)

提取PDF中的表格到Excel

import pathlib
import pdfplumber
from openpyxl import Workbook

path = list(pathlib.Path.cwd().parents)[1].joinpath('data/automate/002pdf')
f_path = path.joinpath('2020-新冠肺炎疫情對(duì)中國(guó)連鎖餐飲行業(yè)的影響調(diào)研報(bào)告-中國(guó)連鎖經(jīng)營(yíng)協(xié)會(huì).pdf')
out_path = path.joinpath('002pdf_excel.xlsx')

wb = Workbook()
sheet = wb.active
with pdfplumber.open(f_path) as pdf:
    for i in range(19, 22):
        page = pdf.pages[i]
        table = page.extract_table()
        for row in table:
            sheet.append(row)
wb.save(out_path)

上面用到了openpyxl的功能創(chuàng)建了一個(gè)Excel文件，之前有單獨(dú)文章介紹它。

提取PDF中的圖片

import pathlib
import pdfplumber
from PIL import Image

path = list(pathlib.Path.cwd().parents)[1].joinpath('data/automate/002pdf')
f_path = path.joinpath('2020-疫情影響下的中國(guó)社區(qū)趨勢(shì)研究-艾瑞.pdf')
out_path = path.joinpath('002pdf_images.png')
with pdfplumber.open(f_path) as pdf, open(out_path, 'wb') as fout:
    page = pdf.pages[10]
    # for img in page.images:
    im = page.to_image()
    im.save(out_path, format='PNG')
    imgs = page.images
    for i, img in enumerate(imgs):
        size = img['width'], img['height']
        data = img['stream'].get_data()
        out_path = path.joinpath(f'002pdf_images_{i}.png')
        with open(out_path, 'wb') as fimg_out:
            fimg_out.write(data)

上面用到了PIL（Pillow）的功能處理圖片。

提取PDF中的圖表

圖表與圖像不同，指的是類似直方圖、餅圖之類的數(shù)據(jù)生成圖。

import pathlib
import pdfplumber
from PIL import Image

path = list(pathlib.Path.cwd().parents)[1].joinpath('data/automate/002pdf')
f_path = path.joinpath('2020-新冠肺炎疫情對(duì)中國(guó)連鎖餐飲行業(yè)的影響調(diào)研報(bào)告-中國(guó)連鎖經(jīng)營(yíng)協(xié)會(huì).pdf')
out_path = path.joinpath('002pdf_figures.png')
with pdfplumber.open(f_path) as pdf, open(out_path, 'wb') as fout:
    page = pdf.pages[7]
    im = page.to_image()
    im.save(out_path, format='PNG')
    figures = page.figures
    for i, fig in enumerate(figures):
        size = fig['width'], fig['height']
        crop = page.crop((fig['x0'], fig['top'], fig['x1'], fig['bottom']))
        img_crop = crop.to_image()
        out_path = path.joinpath(f'002pdf_figures_{i}.png')
        img_crop.save(out_path, format='png')
    im.draw_rects(page.extract_words(), stroke='yellow')
    im.draw_rects(page.images, stroke='blue')
    im.draw_rects(page.figures)
im # show in notebook

另外需要說(shuō)明的是，PDF標(biāo)準(zhǔn)規(guī)范由Adobe公司主導(dǎo)。

平時(shí)我們不需要參考規(guī)范，但如果遇到一些較復(fù)雜的場(chǎng)景，尤其是模塊沒有直接支持，就只能硬著頭皮翻閱文檔了。文檔是公開的，可以去搜索引擎搜索關(guān)鍵詞：pdf_reference_1-7.pdf。

關(guān)于Python中pdfminer.six和pdfplumber模塊是什么就分享到這里了，希望以上內(nèi)容可以對(duì)大家有一定的幫助，可以學(xué)到更多知識(shí)。如果覺得文章不錯(cuò)，可以把它分享出去讓更多的人看到。

向AI問一下細(xì)節(jié)

推薦閱讀：

免責(zé)聲明：本站發(fā)布的內(nèi)容（圖片、視頻和文字）以原創(chuàng)、轉(zhuǎn)載和分享為主，文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng)，如果涉及侵權(quán)請(qǐng)聯(lián)系站長(zhǎng)郵箱：is@yisu.com進(jìn)行舉報(bào)，并提供相關(guān)證據(jù)，一經(jīng)查實(shí)，將立刻刪除涉嫌侵權(quán)內(nèi)容。

上一篇新聞：
openpyxl填充背景和字體顏色的案例
下一篇新聞：
python開發(fā)環(huán)境如何選擇

猜你喜歡

AI
助
手

產(chǎn)品服務(wù)

地區(qū)劃分

專題活動(dòng)

幫助支持

關(guān)于我們

售后咨詢

7*24小時(shí)在線電話：400-100-2938

7*24小時(shí)在線 QQ：800811969

關(guān)注億速云

億速云公眾號(hào)

手機(jī)網(wǎng)站二維碼