java中如何解析pdf文件

發(fā)布時(shí)間：2021-06-18 15:27:59 來源：億速云閱讀：707 作者：Leah 欄目：大數(shù)據(jù)

本篇文章給大家分享的是有關(guān)java中如何解析pdf文件，小編覺得挺實(shí)用的，因此分享給大家學(xué)習(xí)，希望大家閱讀完這篇文章后可以有所收獲，話不多說，跟著小編一起來看看吧。

第一種使用開源組織提供的開源框架 pdfbox

api ： https://pdfbox.apache.org/

特點(diǎn):免費(fèi)，功能強(qiáng)大，解析中文或許會(huì)存在亂碼，格式有點(diǎn)亂，沒有國(guó)產(chǎn)解析的那么美化。

可以按照指定的模板，對(duì)pdf進(jìn)行修改添加刪除等操作，總之操作很騷，很強(qiáng)大。

1 pdfbox 需要帶入依賴

   <dependency>
            <groupId>org.apache.pdfbox</groupId>
            <artifactId>pdfbox</artifactId>
            <version>2.0.15</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.pdfbox/fontbox -->
        <dependency>
            <groupId>org.apache.pdfbox</groupId>
            <artifactId>fontbox</artifactId>
            <version>2.0.15</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.pdfbox/jempbox -->
        <dependency>
            <groupId>org.apache.pdfbox</groupId>
            <artifactId>jempbox</artifactId>
            <version>1.8.16</version>
        </dependency>

2 代碼

/**
 * 功能 PDF讀寫類
 * @CreateTime 2011-4-14 下午02:44:11
 */
public class PDFUtil {

    //  public static final String CHARACTOR_FONT_CH_FILE = "SIMFANG.TTF";  //仿宋常規(guī)
    public static final String CHARACTOR_FONT_CH_FILE = "SIMHEI.TTF";  //黑體常規(guī)

    public static final Rectangle PAGE_SIZE = PageSize.A4;
    public static final float MARGIN_LEFT = 50;
    public static final float MARGIN_RIGHT = 50;
    public static final float MARGIN_TOP = 50;
    public static final float MARGIN_BOTTOM = 50;
    public static final float SPACING = 20;


    private Document document = null;

    /**
     * 功能：創(chuàng)建導(dǎo)出數(shù)據(jù)的目標(biāo)文檔
     * @param fileName 存儲(chǔ)文件的臨時(shí)路徑
     * @return
     */
    public void createDocument(String fileName) {
        File file = new File(fileName);
        FileOutputStream out = null;
        document = new Document(PAGE_SIZE, MARGIN_LEFT, MARGIN_RIGHT, MARGIN_TOP, MARGIN_BOTTOM);
        try {
            out = new FileOutputStream(file);
//          PdfWriter writer =
            PdfWriter.getInstance(document, out);
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (DocumentException e) {
            e.printStackTrace();
        }
        // 打開文檔準(zhǔn)備寫入內(nèi)容
        document.open();
    }

    /**
     * 將章節(jié)寫入到指定的PDF文檔中
     * @param chapter
     * @return
     */
    public void writeChapterToDoc(Chapter chapter) {
        try {
            if(document != null) {
                if(!document.isOpen()) document.open();
                document.add(chapter);
            }
        } catch (DocumentException e) {
            e.printStackTrace();
        }
    }

    /**
     * 功能  創(chuàng)建PDF文檔中的章節(jié)
     * @param title 章節(jié)標(biāo)題
     * @param chapterNum 章節(jié)序列號(hào)
     * @param alignment 0表示align=left，1表示align=center
     * @param numberDepth 章節(jié)是否帶序號(hào) 設(shè)值=1 表示帶序號(hào) 1.章節(jié)一；1.1小節(jié)一...，設(shè)值=0表示不帶序號(hào)
     * @param font 字體格式
     * @return Chapter章節(jié)
     */
    public static Chapter createChapter(String title, int chapterNum, int alignment, int numberDepth, Font font) {
        Paragraph chapterTitle = new Paragraph(title, font);
        chapterTitle.setAlignment(alignment);
        Chapter chapter = new Chapter(chapterTitle, chapterNum);
        chapter.setNumberDepth(numberDepth);
        return chapter;
    }

    /**
     * 功能：創(chuàng)建某指定章節(jié)下的小節(jié)
     * @param chapter 指定章節(jié)
     * @param title 小節(jié)標(biāo)題
     * @param font 字體格式
     * @param numberDepth 小節(jié)是否帶序號(hào) 設(shè)值=1 表示帶序號(hào) 1.章節(jié)一；1.1小節(jié)一...，設(shè)值=0表示不帶序號(hào)
     * @return section在指定章節(jié)后追加小節(jié)
     */
    public static Section createSection(Chapter chapter, String title, Font font, int numberDepth) {
        Section section = null;
        if(chapter != null) {
            Paragraph sectionTitle = new Paragraph(title, font);
            sectionTitle.setSpacingBefore(SPACING);
            section = chapter.addSection(sectionTitle);
            section.setNumberDepth(numberDepth);
        }
        return section;
    }

    /**
     * 功能：向PDF文檔中添加的內(nèi)容
     * @param text 內(nèi)容
     * @param font 內(nèi)容對(duì)應(yīng)的字體
     * @return phrase 指定字體格式的內(nèi)容
     */
    public static Phrase createPhrase(String text,Font font) {
        Phrase phrase = new Paragraph(text,font);
        return phrase;
    }

    /**
     * 功能：創(chuàng)建列表
     * @param numbered  設(shè)置為 true 表明想創(chuàng)建一個(gè)進(jìn)行編號(hào)的列表
     * @param lettered 設(shè)置為true表示列表采用字母進(jìn)行編號(hào)，為false則用數(shù)字進(jìn)行編號(hào)
     * @param symbolIndent
     * @return list
     */
    public static List createList(boolean numbered, boolean lettered, float symbolIndent) {
        List list = new List(numbered, lettered, symbolIndent);
        return list;
    }

    /**
     * 功能：創(chuàng)建列表中的項(xiàng)
     * @param content 列表項(xiàng)中的內(nèi)容
     * @param font 字體格式
     * @return listItem
     */
    public static ListItem createListItem(String content, Font font) {
        ListItem listItem = new ListItem(content, font);
        return listItem;
    }

    /**
     * 功能：創(chuàng)造字體格式
     * @param fontname
     * @param size 字體大小
     * @param style 字體風(fēng)格
     * @param color 字體顏色
     * @return Font
     */
    public static Font createFont(String fontname, float size, int style, BaseColor color) {
        Font font =  FontFactory.getFont(fontname, size, style, color);
        return font;
    }

    /**
     * 功能： 返回支持中文的字體---仿宋
     * @param size 字體大小
     * @param style 字體風(fēng)格
     * @param color 字體 顏色
     * @return  字體格式
     */
    public static Font createCHineseFont(float size, int style, BaseColor color) {
        BaseFont bfChinese = null;
        try {
            bfChinese = BaseFont.createFont(CHARACTOR_FONT_CH_FILE,BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
        } catch (DocumentException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return new Font(bfChinese, size, style, color);
    }

    /**
     * 最后關(guān)閉PDF文檔
     */
    public void closeDocument() {
        if(document != null) {
            document.close();
        }
    }


    /**
     * 讀PDF文件，使用了pdfbox開源項(xiàng)目
     * @param fileName
     */
    public static  void readPDF(String fileName) {
        File file = new File(fileName);
        FileInputStream in = null;
        try {
            in = new FileInputStream(fileName);
            // 新建一個(gè)PDF解析器對(duì)象
            PDFParser parser = new PDFParser(new RandomAccessFile(file,"rw"));
            // 對(duì)PDF文件進(jìn)行解析
            parser.parse();
            // 獲取解析后得到的PDF文檔對(duì)象
            PDDocument pdfdocument = parser.getPDDocument();
            // 新建一個(gè)PDF文本剝離器
            PDFTextStripper stripper = new PDFTextStripper();
            // 從PDF文檔對(duì)象中剝離文本
            String result = stripper.getText(pdfdocument);
            FileWriter fileWriter = new FileWriter(new File("pdf.txt"));
            fileWriter.write(result);
            fileWriter.flush();
            fileWriter.close();
            System.out.println("PDF文件的文本內(nèi)容如下：");
            System.out.println(result);

        } catch (Exception e) {
            System.out.println("讀取PDF文件" + file.getAbsolutePath() + "生失??！" + e);
            e.printStackTrace();
        } finally {
            if (in != null) {
                try {
                    in.close();
                } catch (IOException e1) {
                }
            }
        }
    }


    /**
     * 測(cè)試pdf文件的創(chuàng)建
     * @param args
     */
    public static void main(String[] args) {

        String fileName = "C:\Users\tizzy\Desktop\測(cè)試.pdf";  //這里先手動(dòng)把絕對(duì)路徑的文件夾給補(bǔ)上。
        PDFUtil pdfUtil = new PDFUtil();
        pdfUtil.writeChapterToDoc(chapter);
        pdfUtil.closeDocument();
    }
}

第二種使用國(guó)產(chǎn)的框架 Spire.PDF

包含兩種版本

1 免費(fèi)版
https://www.e-iceblue.cn/Downloads/Free-Spire-PDF-JAVA.html

2 商業(yè)版本
https://www.e-iceblue.cn/Introduce/Spire-PDF-JAVA.html

api
http://e-iceblue.cn/licensing/install-spirepdf-for-java-from-maven-repository.html

1 倉(cāng)庫(kù)地址和依賴

<repositories>
        <repository>
            <id>com.e-iceblue</id>
            <name>e-iceblue</name>
            <url>http://repo.e-iceblue.com/nexus/content/groups/public/</url>
        </repository>
</repositories>

<dependency>
        <groupId>e-iceblue</groupId>
        <artifactId>spire.pdf.free</artifactId>
        <version>2.2.2</version>
    </dependency>

2 代碼

	public static void main(String[] args) {
		// 創(chuàng)建PdfDocument實(shí)例
		PdfDocument doc = new PdfDocument();
		// 加載PDF文件
		doc.loadFromFile("D:\\JAVA核心知識(shí)點(diǎn)整理.pdf");
		StringBuilder sb = new StringBuilder();
		PdfPageBase page;
		int totalPageCount = doc.getPages().getCount();
		// 遍歷PDF頁(yè)面，獲取文本
		for (int i = 0; i < totalPageCount; i++) {
			page = doc.getPages().get(i);
			System.out.println("pageNo:" + i);
			sb.append(page.extractText(true));
		}
		FileWriter writer;
		try {
			// 將文本寫入文本文件
			writer = new FileWriter("ExtractText.txt");
			writer.write(sb.toString());
			writer.flush();
		} catch (IOException e) {
			e.printStackTrace();
		}
		doc.close();
	}

第三種使用iTika 進(jìn)行解析pdf

api : https://tika.apache.org/

對(duì)中文支持不是很友好，解析的格式和pdfbox類似

1 依賴

        <!-- https://mvnrepository.com/artifact/org.apache.tika/tika-core -->
        <dependency>
            <groupId>org.apache.tika</groupId>
            <artifactId>tika-core</artifactId>
            <version>1.20</version>
        </dependency>

2 代碼

public static String getPdfFileText(String fileName) throws IOException {
       PdfReader reader = new PdfReader(fileName);
       PdfReaderContentParser parser = new PdfReaderContentParser(reader);
       StringBuffer buff = new StringBuffer();
       TextExtractionStrategy strategy;
       for (int i = 1; i <= reader.getNumberOfPages(); i++) {
            strategy = parser.processContent(i,
                      new SimpleTextExtractionStrategy());
            buff.append(strategy.getResultantText());
           }
       return buff.toString();
      }

以上就是java中如何解析pdf文件，小編相信有部分知識(shí)點(diǎn)可能是我們?nèi)粘９ぷ鲿?huì)見到或用到的。希望你能通過這篇文章學(xué)到更多知識(shí)。更多詳情敬請(qǐng)關(guān)注億速云行業(yè)資訊頻道。

向AI問一下細(xì)節(jié)

java中如何解析pdf文件

第一種 使用開源組織提供的開源框架 pdfbox

第三種 使用iTika 進(jìn)行解析pdf

猜你喜歡

最新資訊

相關(guān)推薦

相關(guān)標(biāo)簽

第一種使用開源組織提供的開源框架 pdfbox

第三種使用iTika 進(jìn)行解析pdf