利用java實(shí)現(xiàn)讀取html文件中的body標(biāo)簽和內(nèi)容

發(fā)布時(shí)間：2020-11-09 15:31:58 來源：億速云閱讀：1504 作者：Leah 欄目：開發(fā)技術(shù)

利用java實(shí)現(xiàn)讀取html文件中的body標(biāo)簽和內(nèi)容？針對這個(gè)問題，這篇文章詳細(xì)介紹了相對應(yīng)的分析和解答，希望可以幫助更多想解決這個(gè)問題的小伙伴找到更簡單易行的方法。

這里的獲取的是html文件中body中的所有標(biāo)簽以及內(nèi)容

package com.lmt.service.file;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStreamReader;
import java.io.Reader;

import org.springframework.stereotype.Component;
import com.lmt.config.UrlConstants;

@Component
public class ParseFile {

  /**
   * 解析html文件
   * @param file
   * @return
   */
  public String readHtml(File file){
    String body = "";
    try {
      FileInputStream iStream = new FileInputStream(file);
      Reader reader = new InputStreamReader(iStream);
      BufferedReader htmlReader = new BufferedReader(reader);
            
      String line;
      boolean found = false;
      while (!found && (line = htmlReader.readLine()) != null) {
        if (line.toLowerCase().indexOf("<body") != -1) { // 在<body>的前面可能存在空格
          found = true;
        }
      }
      
      found = false;
      while (!found && (line = htmlReader.readLine()) != null) {
        if (line.toLowerCase().indexOf("</body") != -1) {
          found = true;
        } else {
          // 如果存在圖片，則將相對路徑轉(zhuǎn)換為絕對路徑
          String lowerCaseLine = line.toLowerCase();
          if (lowerCaseLine.contains("src")) {
            
            //這里是定義圖片的訪問路徑
            String directory = "D:/test";
            // 如果路徑名不以反斜杠結(jié)尾，則手動(dòng)添加反斜杠
            /*if (!directory.endsWith("\\")) {
              directory = directory + "\\";
            }*/
          //  line = line.substring(0, lowerCaseLine.indexOf("src") + 5) + directory + line.substring(lowerCaseLine.indexOf("src") + 5);
            /*String filename = extractFilename(line);
            line = line.substring(0, lowerCaseLine.indexOf("src") + 5) + directory + filename + line.substring(line.indexOf(filename) + filename.length());
          */
            // 如果該行存在多個(gè)<img>元素，則分行進(jìn)行替代
            String[] splitLines = line.split("<img\\s+"); // <img后帶一個(gè)或多個(gè)空格
            // 因?yàn)閖ava中引用的問題不能使用for each
            for (int i = 0; i < splitLines.length; i++) {
              if (splitLines[i].toLowerCase().startsWith("src")) {
                splitLines[i] = splitLines[i].substring(0, splitLines[i].toLowerCase().indexOf("src") + 5)
                    + directory
                    + splitLines[i].substring(splitLines[i].toLowerCase().indexOf("src") + 5);
              }
            }
            
            // 最后進(jìn)行拼接
            line = "";
            for (int i = 0; i < splitLines.length - 1; i++) { // 循環(huán)次數(shù)要-1，因?yàn)樽詈笠粋€(gè)字符串后不需要添加<img
              line = line + splitLines[i] + "<img ";
            }
            line = line + splitLines[splitLines.length - 1];
          }
          
          body = body + line + "\n";
        }
      }
      htmlReader.close();
  //    System.out.println(body);
      
    } catch (Exception e) {
      e.printStackTrace();
    }
    return body;
  }
  
  /**
   * 
   * @param htmlLine 一行html片段，包含<img>元素
   * @return 文件名
   */
  public static String extractFilename(String htmlLine) {
    int srcIndex = htmlLine.toLowerCase().indexOf("src=");
    if (srcIndex == -1) { // 圖片不存在，返回空字符串
      return "";
    } else {
      String htmlSrc = htmlLine.substring(srcIndex + 4);
      char splitChar = '\"'; // 默認(rèn)為雙引號(hào)，但也有可能為單引號(hào)
      if (htmlSrc.charAt(0) == '\'') {
        splitChar = '\'';
      } 
      String[] firstSplit = htmlSrc.split(String.valueOf(splitChar));
      String path = firstSplit[1]; // 第0位為空字符串
      String[] secondSplit = path.split("[/\\\\]"); // 匹配正斜杠或反斜杠
      return secondSplit[secondSplit.length - 1];
    }
  }
  
}

補(bǔ)充知識(shí)：StandardEngine[Catalina].StandardHost[localhost].StandardContext[]

jar包沒有正確導(dǎo)入

1、在 build path 中添加

利用java實(shí)現(xiàn)讀取html文件中的body標(biāo)簽和內(nèi)容

2、如果這里不添加在編譯的時(shí)你的jar包將不會(huì)被導(dǎo)入

利用java實(shí)現(xiàn)讀取html文件中的body標(biāo)簽和內(nèi)容

3、如果依然沒有成功請刪除user jar包重新導(dǎo)入

關(guān)于利用java實(shí)現(xiàn)讀取html文件中的body標(biāo)簽和內(nèi)容問題的解答就分享到這里了，希望以上內(nèi)容可以對大家有一定的幫助，如果你還有很多疑惑沒有解開，可以關(guān)注億速云行業(yè)資訊頻道了解更多相關(guān)知識(shí)。

向AI問一下細(xì)節(jié)

利用java實(shí)現(xiàn)讀取html文件中的body標(biāo)簽和內(nèi)容

猜你喜歡

最新資訊

相關(guān)推薦

相關(guān)標(biāo)簽