<strike id="fpfhn"></strike>

溫馨提示×

溫馨提示×

您好，登錄后才能下訂單哦！

密碼登錄×

忘記密碼？

登錄注冊(cè)×

獲取短信驗(yàn)證碼

其他方式登錄

點(diǎn)擊登錄注冊(cè) 即表示同意《億速云用戶(hù)服務(wù)條款》

用戶(hù)登錄×

賬戶(hù)密碼登錄

請(qǐng)使用微信掃描上方二維碼

使用幫助

請(qǐng)求超時(shí)！

請(qǐng)點(diǎn)擊重新獲取二維碼

利用JAVA讀取HDFS文件數(shù)據(jù)時(shí)出現(xiàn)亂碼如何解決

發(fā)布時(shí)間：2020-11-17 14:42:22 來(lái)源：億速云閱讀：273 作者：Leah 欄目：開(kāi)發(fā)技術(shù)

利用JAVA讀取HDFS文件數(shù)據(jù)時(shí)出現(xiàn)亂碼如何解決？針對(duì)這個(gè)問(wèn)題，這篇文章詳細(xì)介紹了相對(duì)應(yīng)的分析和解答，希望可以幫助更多想解決這個(gè)問(wèn)題的小伙伴找到更簡(jiǎn)單易行的方法。

使用JAVA api讀取HDFS文件亂碼踩坑

想寫(xiě)一個(gè)讀取HFDS上的部分文件數(shù)據(jù)做預(yù)覽的接口，根據(jù)網(wǎng)上的博客實(shí)現(xiàn)后，發(fā)現(xiàn)有時(shí)讀取信息會(huì)出現(xiàn)亂碼，例如讀取一個(gè)csv時(shí)，字符串之間被逗號(hào)分割

英文字符串a(chǎn)aa，能正常顯示
中文字符串“你好”，能正常顯示
中英混合字符串如“aaa你好”，出現(xiàn)亂碼

查閱了眾多博客，解決方案大概都是：使用xxx字符集解碼。抱著不信的想法，我依次嘗試，果然沒(méi)用。

解決思路

因?yàn)镠DFS支持6種字符集編碼，每個(gè)本地文件編碼方式又是極可能不一樣的，我們上傳本地文件的時(shí)候其實(shí)就是把文件編碼成字節(jié)流上傳到文件系統(tǒng)存儲(chǔ)。那么在GET文件數(shù)據(jù)時(shí)，面對(duì)不同文件、不同字符集編碼的字節(jié)流，肯定不是一種固定字符集解碼就能正確解碼的吧。

那么解決方案其實(shí)有兩種

固定HDFS的編解碼字符集。比如我選用UTF-8，那么在上傳文件時(shí)統(tǒng)一編碼，即把不同文件的字節(jié)流都轉(zhuǎn)化為UTF-8編碼再進(jìn)行存儲(chǔ)。這樣的話(huà)在獲取文件數(shù)據(jù)的時(shí)候，采用UTF-8字符集解碼就沒(méi)什么問(wèn)題了。但這樣做的話(huà)仍然會(huì)在轉(zhuǎn)碼部分存在諸多問(wèn)題，且不好實(shí)現(xiàn)。
動(dòng)態(tài)解碼。根據(jù)文件的編碼字符集選用對(duì)應(yīng)的字符集對(duì)解碼，這樣的話(huà)并不會(huì)對(duì)文件的原生字符流進(jìn)行改動(dòng)，基本不會(huì)亂碼。

我選用動(dòng)態(tài)解碼的思路后，其難點(diǎn)在于如何判斷使用哪種字符集解碼。參考下面的內(nèi)容，獲得了解決方案

java檢測(cè)文本(字節(jié)流)的編碼方式

需求：

某文件或者某字節(jié)流要檢測(cè)他的編碼格式。

實(shí)現(xiàn)：

基于jchardet

<dependency>
	<groupId>net.sourceforge.jchardet</groupId>
	<artifactId>jchardet</artifactId>
	<version>1.0</version>
</dependency>

代碼如下：

public class DetectorUtils {
	private DetectorUtils() {
	}
 
	static class ChineseCharsetDetectionObserver implements
			nsICharsetDetectionObserver {
		private boolean found = false;
		private String result;
 
		public void Notify(String charset) {
			found = true;
			result = charset;
		}
 
		public ChineseCharsetDetectionObserver(boolean found, String result) {
			super();
			this.found = found;
			this.result = result;
		}
 
		public boolean isFound() {
			return found;
		}
 
		public String getResult() {
			return result;
		}
 
	}
 
	public static String[] detectChineseCharset(InputStream in)
			throws Exception {
		String[] prob=null;
		BufferedInputStream imp = null;
		try {
			boolean found = false;
			String result = Charsets.UTF_8.toString();
			int lang = nsPSMDetector.CHINESE;
			nsDetector det = new nsDetector(lang);
			ChineseCharsetDetectionObserver detectionObserver = new ChineseCharsetDetectionObserver(
					found, result);
			det.Init(detectionObserver);
			imp = new BufferedInputStream(in);
			byte[] buf = new byte[1024];
			int len;
			boolean isAscii = true;
			while ((len = imp.read(buf, 0, buf.length)) != -1) {
				if (isAscii)
					isAscii = det.isAscii(buf, len);
				if (!isAscii) {
					if (det.DoIt(buf, len, false))
						break;
				}
			}
 
			det.DataEnd();
			boolean isFound = detectionObserver.isFound();
			if (isAscii) {
				isFound = true;
				prob = new String[] { "ASCII" };
			} else if (isFound) {
				prob = new String[] { detectionObserver.getResult() };
			} else {
				prob = det.getProbableCharsets();
			}
			return prob;
		} finally {
			IOUtils.closeQuietly(imp);
			IOUtils.closeQuietly(in);
		}
	}
}

測(cè)試：

		String file = "C:/3737001.xml";
		String[] probableSet = DetectorUtils.detectChineseCharset(new FileInputStream(file));
		for (String charset : probableSet) {
			System.out.println(charset);
		}

Google提供了檢測(cè)字節(jié)流編碼方式的包。那么方案就很明了了，先讀一些文件字節(jié)流，用工具檢測(cè)編碼方式，再對(duì)應(yīng)進(jìn)行解碼即可。

具體解決代碼

pom

<dependency>
	<groupId>net.sourceforge.jchardet</groupId>
	<artifactId>jchardet</artifactId>
	<version>1.0</version>
</dependency>

從HDFS讀取部分文件做預(yù)覽的邏輯

 // 獲取文件的部分?jǐn)?shù)據(jù)做預(yù)覽
 public List<String> getFileDataWithLimitLines(String filePath, Integer limit) {
  FSDataInputStream fileStream = openFile(filePath);
  return readFileWithLimit(fileStream, limit);
 }

 // 獲取文件的數(shù)據(jù)流
 private FSDataInputStream openFile(String filePath) {
  FSDataInputStream fileStream = null;
  try {
   fileStream = fs.open(new Path(getHdfsPath(filePath)));
  } catch (IOException e) {
   logger.error("fail to open file:{}", filePath, e);
  }
  return fileStream;
 }
 
 // 讀取最多l(xiāng)imit行文件數(shù)據(jù)
 private List<String> readFileWithLimit(FSDataInputStream fileStream, Integer limit) {
  byte[] bytes = readByteStream(fileStream);
  String data = decodeByteStream(bytes);
  if (data == null) {
   return null;
  }

  List<String> rows = Arrays.asList(data.split("\\r\\n"));
  return rows.stream().filter(StringUtils::isNotEmpty)
    .limit(limit)
    .collect(Collectors.toList());
 }

 // 從文件數(shù)據(jù)流中讀取字節(jié)流
 private byte[] readByteStream(FSDataInputStream fileStream) {
  byte[] bytes = new byte[1024*30];
  int len;
  ByteArrayOutputStream stream = new ByteArrayOutputStream();
  try {
   while ((len = fileStream.read(bytes)) != -1) {
    stream.write(bytes, 0, len);
   }
  } catch (IOException e) {
   logger.error("read file bytes stream failed.", e);
   return null;
  }
  return stream.toByteArray();
 }

 // 解碼字節(jié)流
 private String decodeByteStream(byte[] bytes) {
  if (bytes == null) {
   return null;
  }

  String encoding = guessEncoding(bytes);
  String data = null;
  try {
   data = new String(bytes, encoding);
  } catch (Exception e) {
   logger.error("decode byte stream failed.", e);
  }
  return data;
 }

 // 根據(jù)Google的工具判別編碼
 private String guessEncoding(byte[] bytes) {
  UniversalDetector detector = new UniversalDetector(null);
  detector.handleData(bytes, 0, bytes.length);
  detector.dataEnd();
  String encoding = detector.getDetectedCharset();
  detector.reset();

  if (StringUtils.isEmpty(encoding)) {
   encoding = "UTF-8";
  }
  return encoding;
 }

關(guān)于利用JAVA讀取HDFS文件數(shù)據(jù)時(shí)出現(xiàn)亂碼如何解決問(wèn)題的解答就分享到這里了，希望以上內(nèi)容可以對(duì)大家有一定的幫助，如果你還有很多疑惑沒(méi)有解開(kāi)，可以關(guān)注億速云行業(yè)資訊頻道了解更多相關(guān)知識(shí)。

向AI問(wèn)一下細(xì)節(jié)

推薦閱讀：

免責(zé)聲明：本站發(fā)布的內(nèi)容（圖片、視頻和文字）以原創(chuàng)、轉(zhuǎn)載和分享為主，文章觀(guān)點(diǎn)不代表本網(wǎng)站立場(chǎng)，如果涉及侵權(quán)請(qǐng)聯(lián)系站長(zhǎng)郵箱：is@yisu.com進(jìn)行舉報(bào)，并提供相關(guān)證據(jù)，一經(jīng)查實(shí)，將立刻刪除涉嫌侵權(quán)內(nèi)容。

上一篇新聞：
Python項(xiàng)目中如何將list轉(zhuǎn)換為str
下一篇新聞：
利用python如何實(shí)現(xiàn)一個(gè)百度翻譯功能

猜你喜歡

AI
助
手

產(chǎn)品服務(wù)

地區(qū)劃分

專(zhuān)題活動(dòng)

幫助支持

關(guān)于我們

售后咨詢(xún)

7*24小時(shí)在線(xiàn)電話(huà)：400-100-2938

7*24小時(shí)在線(xiàn) QQ：800811969

關(guān)注億速云

億速云公眾號(hào)

手機(jī)網(wǎng)站二維碼

<th id="ffduk"><dd id="ffduk"><dfn id="ffduk"></dfn></dd></th>

<strike id="ffduk"></strike>

<mark id="ffduk"></mark>

<rt id="ffduk"><optgroup id="ffduk"></optgroup></rt>