您好,登錄后才能下訂單哦!
這篇文章主要講解了“Flume接入Hive數(shù)倉的搭建流程”,文中的講解內(nèi)容簡單清晰,易于學(xué)習(xí)與理解,下面請大家跟著小編的思路慢慢深入,一起來研究和學(xué)習(xí)“Flume接入Hive數(shù)倉的搭建流程”吧!
實(shí)時(shí)流接入數(shù)倉,基本在大公司都會(huì)有,在Flume1.8以后支持taildir source, 其有以下幾個(gè)特點(diǎn),而被廣泛使用:
使用正則表達(dá)式匹配目錄中的文件名
監(jiān)控的文件中,一旦有數(shù)據(jù)寫入,F(xiàn)lume就會(huì)將信息寫入到指定的Sink
高可靠,不會(huì)丟失數(shù)據(jù)
不會(huì)對跟蹤文件有任何處理,不會(huì)重命名也不會(huì)刪除
不支持Windows,不能讀二進(jìn)制文件。支持按行讀取文本文件
本文以開源Flume流為例,介紹流接入HDFS ,后面在其上面建立ods層外表。
1.1 taildir source配置
a1.sources.r1.type = TAILDIR a1.sources.r1.positionFile = /opt/hoult/servers/conf/startlog_position.json a1.sources.r1.filegroups = f1 a1.sources.r1.filegroups.f1 =/opt/hoult/servers/logs/start/.*log
1.2 hdfs sink 配置
a1.sinks.k1.type = hdfs a1.sinks.k1.hdfs.path = /user/data/logs/start/logs/start/%Y-%m-%d/ a1.sinks.k1.hdfs.filePrefix = startlog. # 配置文件滾動(dòng)方式(文件大小32M) a1.sinks.k1.hdfs.rollSize = 33554432 a1.sinks.k1.hdfs.rollCount = 0 a1.sinks.k1.hdfs.rollInterval = 0 a1.sinks.k1.hdfs.idleTimeout = 0 a1.sinks.k1.hdfs.minBlockReplicas = 1 # 向hdfs上刷新的event的個(gè)數(shù) a1.sinks.k1.hdfs.batchSize = 100 # 使用本地時(shí)間 a1.sinks.k1.hdfs.useLocalTimeStamp = true
1.3 Agent的配置
a1.sources = r1 a1.sinks = k1 a1.channels = c1 # taildir source a1.sources.r1.type = TAILDIR a1.sources.r1.positionFile = /opt/hoult/servers/conf/startlog_position.json a1.sources.r1.filegroups = f1 a1.sources.r1.filegroups.f1 = /user/data/logs/start/.*log # memorychannel a1.channels.c1.type = memory a1.channels.c1.capacity = 100000 a1.channels.c1.transactionCapacity = 2000 # hdfs sink a1.sinks.k1.type = hdfs a1.sinks.k1.hdfs.path = /opt/hoult/servers/logs/start/%Y-%m-%d/ a1.sinks.k1.hdfs.filePrefix = startlog. # 配置文件滾動(dòng)方式(文件大小32M) a1.sinks.k1.hdfs.rollSize = 33554432 a1.sinks.k1.hdfs.rollCount = 0 a1.sinks.k1.hdfs.rollInterval = 0 a1.sinks.k1.hdfs.idleTimeout = 0 a1.sinks.k1.hdfs.minBlockReplicas = 1 # 向hdfs上刷新的event的個(gè)數(shù) a1.sinks.k1.hdfs.batchSize = 1000 # 使用本地時(shí)間 a1.sinks.k1.hdfs.useLocalTimeStamp = true # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
/opt/hoult/servers/conf/flume-log2hdfs.conf
1.4 啟動(dòng)
flume-ng agent --conf-file /opt/hoult/servers/conf/flume-log2hdfs.conf -name a1 -Dflume.roog.logger=INFO,console export JAVA_OPTS="-Xms4000m -Xmx4000m -Dcom.sun.management.jmxremote" # 要想使配置文件生效,還要在命令行中指定配置文件目錄 flume-ng agent --conf /opt/hoult/servers/flume-1.9.0/conf --conf-file /opt/hoult/servers/conf/flume-log2hdfs.conf -name a1 -Dflume.roog.logger=INFO,console
要$FLUME_HOME/conf/flume-env.sh加下面的參數(shù),否則會(huì)報(bào)錯(cuò)誤如下:
1.5 使用自定義攔截器解決Flume Agent替換本地時(shí)間為日志里面的時(shí)間戳
使用netcat source → logger sink來測試
# a1是agent的名稱。source、channel、sink的名稱分別為:r1 c1 k1 a1.sources = r1 a1.channels = c1 a1.sinks = k1 # source a1.sources.r1.type = netcat a1.sources.r1.bind = linux121 a1.sources.r1.port = 9999 a1.sources.r1.interceptors = i1 a1.sources.r1.interceptors.i1.type = com.hoult.flume.CustomerInterceptor$Builder # channel a1.channels.c1.type = memory a1.channels.c1.capacity = 10000 a1.channels.c1.transactionCapacity = 100 # sink a1.sinks.k1.type = logger # source、channel、sink之間的關(guān)系 a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
攔截器主要代碼如下:
public class CustomerInterceptor implements Interceptor { private static DateTimeFormatter formatter = DateTimeFormatter.ofPattern("yyyyMMdd"); @Override public void initialize() { } @Override public Event intercept(Event event) { // 獲得body的內(nèi)容 String eventBody = new String(event.getBody(), Charsets.UTF_8); // 獲取header的內(nèi)容 Map<String, String> headerMap = event.getHeaders(); final String[] bodyArr = eventBody.split("\\s+"); try { String jsonStr = bodyArr[6]; if (Strings.isNullOrEmpty(jsonStr)) { return null; } // 將 string 轉(zhuǎn)成 json 對象 JSONObject jsonObject = JSON.parseObject(jsonStr); String timestampStr = jsonObject.getString("time"); //將timestamp 轉(zhuǎn)為時(shí)間日期類型(格式 :yyyyMMdd) long timeStamp = Long.valueOf(timestampStr); String date = formatter.format(LocalDateTime.ofInstant(Instant.ofEpochMilli(timeStamp), ZoneId.systemDefault())); headerMap.put("logtime", date); event.setHeaders(headerMap); } catch (Exception e) { headerMap.put("logtime", "unknown"); event.setHeaders(headerMap); } return event; } @Override public List<Event> intercept(List<Event> events) { List<Event> out = new ArrayList<>(); for (Event event : events) { Event outEvent = intercept(event); if (outEvent != null) { out.add(outEvent); } } return out; } @Override public void close() { } public static class Builder implements Interceptor.Builder { @Override public Interceptor build() { return new CustomerInterceptor(); } @Override public void configure(Context context) { } }
啟動(dòng)
flume-ng agent --conf /opt/hoult/servers/flume-1.9.0/conf --conf-file /opt/hoult/servers/conf/flume-test.conf -name a1 -Dflume.roog.logger=INFO,console ## 測試 telnet linux121 9999
感謝各位的閱讀,以上就是“Flume接入Hive數(shù)倉的搭建流程”的內(nèi)容了,經(jīng)過本文的學(xué)習(xí)后,相信大家對Flume接入Hive數(shù)倉的搭建流程這一問題有了更深刻的體會(huì),具體使用情況還需要大家實(shí)踐驗(yàn)證。這里是億速云,小編將為大家推送更多相關(guān)知識點(diǎn)的文章,歡迎關(guān)注!
免責(zé)聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點(diǎn)不代表本網(wǎng)站立場,如果涉及侵權(quán)請聯(lián)系站長郵箱:is@yisu.com進(jìn)行舉報(bào),并提供相關(guān)證據(jù),一經(jīng)查實(shí),將立刻刪除涉嫌侵權(quán)內(nèi)容。