SparkStreaming與Kafka整合遇到的問題及解決方案是什么

發(fā)布時間：2021-12-07 11:30:05 來源：億速云閱讀：137 作者：柒染欄目：大數(shù)據(jù)

今天就跟大家聊聊有關(guān)SparkStreaming與Kafka整合遇到的問題及解決方案是什么，可能很多人都不太了解，為了讓大家更加了解，小編給大家總結(jié)了以下內(nèi)容，希望大家根據(jù)這篇文章可以有所收獲。

前言

最近工作中是做日志分析的平臺，采用了sparkstreaming+kafka，采用kafka主要是看中了它對大數(shù)據(jù)量處理的高性能，處理日志類應(yīng)用再好不過了，采用了sparkstreaming的流處理框架主要是考慮到它本身是基于spark核心的，以后的批處理可以一站式服務(wù)，并且可以提供準(zhǔn)實時服務(wù)到elasticsearch中，可以實現(xiàn)準(zhǔn)實時定位系統(tǒng)日志。

實現(xiàn)

Spark-Streaming獲取kafka數(shù)據(jù)的兩種方式-Receiver與Direct的方式。

一. 基于Receiver方式

這種方式使用Receiver來獲取數(shù)據(jù)。Receiver是使用Kafka的高層次Consumer API來實現(xiàn)的。receiver從Kafka中獲取的數(shù)據(jù)都是存儲在Spark Executor的內(nèi)存中的，然后Spark Streaming啟動的job會去處理那些數(shù)據(jù)。代碼如下：

SparkConf sparkConf = new SparkConf().setAppName("log-etl").setMaster("local[4]");     JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(2000));     int numThreads = Integer.parseInt("4");     Map<String, Integer> topicMap = new HashMap<String, Integer>();     topicMap.put("group-45", numThreads);      //接收的參數(shù)分別是JavaStreamingConetxt,zookeeper連接地址,groupId,kafak的topic      JavaPairReceiverInputDStream<String, String> messages =     KafkaUtils.createStream(jssc, "172.16.206.27:2181,172.16.206.28:2181,172.16.206.29:2181", "1", topicMap);

剛開始的時候系統(tǒng)正常運行，沒有發(fā)現(xiàn)問題，但是如果系統(tǒng)異常重新啟動sparkstreaming程序后，發(fā)現(xiàn)程序會重復(fù)處理已經(jīng)處理過的數(shù)據(jù)，這種基于receiver的方式，是使用Kafka的高階API來在ZooKeeper中保存消費過的offset的。這是消費Kafka數(shù)據(jù)的傳統(tǒng)方式。這種方式配合著WAL機制可以保證數(shù)據(jù)零丟失的高可靠性，但是卻無法保證數(shù)據(jù)被處理一次且僅一次，可能會處理兩次。因為Spark和ZooKeeper之間可能是不同步的。官方現(xiàn)在也已經(jīng)不推薦這種整合方式，官網(wǎng)相關(guān)地址 http://spark.apache.org/docs/latest/streaming-kafka-integration.html ，下面我們使用官網(wǎng)推薦的第二種方式kafkaUtils的createDirectStream()方式。

二.基于Direct的方式

這種新的不基于Receiver的直接方式，是在Spark 1.3中引入的，從而能夠確保更加健壯的機制。替代掉使用Receiver來接收數(shù)據(jù)后，這種方式會周期性地查詢Kafka，來獲得每個topic+partition的***的offset，從而定義每個batch的offset的范圍。當(dāng)處理數(shù)據(jù)的job啟動時，就會使用Kafka的簡單consumer api來獲取Kafka指定offset范圍的數(shù)據(jù)。

代碼如下：

SparkConf sparkConf = new SparkConf().setAppName("log-etl"); JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(2));  HashSet<String> topicsSet = new HashSet<String>(Arrays.asList(topics.split(","))); HashMap<String, String> kafkaParams = new HashMap<String, String>(); kafkaParams.put("metadata.broker.list", brokers); // Create direct kafka stream with brokers and topics JavaPairInputDStream<String, String> messages = KafkaUtils.createDirectStream(     jssc,     String.class,     String.class,     StringDecoder.class,     StringDecoder.class,     kafkaParams,     topicsSet );

這種direct方式的優(yōu)點如下：

1.簡化并行讀?。喝绻x取多個partition，不需要創(chuàng)建多個輸入DStream然后對它們進(jìn)行union操作。Spark會創(chuàng)建跟Kafka partition一樣多的RDD partition，并且會并行從Kafka中讀取數(shù)據(jù)。所以在Kafka partition和RDD partition之間，有一個一對一的映射關(guān)系。

2.一次且僅一次的事務(wù)機制：基于receiver的方式，在spark和zk中通信，很有可能導(dǎo)致數(shù)據(jù)的不一致。

3.高效率：在receiver的情況下，如果要保證數(shù)據(jù)的不丟失，需要開啟wal機制，這種方式下，為、數(shù)據(jù)實際上被復(fù)制了兩份，一份在kafka自身的副本中，另外一份要復(fù)制到wal中， direct方式下是不需要副本的。

三.基于Direct方式丟失消息的問題

貌似這種方式很***，但是還是有問題的，當(dāng)業(yè)務(wù)需要重啟sparkstreaming程序的時候，業(yè)務(wù)日志依然會打入到kafka中，當(dāng)job重啟后只能從***的offset開始消費消息，造成重啟過程中的消息丟失。kafka中的offset如下圖(使用kafkaManager實時監(jiān)控隊列中的消息)：

SparkStreaming與Kafka整合遇到的問題及解決方案是什么

當(dāng)停止業(yè)務(wù)日志的接受后，先重啟spark程序，但是發(fā)現(xiàn)job并沒有將先前打入到kafka中的數(shù)據(jù)消費掉。這是因為消息沒有經(jīng)過zk，topic的offset也就沒有保存

四.解決消息丟失的處理方案

一般有兩種方式處理這種問題，可以先spark streaming 保存offset，使用spark checkpoint機制，第二種是程序中自己實現(xiàn)保存offset邏輯，我比較喜歡第二種方式，以為這種方式可控，所有主動權(quán)都在自己手中。

先看下大體流程圖，

SparkStreaming與Kafka整合遇到的問題及解決方案是什么

SparkConf sparkConf = new SparkConf().setMaster("local[2]").setAppName("log-etl");  Set<String> topicSet = new HashSet<String>();         topicSet.add("group-45");         kafkaParam.put("metadata.broker.list", "172.16.206.17:9092,172.16.206.31:9092,172.16.206.32:9092");         kafkaParam.put("group.id", "simple1");          // transform java Map to scala immutable.map         scala.collection.mutable.Map<String, String> testMap = JavaConversions.mapAsScalaMap(kafkaParam);         scala.collection.immutable.Map<String, String> scalaKafkaParam =                 testMap.toMap(new Predef.$less$colon$less<Tuple2<String, String>, Tuple2<String, String>>() {                     public Tuple2<String, String> apply(Tuple2<String, String> v1) {                         return v1;                     }                 });          // init KafkaCluster         kafkaCluster = new KafkaCluster(scalaKafkaParam);          scala.collection.mutable.Set<String> mutableTopics = JavaConversions.asScalaSet(topicSet);         immutableTopics = mutableTopics.toSet();         scala.collection.immutable.Set<TopicAndPartition> topicAndPartitionSet2 = kafkaCluster.getPartitions(immutableTopics).right().get();          // kafka direct stream 初始化時使用的offset數(shù)據(jù)         Map<TopicAndPartition, Long> consumerOffsetsLong = new HashMap<TopicAndPartition, Long>();          // 沒有保存offset時（該group***消費時）, 各個partition offset 默認(rèn)為0         if (kafkaCluster.getConsumerOffsets(kafkaParam.get("group.id"), topicAndPartitionSet2).isLeft()) {              System.out.println(kafkaCluster.getConsumerOffsets(kafkaParam.get("group.id"), topicAndPartitionSet2).left().get());              Set<TopicAndPartition> topicAndPartitionSet1 = JavaConversions.setAsJavaSet((scala.collection.immutable.Set)topicAndPartitionSet2);              for (TopicAndPartition topicAndPartition : topicAndPartitionSet1) {                 consumerOffsetsLong.put(topicAndPartition, 0L);             }          }         // offset已存在, 使用保存的offset         else {              scala.collection.immutable.Map<TopicAndPartition, Object> consumerOffsetsTemp = kafkaCluster.getConsumerOffsets("simple1", topicAndPartitionSet2).right().get();              Map<TopicAndPartition, Object> consumerOffsets = JavaConversions.mapAsJavaMap((scala.collection.immutable.Map)consumerOffsetsTemp);              Set<TopicAndPartition> topicAndPartitionSet1 = JavaConversions.setAsJavaSet((scala.collection.immutable.Set)topicAndPartitionSet2);              for (TopicAndPartition topicAndPartition : topicAndPartitionSet1) {                 Long offset = (Long)consumerOffsets.get(topicAndPartition);                 consumerOffsetsLong.put(topicAndPartition, offset);             }          }          JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(5000));         kafkaParamBroadcast = jssc.sparkContext().broadcast(kafkaParam);          // create direct stream         JavaInputDStream<String> message = KafkaUtils.createDirectStream(                 jssc,                 String.class,                 String.class,                 StringDecoder.class,                 StringDecoder.class,                 String.class,                 kafkaParam,                 consumerOffsetsLong,                 new Function<MessageAndMetadata<String, String>, String>() {                     public String call(MessageAndMetadata<String, String> v1) throws Exception {                         System.out.println("接收到的數(shù)據(jù)《《==="+v1.message());                         return v1.message();                     }                 }         );          // 得到rdd各個分區(qū)對應(yīng)的offset, 并保存在offsetRanges中         final AtomicReference<OffsetRange[]> offsetRanges = new AtomicReference<OffsetRange[]>();          JavaDStream<String> javaDStream = message.transform(new Function<JavaRDD<String>, JavaRDD<String>>() {             public JavaRDD<String> call(JavaRDD<String> rdd) throws Exception {                 OffsetRange[] offsets = ((HasOffsetRanges) rdd.rdd()).offsetRanges();                 offsetRanges.set(offsets);                 return rdd;             }         });          // output         javaDStream.foreachRDD(new Function<JavaRDD<String>, Void>() {              public Void call(JavaRDD<String> v1) throws Exception {                 if (v1.isEmpty()) return null;                  List<String> list = v1.collect();                 for(String s:list){                     System.out.println("數(shù)據(jù)==="+s);                 }                  for (OffsetRange o : offsetRanges.get()) {                      // 封裝topic.partition 與 offset對應(yīng)關(guān)系 java Map                     TopicAndPartition topicAndPartition = new TopicAndPartition(o.topic(), o.partition());                     Map<TopicAndPartition, Object> topicAndPartitionObjectMap = new HashMap<TopicAndPartition, Object>();                     topicAndPartitionObjectMap.put(topicAndPartition, o.untilOffset());                      // 轉(zhuǎn)換java map to scala immutable.map                     scala.collection.mutable.Map<TopicAndPartition, Object> testMap =                             JavaConversions.mapAsScalaMap(topicAndPartitionObjectMap);                     scala.collection.immutable.Map<TopicAndPartition, Object> scalatopicAndPartitionObjectMap =                             testMap.toMap(new Predef.$less$colon$less<Tuple2<TopicAndPartition, Object>, Tuple2<TopicAndPartition, Object>>() {                                 public Tuple2<TopicAndPartition, Object> apply(Tuple2<TopicAndPartition, Object> v1) {                                     return v1;                                 }                             });                      // 更新offset到kafkaCluster                     kafkaCluster.setConsumerOffsets(kafkaParamBroadcast.getValue().get("group.id"), scalatopicAndPartitionObjectMap);                        System.out.println("原數(shù)據(jù)====》"+o.topic() + " " + o.partition() + " " + o.fromOffset() + " " + o.untilOffset()                     );                 }                 return null;             }         });          jssc.start();         jssc.awaitTermination();     }

基本使用這種方式就可以解決數(shù)據(jù)丟失的問題。

看完上述內(nèi)容，你們對SparkStreaming與Kafka整合遇到的問題及解決方案是什么有進(jìn)一步的了解嗎？如果還想了解更多知識或者相關(guān)內(nèi)容，請關(guān)注億速云行業(yè)資訊頻道，感謝大家的支持。

向AI問一下細(xì)節(jié)

SparkStreaming與Kafka整合遇到的問題及解決方案是什么

猜你喜歡

最新資訊

相關(guān)推薦

相關(guān)標(biāo)簽