您好,登錄后才能下訂單哦!
這篇“spark作業(yè)怎么實(shí)現(xiàn)”文章的知識(shí)點(diǎn)大部分人都不太理解,所以小編給大家總結(jié)了以下內(nèi)容,內(nèi)容詳細(xì),步驟清晰,具有一定的借鑒價(jià)值,希望大家閱讀完這篇文章能有所收獲,下面我們一起來(lái)看看這篇“spark作業(yè)怎么實(shí)現(xiàn)”文章吧。
將sample.log的數(shù)據(jù)發(fā)送到Kafka中,經(jīng)過(guò)Spark Streaming處理,將數(shù)據(jù)格式變?yōu)橐韵滦问剑? commandid | houseid | gathertime | srcip | destip |srcport| destport | domainname | proxytype | proxyip | proxytype | title | content | url | logid 在發(fā)送到kafka的另一個(gè)隊(duì)列中 要求: 1、sample.log => 讀文件,將數(shù)據(jù)發(fā)送到kafka隊(duì)列中 2、從kafka隊(duì)列中獲取數(shù)據(jù)(0.10 接口不管理offset),變更數(shù)據(jù)格式 3、處理后的數(shù)據(jù)在發(fā)送到kafka另一個(gè)隊(duì)列中 分析 1 使用課程中的redis工具類管理offset 2 讀取日志數(shù)據(jù)發(fā)送數(shù)據(jù)到topic1 3 消費(fèi)主題,將數(shù)據(jù)的分割方式修改為豎線分割,再次發(fā)送到topic2
1.OffsetsWithRedisUtils
package home.one import java.util import org.apache.kafka.common.TopicPartition import org.apache.spark.streaming.kafka010.OffsetRange import redis.clients.jedis.{Jedis, JedisPool, JedisPoolConfig} import scala.collection.mutable object OffsetsWithRedisUtils { // 定義Redis參數(shù) private val redisHost = "linux123" private val redisPort = 6379 // 獲取Redis的連接 private val config = new JedisPoolConfig // 最大空閑數(shù) config.setMaxIdle(5) // 最大連接數(shù) config.setMaxTotal(10) private val pool = new JedisPool(config, redisHost, redisPort, 10000) private def getRedisConnection: Jedis = pool.getResource private val topicPrefix = "kafka:topic" // Key:kafka:topic:TopicName:groupid private def getKey(topic: String, groupid: String) = s"$topicPrefix:$topic:$groupid" // 根據(jù) key 獲取offsets def getOffsetsFromRedis(topics: Array[String], groupId: String): Map[TopicPartition, Long] = { val jedis: Jedis = getRedisConnection val offsets: Array[mutable.Map[TopicPartition, Long]] = topics.map { topic => val key = getKey(topic, groupId) import scala.collection.JavaConverters._ // 將獲取到的redis數(shù)據(jù)由Java的map轉(zhuǎn)換為scala的map,數(shù)據(jù)格式為{key:[{partition,offset}]} jedis.hgetAll(key) .asScala .map { case (partition, offset) => new TopicPartition(topic, partition.toInt) -> offset.toLong } } // 歸還資源 jedis.close() offsets.flatten.toMap } // 將offsets保存到Redis中 def saveOffsetsToRedis(offsets: Array[OffsetRange], groupId: String): Unit = { // 獲取連接 val jedis: Jedis = getRedisConnection // 組織數(shù)據(jù) offsets.map{range => (range.topic, (range.partition.toString, range.untilOffset.toString))} .groupBy(_._1) .foreach{case (topic, buffer) => val key: String = getKey(topic, groupId) import scala.collection.JavaConverters._ // 同樣將scala的map轉(zhuǎn)換為Java的map存入redis中 val maps: util.Map[String, String] = buffer.map(_._2).toMap.asJava // 保存數(shù)據(jù) jedis.hmset(key, maps) } jedis.close() } }
KafkaProducer
package home.one import java.util.Properties import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig, ProducerRecord} import org.apache.kafka.common.serialization.StringSerializer import org.apache.log4j.{Level, Logger} import org.apache.spark.rdd.RDD import org.apache.spark.{SparkConf, SparkContext} object KafkaProducer { def main(args: Array[String]): Unit = { Logger.getLogger("org").setLevel(Level.ERROR) val conf = new SparkConf().setAppName(this.getClass.getCanonicalName.init).setMaster("local[*]") val sc = new SparkContext(conf) // 讀取sample.log文件數(shù)據(jù) val lines: RDD[String] = sc.textFile("data/sample.log") // 定義 kafka producer參數(shù) val prop = new Properties() // kafka的訪問(wèn)地址 prop.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "linux121:9092") // key和value的序列化方式 prop.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, classOf[StringSerializer]) prop.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, classOf[StringSerializer]) // 將讀取到的數(shù)據(jù)發(fā)送到mytopic1 lines.foreachPartition{iter => // 初始化KafkaProducer val producer = new KafkaProducer[String, String](prop) iter.foreach{line => // 封裝數(shù)據(jù) val record = new ProducerRecord[String, String]("mytopic1", line) // 發(fā)送數(shù)據(jù) producer.send(record) } producer.close() } } }
3.HomeOne
package home.one import java.util.Properties import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord} import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig, ProducerRecord} import org.apache.kafka.common.serialization.{StringDeserializer, StringSerializer} import org.apache.log4j.{Level, Logger} import org.apache.spark.SparkConf import org.apache.spark.streaming.dstream.InputDStream import org.apache.spark.streaming.kafka010._ import org.apache.spark.streaming.{Seconds, StreamingContext} object HomeOne { val log = Logger.getLogger(this.getClass) def main(args: Array[String]): Unit = { Logger.getLogger("org").setLevel(Level.ERROR) val conf = new SparkConf().setAppName(this.getClass.getCanonicalName).setMaster("local[*]") val ssc = new StreamingContext(conf, Seconds(5)) // 需要消費(fèi)的topic val topics: Array[String] = Array("mytopic1") val groupid = "mygroup1" // 定義kafka相關(guān)參數(shù) val kafkaParams: Map[String, Object] = getKafkaConsumerParameters(groupid) // 從Redis獲取offset val fromOffsets = OffsetsWithRedisUtils.getOffsetsFromRedis(topics, groupid) // 創(chuàng)建DStream val dstream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream( ssc, LocationStrategies.PreferConsistent, // 從kafka中讀取數(shù)據(jù) ConsumerStrategies.Subscribe[String, String](topics, kafkaParams, fromOffsets) ) // 轉(zhuǎn)換后的數(shù)據(jù)發(fā)送到另一個(gè)topic dstream.foreachRDD { rdd => if (!rdd.isEmpty) { // 獲取消費(fèi)偏移量 val offsetRanges: Array[OffsetRange] = rdd.asInstanceOf[HasOffsetRanges].offsetRanges // 處理數(shù)據(jù)發(fā)送到topic2 rdd.foreachPartition(process) // 將offset保存到Redis OffsetsWithRedisUtils.saveOffsetsToRedis(offsetRanges, groupid) } } // 啟動(dòng)作業(yè) ssc.start() // 持續(xù)執(zhí)行 ssc.awaitTermination() } // 將處理后的數(shù)據(jù)發(fā)送到topic2 def process(iter: Iterator[ConsumerRecord[String, String]]) = { iter.map(line => parse(line.value)) .filter(!_.isEmpty) .foreach(line => sendMsg2Topic(line, "mytopic2")) } // 調(diào)用kafka生產(chǎn)者發(fā)送消息 def sendMsg2Topic(msg: String, topic: String): Unit = { val producer = new KafkaProducer[String, String](getKafkaProducerParameters()) val record = new ProducerRecord[String, String](topic, msg) producer.send(record) } // 修改數(shù)據(jù)格式,將逗號(hào)分隔變成豎線分割 def parse(text: String): String = { try { val arr = text.replace("<<<!>>>", "").split(",") if (arr.length != 15) return "" arr.mkString("|") } catch { case e: Exception => log.error("解析數(shù)據(jù)出錯(cuò)!", e) "" } } // 定義kafka消費(fèi)者的配置信息 def getKafkaConsumerParameters(groupid: String): Map[String, Object] = { Map[String, Object]( ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "linux121:9092", ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer], ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer], ConsumerConfig.GROUP_ID_CONFIG -> groupid, ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG -> (false: java.lang.Boolean), ) } // 定義生產(chǎn)者的kafka配置 def getKafkaProducerParameters(): Properties = { val prop = new Properties() prop.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "linux121:9092") prop.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, classOf[StringSerializer]) prop.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, classOf[StringSerializer]) prop } }
/* 假設(shè)機(jī)場(chǎng)的數(shù)據(jù)如下: 1, "SFO" 2, "ORD" 3, "DFW" 機(jī)場(chǎng)兩兩之間的航線及距離如下: 1, 2,1800 2, 3, 800 3, 1, 1400 用 GraphX 完成以下需求: 求所有的頂點(diǎn) 求所有的邊 求所有的triplets 求頂點(diǎn)數(shù) 求邊數(shù) 求機(jī)場(chǎng)距離大于1000的有幾個(gè),有哪些 按所有機(jī)場(chǎng)之間的距離排序(降序),輸出結(jié)果 */
代碼:
import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.graphx.{Edge, Graph, VertexId} import org.apache.spark.rdd.RDD object TwoHome { def main(args: Array[String]): Unit = { // 初始化 val conf = new SparkConf().setAppName(this.getClass.getCanonicalName.init).setMaster("local[*]") val sc = new SparkContext(conf) sc.setLogLevel("warn") //初始化數(shù)據(jù) val vertexArray: Array[(Long, String)] = Array((1L, "SFO"), (2L, "ORD"), (3L, "DFW")) val edgeArray: Array[Edge[Int]] = Array( Edge(1L, 2L, 1800), Edge(2L, 3L, 800), Edge(3L, 1L, 1400) ) //構(gòu)造vertexRDD和edgeRDD val vertexRDD: RDD[(VertexId, String)] = sc.makeRDD(vertexArray) val edgeRDD: RDD[Edge[Int]] = sc.makeRDD(edgeArray) //構(gòu)造圖 val graph: Graph[String, Int] = Graph(vertexRDD, edgeRDD) //所有的頂點(diǎn) println("所有頂點(diǎn):") graph.vertices.foreach(println) //所有的邊 println("所有邊:") graph.edges.foreach(println) //所有的triplets println("所有三元組信息:") graph.triplets.foreach(println) //求頂點(diǎn)數(shù) val vertexCnt = graph.vertices.count() println(s"總頂點(diǎn)數(shù):$vertexCnt") //求邊數(shù) val edgeCnt = graph.edges.count() println(s"總邊數(shù):$edgeCnt") //機(jī)場(chǎng)距離大于1000的 println("機(jī)場(chǎng)距離大于1000的邊信息:") graph.edges.filter(_.attr > 1000).foreach(println) //按所有機(jī)場(chǎng)之間的距離排序(降序) println("降序排列所有機(jī)場(chǎng)之間距離") graph.edges.sortBy(-_.attr).collect().foreach(println) } }
運(yùn)行結(jié)果
以上就是關(guān)于“spark作業(yè)怎么實(shí)現(xiàn)”這篇文章的內(nèi)容,相信大家都有了一定的了解,希望小編分享的內(nèi)容對(duì)大家有幫助,若想了解更多相關(guān)的知識(shí)內(nèi)容,請(qǐng)關(guān)注億速云行業(yè)資訊頻道。
免責(zé)聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng),如果涉及侵權(quán)請(qǐng)聯(lián)系站長(zhǎng)郵箱:is@yisu.com進(jìn)行舉報(bào),并提供相關(guān)證據(jù),一經(jīng)查實(shí),將立刻刪除涉嫌侵權(quán)內(nèi)容。