溫馨提示×

溫馨提示×

您好，登錄后才能下訂單哦！

密碼登錄×

忘記密碼？

登錄注冊×

獲取短信驗證碼

其他方式登錄

點擊登錄注冊即表示同意《億速云用戶服務(wù)條款》

用戶登錄×

賬戶密碼登錄

請使用微信掃描上方二維碼

使用幫助

請求超時！

請點擊重新獲取二維碼

Spark sql流式處理的示例分析

發(fā)布時間：2021-12-16 11:20:00 來源：億速云閱讀：349 作者：小新欄目：云計算

小編給大家分享一下Spark sql流式處理的示例分析，相信大部分人都還不怎么了解，因此分享這篇文章給大家參考一下，希望大家閱讀完這篇文章后大有收獲，下面讓我們一起去了解一下吧！

Spark sql支持流式處理，流式處理有Source，Sink。Source定義了流的源頭，Sink定義了流的目的地，流的執(zhí)行是從Sink開始觸發(fā)的。

Dataset的writeStream定義了流的目的地并觸發(fā)流的真正執(zhí)行，所以分析就從writeStream開始。

writeStream = new DataStreamWriter[T](this)

DataStreamWriter

DataStreamWriter的作用是將入?yún)⒌膁ataset寫入到外部存儲，比如kafka，database，txt等。

主要觸發(fā)方法是start方法，返回一個StreamingQuery對象，代碼：

 def start(): StreamingQuery = {    
    if (source == "memory") {
      assertNotPartitioned("memory")
      val (sink, resultDf) = trigger match {
        case _: ContinuousTrigger =>
          val s = new MemorySinkV2()
          val r = Dataset.ofRows(df.sparkSession, new MemoryPlanV2(s, df.schema.toAttributes))
          (s, r)
        case _ =>
          val s = new MemorySink(df.schema, outputMode)
          val r = Dataset.ofRows(df.sparkSession, new MemoryPlan(s))
          (s, r)
      }
      val chkpointLoc = extraOptions.get("checkpointLocation")
      val recoverFromChkpoint = outputMode == OutputMode.Complete()
      val query = df.sparkSession.sessionState.streamingQueryManager.startQuery(
        extraOptions.get("queryName"),
        chkpointLoc,
        df,
        extraOptions.toMap,
        sink,
        outputMode,
        useTempCheckpointLocation = true,
        recoverFromCheckpointLocation = recoverFromChkpoint,
        trigger = trigger)
      resultDf.createOrReplaceTempView(query.name)
      query
    } else if (source == "foreach") {
      assertNotPartitioned("foreach")
      val sink = new ForeachSink[T](foreachWriter)(ds.exprEnc)
      df.sparkSession.sessionState.streamingQueryManager.startQuery(
        extraOptions.get("queryName"),
        extraOptions.get("checkpointLocation"),
        df,
        extraOptions.toMap,
        sink,
        outputMode,
        useTempCheckpointLocation = true,
        trigger = trigger)
    } else {
      val ds = DataSource.lookupDataSource(source, df.sparkSession.sessionState.conf)
      val disabledSources = df.sparkSession.sqlContext.conf.disabledV2StreamingWriters.split(",")
      val sink = ds.newInstance() match {
        case w: StreamWriteSupport if !disabledSources.contains(w.getClass.getCanonicalName) => w
        case _ =>
          val ds = DataSource(
            df.sparkSession,
            className = source,
            options = extraOptions.toMap,
            partitionColumns = normalizedParCols.getOrElse(Nil))
          ds.createSink(outputMode)
      }

      df.sparkSession.sessionState.streamingQueryManager.startQuery(
        extraOptions.get("queryName"),
        extraOptions.get("checkpointLocation"),
        df,
        extraOptions.toMap,
        sink,
        outputMode,
        useTempCheckpointLocation = source == "console",
        recoverFromCheckpointLocation = true,
        trigger = trigger)
    }
  }

我們這里看最后一個條件分支的代碼，ds是對應的DataSource，sink有時候就是ds。最后通過streamingQueryManager的startQuery啟動流的計算，返回計算中的StreamingQuery對象。

streamingQueryManager的startQuery方法里主要調(diào)用createQuery方法創(chuàng)建StreamingQueryWrapper對象，這是個私有方法：

private def createQuery(
      userSpecifiedName: Option[String],
      userSpecifiedCheckpointLocation: Option[String],
      df: DataFrame,
      extraOptions: Map[String, String],
      sink: BaseStreamingSink,
      outputMode: OutputMode,
      useTempCheckpointLocation: Boolean,
      recoverFromCheckpointLocation: Boolean,
      trigger: Trigger,
      triggerClock: Clock): StreamingQueryWrapper = {
    var deleteCheckpointOnStop = false
    val checkpointLocation = userSpecifiedCheckpointLocation.map { userSpecified =>
      new Path(userSpecified).toUri.toString
    }.orElse {
      df.sparkSession.sessionState.conf.checkpointLocation.map { location =>
        new Path(location, userSpecifiedName.getOrElse(UUID.randomUUID().toString)).toUri.toString
      }
    }.getOrElse {
      if (useTempCheckpointLocation) {
        // Delete the temp checkpoint when a query is being stopped without errors.
        deleteCheckpointOnStop = true
        Utils.createTempDir(namePrefix = s"temporary").getCanonicalPath
      } else {
        throw new AnalysisException(
          "checkpointLocation must be specified either " +
            """through option("checkpointLocation", ...) or """ +
            s"""SparkSession.conf.set("${SQLConf.CHECKPOINT_LOCATION.key}", ...)""")
      }
    }

    // If offsets have already been created, we trying to resume a query.
    if (!recoverFromCheckpointLocation) {
      val checkpointPath = new Path(checkpointLocation, "offsets")
      val fs = checkpointPath.getFileSystem(df.sparkSession.sessionState.newHadoopConf())
      if (fs.exists(checkpointPath)) {
        throw new AnalysisException(
          s"This query does not support recovering from checkpoint location. " +
            s"Delete $checkpointPath to start over.")
      }
    }

    val analyzedPlan = df.queryExecution.analyzed
    df.queryExecution.assertAnalyzed()

    if (sparkSession.sessionState.conf.isUnsupportedOperationCheckEnabled) {
      UnsupportedOperationChecker.checkForStreaming(analyzedPlan, outputMode)
    }

    if (sparkSession.sessionState.conf.adaptiveExecutionEnabled) {
      logWarning(s"${SQLConf.ADAPTIVE_EXECUTION_ENABLED.key} " +
          "is not supported in streaming DataFrames/Datasets and will be disabled.")
    }

    (sink, trigger) match {
      case (v2Sink: StreamWriteSupport, trigger: ContinuousTrigger) =>
        UnsupportedOperationChecker.checkForContinuous(analyzedPlan, outputMode)
        new StreamingQueryWrapper(new ContinuousExecution(
          sparkSession,
          userSpecifiedName.orNull,
          checkpointLocation,
          analyzedPlan,
          v2Sink,
          trigger,
          triggerClock,
          outputMode,
          extraOptions,
          deleteCheckpointOnStop))
      case _ =>
        new StreamingQueryWrapper(new MicroBatchExecution(
          sparkSession,
          userSpecifiedName.orNull,
          checkpointLocation,
          analyzedPlan,
          sink,
          trigger,
          triggerClock,
          outputMode,
          extraOptions,
          deleteCheckpointOnStop))
    }
  }

它根據(jù)是否連續(xù)流操作還是微批處理操作分成ContinuousExecution和MicroBatchExecution，他們都是StreamExecution的子類，StreamExecution是流處理的抽象類。稍后會分析StreamExecution的類結(jié)構(gòu)。

ContinuousExecution和MicroBatchExecution兩者的代碼結(jié)構(gòu)和功能其實是很類似的，我們先拿ContinuousExecution舉例吧。

ContinuousExecution

首先ContinuousExecution是沒有結(jié)束的，是沒有結(jié)束的流，當暫時流沒有數(shù)據(jù)時，ContinuousExecution會阻塞線程等待新數(shù)據(jù)的到來，這是通過awaitEpoch方法來控制的。

其實，commit方法在每條數(shù)據(jù)處理完后被觸發(fā)，commit方法將當前處理完成的偏移量（offset）寫到commitLog中。

再看logicalPlan，在ContinuousExecution中入?yún)⒌倪壿嬘媱澥荢treamingRelationV2類型，會被轉(zhuǎn)換成ContinuousExecutionRelation類型的LogicalPlan：

analyzedPlan.transform {
case r @ StreamingRelationV2(
source: ContinuousReadSupport, _, extraReaderOptions, output, _) =>
toExecutionRelationMap.getOrElseUpdate(r, {
ContinuousExecutionRelation(source, extraReaderOptions, output)(sparkSession)
})

}

還有addOffset方法，在每次讀取完offset之后會將當前的讀取offset寫入到offsetLog中，以便下次恢復時知道從哪里開始。addOffset和commit兩個方法一起保證了Exactly-once語義的執(zhí)行。

以上是“Spark sql流式處理的示例分析”這篇文章的所有內(nèi)容，感謝各位的閱讀！相信大家都有了一定的了解，希望分享的內(nèi)容對大家有所幫助，如果還想學習更多知識，歡迎關(guān)注億速云行業(yè)資訊頻道！

向AI問一下細節(jié)

推薦閱讀：

免責聲明：本站發(fā)布的內(nèi)容（圖片、視頻和文字）以原創(chuàng)、轉(zhuǎn)載和分享為主，文章觀點不代表本網(wǎng)站立場，如果涉及侵權(quán)請聯(lián)系站長郵箱：is@yisu.com進行舉報，并提供相關(guān)證據(jù)，一經(jīng)查實，將立刻刪除涉嫌侵權(quán)內(nèi)容。

上一篇新聞：
Serverless基本概念該怎么入門
下一篇新聞：
Linux?sftp命令的用法是怎樣的

猜你喜歡

AI
助
手

產(chǎn)品服務(wù)

地區(qū)劃分

專題活動

幫助支持

關(guān)于我們

售后咨詢

7*24小時在線電話：400-100-2938

7*24小時在線 QQ：800811969

關(guān)注億速云

億速云公眾號

手機網(wǎng)站二維碼

<strong id="fxdam"><listing id="fxdam"></listing></strong>

<samp id="fxdam"><listing id="fxdam"><dl id="fxdam"></dl></listing></samp>

<video id="fxdam"><th id="fxdam"></th></video>

<td id="fxdam"><listing id="fxdam"><kbd id="fxdam"></kbd></listing></td>