DataStreamReader和DataStreamWriter怎么使用

發(fā)布時間：2021-12-30 10:05:44 來源：億速云閱讀：114 作者：iii 欄目：云計算

這篇文章主要介紹“DataStreamReader和DataStreamWriter怎么使用”，在日常操作中，相信很多人在DataStreamReader和DataStreamWriter怎么使用問題上存在疑惑，小編查閱了各式資料，整理出簡單好用的操作方法，希望對大家解答”DataStreamReader和DataStreamWriter怎么使用”的疑惑有所幫助！接下來，請跟著小編一起來學(xué)習(xí)吧！

流的讀取是從DataStreamReader和DataStreamWriter開始的。

DataStreamReader

DataStreamReader是生成流讀取者的入口所在，關(guān)鍵方法是load。這段代碼很關(guān)鍵，所以把全部代碼先貼出來，慢慢分析。

def load(): DataFrame = {
    
    val ds = DataSource.lookupDataSource(source, sparkSession.sqlContext.conf).

      getConstructor().newInstance()
   
    val v1DataSource = DataSource(

      sparkSession,

      userSpecifiedSchema = userSpecifiedSchema,

      className = source,

      options = extraOptions.toMap)

    val v1Relation = ds match {

      case _: StreamSourceProvider => Some(StreamingRelation(v1DataSource))

      case _ => None

    }

    ds match {

      case provider: TableProvider =>

        val sessionOptions = DataSourceV2Utils.extractSessionConfigs(

          source = provider, conf = sparkSession.sessionState.conf)

        val options = sessionOptions ++ extraOptions

        val dsOptions = new CaseInsensitiveStringMap(options.asJava)

        val table = userSpecifiedSchema match {

          case Some(schema) => provider.getTable(dsOptions, schema)

          case _ => provider.getTable(dsOptions)

        }

        import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Implicits._

        table match {

          case _: SupportsRead if table.supportsAny(MICRO_BATCH_READ, CONTINUOUS_READ) =>

            Dataset.ofRows(

              sparkSession,

              StreamingRelationV2(

                provider, source, table, dsOptions, table.schema.toAttributes, v1Relation)(

                sparkSession))

          // fallback to v1

          // TODO (SPARK-27483): we should move this fallback logic to an analyzer rule.

          case _ => Dataset.ofRows(sparkSession, StreamingRelation(v1DataSource))

        }

      case _ =>

        // Code path for data source v1.

        Dataset.ofRows(sparkSession, StreamingRelation(v1DataSource))

    }

  }

有好多分支，重要的是區(qū)分開V1和V2。

V1用的邏輯關(guān)系是StreamingRelation；而V2用的邏輯關(guān)系是StreamingRelationV2。這里先看看他們對應(yīng)的物理計劃是什么？

在SparkStrategies.scala文件中，定義了物理計劃：

/**

   * This strategy is just for explaining `Dataset/DataFrame` created by `spark.readStream`.

   * It won't affect the execution, because `StreamingRelation` will be replaced with

   * `StreamingExecutionRelation` in `StreamingQueryManager` and `StreamingExecutionRelation` will

   * be replaced with the real relation using the `Source` in `StreamExecution`.

   */

object StreamingRelationStrategy extends Strategy {

    def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {

      case s: StreamingRelation =>

        StreamingRelationExec(s.sourceName, s.output) :: Nil

      case s: StreamingExecutionRelation =>

        StreamingRelationExec(s.toString, s.output) :: Nil

      case s: StreamingRelationV2 =>

        StreamingRelationExec(s.sourceName, s.output) :: Nil

      case _ => Nil

    }

  }

物理計劃都是StreamingRelationExec，StreamingRelationExec的代碼其實啥都沒實現(xiàn)，所以最后其實看代碼注釋StreamingRelationExec也不是真正的物理計劃。

這里先記得相關(guān)的類ContinuousExecution和MicroBatchExecution。一時找不到怎么執(zhí)行到具體的物理計劃ContinuousExecution和MicroBatchExecution的，我們就試試反推把。先看看ContinuousExecution的代碼。

StreamExecution

StreamExecution是抽象類。其抽象方法runActivatedStream是執(zhí)行具體的連續(xù)流讀取任務(wù)的，子類會重寫該函數(shù)。

runStream方法封裝了runActivatedStream方法，額外加了些事件通知等處理機制，知道這一點就行了。

StreamingQueryManager

這里先嘗試看看StreamingQueryManager是干什么用的，看注釋應(yīng)該是管理所有的StreamingQuery的。

 private def createQuery(...): StreamingQueryWrapper ={

   (sink, trigger) match {

      case (table: SupportsWrite, trigger: ContinuousTrigger) =>
       

        new StreamingQueryWrapper(new ContinuousExecution(

          sparkSession,

          userSpecifiedName.orNull,

          checkpointLocation,

          analyzedPlan,

          table,

          trigger,

          triggerClock,

          outputMode,

          extraOptions,

          deleteCheckpointOnStop))

      case _ =>

        if (operationCheckEnabled) {

          UnsupportedOperationChecker.checkForStreaming(analyzedPlan, outputMode)

        }

        new StreamingQueryWrapper(new MicroBatchExecution(

          sparkSession,

          userSpecifiedName.orNull,

          checkpointLocation,

          analyzedPlan,

          sink,

          trigger,

          triggerClock,

          outputMode,

          extraOptions,

          deleteCheckpointOnStop))

    }

}

對于連續(xù)流，返回一個：

new StreamingQueryWrapper(new ContinuousExecution））

StreamingQueryWrapper的作用，就是將StreamingQuery封裝成可序列化的，別的和StreamingQuery沒什么區(qū)別。這里對于連續(xù)流就是包裝了ContinuousExecution。

ContinuousExecution

ContinuousExecution看名稱應(yīng)該是對應(yīng)連續(xù)流的物理執(zhí)行計劃的，繼承自StreamExecution（抽象類）?？纯粗饕a其實就是重寫了runActivatedStream方法。

 override protected def runActivatedStream(sparkSessionForStream: SparkSession): Unit = {

    val stateUpdate = new UnaryOperator[State] {

      override def apply(s: State) = s match {

        // If we ended the query to reconfigure, reset the state to active.

        case RECONFIGURING => ACTIVE

        case _ => s

      }

    }

    do {

      runContinuous(sparkSessionForStream)

    } while (state.updateAndGet(stateUpdate) == ACTIVE)

    stopSources()

  }

真正的執(zhí)行邏輯代碼在私有方法runContinuous中，這里就不詳細展開了，知道了主要流程就可以了。

下面就是要看看ContinuousExecution到底是在哪里被從邏輯計劃轉(zhuǎn)換到物理計劃的。

搜索全文，找到了StreamingQueryManager.scala這個文件。對了，就是從上面的StreamingQueryManager找到這個ContinuousExecution。

DataStreamWriter

DataStreamWriter是真正觸發(fā)流計算開始啟動執(zhí)行的地方。

start()方法得到要給StreamingQuery，方法里的關(guān)鍵代碼片段：

 df.sparkSession.sessionState.streamingQueryManager.startQuery(

        extraOptions.get("queryName"),

        extraOptions.get("checkpointLocation"),

        df,

        extraOptions.toMap,

        sink,

        outputMode,

        useTempCheckpointLocation = source == "console" || source == "noop",

        recoverFromCheckpointLocation = true,

        trigger = trigger)

跟蹤進去到了StreamingQueryManager，看它的startQuery方法。

startQuery方法分為幾步：

調(diào)用createQuery方法返回StreamingQuery。

val query = createQuery(

      userSpecifiedName,

      userSpecifiedCheckpointLocation,

      df,

      extraOptions,

      sink,

      outputMode,

      useTempCheckpointLocation,

      recoverFromCheckpointLocation,

      trigger,

      triggerClock)

query就是StreamingQueryWrapper，就是類似這樣的代碼：

new StreamingQueryWrapper(new ContinuousExecution））

2、啟動上一步的query

try {     

      query.streamingQuery.start()

    } catch {     

    }

這里的代碼直接調(diào)用到StreamingQuery的父類StreamExecution的start方法。代碼定義：

def start(): Unit = {

    logInfo(s"Starting $prettyIdString. Use $resolvedCheckpointRoot to store the query checkpoint.")

    queryExecutionThread.setDaemon(true)

    queryExecutionThread.start()

    startLatch.await()  // Wait until thread started and QueryStart event has been posted

  }

queryExecutionThread線程的定義又是這樣的：

val queryExecutionThread: QueryExecutionThread =

    new QueryExecutionThread(s"stream execution thread for $prettyIdString") {

      override def run(): Unit = {

        sparkSession.sparkContext.setCallSite(callSite)

        runStream()

      }

    }

最后在線程中啟動runStream這個私有方法。

3、返回query

最后返回query，注意這里的query在上面的代碼中已經(jīng)start運行了。

到此，關(guān)于“DataStreamReader和DataStreamWriter怎么使用”的學(xué)習(xí)就結(jié)束了，希望能夠解決大家的疑惑。理論與實踐的搭配能更好的幫助大家學(xué)習(xí)，快去試試吧！若想繼續(xù)學(xué)習(xí)更多相關(guān)知識，請繼續(xù)關(guān)注億速云網(wǎng)站，小編會繼續(xù)努力為大家?guī)砀鄬嵱玫奈恼拢?/p>

向AI問一下細節(jié)

DataStreamReader和DataStreamWriter怎么使用

DataStreamReader

StreamExecution

StreamingQueryManager

ContinuousExecution

DataStreamWriter

猜你喜歡

最新資訊

相關(guān)推薦

相關(guān)標(biāo)簽