spark delta如何讀數(shù)據(jù)

發(fā)布時(shí)間：2021-12-16 16:14:02 來(lái)源：億速云閱讀：135 作者：小新欄目：大數(shù)據(jù)

小編給大家分享一下spark delta如何讀數(shù)據(jù)，相信大部分人都還不怎么了解，因此分享這篇文章給大家參考一下，希望大家閱讀完這篇文章后大有收獲，下面讓我們一起去了解一下吧！

分析

spark 的delta datasource的構(gòu)建要從DataSource.lookupDataSourceV2開始，之后會(huì)流向到loadV1Source，這里會(huì)進(jìn)行dataSource.createRelation進(jìn)行構(gòu)建datasource的Relation的構(gòu)建,直接轉(zhuǎn)到deltaDataSource 的createRelation：

override def createRelation(
      sqlContext: SQLContext,
      parameters: Map[String, String]): BaseRelation = {
    val maybePath = parameters.getOrElse("path", {
      throw DeltaErrors.pathNotSpecifiedException
    })

    // Log any invalid options that are being passed in
    DeltaOptions.verifyOptions(CaseInsensitiveMap(parameters))

    val timeTravelByParams = DeltaDataSource.getTimeTravelVersion(parameters)
    DeltaTableV2(
      sqlContext.sparkSession,
      new Path(maybePath),
      timeTravelOpt = timeTravelByParams).toBaseRelation
  }

DeltaOptions.verifyOptions進(jìn)行參數(shù)校驗(yàn),有效的參數(shù)如下：

val validOptionKeys : Set[String] = Set(
    REPLACE_WHERE_OPTION,
    MERGE_SCHEMA_OPTION,
    EXCLUDE_REGEX_OPTION,
    OVERWRITE_SCHEMA_OPTION,
    USER_METADATA_OPTION,
    MAX_FILES_PER_TRIGGER_OPTION,
    IGNORE_FILE_DELETION_OPTION,
    IGNORE_CHANGES_OPTION,
    IGNORE_DELETES_OPTION,
    OPTIMIZE_WRITE_OPTION,
    DATA_CHANGE_OPTION,
    "queryName",
    "checkpointLocation",
    "path",
    "timestampAsOf",
    "versionAsOf"
  )

DeltaDataSource.getTimeTravelVersion根據(jù)指定的timestampAsOf或者versionAsOf獲取指定的版本
直接調(diào)用DeltaTableV2的toBaseRelation方法：

def toBaseRelation: BaseRelation = {
    if (deltaLog.snapshot.version == -1) {
      val id = catalogTable.map(ct => DeltaTableIdentifier(table = Some(ct.identifier)))
        .getOrElse(DeltaTableIdentifier(path = Some(path.toString)))
      throw DeltaErrors.notADeltaTableException(id)
    }
    val partitionPredicates = DeltaDataSource.verifyAndCreatePartitionFilters(
      path.toString, deltaLog.snapshot, partitionFilters)

    // TODO(burak): We should pass in the snapshot here
    deltaLog.createRelation(partitionPredicates, timeTravelSpec)
  }

如果存在分區(qū)，則DeltaDataSource.verifyAndCreatePartitionFilter創(chuàng)建partitionPredicates
timeTravelSpec，這里優(yōu)先選擇用戶指定的timeTravelByParams，否則通過(guò)DeltaDataSource.parsePathIdentifier選擇path指定的version，格式如:/some/path/partition=1@v1234 或者/some/path/partition=1@yyyyMMddHHmmssSSS

直接調(diào)用deltaLog.createRelation:

def createRelation(
   partitionFilters: Seq[Expression] = Nil,
   timeTravel: Option[DeltaTimeTravelSpec] = None): BaseRelation = {

 val versionToUse = timeTravel.map { tt =>
   val (version, accessType) = DeltaTableUtils.resolveTimeTravelVersion(
     spark.sessionState.conf, this, tt)
   val source = tt.creationSource.getOrElse("unknown")
   recordDeltaEvent(this, s"delta.timeTravel.$source", data = Map(
     "tableVersion" -> snapshot.version,
     "queriedVersion" -> version,
     "accessType" -> accessType
   ))
   version
 }

 /** Used to link the files present in the table into the query planner. */
 val snapshotToUse = versionToUse.map(getSnapshotAt(_)).getOrElse(snapshot)
 val fileIndex = TahoeLogFileIndex(
   spark, this, dataPath, snapshotToUse.metadata.schema, partitionFilters, versionToUse)

 new HadoopFsRelation(
   fileIndex,
   partitionSchema = snapshotToUse.metadata.partitionSchema,
   dataSchema = snapshotToUse.metadata.schema,
   bucketSpec = None,
   snapshotToUse.fileFormat,
   snapshotToUse.metadata.format.options)(spark) with InsertableRelation {
   def insert(data: DataFrame, overwrite: Boolean): Unit = {
     val mode = if (overwrite) SaveMode.Overwrite else SaveMode.Append
     WriteIntoDelta(
       deltaLog = DeltaLog.this,
       mode = mode,
       new DeltaOptions(Map.empty[String, String], spark.sessionState.conf),
       partitionColumns = Seq.empty,
       configuration = Map.empty,
       data = data).run(spark)
   }
 }

override def inputFiles: Array[String] = {
getSnapshot(stalenessAcceptable = false).filesForScan(
  projection = Nil, partitionFilters).files.map(f => absolutePath(f.path).toString).toArray
}

該方法調(diào)用了snapshot的filesForScan方法：

def filesForScan(projection: Seq[Attribute], filters: Seq[Expression]): DeltaScan = {
implicit val enc = SingleAction.addFileEncoder

val partitionFilters = filters.flatMap { filter =>
  DeltaTableUtils.splitMetadataAndDataPredicates(filter, metadata.partitionColumns, spark)._1
}

val files = DeltaLog.filterFileList(
  metadata.partitionSchema,
  allFiles.toDF(),
  partitionFilters).as[AddFile].collect()

DeltaScan(version = version, files, null, null, null)(null, null, null, null)
}

. 通過(guò)指定版本獲取對(duì)應(yīng)的snapshot
. 構(gòu)建TahoeLogFileIndex，因?yàn)檫@里構(gòu)建的是HadoopFsRelation，所以我們關(guān)注TahoeLogFileIndex的inputfiles方法：

通過(guò)之前文章的分析，我們直到deltalog記錄了AddFile和Remove記錄，那現(xiàn)在讀數(shù)據(jù)怎么讀取呢？全部在allFiles方法。
重點(diǎn)看一下：allFiles方法：

def allFiles: Dataset[AddFile] = {
 val implicits = spark.implicits
 import implicits._
 state.where("add IS NOT NULL").select($"add".as[AddFile])
 }

這里調(diào)用了state方法，而它又調(diào)用了stateReconstruction方法，

private lazy val cachedState =
 cacheDS(stateReconstruction, s"Delta Table State #$version - $redactedPath")

 /** The current set of actions in this [[Snapshot]]. */
 def state: Dataset[SingleAction] = cachedState.getDS

stateReconstruction方法在checkpoint的時(shí)用到了，在這里也用到了，主要是重新構(gòu)造文件狀態(tài)，合并AddFile和RemoveFile：

private def stateReconstruction: Dataset[SingleAction] = {
 ...
 loadActions.mapPartitions { actions =>
     val hdpConf = hadoopConf.value.value
     actions.flatMap {
       _.unwrap match {
         case add: AddFile => Some(add.copy(path = canonicalizePath(add.path, hdpConf)).wrap)
         case rm: RemoveFile => Some(rm.copy(path = canonicalizePath(rm.path, hdpConf)).wrap)
         case other if other == null => None
         case other => Some(other.wrap)
       }
     }
    }
   ...
   .mapPartitions { iter =>
     val state = new InMemoryLogReplay(time)
     state.append(0, iter.map(_.unwrap))
     state.checkpoint.map(_.wrap)
   }
  }

而關(guān)鍵在于InMemoryLogReplay的append方法和checkpoint方法，這里做到了文件狀態(tài)的合并：

  assert(currentVersion == -1 || version == currentVersion + 1,
   s"Attempted to replay version $version, but state is at $currentVersion")
 currentVersion = version
 actions.foreach {
   case a: SetTransaction =>
     transactions(a.appId) = a
   case a: Metadata =>
     currentMetaData = a
   case a: Protocol =>
     currentProtocolVersion = a
   case add: AddFile =>
     activeFiles(add.pathAsUri) = add.copy(dataChange = false)
     // Remove the tombstone to make sure we only output one `FileAction`.
     tombstones.remove(add.pathAsUri)
   case remove: RemoveFile =>
     activeFiles.remove(remove.pathAsUri)
     tombstones(remove.pathAsUri) = remove.copy(dataChange = false)
   case ci: CommitInfo => // do nothing
   case null => // Some crazy future feature. Ignore
  }
 }

重點(diǎn)就在case add: AddFile和 case remove: RemoveFile處理以及checkpoint方法，能夠很好的合并文件狀態(tài)。

再調(diào)用collect方法,返回DeltaScan，之后獲取文件路徑作為要處理的文件路徑。

把TahoeLogFileIndex傳入HadoopFsRelation得到最后的BaseRelation 返回

注意：spark讀取delta格式整個(gè)流程和spark讀取其他數(shù)據(jù)格式流程一致，主要區(qū)別在于讀取數(shù)據(jù)之前，會(huì)把文件狀態(tài)在內(nèi)存中進(jìn)行一次合并，這樣只需要讀取文件狀態(tài)為Addfile的就行了

以上是“spark delta如何讀數(shù)據(jù)”這篇文章的所有內(nèi)容，感謝各位的閱讀！相信大家都有了一定的了解，希望分享的內(nèi)容對(duì)大家有所幫助，如果還想學(xué)習(xí)更多知識(shí)，歡迎關(guān)注億速云行業(yè)資訊頻道！

向AI問(wèn)一下細(xì)節(jié)

spark delta如何讀數(shù)據(jù)

分析

猜你喜歡

最新資訊

相關(guān)推薦

相關(guān)標(biāo)簽