<pre id="rtpsr"><td id="rtpsr"><dl id="rtpsr"></dl></td></pre>

溫馨提示×

溫馨提示×

您好，登錄后才能下訂單哦！

密碼登錄×

忘記密碼？

登錄注冊(cè)×

獲取短信驗(yàn)證碼

其他方式登錄

點(diǎn)擊登錄注冊(cè) 即表示同意《億速云用戶(hù)服務(wù)條款》

用戶(hù)登錄×

賬戶(hù)密碼登錄

請(qǐng)使用微信掃描上方二維碼

使用幫助

請(qǐng)求超時(shí)！

請(qǐng)點(diǎn)擊重新獲取二維碼

大數(shù)據(jù)開(kāi)發(fā)中怎樣從cogroup的實(shí)現(xiàn)來(lái)看join是寬依賴(lài)還是窄依賴(lài)

發(fā)布時(shí)間：2021-12-18 14:05:34 來(lái)源：億速云閱讀：140 作者：柒染欄目：大數(shù)據(jù)

大數(shù)據(jù)開(kāi)發(fā)中怎樣從cogroup的實(shí)現(xiàn)來(lái)看join是寬依賴(lài)還是窄依賴(lài)，很多新手對(duì)此不是很清楚，為了幫助大家解決這個(gè)難題，下面小編將為大家詳細(xì)講解，有這方面需求的人可以來(lái)學(xué)習(xí)下，希望你能有所收獲。

下面從源碼角度來(lái)看cogroup 的join實(shí)現(xiàn)

1.分析下面的代碼

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object JoinDemo {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName(this.getClass.getCanonicalName.init).setMaster("local[*]") 
    val sc = new SparkContext(conf)
    sc.setLogLevel("WARN")
    
    val random = scala.util.Random
    val col1 = Range(1, 50).map(idx => (random.nextInt(10), s"user$idx"))
    val col2 = Array((0, "BJ"), (1, "SH"), (2, "GZ"), (3, "SZ"), (4, "TJ"), (5, "CQ"), (6, "HZ"), (7, "NJ"), (8, "WH"), (0, "CD"))
    val rdd1: RDD[(Int, String)] = sc.makeRDD(col1) 
    val rdd2: RDD[(Int, String)] = sc.makeRDD(col2)
    val rdd3: RDD[(Int, (String, String))] = rdd1.join(rdd2) 
    println(rdd3.dependencies)
    val rdd4: RDD[(Int, (String, String))] = rdd1.partitionBy(new HashPartitioner(3)).join(rdd2.partitionBy(new HashPartitioner(3)))
    println(rdd4.dependencies)
    sc.stop() 
  }
}

分析上面一段代碼，打印結(jié)果是什么，這種join是寬依賴(lài)還是窄依賴(lài)，為什么是這樣

2.從spark的ui界面來(lái)查看運(yùn)行情況

關(guān)于stage劃分和寬依賴(lài)窄依賴(lài)的關(guān)系，從2.1.3 如何區(qū)別寬依賴(lài)和窄依賴(lài)就知道stage與寬依賴(lài)對(duì)應(yīng)，所以從rdd3和rdd4的stage的依賴(lài)圖就可以區(qū)別寬依賴(lài)，可以看到j(luò)oin劃分除了新的stage，所以rdd3的生成事寬依賴(lài)，另外rdd1.partitionBy(new HashPartitioner(3)).join(rdd2.partitionBy(new HashPartitioner(3))) 是另外的依賴(lài)圖，所以可以看到partitionBy以后再?zèng)]有劃分新的 stage，所以是窄依賴(lài)。

3.join的源碼實(shí)現(xiàn)

前面知道結(jié)論，是從ui圖里面看到的，現(xiàn)在看join源碼是如何實(shí)現(xiàn)的（基于spark2.4.5）

先進(jìn)去入口方法，其中withScope的做法可以理解為裝飾器，為了在sparkUI中能展示更多的信息。所以把所有創(chuàng)建的RDD的方法都包裹起來(lái)，同時(shí)用RDDOperationScope 記錄 RDD 的操作歷史和關(guān)聯(lián)，就能達(dá)成目標(biāo)。

  /**
   * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
   * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
   * (k, v2) is in `other`. Performs a hash join across the cluster.
   */
  def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))] = self.withScope {
    join(other, defaultPartitioner(self, other))
  }

下面來(lái)看defaultPartitioner 的實(shí)現(xiàn)，其目的就是在默認(rèn)值和分區(qū)器之間取一個(gè)較大的，返回分區(qū)器

def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
    val rdds = (Seq(rdd) ++ others)
    // 判斷有沒(méi)有設(shè)置分區(qū)器partitioner
    val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > 0))
    
    //如果設(shè)置了partitioner，則取設(shè)置partitioner的最大分區(qū)數(shù)
    val hasMaxPartitioner: Option[RDD[_]] = if (hasPartitioner.nonEmpty) {
      Some(hasPartitioner.maxBy(_.partitions.length))
    } else {
      None
    }
 
    //判斷是否設(shè)置了spark.default.parallelism，如果設(shè)置了則取spark.default.parallelism
    val defaultNumPartitions = if (rdd.context.conf.contains("spark.default.parallelism")) {
      rdd.context.defaultParallelism
    } else {
      rdds.map(_.partitions.length).max
    }
 
    // If the existing max partitioner is an eligible one, or its partitions number is larger
    // than the default number of partitions, use the existing partitioner.
    //主要判斷傳入rdd是否設(shè)置了默認(rèn)的partitioner 以及設(shè)置的partitioner是否合法                
    //或者設(shè)置的partitioner分區(qū)數(shù)大于默認(rèn)的分區(qū)數(shù) 
    //條件成立則取傳入rdd最大的分區(qū)數(shù)，否則取默認(rèn)的分區(qū)數(shù)
    if (hasMaxPartitioner.nonEmpty && (isEligiblePartitioner(hasMaxPartitioner.get, rdds) ||
        defaultNumPartitions < hasMaxPartitioner.get.getNumPartitions)) {
      hasMaxPartitioner.get.partitioner.get
    } else {
      new HashPartitioner(defaultNumPartitions)
    }
  }

  private def isEligiblePartitioner(
     hasMaxPartitioner: RDD[_],
     rdds: Seq[RDD[_]]): Boolean = {
    val maxPartitions = rdds.map(_.partitions.length).max
    log10(maxPartitions) - log10(hasMaxPartitioner.getNumPartitions) < 1
  }
}

再進(jìn)入join的重載方法，里面有個(gè)new CoGroupedRDD[K](Seq(self, other), partitioner)

def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope {
    this.cogroup(other, partitioner).flatMapValues( pair =>
      for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
    )
  }
def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner)
    : RDD[(K, (Iterable[V], Iterable[W]))] = self.withScope {
  if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) {
    throw new SparkException("HashPartitioner cannot partition array keys.")
  }
  //partitioner 通過(guò)對(duì)比得到的默認(rèn)分區(qū)器，主要是分區(qū)器中的分區(qū)數(shù)
  val cg = new CoGroupedRDD[K](Seq(self, other), partitioner)
  cg.mapValues { case Array(vs, w1s) =>
    (vs.asInstanceOf[Iterable[V]], w1s.asInstanceOf[Iterable[W]])
  }
}


  /**
   * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
   * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
   * (k, v2) is in `other`. Performs a hash join across the cluster.
   */
  def join[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, W))] = self.withScope {
    join(other, new HashPartitioner(numPartitions))
  }

最后來(lái)看CoGroupedRDD，這是決定是寬依賴(lài)還是窄依賴(lài)的地方，可以看到如果左邊rdd的分區(qū)和上面選擇給定的分區(qū)器一致，則認(rèn)為是窄依賴(lài)，否則是寬依賴(lài)

  override def getDependencies: Seq[Dependency[_]] = {
    rdds.map { rdd: RDD[_] =>
      if (rdd.partitioner == Some(part)) {
        logDebug("Adding one-to-one dependency with " + rdd)
        new OneToOneDependency(rdd)
      } else {
        logDebug("Adding shuffle dependency with " + rdd)
        new ShuffleDependency[K, Any, CoGroupCombiner](
          rdd.asInstanceOf[RDD[_ <: Product2[K, _]]], part, serializer)
      }
    }
  }

join時(shí)候可以指定分區(qū)數(shù)，如果join操作左右的rdd的分區(qū)方式和分區(qū)數(shù)一致則不會(huì)產(chǎn)生shuffle，否則就會(huì)shuffle，而是寬依賴(lài)，分區(qū)方式和分區(qū)數(shù)的體現(xiàn)就是分區(qū)器。

看完上述內(nèi)容是否對(duì)您有幫助呢？如果還想對(duì)相關(guān)知識(shí)有進(jìn)一步的了解或閱讀更多相關(guān)文章，請(qǐng)關(guān)注億速云行業(yè)資訊頻道，感謝您對(duì)億速云的支持。

向AI問(wèn)一下細(xì)節(jié)

推薦閱讀：

免責(zé)聲明：本站發(fā)布的內(nèi)容（圖片、視頻和文字）以原創(chuàng)、轉(zhuǎn)載和分享為主，文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng)，如果涉及侵權(quán)請(qǐng)聯(lián)系站長(zhǎng)郵箱：is@yisu.com進(jìn)行舉報(bào)，并提供相關(guān)證據(jù)，一經(jīng)查實(shí)，將立刻刪除涉嫌侵權(quán)內(nèi)容。

上一篇新聞：
如何定義python裝飾器
下一篇新聞：
如何進(jìn)行springboot配置templates直接訪問(wèn)的實(shí)現(xiàn)

猜你喜歡

AI
助
手

產(chǎn)品服務(wù)

地區(qū)劃分

專(zhuān)題活動(dòng)

幫助支持

關(guān)于我們

售后咨詢(xún)

7*24小時(shí)在線電話：400-100-2938

7*24小時(shí)在線 QQ：800811969

關(guān)注億速云

億速云公眾號(hào)

手機(jī)網(wǎng)站二維碼

<input id="qosrc"><output id="qosrc"></output></input>