Distinct Count有什么作用

發(fā)布時(shí)間：2022-01-15 10:33:07 來源：億速云閱讀：155 作者：iii 欄目：大數(shù)據(jù)

這篇“Distinct Count有什么作用”文章的知識點(diǎn)大部分人都不太理解，所以小編給大家總結(jié)了以下內(nèi)容，內(nèi)容詳細(xì)，步驟清晰，具有一定的借鑒價(jià)值，希望大家閱讀完這篇文章能有所收獲，下面我們一起來看看這篇“Distinct Count有什么作用”文章吧。

大數(shù)據(jù)（big data），IT行業(yè)術(shù)語，是指無法在一定時(shí)間范圍內(nèi)用常規(guī)軟件工具進(jìn)行捕捉、管理和處理的數(shù)據(jù)集合，是需要新處理模式才能具有更強(qiáng)的決策力、洞察發(fā)現(xiàn)力和流程優(yōu)化能力的海量、高增長率和多樣化的信息資產(chǎn)。

Hive

在大數(shù)據(jù)場景下，報(bào)表很重要一項(xiàng)是UV（Unique Visitor）統(tǒng)計(jì)，即某時(shí)間段內(nèi)用戶人數(shù)。例如，查看一周內(nèi)app的用戶分布情況，Hive中寫HiveQL實(shí)現(xiàn)：

select app, count(distinct uid) as uv
from log_table
where week_cal = '2016-03-27'

Pig

與之類似，Pig的寫法：

-- all users
define DISTINCT_COUNT(A, a) returns dist {
    B = foreach $A generate $a;
    unique_B = distinct B;
    C = group unique_B all;
    $dist = foreach C generate SIZE(unique_B);
}
A = load '/path/to/data' using PigStorage() as (app, uid);
B = DISTINCT_COUNT(A, uid);

-- 
A = load '/path/to/data' using PigStorage() as (app, uid);
B = distinct A;
C = group B by app;
D = foreach C generate group as app, COUNT($1) as uv;
-- suitable for small cardinality scenarios
D = foreach C generate group as app, SIZE($1) as uv;

DataFu 為pig提供基數(shù)估計(jì)的UDF datafu.pig.stats.HyperLogLogPlusPlus，其采用HyperLogLog++算法，更為快速地Distinct Count：

define HyperLogLogPlusPlus datafu.pig.stats.HyperLogLogPlusPlus();
A = load '/path/to/data' using PigStorage() as (app, uid);
B = group A by app;
C = foreach B generate group as app, HyperLogLogPlusPlus($1) as uv;

Spark

在Spark中，Load數(shù)據(jù)后通過RDD一系列的轉(zhuǎn)換——map、distinct、reduceByKey進(jìn)行Distinct Count：

rdd.map { row => (row.app, row.uid) }
  .distinct()
  .map { line => (line._1, 1) }
  .reduceByKey(_ + _)

// or
rdd.map { row => (row.app, row.uid) }
  .distinct()
  .mapValues{ _ => 1 }
  .reduceByKey(_ + _)

// or 
rdd.map { row => (row.app, row.uid) }
  .distinct()
  .map(_._1)
  .countByValue()

同時(shí)，Spark提供近似Distinct Count的API：

rdd.map { row => (row.app, row.uid) }
    .countApproxDistinctByKey(0.001)

實(shí)現(xiàn)是基于HyperLogLog算法：

The algorithm used is based on streamlib's implementation of "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm", available here.

或者，將Schema化的RDD轉(zhuǎn)成DataFrame后，registerTempTable然后執(zhí)行sql命令亦可：

val sqlContext = new SQLContext(sc)
val df = rdd.toDF()
df.registerTempTable("app_table")

val appUsers = sqlContext.sql("select app, count(distinct uid) as uv from app_table group by app")

以上就是關(guān)于“Distinct Count有什么作用”這篇文章的內(nèi)容，相信大家都有了一定的了解，希望小編分享的內(nèi)容對大家有幫助，若想了解更多相關(guān)的知識內(nèi)容，請關(guān)注億速云行業(yè)資訊頻道。

向AI問一下細(xì)節(jié)

Distinct Count有什么作用

猜你喜歡

最新資訊

相關(guān)推薦

相關(guān)標(biāo)簽