您好,登錄后才能下訂單哦!
這篇“Distinct Count有什么作用”文章的知識點(diǎn)大部分人都不太理解,所以小編給大家總結(jié)了以下內(nèi)容,內(nèi)容詳細(xì),步驟清晰,具有一定的借鑒價(jià)值,希望大家閱讀完這篇文章能有所收獲,下面我們一起來看看這篇“Distinct Count有什么作用”文章吧。
大數(shù)據(jù)(big data),IT行業(yè)術(shù)語,是指無法在一定時(shí)間范圍內(nèi)用常規(guī)軟件工具進(jìn)行捕捉、管理和處理的數(shù)據(jù)集合,是需要新處理模式才能具有更強(qiáng)的決策力、洞察發(fā)現(xiàn)力和流程優(yōu)化能力的海量、高增長率和多樣化的信息資產(chǎn)。 |
Hive
在大數(shù)據(jù)場景下,報(bào)表很重要一項(xiàng)是UV(Unique Visitor)統(tǒng)計(jì),即某時(shí)間段內(nèi)用戶人數(shù)。例如,查看一周內(nèi)app的用戶分布情況,Hive中寫HiveQL實(shí)現(xiàn):
select app, count(distinct uid) as uv from log_table where week_cal = '2016-03-27'
Pig
與之類似,Pig的寫法:
-- all users define DISTINCT_COUNT(A, a) returns dist { B = foreach $A generate $a; unique_B = distinct B; C = group unique_B all; $dist = foreach C generate SIZE(unique_B); } A = load '/path/to/data' using PigStorage() as (app, uid); B = DISTINCT_COUNT(A, uid); -- A = load '/path/to/data' using PigStorage() as (app, uid); B = distinct A; C = group B by app; D = foreach C generate group as app, COUNT($1) as uv; -- suitable for small cardinality scenarios D = foreach C generate group as app, SIZE($1) as uv;
DataFu 為pig提供基數(shù)估計(jì)的UDF datafu.pig.stats.HyperLogLogPlusPlus,其采用HyperLogLog++算法,更為快速地Distinct Count:
define HyperLogLogPlusPlus datafu.pig.stats.HyperLogLogPlusPlus(); A = load '/path/to/data' using PigStorage() as (app, uid); B = group A by app; C = foreach B generate group as app, HyperLogLogPlusPlus($1) as uv;
Spark
在Spark中,Load數(shù)據(jù)后通過RDD一系列的轉(zhuǎn)換——map、distinct、reduceByKey進(jìn)行Distinct Count:
rdd.map { row => (row.app, row.uid) } .distinct() .map { line => (line._1, 1) } .reduceByKey(_ + _) // or rdd.map { row => (row.app, row.uid) } .distinct() .mapValues{ _ => 1 } .reduceByKey(_ + _) // or rdd.map { row => (row.app, row.uid) } .distinct() .map(_._1) .countByValue()
同時(shí),Spark提供近似Distinct Count的API:
rdd.map { row => (row.app, row.uid) } .countApproxDistinctByKey(0.001)
實(shí)現(xiàn)是基于HyperLogLog算法:
The algorithm used is based on streamlib's implementation of "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm", available here.
或者,將Schema化的RDD轉(zhuǎn)成DataFrame后,registerTempTable然后執(zhí)行sql命令亦可:
val sqlContext = new SQLContext(sc) val df = rdd.toDF() df.registerTempTable("app_table") val appUsers = sqlContext.sql("select app, count(distinct uid) as uv from app_table group by app")
以上就是關(guān)于“Distinct Count有什么作用”這篇文章的內(nèi)容,相信大家都有了一定的了解,希望小編分享的內(nèi)容對大家有幫助,若想了解更多相關(guān)的知識內(nèi)容,請關(guān)注億速云行業(yè)資訊頻道。
免責(zé)聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點(diǎn)不代表本網(wǎng)站立場,如果涉及侵權(quán)請聯(lián)系站長郵箱:is@yisu.com進(jìn)行舉報(bào),并提供相關(guān)證據(jù),一經(jīng)查實(shí),將立刻刪除涉嫌侵權(quán)內(nèi)容。