在分布式數(shù)據(jù)庫中怎么計(jì)算count

發(fā)布時(shí)間：2021-11-10 16:30:10 來源：億速云閱讀：188 作者：iii 欄目：關(guān)系型數(shù)據(jù)庫

這篇文章主要介紹“在分布式數(shù)據(jù)庫中怎么計(jì)算count”，在日常操作中，相信很多人在在分布式數(shù)據(jù)庫中怎么計(jì)算count問題上存在疑惑，小編查閱了各式資料，整理出簡(jiǎn)單好用的操作方法，希望對(duì)大家解答”在分布式數(shù)據(jù)庫中怎么計(jì)算count”的疑惑有所幫助！接下來，請(qǐng)跟著小編一起來學(xué)習(xí)吧！

背景

在分布式數(shù)據(jù)庫中，計(jì)算count(distinct xxx)，需要對(duì)distinct 的字段，

1、去重，

2、重分布去重后的數(shù)據(jù)，（這一步，如果distinct值特別多，那么就會(huì)比較耗時(shí)）

3、然后再去重，

4、最后count (xxx)，

5、求所有節(jié)點(diǎn)的count SUM。

例如，以下是Greenplum的執(zhí)行計(jì)劃例子

postgres=# explain analyze select count(distinct c_acctbal) from customer;  
                                                                             QUERY PLAN                                                                               
--------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Aggregate  (cost=182242.41..182242.42 rows=1 width=8)  
   Rows out:  1 rows with 0.006 ms to first row, 69 ms to end, start offset by 23 ms.  
   ->  Gather Motion 16:1  (slice2; segments: 16)  (cost=53392.85..173982.82 rows=660767 width=8)  
         Rows out:  818834 rows at destination with 3.416 ms to first row, 447 ms to end, start offset by 23 ms.  
         ->  HashAggregate  (cost=53392.85..61652.43 rows=41298 width=8)  
               Group By: customer.c_acctbal  
               Rows out:  Avg 51177.1 rows x 16 workers.  Max 51362 rows (seg3) with 0.004 ms to first row, 33 ms to end, start offset by 25 ms.  
               ->  Redistribute Motion 16:16  (slice1; segments: 16)  (cost=30266.00..43481.34 rows=41298 width=8)  
                     Hash Key: customer.c_acctbal  
                     Rows out:  Avg 89865.6 rows x 16 workers at destination.  Max 90305 rows (seg3) with 18 ms to first row, 120 ms to end, start offset by 25 ms.  
                     ->  HashAggregate  (cost=30266.00..30266.00 rows=41298 width=8)  
                           Group By: customer.c_acctbal  
                           Rows out:  Avg 89865.6 rows x 16 workers.  Max 89929 rows (seg2) with 0.007 ms to first row, 33 ms to end, start offset by 26 ms.  
                           ->  Append-only Columnar Scan on customer  (cost=0.00..22766.00 rows=93750 width=8)  
                                 Rows out:  Avg 93750.0 rows x 16 workers.  Max 93751 rows (seg4) with 20 ms to first row, 30 ms to end, start offset by 26 ms.  
 Slice statistics:  
   (slice0)    Executor memory: 387K bytes.  
   (slice1)    Executor memory: 6527K bytes avg x 16 workers, 6527K bytes max (seg0).  
   (slice2)    Executor memory: 371K bytes avg x 16 workers, 371K bytes max (seg0).  
 Statement statistics:  
   Memory used: 1280000K bytes  
 Optimizer status: legacy query optimizer  
 Total runtime: 723.143 ms  
(23 rows)

以下是citus的例子

postgres=# explain analyze select count(distinct bid) from pgbench_accounts ;  
                                                                            QUERY PLAN                                                                              
------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Aggregate  (cost=0.00..0.00 rows=0 width=0) (actual time=31.748..31.749 rows=1 loops=1)  
   ->  Custom Scan (Citus Real-Time)  (cost=0.00..0.00 rows=0 width=0) (actual time=31.382..31.510 rows=1280 loops=1)  
         Task Count: 128  
         Tasks Shown: One of 128  
         ->  Task  
               Node: host=172.24.211.224 port=1921 dbname=postgres  
               ->  HashAggregate  (cost=231.85..231.95 rows=10 width=4) (actual time=3.700..3.702 rows=10 loops=1)  
                     Group Key: bid  
                     ->  Seq Scan on pgbench_accounts_106812 pgbench_accounts  (cost=0.00..212.48 rows=7748 width=4) (actual time=0.017..2.180 rows=7748 loops=1)  
                   Planning time: 0.445 ms  
                   Execution time: 3.781 ms  
 Planning time: 1.399 ms  
 Execution time: 32.159 ms  
(13 rows)

對(duì)于可估值計(jì)算的場(chǎng)景，即不需要精確distinct值的場(chǎng)景，PostgreSQL提供了一個(gè)名為hll的插件，可以用來估算distinct元素個(gè)數(shù)。

citus 結(jié)合hll，可以實(shí)現(xiàn)超高速的count(distinct xxx)，即使distinct值非常非常多，也不慢。

SET citus.count_distinct_error_rate to 0.005;  
  
0.005表示失真度

hll加速citus count(distinct xxx)使用舉例

部署

1、所有節(jié)點(diǎn)（coordinator 與 worker節(jié)點(diǎn)），安裝hll軟件

yum install -y gcc-c++  
  
cd ~/  
  
git clone https://github.com/citusdata/postgresql-hll  
  
cd postgresql-hll  
  
. /var/lib/pgsql/.bash_profile   
  
USE_PGXS=1 make  
USE_PGXS=1 make install

2、所有節(jié)點(diǎn)（coordinator 與 worker節(jié)點(diǎn)），在需要用到HLL的DB中增加插件

su - postgres -c "psql -d postgres -c 'create extension hll;'"  
  
su - postgres -c "psql -d newdb -c 'create extension hll;'"

使用舉例

1、創(chuàng)建測(cè)試表，128 shard

create table test (id int primary key, a int, b int, c int);  
  
set citus.shard_count =128;   
  
select create_distributed_table('test', 'id');

2、寫入10億測(cè)試數(shù)據(jù)，a字段10唯一值，b字段100唯一值，c字段100萬唯一值

insert into test select id, random()*9, random()*99, random()*999999 from generate_series(1,1000000000) t(id);

3、（coordinator節(jié)點(diǎn)）設(shè)置全局或當(dāng)前會(huì)話級(jí)參數(shù)，指定失真度，越小失真度越小

SET citus.count_distinct_error_rate to 0.005;  
  
newdb=# explain select count(distinct bid) from pgbench_accounts group by bid;  
                                                          QUERY PLAN                                                             
-------------------------------------------------------------------------------------------------------------------------------  
 HashAggregate  (cost=0.00..0.00 rows=0 width=0)  
   Group Key: remote_scan.worker_column_2  
   ->  Custom Scan (Citus Real-Time)  (cost=0.00..0.00 rows=0 width=0)  
         Task Count: 128  
         Tasks Shown: One of 128  
         ->  Task  
               Node: host=172.24.211.224 port=8001 dbname=newdb  
               ->  GroupAggregate  (cost=97272.79..105102.29 rows=1000 width=36)  
                     Group Key: bid  
                     ->  Sort  (cost=97272.79..99227.04 rows=781700 width=4)  
                           Sort Key: bid  
                           ->  Seq Scan on pgbench_accounts_102008 pgbench_accounts  (cost=0.00..20759.00 rows=781700 width=4)  
(12 rows)

4、對(duì)比是否使用HLL加速(少量唯一值，HLL沒有性能提升，因?yàn)楸旧砭筒淮嬖谄款i)

4.1、未使用hll

newdb=# set citus.count_distinct_error_rate to 0;  
newdb=# select count(distinct bid) from pgbench_accounts;  
 count   
-------  
  1000  
(1 row)  
  
Time: 423.364 ms  
  
postgres=# set citus.count_distinct_error_rate to 0;  
postgres=# select count(distinct a) from test;  
 count   
-------  
    10  
(1 row)  
  
Time: 2392.709 ms (00:02.393)

4.2、使用hll

newdb=# set citus.count_distinct_error_rate to 0.005;  
newdb=# select count(distinct bid) from pgbench_accounts;  
 count   
-------  
  1000  
(1 row)  
  
Time: 444.287 ms  
  
postgres=# set citus.count_distinct_error_rate to 0.005;  
postgres=# select count(distinct a) from test;  
 count   
-------  
    10  
(1 row)  
  
Time: 2375.473 ms (00:02.375)

5、對(duì)比是否使用HLL加速(大量唯一值，HLL性能提升顯著)

5.1、未使用hll

postgres=# set citus.count_distinct_error_rate to 0;  
  
  count   
----------
 10000000
(1 row)
Time: 5826241.205 ms (01:37:06.241)

128個(gè)節(jié)點(diǎn)，每個(gè)節(jié)點(diǎn)最多發(fā)送10億/128條數(shù)據(jù)給coordinator，慢是可以理解的。另一方面，coordinator可以邊接收邊去重(postgresql 11增加了parallel gather, merge sort等能力，citus coordinator可以借鑒)，沒必要等所有數(shù)據(jù)都收完再去重。

5.2、使用hll

postgres=# set citus.count_distinct_error_rate to 0.005;  
postgres=# select count(distinct (a,c)) from test;  
  count    
---------  
 9999995  
(1 row)  
  
Time: 4468.749 ms (00:04.469)

6、設(shè)置不同的精度參數(shù)，性能對(duì)比

newdb=# set citus.count_distinct_error_rate to 0.1;  
newdb=#  select count(distinct (aid,bid)) from pgbench_accounts ;  
  count     
----------  
 94778491  
(1 row)  
Time: 545.301 ms  
  
newdb=# set citus.count_distinct_error_rate to 0.01;  
newdb=#  select count(distinct (aid,bid)) from pgbench_accounts ;  
   count     
-----------  
 100293937  
(1 row)  
Time: 554.333 ms  
  
-- 推薦設(shè)置0.005  
  
newdb=# set citus.count_distinct_error_rate to 0.005;  
newdb=#  select count(distinct (aid,bid)) from pgbench_accounts ;  
   count     
-----------  
 100136086  
(1 row)  
Time: 1053.070 ms (00:01.053)  
  
newdb=# set citus.count_distinct_error_rate to 0.001;  
newdb=#  select count(distinct (aid,bid)) from pgbench_accounts ;  
   count     
-----------  
 100422107  
(1 row)  
Time: 9287.934 ms (00:09.288)

到此，關(guān)于“在分布式數(shù)據(jù)庫中怎么計(jì)算count”的學(xué)習(xí)就結(jié)束了，希望能夠解決大家的疑惑。理論與實(shí)踐的搭配能更好的幫助大家學(xué)習(xí)，快去試試吧！若想繼續(xù)學(xué)習(xí)更多相關(guān)知識(shí)，請(qǐng)繼續(xù)關(guān)注億速云網(wǎng)站，小編會(huì)繼續(xù)努力為大家?guī)砀鄬?shí)用的文章！

向AI問一下細(xì)節(jié)

在分布式數(shù)據(jù)庫中怎么計(jì)算count

背景

hll加速citus count(distinct xxx)使用舉例

部署

使用舉例

1、創(chuàng)建測(cè)試表，128 shard

2、寫入10億測(cè)試數(shù)據(jù)，a字段10唯一值，b字段100唯一值，c字段100萬唯一值

3、（coordinator節(jié)點(diǎn)）設(shè)置全局或當(dāng)前會(huì)話級(jí)參數(shù)，指定失真度，越小失真度越小

4、對(duì)比是否使用HLL加速(少量唯一值，HLL沒有性能提升，因?yàn)楸旧砭筒淮嬖谄款i)

4.1、未使用hll

4.2、使用hll

5、對(duì)比是否使用HLL加速(大量唯一值，HLL性能提升顯著)

5.1、未使用hll

5.2、使用hll

6、設(shè)置不同的精度參數(shù)，性能對(duì)比

猜你喜歡

最新資訊

相關(guān)推薦

相關(guān)標(biāo)簽