怎么在postgreSQL中對記錄數(shù)進行計算

發(fā)布時間：2020-12-28 14:55:12 來源：億速云閱讀：534 作者：Leah 欄目：開發(fā)技術

怎么在postgreSQL中對記錄數(shù)進行計算？相信很多沒有經(jīng)驗的人對此束手無策，為此本文總結了問題出現(xiàn)的原因和解決方法，通過這篇文章希望你能解決這個問題。

一般方法

select count(1) from table_name;

全量掃描一遍表，記錄越多，查詢速度越慢

新法

PostgreSQL 還真提供了一個這樣的途徑，那就是系統(tǒng)表 pg_class，這個系統(tǒng)表里頭，存儲著每個表的統(tǒng)計信息，其中 reltuples 就是對應的表的統(tǒng)計行,統(tǒng)計行的數(shù)據(jù)是pg有個獨立進程，定期掃描不同的表，收集這些表的統(tǒng)計信息，保存在系統(tǒng)表里頭。

方法如下：

select 
 reltuples::int as total 
from 
 pg_class 
where 
 relname = 'table_name' 
 and relnamespace = (select oid from pg_namespace where nspname = 'schema');

新方法不是通用的，如果要求特精確還是使用select count(1),如果是類似分頁的，且分頁數(shù)量超過比較多的，也不是要求特別精準的，這就是一個好方法！

count(1) over 計算記錄數(shù)

select count(1) over(), * from table_name;

補充

count 是最常用的聚集函數(shù)之一，看似簡單，其實還是有坑的，如：

1、count(*)：返回結果集的行數(shù)，是null也統(tǒng)計

2、count(1)：和count(*)基本沒區(qū)別，pg92之前都是掃描全表的，pg92之后增加了index only scan一般會變成掃主鍵索引，如果沒有主鍵或者是表的列很多的情況下，count(1)快一些，因為不會考慮表的全部字段

3、count(field)：返回數(shù)據(jù)表中指定字段值不等于null的行數(shù)

拓展：理解 PostgreSQL 的 count 函數(shù)的行為

關于 count 函數(shù)的使用一直存在爭議，尤其是在 MySQL 中，作為流行度越來越高的 PostgreSQL 是否也有類似的問題呢，我們通過實踐來理解一下 PostgreSQL 中 count 函數(shù)的行為。

構建測試數(shù)據(jù)庫

創(chuàng)建測試數(shù)據(jù)庫，并創(chuàng)建測試表。測試表中有自增 ID、創(chuàng)建時間、內容三個字段，自增 ID 字段是主鍵。

create database performance_test;
create table test_tbl (id serial primary key, created_at timestamp, content varchar(512));

生成測試數(shù)據(jù)

使用 generate_series 函數(shù)生成自增 ID，使用 now() 函數(shù)生成 created_at 列，對于 content 列，使用了 repeat(md5(random()::text), 10) 生成 10 個 32 位長度的 md5 字符串。使用下列語句，插入 1000w 條記錄用于測試。

performance_test=# insert into test_tbl select generate_series(1,10000000),now(),repeat(md5(random()::text),10); INSERT 0 10000000 Time: 212184.223 ms (03:32.184)

由 count 語句引發(fā)的思考

默認情況下 PostgreSQL 不開啟 SQL 執(zhí)行時間的顯示，所以需要手動開啟一下，方便后面的測試對比。

\timing on

count(*) 和 count(1) 的性能區(qū)別是經(jīng)常被討論的問題，分別使用 count(*) 和 count(1) 執(zhí)行一次查詢。

performance_test=# select count(*) from test_tbl;
 count
----------
 10000000
(1 row)
 
Time: 115090.380 ms (01:55.090)
 
performance_test=# select count(1) from test_tbl;
 count
----------
 10000000
(1 row)
 
Time: 738.502 ms

可以看到兩次查詢的速度差別非常大，count(1) 真的有這么大的性能提升？接下來再次運行查詢語句。

performance_test=# select count(*) from test_tbl;
 count
----------
 10000000
(1 row)
 
Time: 657.831 ms
 
performance_test=# select count(1) from test_tbl;
 count
----------
 10000000
(1 row)
 
Time: 682.157 ms

可以看到第一次查詢時候會非常的慢，后面三次速度非?？觳⑶視r間相近，這里就有兩個問題出現(xiàn)了：

為什么第一次查詢速度這么慢？

count(*) 和 count(1) 到底存不存在性能差別？

查詢緩存

使用 explain 語句重新執(zhí)行查詢語句

explain (analyze,buffers,verbose) select count(*) from test_tbl;

可以看到如下輸出：

Finalize Aggregate (cost=529273.69..529273.70 rows=1 width=8) (actual time=882.569..882.570 rows=1 loops=1)
  Output: count(*)
  Buffers: shared hit=96 read=476095
  -> Gather (cost=529273.48..529273.69 rows=2 width=8) (actual time=882.492..884.170 rows=3 loops=1)
     Output: (PARTIAL count(*))
     Workers Planned: 2
     Workers Launched: 2
     Buffers: shared hit=96 read=476095
     -> Partial Aggregate (cost=528273.48..528273.49 rows=1 width=8) (actual time=881.014..881.014 rows=1 loops=3)
        Output: PARTIAL count(*)
        Buffers: shared hit=96 read=476095
        Worker 0: actual time=880.319..880.319 rows=1 loops=1
         Buffers: shared hit=34 read=158206
        Worker 1: actual time=880.369..880.369 rows=1 loops=1
         Buffers: shared hit=29 read=156424
        -> Parallel Seq Scan on public.test_tbl (cost=0.00..517856.98 rows=4166598 width=0) (actual time=0.029..662.165 rows=3333333 loops=3)
           Buffers: shared hit=96 read=476095
           Worker 0: actual time=0.026..661.807 rows=3323029 loops=1
            Buffers: shared hit=34 read=158206
           Worker 1: actual time=0.030..660.197 rows=3285513 loops=1
            Buffers: shared hit=29 read=156424
 Planning time: 0.043 ms
 Execution time: 884.207 ms

注意里面的 shared hit，表示命中了內存中緩存的數(shù)據(jù)，這就可以解釋為什么后面的查詢會比第一次快很多。接下來去掉緩存，并重啟 PostgreSQL。

service postgresql stop
echo 1 > /proc/sys/vm/drop_caches
service postgresql start

重新執(zhí)行 SQL 語句，速度慢了很多。

 Finalize Aggregate (cost=529273.69..529273.70 rows=1 width=8) (actual time=50604.564..50604.564 rows=1 loops=1)
  Output: count(*)
  Buffers: shared read=476191
  -> Gather (cost=529273.48..529273.69 rows=2 width=8) (actual time=50604.508..50606.141 rows=3 loops=1)
     Output: (PARTIAL count(*))
     Workers Planned: 2
     Workers Launched: 2
     Buffers: shared read=476191
     -> Partial Aggregate (cost=528273.48..528273.49 rows=1 width=8) (actual time=50591.550..50591.551 rows=1 loops=3)
        Output: PARTIAL count(*)
        Buffers: shared read=476191
        Worker 0: actual time=50585.182..50585.182 rows=1 loops=1
         Buffers: shared read=158122
        Worker 1: actual time=50585.181..50585.181 rows=1 loops=1
         Buffers: shared read=161123
        -> Parallel Seq Scan on public.test_tbl (cost=0.00..517856.98 rows=4166598 width=0) (actual time=92.491..50369.691 rows=3333333 loops=3)
           Buffers: shared read=476191
           Worker 0: actual time=122.170..50362.271 rows=3320562 loops=1
            Buffers: shared read=158122
           Worker 1: actual time=14.020..50359.733 rows=3383583 loops=1
            Buffers: shared read=161123
 Planning time: 11.537 ms
 Execution time: 50606.215 ms

shared read 表示沒有命中緩存，通過這個現(xiàn)象可以推斷出，上一小節(jié)的四次查詢中，第一次查詢沒有命中緩存，剩下三次查詢都命中了緩存。

count(1) 和 count(*) 的區(qū)別

接下來探究 count(1) 和 count(*) 的區(qū)別是什么，繼續(xù)思考最開始的四次查詢，第一次查詢使用了 count(*)，第二次查詢使用了 count(1) ，卻依然命中了緩存，不正是說明 count(1) 和 count(*) 是一樣的嗎？

事實上，PostgreSQL 官方對于 is there a difference performance-wise between select count(1) and select count(*)? 問題的回復也證實了這一點：

Nope. In fact, the latter is converted to the former during parsing.[2]

既然 count(1) 在性能上沒有比 count(*) 更好，那么使用 count(*) 就是更好的選擇。

sequence scan 和 index scan

接下來測試一下，在不同數(shù)據(jù)量大小的情況下 count(*) 的速度，將查詢語句寫在 count.sql 文件中，使用 pgbench 進行測試。

pgbench -c 5 -t 20 performance_test -r -f count.sql

分別測試 200w - 1000w 數(shù)據(jù)量下的 count 語句耗時

數(shù)據(jù)大小	count耗時(ms)
200w	738.758
300w	1035.846
400w	1426.183
500w	1799.866
600w	2117.247
700w	2514.691
800w	2526.441
900w	2568.240
1000w	2650.434

繪制成耗時曲線

曲線的趨勢在 600w - 700w 數(shù)據(jù)量之間出現(xiàn)了轉折，200w - 600w 是線性增長，600w 之后 count 的耗時就基本相同了。使用 explain 語句分別查看 600w 和 700w 數(shù)據(jù)時的 count 語句執(zhí)行。

700w：

Finalize Aggregate (cost=502185.93..502185.94 rows=1 width=8) (actual time=894.361..894.361 rows=1 loops=1)
  Output: count(*)
  Buffers: shared hit=16344 read=352463
  -> Gather (cost=502185.72..502185.93 rows=2 width=8) (actual time=894.232..899.763 rows=3 loops=1)
     Output: (PARTIAL count(*))
     Workers Planned: 2
     Workers Launched: 2
     Buffers: shared hit=16344 read=352463
     -> Partial Aggregate (cost=501185.72..501185.73 rows=1 width=8) (actual time=889.371..889.371 rows=1 loops=3)
        Output: PARTIAL count(*)
        Buffers: shared hit=16344 read=352463
        Worker 0: actual time=887.112..887.112 rows=1 loops=1
         Buffers: shared hit=5459 read=118070
        Worker 1: actual time=887.120..887.120 rows=1 loops=1
         Buffers: shared hit=5601 read=117051
        -> Parallel Index Only Scan using test_tbl_pkey on public.test_tbl (cost=0.43..493863.32 rows=2928960 width=0) (actual time=0.112..736.376 rows=2333333 loops=3)
           Index Cond: (test_tbl.id < 7000000)
           Heap Fetches: 2328492
           Buffers: shared hit=16344 read=352463
           Worker 0: actual time=0.107..737.180 rows=2344479 loops=1
            Buffers: shared hit=5459 read=118070
           Worker 1: actual time=0.133..737.960 rows=2327028 loops=1
            Buffers: shared hit=5601 read=117051
 Planning time: 0.165 ms
 Execution time: 899.857 ms

600w：

Finalize Aggregate (cost=429990.94..429990.95 rows=1 width=8) (actual time=765.575..765.575 rows=1 loops=1)
  Output: count(*)
  Buffers: shared hit=13999 read=302112
  -> Gather (cost=429990.72..429990.93 rows=2 width=8) (actual time=765.557..770.889 rows=3 loops=1)
     Output: (PARTIAL count(*))
     Workers Planned: 2
     Workers Launched: 2
     Buffers: shared hit=13999 read=302112
     -> Partial Aggregate (cost=428990.72..428990.73 rows=1 width=8) (actual time=763.821..763.821 rows=1 loops=3)
        Output: PARTIAL count(*)
        Buffers: shared hit=13999 read=302112
        Worker 0: actual time=762.742..762.742 rows=1 loops=1
         Buffers: shared hit=4638 read=98875
        Worker 1: actual time=763.308..763.308 rows=1 loops=1
         Buffers: shared hit=4696 read=101570
        -> Parallel Index Only Scan using test_tbl_pkey on public.test_tbl (cost=0.43..422723.16 rows=2507026 width=0) (actual time=0.053..632.199 rows=2000000 loops=3)
           Index Cond: (test_tbl.id < 6000000)
           Heap Fetches: 2018490
           Buffers: shared hit=13999 read=302112
           Worker 0: actual time=0.059..633.156 rows=1964483 loops=1
            Buffers: shared hit=4638 read=98875
           Worker 1: actual time=0.038..634.271 rows=2017026 loops=1
            Buffers: shared hit=4696 read=101570
 Planning time: 0.055 ms
 Execution time: 770.921 ms

根據(jù)以上現(xiàn)象推斷，PostgreSQL 似乎在 count 的數(shù)據(jù)量小于數(shù)據(jù)表長度的某一比例時，才使用 index scan，通過查看官方 wiki 也可以看到相關描述：

It is important to realise that the planner is concerned with minimising the total cost of the query. With databases, the cost of I/O typically dominates. For that reason, "count(*) without any predicate" queries will only use an index-only scan if the index is significantly smaller than its table. This typically only happens when the table's row width is much wider than some indexes'.[3]

根據(jù) Stackoverflow 上的回答，count 語句查詢的數(shù)量大于表大小的 3/4 時候就會用使用全表掃描代替索引掃描[4]。

結論

不要用 count(1) 或 count(列名) 代替 count(*)

count 本身是非常耗時的

count 可能是 index scan 也可能是 sequence scan，取決于 count 數(shù)量占表大小的比例

看完上述內容，你們掌握怎么在postgreSQL中對記錄數(shù)進行計算的方法了嗎？如果還想學到更多技能或想了解更多相關內容，歡迎關注億速云行業(yè)資訊頻道，感謝各位的閱讀！

向AI問一下細節(jié)

怎么在postgreSQL中對記錄數(shù)進行計算

新法

方法如下：

構建測試數(shù)據(jù)庫

生成測試數(shù)據(jù)

查詢緩存

count(1) 和 count(*) 的區(qū)別

sequence scan 和 index scan

結論

猜你喜歡

最新資訊

相關推薦

相關標簽