基于postgresql的数据库如果依赖autovacuum收集统计信息时,会存在对于分区表,分区的统计信息会收集,但是表上的全局统计信息(global statistics)缺失,那样在查询跨越多个分区时或与其他表关连时,可能因统计信息错误产生错误的执行计划。
像表级统计信息对于总行数容易估算,为所有分区的总和。但是对于列上的distinct 值无法计算,因为无法得值分区间重复的记录。所以列的distinct应该是每个分区的最大不同值计数和所有分区计数的总和之间。
刚好看到Franck Pachot在分享postgreSQL的这个问题,我测试在Kingbase 和 HighgoDB上同样存在该问题,下面简单记录。
构建测试表
create table history (
year int,
num serial,
x int,
y int,
primary key (year, num)
) partition by list (year)
;
create table history_2024 partition of history for values in (2024);
create table history_2025 partition of history for values in (2025);
create table history_2026 partition of history for values in (2026);
create table history_2027 partition of history for values in (2027);
insert into history select
extract(year from ( date '2026-01-02' - interval '1 minute' * num ))::int as year
,num -- NDV ≈ rows (unique key, density ≈ 1 / rows)
,(num % 2) as x -- NDV = 2 per partition (very high density: ~50% per value)
,(num / 2) as y -- NDV ≈ rows / 2 per partition (moderate density: ~2 rows per distinct value)
from generate_series(1,1e6) num
;
真实的数据分布情况:
select count(*), year
, count(distinct x) as "distinct x"
, min(x) as "min x"
, max(x) as "max x"
, count(*)::float / nullif(count(distinct x), 0) as density_x
, count(distinct y) as "distinct y"
, min(y) as "min y"
, max(y) as "max y"
, count(*)::float / nullif(count(distinct y), 0) as density_y
from history group by grouping sets ((),year)
;
kingbase-# ;
count | year | distinct x | min x | max x | density_x | distinct y | min y | max y | density_y
---------+------+------------+-------+-------+-----------+------------+--------+--------+-----------
472960 | 2024 | 2 | 0 | 1 | 236480 | 236480 | 263521 | 500000 | 2
525600 | 2025 | 2 | 0 | 1 | 262800 | 262800 | 721 | 263520 | 2
1440 | 2026 | 2 | 0 | 1 | 720 | 720 | 1 | 720 | 2
1000000 | | 2 | 0 | 1 | 500000 | 500000 | 1 | 500000 | 2
(4 rows)
insert完成后在kingbase 可能会有短暂(3s)没有统计信息,在highgodb发现autovacuum触发的比较频繁。
select relname, relpages, reltuples, relkind, relhassubclass, relispopulated, relispartition
from pg_class where relname like 'history%' order by relkind desc, relname
;
kingbase-# ;
relname | relpages | reltuples | relkind | relhassubclass | relispopulated | relispartition
-------------------+----------+-----------+---------+----------------+----------------+----------------
history_2024 | 0 | -1 | r | f | t | t
history_2025 | 0 | -1 | r | f | t | t
history_2026 | 0 | -1 | r | f | t | t
history_2027 | 0 | -1 | r | f | t | t
history | 0 | -1 | p | t | t | f
history_2024_pkey | 1 | 0 | i | f | t | t
history_2025_pkey | 1 | 0 | i | f | t | t
history_2026_pkey | 1 | 0 | i | f | t | t
history_2027_pkey | 1 | 0 | i | f | t | t
history_num_seq | 1 | 1 | S | f | t | f
history_pkey | 0 | 0 | I | t | t | f
(11 rows)
select tablename, attname, n_distinct, null_frac
from pg_stats where tablename like 'history%'
kingbase-# ;
tablename | attname | n_distinct | null_frac
-----------+---------+------------+-----------
(0 rows)
过一段时间,autovacuum开始工作,收集了分区级统计信息
kingbase=# select relname, relpages, reltuples, relkind, relhassubclass, relispopulated, relispartition
kingbase-# from pg_class where relname like 'history%' order by relkind desc, relname
kingbase-# ;
relname | relpages | reltuples | relkind | relhassubclass | relispopulated | relispartition
-------------------+----------+-----------+---------+----------------+----------------+----------------
history_2024 | 2571 | 472960 | r | f | t | t
history_2025 | 2857 | 525600 | r | f | t | t
history_2026 | 8 | 1440 | r | f | t | t
history_2027 | 0 | -1 | r | f | t | t
history | 0 | -1 | p | t | t | f
history_2024_pkey | 1307 | 472960 | i | f | t | t
history_2025_pkey | 1451 | 525600 | i | f | t | t
history_2026_pkey | 6 | 1440 | i | f | t | t
history_2027_pkey | 1 | 0 | i | f | t | t
history_num_seq | 1 | 1 | S | f | t | f
history_pkey | 0 | 0 | I | t | t | f
kingbase=# select tablename, attname, n_distinct, null_frac
kingbase-# from pg_stats where tablename like 'history%'
kingbase-# ;
tablename | attname | n_distinct | null_frac
--------------+---------+-------------+-----------
history_2024 | year | 1 | 0
history_2024 | num | -1 | 0
history_2024 | x | 2 | 0
history_2024 | y | -0.5069985 | 0
history_2025 | year | 1 | 0
history_2025 | num | -1 | 0
history_2025 | x | 2 | 0
history_2025 | y | -0.49855024 | 0
history_2026 | year | 1 | 0
history_2026 | num | -1 | 0
history_2026 | x | 2 | 0
history_2026 | y | -0.5 | 0
(12 rows)
注: 但是没有全局统计信息,列统计信息也仅有分区级,包含了distinct 值数量。
模拟跨分区时的估算
explain (analyze, buffers off, summary off) select distinct x from history
kingbase-# ;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=22978.20..22980.20 rows=200 width=4) (actual time=313.883..313.885 rows=2 loops=1)
Group Key: history_2024.x
-> Append (cost=0.00..20473.60 rows=1001840 width=4) (actual time=0.013..178.975 rows=1000000 loops=1)
-> Seq Scan on history_2024 (cost=0.00..7300.60 rows=472960 width=4) (actual time=0.012..56.130 rows=472960 loops=1)
-> Seq Scan on history_2025 (cost=0.00..8113.00 rows=525600 width=4) (actual time=0.012..62.377 rows=525600 loops=1)
-> Seq Scan on history_2026 (cost=0.00..22.40 rows=1440 width=4) (actual time=0.012..0.179 rows=1440 loops=1)
-> Seq Scan on history_2027 (cost=0.00..28.40 rows=1840 width=4) (actual time=0.001..0.002 rows=0 loops=1)
(7 rows)
explain (analyze, buffers off, summary off) select distinct y from history
kingbase-# ;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=22978.20..22980.20 rows=200 width=4) (actual time=449.150..552.624 rows=500000 loops=1)
Group Key: history_2024.y
-> Append (cost=0.00..20473.60 rows=1001840 width=4) (actual time=0.011..191.240 rows=1000000 loops=1)
-> Seq Scan on history_2024 (cost=0.00..7300.60 rows=472960 width=4) (actual time=0.010..60.868 rows=472960 loops=1)
-> Seq Scan on history_2025 (cost=0.00..8113.00 rows=525600 width=4) (actual time=0.015..68.103 rows=525600 loops=1)
-> Seq Scan on history_2026 (cost=0.00..22.40 rows=1440 width=4) (actual time=0.014..0.195 rows=1440 loops=1)
-> Seq Scan on history_2027 (cost=0.00..28.40 rows=1840 width=4) (actual time=0.003..0.003 rows=0 loops=1)
(7 rows)
Note: 估算的rows 是200,实际x列就2个值。y列估算也是200,实际是500000. 可见Postgresql在没有统计信息时使用默认值200. PG源码中可以找到 define DEFAULT_NUM_DISTINCT 200.
查询单分区时的估算
kingbase=# SELECT * FROM pg_partition_tree('history');
relid | parentrelid | isleaf | level
--------------+-------------+--------+-------
history | | f | 0
history_2024 | history | t | 1
history_2025 | history | t | 1
history_2026 | history | t | 1
history_2027 | history | t | 1
(5 rows)
kingbase=# explain (analyze, buffers off, summary off) select distinct x from history_2024;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=8483.00..8483.02 rows=2 width=4) (actual time=114.068..114.069 rows=2 loops=1)
Group Key: x
-> Seq Scan on history_2024 (cost=0.00..7300.60 rows=472960 width=4) (actual time=0.012..37.173 rows=472960 loops=1)
(3 rows)
Note: 单分区估算正常.
当多表关连时跨多分区可能会产生错误的执行计划,因为使用rows 200。 Autovacuum 自动分析功能会收集分区统计信息,但不会收集父分区的统计信息。要收集父分区的全局统计信息,您必须手动运行analyze命令。 默认analyze 会递归所有分区, (在Postgresql 16中引入了指定因子n_distinct_inherited,使用analyze only 仅分析父分区(不递归),目前国产数据库基于的PG普遍在12或14还没有该功能。)
kingbase=# ANALYZE history
kingbase-# ;
ANALYZE
kingbase=# select relname, relpages, reltuples, relkind, relhassubclass, relispopulated, relispartition
kingbase-# from pg_class where relname like 'history%' order by relkind desc, relname
kingbase-# ;
relname | relpages | reltuples | relkind | relhassubclass | relispopulated | relispartition
-------------------+----------+-----------+---------+----------------+----------------+----------------
history_2024 | 2571 | 472960 | r | f | t | t
history_2025 | 2857 | 525600 | r | f | t | t
history_2026 | 8 | 1440 | r | f | t | t
history_2027 | 0 | 0 | r | f | t | t
history | 5436 | 1e+06 | p | t | t | f
history_2024_pkey | 1307 | 472960 | i | f | t | t
history_2025_pkey | 1451 | 525600 | i | f | t | t
history_2026_pkey | 6 | 1440 | i | f | t | t
history_2027_pkey | 1 | 0 | i | f | t | t
history_num_seq | 1 | 1 | S | f | t | f
history_pkey | 2765 | 1e+06 | I | t | t | f
(11 rows)
select tablename, attname, n_distinct, null_frac
from pg_stats where tablename like 'history%'
kingbase-# ;
tablename | attname | n_distinct | null_frac
--------------+---------+-------------+-----------
history | year | 3 | 0
history | num | -1 | 0
history | x | 2 | 0
history | y | -0.476323 | 0
history_2024 | year | 1 | 0
history_2024 | num | -1 | 0
history_2024 | x | 2 | 0
history_2024 | y | -0.49474797 | 0
history_2025 | year | 1 | 0
history_2025 | num | -1 | 0
history_2025 | x | 2 | 0
history_2026 | year | 1 | 0
history_2026 | x | 2 | 0
history_2025 | y | -0.49974886 | 0
history_2026 | num | -1 | 0
history_2026 | y | -0.5 | 0
(16 rows)
再次执行同样的查询
kingbase=# explain (analyze, buffers off, summary off) select distinct x from history;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=22936.01..22936.03 rows=2 width=4) (actual time=313.243..313.245 rows=2 loops=1)
Group Key: history_2024.x
-> Append (cost=0.00..20436.01 rows=1000001 width=4) (actual time=0.033..177.602 rows=1000000 loops=1)
-> Seq Scan on history_2024 (cost=0.00..7300.60 rows=472960 width=4) (actual time=0.032..55.660 rows=472960 loops=1)
-> Seq Scan on history_2025 (cost=0.00..8113.00 rows=525600 width=4) (actual time=0.017..61.183 rows=525600 loops=1)
-> Seq Scan on history_2026 (cost=0.00..22.40 rows=1440 width=4) (actual time=0.011..0.180 rows=1440 loops=1)
-> Seq Scan on history_2027 (cost=0.00..0.00 rows=1 width=4) (actual time=0.003..0.003 rows=0 loops=1)
(7 rows)
kingbase=# explain (analyze, buffers off, summary off) select distinct y from history;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=133768.45..138768.46 rows=476323 width=4) (actual time=320.937..529.077 rows=500000 loops=1)
-> Sort (cost=133768.45..136268.46 rows=1000001 width=4) (actual time=320.935..417.324 rows=1000000 loops=1)
Sort Key: history_2024.y
Sort Method: external merge Disk: 13800kB
-> Append (cost=0.00..20436.01 rows=1000001 width=4) (actual time=0.011..180.121 rows=1000000 loops=1)
-> Seq Scan on history_2024 (cost=0.00..7300.60 rows=472960 width=4) (actual time=0.010..56.977 rows=472960 loops=1)
-> Seq Scan on history_2025 (cost=0.00..8113.00 rows=525600 width=4) (actual time=0.013..62.874 rows=525600 loops=1)
-> Seq Scan on history_2026 (cost=0.00..22.40 rows=1440 width=4) (actual time=0.012..0.181 rows=1440 loops=1)
-> Seq Scan on history_2027 (cost=0.00..0.00 rows=1 width=4) (actual time=0.002..0.002 rows=0 loops=1)
(9 rows)
Note: 现在的查询优化器给给出几乎接近实际的rows。
autovacuum当然也是在达到收集的触发条件时才收集了分区级的统计信息,但是父表是空表所以缺失全局统计信息,所以需要手动的维护统计信息。
— over —