not in 和join view 在达梦和崖山数据库表现

前面我测试了oracle迁移到postgresql的两个性能问题,not in时不能转换为Antijoin, 和join view的谓词不是关连列时存在无法谓词推进,下面测试一下在另外2款集中式数据库达梦和崖山的表现。

达梦

构建测试数据

Server[LOCALHOST:5236]:mode is normal, state is open
login used time : 7.366(ms)
disql V8
SQL>
SQL> create table tobj1 as select * from dba_objects;
executed successfully
used time: 66.824(ms). Execute id is 60105.
SQL> create table tobj2 as select * from dba_objects;
executed successfully
used time: 63.160(ms). Execute id is 60106.

SQL> insert into tobj2 select * from tobj2;
affect rows 1015

used time: 6.149(ms). Execute id is 60107.
SQL> /
affect rows 2030

used time: 9.378(ms). Execute id is 60108.
SQL> /
affect rows 4060
...
SQL>  select count(*) from tobj2;

LINEID     COUNT(*)
---------- --------------------
1          2078720

Not in场景

SQL> select count(*) from tobj1 where object_id not in(select object_id from tobj2 where object_id is not null) ;

LINEID     COUNT(*)
---------- --------------------
1          0

used time: 439.006(ms). Execute id is 60120.
SQL> explain select count(*) from tobj1 where object_id not in(select object_id from tobj2 where object_id is not null) ;

1   #NSET2: [384, 1, 30]
2     #PRJT2: [384, 1, 30]; exp_num(1), is_atom(FALSE)
3       #AAGR2: [384, 1, 30]; grp_num(0), sfun_num(1), distinct_flag[0]; slave_empty(0)
4         #HASH LEFT SEMI JOIN2: [384, 1, 30];  (ANTI),KEY_NUM(1);  KEY(TOBJ1.OBJECT_ID=DMTEMPVIEW_889214295.colname) KEY_NULL_EQU(0)
5           #CSCN2: [1, 1013, 30]; INDEX33555549(TOBJ1)
6           #PRJT2: [250, 1974784, 30]; exp_num(1), is_atom(FALSE)
7             #SLCT2: [250, 1974784, 30]; NOT(TOBJ2.OBJECT_ID IS NULL)
8               #CSCN2: [250, 2078720, 30]; INDEX33555550(TOBJ2)

used time: 1.432(ms). Execute id is 0.
SQL>  select count(*) from tobj1 where not exists(select 1 from tobj2 where tobj1.object_id=tobj2.object_id);

LINEID     COUNT(*)
---------- --------------------
1          0

used time: 382.377(ms). Execute id is 60121.
SQL> explain select count(*) from tobj1 where not exists(select 1 from tobj2 where tobj1.object_id=tobj2.object_id);

1   #NSET2: [371, 1, 42]
2     #PRJT2: [371, 1, 42]; exp_num(1), is_atom(FALSE)
3       #AAGR2: [371, 1, 42]; grp_num(0), sfun_num(1), distinct_flag[0]; slave_empty(0)
4         #HASH LEFT SEMI JOIN2: [371, 1, 42];  (ANTI),KEY_NUM(1);  KEY(TOBJ1.OBJECT_ID=TOBJ2.OBJECT_ID) KEY_NULL_EQU(0)
5           #CSCN2: [1, 1013, 42]; INDEX33555549(TOBJ1)
6           #CSCN2: [230, 2078720, 30]; INDEX33555550(TOBJ2)

Note: 在达梦中not in 和not exists一样都使用了 ANTI JOIN,而非逐行执行子查询. 便是注意使用not in时子查询是先被完整的物化再与外层匹配。 而not exists使用了更优的反连接(ANTI JOIN)优化,并且避免了子查询全物化。虽然两个计划最终都是 HASH ANTI JOIN,但 NOT EXISTS 版本省去了 PRJT2 和 SLCT2 两层操作,减少了数据复制和内存开销。

对比项NOT IN 版本NOT EXISTS 版本
子查询处理先全扫描 + 过滤 NULL,生成中间结果(第6-8行有 PRJT2 + SLCT2直接扫描,无额外投影层
中间结果集需要物化投影后的 197 万行无需物化,直接参与 HASH 连接
执行计划层级3 层(子查询投影 → 过滤 → 反连接)2 层(扫描 → 反连接)
性能差(大量内存/临时表消耗)(流式处理)
建议: 在达梦中,判断不存在性时优先使用 NOT EXISTS 而非 NOT IN,即使 NOT IN 写了 IS NOT NULL

JOIN VIEW 场景

SQL> create view vobj2 as select object_id,max(object_name) object_name,count(*) cnt from tobj2 group by object_id;
executed successfully
used time: 17.226(ms). Execute id is 60122.
SQL> EXPLAIN select * from tobj1 left join vobj2 on tobj1.object_id=vobj2.object_id where tobj1.object_id=40;

1   #NSET2: [128510, 13160, 675]
2     #PRJT2: [128510, 13160, 675]; exp_num(18), is_atom(FALSE)
3       #NEST LOOP LEFT JOIN2: [128510, 13160, 675];  partition_keys_num(0) ret_null(0)
4         #SLCT2: [1, 25, 597]; TOBJ1.OBJECT_ID = var1
5           #CSCN2: [1, 1013, 597]; INDEX33555549(TOBJ1)
6         #SLCT2: [404, 519, 78]; VOBJ2.OBJECT_ID = var2
7           #PRJT2: [404, 20787, 78]; exp_num(3), is_atom(FALSE)
8             #HAGR2: [404, 20787, 78]; grp_num(1), sfun_num(2), distinct_flag[0,0]; slave_empty(0) keys(TOBJ2.OBJECT_ID)
9               #CSCN2: [251, 2078720, 78]; INDEX33555550(TOBJ2)

used time: 1.844(ms). Execute id is 0.
SQL> create index idx_tobj1_oid on tobj1(object_id);
executed successfully
used time: 449.781(ms). Execute id is 60123.
SQL> create index idx_tobj2_oid on tobj2(object_id);
executed successfully
used time: 00:00:06.532. Execute id is 60124.
SQL> EXPLAIN select * from tobj1 left join vobj2 on tobj1.object_id=vobj2.object_id where tobj1.object_id=40;

1   #NSET2: [128507, 13160, 675]
2     #PRJT2: [128507, 13160, 675]; exp_num(18), is_atom(FALSE)
3       #NEST LOOP LEFT JOIN2: [128507, 13160, 675];  partition_keys_num(0) ret_null(0)
4         #BLKUP2: [1, 25, 597]; IDX_TOBJ1_OID(TOBJ1)
5           #SSEK2: [1, 25, 597]; scan_type(ASC), IDX_TOBJ1_OID(TOBJ1), scan_range[exp_cast(40),exp_cast(40)]
6         #SLCT2: [404, 519, 78]; VOBJ2.OBJECT_ID = var1
7           #PRJT2: [404, 20787, 78]; exp_num(3), is_atom(FALSE)
8             #HAGR2: [404, 20787, 78]; grp_num(1), sfun_num(2), distinct_flag[0,0]; slave_empty(0) keys(TOBJ2.OBJECT_ID)
9               #CSCN2: [251, 2078720, 78]; INDEX33555550(TOBJ2)

used time: 2.541(ms). Execute id is 0.

使用Join 条件谓词时,在达梦8时发生了视图谓词推进(View Predicate Pushdown)

  • 第 6 行:#SLCT2: [404, 519, 78]; VOBJ2.OBJECT_ID = var1
    • 这说明在扫描视图 VOBJ2 时,已经应用了一个过滤条件 VOBJ2.OBJECT_ID = var1
    • 这里的 var1 是从左表 TOBJ1 传来的绑定变量值(具体为 40)。

吐槽,执行计划关键字真难看。

SQL> EXPLAIN select * from tobj1 left join vobj2 on tobj1.object_id=vobj2.object_id where tobj1.object_name='tobj';

1   #NSET2: [406, 75, 675]
2     #PRJT2: [406, 75, 675]; exp_num(18), is_atom(FALSE)
3       #HASH LEFT JOIN2: [406, 75, 675]; key_num(1), partition_keys_num(0), ret_null(0), mix(0) KEY(TOBJ1.OBJECT_ID=VOBJ2.OBJECT_ID)
4         #SLCT2: [1, 25, 597]; TOBJ1.OBJECT_NAME = 'tobj'
5           #CSCN2: [1, 1013, 597]; INDEX33555549(TOBJ1)
6         #PRJT2: [404, 20787, 78]; exp_num(3), is_atom(FALSE)
7           #HAGR2: [404, 20787, 78]; grp_num(1), sfun_num(2), distinct_flag[0,0]; slave_empty(0) keys(TOBJ2.OBJECT_ID)
8             #CSCN2: [251, 2078720, 78]; INDEX33555550(TOBJ2)

used time: 2.093(ms). Execute id is 0.

NOte: 在使用 非join 列谓词时,并没有发生视图谓词推进

  1. 第4-5行: 从 TOBJ1 中筛选 object_name='tobj' 的记录(扫描索引 INDEX33555549)。索引组织表机构,使用了内部的索引。
  2. 第6-8行:独立、完整地执行视图 vobj2
    • 全表扫描 TOBJ2(预估 2,078,720 行,见第8行)
    • 进行哈希聚合 HAGR2(按 OBJECT_ID 分组)
    • 生成完整的视图结果集
特征使用join列谓词使用 非join列谓词
连接方式NEST LOOP LEFT JOINHASH LEFT JOIN
视图侧过滤#SLCT2: VOBJ2.OBJECT_ID = var1 (有参数传递)无此条件
视图扫描虽然有CSCN2,但外层传入了绑定变量视图独立全扫,无外部条件

崖山数据库

构建测试数据

Connected to:
YashanDB Server Enterprise Edition Release 23.5.1.100 x86_64 - Linux

SQL> 
SQL> create table tobj1 as select * from dba_objects;

Succeed.

SQL> select count(*) from tobj1;

             COUNT(*)
---------------------
                 2333

1 row fetched.

SQL> create table tobj2 as select * from dba_objects;

Succeed.

SQL> insert into tobj2 select * from tobj2;

2334 rows affected.

SQL> /

4668 rows affected.

...
SQL> select count(*) from tobj2;

             COUNT(*)
---------------------
              1195008

not in 场景

SQL> explain select count(*) from tobj1 where object_id not in(select object_id from tobj2 where object_id is not null) ;

PLAN_DESCRIPTION
----------------------------------------------------------------
SQL hash value: 72379594
Optimizer: ADOPT_C

+----+--------------------------------+----------------------+------------+----------+-------------+--------------------------------+
| Id | Operation type                 | Name                 | Owner      | Rows     | Cost(%CPU)  | Partition info                 |
+----+--------------------------------+----------------------+------------+----------+-------------+--------------------------------+
|  0 | SELECT STATEMENT               |                      |            |          |             |                                |
|  1 |  AGGREGATE                     |                      |            |         1|      280( 0)|                                |
|* 2 |   HASH JOIN ANTI NA            |                      |            |     99900|      279( 0)|                                |
|  3 |    TABLE ACCESS FULL           | TOBJ1                | SYS        |    100000|      136( 0)|                                |
|  4 |    HASH GROUP                  |                      |            |      1000|      141( 0)|                                |
|* 5 |     TABLE ACCESS FULL          | TOBJ2                | SYS        |     96000|      136( 0)|                                |
+----+--------------------------------+----------------------+------------+----------+-------------+--------------------------------+

Operation Information (identified by operation id):
---------------------------------------------------

   2 - Predicate : access("TOBJ1"."OBJECT_ID" = "TOBJ2"."OBJECT_ID")
   4 - Group Expression: ("TOBJ2"."OBJECT_ID")
   5 - Predicate : filter("TOBJ2"."OBJECT_ID" IS NOT NULL)

SQL> explain select count(*) from tobj1 where not exists(select 1 from tobj2 where tobj1.object_id=tobj2.object_id);

PLAN_DESCRIPTION
----------------------------------------------------------------
SQL hash value: 1231899452
Optimizer: ADOPT_C

+----+--------------------------------+----------------------+------------+----------+-------------+--------------------------------+
| Id | Operation type                 | Name                 | Owner      | Rows     | Cost(%CPU)  | Partition info                 |
+----+--------------------------------+----------------------+------------+----------+-------------+--------------------------------+
|  0 | SELECT STATEMENT               |                      |            |          |             |                                |
|  1 |  AGGREGATE                     |                      |            |         1|      280( 0)|                                |
|* 2 |   HASH JOIN ANTI               |                      |            |     99900|      279( 0)|                                |
|  3 |    TABLE ACCESS FULL           | TOBJ1                | SYS        |    100000|      136( 0)|                                |
|  4 |    HASH GROUP                  |                      |            |      1000|      141( 0)|                                |
|  5 |     TABLE ACCESS FULL          | TOBJ2                | SYS        |    100000|      136( 0)|                                |
+----+--------------------------------+----------------------+------------+----------+-------------+--------------------------------+

Operation Information (identified by operation id):
---------------------------------------------------

   2 - Predicate : access("TOBJ1"."OBJECT_ID" = "TOBJ2"."OBJECT_ID")
   4 - Group Expression: ("TOBJ2"."OBJECT_ID")

19 rows fetched.

Note: 这执行计划看的很熟悉,和oracle一样使用了hash join anti join。

JOIN VIEW 场景

SQL> create view vobj2 as select object_id,max(object_name) object_name,count(*) cnt from tobj2 group by object_id;

SQL> create index idx_tobj1_oid on tobj1(object_id);

Succeed.

SQL> create index idx_tobj2_oid on tobj2(object_id);

Succeed.

SQL> explain select * from tobj1 left join vobj2 on tobj1.object_id=vobj2.object_id where tobj1.object_id=40;

PLAN_DESCRIPTION
----------------------------------------------------------------
SQL hash value: 3480004066
Optimizer: ADOPT_C

+----+--------------------------------+----------------------+------------+----------+-------------+--------------------------------+
| Id | Operation type                 | Name                 | Owner      | Rows     | Cost(%CPU)  | Partition info                 |
+----+--------------------------------+----------------------+------------+----------+-------------+--------------------------------+
|  0 | SELECT STATEMENT               |                      |            |          |             |                                |
|* 1 |  HASH JOIN LEFT OUTER          |                      |            |       101|        1( 0)|                                |
|  2 |   TABLE ACCESS BY INDEX ROWID  | TOBJ1                | SYS        |       101|        1( 0)|                                |
|* 3 |    INDEX RANGE SCAN            | IDX_TOBJ1_OID        | SYS        |       101|        1( 0)|                                |
|  4 |   VIEW                         |                      |            |       101|        1( 0)|                                |
|  5 |    GROUP                       |                      |            |       101|        1( 0)|                                |
|  6 |     TABLE ACCESS BY INDEX ROWID| TOBJ2                | SYS        |       101|        1( 0)|                                |
|* 7 |      INDEX RANGE SCAN          | IDX_TOBJ2_OID        | SYS        |       101|        1( 0)|                                |
+----+--------------------------------+----------------------+------------+----------+-------------+--------------------------------+

Operation Information (identified by operation id):
---------------------------------------------------

   1 - Predicate : access("TOBJ1"."OBJECT_ID" = "TOBJ2"."OBJECT_ID")
   3 - Predicate : access("TOBJ1"."OBJECT_ID" = 40)
   5 - Group Expression: (40)
   7 - Predicate : access("TOBJ2"."OBJECT_ID" = 40)

23 rows fetched.

Note: 在使用join 列谓词时,可以做到view的谓词推进,#7能看到。

SQL> explain select * from tobj1 left join vobj2 on tobj1.object_id=vobj2.object_id where tobj1.object_name='tobj';

PLAN_DESCRIPTION
----------------------------------------------------------------
SQL hash value: 1184278721
Optimizer: ADOPT_C

+----+--------------------------------+----------------------+------------+----------+-------------+--------------------------------+
| Id | Operation type                 | Name                 | Owner      | Rows     | Cost(%CPU)  | Partition info                 |
+----+--------------------------------+----------------------+------------+----------+-------------+--------------------------------+
|  0 | SELECT STATEMENT               |                      |            |          |             |                                |
|* 1 |  HASH JOIN RIGHT OUTER         |                      |            |       101|      293( 0)|                                |
|  2 |   JOIN FILTER USE              |                      |            |      1000|      150( 0)|                                |
|  3 |    VIEW                        |                      |            |      1000|      150( 0)|                                |
|  4 |     HASH GROUP                 |                      |            |      1000|      150( 0)|                                |
|* 5 |      TABLE ACCESS FULL         | TOBJ2                | SYS        |    100000|      136( 0)|                                |
|* 6 |   JOIN FILTER CREATE           |                      |            |       101|      141( 0)|                                |
|* 7 |    TABLE ACCESS FULL           | TOBJ1                | SYS        |       101|      141( 0)|                                |
+----+--------------------------------+----------------------+------------+----------+-------------+--------------------------------+

Operation Information (identified by operation id):
---------------------------------------------------

   1 - Predicate : access("TOBJ2"."OBJECT_ID" = "TOBJ1"."OBJECT_ID")
   4 - Group Expression: ("TOBJ2"."OBJECT_ID")
   5 - Predicate : RUNTIME FILTER(RUNTIME USE(0): "TOBJ2"."OBJECT_ID")
   6 - Predicate : RUNTIME FILTER(RUNTIME CREATE(0): "TOBJ1"."OBJECT_ID")
   7 - Predicate : filter("TOBJ1"."OBJECT_NAME" = 'tobj')

Note: 崖山的效果更好,对于非Join列谓词时,崖山数据库在这里使用了运行时过滤(Runtime Filter)优化,本质上实现了动态谓词推进,避免了视图的完全物化。

执行顺序(从下往上):
1. 第7行:扫描 TOBJ1,过滤 object_name='tobj' → 预估 101 行
2. 第6行:JOIN FILTER CREATE - 用这 101 行的 OBJECT_ID 动态创建一个过滤布隆过滤器
3. 第2行:JOIN FILTER USE - 将过滤器下推到 TOBJ2 的扫描中
4. 第5行:扫描 TOBJ2 时,用过滤器提前过滤掉不符合条件的行
5. 第4行:对过滤后的结果做 GROUP BY
6. 第1行:HASH JOIN 完成最终连接

小结

对于在postgresql中的这2个问题,首先对于not in 这2款数据库都可以转换为anti join, 只是达梦not in时有一层全雾化的过程,还是建议用not exists代替not in. 其次对于join view 在使用join列谓词时都可以推进。而使用非join列谓词时,达梦未能推进,而yashan使用了一种运行时过滤技术,动态推进。

Leave a Comment