在关系数据库中两表关联在oracle中使用IN( subquery)的语法很常见, 但kevin发现在MySQL中subquery使用MAX聚合参数时,会导致主查询Full scan而无法使用索引范围扫描, 当遇到大表时可能性能下降明显 ,测试发现有的库使用子查询做为驱动表,有的是使用filter从主查询过滤。下面简单测试。
先看oracle
QL> CREATE TABLE employee1 (id int,LastName varchar(255),FirstName varchar(255),Address varchar(255),profile varchar(255));
Table created.
SQL> insert into employee1 select rownum,'anbob'||level,'weejar','xxxxxxxxxxx','yyyyyyy' from dual connect by rownum<=100000; 100000 rows created. SQL> create table employee2 as select * from employee1;
Table created.
SQL> create index idx_emp_id1 on employee1(id);
Index created.
SQL> create index idx_emp_id2 on employee2(id);
Index created.
SQL> create index idx_emp_lname1 on employee1(lastname);
Index created.
SQL> create index idx_emp_lname2 on employee2(lastname);
Index created.
SQL> explain plan for select * from employee2 where id in(select max(id) from employee1 where lastname='anbob1');
Explained.
SQL> @x2
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 616252811
---------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 43 | 4 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID BATCHED | EMPLOYEE2 | 1 | 43 | 2 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | IDX_EMP_ID2 | 1 | | 1 (0)| 00:00:01 |
| 3 | SORT AGGREGATE | | 1 | 16 | | |
| 4 | TABLE ACCESS BY INDEX ROWID BATCHED| EMPLOYEE1 | 1 | 16 | 2 (0)| 00:00:01 |
|* 5 | INDEX RANGE SCAN | IDX_EMP_LNAME1 | 1 | | 1 (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("ID"= (SELECT MAX("ID") FROM "EMPLOYEE1" "EMPLOYEE1" WHERE "LASTNAME"='anbob1'))
5 - access("LASTNAME"='anbob1')
Note:
执行计划可阅读性非常棒,不解释。
再看Postgresql
-- 表结构与构建数据省略
weejar=# explain select id from employee2 where id in(select max(id) from employee1);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Nested Loop (cost=1.76..10.31 rows=1 width=4)
-> Result (cost=1.62..1.63 rows=1 width=4)
InitPlan 1 (returns $0)
-> Limit (cost=0.14..1.62 rows=1 width=4)
-> Index Only Scan Backward using idx_emp_1 on employee1 (cost=0.14..44.66 rows=30 width=4)
Index Cond: (id IS NOT NULL)
-> Index Only Scan using employee2_pkey on employee2 (cost=0.14..8.15 rows=1 width=4)
Index Cond: (id = ($0))
(8 行记录)
Note:
Postgresql也和oracle一样符合预期。
MySQL社区版v8.0
[root@oel7db1 ~]# mysql -uroot -pwww.anbob.com -h127.0.0.1 -P3306 anbob
mysql: [Warning] Using a password on the command line interface can be insecure.
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 8
Server version: 8.0.20-commercial MySQL Enterprise Server - Commercial
Copyright (c) 2000, 2020, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
MYSQL_root@127.0.0.1 [anbob]> use anbob;
Database changed
MYSQL_root@127.0.0.1 [anbob]> CREATE TABLE employee1 (id int,LastName varchar(255),FirstName varchar(255),Address varchar(255),profile varchar(255));
Query OK, 0 rows affected (0.17 sec)
MYSQL_root@127.0.0.1 [anbob]> CREATE TABLE employee2 (id int primary key,LastName varchar(255),FirstName varchar(255),Address varchar(255),profile varchar(255));
Query OK, 0 rows affected (0.09 sec)
MYSQL_root@127.0.0.1 [anbob]> create index idx_emp_1 on employee1(id);
Query OK, 0 rows affected (0.09 sec)
Records: 0 Duplicates: 0 Warnings: 0
MYSQL_root@127.0.0.1 [anbob]> DELIMITER $$
MYSQL_root@127.0.0.1 [anbob]> CREATE PROCEDURE BulkInsert()
-> BEGIN
-> DECLARE i INT DEFAULT 1;
-> truncate table employee1;
-> truncate table employee2;
-> WHILE (i <= 20000)
DO -> INSERT INTO employee1 (id, FirstName, Address) VALUES(i, CONCAT("user","-",i), CONCAT("address","-",i));
-> INSERT INTO employee2 (id,FirstName, Address) VALUES(i,CONCAT("user","-",i), CONCAT("address","-",i));
-> SET i = i+1;
-> END WHILE;
-> END
-> $$
Query OK, 0 rows affected (0.12 sec)
MYSQL_root@127.0.0.1 [anbob]> DELIMITER ;
MYSQL_root@127.0.0.1 [anbob]> call BulkInsert();
-- 吐槽, MYSQL这存储过程调用真是慢到怀疑人生。
MYSQL_root@127.0.0.1 [anbob]> explain select id from employee1 where id in(select max(id) from employee2) ;
+----+--------------------+-----------+------------+-------+---------------+---------------+---------+------+-------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+-----------+------------+-------+---------------+---------------+---------+------+-------+----------+------------------------------+
| 1 | PRIMARY | employee1 | NULL | index | NULL | idx_emp_name1 | 1023 | NULL | 20160 | 100.00 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Select tables optimized away |
+----+--------------------+-----------+------------+-------+---------------+---------------+---------+------+-------+----------+------------------------------+
2 rows in set, 1 warning (0.00 sec)
MYSQL_root@127.0.0.1 [anbob]> explain format=tree select id from employee1 where id in(select max(id) from employee2) \G
*************************** 1. row ***************************
EXPLAIN: -> Filter: (employee1.id,(select #2)) (cost=2111.85 rows=20160)
-> Index scan on employee1 using idx_emp_name1 (cost=2111.85 rows=20160)
-> Select #2 (subquery in condition; dependent)
-> Filter: ((employee1.id) = (max(employee2.id)))
-> Rows fetched before execution
1 row in set (0.00 sec)
MYSQL_root@127.0.0.1 [anbob]> create index idx_emp_name1 on employee1(firstname);
MYSQL_root@127.0.0.1 [anbob]> create index idx_emp_name2 on employee2(firstname);
MYSQL_root@127.0.0.1 [anbob]> explain select id from employee1 where id in(select max(id) from employee2 where firstname='user-1');
+----+--------------------+-----------+------------+-------+---------------+---------------+---------+------+-------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+-----------+------------+-------+---------------+---------------+---------+------+-------+----------+------------------------------+
| 1 | PRIMARY | employee1 | NULL | index | NULL | idx_emp_name1 | 1023 | NULL | 20160 | 100.00 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Select tables optimized away |
+----+--------------------+-----------+------------+-------+---------------+---------------+---------+------+-------+----------+------------------------------+
2 rows in set, 1 warning (0.00 sec)
-------------------------+
-> Filter: (employee1.id,(select #2)) (cost=2111.85 rows=20160)
-> Index scan on employee1 using idx_emp_name1 (cost=2111.85 rows=20160)
-> Select #2 (subquery in condition; dependent)
-> Filter: ((employee1.id) = (max(employee2.id)))
-> Rows fetched before execution
+---------------------------------------------------------
MYSQL_root@127.0.0.1 [anbob]> explain select * from employee1 where id in(select max(id) from employee2 where firstname IN('user-1','user-2')) ;
+----+-------------+-----------+------------+-------+---------------+---------------+---------+------+-------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+-------+---------------+---------------+---------+------+-------+----------+--------------------------+
| 1 | PRIMARY | employee1 | NULL | ALL | NULL | NULL | NULL | NULL | 20160 | 100.00 | Using where |
| 2 | SUBQUERY | employee2 | NULL | range | idx_emp_name2 | idx_emp_name2 | 1023 | NULL | 2 | 100.00 | Using where; Using index |
+----+-------------+-----------+------------+-------+---------------+---------------+---------+------+-------+----------+--------------------------+
2 rows in set, 1 warning (0.00 sec)
MYSQL_root@127.0.0.1 [anbob]> explain select * from employee1 where id in(select id from employee2 where firstname IN('user-1','user-2')) ;
+----+-------------+-----------+------------+--------+-----------------------+---------------+---------+--------------------+------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+--------+-----------------------+---------------+---------+--------------------+------+----------+--------------------------+
| 1 | SIMPLE | employee2 | NULL | range | PRIMARY,idx_emp_name2 | idx_emp_name2 | 1023 | NULL | 2 | 100.00 | Using where; Using index |
| 1 | SIMPLE | employee1 | NULL | eq_ref | PRIMARY | PRIMARY | 4 | anbob.employee2.id | 1 | 100.00 | NULL |
+----+-------------+-----------+------------+--------+-----------------------+---------------+---------+--------------------+------+----------+--------------------------+
2 rows in set, 1 warning (0.01 sec)
-- 如果没有聚合函数MAX ,是用子查询做为驱动表。同时emp1 表也使用上了索引.
MYSQL_root@127.0.0.1 [anbob]> explain analyze select * from employee1
where id in(select max(id) from employee2 where firstname IN('user-1','user-2')) \G
*************************** 1. row ***************************
EXPLAIN: -> Filter: (employee1.id,employee1.id in (select #2)) (cost=2111.85 rows=20160) (actual time=0.086..65.882 rows=1 loops=1)
-> Table scan on employee1 (cost=2111.85 rows=20160) (actual time=0.036..26.958 rows=20000 loops=1)
-> Select #2 (subquery in condition; run only once)
-> Filter: ((employee1.id = ``.`max(id)`)) (actual time=0.001..0.001 rows=0 loops=20001)
-> Limit: 1 row(s) (actual time=0.001..0.001 rows=0 loops=20001)
-> Index lookup on using (max(id)=employee1.id) (actual time=0.001..0.001 rows=0 loops=20001)
-> Materialize with deduplication (actual time=0.001..0.001 rows=0 loops=20001)
-> Aggregate: max(employee2.id) (actual time=0.032..0.032 rows=1 loops=1)
-> Filter: (employee2.FirstName in ('user-1','user-2')) (cost=1.50 rows=2) (actual time=0.019..0.028 rows=2 loops=1)
-> Index range scan on employee2 using idx_emp_name2 (cost=1.50 rows=2) (actual time=0.017..0.026 rows=2 loops=1)
1 row in set (0.07 sec)
MYSQL_root@127.0.0.1 [anbob]> explain analyze select * from employee1
where id in(select id from employee2 where firstname IN('user-1','user-2')) \G
*************************** 1. row ***************************
EXPLAIN: -> Nested loop inner join (cost=3.68 rows=2) (actual time=0.028..0.040 rows=2 loops=1)
-> Filter: (employee2.FirstName in ('user-1','user-2')) (cost=1.50 rows=2) (actual time=0.018..0.026 rows=2 loops=1)
-> Index range scan on employee2 using idx_emp_name2 (cost=1.50 rows=2) (actual time=0.016..0.024 rows=2 loops=1)
-> Single-row index lookup on employee1 using PRIMARY (id=employee2.id) (cost=1.04 rows=1) (actual time=0.006..0.006 rows=1 loops=2)
1 row in set (0.01 sec)
Note:
MySQL中子查询使用了MAX聚合函数,导致主查询全表扫,如果表较大时,性能会变化明显。
从MySQL 8.0.21开始再可以支持子查询做为驱动表。 SET SESSION optimizer_switch=”subquery_to_derived=ON”;
MYSQL_root@127.0.0.1 [anbob]> show variables like '%optimizer_switch%'\G;
*************************** 1. row ***************************
Variable_name: optimizer_switch
Value: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,engine_condition_pushdown=on,index_condition_pushdown=on,mrr=on,mrr_cost_based=on,block_nested_loop=on,batched_key_access=off,materialization=on,semijoin=on,loosescan=on,firstmatch=on,duplicateweedout=on,subquery_materialization_cost_based=on,use_index_extensions=on,condition_fanout_filter=on,derived_merge=on,use_invisible_indexes=off,skip_scan=on,hash_join=on
1 row in set (0.00 sec)
而之前只能改写SQL使用join
MYSQL_root@127.0.0.1 [anbob]> explain
-> select * from
-> employee1 e1
-> join employee2 e2 on e2.id=e1.id
-> where e2.firstname IN('user-1','user-2');
+----+-------------+-------+------------+--------+-----------------------+---------------+---------+-------------+------+----------+-----------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+--------+-----------------------+---------------+---------+-------------+------+----------+-----------------------+
| 1 | SIMPLE | e2 | NULL | range | PRIMARY,idx_emp_name2 | idx_emp_name2 | 1023 | NULL | 2 | 100.00 | Using index condition |
| 1 | SIMPLE | e1 | NULL | eq_ref | PRIMARY | PRIMARY | 4 | anbob.e2.id | 1 | 100.00 | NULL |
+----+-------------+-------+------------+--------+-----------------------+---------------+---------+-------------+------+----------+-----------------------+
2 rows in set, 1 warning (0.00 sec)
Note:
在MYSQL中还是老老实时的单表查询吧,优化器有点弱,相比oracle差的太远。
看一下OceanBase for MySQL模式.
-- 构建数据同MYSQL完全一样。
obclient [test]> explain select id from employee1 where id in(select max(id) from employee2) \G
*************************** 1. row ***************************
Query Plan: ==========================================================
|ID|OPERATOR |NAME |EST. ROWS|COST|
----------------------------------------------------------
|0 |NESTED-LOOP JOIN | |199 |169 |
|1 | SUBPLAN SCAN |VIEW1 |1 |46 |
|2 | SCALAR GROUP BY| |1 |46 |
|3 | SUBPLAN SCAN |VIEW2 |1 |46 |
|4 | TABLE SCAN |employee2(Reverse) |1 |46 |
|5 | TABLE SCAN |employee1(idx_emp_1)|199 |77 |
==========================================================
Outputs & filters:
-------------------------------------
0 - output([employee1.id]), filter(nil),
conds(nil), nl_params_([VIEW1.max(id)])
1 - output([VIEW1.max(id)]), filter(nil),
access([VIEW1.max(id)])
2 - output([T_FUN_MAX(VIEW2.id)]), filter(nil),
group(nil), agg_func([T_FUN_MAX(VIEW2.id)])
3 - output([VIEW2.id]), filter(nil),
access([VIEW2.id])
4 - output([employee2.id]), filter(nil),
access([employee2.id]), partitions(p0),
limit(1), offset(nil)
5 - output([employee1.id]), filter(nil),
access([employee1.id]), partitions(p0)
obclient [test]> explain select * from employee2 where id in(select max(id) from employee1 where firstname='user-1') \G
*************************** 1. row ***************************
Query Plan: ==============================================================
|ID|OPERATOR |NAME |EST. ROWS|COST|
--------------------------------------------------------------
|0 |NESTED-LOOP JOIN | |1 |114 |
|1 | SUBPLAN SCAN |VIEW1 |1 |92 |
|2 | SCALAR GROUP BY| |1 |92 |
|3 | SUBPLAN SCAN |VIEW2 |1 |92 |
|4 | LIMIT | |1 |92 |
|5 | TOP-N SORT | |1 |92 |
|6 | TABLE SCAN |employee1(idx_emp_name1)|1 |92 |
|7 | TABLE GET |employee2 |1 |22 |
==============================================================
Outputs & filters:
-------------------------------------
0 - output([employee2.id], [employee2.LastName], [employee2.FirstName], [employee2.Address], [employee2.profile]), filter(nil),
conds(nil), nl_params_([VIEW1.max(id)])
1 - output([VIEW1.max(id)]), filter(nil),
access([VIEW1.max(id)])
2 - output([T_FUN_MAX(VIEW2.id)]), filter(nil),
group(nil), agg_func([T_FUN_MAX(VIEW2.id)])
3 - output([VIEW2.id]), filter(nil),
access([VIEW2.id])
4 - output([employee1.id]), filter(nil), limit(1), offset(nil)
5 - output([employee1.id]), filter(nil), sort_keys([employee1.id, DESC]), topn(1)
6 - output([employee1.id]), filter([(T_OP_IS_NOT, employee1.id, NULL, 0)]),
access([employee1.id]), partitions(p0)
7 - output([employee2.id], [employee2.LastName], [employee2.FirstName], [employee2.Address], [employee2.profile]), filter(nil),
access([employee2.id], [employee2.LastName], [employee2.FirstName], [employee2.Address], [employee2.profile]), partitions(p0)
obclient [test]> explain select * from employee2 where id in(select id from employee1 where firstname='user-1') \G
*************************** 1. row ***************************
Query Plan: =============================================================
|ID|OPERATOR |NAME |EST. ROWS|COST|
-------------------------------------------------------------
|0 |NESTED-LOOP JOIN| |1 |114 |
|1 | SUBPLAN SCAN |VIEW2 |1 |92 |
|2 | MERGE DISTINCT| |1 |92 |
|3 | SORT | |1 |92 |
|4 | TABLE SCAN |employee1(idx_emp_name1)|1 |92 |
|5 | TABLE GET |employee2 |1 |22 |
=============================================================
Outputs & filters:
-------------------------------------
0 - output([employee2.id], [employee2.LastName], [employee2.FirstName], [employee2.Address], [employee2.profile]), filter(nil),
conds(nil), nl_params_([VIEW2.VIEW1.id])
1 - output([VIEW2.VIEW1.id]), filter(nil),
access([VIEW2.VIEW1.id])
2 - output([employee1.id]), filter(nil),
distinct([employee1.id])
3 - output([employee1.id]), filter(nil), sort_keys([employee1.id, ASC])
4 - output([employee1.id]), filter(nil),
access([employee1.id]), partitions(p0)
5 - output([employee2.id], [employee2.LastName], [employee2.FirstName], [employee2.Address], [employee2.profile]), filter(nil),
access([employee2.id], [employee2.LastName], [employee2.FirstName], [employee2.Address], [employee2.profile]), partitions(p0)
Note:
在OB for mysql中也是支持子查询做为驱动表的。在for oracle租户执行计划是一样的。
GreatDB万里开源数据库
GreatDB Cluster[test]> explain format=tree select id from employee1 where id in(select max(id) from employee2 where firstname='user-1'); +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | EXPLAIN | +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | -> Filter: (employee1.id,(select #2)) (cost=14.99 rows=111) -> Index scan on employee1 using idx_emp_name1 {[datanode3:datanode3_1]: SELECT `id`, `_hidden_pk_` FROM `test`.`employee1`;} (cost=14.99 rows=111) -> Select #2 (subquery in condition; dependent) -> Filter: ((employee1.id) = (max(employee2.id))) -> Rows fetched before execution (cost=0.00..0.00 rows=1) | +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (0.00 sec) GreatDB Cluster[test]> explain select id from employee1 where id in(select max(id) from employee2 where firstname='user-1'); +----+--------------------+-----------+------------+-------+---------------+---------------+---------+------+------+----------+------------------------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+--------------------+-----------+------------+-------+---------------+---------------+---------+------+------+----------+------------------------------+ | 1 | PRIMARY | employee1 | NULL | index | NULL | idx_emp_name1 | 1023 | NULL | 111 | 100.00 | Using where; Using index | | 2 | DEPENDENT SUBQUERY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Select tables optimized away | +----+--------------------+-----------+------------+-------+---------------+---------------+---------+------+------+----------+------------------------------+ 2 rows in set, 1 warning (0.01 sec) GreatDB Cluster[test]> explain select * from employee1 where id in(select max(id) from employee2 where firstname IN('user-1','user-2')) ; +----+-------------+-----------+------------+-------+---------------+---------------+---------+------+------+----------+---------------------------------------------------------------------------------------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-----------+------------+-------+---------------+---------------+---------+------+------+----------+---------------------------------------------------------------------------------------------+ | 1 | PRIMARY | employee1 | NULL | ALL | NULL | NULL | NULL | NULL | 111 | 100.00 | Using where | | 2 | SUBQUERY | employee2 | NULL | range | idx_emp_name2 | idx_emp_name2 | 1023 | NULL | 20 | 100.00 | Using pushed condition (`test`.`employee2`.`FirstName` in ('user-1','user-2')); Using index | +----+-------------+-----------+------------+-------+---------------+---------------+---------+------+------+----------+---------------------------------------------------------------------------------------------+ 2 rows in set, 1 warning (0.00 sec) GreatDB Cluster[test]>
Note:
万里数据库并没有把子查询做为驱动表,仿佛把subquery的值以filter的形式传给了主查询employee1 ,并且使用的索引,与oracle的执行路径接近,而不是像mysql改写后或pg一样子查询驱动的方式。
References
https://www.modb.pro/db/1730045837189931008