HAVING expression default Selectivity

Usually aggregates are one of the last steps executed before the final result set is returned to the client. GROUP BY operations an additional filter on the aggregates can be applied using the HAVING clause.This filter operations can be seen from the execution plan。CBO generate the correct execution plan is based on COST，Two important and related concepts inside CBO is selectivity and cardinality. The cardinality means that how many rows should be returned by CBO after evaluating the predicates.In some cases the query optimizer might can not accurately estimate cardinality (e.g. group by having or like with bind variables cause etc..),therefore the Cardinality estimates are based on default seletivity of build-in oracle coredefault seletivity(5%).when a GROUP BY operation might be somewhere in the middle of the execution plan operation, for example it might be part of a view that cannot be merged (or was hinted not to be merged using the NO_MERGE hint), optimizer simply assumes 5% selectivity per unknown range predicate resulting,as it will influence the choice of other related execution steps like join orders and methods .

Here is an example

anbob@OEM12C>create table testhaving
  2  as select rownum id,mod(rownum-1,100) c1, mod(rownum-1,100) c2
  3  from dual
  4  connect by level <=1e5;

Table created.

--创建一个10W记录的表，c1 ,c2 列值在0-99之间

anbob@OEM12C>create unique index uidx_testhaving_id on testhaving(id);
Index created.

anbob@OEM12C>exec dbms_stats.gather_table_stats(user,'testhaving',cascade=>true);
PL/SQL procedure successfully completed.

anbob@OEM12C>explain plan for 
  2  select id,max(c1) mc1,max(c2) mc2
  3  from testhaving
  4  where id>50000 group by id;
Explained.

--这里我查询50%的数据

anbob@OEM12C>select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 1305797301
---------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name               | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |                    | 49999 |   488K|   217   (1)| 00:00:02 |
|   1 |  SORT GROUP BY NOSORT        |                    | 49999 |   488K|   217   (1)| 00:00:02 |
|   2 |   TABLE ACCESS BY INDEX ROWID| TESTHAVING         | 49999 |   488K|   217   (1)| 00:00:02 |
|*  3 |    INDEX RANGE SCAN          | UIDX_TESTHAVING_ID | 49999 |       |   106   (1)| 00:00:01 |
---------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("ID">50000)

--可以看到只是group by 估算的cardinality是大致正确的

anbob@OEM12C>explain plan for select id,max(c1) mc1,max(c2) mc2
  2  from testhaving
  3  where id>50000
  4  group by id
  5  having max(c1)>=0;
Explained.

--加一个having 条件，其实并不会过滤任何数据

anbob@OEM12C>select * from table(dbms_xplan.display);
----------------------------------------------------------------------------------
| Id  | Operation           | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |            |  2500 | 25000 |    88   (8)| 00:00:01 |
|*  1 |  FILTER             |            |       |       |            |          |
|   2 |   HASH GROUP BY     |            |  2500 | 25000 |    88   (8)| 00:00:01 |
|*  3 |    TABLE ACCESS FULL| TESTHAVING | 49999 |   488K|    84   (4)| 00:00:01 |
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(MAX("C1")>=0)
   3 - filter("ID">50000)

--加1 having条件，可以看到已按5%seletivity，估算的cardinality 50000*5%=2500

anbob@OEM12C>explain plan for select id,max(c1) mc1,max(c2) mc2
  2  from testhaving
  3  where id>50000
  4  group by id
  5  having max(c1)>=0 and max(c2)<101
  6  ;
Explained.

anbob@OEM12C>select * from table(dbms_xplan.display);

----------------------------------------------------------------------------------
| Id  | Operation           | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |            |   125 |  1250 |    88   (8)| 00:00:01 |
|*  1 |  FILTER             |            |       |       |            |          |
|   2 |   HASH GROUP BY     |            |   125 |  1250 |    88   (8)| 00:00:01 |
|*  3 |    TABLE ACCESS FULL| TESTHAVING | 49999 |   488K|    84   (4)| 00:00:01 |
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(MAX("C1")>=0 AND MAX("C2")<101)
   3 - filter("ID">50000)

tip:
加入第2个having 条件，估算的cardinality 50000*5%*5%=125，如果这只是最后操作并不影响，但是如果还要和其它表关连就会影响，如小比例数据集可能会先考虑NL join,反之则hash join会更好。当group by 是一个 view or 子查询是，它的过滤条件执行计划有可能会push到view 内部，隐式的形成类似having.看下面的例子

anbob@OEM12C>select  /*+no_merge(temp)*/ * from (
  2  select t1.*, t2.id  t2_id
  3  from    (          
  4            select id,max(c1) mc1,max(c2) mc2
  5            from testhaving   
  6            group by id
  7          ) t1
  8        , testhaving  t2
  9  where
 10          mc1>=0
 11  and     mc2<=101
 12  and     t1.id > 50000
 13  and     t1.id = t2.id
 14  )temp where rownum>1;
 
 
-----------------------------------------------------------------------------------------------
| Id  | Operation                | Name               | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |                    |   125 |  6500 |    88   (8)| 00:00:01 |
|   1 |  COUNT                   |                    |       |       |            |          |
|*  2 |   FILTER                 |                    |       |       |            |          |
|   3 |    VIEW                  |                    |   125 |  6500 |    88   (8)| 00:00:01 |
|   4 |     NESTED LOOPS         |                    |   125 |  4500 |    88   (8)| 00:00:01 |
|   5 |      VIEW                |                    |   125 |  3875 |    88   (8)| 00:00:01 |
|*  6 |       FILTER             |                    |       |       |            |          |
|   7 |        HASH GROUP BY     |                    |   125 |  1250 |    88   (8)| 00:00:01 |
|*  8 |         TABLE ACCESS FULL| TESTHAVING         | 49999 |   488K|    84   (4)| 00:00:01 |
|*  9 |      INDEX UNIQUE SCAN   | UIDX_TESTHAVING_ID |     1 |     5 |     0   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(ROWNUM>1)
   6 - filter(MAX("C1")>=0 AND MAX("C2")<=101)
   8 - filter("ID">50000)
   9 - access("T1"."ID"="T2"."ID")
       filter("T2"."ID">50000)

TIP:
6 – filter(MAX(“C1”)>=0 AND MAX(“C2”)<=101) 8 - filter("ID">50000) 在view的内部执行，因为估算view只有125行所以用了NL JOIN,但是并未过滤数据实为50000行。而且是TFS全表扫描，因为每行的长度较小，因此增加一列，增加TFS的代价从而使用INDEX.

       
anbob@OEM12C>alter table testhaving add remark1 varchar2(200);

Table altered.

anbob@OEM12C>update testhaving set remark1=rpad('x',100);

100000 rows updated.

anbob@OEM12C>exec dbms_stats.gather_table_stats(user,'testhaving',cascade=>true);


anbob@OEM12C>select /*+no_merge(temp)*/ * from (
  2  select /*+ gather_plan_statistics anbob */ t1.*, t2.id  t2_id,t2.remark1
  3  from    (      
  4            select  id,max(c1) mc1,max(c2) mc2
  5            from testhaving   
  6            group by id
  7          ) t1
  8        , testhaving  t2
  9  where
 10          mc1>=0
 11  and     mc2<=101
 12  and     t1.id > 50000
 13  and     t1.id = t2.id
 14  )temp where rownum>1;

no rows selected

TIP:
我们增加一个rownum >1 其实不会返回任何数据，为了减少 gather_plan_statistics 时查询输出。

anbob@OEM12C>select sql_id,child_number,substr(sql_text,1,40)
  2  from v$sql where instr(sql_text,'anbob')>0;

SQL_ID                CHILD_NUMBER SUBSTR(SQL_TEXT,1,40)
------------- -------------------- --------------------------------------------------------------------------------
7fdcyff1pjfxw                    0 select /*+no_merge(temp)*/ * from ( sele
f1v9hj3qd7hhh                    0 select sql_id,child_number,substr(sql_te


anbob@OEM12C>select * from table(dbms_xplan.display_cursor('7fdcyff1pjfxw', 0, 'RUNSTATS_LAST'));

-------------------------------------------------------------------------------------------------------------------
| Id  | Operation                          | Name               | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                   |                    |      1 |        |      0 |00:00:00.70 |     240K|
|   1 |  COUNT                             |                    |      1 |        |      0 |00:00:00.70 |     240K|
|*  2 |   FILTER                           |                    |      1 |        |      0 |00:00:00.70 |     240K|
|   3 |    VIEW                            |                    |      1 |    123 |  50000 |00:00:00.71 |     240K|
|   4 |     NESTED LOOPS                   |                    |      1 |    123 |  50000 |00:00:00.71 |     240K|
|   5 |      VIEW                          |                    |      1 |    123 |  50000 |00:00:00.26 |   93816 |
|*  6 |       FILTER                       |                    |      1 |        |  50000 |00:00:00.26 |   93816 |
|   7 |        HASH GROUP BY               |                    |      1 |    123 |  50000 |00:00:00.21 |   93816 |
|   8 |         TABLE ACCESS BY INDEX ROWID| TESTHAVING         |      1 |  49100 |  50000 |00:00:00.15 |   93816 |
|*  9 |          INDEX RANGE SCAN          | UIDX_TESTHAVING_ID |      1 |  50009 |  50000 |00:00:00.01 |     106 |
|  10 |      TABLE ACCESS BY INDEX ROWID   | TESTHAVING         |  50000 |      1 |  50000 |00:00:00.40 |     146K|
|* 11 |       INDEX UNIQUE SCAN            | UIDX_TESTHAVING_ID |  50000 |      1 |  50000 |00:00:00.14 |   50002 |
-------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(ROWNUM>1)
   6 - filter((MAX("C1")>=0 AND MAX("C2")<=101))
   9 - access("ID">50000)
  11 - access("T1"."ID"="T2"."ID")
       filter("T2"."ID">50000)

TIP:
可以看到增加一列后使用了索引，但因为错误估算cardinality还是NL JOIN。可以使用CARDINALITY 、OPT_ESTIMATE Hint 来提示cardinality。

CARDINALITY hint (9i+): Undocumented
OPT_ESTIMATE hint (10g+): Undocumented

anbob@OEM12C>select /*+no_merge(temp)*/ * from (
  2  select /*+ gather_plan_statistics anbob2 */ t1.*, t2.id  t2_id,t2.remark1
  3  from    (      
  4            select /*+ CARDINALITY( 50000 )*/  id,max(c1) mc1,max(c2) mc2
  5            from testhaving   
  6            group by id
  7          ) t1
  8        , testhaving  t2
  9  where
 10          mc1>=0
 11  and     mc2<=101
 12  and     t1.id > 50000
 13  and     t1.id = t2.id
 14  )temp where rownum>1;



------------------------------------------------------------------------------------------------------------------
| Id  | Operation                         | Name               | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                  |                    |      1 |        |      0 |00:00:00.47 |     187K|
|   1 |  COUNT                            |                    |      1 |        |      0 |00:00:00.47 |     187K|
|*  2 |   FILTER                          |                    |      1 |        |      0 |00:00:00.47 |     187K|
|   3 |    VIEW                           |                    |      1 |  50000 |  50000 |00:00:00.48 |     187K|
|*  4 |     FILTER                        |                    |      1 |        |  50000 |00:00:00.48 |     187K|
|   5 |      HASH GROUP BY                |                    |      1 |  50000 |  50000 |00:00:00.43 |     187K|
|*  6 |       HASH JOIN                   |                    |      1 |  49100 |  50000 |00:00:00.42 |     187K|
|   7 |        TABLE ACCESS BY INDEX ROWID| TESTHAVING         |      1 |  49100 |  50000 |00:00:00.15 |   93816 |
|*  8 |         INDEX RANGE SCAN          | UIDX_TESTHAVING_ID |      1 |  50009 |  50000 |00:00:00.01 |     106 |
|   9 |        TABLE ACCESS BY INDEX ROWID| TESTHAVING         |      1 |  49100 |  50000 |00:00:00.20 |   93816 |
|* 10 |         INDEX RANGE SCAN          | UIDX_TESTHAVING_ID |      1 |  50009 |  50000 |00:00:00.01 |     106 |
------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(ROWNUM>1)
   4 - filter((MAX("C1")>=0 AND MAX("C2")<=101))
   6 - access("ID"="T2"."ID")
   8 - access("ID">50000)
  10 - access("T2"."ID">50000)

TIP:
使用CARDINALITY hint后，NL JOIN 改为了HASH JOIN，但是 filter((MAX(“C1”)>=0 AND MAX(“C2”)<=101)) 操作到到了T1 T2 HASH JOIN 操作以后。但是执行时间有所提升，一致读也减少了。下面是用USE_HASH Hint提示后的结果


anbob@OEM12C>select /*+no_merge(temp)*/ * from (
  2  select /*+ gather_plan_statistics anbob5 USE_HASH(T1 T2) */ t1.*, t2.id  t2_id,t2.remark1
  3  from    (      
  4            select /*+*/  id,max(c1) mc1,max(c2) mc2
  5            from testhaving   
  6            group by id
  7          ) t1
  8        , testhaving  t2
  9  where
 10          mc1>=0
 11  and     mc2<=101
 12  and     t1.id > 50000
 13  and     t1.id = t2.id
 14  )temp where rownum>1;
 
 ——————————————————————————————————————-
| Id  | Operation                          | Name               | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
——————————————————————————————————————-
|   0 | SELECT STATEMENT                   |                    |      1 |        |      0 |00:00:00.41 |     187K|
|   1 |  COUNT                             |                    |      1 |        |      0 |00:00:00.41 |     187K|
|*  2 |   FILTER                           |                    |      1 |        |      0 |00:00:00.41 |     187K|
|   3 |    VIEW                            |                    |      1 |    123 |  50000 |00:00:00.46 |     187K|
|*  4 |     HASH JOIN                      |                    |      1 |    123 |  50000 |00:00:00.41 |     187K|
|   5 |      VIEW                          |                    |      1 |    123 |  50000 |00:00:00.21 |   93816 |
|*  6 |       FILTER                       |                    |      1 |        |  50000 |00:00:00.16 |   93816 |
|   7 |        HASH GROUP BY               |                    |      1 |    123 |  50000 |00:00:00.16 |   93816 |
|   8 |         TABLE ACCESS BY INDEX ROWID| TESTHAVING         |      1 |  49100 |  50000 |00:00:00.15 |   93816 |
|*  9 |          INDEX RANGE SCAN          | UIDX_TESTHAVING_ID |      1 |  50009 |  50000 |00:00:00.01 |     106 |
|  10 |      TABLE ACCESS BY INDEX ROWID   | TESTHAVING         |      1 |  49100 |  50000 |00:00:00.15 |   93816 |
|* 11 |       INDEX RANGE SCAN             | UIDX_TESTHAVING_ID |      1 |  50009 |  50000 |00:00:00.01 |     106 |
——————————————————————————————————————-

Predicate Information (identified by operation id):
—————————————————

   2 – filter(ROWNUM>1)
   4 – access(“T1″.”ID”=”T2″.”ID”)
   6 – filter((MAX(“C1”)>=0 AND MAX(“C2”)<=101))
   9 - access("ID">50000)
  11 – access(“T2”.”ID”>50000)
  
  

anbob@OEM12C>SELECT /*+no_merge(temp)*/
  2          *
  3      FROM (SELECT /*+ gather_plan_statistics anbob7 */
  4                  t1.*, t2.id t2_id, t2.remark1
  5              FROM (SELECT id,
  6                           mc1,
  7                           mc2,
  8                           ROWNUM rn
  9                      FROM (  SELECT /*+*/
 10                                    id, MAX (c1) mc1, MAX (c2) mc2
 11                                FROM testhaving
 12                            GROUP BY id)) t1,
 13                   testhaving t2
 14             WHERE mc1 >= 0 AND mc2 <= 101 AND t1.id > 50000 AND t1.id = t2.id) temp
 15     WHERE ROWNUM > 1;

——————————————————————————————————
| Id  | Operation                       | Name               | Starts | E-Rows | A-Rows |   A-Time   |
——————————————————————————————————
|   0 | SELECT STATEMENT                |                    |      1 |        |      0 |00:00:00.42 |
|   1 |  COUNT                          |                    |      1 |        |      0 |00:00:00.42 |
|*  2 |   FILTER                        |                    |      1 |        |      0 |00:00:00.42 |
|   3 |    VIEW                         |                    |      1 |  49100 |  50000 |00:00:00.46 |
|*  4 |     HASH JOIN                   |                    |      1 |  49100 |  50000 |00:00:00.46 |
|*  5 |      VIEW                       |                    |      1 |  98182 |  50000 |00:00:00.30 |
|   6 |       COUNT                     |                    |      1 |        |    100K|00:00:00.20 |
|   7 |        VIEW                     |                    |      1 |  98182 |    100K|00:00:00.10 |
|   8 |         HASH GROUP BY           |                    |      1 |  98182 |    100K|00:00:00.10 |
|   9 |          TABLE ACCESS FULL      | TESTHAVING         |      1 |  98182 |    100K|00:00:00.01 |
|  10 |      TABLE ACCESS BY INDEX ROWID| TESTHAVING         |      1 |  49100 |  50000 |00:00:00.15 |
|* 11 |       INDEX RANGE SCAN          | UIDX_TESTHAVING_ID |      1 |  50009 |  50000 |00:00:00.01 |
——————————————————————————————————

Predicate Information (identified by operation id):
—————————————————

   2 – filter(ROWNUM>1)
   4 – access(“T1″.”ID”=”T2″.”ID”)
   5 – filter((“MC1″>=0 AND “MC2”<=101 AND "T1"."ID">50000))
  11 – access(“T2”.”ID”>50000)

TIP:
在子查询外层增加了rownum伪列，就做到了防止mc1 >= 0 AND mc2 <= 101 AND t1.id > 50000 push（推进）到view内部，带来的负作用就是group by 是所有的数据10W，而实际只要5W记录，而且因为rownum在如果并行处理时会影响性能

Summary:

用到having 时注意因为查询优化器使用了默认的5% selectivity而产生了不准确的CARDINALITY ，从而影响了整个执行计划。如果group by 聚合操作是最后一步可以不考虑。

extend read
like 5% selectivity http://www.anbob.com/archives/1701.html