过年刚回来就忙了起来,各种问题最近一直在处理优化几个应用的SQL, 其中开班第一天,就有套库的负载突然高的有个应用差点把库拖垮,下面简单的记录一下案例分析过程。
因为过年原因把好多月初该做的周期计算的任务推在开班第一天,应用的也着急,因为按以前的经验较快出的结果数据库一直没有跑完,他们就不停的加进程认为多那样总得时间更快,殊不知数据库系统资源是有限的,起再多算不过来也只会回到队列进程使系统更慢,这是一套32c的数据库主机当我统计是发现已起了约700个进程都是ACTIVE的。 这些活动的会话进程wait event基本都是“latch: cache buffers chains” 和” read by other session”, “latch: cache buffers chains” 以前也记录过几篇笔记,当一个进程试取得cache buffer hash chain时的一种latch, 查找一个数据块的DBA当前是不是在Buffer Cache中,CBC latch竞争的原因很多,通常也可以理解为一种热块, 对于CBC 通常都是从session wait中找到child latch address 然后再去x$bh的hladdr字段找到相应的obj,1个BH handle可以会关连多个obj, 再参考TCH 列确认比较hot的对象。大多数CBC的OLTP系统多数应该注意一下sql 的执行计划中使用了NL join的方式; 对于read by other session 也是一种hot block事件,当一个进程想把data block读进BH时,另一个进程快开始做这事情了,但还没读完。
SQL> @asess USERNAME SID EVENT MACHINE PROGRAM STATUS LAST_CALL_ET HASH_VALUE SQL_ID WAI_SECINW BS SQLTEXT CH# SEQ# ROW_WAIT_OBJ# ---------- ---------- -------------------- ---------- -------------- -------- ------------ ---------- --------------- ---------- ---------- ------------------------------ ---- ---------- ------------- ANBOB_P1 907 latch: cache buffers kyyz1 weejar_pro@kyyz ACTIVE 3758 3133220035 ac2a4daxc2963 0:4 : SELECT SUBSID,VALTYPE,TARGET_ 0 49467 3462360 ANBOB_P1 3426 latch: cache buffers kyyz1 weejar_pro@kyyz ACTIVE 3758 658665007 f9bu1ywmn4vjg 0:4 : SELECT SUBSID,VALTYPE,TARGET_ 0 57994 3462360 ANBOB_P1 7961 latch: cache buffers kyyz1 weejar_pro@kyyz ACTIVE 3758 455527273 bwpfag8dkdkv9 0:4 : SELECT SUBSID,VALTYPE,TARGET_ 0 42487 3462360 ANBOB_P1 7107 latch: cache buffers kyyz1 weejar_pro@kyyz ACTIVE 3771 3924654184 grq56avnyuy38 0:4 : SELECT SUBSID,VALTYPE,TARGET_ 0 51638 3462360 ANBOB_P1 1717 latch: cache buffers kyyz1 weejar_pro@kyyz ACTIVE 3748 1602056967 1ub6u7tgruws7 457:5 : SELECT SUBSID,VALTYPE,TARGET_ 0 5290 3462360 ANBOB_P1 36 latch: cache buffers kyyz1 weejar_pro@kyyz ACTIVE 3750 183002219 dg676ns5fht3b 0:5 : SELECT SUBSID,VALTYPE,TARGET_ 0 52221 3462360 ANBOB_P1 2297 latch: cache buffers kyyz1 weejar_pro@kyyz ACTIVE 3760 2376432804 3fmp9fq6uaz54 0:5 : SELECT SUBSID,VALTYPE,TARGET_ 0 58741 3462360 ANBOB_P1 6821 latch: cache buffers kyyz1 weejar_pro@kyyz ACTIVE 3888 1323001963 a6pf3r57dqu3b 0:5 : SELECT SUBSID,VALTYPE,TARGET_ 0 6526 3462360 ... 681 rows selected.
Note:
看到SQL id虽不同但是文本貌似相同,随后我们确认了应用未使用绑定变量, 只是SQL中的几个字面量不一样, 找一条SQL我们手动执行一下,查看执行计划和效率(确保手动执行计划和应用的相同)。
SQL> SELECT SUBSID,
2 VALTYPE,
3 TARGET_ID,
4 TARGET_VAL
5 FROM ANBOB_P1.CH_NGSETTLE_SUBS_TARGET PARTITION (PART_310_201601) B
6 WHERE EXISTS
7 (SELECT N.OID
8 FROM ANBOB_P1.RECEPTION PARTITION (PART_310_201601) N
9 WHERE B.SUBSID = N.OID AND N.RECORGID = 'xxxdf.xxx.sfdsfsdf')
10 AND CARRYINGTYPE = 'ceReception'
11 ORDER BY SUBSID;
59 rows selected.
Elapsed: 00:04:44.50
Execution Plan
----------------------------------------------------------
Plan hash value: 1746076586
--------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
--------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 112 | 1 (0)| 00:00:01 | | |
| 1 | NESTED LOOPS SEMI | | 1 | 112 | 1 (0)| 00:00:01 | | |
| 2 | PARTITION RANGE SINGLE | | 1 | 81 | 0 (0)| 00:00:01 | 12 | 12 |
|* 3 | TABLE ACCESS BY LOCAL INDEX ROWID| CH_NGSETTLE_SUBS_TARGET | 1 | 81 | 0 (0)| 00:00:01 | 12 | 12 |
| 4 | INDEX FULL SCAN | IDX_NGSUBS_TARTGE_SUBSID | 1 | | 0 (0)| 00:00:01 | 12 | 12 |
| 5 | PARTITION RANGE SINGLE | | 1 | 31 | 1 (0)| 00:00:01 | 18 | 18 |
|* 6 | TABLE ACCESS BY LOCAL INDEX ROWID| RECEPTION | 1 | 31 | 1 (0)| 00:00:01 | 18 | 18 |
|* 7 | INDEX SKIP SCAN | IDX_RECEPTION_OID | 1 | | 1 (0)| 00:00:01 | 18 | 18 |
--------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - filter("CARRYINGTYPE"='ceReception')
6 - filter("N"."RECORGID"='xxxdf.xxx.sfdsfsdf')
7 - access("B"."SUBSID"="N"."OID")
filter("B"."SUBSID"="N"."OID")
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
44058611 consistent gets
1933 physical reads
0 redo size
1870 bytes sent via SQL*Net to client
520 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
59 rows processed
Note:
这就是一个简单的查询, 两个表使用了分区名的方式限制了一个数据周期,1个表使用了IFS而另一个使用了ISS 这样的访问路径,和使用了NL JOIN 的连接方式,这就是我们上面提到过的CBC很多情况与NL join相关。但是看cost非常小,猜测是统计信息不准,用时00:04:44.50, 而且之返回59行的记录,产生了44058611 一致读, 这个SQL有很大的优化空间。
1,不建议使用分区名的方试限制分区范围
比如对分区名过分依赖而不是分区条件,就像我曾经写的那篇”数据去哪了”, 分区名只是指定了表的裁剪区,强烈建议使用分区列做为where查询的谓词条件去过滤数据,因为那样让CBO更好的估算分区表裁剪和对应的分区索引裁剪扫描范围。
SELECT SUBSID,
VALTYPE,
TARGET_ID,
TARGET_VAL
FROM ANBOB_P1.CH_NGSETTLE_SUBS_TARGET B
WHERE b.region = 310
AND b.CYCLE = 201601
AND EXISTS
(SELECT N.OID
FROM ANBOB_P1.RECEPTION N
WHERE n.region = 310
AND n.CYCLE = 201601
AND B.SUBSID = N.OID
AND N.RECORGID = 'xxxdf.xxx.sfdsfsdf')
AND CARRYINGTYPE = 'ceReception'
ORDER BY SUBSID;
Elapsed: 00:01:45.62
Execution Plan
----------------------------------------------------------
Plan hash value: 1743099239
-------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
-------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 164 | 0 (0)| 00:00:01 | | |
| 1 | NESTED LOOPS SEMI | | 1 | 164 | 0 (0)| 00:00:01 | | |
| 2 | PARTITION RANGE SINGLE | | 1 | 107 | 0 (0)| 00:00:01 | 12 | 12 |
| 3 | TABLE ACCESS BY LOCAL INDEX ROWID| CH_NGSETTLE_SUBS_TARGET | 1 | 107 | 0 (0)| 00:00:01 | 12 | 12 |
|* 4 | INDEX RANGE SCAN | IDX_SUBS_TARGET1 | 1 | | 0 (0)| 00:00:01 | 12 | 12 |
| 5 | PARTITION RANGE SINGLE | | 1 | 57 | 0 (0)| 00:00:01 | 18 | 18 |
|* 6 | TABLE ACCESS BY LOCAL INDEX ROWID| RECEPTION | 1 | 57 | 0 (0)| 00:00:01 | 18 | 18 |
|* 7 | INDEX RANGE SCAN | IDX_RECEPTION_OID | 1 | | 0 (0)| 00:00:01 | 18 | 18 |
-------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - access("B"."REGION"=310 AND "B"."CYCLE"=201601 AND "CARRYINGTYPE"='ceReception')
6 - filter("N"."RECORGID"='xxxdf.xxx.sfdsfsdf')
7 - access("N"."REGION"=310 AND "N"."CYCLE"=201601 AND "B"."SUBSID"="N"."OID")
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
1255979 consistent gets
2895 physical reads
0 redo size
1870 bytes sent via SQL*Net to client
520 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
59 rows processed
Note:
注意我们把上面的SQL 把分区表改成了where 条件,用时立即从4m44s 降到1m:45s, 一致读从44058611降到了1255979,索引访问方式都变成了IRS且有一个表使用了新的索引(不是新建的,只是列更合适,关于这个列字段略),并且扫描的分区范围也是意料内的单分区,当然到这还没结束,这两个表都是较大的表,因为统计信息的错误,低估了表上的数据,才使用了NESTED LOOPS SEMI 这种连接方式,我们再把两个表的统计信息收集一下(可以只收那两个表分区及索引分区)。
2,统计信息收集 略, 再来看收集以后的执行计划。
SQL> SELECT SUBSID,
VALTYPE,
TARGET_ID,
TARGET_VAL
FROM ANBOB_P1.CH_NGSETTLE_SUBS_TARGET B
WHERE b.region = 310
AND b.CYCLE = 201601 and EXISTS
(SELECT N.OID
FROM ANBOB_P1.RECEPTION N
WHERE B.SUBSID = N.OID AND N.RECORGID = 'xxxdf.xxx.sfdsfsdf' and n.region = 310
AND n.CYCLE = 201601 )
AND CARRYINGTYPE = 'ceReception'
ORDER BY SUBSID ;
59 rows selected.
Elapsed: 00:00:04.12
Execution Plan
----------------------------------------------------------
Plan hash value: 3997962624
--------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
--------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 12154 | 902K| 417 (1)| 00:00:02 | | |
| 1 | MERGE JOIN SEMI | | 12154 | 902K| 417 (1)| 00:00:02 | | |
| 2 | PARTITION RANGE SINGLE | | 39M| 1558M| 0 (0)| 00:00:01 | 12 | 12 |
| 3 | TABLE ACCESS BY LOCAL INDEX ROWID | CH_NGSETTLE_SUBS_TARGET | 39M| 1558M| 0 (0)| 00:00:01 | 12 | 12 |
|* 4 | INDEX RANGE SCAN | IDX_SUBS_TARGET1 | 867K| | 0 (0)| 00:00:01 | 12 | 12 |
|* 5 | SORT UNIQUE | | 1633 | 57155 | 417 (1)| 00:00:02 | | |
| 6 | TABLE ACCESS BY GLOBAL INDEX ROWID| RECEPTION | 1633 | 57155 | 416 (1)| 00:00:02 | 18 | 18 |
|* 7 | INDEX RANGE SCAN | IDX_RECEPTION_RECORGID | 7421 | | 24 (0)| 00:00:01 | | |
--------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - access("B"."REGION"=310 AND "B"."CYCLE"=201601 AND "CARRYINGTYPE"='ceReception')
5 - access("B"."SUBSID"="N"."OID")
filter("B"."SUBSID"="N"."OID")
7 - access("N"."RECORGID"='xxxdf.xxx.sfdsfsdf' AND "N"."CYCLE"=201601 AND "N"."REGION"=310)
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
650583 consistent gets
0 physical reads
0 redo size
1870 bytes sent via SQL*Net to client
520 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
59 rows processed
Note:
用时已从最开始的4m44s到上面的1m45s再到现在的仅4秒,一致读也从最初的4000多万降到了65万,”NESTED LOOPS SEMI”换成了”MERGE JOIN SEMI”,数据库恢复了昔日的平静。
Summary:
优化的SQL前提是良好的SQL书写格式,不要用分区名,不要用分区名,不要用分区名
对于分区的统计信息比如以月份的的可以copy上月的统计到本月,也最好不要拆完分区立即收集,因为当时数据还没有,num_rows,blks 都是0,还不如动态采样呢