首页 » ORACLE, ORACLE [C]系列 » Troubleshooting DB load high wait ‘ON CPU’ by New ASH in 12c R2

Troubleshooting DB load high wait ‘ON CPU’ by New ASH in 12c R2

前在前面

2021年1月6日,石家庄被按下了暂停键,街道变得宁静,新冠肺炎疫情反扑,周围气氛变的紧张, “空城”、“管控”、“隔离”、“确诊”等词让人不安, 他日的新闻中的场景,今日我们正在经历, 我在国际庄,距离藁城最近,5号晚封城后开始多渠道筹备粮食,居家办公,等核算检测结果,每天刷看确诊人员活动轨迹是否有与自己重合,最近2天附近又有新增诊被接走,希望庄里的小伙伴做好防护,感谢医务人员和所有抗疫人员的付出,再坚持一下,寒冷的“冬天”总会过去,一起加油。

远程办公第二天就2起故障,有1个故障感觉比较有意思,特分享出来, 为了是表明猜测充分,原谅我这里附上较多的输出。

背景

6日下午17:20接到告警CPU使用率高, 环境 Oracle 12cR2  3-nodes RAC on Linux. 登录系统查看是实例1活动会话接近200. 这是一个<90 cpu的主机. Session status 全是on cpu.  当然从我们的监控上很直观定位2条select SQL 占用了50% 左右的db time.  首先可能怀疑是SQL 执行量增长或执行计划改变等效率变差了, 这是一个1类系统,需要快速恢复应用,客户要求先kill部分会话, 查看了sql执行计划并未改变, 可是在KILL 部分会话后还在负载逐渐增长,没有多长时间,17:29 主机ping不通了,并且没有重启。

应用还好有配置TAF自动failover到了其它节点,并且运行良好,就当时的负载能把主机压死确实不多见,难道X86就如此脆弱?  下一步如何分析? 我们的AWR 周期是30分钟,AWR中关于SQL stat的信息还没收集。 当然事后会说为什么不收systemdump , 为什么不收hanganalyze, 为什么不手动收AWR, 为什么不把v$ash 物化下来, 可往往故障就是这么猝不及防,如何根据当前的信息进行分析?

如果此时有套监控采集信息或许能够提供部分信息, 如分析sysstat中sql executes是不是增长了?奇怪这该时间段我们的监控sysstat数据也没有采集到.

当时KILL 部分会话后收集

sampling SID all with interval 5 seconds, taking 1 snapshots...
-- Session Snapper v4.14 BETA - by Tanel Poder ( http://blog.tanelpoder.com ) - Enjoy the Most Advanced Oracle Troubleshooting Script on the Planet! :)

----------------------------------------------------------------------------------------------------
Active% | INST | SQL_ID          | SQL_CHILD | EVENT                               | WAIT_CLASS
----------------------------------------------------------------------------------------------------
  3500% |    1 | 56twj7s93jaz4   | 0         | ON CPU                              | ON CPU
  1700% |    1 | dp8fnkzqdt3km   | 0         | ON CPU                              | ON CPU
  1300% |    1 | 05qszn0ufs4ff   | 0         | ON CPU                              | ON CPU
  1200% |    1 | 1b9zsamawq6mh   | 0         | ON CPU                              | ON CPU
  1100% |    1 | a7kcm21nbngvx   | 0         | ON CPU                              | ON CPU
  1100% |    1 | 2v7njuhdw0pgm   | 1         | ON CPU                              | ON CPU
  1000% |    1 | 13nf2mwh3xmsh   | 1         | ON CPU                              | ON CPU
   900% |    1 |                 | 0         | ON CPU                              | ON CPU
   700% |    1 | 60t0pum7f1pbm   | 1         | ON CPU                              | ON CPU
   600% |    1 | 60t0pum7f1pbm   | 2         | ON CPU                              | ON CPU

--  End of ASH snap 1, end=2021-01-06 17:24:43, seconds=9, samples_taken=1

可以确认当时全时ON CPU, 从上图看负载是从16:58左右开始, 分析这类问题DASH是常用的分析工具,之前总结过ASH相关的BLOG《Know more about Oracle ASH》,前不久还有同事问我ASH是何时从memory刷到disk中的, 1,默认AWR的快照间隔;2,如果ASH buffer 已满时MMNL进程都会负责这事。 如果17:00 到 17:30的AWR 没有形成,那17:00前那2分钟的DASH(dba_hist_active_sess_history)可以做为分析的入口。在统计top SQL 历史快照时发现实例1其实16:30 – 17:00 AWR snapshot也没形成。问题时间段跨越的2个AWR snapshot都没生成,那DASH 真没有数据了吗? 11G前是这样的,当12c不是的。

12c ASH 从memroy刷到disk形为改变

检查DASH数据是否存在?

SQL> select to_char(sample_time,'yyyymmdd hh24:mi'),count(*) --not * 10
from dba_hist_active_sess_history 
where sample_time >to_date('2021-01-06 17','yyyy-mm-dd hh24') and sample_time <to_date('2021-01-06 18','yyyy-mm-dd hh24') 
and instance_number=1 
group by to_char(sample_time,'yyyymmdd hh24:mi') order by 1;

TO_CHAR(SAMPLE   COUNT(*)
-------------- ----------
20210106 17:00        503
20210106 17:01        273
20210106 17:02        572
20210106 17:03        601
20210106 17:04        607
20210106 17:05        293
20210106 17:06        571
20210106 17:07        559
20210106 17:08        553
20210106 17:09        286
20210106 17:10        510
20210106 17:11        458
20210106 17:12        443
20210106 17:13        434
20210106 17:14        455
20210106 17:15        710
20210106 17:16        457
20210106 17:17        433
20210106 17:18        481
20210106 17:19        503
20210106 17:20        259
20210106 17:21        527
20210106 17:22        530
20210106 17:23        529
20210106 17:24        209

25 rows selected.

Note:
实例1 crash(17:28)之前的ASH数据基本上都在,这也正是oracle宣传的,ASH及时在系统负载很快时也可以很好的工作。 检查了db alert log没有提示ASH buffer 不足的问题, 那ASH是什么刷新频率刷到DISK上的呢?

SQL> select max(sample_time),sysdate from dba_hist_active_sess_history where instance_number=1;

MAX(SAMPLE_TIME)                                                            SYSDATE
--------------------------------------------------------------------------- -----------------
14-JAN-21 10.54.32.234 PM                                                   20210114 22:58:23

SQL> select max(sample_time),sysdate from dba_hist_active_sess_history where instance_number=1;

MAX(SAMPLE_TIME)                                                            SYSDATE
--------------------------------------------------------------------------- -----------------
14-JAN-21 10.54.32.234 PM                                                   20210114 22:59:52

SQL> r
  1* select max(sample_time),sysdate from dba_hist_active_sess_history where instance_number=1

MAX(SAMPLE_TIME)                                                            SYSDATE
--------------------------------------------------------------------------- -----------------
14-JAN-21 10.59.54.686 PM                                                   20210114 23:00:56

SQL> r
  1* select max(sample_time),sysdate from dba_hist_active_sess_history where instance_number=1

MAX(SAMPLE_TIME)                                                            SYSDATE
--------------------------------------------------------------------------- -----------------
14-JAN-21 11.04.26.622 PM                                                   20210114 23:07:58

Note:
非AWR snapshot flush时间,DASH也在逐渐的更新最新数据,基本上可以判断,当前的ASH是大概5分钟更新到DASH(dba_hist_active_sess_history)中, 那继续查找一下相关的ASH参数。

SQL> select
  2        n.indx
  3      , to_char(n.indx, 'XXXX') i_hex
  4      , n.ksppinm pd_name
  5      , c.ksppstvl pd_value
  6      , n.ksppdesc pd_descr
  7     from sys.x$ksppi n, sys.x$ksppcv c
  8     where n.indx=c.indx
  9     and  
 10        lower(n.ksppinm) || ' ' || lower(n.ksppdesc) like lower('%\_ash%')
 11  escape '\'   ;

      INDX I_HEX NAME                                 VALUE       DESCRIPTION
---------- ----- ------------------------------------ ----------- ----------------------------------------------------------------------
      4546  11C2 _ash_sampling_interval               1000        Time interval between two successive Active Session samples in
                                                                  millisecs

      4547  11C3 _ash_size                            1048618     To set the size of the in-memory Active Session History buffers
      4548  11C4 _ash_enable                          TRUE        To enable or disable Active Session sampling and flushing
      4549  11C5 _ash_disk_write_enable               TRUE        To enable or disable Active Session History flushing
      4550  11C6 _ash_disk_filter_ratio               10          Ratio of the number of in-memory samples to the number of samples
                                                                  actually written to disk

      4551  11C7 _ash_eflush_trigger                  66          The percentage above which if the in-memory ASH is full the emergency
                                                                  flusher will be triggered

      4552  11C8 _ash_sample_all                      FALSE       To enable or disable sampling every connected session including ones
                                                                  waiting for idle waits

      4553  11C9 _ash_dummy_test_param                0           Oracle internal dummy ASH parameter used ONLY for testing!
      4554  11CA _ash_min_mmnl_dump                   90          Minimum Time interval passed to consider MMNL Dump
      4555  11CB _ash_compression_enable              TRUE        To enable or disable string compression in ASH
      4556  11CC _ash_progressive_flush_interval      300         ASH Progressive Flush interval in secs

11 rows selected.

Note:
_ash_progressive_flush_interval 这个隐藏参数值300秒,描述也说明了它是控制ASH渐进式刷新频率秒数,该参数11G中并不存在,与之相关的TIPs几乎没有, 不过我们可以判断从12c除了上面提到的2种ASH flush形为,还有第3种,每300秒渐进式ASH也会flush到disk. 不得不感叹ORACLE一直在悄悄改善的更加完美, 这么实用的功能没有得到广泛的宣传。

从DASH 分析的SQL效率

先从现有的dba_hist_sqlstat分析TOP SQL的执行情况
脚本sql_hist.sql

                                  Summary Execution Statistics Over Time of SQL_ID:dp8fnkzqdt3km
                                                                                          Avg                 Avg
Snapshot                                              Avg LIO         Avg PIO      CPU (secs)      Elapsed (secs)
Beg Time     INSTANCE_NUMBER        Execs            Per Exec        Per Exec        Per Exec            Per Exec
------------ --------------- ------------ ------------------- --------------- --------------- -------------------
06-JAN 14:00               1          824          362,726.48            3.82            1.56                1.57
06-JAN 14:00               2          841                0.00            6.04            1.58                1.59
06-JAN 14:30               1        1,084          358,591.06            0.80            1.52                1.53
06-JAN 14:30               2          725          239,373.03           10.00            1.35                1.36
06-JAN 15:00               1        1,303          335,387.17            0.36            1.56                1.56
06-JAN 15:00               2          903          254,055.09            2.25            1.37                1.38
06-JAN 15:30               2          850          315,445.81           18.58            1.63                1.66
06-JAN 15:30               1        1,238          335,970.25            1.11            1.50                1.50
06-JAN 16:00               1        1,048          339,324.59            1.46            1.54                1.55
06-JAN 16:00               2          765          310,440.36            1.66            1.43                1.44
06-JAN 16:30               2          703          273,389.33           21.77            1.48                1.50
06-JAN 17:00               2          595          238,034.54           67.25            1.33                1.46
06-JAN 17:30               2          450          313,110.14          215.90            1.55                1.72
06-JAN 18:00               2          408          312,342.91           59.53            1.50                1.58
06-JAN 18:30               1           64          305,746.14        3,045.39            2.78                4.15
06-JAN 18:30               2          152          395,698.88          224.32            1.94                2.06
06-JAN 19:00               1           75          378,465.49          249.27            2.08                2.16
06-JAN 19:00               2          122          346,031.39           33.33            1.88                1.90
06-JAN 19:30               2           31          237,640.52           30.10            1.59                1.61
06-JAN 19:30               1           77          193,193.10           37.27            1.49                1.51
06-JAN 20:00               1           88          317,937.30          347.97            1.67                1.76
06-JAN 20:00               2           26          253,282.27            6.77            1.41                1.42
06-JAN 20:30               1           20          430,854.10            1.65            2.42                2.43
06-JAN 20:30               2           18          344,005.61           75.72            2.42                2.47
06-JAN 21:00               1            5          306,552.00           25.60            1.53                1.56
06-JAN 21:00               2           13          342,130.31            2.77            2.31                2.32
06-JAN 21:30               1           35            6,480.89            2.60            0.06                0.06
07-JAN 07:30               1           12          396,023.42        2,950.17            2.50                3.96
07-JAN 07:30               2           17        2,123,654.59        3,804.82            6.46                8.35
07-JAN 08:00               1          144          225,576.83          489.54            1.35                1.76

Note:
当然没有问题时间段的实例1 sql的执行情况,但是可以判断基本上每次执行20-40万的逻辑读,耗费时间为1-2秒,另外每天7:00和实例1刚启动时因为第一次物理读,单次执行也是在4-8秒之间。

看一下具体的负载趋势

SQL> create table dbmt.dash0106 tablespace users 
as select * from dba_hist_active_sess_history 
where sample_time >to_date('2021-01-06 16','yyyy-mm-dd hh24') and sample_time <to_date('2021-01-06 18','yyyy-mm-dd hh24') ; Table created. SQL>  select * from (
  2      select etime,nvl(event,'on cpu') events,sql_id, dbtime, cnt,first_time,end_time,
  3     round(100*ratio_to_report(dbtime) OVER (partition by etime ),2) pct,row_number() over(partition by etime order by dbtime  desc) rn
  4   from (
  5  select to_char(SAMPLE_TIME,'yyyymmdd hh24:mi') etime,event,sql_id,count(*)*10 dbtime,count(*) cnt,
  6  to_char(min(SAMPLE_TIME),'hh24:mi:ss') first_time,to_char(max(SAMPLE_TIME),'hh24:mi:ss') end_time
  7   from dbmt.dash0106
  8  --where sample_time between to_date('2015-4-1 16:00','yyyy-mm-dd hh24:mi') and to_date('2015-4-1 17:00','yyyy-mm-dd hh24:mi')
  9   where INSTANCE_NUMBER=1
 10   group by to_char(SAMPLE_TIME,'yyyymmdd hh24:mi'),event,sql_id
 11  )
 12  ) where rn<=5;

ETIME          EVENTS     SQL_ID            DBTIME        CNT FIRST_TI END_TIME        PCT         RN
-------------- ---------- ------------- ---------- ---------- -------- -------- ---------- ----------

20210106 16:30 on cpu     d506gkxjgw7xr         40          4 16:30:04 16:30:54       7.41          1
               on cpu     4vk65sy477mxm         40          4 16:30:24 16:30:54       7.41          2
               on cpu     1wcsvyq7xshqz         40          4 16:30:24 16:30:54       7.41          3
               on cpu                           40          4 16:30:04 16:30:44       7.41          4
               on cpu     60dw4cw904b83         30          3 16:30:14 16:30:34       5.56          5

20210106 16:31 on cpu     56twj7s93jaz4        120         12 16:31:24 16:31:24      15.19          1
               on cpu     b15j01vphnb60        100         10 16:31:14 16:31:34      12.66          2
               on cpu     dp8fnkzqdt3km        100         10 16:31:04 16:31:54      12.66          3
               db file se fmtaf5hf9tm7s         60          6 16:31:04 16:31:54       7.59          4
               quential r
               ead
               on cpu     gm9rz10b0kb50         60          6 16:31:14 16:31:44       7.59          5

20210106 16:32 on cpu     dp8fnkzqdt3km         50          5 16:32:15 16:32:45       9.43          1
               on cpu     4vk65sy477mxm         40          4 16:32:25 16:32:55       7.55          2
               on cpu     gvv9dcnsc1jf9         40          4 16:32:25 16:32:55       7.55          3
               on cpu     d506gkxjgw7xr         40          4 16:32:04 16:32:45       7.55          4
               on cpu     0hpuufdajtzch         30          3 16:32:04 16:32:25       5.66          5

20210106 16:33 on cpu     b15j01vphnb60         60          6 16:33:05 16:33:55      16.22          1
               on cpu     dp8fnkzqdt3km         40          4 16:33:15 16:33:45      10.81          2
               on cpu     4vk65sy477mxm         40          4 16:33:25 16:33:55      10.81          3
               db file se fmtaf5hf9tm7s         30          3 16:33:05 16:33:55       8.11          4
               quential r
               ead
               gc cr gran fmtaf5hf9tm7s         20          2 16:33:15 16:33:25       5.41          5
               t 2-way
...
...
...

20210106 16:54 on cpu     d506gkxjgw7xr         70          7 16:54:05 16:54:55      15.22          1
               on cpu                           40          4 16:54:25 16:54:45        8.7          2
               gc cr gran fmtaf5hf9tm7s         30          3 16:54:05 16:54:45       6.52          3
               t 2-way
               on cpu     1wcsvyq7xshqz         30          3 16:54:05 16:54:25       6.52          4
               on cpu     gvv9dcnsc1jf9         30          3 16:54:25 16:54:55       6.52          5

20210106 16:55 on cpu     d506gkxjgw7xr         70          7 16:55:05 16:55:45      12.07          1
               on cpu                           50          5 16:55:05 16:55:45       8.62          2
               db file se fmtaf5hf9tm7s         50          5 16:55:05 16:55:55       8.62          3
               quential r
               ead
               on cpu     gm9rz10b0kb50         40          4 16:55:05 16:55:55        6.9          4
               on cpu     gvv9dcnsc1jf9         40          4 16:55:05 16:55:55        6.9          5

20210106 16:56 on cpu     dp8fnkzqdt3km        100         10 16:56:05 16:56:56      17.24          1
               on cpu     c8t8f3rps66d5         50          5 16:56:05 16:56:46       8.62          2
               on cpu     gvv9dcnsc1jf9         40          4 16:56:16 16:56:46        6.9          3
               on cpu     d506gkxjgw7xr         40          4 16:56:05 16:56:56        6.9          4
               gc cr gran fmtaf5hf9tm7s         30          3 16:56:16 16:56:56       5.17          5
               t 2-way

20210106 16:57 on cpu     56twj7s93jaz4        920         92 16:57:23 16:57:44      27.71          1   
               on cpu     dp8fnkzqdt3km        350         35 16:57:08 16:57:44      10.54          2
               on cpu                          310         31 16:57:08 16:57:44       9.34          3
               on cpu     d506gkxjgw7xr        210         21 16:57:08 16:57:44       6.33          4
               on cpu     gvv9dcnsc1jf9        160         16 16:57:08 16:57:44       4.82          5

20210106 16:58 on cpu     56twj7s93jaz4       1970        197 16:58:08 16:58:57      36.41          1
               on cpu     dp8fnkzqdt3km       1100        110 16:58:08 16:58:57      20.33          2
               on cpu                          390         39 16:58:08 16:58:57       7.21          3
               on cpu     d506gkxjgw7xr        250         25 16:58:08 16:58:57       4.62          4
               on cpu     gvv9dcnsc1jf9        250         25 16:58:08 16:58:57       4.62          5

20210106 16:59 on cpu     56twj7s93jaz4       1370        137 16:59:25 16:59:53      32.39          1
               on cpu     dp8fnkzqdt3km       1220        122 16:59:25 16:59:53      28.84          2
               on cpu                          240         24 16:59:25 16:59:53       5.67          3
               on cpu     d506gkxjgw7xr        160         16 16:59:25 16:59:53       3.78          4
               on cpu     05qszn0ufs4ff        140         14 16:59:25 16:59:53       3.31          5

20210106 17:00 on cpu     dp8fnkzqdt3km       1650        165 17:00:25 17:00:56       32.8          1
               on cpu     56twj7s93jaz4       1360        136 17:00:25 17:00:56      27.04          2
               on cpu                          270         27 17:00:25 17:00:56       5.37          3
               on cpu     d500bxxqf6dbs        160         16 17:00:25 17:00:56       3.18          4
               on cpu     05qszn0ufs4ff        150         15 17:00:25 17:00:56       2.98          5

20210106 17:01 on cpu     dp8fnkzqdt3km        900         90 17:01:30 17:01:30      32.97          1
               on cpu     56twj7s93jaz4        620         62 17:01:30 17:01:30      22.71          2
               on cpu     60t0pum7f1pbm        160         16 17:01:30 17:01:30       5.86          3
               on cpu     2v7njuhdw0pgm        110         11 17:01:30 17:01:30       4.03          4
               on cpu                          100         10 17:01:30 17:01:30       3.66          5

20210106 17:02 on cpu     dp8fnkzqdt3km       1890        189 17:02:03 17:02:37      33.04          1
               on cpu     56twj7s93jaz4        970         97 17:02:03 17:02:37      16.96          2
               on cpu     60t0pum7f1pbm        510         51 17:02:03 17:02:37       8.92          3
               on cpu     2v7njuhdw0pgm        460         46 17:02:03 17:02:37       8.04          4
               on cpu     05qszn0ufs4ff        190         19 17:02:03 17:02:37       3.32          5

20210106 17:03 on cpu     dp8fnkzqdt3km       2090        209 17:03:12 17:03:47      34.78          1
               on cpu     56twj7s93jaz4        730         73 17:03:12 17:03:47      12.15          2
               on cpu     2v7njuhdw0pgm        660         66 17:03:12 17:03:47      10.98          3
               on cpu     60t0pum7f1pbm        470         47 17:03:12 17:03:47       7.82          4
               on cpu                          320         32 17:03:12 17:03:47       5.32          5

20210106 17:04 on cpu     dp8fnkzqdt3km       2100        210 17:04:19 17:04:58       34.6          1
               on cpu     2v7njuhdw0pgm        550         55 17:04:19 17:04:58       9.06          2
               on cpu     56twj7s93jaz4        550         55 17:04:19 17:04:58       9.06          3
               on cpu     60t0pum7f1pbm        360         36 17:04:19 17:04:58       5.93          4
               on cpu                          260         26 17:04:19 17:04:58       4.28          5

20210106 17:05 on cpu     dp8fnkzqdt3km        990         99 17:05:33 17:05:33      33.79          1
               on cpu     2v7njuhdw0pgm        240         24 17:05:33 17:05:33       8.19          2
               on cpu     56twj7s93jaz4        230         23 17:05:33 17:05:33       7.85          3
               on cpu     60t0pum7f1pbm        190         19 17:05:33 17:05:33       6.48          4
               on cpu     a7kcm21nbngvx        180         18 17:05:33 17:05:33       6.14          5

20210106 17:06 on cpu     dp8fnkzqdt3km       1980        198 17:06:05 17:06:40      34.68          1
               on cpu     2v7njuhdw0pgm        380         38 17:06:05 17:06:40       6.65          2
               on cpu     56twj7s93jaz4        380         38 17:06:05 17:06:40       6.65          3
               on cpu     a7kcm21nbngvx        380         38 17:06:05 17:06:40       6.65          4
               on cpu     60t0pum7f1pbm        320         32 17:06:05 17:06:40        5.6          5

20210106 17:07 on cpu     dp8fnkzqdt3km       1870        187 17:07:14 17:07:47      33.45          1
               on cpu     2v7njuhdw0pgm        400         40 17:07:14 17:07:47       7.16          2
               on cpu     a7kcm21nbngvx        320         32 17:07:14 17:07:47       5.72          3
               on cpu     f018b0x00auxp        310         31 17:07:14 17:07:47       5.55          4
               on cpu     60t0pum7f1pbm        300         30 17:07:14 17:07:47       5.37          5

20210106 17:08 on cpu     dp8fnkzqdt3km       1820        182 17:08:18 17:08:51      32.91          1
               on cpu     2v7njuhdw0pgm        420         42 17:08:18 17:08:51       7.59          2
               on cpu     f018b0x00auxp        400         40 17:08:18 17:08:51       7.23          3
               on cpu     13nf2mwh3xmsh        390         39 17:08:18 17:08:51       7.05          4
               on cpu     60t0pum7f1pbm        330         33 17:08:18 17:08:51       5.97          5

20210106 17:09 on cpu     dp8fnkzqdt3km        910         91 17:09:26 17:09:26      31.82          1
               on cpu     2v7njuhdw0pgm        310         31 17:09:26 17:09:26      10.84          2
               on cpu     f018b0x00auxp        220         22 17:09:26 17:09:26       7.69          3
               on cpu     60t0pum7f1pbm        180         18 17:09:26 17:09:26       6.29          4
               on cpu     13nf2mwh3xmsh        150         15 17:09:26 17:09:26       5.24          5

20210106 17:10 on cpu     dp8fnkzqdt3km       1830        183 17:10:02 17:10:35      35.88          1
               on cpu     2v7njuhdw0pgm        530         53 17:10:02 17:10:35      10.39          2
               on cpu     13nf2mwh3xmsh        290         29 17:10:02 17:10:35       5.69          3
               on cpu     f018b0x00auxp        280         28 17:10:02 17:10:35       5.49          4
               on cpu                          270         27 17:10:02 17:10:35       5.29          5

...

20210106 17:23 on cpu     dp8fnkzqdt3km       1390        139 17:23:09 17:23:40      26.28          1
               on cpu     56twj7s93jaz4        770         77 17:23:09 17:23:40      14.56          2
               on cpu     60t0pum7f1pbm        330         33 17:23:09 17:23:40       6.24          3
               on cpu     05qszn0ufs4ff        260         26 17:23:09 17:23:40       4.91          4
               on cpu                          250         25 17:23:09 17:23:40       4.73          5

20210106 17:24 on cpu     56twj7s93jaz4        390         39 17:24:07 17:24:07      18.66          1
               on cpu                          160         16 17:24:07 17:24:07       7.66          2
               on cpu     60t0pum7f1pbm        140         14 17:24:07 17:24:07        6.7          3
               on cpu     05qszn0ufs4ff        130         13 17:24:07 17:24:07       6.22          4
               on cpu     a7kcm21nbngvx        110         11 17:24:07 17:24:07       5.26          5


425 rows selected.

Note:
从16:57开始的, DBtime 呈上升趋势,并且wait event: ON CPU, 也能看到TOP  SQL。

正常时间段DASH中TOP sql

SQL> select session_id,to_char(sample_time,'yyyymmdd hh24:mi:ss') etime,event,sql_exec_id,SQL_EXEC_START,session_state,TIME_WAITED,IN_SQL_EXECUTION,TM_DELTA_CPU_TIME 
from dbmt.dash0106
 where sample_time >to_date('2021-01-06 16:45','yyyy-mm-dd hh24:mi') and sample_time <to_date('2021-01-06 16:54','yyyy-mm-dd hh24:mi') 
 and sql_id='dp8fnkzqdt3km' order by 1,2;

SESSION_ID ETIME             EVENT      SQL_EXEC_ID SQL_EXEC_START    SESSION TIME_WAITED I TM_DELTA_CPU_TIME
---------- ----------------- ---------- ----------- ----------------- ------- ----------- - -----------------
        12 20210106 16:49:22               20337054 20210106 16:49:20 ON CPU            0 Y           5824114
        12 20210106 16:49:53               20337078 20210106 16:49:51 ON CPU            0 Y           2731150
       467 20210106 16:45:06               41876133 20210106 16:45:05 ON CPU            0 Y          12821930
       921 20210106 16:52:14               20337167 20210106 16:52:13 ON CPU            0 Y           9083661
      1229 20210106 16:48:48               41876201 20210106 16:48:45 ON CPU            0 Y          28963774
      1532 20210106 16:45:10               20336897 20210106 16:45:06 ON CPU            0 Y          12105818
      1532 20210106 16:51:43               20337139 20210106 16:51:43 ON CPU            0 Y           9087716
      1535 20210106 16:47:52               20337014 20210106 16:47:51 ON CPU            0 Y           6957901
      1535 20210106 16:48:02               20337023 20210106 16:48:01 ON CPU            0 Y           6100204
      1537 20210106 16:47:11               20336989 20210106 16:47:11 ON CPU            0 Y          10426243
      1559 20210106 16:48:12               20337027 20210106 16:48:11 ON CPU            0 Y          20086346
      1559 20210106 16:48:22               20337032 20210106 16:48:17 ON CPU            0 Y           5404968
      1559 20210106 16:50:13               20337087 20210106 16:50:12 ON CPU            0 Y           5066286
      2292 20210106 16:49:02               20337039 20210106 16:49:00 ON CPU            0 Y           6094836
      2435 20210106 16:46:16               41876161 20210106 16:46:15 ON CPU            0 Y          16064789
      2741 20210106 16:49:02               20337040 20210106 16:49:01 ON CPU            0 Y           5135847
      2746 20210106 16:45:31               20336918 20210106 16:45:30 ON CPU            0 Y            559010
      2746 20210106 16:46:21               20336951 20210106 16:46:20 ON CPU            0 Y            442342

TIP:
主要是看同一session 同一SQL的执行情况,通过session_id(session_serial#)和sql_exec_id来确认,sql_exec_id在同一SESSION执行sql会累增,sql_exec_id不变再看SQL_EXEC_START SQL开始执行的时间。 可见正常时间SQL持续运行的时间都是<2秒的。

问题时间段的DASH中TOP sql

SQL> select session_id,--SESSION_SERIAL#, (used to uniquely identify a session's objects)
  to_char(sample_time,'yyyymmdd hh24:mi:ss') etime,event,sql_exec_id,SQL_EXEC_START,session_state,TIME_WAITED,IN_SQL_EXECUTION,TM_DELTA_CPU_TIME 
  from dbmt.dash0106 where sample_time >to_date('2021-01-06 16:58','yyyy-mm-dd hh24:mi') and sample_time <to_date('2021-01-06 17:10','yyyy-mm-dd hh24:mi') 
  and sql_id='dp8fnkzqdt3km' order by 1,2


SESSION_ID ETIME             EVENT      SQL_EXEC_ID SQL_EXEC_START    SESSION TIME_WAITED I TM_DELTA_CPU_TIME
---------- ----------------- ---------- ----------- ----------------- ------- ----------- - -----------------
         9 20210106 17:02:03               20337526 20210106 17:02:01 ON CPU            0 Y           3050637
         9 20210106 17:02:37               20337526 20210106 17:02:01 ON CPU            0 Y           8411155
         9 20210106 17:03:12               20337526 20210106 17:02:01 ON CPU            0 Y           7732513
         9 20210106 17:03:47               20337526 20210106 17:02:01 ON CPU            0 Y           8855886
         9 20210106 17:04:19               20337526 20210106 17:02:01 ON CPU            0 Y           6742950
         9 20210106 17:04:58               20337526 20210106 17:02:01 ON CPU            0 Y           8861440
         9 20210106 17:05:33               20337526 20210106 17:02:01 ON CPU            0 Y           7035218
         9 20210106 17:06:05               20337526 20210106 17:02:01 ON CPU            0 Y           7703591
         9 20210106 17:06:40               20337526 20210106 17:02:01 ON CPU            0 Y           8466669
        12 20210106 16:58:08               20337396 20210106 16:57:38 ON CPU            0 Y           7636384
        12 20210106 16:58:31               20337396 20210106 16:57:38 ON CPU            0 Y           6840769
        12 20210106 16:58:57               20337396 20210106 16:57:38 ON CPU            0 Y           6953856
        12 20210106 16:59:25               20337396 20210106 16:57:38 ON CPU            0 Y           6760293
        12 20210106 16:59:53               20337396 20210106 16:57:38 ON CPU            0 Y           6476894
        12 20210106 17:00:25               20337396 20210106 16:57:38 ON CPU            0 Y           6974848
        12 20210106 17:00:56               20337396 20210106 16:57:38 ON CPU            0 Y           5540165
        12 20210106 17:01:30               20337396 20210106 16:57:38 ON CPU            0 Y           6374697
        12 20210106 17:02:03               20337396 20210106 16:57:38 ON CPU            0 Y           6177705
        12 20210106 17:02:37               20337396 20210106 16:57:38 ON CPU            0 Y           5990692
        12 20210106 17:03:12               20337396 20210106 16:57:38 ON CPU            0 Y           6111413
        12 20210106 17:03:47               20337396 20210106 16:57:38 ON CPU            0 Y           6873905
        12 20210106 17:04:19               20337396 20210106 16:57:38 ON CPU            0 Y           5680029
        12 20210106 17:04:58               20337396 20210106 16:57:38 ON CPU            0 Y           6661646
        12 20210106 17:05:33               20337396 20210106 16:57:38 ON CPU            0 Y           5920078
        12 20210106 17:06:05               20337396 20210106 16:57:38 ON CPU            0 Y           6712145
        12 20210106 17:06:40               20337396 20210106 16:57:38 ON CPU            0 Y           6765375
        12 20210106 17:07:14               20337396 20210106 16:57:38 ON CPU            0 Y           5585508
        12 20210106 17:07:47               20337396 20210106 16:57:38 ON CPU            0 Y           6002971
        12 20210106 17:08:18               20337396 20210106 16:57:38 ON CPU            0 Y           6159024
        12 20210106 17:08:51               20337396 20210106 16:57:38 ON CPU            0 Y           6594718
        12 20210106 17:09:26               20337396 20210106 16:57:38 ON CPU            0 Y           6444770
        40 20210106 17:03:47               20337544 20210106 17:03:13 ON CPU            0 Y           8750964
        40 20210106 17:04:19               20337544 20210106 17:03:13 ON CPU            0 Y           5033203
        40 20210106 17:04:58               20337544 20210106 17:03:13 ON CPU            0 Y           6892512
        40 20210106 17:05:33               20337544 20210106 17:03:13 ON CPU            0 Y           6081303
        40 20210106 17:06:05               20337544 20210106 17:03:13 ON CPU            0 Y           6165369
        40 20210106 17:06:40               20337544 20210106 17:03:13 ON CPU            0 Y           6525653
        40 20210106 17:07:14               20337544 20210106 17:03:13 ON CPU            0 Y           5790222
        40 20210106 17:07:47               20337544 20210106 17:03:13 ON CPU            0 Y           6261831
        40 20210106 17:08:18               20337544 20210106 17:03:13 ON CPU            0 Y           5841272
        40 20210106 17:08:51               20337544 20210106 17:03:13 ON CPU            0 Y           6215711
        40 20210106 17:09:26               20337544 20210106 17:03:13 ON CPU            0 Y           6987163
        44 20210106 17:00:56               20337496 20210106 17:00:34 ON CPU            0 Y           7026321
        44 20210106 17:01:30               20337496 20210106 17:00:34 ON CPU            0 Y           8437444
        44 20210106 17:02:03               20337496 20210106 17:00:34 ON CPU            0 Y           8608695
        44 20210106 17:02:37               20337496 20210106 17:00:34 ON CPU            0 Y           8561660
        44 20210106 17:03:12               20337496 20210106 17:00:34 ON CPU            0 Y           8350169
       181 20210106 16:58:08               20337392 20210106 16:57:35 ON CPU            0 Y          10671584
...

Note:
这个时段的sql个别会话单次执行时间4分钟,10分钟以上。且这时间段没有I/O 类(物理读)wait event 全是ON CPU,业务也确认该时间没有太多数据变化, 从业务failover到其它节点后的执行效率也基本上是2秒内和之前一样,也能判断不是数据变化问题。

当然2个时间段的SQL 执行计划也一致

SQL> select to_char(sample_time,'yyyymmdd hh24:mi') etime,sql_id,SQL_PLAN_HASH_VALUE,count(*) from dbmt.dash0106 where sample_time >to_date('2021-01-06 16:40','yyyy-mm-dd hh24:mi') and sample_time <to_date('2021-01-06 17:10','yyyy-mm-dd hh24:mi') and sql_id in('dp8fnkzqdt3km','56twj7s93jaz4') group by to_char(sample_time,'yyyymmdd hh24:mi'),sql_id,SQL_PLAN_HASH_VALUE order by 2,1;

ETIME          SQL_ID        SQL_PLAN_HASH_VALUE   COUNT(*)
-------------- ------------- ------------------- ----------
20210106 16:40 dp8fnkzqdt3km          2900077901         11
20210106 16:41 dp8fnkzqdt3km          2900077901          4
20210106 16:42 dp8fnkzqdt3km          2900077901          7
20210106 16:43 dp8fnkzqdt3km          2900077901          9
...
20210106 16:51 dp8fnkzqdt3km          2900077901          3
20210106 16:52 dp8fnkzqdt3km          2900077901          6
20210106 16:53 dp8fnkzqdt3km          2900077901          9
20210106 16:54 dp8fnkzqdt3km          2900077901          6
20210106 16:55 dp8fnkzqdt3km          2900077901          6
20210106 16:56 dp8fnkzqdt3km          2900077901         11
20210106 16:57 dp8fnkzqdt3km          2900077901         47
20210106 16:58 dp8fnkzqdt3km          2900077901        111
20210106 16:59 dp8fnkzqdt3km          2900077901        123
20210106 17:00 dp8fnkzqdt3km          2900077901        168
20210106 17:01 dp8fnkzqdt3km          2900077901         95
20210106 17:02 dp8fnkzqdt3km          2900077901        193
20210106 17:03 dp8fnkzqdt3km          2900077901        213
...
20210106 17:08 dp8fnkzqdt3km          2900077901        186
20210106 17:09 dp8fnkzqdt3km          2900077901         95

其它SQL也存在这种现象。执行计划一致,数据无变化,短时间内单次执行时间从秒级放大分钟级,并且session state 也是在ON CPU, 这段时间在buffer gets做运算,所以怀疑当时buffer gets出现系统调用问题。

主机什么原因CRASH

主机上的messages log服务问题没有生成日志,但是硬件工程师确认是因为物理内存损坏了,时间点也基本和数据库负载突增吻合, 所以有理由怀疑是在主机CRASH前,内存异常,导致数据库同一类SQL的内存内数据buffer get异常,单次SQL执行时间变长,最终导致负载的积压。

本次数据库负载异常或故障突然CRASH,而AWR snapshot没有形成时,在12c后中的ASH每5分钟逐渐式flush disk,已不会刷新太频繁而增加系统负载,也不会等到AWR SNAPSHOT时间大粒度间隔而突然重启而ASH数据缺失无法分析。本次就是利用DASH中SQL两个时间段的SQL执行持续时间判断SQL变慢而导致的业务积压,而非SQL执行量增加,或执行计划变化。

— ENJOY —

打赏

,

对不起,这篇文章暂时关闭评论。