诊断一起存储链路引起的数据库性能问题案例 ORA-32701 and krsv_proc_kill: Killing 1 processes (Process by index) in alert

前几日一套库出现在性能问题，虽然最终问题不在数据库，但是记录一下希望遇到同样问题时可以节约你的时间，这是11.2.0.3 RAC 2nodes on hpux (EMC存储), 问题是从16：40左右起中间件偶尔有瞬间的业务积压，积压时数据库大多数会话都是简单的insert同一表数据，后来大概停止了该批处理，数据库后来没有再出明显积压，因为当时反馈给我是周末晚上而且后来没有再起该批业务，只是中间件日志有提示会话被kill的错误，想了解一下会话被kill原因，同事传了我一段db alert日志如下：

# DB ALERT

Sun Jul 19 16:32:11 2015
Errors in file /oracle/app/oracle/diag/rdbms/weejar/weejar2/trace/weejar2_dia0_2471.trc  (incident=512083):
ORA-32701: Possible hangs up to hang ID=0 detected
Incident details in: /oracle/app/oracle/diag/rdbms/weejar/weejar2/incident/incdir_512083/weejar2_dia0_2471_i512083.trc
Sun Jul 19 16:32:13 2015
Sweep [inc][512083]: completed
Sun Jul 19 16:32:13 2015
Sweep [inc2][512083]: completed
System State dumped to trace file /oracle/app/oracle/diag/rdbms/weejar/weejar2/incident/incdir_512083/weejar2_m000_29275_i512083_a.trc
DIA0 terminating blocker (ospid: 4073 sid: 7829 ser#: 4119) of hang with ID = 223
    requested by master DIA0 process on instance 1
    Hang Resolution Reason: Automatic hang resolution was performed to free a
    significant number of affected sessions.
   by terminating the process
DIA0 successfully terminated process ospid:4073.
Errors in file /oracle/app/oracle/diag/rdbms/weejar/weejar2/trace/weejar2_dia0_2471.trc  (incident=512084):
ORA-32701: Possible hangs up to hang ID=0 detected
Incident details in: /oracle/app/oracle/diag/rdbms/weejar/weejar2/incident/incdir_512084/weejar2_dia0_2471_i512084.trc
Sun Jul 19 16:32:34 2015
Sweep [inc][512084]: completed
Sun Jul 19 16:32:35 2015
DIA0 terminating blocker (ospid: 16104 sid: 2287 ser#: 58237) of hang with ID = 225
    requested by master DIA0 process on instance 1
    Hang Resolution Reason: Automatic hang resolution was performed to free a
    significant number of affected sessions.
    by terminating session sid: 2287 ospid: 16104
DIA0 successfully terminated session sid:2287 ospid:16104 with status 31.
Sun Jul 19 16:33:13 2015
Sweep [inc2][512084]: completed
Sun Jul 19 16:33:50 2015
WARN: ARC2: Terminating pid 4793 hung on an I/O operation
Sun Jul 19 16:34:13 2015
krsv_proc_kill: Killing 1 processes (Process by index)
ARC2: Error 16198 due to hung I/O operation to LOG_ARCHIVE_DEST_1      #<<<<<<<<<<<<<< 
ARC2: Detected ARCH process failure
ARC2: STARTING ARCH PROCESSES
Sun Jul 19 16:34:15 2015
ARC0 started with pid=34, OS id=3953 
Sun Jul 19 16:34:15 2015
Errors in file /oracle/app/oracle/diag/rdbms/weejar/weejar2/trace/weejar2_dia0_2471.trc  (incident=512085):
ORA-32701: Possible hangs up to hang ID=0 detected
Incident details in: /oracle/app/oracle/diag/rdbms/weejar/weejar2/incident/incdir_512085/weejar2_dia0_2471_i512085.trc
ARC0: Archival started
ARC2: STARTING ARCH PROCESSES COMPLETE
ARCH: Archival stopped, error occurred. Will continue retrying
ORACLE Instance weejar2 - Archival Error
ORA-16014: log 11 sequence# 12803 not archived, no available destinations
ORA-00312: online log 11 thread 2: '/dev/anbob_oravg02/ranbob_redo11'
Sun Jul 19 16:34:17 2015
Sweep [inc][512085]: completed
...
... here had truncated
...
DISTRIB TRAN 00000028.716D7979313000000000000000000000000000000000000000000000000000000006D2A955A7E4B002ECC089
  is local tran 414.10.43815 (hex=19e.0a.ab27))
  delete pending collecting tran, scn=14464580264119 (hex=d27.cc2b20b7)
DISTRIB TRAN 00000028.716D7979313000000000000000000000000000000000000000000000000000000006D2A955A7E4B002ECC08E
  is local tran 125.19.296884 (hex=7d.13.487b4))
  delete pending collecting tran, scn=14464580264148 (hex=d27.cc2b20d4)
Sun Jul 19 17:32:17 2015
Thread 2 advanced to log sequence 12808 (LGWR switch)
  Current log# 10 seq# 12808 mem# 0: /dev/anbob_oravg02/ranbob_redo10
Sun Jul 19 17:32:25 2015
Archived Log entry 22618 added for thread 2 sequence 12807 ID 0x17418ab3 dest 1:
Sun Jul 19 17:37:00 2015
Errors in file /oracle/app/oracle/diag/rdbms/weejar/weejar2/trace/weejar2_dia0_16512.trc  (incident=512088):
ORA-32701: Possible hangs up to hang ID=0 detected
Incident details in: /oracle/app/oracle/diag/rdbms/weejar/weejar2/incident/incdir_512088/weejar2_dia0_16512_i512088.trc
Sun Jul 19 17:37:02 2015
Sweep [inc][512088]: completed
Sun Jul 19 17:37:02 2015
Sweep [inc2][512088]: completed
System State dumped to trace file /oracle/app/oracle/diag/rdbms/weejar/weejar2/incident/incdir_512088/weejar2_m000_10298_i512088_a.trc
DIA0 terminating blocker (ospid: 29778 sid: 9592 ser#: 20163) of hang with ID = 374
    requested by master DIA0 process on instance 1
    Hang Resolution Reason: Automatic hang resolution was performed to free a
    significant number of affected sessions.
    by terminating session sid: 9592 ospid: 29778
DIA0 successfully terminated session sid:9592 ospid:29778 with status 31.
Sun Jul 19 17:48:15 2015
Errors in file /oracle/app/oracle/diag/rdbms/weejar/weejar2/trace/weejar2_ora_7290.trc  (incident=516793):
ORA-00600: internal error code, arguments: [qerltcUserIterGet_1], [56], [56], [], [], [], [], [], [], [], [], []
ORA-24761: transaction rolled back
Incident details in: /oracle/app/oracle/diag/rdbms/weejar/weejar2/incident/incdir_516793/weejar2_ora_7290_i516793.trc
Sun Jul 19 17:48:15 2015
Errors in file /oracle/app/oracle/diag/rdbms/weejar/weejar2/trace/weejar2_ora_5983.trc  (incident=525073):
ORA-00600: internal error code, arguments: [qerltcUserIterGet_1], [56], [56], [], [], [], [], [], [], [], [], []
ORA-24761: transaction rolled back

NOTE：
注意alert中从那个时间点的确出现了ORA-32701 hang的进程，而且从下面一段要以判断当时应该是有I/O 高负载或者异常，因为日志中只出现一次，没有再深究
“WARN: ARC2: Terminating pid 4793 hung on an I/O operation
Sun Jul 19 16:34:13 2015
krsv_proc_kill: Killing 1 processes (Process by index)
ARC2: Error 16198 due to hung I/O operation to LOG_ARCHIVE_DEST_1
ARC2: Detected ARCH process failure”

不过看到了DIA0进程的解有终止进程，dia0进程是HM(Hang manger)的后台进程，是11G 新特性，负责定期在RAC 环境中收集进程hang的信息，或可以自行解决一些hang进程，受隐藏参数的影响，了解HM 可以自己去官网查阅。而且后来的分布式事务回滚，及ora-600 [qerltcUserIterGet_1] ORA-24761 也是受会话hang ,HM kill后受到的影响。

# hang trace file

Dump file /oracle/app/oracle/diag/rdbms/weejar/weejar2/incident/incdir_512083/weejar2_dia0_2471_i512083.trc
..

*** 2015-07-19 16:32:11.494
Resolvable Hangs in the System
                      Root       Chain Total               Hang
   Hang Hang          Inst Root  #hung #hung  Hang   Hang  Resolution
     ID Type Status   Num  Sess   Sess  Sess  Conf   Span  Action
  ----- ---- -------- ---- ----- ----- ----- ------ ------ -------------------
    223 HANG RSLNPEND    2  7829     2    19   HIGH  LOCAL Terminate Process
  Hang Resolution Reason: Automatic hang resolution was performed to free a
    significant number of affected sessions.

      inst# SessId  Ser#     OSPID PrcNm Event
      ----- ------ ----- --------- ----- -----
          2    300  2745      5568    FG read by other session
          2   7829  4119      4073    FG db file sequential read

Dumping process info of pid[2767.4073] (sid:7829, ser#:4119)
    requested by master DIA0 process on instance 1.


*** 2015-07-19 16:32:11.494
Process diagnostic dump for oracle@qdanbob2, OS id=4073,
pid: 2767, proc_ser: 21, sid: 7829, sess_ser: 4119
-------------------------------------------------------------------------------
os thread scheduling delay history: (sampling every 1.000000 secs)
  0.000000 secs at [ 16:32:11 ]
    NOTE: scheduling delay has not been sampled for 0.224630 secs  0.000657 secs from [ 16:32:07 - 16:32:12 ], 5 sec avg
  0.000455 secs from [ 16:31:12 - 16:32:12 ], 1 min avg
  0.000284 secs from [ 16:27:11 - 16:32:12 ], 5 min avg

*** 2015-07-19 16:32:11.907
loadavg : 0.17 0.16 0.17
Swapinfo :
        Avail = 504818.16Mb Used = 170655.53Mb
        Swap free = 334112.25Mb Kernel rsvd = 17269.27Mb
        Free Mem  = 241739.08Mb
  F S      UID   PID  PPID  C PRI NI             ADDR   SZ            WCHAN    STIME TTY       TIME COMD
3401 S     grid  4073     1  0 178 20 e000000feb8d5680 97832 e00000103c925097  Jul 10  ?        13:13 oracleweejar2 (LOCAL=NO)
Short stack dump:
ksedsts()+544<-ksdxfstk()+48<-ksdxcb()+3216<-sspuser()+688<-<-_read_sys()+48<-_read()+224<-$cold_skgfqio()+864<-ksfd_skgfqio()+400<-ksfd_io()+1168<-ksfdread()+336<-kcfrbd1()+1328<-kcbzib()+304
0<-kcbgcur()+10400<-ktbgcur()+192<-ktspfpblk()+720<-ktspfsrch()+944<-ktspscan_bmb()+608<-ktspgsp_main()+1520<-kdtgsp()+1248<-kdtgsph()+1440<-kdtgrs()+560<-kdtInsRow()+1584 show parameter LOG_ARCHIVE_DEST_1

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
log_archive_dest_1                   string      location=/anbob2_arch

oracle@qdanbob2:/home/oracle> mount  
...
/anbob2_arch on /dev/anbob_arch/fslvanbob_arch2 ioerror=mwdisable,largefiles,delaylog,dev=40120002 on Fri Jun 19 00:48:11 2015
/anbob1_arch on qdanbob1:/anbob1_arch noac,forcedirectio,rsize=32768,wsize=32768,NFSv3,dev=4000003 on Fri Jun 19 00:58:59 2015

SQL>@p hang
NAME                                     VALUE
---------------------------------------- ----------------------------------------
_ksv_pool_hang_kill_to                   0
_hang_analysis_num_call_stacks           3
_local_hang_analysis_interval_secs       3
_hang_msg_checksum_enabled               TRUE
_hang_detection_enabled                  TRUE       #<<<<<<<<<<< HM enable
_hang_detection_interval                 32
_hang_resolution_scope                   PROCESS    #<<<<<<<<<<<< 表示HM 会自己终止HANG PROCESS(非后台process)
_hang_resolution_policy                  HIGH
_hang_resolution_confidence_promotion    FALSE
_hang_resolution_global_hang_confidence_ FALSE
...
37 rows selected.

Note:
该RAC 是启用了HM, 而且HM 可以自己kill一些非核心的后台进程，从hang trace 发现了存在root blocker I/O 的等待导致了进程的hang, 执行的SQL是用户的insert 进程，这样就找到了会话被kill的凶手，另外注意上面db alert中归档出错的路径是本地的存储路径(非NFS)

第二天中午一来9点时，发现业务又开始了积压，查询到还是昨天的insert, 查询了一下该SQL近几天的业务量

                                              Per-Plan Execution Statistics Over Time
                                                                                         Avg                 Avg
      Plan Snapshot                          Avg LIO             Avg PIO          CPU (secs)      Elapsed (secs)
Hash Value Time            Execs            Per Exec            Per Exec            Per Exec            Per Exec
---------- ------------ -------- ------------------- ------------------- ------------------- -------------------
           18-JUL 07:00   81,954               27.60                0.68                0.00                0.01
           18-JUL 08:00  164,767               27.44                0.52                0.00                0.00
           18-JUL 09:00  148,988               27.89                0.57                0.00                0.00
           18-JUL 10:00  152,462               27.89                0.55                0.00                0.00
           18-JUL 11:00  133,345               27.51                0.57                0.00                0.00
           18-JUL 12:00  144,135               27.71                0.57                0.00                0.00
           18-JUL 13:00  126,773               27.66                0.64                0.00                0.00
           18-JUL 14:00  134,799               27.53                0.53                0.00                0.00
           18-JUL 15:00  125,364               27.89                0.57                0.00                0.00
           18-JUL 16:00  129,116               27.79                0.56                0.00                0.00
           18-JUL 17:00  131,873               27.86                0.58                0.00                0.00
           18-JUL 18:00  125,614               27.64                0.55                0.00                0.00
           18-JUL 19:00  126,923               27.77                0.56                0.00                0.00
           18-JUL 20:00  123,668               27.86                0.60                0.00                0.00
           18-JUL 21:00  112,380               27.58                0.52                0.00                0.00
           18-JUL 22:00   92,602               27.59                0.43                0.00                0.00
           18-JUL 23:00   53,483               27.53                0.43                0.00                0.00
           19-JUL 06:00   39,651               27.83                0.69                0.00                0.01
           19-JUL 07:00   75,884               27.77                0.69                0.00                0.01
           19-JUL 08:00  172,000               27.51                0.54                0.00                0.00
           19-JUL 09:00  148,931               27.67                0.60                0.00                0.00
           19-JUL 10:00  153,295               27.64                0.54                0.00                0.00
           19-JUL 11:00  125,487               27.73                0.58                0.00                0.00
           19-JUL 12:00  136,311               27.84                0.59                0.00                0.00
           19-JUL 13:00  114,741               27.52                0.63                0.00                0.00
           19-JUL 14:00  104,638               27.64                0.55                0.00                0.00
           19-JUL 15:00  134,834               27.65                0.57                0.00                0.00
           19-JUL 16:00  124,629               27.72                0.57                0.00                0.39
           19-JUL 17:01  134,321               28.35                0.62                0.00                0.39
           19-JUL 18:00  125,256               27.75                0.56                0.00                0.14
           19-JUL 19:00  126,852               28.61                0.61                0.00                0.32
           19-JUL 20:00  133,179               28.30                0.62                0.00                0.13
           19-JUL 21:00  109,673               28.19                0.59                0.00                0.21
           19-JUL 22:00   84,624               27.73                0.49                0.00                0.07
           19-JUL 23:00   47,534               27.51                0.48                0.00                0.05
           20-JUL 00:00   20,786               28.88                0.60                0.00                1.87
           20-JUL 01:00   20,900               27.69                0.72                0.00                0.50
           20-JUL 02:00    8,399               28.67                0.77                0.00                1.34
           20-JUL 03:00    7,174               28.26                0.75                0.00                0.35
           20-JUL 04:00    7,798               38.87                0.72                0.00                6.86
           20-JUL 05:00   16,600               30.29                0.74                0.00                1.59
           20-JUL 06:00   41,540               28.78                0.72                0.00                1.46
           20-JUL 07:00   77,297               28.04                0.72                0.00                0.72
           20-JUL 08:00  495,309               30.34                0.53                0.00                0.85     #<<<<<<<<<<<<<<<<<<
**********              -------- ------------------- ------------------- ------------------- -------------------
avg                                            26.42                0.50                0.00                0.06
sum                     ########

Note:
从AWR DBA_HIST_SQL_STATS中可以确认当前8-9点之间exec执行数是比一前多了30万次，而且单次执行的响应时间是从昨天的16：00的快照看变长不到1秒，通知了存储工程师去确认存储是否正常，找业务确认后是他们又在批量执行, 当时判断应该还是量变引起的质变，又让他们先停止了该业务，其实后来发现只是该业务掩盖真实的原因，也正是因为这个大量insert和commit，才让问题表现出来。

但是在9:00 多后停止了批量后，10:00 又出同了短时的业务积压。

SQL> create table ash0720 tablespace users as select * from  v$active_session_history ;
Table created.

SQL>  select * from (
  2      select etime,nvl(event,'on cpu') events, dbtime, round(100*ratio_to_report(dbtime) OVER (partition by etime ),2) pct,row_number() over(partition by etime order by dbtime  desc) rn
  3   from (
  4  select substr(to_char(SAMPLE_TIME,'yyyymmdd hh24:mi'),1,13)||'0' etime,event,count(*) dbtime
  5   from ash0720 
  6  --where sample_time between to_date('2015-3-18 10:00','yyyy-mm-dd hh24:mi') and to_date('2015-3-18 11:00','yyyy-mm-dd hh24:mi')
  7   group by substr(to_char(SAMPLE_TIME,'yyyymmdd hh24:mi'),1,13),event
  8  )
  9  ) where rn<=10;  

ETIME          EVENTS                                             DBTIME        PCT         RN
-------------- ---------------------------------------------- ---------- ---------- ----------
20150720 10:20 log file sync                                       19782         33          1
20150720 10:20 gc buffer busy release                              12219      20.39          2
20150720 10:20 buffer busy waits                                    9621      16.05          3
20150720 10:20 enq: TX - index contention                           7034      11.74          4
20150720 10:20 enq: SQ - contention                                 6303      10.52          5
20150720 10:20 read by other session                                1153       1.92          6
20150720 10:20 db file sequential read                              1061       1.77          7
20150720 10:20 inactive transaction branch                           971       1.62          8
20150720 10:20 enq: TX - row lock contention                         832       1.39          9
20150720 10:20 on cpu                                                219        .37         10
20150720 10:30 enq: TX - index contention                          93536      43.91          1  #<<<<<<<<<<<<<<<
20150720 10:30 db file sequential read                             47243      22.18          2
20150720 10:30 read by other session                               21236       9.97          3
20150720 10:30 log file sync                                       12610       5.92          4
20150720 10:30 gc buffer busy release                               7793       3.66          5
20150720 10:30 buffer busy waits                                    6061       2.85          6
20150720 10:30 enq: SQ - contention                                 5360       2.52          7
20150720 10:30 on cpu                                               5332        2.5          8
20150720 10:30 enq: TX - row lock contention                        4271          2          9
20150720 10:30 inactive transaction branch                          2698       1.27         10
20150720 10:40 read by other session                               26429      40.58          1
20150720 10:40 db file sequential read                             24469      37.57          2
20150720 10:40 on cpu                                               5291       8.12          3
20150720 10:40 db file parallel write                               2136       3.28          4
20150720 10:40 SQL*Net message from dblink                          1162       1.78          5
20150720 10:40 inactive transaction branch                          1097       1.68          6
20150720 10:40 SQL*Net more data from dblink                        1054       1.62          7
20150720 10:40 inactive session                                      698       1.07          8
20150720 10:40 enq: TX - row lock contention                         540        .83          9
20150720 10:40 db file scattered read                                472        .72         10
20150720 10:50 db file sequential read                             24371       50.1          1
20150720 10:50 read by other session                               11212      23.05          2
20150720 10:50 on cpu                                               5499       11.3          3
20150720 10:50 db file parallel write                               1817       3.74          4
20150720 10:50 enq: TX - index contention                           1596       3.28          5
20150720 10:50 SQL*Net more data from dblink                        1191       2.45          6
20150720 10:50 SQL*Net message from dblink                           976       2.01          7
20150720 10:50 inactive transaction branch                           276        .57          8
20150720 10:50 enq: TX - row lock contention                         246        .51          9
20150720 10:50 inactive session                                      196         .4         10
...

Note:
20150720 10:30 到10：40出现了负载压力大情况

  SQL> select * from (
  2    SELECT
  3    h.event "Wait Event",
  4    SUM(h.wait_time + h.time_waited)/1000000 "Total Wait Time"
  5    FROM
  6    ash0720 h,
  7    v$event_name e
  8    WHERE
  9   sample_time between to_date('2015-07-20 10:30','yyyy-mm-dd hh24:mi') and to_date('2015-07-20 10:40','yyyy-mm-dd hh24:mi')
 10    AND h.event_id = e.event_id
 11    AND e.wait_class <>'Idle'
 12    GROUP BY h.event
 13    ORDER BY 2 DESC)
 14    where rownum <10;

Wait Event                                                       Total Wait Time
---------------------------------------------------------------- ---------------
enq: TX - index contention                                            78846.1543
log file sync                                                         56877.8177
db file sequential read                                               43827.3883
read by other session                                                 23515.1048
enq: SQ - contention                                                  10085.8243
gc buffer busy release                                                9988.80736
buffer busy waits                                                     6469.06895
enq: TX - row lock contention                                         4129.69438
inactive transaction branch                                           2714.48481

9 rows selected.

Note:
从top event看应该还是与insert相关，比如enq：sq ,enq: tx, log file sync,BBS 还有两个I/O 明显的事件，gc buffer busy release通常是因log file write 慢或网络问题相关，当然这些还是我们的经验主观判断，要用数据去分析问题。

接下来就以’enq: TX – index contention’查找的wait chain

SQL>   select sql_id,count(*)
  2     from ash0720 
  3    where sample_time between to_date('2015-07-20 10:30','yyyy-mm-dd hh24:mi') and to_date('2015-07-20 10:40','yyyy-mm-dd hh24:mi')
  4        and event='enq: TX - index contention' group by sql_id order by 2 desc;

SQL_ID                 COUNT(*)
-------------------- ----------
b1kh15rh5d2h9             33113
d71tkyh0b2q1y             21026
8nhv049t5zzh6             12854
2p8uq2t4b5v6r              8554
29rwby2p0367t              2786
...

SQL> select  sql_id,blocking_session,count(*)
  2        from ash0720 
  3       where sample_time between to_date('2015-07-20 10:30','yyyy-mm-dd hh24:mi') and to_date('2015-07-20 10:40','yyyy-mm-dd hh24:mi')
  4           and event='enq: TX - index contention'
  5           and sql_id in ('b1kh15rh5d2h9','d71tkyh0b2q1y','8nhv049t5zzh6')
  6           group by sql_id,blocking_session
  7           order by 1,3 desc;

SQL_ID               BLOCKING_SESSION   COUNT(*)
-------------------- ---------------- ----------
8nhv049t5zzh6                    7881      12736
8nhv049t5zzh6                   10634         62
8nhv049t5zzh6                                 53
8nhv049t5zzh6                    3026          3
b1kh15rh5d2h9                     532      18733
b1kh15rh5d2h9                    8317       8996
b1kh15rh5d2h9                    8801       5148
b1kh15rh5d2h9                    9340        224
b1kh15rh5d2h9                                  7
b1kh15rh5d2h9                    8621          3
b1kh15rh5d2h9                    6628          1
b1kh15rh5d2h9                    6817          1
d71tkyh0b2q1y                    6538      15560
d71tkyh0b2q1y                   11553       5372
d71tkyh0b2q1y                    1019         85
d71tkyh0b2q1y                                  9

16 rows selected.


select  session_id,blocking_session,count(*) cnt
        from ash0720 
       where sample_time between to_date('2015-07-20 10:30','yyyy-mm-dd hh24:mi') and to_date('2015-07-20 10:40','yyyy-mm-dd hh24:mi')
           and event='enq: TX - index contention'
           and sql_id in ('b1kh15rh5d2h9','d71tkyh0b2q1y','8nhv049t5zzh6')
           and session_id in (7881,532,8317,6538,11553) group by session_id,blocking_session
           

SESSION_ID BLOCKING_SESSION        CNT
---------- ---------------- ----------
       532             8801         32
      8317              532          5
     11553             6538         53
     
select  session_id,blocking_session,count(*) cnt
        from ash0720 
       where sample_time between to_date('2015-07-20 10:30','yyyy-mm-dd hh24:mi') and to_date('2015-07-20 10:40','yyyy-mm-dd hh24:mi')
           and session_id in (8801,6538) group by session_id,blocking_session
           
SESSION_ID BLOCKING_SESSION        CNT
---------- ---------------- ----------
      6538                         233
      6538             6276         32
      8801             6276         32
      
select  session_id,blocking_session,count(*) cnt
        from ash0720 
       where sample_time between to_date('2015-07-20 10:30','yyyy-mm-dd hh24:mi') and to_date('2015-07-20 10:40','yyyy-mm-dd hh24:mi')
           and session_id in (6276,6538) group by session_id,blocking_session
           

SESSION_ID BLOCKING_SESSION        CNT
---------- ---------------- ----------
      6276                         234   
      6538                         233
      6538             6276         32

最终确认了阻塞会话根源是6276 会话，下面查看问题当时6276 是什么进程？

SQL> l
  1  select  machine,program,event,wait_time ,time_waited,sql_id,SESSION_STATE,CURRENT_OBJ#
  2          from ash0720
  3  where sample_time between to_date('2015-07-20 10:30','yyyy-mm-dd hh24:mi') and to_date('2015-07-20 10:40','yyyy-mm-dd hh24:mi')
  4* and session_id in (6276)
SQL> /

MACHINE      PROGRAM                          EVENT                      WAIT_TIME TIME_WAITED SQL_ID        SESSION CURRENT_OBJ#
---------- -------------------------------- ------------------------- ---------- ----------- ------------- ------- ------------
qdanbob2     oracle@qdanbob2 (LGWR)             log file parallel write            0        1487               WAITING           -1
qdanbob2     oracle@qdanbob2 (LGWR)             log file parallel write            0        1367               WAITING           -1
qdanbob2     oracle@qdanbob2 (LGWR)             log file parallel write            0        4800               WAITING           -1
qdanbob2     oracle@qdanbob2 (LGWR)             log file parallel write            0        2006               WAITING           -1
qdanbob2     oracle@qdanbob2 (LGWR)             log file parallel write            0        1198               WAITING           -1
qdanbob2     oracle@qdanbob2 (LGWR)             log file parallel write            0        2812               WAITING           -1
qdanbob2     oracle@qdanbob2 (LGWR)             log file parallel write            0        3939               WAITING           -1
qdanbob2     oracle@qdanbob2 (LGWR)             log file parallel write            0        5072               WAITING           -1
qdanbob2     oracle@qdanbob2 (LGWR)             log file parallel write            0        6973               WAITING           -1
qdanbob2     oracle@qdanbob2 (LGWR)             log file parallel write            0        6783               WAITING           -1
qdanbob2     oracle@qdanbob2 (LGWR)             log file parallel write            0        1225               WAITING           -1
qdanbob2     oracle@qdanbob2 (LGWR)             log file parallel write            0        3826               WAITING           -1
qdanbob2     oracle@qdanbob2 (LGWR)             log file parallel write            0        3154               WAITING           -1
qdanbob2     oracle@qdanbob2 (LGWR)                                             5070           0               ON CPU            -1
qdanbob2     oracle@qdanbob2 (LGWR)                                            21949           0               ON CPU            -1
qdanbob2     oracle@qdanbob2 (LGWR)             log file parallel write            0        9892               WAITING           -1

qdanbob2     oracle@qdanbob2 (LGWR)             log file parallel write            0       28703               WAITING           -1
qdanbob2     oracle@qdanbob2 (LGWR)             log file parallel write            0      436821               WAITING           -1
qdanbob2     oracle@qdanbob2 (LGWR)             log file parallel write            0        3153               WAITING           -1
qdanbob2     oracle@qdanbob2 (LGWR)             log file parallel write            0        1487               WAITING           -1
qdanbob2     oracle@qdanbob2 (LGWR)             log file parallel write            0      205371               WAITING           -1

qdanbob2     oracle@qdanbob2 (LGWR)             log file parallel write            0        5429               WAITING           -1
qdanbob2     oracle@qdanbob2 (LGWR)             log file parallel write            0        5180               WAITING           -1
qdanbob2     oracle@qdanbob2 (LGWR)             log file parallel write            0   134327524               WAITING           -1
...
234 rows selected.

Note:
因为篇幅太大已截取，是因为LGWR 后台进程等待事件log file parallel write，而且wait time 很大，’log file parallel write’发生在事务commit后，等待磁盘write成功确认时。想到昨晚曾经出现过一个归档日志写入的IO错误，怀疑是I/O 慢的情况存在。上午通知了通知存储部门检查存储情况，因为上午没有root 密码，中午用root登录后发现OS 日志中确认有存储层的错误如下

#OS log /var/adm/message

Jul 19 01:05:01 qdanbob2 vmunix:
Jul 19 01:05:03 qdanbob2  above message repeats 6 times
Jul 19 01:05:03 qdanbob2 CIM Indication[3683]: Indication (default format):IndicationIdentifier = 2064320150719010503, ProviderName = HPUXESCSIIndicationProvider, PerceivedSeverity = 6, EventID = 206
Jul 19 01:05:04 qdanbob2 CIM Indication[3683]: Indication (default format):IndicationIdentifier = 2064420150719010503, ProviderName = HPUXESCSIIndicationProvider, PerceivedSeverity = 6, EventID = 206
Jul 19 01:05:04 qdanbob2 CIM Indication[3683]: Indication (default format):IndicationIdentifier = 2064520150719010504, ProviderName = HPUXESCSIIndicationProvider, PerceivedSeverity = 6, EventID = 206
Jul 19 01:05:04 qdanbob2 CIM Indication[3683]: Indication (default format):IndicationIdentifier = 2064620150719010504, ProviderName = HPUXESCSIIndicationProvider, PerceivedSeverity = 6, EventID = 206
Jul 19 01:05:04 qdanbob2 CIM Indication[3683]: Indication (default format):IndicationIdentifier = 2064720150719010504, ProviderName = HPUXESCSIIndicationProvider, PerceivedSeverity = 6, EventID = 206
Jul 19 05:28:35 qdanbob2 sshd[29499]: error: PAM: No account present for user for illegal user aqzhdg from 133.96.102.206
Jul 19 05:28:37 qdanbob2 sshd[29499]: error: PAM: No account present for user for illegal user aqzhdg from 133.96.102.206
Jul 19 16:25:15 qdanbob2 vmunix: DIAGNOSTIC SYSTEM WARNING:
Jul 19 16:25:15 qdanbob2 vmunix:    The diagnostic logging facility has started receiving excessive
Jul 19 16:25:15 qdanbob2 vmunix:    errors.  The error entries will be lost until the cause of
Jul 19 16:25:15 qdanbob2 vmunix:    the excessive error logging is corrected.
Jul 19 16:25:15 qdanbob2 vmunix:    Use oserrlogd(1M) man page for further details.
Jul 19 16:25:20 qdanbob2 vmunix: DIAGNOSTIC SYSTEM WARNING:
Jul 19 16:25:20 qdanbob2 vmunix:    The diagnostic logging facility is no longer receiving excessive
Jul 19 16:25:20 qdanbob2 vmunix:    errors .  9  error entries were lost.
Jul 19 16:25:23 qdanbob2 vmunix:    The diagnostic logging facility has started receiving excessive
Jul 19 16:25:23 qdanbob2 vmunix:    errors.  The error entries will be lost until the cause of
Jul 19 16:25:23 qdanbob2 vmunix:    the excessive error logging is corrected.
Jul 19 16:25:23 qdanbob2 vmunix: DIAGNOSTIC SYSTEM WARNING:
Jul 19 16:25:23 qdanbob2 vmunix:    Use oserrlogd(1M) man page for further details.
Jul 19 16:25:48 qdanbob2 vmunix: emcp:Mpx:Error:
Jul 19 16:25:48 qdanbob2 vmunix: emcp:Mpx:Error: Path lunpath413 to 000290102087 is dead.Killing bus 6 to Symmetrix     000290102087 port 8aA.Path lunpath320 to 000290102087 is dead.Path lunpath327 to 0
00290102087 is dead.Path lunpath325 to 000290102087 is dead.Path lunpath326 to 000290102087 is dead.Pat
Jul 19 16:25:48 qdanbob2 vmunix: emcp:Mpx:Error: h lunpath338 to 000290102087 is dead.Path lunpath337 to 000290102087 is dead.Path lunpath330 to 000290102087 is dead.Path lunpath345 to 000290102087 is d
ead.Path lunpath332 to 000290102087 is dead.Path lunpath351 to 000290102087 is dead.Path lunpath339 to
Jul 19 16:25:48 qdanbob2 vmunix: emcp:Mpx:Error: 000290102087 is dead.Path lunpath329 to 000290102087 is dead.Path lunpath323 to 000290102087 is dead.Path lunpath356 to 000290102087 is dead.Path lunpath
328 to 000290102087 is dead.Path lunpath331 to 000290102087 is dead.Path lunpath321 to 000290102087 is
...

407 to 000290102087 is dead.Path lunpath402 to 000290102087 is dead.Path lunpath311 to 000290102087 is
Jul 19 16:25:48 qdanbob2 vmunix: emcp:Mpx:Error: dead.Path lunpath410 to 000290102087 is dead.Path lunpath408 to 000290102087 is dead.Path lunpath400 to 000290102087 is dead.Path lunpath409 to 000290102
087 is dead.Path lunpath310 to 000290102087 is dead.Path lunpath419 to 000290102087 is dead.Path lunpat
...

Jul 20 14:24:50 qdanbob2 vmunix: class : disk, instance 2004
Jul 20 14:24:50 qdanbob2 vmunix: All available lunpaths of a LUN (dev=0xd000120)
Jul 20 14:24:50 qdanbob2 vmunix: have gone offline. The LUN has entered a transient
Jul 20 14:24:50 qdanbob2 vmunix: condition. The transient time threshold is 120 seconds.
Jul 20 14:24:50 qdanbob2 vmunix: 1 lunpaths are currently in a failed state.
Jul 20 14:24:50 qdanbob2 vmunix:
Jul 20 14:24:50 qdanbob2 vmunix: class : disk, instance 2005
Jul 20 14:24:50 qdanbob2 vmunix: All available lunpaths of a LUN (dev=0xcb250200)
Jul 20 14:24:50 qdanbob2 vmunix: have gone offline. The LUN has entered a transient
Jul 20 14:24:50 qdanbob2 vmunix: condition. The transient time threshold is 120 seconds.
Jul 20 14:24:50 qdanbob2 vmunix: 1 lunpaths are currently in a failed state.
Jul 20 14:24:50 qdanbob2 vmunix:
Jul 20 14:24:50 qdanbob2 vmunix: DIAGNOSTIC SYSTEM WARNING:
Jul 20 14:24:50 qdanbob2 vmunix:    The diagnostic logging facility has started receiving excessive
Jul 20 14:24:50 qdanbob2 vmunix:    errors.  The error entries will be lost until the cause of
Jul 20 14:24:50 qdanbob2 vmunix:    the excessive error logging is corrected.
Jul 20 14:24:50 qdanbob2 vmunix:    Use oserrlogd(1M) man page for further details.
Jul 20 14:24:50 qdanbob2 vmunix: class : disk, instance 2006
Jul 20 14:24:50 qdanbob2 vmunix: All available lunpaths of a LUN (dev=0xcb250300)
Jul 20 14:24:50 qdanbob2 vmunix: have gone offline. The LUN has entered a transient
Jul 20 14:24:50 qdanbob2 vmunix: condition. The transient time threshold is 120 seconds.
Jul 20 14:24:50 qdanbob2 vmunix: 1 lunpaths are currently in a failed state.
Jul 20 14:24:50 qdanbob2 vmunix:
Jul 20 14:24:51 qdanbob2 vmunix: emcp:Mpx:Error:
Jul 20 14:25:01 qdanbob2 vmunix: class : tgtpath, instance 6
Jul 20 14:25:01 qdanbob2 vmunix: Target path (class=tgtpath, instance=6) has gone online.  The target path h/w path is 41/0/2/2/0/0/0.0x5006048452a6d1c7
Jul 20 14:25:01 qdanbob2 vmunix:
Jul 20 14:25:01 qdanbob2 vmunix: DIAGNOSTIC SYSTEM WARNING:
Jul 20 14:25:01 qdanbob2 vmunix:    The diagnostic logging facility is no longer receiving excessive
Jul 20 14:25:01 qdanbob2 vmunix:    errors .  2  error entries were lost.
Jul 20 14:28:42 qdanbob2 vmunix: emcp:Mpx:Info:
Jul 20 14:30:37 qdanbob2 vmunix: emcp:Mpx:Info: h374 to 000290102087 is dead.Path lunpath248 to 000290102087 is dead.Path lunpath369 to 000290102087 is dead.Path lunpath279 to 000290102087 is alive.Path
 lunpath279 to 000290102087 is dead.Path lunpath409 to 000290102087 is alive.Path lunpath369 to 000290
Jul 20 14:34:06 qdanbob2 vmunix: emcp:Mpx:Info:
Jul 20 14:35:11 qdanbob2 vmunix: class : lunpath, instance 374
Jul 20 14:35:11 qdanbob2 vmunix: lun path (class = lunpath, instance = 374) belonging to LUN (default minor = 0xec) has gone offline.  The lunpath hwpath is 41/0/2/2/0/0/0.0x5006048452a6d1c7.0x411200000
0000000
Jul 20 14:35:11 qdanbob2 vmunix:
Jul 20 14:35:50 qdanbob2 vmunix: class : lunpath, instance 374
Jul 20 14:35:50 qdanbob2 vmunix: lun path (class = lunpath, instance = 374) belonging to LUN (default minor = 0xec) has come online
Jul 20 14:35:50 qdanbob2 vmunix:
Jul 20 14:36:48 qdanbob2 vmunix: emcp:Mpx:Error:
Jul 20 14:36:48 qdanbob2 vmunix: emcp:Mpx:Error: 102087 is alive.Path lunpath243 to 000290102087 is alive.Path lunpath243 to 000290102087 is dead.Killing bus 6 to Symmetrix     000290102087 port 8aA.Pat
h lunpath409 to 000290102087 is dead.Path lunpath369 to 000290102087 is dead.Path lunpath288 to 0002901
Jul 20 14:36:48 qdanbob2 vmunix: emcp:Mpx:Error:

确认从昨天16：20后存储就开始出现问题，不稳定或间歇性问题出现。这时存储方成确认了有点存储链路有问题

qdanbob2[/var/adm]#powermt display
Symmetrix logical device count=202
CLARiiON logical device count=0
Hitachi logical device count=0
Invista logical device count=0
HP xp logical device count=0
Ess logical device count=0
HP HSx logical device count=0
==============================================================================
----- Host Bus Adapters ---------  ------ I/O Paths -----  ------ Stats ------
###  HW Path                       Summary   Total   Dead  IO/Sec Q-IOs Errors
==============================================================================
   0 41/0/0/2/0/0/0                optimal     194      0       -     2      0
   6 41/0/2/2/0/0/0                degraded     16     12       -     0     10  <<<<<
   8 44/0/0/2/0/0/0                optimal     194      0       -     3      0
  42 44/0/2/2/0/0/0                optimal     194      0       -     1      0

禁掉了#6存储链路，数据库恢复了正常，I/O 延迟也明显下降，最终定位为存储问题，因为是偶尔出现，AWR 是一小时的粒度，一开始没有看，后来发现AWR里TOP EVENT里I/O avg time也是比较大的，而且看pw+pr iops 也可以做为参考。

qdanbob2[/var/adm]#powerrmt display
sh: powerrmt:  not found.
qdanbob2[/var/adm]#
qdanbob2[/var/adm]#powermt display 
Symmetrix logical device count=202
CLARiiON logical device count=0
Hitachi logical device count=0
Invista logical device count=0
HP xp logical device count=0
Ess logical device count=0
HP HSx logical device count=0
==============================================================================
----- Host Bus Adapters ---------  ------ I/O Paths -----  ------ Stats ------
###  HW Path                       Summary   Total   Dead  IO/Sec Q-IOs Errors
==============================================================================
   0 41/0/0/2/0/0/0                optimal     194      0       -     2      0
   6 41/0/2/2/0/0/0                failed       16     16       -     0      0   <<<<<<<<
   8 44/0/0/2/0/0/0                optimal     194      0       -     2      0
  42 44/0/2/2/0/0/0                optimal     194      0       -     1      0

Summary:
整个问题现在可以梳理清楚，从昨天16：30 起存储链路就出现了问题或者间歇性，当insert 这种大批量灌数据及每条commit发生时，存储问题就使数据库I/O类wait event变的非常明显，比如log file sync,索引分裂，gc buffer, read by other session.. , 禁掉了4条存储链路中坏的一条后，恢复了正常，后续需要加强对存储的监控。