How to release still “killed“ status session in v$session? (释放killed的session) (三)
How to release still “killed“ status session in v$session?(释放killed的session) (一)
How to release still “killed“ status session in v$session? (释放killed的session) (二)
昨天数据库有3个session记录在v$session一直是KILLED状态,且已经好几个小时,当前的等待事件为”db file sequential read”, 其中有一个session是DML SQL,所以最后在事务结束前的行级锁一直未能释放,影响了操作相同记录的业务, 从数据库尝试kill [immeidate]没报错但是无法清理,手动weekup PMON检查PMON trace无报错, 通过操作系统层KILL -9 无报错仍无法清理,环境是11.2.0.3 2-NODES RAC ON AIX, 记录一下分析思路。
SQL>@ase USERNAME SID SEQ# EVENT MACHINE MODULE STATUS LAST_CALL_ET SQL_ID WAI_SECINW ---------- ------ ---------- -------------------- ---------- ----------- -------- ------------ --------------- ---------- ... ANBOB 3510 28110 gc buffer busy acqui kdbxxxx SQL*xxxxxxx ACTIVE 3458 3dpbdtsg82768 0:0 ANBOB 6024 24828 enq: TX - row lock c kycxxxx evenxxxxxxx ACTIVE 9761 drdanx3hhzwc6 0:9760 ANBOB 6121 1224 read by other sessio kycxxxx evenxxxxxxx ACTIVE 9767 drdanx3hhzwc6 0:9762 ANBOB 7180 1768 read by other sessio qmzxxxx AccSxxxxxxx ACTIVE 29170 bazvmr4hmyprb 0:28806 ANBOB 5276 19993 db file sequential r kycxxxx evenxxxxxxx KILLED 29183 drdanx3hhzwc6 0:29183 ANBOB 2284 38595 db file sequential r kycxxxx evenxxxxxxx KILLED 29183 3yv2fann81bq7 0:29183 ANBOB 6959 26745 db file sequential r qmzxxxx AccSxxxxxxx KILLED 29184 cntmwhgmv5wuw 0:29183 ... [oracle@weejar1:/home/oracle]$ kill -9 39518840 [oracle@weejar1:/home/oracle]$ ps -ef|grep 39518840 oracle 44892446 59769246 0 15:33:54 pts/9 0:00 grep 39518840 grid 39518840 1 0 Jun 07 - 199:30 oracleanbob1 (LOCAL=NO) [oracle@weejar1:/home/oracle]$ truss -p pid -- hang procestack pid -- hang oradebug setorapid xx oradebug errorstack -- hang oracebug short_stack -- hang SQL> oradebug setmypid; Statement processed. SQL> oradebug unlimit; Statement processed. SQL> oradebug dump systemstate 266; Statement processed. SQL> oradebug tracefile_name /oracle/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_ora_65995280.trc [oracle@weejar1:/home/oracle]$ grep -n 39518840 /oracle/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_ora_65995280.trc 5891344: O/S info: user: grid, term: UNKNOWN, ospid: 39518840 5891345: OSD pid info: Unix process pid: 39518840, image: oracle@weejar1 5891348: Short stack dump: ORA-32515: cannot issue ORADEBUG command 'SHORT_STACK' to process 'Unix process pid: 39518840, image: oracle@weejar1'; prior command execution time exceeds 30000 ms 5899983: owner: 700000db18e19c0 - ospid: 39518840 5900005: pid : 39518840 [oracle@weejar1:/home/oracle]$ sed -n 5891330,5891364p /oracle/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_ora_65995280.trc (process) Oracle pid:2474, ser:178, calls cur/top: 0x700000d43d97f58/0x700000be07a5dd0 flags : (0x0) - flags2: (0x0), flags3: (0x10) intr error: 24761, call error: 0, sess error: 0, txn error 0 intr queue: 24761 2147483676 2147483676 2147483676 2147483676 2147483676 2147483676 ksudlp FALSE at location: 0 (post info) last post received: 0 0 178 last post received-location: kcl2.h LINE:4340 ID:kclcget last process to post me: 700000dc96f2df8 1 6 last post sent: 0 0 159 last post sent-location: kcb2.h LINE:4139 ID:kcbzww last process posted by me: 700000db9866cc8 93 0 (latch info) wait_event=0 bits=0 Process Group: DEFAULT, pseudo proc: 0x700000dd9a86230 O/S info: user: grid, term: UNKNOWN, ospid: 39518840 OSD pid info: Unix process pid: 39518840, image: oracle@weejar1 *** 2017-06-13 17:08:33.300 Short stack dump: ORA-32515: cannot issue ORADEBUG command 'SHORT_STACK' to process 'Unix process pid: 39518840, image: oracle@weejar1'; prior command execution time exceeds 30000 ms ---------------------------------------- SO: 0x700000d7912cdc8, type: 10, owner: 0x700000db18e19c0, flag: INIT/-/-/0x00 if: 0x1 c: 0x1 proc=0x700000db18e19c0, name=FileOpenBlock, file=ksfd.h LINE:6448 ID:, pg=0 (FOB) 700000d7912cdc8 flags=2050 fib=700000da3a2c340 incno=0 pending i/o cnt=0 fname=/dev/rzwb_lv15_104 fno=377 lblksz=16384 fsiz=983039 ---------------------------------------- SO: 0x700000d7b9b2ea0, type: 10, owner: 0x700000db18e19c0, flag: INIT/-/-/0x00 if: 0x1 c: 0x1 proc=0x700000db18e19c0, name=FileOpenBlock, file=ksfd.h LINE:6448 ID:, pg=0 (FOB) 700000d7b9b2ea0 flags=2050 fib=700000da39b38b0 incno=0 pending i/o cnt=0 fname=/dev/rzwb_lv15_106 fno=61 lblksz=16384 fsiz=983039 ---------------------------------------- SO: 0x700000d7c53d2b8, type: 10, owner: 0x700000db18e19c0, flag: INIT/-/-/0x00 if: 0x1 c: 0x1 proc=0x700000db18e19c0, name=FileOpenBlock, file=ksfd.h LINE:6448 ID:, pg=0 (FOB) 700000d7c53d2b8 flags=2050 fib=700000da39dfd60 incno=0 pending i/o cnt=0
Note:
尝试trace进程调用均hang. 下一步使用PS检查进程的STAT列,如果进程状态为D,并且如strace,truss, tuss等工具无输出时,我们是无法kill或恢复进程,表未进程在wait内核调用,KILL 命令无权终止,只能重启操作系统恢复。通常数据库出现该问题时,如果系统资源充足时可能是I/O请求挂起引起, 会一直等待下去。
$ ps aux | grep ” D”
SQL> @sw "select sid from v$session where status='KILLED'" SID STATE EVENT SEQ# SEC_IN_WAIT P1 P2 P3 ------- ------- ---------------------------------------- ---------- ----------- ------------------- ------------------- ----------------- 6959 WAITING db file sequential read 26745 39163 file#= 621 block#= 266395 blocks= 1 2284 WAITING db file sequential read 38595 39163 file#= 655 block#= 569108 blocks= 1 5276 WAITING db file sequential read 19993 39163 file#= 764 block#= 1948676 blocks= 1 SQL> @sw "select sid from v$session where status='KILLED'" SID STATE EVENT SEQ# SEC_IN_WAIT P1 P2 P3 ------- ------- ---------------------------------------- ---------- ----------- ------------------- ------------------- ----------------- 6959 WAITING db file sequential read 26745 39165 file#= 621 block#= 266395 blocks= 1 2284 WAITING db file sequential read 38595 39165 file#= 655 block#= 569108 blocks= 1 5276 WAITING db file sequential read 19993 39165 file#= 764 block#= 1948676 blocks= 1 SQL> @sw "select sid from v$session where status='KILLED'" SID STATE EVENT SEQ# SEC_IN_WAIT P1 P2 P3 ------- ------- ---------------------------------------- ---------- ----------- ------------------- ------------------- ----------------- 6959 WAITING db file sequential read 26745 39166 file#= 621 block#= 266395 blocks= 1 2284 WAITING db file sequential read 38595 39166 file#= 655 block#= 569108 blocks= 1 5276 WAITING db file sequential read 19993 39166 file#= 764 block#= 1948676 blocks= 1 SQL> select name from v$datafile where file#=621; NAME -------------------------------------------------------------- /dev/rzwb_lv15_330 SQL> select name from v$datafile where file#=655; NAME -------------------------------------------------------------- /dev/rzwb_lv30_411 SQL> select name from v$datafile where file#=764; NAME -------------------------------------------------------------- /dev/rzwb_lv30_525 [oracle@weejar1:/home/oracle]$ lslv zwb_lv15_330 0516-1201 lslv: Warning: Volume group zwb_datavg09 is locked. This command will continue retries until lock is free. If lock is inadvertent and needs to be removed, execute 'chvg -u zwb_datavg09'. [oracle@weejar1:/home/oracle]$ errpt IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION 4B436A3D 0613172517 T H fscsi1 LINK ERROR 4B436A3D 0613172517 T H fscsi1 LINK ERROR 4B436A3D 0613172417 T H fscsi1 LINK ERROR 4B436A3D 0613172417 T H fscsi1 LINK ERROR 4B436A3D 0613172417 T H fscsi1 LINK ERROR 4B436A3D 0613172317 T H fscsi1 LINK ERROR
Note:
从数据库等待事件的数据文件查起,确认了该LV及所在的VG已被锁定,同步有4个VG受到影响, 第二个节点存储(11个VG)正常。 从节点一主机日志及系统确认该主机存储目前有4条存储链路, 其中1条存储链路故障(sicsi1 hba卡,且使用POWER存储命令查看该链路有3个读写I/O请求未完成)引起读写异常进而引发VG被锁,进程处于等待I/O资源的核心态而不是用户态,不能接受任何外部信号(包括KILL)。
当然这种情况下最安全的方法是重启主机, 当然这个情况数据库需要ABORT才能停下来,且DB INSTANCE和CRS LISTENER停掉后,KILLED状态的OS进程(LOCAL=NO)还一直存在,重启操作系统后,链路和数据库恢复正常。
对不起,这篇文章暂时关闭评论。