首页 » ORACLE [C]系列, ORACLE 9i-23c » 12c R2 RAC instance crash due to CKPT is hung, using Multipathing

12c R2 RAC instance crash due to CKPT is hung, using Multipathing

前段时间朋友一套12C R2   3Nodes RAC on Linux using multipath 环境的crash总是不定时的crash(几天一次),  使用的是LINUX的多路径软件的默认配置。

— alert log

2017-11-21T21:52:48.474246+08:00
Thread 3 advanced to log sequence 26159 (LGWR switch)
  Current log# 17 seq# 26159 mem# 0: +DATADG/anbob/ONLINELOG/group_17
2017-11-21T21:55:42.614168+08:00
IMR0 (ospid: 53353) waits for event 'control file sequential read' for 58 secs.
2017-11-21T21:55:42.614240+08:00
IMR0 (ospid: 53353) is hung in an acceptable location (cfio 0x11.00).
2017-11-21T21:55:55.374988+08:00
IMR0 (ospid: 53353) waits for event 'control file sequential read' for 81 secs.
2017-11-21T21:55:58.554970+08:00
IMR has experienced some problems and can't re-enable, retry count 1
2017-11-21T21:56:20.580385+08:00
CKPT (ospid: 53295) waits for event 'control file parallel write' for 95 secs.
2017-11-21T21:56:22.776564+08:00
Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob3/trace/anbob3_lmhb_53275.trc  (incident=824802) (PDBNAME=CDB$ROOT):
ORA-29770: global enqueue process CKPT (OSID 53295) is hung for more than 70 seconds
Incident details in: /u01/app/oracle/diag/rdbms/anbob/anbob3/incident/incdir_824802/anbob3_lmhb_53275_i824802.trc
2017-11-21T21:56:25.265258+08:00
LOCK_DBGRP: GCR_SYSTEST debug event locked group GR+DB_anbob by memno 0 
ERROR: Some process(s) is not making progress.
LMHB (ospid: 53275) is terminating the instance.
Please check LMHB trace file for more details.
Please also check the CPU load, I/O load and other system properties for anomalous behavior
ERROR: Some process(s) is not making progress.
LMHB (ospid: 53275): terminating the instance due to error 29770
2017-11-21T21:56:25.763179+08:00
System state dump requested by (instance=3, osid=53275 (LMHB)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/anbob/anbob3/trace/anbob3_diag_53235_20171121215625.trc
2017-11-21T21:56:27.521552+08:00
License high water mark = 172
2017-11-21T21:56:31.945109+08:00
Instance terminated by LMHB, pid = 53275
2017-11-21T21:56:31.945889+08:00
Warning: 2 processes are still attach to shmid 458758:
 (size: 24576 bytes, creator pid: 53153, last attach/detach pid: 53235)
2017-11-21T21:56:32.530103+08:00
USER (ospid: 34072): terminating the instance
2017-11-21T21:56:32.535249+08:00
Instance terminated by USER, pid = 34072

— lmnb trace anbob3_lmhb_53275_i824802.trc

----- Beginning of Customized Incident Dump(s) -----
==================================================
=== CKPT (ospid: 53295) Heartbeat Report
==================================================
CKPT (ospid: 53295) has no heartbeats for 103 sec. (threshold 70) 
  : heartbeat state 0x11.00 (cfio) pso-flag 0x100
  : waiting for event 'control file parallel write' for 97 secs with wait_id 4255818.
===[ Wait Chain ]===
Wait chain is empty.
=[SNAPSHOT SUMMARY]=
Process moved according to collected snapshots
===[NO-HB PROCESS CALLSTACK]===
ksedsts()+346<-ksdxfstk()+71<-ksdxcb()+912<-sspuser()+217<-__sighandler()<-io_queue_run()+100<-skgfrwat()+196
<-ksfdwtio()+306<-ksfdwat_internal()+234<-kfk_reap_ufs_async_io()+166<-kfk_reap_ios_from_subsys()+1000<-kfk_reap_ios()
+409<-kfk_io1()+1449<-kfk_transitIO()+451<-kfioReapIO()+633<-kfioRequestPriv()+216<-kfioRequest()+337<-ksfdafRequest()
+742<-ksfdafWait()+371<-ksfdwtio()+773<-ksfdwat_internal()+234<-ksfdblock()+325<-ksfdbio()+2910<-ksfdbio()+3035<-kccwbp()
+1035<-kccbmp_wdl()+257<-kccfhb()+144<-kcccmt()+244<-kccecx()+80<-kctmttrest()+303<-kctrcp()+1942<-kcvcca()+189<-ksbcti()
+247<-ksbabs()+11524<-ksbrdp()+1079<-opirip()+609<-opidrv()+602<-sou2o()+145<-opimai_real()+202<-ssthrdmain()+417<-main()
+262<-__libc_start_main()+253
==============================
==============================
Dumping PROCESS CKPT (ospid: 53295) States
==============================
===[ System Load State ]===
  CPU Total 56 Raw 56 Core 28 Socket 2
  Load normal: Cur 1144 Highmark 71680 (4.46 280.00)
===[ Latch State ]===
  Not in Latch Get
===[ Session State Object ]===
  ----------------------------------------
  SO: 0x23251ba70, type: 4, owner: 0x1f1d4c328, flag: INIT/-/-/-/0x00 if: 0x3 c: 0x3
   proc=0x1f1d4c328, name=session, file=ksu.h LINE:15737, pg=0, conuid=1
   SGA version=(1,0)
  (session) sid: 4720 ser: 27225 trans: (nil), creator: 0x1f1d4c328
            flags: (0x51) USR/- flags2: (0x409) -/-/INC
            flags_idl: (0x1) status: BSY/-/-/- kill: -/-/-/-
            DID: 0003-0027-0000000A0000-0000-00000000, short-term DID: 
            txn branch: (nil)
            con_id/con_uid/con_name: 1/1/CDB$ROOT
            con_logonuid: 1 con_logonid: 1
            con_scuid: 1 con_scid: 1
            edition#: 0            user#/name: 0/SYS
            oct: 0, prv: 0, sql: (nil), psql: (nil)
            stats: 0x1ffa5a0b0, PX stats: 0x1101de44
  service name: SYS$BACKGROUND
  Current Wait Stack:
   0: waiting for 'control file parallel write'
      files=0x1, block#=0x27, requests=0x1
      wait_id=4255818 seq_num=61824 snap_id=1
      wait times: snap=1 min 37 sec, exc=1 min 37 sec, total=1 min 37 sec
      wait times: max=infinite, heur=1 min 37 sec
      wait counts: calls=0 os=0
      in_wait=1 iflags=0x5a0
  There are 4 sessions blocked by this session.
  Dumping one waiter:
    inst: 3, sid: 3389, ser: 36254
    wait event: 'enq: CF - contention'
      p1: 'name|mode'=0x43460004
      p2: '0'=0x0
      p3: 'operation'=0x0
    row_wait_obj#: 213, block#: 1969, row#: 0, file# 1
    min_blocked_time: 92 secs, waiter_cache_ver: 59413
  Wait State:
    fixed_waits=0 flags=0x22 boundary=(nil)/-1
  Session Wait History:
      elapsed time of 0.000027 sec since current wait
   0: waited for 'control file parallel write'
      files=0x1, block#=0x29, requests=0x1
      wait_id=4255817 seq_num=61823 snap_id=1
      wait times: snap=0.000462 sec, exc=0.000462 sec, total=0.000462 sec
      wait times: max=infinite
      wait counts: calls=0 os=0
      occurred after 0.000029 sec of elapsed time
   1: waited for 'control file parallel write'
      files=0x1, block#=0x372, requests=0x1
      wait_id=4255816 seq_num=61822 snap_id=1
      wait times: snap=0.000589 sec, exc=0.000589 sec, total=0.000589 sec
      wait times: max=infinite
      wait counts: calls=0 os=0
      occurred after 0.000051 sec of elapsed time
   2: waited for 'control file sequential read'
      file#=0x0, block#=0x371, blocks=0x1
      wait_id=4255815 seq_num=61821 snap_id=1
      wait times: snap=0.000279 sec, exc=0.000279 sec, total=0.000279 sec
      wait times: max=infinite
      wait counts: calls=0 os=0
      occurred after 0.000010 sec of elapsed time
   3: waited for 'control file sequential read'
      file#=0x0, block#=0x2a, blocks=0x1
      wait_id=4255814 seq_num=61820 snap_id=1
      wait times: snap=0.000272 sec, exc=0.000272 sec, total=0.000272 sec
      wait times: max=infinite
      wait counts: calls=0 os=0
      occurred after 0.000004 sec of elapsed time
   4: waited for 'control file sequential read'
      file#=0x0, block#=0x28, blocks=0x1
      wait_id=4255813 seq_num=61819 snap_id=1
      wait times: snap=0.000280 sec, exc=0.000280 sec, total=0.000280 sec
      wait times: max=infinite
      wait counts: calls=0 os=0
      occurred after 0.000009 sec of elapsed time
   5: waited for 'control file sequential read'
      file#=0x0, block#=0x1, blocks=0x1
      wait_id=4255812 seq_num=61818 snap_id=1
      wait times: snap=0.000319 sec, exc=0.000319 sec, total=0.000319 sec
      wait times: max=infinite
      wait counts: calls=0 os=0
      occurred after 0.000070 sec of elapsed time
   6: waited for 'enq: CF - contention'
      name|mode=0x43460005, 0=0x0, operation=0x0
      wait_id=4255811 seq_num=61817 snap_id=1
      wait times: snap=0.000257 sec, exc=0.000257 sec, total=0.000257 sec
      wait times: max=0.000000 sec
      wait counts: calls=3 os=3
      occurred after 0.000800 sec of elapsed time
   7: waited for 'rdbms ipc message'
      timeout=0x12c, =0x0, =0x0
      wait_id=4255810 seq_num=61816 snap_id=1
      wait times: snap=3.002193 sec, exc=3.002193 sec, total=3.002193 sec
      wait times: max=3.000000 sec
      wait counts: calls=1 os=1
      occurred after 0.000124 sec of elapsed time
   8: waited for 'control file sequential read'
      file#=0x0, block#=0x34a, blocks=0x1
      wait_id=4255809 seq_num=61815 snap_id=1
      wait times: snap=0.000284 sec, exc=0.000284 sec, total=0.000284 sec
      wait times: max=infinite
      wait counts: calls=0 os=0
      occurred after 0.000005 sec of elapsed time
   9: waited for 'control file sequential read'
      file#=0x0, block#=0x2a, blocks=0x1
      wait_id=4255808 seq_num=61814 snap_id=1
      wait times: snap=0.000341 sec, exc=0.000341 sec, total=0.000341 sec
      wait times: max=infinite
      wait counts: calls=0 os=0
      occurred after 0.000006 sec of elapsed time
	  
	  ...
	  ...
*** 2017-11-21T21:56:24.918643+08:00
loadavg : 4.47 2.09 1.21
System user time: 0.00 sys time: 0.00 context switch: 19791
Memory (Avail / Total) = 107148.16M / 257929.37M
Swap (Avail / Total) = 32768.00M /  32768.00M
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
0 S oracle   53295     1  0  80   0 - 19975755 read_e Nov13 ?     00:06:16 ora_ckpt_anbob3
Name:	ora_ckpt_anbob3
State:	S (sleeping)
Tgid:	53295
Pid:	53295
PPid:	1
TracerPid:	0
Uid:	602	602	602	602
Gid:	601	604	604	604
Utrace:	0
FDSize:	512
Groups:	601 602 603 604 606 
VmPeak:	79907024 kB
VmSize:	79903020 kB
VmLck:	       0 kB
VmHWM:	 3446908 kB
VmRSS:	 3438240 kB
VmData:	   25352 kB
VmStk:	     164 kB
VmExe:	  342688 kB
VmLib:	   30328 kB
VmPTE:	   14816 kB
VmSwap:	       0 kB
Threads:	1
SigQ:	1/1031607
SigPnd:	0000000000000000
ShdPnd:	0000000000000000
SigBlk:	0000000000000000
SigIgn:	0000000016400207
SigCgt:	00000003c9887cf8
CapInh:	0000000000000000
CapPrm:	0000000000000000
CapEff:	0000000000000000
CapBnd:	ffffffffffffffff
Cpus_allowed:	ffffff,ffffffff
Cpus_allowed_list:	0-55
Mems_allowed:	00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003
Mems_allowed_list:	0-1
voluntary_ctxt_switches:	3914186
nonvoluntary_ctxt_switches:	43390
Short stack dump: 
ksedsts()+346<-ksdxfstk()+71<-ksdxcb()+912<-sspuser()+217<-__sighandler()<-io_queue_run()+100<-skgfrwat()+196<-ksfdwtio()
+306<-ksfdwat_internal()+234<-kfk_reap_ufs_async_io()+166<-kfk_reap_ios_from_subsys()+1000<-kfk_reap_ios()+409<-kfk_io1()
+1449<-kfk_transitIO()+451<-kfioReapIO()+633<-kfioRequestPriv()+216<-kfioRequest()+337<-ksfdafRequest()+742<-ksfdafWait()
+371<-ksfdwtio()+773<-ksfdwat_internal()+234<-ksfdblock()+325<-ksfdbio()+2910<-ksfdbio()+3035<-kccwbp()+1035<-kccbmp_wdl()
+257<-kccfhb()+144<-kcccmt()+244<-kccecx()+80<-kctmttrest()+303<-kctrcp()+1942<-kcvcca()+189<-ksbcti()+247<-ksbabs()+11524
<-ksbrdp()+1079<-opirip()+609<-opidrv()+602<-sou2o()+145<-opimai_real()+202<-ssthrdmain()+417<-main()+262<-__libc_start_main()+253
 

Note:
从日志看是CKPT进程hang超过了70秒所以LMON进程终止了实例, 从CKPT的trace文件中Call Stack看I/O的可能性较大, 开始有说这个数据库使用的是linux的多路径软件,使用的默认配置, 通常来说I/O错误对于ASM镜像和多路径以上的应用进程在做I/O操作时是通明的。但是在I/O重试期间可能会导致进程hang, 对于像CKPT\LMON 长时间的HANG是致命的, 在MOS中没有找到相关的BUG个NOTE但有一篇Extend RAC因远程IO失败导致数据库crash的case, 参照尝试性的调整了多路径的配置参数如下:

polling_interval= 10
no_path_retry = 3
/sys/module/scsi_transport_fc/parameters/dev_loss_tmo = 30

polling_interval: Specifies the interval between two path checks in seconds. For properly functioning paths, the interval between checks will gradually increase to (4 * polling_interval). The default value is 5.

no_path_retry: A numeric value for this attribute specifies the number of times the system should attempt to use a failed path before disabling queueing.
A value of fail indicates immediate failure, without queuing.A value of queue indicates that queuing should not stop until the path is fixed.

The “dev_loss_tmo” option controls how long the SCSI layer waits after a SCSI device fails before marking it as failed.

未提供OS,和多路径日志。 但做了以前调整重启了多路径服务后,观察了1个月时间数据库实例没有再crash过。

打赏

对不起,这篇文章暂时关闭评论。