Troubleshooting Oracle instance start fail join cluster wait control file enqueue

最近1 Oracle Exadata X7客户ora instance 2被驱逐后，重启db instance 2启动挂起，影响另一实例instance 1, 随后终止启动，实例1运行正常。分析db instance 2启动时在等待control file enqueue超时，OS 日志显示“RDS/IB: conn <192.168.*.3,192.168.*.6,4> racing for more than 1s, retry”，简单记录。

DB alert log inst 2

RNING: inbound connection timed out (ORA-3136)
2023-06-21 11:55:44.532000 +08:00
Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_diag_105594.trc  (incident=363606):
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 2, osid 105678'
Incident details in: /u01/app/oracle/diag/rdbms/anbob/anbob2/incident/incdir_363606/anbob2_diag_105594_i363606.trc
2023-06-21 11:56:34.057000 +08:00
Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_ora_223793.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-01013: user requested cancel of current operation
ORA-06512: at line 2
2023-06-21 11:56:44.173000 +08:00
Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_lgwr_105676.trc  (incident=363830):
ORA-00239: timeout waiting for control file enqueue: held by 'inst 2, osid 105678' for more than 900 seconds   《《《《《《《《《《《《《《《《《《《《《《《《《《《《
Incident details in: /u01/app/oracle/diag/rdbms/anbob/anbob2/incident/incdir_363830/anbob2_lgwr_105676_i363830.trc
opidrv aborting process LGWR ospid (105676) as a result of ORA-603
PMON (ospid: 105570): terminating the instance due to error 470
System state dump requested by (instance=2, osid=105570 (PMON)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_diag_105594_20230621115644.trc
ORA-1092 : opitsk aborting process
ORA-1092 : opitsk aborting process
2023-06-21 11:56:47.718000 +08:00
ORA-1092 : opitsk aborting process
2023-06-21 11:56:49.276000 +08:00
Instance terminated by PMON, pid = 105570

DB instance 1

SQL>  select * from v$lock where type='CF' ;
ADDR             KADDR                   SID TY        ID1        ID2      LMODE    REQUEST      CTIME      BLOCK
---------------- ---------------- ---------- -- ---------- ---------- ---------- ---------- ---------- ----------
00000018B32E5C30 00000018B32E5C88       2766 CF          0          0          2          0   26391859          2

 
 
SQL> select program,status,event,last_call_et,blocking_session,state from v$session where sid=2766;
                                                                              last_call
PROGRAM                                          STATUS  EVENT                       et BLOCKING_SESSION STATE
------------------------------------------------ ------- -------------------- --------- ---------------- -------------------
oracle@dm05db01.anbob.com.cn (CKPT)               ACTIVE  rdbms ipc message     26391931                  WAITING

DB instance 1 ckpt trace

Control file enqueue hold time tracking dump at time: 791292

*** 2023-06-21 10:06:36.561
  1: 12230ms (ro) file: kcrr.c line: 3628 count: 934 total: 1444730ms time: 42978
  2: 8880ms (rw) file: kcrf.c line: 10076 count: 320 total: 509630ms time: 43233
  3: 8050ms (rw) file: kcrf.c line: 10139 count: 926 total: 1893160ms time: 4251900
  4: 7010ms (rw) file: kcv.c line: 11974 count: 1674 total: 1969640ms time: 607190
  5: 4410ms (rw) file: kct.c line: 7821 count: 886 total: 881030ms time: 1707745
  6: 3170ms (rw) file: krse.c line: 1733 count: 1 total: 3170ms time: 3349300
  7: 2630ms (ro) file: krss.c line: 871 count: 1 total: 2630ms time: 475070
  8: 2460ms (rw) file: kcv.c line: 12507 count: 2 total: 3230ms time: 4221819
  9: 2220ms (rw) file: kcv.c line: 12509 count: 5 total: 8070ms time: 3583615
 10: 1970ms (rw) file: krbm.c line: 2595 count: 7 total: 7190ms time: 1006926
Control file enqueue hold time tracking dump at time: 798533

*** 2023-06-21 12:07:18.106
  1: 12230ms (ro) file: kcrr.c line: 3628 count: 140720308487078 total: 1444730ms time: 107374225378
  2: 8880ms (rw) file: kcrf.c line: 10076 count: 140720308486464 total: 509630ms time: 107374225633
  3: 8050ms (rw) file: kcrf.c line: 10139 count: 140720308487070 total: 1893160ms time: 107378434300
  4: 7010ms (rw) file: kcv.c line: 11974 count: 140720308487819 total: 1970240ms time: 107374789590
  5: 4410ms (rw) file: kct.c line: 7821 count: 140720308487030 total: 881030ms time: 107375890145
  6: 3170ms (rw) file: krse.c line: 1733 count: 140720308486145 total: 3170ms time: 107377531700
  7: 2630ms (ro) file: krss.c line: 871 count: 140720308486145 total: 2630ms time: 107374657470
  8: 2460ms (rw) file: kcv.c line: 12507 count: 140720308486146 total: 3230ms time: 107378404219
  9: 2220ms (rw) file: kcv.c line: 12509 count: 140720308486149 total: 8070ms time: 107377766015
 10: 1970ms (rw) file: krbm.c line: 2595 count: 140720308486151 total: 7190ms time: 107375189326

SQL> select program,status,event,last_call_et,blocking_session,state,paddr from v$session where sid=2766;

                                                                              last_call
PROGRAM                                          STATUS  EVENT                       et BLOCKING_SESSION STATE               PADDR
------------------------------------------------ ------- -------------------- --------- ---------------- ------------------- ----------------
oracle@dm05db01.anbob.com.cn (CKPT)               ACTIVE  rdbms ipc message     26392502                  WAITING             0000001951B65870

SQL> select pid ,spid from v$process where addr='0000001951B65870';

       PID SPID
---------- ------------------------
        35 104889

SQL> oradebug setorapid 35
Oracle pid: 35, Unix process pid: 104889, image: oracle@dm05db01.anbob.com.cn (CKPT)
SQL> oradebug short_stack;
ksedsts()+465<-ksdxfstk()+32<-ksdxcb()+1927<-sspuser()+112<-__sighandler()<-semtimedop()+10<-skgpwwait()+178<-ksliwat()+2047<-kslwaitctx()+163<-kslwait()+141<-ksarcv()+207<-ksbabs()+330<-ksbrdp()+1045<-opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+250<-ssthrdmain()+265<-main()+201<-__libc_start_main()+253 SQL> oradebug short_stack;
ksedsts()+465<-ksdxfstk()+32<-ksdxcb()+1927<-sspuser()+112<-__sighandler()<-semtimedop()+10<-skgpwwait()+178<-ksliwat()+2047<-kslwaitctx()+163<-kslwait()+141<-ksarcv()+207<-ksbabs()+330<-ksbrdp()+1045<-opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+250<-ssthrdmain()+265<-main()+201<-__libc_start_main()+253 SQL> oradebug short_stack;
ksedsts()+465<-ksdxfstk()+32<-ksdxcb()+1927<-sspuser()+112<-__sighandler()<-semtimedop()+10<-skgpwwait()+178<-ksliwat()+2047<-kslwaitctx()+163<-kslwait()+141<-ksarcv()+207<-ksbabs()+330<-ksbrdp()+1045<-opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+250<-ssthrdmain()+265<-main()+201<-__libc_start_main()+253 SQL> 

[oracle@dm05db01 ~]$ sysctl -a|grep sem

kernel.sem = 10000      10240000        10000   1024
kernel.sem_next_id = -1

db instance 1

SQL>alter system checkpoint
SQL>select checkpont_chage# from v$database;

[root@dm05db01 ~]# strace  -f -c -p 104889
Process 104889 attached
^CProcess 104889 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 50.33    0.001742          20        89           semctl
 30.16    0.001044           0     10474       448 recvmsg
  9.53    0.000330           0      6384      3192 setsockopt
  5.20    0.000180           1       286           poll
  3.90    0.000135           0      3197           sendmsg
  0.87    0.000030           1        42        21 semtimedop
  0.00    0.000000           0        12           read
  0.00    0.000000           0        12           open
  0.00    0.000000           0        12           close
  0.00    0.000000           0        11           getrusage
  0.00    0.000000           0       161           times
------ ----------- ----------- --------- --------- ----------------
100.00    0.003461                 20680      3661 total

Note:
在实例1 checkpoint正常。

实例2 禁用DB instance 启动
因为当前crs和ASM 都正常，db instance 2总是会自动被crs拉起，因为实例2在启动时长时间持有controlfile ENQ 资源，导致实例1挂起，需要在解决instance 2前停止自启动。

[oracle@dm05db02 ~]$ srvctl disable instance -d anbob -i anbob2

分析实例2启动失败trace

vi /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_diag_105594.trc
Dump continued from file: /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_diag_105594.trc
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 2, osid 105678'

========= Dump for incident 363606 (ORA 494) ========
----- Beginning of Customized Incident Dump(s) -----
-------------------------------------------------------------------------------
ENQUEUE [CF] HELD FOR TOO LONG

enqueue holder: 'inst 2, osid 105678'

Process 'inst 2, osid 105678' is holding an enqueue for maximum allowed time.
The process will be terminated after collecting some diagnostics.
It is possible the holder may release the enqueue before the termination
is attempted. If so, the process will not be terminated and
this incident will only be served as a warning.
Oracle Support Services triaging information: to find the root-cause, look
at the call stack of process 'inst 2, osid 105678' located below. Ask the
developer that owns the first NON-service layer in the stack to investigate.
Common service layers are enqueues (ksq), latches (ksl), library cache
pins and locks (kgl), and row cache locks (kqr).

Current Wait Stack:
 0: waiting for 'control file parallel write'
    files=0x2, block#=0x6e, requests=0x2
    wait_id=896400246 seq_num=62826 snap_id=1
    wait times: snap=18 min 31 sec, exc=18 min 31 sec, total=18 min 31 sec
    wait times: max=infinite, heur=18 min 31 sec
    wait counts: calls=0 os=0
    in_wait=1 iflags=0x5a0
There are 932 sessions blocked by this session.
Dumping one waiter:
  inst: 2, sid: 2687, ser: 1
  wait event: 'enq: CF - contention'
    p1: 'name|mode'=0x43460005
    p2: '0'=0x0
    p3: 'operation'=0x0
  row_wait_obj#: 4294967295, block#: 0, row#: 0, file# 0
  min_blocked_time: 23 secs, waiter_cache_ver: 29350
Wait State:
  fixed_waits=0 flags=0x22 boundary=(nil)/-1
Session Wait History:
    elapsed time of 0.000125 sec since current wait
 0: waited for 'control file parallel write'
    files=0x2, block#=0x70, requests=0x2
    wait_id=896400245 seq_num=62825 snap_id=1
    wait times: snap=5 min 12 sec, exc=5 min 12 sec, total=5 min 12 sec
    wait times: max=infinite
    wait counts: calls=0 os=0
    occurred after 0.000329 sec of elapsed time
 1: waited for 'direct path write'
    file number=0x3db, first dba=0x1, block cnt=0x1
    wait_id=896400244 seq_num=62824 snap_id=1
    wait times: snap=10 min 25 sec, exc=10 min 25 sec, total=10 min 25 sec
    wait times: max=infinite
    wait counts: calls=0 os=0
    occurred after 0.001413 sec of elapsed time
 2: waited for 'control file sequential read'
    file#=0x0, block#=0x6f, blocks=0x1
    wait_id=896400243 seq_num=62823 snap_id=1
    wait times: snap=0.001013 sec, exc=0.001013 sec, total=0.001013 sec
    wait times: max=infinite
    wait counts: calls=0 os=0
    occurred after 0.000021 sec of elapsed time
 3: waited for 'control file parallel write'
    files=0x2, block#=0x6c, requests=0x2
    wait_id=896400242 seq_num=62822 snap_id=1
    wait times: snap=5 min 11 sec, exc=5 min 11 sec, total=5 min 11 sec
    wait times: max=infinite
    wait counts: calls=0 os=0
    occurred after 0.001367 sec of elapsed time
 4: waited for 'control file sequential read'
    file#=0x0, block#=0x6d, blocks=0x1
    wait_id=896400241 seq_num=62821 snap_id=1
    wait times: snap=0.000545 sec, exc=0.000545 sec, total=0.000545 sec
    wait times: max=infinite
    wait counts: calls=0 os=0
    occurred after 0.001562 sec of elapsed time
 5: waited for 'control file sequential read'
    file#=0x0, block#=0x6b, blocks=0x1
    wait_id=896400240 seq_num=62820 snap_id=1
    wait times: snap=0.001161 sec, exc=0.001161 sec, total=0.001161 sec
    wait times: max=infinite
    wait counts: calls=0 os=0
    occurred after 0.000050 sec of elapsed time
 6: waited for 'control file sequential read'
    file#=0x0, block#=0x7c, blocks=0x1
    wait_id=896400239 seq_num=62819 snap_id=1
    wait times: snap=0.000470 sec, exc=0.000470 sec, total=0.000470 sec
    wait times: max=infinite
    wait counts: calls=0 os=0
    occurred after 0.000524 sec of elapsed time
 7: waited for 'control file sequential read'
    file#=0x0, block#=0x7a, blocks=0x1
    wait_id=896400238 seq_num=62818 snap_id=1
    wait times: snap=0.000357 sec, exc=0.000357 sec, total=0.000357 sec
    wait times: max=infinite
    wait counts: calls=0 os=0
    occurred after 0.000159 sec of elapsed time
 8: waited for 'control file sequential read'
    file#=0x0, block#=0x78, blocks=0x1
    wait_id=896400237 seq_num=62817 snap_id=1
    wait times: snap=0.000392 sec, exc=0.000392 sec, total=0.000392 sec
    wait times: max=infinite
    wait counts: calls=0 os=0
    occurred after 0.000179 sec of elapsed time
 9: waited for 'control file sequential read'
    file#=0x0, block#=0x76, blocks=0x1
    wait_id=896400236 seq_num=62816 snap_id=1
    wait times: snap=0.000373 sec, exc=0.000373 sec, total=0.000373 sec
    wait times: max=infinite
    wait counts: calls=0 os=0
    occurred after 0.000033 sec of elapsed time
Sampled Session History of session 2766 serial 1
---------------------------------------------------
The sampled session history is constructed by sampling
the target session every 1 second. The sampling process
captures at each sample if the session is in a non-idle wait,
an idle wait, or not in a wait. If the session is in a
samples  in a single interval the session may NOT be continuously
idle or not in a wait (the sampling process does not know).

The history is displayed in reverse chronological order.

sample interval: 1 sec, max history 900 sec
---------------------------------------------------
  [1130 samples,                                           11:36:55 - 11:55:44]
    idle wait at each sample
[CF][current][num io 20][tot tim 1249367 ms]
Dumping process 35.105678 info:

*** 2023-06-21 11:55:44.539
Process diagnostic dump for oracle@dm05db02.anbob.com.cn (CKPT), OS id=105678, --blocker session
pid: 35, proc_ser: 1, sid: 2766, sess_ser: 1
-------------------------------------------------------------------------------
os thread scheduling delay history: (sampling every 1.000000 secs)
  0.000000 secs at [ 11:55:43 ]
    NOTE: scheduling delay has not been sampled for 0.694673 secs  0.000000 secs from [ 11:55:39 - 11:55:44 ], 5 sec avg
  0.000000 secs from [ 11:54:44 - 11:55:44 ], 1 min avg
  0.000000 secs from [ 11:50:44 - 11:55:44 ], 5 min avg
loadavg : 3.66 3.31 5.50
Memory (Avail / Total) = 84835.05M / 772256.64M
Swap (Avail / Total) = 24576.00M /  24576.00M
F S UID         PID   PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
0 S oracle   105678      1  0  80   0 - 26284836 -    2022 ?        2-00:59:07 ora_ckpt_anbob2
Short stack dump:
ksedsts()+465<-ksdxfstk()+32<-ksdxcb()+1927<-sspuser()+112<-__sighandler()<-__poll()+24<-sskgxp_selectex()+392<-skgxpiwait()+4318<-skgxpwaiti()+1916<-skgxpwait()+162<-ossnet_wait_all()+248<-ossnet_wait()+212
<-ossdisk_reopen()+293<-ossnet_reopen_dev_handles()+161<-ossnet_reconnect_box_handle_int()+949<-ossnet_vrpc_completion()+1285<-ossnet_process_netmsg_completion()+117<-ossnet_wait_all()+397<-ossnet_wait()+212
<-oss_wait()+991<-kfk_wait_on_oss_io()+406<-kfk_block_oss()+349<-kfk_block()+901<-kfk_transit_waitIO()+188<-kfk_transitIO()+1770<-kfioWaitIO()+721<-kfioRequestPriv()+351<-kfioRequest()+691<-ksfd_kfioRequest(
)+649<-ksfd_osmbio()+413<-ksfdbio()+1579<-ksfdss_bio()+75<-ksfdbio()+2637<-kccwbp()+607<-kccrec_wdf()+364<-kccfhb()+1345<-kcccmt()+251<-kccecx()+61<-kcvcca()+2904<-ksbcti()+513<-ksbabs()+1742<-ksbrdp()+1045<
-opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+250<-ssthrdmain()+265<-main()+201<-__libc_start_main()+253

vi /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_lgwr_105676.trc
*** 2023-06-21 11:56:44.175
*** SESSION ID:(2687.1) 2023-06-21 11:56:44.175
*** CLIENT ID:() 2023-06-21 11:56:44.175
*** SERVICE NAME:(SYS$BACKGROUND) 2023-06-21 11:56:44.175
*** MODULE NAME:() 2023-06-21 11:56:44.175
*** ACTION NAME:() 2023-06-21 11:56:44.175

Dump continued from file: /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_lgwr_105676.trc
ORA-00239: timeout waiting for control file enqueue: held by 'inst 2, osid 105678' for more than 900 seconds

========= Dump for incident 363830 (ORA 239) ========
----- Beginning of Customized Incident Dump(s) -----
-------------------------------------------------------------------------------

TIMEOUT WAITING FOR CONTROL FILE ENQUEUE

    requested mode : X
    enqueue holder : 'inst 2, osid 105678'
 enqueue held time : 900 seconds

The control file enqueue was held by process 'inst 2, osid 105678' for
more than 900 seconds. Attempts to terminate the enqueue holder were
unsuccesful, therefore terminating the current process.

To resolve this problem try restarting 'instance 2' or rebooting
the machine that that 'instance 2' is on.

最开始db 2 alert 日志启动时也有显示HANGMgr (dia0)进程发现ckpt进程持有CF enq超过了900s，同时lgwr也在请求X cf enq失败，因为ckpt的持续持有，ckpt的control file parallel write和direct path write写相关的等待都很久有5分钟甚至10分钟，但是读正常。这种有可能是多路径、同步复制、存储链路等部分存在问题。

OS 日志

Jun 21 13:55:51 dm05db02 kernel: [152204048.913900] ll header: 00000010: 00 00 00 00 08 06 00 00                          ........
Jun 21 13:55:52 dm05db02 kernel: [152204049.277575] RDS/IB: conn <192.168.*.3,192.168.*.6,4> racing for more than 1s, retry
Jun 21 13:55:52 dm05db02 kernel: [152204049.560413] IPv4: martian source 192.168.*.4 from 192.168.*.2, on dev ib1
Jun 21 13:55:52 dm05db02 kernel: [152204049.560419] ll header: 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Jun 21 13:55:52 dm05db02 kernel: [152204049.560421] ll header: 00000010: 00 00 00 00 08 06 00 00                          ........
Jun 21 13:55:52 dm05db02 kernel: [152204049.913696] IPv4: martian source 192.168.*.4 from 192.168.*.1, on dev ib1
Jun 21 13:55:52 dm05db02 kernel: [152204049.913716] ll header: 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Jun 21 13:55:52 dm05db02 kernel: [152204049.913718] ll header: 00000010: 00 00 00 00 08 06 00 00                          ........
Jun 21 13:55:53 dm05db02 kernel: [152204050.672772] IPv4: martian source 192.168.*.4 from 192.168.*.2, on dev ib1
Jun 21 13:55:53 dm05db02 kernel: [152204050.672791] ll header: 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Jun 21 13:55:53 dm05db02 kernel: [152204050.672793] ll header: 00000010: 00 00 00 00 08 06 00 00                          ........
Jun 21 13:55:53 dm05db02 kernel: [152204050.913285] IPv4: martian source 192.168.*.4 from 192.168.*.1, on dev ib1
Jun 21 13:55:53 dm05db02 kernel: [152204050.913305] ll header: 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Jun 21 13:55:53 dm05db02 kernel: [152204050.913307] ll header: 00000010: 00 00 00 00 08 06 00 00                          ........
Jun 21 13:55:54 dm05db02 kernel: [152204051.142976] RDS/IB: conn <192.168.*.3,192.168.*.6,4> racing for more than 1s, retry

# ibstat
CA 'mlx4_0'
	CA type: MT4099
	Number of ports: 2
	Firmware version: 2.35.5538
	Hardware version: 1
	Node GUID: 0x506b4b030061e530
	System image GUID: 0x506b4b030061e533
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 40
		Base lid: 3
		LMC: 0
		SM lid: 2
		Capability mask: 0x02514868
		Port GUID: 0x506b4b030061e531
		Link layer: InfiniBand
	Port 2:
		State: Active
		Physical state: LinkUp
		Rate: 40
		Base lid: 4
		LMC: 0
		SM lid: 2
		Capability mask: 0x02514868
		Port GUID: 0x506b4b030061e532
		Link layer: InfiniBand

# lspci|grep InfiniBand
3b:00.0 InfiniBand: Mellanox Technologies MT27500 Family [ConnectX-3]

Note:
有RDS/ib 这类提示，因为Oracle Exadata machine使用RDS协议处理私网或存储间的网络通信，既然出说问题就说明这两个网络间存在网络延迟现象，但现场反馈使用ibping 、rds-ping命令测试正常，硬件无异常亮灯。ibstat显示都是up，固件版本为 2.35.5538

控制文件测试

SELECT * FROM V$CONTROLFILE

STATUS  NAME                                                                             IS_ BLOCK_SIZE FILE_SIZE_BLKS
------- -------------------------------------------------------------------------------- --- ---------- --------------
        +DATA_DM01/anbob/controlfile/current01.ctl                                        NO       16384          29760
        +RECO_DM01/anbob/controlfile/current02.ctl                                        NO       16384          29760

SQL> SELECT 16*29760 FROM DUAL;
  16*29760
----------
    476160

# su - grid

[grid@dm05db02 ~]$ asmcmd
ASMCMD>  cp +DATA_DM01/anbob/controlfile/current01.ctl /tmp/ctl
^C

[grid@dm05db01 ~]$ asmcmd
ASMCMD>  cp +DATA_DM01/anbob/controlfile/current01.ctl /tmp/ctl
^C

ASMCMD> ls  +DATA_DM01/anbob/controlfile/current01.ctl
current01.ctl
ASMCMD> ls -l  +DATA_DM01/anbob/controlfile/current01.ctl
^[[[A

Note:
控制文件有450MB左右大小，在ASMCMD直接复制controlfile 在两个节点都挂起，但是过一会偶尔又回正常，秒回结果。

$ asmcmd
ASMCMD> ls -l +DATA_DM01/anbob/controlfile/current01.ctl
Type         Redund  Striped  Time             Sys  Name
                                               N    current01.ctl => +DATA_DM01/anbob/CONTROLFILE/current.765.985169275
ASMCMD> ls -l +RECO_DM01/anbob/controlfile/current02.ctl
Type         Redund  Striped  Time             Sys  Name
                                               N    current02.ctl => +RECO_DM01/anbob/CONTROLFILE/current.1522.985169281

ASMCMD> cp +DATA_DM01/anbob/controlfile/current01.ctl /tmp/ctl_bak
copying +DATA_DM01/anbob/controlfile/current01.ctl -> /tmp/ctl_bak
ASMCMD>

说明当前的IB存储网络不稳定或存在异常或bug.

MOS中未找到匹配案列， Oracle Linux: RDS/IB Re-connection And Stalling for More Than 1 Min (Doc ID 2611709.1)记录

Jan 1 00:00:48 hostname kernel: [12751701.033437] RDS/IB: connection <::ffff:192.168.0.3,::ffff:192.168.1.142,0> dropped due to 'incoming REQ in CONN_UP state'
Jan 1 00:00:48 hostname kernel: [12751701.033771] RDS/IB: re-connect <::ffff:192.168.1.142,::ffff:192.168.0.3,0> stalling for more than 1 min...(drops=1 err=0)
Jan 1 00:00:48 hostname kernel: [12751701.035083] RDS/IB: re-connect <::ffff:192.168.1.142,::ffff:192.168.0.1,0> stalling for more than 1 min...(drops=1 err=0)

不完全匹配，是因为Infiniband card firmware bug. 需要升级到Upgrade the Infiniband ConnectX-3 firmware to version 2.35.6302 or newer version.

最终客户重新插拔了一下instance 2主机的IB线，居然恢复了正常。