Troubloshooting Oracle RAC node reboot and OS log show “kernel: qla2xxx[ ] Abort command issued”

近期1客户Oracle RAC 节点OS重启，协助分析原因，db层无日志错误输出，RAC层有vote disk I/O timeout， OS层 qla2xxx [0000:81:00.0]-801c:7: Abort command 和DEVICE RESET 操作。qla2xxx 是QLogic FC HBA的驱动，怀疑重启是HBA卡导致IO失败，导致disk timeout, CRS发起reboot. 简单记录该问题。

db alert log

Mon Jun 19 10:11:17 2023

***********************************************************************

Fatal NI connect error 12170.

  VERSION INFORMATION:
	TNS for Linux: Version 11.2.0.4.0 - Production
	Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production
	TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production
  Time: 19-JUN-2023 10:11:17
  Tracing not turned on.
  Tns error struct:
    ns main err code: 12535
    
TNS-12535: TNS:operation timed out
    ns secondary err code: 12606
    nt main err code: 0
    nt secondary err code: 0
    nt OS err code: 0
  Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=10.10.51.124)(PORT=37840))
Mon Jun 19 10:20:17 2023                                                            <<<<<<<<<<<<<<<<<<< log  gap 
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Initial number of CPU is 96
Number of processor cores in the system is 48
Number of processor sockets in the system is 4

Note
db alert 无error

LHMB trce

*** 2023-06-19 10:09:45.578
==============================
LMON (ospid: 7091) has not moved for 128 sec (1687140585.1687140457)
kjfmGCR_HBCheckAll: LMON (ospid: 7091) has status 2
  : waiting for event 'control file sequential read' for 116 secs with wait_id 843992236.
  ===[ Wait Chain ]===
  Wait chain is empty.
kjgcr_Main: KJGCR_ACTION - id 5

*** 2023-06-19 10:10:06.058
==============================
LMON (ospid: 7091) has not moved for 148 sec (1687140605.1687140457)
kjfmGCR_HBCheckAll: LMON (ospid: 7091) has status 2
  : waiting for event 'control file sequential read' for 137 secs with wait_id 843992236.
  ===[ Wait Chain ]===
  Wait chain is empty.
kjgcr_Main: KJGCR_ACTION - id 5

*** 2023-06-19 10:10:26.537
==============================
LMON (ospid: 7091) has not moved for 169 sec (1687140626.1687140457)
kjfmGCR_HBCheckAll: LMON (ospid: 7091) has status 2
  : waiting for event 'control file sequential read' for 157 secs with wait_id 843992236.
  ===[ Wait Chain ]===
  Wait chain is empty.
kjgcr_Main: KJGCR_ACTION - id 5

*** 2023-06-19 10:10:47.017
==============================
LMON (ospid: 7091) has not moved for 189 sec (1687140646.1687140457)
kjfmGCR_HBCheckAll: LMON (ospid: 7091) has status 2
  : waiting for event 'control file sequential read' for 178 secs with wait_id 843992236.
  ===[ Wait Chain ]===
  Wait chain is empty.
kjgcr_Main: KJGCR_ACTION - id 5

*** 2023-06-19 10:11:07.497
==============================
LMON (ospid: 7091) has not moved for 209 sec (1687140666.1687140457)
kjfmGCR_HBCheckAll: LMON (ospid: 7091) has status 2
  : waiting for event 'control file sequential read' for 198 secs with wait_id 843992236.
  ===[ Wait Chain ]===
  Wait chain is empty.
kjgcr_Main: KJGCR_ACTION - id 5

*** 2023-06-19 10:11:27.977
==============================
LMON (ospid: 7091) has not moved for 230 sec (1687140687.1687140457)
kjfmGCR_HBCheckAll: LMON (ospid: 7091) has status 2
  : waiting for event 'control file sequential read' for 219 secs with wait_id 843992236.
  ===[ Wait Chain ]===
  Wait chain is empty.
kjgcr_Main: KJGCR_ACTION - id 5

Note:
出现LMON hang，等待control file sequential read持续时间增长,约从2023-06-19 10:07开始。

GI Alert log

2023-06-19 10:07:51.826: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 8010 msecs.
2023-06-19 10:07:51.826: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 7870 msecs.
2023-06-19 10:07:53.826: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 9360 msecs.
。。。
2023-06-19 10:08:08.979: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 25020 msecs.
2023-06-19 10:08:08.979: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 25160 msecs.
2023-06-19 10:08:08.979: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 25020 msecs.
2023-06-19 10:08:10.829: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 26360 msecs.
2023-06-19 10:08:16.831: 
。。。
。。。
2023-06-19 10:08:49.111: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 65290 msecs.
。。。
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 82880 msecs.
2023-06-19 10:09:15.843: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 91370 msecs.
2023-06-19 10:09:15.843: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 91880 msecs.
2023-06-19 10:09:15.843: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 92020 msecs.
2023-06-19 10:09:15.843: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 91880 msecs.
2023-06-19 10:09:18.971: 
[crsd(5234)]CRS-2765:Resource 'ora.OCR.dg' has failed on server 'ANBOB1'.
2023-06-19 10:09:19.025: 
[crsd(5234)]CRS-2878:Failed to restart resource 'ora.OCR.dg'
2023-06-19 10:09:19.026: 
[crsd(5234)]CRS-2769:Unable to failover resource 'ora.OCR.dg'.
2023-06-19 10:09:23.284: 
[cssd(4644)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file /dev/mapper/ocr01 will be considered not functional in 99530 milliseconds
2023-06-19 10:09:23.844: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 99380 msecs.
2023-06-19 10:09:23.844: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 99890 msecs.
2023-06-19 10:09:23.844: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 100030 msecs.
2023-06-19 10:09:23.844: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 99890 msecs.
2023-06-19 10:09:24.285: 
[cssd(4644)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file /dev/mapper/ocr03 will be considered not functional in 99180 milliseconds
2023-06-19 10:09:31.846: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 107380 msecs.
2023-06-19 10:09:31.846: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 107890 msecs.
2023-06-19 10:09:31.846: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 108030 msecs.
2023-06-19 10:09:31.846: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 107890 msecs.
。。。
。。。
2023-06-19 10:10:06.852: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 142890 msecs.
2023-06-19 10:10:06.852: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 143030 msecs.
2023-06-19 10:10:06.852: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 142890 msecs.
2023-06-19 10:10:13.134: 
[cssd(4644)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file /dev/mapper/ocr01 will be considered not functional in 49680 milliseconds
2023-06-19 10:10:14.134: 
[cssd(4644)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file /dev/mapper/ocr03 will be considered not functional in 49330 milliseconds
2023-06-19 10:10:14.854: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 150390 msecs.
2023-06-19 10:10:14.854: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 150900 msecs.
2023-06-19 10:10:14.854: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 151040 msecs.
2023-06-19 10:10:14.854: 
。。。
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 166900 msecs.
2023-06-19 10:10:30.857: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 167040 msecs.
2023-06-19 10:10:30.857: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 166900 msecs.
2023-06-19 10:10:39.858: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 175390 msecs.
2023-06-19 10:10:39.858: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 175900 msecs.
2023-06-19 10:10:39.858: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 176040 msecs.
2023-06-19 10:10:39.858: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 175900 msecs.
2023-06-19 10:10:43.339: 
[cssd(4644)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file /dev/mapper/ocr01 will be considered not functional in 19480 milliseconds
2023-06-19 10:10:44.339: 
[cssd(4644)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file /dev/mapper/ocr03 will be considered not functional in 19130 milliseconds
2023-06-19 10:10:47.860: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 183390 msecs.
2023-06-19 10:10:47.860: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 183900 msecs.
2023-06-19 10:10:47.860: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 184040 msecs.
2023-06-19 10:10:47.860: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 183900 msecs.
2023-06-19 10:10:56.494: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 192030 msecs.
2023-06-19 10:10:56.494: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 192540 msecs.
2023-06-19 10:10:56.494: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 192680 msecs.
2023-06-19 10:10:56.494: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 192540 msecs.
2023-06-19 10:11:03.216: 
[cssd(4644)]CRS-1604:CSSD voting file is offline: /dev/mapper/ocr01; details at (:CSSNM00058:) in /u01/app/11.2.0/grid/log/ANBOB1/cssd/ocssd.log.
2023-06-19 10:11:04.217: 
[cssd(4644)]CRS-1604:CSSD voting file is offline: /dev/mapper/ocr03; details at (:CSSNM00058:) in /u01/app/11.2.0/grid/log/ANBOB1/cssd/ocssd.log.
2023-06-19 10:11:04.217: 
[cssd(4644)]CRS-1606:The number of voting files available, 1, is less than the minimum number of voting files required, 2,
 resulting in CSSD termination to ensure data integrity; details at (:CSSNM00018:) in /u01/app/11.2.0/grid/log/ANBOB1/cssd/ocssd.log
2023-06-19 10:11:04.217: 
[cssd(4644)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/11.2.0/grid/log/ANBOB1/cssd/ocssd.log
2023-06-19 10:11:04.262: 
[cssd(4644)]CRS-1652:Starting clean up of CRSD resources.
2023-06-19 10:11:05.422: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 200950 msecs.
2023-06-19 10:11:05.422: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 201460 msecs.
2023-06-19 10:11:05.422: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 201600 msecs.
2023-06-19 10:11:05.422: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 201460 msecs.
2023-06-19 10:11:05.590: 
[/u01/app/11.2.0/grid/bin/oraagent.bin(6404)]CRS-5016:Process "/u01/app/11.2.0/grid/opmn/bin/onsctli" spawned by agent "/u01/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/u01/app/11.2.0/grid/log/ANBOB1/agent/crsd/oraagent_grid//oraagent_grid.log"
2023-06-19 10:11:06.092: 
[/u01/app/11.2.0/grid/bin/oraagent.bin(6404)]CRS-5016:Process "/u01/app/11.2.0/grid/bin/lsnrctl" spawned by agent "/u01/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/u01/app/11.2.0/grid/log/ANBOB1/agent/crsd/oraagent_grid//oraagent_grid.log"
2023-06-19 10:11:06.095: 
[/u01/app/11.2.0/grid/bin/oraagent.bin(6404)]CRS-5016:Process "/u01/app/11.2.0/grid/bin/lsnrctl" spawned by agent "/u01/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/u01/app/11.2.0/grid/log/ANBOB1/agent/crsd/oraagent_grid//oraagent_grid.log"
2023-06-19 10:11:11.209: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingMonitorThread not scheduled for 7990 msecs.
2023-06-19 10:11:13.649: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 209180 msecs.
2023-06-19 10:11:13.649: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 209690 msecs.
2023-06-19 10:11:13.649: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 209830 msecs.
2023-06-19 10:11:13.649: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 209690 msecs.
2023-06-19 10:11:19.854: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingMonitorThread not scheduled for 16640 msecs.
2023-06-19 10:11:21.905: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 217440 msecs.
2023-06-19 10:11:21.905: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 217950 msecs.
2023-06-19 10:11:21.905: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 218090 msecs.
2023-06-19 10:11:21.905: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 217950 msecs.
2023-06-19 10:11:28.056: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingMonitorThread not scheduled for 24840 msecs.
2023-06-19 10:11:30.109: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 225640 msecs.
2023-06-19 10:11:30.109: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 226150 msecs.
2023-06-19 10:11:30.109: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 226290 msecs.
2023-06-19 10:11:30.109: 
[cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 226150 msecs.
2023-06-19 10:18:38.098: 
[ohasd(4848)]CRS-2112:The OLR service started on node ANBOB1.
2023-06-19 10:18:38.125: 
[ohasd(4848)]CRS-8011:reboot advisory message from host: ANBOB1, component: cssmonit, with time stamp: L-2023-06-19-10:11:32.220      <<<<<<<<<<<<<<<
[ohasd(4848)]CRS-8013:reboot advisory message text: oracssdmonitor is rebooting this node due to vote disk I/O timeout 
(no disk activity for 28030 milliseconds).
2023-06-19 10:18:38.129: 
[ohasd(4848)]CRS-8017:location: /etc/oracle/lastgasp has 2 reboot advisory log files, 1 were announced and 0 errors occurred
2023-06-19 10:18:38.125: 
[ohasd(4848)]CRS-1301:Oracle High Availability Service started on node ANBOB1.
2023-06-19 10:18:38.617: 
[/u01/app/11.2.0/grid/bin/oraagent.bin(5570)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/u01/app/11.2.0/grid/log/ANBOB1/agent/ohasd/oraagent_grid//oraagent_grid.log"
2023-06-19 10:18:41.778: 
[/u01/app/11.2.0/grid/bin/orarootagent.bin(5574)]CRS-2302:Cannot get GPnP profile. Error CLSGPNP_NO_DAEMON (GPNPD daemon is not running). 
2023-06-19 10:18:44.024: 
[gpnpd(5734)]CRS-2328:GPNPD started on node ANBOB1. 
2023-06-19 10:18:46.427: 
[cssd(5807)]CRS-1713:CSSD daemon is started in clustered mode

Note:
显示CSSD发vdiskping也就是disk hearbeat超时，也是从2023-06-19 10:07:43开始，11g RAC默认disktimeout 为200s, reboottime为3s. 在10:11:04 开始做rebootless 清理CRS资源，但cssd多个线程hang, 10:11:32发起reboot， 10:18重启动。

OS message

Jun 19 10:05:01 ANBOB1 su: (to oracle) root on none
Jun 19 10:05:02 ANBOB1 systemd: Removed slice User Slice of root.
Jun 19 10:05:05 ANBOB1 su: (to oracle) root on none
Jun 19 10:05:05 ANBOB1 systemd: Started Session c245246 of user oracle.
Jun 19 10:06:05 ANBOB1 su: (to oracle) root on none
Jun 19 10:06:05 ANBOB1 systemd: Started Session c245247 of user oracle.
Jun 19 10:07:05 ANBOB1 su: (to oracle) root on none
Jun 19 10:07:05 ANBOB1 systemd: Started Session c245248 of user oracle.
Jun 19 10:07:43 ANBOB1 kernel: qla2xxx [0000:81:00.0]-801c:7: Abort command issued nexus=7:1:4 -- 2003.
Jun 19 10:08:28 ANBOB1 kernel: qla2xxx [0000:81:00.0]-801c:7: Abort command issued nexus=7:1:0 -- 2003.  <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Jun 19 10:08:41 ANBOB1 su: (to oracle) root on none
Jun 19 10:08:41 ANBOB1 systemd: Started Session c245249 of user oracle.
Jun 19 10:09:13 ANBOB1 kernel: qla2xxx [0000:81:00.0]-801c:7: Abort command issued nexus=7:1:2 -- 2003.
Jun 19 10:09:33 ANBOB1 systemd-udevd: worker [26686] /devices/virtual/block/dm-5 is taking a long time <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Jun 19 10:09:33 ANBOB1 systemd-udevd: worker [26688] /devices/virtual/block/dm-12 is taking a long time  <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Jun 19 10:10:01 ANBOB1 systemd: Created slice User Slice of root.
Jun 19 10:10:01 ANBOB1 systemd: Started Session 236697 of user root.
Jun 19 10:10:01 ANBOB1 systemd: Started Session 236698 of user root.
Jun 19 10:10:01 ANBOB1 su: (to oracle) root on none
Jun 19 10:10:32 ANBOB1 kernel: qla2xxx [0000:81:00.0]-8009:7: DEVICE RESET ISSUED nexus=7:1:0 cmd=00000000c2007889.  <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Jun 19 10:10:32 ANBOB1 kernel: qla2xxx [0000:81:00.0]-800e:7: DEVICE RESET SUCCEEDED nexus:7:1:0 cmd=00000000c2007889.
Jun 19 10:10:41 ANBOB1 su: (to oracle) root on none
Jun 19 10:10:41 ANBOB1 systemd: Started Session c245250 of user oracle.
Jun 19 10:11:04 ANBOB1 avahi-daemon[2471]: Withdrawing address record for 10.10.53.153 on bond0.
Jun 19 10:11:04 ANBOB1 avahi-daemon[2471]: Withdrawing address record for 10.10.53.155 on bond0.
Jun 19 10:11:05 ANBOB1 ntpd[2629]: Deleting interface #12 bond0:3, 10.10.53.155#123, interface stats: received=0, sent=0, dropped=0, active_time=823276 secs
Jun 19 10:11:05 ANBOB1 ntpd[2629]: Deleting interface #10 bond0:1, 10.10.53.153#123, interface stats: received=0, sent=0, dropped=0, active_time=36846710 secs
Jun 19 10:11:29 ANBOB1 kernel: qla2xxx [0000:81:00.0]-801c:7: Abort command issued nexus=7:1:0 -- 2003.
Jun 19 10:11:29 ANBOB1 kernel: qla2xxx [0000:81:00.0]-8009:7: DEVICE RESET ISSUED nexus=7:1:0 cmd=00000000c2007889.
Jun 19 10:11:29 ANBOB1 kernel: qla2xxx [0000:81:00.0]-800e:7: DEVICE RESET SUCCEEDED nexus:7:1:0 cmd=00000000c2007889.
Jun 19 10:11:29 ANBOB1 kernel: qla2xxx [0000:81:00.0]-8009:7: DEVICE RESET ISSUED nexus=7:1:2 cmd=00000000fd95afce.
Jun 19 10:11:29 ANBOB1 kernel: qla2xxx [0000:81:00.0]-800e:7: DEVICE RESET SUCCEEDED nexus:7:1:2 cmd=00000000fd95afce.
Jun 19 10:18:03 ANBOB1 kernel: microcode: microcode updated early to revision 0x16, date = 2019-06-17                         <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< OS reboot
Jun 19 10:18:03 ANBOB1 kernel: Linux version 5.4.17-2011.6.2.el7uek.x86_64 (mockbuild@jenkins-172-17-0-2-64d560bf-4b94-4bde-a5f8-f55888be27a0) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39.0.3) (GCC)) #2 SMP Thu Sep 3 14:09:14 PDT 2020
Jun 19 10:18:03 ANBOB1 kernel: Command line: BOOT_IMAGE=/vmlinuz-5.4.17-2011.6.2.el7uek.x86_64 root=/dev/mapper/vg_ggdb1-lv_root ro crashkernel=auto rd.lvm.lv=vg_ggdb1/lv_root rd.lvm.lv=vg_ggdb1/lv_swap rhgb quiet LANG=en_US.UTF-8
Jun 19 10:18:03 ANBOB1 kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Jun 19 10:18:03 ANBOB1 kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Jun 19 10:18:03 ANBOB1 kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Jun 19 10:18:03 ANBOB1 kernel: x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
Jun 19 10:18:03 ANBOB1 kernel: x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
Jun 19 10:18:03 ANBOB1 kernel: BIOS-provided physical RAM map:
Jun 19 10:18:03 ANBOB1 kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009e7ff] usable

Note:
根据MESSAGE日志的报错涉及
7:1:0 — dm-12 — data01
7:1:2 — dm-13 — ocr01
7:1:4 — dm-4 — ocr03

7:1:1– dm-5 — asmdata02

日志含义
kernel: qla2xxx [0000:81:00.0]-801c:7: Abort command issued nexus=7:1:0 — 2003.

qla2xxx is the name of the driver or kernel module.
[0000:81:00.0] is the PCI bus information of the device.
801c is a hexadecimal id which uniquely identifies the part of the code from where the message originated.
7 is the host number of the scsi target.
Abort command issued nexus=7:1:0 The driver aborted the command that was in progress to the scsi target 7:1:0.
2002 means the reset succeeded, 2003 means the reset failed.

10:11:04 开始释放CRS资源，对应OS日志的删除VIP Interface，在此之前只有data01的链路reset成功。而ocr依赖的链路在REBOOT时才成功。这也许是有双链路而依旧重启的原因，但长时间hang为什么没有链路切换或reset是HBA 驱动层的问题。需要联系硬件工程师分析HBA卡问题原因，如是否需要固件升级等。

dmesg

[    3.768929] [TTM] Zone  kernel: Available graphics memory: 528137490 KiB
[    3.768930] [TTM] Zone   dma32: Available graphics memory: 2097152 KiB
[    3.768931] [TTM] Initializing pool allocator
[    3.768935] [TTM] Initializing DMA pool allocator
[    3.775376] fbcon: astdrmfb (fb0) is primary device
[    3.778660] qla2xxx [0000:00:00.0]-0005: : QLogic Fibre Channel HBA Driver: 10.01.00.25-k.
[    3.779073] qla2xxx [0000:81:00.0]-011c: : MSI-X vector count: 32.
[    3.779076] qla2xxx [0000:81:00.0]-001d: : Found an ISP2031 irq 35 iobase 0x00000000dc296659.
[    3.781010] qla2xxx 0000:81:00.0: DMAR: 32bit DMA uses non-identity mapping
[    3.790795] megaraid_sas 0000:41:00.0: megasas_enable_intr_fusion is called outbound_intr_mask:0x40000000
[    3.790810] megaraid_sas 0000:41:00.0: INIT adapter done
[    3.790812] megaraid_sas 0000:41:00.0: JBOD sequence map is disabled megasas_setup_jbod_map 5675
[    3.791859] megaraid_sas 0000:41:00.0: pci id		: (0x1000)/(0x005b)/(0x1000)/(0x9271)
[    3.791862] megaraid_sas 0000:41:00.0: unevenspan support	: no

Note：
有1个 qla2xxx [0000:81:00.0] QLogic Fibre Channel HBA

确认qla2xxx 设备

# lspci|grep 81
81:00.0 Fibre Channel: OLogic Corp.TSP8324-based16Gb Fibre Channel to pCi Express Adapter
81:00.1 Fibre  Channel: OLogic Corp.TSP8324-based16Gb Fibre Channel to pCi Express Adapter

当时没有查看该信息，但反馈有2块HBA卡，而不是同1 HBA的2个口。

检查卡错误基于上面的PCI地址

grep 81:00.0 var/log/messages | grep qla2xxx

启用qla2xxx driver扩展日志

Enable extended logging for the qla2xxx driver to try to capture any additional error messages when the issue occurs
Raw

$ chmod u+w /sys/module/qla2xxx/parameters/ql2xextended_error_logging
$ echo "1" > /sys/module/qla2xxx/parameters/ql2xextended_error_logging

Increase scsi extended event logging to get more information from the SCSI layer.
$  sysctl -w dev.scsi.logging_level=0x1003

— enjoy —