Exadata Instance crash ORA-600 [ksz_cln_proc1] and restart fail due to breakdown of one CellServer (案例)

前段时间有套Oracle Exadata Machine x2-2 2-nodes RAC 环境数据库两个db instance 前后crash, 并且自动启动失败，ASM 存储使用Normal 冗余，每个DG 3个CELL, 理论上坏一个CELL 应该是不会影响到数据库，简单记录分享一下这个故障和处理思路。

[oracle@kdexa1db01 (anbob1)oracle]$ sqlplus / as sysdba
SQL*Plus: Release 11.2.0.3.0 Production on Fri Dec 1 09:51:38 2017
Copyright (c) 1982, 2011, Oracle.  All rights reserved.
Connected to an idle instance.

SQL> startup
ORA-00304: requested INSTANCE_NUMBER is busy

Note:
当前数据库是RAC环境，且当时节点2是活动状态，此时检查了与当前实例相关的进程是否存在，共享内存是否已释放，CRS是否在自动拉起DB instance, 参数文件是否真的istance_number重复。就在此时节点2也突然crash, 手动启动挂起.

数据库实例检查

-- db1 alert log 

Immediate Kill Session: sess: 0xa59411b58  OS pid: 32215
Fri Dec 01 09:40:17 2017
opiodr aborting process unknown ospid (32520) as a result of ORA-28
Fri Dec 01 09:40:22 2017
Errors in file /oracle/app/oracle/diag/rdbms/orarpt/anbob1/trace/anbob1_dia0_16581.trc  (incident=2085703):
ORA-32701: Possible hangs up to hang ID=109905 detected
Incident details in: /oracle/app/oracle/diag/rdbms/orarpt/anbob1/incident/incdir_2085703/anbob1_dia0_16581_i2085703.trc
DIA0 requesting termination of session sid:2334 with serial # 64947 (ospid:13691) on instance 2
    due to a LOCAL, HIGH confidence hang with ID=109905.
    Hang Resolution Reason: Although the number of affected sessions did not
    justify automatic hang resolution initially, this previously ignored
    hang was automatically resolved.
DIA0: Examine the alert log on instance 2 for session termination status of hang with ID=109905.
Fri Dec 01 09:40:28 2017
DSKM process appears to be hung. Initiating system state dump.
Fri Dec 01 09:40:28 2017
System state dump requested by (instance=1, osid=16565 (GEN0)), summary=[system state dump request (ksz_check_ds)].
System State dumped to trace file /oracle/app/oracle/diag/rdbms/orarpt/anbob1/trace/anbob1_diag_16567.trc
Fri Dec 01 09:40:45 2017
Immediate Kill Session#: 2661, Serial#: 19795
Immediate Kill Session: sess: 0xa493a7190  OS pid: 1420
...
...
..
Fri Dec 01 09:41:02 2017
ORA-00020: maximum number of processes (2000) exceeded
 ORA-20 errors will not be written to the alert log for
 the next minute. Please look at trace files to see all
 the ORA-20 errors.
Fri Dec 01 09:41:22 2017
Errors in file /oracle/app/oracle/diag/rdbms/orarpt/anbob1/trace/anbob1_dia0_16581.trc  (incident=2085704):
ORA-32701: Possible hangs up to hang ID=109909 detected
Incident details in: /oracle/app/oracle/diag/rdbms/orarpt/anbob1/incident/incdir_2085704/anbob1_dia0_16581_i2085704.trc
Fri Dec 01 09:41:40 2017
DIA0 terminating blocker (ospid: 1842 sid: 130 ser#: 63021) of hang with ID = 109906
    requested by master DIA0 process on instance 1
    Hang Resolution Reason: Although the number of affected sessions did not
    justify automatic hang resolution initially, this previously ignored
    hang was automatically resolved.
    by terminating session sid: 130 ospid: 1842
DIA0 successfully terminated session sid:130 ospid:1842 with status 31.
DIA0 successfully resolved a LOCAL, HIGH confidence hang with ID=109906.
DIA0 terminating blocker (ospid: 3355 sid: 2586 ser#: 14507) of hang with ID = 109907
    requested by master DIA0 process on instance 1
    Hang Resolution Reason: Although the number of affected sessions did not
    justify automatic hang resolution initially, this previously ignored
    hang was automatically resolved.
    by terminating session sid: 2586 ospid: 3355
DIA0 successfully terminated session sid:2586 ospid:3355 with status 31.
DIA0 successfully resolved a LOCAL, HIGH confidence hang with ID=109907.
Fri Dec 01 09:41:51 2017
DIA0 terminating blocker (ospid: 3311 sid: 2938 ser#: 12823) of hang with ID = 109908
    requested by master DIA0 process on instance 1
    Hang Resolution Reason: Although the number of affected sessions did not
    justify automatic hang resolution initially, this previously ignored
    hang was automatically resolved.
    by terminating session sid: 2938 ospid: 3311
DIA0 successfully terminated session sid:2938 ospid:3311 with status 31.
DIA0 successfully resolved a LOCAL, HIGH confidence hang with ID=109908.
DIA0 terminating blocker (ospid: 3333 sid: 899 ser#: 12861) of hang with ID = 109909
    requested by master DIA0 process on instance 1
    Hang Resolution Reason: Although the number of affected sessions did not
    justify automatic hang resolution initially, this previously ignored
    hang was automatically resolved.
    by terminating session sid: 899 ospid: 3333
Fri Dec 01 09:41:57 2017
Errors in file /oracle/app/oracle/diag/rdbms/orarpt/anbob1/trace/anbob1_pmon_16539.trc  (incident=2069655):
ORA-00600: internal error code, arguments: [ksz_cln_proc1], [0x94B89D2A0], [3], [], [], [], [], [], [], [], [], []
Incident details in: /oracle/app/oracle/diag/rdbms/orarpt/anbob1/incident/incdir_2069655/anbob1_pmon_16539_i2069655.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /oracle/app/oracle/diag/rdbms/orarpt/anbob1/trace/anbob1_pmon_16539.trc:
ORA-00600: internal error code, arguments: [ksz_cln_proc1], [0x94B89D2A0], [3], [], [], [], [], [], [], [], [], []
PMON (ospid: 16539): terminating the instance due to error 472

Fri Dec 01 09:42:04 2017
ORA-1092 : opitsk aborting process
Instance terminated by PMON, pid = 16539
USER (ospid: 23180): terminating the instance
Termination issued to instance processes. Waiting for the processes to exit
Fri Dec 01 09:42:13 2017
Instance terminated by USER, pid = 23180
Fri Dec 01 09:43:48 2017
Starting ORACLE instance (normal)

...
...
...
Fri Dec 01 09:55:15 2017
MMON started with pid=48, OS id=18350
Fri Dec 01 09:55:15 2017
MMNL started with pid=50, OS id=18356
Fri Dec 01 09:55:28 2017
USER (ospid: 14048): terminating the instance due to error 304
Instance terminated by USER, pid = 14048
Fri Dec 01 09:57:23 2017
ORA-01565: Unable to open Spfile ?/dbs/spfile@.ora.

Note:
从db alert log可以看出， DSKM( DiskMon)进程开始出现了HANG,并且数据库因无法处理事务，逐渐积压导致数据库连接数满，随后出现了ORA-600 [ksz_cln_proc1] 内部错误，通常是因为ASM 反应慢或因为其它原因导致DSKM进程hang时，当PMON 清理终止的会话资源申请隔离请求失败也会提示这个600错误，当前版本有2个相关BUG。还有能显示启动失败ORA-304，和spfile不可读，种种说明当前的ASM 出现问题，DB已经无法从ASM读写数据。在上面我的检查instance number是否重复时，create pfile from spfile就报出了spfile读取错误。

–DB1 ASM ALERT LOG

ri Dec 01 09:42:00 2017
NOTE: ASM client anbob1:anbob disconnected unexpectedly.
NOTE: check client alert log.
NOTE: Trace records dumped in trace file /oracle/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_16769.trc
Fri Dec 01 09:45:33 2017
WARNING: client [anbob1:anbob] cleanup delayed; waited 213s, pid 16769 mbr 0x1
Fri Dec 01 09:46:33 2017
WARNING: client [anbob1:anbob] cleanup delayed; waited 273s, pid 16769 mbr 0x1
Fri Dec 01 09:47:34 2017
WARNING: client [anbob1:anbob] cleanup delayed; waited 333s, pid 16769 mbr 0x1
Fri Dec 01 09:48:34 2017
WARNING: client [anbob1:anbob] cleanup delayed; waited 393s, pid 16769 mbr 0x1
Fri Dec 01 09:48:55 2017
NOTE: timeout waiting for prior umbilllicus process to exit; 300s
Fri Dec 01 09:49:34 2017
WARNING: client [anbob1:anbob] cleanup delayed; waited 453s, pid 16769 mbr 0x1
Fri Dec 01 09:50:34 2017
WARNING: client [anbob1:anbob] cleanup delayed; waited 513s, pid 16769 mbr 0x1
Fri Dec 01 09:51:34 2017
WARNING: client [anbob1:anbob] cleanup delayed; waited 573s, pid 16769 mbr 0x1
Fri Dec 01 09:52:34 2017
WARNING: client [anbob1:anbob] cleanup delayed; waited 633s, pid 16769 mbr 0x1
Fri Dec 01 09:53:34 2017
WARNING: client [anbob1:anbob] cleanup delayed; waited 693s, pid 16769 mbr 0x1
Fri Dec 01 09:54:34 2017
WARNING: client [anbob1:anbob] cleanup delayed; waited 753s, pid 16769 mbr 0x1
Fri Dec 01 09:54:48 2017
NOTE: timeout waiting for prior umbilllicus process to exit; 300s
Fri Dec 01 09:55:34 2017
WARNING: client [anbob1:anbob] cleanup delayed; waited 813s, pid 16769 mbr 0x1
Fri Dec 01 09:56:34 2017
WARNING: client [anbob1:anbob] cleanup delayed; waited 873s, pid 16769 mbr 0x1
Fri Dec 01 09:57:32 2017
Errors in file /oracle/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_14282.trc:
ORA-27603: Cell storage I/O error, I/O failed on disk o/192.168.160.37/DATA_CD_05_kdexa1cel03 at offset 0 for data length 4096
ORA-27626: Exadata error: 201 (Generic I/O error)
WARNING: Read Failed. group:1 disk:45 AU:0 offset:0 size:4096
path:o/192.168.160.37/DATA_CD_05_kdexa1cel03
         incarnation:0xe9681366 asynchronous result:'I/O error'
         subsys:OSS iop:0x7f163de0fa80 bufp:0x7f163ddbe400 osderr:0xc9 osderr1:0x0
Errors in file /oracle/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_14282.trc:
ORA-27603: Cell storage I/O error, I/O failed on disk o/192.168.160.37/DATA_CD_07_kdexa1cel03 at offset 0 for data length 4096
ORA-27626: Exadata error: 201 (Generic I/O error)
WARNING: Read Failed. group:1 disk:41 AU:0 offset:0 size:4096
path:o/192.168.160.37/DATA_CD_07_kdexa1cel03
         incarnation:0xe9681362 asynchronous result:'I/O error'
         subsys:OSS iop:0x7f163de13680 bufp:0x7f163de78800 osderr:0xc9 osderr1:0x0
Errors in file /oracle/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_14282.trc:
ORA-27603: Cell storage I/O error, I/O failed on disk o/192.168.160.37/DATA_CD_10_kdexa1cel03 at offset 0 for data length 4096
ORA-27626: Exadata error: 201 (Generic I/O error)
WARNING: Read Failed. group:1 disk:43 AU:0 offset:0 size:4096
path:o/192.168.160.37/DATA_CD_10_kdexa1cel03

...
...
...
Fri Dec 01 09:58:45 2017
NOTE: PST update grp = 1 completed successfully!
NOTE: initiating PST update: grp = 1, dsk = 38/0xe968135f, mask = 0x7e, op = clear
GMON updating disk modes for group 1 at 105 for pid 16, osid 13436
NOTE: group DATA: updated PST location: disk 0012 (PST copy 0)
NOTE: group DATA: updated PST location: disk 0008 (PST copy 1)
NOTE: cache closing disk 38 of grp 1: DATA_CD_03_KDEXA1CEL03
Fri Dec 01 09:58:45 2017
NOTE: process _b000_+asm1 (6940) initiating offline of disk 36.3915912029 (DATA_CD_08_KDEXA1CEL03) with mask 0x7e[0x15] in group 1
NOTE: process _b000_+asm1 (6940) initiating offline of disk 37.3915912030 (DATA_CD_00_KDEXA1CEL03) with mask 0x7e[0x15] in group 1
NOTE: process _b000_+asm1 (6940) initiating offline of disk 39.3915912032 (DATA_CD_09_KDEXA1CEL03) with mask 0x7e[0x15] in group 1
NOTE: process _b000_+asm1 (6940) initiating offline of disk 40.3915912033 (DATA_CD_04_KDEXA1CEL03) with mask 0x7e[0x15] in group 1
NOTE: process _b000_+asm1 (6940) initiating offline of disk 41.3915912034 (DATA_CD_07_KDEXA1CEL03) with mask 0x7e[0x15] in group 1
NOTE: process _b000_+asm1 (6940) initiating offline of disk 42.3915912035 (DATA_CD_11_KDEXA1CEL03) with mask 0x7e[0x15] in group 1
NOTE: process _b000_+asm1 (6940) initiating offline of disk 43.3915912036 (DATA_CD_10_KDEXA1CEL03) with mask 0x7e[0x15] in group 1
NOTE: process _b000_+asm1 (6940) initiating offline of disk 44.3915912037 (DATA_CD_01_KDEXA1CEL03) with mask 0x7e[0x15] in group 1
NOTE: process _b000_+asm1 (6940) initiating offline of disk 45.3915912038 (DATA_CD_05_KDEXA1CEL03) with mask 0x7e[0x15] in group 1
NOTE: process _b000_+asm1 (6940) initiating offline of disk 46.3915912039 (DATA_CD_06_KDEXA1CEL03) with mask 0x7e[0x15] in group 1
NOTE: process _b000_+asm1 (6940) initiating offline of disk 47.3915912040 (DATA_CD_02_KDEXA1CEL03) with mask 0x7e[0x15] in group 1
WARNING: Disk 36 (DATA_CD_08_KDEXA1CEL03) in group 1 in mode 0x15 is now being taken offline on ASM inst 1
WARNING: Disk 37 (DATA_CD_00_KDEXA1CEL03) in group 1 in mode 0x15 is now being taken offline on ASM inst 1
WARNING: Disk 39 (DATA_CD_09_KDEXA1CEL03) in group 1 in mode 0x15 is now being taken offline on ASM inst 1
WARNING: Disk 40 (DATA_CD_04_KDEXA1CEL03) in group 1 in mode 0x15 is now being taken offline on ASM inst 1
WARNING: Disk 41 (DATA_CD_07_KDEXA1CEL03) in group 1 in mode 0x15 is now being taken offline on ASM inst 1
WARNING: Disk 42 (DATA_CD_11_KDEXA1CEL03) in group 1 in mode 0x15 is now being taken offline on ASM inst 1
WARNING: Disk 43 (DATA_CD_10_KDEXA1CEL03) in group 1 in mode 0x15 is now being taken offline on ASM inst 1
WARNING: Disk 44 (DATA_CD_01_KDEXA1CEL03) in group 1 in mode 0x15 is now being taken offline on ASM inst 1
WARNING: Disk 45 (DATA_CD_05_KDEXA1CEL03) in group 1 in mode 0x15 is now being taken offline on ASM inst 1
WARNING: Disk 46 (DATA_CD_06_KDEXA1CEL03) in group 1 in mode 0x15 is now being taken offline on ASM inst 1
WARNING: Disk 47 (DATA_CD_02_KDEXA1CEL03) in group 1 in mode 0x15 is now being taken offline on ASM inst 1
WARNING: Disk 38 (DATA_CD_03_KDEXA1CEL03) in group 1 in mode 0x1 is now being taken offline on ASM inst 1
NOTE: initiating PST update: grp = 1, dsk = 36/0xe968135d, mask = 0x6a, op = clear

Note:
ASM在处理client进程清理时延时，随后报出了Cell storage I/O error，CELL03(192.168.160.37)读写错误，及cell03相关的DISK全部OFFLINE。v$asm_disk视图中mount_status显示为”MISSING” , 可以在下面的脚本在数据库中查询：

col PATH for a20
col DG_NAME for a15
col DG_STATE for a10
col FAILGROUP for a15
col name for a20
set lines 200 pages 10000

select dg.name dg_name, dg.state dg_state, dg.type, d.disk_number dsk_no,
d.path, d.mount_status, d.FAILGROUP,d.name, d.state 
from v$asm_diskgroup dg, v$asm_disk d
where dg.group_number=d.group_number
order by dg_name, dsk_no;

— db2 db alert log

-- node2 db alert log 
Fri Dec 01 09:45:14 2017
ORA-00020: maximum number of processes (2000) exceeded
 ORA-20 errors will not be written to the alert log for
 the next minute. Please look at trace files to see all
 the ORA-20 errors.
Fri Dec 01 09:45:38 2017
Immediate Kill Session#: 191, Serial#: 8141
...
Immediate Kill Session: sess: 0xa51130b58  OS pid: 18573
Immediate Kill Session#: 2808, Serial#: 13305
Immediate Kill Session: sess: 0xa31411a00  OS pid: 21909
Immediate Kill Session#: 2948, Serial#: 11987
Immediate Kill Session: sess: 0xa4140b1d8  OS pid: 21913
Fri Dec 01 09:45:39 2017
opiodr aborting process unknown ospid (14209) as a result of ORA-28
Fri Dec 01 09:46:39 2017
DSKM process appears to be hung. Initiating system state dump.
Fri Dec 01 09:47:15 2017
IPC Send timeout detected. Sender: ospid 21743 [oracle@kdexa1db02.hebei.mobile.com (TNS V1-V3)]
Receiver: inst 1 binc 429466025 ospid 16596
Fri Dec 01 09:47:16 2017
IPC Send timeout detected. Sender: ospid 29713 [oracle@kdexa1db02.hebei.mobile.com]
Receiver: inst 1 binc 429465961 ospid 16585
...
IPC Send timeout detected. Sender: ospid 29739 [oracle@kdexa1db02.hebei.mobile.com]
Receiver: inst 1 binc 429465961 ospid 16585
Fri Dec 01 09:47:16 2017
IPC Send timeout detected. Sender: ospid 29709 [oracle@kdexa1db02.hebei.mobile.com]
Receiver: inst 1 binc 429466025 ospid 16591
IPC Send timeout to 1.0 inc 4 for msg type 12 from opid 201
IPC Send timeout to 1.1 inc 4 for msg type 36 from opid 148
IPC Send timeout: Terminating pid 201 osid 29739
IPC Send timeout: Terminating pid 148 osid 29709
...
IPC Send timeout detected. Sender: ospid 29785 [oracle@kdexa1db02.hebei.mobile.com]
Receiver: inst 1 binc 429465961 ospid 16585
Fri Dec 01 09:47:16 2017
IPC Send timeout detected. Sender: ospid 29783 [oracle@kdexa1db02.hebei.mobile.com]
Receiver: inst 1 binc 429465961 ospid 16585
IPC Send timeout to 1.0 inc 4 for msg type 12 from opid 230
Fri Dec 01 09:47:16 2017
IPC Send timeout detected. Sender: ospid 29789 [oracle@kdexa1db02.hebei.mobile.com]
Receiver: inst 1 binc 429466025 ospid 16596
IPC Send timeout to 1.0 inc 4 for msg type 12 from opid 227
IPC Send timeout to 1.2 inc 4 for msg type 32 from opid 232
IPC Send timeout: Terminating pid 230 osid 29785
IPC Send timeout: Terminating pid 227 osid 29783
IPC Send timeout: Terminating pid 232 osid 29789
...
Receiver: inst 1 binc 429466025 ospid 16591
IPC Send timeout to 1.1 inc 4 for msg type 32 from opid 321
IPC Send timeout: Terminating pid 321 osid 29832
Fri Dec 01 09:47:18 2017
IPC Send timeout detected. Sender: ospid 29836 [oracle@kdexa1db02.hebei.mobile.com]
Receiver: inst 1 binc 429465961 ospid 16585
IPC Send timeout to 1.0 inc 4 for msg type 12 from opid 340
IPC Send timeout: Terminating pid 340 osid 29836
Fri Dec 01 09:47:37 2017
IPC Send timeout to 1.2 inc 4 for msg type 32 from opid 55
IPC Send timeout: Terminating pid 55 osid 21743
Fri Dec 01 09:48:08 2017
Errors in file /oracle/app/oracle/diag/rdbms/orarpt/orarpt2/trace/orarpt2_pmon_18433.trc  (incident=594684):
ORA-00600: internal error code, arguments: [ksz_cln_proc1], [0x9FD5C5C10], [3], [], [], [], [], [], [], [], [], []
Incident details in: /oracle/app/oracle/diag/rdbms/orarpt/orarpt2/incident/incdir_594684/orarpt2_pmon_18433_i594684.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /oracle/app/oracle/diag/rdbms/orarpt/orarpt2/trace/orarpt2_pmon_18433.trc:
ORA-00600: internal error code, arguments: [ksz_cln_proc1], [0x9FD5C5C10], [3], [], [], [], [], [], [], [], [], []
PMON (ospid: 18433): terminating the instance due to error 472
Fri Dec 01 09:48:09 2017
ORA-1092 : opitsk aborting process
Fri Dec 01 09:48:10 2017
ORA-1092 : opitsk aborting process
Fri Dec 01 09:48:10 2017
License high water mark = 1902
Instance terminated by PMON, pid = 18433
USER (ospid: 3194): terminating the instance
Instance terminated by USER, pid = 3194

登录CellSever03 查看Alert log

[root@kdexa1cel03 localhost]# cd trace
/var/log/oracle/diag/asm/cell/localhost/trace
[root@kdexa1cel03 trace]# ls -l alert.log 
-rw-rw---- 1 root celladmin 773 Feb  2  2012 alert.log
[root@kdexa1cel03 trace]# cat alert.log 
cat: alert.log: Input/output error
[root@kdexa1cel03 trace]# more alert.log 
Bus error

[root@kdexa1cel03 ~]# ls
ls: reading directory .: Input/output error

Note:
当前的CellServer03 本地文件系统出现了异常，此时为了尽快恢复业务，因为当前DB和ASM之间无法通信，决定重启了节点1的CRS。重启节点1 CRS后，数据库实例1恢复正常并且实例2之前启动hang也恢复正常，只是因为少了一个CELL，ASM Diskgroup的总存储和可用存储少了一些,注意当前ASMDG 并不会立即DROP或rebalance，需要调整DISK_REPAIR_TIME参数。

DISK_REPAIR_TIME

The DISK_REPAIR_TIME attribute of the disk group controls the maximum acceptable outage duration. Once one or more disks become unavailable to ASM, it will wait for up to the interval specified for DISK_REPAIR_TIME for the disk(s) to come online. If the disk(s) come back online within this interval, a resync operation will occur, where only the extents that were modified while the disks were offline are written to the disks once back online. If the disk(s) do not come back within this interval, ASM will initiate a forced drop of the disk(s), which will trigger a rebalance, in order to restore redundancy using the surviving disks. Once the disk(s) are back online, they will be added to the diskgroup, with all existing extents on those disks being ignored/discarded, and another rebalance will begin. In other words, the DISK_REPAIR_TIME value is the acceptable time of duration during you need to fix the failure. This setting is also the countdown timer of ASM to drop the disk(s) that have been taken offline. The default setting for DISK_REPAIR_TIME is 3.6 hours.

(1). 检查所有磁盘组的disk_repair_time，登录到ASM实例并执行以下查询：

SQL> select dg.name, a.value from v$asm_diskgroup dg, v$asm_attribute a where dg.group_number=a.group_number and a.name='disk_repair_time';

(2). 修改所有磁盘组的disk_repair_time：

SQL> ALTER DISKGROUP <DISKGROUP NAME> SET ATTRIBUTE 'DISK_REPAIR_TIME'='8.5H';

如果属于CELL的磁盘都已经OFFLINE像本案例这样，就不能再使用上面的方法，因为Cell本身已经不可用, 需要使用下面的方法延长disk_repair_time:

SQL>   ALTER DISKGROUP <DISKGROUP NAME> OFFLINE DISKS IN FAILGROUP <FAILGROUP NAME> DROP AFTER 24H;

验证所有磁盘是否都已修改

SQL>  select dg.name ,a.value from v$asm_diskgroup
  2          dg, v$asm_attribute a where dg.group_number=a.group_number and
  3          a.name ='disk_repair_time';

NAME                           VALUE
------------------------------ -------------------------------
DATA                           24h
DBFS_DG                        24h
FLASHTS                        24h
RECO                           24h

CellServer03 检查

1，机器面板无报错；
2，登录ILOM

[root@kdexa1db01 ~]# ssh 192.168.60.xx
Password: 
Oracle(R) Integrated Lights Out Manager
Version 3.0.16.10.d r74499
Copyright (c) 2012, Oracle and/or its affiliates. All rights reserved.


-> show /SP/faultmgmt

 /SP/faultmgmt
    Targets:
        shell
    Properties:
    Commands:
        cd
        show

-> start /SP/faultmgmt/shell
Are you sure you want to start /SP/faultmgmt/shell (y/n)? y

faultmgmtsp> fmadm faulty -a
No faults found
faultmgmtsp>

3, 检查OS layer log

[root@kdexa1cel03 /]# dmesg
sd 0:2:1:0: rejecting I/O to offline device
sd 0:2:1:0: rejecting I/O to offline device
sd 0:2:1:0: rejecting I/O to offline device
sd 0:2:1:0: rejecting I/O to offline device
sd 0:2:1:0: rejecting I/O to offline device
sd 0:2:1:0: rejecting I/O to offline device
...

[root@kdexa1cel03 /]# lsscsi
[0:0:20:0]   enclosu SUN      HYDE12           0341  -       
[0:2:0:0]    disk    LSI      MR9261-8i        2.13  /dev/sda
[0:2:1:0]    disk    LSI      MR9261-8i        2.13  /dev/sdb
[0:2:2:0]    disk    LSI      MR9261-8i        2.13  /dev/sdc
[0:2:3:0]    disk    LSI      MR9261-8i        2.13  /dev/sdd
[0:2:4:0]    disk    LSI      MR9261-8i        2.13  /dev/sde
[0:2:5:0]    disk    LSI      MR9261-8i        2.13  /dev/sdf
[0:2:6:0]    disk    LSI      MR9261-8i        2.13  /dev/sdg
[0:2:7:0]    disk    LSI      MR9261-8i        2.13  /dev/sdh
[0:2:8:0]    disk    LSI      MR9261-8i        2.13  /dev/sdi
[0:2:9:0]    disk    LSI      MR9261-8i        2.13  /dev/sdj
[0:2:10:0]   disk    LSI      MR9261-8i        2.13  /dev/sdk
[0:2:11:0]   disk    LSI      MR9261-8i        2.13  /dev/sdl
[1:0:0:0]    disk    Unigen   PSA4000          1100  /dev/sdm
[8:0:0:0]    disk    ATA      MARVELL SD88SA02 D21Y  /dev/sdn
[8:0:1:0]    disk    ATA      MARVELL SD88SA02 D21Y  /dev/sdo
[8:0:2:0]    disk    ATA      MARVELL SD88SA02 D21Y  /dev/sdp
[8:0:3:0]    disk    ATA      MARVELL SD88SA02 D21Y  /dev/sdq
[9:0:0:0]    disk    ATA      MARVELL SD88SA02 D21Y  /dev/sdr
...

[root@kdexa1cel03 mptsas]# cat /proc/mdstat 
Personalities : [raid1] 
md4 : active raid1 sda1[2](F) sdb1[1]
      120384 blocks [2/1] [_U]
      
md5 : active raid1 sda5[0] sdb5[2](F)
      10485696 blocks [2/1] [U_]
      
md6 : active raid1 sda6[2](F) sdb6[1]
      10485696 blocks [2/1] [_U]
      
md7 : active raid1 sda7[0] sdb7[2](F)
      3145664 blocks [2/1] [U_]
      
md8 : active raid1 sda8[2](F) sdb8[1]
      3145664 blocks [2/1] [_U]
      
md2 : active raid1 sda9[0] sdb9[2](F)
      2097088 blocks [2/1] [U_]
      
md11 : active raid1 sda11[2](F) sdb11[1]
      5242752 blocks [2/1] [_U]
      
md1 : active raid1 sda10[2](F) sdb10[1]
      714752 blocks [2/1] [_U]
      
unused devices: 
	   
[root@kdexa1cel03 mptsas]# mdadm --detail /dev/md6
/dev/md6:
        Version : 0.90
  Creation Time : Thu Feb  2 15:16:27 2012
     Raid Level : raid1
     Array Size : 10485696 (10.00 GiB 10.74 GB)
  Used Dev Size : 10485696 (10.00 GiB 10.74 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 6
    Persistence : Superblock is persistent

    Update Time : Fri Dec  1 13:49:39 2017
          State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       8       22        1      active sync   /dev/sdb6

       2       8        6        -      faulty spare   /dev/sda6

当前的文件系统出现异常，好多日志无法查看，当前情况解决方法也只能是重启操作系统。

通常LINUX中出现了BUS Error错误，意为的可能无法正常关机，尝试了各种关机命令后果然，需要去机房按电源。但是主机上如果无标签，就需要确认一下哪台是对应CELL03主机，可以使用点亮CELL主机前置灯的方法，同样更换硬盘时，也有点亮硬盘灯的方法。

使用点亮灯识别CELL主机的方法：

方法1

CellCLI> alter cell led on;
Cell kdexa1cel03 Locator LED switched on

CellCLI> alter cell led off;
Cell kdexa1cel03 Locator LED switched off

方法2

[root@kdexa1db01 ~]# ssh kdexa1cel03-ilom
Password:

Oracle(R) Integrated Lights Out Manager

Version 3.0.16.10.d r74499
Copyright (c) 2012, Oracle and/or its affiliates. All rights reserved.
-> set /SYS/LOCATE value=fast_blink
Set 'value' to 'fast_blink'

-> set /SYS/LOCATE value=off
Set 'value' to 'off'

然后按主机电源重启cell03 主机后，主机恢复正常，cell server也恢复正常，自动回归ASM diskgroup。– 前提未超过DISK_REPAIR_TIME

收集日志，提交SR

1, cell alert log
# /var/log/oracle/diag/asm/cell/localhost/trace

2, sundiag
# /opt/oracle.SupportTools/sundiag.sh

3, OS hardware log
# dmesg or cat /var/log/dmesg
# cat /var/log/messages

4, ILOM Snapshot
使用web 登录ILOM 省略
使用shell :

# ssh kdexa1cel03-ilom
Password:
Oracle(R) Integrated Lights Out Manager

Version 3.0.16.10.d r74499
Copyright (c) 2012, Oracle and/or its affiliates. All rights reserved.
-> set /SP/diag/snapshot dataset=normal
-> /SP/diag/snapshot dump_uri=sftp://root:welcome@192.168.60.42/tmp
-> cd /SP/diag/snapshot
-> show

多次检查状态，几分钟后会显示”Snapshot Complete” 会有个压缩文件生成并已FTP 到上面远程主机的/tmp 目录中。

5, sos report as root user
# sosreport

Summary:

因为cell03存储主机的文件系统异常，导致ASM Hang,数据库实例crash, 虽然是NORMAL级别的冗余，但是数据库实例此时不能于ASM通信，重启CRS进程恢复，可使用剩余的2条CELL继续为数据库提供服务。在延长了disk_repair_time时间后，等待时间后强置重启CELL03主机操作系统后，一切恢复。后EXADATA hardware部门并未发现硬件问题，Oracle Linux部门怀疑是文件系统问题，说是可以使用e2fsck解决，未尝试. 但也未确认是否是因为MD层的软RAID问题。