Troubleshooting 因存储断电导致数据库ora-600 [2131], [33], [32] & ORA-600 [kffbAddBlk04] ORA-15081

简单记录一起存储突然断电导致的11.2.0.4 2-Nodes RAC恢复案例。

存储恢复后第一次尝试启动数据库

Starting up:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options.
ORACLE_HOME = /s01/oracle/app/oracle/product/11.2.0/dbhome_1
System name:Linux
Node name:anbob01
Release:2.6.32-358.el6.x86_64
Version:#1 SMP Tue Jan 29 11:47:41 EST 2013
Machine:x86_64
VM name:VMWare Version: 6
Using parameter settings in server-side pfile /s01/oracle/app/oracle/product/11.2.0/dbhome_1/dbs/initanbob1.ora
System parameters with non-default values:
  processes                = 1000
  sessions                 = 1536
  memory_target            = 19200M
  control_files            = "+DATADG/anbob/controlfile/current.260.897563029"
  db_block_size            = 8192
  compatible               = "11.2.0.4.0"
  log_archive_dest_1       = "location=+DATADG/arch"
  log_archive_format       = "%t_%s_%r.arc"
  cluster_database         = FALSE
  db_create_file_dest      = "+DATADG"
  thread                   = 1
  undo_tablespace          = "UNDOTBS1"
  instance_number          = 1
  remote_login_passwordfile= "EXCLUSIVE"
  db_domain                = ""
  dispatchers              = "(PROTOCOL=TCP) (SERVICE=anbobXDB)"
  remote_listener          = "anbob-cluster-scan:1521"
  audit_file_dest          = "/s01/oracle/app/oracle/admin/anbob/adump"
  audit_trail              = "DB"
  db_name                  = "anbob"
  open_cursors             = 300
  diagnostic_dest          = "/s01/oracle/app/oracle"
Cluster communication is configured to use the following interface(s) for this instance
  169.254.15.56
cluster interconnect IPC version:Oracle UDP/IP (generic)
IPC Vendor 1 proto 2
PMON started with pid=2, OS id=6300
2019-09-10 15:11:39.825000 +08:00
PSP0 started with pid=3, OS id=6302
2019-09-10 15:11:40.866000 +08:00
VKTM started with pid=4, OS id=6305 at elevated priority
VKTM running at (1)millisec precision with DBRM quantum (100)ms
GEN0 started with pid=5, OS id=6309
DIAG started with pid=6, OS id=6311
DBRM started with pid=7, OS id=6313
PING started with pid=8, OS id=6315
ACMS started with pid=9, OS id=6317
DIA0 started with pid=10, OS id=6319
LMON started with pid=11, OS id=6321
LMD0 started with pid=12, OS id=6323
* Load Monitor used for high load check
* New Low - High Load Threshold Range = [15360 - 20480]
RMS0 started with pid=13, OS id=6325
LMHB started with pid=14, OS id=6327
MMAN started with pid=15, OS id=6329
DBW0 started with pid=16, OS id=6331
DBW1 started with pid=17, OS id=6333
LGWR started with pid=18, OS id=6335
CKPT started with pid=19, OS id=6337
SMON started with pid=20, OS id=6339
RECO started with pid=21, OS id=6341
RBAL started with pid=22, OS id=6343
ASMB started with pid=23, OS id=6345
MMON started with pid=24, OS id=6347
MMNL started with pid=25, OS id=6351
starting up 1 dispatcher(s) for network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'...
NOTE: initiating MARK startup
starting up 1 shared server(s) ...
Starting background process MARK
MARK started with pid=27, OS id=6355
NOTE: MARK has subscribed
lmon registered with NM - instance number 1 (internal mem no 0)
2019-09-10 15:11:42.035000 +08:00
Reconfiguration started (old inc 0, new inc 2)
List of instances:
 1 (myinst: 1)
 Global Resource Directory frozen
* allocate domain 0, invalid = TRUE
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
 Set master node info
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Post SMON to start 1st pass IR
 Submitted all GCS remote-cache requests
 Post SMON to start 1st pass IR
Reconfiguration complete
ORACLE_BASE from environment = /s01/oracle/app/oracle
ALTER SYSTEM SET local_listener=' (ADDRESS=(PROTOCOL=TCP)(HOST=10.129.252.206)(PORT=1521))' SCOPE=MEMORY SID='anbob1';
2019-09-10 15:12:07.811000 +08:00
alter database mount
NOTE: Loaded library: System
SUCCESS: diskgroup DATADG was mounted
2019-09-10 15:12:16.746000 +08:00
NOTE: dependency between database anbob and diskgroup resource ora.DATADG.dg is established
2019-09-10 15:12:21.022000 +08:00
Errors in file /s01/oracle/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_ora_6376.trc  (incident=304298):
ORA-00600: internal error code, arguments: [2131], [33], [32], [], [], [], [], [], [], [], [], []
Incident details in: /s01/oracle/app/oracle/diag/rdbms/anbob/anbob1/incident/incdir_304298/anbob1_ora_6376_i304298.trc
Dumping diagnostic data in directory=[cdmp_20190910151221], requested by (instance=1, osid=6376), summary=[incident=304298].
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
ORA-600 signalled during: alter database mount...
2019-09-10 15:12:41.193000 +08:00
Shutting down instance (immediate)
Shutting down instance: further logons disabled
Stopping background process MMNL
Stopping background process MMON
2019-09-10 15:12:43.200000 +08:00
License high water mark = 3
All dispatchers and shared servers shutdown
ALTER DATABASE CLOSE NORMAL
ORA-1507 signalled during: ALTER DATABASE CLOSE NORMAL...
2019-09-10 15:12:48.269000 +08:00
ARCH: Archival disabled due to shutdown: 1089
Shutting down archive processes
Archiving is disabled
Stopping background process VKTM
ARCH: Archival disabled due to shutdown: 1089
Shutting down archive processes
Archiving is disabled
NOTE: Shutting down MARK background process
2019-09-10 15:12:50.979000 +08:00
freeing rdom 0
2019-09-10 15:12:54.642000 +08:00
Instance shutdown complete

Note:
提示因ORA-600 [2131] 致命错误实例再次crash, 因为存储IO问题导致控制文件损坏，尝试重建控制文件从自动备份的文件。

SQL> startup nomount
ORACLE instance started.

Total System Global Area 2.0043E+10 bytes
Fixed Size                  2261848 bytes
Variable Size            1.1744E+10 bytes
Database Buffers         8254390272 bytes
Redo Buffers               42065920 bytes
SQL> exit
Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options
[oracle@anbob01 dbs]$ rman target /
Recovery Manager: Release 11.2.0.4.0 - Production on Tue Sep 10 15:16:06 2019
Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.
connected to target database: anbob (not mounted)

RMAN> restore controlfile from '/s01/oracle/app/oracle/product/11.2.0/dbhome_1/dbs/snapcf_anbob1.f';

RMAN> alter database mount;

RMAN> recover database;

RMAN> alter database open resetlogs;
database opened

Note:
到这里数据库已打开，但是没过多久数据库再次crash.

2019-09-10 15:18:59.733000 +08:00
alter database open resetlogs
RESETLOGS after complete recovery through change 3673444129
Archived Log entry 31169 added for thread 1 sequence 19643 ID 0x5e3f8411 dest 1:
2019-09-10 15:19:01.544000 +08:00
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
NOTE: ASM instance returned error dumped to trace file /s01/oracle/app/oracle/diag/rdbms/anbob/ANBOB1/trace/ANBOB1_ora_7530.trc
Errors in file /s01/oracle/app/oracle/diag/rdbms/anbob/ANBOB1/trace/ANBOB1_ora_7530.trc:
ORA-00600: internal error code, arguments: [ORA_NPI_ERROR], [600], [ORA-00600: internal error code, arguments: [kffbAddBlk04], [], [], [], [], [], [], [], [], [], [], []
], [], [], [], [], [], [], [], [], []
Unable to create archive log file '+DATADG/arch/1_19645_897563030.arc'
ARCH: Error 19504 Creating archive log file to '+DATADG/arch/1_19645_897563030.arc'
2019-09-10 15:22:42.274000 +08:00
Completed: alter database open resetlogs
Starting background process CJQ0
CJQ0 started with pid=45, OS id=8726
2019-09-10 15:26:29.596000 +08:00
WARNING: Read Failed. group:1 disk:0 AU:14724 offset:458752 size:16384
WARNING: failed to read mirror side 1 of virtual extent 5801 logical extent 0 of file 269 in group [1.2114156029] from disk DATADG_0000  allocation unit 14724 reason error; if possible, will try another mirror side
WARNING: Write Failed. group:1 disk:0 AU:1430 offset:863744 size:55808
Errors in file /s01/oracle/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_lgwr_6915.trc:
ORA-15080: synchronous I/O operation to a disk failed
ORA-27061: waiting for async I/Os failed
Linux-x86_64 Error: 5: Input/output error
Additional information: -1
Additional information: 55808
WARNING: failed to write mirror side 1 of virtual extent 58 logical extent 0 of file 261 in group 1 on disk 0 allocation unit 1430
Errors in file /s01/oracle/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_lgwr_6915.trc:
ORA-00345: redo log write error block 120471 count 109
ORA-00312: online log 1 thread 1: '+DATADG/anbob/onlinelog/group_1.261.897563031'
ORA-15081: failed to submit an I/O operation to a disk
ORA-15081: failed to submit an I/O operation to a disk
Errors in file /s01/oracle/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_lgwr_6915.trc:
ORA-00340: IO error processing online log 1 of thread 1
ORA-00345: redo log write error block 120471 count 109
ORA-00312: online log 1 thread 1: '+DATADG/anbob/onlinelog/group_1.261.897563031'
ORA-15081: failed to submit an I/O operation to a disk
ORA-15081: failed to submit an I/O operation to a disk
Errors in file /s01/oracle/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_j001_8771.trc:
ORA-00449: 后台进程 'LGWR' 因错误 340 而异常终止
ORA-01115: 从文件  读取块时出现 IO 错误 (块 # )
ORA-01110: 数据文件 7: '+DATADG/anbob/datafile/prd_soainfra.269.897845057'
ORA-15081: 无法将 I/O 操作提交到磁盘
LGWR (ospid: 6915): terminating the instance due to error 340

Note:
在ASM存储读写时再次发生错误，有出现ORA-600 [kffbAddBlk04]和LGWR写进程失败导致实例再次crash, 这是因为存储异常断电导致ASM Metadata数据损坏，剩下的工作重构ASM DISKGROUP即可，启动后记的先禁用那个自动任务防止期间再次CRASH，最终数据库里的数据已全部恢复可以正常读写，较为幸运，但同存储上另外的库就没那么幸运了。

如果您解决不了，可以联系www.anbob.com 首页上的联系方式。