Alert: In Oracle ADG, if the redo apply instance crashes, all other instances will from ‘OPEN’ to ‘Mount’

今天在一套11 G r2版本的2节点RAC adg环境，节点1因为硬件原因异常crash（apply redo 节点），但是实例2也上的应用也都断开了（原来都是open），adg上是有连接一些只读业务，而且节点2 db alert log未发现明显手动close 实例的日志，并且是自动切换到了mount状态，RAC不是应该高可用吗？为什么死一个节点另外的节点也要跟着受影响？

这里如果检查实例2状态其实是“mount”，不知道有多少人知道数据库其实是有alter database close命令的，但是在一个实例的生命周期内手动close，也就无法再open, 并且刚才也说了，实例2 alert没有close迹象，下面附一段

# tbcsc2 dg
2020-09-03 00:10:39.601000 +08:00
ORA-01555 caused by SQL statement below (SQL ID: 4snkhx5vxrmv2, Query Duration=7340 sec, SCN: 0x0f46.c04d2402):
select....
2020-09-04 11:35:29.504000 +08:00
Archived Log entry 105312 added for thread 2 sequence 200520 ID 0x1fcb56a7 dest 1:
2020-09-08 14:16:17.954000 +08:00
Reconfiguration started (old inc 22, new inc 24)
List of instances:
 2 (myinst: 2)
 Global Resource Directory frozen
 * dead instance detected - domain 0 invalid = TRUE
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
 LMS 0: 23 GCS shadows cancelled, 0 closed, 0 Xw survived
 LMS 2: 34 GCS shadows cancelled, 0 closed, 0 Xw survived
 LMS 1: 19 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info
 Submitted all remote-enqueue requests
2020-09-08 14:16:19.005000 +08:00
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Post SMON to start 1st pass IR
2020-09-08 14:16:22.014000 +08:00
ARC1: Becoming the active heartbeat ARCH
ARC1: Becoming the active heartbeat ARCH
2020-09-08 14:16:23.328000 +08:00
 Submitted all GCS remote-cache requests
 Post SMON to start 1st pass IR
 Fix write in gcs resources
2020-09-08 14:16:24.368000 +08:00
Reconfiguration complete
Recovery session aborted due to instance crash
Close the database due to aborted recovery session
SMON: disabling tx recovery
2020-09-08 14:16:54.955000 +08:00
Stopping background process MMNL
Stopping background process MMON
2020-09-08 14:17:26.530000 +08:00
Background process MMON not dead after 30 seconds
Killing background process MMON
2020-09-08 14:18:04.907000 +08:00
Starting background process MMON
MMON started with pid=27, OS id=18743
Starting background process MMNL
MMNL started with pid=1865, OS id=18745
CLOSE: killing server sessions.
2020-09-08 14:18:07.003000 +08:00
Active process 3858 user 'grid' program 'oracle@anbob2'
Active process 14847 user 'grid' program 'oracle@anbob2'
Active process 3435 user 'grid' program 'oracle@anbob2'
Active process 25029 user 'grid' program 'oracle@anbob2'
Active process 9789 user 'grid' program 'oracle@anbob2'
Active process 23815 user 'grid' program 'oracle@anbob2'
...
Active process 24285 user 'itmuser' program 'oracle@anbob2 (TNS V1-V3)'
Active process 10045 user 'grid' program 'oracle@anbob2'
Active process 24229 user 'grid' program 'oracle@anbob2'
CLOSE: all sessions shutdown successfully.
SMON: disabling tx recovery
SMON: disabling cache recovery
2020-09-08 14:19:06.638000 +08:00


2020-09-08 14:26:56.608000 +08:00
alter database recover managed standby database using current logfile disconnect from session
Attempt to start background Managed Standby Recovery process (tbcsc2)
MRP0 started with pid=73, OS id=9746
MRP0: Background Managed Standby Recovery process started (tbcsc2)
2020-09-08 14:27:01.712000 +08:00
 started logmerger process
Managed Standby Recovery starting Real Time Apply
2020-09-08 14:27:08.171000 +08:00
Parallel Media Recovery started with 32 slaves
Waiting for all non-current ORLs to be archived...
All non-current ORLs have been archived.
Recovery of Online Redo Log: Thread 2 Group 41 Seq 201120 Reading mem 0
  Mem# 0: /dev/yyc_oravg04/ryyc_redo41
Media Recovery Log /yyc1_arch/arch_1_288357_920590168.arc
Completed: alter database recover managed standby database using current logfile disconnect from session
2020-09-08 14:27:29.221000 +08:00

-- 就为缺少主机1上的归档日志所以无法应用，我们cancel redo apply,abort了数据库
alter database recover managed standby database cancel
2020-09-08 14:31:57.640000 +08:00


WARNING: inbound connection timed out (ORA-3136)
2020-09-08 14:33:30.958000 +08:00
Shutting down instance (abort)

也就”Close the database due to aborted recovery session” 给出了一个原因，close 数据库是因为recover session 终止了，其实这是RAC ADG的预期行为，在这里不得不吐槽一下ORACLE MOS文档标题是写给oracle工程师或专业人看的，让人很费解，如12c alert log路径改了标题是12.1.0.2 Oracle Clusterware Diagnostic and Alert Log Moved to ADR (Doc ID 1915729.1)，从数据库里读操作系统上的文件内容叫”外部表”，谁知道啥是内部表，对于这个问题偏的倒不是很远。Active Data Guard RAC Standby – Apply Instance Node Failure Impact (Doc ID 1357597.1) 给出了明确解释，

简而言之就是，如果apply redo应用日志的实例进程异常终止后，其它所有OPEN READ ONLY的实例会close, 因为在RAC ADG环境中，如果实例在应用日志过程中中断crash, 会把CACHE FUSION的锁留到残留幸存的实例中，会导致数据查询不一致，因次需要关闭数据库，重新打开来保证buffer cache和datafile 的一致状态。如果配置了DG BROKER 这个操作可以自动完成，版本大于11.2.0.2，如果没有配置自动，手动方式直接open 就可以了，接着手动执行应用日志命令，继续在幸存的节点上应用日志。

附上MOS那段解释
Symptoms
In an Active Data Guard RAC standby, if the redo apply instance crashes, all other instances of that standby that were open Read Only will be closed and returned to the MOUNT state. This disconnects all readers of the Active Data Guard standby.

Cause
In an Active Data Guard RAC standby, if the redo apply instance crashes in the middle of media recovery, it leaves the RAC cache fusion locks on the surviving instances and the data files on disk in an in-flux state. In such a situation, queries on the surviving instances can potentially see inconsistent data. To resolve this in-flux state, the entire standby database is closed. Upon opening the first instance after such a close, the buffer caches and datafiles are made consistent again.

从12.1 版本引入了新特性”ADG instance recovery” ，解决的是当redo apply instance crash时，影响其它实例也close问题，从12.1以后保存下来的ADG 实例会自动做adg instance recovery,保证数据一致性，这操作可以从实例的alert log中看到如 “Beginning ADG Instance Recovery” and “Completed ADG Instance Recovery”，然后实例还是保持在open read only状态，不在中断ADG上的应用，如果配置了dg broker 还会自动在幸存的实例启动MRP，从而实现继续日志apply。这个功能在向后做到了11.2.0.4版中，前提是安装了较新的PSU，修复了bug 18331944和19516448，同时再配置隐藏参数””_adg_instance_recovery=TRUE””(默认是close幸存实例)。

But from 12.1 when the apply instance crashed in the middle of applying changes, one of the remaining open instances will be automatically posted to do “ADG instance recovery”, after the ADG instance recovery.We can see this, ADG instance recovery by checking the alert log, for the messages like “Beginning ADG Instance Recovery” and “Completed ADG Instance Recovery”. If DG broker is enabled then Broker will start the MRP on any of the surviving instances.