首页 » ORACLE » Troubleshooting One RAC node reboot with ora-600 [kjdrisRMnovalid:msgRMno]

Troubleshooting One RAC node reboot with ora-600 [kjdrisRMnovalid:msgRMno]

One RAC node reboot with ora-600 [kjdrisRMnovalid:msgRMno]

今天一套 2节点的11.2.0.3 RAC 数据库计划在0点的维护窗口做一些参数优化,其中包括禁用DRM ,所以在下午提前修改了参数scope=spfile等晚上重启时生效,计划赶不上变化,就在大伙等待0点到来时,突然节点2主机重启,随后节点1跟随instance reboot(主机未重启),在Node1 DB alert.log 发现了ora-600 [kjdrisRMnovalid:msgRMno]. 下面记录一下这个案例原因。

Note:节点2主机原因不在本篇范围内,只分析节点1实例重启

禁用DRM 使用语句

Disable DRM in 11g Release 1 and onward

alter system set “_gc_undo_affinity”=FALSE scope=spfile sid=’*';
alter system set “_gc_policy_time”=0 scope=spfile sid=’*';

TIP:
Disable DRM In 10g

alter system set “_gc_undo_affinity”=FALSE scope=spfile sid=’*';
alter system set “_gc_affinity_time”=0 scope=spfile sid=’*';

# 2 instances of RAC startup time

SQL>  select inst_id,startup_time from gv$instance;

   INST_ID STARTUP_TIME
---------- -------------------
         1 2014-12-23 19:22:06
         2 2014-12-23 19:21:24

# Node1 alert log
Mon Dec 22 11:38:08 2014

Started redo scan
Submitted all GCS remote-cache requests
Post SMON to start 1st pass IR
Fix write in gcs resources
Completed redo scan
read 49416 KB redo, 21194 data blocks need recovery
Reconfiguration complete
Started redo application at
Thread 2: logseq 11964, block 2039924
Recovery of Online Redo Log: Thread 2 Group 24 Seq 11964 Reading mem 0
Mem# 0: +ACCTDG1/anbob/onlinelog/group_24.524.837100973
Mem# 1: +ACCTDG2/anbob/onlinelog/group_24.519.837100979
Completed redo application of 29.93MB
Completed instance recovery at
Thread 2: logseq 11964, block 2138757, scn 14112071224855
18297 data blocks read, 37970 data blocks written, 49416 redo k-bytes read
Thread 2 advanced to log sequence 11965 (thread recovery)
Tue Dec 23 19:09:55 2014
minact-scn: master continuing after IR
Tue Dec 23 19:14:52 2014

Thread 1 advanced to log sequence 99402 (LGWR switch)
Current log# 16 seq# 99402 mem# 0: +ACCTDG1/anbob/onlinelog/group_16.516.837100441
Current log# 16 seq# 99402 mem# 1: +ACCTDG2/anbob/onlinelog/group_16.511.837100447
Tue Dec 23 19:21:35 2014

***********************************************************************

Fatal NI connect error 12170.

VERSION INFORMATION:
TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 – Production
TCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 – Production
Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 – Production
Time: 23-DEC-2014 19:21:35
Tracing not turned on.
Tns error struct:
ns main err code: 12535

TNS-12535: TNS:operation timed out
ns secondary err code: 12560
nt main err code: 505

TNS-00505: Operation timed out
nt secondary err code: 78
nt OS err code: 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=136.142.29.79)(PORT=64736))
Tue Dec 23 19:21:45 2014
Reconfiguration started (old inc 148, new inc 150)
List of instances:
1 2 (myinst: 1)
Global Resource Directory frozen
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Tue Dec 23 19:21:47 2014
Tue Dec 23 19:21:47 2014
LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Tue Dec 23 19:21:47 2014
LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Tue Dec 23 19:21:47 2014
LMS 3: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Tue Dec 23 19:21:48 2014
minact-scn: Master returning as live inst:2 has inc# mismatch instinc:0 cur:150 errcnt:0
Submitted all GCS remote-cache requests
Fix write in gcs resources
Reconfiguration complete
Errors in file /oracle/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_lms1_47907010.trc (incident=747922):
ORA-00600: internal error code, arguments: [kjdrisRMnovalid:msgRMno], [4294967295], [63056], [0], [], [], [], [], [], [], [], []
Incident details in: /oracle/app/oracle/diag/rdbms/anbob/anbob1/incident/incdir_747922/anbob1_lms1_47907010_i747922.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /oracle/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_lms2_51708066.trc (incident=747930):
ORA-00600: internal error code, arguments: [kjdrisRMnovalid:msgRMno], [4294967295], [63056], [0], [], [], [], [], [], [], [], []
Incident details in: /oracle/app/oracle/diag/rdbms/anbob/anbob1/incident/incdir_747930/anbob1_lms2_51708066_i747930.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /oracle/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_lms3_41746780.trc (incident=747938):
ORA-00600: internal error code, arguments: [kjdrisRMnovalid:msgRMno], [4294967295], [63056], [0], [], [], [], [], [], [], [], []
Incident details in: /oracle/app/oracle/diag/rdbms/anbob/anbob1/incident/incdir_747938/anbob1_lms3_41746780_i747938.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Tue Dec 23 19:21:53 2014
Dumping diagnostic data in directory=[cdmp_20141223192153], requested by (instance=1, osid=47907010 (LMS1)), summary=[incident=747922].
Errors in file /oracle/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_lms1_47907010.trc:
ORA-00600: internal error code, arguments: [kjdrisRMnovalid:msgRMno], [4294967295], [63056], [0], [], [], [], [], [], [], [], []
Errors in file /oracle/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_lms2_51708066.trc:
ORA-00600: internal error code, arguments: [kjdrisRMnovalid:msgRMno], [4294967295], [63056], [0], [], [], [], [], [], [], [], []
Errors in file /oracle/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_lms3_41746780.trc:
ORA-00600: internal error code, arguments: [kjdrisRMnovalid:msgRMno], [4294967295], [63056], [0], [], [], [], [], [], [], [], []
Tue Dec 23 19:21:55 2014

***********************************************************************

Fatal NI connect error 12170.

VERSION INFORMATION:
TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 – Production
TCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 – Production
Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 – Production
Time: 23-DEC-2014 19:21:55
Tracing not turned on.
Tns error struct:
ns main err code: 12535

TNS-12535: TNS:operation timed out
ns secondary err code: 12560
nt main err code: 505

TNS-00505: Operation timed out
nt secondary err code: 78
nt OS err code: 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=136.142.29.169)(PORT=47090))
Tue Dec 23 19:21:58 2014
Errors in file /oracle/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_lms1_47907010.trc:
ORA-00600: internal error code, arguments: [kjdrisRMnovalid:msgRMno], [4294967295], [63056], [0], [], [], [], [], [], [], [], []
LMS1 (ospid: 47907010): terminating the instance due to error 484
Tue Dec 23 19:22:02 2014
ORA-1092 : opitsk aborting process
Tue Dec 23 19:22:02 2014
ORA-1092 : opitsk aborting process
Tue Dec 23 19:22:02 2014
ORA-1092 : opitsk aborting process
Tue Dec 23 19:22:02 2014
ORA-1092 : opitsk aborting process
Tue Dec 23 19:22:02 2014
License high water mark = 1568
Instance terminated by LMS1, pid = 47907010
USER (ospid: 50331738): terminating the instance
Instance terminated by USER, pid = 50331738
Tue Dec 23 19:22:06 2014
Starting ORACLE instance (normal)

# /oracle/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_lms1_47907010.trc

*** 2014-12-23 19:21:58.074
lms abort after exception 600
KJC Communication Dump:
state 0x35 flags 0x0 mode 0x0 inst 0 inc 150
nrcv 5 nsp 5 nrcvbuf 1000
reg_msg: sz 1240 cur 1760 (s:0 i:1760) max 3425 ini 6730
big_msg: sz 8440 cur 77 (s:0 i:77) max 3995 ini 6730
rsv_msg: sz 8440 cur 0 (s:0 i:0) max 0 tot 1000
rcvr: id 0 orapid 12 ospid 10158530
rcvr: id 3 orapid 15 ospid 51708066
rcvr: id 1 orapid 13 ospid 5571150
rcvr: id 4 orapid 16 ospid 41746780
rcvr: id 2 orapid 14 ospid 47907010
send proxy: id 0 ndst 1 (2:0 )
send proxy: id 3 ndst 1 (2:3 )
send proxy: id 1 ndst 1 (2:1 )
send proxy: id 4 ndst 1 (2:4 )
send proxy: id 2 ndst 1 (2:2 )
GES resource limits:
ges resources: cur 121439 max 280102 ini 86629
ges enqueues: cur 68456 max 94564 ini 132386
ges cresources: cur 15126 max 57610
gcs resources: cur 2448694 max 4146208 ini -1
gcs shadows: cur 1374260 max 1510396 ini -1
KJCTS state: seq-check:no timeout:yes waitticks:0x3 highload no

#255 id 371 exid 69075 dests 00
kjctseventdump-end tail 50 heads 50 @ 0 50 @ 134050450
End of KJC Communication Dump
DEFER MSG QUEUE 0 ON LMS1 IS EMPTY
SEQUENCES:
[sidx 0] 0:0.0 1:0.0
DEFER MSG QUEUE 1 ON LMS1 IS EMPTY
SEQUENCES:
[sidx 1] 0:0.0 1:15.0
DEFER MSG QUEUE 2 ON LMS1 IS EMPTY
SEQUENCES:
[sidx 2] 0:0.0 1:15.0
error 484 detected in background process
ORA-00600: internal error code, arguments: [kjdrisRMnovalid:msgRMno], [4294967295], [63056], [0], [], [], [], [], [], [], [], []
kjzduptcctx: Notifying DIAG for crash event
—– Abridged Call Stack Trace —–
ksedsts()+360<-kjzdssdmp()+240<-kjzduptcctx()+228<-kjzdicrshnfy()+100<-ksuitm()+5136<-ksbrdp()+4508<-opirip()+1624<-opidrv()+608<-sou2o()+
136<-opimai_real()+188<-ssthrdmain()+268<-main()+204<-__start()+112
—– End of Abridged Call Stack Trace —–

*** 2014-12-23 19:21:58.104
LMS1 (ospid: 47907010): terminating the instance due to error 484
ksuitm: waiting up to [5] seconds before killing DIAG(3408628)

— check parameter value after instance reboot

SQL> select value,inst_id,ISINSTANCE_MODIFIABLE from gv$parameter where name='_gc_undo_affinity';

VALUE                   INST_ID ISINS
-------------------- ---------- -----
FALSE                         2 FALSE
FALSE                         1 FALSE

SQL> select value,inst_id,ISINSTANCE_MODIFIABLE from gv$parameter where name='_gc_policy_time';

VALUE                   INST_ID ISINS
-------------------- ---------- -----
0                             1 FALSE
0                             2 FALSE

SQL> select * from REGISTRY$HISTORY

ACTION_TIME                    ACTION     NAMESPACE  VERSION            ID COMMENTS        BUNDLE_SERIES
------------------------------ ---------- ---------- ---------- ---------- --------------- ------------------------------
05-AUG-13 01.24.15.969777 AM   APPLY      SERVER     11.2.0.3            6 PSU 11.2.0.3.6  PSU

MOS 有记录一个EXADATA MACHINE 的bug.

Issue

Bug 16888850 – If RAC dynamic resource management (DRM) is disabled (i.e. initialization parameters “_gc_undo_affinity=FALSE” and “_gc_policy_time=0”), RAC rolling upgrade from 11.2.0.3.9, or earlier, to 11.2.0.3.10, or later, will fail. A database instance running the later software fails to start with error ORA-600 [kjdrisRMnovalid:msgRMno].

Fix or Workaround

Perform the upgrade non-rolling
– or –
Before performing RAC rolling upgrade, enable DRM by unsetting _gc_undo_affinity, or _gc_policy_time, or both.

Summary:
Disable DRM 的参数不是实例级的,在当同一实例中open节点间的DRM参数不一致时会异常如本案例的ORA-600 [kjdrisRMnovalid:msgRMno], DRM 不允许rolling Disable. 所以当节点2重启后发现节点1参数不一致,后节点1也因ORA-600 导致LMS进程Abort,随后instance reboot. 重启后实例DRM 参数一致, 维护工作也“被”提前完成。

打赏

,

对不起,这篇文章暂时关闭评论。