Troubleshooting Oracle 19c RAC DB crash after ora-600 [kjblpgorm:!antilock] and start fail with Ora-600 [kfmdPriRegRclient04]
最近有个客户的oracle 19c 3nodes RAC 有一个节点意外crash ORA-600 kjblpgorm:!antilock, 启动时报ORA-600[kfmdPriRegRclient04],并启动过程中重导致之前的幸存节点hang并且重启,Oracle 的基础版本bug 比较多,找我分析并临时解决了该问题,简单记录该问题。
— version 19.3
ORA-00600: internal error code, arguments: [kjblpgorm:!antilock]
db alert log
2025-06-06T13:05:24.742963+08:00 Errors in file /oracle/app/oracle/diag/rdbms/anbob/anbob3/trace/anbob3_lms3_3782790_3782800.trc (incident=656189): ORA-00600: internal error code, arguments: [kjblpgorm:!antilock], [680864], [115], [0], [11], [1269857], [5], [], [], [], [], [] Incident details in: /oracle/app/oracle/diag/rdbms/anbob/anbob3/incident/incdir_656189/anbob3_lms3_3782790_3782800_i656189.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. 2025-06-06T13:05:26.709825+08:00 Errors in file /oracle/app/oracle/diag/rdbms/anbob/anbob3/trace/anbob3_lms3_3782790_3782800.trc: ORA-00600: internal error code, arguments: [kjblpgorm:!antilock], [680864], [115], [0], [11], [1269857], [5], [], [], [], [], [] 2025-06-06T13:05:26.848797+08:00 Dumping diagnostic data in directory=[cdmp_20250606130526], requested by (instance=3, osid=3782800 (LMS3)), summary=[incident=656189]. 2025-06-06T13:05:28.612222+08:00 opidrv aborting process LMS3 ospid (3782790_3782800) as a result of ORA-600 2025-06-06T13:05:28.653226+08:00 PMON (ospid: ): terminating the instance due to ORA error
Note:
后台进程LMS因ora-600 错误,实例crash.
Format: ORA-600 [kjblpgorm:!antilock] [a] [b] [c] [d] [e]
a = id1
b = id2
c = pkey pdb
d = tablespace number
e = object #
kjblpgorm antilock (kjbl)pgorm antilock – kernel lock management global cache service lock table ?
NB | Prob | Bug | Fixed | Description |
II | 29531836 | 19.6, 20.1 | RAC INSTANCE PRODUCES ORA-00600 [kjblpgorm:!antilock] | |
III | 36354638 | 19.25, 23.4 | LMS Hit ORA-00600: internal error code, arguments: [kjblpgorm:!antilock] | |
IIII | 35843249 | 19.22, 23.4 | [RAC] LMS Hit ORA-600[kjblpgorm:!antilock] | |
III | 35151872 | 19.22 | HA Mode: Hit ORA-600 [kjblpgorm:!antilock] | |
III | 32783456 | 19.17 | ORA-600[kjblpgorm:!antilock] Instance Crash | |
III | 29646315 | 19.20, 20.1 | ASM, DB LMS HIT ORA-600[KJBLPGORM:!ANTILOCK] | |
IIII | 29464779 | 12.1.0.2.200714, 12.2.0.1.DBRU:200114, 18.18, 18.8, 19.4, 20.1 | LMS: ORA-600 [kjblpgorm:!antilock] crashing the instance, ORA-600 [3020], ORA-752 during media recover | |
III | 29372069 | 12.2.0.1.DBRU:200114, 18.11, 18.18, 19.8, 20.1 | Instance Crash With ORA-600[kjblpgorm:!antilock] | |
III | 29038730 | 19.12, 20.1 | Hitting the ORA-600[kjblpgorm:!antilock] followed by instance crash | |
IIII | 35045932 | 19.21 | [RAC] Instance crash after ORA-600 [kjblpgorm:!antilock] |
该问题相关的bug 较多,像 Bug 35045932 – Instance crash after ORA-600 [kjblpgorm:!antilock]
Bug 29464779 – LNX64-20.1-ASM,DB LMS HIT ORA-600[KJBLPGORM:!ANTILOCK] THEN CRASH
Bug 35843249 [RAC] LMS Hit ORA-600[kjblpgorm:!antilock]
都是因为使用了DRM 在11g版本引入的read-mostly 新特性引起的。
read-mostly is enabled(Default Enabled)The pkey check is missing from when the anti-lock is not an LE. This can cause wrong anti-lock being closed after object reused.
什么Read-mostly locking
DRM(Dynamic Resource Remastering)在10gR2引入Affinity Locks和Object级别的DRM,11g引入Read-Mostly和Reader Bypass.而了read mostly locking的机制,它会基于对象的global operation历史。用于减少读访问的消息传递和CPU消耗。oracle的cache层记录着每个对象上的S lock和X lock的数量,如果某个节点打开了大量的S lock并且很少了的X lock,并且block传输的比较少,那么这个对象在这个节点上就是read mostly了。当read mostly发生的时候,对象的共享就停止了,并且block不再通过interconnect进行传输(除非block被修改)。
当一个对象被定义成read mostly,他会被master node授予在所有节点上的S affinity lock,这意味着所有的节点都被“提前”授予了该block的读访问权限,因此,减少了在各个节点间互相传递S lock的消息量。
Oracle使用一种特殊的叫anti-lock,来控制read mostly对象上的X锁。当x lock被申请时,所有的节点会被广播通知到要打开anti-lock,所有的对那个块的访问(不管是S lock还是X lock)都会变成标准的cache fusion locking,即使该对象本身还是read-mostly。广播会在分配X lock之前完成,仅当block上没有anti-lock打开的时候。anti-lock将会在read-mostly消失的时候,或者脏块写入磁盘的时候清除掉,并且X lock会降级。
为read-mostly的对象打开x lock是非常昂贵的操作,在分配x lock之前,master node需要广播anti-lock给所有的节点。在x lock关闭之前,anti-lock不能被移除。另外,在节点加入集群的时候,他也会创建anti-lock,anti-lock只是在LE上标记KCLL_F_ANTI,并且在有anti-lock的情况下,read-mostly lock不能被分配。
more 【深入解析】DRM和read-mostly locking
解决方法
除了升级,就是禁用GCS read-mostly locking或干脆禁用所有DRM。
-- Dynamic workaround alter system set "_lm_drm_disable"=4; oradebug setmypid oradebug lkdebug -m reconfig disrm -- Static workaround alter system set "_gc_read_mostly_locking"=false scope=spfile sid='*' ; alter system set "_gc_persistent_read_mostly"=false scope=spfile sid='*' ;
ORA-00600: internal error code, arguments: [kfmdPriRegRclient04]
db alert log
Errors in file /oracle/app/oracle/diag/rdbms/anbob/anbob3/trace/anbob3_fenc_758375.trc: ORA-00600: internal error code, arguments: [kfmdPriRegRclient04], [], [], [], [], [], [], [], [], [], [], [] Errors in file /oracle/app/oracle/diag/rdbms/anbob/anbob3/trace/anbob3_fenc_758375.trc (incident=853023): ORA-854 [] [] [] [] [] [] [] [] [] [] [] [] Incident details in: /oracle/app/oracle/diag/rdbms/anbob/anbob3/incident/incdir_853023/anbob3_fenc_758375_i853023.trc 2025-06-06T21:23:00.975935+08:00 Dumping diagnostic data in directory=[cdmp_20250606212300], requested by (instance=3, osid=758375 (FENC)), summary=[incident=853022]. 2025-06-06T21:23:02.073135+08:00 USER (ospid: ): terminating the instance due to ORA error
fenc trace file
fenc 后台进程用于当db crash时,css监控到后使用该进程隔离db 到ASM 层的IO请求。
000000000 ? 000000082 ? kgerinv_internal()+ call kgeadse() 7F39A6A829A0 ? 7F39A6960048 44 000000258 ? 012F67394 7F3900000000 7FFC00000000 kgerinv()+40 call kgerinv_internal() 7F39A6A829A0 ? 7F39A6960048 ? 000000258 ? 012F67394 ? 7F3900000000 ? 7FFC00000000 ? kgeasnmierr()+146 call kgerinv() 7F39A6A829A0 ? 7F39A6960048 ? 000000258 ? 012F67394 ? 7F3900000000 ? 7FFC00000000 ? kfmdPriRegRclient() call kgeasnmierr() 7F39A6A829A0 ? 7F39A6960048 ? +1421 000000258 ? 012F67394 ? 14019C5608 000000000 kfmdProcessRclient( call kfmdPriRegRclient() 7F39A6A829A0 ? 7F39A6960048 ? )+276 000000258 ? 6264637A78 14019C5608 ? 000000000 ? kfnbListenNodeRecon call kfmdProcessRclient( 7F39A6A829A0 ? 000000004 f()+1224 ) 000000258 ? 6264637A78 ? 14019C5608 ? 000000000 ?
kfmdPriRegRclient04 (kfmd)PriRegRclient04 – kernel automatic storage management node monitor interface implementation layer for diskgroup registration
Bug 32656231 @ Slow Instance Startup and ORA-00600 [kfmdPriRegRclient04] on FENC Process
Bug 35469192 Instance Crashes With ORA-600 [kfmdpriregrclient04] During Reconfiguration
During instance start, it takes too long to complete FIXWRITE step and instance is killed and restarted when using Real Application Clusters (RAC)
- Stack is likely to include kgeasnmierr
- Stack is likely to include kfmdPriRegRclient
- Stack is likely to include kfmdProcessRclient
- Stack is likely to include kfnbListenNodeRecon
- Stack is likely to include ksbrdp
较匹配,解决方法升级RU。
— OVER —
目前这篇文章还没有评论(Rss)