Troubleshooting ORA-00600: internal error code, arguments: [kcbbxsv_nwp] and opatch fail one-off patch

几年前在某个客户查看Oracle数据库 RAC补丁时发现节点间存在不一致的现象，当时在这篇《Using ‘opatch lsinventory’ show patched is real? （看到的补丁信息真的靠谱么？）》记录过，最近又遇到了一片“沼泽地”环境，运维人的窘境是无法要求建设阶段如何规范化，什么样的环境都要接，即使到处是坑。这里再分享一例有个ora-600错误引起的一系列问题。

–环境ORACLE RAC 19.11 ON RHEL 7.4.(没看错7.4,虽然官方要求是7.5+).

这环境据说是几日前出现过logfile switch checkpoint incomplete问题，当时发现数据库配置每个thread只有2组redo log. ok 太“节俭”了，后来增加到了4组，这次故障发生前是数据库sqlplus / as sysdba夯，但没有记录hanganalye信息，没有ASH, OSW，从db alert log中平时段在半小时左右会有log file switch而故障实例3小时未switch，并提示过一次checkpoint incompleted。2小时左右后dbwr 抛错crash了实例1,CRS自动拉起后恢复正常，期间node2 gc event也明显增加受到了影响，在实例1crash后，实例2恢复正常。猜测实例1crash前应该是长时间未checkpoint，所有online redo 都active无法复用。

DB ALERT LOG

Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_dbw0_70127.trc  (incident=921953) (PDBNAME=CDB$ROOT):
ORA-00600: internal error code, arguments: [kcbbxsv_nwp], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/anbob/anbob1/incident/incdir_921953/anbob1_dbw0_70127_i921953.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
2022-11-17T15:06:56.316462+08:00
Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_dbw0_70127.trc:
ORA-00600: internal error code, arguments: [kcbbxsv_nwp], [], [], [], [], [], [], [], [], [], [], []
Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_dbw0_70127.trc  (incident=921954) (PDBNAME=CDB$ROOT):
ORA-471 [] [] [] [] [] [] [] [] [] [] [] []
Incident details in: /u01/app/oracle/diag/rdbms/anbob/anbob1/incident/incdir_921954/anbob1_dbw0_70127_i921954.trc
2022-11-17T15:06:56.372478+08:00
Dumping diagnostic data in directory=[cdmp_20221117150656], requested by (instance=1, osid=70127 (DBW0)), summary=[incident=921953].
2022-11-17T15:06:57.575063+08:00
USER (ospid: 70127): terminating the instance due to ORA error 471
2022-11-17T15:06:57.601250+08:00
TYJB(4):opiodr aborting process unknown ospid (122215) as a result of ORA-1092
2022-11-17T15:06:57.601832+08:00
TYJB(4):Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_ora_122215.trc:
2022-11-17T15:07:03.672394+08:00
Instance terminated by USER, pid = 70127
2022-11-17T15:07:04.132714+08:00
Warning: 2 processes are still attacheded to shmid 3964937:
 (size: 49152 bytes, creator pid: 69713, last attach/detach pid: 70076)
2022-11-17T15:07:04.880460+08:00
USER(prelim) (ospid: 114832): terminating the instance

anbob1_dbw0_70127_i921954.trc trace file

----- Abridged Call Stack Trace -----
ksedsts()+426<-kjzduptcctx()+724<-kjzdpcrshnfy()+353<-dbkedKstDump()+27<-dbgdaExecuteAction()+354<-dbgerRunAction()+83
<-dbgerRunActions()+1352<-dbgexPhaseII()+549<-dbgexExplicitEndInc()+285<-dbgeEndDDEInvocationImpl()+314<-kcbbxsv()+17643
<-kcbb_coalesce_int()+326<-kcbb_coalesce()+438<-kcbbwthc()+820<-kcbbdrv()+8918<-ksb_act_run_int()+117<-ksb_act_run()+130
<-ksbcti()+1871<-ksbabs()+2363<-ksbrdp()+1167 
----- End of Abridged Call Stack Trace -----

根据CALL stack在MOS可以匹配Bug 31711720，直到19.13 RU修复。

Description
    DBWR process might encounter ORA-600 [kcbbxsv_nwp] which will 
    lead to instance crash. 
     
    Call stack might include: 
    ... kcbbxsv kcbb_coalesce_int kcbb_coalesce kcbbwdl ...
     
    Rediscovery Notes
    If DBW process fails with ORA-600 [kcbbxsv_nwp], this bug might 
    be hit.

其于这个问题建议
1，安装one-off patch 31711720
2, 优化checkpoint参数，如MTTR

安装31711720
安装oneoff patch比较简单的操作,有停机时间，选择了offline安装，那就很简单opatch apply就可以，但这里又一个“坑”。前期检查opatch lsinv输出正常。但是在sqlplus登陆两个节点发现SQL*Plus banner输出不同。node2提示是19.3.

安装补丁会发生什么？

节点1安装patch 顺利，但节点2应用报错，如下

[oracle@rac2 31711720]$ /u01/app/oracle/product/19.0.0/dbhome_1/OPatch/opatch apply
Oracle Interim Patch Installer version 12.2.0.1.25
Copyright (c) 2022, Oracle Corporation.  All rights reserved.


Oracle Home       : /u01/app/oracle/product/19.0.0/dbhome_1
Central Inventory : /u01/app/oraInventory
   from           : /u01/app/oracle/product/19.0.0/dbhome_1/oraInst.loc
OPatch version    : 12.2.0.1.25
OUI version       : 12.2.0.7.0
Log file location : /u01/app/oracle/product/19.0.0/dbhome_1/cfgtoollogs/opatch/opatch2022-11-18_19-41-01PM_1.log
...

Is the local system ready for patching? [y|n]
y
User Responded with: Y
Backing up files...
Applying interim patch '31711720' to OH '/u01/app/oracle/product/19.0.0/dbhome_1'

Patching component oracle.rdbms, 19.0.0.0.0...
Make failed to invoke "/usr/bin/make -f ins_rdbms.mk ioracle ORACLE_HOME=/u01/app/oracle/product/19.0.0/dbhome_1 OPATCH_SESSION=apply"....'/u01/app/oracle/product/19.0.0/dbhome_1/lib//libserver19.a(ksl.o): In function `kslges':
ksl.c:(.text+0x1fdfd): undefined reference to `sslss_async_call_cnt_'
ksl.c:(.text+0x1fe0f): undefined reference to `sslss_sig_pending_'
ksl.c:(.text+0x1fecb): undefined reference to `sslss_async_call_cnt_'
ksl.c:(.text+0x1fed7): undefined reference to `sslss_sigset_raise_signals'
ksl.c:(.text+0x1feeb): undefined reference to `sslss_async_call_cnt_'
/u01/app/oracle/product/19.0.0/dbhome_1/lib//libserver19.a(ksmmga.o): In function `ksm_mga_sctx_rcv_do':
ksmmga.c:(.text+0x361f): undefined reference to `sslss_async_call_cnt_'
ksmmga.c:(.text+0x3672): undefined reference to `sslss_async_call_cnt_'
ksmmga.c:(.text+0x3680): undefined reference to `sslss_async_call_cnt_'
ksmmga.c:(.text+0x368c): undefined reference to `sslss_sig_pending_'
ksmmga.c:(.text+0x375b): undefined reference to `sslss_sigset_raise_signals'
/u01/app/oracle/product/19.0.0/dbhome_1/lib//libserver19.a(ksmmga.o): In function `ksm_mga_detach':
...

Note:
对比了libserver19.a文件一致，即使从node1复制过来依旧有问题。

lib 节点间不同

$ cd /u01/app/oracle/product/19.0.0/dbhome_1/
$ ls -l lib/*|sort -k 9|awk '{print $1" "$3" "$5" "$9}'>node1lib
$ ls -l lib/*|sort -k 9|awk '{print $1" "$3" "$5" "$9}'>node2lib

[root@oel7db1 ~]# diff node1lib node2lib
1d0
< [oracle@rac1
34,36c33,35
< -rw-r--r-- oracle 39603 classgen.jar
< -rw-r--r-- oracle 6567902 clntsh.map
< -rw-r--r-- oracle 1818540 clntshcore.map --- > -rw-r----- oracle 39293 classgen.jar
> -rw-r--r-- oracle 6519050 clntsh.map
> -rw-r--r-- oracle 1818066 clntshcore.map
52c51
< -rwxr-xr-x oracle 1676112 libagtsh.so.1.0 --- > -rwxr-xr-x oracle 1674632 libagtsh.so.1.0
54c53
< -rw-r--r-- oracle 8677152 libanllexer19.so --- > -rw-r--r-- oracle 8676584 libanllexer19.so
57c56
...

Note:
如果可以把整合lib目录全从好的节点复制一份到报错节点，但前面注意到了sqlplus banner，有必要考虑下这套软件环境是不是之前安装的就有问题，也对比了OS 的rpm lib安装一致。

检查补丁信息

\u01\app\oracle\product\19.0.0\dbhome_1\.opatchauto_storage\system_patches
-- node1
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<system-patch-information>
 <system-patch id="29547676">
 <sub-patch>
 <id>29517242</id>
 <time>2019/04/18 07:21:20</time>
 </sub-patch>
 <sub-patch>
 <id>29585399</id>
 <time>2019/04/18 07:21:37</time>
 </sub-patch>
 </system-patch>
 <system-patch id="32545008">
 <sub-patch>
 <id>32545013</id> ------19.11 RU
 <time>2022/10/24 14:45:53</time>
 </sub-patch>
 <sub-patch>
 <id>32579761</id>
 <time>2022/10/24 14:46:38</time>
 </sub-patch>
 </system-patch>
</system-patch-information>

-- node2
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<system-patch-information>
 <system-patch id="29547676">
 <sub-patch>
 <id>29517242</id>
 <time>2019/04/18 07:21:20</time>
 </sub-patch>
 <sub-patch>
 <id>29585399</id>
 <time>2019/04/18 07:21:37</time>
 </sub-patch>
 </system-patch>
 <system-patch id="32545008">
 <sub-patch>
 <id>32579761</id>
 <time>2022/10/24 15:35:07</time>
 </sub-patch>
 </system-patch>
</system-patch-information>

Note:
node2缺少32545013 RU信息，开始我提到之前的blog提到过，如果怀疑软件是否一致，可以查看一下oracle binary执行文件是不是大小一致，节点2确认是不一致的（注意要和节点1apply补丁前对比），所以后期打补丁时不防检查一下节点间oracle执行文件是否大小一致。如果软件有bug,这里可以尝试rollback 节点的RU 重新安装试试，节点1软件是正常的，这里使用另一方案,delete node /add node重新初始化软件。

-- delete node2
# dbca -silent -deleteInstance -nodeList "rac2" -gdbName "anbob" -instanceName "anbob2" -sysDBAUserName "sys" -sysDBAPassword xxxxxxx
[grid@rac1 ~]$ srvctl disable listener -listener LISTENER -node rac2
[grid@rac1 ~]$ srvctl stop listener -listener LISTENER -node rac2
[oracle@rac2 bin]$ ./runInstaller -updateNodeList ORACLE_HOME =/u01/app/oracle/product/19.0.0/dbhome_1 "CLUSTER_NODES={rac2}" -local
[oracle@rac2 bin]$ cd $ORACLE_HOME/deinstall
[oracle@rac2 deinstall]$ ./deinstall -local
[oracle@rac1 bin]$ ./runInstaller -updateNodeList ORACLE_HOME =/u01/app/oracle/product/19.0.0/dbhome_1 "CLUSTER_NODES={rac1}"
[grid@rac2 bin]$ ./runInstaller -updateNodeList ORACLE_HOME =/u01/app/19.0.0/grid "CLUSTER_NODES={rac2}" CRS=TRUE -silent -local
[grid@rac2 ContentsXML]$ cd $ORACLE_HOME/deinstall
[grid@rac2 deinstall]$ ./deinstall -local
[root@rac2 ~]# /u01/app/19.0.0/grid/crs/install/rootcrs.sh -force  -deconfig -paramfile "/tmp/deinstall2022-11-18_11-33-32PM/response/deinstall_OraGI19Home1.rsp"
[grid@rac1 bin]$ ./runInstaller -updateNodeList ORACLE_HOME=/u01/app/19.0.0/grid "CLUSTER_NODES={rac1}" CRS=TRUE -silent -local
[root@rac1 bin]# ./crsctl delete node -n rac2
CRS-4661: Node rac2 successfully deleted.
[root@rac1 bin]# ./olsnodes -s -t
rac1    Active  Unpinned

-- add node2
[grid@rac1 ~]$ cluvfy stage -pre nodeadd -n rac2 -verbose -fixup
[grid@rac1 ~]$ cd /u01/app/19.0.0/grid/addnode/
[grid@rac1 ~]$export IGNORE_PREADDNODE_CHECKS=Y
[grid@rac1 addnode]$ ./addnode.sh -silent -ignorePrereq "CLUSTER_NEW_NODES={rac2}" "CLUSTER_NEW_VIRTUAL_HOSTNAMES={rac2-vip}" "CLUSTER_NEW_NODE_ROLES={hub}"
[root@rac2 ~]# /u01/app/19.0.0/grid/root.sh

[oracle@rac1 ~]$ cd /u01/app/oracle/product/19.0.0/dbhome_1/addnode/
[oracle@rac1 addnode]$ ./addnode.sh -silent -ignorePrereq "CLUSTER_NEW_NODES={rac2}"
[root@rac2 ~]# /u01/app/oracle/product/19.0.0/dbhome_1/root.sh

[oracle@rac1 ~]$ dbca -silent -addInstance -gdbName "anbob" -nodeName "rac2" -instanceName "anbob2" -sysDBAUserName "sys" -sysDBAPassword xxxxxxx
Prepare for db operation
40% complete
Adding instance
48% complete
52% complete
56% complete
60% complete
64% complete
68% complete
72% complete
80% complete
Completing instance management.
86% complete
100% complete
[FATAL] PRCR-1013 : Failed to start resource ora.tyjbcdb.db
PRCR-1064 : Failed to start resource ora.tyjbcdb.db on node rac2
CRS-5017: The resource action "ora.anbob.db start" encountered the following error:
ORA-12547: TNS:lost contact
. For details refer to "(:CLSN00107:)" in "/u01/app/grid/diag/crs/rac2/crs/trace/crsd_oraagent_oracle.trc".

ORA-12547: TNS:lost contact
CRS-2674: Start of 'ora.anbob.db' on 'rac2' failed
Look at the log file "/u01/app/oracle/cfgtoollogs/dbca/anbob/anbob.log" for further details.

NOTE:
在启实例阶段失败了。ora-12547

检查oracle binary权限

[oracle@rac1 bin]$ ll oracle
-rwsr-s--x 1 oracle asmadmin 448711384 Nov 18 23:00 oracle

[oracle@rac2 bin]$ ll oracle
-rwsr-s--x 1 oracle asmadmin 448711384 Nov 19 00:12 oracle

strace跟踪

[oracle@rac2 ~]$ strace -f -o /tmp/trace.1.log sqlplus / as sysdba

SQL*Plus: Release 19.0.0.0.0 - Production on Sat Nov 19 00:41:19 2022
Version 19.11.0.0.0

Copyright (c) 1982, 2020, Oracle.  All rights reserved.

ERROR:
ORA-12547: TNS:lost contact

分析 trace文件
检查“exited”

 close(6) = 0 <0.000006>
  close(4) = 0 <0.000006>
  close(5) = 0 <0.000011>
  close(3) = 0 <0.000006>
  close(5) = -1 EBADF (Bad file descriptor) <0.000005>
  sigaltstack(NULL, {ss_sp=0x7f6df2186000, ss_flags=0, ss_size=1048576}) = 0 <0.000005>
  sigaltstack({ss_sp=0x7f6df2186000, ss_flags=SS_DISABLE, ss_size=1048576}, NULL) = 0 <0.000005>
  exit_group(0) = ?
  +++ exited with 0 +++

继续向上找file句柄 5

 open("/u01/app/oracle/product/19.0.0/dbhome_2/rdbms/log/tyjbcdb2_ora_439527.trc", O_WRONLY|O_CREAT|O_EXCL, 0660) = 5 <0.000047>
  lstat("/u01/app/oracle/product/19.0.0/dbhome_2/rdbms/log/tyjbcdb2_ora_439527.trc", {st_mode=S_IFREG|0640, st_size=0, ...}) = 0 <0.000006>
  fcntl(5, F_SETFD, FD_CLOEXEC) = 0 <0.000005>
  write(5, "Dump file ", 10) = 10 <0.000014>
  write(5, "/u01/app/oracle/product/19.0.0/d"..., 73) = 73 <0.000007>
  write(5, "\n", 1) = 1 <0.000007>
  write(5, "Build label: RDBMS_19.11.0.0."..., 54) = 54 <0.000007>
  write(5, "Oracle Database 19c Enterprise E"..., 90) = 90 <0.000007>
  write(5, "\n", 1) = 1 <0.000007>
  write(5, "Build label: RDBMS_19.11.0.0."..., 54) = 54 <0.000007>
  write(5, "\n", 1) = 1 <0.000014>
  access("/u01/app/oracle/common/oracle/sql/ORACLE_HOME: /u01/app/oracle/product/19.0.0/dbhome_2\n", F_OK) = -1 ENOENT (No such file or directory) <0.000007>
  access("./ORACLE_HOME: /u01/app/oracle/product/19.0.0/dbhome_2\n", F_OK) = -1 ENOENT (No such file or directory) <0.000007>
  access("/u01/app/oracle/product/19.0.0/dbhome_1/rdbms/admin/ORACLE_HOME: /u01/app/oracle/product/19.0.0/dbhome_2\n", F_OK) = -1 ENOENT (No such file or directory) <0.000007>
  write(5, "ORACLE_HOME: /u01/app/oracle/"..., 56) = 56 <0.000007>
  uname({sysname="Linux", nodename="rac2", ...}) = 0 <0.000006>
  write(5, "System name:\tLinux\n", 19) = 19 <0.000007>
  write(5, "Node name:\trac2\n", 16) = 16 <0.000007>
  write(5, "Release:\t3.10.0-693.el7.x86_64\n", 31) = 31 <0.000006>
  write(5, "Version:\t#1 SMP Thu Jul 6 19:56:"..., 44) = 44 <0.000007>
  write(5, "Machine:\tx86_64\n", 16) = 16 <0.000008>
  write(5, "Storage:\t?\n", 11) = 11 <0.000007>
  write(5, "Instance name: tyjbcdb2\n", 24) = 24 <0.000007>
  write(5, "Redo thread mounted by this inst"..., 47) = 47 <0.000006>
  write(5, "Oracle process number: 0\n", 25) = 25 <0.000007>
  write(5, "Unix process pid: 439527, image:"..., 33) = 33 <0.000006>
  write(5, "\n", 1) = 1 <0.000006>
  write(5, "\n", 1) = 1 <0.000006>
  write(5, "\n*** 2022-11-19T12:45:43.868985+"..., 38) = 38 <0.000007>
  write(5, "ORA-12753: The Oracle home user "..., 131) = 131 <0.000006>
  write(5, "\n", 1) = 1 <0.000006>
  write(5, "ORA-27300: OS system dependent o"..., 91) = 91 <0.000006>
  write(5, "\n", 1) = 1 <0.000007>
  write(5, "ORA-27301: OS failure message: E"..., 38) = 38 <0.000007>
  write(5, "\n", 1) = 1 <0.000007>
  write(5, "ORA-27302: failure occurred at: "..., 39) = 39 <0.000007>
  write(5, "\n", 1) = 1 <0.000006>
  write(5, "ORA-27303: additional informatio"..., 94) = 94 <0.000007>
  write(5, "\n", 1) = 1 <0.000007>

查看trace file

Dump file /u01/app/oracle/product/19.0.0/dbhome_2/rdbms/log/tyjbcdb2_ora_439527.trc

*** 2022-11-19T12:45:43.868985+08:00
ORA-12753: The Oracle home user ID does not match the effective user ID of the local client which connected to the Oracle instance.
ORA-27300: OS system dependent operation:check $ORACLE_HOME ownership failed with status: 0
ORA-27301: OS failure message: Error 0
ORA-27302: failure occurred at: spstp:4
ORA-27303: additional information: ORACLE_HOME uid does not match euid. uid: 1002 euid: 1001

检查ORACLE HOME owner uid

[oracle@rac2 19.0.0]$ ls -l
total 10510948
drwxrwxr-x 73 grid   oinstall        4096 Nov 19 00:13 dbhome_1
[oracle@rac2 19.0.0]$ id oracle
uid=1001(oracle) gid=501(oinstall) groups=501(oinstall),502(dba),503(oper),506(asmdba)
[oracle@rac2 19.0.0]$ id grid
uid=1002(grid) gid=501(oinstall) groups=501(oinstall),502(dba),503(oper),504(asmadmin),505(asmoper),506(asmdba)

Note:
应该是1001 oracle,当前是1002 grid。

修正OH owner

[root@rac2 19.0.0]# chown oracle:oinstall dbhome_1/

[oracle@rac2 ~]$ sqlplus / as sysdba
SQL*Plus: Release 19.0.0.0.0 - Production on Sat Nov 19 13:28:35 2022
Version 19.11.0.0.0

Copyright (c) 1982, 2020, Oracle. All rights reserved.
Connected to an idle instance.

Summary:
一个错误引起的一系列“闯关”。源于一个软件最开始的安装与配置，以后再安装补丁时，记的增加一个检查吧。

— over —