最近一个比较新鲜的案例,环境ORACLE 2-nodes RAC,有3个PDB 多租户架构,在节点2在仅做了某1个PDB级的PGA大小参数后,实例2 crash,并且,重启node2 db instance后,逐个open PDB, 仅当open 此PDB时,实例2会再次crash, 并提示错误:
2025-04-15T12:33:53.433625+08:00 Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_lmon_50145.trc: ORA-29740: evicted by instance number 2, group incarnation 121 Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_lmon_50145.trc (incident=1398624) (PDBNAME=CDB$ROOT): ORA-29740 [] [] [] [] [] [] [] [] [] [] [] [] Incident details in: /u01/app/oracle/diag/rdbms/anbob/anbob1/incident/incdir_1398624/anbob1_lmon_50145_i1398624.trc 2025-04-15T12:33:53.578776+08:00 USER (ospid: 80938): terminating the instance due to ORA error 481
操作
# grep -i "alter system " alert* alert_anbob1.log:QTP_ZZ(3):ALTER SYSTEM SET pga_aggregate_limit=150G SCOPE=SPFILE PDB='QTP_ZZ'; alert_anbob1.log:QTP_ZZ(3):ALTER SYSTEM SET pga_aggregate_target=50G SCOPE=SPFILE PDB='QTP_ZZ;
DB ALERT log
Completed: alter pluggable database QTP_ZZ close immediate 2025-04-15T00:58:16.171882+08:00 alter pluggable database QTP_ZZ open 2025-04-15T00:58:16.174395+08:00 QTP_ZZ(3):Pluggable database QTP_ZZ opening in read write QTP_ZZ(3):SUPLOG: Initialize PDB SUPLOG SGA, old value 0x0, new value 0x18 QTP_ZZ(3):Autotune of undo retention is turned on. QTP_ZZ(3):queued attach DA request 0xa849e3f58 for pdb 3, ospid 41223 2025-04-15T00:58:16.425005+08:00 Increasing priority of 32 RS Domain Action Reconfiguration started (domid 3, new da inc 4, cluster inc 4) Instance 1 is attaching to domain 3 Global Resource Directory partially frozen for domain action Non-local Process blocks cleaned out Set master node info Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted 2025-04-15T00:59:01.857616+08:00 CLMN: clean deferred state objects - failed 2025-04-15T01:03:51.414380+08:00 minact-scn: got error during useg scan e:12751 usn:11 minact-scn: useg scan erroring out with error e:12751 2025-04-15T01:04:24.314328+08:00 Decreasing priority of 32 RS 2025-04-15T01:05:12.397123+08:00 Detected an inconsistent instance membership by instance 2 Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_lmon_24376.trc (incident=782577) (PDBNAME=CDB$ROOT): ORA-29740: evicted by instance number 2, group incarnation 6 Incident details in: /u01/app/oracle/diag/rdbms/anbob/anbob1/incident/incdir_782577/anbob1_lmon_24376_i782577.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. 2025-04-15T01:05:13.189357+08:00 Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_lmon_24376.trc: ORA-29740: evicted by instance number 2, group incarnation 6 Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_lmon_24376.trc (incident=782578) (PDBNAME=CDB$ROOT): ORA-29740 [] [] [] [] [] [] [] [] [] [] [] [] Incident details in: /u01/app/oracle/diag/rdbms/anbob/anbob1/incident/incdir_782578/anbob1_lmon_24376_i782578.trc 2025-04-15T01:05:13.269464+08:00 USER (ospid: 57215): terminating the instance due to ORA error 481 2025-04-15T01:05:13.269822+08:00 Cause - 'Instance is being terminated due to dlm error ' 2025-04-15T01:05:19.951483+08:00 ORA-1092 : opitsk aborting process 2025-04-15T01:05:21.288270+08:00 License high water mark = 1222 2025-04-15T01:05:28.316991+08:00 Termination issued to instance processes. Waiting for the processes to exit, wait time 5 sec 2025-04-15T01:05:29.317677+08:00 Instance terminated by USER, pid = 57215 2025-04-15T01:05:29.759456+08:00 Warning: 2 processes are still attacheded to shmid 25: (size: 118784 bytes, creator pid: 23958, last attach/detach pid: 24483)
PDB PARAMETER
sys@anbob2(762)> select * from v$pdbs;
CON_ID DBID CON_UID GUID NAME OPEN_MODE RES OPEN_TIME CREATE_SCN TOTAL_SIZE BLOCK_SIZE RECOVERY SNAPSHOT_PARENT_CON_ID APP APP APP APPLICATION_ROOT_CON_ID APP PRO LOCAL_UNDO UNDO_SCN UNDO_TIMESTAMP CREATION_TIME DIAGNOSTICS_SIZE PDB_COUNT AUDIT_FILES_SIZE MAX_SIZE MAX_DIAGNOSTICS_SIZE MAX_AUDIT_SIZE LAST_CHANGE TEM TENANT_ID UPGRADE_LEVEL GUID_BASE64
---------- ---------- ---------- -------------------------------- -------------------------------------------------- ---------- --- --------------------------------------------------------------------------- ---------- ---------- ---------- -------- ---------------------- --- --- --- ----------------------- --- --- ---------- ---------- ------------------- ------------------- ---------------- ---------- ---------------- ---------- -------------------- -------------- ----------- --- ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ------------- ------------------------------
2 2285472469 2285472469 DA2EC12AE5F8B1D5E0532303D80A6F97 PDB$SEED READ ONLY NO 18-JUN-22 11.13.48.006 AM +08:00 317 8.6282E+10 8192 ENABLED NO NO NO NO NO 1 0 2022-03-14 21:36:30 0 0 0 0 0 0 COMMON USER NO 1 2i7BKuX4sdXgUyMD2ApvlwA=
3 3808343839 3808343839 DAF174EB35C75284E0532403D80AC894 QTP_ZZ READ WRITE NO 18-JUN-22 11.32.19.766 AM +08:00 3221042 1.0267E+13 8192 ENABLED NO NO NO NO NO 1 317 2022-03-24 13:52:52 0 0 0 0 0 0 COMMON USER NO 1 2vF06zXHUoTgUyQD2ArIlAA=
4 2224247423 2224247423 DE3F4BF37670A8D8E0532303D80A599D QTP_XX READ WRITE NO 18-JUN-22 11.32.19.767 AM +08:00 12418374 2.1310E+11 8192 ENABLED NO NO NO NO NO 1 317 2022-05-05 15:00:27 0 0 0 0 0 0 COMMON USER NO 1 3j9L83ZwqNjgUyMD2ApZnQA=
5 582974699 582974699 DF2DD911E6D74D1FE0532303D80A8284 QTP_YY READ WRITE NO 18-JUN-22 11.32.19.767 AM +08:00 22662829 3.2740E+11 8192 ENABLED NO NO NO NO NO 1 317 2022-05-17 11:36:36 0 0 0 0 0 0 COMMON USER NO 1 3y3ZEebXTR/gUyMD2AqChAA=
sys@anbob2(258)>select * from pdb_spfile$
DB_UNIQ_NAME PDB_UID SID NAME VALUE$ COMMENT$ SPARE1 SPARE2 SPARE3
------------------------------ --------------- -------------------- ---------------------------------------- -------------------------------------------------- ---------------------------------------- --------------- --------------- --------------------------------------------------------------------------------------------------------------------------------
* 2285472469 * db_securefile 'PREFERRED' 0 0
* 3808343839 * db_securefile 'PREFERRED' 0 0
* 3808343839 * sga_target 322122547200 322122547200 0
* 3808343839 * sga_min_size 64424509440 64424509440 0
* 3808343839 * pga_aggregate_limit 107374182400 161061273600 0
* 3808343839 * pga_aggregate_target 53687091200 69793218560 0
* 3808343839 * open_cursors 20000 20000 0
* 2224247423 * db_securefile 'PREFERRED' 0 0
* 582974699 * db_securefile 'PREFERRED' 0 0
* 2224247423 * sga_target 214748364800 214748364800 0
* 2224247423 * sga_min_size 64424509440 64424509440 0
* 2224247423 * pga_aggregate_limit 107374182400 107374182400 0
* 2224247423 * pga_aggregate_target 53687091200 53687091200 0
* 2224247423 * open_cursors 200 200 0
14 rows selected.
Note:
pdb spfile参数并不在db spfile,而在是DB内的pdb_spfile$表.
incident 日志
adrci> show trace /u01/app/oracle/diag/rdbms/anbob/anbob1/incident/incdir_782577/anbob1_lmon_24376_i782577.trc
Output the results to file: /tmp/utsout_185122_1404_4.ado
000000000 ? 000000082 ?
ksedmp()+577 call dbkedDefDump() 000000003 000000002
7FFF11575E50 ? 7FFF11575F68 ?
000000000 ? 000000082 ?
dbgexPhaseII()+2092 call ksedmp() 0000003EB 000000002 ?
7FFF11575E50 ? 7FFF11575F68 ?
000000000 ? 000000082 ?
dbgexProcessError() call dbgexPhaseII() 7FED77D156D8 7FED77CD92A0
+1871 7FFF1157D500 7FFF11575F68 ?
000000000 ? 000000082 ?
dbgePostErrorKGE()+ call dbgexProcessError() 7FED77D156D8 7FED77CD92A0
1853 000000001 000000000
000000000 ? 000000082 ?
dbkePostKGE_kgsf()+ call dbgePostErrorKGE() 7FED77D559C0 7FED71774380
71 00000742C 000000000 ?
000000000 ? 000000082 ?
kgeade()+392 call dbkePostKGE_kgsf() 7FED77D559C0 7FED71774380
00000742C 000000000 ?
000000000 ? 000000082 ?
kgeselv()+89 call kgeade() 7FED77D559C0 ? 7FED77D55C08 ?
7FED71774380 ? 00000742C ?
000000000 000000000
ksesec2()+205 call kgeselv() 7FED77D559C0 ? 7FED71774380 ?
00000742C ? 012FC20E0
012FC20E8 000000002
kjxgrdtrt()+1317 call ksesec2() 7FED77D559C0 ? 000000000
000000002 000000000 000000006
0FFFFFFFF
kjxgrDiskVote_Valid call kjxgrdtrt() 7FED723BC6F8 000000001
ateMembership()+187 000000006 000000000 ?
4 000000006 ? 0FFFFFFFF ?
kjxgrDiskVote_Execu call kjxgrDiskVote_Valid 000000008 000000007
te()+86 ateMembership() 7FED723BC6F8 000000000 ?
000000006 ? 0FFFFFFFF ?
kjxgrrcfgchk()+7788 call kjxgrDiskVote_Execu 7FED723BC6F8 000000007 ?
te() 7FED723BC6F8 ? 000000000 ?
000000006 ? 0FFFFFFFF ?
kjxggpoll()+171 call kjxgrrcfgchk() 7FED723BC6F8 000000000
7FED723BC6F8 ? 000000000 ?
000000006 ? 0FFFFFFFF ?
kjfmact()+104 call kjxggpoll() 0722BC200 000000000
7FED723BC6F8 ? 000000000 ?
000000006 ? 0FFFFFFFF ?
kjfcln()+4310 call kjfmact() 7FED722BC200 A65F575A0
000000000 000000000 ?
000000006 ? 0FFFFFFFF ?
ksbrdp()+1167 call kjfcln() 0600124D0 A65F575A0 ?
000000000 ? 000000000 ?
000000006 ? 0FFFFFFFF ?
opirip()+541 call ksbrdp() 0600124D0 ? A65F575A0 ?
000000000 ? 000000000 ?
000000006 ? 0FFFFFFFF ?
opidrv()+581 call opirip() 000000032 000000004
7FFF11614E18 000000000 ?
000000006 ? 0FFFFFFFF ?
sou2o()+165 call opidrv() 000000032 000000004
lmon trace
LMD0 group 0 GES resources 111744 pool 38
LMD1 group 0 GES resources 111744 pool 38
LMD2 group 0 GES resources 111744 pool 38
LMD3 group 0 GES resources 111744 pool 38
LMD4 group 0 GES resources 111744 pool 38
GES enqueues 169250
GCS latches 4096
GES IPC: Receivers 37 Senders 37
GES IPC: Buffers Receive 1000 Send (i:0 b:0) Reserve 0
GES IPC: Msg Size Regular 512 Batch 8192
Batching factor: enqueue replay 201, ack 223
Batching factor: cache replay 91 size per lock 88
Read-write Instance? 1, Designated Master? 1, BOC? 1, Broadcast SCN mode: 1
CSS cluster type is UNKNOWN (1)
*** 2025-04-15T12:14:13.128153+08:00 (CDB$ROOT(1))
kjxggin: CGS tickets = 1000
kjxgmin: set instance reconnect max time to 40 secs
kjxgmin: local IPv4 169.254.8.240 (UDP)
kjxgrdmpcpu: CPU Total (raw:192 eff:192) Core 96 Socket 4 OCPU 192
kjxgrdmpcpu: High load threshold 245760
CGS/IMR TIMEOUTS:
CSS recovery timeout = 31 sec (Total CSS waittime = 65)
IMR Reconfig timeout = 75 sec
CGS rcfg timeout = 85 sec
kjxgmjoin: rimlost event instmap:
*** 2025-04-15T12:14:13.222430+08:00 (CDB$ROOT(1))
kjxgmrcfg: Reconfiguration started, type 1
CGS/IMR TIMEOUTS:
CSS recovery timeout = 31 sec (Total CSS waittime = 65)
IMR Reconfig timeout = 75 sec
CGS rcfg timeout = 85 sec
kjxgmcs: Setting state to 0 0.
2025-04-15 12:14:13.742 : * Begin lmon rcfg step KJGA_RCFG_BEGIN (kjidomena 0, rcfginfo x0)
* local undo 0 (0.0.1.0), kjitxtsn 1, kju_tx_tsn_affinity 1
* kjga: st x0, flg x2000, stp 1.0(0).0, rmno 0
* kjdrmst: domid 0, requester 32767, pt 0, hv 0, rm 0, rcfg int 0,
undo 0, datype x0, sizefltr stg 0, intda chkinc 0, intda setinc 0
* adg_enabled? 1
domain 0 valid? 0
* RORA mode = FALSE
* ----- RORA state at the beginning of rcfg ------ *
adg_enabled 1, roram 32767, last roram 32767, rora_requester 32767
rora_invalid 0, rora_expand 0
adg_roram 32767, adg last roram 32767, adg_rora_requester 32767
rcvinst 32767, domain valid? 0
* ------------------------------------------------- *
* kjfcrfg: Dump rbuddy info at the beginning of rcfg:
* kji_rbuddy_dmpi2t: dump i2t array:
* array is empty
* kji_rbuddy_dmpall: dump rbuddy array (rcvinst 32767, dom0 valid 0):
* kji_rbuddy_graph: (cinc 0, valid 0, rinst 32767) [ ]
* End of rbuddy info dump
* Begin rcfg: free use mem = 710104928 (freemem 924863904, count 8)
* kjfcqiora: query MULTIPLE LMD ENQUEUE INFO of inst 1 = 5@*@*, vallen 5, strlen 5
* kjfcqiora: parsing of string complete
kjfcqiora: skipping namespace 14, not in use anymore
* kjfcqiora: query MULTIPLE LMD ENQUEUE INFO of inst 2 = 5@*@*, vallen 5, strlen 5
* kjfcqiora: parsing of string complete
kjfcqiora: skipping namespace 14, not in use anymore
* kjfcrfg: kjfcqiora returned success
2025-04-15 12:14:13.746 :
Reconfiguration started (old inc 0, new inc 119)
* kjfcrfg: drm size limit is -1 buffers
Dynamic remastering is disabled
List of instances (total 2) :
1 2
My inst 1 (I'm a new instance) '
*
kjfcrfg: Dump rbuddy info before kji_rbuddy_rcfg:
* kji_rbuddy_dmpi2t: dump i2t array:
* array is empty
* kji_rbuddy_dmpall: dump rbuddy array (rcvinst 32767, dom0 valid 0):
* kji_rbuddy_graph: (cinc 0, valid 0, rinst 32767) [ ]
* End of rbuddy info dump *
kjfcrfg: Dump rbuddy info after kji_rbuddy_rcfg:
* kji_rbuddy_dmpi2t: dump i2t array:
* array is empty
* kji_rbuddy_dmpall: dump rbuddy array (rcvinst 32767, dom0 valid 0):
* kji_rbuddy_graph: (cinc 0, valid 0, rinst 32767) [ ]
* End of rbuddy info dump * kjfcrfg: sync timeout = 326 secs (2x default)
TIMEOUTS:
Local health check timeout: 70 sec
Rcfg process freeze timeout: 70 sec
Remote health check timeout: 140 sec
*** 2025-04-15T12:30:43.091118+08:00 (CDB$ROOT(1))
kjfmReceiverHealthCB_Check: Reciever [26] is healthy.
kjfmReceiverHealthCB_Check: Reciever [9] is healthy.
kjfmReceiverHealthCB_Check: Reciever [5] is healthy.
*** 2025-04-15T12:33:04.022623+08:00 (CDB$ROOT(1))
kjfmReceiverHealthCB_Check: Reciever [9] is healthy.
kjfmReceiverHealthCB_Check: Reciever [2] is healthy.
kjfmReceiverHealthCB_Check: Reciever [11] is healthy.
kjfmReceiverHealthCB_Check: Reciever [28] is healthy.
*** 2025-04-15T12:33:04.664247+08:00 (CDB$ROOT(1))
2025-04-15 12:33:04.664 : kjxgrDD_rr_read: Detect reconfig from inst 2, seq 120, reason 3
================================
== System Network Information ==
================================
==[ Network Interfaces : 5 (5 max) ]============
lo | 127.0.0.1 | 255.0.0.0 | UP|RUNNING
bond0 | 10.216.3.35 | 255.255.255.0 | UP|RUNNING
bond0:1 | 10.216.3.37 | 255.255.255.0 | UP|RUNNING
bond1 | 192.168.103.1 | 255.255.255.0 | UP|RUNNING
bond1:1 | 169.254.8.240 | 255.255.224.0 | UP|RUNNING
== [ Network Transport Usage (ksipc: avail[xa8] sel[UDP]) (IPv4) ] ==
===[ IPv4 Route Table : 4 entries ]============
Destination Gateway Iface
0.0.0.0/0 10.216.3.1 bond0
10.216.3.0/24 0.0.0.0 bond0
169.254.0.0/19 0.0.0.0 bond1
192.168.103.0/24 0.0.0.0 bond1
===[ ARP Table ]============
IP address HW type Flags HW address Mask Device
10.216.3.36 0x1 0x2 5c:6f:69:55:cb:70 * bond0
10.216.3.39 0x1 0x2 5c:6f:69:55:cb:70 * bond0
169.254.27.60 0x1 0x2 74:50:4e:d8:df:c7 * bond1
10.216.3.38 0x1 0x2 5c:6f:69:55:cb:70 * bond0
192.168.103.2 0x1 0x2 74:50:4e:d8:df:c7 * bond1
10.216.3.37 0x1 0x0 00:00:00:00:00:00 * bond0
10.216.3.1 0x1 0x2 a0:69:d9:91:1e:36 * bond0
===[ Network Config : 15 devices ]============
bond0 .rp_filter = 1
bond1 .rp_filter = 1
ens14f0 .rp_filter = 1
ens14f1d1 .rp_filter = 1
ens15f0 .rp_filter = 1
ens15f1d1 .rp_filter = 1
ens16f0 .rp_filter = 1
ens16f1 .rp_filter = 1
ens16f2 .rp_filter = 1
ens16f3 .rp_filter = 1
ens31f0 .rp_filter = 1
ens31f1 .rp_filter = 1
ens31f2 .rp_filter = 1
ens31f3 .rp_filter = 1
lo .rp_filter = 0
==[ Network Interface States: num IF 5 Snapshots 5 ]==
***** info from 292s ago
lo | 127.0.0.1 | 255.0.0.0 | UP|RUNNING
bond0 | 10.216.3.35 | 255.255.255.0 | UP|RUNNING
bond0:1 | 10.216.3.37 | 255.255.255.0 | UP|RUNNING
bond1 | 192.168.103.1 | 255.255.255.0 | UP|RUNNING
bond1:1 | 169.254.8.240 | 255.255.224.0 | UP|RUNNING
***** info from 52s ago
lo | 127.0.0.1 | 255.0.0.0 | UP|RUNNING
bond0 | 10.216.3.35 | 255.255.255.0 | UP|RUNNING
bond0:1 | 10.216.3.37 | 255.255.255.0 | UP|RUNNING
bond1 | 192.168.103.1 | 255.255.255.0 | UP|RUNNING
bond1:1 | 169.254.8.240 | 255.255.224.0 | UP|RUNNING
***** info from 112s ago
lo | 127.0.0.1 | 255.0.0.0 | UP|RUNNING
bond0 | 10.216.3.35 | 255.255.255.0 | UP|RUNNING
bond0:1 | 10.216.3.37 | 255.255.255.0 | UP|RUNNING
bond1 | 192.168.103.1 | 255.255.255.0 | UP|RUNNING
bond1:1 | 169.254.8.240 | 255.255.224.0 | UP|RUNNING
***** info from 172s ago
lo | 127.0.0.1 | 255.0.0.0 | UP|RUNNING
bond0 | 10.216.3.35 | 255.255.255.0 | UP|RUNNING
bond0:1 | 10.216.3.37 | 255.255.255.0 | UP|RUNNING
bond1 | 192.168.103.1 | 255.255.255.0 | UP|RUNNING
bond1:1 | 169.254.8.240 | 255.255.224.0 | UP|RUNNING
***** info from 232s ago
lo | 127.0.0.1 | 255.0.0.0 | UP|RUNNING
bond0 | 10.216.3.35 | 255.255.255.0 | UP|RUNNING
bond0:1 | 10.216.3.37 | 255.255.255.0 | UP|RUNNING
bond1 | 192.168.103.1 | 255.255.255.0 | UP|RUNNING
bond1:1 | 169.254.8.240 | 255.255.224.0 | UP|RUNNING
kjxgrrcfgchk: Initiating reconfig, reason=3
kjxgrrcfgchk: COMM rcfg - Disk Vote Required
kjfmReceiverHealthCB_CheckAll: Recievers are healthy.
2025-04-15 12:33:04.688 : kjxgrnetchk: start 0x2105f75, end 0x21115bf
2025-04-15 12:33:04.688 : kjxgrnetchk: Network Validation wait: 46 sec
2025-04-15 12:33:04.688 : kjxgrnetchk: Sending comm check req to inst 2
kjxgrrcfgchk: prev pstate 6 mapsz 1024
kjxgrrcfgchk: new bmp: 1 2
kjxgrrcfgchk: cnct bmp: 1 2
kjxgrrcfgchk: disc bmp:
kjxgrrcfgchk: work bmp: 1 2
kjxgrrcfgchk: rr bmp: 1 2
*** 2025-04-15T12:33:04.689167+08:00 (CDB$ROOT(1))
kjxgmrcfg: Reconfiguration started, type 3
CGS/IMR TIMEOUTS:
CSS recovery timeout = 31 sec (Total CSS waittime = 65)
IMR Reconfig timeout = 75 sec
CGS rcfg timeout = 85 sec
kjxgmcs: Setting state to 119 0.
kjxgrs0h: disable CGS timeout
2025-04-15 12:33:04.705 : kjxgrDD_rr_read: Detect reconfig from inst 2, seq 120, reason 3
kjxgrsyncnewmap: mem info history
mem[1]:0x39 mem[2]:0x39
*** 2025-04-15T12:33:04.739156+08:00 (CDB$ROOT(1))
Name Service frozen
kjxgmcs: Setting state to 119 1.
kjxgrsyncnewmap: mem info history
mem[1]:0x39 mem[2]:0x39
kjxggpoll: change db group poll time to 50 ms
kjmsetrmvtots: reconfig ending, lowering RS priority
* kjfcdarmrfg: real reconfiguration detected, break out of kjfcdarmrfg
* kjfcln: domain action rcfg aborted due to CGS RCFG
*** 2025-04-15T12:33:09.565067+08:00 (CDB$ROOT(1))
=====================================================
kjxgmpoll: CGS state (119 1) start 0x67fde180 cur 0x67fde185 rcfgtm 5 sec
*** 2025-04-15T12:33:14.572779+08:00 (CDB$ROOT(1))
=====================================================
=====================================================
kjxgmpoll: CGS state (119 1) start 0x67fde180 cur 0x67fde18a rcfgtm 10 sec
*** 2025-04-15T12:33:34.580575+08:00 (CDB$ROOT(1))
=====================================================
kjxgmpoll: CGS state (119 1) start 0x67fde180 cur 0x67fde19e rcfgtm 30 sec
kjxgmpngin: started oraping facility
=====================================================
Group name: anbob
Member id: node 0 inst 1
Cached KGXGN event: 0
Group State:
State: 119 1
Flags: 0xc4:70100001 SSFlags: 0x0
Reconfig started cur-tm 0x210d183 start-tm 0x2105f76 tmout 0x55
Reconfig state 0x2 chkcnt 0
Reconfig INPG type 3 inc 119 rsn 0 data 0x0
Reconfig COMP type 1 inc 119 rsn 0 data 0x0
Commited Map: 1 2
Commited DISC Map:
Commited RECN Map:
New Map: 1 2
KGXGN Map: 1 2
DISC Map:
RECN Map:
KGXGN Map (tmp): 1 2
Master inst: 1
...
Dumping the osd state
Dumping the osd context (verbose)
dumping IPCLW connections
IPCLW:[0.26]{-}[LMOD]:UTIL: [1744691632832754]cnh 0x7f15842fb990 id 5377251930210828319 lport 169.254.8.240:28276 rport 169.254.27.60:47441 trans=UDP ts=1744690453278397 type=RECV ctx 22@2.4
IPCLW:[0.27]{-}[LMOD]:UTIL: [1744691632832754] PCNH 0x7f15842fb990 State: 1 SMSN: 521181074 PKT(521183049.1215248887) Last Rcv 0:0:48.581.581940 Last valid Rcv 0:0:48.581.581940
IPCLW:[0.28]{-}[LMOD]:UTIL: [1744691632832754] Peer: LMON.KSXP_ksipc.36096 AckSeq: 1215248887. # Coalesced: 0
ksxp:lwcnh: pt (nil) cookie 139730491191120 (unknown) (LMON) pd len 32 magic 0x2793aa31 inst 1 inc 4 pid 22 ser 1 unid 22 status 1
IPCLW:[0.29]{-}[LMOD]:UTIL: [1744691632832754]cnh 0x7f15842e7ab0 id 1263109037 lport 169.254.8.240:40281 rport 169.254.27.60:62772 trans=UDP ts=1744690453223269 type=SEND ctx 22@2.4
IPCLW:[0.30]{-}[LMOD]:UTIL: [1744691632832754] ACNH 0x7f15842e7ab0 State: 1 SMSN: 10974043 PKT(10976013.1760611214) # Pending: 0
IPCLW:[0.31]{-}[LMOD]:UTIL: [1744691632832754] Peer: LMON.KSXP_cgs.36096 AckSeq: 1760611214
ksxp:lwcnh: pt (nil) cookie 0 (unknown) (LMON) pd len 32 magic 0x2793aa31 inst 1 inc 4 pid 22 ser 1 unid 22 status 1
KSXPLW: oustanding connections 2, sysinc 119, nodes 2
dumping OSD IPCLW ctx
Dumping ksxp state
ksxppg=0x7f158a51d6f8 ksxpsg=0x4143a0b898 ksxpsg_a=0x4143a0b898ksxpssg=0x4143a0b5d0 rm=0x4083d340a0
proc state: (pid: 22) [flg: 3 sg: 1]
curts 1744691632 wtctr 1172762
Dumping ksxp contexts
Context[5] 0x7f158a4aaf50 CGS state 1
Dumping connection queue
connection count: 1
port[0] state 1 flag 1 osd 0x40003e9d7438 [(invalid key)] has requests
port count: 1 ports
2025-04-15T12:33:52.872276+08:00
The reasons are as follows:
Reason 0 = No reconfiguration
Reason 1 = The Node Monitor generated the reconfiguration.
Reason 2 = An instance death was detected.
Reason 3 = Communications Failure
Reason 4 = Reconfiguration after suspend
Note:
并不是参数的问题,reset后同样无法启动,似乎在ksxp网络层.
可能有效的方法:
Database instances failed to start with error-LCK0 (ospid: xxxxxx): terminating the instance due to ORA error 481 (Doc ID 3058827.1)
之前也遇到过rp_filter导致的通信问题,建议配置为0或2
net.ipv4.conf..rp_filter = 2 net.ipv4.conf..rp_filter = 2
尝试kill了所有节点的gpnpd.bin和gipcd.bin
Bug 32544124 – Instance Restart Terminated Due to DLM Error (Doc ID 32544124.8) 但未匹配上kgnfscrechan stack
客户未尝试,直接重启了所有节点,恢复正常。