首页 » ORACLE 9i-23c » Troubleshooting 19C(19.4) CTSS start failed “Failed in clsctssslave_sync_with_master [9]” on LinuxONE

Troubleshooting 19C(19.4) CTSS start failed “Failed in clsctssslave_sync_with_master [9]” on LinuxONE

某客户一套linux container环境(LinuxOne大机)中的Oracle 19c RAC, 在启动阶段总是因为CTSS资源无法启动,导致Crs重启后需要手动干预, 通过kill local node gipcd.bin进程可以启动成功。多套环境有这问题,看来可能是Oracle软件在该环境存在缺陷,简单记录处理方法。

# GI ALERT LOG

2022-08-01 22:26:47.769 [OHASD(29527)]CRS-8500: Oracle Clusterware OHASD process is starting with operating system process ID 29527
2022-08-01 22:26:47.941 [OHASD(29527)]CRS-0714: Oracle Clusterware Release 19.0.0.0.0.
2022-08-01 22:26:47.951 [OHASD(29527)]CRS-2112: The OLR service started on node anbob1.
2022-08-01 22:26:48.801 [OHASD(29527)]CRS-1301: Oracle High Availability Service started on node anbob1.
2022-08-01 22:26:48.801 [OHASD(29527)]CRS-8017: location: /etc/oracle/lastgasp has 2 reboot advisory log files, 0 were announced and 0 errors occurred
2022-08-01 22:26:49.894 [CSSDAGENT(29706)]CRS-8500: Oracle Clusterware CSSDAGENT process is starting with operating system process ID 29706
2022-08-01 22:26:49.910 [ORAROOTAGENT(29697)]CRS-8500: Oracle Clusterware ORAROOTAGENT process is starting with operating system process ID 29697
2022-08-01 22:26:49.884 [ORAAGENT(29687)]CRS-8500: Oracle Clusterware ORAAGENT process is starting with operating system process ID 29687
2022-08-01 22:26:49.906 [CSSDMONITOR(29715)]CRS-8500: Oracle Clusterware CSSDMONITOR process is starting with operating system process ID 29715
2022-08-01 22:26:50.712 [ORAAGENT(29778)]CRS-8500: Oracle Clusterware ORAAGENT process is starting with operating system process ID 29778
2022-08-01 22:26:51.259 [MDNSD(29801)]CRS-8500: Oracle Clusterware MDNSD process is starting with operating system process ID 29801
2022-08-01 22:26:51.266 [EVMD(29803)]CRS-8500: Oracle Clusterware EVMD process is starting with operating system process ID 29803
2022-08-01 22:26:52.294 [GPNPD(29832)]CRS-8500: Oracle Clusterware GPNPD process is starting with operating system process ID 29832
2022-08-01 22:26:52.991 [GPNPD(29832)]CRS-2328: GPNPD started on node anbob1.
2022-08-01 22:26:53.333 [GIPCD(29902)]CRS-8500: Oracle Clusterware GIPCD process is starting with operating system process ID 29902
2022-08-01 22:26:57.350 [OSYSMOND(29971)]CRS-8500: Oracle Clusterware OSYSMOND process is starting with operating system process ID 29971
2022-08-01 22:26:57.339 [CSSDMONITOR(29969)]CRS-8500: Oracle Clusterware CSSDMONITOR process is starting with operating system process ID 29969
2022-08-01 22:26:58.001 [CSSDAGENT(30019)]CRS-8500: Oracle Clusterware CSSDAGENT process is starting with operating system process ID 30019
2022-08-01 22:26:58.660 [OCSSD(30097)]CRS-8500: Oracle Clusterware OCSSD process is starting with operating system process ID 30097
2022-08-01 22:26:59.824 [OCSSD(30097)]CRS-1713: CSSD daemon is started in hub mode
2022-08-01 22:27:06.052 [OCSSD(30097)]CRS-1707: Lease acquisition for node anbob1 number 1 completed
2022-08-01 22:27:07.304 [OCSSD(30097)]CRS-1621: The IPMI configuration data for this node stored in the Oracle registry is incomplete; details at (:CSSNK00002:) in /u01/app/grid/diag/crs/anbob1/crs/trace/ocssd.trc
2022-08-01 22:27:07.304 [OCSSD(30097)]CRS-1617: The information required to do node kill for node anbob1 is incomplete; details at (:CSSNM00004:) in /u01/app/grid/diag/crs/anbob1/crs/trace/ocssd.trc
2022-08-01 22:27:07.308 [OCSSD(30097)]CRS-1605: CSSD voting file is online: /dev/asm-crs01; details in /u01/app/grid/diag/crs/anbob1/crs/trace/ocssd.trc.
2022-08-01 22:27:07.312 [OCSSD(30097)]CRS-1605: CSSD voting file is online: /dev/asm-crs02; details in /u01/app/grid/diag/crs/anbob1/crs/trace/ocssd.trc.
2022-08-01 22:27:07.316 [OCSSD(30097)]CRS-1605: CSSD voting file is online: /dev/asm-crs03; details in /u01/app/grid/diag/crs/anbob1/crs/trace/ocssd.trc.
2022-08-01 22:27:08.479 [OCSSD(30097)]CRS-1601: CSSD Reconfiguration complete. Active nodes are anbob1 anbob2 .
2022-08-01 22:27:10.241 [OCSSD(30097)]CRS-1720: Cluster Synchronization Services daemon (CSSD) is ready for operation.
2022-08-01 22:27:10.703 [OCTSSD(30383)]CRS-8500: Oracle Clusterware OCTSSD process is starting with operating system process ID 30383
2022-08-01 22:27:13.205 [OCTSSD(30383)]CRS-2407: The new Cluster Time Synchronization Service reference node is host anbob2.
2022-08-01 22:27:13.206 [OCTSSD(30383)]CRS-2401: The Cluster Time Synchronization Service started on host anbob1.
2022-08-01 22:27:13.301 [OCTSSD(30383)]CRS-2402: The Cluster Time Synchronization Service aborted on host anbob1. Details at (:ctsselect_mstm5:) in /u01/app/grid/diag/crs/anbob1/crs/trace/octssd.trc.

# CTSSD LOG

grid@anbob1:/u01/app/grid/diag/crs/anbob1/crs/trace$ sed -n '/2022-08-01 22:06/,$p' /u01/app/grid/diag/crs/anbob1/crs/trace/octssd.trc|egrep -i 'fail|error'
 default:2493454480: u_set_comp_error: comptype '103' : error '29'
2022-08-01 22:27:11.402 :  CRSCCL:2493454480: Failed to init OCR to get active version. PROC-32: Cluster Ready Services on the local node is not running Messaging error [gipcretConnectionRefused] [29]Checking active version in OLR.
2022-08-01 22:27:13.071 :GIPCHGEN:2355624208:  gipchaNodeAddInterfaceF [gipchaDaemonProcessInfUpdate : gipchaDaemonThread.c : 6109]: adding interface info 0x3ff7c073ed0 { host '', haName '2f95-xxx-xxx-c28d', local (nil), ip '10.xxx.xxx.1', subnet '10.xxx.xxx.0', mask '255.255.255.0', mac '82-02-9b-xx-a1-ab', ifname 'bond1.3202', numRef 0, numFail 0, idxBoot 0, flags 0x1 }
2022-08-01 22:27:13.071 :GIPCHDEM:2355624208:  gipchaDaemonProcessFailTransientInfs: failed transient interfaces (if any) for host anbob1 haname 2f95-4450-5dcb-c28d
2022-08-01 22:27:13.251 :GIPCHGEN:2355624208:  gipchaNodeAddInterfaceF [gipchaDaemonProcessInfUpdate : gipchaDaemonThread.c : 6109]: adding interface info 0x3ff7c08f5c0 { host 'anbob2', haName 'f691-xxx-xxx-1179', local (nil), ip '10.xxx.xxx.2:37277', subnet '10.xxx.xxx.0', mask '255.255.255.0', mac '', ifname 'bond1.3202', numRef 0, numFail 0, idxBoot 0, flags 0x2 }
2022-08-01 22:27:13.251 :GIPCHDEM:2355624208:  gipchaDaemonProcessFailTransientInfs: failed transient interfaces (if any) for host anbob2 haname f691-e88e-fadc-1179
2022-08-01 22:27:13.251 :GIPCHGEN:2357721360:  gipchaWorkerAttachInterface: Interface attached inf 0x3ff7c08f5c0 { host 'anbob2', haName 'f691-e88e-fadc-1179', local 0x3ff7c073ed0, ip '10.xxx.xxx.2:37277', subnet '10.xxx.xxx.0', mask '255.255.255.0', mac '', ifname 'bond1.3202', numRef 0, numFail 0, idxBoot 0, flags 0x6 }
2022-08-01 22:27:13.300 :GIPCXCPT:2357721360:  gipchaUpperProcessDisconnect: dropping Disconnect to unknown msg 0x3ff78051480 { type gipchaMsgTypeDisconnect (5), srcCid 00000000-0e0fd1d3, dstCid 00000000-00001314 }, node 0x3ff7c074820 { host 'anbob2', haName 'f691-e88e-fadc-1179', srcLuid b997f400-f243ff47, dstLuid 9c681c2a-3a876b16 numInf 1, sentRegister 1, localMonitor 0, baseStream 0x3ff7c073670 type gipchaNodeType12001 (20), nodeIncarnation 00000000-0000da90, incarnation 3, cssIncarnation 0, negDigest 4294967295, roundTripTime 4294967295 lastSeenPingAck 0 nextPingId 1 latencySrc 0 latencyDst 0 flags 0x800080c}, ret gipcretFail (1)
2022-08-01 22:27:13.300 :GIPCHAUP:2357721360:  gipchaUpperProcessDisconnect: EXCEPTION[ ret gipcretFail (1) ]  error during DISCONNECT processing for node 0x3ff7c074820 { host 'anbob2', haName 'f691-e88e-fadc-1179', srcLuid b997f400-f243ff47, dstLuid 9c681c2a-3a876b16 numInf 1, sentRegister 1, localMonitor 0, baseStream 0x3ff7c073670 type gipchaNodeType12001 (20), nodeIncarnation 00000000-0000da90, incarnation 3, cssIncarnation 0, negDigest 4294967295, roundTripTime 4294967295 lastSeenPingAck 0 nextPingId 1 latencySrc 0 latencyDst 0 flags 0x800080c}
2022-08-01 22:27:13.301 :    CTSS:1995954448: ctssslave_swm: Failed in sclsctss_set_timeofday [1] with argument newtime{sec[1659364034], usec[-528297]}.
2022-08-01 22:27:13.301 :    CTSS:1995954448: (:ctsselect_mstm5:): Failed in clsctssslave_sync_with_master [9].

检查另一节点CTSS MODE

grid@anbob2:/home/grid$ crsctl stat res -t -init
--------------------------------------------------------------------------------
Name           Target  State        Server                   State details       
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
      1        ONLINE  ONLINE       anbob2                Started,STABLE
ora.cluster_interconnect.haip
      1        ONLINE  ONLINE       anbob2                STABLE
ora.crf
      1        ONLINE  ONLINE       anbob2                STABLE
ora.crsd
      1        ONLINE  ONLINE       anbob2                STABLE
ora.cssd
      1        ONLINE  ONLINE       anbob2                STABLE
ora.cssdmonitor
      1        ONLINE  ONLINE       anbob2                STABLE
ora.ctssd
      1        ONLINE  ONLINE       anbob2                ACTIVE:0,STABLE
ora.diskmon
      1        OFFLINE OFFLINE                               STABLE
ora.evmd
      1        ONLINE  ONLINE       anbob2                STABLE
ora.gipcd
      1        ONLINE  ONLINE       anbob2                STABLE
ora.gpnpd
      1        ONLINE  ONLINE       anbob2                STABLE
ora.mdnsd
      1        ONLINE  ONLINE       anbob2                STABLE
ora.storage
      1        ONLINE  ONLINE       anbob2                STABLE
--------------------------------------------------------------------------------

grid@anbob2:/u01/app/grid/diag/crs/anbob2/crs/trace$ crsctl check ctss
CRS-4701: The Cluster Time Synchronization Service is in Active mode.
CRS-4702: Offset (in msec): 0

# ps -ef|grep chronyd
--chronyd运行中

Note:
另一节点运行在ACTIVE MODE, 虽然OS 没有配置NTP, 但是在LINUX7中默认使用了chronyd, oracle 19c初期版本并不会检查chronyd 服务。

什么是CTSS
Cluster time synchronization service daemon (CTSS): Oracle Clusterware 11g release 2 (11.2)引入的新特性,用于检查集群间的时间同步,如果OS配置了NTP没有配置或配置不当时, CTSS可以同步所有节点的时间与CTSS Mast node保持一致。在安装和启动时会检查NTP是否存在,Windows上还会检查 Windows Time Service,如果发现已经配置了NTP 进程, 并且已经配置NTP如/etc/ntd.conf存在, 也不会影响CTSS的安装和启动,只是运行在observer mode观望模式,此状态的CTSS不会修复时间。 像CRS/EVM/OCR/CTSS从 11.2.0.2起使用gipc通信, 所以上面为什么有时kill GIPC进程,CTSS又可以启动。 在LINUX 7中引入了chronyd服务 替换了过去的NTP, 但是Oracle 19c CRS初期的的版本只检查NTP,不检查chronyd, 所以注意CTSS和chronyd同时运行,推荐顺序 chronyd NTP CTSS。

检查CTSS 运行在 active or observer mode

$ crsctl check ctss
CRS-4700: The Cluster Time Synchronization Service is in Observer mode. 
or
CRS-4701: The Cluster Time Synchronization Service is in Active mode.

# NODE1 CTSS log

2022-08-01 22:54:13.553 :    CTSS:2195769616: ctssslave_swm2_1: Waiting for time sync message from master. sync_state[2].
2022-08-01 22:54:13.553 :    CTSS:2197866768: ctsscomm_recv_cb4_2: Receive active version change msg. Old active version [318767104] New active version [318767104].
2022-08-01 22:54:13.553 :    CTSS:2197866768: ctssslave_msg_handler4_1: Waiting for slave_sync_with_master to finish sync process. sync_state[3].
2022-08-01 22:54:13.553 :    CTSS:2195769616: ctssslave_swm2_3: Received time sync message from master.
2022-08-01 22:54:13.553 :    CTSS:2195769616: clsctssslave.c:781:CLSCTSS_CVAL_SET: &cval[ac51980c] prev_value[0] new_value[2408] start_tm [0] curr_tm [2403574996] period [1800000].
2022-08-01 22:54:13.553 :    CTSS:2195769616: ctssslave_swm: Failed in sclsctss_set_timeofday [1] with argument newtime{sec[1659365655], usec[-565199]}.
    CTSS:2195769616: SLOS : SLOS: cat=-2, opn=settimeofday, dep=22, loc=sctss_std.2

2022-08-01 22:54:13.553 :    CTSS:2195769616: (:ctsselect_mstm5:): Failed in clsctssslave_sync_with_master [9].
2022-08-01 22:54:13.553 :    CTSS:2195769616: ctsselect_monitor_steysync_mode: CTSS daemon exiting [9].
2022-08-01 22:54:13.553 :    CTSS:2197866768: ctssslave_msg_handler4_3: slave_sync_with_master finished sync process. Exiting clsctssslave_msg_handler
2022-08-01 22:54:13.553 :    CTSS:2195769616: CTSS daemon aborting

clsctssslave_sync_with_master [9] 内部含义
CTSSD in a Linux container try to become active and failed in sclsctss_set_timeofday() ,Added sclsctss_is_container() to detect if CTSSD is running in a Linux  container and force it to run in observer mode

像Bug 20827053 Linux: On Linux containers CTSSD may not start中记录,该BUG在12.2已修复, 但是报错内容一样, 如果错误代码是[12]时 启动检查节点间是否时间相差太大。

尝试创建ntp.conf
$ touch /etc/ntp.conf

重启CRS,启动正常。

打赏

,

对不起,这篇文章暂时关闭评论。