Oracle 19c RAC 频繁重启 OS log show “avahi-daemon : Withdrawing address record”

总会有一些创新型的客户走在技术的最前端，但有些问题无参考这是最担忧的问题，最近就一个非常新的环境ORACLE 19C 2-nodes RAC on IBM LinuxONE大机，同一大机部分节点上oracle实例频繁重启，重启前OS日志中有输出“avahi-daemon[4537]: Withdrawing address record for 28.83.70.4 on bond0.3112”，这里简单记录这个案例。

环境信息（来源DB ALERT LOG）

NOTE: remote asm mode is remote (mode 0x2; from cluster type)
NOTE: Cluster configuration type = CLUSTER [4]
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.4.0.0.0.
ORACLE_HOME:    /u01/app/oracle/product/19.0.0/dbhome_1
System name:	Linux
Node name:	anboblpa4
Release:	3.10.0-693.el7.s390x
Version:	#1 SMP Thu Jul 6 20:01:53 EDT 2017
Machine:	s390x
...
Dynamic remastering is disabled
List of instances (total 2) :
 1 2
My inst 2 (I'm a new instance) 
...
Ping without log force is disabled:
  not an Exadata system.
Buffer Cache Full DB Caching mode changing from FULL CACHING DISABLED to FULL CACHING ENABLED 
Picked broadcast on commit scheme to generate SCNs
Endian type of dictionary set to big
2019-10-11T14:07:47.939549+08:00
Redo log for group 3, sequence 1 is not located on DAX storage
...
===========================================================
Dumping current patch information
===========================================================
Patch Id: 29585399
Patch Description: OCW RELEASE UPDATE 19.3.0.0.0 (29585399)
Patch Apply Time: 2019-06-02T04:42:42+08:00

...
Patch Id: 29834717
Patch Description: Database Release Update : 19.4.0.0.190716 (29834717)
Patch Apply Time: 2019-10-09T13:50:10+08:00
...
Patch Id: 29774421
Patch Description: OJVM RELEASE UPDATE: 19.4.0.0.190716 (29774421)
Patch Apply Time: 2019-10-09T13:51:42+08:00

数据库错误日志 db alert log

2020-01-04T10:04:31.601196+08:00
Thread 2 advanced to log sequence 73905 (LGWR switch)
  Current log# 24 seq# 73905 mem# 0: +DATADG/A2CDB/ONLINELOG/group_24.405.1028475833
  Current log# 24 seq# 73905 mem# 1: +DATADG/A2CDB/ONLINELOG/group_24.392.1028475843
2020-01-04T10:04:31.747748+08:00
ARC1 (PID:9942): Archived Log entry 38639 added for T-2.S-73904 ID 0xffffffff9876a713 LAD:1
2020-01-04T10:17:28.893387+08:00
Thread 2 advanced to log sequence 73906 (LGWR switch)
  Current log# 25 seq# 73906 mem# 0: +DATADG/A2CDB/ONLINELOG/group_25.404.1028475833
  Current log# 25 seq# 73906 mem# 1: +DATADG/A2CDB/ONLINELOG/group_25.388.1028475843
2020-01-04T10:17:29.016138+08:00
ARC2 (PID:9944): Archived Log entry 38643 added for T-2.S-73905 ID 0xffffffff9876a713 LAD:1
2020-01-04T10:29:23.921516+08:00
PMON (ospid: ): terminating the instance due to ORA error 
2020-01-04T10:29:23.921625+08:00
Cause - 'Instance is being terminated due to fatal process death (pid: 5, ospid: 8726, IPC0)'
2020-01-04T10:29:25.777379+08:00
Instance terminated by PMON, pid = 8718
2020-01-04T10:32:01.383411+08:00
Starting ORACLE instance (normal) (OS id: 8531)
....
2020-01-04T10:32:30.496195+08:00
CJQ0 started with pid=96, OS id=10437 
Completed: ALTER DATABASE OPEN /* db agent *//* {2:17390:2} */

CRSD.LOG

## crsd.log
2020-01-04 10:27:31.168 :GIPCHTHR:2329880848:  gipchaWorkerWork: workerThread heart beat, time interval since last heartBeat 30041 loopCount 60 sendCount 12 recvCount 36 postCount 12 sendCmplCount 12 recvCmplCount 12
2020-01-04 10:27:32.269 :GIPCHTHR:2327783696:  gipchaDaemonWork: DaemonThread heart beat, time interval since last heartBeat 30041loopCount 47
2020-01-04 10:28:01.229 :GIPCHTHR:2329880848:  gipchaWorkerWork: workerThread heart beat, time interval since last heartBeat 30061 loopCount 62 sendCount 12 recvCount 36 postCount 12 sendCmplCount 12 recvCmplCount 12
2020-01-04 10:28:02.331 :GIPCHTHR:2327783696:  gipchaDaemonWork: DaemonThread heart beat, time interval since last heartBeat 30062loopCount 46
2020-01-04 10:28:31.273 :GIPCHTHR:2329880848:  gipchaWorkerWork: workerThread heart beat, time interval since last heartBeat 30044 loopCount 67 sendCount 14 recvCount 42 postCount 14 sendCmplCount 14 recvCmplCount 14
2020-01-04 10:28:32.374 :GIPCHTHR:2327783696:  gipchaDaemonWork: DaemonThread heart beat, time interval since last heartBeat 30043loopCount 46
2020-01-04 10:29:01.315 :GIPCHTHR:2329880848:  gipchaWorkerWork: workerThread heart beat, time interval since last heartBeat 30042 loopCount 60 sendCount 12 recvCount 36 postCount 12 sendCmplCount 12 recvCmplCount 12
2020-01-04 10:29:02.415 :GIPCHTHR:2327783696:  gipchaDaemonWork: DaemonThread heart beat, time interval since last heartBeat 30041loopCount 46
2020-01-04 10:29:19.978 :UiServer:749717776: [     INFO] {2:17417:6560} Container [ Name: FENCESERVER
	API_HDR_VER: 
	TextMessage[3]
	CLIENT: 
	TextMessage[]
	CLIENT_NAME: 
	TextMessage[ocssd.bin]
	CLIENT_PID: 
	TextMessage[6809]
	CLIENT_PRIMARY_GROUP: 
	TextMessage[asmadmin]
	LOCALE: 
	TextMessage[AMERICAN_AMERICA.AL32UTF8]
]
2020-01-04 10:29:19.978 :UiServer:749717776: [     INFO] {2:17417:6560} Sending message to AGFW. ctx= 0x3fec404e010, Client PID: 6809
2020-01-04 10:29:19.978 :  OCRAPI:749717776: procr_beg_asmshut: OCR ctx set to donotterminate state. Return [0].  <<<<<<<<<<<<<<<<<<<
2020-01-04 10:29:19.978 :UiServer:749717776: [     INFO] {2:17417:6560} Force-disconnecting [21]  existing PE clients...
2020-01-04 10:29:19.978 :UiServer:749717776: [     INFO] {2:17417:6560} Disconnecting client of command id :28
2020-01-04 10:29:19.978 :UiServer:749717776: [     INFO] {2:17417:6560} Disconnecting client of command id :43
2020-01-04 10:29:19.978 :UiServer:749717776: [     INFO] {2:17417:6560} Disconnecting client of command id :46
...
2020-01-04 10:29:19.979 :UiServer:749717776: [     INFO] {2:17417:6560} Disconnecting client of command id :272
2020-01-04 10:29:19.979 :UiServer:749717776: [     INFO] {2:17417:6560} Disconnecting client of command id :273
2020-01-04 10:29:19.979 :UiServer:749717776: [     INFO] {2:17417:6560} Disconnecting client of command id :275
2020-01-04 10:29:19.979 :UiServer:749717776: [     INFO] {2:17417:6560} Disconnecting client of command id :292
2020-01-04 10:29:19.979 :UiServer:749717776: [     INFO] {2:17417:6560} Sending message: 554957 to AGFW proxy server.
2020-01-04 10:29:19.979 :    AGFW:768592144: [     INFO] {2:17417:6560} Agfw Proxy Server received the message: FENCE_CMD[Proxy] ID 20489:554957   
2020-01-04 10:29:19.979 :    AGFW:768592144: [     INFO] {2:17417:6560} Agfw Proxy Server sending message: RESOURCE_CLEAN[ora.ASMNET1LSNR_ASM.lsnr 2 1] ID 4100:554958 to the agent /u01/app/19.0.0/grid/bin/oraagent_grid
2020-01-04 10:29:19.979 :   CRSPE:768592144: [     INFO] {2:17417:6560} Skipping Fence of : ora.CRSDG.dg
2020-01-04 10:29:19.979 :   CRSPE:768592144: [     INFO] {2:17417:6560} Skipping Fence of : ora.DATADG.dg
2020-01-04 10:29:19.979 :   CRSPE:768592144: [     INFO] {2:17417:6560} Skipping Fence of : ora.FRADG.dg
2020-01-04 10:29:19.979 :    AGFW:768592144: [     INFO] {2:17417:6560} Agfw Proxy Server sending message: RESOURCE_CLEAN[ora.LISTENER.lsnr anboblpa4 1] ID 4100:554959 to the agent /u01/app/19.0.0/grid/bin/oraagent_grid
2020-01-04 10:29:19.979 :   CRSPE:768592144: [     INFO] {2:17417:6560} Skipping Fence of : ora.a2cdb.db
2020-01-04 10:29:19.979 :   CRSPE:768592144: [     INFO] {2:17417:6560} Skipping Fence of : ora.asm
2020-01-04 10:29:19.979 :    AGFW:768592144: [     INFO] {2:17417:6560} Agfw Proxy Server sending message: RESOURCE_CLEAN[ora.asmnet1.asmnetwork 2 1] ID 4100:554960 to the agent /u01/app/19.0.0/grid/bin/orarootagent_root
2020-01-04 10:29:19.979 :    AGFW:768592144: [     INFO] {2:17417:6560} Agfw Proxy Server sending message: RESOURCE_CLEAN[ora.net1.network anboblpa4 1] ID 4100:554961 to the agent /u01/app/19.0.0/grid/bin/orarootagent_root
2020-01-04 10:29:19.979 :    AGFW:768592144: [     INFO] {2:17417:6560} Agfw Proxy Server sending message: RESOURCE_CLEAN[ora.ons anboblpa4 1] ID 4100:554962 to the agent /u01/app/19.0.0/grid/bin/oraagent_grid
2020-01-04 10:29:19.979 :    AGFW:768592144: [     INFO] {2:17417:6560} Agfw Proxy Server sending message: RESOURCE_CLEAN[ora.anboblpa4.vip 1 1] ID 4100:554963 to the agent /u01/app/19.0.0/grid/bin/orarootagent_root
2020-01-04 10:29:19.979 :    AGFW:768592144: [     INFO] {2:17417:6560} dumpPendingFences:Started Resource Fencing, Count = 6
2020-01-04 10:29:19.979 :    AGFW:768592144: [     INFO] {2:17417:6560} ora.ASMNET1LSNR_ASM.lsnr 2 1;ora.LISTENER.lsnr anboblpa4 1;ora.asmnet1.asmnetwork 2 1;ora.net1.network anboblpa4 1;ora.ons anboblpa4 1;ora.anboblpa4.vip 1 1;---
2020-01-04 10:29:19.981 :    AGFW:768592144: [     INFO] {2:17417:6560} Received the reply to the message: RESOURCE_CLEAN[ora.asmnet1.asmnetwork 2 1] ID 4100:554960 from the agent /u01/app/19.0.0/grid/bin/orarootagent_root
2020-01-04 10:29:19.981 :    AGFW:768592144: [     INFO] {2:17417:6560} Received the reply to the message: RESOURCE_CLEAN[ora.asmnet1.asmnetwork 2 1] ID 4100:554960 from the agent /u01/app/19.0.0/grid/bin/orarootagent_root
2020-01-04 10:29:19.981 :    AGFW:768592144: [     INFO] {2:17417:6560} Fenced off the resource [ora.asmnet1.asmnetwork]  
2020-01-04 10:29:19.981 :    AGFW:768592144: [     INFO] {2:17417:6560} dumpPendingFences:Pending Fence(s), Count = 5
2020-01-04 10:29:19.981 :    AGFW:768592144: [     INFO] {2:17417:6560} ora.ASMNET1LSNR_ASM.lsnr 2 1;ora.LISTENER.lsnr anboblpa4 1;ora.net1.network anboblpa4 1;ora.ons anboblpa4 1;ora.anboblpa4.vip 1 1;---
2020-01-04 10:29:19.982 :    AGFW:768592144: [     INFO] {2:17417:6560} Received the reply to the message: RESOURCE_CLEAN[ora.net1.network anboblpa4 1] ID 4100:554961 from the agent /u01/app/19.0.0/grid/bin/orarootagent_root
2020-01-04 10:29:19.982 :    AGFW:768592144: [     INFO] {2:17417:6560} Received the reply to the message: RESOURCE_CLEAN[ora.net1.network anboblpa4 1] ID 4100:554961 from the agent /u01/app/19.0.0/grid/bin/orarootagent_root
2020-01-04 10:29:19.982 :    AGFW:768592144: [     INFO] {2:17417:6560} Fenced off the resource [ora.net1.network]  
2020-01-04 10:29:19.982 :    AGFW:768592144: [     INFO] {2:17417:6560} dumpPendingFences:Pending Fence(s), Count = 4
2020-01-04 10:29:19.982 :    AGFW:768592144: [     INFO] {2:17417:6560} ora.ASMNET1LSNR_ASM.lsnr 2 1;ora.LISTENER.lsnr anboblpa4 1;ora.ons anboblpa4 1;ora.anboblpa4.vip 1 1;---
2020-01-04 10:29:19.984 :    AGFW:768592144: [     INFO] {2:17417:6560} Received the reply to the message: RESOURCE_CLEAN[ora.ASMNET1LSNR_ASM.lsnr 2 1] ID 4100:554958 from the agent /u01/app/19.0.0/grid/bin/oraagent_grid
2020-01-04 10:29:19.984 :    AGFW:768592144: [     INFO] {2:17417:6560} Received the reply to the message: RESOURCE_CLEAN[ora.LISTENER.lsnr anboblpa4 1] ID 4100:554959 from the agent /u01/app/19.0.0/grid/bin/oraagent_grid
2020-01-04 10:29:20.089 :    AGFW:768592144: [     INFO] {2:17417:6560} Received the reply to the message: RESOURCE_CLEAN[ora.anboblpa4.vip 1 1] ID 4100:554963 from the agent /u01/app/19.0.0/grid/bin/orarootagent_root
2020-01-04 10:29:20.089 :    AGFW:768592144: [     INFO] {2:17417:6560} Received the reply to the message: RESOURCE_CLEAN[ora.anboblpa4.vip 1 1] ID 4100:554963 from the agent /u01/app/19.0.0/grid/bin/orarootagent_root
2020-01-04 10:29:20.089 :    AGFW:768592144: [     INFO] {2:17417:6560} Fenced off the resource [ora.anboblpa4.vip] <<<<<<<<<<<<<<<<<<<
2020-01-04 10:29:20.089 :    AGFW:768592144: [     INFO] {2:17417:6560} dumpPendingFences:Pending Fence(s), Count = 3
2020-01-04 10:29:20.089 :    AGFW:768592144: [     INFO] {2:17417:6560} ora.ASMNET1LSNR_ASM.lsnr 2 1;ora.LISTENER.lsnr anboblpa4 1;ora.ons anboblpa4 1;---
...
2020-01-04 10:29:21.640 :  OCRSRV:805304592: proas_amiwriter: ctx is in some other state
2020-01-04 10:29:21.687 :    AGFW:768592144: [     INFO] {2:17417:6560} Received the reply to the message: RESOURCE_CLEAN[ora.LISTENER.lsnr anboblpa4 1] ID 4100:554959 from the agent /u01/app/19.0.0/grid/bin/oraagent_grid
2020-01-04 10:29:21.687 :    AGFW:768592144: [     INFO] {2:17417:6560} Fenced off the resource [ora.LISTENER.lsnr]
2020-01-04 10:29:21.687 :    AGFW:768592144: [     INFO] {2:17417:6560} dumpPendingFences:Pending Fence(s), Count = 1
2020-01-04 10:29:21.687 :    AGFW:768592144: [     INFO] {2:17417:6560} ora.ASMNET1LSNR_ASM.lsnr 2 1;---
2020-01-04 10:29:21.687 :    AGFW:768592144: [     INFO] {2:17417:6560} Received the reply to the message: RESOURCE_CLEAN[ora.ASMNET1LSNR_ASM.lsnr 2 1] ID 4100:554958 from the agent /u01/app/19.0.0/grid/bin/oraagent_grid
2020-01-04 10:29:21.687 :    AGFW:768592144: [     INFO] {2:17417:6560} Fenced off the resource [ora.ASMNET1LSNR_ASM.lsnr]
2020-01-04 10:29:21.687 :    AGFW:768592144: [     INFO] {2:17417:6560} dumpPendingFences:Pending Fence(s), Count = 0
2020-01-04 10:29:21.688 :   CRSPE:768592144: [     INFO] {2:17417:6560} Fence command status: UI_DATA
2020-01-04 10:29:21.688 :    AGFW:768592144: [     INFO] {2:17417:6560} Fence command completed, rc = 0
2020-01-04 10:29:21.688 :UiServer:749717776: [     INFO] {2:17417:6560} UI server recvd reply from Agfw Proxy Server: 116530
2020-01-04 10:29:21.688 :UiServer:749717776: [     INFO] {2:17417:6560} Response: c1|7!UI_DATAk6|RESULTt1|0
2020-01-04 10:29:21.688 :UiServer:749717776: [     INFO] {2:17417:6560} Done for ctx=0x3fec404e010
2020-01-04 10:29:21.690 :  OCRSRV:805304592: proas_amiwriter: ctx is in some other state
Trace file /u01/app/grid/diag/crs/anboblpa4/crs/trace/crsd.trc
Oracle Database 19c Clusterware Release 19.0.0.0.0 - Production
Version 19.4.0.0.0 Copyright 1996, 2019 Oracle. All rights reserved.
 default:2381237440: 1: clskec:has:CLSU:910 4 args[clsdAdr_CLSK_err][mod=clsdadr.c][loc=(:CLSD00302:)][msg=2020-01-04 10:29:21.861 (:CLSD00302:) Trace file size and number of segments fetched from environment variable: ORA_DAEMON_TRACE_FILE_OPTIONS filesize=26214400,numsegments=10    Detected in function clsdAdrGetEnvVar_TFO at line number 6819. ]
...
    CLSB:2652294336: [     INFO] Argument count (argc) for this daemon is 2
    CLSB:2652294336: [     INFO] Argument 0 is: /u01/app/19.0.0/grid/bin/crsd.bin
    CLSB:2652294336: [     INFO] Argument 1 is: reboot
2020-01-04 10:31:18.928 : CRSMAIN:2652294336: [     INFO]  First attempt: init CSS context succeeded.
2020-01-04 10:31:18.928 : CRSMAIN:2652294336: [     INFO]  Start mode: normal
2020-01-04 10:31:18.930 :  CLSDMT:2146957584: [     INFO] PID for the Process [7535], connkey CRSD
2020-01-04 10:31:19.305 : CRSMAIN:2652294336: [     INFO]  CRS Daemon Starting

ocssd.log

2020-01-04 10:29:15.128 :    CSSD:2514209040: [     INFO]   : Sending member data change to GMP for group HB+ASM, memberID 17:2:2
2020-01-04 10:29:15.128 :    CSSD:2526538000: [     INFO] clssgmpcMemberDataUpdt: grockName HB+ASM memberID 17:2:2, datatype 1 datasize 4
2020-01-04 10:29:15.129 :    CSSD:2511063312: [     INFO] clssgmcpDataUpdtCmpl: Status 0 mbr data updt memberID 17:2:2 from clientID 2:49:2
2020-01-04 10:29:16.683 :    CSSD:2520246544: [     INFO] clssgmpcGMCReqWorkerThread: processing msg (0x3ff74043d70) type 2, msg size 76, payload (0x3ff74043d9c) size 32, sequence 1505227, for clientID 2:49:2
2020-01-04 10:29:17.178 :    CSSD:2514209040: [     INFO]   : Processing member data change type 1, size 4 for group HB+ASM, memberID 17:2:2
2020-01-04 10:29:17.178 :    CSSD:2514209040: [     INFO]   : Sending member data change to GMP for group HB+ASM, memberID 17:2:2
2020-01-04 10:29:17.178 :    CSSD:2526538000: [     INFO] clssgmpcMemberDataUpdt: grockName HB+ASM memberID 17:2:2, datatype 1 datasize 4
2020-01-04 10:29:17.179 :    CSSD:2511063312: [     INFO] clssgmcpDataUpdtCmpl: Status 0 mbr data updt memberID 17:2:2 from clientID 2:49:2
2020-01-04 10:29:18.229 :    CSSD:2511063312: [     INFO] clssgmcpGroupDataResp: sending type 5, size 166, status 0 to clientID 2:24:0
2020-01-04 10:29:18.229 :    CSSD:2514209040: [     INFO] clssgmMemberPublicInfo: group DG_CRSDG member 0 not found
2020-01-04 10:29:18.229 :    CSSD:2517100816: [     INFO] clssgmpcGMCReqWorkerThread: processing msg (0x3ff7403e390) type 2, msg size 85, payload (0x3ff7403e3bc) size 41, sequence 1505230, for clientID 2:24:0
2020-01-04 10:29:18.552 :    CSSD:839383312: [     INFO] clssnmSendingThread: sending status msg to all nodes
2020-01-04 10:29:18.552 :    CSSD:839383312: [     INFO] clssnmSendingThread: sent 4 status msgs to all nodes
2020-01-04 10:29:19.238 :    CSSD:2514209040: [     INFO]   : Processing member data change type 1, size 4 for group HB+ASM, memberID 17:2:2
2020-01-04 10:29:19.238 :    CSSD:2514209040: [     INFO]   : Sending member data change to GMP for group HB+ASM, memberID 17:2:2
2020-01-04 10:29:19.238 :    CSSD:2526538000: [     INFO] clssgmpcMemberDataUpdt: grockName HB+ASM memberID 17:2:2, datatype 1 datasize 4
2020-01-04 10:29:19.239 :    CSSD:2511063312: [     INFO] clssgmcpDataUpdtCmpl: Status 0 mbr data updt memberID 17:2:2 from clientID 2:49:2
2020-01-04 10:29:19.908 :    CSSD:2514209040: [     INFO]   : Processing member data change type 1, size 600 for group GR+DB_A2CDB, memberID 119:2:1  
2020-01-04 10:29:19.908 :    CSSD:2514209040: [     INFO]   : Sending member data change to GMP for group GR+DB_A2CDB, memberID 119:2:1 
2020-01-04 10:29:19.908 :    CSSD:2526538000: [     INFO] clssgmpcMemberDataUpdt: grockName GR+DB_A2CDB memberID 119:2:1, datatype 1 datasize 600
2020-01-04 10:29:19.909 :    CSSD:2511063312: [     INFO] clssgmSendEventsToMbrs: Group GR+DB_A2CDB, member count 1, event master 0, event type 6, event incarn 21350, event member count 1, pids 8779-13034,  
2020-01-04 10:29:19.909 :    CSSD:2511063312: [     INFO] clssgmcpDataUpdtCmpl: Status 0 mbr data updt memberID 119:2:1 from clientID 2:87:2
2020-01-04 10:29:19.909 :    CSSD:2514209040: [     INFO] clssgmTermMember: Terminating memberID 119:2:1 (0x3ff400783a0) in grock GR+DB_A2CDB <<<<<<<<<<<<<<<<<<<
2020-01-04 10:29:19.909 :    CSSD:2514209040: ASSERT clsssc.c 8607
2020-01-04 10:29:19.909 :    CSSD:2514209040: clssscRefFree: object(0x3ff4842b670) has 0 reference prior to decrement, object may have been deallocated! <<<<<<<<<<<<<<<<<<<
2020-01-04 10:29:19.909 :    CSSD:2514209040: [     INFO] clssnmCheckForNetworkFailure: Entered 
2020-01-04 10:29:19.909 :    CSSD:2514209040: [     INFO] clssnmCheckForNetworkFailure: skipping 0 defined 0 
2020-01-04 10:29:19.909 :    CSSD:2514209040: [     INFO] clssnmCheckForNetworkFailure: expiring 0  evicted 0 evicting node 0 this node 1
2020-01-04 10:29:19.909 :    CSSD:2514209040: [     INFO] clssnmCheckForNetworkFailure: expiring 0  evicted 0 evicting node 0 this node 2
2020-01-04 10:29:19.909 :    CSSD:2514209040: [     INFO] clssnmCheckForNetworkFailure: skipping 3 defined 0 
2020-01-04 10:29:19.909 :    CSSD:2514209040: [     INFO] clssnmCheckForNetworkFailure: skipping 4 defined 0 
... 
2020-01-04 10:29:19.909 :    CSSD:2514209040: [     INFO] clssnmCheckForNetworkFailure: skipping 30 defined 0 
2020-01-04 10:29:19.909 :    CSSD:2514209040: [     INFO] clssnmCheckForNetworkFailure: skipping 31 defined 0 
2020-01-04 10:29:19.909 :    CSSD:2514209040: [     INFO] clssscExit: Call to clscal flush successful and clearing the CLSSSCCTX_INIT_CALOG flag so that no further CA logging happens
2020-01-04 10:29:19.909 :    CSSD:2514209040: [     INFO] clssnmRemoveNodeInTerm: node 2, anboblpa4 terminated. Removing from its own member and connected bitmaps
2020-01-04 10:29:19.909 :    CSSD:2514209040: [    ERROR] ###################################
2020-01-04 10:29:19.909 :    CSSD:2514209040: [    ERROR] clssscExit: CSSD aborting from thread GMClientListener
2020-01-04 10:29:19.909 :    CSSD:2514209040: [    ERROR] ###################################
2020-01-04 10:29:19.909 :    CSSD:2514209040: [     INFO] (:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally
2020-01-04 10:29:19.909 :    CSSD:2514209040: [     INFO] ####### Begin Diagnostic Dump #######
2020-01-04 10:29:19.910 :    CSSD:2514209040: [     INFO] ### Begin diagnostic data for the Core layer ###
2020-01-04 10:29:19.910 :    CSSD:2514209040: [     INFO] Initialization successfully completed OK
2020-01-04 10:29:19.910 :    CSSD:2514209040: [     INFO] Initialization of EXADATA fencing successfully completed OK
2020-01-04 10:29:19.910 :    CSSD:2514209040: [     INFO] #### End diagnostic data for the Core layer ####
2020-01-04 10:29:19.910 :    CSSD:2514209040: [     INFO] ### Begin diagnostic data for the GM Client layer ###
2020-01-04 10:29:19.910 :    CSSD:2514209040: Status for clientID 2:11:2, pid(6609-5118), GIPC endpt 0x261e, flags 0x0004, refcount 3, aborted at 0, fence is not progress   OK
2020-01-04 10:29:19.910 :    CSSD:2514209040: [     INFO]   memberID 5:2:2, group EVMDMAIN2 refcount 3, state 0x0000, granted 0, fence is not in progress  OK
2020-01-04 10:29:19.910 :    CSSD:2514209040: Status for clientID 2:13:1, pid(6758-5541), GIPC endpt 0x3ebd, flags 0x0004, refcount 3, aborted at 0, fence is not progress   OK
2020-01-04 10:29:19.910 :    CSSD:2514209040: [     INFO]   memberID 11:2:0, group CRF- refcount 3, state 0x0000, granted 0, fence is not in progress  OK
...
2020-01-04 10:29:19.910 :    CSSD:2514209040: [     INFO]   memberID 26:2:2, group ocr_dblpa4-cluster refcount 3, state 0x0000, granted 0, fence is not in progress  OK
2020-01-04 10:29:19.910 :    CSSD:2514209040: Status for clientID 2:29:4, pid(7488-7910), GIPC endpt 0x3a1b, flags 0x0006, refcount 3, aborted at 0, fence is not progress   OK
2020-01-04 10:29:19.910 :    CSSD:2514209040: [     INFO]   memberID 7:2:1, group crs_version refcount 3, state 0x0000, granted 0, fence is not in progress  OK
...
01-04 10:29:19.911 :    CSSD:2514209040: Status for clientID 2:579:2, pid(9208-13294), GIPC endpt 0x45fbd4, flags 0x0002, refcount 2, aborted at 0, fence is not progress   OK
2020-01-04 10:29:19.911 :    CSSD:2514209040: Status for clientID 2:84:1, pid(8726-13006), GIPC endpt 0xbc75, flags 0x3000, refcount 3, aborted at 0, fence is not progress   OK
2020-01-04 10:29:19.911 :    CSSD:2514209040: [     INFO]   memberID 104:2:1, group IPC0_GRP_dblpa4-cluster_a2cdb refcount 3, state 0x0000, granted 0, fence is not in progress  OK
2020-01-04 10:29:19.911 :    CSSD:2514209040: Status for clientID 2:85:1, pid(8742-13021), GIPC endpt 0xbe61, flags 0x2000, refcount 3, aborted at 0, fence is not progress   OK
----- Call Stack Trace -----
2020-01-04 10:29:19.913 :    CSSD:2514209040: calling              call     entry                argument values in hex      
2020-01-04 10:29:19.913 :    CSSD:2514209040: location             type     point                (? means dubious value)     
2020-01-04 10:29:19.913 :    CSSD:2514209040: -------------------- -------- -------------------- ----------------------------
2020-01-04 10:29:19.913 :    CSSD:2514209040: ssdgetcall: Failure to recover Stack Trace: starting frame address is (nil)
2020-01-04 10:29:19.914 :    CSSD:2514209040: clssscExit()+1860    call     kgdsdst()            
2020-01-04 10:29:19.914 :    CSSD:2514209040: clssscAssert()+210   call     clssscExit()         
2020-01-04 10:29:19.914 :    CSSD:2514209040: clssscRefFreeInt()+256  call     clssscAssert()                                                  
2020-01-04 10:29:19.914 :    CSSD:2514209040: clssgmTermMember()+206   call     clssscRefFreeInt()                                                
2020-01-04 10:29:19.914 :    CSSD:2514209040: clssgmcClientDestroy()+266  call     clssgmTermMember()   
2020-01-04 10:29:19.914 :    CSSD:2514209040: clssscHTDestroyObj()+334  call     clssgmcClientDestroy()                                         
2020-01-04 10:29:19.914 :    CSSD:2514209040: clssscHTRefDestroyObj()+50  call     clssscHTDestroyObj()                                          
2020-01-04 10:29:19.914 :    CSSD:2514209040: clssscRefFreeInt()+180   call     clssscHTRefDestroyObj()  
2020-01-04 10:29:19.914 :    CSSD:2514209040: clssgmclienteventhndlr()+2556   call     clssscRefFreeInt()                   
2020-01-04 10:29:19.914 :    CSSD:2514209040: clssscSelect()+1568  call     clssgmclienteventhndlr()     
2020-01-04 10:29:19.914 :    CSSD:2514209040: clssgmProcClientReqs()+2204  call     clssscSelect()                                     
2020-01-04 10:29:19.914 :    CSSD:2514209040: clssgmclientlsnr()+  call     clssgmProcClientReqs()                  
2020-01-04 10:29:19.914 :    CSSD:2514209040: clssscthrdmain()+26  call     clssgmclientlsnr()                                                  
2020-01-04 10:29:19.914 :    CSSD:2514209040: start_thread()+234   call     clssscthrdmain()     
...
2020-01-04 10:29:20.054 :    CSSD:840956176: [     INFO] clssnmCheckForNetworkFailure: skipping 30 defined 0 
2020-01-04 10:29:20.054 :    CSSD:840956176: [     INFO] clssnmCheckForNetworkFailure: skipping 31 defined 0 
2020-01-04 10:29:20.054 :    CSSD:840956176: [    ERROR] clssscExit: CSSD aborting from thread clssnmPollingThread
2020-01-04 10:29:20.054 :    CSSD:840956176: [     INFO] clssscExit: abort already set 1
...
2020-01-04 10:29:21.057 :    CSSD:2520246544: [     INFO] clssgmRPC: RPC(#1) to node(8298) not sent due to impending local GM shutdown
2020-01-04 10:29:21.057 :    CSSD:2520246544: [     INFO] clssgmRPC: failed to send RPC#8298 to node(1), rpcret(10), master(1), DBInfo(1), masterRPC(1),unsentRPC(0), queuing RPC to unsent queue
2020-01-04 10:29:21.057 :    CSSD:2520246544: [     INFO] clssgmDelMemCmpl: rpc 0x3ff96c83f90, ret 10, clientID 2:34:10 memberID 7:2:4
2020-01-04 10:29:21.058 :    CSSD:2511063312: [     INFO] clssgmcpMbrDeleteResp: Status -16 deleting memberID 7:2:4 from clientID 2:34:10
2020-01-04 10:29:21.288 :    CSSD:2751404192: [     INFO]   : Processing member data change type 1, size 4 for group HB+ASM, memberID 17:2:2
2020-01-04 10:29:21.288 :    CSSD:2751404192: [     INFO]   : Sending member data change to GMP for group HB+ASM, memberID 17:2:2
2020-01-04 10:29:21.288 :    CSSD:2526538000: [     INFO] clssgmpcMemberDataUpdt: grockName HB+ASM memberID 17:2:2, datatype 1 datasize 4
2020-01-04 10:29:21.288 :    CSSD:2526538000: [     INFO] clssgmpeersend: Local node is terminating, send aborted for node dx2dblpa4, number 1
2020-01-04 10:29:21.288 :    CSSD:2526538000: [  WARNING] clssgmBroadcastMap: clssgmpeersend node(1) failed - 0
2020-01-04 10:29:21.288 :    CSSD:2511063312: [     INFO] clssgmcpDataUpdtCmpl: Status 0 mbr data updt memberID 17:2:2 from clientID 2:49:2
2020-01-04 10:29:21.688 :    CSSD:2514209040: [     INFO] clssscExit: CRSD cleanup status 0
2020-01-04 10:29:21.690 :    CSSD:2514209040: [     INFO] clssscagProcAgReq: Sending response that CSSD is shutting down reason 0x0
2020-01-04 10:29:21.690 :    CSSD:2514209040: [     INFO] clssscExit: CRSD cleanup successfully completed
2020-01-04 10:29:21.691 :    CSSD:2514209040: [     INFO] clssgmKillAllClients: Adding pid 7488-7910 to the kill request
2020-01-04 10:29:21.691 :    CSSD:2514209040: [     INFO] clssgmKillAllClients: Adding pid 7812-8904 to the kill request

ohasd_cssdagent_root

 2020-01-04 10:29:33.535 [CSSDAGENT(6791)]CRS-1661: The CSS daemon is not responding. If this persists, a reboot will occur in 13960 milliseconds

OS messages LOG

Jan  4 10:29:17 anboblpa4 systemd-logind: New session 43767 of user grid.
Jan  4 10:29:17 anboblpa4 systemd: Started Session 43767 of user grid.
Jan  4 10:29:17 anboblpa4 systemd: Starting Session 43767 of user grid.
Jan  4 10:29:18 anboblpa4 systemd-logind: Removed session 43767.
Jan  4 10:29:19 anboblpa4 avahi-daemon[4537]: Withdrawing address record for 28.83.70.4 on bond0.3112.
Jan  4 10:27:14 anboblpa4 journal: Runtime journal is using 8.0M (max allowed 3.1G, trying to leave 4.0G free of 31.4G available → current limit 3.1G).
Jan  4 10:27:14 anboblpa4 kernel: Initializing cgroup subsys cpuset
Jan  4 10:27:14 anboblpa4 kernel: Initializing cgroup subsys cpu
Jan  4 10:27:14 anboblpa4 kernel: Initializing cgroup subsys cpuacct
Jan  4 10:27:14 anboblpa4 kernel: Linux version 3.10.0-693.el7.s390x (mockbuild@s390-014.build.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Thu Jul 6 20:01:53 EDT 2017
Jan  4 10:27:14 anboblpa4 kernel: setup: Linux is running natively in 64-bit mode

OSW

-- VMSTAT
zzz ***Sat Jan 4 10:29:09 CST 2020
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  0      0 12239196  46764 40267704    0    0   432    93    2    3  2  1 97  0  0
 1  0      0 12227552  46764 40268336    0    0   136   292 10503 16484  3  3 94  0  0
 0  0      0 12229828  46764 40267980    0    0  8368   156 10164 17217  2  1 97  0  0
zzz ***Sat Jan 4 10:29:39 CST 2020
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  0      0 53626732  44196 9418284    0    0   432    93    2    3  2  1 97  0  0
 0  0      0 53624900  44196 9418216    0    0    36   779 4824 8191  2  2 97  0  0
 0  0      0 53625828  44196 9418264    0    0    36   311 3262 5288  0  0 99  0  0

-- TOP
zzz ***Sat Jan 4 10:29:39 CST 2020
top - 10:29:41 up 16 days,  1:33,  1 user,  load average: 0.59, 0.80, 0.93    
Tasks: 277 total,   1 running, 276 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.0 us,  0.9 sy,  0.0 ni, 97.9 id,  0.1 wa,  0.0 hi,  0.0 si,  0.1 st
KiB Mem : 65946540 total, 53624596 free,  2859464 used,  9462480 buff/cache
KiB Swap: 33550332 total, 33550332 free,        0 used. 61802316 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 4578 node_ex+  20   0  117028  13180   5924 S   3.0  0.0  55:01.24 node_expor+
 6758 root      rt   0 1530652 189620 105952 S   2.0  0.3 309:12.92 osysmond.b+
 6791 root      rt   0 1131744 153004 107768 S   2.0  0.2  15:29.29 cssdagent
 7648 root      rt  -5 1114976 252424 104644 S   2.0  0.4  96:31.80 ologgerd
 6809 grid      rt   0 3047524 261744 112608 S   1.0  0.4  81:24.56 ocssd.bin

zzz ***Sat Jan 4 10:32:41 CST 2020 
top - 10:32:42 up 2 min,  1 user,  load average: 3.50, 1.15, 0.42       最近有重启
Tasks: 454 total,   1 running, 452 sleeping,   0 stopped,   1 zombie
%Cpu(s):  2.6 us,  2.5 sy,  0.0 ni, 92.1 id,  2.5 wa,  0.1 hi,  0.2 si,  0.0 st
KiB Mem : 65946540 total, 29817876 free,  4089312 used, 32039352 buff/cache
KiB Swap: 33550332 total, 33550332 free,        0 used. 31010936 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 7633 root      20   0 1237724  60736  28932 S   2.0  0.1   0:00.91 orarootage+
10619 grid      20   0    7124   2372   1440 R   2.0  0.0   0:00.02 top
 5711 root      20   0 2900512  89540  28720 S   1.0  0.1   0:06.56 ohasd.bin
-- IFCONFIG 
zzz ***Sat Jan 4 10:29:09 CST 2020
bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST>  mtu 1500
        inet6 fe80::8006:9bff:feb4:ea26  prefixlen 64  scopeid 0x20
        ether 82:06:9b:b4:ea:26  txqueuelen 1000  (Ethernet)
        RX packets 209436160  bytes 143740137716 (133.8 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 121960004  bytes 46934313850 (43.7 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

bond0.3112: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 28.83.70.2  netmask 255.255.255.0  broadcast 28.83.70.255
        inet6 fe80::8006:9bff:feb4:ea26  prefixlen 64  scopeid 0x20
        ether 82:06:9b:b4:ea:26  txqueuelen 1000  (Ethernet)
        RX packets 122078240  bytes 117461299874 (109.3 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 87350042  bytes 36732906875 (34.2 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

bond0.3112:1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 28.83.70.4  netmask 255.255.255.0  broadcast 28.83.70.255 《《下一次采集IP 不存在了
        ether 82:06:9b:b4:ea:26  txqueuelen 1000  (Ethernet)


zzz ***Sat Jan 4 10:29:39 CST 2020
bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST>  mtu 1500
        inet6 fe80::8006:9bff:feb4:ea26  prefixlen 64  scopeid 0x20
        ether 82:06:9b:b4:ea:26  txqueuelen 1000  (Ethernet)
        RX packets 209438026  bytes 143740812908 (133.8 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 121961571  bytes 46935014086 (43.7 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

bond0.3112: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 28.83.70.2  netmask 255.255.255.0  broadcast 28.83.70.255
        inet6 fe80::8006:9bff:feb4:ea26  prefixlen 64  scopeid 0x20
        ether 82:06:9b:b4:ea:26  txqueuelen 1000  (Ethernet)
        RX packets 122080043  bytes 117461945938 (109.3 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 87351401  bytes 36733594709 (34.2 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

建议收集日志
— note: 最好提供所有节点

— OS
/var/log/messages
OSW 当问题时间点日志，现在的系统越来越快时间线上的先后顺序带来困难，建议intarval 3-5s 使用gzip

— GI
su – grid
cd $ORACLE_BASE/diag/crs/*/crs/trace
ls -l alert log
ls -l crsd.trc
ls -l cssd.trc
ls -l ohasd_cssdagent_root.trc

— DB
su – oracle
cd $ORACLE_BASE/diag/rdbms/*/$ORACLE_SID/trace

ls -l alert*.log
ls -l *dbw*.trc
ls -l *lmhb*.trc
crash相关进程的trace

Avahi（avahi守护程序）

Avahi（avahi守护程序）相关于免费实现Apple的“ ZeroConf”程序，并使用mDNS / DNS-SD protocol套件进行服务发现。并在网络中扫描打印机和其他共享资源。它还可以提供DNS和DHCP服务。这对于笔记本电脑非常有用，但在服务器上并不需要，也可能导致网络性能下降，并在重负载下变得不稳定。Avahi对ORACLE没有任何益处。建议在ORACLE环境禁用该服务,Oracle已报告avahi-daemon可能会干扰Oracle RAC的多播心跳信号，从而导致应用程序层接口假定它已在节点上断开连接并重新启动。

avahi deamon can also spin or accumulate high CPU time
Bug# 14027941 Possible rac eviction avahi-daemon: withdrawing address record
Bug# 14739888 – CSSD FROM NODE 2 CANNOT JOIN THE CLUSTER AFTER REBOOT
Bug# 12717666 : LNX64-11203 CVU TO CHECK IF AVAHI-DAEMON IS DISABLED AND PROVIDEFIX-UP SCRIPT

OS messages中记录的
anboblpa4 avahi-daemon[4537]: Withdrawing address record for 28.83.70.4 on bond0.3112
网上有记录上面的日志显示 Withdrawing address,只是avahi-daemon发现了是IP被回收，但不是它发起的。

禁用AVAHI
To stop the avahi-daemon, for OL5/OL6:

# service avahi-dnsconfd stop # ignore any errors
# service avahi-daemon stop
# chkconfig avahi-dnsconfd off
# chkconfig avahi-daemon off

for OL7:

# systemctl stop avahi-daemon.socket avahi-daemon.service
# systemctl disable avahi-daemon.socket avahi-daemon.service

Note：同时注意这个案例还有个非常不好的地方，ORACLE明确指出网卡名中不要带”.”