首页 » ORACLE 9i-23c » Troubleshooting Oracle RAC a node Fails to Join the Cluster with “no network HB”

Troubleshooting Oracle RAC a node Fails to Join the Cluster with “no network HB”

近日1客户环境的oracle 12cR2 6-nodes RAC多个节点脑裂后无法启动加回cluster, 分析日志又是经典的“has a disk HB, but no network HB“, 最近安全加固需求颇多,当心过度封锁影响到了RAC 间的interconnect 通信。 这里简单记录一下case现象的分析方法。

分析方法:
1, 检查crs状态和资源情况 All nodes

crsctl check crs
crsctl stat res -t
crsctl stat res -t -init

2, 检查问题节点软件环境

 cluvfy stage -post crsinst -n hract21,hract22

3, 检查日志
GI alert log
ocssd.log
crs.log
ASM alert.log
DB alert.log

4, 如果CSSD启动失败,可以开启ocssd debug 日志
#  $GRID_HOME/bin/crsctl set  log css CSSD:3
Set CSSD Module: CSSD  Log Level: 3
#   $GRID_HOME/bin/crsctl get log css CSSD
Get CSSD Module: CSSD  Log Level: 3

5, 检查CSS timeout values
#  $GRID_HOME/bin/crsctl  get css misscount
CRS-4678: Successful get misscount 30 for Cluster Synchronization Services.

6, ocssd log

$ cat $GRID_HOME/log/grac2/cssd/ocssd.log | egrep -i 'Removal|evict|30000|network HB|splitbrain|aborting'
$ cat $GRID_HOME/log/grac2/cssd/ocssd.log | egrep -i 'fail|error|exception|fatal'

alert.log:

2015-02-17 09:42:27.823 [OCSSD(15855)]CRS-1656: The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/grid/diag/crs/hract21/crs/trace/ocssd.trc
2015-02-17 09:42:27.824 [OCSSD(15855)]CRS-1603: CSSD on node hract21 shutdown by user.
2015-02-17 09:42:27.823 [CSSDAGENT(15844)]CRS-5818: Aborted command 'start' for resource 'ora.cssd'. Details at (:CRSAGF00113:) {0:0:2} in /u01/app/grid/diag/crs/hract21/crs/trace/ohasd_cssdagent_root.trc.
Tue Feb 17 09:42:32 2015
Errors in file /u01/app/grid/diag/crs/hract21/crs/trace/ocssd.trc  (incident=2977):
CRS-8503 [] [] [] [] [] [] [] [] [] [] [] []
Incident details in: /u01/app/grid/diag/crs/hract21/crs/incident/incdir_2977/ocssd_i2977.trc

2015-02-17 09:42:33.019 [OCSSD(15855)]CRS-8503: Oracle Clusterware OCSSD process with operating system process ID
 15855 experienced fatal signal or exception code 6
Sweep [inc][2977]: completed
2015-02-17 09:42:38.005 [OHASD(11954)]CRS-2757: Command 'Start' timed out waiting for response from the resource 'ora.cssd'. Details at (:CRSPE00163:) {0:0:2} in /u01/app/grid/diag/crs/hract21/crs/trace/ohasd.trc.

ocssd.trc:

2015-02-17 09:42:32.451021 :    CSSD:2417551104: 
   clssnmvDHBValidateNCopy: node 2, hract22, has a disk HB, but no network HB, DHB has rcfg 319544228, wrtcnt, 963949, LATS 92477974, lastSeqNo 963946, uniqueness 1424074596, timestamp 1424162551/21220694
2015-02-17 09:42:32.451113 :    CSSD:2422281984: 
   clssnmvDHBValidateNCopy: node 2, hract22, has a disk HB, but no network HB, DHB has rcfg 319544228, wrtcnt, 963950, LATS 92477974, lastSeqNo 963947, uniqueness 1424074596, timestamp 1424162552/21220904
Trace file /u01/app/grid/diag/crs/hract21/crs/trace/ocssd.trc
Oracle Database 12c Clusterware Release 12.1.0.2.0 - Production Copyright 1996, 2014 Oracle. All rights reserved.
DDE: Flood control is not active
CLSB:2467473152: Oracle Clusterware infrastructure error in OCSSD (OS PID 15855): Fatal signal 6 has occurred in program ocssd thread 2467473152; nested signal count is 1
Incident 2977 created, dump file: /u01/app/grid/diag/crs/hract21/crs/incident/incdir_2977/ocssd_i2977.trc
CRS-8503 [] [] [] [] [] [] [] [] [] [] [] []
2015-02-17 09:42:33.108629 :    CSSD:2450904832: clssscWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 1000 with cvtimewait status 4294967186
2015-02-17 09:42:33.451785 :    CSSD:2417551104: clssnmvDHBValidateNCopy: node 2, hract22, has a disk HB, but no network HB, DHB has rcfg 319544228, wrtcnt, 963952, LATS 92478974, lastSeqNo 963949, uniqueness 1424074596, timestamp 1424162552/21221694
2015-02-17 09:42:33.451933 :    CSSD:2422281984: clssnmvDHBValidateNCopy: node 2, hract22, has a disk HB, but no network HB, DHB has rcfg 319544228, wrtcnt, 963953, LATS 92478974, lastSeqNo 963950, uniqueness 1424074596, timestamp 1424162553/21221904
--> Here we know that we have a networking problem

Note:
has a disk HB, but no network HB和OCSSD (OS PID 15855): Fatal signal 6关键字 多是网络不通, (因为现场故障只能截图,所以找了个相同案例日志输出)

7, 检查 gpnp profile.xml

$GRID_HOME/bin/gpnptool get 2>/dev/null | xmllint --format - | egrep 'CSS-Profile|ASM-Profile|Network id'

8, 检查 network

ping     
traceroute     
ifconfig xxx
ip addr

9, 多播测试
oracle官方提供了一个多播测试脚本mcasttest.pl, 确认OS 和网络设备启用了多播。

$ ./mcasttest.pl -n db01,db02 -i ib0,ib3

10, 检查是否有网络重组包问题

检查OS层是否存在网络包重组问题,之前的案例中多次遇到,Troubleshooting 11gR2 Grid Infrastructure Node not Join the Cluster After Evicted error show disk and network HB failed

— ON ALL NODE

LINUX 6

netstat -s |grep reass
-sleep 5sec
netstat -s |grep reass

LINUX 7 
nstat -az|grep IpReasmFails
-sleep 5sec
nstat -az|grep IpReasmFails

11, 检查防火墙
检查是否有软、硬件防火墙调整网络策略,如iptables firewall随OS 启动的服务。

12, 检查 ASM disk

$GRID_HOME/bin/kfod disks=asm  st=true ds=true cluster=true

13, 检查OS message
是否有硬件损坏,如link down/up现象。

14, network sock错误
如/tmp 目录下network sock文件缺失或权限错误。

这个故障检查ASM磁盘正常, sqlnet.ora并未配置白名单(可能影响Flex ASM listener通信), OS message日志无硬件错误后, 可以尝试重启crs stack, 如果还是失败,使用traceroute 测试到幸存节点的private IP发现并不通, 结合 no network HB和OCSSD (OS PID 15855): Fatal signal 6关键字初步判断是以下可能:
OS layer: iptables  firewall
Network Layer: 网络防火墙等访问策略

在询问网络工程师确认刚做过网络策略调整, Disabled  firewall especially on the private interconnect. 禁用private interconnect中的所有网络限制后,恢复正常。

打赏

,

对不起,这篇文章暂时关闭评论。