首页 » ORACLE » 12cR2 ASM start fail when using multiple private interconnects, HAIP issue

12cR2 ASM start fail when using multiple private interconnects, HAIP issue

Oracle从11.2.0.2开始引入了一个新特性网络冗余技术HAIP。HAIP的目的用来代替操作系统级别的网卡绑定以实现Active-Active的模式进行数据传输。Oracle HAIP 支持多个私网,之前通常使用OS层的网卡绑定,但Oracle一直在希望使用自己的技术而不依赖其他,但HAIP存在较多bug,个人还是比较推荐OS bond网卡,这里简单记录一个案例,当使用2个HAIP网卡时,节点间HAIP 1缺失并且交叉在两个网卡上的,导致ASM无法启动。

 

ASM ALERT LOG

 
NOTE: remote asm mode is remote (mode 0x202; from cluster type)
2023-01-11T15:31:14.287419+08:00
Cluster Communication is configured to use IPs from: GPnP
IP: 169.254.29.170	 Subnet: 169.254.0.0
IP: 169.254.191.75	 Subnet: 169.254.128.0
KSIPC Loopback IP addresses(DEF): 
127.0.0.1	
KSIPC Available Transports: UDP:TCP
....
Warning: Oraping detected connectivity issues.
An eviction is expected due environment issues
  OS ping to instance: 2 has failed.
Please see LMON and oraping trace files for details.
2023-01-11T15:46:15.961104+08:00
LMON (ospid: 36421) detects hung instances during IMR reconfiguration
LMON (ospid: 36421) tries to kill the instance 2 in 37 seconds.
Please check instance 2's alert log and LMON trace file for more details.
2023-01-11T15:46:53.020074+08:00
Remote instance kill is issued with system inc 10
Remote instance kill map (size 1) : 2 
LMON received an instance eviction notification from instance 1
The instance eviction reason is 0x20000000
The instance eviction map is 2 
2023-01-11T15:46:53.587757+08:00
Reconfiguration started (old inc 10, new inc 12)
List of instances (total 1) :

ohasd_orarootagent_root.trc

2023-01-11 15:52:47.980 : USRTHRD:1197451008: {0:5:3} HAIP: assigned ip '169.254.116.29'
2023-01-11 15:52:47.980 : USRTHRD:1197451008: {0:5:3} HAIP: check ip '169.254.116.29'
2023-01-11 15:52:47.980 : USRTHRD:1197451008: {0:5:3} HAIP:  CleanDeadThreads entry

2023-01-11 15:52:47.980 : USRTHRD:1197451008: {0:5:3} HAIP: assigned ip '169.254.232.245'
2023-01-11 15:52:47.980 : USRTHRD:1197451008: {0:5:3} HAIP: check ip '169.254.232.245'
2023-01-11 15:52:47.980 : USRTHRD:1197451008: {0:5:3} Start: 2 HAIP assignment, 2, 1, 1, 1
2023-01-11 15:52:47.980 : USRTHRD:1197451008: {0:5:3} to verify wt, 1-2-2
2023-01-11 15:52:47.980 : USRTHRD:1197451008: {0:5:3} to verify inf event
2023-01-11 15:52:49.036 : USRTHRD:1194911488:  HAIP: event GIPCD_METRIC_UPDATE
2023-01-11 15:52:49.036 : USRTHRD:1197451008: {0:5:3} dequeue change event 0x7f2728018290, GIPCD_METRIC_UPDATE
2023-01-11 15:52:49.036 : USRTHRD:1197451008: {0:5:3} InitializeHaIps[ 1]  infList 'inf ib1, ip 11.11.11.3, sub 11.11.11.0'
2023-01-11 15:52:49.036 : USRTHRD:1197451008: {0:5:3} Network interface ib1 NOT functioning (bad rank -1, 20) from anbob1, HAIP to failover now
2023-01-11 15:52:49.036 : USRTHRD:1197451008: {0:5:3} InitializeHaIps[ 0]  infList 'inf ib0, ip 10.10.10.3, sub 10.10.10.0'
2023-01-11 15:52:49.036 : USRTHRD:1197451008: {0:5:3} HAIP: updateCssMbrData for ib1:jzsjk1:0;ib0:jzsjk1:99;ib0:jzsjk2:99;ib1:jzsjk2:99, Threshold 20
2023-01-11 15:52:49.036 : USRTHRD:1197451008: {0:5:3} HAIP: enter UpdatecssMbrdata
2023-01-11 15:52:49.036 : USRTHRD:1197451008: {0:5:3} HAIP: BC to Update member info HAIP-RM1;10.10.10.0#0
2023-01-11 15:52:49.036 : USRTHRD:1197451008: {0:5:3} 1 - HAIP enable is 2.
2023-01-11 15:52:49.036 : USRTHRD:1197451008: {0:5:3} to verify routes
2023-01-11 15:52:49.036 : USRTHRD:1197451008: {0:5:3} to verify start completion 2
2023-01-11 15:52:49.036 : USRTHRD:1197451008: {0:5:3} HAIP:  CleanDeadThreads entry

..
2023-01-11 16:00:07.005 : USRTHRD:171906816: {0:5:3} HAIP: to set HAIP
2023-01-11 16:00:07.005 : CLSINET:171906816:  (:CLSINE0018:)WARNING: failed to find interface available for interface definition ib1(:.*)?:11.11.11.0
2023-01-11 16:00:07.006 : USRTHRD:171906816: {0:5:3} HAIP: number of inf from clsinet -- 1
2023-01-11 16:00:07.006 : USRTHRD:171906816: {0:5:3} HAIP: read grpdata, len 8

1, 使用private ip ping
2, 检查haip 是否所有private 网卡上都存在
3,使用haip互ping
4,ifconifg查看subnet是否和oifcfg iflist一致
5, 检查路由表

Known Issues: Grid Infrastructure Redundant Interconnect and ora.cluster_interconnect.haip (Doc ID 1640865.1)

HAIP BUG
Bug 10332426 – HAIP fails to start due to network mismatch
Bug 19270660 – AIX: category: -2, operation: open, loc: bpfopen:1,os, OS error: 2, other: ARP device /dev/bpf4, interface en8
Bug 16445624 – AIX: HAIP fails to start
Bug 13989181 – AIX: HAIP fails to start with: category: -2, operation: SETIF, loc: bpfopen:21,o, OS error: 6, other: dev /dev/bpf0
Note 1447517.1 – AIX: HAIP fails to start if bpf and other devices using same major/minor number
Bug 10253028 – “oifcfg iflist -p -n” not showing HAIP on AIX as expected
Bug 13332363 – Wrong MTU for HAIP on Solaris
Bug 10114953 – only one HAIP is create on HP-UX
Bug 10363902 – HAIP Infiniband support for Linux and Solaris
Bug 10357258 – Many HAIP started on Solaris IPMP – not affecting 11.2.0.3
Bug 10397652/ 12767231 – HAIP not failing over when private network fails – not affecting 11.2.0.3
Bug 11077756 – allow root script to continue upon HAIP failure
Bug 12546712 – not affecting 11.2.0.3
HAIP fails to start if default gateway is configured for VLAN for private network on network switch
Bug 12425730 – HAIP does not start, 11.2.0.3 not affected
ASM on Non-First Node (Second or Others) Fails to Start: PMON (ospid: nnnn): terminating the instance due to error 481
11gR2 GI HAIP Resource Not Created in Solaris 11 if IPMP is Used for Private Network
BUG 20900984 – HAIP CONFLICTION DETECTED BETWEEN TWO RAC CLUSTERS IN THE SAME VLAN
note 2059802.1 – AIX HAIP: failed to create arp – operation: open, loc: bpfopen:1,os, OS error: 2
note 2059461.1 – AIX: HAIP Fails to Start: operation: open, loc: bpfopen:1,os, OS error: 2
note 1963248.1 – AIX HAIP: operation: SETIF, loc: bpfopen:21,o, OS error: 6, other: dev /dev/bpf0
note 2066903.1 – operation: routedel, loc: system, OS error: 76, other: failed to execute route del, ret 256
Bug 29379299 – HAIP SUBNET FLIPS UPON BOTH LINK DOWN/UP EVENT
HAIP failover improvement
BUG 23472436 – HAIP USING ONLY 2 OF 4 NETWORK INTERFACES ASSIGNED TO THE INTERCONNECT
BUG 25073182 – ONE GIPC RANK FOR EACH PRIVATE NETWORK INTERFACE
Bug 24509481 – HAIP TO HAVE SMALLER SUBNET INSTEAD OF WHOLE LINK LOCAL 169.254.*.*

遇到这案例现状是HAIP服务启动正常,但ASM无法启动,因为haip节点前oraping不通过, 为IB硬件, 2个interconnect network做为私网并使用直连,配置不是推荐,奇怪的是每个节点只启动了一个haip,node1上在ib1, node2上在ib0, 因为两个是直连,所以分别ping ib上的private IP正常,但是HAIP在两个物理链路上无法ping通,导致ASM无法启动影响第二个启动的CRS无法启动。
NODE1
NODE2

Solution:

1,因为使用直连也无法bond(bond主备也会存在当node1 主ib0故障,而node2 主是ib1时,物理连路不通现象 ), 下面的重启顺序可以正常

# node1
crsctl stop crs

# node2
crsctl stop crs
–启动了2个haip

# node1
crsctl start crs
–启动正常

2, 禁用HAIP

风险可能会影响以后的升级或新节点加入

1. Run "crsctl stop crs" on all nodes to stop CRS stack.2.   关闭HAIP

2. On one node, run the following commands:
     $CRS_HOME/bin/crsctl start crs -excl -nocrs
     $CRS_HOME/bin/crsctl stop res ora.asm -init
     $CRS_HOME/bin/crsctl modify res ora.cluster_interconnect.haip -attr  "ENABLED=0" -init
     $CRS_HOME/bin/crsctl modify res ora.asm -attr                "START_DEPENDENCIES='hard(ora.cssd,ora.ctssd)pullup(ora.cssd,ora.ctssd)weak(ora.drivers.acfs)',STOP_DEPENDENCIES='hard(intermediate:ora.cssd)'" -init
     $CRS_HOME/bin/crsctl stop crs4.  进一步测试
   
3. Repeat Step(2) on other nodes.

4. Run "crsctl start crs" on all nodes to restart CRS stack.

3, 防止以后每次需要停所有节点,临时解决禁用了IB1 ,只使用ib0,  后续单节点重启正常(但interconnect work没有冗余环境)。比如

# as grid user
$ oifcfg geif

$oifcfg delif -global eth3/xxx.xxx.xx.0

— 需要重启所有节点上的CRS,不能以滚动方式使CRS重启

打赏

目前这篇文章还没有评论(Rss)

我要评论