12cR2 ASM start fail when using multiple private interconnects, HAIP issue
Oracle从11.2.0.2开始引入了一个新特性网络冗余技术HAIP。HAIP的目的用来代替操作系统级别的网卡绑定以实现Active-Active的模式进行数据传输。Oracle HAIP 支持多个私网,之前通常使用OS层的网卡绑定,但Oracle一直在希望使用自己的技术而不依赖其他,但HAIP存在较多bug,个人还是比较推荐OS bond网卡,这里简单记录一个案例,当使用2个HAIP网卡时,节点间HAIP 1缺失并且交叉在两个网卡上的,导致ASM无法启动。
ASM ALERT LOG
NOTE: remote asm mode is remote (mode 0x202; from cluster type) 2023-01-11T15:31:14.287419+08:00 Cluster Communication is configured to use IPs from: GPnP IP: 169.254.29.170 Subnet: 169.254.0.0 IP: 169.254.191.75 Subnet: 169.254.128.0 KSIPC Loopback IP addresses(DEF): 127.0.0.1 KSIPC Available Transports: UDP:TCP .... Warning: Oraping detected connectivity issues. An eviction is expected due environment issues OS ping to instance: 2 has failed. Please see LMON and oraping trace files for details. 2023-01-11T15:46:15.961104+08:00 LMON (ospid: 36421) detects hung instances during IMR reconfiguration LMON (ospid: 36421) tries to kill the instance 2 in 37 seconds. Please check instance 2's alert log and LMON trace file for more details. 2023-01-11T15:46:53.020074+08:00 Remote instance kill is issued with system inc 10 Remote instance kill map (size 1) : 2 LMON received an instance eviction notification from instance 1 The instance eviction reason is 0x20000000 The instance eviction map is 2 2023-01-11T15:46:53.587757+08:00 Reconfiguration started (old inc 10, new inc 12) List of instances (total 1) :
ohasd_orarootagent_root.trc
2023-01-11 15:52:47.980 : USRTHRD:1197451008: {0:5:3} HAIP: assigned ip '169.254.116.29' 2023-01-11 15:52:47.980 : USRTHRD:1197451008: {0:5:3} HAIP: check ip '169.254.116.29' 2023-01-11 15:52:47.980 : USRTHRD:1197451008: {0:5:3} HAIP: CleanDeadThreads entry 2023-01-11 15:52:47.980 : USRTHRD:1197451008: {0:5:3} HAIP: assigned ip '169.254.232.245' 2023-01-11 15:52:47.980 : USRTHRD:1197451008: {0:5:3} HAIP: check ip '169.254.232.245' 2023-01-11 15:52:47.980 : USRTHRD:1197451008: {0:5:3} Start: 2 HAIP assignment, 2, 1, 1, 1 2023-01-11 15:52:47.980 : USRTHRD:1197451008: {0:5:3} to verify wt, 1-2-2 2023-01-11 15:52:47.980 : USRTHRD:1197451008: {0:5:3} to verify inf event 2023-01-11 15:52:49.036 : USRTHRD:1194911488: HAIP: event GIPCD_METRIC_UPDATE 2023-01-11 15:52:49.036 : USRTHRD:1197451008: {0:5:3} dequeue change event 0x7f2728018290, GIPCD_METRIC_UPDATE 2023-01-11 15:52:49.036 : USRTHRD:1197451008: {0:5:3} InitializeHaIps[ 1] infList 'inf ib1, ip 11.11.11.3, sub 11.11.11.0' 2023-01-11 15:52:49.036 : USRTHRD:1197451008: {0:5:3} Network interface ib1 NOT functioning (bad rank -1, 20) from anbob1, HAIP to failover now 2023-01-11 15:52:49.036 : USRTHRD:1197451008: {0:5:3} InitializeHaIps[ 0] infList 'inf ib0, ip 10.10.10.3, sub 10.10.10.0' 2023-01-11 15:52:49.036 : USRTHRD:1197451008: {0:5:3} HAIP: updateCssMbrData for ib1:jzsjk1:0;ib0:jzsjk1:99;ib0:jzsjk2:99;ib1:jzsjk2:99, Threshold 20 2023-01-11 15:52:49.036 : USRTHRD:1197451008: {0:5:3} HAIP: enter UpdatecssMbrdata 2023-01-11 15:52:49.036 : USRTHRD:1197451008: {0:5:3} HAIP: BC to Update member info HAIP-RM1;10.10.10.0#0 2023-01-11 15:52:49.036 : USRTHRD:1197451008: {0:5:3} 1 - HAIP enable is 2. 2023-01-11 15:52:49.036 : USRTHRD:1197451008: {0:5:3} to verify routes 2023-01-11 15:52:49.036 : USRTHRD:1197451008: {0:5:3} to verify start completion 2 2023-01-11 15:52:49.036 : USRTHRD:1197451008: {0:5:3} HAIP: CleanDeadThreads entry .. 2023-01-11 16:00:07.005 : USRTHRD:171906816: {0:5:3} HAIP: to set HAIP 2023-01-11 16:00:07.005 : CLSINET:171906816: (:CLSINE0018:)WARNING: failed to find interface available for interface definition ib1(:.*)?:11.11.11.0 2023-01-11 16:00:07.006 : USRTHRD:171906816: {0:5:3} HAIP: number of inf from clsinet -- 1 2023-01-11 16:00:07.006 : USRTHRD:171906816: {0:5:3} HAIP: read grpdata, len 8
1, 使用private ip ping
2, 检查haip 是否所有private 网卡上都存在
3,使用haip互ping
4,ifconifg查看subnet是否和oifcfg iflist一致
5, 检查路由表
Known Issues: Grid Infrastructure Redundant Interconnect and ora.cluster_interconnect.haip (Doc ID 1640865.1)
HAIP BUG
Bug 10332426 – HAIP fails to start due to network mismatch
Bug 19270660 – AIX: category: -2, operation: open, loc: bpfopen:1,os, OS error: 2, other: ARP device /dev/bpf4, interface en8
Bug 16445624 – AIX: HAIP fails to start
Bug 13989181 – AIX: HAIP fails to start with: category: -2, operation: SETIF, loc: bpfopen:21,o, OS error: 6, other: dev /dev/bpf0
Note 1447517.1 – AIX: HAIP fails to start if bpf and other devices using same major/minor number
Bug 10253028 – “oifcfg iflist -p -n” not showing HAIP on AIX as expected
Bug 13332363 – Wrong MTU for HAIP on Solaris
Bug 10114953 – only one HAIP is create on HP-UX
Bug 10363902 – HAIP Infiniband support for Linux and Solaris
Bug 10357258 – Many HAIP started on Solaris IPMP – not affecting 11.2.0.3
Bug 10397652/ 12767231 – HAIP not failing over when private network fails – not affecting 11.2.0.3
Bug 11077756 – allow root script to continue upon HAIP failure
Bug 12546712 – not affecting 11.2.0.3
HAIP fails to start if default gateway is configured for VLAN for private network on network switch
Bug 12425730 – HAIP does not start, 11.2.0.3 not affected
ASM on Non-First Node (Second or Others) Fails to Start: PMON (ospid: nnnn): terminating the instance due to error 481
11gR2 GI HAIP Resource Not Created in Solaris 11 if IPMP is Used for Private Network
BUG 20900984 – HAIP CONFLICTION DETECTED BETWEEN TWO RAC CLUSTERS IN THE SAME VLAN
note 2059802.1 – AIX HAIP: failed to create arp – operation: open, loc: bpfopen:1,os, OS error: 2
note 2059461.1 – AIX: HAIP Fails to Start: operation: open, loc: bpfopen:1,os, OS error: 2
note 1963248.1 – AIX HAIP: operation: SETIF, loc: bpfopen:21,o, OS error: 6, other: dev /dev/bpf0
note 2066903.1 – operation: routedel, loc: system, OS error: 76, other: failed to execute route del, ret 256
Bug 29379299 – HAIP SUBNET FLIPS UPON BOTH LINK DOWN/UP EVENT
HAIP failover improvement
BUG 23472436 – HAIP USING ONLY 2 OF 4 NETWORK INTERFACES ASSIGNED TO THE INTERCONNECT
BUG 25073182 – ONE GIPC RANK FOR EACH PRIVATE NETWORK INTERFACE
Bug 24509481 – HAIP TO HAVE SMALLER SUBNET INSTEAD OF WHOLE LINK LOCAL 169.254.*.*
遇到这案例现状是HAIP服务启动正常,但ASM无法启动,因为haip节点前oraping不通过, 为IB硬件, 2个interconnect network做为私网并使用直连,配置不是推荐,奇怪的是每个节点只启动了一个haip,node1上在ib1, node2上在ib0, 因为两个是直连,所以分别ping ib上的private IP正常,但是HAIP在两个物理链路上无法ping通,导致ASM无法启动影响第二个启动的CRS无法启动。
NODE1
NODE2
Solution:
1,因为使用直连也无法bond(bond主备也会存在当node1 主ib0故障,而node2 主是ib1时,物理连路不通现象 ), 下面的重启顺序可以正常:
# node1
crsctl stop crs
# node2
crsctl stop crs
–启动了2个haip
# node1
crsctl start crs
–启动正常
2, 禁用HAIP
风险可能会影响以后的升级或新节点加入
1. Run "crsctl stop crs" on all nodes to stop CRS stack.2. 关闭HAIP 2. On one node, run the following commands: $CRS_HOME/bin/crsctl start crs -excl -nocrs $CRS_HOME/bin/crsctl stop res ora.asm -init $CRS_HOME/bin/crsctl modify res ora.cluster_interconnect.haip -attr "ENABLED=0" -init $CRS_HOME/bin/crsctl modify res ora.asm -attr "START_DEPENDENCIES='hard(ora.cssd,ora.ctssd)pullup(ora.cssd,ora.ctssd)weak(ora.drivers.acfs)',STOP_DEPENDENCIES='hard(intermediate:ora.cssd)'" -init $CRS_HOME/bin/crsctl stop crs4. 进一步测试 3. Repeat Step(2) on other nodes. 4. Run "crsctl start crs" on all nodes to restart CRS stack.
3, 防止以后每次需要停所有节点,临时解决禁用了IB1 ,只使用ib0, 后续单节点重启正常(但interconnect work没有冗余环境)。比如
# as grid user $ oifcfg geif $oifcfg delif -global eth3/xxx.xxx.xx.0
— 需要重启所有节点上的CRS,不能以滚动方式使CRS重启
目前这篇文章还没有评论(Rss)