首页 » ORACLE » 案例: 11g R2(11.2.0.4) RAC addnode on RHEL7(Linux7)问题小结

案例: 11g R2(11.2.0.4) RAC addnode on RHEL7(Linux7)问题小结

最近一例在11.2.0.4 2NODES RAC on linux7 增加节点时不是很顺利,此版本是ORACLE的认证版本但是还是兼容性还不是那么顺滑,9年前分享过在linux6上addnode还相对顺利《Oracle 11g R2 RAC addnode (增加RAC节点) 实践和注意事项》。 遇到的问题也较多涉及IB,网络,bug, 损坏,补丁等,简单记录。

1, IB心跳网不通
addnode前需要检查现存节点是否正常,发现node2心跳不通,使用的是IB全家桶, 进看IB状态

[oracle@anbob001:/home/oracle]$ethtool ib0
Settings for ib0:
        Supported ports: [ ]
        Supported link modes:   Not reported
        Supported pause frame use: No
        Supports auto-negotiation: No
        Supported FEC modes: Not reported
        Advertised link modes:  Not reported
        Advertised pause frame use: No
        Advertised auto-negotiation: No
        Advertised FEC modes: Not reported
        Speed: 56000Mb/s
        Duplex: Full
        Port: Other
        PHYAD: 255
        Transceiver: internal
        Auto-negotiation: off
Cannot get wake-on-lan settings: Operation not permitted
        Link detected: yes
		
Dec 13 23:34:42 anbob002 kernel: ib1: transmit timeout: latency 1002653 msecs
Dec 13 23:34:42 anbob002 kernel: ib1: queue (5) stopped, tx_head 2318340524, tx_tail 2318340429
Dec 13 23:34:52 anbob002 kernel: ib1: transmit timeout: latency 1012637 msecs
Dec 13 23:34:52 anbob002 kernel: ib1: queue (5) stopped, tx_head 2318340524, tx_tail 2318340429
Dec 13 23:35:02 anbob002 python: File "/opt/zdata/compute/python/lib/python2.7/logging/handlers.py", line 117, in __init__
Dec 13 23:35:02 anbob002 python: File "/opt/zdata/compute/python/lib/python2.7/logging/handlers.py", line 64, in __init__
Dec 13 23:35:02 anbob002 python: File "/opt/zdata/compute/python/lib/python2.7/logging/__init__.py", line 913, in __init__
Dec 13 23:35:02 anbob002 python: File "/opt/zdata/compute/python/lib/python2.7/logging/__init__.py", line 943, in _open
Dec 13 23:35:02 anbob002 kernel: ib1: transmit timeout: latency 1022653 msecs
Dec 13 23:35:02 anbob002 kernel: ib1: queue (5) stopped, tx_head 2318340524, tx_tail 2318340429
Dec 13 23:35:12 anbob002 kernel: ib1: transmit timeout: latency 1032669 msecs
Dec 13 23:35:12 anbob002 kernel: ib1: queue (5) stopped, tx_head 2318340524, tx_tail 2318340429
Dec 13 23:35:22 anbob002 kernel: ib1: transmit timeout: latency 1042653 msecs

该问题为IB驱动BUG, IB堵塞,暂时可以通过REBOOT解决。

2, addnode 耗时久、流量小
addnode会传送执行节点的文件scp到新节点,对于trace logfile较多如adump,几十万的trace小文件传递, 目标root.sh时同样还要对这些文件授权,很有可能浪费几个小时。

解决在增加节点前清理trace文件。

3, addnode 失败 with PRCF-2015 PRCF-2002
addnode中间连接断开,多次尝试相同现象,心跳网虽是IB,但这个环境的public network是1000M电口, 但理论也不应该存在瓶颈,addnode使用的是public network.

Instantiation of add node scripts complete

Copying to remote nodes (Tuesday, April 27, 2021 3:28:31 PM CST)
..............................................................................................
.WARNING:Error while copying directory /u01/app/11.2.0/grid with exclude file 
list '/tmp/OraInstall2021-04-27_03-28-20PM/installExcludeFile.lst' 
to nodes 'anbob003'. [PRKC-PRCF-2015 : One or more commands were not executed successfully on one or more nodes : ]
----------------------------------------------------------------------------------
anbob003:
    PRCF-2002 : Connection to node "anbob003" is lost

----------------------------------------------------------------------------------

Refer to '/u01/app/oraInventory/logs/addNodeActions2021-04-27_03-28-20PM.log' for details. You may fix the errors on the required remote nodes. Refer to the install guide for error recovery.
                                 96% Done.

在文件传输时ping远程节点达40ms时延,网络不是很好。文件传速理论并不是firewall kill的idle connect, Oracle存在一个bug较匹配,因为使用scp服务,默认SSH服务的最大并发不足,oracle建议调整加大ssh server端maxstartups ,MaxStartups 默认是10:30:60 “start:rate:full”,表示最大并发10个,如果超过后会有rate 30/100% 可能性失败,如果超过 full 60时会肯定失败,但是只是限制未认证的连接, 需要重启sshd服务。 建议加大到40,但是重试还是失败,继续加大到100, 100:30:100 后重试还是失败, 问题还是网络太差导致。

4, 网络差

这样的网络就可能会导致数据包发送失败,涉及网络包失败重传优化问题,RFC2018提供了一个SACK的方法,与报文的确认机制相关,SACK(Selective Acknowledgment),SACK是一个TCP的选项,来允许TCP单独确认非连续的片段,用于告知真正丢失的包,只重传丢失的片段。当前是禁用的,下面尝试启用SACK

sysctl -w net.ipv4.tcp_sack=1

再次尝试addnode ,成功解决。
tcp_sack 不是个万能药,如果你注意EXADATA中的配置,最值实践还是禁用SACK的,并且在OEL UEK4中同样存在启用TCP_SACK后性能变差。 遇到问题时建议测试一下,事实胜于雄辩。

5, HAS 无法启动
这是11G 在LINUX7上的著名的问题,因为有init.d改为systemd,ohasd服务需要在root.sh前安装补丁,或手动创建servcie,眼快手极的手动启服务。直到看到init.ohasd run进程。补丁安装18370031也可以。

注意root.sh 如果失败了,从11.2.0.2开始是可以重复的跑的继续安装。

6, crs-10131 /var/tmp/.oracle/npohasd mkfifo: cannot create fifo ‘/var/tmp/.oracle/npohasd’: file exists
如果没有看到 /etc/init.d/init.ohasd run这样的进程,说明ohasd服务都没启 看日志文件提示

crs-10131 /var/tmp/.oracle/npohasd mkfifo: cannot create fifo '/var/tmp/.oracle/npohasd': file exists

是因为之前网络传失败时/var/tmp/.oracle/npohasd文件传送了过来,文件的生成时间早于进程启动时间, 建议删除。

7, PROCL-26 OLR initalization failured, rc=26
再次尝试root.sh,又失败了

2021-04-27 23:01:46.275: [  OCROSD][1228334000]utread:3: Problem reading buffer 1907f000 buflen 4096 retval 0 phy_offset 102400 retry 5
2021-04-27 23:01:46.275: [  OCRRAW][1228334000]propriogid:1_1: Failed to read the whole bootblock. Assumes invalid format.
2021-04-27 23:01:46.275: [  OCRRAW][1228334000]proprioini: all disks are not OCR/OLR formatted
2021-04-27 23:01:46.275: [  OCRRAW][1228334000]proprinit: Could not open raw device
2021-04-27 23:01:46.275: [  OCRAPI][1228334000]a_init:16!: Backend init unsuccessful : [26]
2021-04-27 23:01:46.276: [  CRSOCR][1228334000] OCR context init failure.  Error: PROCL-26: Error while accessing the physical storage
2021-04-27 23:01:46.276: [ default][1228334000] OLR initalization failured, rc=26
2021-04-27 23:01:46.276: [ default][1228334000]Created alert : (:OHAS00106:) :  Failed to initialize Oracle Local Registry
2021-04-27 23:01:46.277: [ default][1228334000][PANIC] OHASD exiting; Could not init OLR

这原因是因为OLR损坏了, root.sh会修复,

Root.sh Failed USM Driver Install Actions Failed On Second Node ( ACFS-11022 ) (Doc ID 1477313.1) 之前RAC是安装过17475946 patch的。

这环境看来有点糟糕, 清理了,重新来addnode, root.sh

8,node1 addnoe 会提示node3已存在,需要清理一下。

#as  root 
$GRID_HOME/bin/crsctl delete node -n anbob003
# as grid
./runInstaller -updateNodeList ORACLE_HOME=CRS_home  "CLUSTER_NODES={remaining_nodes_list}" CRS=TRUE

解决了addnode

node3再次root.sh时又失败,提示

crs-4046 invalid coracle clusterware configuration
crs-4000 command create failed, or completed with errors

解决这个问题执行:

# /crs/install/rootcrs.pl -deconfig -force -verbose

9, ACFS-9459  错误
但是deconfig又报如下错误

ps -ef|grpe u01

并没有ORACLE_HOME下的活动进程,我认为主要是解决ACFS-9459 错误, 不支持当前的操作系统内核,相关的bug 还有21233961.

cd /u01/app/grid/11.2.0/install/usm/Novell/SLES11/x86_64
mv 3.0.61-0.9 3.0.61-0.9_bak
ln -s 3.0.13-0.27 3.10.0-957.21

这种link版本的方式无法解决,后来回滚了node3 上的ACFS补丁, deconfig成功, 同时root.sh也顺利完成。 判断应该是oracle bug,在安装顺序上ACFS补丁有冲突。

打赏

, , ,

目前这篇文章还没有评论(Rss)

我要评论

上一篇:

下一篇: