首页 » ORACLE » Troubleshooting RAC intrance crash caused by private network IP address conflict (IP冲突)

Troubleshooting RAC intrance crash caused by private network IP address conflict (IP冲突)

今天有套rac 的一个节点主机重启,CRS没有启动, 10.2.0.5 2nodes rac on aix. 手动启动CRS依旧没有拽起来,下面整理一下错误过程

$ ./crsctl check crs
Failure 1 contacting CSS daemon
Cannot communicate with CRS
Cannot communicate with EVM

$ ps -ef|grep d.bin
root 9109626 10813558 0 08:51:54 – 0:00 /oracle/product/102/crs/bin/crsd.bin reboot
oracle 9371872 7864454 0 08:52:09 pts/6 0:00 grep d.bin
oracle 9961690 11272320 0 08:51:46 – 0:00 /oracle/product/102/crs/bin/evmd.bin
oracle 10289302 11010270 0 08:52:01 – 0:00 /oracle/product/102/crs/bin/ocssd.bin

$ ./crs_stat -t
Name Type Target State Host
————————————————————
ora.anbobdb.db application ONLINE ONLINE anbobrac1
ora….b1.inst application ONLINE ONLINE anbobrac1
ora….b2.inst application ONLINE OFFLINE
ora….srac.cs application ONLINE ONLINE anbobrac1
ora….db1.srv application ONLINE ONLINE anbobrac1
ora….db2.srv application ONLINE OFFLINE
ora….C1.lsnr application ONLINE ONLINE anbobrac1
ora….ac1.gsd application ONLINE ONLINE anbobrac1
ora….ac1.ons application ONLINE ONLINE anbobrac1
ora….ac1.vip application ONLINE ONLINE anbobrac1
ora….C2.lsnr application ONLINE ONLINE anbobrac2
ora….ac2.gsd application ONLINE OFFLINE
ora….ac2.ons application ONLINE OFFLINE
ora….ac2.vip application ONLINE ONLINE anbobrac2

1,
# crs alert
2014-12-08 02:47:42.168
[cssd(21889152)]CRS-1612:node anbobrac1 (0) at 50% heartbeat fatal, eviction in 0.000 seconds
2014-12-08 02:47:43.175
[cssd(21889152)]CRS-1612:node anbobrac1 (0) at 50% heartbeat fatal, eviction in 0.000 seconds
2014-12-08 02:47:50.197
[cssd(21889152)]CRS-1611:node anbobrac1 (0) at 75% heartbeat fatal, eviction in 0.000 seconds
2014-12-08 02:47:54.200
[cssd(21889152)]CRS-1610:node anbobrac1 (0) at 90% heartbeat fatal, eviction in 0.000 seconds
2014-12-08 02:47:55.200
[cssd(21889152)]CRS-1610:node anbobrac1 (0) at 90% heartbeat fatal, eviction in 0.000 seconds
2014-12-08 02:47:56.203
[cssd(21889152)]CRS-1610:node anbobrac1 (0) at 90% heartbeat fatal, eviction in 0.000 seconds
2014-12-08 02:57:47.996

# crs log
2014-12-08 02:57:43.374: [ default][1][ENTER]32
Oracle Database 10g CRS Release 10.2.0.5.0 Production Copyright 1996, 2004, Oracle. All rights reserved
2014-12-08 02:57:43.382: [ default][1]32CRS Daemon Starting
2014-12-08 02:57:43.391: [ CRSMAIN][1]32Checking the OCR device
2014-12-08 02:57:43.394: [ CRSMAIN][1]32Connecting to the CSS Daemon
2014-12-08 02:57:43.564: [ COMMCRS][258]clsc_connect: (1103bf910) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_anbobrac2_))

2014-12-08 02:57:43.564: [ CSSCLNT][1]clsssInitNative: connect failed, rc 9

2014-12-08 02:57:43.564: [ CRSRTI][1]32CSS is not ready. Received status 3 from CSS. Waiting for good status ..

# ocssd log
[ CSSD]2014-12-08 02:47:36.712 [2058] >TRACE: clssgmAddGrockMember: adding member to grock SRVM.DATABASE.NODEAPPS.anbobrac2
[ CSSD]2014-12-08 02:47:36.712 [2058] >TRACE: clssgmAllocateRPCIndex: allocated rpc 1548 (11081ecb0)
[ CSSD]2014-12-08 02:47:36.712 [2058] >TRACE: clssgmpeersend: send failed type 3, node 1, unreachable, flags 0x0, quiesced 0
[ CSSD]2014-12-08 02:47:38.563 [3600] >TRACE: clssnmSendingThread: sending status msg to all nodes
[ CSSD]2014-12-08 02:47:38.563 [3600] >TRACE: clssnmSendingThread: sent 4 status msgs to all nodes
[ CSSD]2014-12-08 02:47:42.168 [3343] >WARNING: clssnmPollingThread: node anbobrac1 (1) at 50% heartbeat fatal, eviction in 14.696 second
s seedhbimpd 0
[ CSSD]2014-12-08 02:47:42.168 [3343] >TRACE: clssnmPollingThread: node anbobrac1 (1) is impending reconfig, flag 1, misstime 15304
[ CSSD]2014-12-08 02:47:42.168 [3343] >TRACE: clssnmPollingThread: diskTimeout set to (27000)ms impending reconfig status(1)
[ CSSD]2014-12-08 02:47:42.568 [3600] >TRACE: clssnmSendingThread: sending status msg to all nodes
[ CSSD]2014-12-08 02:47:42.568 [3600] >TRACE: clssnmSendingThread: sent 4 status msgs to all nodes
[ CSSD]2014-12-08 02:47:43.175 [3343] >WARNING: clssnmPollingThread: node anbobrac1 (1) at 50% heartbeat fatal, eviction in 13.689 second
s seedhbimpd 1
[ CSSD]2014-12-08 02:47:46.569 [3600] >TRACE: clssnmSendingThread: sending status msg to all nodes
[ CSSD]2014-12-08 02:47:46.569 [3600] >TRACE: clssnmSendingThread: sent 4 status msgs to all nodes
[ CSSD]2014-12-08 02:47:50.197 [3343] >WARNING: clssnmPollingThread: node anbobrac1 (1) at 75% heartbeat fatal, eviction in 6.667 seconds
seedhbimpd 1
[ CSSD]2014-12-08 02:47:50.573 [3600] >TRACE: clssnmSendingThread: sending status msg to all nodes
[ CSSD]2014-12-08 02:47:50.573 [3600] >TRACE: clssnmSendingThread: sent 4 status msgs to all nodes
[ CSSD]2014-12-08 02:47:54.200 [3343] >WARNING: clssnmPollingThread: node anbobrac1 (1) at 90% heartbeat fatal, eviction in 2.665 seconds
seedhbimpd 1
[ CSSD]2014-12-08 02:47:54.582 [3600] >TRACE: clssnmSendingThread: sending status msg to all nodes
[ CSSD]2014-12-08 02:47:54.582 [3600] >TRACE: clssnmSendingThread: sent 4 status msgs to all nodes
[ CSSD]2014-12-08 02:47:55.200 [3343] >WARNING: clssnmPollingThread: node anbobrac1 (1) at 90% heartbeat fatal, eviction in 1.665 seconds
seedhbimpd 1
[ CSSD]2014-12-08 02:47:55.689 [1287] >TRACE: clssgmDiscOmonReady: omon was posted for member 2
[ CSSD]2014-12-08 02:47:55.689 [1287] >ERROR: clssnmvDiskKillCheck: Aborting, evicted by node 1, anbobrac1, sync 7, stamp 4281073396, u
nique 1396521241
[ CSSD]2014-12-08 02:47:55.689 [1287] >ERROR: ###################################
[ CSSD]2014-12-08 02:47:55.689 [1287] >ERROR: clssscExit: CSSD aborting from thread clssnmvKillBlockThread0
[ CSSD]2014-12-08 02:47:55.689 [1287] >ERROR: ###################################
..
WARNING: clssnmLocalJoinEvent: takeover aborted due to ALIVE node on Disk
..

[oracle@anbobrac2:/oracle/product/102/crs/log/anbobrac2/cssd]# errpt|more
IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
FE2DEE00 1208081914 P S SYSXAIXIF DUPLICATE IP ADDRESS DETECTED IN THE NET
FE2DEE00 1208073914 P S SYSXAIXIF DUPLICATE IP ADDRESS DETECTED IN THE NET
FE2DEE00 1208065914 P S SYSXAIXIF DUPLICATE IP ADDRESS DETECTED IN THE NET
FE2DEE00 1208061114 P S SYSXAIXIF DUPLICATE IP ADDRESS DETECTED IN THE NET
FE2DEE00 1208053914 P S SYSXAIXIF DUPLICATE IP ADDRESS DETECTED IN THE NET
FE2DEE00 1208052914 P S SYSXAIXIF DUPLICATE IP ADDRESS DETECTED IN THE NET
FE2DEE00 1208045914 P S SYSXAIXIF DUPLICATE IP ADDRESS DETECTED IN THE NET
FE2DEE00 1208044914 P S SYSXAIXIF DUPLICATE IP ADDRESS DETECTED IN THE NET
FE2DEE00 1208040914 P S SYSXAIXIF DUPLICATE IP ADDRESS DETECTED IN THE NET
FE2DEE00 1208032914 P S SYSXAIXIF DUPLICATE IP ADDRESS DETECTED IN THE NET
AFA89905 1208025614 I O grpsvcs Group Services daemon started
97419D60 1208025614 I O topsvcs Topology Services daemon started
A6DF45AA 1208025514 I O RMCdaemon The daemon is started.
FE2DEE00 1208025514 P S SYSXAIXIF DUPLICATE IP ADDRESS DETECTED IN THE NET
D221BD55 1208025414 I O perftune RESTRICTED TUNABLES MODIFIED AT REBOOT
67145A39 1208025314 U S SYSDUMP SYSTEM DUMP
F48137AC 1208025314 U O minidump COMPRESSED MINIMAL DUMP
A924A5FC 1208025314 P S SYSPROC SOFTWARE PROGRAM ABNORMALLY TERMINATED
9DBCFDEE 1208025414 T O errdemon ERROR LOGGING TURNED ON
A924A5FC 1208024714 P S SYSPROC SOFTWARE PROGRAM ABNORMALLY TERMINATED
3D32B80D 1015112214 P S topsvcs NIM thread blocked
F7FA22C9 0730010214 I O SYSJ2 UNABLE TO ALLOCATE SPACE IN FILE SYSTEM
3D32B80D 0729174814 P S topsvcs NIM thread blocked

[oracle@anbobrac2:/oracle]# errpt -aj FE2DEE00|more
—————————————————————————
LABEL: AIXIF_ARP_DUP_ADDR
IDENTIFIER: FE2DEE00

Date/Time: Mon Dec 8 06:59:33 2014
Sequence Number: 36503
Machine Id: 00F722C44C00
Node Id: anbobrac2
Class: S
Type: PERM
WPAR: Global
Resource Name: SYSXAIXIF

Description
DUPLICATE IP ADDRESS DETECTED IN THE NET

Failure Causes
ARP RESPONSE RECEIVED FOR MY IP ADDRESS

Recommended Actions
CONTACT NETWORK ADMINISTRATOR

Detail Data
DUPLICATE IP ADDRESS
C0A8 C802
MAC ADDRESS
C434 6BC4 18DB

[oracle@anbobrac2:/oracle]# echo “ibase=16; C0″|bc
192
[oracle@anbobrac2:/oracle]# echo “ibase=16; A8″|bc
168
[oracle@anbobrac2:/oracle]# echo “ibase=16;C8″|bc
200
[oracle@anbobrac2:/oracle]# echo “ibase=16; 02″|bc
2

[oracle@anbobrac2:/oracle]# ifconfig -a|grep inet
inet 192.168.199.2 netmask 0xffffff00 broadcast 192.168.199.255
inet 136.142.31.62 netmask 0xffffff00 broadcast 136.142.31.255
inet 136.142.31.64 netmask 0xffffff00 broadcast 136.142.31.255
inet 192.168.200.2 netmask 0xffffff00 broadcast 192.168.200.255
inet 127.0.0.1 netmask 0xff000000 broadcast 127.255.255.255
inet6 ::1%1/0

[oracle@anbobrac2:/oracle]# ping 192.168.200.1

[oracle@anbobrac2:/oracle]# cat /etc/hosts

192.168.200.1 anbobrac1-priv anbobrac1_boot2
192.168.200.2 anbobrac2-priv anbobrac2_boot2

最后找到了RAC 节点2 的priv IP 地址被占用了,后来查明是前一天有人给一个刀片框,配置了这个IP,找回了IP手动再启动CRS,这次成功拉起,其实这类问题最终还是管理上的规划问题,希望以后不要再发生。

下面转载Gwen Shapira 的一blog, 看OS error log外,下面的方法也可以发现。

Troubleshooting Broken Clusterware

Here are the steps I found useful when debugging cluster issues:

Check DB alert log on all nodes
Check clusterware logs on all nodes. There are found in $CRS_HOME/log. The useful ones are the alert log, crsd log and cssd log.
Check write permissions to voting disk. From all nodes. As Oracle and as root.
Check the network interfaces. Both by looking at ifconfig on all nodes, pinging every node from every other node using all its names and interfaces (public, private, vip).
Verify SSH the same way.
Check that both nodes run the same OS version and same DB and clusterware versions (including patches).
Stop and start clusterware on each node seperately and then on both nodes together.
Reboot both nodes.

打赏

, ,

对不起,这篇文章暂时关闭评论。