首页 » ORACLE » After OS reboot, Ohasd(cssd) start fail is due to OLR corrupted

After OS reboot, Ohasd(cssd) start fail is due to OLR corrupted

前几天帮助同事处理了个案例, 主机意外重启后数据库无法启动, 环境是11.2.0.3 standalone on aix, 用的是ASM.最后确认是OLR损坏, 因为当时没有记录具体流程,在这里只简单的记录.

# check init.ohasd process is running
ps -ef | grep init.ohasd
# if not the run as root 
/etc/init.d/init.ohasd run >/dev/null 2>&1

到这步发现HAS已启动,但是cssd都没启动, cssd和一些local OCR做为资源是随HAS启动的, 如果没有数据库使用ASM, 默认是不会自动启动cssd和asm的。 但是使用crsctl start res ora.cssd -init又提示没有资源, 查看GI alert log 如下

# gi alert
[ohasd(7210)]CRS-8017:location: /etc/oracle/lastgasp has 4 reboot advisory log files, 0 were announced and 0 errors occurred
2015-10-18 12:51:23.952
[ohasd(7210)]CRS-2772:Server 'anbob' has been assigned to pool 'Free'.
2015-10-18 13:52:40.249
[ohasd(8085)]CRS-2112:The OLR service started on node anbob.
2015-10-18 13:52:40.314
[ohasd(8085)]CRS-2772:Server 'anbob' has been assigned to pool 'Free'.
2015-10-18 14:01:57.696
[ohasd(8668)]CRS-2112:The OLR service started on node anbob.
2015-10-18 14:01:57.761
[ohasd(8668)]CRS-2772:Server 'anbob' has been assigned to pool 'Free'.
2015-10-18 15:03:27.157
[ohasd(11026)]CRS-2112:The OLR service started on node anbob.
2015-10-18 15:03:27.202
[ohasd(11296)]CRS-1339:Oracle High Availability Service aborted due to an unexpected error 
[Failed to initialize Oracle Local Registry]. Details at (:OHAS00106:) in 
/oracle/app/oracle/product/11.2.0/grid/log/anbob/ohasd/ohasd.log.
2015-10-18 15:03:27.220
[ohasd(11026)]CRS-2772:Server 'anbob' has been assigned to pool 'Free'.
2015-10-18 15:23:08.435
[ohasd(12135)]CRS-2112:The OLR service started on node anbob.
2015-10-18 15:23:08.499
[ohasd(12135)]CRS-2772:Server 'anbob' has been assigned to pool 'Free'.

# OHASD.LOG  

[grid@anbob anbob]$ vi /oracle/app/oracle/product/11.2.0/grid/log/anbob/ohasd/ohasd.log

2015-10-18 15:03:27.159: [  OCRSRV][1032845088]th_init: Local listener did not reach valid state
2015-10-18 15:03:27.159: [  CRSOCR][555579168] CAAOCR GET Debug sblevel Level[default]: 0
...
[  clsdmt][4118787840]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=anbobDBG_OHASD))
2015-10-18 15:03:27.170: [  clsdmt][4118787840]PID for the Process [11026], connkey 8
2015-10-18 15:03:27.176: [ default][555579168] Ohasd Daemon Started.
2015-10-18 15:03:27.176: [   CLSLG][3747604224] Last Gasp Monitor thread started
2015-10-18 15:03:27.176: [   CLSLG][3747604224] processing Last Gasp disk location /etc/oracle/lastgasp
2015-10-18 15:03:27.177: [    CLSE][555579168]clse_get_auth_loc: Returning default authloc: /oracle/app/oracle/product/11.2.0/grid/auth/ohasd/anbob
2015-10-18 15:03:27.177: [ default][555579168] AuthLoc /oracle/app/oracle/product/11.2.0/grid/auth/ohasd/anbob
2015-10-18 15:03:27.177: [ default][555579168] PE Engine: NEW
2015-10-18 15:03:27.177: [ default][555579168] Using OCR batch ops : ENABLED
2015-10-18 15:03:27.177: [ default][555579168] RD registrations and Clusterization disabled.
2015-10-18 15:03:27.177: [   CLSLG][3747604224] monitoring new interface 0.0.0.0
2015-10-18 15:03:27.178: [ default][555579168][F-ALGO] getIpcPath returning (ADDRESS=(PROTOCOL=IPC)(KEY=OHASD_IPC_SOCKET_11))
2015-10-18 15:03:27.178: [CLSFRAME][555579168] Inited lsf context 0x220efe0
2015-10-18 15:03:27.178: [CLSFRAME][555579168] Initing CLS Framework messaging
2015-10-18 15:03:27.179: [  CLSVER][555579168] Static Version 11.2.0.1.0
2015-10-18 15:03:27.179: [ default][555579168][F-ALGO] getIpcPath returning (ADDRESS=(PROTOCOL=IPC)(KEY=OHASD_IPC_SOCKET_11))
2015-10-18 15:03:27.180: [UiServer][555579168] UI Comms initalize() 1
2015-10-18 15:03:27.181: [CLSFRAME][555579168] New Framework state: 2
2015-10-18 15:03:27.181: [CLSFRAME][555579168] M2M is starting...
2015-10-18 15:03:27.182: [ CRSCOMM][555579168] m_pClscCtx=0x22c8030m_pUgblm=0x22d32f0
2015-10-18 15:03:27.182: [ CRSCOMM][555579168] Starting send thread
2015-10-18 15:03:27.182: [ CRSCOMM][555579168] IPC Listener instantiated for: (ADDRESS=(PROTOCOL=IPC)(KEY=OHASD_IPC_SOCKET_11))
2015-10-18 15:03:27.182: [ CRSCOMM][205494016] clsIpc: sendWork thread started.
2015-10-18 15:03:27.182: [ CRSCOMM][555579168] IPC Listener started listening.
2015-10-18 15:03:27.183: [ CRSCOMM][4097808128] IPCL thread started listening
2015-10-18 15:03:27.183: [CLSFRAME][555579168] Starting thread model named: AgfwProxySrvTM
2015-10-18 15:03:27.183: [CLSFRAME][555579168] Starting thread model named: OcrModuleTM
2015-10-18 15:03:27.183: [CLSFRAME][555579168] Starting thread model named: PolicyEngineTM
2015-10-18 15:03:27.184: [CLSFRAME][555579168] Starting thread model named: SharedThreadTM
2015-10-18 15:03:27.184: [CLSFRAME][555579168] Starting thread model named: UiServerTM
2015-10-18 15:03:27.184: [CLSFRAME][555579168] New Framework state: 3
2015-10-18 15:03:27.185: [  CRSRPT][3699324672] Enabled
2015-10-18 15:03:27.185: [   CRSPE][3701425920] PE Role|State Update: old role [INVALID] new [INVALID]; old state [Not yet initialized] new [Enabling: waiting for role]
2015-10-18 15:03:27.185: [   CRSSE][3699324672] SE module master election disabled
2015-10-18 15:03:27.185: [   CRSSE][3699324672] Master Change Event; New Master Node ID:0 This Node's ID:0
2015-10-18 15:03:27.189: [   CRSPE][3701425920] Sent request to write event sequence number 1400000 to repository
2015-10-18 15:03:27.189: [   CRSPE][3701425920] Reading (1) servers
2015-10-18 15:03:27.189: [   CRSPE][3701425920] There are no resource types to read.
2015-10-18 15:03:27.189: [   CRSPE][3701425920] There are no resources to read.
2015-10-18 15:03:27.191: [   CRSPE][3701425920] Wrote new event sequence to repository
2015-10-18 15:03:27.192: [   CRSPE][3701425920] Reading (1) server pools
2015-10-18 15:03:27.196: [   CRSPE][3701425920] Finished reading configuration. Parsing...
2015-10-18 15:03:27.202: [   CRSPE][3701425920] Parsing server pools...
2015-10-18 15:03:27.202: [  CRSOCR][1032845088] OCR context init failure.  Error: PROCL-24: Error in the messaging layer Messaging error [18]
2015-10-18 15:03:27.202: [ default][1032845088] OLR initalization failured, rc=24
2015-10-18 15:03:27.202: [   CRSPE][3701425920] Parsed and validated SERVERPOOL: Free [min:0][max:-1][importance:0] NO SERVERS ASSIGNED
2015-10-18 15:03:27.202: [ default][1032845088]Created alert : (:OHAS00106:) :  Failed to initialize Oracle Local Registry
2015-10-18 15:03:27.202: [   CRSPE][3701425920] Server pools parsed
2015-10-18 15:03:27.202: [ default][1032845088][PANIC] OHASD exiting; Could not init OLR
2015-10-18 15:03:27.202: [   CRSPE][3701425920] Server Pool Free has been registered
2015-10-18 15:03:27.202: [ default][1032845088] Done.

2015-10-18 15:03:27.202: [   CRSPE][3701425920] Cluster reboot took place.
2015-10-18 15:03:27.203: [   CRSPE][3701425920] Configuration has been parsed
2015-10-18 15:03:27.203: [ default][3697223424][F-ALGO] getIpcPath returning (ADDRESS=(PROTOCOL=IPC)(KEY=OHASD_UI_SOCKET))
2015-10-18 15:03:27.204: [UiServer][3697223424] UI socket on: (ADDRESS=(PROTOCOL=IPC)(KEY=OHASD_UI_SOCKET))
2015-10-18 15:03:27.204: [ default][3697223424][F-ALGO] getIpcPath returning (ADDRESS=(PROTOCOL=IPC)(KEY=CRSD_UI_SOCKET))
2015-10-18 15:03:27.204: [UiServer][3697223424] UI socket on: (ADDRESS=(PROTOCOL=IPC)(KEY=CRSD_UI_SOCKET))
2015-10-18 15:03:27.204: [UiServer][3695122176] UI comms listening for events.
2015-10-18 15:03:27.204: [CLSFRAME][3711932160] Module Enabling is complete

从上面可以看到OLR初始化失败,应该是olr miss或corrupted,因为是Standalone所有找个本地的olr备份还原一下应该就可以解决。

[grid@anbob ~]$ cd $ORACLE_HOME
[grid@anbob grid]$ cd cdata
[grid@anbob cdata]$ ls
anbob  localhost
[grid@anbob cdata]$ cd anbob
[grid@anbob anbob]$ ls
backup_20140818_113812.olr
[grid@anbob anbob]$ ls -lrt
total 5260
-rwxr-xr-x 1 grid oinstall 5386240 Aug 18  2014 backup_20140818_113812.olr

[grid@anbob anbob]$ ocrconfig -local -showbackup
PROTL-25: Manual backups for the Oracle Local Registry are not available

可以看到有个2014年的一个olr备份,ocr备份存放在$GRID_HOME/cdata, 其它没有手动备份的。

[grid@anbob anbob]$ ls -lrt /etc/oracle/
total 24
drwxr-xr-x 3 root oinstall 4096 Oct 18 04:58 scls_scr
drwxrwxr-x 5 root oinstall 4096 Oct 18 04:58 oprocd
-rw-r--r-- 1 root root        0 Oct 18 04:58 olr.loc.orig
-rw-r--r-- 1 root oinstall  130 Oct 18 04:58 olr.loc
-rw-r--r-- 1 root root       16 Oct 18 04:58 ocr.loc.orig
-rw-r----- 1 grid oinstall   95 Oct 18 04:58 ocr.loc
drwxrwx--- 3 root oinstall 4096 Oct 18 13:33 lastgasp

# To restore OLR

crsctl stop has
ocrconfig -local -restore  /oracle/app/oracle/product/11.2.0/grid/cdata/anbob/backup_20140818_113812.olr

crsctl start has
crsctl start res ora.cssd -init

终于可以显示资源了,并且cssd已自动启动,剩下的工作就是手动启asm,mount asm磁盘组,启数据库

su - grid
sqlplus / as sysasm
# startup asm
alter diskgroup  mount
su - oracle
# startup database;

这里因为刚好有以前的备份,如果没有备份时可以参考NOTE 1539020.1, 重建OLR

In an environment with fresh GI installation:
1. Deconfig the existing clusterware, as this is only done on the problematic node, the other nodes should have Clusterware up and running, this command should only deconfig the clusterware configuration for local node, it should not touch/change OCR.
# /crs/install/rootcrs.pl -deconfig -force
2. Rerun root.sh
# /root.sh

打赏

对不起,这篇文章暂时关闭评论。