重启Oracle 11g RAC后(on Linux7)ohasd.bin未启动

众所周知,oracle 11g(11.2.0.4) RAC 在Linux 7上安装并不是很顺利,之前我整理过几个小坑,其中最常见的就是ohasd.bin 或ohasd.server 未启动,影响root.sh时,或操作系统重启后,或安装补丁时。一般手动创建个服务,或是安装个patch引入服务也可以,但这次这个case有点复杂,断电重启后CRS无法启动,简单记录。

现象

$ps -ef| grep d.bin
grid 24180 20219 0 18:36 pts/1 00:00:00 grep --color=auto d.bin

$ps -ef| grep ohas
grid 24180 20219 0 18:36 pts/1 00:00:00 grep --color=auto ohas

$crsctl start crs|has
CRS-4639: Could not contact Oracle High Availability Services
CRS-4000: Command Status failed, or completed with errors.

CRS未启动,检查ohasd.bin也未启动,连/etc/init.d/init.ohasd也未启动,CRS log也没有任何输出。我们知道crs的启动顺序是init.ohasd衍生ohasd.bin,再启动一堆agent和crs\css等服务进程。

常见原因

  • ohasd service缺失
  • crs autorestart 禁用
  • ohasd.bin 进程损坏
  • olr损坏
  • 操作系统限制

分析

对于操作系统的问题 RHEL 7.3 kernel 3.10.0-514.21.1中有已知问题,升级OS内核因为KernelCare is turned on,安全问题阻止ohasd进程,这案例排除.

排查olr

[root@anbob1 anbob1]# tail -n 6 alertanbob1.log 
2025-12-03 17:05:17.504: 
[ohasd(8688)]CRS-0715:Oracle High Availability Service has timed out waiting for init.ohasd to be started.
2025-12-03 17:09:40.806: 
[ohasd(8851)]CRS-0715:Oracle High Availability Service has timed out waiting for init.ohasd to be started.
2025-12-03 17:30:22.591: 
[ohasd(13348)]CRS-0704:Oracle High Availability Service aborted due to Oracle Local Registry error [PROCL-26: Error while accessing the physical storage Operating System error [No such file or directory] [2]]. Details at (:OHAS00106:) in /oracle/app/11.2.0/grid/log/anbob1/ohasd/ohasd.log.
[root@anbob1 anbob1]# vi /oracle/app/11.2.0/grid/log/anbob1/ohasd/ohasd.log
[root@anbob1 anbob1]# tail -f 8 /oracle/app/11.2.0/grid/log/anbob1/ohasd/ohasd.log
tail: cannot open ‘8’ for reading: No such file or directory
==> /oracle/app/11.2.0/grid/log/anbob1/ohasd/ohasd.log <==
2025-12-03 17:30:22.590: [ default][923313984] Initializing OLR
2025-12-03 17:30:22.590: [  OCROSD][923313984]utopen:6m': failed in stat OCR file/disk /oracle/app/11.2.0/grid/cdata/anbob1.olr, errno=2, os err string=No such file or directory
2025-12-03 17:30:22.590: [  OCROSD][923313984]utopen:7: failed to open any OCR file/disk, errno=2, os err string=No such file or directory
2025-12-03 17:30:22.590: [  OCRRAW][923313984]proprinit: Could not open raw device 
2025-12-03 17:30:22.590: [  OCRAPI][923313984]a_init:16!: Backend init unsuccessful : [26]
2025-12-03 17:30:22.591: [  CRSOCR][923313984] OCR context init failure.  Error: PROCL-26: Error while accessing the physical storage Operating System error [No such file or directory] [2]
2025-12-03 17:30:22.591: [ default][923313984] Created alert : (:OHAS00106:) :  OLR initialization failed, error: PROCL-26: Error while accessing the physical storage Operating System error [No such file or directory] [2]
2025-12-03 17:30:22.591: [ default][923313984][PANIC] OHASD exiting; Could not init OLR
2025-12-03 17:30:22.591: [ default][923313984] Done.

发现确实是Olr不存在

[root@anbob1 oracle]# cd /etc/oracle

[root@anbob1 oracle]# ll
total 2224
drwxrwx--- 2 root oinstall      58 Aug 17  2023 lastgasp
-rw-r--r-- 1 root oinstall      36 Aug 17  2023 ocr.loc
-rw-r--r-- 1 root root          36 Aug 17  2023 ocr.loc.orig
-rw-r--r-- 1 root oinstall      88 Aug 17  2023 olr.loc
-rw-r--r-- 1 root root          88 Aug 17  2023 olr.loc.orig
drwxrwxr-x 5 root oinstall      41 Aug 17  2023 oprocd
drwxr-xr-x 3 root oinstall      19 Aug 17  2023 scls_scr
-rws--x--- 1 root oinstall 2260808 Aug 17  2023 setasmgid

[root@anbob1 oracle]# cat olr.loc
olrconfig_loc=/oracle/app/11.2.0/grid/cdata/anbob1.olr
crs_home=/oracle/app/11.2.0/grid
[root@anbob1 oracle]#



[root@anbob1 ~]# cd /oracle/app/11.2.0/grid/cdata/
[root@anbob1 cdata]# ll
total 0
drwxr-xr-x 5 grid oinstall 64 Dec  3 12:04 bak20251203

原来是现场的人以为是olr问题备份了目录,把文件copy回来即可,该文件并没有坏。 重启问题依旧,只是不再提示olr文件问题.

检查service

在 Linux 4 和 Linux 5 中,内核启动完成后,会启动用户级程序 /sbin/init 来启动其他用户级进程或服务。/sbin/init 读取内容/etc/inittab 文件包含许多用于启动其他用户进程和服务的命令,在 Linux 5 中安装 RAC(10.2 或 11.2)后,脚本会在上一行添加启动 ohasd 守护进程的脚本。如果要在系统启动时启动 crs,需要在 /etc/inittab 文件中添加以下启动命令:


H1: 35: respawn:/etc/init. d/init. ohasd run>/dev/null 2> & 1

在 Linux 6 init 仅读取配置文件,处理各种服务和应用程序之间的依赖关系,基于事件启动这些函数和服务,并动态管理它们。在 Linux 6 中,事件由“Upstart 事件管理器”管理,在 Linux 6 中,使用 /etc/init 目录下的 oracle-ohasd.conf 配置文件来启动,而不是像 Linux 5 那样使用 /etc/inittab 进行配置。ohasd 启动时不会写入 /etc/inittab 文件,因此需要根据 /etc/init/ 目录中的事件管理配置文件来确定系统在启动时运行哪些服务:

cat /etc/init/oracle-ohasd.conf

Exec /etc/init.d/init.ohasd run>/dev/null 2> & 1 </dev/null

在 Linux 7 中,系统采用 systemd 机制,使用 socket 和 D-Bus 并行启动服务,提供基于守护进程的按需启动策略。没有运行级别的概念(但完全兼容 sysvinit)。要启动的服务的配置文件存储在 /lib/systemd/system/ 目录中,文件名为 *.service,在 Linux 7 中,用于启动和加载 Oracle RAC 的 ohasd 单元必须按如下方式配置(安装过程中,运行 root.sh 将配置 ohasd 服务单元。如果未配置,您可以手动配置):

[root@node01 ~]# systemctl -a|grep ohas
ohasd.service loaded active exited LSB: Start and Stop Oracle High Availability Service
oracle-ohasd.service loaded active running Oracle High Availability Services
[root@node01 ~]# systemctl status oracle-ohasd
● oracle-ohasd.service - Oracle High Availability Services
Loaded: loaded (/etc/systemd/system/oracle-ohasd.service; enabled; vendor preset: disabled)
Active: active (running) since Tue 2025-10-28 14:13:47 CST; 1 months 7 days ago
Main PID: 1158 (init.ohasd)
Tasks: 1 (limit: 48800)
Memory: 260.0K
CGroup: /system.slice/oracle-ohasd.service
├─ 1158 /bin/sh /etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null
├─ 2197 /u01/app/19.3.0.0/grid/bin/ohasd.bin reboot
├─ 2635 /u01/app/19.3.0.0/grid/bin/orarootagent.bin
├─ 2714 /u01/app/19.3.0.0/grid/bin/oraagent.bin
├─ 2756 /u01/app/19.3.0.0/grid/bin/mdnsd.bin
...

[root@node01 ~]# systemctl status ohasd
● ohasd.service - LSB: Start and Stop Oracle High Availability Service
   Loaded: loaded (/etc/rc.d/init.d/ohasd; generated)
   Active: active (exited) since Tue 2025-10-28 14:13:49 CST; 1 months 7 days ago
     Docs: man:systemd-sysv-generator(8)
  Process: 1160 ExecStart=/etc/rc.d/init.d/ohasd start (code=exited, status=0/SUCCESS)
    Tasks: 670 (limit: 48800)
   Memory: 5.6G
   CGroup: /system.slice/ohasd.service

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

正常在systemctl中会有2个服务,ohasd.service和oracle-ohasd.service, RHEL 7中的oracle-ohasd.service
/etc/systemd/system/oracle-ohasd.service 也是调用 /etc/init.d/init.ohasd 。 而ohasd.service /etc/rc.d/init.d/ohasd 也是调用/etc/init.d/init.ohasd 。

在GI环境中,第一个启动的守护进程是OHAS。该进程依赖于 init 进程调用 /etc/init.d/init.ohasd,后启动 /etc/rc.d/init.d/ohasd,进而执行 $GRID_HOME/ohasd.bin。如果没有正常工作的 ohasd.bin 进程,其他所有进程都无法运行。管理员可以选择通过调用 `crsctl disable crs` 来禁用高可用性服务堆栈的启动。此调用会更新 `/etc/oracle/scls_scr/hostname/root/ohasdstr` 文件中的一个标志。该文件仅包含一个单词,即 `enable` 或 `disable`,且不包含回车符。如果设置为 `disable`,则/etc/rc.d/init.d/ohasd 将不会继续启动。在这种情况下,请调用 crsctl start crs 手动启动集群堆栈。

命令 “crsctl disable crs” 会更新 “/etc/orale/scls_scr/<host_name>/root/ohasdstr“. crs自动启动的标记,值只有2个 “enable” or “disable” .

[root@node01 root]# pwd
/etc/oracle/scls_scr/node01/root

[root@node01 root]# /u01/app/19.3.0.0/grid/bin/crsctl disable crs
CRS-4621: Oracle High Availability Services autostart is disabled.
[root@node01 root]# strings ohasdstr
disable
[root@node01 root]# /u01/app/19.3.0.0/grid/bin/crsctl enable crs
CRS-4622: Oracle High Availability Services autostart is enabled.
[root@node01 root]# strings ohasdstr
enable

当然crs 自动启动,也可以排除,因为手动启动一样失败。

这个案例中只有一个ohasd.service, 并且crs 也未禁用,我们手动再创建一个service. 名叫ohas,不要和ohasd重名。

$ vi /usr/lib/systemd/system/ohas.service

[Unit]
Description=Oracle High Availability Services
After=syslog.target

[Service]
ExecStart=/etc/init.d/init.ohasd run >/dev/null 2>&1 Type=simple
Restart=always

[Install]
WantedBy=multi-user.target

$ systemctl daemon-reload
$ systemctl enable ohas.service

创建服务启动,依旧未启动也无日志

有些案例Redhat还建议过修改启动不配置管道,输出日志,如/etc/inittab中

Change inittab entry from:
 h1:35:respawn:/etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null
To
  h1:35:respawn:/etc/init.d/init.ohasd run

尝试无解

为什么要手动创建service? 因为该环境有个ohasd启动的bug,并且判断也未安装Patch.

由于 Oracle Linux 7(和 Redhat 7)使用 systemd 而不是 initd 来启动/重启进程并将其作为服务运行,因此当前 11.2.0.4 和 12.1.0.1 的软件安装将不会成功,因为 ohasd 进程无法正常启动。已知问题在于 OL7 期望使用 systemd 而不是 initd 来运行和重启进程,而 root.sh 目前无法处理这种情况。

该问题已在以下未公开的错误报告中提及。

Bug 18370031 – 集群软件 OL7 上的 RC 脚本(/ETC/RC.D/RC.,/ETC/INIT.D/

清理socket套接字

调用 crsctl start crs 手动启动集群堆栈。许多 Grid Infrastructure 后台进程依赖于 /var/tmp/.oracle 中创建的套接字。

cd /var/tmp
mv .oracle .oracle_bak
/bin/mkdir -p /var/tmp/.oracle
/bin/chmod 01755 /var/tmp/             
/bin/chown root /var/tmp/
/bin/chown root:oinstall /var/tmp/.oracle
/bin/chmod 01755 /var/tmp/.oracle       

ohasd.bin 无法启动的另一个原因是:$GRID_HOME 目录下的文件系统可能已损坏或未挂载。前面提到过,ohasd.bin 位于 $GRID_HOME/bin 目录下。如果 $GRID_HOME 目录未挂载,则无法启动该守护进程。

重建tmp/.oracle后,启动依旧失败。手动调用 /etc/init.d/init.ohasd run 确实能看到 tmp/.oracle目录中生成了npohasd 文件,无其他套接字。

重启操作系统同样无解。

手动启动ohasd.bin

[root@oarac2 ~]# /oracle/app/11.2.0/grid/bin/ohasd.bin restart
/oracle/app/11.2.0/grid/bin/ohasd.bin: error while loading shared libraries: libocr11.so: cannot open shared object file: No such file or directory

[root@oarac2 ~]# ldd `which ohasd.bin` |grep ocr11
        libocr11.so => /oracle/app/11.2.0/grid/lib/libocr11.so (0x00007fc76daf7000)
		
[root@oarac2 ~]# ls -l /oracle/app/11.2.0/grid/lib/libocr11.so
-rwxr-xr-x 1 grid oinstall 1612720 Aug 17  2023 /oracle/app/11.2.0/grid/lib/libocr11.so

原来是执行的链接库出现了问题。 下面决定relink GI_HOME。

relink GI HOME

[root@oarac2 system]# cd /oracle/app/11.2.0/grid/crs/install

[root@oarac2 install]# perl rootcrs.pl -unlock
Can't locate Env.pm in @INC (@INC contains: /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 . .) at crsconfig_lib.pm line 703.
BEGIN failed--compilation aborted at crsconfig_lib.pm line 703.
Compilation failed in require at rootcrs.pl line 305.
BEGIN failed--compilation aborted at rootcrs.pl line 305.

[root@oarac2 5.10.0]# cd $ORACLE_HOME/perl/lib/5.*

[root@oarac2 5.10.0]# cp -p Env.pm /usr/lib64/perl5/vendor_perl/

[root@oarac2 5.10.0]# cd /oracle/app/11.2.0/grid/crs/install

[root@oarac2 install]# perl rootcrs.pl -unlock
Using configuration parameter file: ./crsconfig_params
CRS-4544: Unable to connect to OHAS
CRS-4000: Command Stop failed, or completed with errors.
Successfully unlock /oracle/app/11.2.0/grid

这里的perl 脚本还缺少库文件,从oracle copy的操作系统下即可。

[grid@oarac2 ~]$ which relink
/oracle/app/11.2.0/grid/bin/relink

[root@oarac2 install]# perl rootcrs.pl -unlock
Using configuration parameter file: ./crsconfig_params
CRS-4544: Unable to connect to OHAS
CRS-4000: Command Stop failed, or completed with errors.
Successfully unlock /oracle/app/11.2.0/grid

[root@oarac2 install]# su - grid
Last login: Thu Dec  4 18:04:47 CST 2025 on pts/2

[grid@oarac2 ~]$ which relink
/oracle/app/11.2.0/grid/bin/relink
[grid@oarac2 ~]$ relink all
writing relink log to: /oracle/app/11.2.0/grid/install/relink.log
[grid@oarac2 ~]$ exit
logout
[root@oarac2 install]# perl rootcrs.pl -patch
Using configuration parameter file: ./crsconfig_params
Installing Trace File Analyzer


CRS-4123: Oracle High Availability Services has been started.
[root@oarac2 install]# 

relink后,perl rootcrs.pl -patch 直接拉起了GI.

[root@oarac2 install]# crsctl stat res -t 
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS       
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.ARCH.dg
               ONLINE  ONLINE       anbob1                                       
               ONLINE  ONLINE       oarac2                                       
ora.DATA.dg
               ONLINE  ONLINE       anbob1                                       
               ONLINE  ONLINE       oarac2                                       
ora.LISTENER.lsnr
               ONLINE  ONLINE       anbob1                                       
               ONLINE  ONLINE       oarac2                                       
ora.OCR.dg
               ONLINE  ONLINE       anbob1                                       
               ONLINE  ONLINE       oarac2                                       
ora.asm
               ONLINE  ONLINE       anbob1                   Started             
               ONLINE  ONLINE       oarac2                   Started             
ora.gsd
               OFFLINE OFFLINE      anbob1                                       
               OFFLINE OFFLINE      oarac2                                       
ora.net1.network
               ONLINE  ONLINE       anbob1                                       
               ONLINE  ONLINE       oarac2                                       
ora.ons
               ONLINE  ONLINE       anbob1                                       
               ONLINE  ONLINE       oarac2                                       
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE       anbob1                                       
ora.cvu
      1        ONLINE  ONLINE       anbob1                                       
ora.oarac.db
      1        ONLINE  ONLINE       oarac2                   Open                
      2        ONLINE  ONLINE       anbob1                   Open                
ora.anbob1.vip
      1        ONLINE  ONLINE       anbob1                                       
ora.oarac2.vip
      1        ONLINE  ONLINE       oarac2                                       
ora.oc4j
      1        ONLINE  ONLINE       anbob1                                       
ora.scan1.vip
      1        ONLINE  ONLINE       anbob1

ok, 到此问题解决。

小结:

启动无响应从找回olr文件,重建ohas.service, 清理tmp, 最后再relink GI HOME,最终解决了OHASD无法启动的问题。

Leave a Comment