首页 » Exadata, ORACLE 9i-23c » Troubleshooting Oracle Exadata X5 db instance mount fail with ORA-01105 & ORA-01154

Troubleshooting Oracle Exadata X5 db instance mount fail with ORA-01105 & ORA-01154

最近一个Oracle Exadata x5 2节点RAC 11.2.0.4环境,每个节点中有2套DB实例。 Node1正常运行,计划性重启Node 2后,CRS启动正常,主机上1个db的 instance 2启动正常, 但另1个db的node2 db instance启动失败,提示下面的错误:

SQL> startup mount;
ORACLE instance started.

Total System Global Area 3.8482E+10 bytes
Fixed Size                  2261368 bytes
Variable Size            1.7583E+10 bytes
Database Buffers         2.0670E+10 bytes
Redo Buffers              227811328 bytes
ORA-01105: mount is incompatible with mounts by other instances
ORA-01154: database busy. Open, close, mount, and dismount not allowed now

分析方法

检查db alert log.
确认2个实例使用的是同一个spfile
实例1 create pfile from memory 和spfile中参数对比
查看实例1的db alert log,查找最后一次dbinstance 启动显示的参数及后期alter system变更的参数
查找实例1启动后的其他所有实例alert log中alter system修改的参数
因为是mount阶段,尝试把controlfile 从ASM copy出来一份,手动启动一测试instance验证mount是否正常
10046跟踪,或dump contrlf, strace跟踪
检查socket file 在.oracle目录下
重启另一存活节点

db alert log

ALTER DATABASE MOUNT
NOTE: Loaded library: OSS
NOTE: Loaded library: System
SUCCESS: diskgroup RECOC1 was mounted
NOTE: dependency between database XXX and diskgroup resource ora.RECOC1.dg is established
ORA-1105 signalled during: ALTER DATABASE MOUNT...

socket files

/tmp/.oracle, /var/tmp/.oracle or /usr/tmp/.oracle
因为node2 操作系统已发生过重启,且crs启动正常,并且同一node上另一db instance启动正常,排除socket文件问题.

sql trace跟踪

WAIT #139911440860624: nam='control file sequential read' ela= 413 file#=0 block#=816 blocks=1 obj#=-1 tim=1688619713189879
WAIT #139911440860624: nam='control file sequential read' ela= 382 file#=0 block#=96 blocks=1 obj#=-1 tim=1688619713190366
WAIT #139911440860624: nam='control file sequential read' ela= 291 file#=0 block#=423 blocks=1 obj#=-1 tim=1688619713190706
WAIT #139911440860624: nam='control file sequential read' ela= 337 file#=0 block#=47 blocks=1 obj#=-1 tim=1688619713191090
WAIT #139911440860624: nam='enq: DM - contention' ela= 139 name|mode=1145896964 type=0 type=3 obj#=-1 tim=1688619713191284
WAIT #139911440860624: nam='CSS initialization' ela= 2435 p1=0 p2=0 p3=0 obj#=-1 tim=1688619713193772
WAIT #139911440860624: nam='CSS operation: action' ela= 548 function_id=65 p2=0 p3=0 obj#=-1 tim=1688619713194360
WAIT #139911440860624: nam='control file sequential read' ela= 335 file#=0 block#=2 blocks=1 obj#=-1 tim=1688619713194747
WAIT #139911440860624: nam='enq: SW - contention' ela= 280 name|mode=1398210563 0=0 0=0 obj#=-1 tim=1688619713195185
WAIT #139911440860624: nam='rdbms ipc reply' ela= 312 from_process=21 timeout=910 p3=0 obj#=-1 tim=1688619713195538
WAIT #139911440860624: nam='rdbms ipc reply' ela= 145 from_process=22 timeout=910 p3=0 obj#=-1 tim=1688619713195724
WAIT #139911440860624: nam='rdbms ipc reply' ela= 188 from_process=23 timeout=910 p3=0 obj#=-1 tim=1688619713195962
WAIT #139911440860624: nam='rdbms ipc reply' ela= 145 from_process=24 timeout=910 p3=0 obj#=-1 tim=1688619713196138
WAIT #139911440860624: nam='rdbms ipc reply' ela= 94 from_process=25 timeout=910 p3=0 obj#=-1 tim=1688619713196262
WAIT #139911440860624: nam='rdbms ipc reply' ela= 198 from_process=27 timeout=910 p3=0 obj#=-1 tim=1688619713196507
WAIT #139911440860624: nam='rdbms ipc reply' ela= 95 from_process=26 timeout=910 p3=0 obj#=-1 tim=1688619713196643
WAIT #139911440860624: nam='rdbms ipc reply' ela= 141 from_process=28 timeout=910 p3=0 obj#=-1 tim=1688619713196823
WAIT #139911440860624: nam='rdbms ipc reply' ela= 105 from_process=29 timeout=910 p3=0 obj#=-1 tim=1688619713196965
WAIT #139911440860624: nam='rdbms ipc reply' ela= 1867 from_process=31 timeout=900 p3=0 obj#=-1 tim=1688619713198864
WAIT #139911440860624: nam='rdbms ipc reply' ela= 66 from_process=31 timeout=900 p3=0 obj#=-1 tim=1688619713198971
WAIT #139911440860624: nam='rdbms ipc reply' ela= 75 from_process=21 timeout=2147483647 p3=0 obj#=-1 tim=1688619713199090
EXEC #139911440860624:c=39068,e=35990216,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=0,tim=1688619713199245
ERROR #139911440860624:err=1105 tim=1688619713199262
WAIT #139911440860624: nam='Disk file operations I/O' ela= 58 FileOperation=8 fileno=0 filetype=8 obj#=-1 tim=1688619713199414
WAIT #139911440860624: nam='SQL*Net break/reset to client' ela= 7 driver id=1650815232 break?=1 p3=0 obj#=-1 tim=1688619713199449
WAIT #139911440860624: nam='SQL*Net break/reset to client' ela= 74 driver id=1650815232 break?=0 p3=0 obj#=-1 tim=1688619713199541
WAIT #139911440860624: nam='SQL*Net message to client' ela= 1 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=1688619713199554

strace跟踪发现

disconnect: sosstcpopen failed.boxname = 192.168.XX.21,port =5042
OS system dependent operation:poll_timeout failed with status: 115
OS failure message: Operation now in progress
Failure occurred at: sosstcpconne

Note:
strace过程中发现有write trace file 上面报错信息,也在ora trace找到上面的内容,启动过程中有发现db到cellserver的private IB链路一半的链路报错,但controlfile read block尝试另一链路后正常, 使用ibping和rds-ping 在db server和cell server之间正常。这似乎是XD的问题,历史问题, 这个问题之前一直存在不影响启动,该问题先忽略,在X5 X7相同问题都存在, X8正常。

ORA-01154: database busy

Oracle RAC中有些certain parameters 必须在node间参数具有相同的值,如DRM( _gc_read_mostly_locking, _gc_policy_time _gc_undo_affinity etc..), log_archive_config, db_recovery_file_dest & db_recovery_file_dest_size , DB_FILES ,db_file_name_convert ,log_file_name_convert 等

否则可能报错:

ORA-01105: mount is incompatible with mounts by other instances
ORA-19808: recovery destination parameter mismatch

ORA-01105: mount is incompatible with mounts by other instances
ORA-01606: gc_files_to_locks not identical to that of another mounted instance

ORA-01105: mount is incompatible with mounts by other instances
ORA-01174: DB_FILES is 201 buts needs to be 200 to be compatible

ORA-01105: mount is incompatible with mounts by other instances
ORA-01677: standby file name conversion parameters differ from other instance

上面的报错 ORA-01154报错应该是主要的问题。

Oracle DB Error ORA-01154 database busy. Open, close, mount, and dismount not allowed now

The following in the cause of this error:
Some operation is in progress that expects the state of the instance to remain open or mounted.
Action you can take to resolve this issue: Wait for the operation to complete and try again. If attempting a normal database shutdown, try SHUTDOWN ABORT.

包括把controlfile copy出来mount正常和strace结果说明在读controlfile均正常, 不确认是什么原因另一实例持有了DB的某状态与现有mount冲突,也可能在另一实例database close中。做好准备后尝试重启存活节点1 db instance.

重启存活节点

尝试重启实例1 db instance..

SQL> alter system checkpoint
SQL> shutdown immediate

Note:
挂起10分钟,查看db alert log

# shut immediate alert log

2023-07-06 23:10:58.025000 +08:00
Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_pmon_83663.trc:
ORA-24756: transaction does not exist
2023-07-06 23:11:01.031000 +08:00
Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_pmon_83663.trc:
ORA-24756: transaction does not exist
2023-07-06 23:11:04.036000 +08:00
Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_pmon_83663.trc:
ORA-24756: transaction does not exist
2023-07-06 23:11:07.041000 +08:00
Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_pmon_83663.trc:
ORA-24756: transaction does not exist
2023-07-06 23:11:10.046000 +08:00
Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_pmon_83663.trc:
ORA-24756: transaction does not exist
2023-07-06 23:11:13.051000 +08:00
Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_pmon_83663.trc:
ORA-24756: transaction does not exist
2023-07-06 23:11:16.056000 +08:00
Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_pmon_83663.trc:
ORA-24756: transaction does not exist
2023-07-06 23:11:19.062000 +08:00
Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_pmon_83663.trc:

PMON trace file

# /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_pmon_83663.trc
*** 2023-07-06 23:11:16.055
found process 0x959db8d98 pid=974 serial=246 ospid = 243846 dead

*** 2023-07-06 23:11:16.056
kssxdl: error deleting SO: 0x927452318 = transaction (56)
  ----------------------------------------
  SO: 0x927452318, type: 56, owner: 0x94ac658a8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
   proc=0x959db8d98, name=transaction, file=ktccts.h LINE:410, pg=0
ORA-24756: transaction does not exist

*** 2023-07-06 23:11:19.061
found process 0x959db8d98 pid=974 serial=246 ospid = 243846 dead

*** 2023-07-06 23:11:19.061
kssxdl: error deleting SO: 0x927452318 = transaction (56)
  ----------------------------------------
  SO: 0x927452318, type: 56, owner: 0x94ac658a8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
   proc=0x959db8d98, name=transaction, file=ktccts.h LINE:410, pg=0
ORA-24756: transaction does not exist

*** 2023-07-06 23:11:22.066
found process 0x959db8d98 pid=974 serial=246 ospid = 243846 dead

*** 2023-07-06 23:11:22.067
kssxdl: error deleting SO: 0x927452318 = transaction (56)
  ----------------------------------------
  SO: 0x927452318, type: 56, owner: 0x94ac658a8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
   proc=0x959db8d98, name=transaction, file=ktccts.h LINE:410, pg=0

Note:
MOS中查询相关报错PMON cleanup fails with ORA-24756: transaction does not exist (Doc ID 2299927.1),存在bug。
Solution
The only workaround was to restart the instance.

尝试shudown abort重启实例,再次启动实例正常open. 另一实例也启动正常。 大胆猜测一下,也许这两个事情有关连性。

启动后检查

DB server 检查

# dbmcli -e list alerthistory

CELL Server检查

# cellcli -e list alerthistory

— 一切正常 over —

打赏

,

对不起,这篇文章暂时关闭评论。