Oracle 19c RAC(19.11) ‘crsctl stop crs’ without -f Terminated Database Another node Instances, and more about Flex ASM
最近,在对一套Oracle 19c RAC(Real Application Clusters)进行计划内的维护操作时,遇到了一个预料之外的问题。原本打算通过一个常规的滚动停实例操作来维护其中一个节点,即通过执行 crsctl stop crs
命令来停止节点1的实例。然而,在执行此命令后,意外地发现节点2的实例也随之自动终止了。这一情况发生在Oracle 19.11版本中,如果此类问题频发,无疑会对未来的维护操作带来不确定性。
在去年《Troubleshooting Oracle RAC node OS shutdown (‘crsctl stop crs -f’) cause db instance stop on another node》我们也曾遇到过类似的问题,当时使用的版本是19.10。那次的问题是在使用 -f
参数强制停止时发生的。Oracle当时的建议是尽量避免使用 -f
参数,因为这可能会导致不必要的副作用。而现在的情况有所不同,此次并没有使用 -f
参数。
考虑到上次的问题,这次可能是相同的软件设计缺陷或bug重现,或者是操作人员误操作导致的。另一种可能性是存在已删除的ASM(Automatic Storage Management)磁盘组依赖,正如Oracle文档 Doc ID 2948397.1 中所提到的那样。 也可能又有新的bug吧,目前还不确认,相信有一定的特殊场景,因为我现在手里的日志有限,仅记录一下事件,刚好也测试Flex ASM,因为日志中出现了
WARNING: giving up on client id 0x3f790c8721c74796 [anbob2:anbob:anbobdb-cluster] which has not reconnected for 2 seconds (originally from ASM inst +ASM2, reg:1051058463) [dead]
当然,我不认为是当时是因为db实例2 是使用FLEX ASM ,刚好之前在+ASM 1,当node1停止时未能failover +ASM2, 而是看到了node2的ASM instance同样被stop 了。 但没关系,你如何验证当前的DB 实例在哪个ASM实例呢?后面我会演示。
2024-07-26 21:30:59,Node 1 crsctl stop crs;
node 2 Agent log
2024-07-26 21:31:04.506 : AGFW:2628663040: [ INFO] {1:25042:26991} Agfw Proxy Server received the message: RESOURCE_STOP[ora.weejar.weejartaf.svc 1 1] ID 4099:125259 2024-07-26 21:31:04.506 : AGFW:2628663040: [ INFO] {1:25042:26991} Agfw Proxy Server forwarding the message: RESOURCE_STOP[ora.weejar.weejartaf.svc 1 1] ID 4099:125259 to the agent /oracle19c/app/19.0.0/grid/bin/oraagent_oracle. 2024-07-26 21:31:04.508 : AGFW:2628663040: [ INFO] {1:25042:26991} Agfw Proxy Server received the message: RESOURCE_STOP[ora.ARCHDG.dg 3 1] ID 4099:125262 2024-07-26 21:31:04.508 : AGFW:2628663040: [ INFO] {1:25042:26991} Agfw Proxy Server forwarding the message: RESOURCE_STOP[ora.ARCHDG.dg 3 1] ID 4099:125262 to the agent /oracle19c/app/19.0.0/grid/bin/oraagent_grid 2024-07-26 21:31:04.509 : AGFW:2628663040: [ INFO] {1:25042:26991} Agfw Proxy Server received the message: RESOURCE_STOP[ora.OCRDG.dg 3 1] ID 4099:125264 2024-07-26 21:31:04.509 : AGFW:2628663040: [ INFO] {1:25042:26991} Agfw Proxy Server forwarding the message: RESOURCE_STOP[ora.OCRDG.dg 3 1] ID 4099:125264 to the agent /oracle19c/app/19.0.0/grid/bin/oraagent_grid -- 我不确认是因为日志截取的少,还是因为之前确实DATA之类的 DG在node2 是offline。 2024-07-26 21:31:04.534 : AGFW:3851085568: [ INFO] {1:25042:26991} Agent received the message: RESOURCE_STOP[ora.anbob.db 2 1] ID 4099:62427 2024-07-26 21:31:04.534 : AGFW:3851085568: [ INFO] {1:25042:26991} Preparing STOP command for: ora.anbob.db 2 1 2024-07-26 21:31:04.534 : AGFW:3851085568: [ INFO] {1:25042:26991} ora.anbob.db 2 1 state changed from: ONLINE to: STOPPING 2024-07-26 21:31:04.534 :CLSDYNAM:3853186816: [ora.anbob.db]{1:25042:26991} [stop] (:CLSN00108:) clsn_agent::stop { 2024-07-26 21:31:04.534 :CLSDYNAM:3853186816: [ora.anbob.db]{1:25042:26991} [stop] InstAgent::getStopModeAttrib stop mode immediate option 1 2024-07-26 21:31:04.535 :CLSDYNAM:3853186816: [ora.anbob.db]{1:25042:26991} [stop] InstAgent::stop 000 { 2024-07-26 21:31:04.535 :CLSDYNAM:3853186816: [ora.anbob.db]{1:25042:26991} [stop] Agent::valueOfAttribIs attrib: CRS_ORIGINAL_REASON compare value: system attribute value: system 2024-07-26 21:31:04.535 :CLSDYNAM:3853186816: [ora.anbob.db]{1:25042:26991} [stop] Agent::valueOfAttribIs returns 1 2024-07-26 21:31:04.535 :CLSDYNAM:3853186816: [ora.anbob.db]{1:25042:26991} [stop] InstAgent::stop 040 isClean:0 or original reason is system - stopOption:ABORT 2024-07-26 21:31:04.535 :CLSDYNAM:3853186816: [ora.anbob.db]{1:25042:26991} [stop] ConnectionPool::removeConnection 999 exit } this:0x7fffc42570d0 m_dbPool:0x7fffc4252040 freed:0 freeCount:1 2024-07-26 21:31:04.535 :CLSDYNAM:3853186816: [ora.anbob.db]{1:25042:26991} [stop] InstAgent::stop 120 shutdown mode:4 2024-07-26 21:31:04.535 :CLSDYNAM:3853186816: [ora.anbob.db]{1:25042:26991} [stop] DbAgent::preStopCbk { 2024-07-26 21:31:04.535 :CLSDYNAM:3853186816: [ora.anbob.db]{1:25042:26991} [stop] DbAgentEventModule::stop stopping event bridge 2024-07-26 21:31:04.535 :CLSDYNAM:3853186816: [ora.anbob.db]{1:25042:26991} [stop] DbAgentEventModule::stop stopping rlb event bridge 2024-07-26 21:31:04.535 :CLSDYNAM:3853186816: [ora.anbob.db]{1:25042:26991} [stop] DbAgent::preStopCbk dThread stop
Note:
几乎node2 是同步响应了node1的stop crs命令,相信当时的人要确认一下ip ,是不是真的搞错服务器了,当然这低级错误不允许,不可能在专业的DBA身上发生。不过仅从上面的报错信息google仅能查到一篇,就是我上面遇到的自己一年前写的blog.
另外需要注意,我上面日志中没有看到node 2停data dg 资源。仅有arch和ocr, 如果真的是当时DB instance 用的是node1 ASM DG呢?
DB Instance 2 ALERT LOG
2024-07-26T21:31:04.523
Stopped service weejartaf
Shutting down ORACLE instance (abort) (OS id: 77506)
2024-07-26T21:31:05.619497+08:00
NOTE: ASM client anbob2:anbob:xxxxdb-cluster disconnected unexpectedly.
2024-07-26T21:31:05.619506+08:00
NOTE: ASM client weejar2:weejar:xxxxdb-cluster disconnected unexpectedly.
NOTE: check client alert log.
NOTE: check client alert log.
NOTE: cleaned up ASM client anbob2:anbob:xxxxdb-cluster connection state (reg:1051058463)
2024-07-26T21:31:05.672820+08:00
NOTE: cleaned up ASM client weejar2:weejar:xxxxdb-cluster connection state (reg:3168113646)
如何判断 DB instance 在哪个ASM instance上?
环境 oracle 23c AI RAC
[grid@db1 ~]$ srvctl status database -database anbob Instance anbob1 is running on node db1 Instance anbob2 is running on node db2 [grid@db1 ~]$ srvctl status asm ASM is running on db1,db2 [grid@db1 ~]$ asmcmd showclustermode ASM cluster : Flex mode enabled - Direct Storage Access [grid@db2 ~]$ srvctl config asm ASM home: Password file: +OCR/orapwASM Backup of Password file: +OCR/orapwASM_backup ASM listener: LISTENER ASM instance count: 2 Cluster ASM listener: ASMNET1LSNR_ASM
仅停止node 1 ASM
[grid@db1 ~]$ srvctl stop asm -node db1 -stopoption abort [grid@db1 ~]$ srvctl status database -database anbob Instance anbob1 is running on node db1 Instance anbob2 is running on node db2 [grid@db1 ~]$ srvctl status asm ASM is running on db2 [grid@db1 ~]$ crsctl check crs CRS-4638: Oracle High Availability Services is online CRS-4537: Cluster Ready Services is online CRS-4529: Cluster Synchronization Services is online CRS-4533: Event Manager is online [grid@db1 ~]$ crsctl stat res -t -------------------------------------------------------------------------------- Name Target State Server State details -------------------------------------------------------------------------------- Local Resources -------------------------------------------------------------------------------- ora.LISTENER.lsnr ONLINE ONLINE db1 STABLE ONLINE ONLINE db2 STABLE ora.chad ONLINE ONLINE db1 STABLE ONLINE ONLINE db2 STABLE ora.cvuadmin OFFLINE OFFLINE db1 STABLE OFFLINE OFFLINE db2 STABLE ora.net1.network ONLINE ONLINE db1 STABLE ONLINE ONLINE db2 STABLE ora.ons ONLINE ONLINE db1 STABLE ONLINE ONLINE db2 STABLE -------------------------------------------------------------------------------- Cluster Resources -------------------------------------------------------------------------------- ora.ASMNET1LSNR_ASM.lsnr(ora.asmgroup) 1 OFFLINE OFFLINE STABLE 2 ONLINE ONLINE db2 STABLE ora.DATA.dg(ora.asmgroup) 1 OFFLINE OFFLINE STABLE 2 ONLINE ONLINE db2 STABLE ora.LISTENER_SCAN1.lsnr 1 ONLINE ONLINE db2 STABLE ora.OCR.dg(ora.asmgroup) 1 OFFLINE OFFLINE STABLE 2 ONLINE ONLINE db2 STABLE ora.asm(ora.asmgroup) 1 OFFLINE OFFLINE Instance Shutdown,ST ABLE 2 ONLINE ONLINE db2 Started,STABLE ora.asmnet1.asmnetwork(ora.asmgroup) 1 OFFLINE OFFLINE STABLE 2 ONLINE ONLINE db2 STABLE ora.cdp1.cdp 1 ONLINE ONLINE db2 STABLE ora.cvu 1 ONLINE ONLINE db2 STABLE ora.db1.vip 1 ONLINE ONLINE db1 STABLE ora.db2.vip 1 ONLINE ONLINE db2 STABLE ora.anbob.db 1 ONLINE ONLINE db1 Open,HOME=/u01/app/o racle/product/23.0.0 /dbhome_1,STABLE 2 ONLINE ONLINE db2 Open,HOME=/u01/app/o racle/product/23.0.0 /dbhome_1,STABLE ora.anbob.test.pdb 1 ONLINE ONLINE db1 STABLE 2 ONLINE ONLINE db2 STABLE ora.free.db 1 OFFLINE OFFLINE Instance Shutdown,ST ABLE ora.free.freepdb1.pdb 1 OFFLINE OFFLINE STABLE ora.qosmserver 1 ONLINE ONLINE db2 STABLE ora.scan1.vip 1 ONLINE ONLINE db2 STABLE -------------------------------------------------------------------------------- [grid@db1 ~]$ crsctl stat res -t -init -------------------------------------------------------------------------------- Name Target State Server State details -------------------------------------------------------------------------------- Cluster Resources -------------------------------------------------------------------------------- ora.asm 1 ONLINE ONLINE db1 STABLE ora.cluster_interconnect.haip 1 ONLINE ONLINE db1 STABLE ora.crf 1 ONLINE ONLINE db1 STABLE ora.crsd 1 ONLINE ONLINE db1 STABLE ora.cssd 1 ONLINE ONLINE db1 STABLE ora.cssdmonitor 1 ONLINE ONLINE db1 STABLE ora.ctssd 1 ONLINE ONLINE db1 OBSERVER,STABLE ora.diskmon 1 OFFLINE OFFLINE STABLE ora.evmd 1 ONLINE ONLINE db1 STABLE ora.gipcd 1 ONLINE ONLINE db1 STABLE ora.gpnpd 1 ONLINE ONLINE db1 STABLE ora.mdnsd 1 ONLINE ONLINE db1 STABLE ora.storage 1 ONLINE ONLINE db1 STABLE -------------------------------------------------------------------------------- [grid@db1 ~]$ asmcmd --privilege sysasm --inst +ASM2 lsdg State Type Rebal Sector Logical_Sector Block AU Total_MB Free_MB Req_mir_free_MB Usable_file_MB Offline_disks Voting_files Name MOUNTED EXTERN N 512 512 4096 4194304 51200 8960 0 8960 0 N DATA/ MOUNTED NORMAL N 512 512 4096 4194304 30720 29676 10240 9718 0 Y OCR/ [grid@db1 ~]$ asmcmd --privilege sysasm --inst +ASM1 lsdg Connected to an idle instance. ASMCMD-8102: no connection to Oracle ASM; command requires Oracle ASM to run [grid@db1 ~]$ ps -ef|grep lsnr grid 4734 1 0 Aug02 ? 00:00:34 /u01/app/23.0.0/grid/bin/tnslsnr LISTENER -no_crs_notify -inherit grid 911559 896293 0 20:48 pts/3 00:00:00 grep --color=auto lsnr -- node2 [grid@db1 ~]$ ssh db2 Last login: Wed Aug 21 20:46:36 2024 from 172.20.22.166 [grid@db2 ~]$ ps -ef|grep lsnr grid 4974 1 0 Aug02 ? 00:00:37 /u01/app/23.0.0/grid/bin/tnslsnr ASMNET1LSNR_ASM -no_crs_notify -inherit grid 5203 1 0 Aug02 ? 00:00:33 /u01/app/23.0.0/grid/bin/tnslsnr LISTENER -no_crs_notify -inherit grid 5458 1 0 Aug02 ? 00:00:31 /u01/app/23.0.0/grid/bin/tnslsnr LISTENER_SCAN1 -no_crs_notify -inherit grid 84688 84488 0 20:48 pts/1 00:00:00 grep --color=auto lsnr [grid@db2 ~]$ lsnrctl status ASMNET1LSNR_ASM LSNRCTL for Linux: Version 23.0.0.0.0 - Beta on 21-AUG-2024 20:49:33 Copyright (c) 1991, 2022, Oracle. All rights reserved. Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=ASMNET1LSNR_ASM))) STATUS of the LISTENER ------------------------ Alias ASMNET1LSNR_ASM Version TNSLSNR for Linux: Version 23.0.0.0.0 - Beta Start Date 02-AUG-2024 05:20:20 Uptime 19 days 15 hr. 29 min. 13 sec Trace Level off Security ON: Local OS Authentication SNMP OFF Listener Parameter File /u01/app/23.0.0/grid/network/admin/listener.ora Listener Log File /u01/app/grid/diag/tnslsnr/db2/asmnet1lsnr_asm/alert/log.xml Listening Endpoints Summary... (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=ASMNET1LSNR_ASM))) (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.22.1x7)(PORT=1525))) Services Summary... Service "+ASM" has 1 instance(s). Instance "+ASM2", status READY, has 2 handler(s) for this service... Service "+ASM_DATA" has 1 instance(s). Instance "+ASM2", status READY, has 2 handler(s) for this service... Service "+ASM_OCR" has 1 instance(s). Instance "+ASM2", status READY, has 2 handler(s) for this service... The command completed successfully
Note:
停了节点1的ASM, 但是db instance依旧正常,这就是Flex ASM的魅力,和本地的ASM 不再是1对1绑定,通过ASM LISTENER连上了远端的存储实例。node1 的asm listener也跟着停了。
检查DB在用哪个ASM instance
[grid@db2 ~]$ asmcmd lsct data
2024-08-21 20:53:45.133
CLSD: An error occurred while attempting to generate a full name. Logging may not be active for this process
Additional diagnostics: CLSU-00107: operating system function: sclsdgcwd; failed with error data: 0; at location: sclsdgcwd13
CLSU-00101: operating system error message: Error 0
(:CLSD00183:)
DB_Name Status Software_Version Compatible_version Instance_Name Disk_Group
+ASM CONNECTED 23.0.0.0.0 23.0.0.0.0 +ASM2 DATA
anbob CONNECTED 23.0.0.0.0 23.0.0.0.0 anbob1 DATA
anbob CONNECTED 23.0.0.0.0 23.0.0.0.0 anbob2 DATA
SQL> SELECT instance_name, db_name, status,inst_id FROM gV$ASM_CLIENT
INSTANCE_NAME DB_NAME STATUS INST_ID
------------------------------ ---------------- ------------------------ ----------
+ASM2 +ASM CONNECTED 2
anbob1 anbob CONNECTED 2 -- db instace 1 on ASM instance 2
anbob2 anbob CONNECTED 2
+ASM2 +ASM CONNECTED 2
db2 _OCR CONNECTED 2
恢复启动ASM1 instance
[grid@db1 ~]$ srvctl start asm -node db1
[grid@db1 ~]$ srvctl status asm
ASM is running on db1,db2
[grid@db1 ~]$ ps -ef|grep lsnr
grid 4734 1 0 Aug02 ? 00:00:34 /u01/app/23.0.0/grid/bin/tnslsnr LISTENER -no_crs_notify -inherit
grid 930490 1 0 21:00 ? 00:00:00 /u01/app/23.0.0/grid/bin/tnslsnr ASMNET1LSNR_ASM -no_crs_notify -inherit
grid 931909 887818 0 21:01 pts/2 00:00:00 grep --color=auto lsnr
[grid@db1 ~]$ lsnrctl status ASMNET1LSNR_ASM
LSNRCTL for Linux: Version 23.0.0.0.0 - Beta on 21-AUG-2024 21:01:50
Copyright (c) 1991, 2022, Oracle. All rights reserved.
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=ASMNET1LSNR_ASM)))
STATUS of the LISTENER
------------------------
Alias ASMNET1LSNR_ASM
Version TNSLSNR for Linux: Version 23.0.0.0.0 - Beta
Start Date 21-AUG-2024 21:00:52
Uptime 0 days 0 hr. 0 min. 57 sec
Trace Level off
Security ON: Local OS Authentication
SNMP OFF
Listener Parameter File /u01/app/23.0.0/grid/network/admin/listener.ora
Listener Log File /u01/app/grid/diag/tnslsnr/db1/asmnet1lsnr_asm/alert/log.xml
Listening Endpoints Summary...
(DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=ASMNET1LSNR_ASM)))
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.22.166)(PORT=1525)))
Services Summary...
Service "+ASM" has 2 instance(s).
Instance "+ASM1", status READY, has 2 handler(s) for this service...
Instance "+ASM2", status READY, has 1 handler(s) for this service...
Service "+ASM_DATA" has 2 instance(s).
Instance "+ASM1", status READY, has 2 handler(s) for this service...
Instance "+ASM2", status READY, has 1 handler(s) for this service...
Service "+ASM_OCR" has 2 instance(s).
Instance "+ASM1", status READY, has 2 handler(s) for this service...
Instance "+ASM2", status READY, has 1 handler(s) for this service...
The command completed successfully
SQL> select INST_ID,GROUP_NUMBER, INSTANCE_NAME, DB_NAME, INSTANCE_NAME||':'||DB_NAME client_id, STATUS from gv$asm_client
INST_ID GROUP_NUMBER INSTANCE_N DB_NAME CLIENT_ID STATUS
---------- ------------ ---------- ---------------- ------------------------------ ------------------------
2 1 +ASM2 +ASM +ASM2:+ASM CONNECTED
2 1 anbob2 anbob anbob2:anbob CONNECTED
2 2 +ASM2 +ASM +ASM2:+ASM CONNECTED
2 2 db2 _OCR db2:_OCR CONNECTED
1 1 anbob1 anbob anbob1:anbob CONNECTED
1 2 +ASM1 +ASM +ASM1:+ASM CONNECTED
1 2 db1 _OCR db1:_OCR CONNECTED
7 rows selected.
Note:
恢复启动后, DB1 instance自动切回了ASM1. 另外注意这里就可以看到每个ASM磁盘组的client 也就是instance_name
:
db_name
. 重启前重点检查这里。
如何手动relocate DB instance?
如果发现db instance没有在预定的ASM intance,也可以手动relocate,和VIP资源一样,这里直接两种方法,1 srvctl, 2 ASM sql命令 alter system.
[grid@db2 ~]$ srvctl update instance -db anbob -instance anbob1 -targetinstance +ASM2 [grid@db2 ~]$ SQL> select INST_ID,GROUP_NUMBER, INSTANCE_NAME, DB_NAME, INSTANCE_NAME||':'||DB_NAME client_id, STATUS from gv$asm_client INST_ID GROUP_NUMBER INSTANCE_NAME DB_NAME CLIENT_ID STATUS ---------- ------------ ------------------------------ ---------------- ---------------------------------------- ------------------------ 2 1 +ASM2 +ASM +ASM2:+ASM CONNECTED 2 1 anbob1 anbob anbob1:anbob CONNECTED 2 1 anbob2 anbob anbob2:anbob CONNECTED 2 2 +ASM2 +ASM +ASM2:+ASM CONNECTED 2 2 db2 _OCR db2:_OCR CONNECTED 1 2 +ASM1 +ASM +ASM1:+ASM CONNECTED 1 2 db1 _OCR db1:_OCR CONNECTED 7 rows selected. -- node1 as sysasm SQL> alter system relocate client 'anbob1:anbob'; alter system relocate client 'anbob1:anbob' * ERROR at line 1: ORA-15344: client anbob1:anbob:db@+ASM1 not found -- node2 [grid@db2 ~]$ sqlplus / as sysasm SQL*Plus: Release 23.0.0.0.0 - Beta on Wed Aug 21 21:17:21 2024 Version 23.1.0.0.0 Copyright (c) 1982, 2022, Oracle. All rights reserved. Connected to: Oracle Database 23c Enterprise Edition Release 23.0.0.0.0 - Beta Version 23.1.0.0.0 SQL> alter system relocate client 'anbob1:anbob'; System altered. SQL> r 1* select INST_ID,GROUP_NUMBER, INSTANCE_NAME, DB_NAME, INSTANCE_NAME||':'||DB_NAME client_id, STATUS from gv$asm_client INST_ID GROUP_NUMBER INSTANCE_NAME DB_NAME CLIENT_ID STATUS ---------- ------------ ------------------------------ ---------------- ---------------------------------------- ------------------------ 2 1 +ASM2 +ASM +ASM2:+ASM CONNECTED 2 1 anbob2 anbob anbob2:anbob CONNECTED 2 2 +ASM2 +ASM +ASM2:+ASM CONNECTED 2 2 db2 _OCR db2:_OCR CONNECTED 1 1 anbob1 anbob anbob1:anbob CONNECTED 1 2 +ASM1 +ASM +ASM1:+ASM CONNECTED 1 2 db1 _OCR db1:_OCR CONNECTED 7 rows selected.
Note:
需要在client所在的实例执行。
小结:
这里因为缺少数据证明CRSCTL STOP CRS 在节点1操作,错误的停止了节点2的真正原因,但除了bug外,我们可以检查是否真的存在已删除ASMDG 依赖,或所有db instance存在当前的ASM instance,同时也不要在业务时间在当前的版本认为crsctl stop 绝对没有意外。
目前这篇文章还没有评论(Rss)