首页 » ORACLE 9i-23ai » 改变操作系统时间对Oracle RAC的影响

改变操作系统时间对Oracle RAC的影响

在服务器运维中,由于未配置NTP服务或存在CPU时钟精度问题,操作系统时间常会逐渐产生偏差。当需要修正这类时间差异时,通常希望在不停库的前提下完成操作以避免影响业务。

对于采用本地文件系统的Oracle单实例数据库,时间修正主要风险在于业务逻辑中若直接调用SYSDATE函数,可能导致事务时间戳跳跃(如业务单据时间异常),但一般不会影响数据库可用性。然而在Oracle RAC(Real Application Clusters)环境中,时间同步问题可能引发更严重的后果——包括集群实例崩溃和强制重启。

近期我们遇到一个典型案例:一套稳定运行2000余天的Oracle 11g RAC环境,因时间偏差数小时进行修正时,仍未能避免实例重启。此现象印证了RAC架构对时间同步的敏感性,特此记录以供参考。

正常调整OS时间的建议

1,  停止ORACLE DB 和Clusterware

2, 调整OS 时间

3, 启动Clusterware和Oracle database

或配置NPT 的微调模式

没有停库时间时,在线调整OS datatime, 防止节点间时间相差过大,导致脑裂,优先关闭了一个节点。仅剩1个节点,在调整时建议降低每次调整的时间长度,如每次调1分钟,这次我们使用了shell 每次调整5分钟,sleep 10s。 但还是重启了,现象如下。

DB ALERT LOG

Archived Log entry 289315 added for thread 3 sequence 72112 ID 0x1de24863 dest 1:
Mon Aug 28 07:45:49 2017
NOTE: ASMB terminating
Errors in file /u10/admin/diag/rdbms/cms/cms28/trace/cms28_asmb_12527.trc:
ORA-15064: communication failure with ASM instance         
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 597 Serial number: 3
Errors in file /u10/admin/diag/rdbms/cms/cms28/trace/cms28_asmb_12527.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 597 Serial number: 3
ASMB (ospid: 12527): terminating the instance due to error 15064

-- or --
Warning: VKTM detected a time drift.
WARNING: ASMB has not responded for 190 seconds

NOTE: ASMB terminating

ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel

Note:
Without the connection to the ASM, the database instances shutdown since they were no longer able to access the data.

ASMB hang的原因
1, ASMB进程hang,如等某种latch ,从diag日志中查看进程状态
2, ASMB无法申请到资源,检查操作系统CPU和内存使用率
3, 修改了操作系统时间大跨度
4, ASMB 被 GEN0进程关闭

Database Instance Crashes With ORA-15064 ORA-03113: Possible Causes and Solution (Doc ID 2378963.1)
GEN0 background process terminate ASMB background process just after ASMB starts.
During startup phase of DB Instance, GEN0 detected hang of ASMB process and instance was terminated by ORA-15082 on Node1.

–> GEN0 checks ASMB heartbeat before getting the elapsed time and you may be hitting the “Bug 27314390 : ASMB IS TERMINATED BY GEN DURING STARTUP” or an internal “Bug 19032250 : ASMB IS STUCK,FAILED TO START DB INSTANCE”

ASM alert LOG

Warning: VKTM detected a time drift.
Time drifts can result in an unexpected behavior such as time-outs. Please check trace file for more details.
WARNING: client [DB:DB] not responsive for 301s; state=0x1. killing pid 12407

ERROR: Terminating Instance Because ASMB Is Stuck For XXX Seconds (Doc ID 2934118.1)

messages log
================

Mar 5 00:32:30  systemd: Time has been changed

而我们在19c 测试,仅DB ALERT LOG提示以下日志,并未重启.

LMS3 (ospid: XX) waits for event 'gcs remote message' for 0 secs.
LMS3 (ospid: XX) is hung in an acceptable location (inwait 0x1.ffff).

LMON (ospid: xx) waits for event 'rdbms ipc messgae' for 0 secs.
LMON (ospid: XX) is hung in an acceptable location (inwait 0x1.ffff).

LMD0 (ospid: XX) waits for event 'gcs remote message' for 0 secs.
..

而我们另一套Oracle 11g RAC环境相同的方法,测试300s 确实会直接crash, 但是降到250s时发现并没有发生crash, 所以后期如果有类似的需求,不妨尝试一下,每次调整的时间控制在200秒以内。

Shell如下

#!/bin/bash

# 循环执行48次
for ((i = 0; i < 48; i++)); do
    # 获取当前时间戳
    current_timestamp=$(date +%s)
    # 计算快3分钟后的时间戳
    new_timestamp=$((current_timestamp + 3 * 60))
    # 将新的时间戳转换为日期时间格式
    new_date=$(date -d "@$new_timestamp" +"%Y-%m-%d %H:%M:%S")
    # 设置系统时间
    date -s "$new_date"
    # 等待10秒
    sleep 10
done
打赏

目前这篇文章还没有评论(Rss)

我要评论