最近一个客户的数据库出现了短暂1分钟左右的高负载,环境为oracle 11.2.0.4 2nodes RAC , 看当时的top event主要是buffer busy wait和log file sync, 关于buffer busy wait 在Troubleshooting Oracle wait event ‘buffer busy wait’ 有记录过分析方法,是一种hot block现象, 而log file sync又是提交类的,当两者同时发生时,可能是后者影响前者,首先建议排查I/O, 而另1个节点此时正在做RMAN增量备份,简单记录该问题.
ASH 报告
— Instance 2
Top User Events
| Event | Event Class | % Event | Avg Active Sessions |
|---|---|---|---|
| buffer busy waits | Concurrency | 32.78 | 11.27 |
| log file sync | Commit | 27.64 | 9.50 |
| CPU + Wait for CPU | CPU | 27.22 | 9.36 |
| db file sequential read | User I/O | 3.12 | 1.07 |
| enq: TX – row lock contention | Application | 1.87 | 0.64 |
Top Background Events
| Event | Event Class | % Activity | Avg Active Sessions |
|---|---|---|---|
| Streams AQ: qmn coordinator waiting for slave to start | Other | 1.12 | 0.39 |
Top Event P1/P2/P3 Values
| Event | % Event | P1 Value, P2 Value, P3 Value | % Activity | Parameter 1 | Parameter 2 | Parameter 3 |
|---|---|---|---|---|---|---|
| buffer busy waits | 32.78 | “1”,”59943″,”1″ | 3.01 | file# | block# | class# |
|
buffer busy waits
|
32.78243590364949352978238743759996122716
|
“1”,”62769″,”1″ | 2.73 |
file#
|
block#
|
class#
|
|
buffer busy waits
|
32.78243590364949352978238743759996122716
|
“1”,”228″,”1″ | 2.73 |
file#
|
block#
|
class#
|
| log file sync | 27.66 | “4294967295”,”3388657110″,”0″ | 0.15 | buffer# | sync scn | NOT DEFINED |
Top Blocking Sessions
| Blocking Sid (Inst) | % Activity | Event Caused | % Event | User | Program | # Samples Active | XIDs |
|---|---|---|---|---|---|---|---|
| 4348, 1( 2) | 27.98 | log file sync | 27.66 | SYS | oracle@anbob2 (LGWR) | 248/1,200 [ 21%] | 0 |
| 4479,20405( 2) | 2.09 | buffer busy waits | 2.09 | USER1 | proc1#svr31 | 14/1,200 [ 1%] | 1 |
| 1148,53443( 2) | 1.66 | buffer busy waits | 1.66 | USER1 | proc1#svr25 | 18/1,200 [ 2%] | 0 |
| 5868,21505( 2) | 1.50 | buffer busy waits | 1.50 | USER1 | proc1#svr26 | 21/1,200 [ 2%] | 0 |
| 3663, 7499( 2) | 1.41 | buffer busy waits | 1.41 | USER1 | proc1#svr29 | 17/1,200 [ 1%] | 1 |
— Instance 1
Top User Events
| Event | Event Class | % Event | Avg Active Sessions |
|---|---|---|---|
| CPU + Wait for CPU | CPU | 6.67 | 0.26 |
| gc buffer busy acquire | Cluster | 3.18 | 0.12 |
| gc cr block busy | Cluster | 1.18 | 0.05 |
Top Background Events
| Event | Event Class | % Activity | Avg Active Sessions |
|---|---|---|---|
| Backup: MML write backup piece | Administrative | 62.90 | 2.40 |
| CPU + Wait for CPU | CPU | 12.93 | 0.49 |
| enq: CF – contention | Other | 2.62 | 0.10 |
| RMAN backup & recovery I/O | System I/O | 1.94 | 0.07 |
| ASM file metadata operation | Other | 1.35 | 0.05 |
Top Event P1/P2/P3 Values
| Event | % Event | P1 Value, P2 Value, P3 Value | % Activity | Parameter 1 | Parameter 2 | Parameter 3 |
|---|---|---|---|---|---|---|
| gc buffer busy acquire | 3.18 | “22”,”177″,”1″ | 1.55 | file# | block# | class# |
|
gc buffer busy acquire
|
3.18290821887944190102463483758447787225
|
“22”,”174″,”1″ | 1.53 |
file#
|
block#
|
class#
|
| enq: CF – contention | 2.64 | “1128660996”,”0″,”0″ | 2.62 | name|mode | 0 | operation |
| RMAN backup & recovery I/O | 1.94 | “1”,”256″,”2147483647″ | 1.90 | count | intr | timeout |
| ASM file metadata operation | 1.55 | “6268”,”9″,”0″ | 0.78 | msgop | locn | NOT DEFINED |
| log file sequential read | 1.33 | “1”,”1″,”1″ | 0.15 | log# | block# | blocks |
Top Blocking Sessions
| Blocking Sid (Inst) | % Activity | Event Caused | % Event | User | Program | # Samples Active | XIDs |
|---|---|---|---|---|---|---|---|
| 5307,19313( 1) | 2.62 | enq: CF – contention | 2.62 | SYS | rman@anbob1(TNS V1-V3) | 48/1,200 [ 4%] | 0 |
| 2650,45619( 1) | 1.57 | gc buffer busy acquire | 1.57 | USER1 | JDBC Thin Client | 28/1,200 [ 2%] | 0 |
| 1906,10651( 1) | 1.53 | gc buffer busy acquire | 1.53 | USER1 | JDBC Thin Client | 20/1,200 [ 2%] | 0 |
诊断脚本
当然可以使用ASH,DASH,根据sample_time或snap_id再结合blocking_session和blocking_inst_id手动递归的查等待链条,Top 1 Event并不一定是问题的根源,而是现象。
使用 Tanel’s ash_wait_chains.sql
@ash_wait_chains <grouping_cols> <filters> <fromtime> <totime>
@ash_wait_chains username||':'||program2||event2 session_type='FOREGROUND' "to_date('20250307 13:40','yyyymmdd hh24:mi')" "to_date('20250307 13:50','yyyymmdd hh24:mi')"
@ash_wait_chains username||':'||event 1=1 "to_date('20250307 13:40','yyyymmdd hh24:mi')" "to_date('20250307 13:50','yyyymmdd hh24:mi')"
@ash_wait_chains instance_number||':'||session_id 1=1 "to_date('20250307 13:40','yyyymmdd hh24:mi')" "to_date('20250307 13:50','yyyymmdd hh24:mi')"
通过ASH
-- ASH TEMP table
create table system.ash_20250307 as
select * from gv$active_session_history
where sample_time between to_date('20250307 13:40','yyyymmdd hh24:mi') and to_date('20250307 13:50','yyyymmdd hh24:mi');
-- Diag wait chains
with t as
(select * from system.ash_20250307),
t2 as
(select
sample_time ts
, level lv,
session_id,
session_serial#,
program,
event,
connect_by_iscycle iscycle,
connect_by_isleaf leaf,
LEVEL,
SYS_CONNECT_BY_PATH('[' || program || ']' || '(' || nvl(event, session_state) || ')','->') wait_chain
from t
connect by NOCYCLE prior blocking_session = session_id
and prior blocking_session_serial# = session_serial#
and prior blocking_inst_id = inst_id
and prior to_char(sample_time, 'hh24miss') =
to_char(sample_time, 'hh24miss')
start with session_state = 'WAITING')
select wait_chain, count(*), max(iscycle) iscycle, TO_CHAR(MIN(ts), 'YYYY-MM-DD HH24:MI:SS') first_seen
, TO_CHAR(MAX(ts), 'YYYY-MM-DD HH24:MI:SS') last_seen
from t2
where leaf = 1 AND LV > 1
group by wait_chain
order by count(*) desc;
Note:
返回的堵塞链条
->[proc1#SRV31](buffer busy waits)->[proc1#SRV21](buffer busy waits)->[proc2#svr31](log file sync)->[oracle@anbob2 (LGWR)](enq: CF – contention)->[rman@anbob1 (TNS V1-V3)](ON CPU)
目前可以看到最终是有RMAN进程备份时,持有的Control File 队列,堵塞了另一个节点的BG进程LGWR,又导致提前的FG进程logfile sync,和未提交FG进程的buffer busy waits。
buffer busy waits
===>
log file sync
===>
enq: CF - contention [lgwr]
===>
RMAN
网上一个案例,因为应用调用sequence比较多,TOP 1 是enq: SQ – contention, 可能给人的错觉是seq出了问题,比如cache不足等,但那不是问题的根源。
@dash_wait_chains username||':'||event 1=1 1596 1596 %This SECONDS WAIT_CHAIN ------ ---------- ---------------------------------------------------------------------------------------------------------------------------------------------------- 70% 105023520 -> SYS:enq: SQ - contention -> SYS:row cache lock -> SYS:log file switch (archiving needed) -> SYS:control file sequential read 9% 13698720 -> SYS:enq: SQ - contention -> SYS:row cache lock -> SYS:log file switch (archiving needed) -> SYS:Disk file Mirror Read
另一案例
SQL> select * from (
2 with sf_sql as
3 (select distinct sql_id, plan_hash_value from v$sql_plan where operation = 'TEMP TABLE TRANSFORMATION'
4 union
5 select distinct sql_id, plan_hash_value from dba_hist_sql_plan where operation = 'TEMP TABLE TRANSFORMATION')
6 select PROGRAM, EVENT, ash.sql_id, nvl2(p.sql_id, 'TEMP TABLE TRANSFORMATION', 'Other CF activities') AS CF_ENQUEUE_REASON, count(*)
7 from v$active_session_history ash, sf_sql p
8 where EVENT in ('enq: CF - contention',
9 'control file sequential read')
10 and SESSION_STATE = 'WAITING'
11 and ash.SQL_ID = p.sql_id(+)
12 and ash.SQL_PLAN_HASH_VALUE = p.plan_hash_value(+)
13 group by PROGRAM, EVENT, ash.sql_id, nvl2(p.sql_id, 'TEMP TABLE TRANSFORMATION', 'Other CF activities')
14 order by count(*) desc
15 ) where rownum <= 5
16 /
PROGRAM EVENT SQL_ID CF_ENQUEUE_REASON COUNT(*)
------------------------------------------------ ----------------------------- ------------- ------------------------- ----------
JDBC Thin Client enq: CF - contention 0cufxa3h8bp1b TEMP TABLE TRANSFORMATION 1132
JDBC Thin Client enq: CF - contention 60tftk50vu96h TEMP TABLE TRANSFORMATION 266
检查control file sequential read等待分布
@event_hist control.*read session_id=1262 1596 1596
Wait time Num ASH Estimated Estimated % Event Estimated
Wait Event bucket ms+ Samples Total Waits Total Sec Time Time Graph
------------------------------ ----------- ---------- ----------- ------------ ---------- ------------
control file sequential read < 1 17 66149.3 66.1 88.3 |######### |
< 16 2 171.8 2.7 3.6 | |
< 32 3 128.2 4.1 5.5 |# |
< 128 1 15.6 2.0 2.7 | |
关于 ‘enq: CF – contention’的等待可以去我另1个之前另一篇《 Troubleshooting performance event ‘enq: CF – contention’》,CF lock通常发生在:
— checkpointing
— switching logfiles
— archiving redologs
— performing crash recovery
— logfile manipulation
— begin/end hot backup
— DML access for NOLOGGING objects
持有CF lock的常见进程
SQL> select p.program, s.event from v$lock l, v$session s, v$process p where l.sid = s.sid and s.paddr = p.addr and l.type='CF' and l.lmode >= 5; oracle@hostname.ru (CKPT) control file sequential read oracle@hostname.ru (ARC2) control file sequential read oracle@hostname.ru (LGWR) control file sequential read
这个案例继续可以排查control file 大小,和当时的备份进程状态,如是否因为RMAN备份磁带问题而挂起,是否当时ASM层有问题? 是否控制文件读取较慢,尝试asmcmd cp CF文件。