Troubleshooting Oracle Wait Chains ‘buffer busy wait’、’log file sync’、 ‘enq: CF – contention’ during RMAN backup

最近一个客户的数据库出现了短暂1分钟左右的高负载，环境为oracle 11.2.0.4 2nodes RAC , 看当时的top event主要是buffer busy wait和log file sync, 关于buffer busy wait 在Troubleshooting Oracle wait event ‘buffer busy wait’ 有记录过分析方法，是一种hot block现象，而log file sync又是提交类的，当两者同时发生时，可能是后者影响前者，首先建议排查I/O，而另1个节点此时正在做RMAN增量备份，简单记录该问题.

ASH 报告

— Instance 2

Top User Events

Event	Event Class	% Event	Avg Active Sessions
buffer busy waits	Concurrency	32.78	11.27
log file sync	Commit	27.64	9.50
CPU + Wait for CPU	CPU	27.22	9.36
db file sequential read	User I/O	3.12	1.07
enq: TX – row lock contention	Application	1.87	0.64

Top Background Events

Event	Event Class	% Activity	Avg Active Sessions
Streams AQ: qmn coordinator waiting for slave to start	Other	1.12	0.39

Top Event P1/P2/P3 Values

Event	% Event	P1 Value, P2 Value, P3 Value	% Activity	Parameter 1	Parameter 2	Parameter 3
buffer busy waits	32.78	“1”,”59943″,”1″	3.01	file#	block#	class#
buffer busy waits	32.78243590364949352978238743759996122716	“1”,”62769″,”1″	2.73	file#	block#	class#
buffer busy waits	32.78243590364949352978238743759996122716	“1”,”228″,”1″	2.73	file#	block#	class#
log file sync	27.66	“4294967295”,”3388657110″,”0″	0.15	buffer#	sync scn	NOT DEFINED

Top Blocking Sessions

Blocking Sid (Inst)	% Activity	Event Caused	% Event	User	Program	# Samples Active	XIDs
4348, 1( 2)	27.98	log file sync	27.66	SYS	oracle@anbob2 (LGWR)	248/1,200 [ 21%]	0
4479,20405( 2)	2.09	buffer busy waits	2.09	USER1	proc1#svr31	14/1,200 [ 1%]	1
1148,53443( 2)	1.66	buffer busy waits	1.66	USER1	proc1#svr25	18/1,200 [ 2%]	0
5868,21505( 2)	1.50	buffer busy waits	1.50	USER1	proc1#svr26	21/1,200 [ 2%]	0
3663, 7499( 2)	1.41	buffer busy waits	1.41	USER1	proc1#svr29	17/1,200 [ 1%]	1

— Instance 1

Top User Events

Event	Event Class	% Event	Avg Active Sessions
CPU + Wait for CPU	CPU	6.67	0.26
gc buffer busy acquire	Cluster	3.18	0.12
gc cr block busy	Cluster	1.18	0.05

Top Background Events

Event	Event Class	% Activity	Avg Active Sessions
Backup: MML write backup piece	Administrative	62.90	2.40
CPU + Wait for CPU	CPU	12.93	0.49
enq: CF – contention	Other	2.62	0.10
RMAN backup & recovery I/O	System I/O	1.94	0.07
ASM file metadata operation	Other	1.35	0.05

Top Event P1/P2/P3 Values

Event	% Event	P1 Value, P2 Value, P3 Value	% Activity	Parameter 1	Parameter 2	Parameter 3
gc buffer busy acquire	3.18	“22”,”177″,”1″	1.55	file#	block#	class#
gc buffer busy acquire	3.18290821887944190102463483758447787225	“22”,”174″,”1″	1.53	file#	block#	class#
enq: CF – contention	2.64	“1128660996”,”0″,”0″	2.62	name\|mode	0	operation
RMAN backup & recovery I/O	1.94	“1”,”256″,”2147483647″	1.90	count	intr	timeout
ASM file metadata operation	1.55	“6268”,”9″,”0″	0.78	msgop	locn	NOT DEFINED
log file sequential read	1.33	“1”,”1″,”1″	0.15	log#	block#	blocks

Top Blocking Sessions

Blocking Sid (Inst)	% Activity	Event Caused	% Event	User	Program	# Samples Active
5307,19313( 1)	2.62	enq: CF – contention	2.62	SYS	rman@anbob1(TNS V1-V3)	48/1,200 [ 4%]
2650,45619( 1)	1.57	gc buffer busy acquire	1.57	USER1	JDBC Thin Client	28/1,200 [ 2%]
1906,10651( 1)	1.53	gc buffer busy acquire	1.53	USER1	JDBC Thin Client	20/1,200 [ 2%]

诊断脚本

当然可以使用ASH,DASH,根据sample_time或snap_id再结合blocking_session和blocking_inst_id手动递归的查等待链条，Top 1 Event并不一定是问题的根源，而是现象。

使用 Tanel’s ash_wait_chains.sql

@ash_wait_chains <grouping_cols> <filters> <fromtime> <totime>

@ash_wait_chains username||':'||program2||event2 session_type='FOREGROUND'  "to_date('20250307 13:40','yyyymmdd hh24:mi')" "to_date('20250307 13:50','yyyymmdd hh24:mi')"

@ash_wait_chains username||':'||event  1=1 "to_date('20250307 13:40','yyyymmdd hh24:mi')" "to_date('20250307 13:50','yyyymmdd hh24:mi')"

@ash_wait_chains instance_number||':'||session_id  1=1 "to_date('20250307 13:40','yyyymmdd hh24:mi')" "to_date('20250307 13:50','yyyymmdd hh24:mi')"

通过ASH

-- ASH TEMP table
create table system.ash_20250307  as
select * from gv$active_session_history 
where sample_time between to_date('20250307 13:40','yyyymmdd hh24:mi') and to_date('20250307 13:50','yyyymmdd hh24:mi'); 

-- Diag wait chains
with t as
    (select * from system.ash_20250307),
   t2 as
    (select 
	 sample_time ts
      , level lv,
            session_id,
            session_serial#,
            program,
            event,
            connect_by_iscycle iscycle,
            connect_by_isleaf leaf,
            LEVEL,
            SYS_CONNECT_BY_PATH('[' || program || ']' || '(' || nvl(event, session_state) || ')','->') wait_chain
       from t
     connect by NOCYCLE prior blocking_session = session_id
            and prior blocking_session_serial# = session_serial#
			and prior blocking_inst_id = inst_id
            and prior to_char(sample_time, 'hh24miss') =
                 to_char(sample_time, 'hh24miss')
      start with session_state = 'WAITING')
   select wait_chain, count(*), max(iscycle) iscycle, TO_CHAR(MIN(ts), 'YYYY-MM-DD HH24:MI:SS') first_seen
      , TO_CHAR(MAX(ts), 'YYYY-MM-DD HH24:MI:SS') last_seen
     from t2
    where leaf = 1   AND LV > 1
    group by wait_chain
    order by count(*) desc;

Note:

返回的堵塞链条

->[proc1#SRV31](buffer busy waits)->[proc1#SRV21](buffer busy waits)->[proc2#svr31](log file sync)->[oracle@anbob2 (LGWR)](enq: CF – contention)->[rman@anbob1 (TNS V1-V3)](ON CPU)

目前可以看到最终是有RMAN进程备份时，持有的Control File 队列,堵塞了另一个节点的BG进程LGWR，又导致提前的FG进程logfile sync，和未提交FG进程的buffer busy waits。

buffer busy waits
   ===>
      log file sync
                  ===>
                     enq: CF - contention [lgwr]
                                         ===>
                                             RMAN

网上一个案例,因为应用调用sequence比较多，TOP 1 是enq: SQ – contention, 可能给人的错觉是seq出了问题，比如cache不足等，但那不是问题的根源。

@dash_wait_chains username||':'||event  1=1 1596 1596

%This     SECONDS WAIT_CHAIN
------ ---------- ----------------------------------------------------------------------------------------------------------------------------------------------------
  70%   105023520 -> SYS:enq: SQ - contention -> SYS:row cache lock -> SYS:log file switch (archiving needed) -> SYS:control file sequential read
   9%    13698720 -> SYS:enq: SQ - contention -> SYS:row cache lock -> SYS:log file switch (archiving needed) -> SYS:Disk file Mirror Read

另一案例

SQL> select * from (
2  with sf_sql as
3   (select distinct sql_id, plan_hash_value from v$sql_plan where operation = 'TEMP TABLE TRANSFORMATION'
4    union
5    select distinct sql_id, plan_hash_value from dba_hist_sql_plan where operation = 'TEMP TABLE TRANSFORMATION')
6  select PROGRAM, EVENT, ash.sql_id, nvl2(p.sql_id, 'TEMP TABLE TRANSFORMATION', 'Other CF activities') AS CF_ENQUEUE_REASON, count(*)
7    from v$active_session_history ash, sf_sql p
8   where EVENT in ('enq: CF - contention',
9                   'control file sequential read')
10     and SESSION_STATE = 'WAITING'
11     and ash.SQL_ID = p.sql_id(+)
12     and ash.SQL_PLAN_HASH_VALUE = p.plan_hash_value(+)
13   group by PROGRAM, EVENT, ash.sql_id, nvl2(p.sql_id, 'TEMP TABLE TRANSFORMATION', 'Other CF activities')
14   order by count(*) desc
15  ) where rownum <= 5
16  /
 
PROGRAM                                          EVENT                         SQL_ID        CF_ENQUEUE_REASON           COUNT(*)
------------------------------------------------ ----------------------------- ------------- ------------------------- ----------
JDBC Thin Client                                 enq: CF - contention          0cufxa3h8bp1b TEMP TABLE TRANSFORMATION       1132
JDBC Thin Client                                 enq: CF - contention          60tftk50vu96h TEMP TABLE TRANSFORMATION        266

检查control file sequential read等待分布

@event_hist  control.*read session_id=1262 1596 1596
                                 Wait time    Num ASH   Estimated    Estimated    % Event  Estimated
Wait Event                      bucket ms+    Samples Total Waits    Total Sec       Time  Time Graph
------------------------------ ----------- ---------- ----------- ------------ ---------- ------------
control file sequential read           < 1         17     66149.3         66.1       88.3 |######### |
                                      < 16          2       171.8          2.7        3.6 |          |
                                      < 32          3       128.2          4.1        5.5 |#         |
                                     < 128          1        15.6          2.0        2.7 |          |

关于 ‘enq: CF – contention’的等待可以去我另1个之前另一篇《 Troubleshooting performance event ‘enq: CF – contention’》，CF lock通常发生在：

— checkpointing
— switching logfiles
— archiving redologs
— performing crash recovery
— logfile manipulation
— begin/end hot backup
— DML access for NOLOGGING objects

持有CF lock的常见进程

SQL> select p.program, s.event
 from v$lock l, v$session s, v$process p
 where l.sid = s.sid
 and s.paddr = p.addr
 and l.type='CF'
 and l.lmode >= 5;
 
oracle@hostname.ru (CKPT)    control file sequential read
oracle@hostname.ru (ARC2)    control file sequential read
oracle@hostname.ru (LGWR)    control file sequential read

这个案例继续可以排查control file 大小，和当时的备份进程状态，如是否因为RMAN备份磁带问题而挂起，是否当时ASM层有问题? 是否控制文件读取较慢，尝试asmcmd cp CF文件。