首页 » ORACLE » Troubleshooting performance event ‘enq: CF – contention’

Troubleshooting performance event ‘enq: CF – contention’

CF enqueues are control file enqueues, which occur during parallel access to the control file,it is a system enqueues  generally only held for a very short time. the CF locks are used to serialize controlfile transactions and read and writes on shared portions of the controlfile.Control file enqueue is an expensive operation. For operations like updating checkpoint SCNs for datafiles, for no-logging operations, such enqueue on controlfile is taken.

P1: name|mode   e.g. 43460005 ,  4346 ASCII is CF , mode 5 is SSX

P2:  correspond to 0.

In which modes can I hold or request an enqueue?

  • 0: Enqueue is not held or requested
  • 1: null (NULL)
  • 2: row-S (SS)
  • 3: row-X (SX)
  • 4: share (S)
  • 5: S/Row-X (SSX)
  • 6: exclusive (X)

The CF enqueue can be seen during any action that requires reading the control file, Typically CF locks are allocated for a very brief time and used when:

.checkpointing
.switching logfiles
.archiving redologs
.performing crash recovery
.logfile manipulation
.begin/end hot backup
.DML access for NOLOGGING objects

Diag Method

1, Hanganalye

2, Systemstate dump
Search for ” ‘enq: CF – contention’ ” in the systemstate dump to identify the hold process, which call/top? what’s enqueue? hold mode?

SO: 0x2341a50718, type: 8, owner: 0x2080dec768, flag: INIT/-/-/0x00 if: 0x1 c: 0x1
         proc=0x23012ace68, name=enqueue, file=ksq1.h LINE:380 ID:, pg=0
        (enqueue) CF-00000000-00000000  DID: 0001-015C-00009318
        lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  res_flag: 0x3
        mode: SSX, lock_flag: 0x10, lock: 0x2341a50770, res: 0x23319cb488
        own: 0x230176cc20, sess: 0x230176cc20, proc: 0x23012ace68, prv: 0x23419b9838
        slk: 0x23128d3d90

— CASE 1, IO issue

Current Wait Stack:
0: waiting for 'control file parallel write'
files=0x3, block#=0x4f, requests=0x3
wait_id=97366 seq_num=31832 snap_id=1
wait times: snap=20 min 13 sec, exc=20 min 13 sec, total=20 min 13 sec <<<<<<<<<<<<<<<< very high wait probably Disk IO issue
wait times: max=infinite, heur=20 min 13 sec
wait counts: calls=0 os=0
in_wait=1 iflags=0x5a0

— CASE 2 , controlfile huge

Current Wait Stack:
0: waiting for 'control file sequential read'
file#=0x0, block#=0x3163, blocks=0x1
wait_id=845082 seq_num=58663 snap_id=1
wait times: snap=0.000072 sec, exc=0.000072 sec, total=0.000072 sec <<<<<<<<<<<<<<<< IO no slow(time not long), but many times
wait times: max=infinite, heur=0.000072 sec
wait counts: calls=0 os=0
in_wait=1 iflags=0x5a0
There are 223 sessions blocked by this session.
Dumping one waiter:
inst: 2, sid: 1369, ser: 1
wait event: 'enq: CF - contention'
p1: 'name|mode'=0x43460005
p2: '0'=0x0
p3: 'operation'=0x0
row_wait_obj#: 4294967295, block#: 0, row#: 0, file# 0
min_blocked_time: 473 secs, waiter_cache_ver: 13578
Wait State:
fixed_waits=0 flags=0x22 boundary=0x0/-1
Session Wait History:
elapsed time of 0.000024 sec since current wait
0: waited for 'control file sequential read'
file#=0x0, block#=0x20a8, blocks=0x1
wait_id=845081 seq_num=58662 snap_id=1
wait times: snap=0.000331 sec, exc=0.000331 sec, total=0.000331 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000024 sec of elapsed time
1: waited for 'control file sequential read'
file#=0x0, block#=0x1fc2, blocks=0x1
wait_id=845080 seq_num=58661 snap_id=1
wait times: snap=0.000386 sec, exc=0.000386 sec, total=0.000386 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000023 sec of elapsed time

3, create a table for v$active_session_hisotry

SQL> create table gvash tablespace users as select * from gv$active_session_history;

4, find the waitchain(Tanelpoder’s scripts tools)

SQL> @ash/ash_wait_chains program2||event2 1=1 sysdate-1/24/12 sysdate

5, find the holder

select l.sid, p.program, p.pid, p.spid, s.username, s.terminal, s.module, s.action, s.event, s.wait_time, s.seconds_in_wait, s.state
from v$lock l, v$session s, v$process p
where l.sid = s.sid
and s.paddr = p.addr
and l.type='CF'
and l.lmode >= 5;

1. If you see the holder is: background process, typically LGWR, CKPT or ARCn
the holder is holding the enqueue for a longer period of time
Check if the redologs are sized adequately. Typically you want to drive at a log switch every 30 minutes. Also verify checkpointing parameters such as fast_start_mttr_target.

2. If you see the holder is: a user session (so no background process)
the holder is constantly changing the wait event of the holder is ‘control file parallel write’
Then it is most likely that the contention for the CF enqueue is caused by DML on a NOLOGGING object.
sulution to enable LOGGING or Set event 10359 to level 1 to skip updating the unrecoverable SCN’s in the control file.

NOTE! Be aware of bug 12360160 which affects 11.2.0.2 that setting 10359 has no effect if DB_UNRECOVERABLE_SCN_TRACKING is set to false at startup

3. Check if the archive destination (log_archive_dest_n) are accessible, you may need to involve System/Storage admins.
If you are using NFS filesystem for the archive destinations then make sure there is no issue with nfs as this can lead to log switch hanging and that leads to CF enqueue as the lock holder will be either LGWR or ARCn processes

The RMAN snapshot control file is stored in an NFS file system,The access to NFS file system is extremely slow which provokes that control file will be locked for so much time and causes the instance failure due to not being able to get control file enqueue. When RMAN is started,the db instance maybe hang.

— a case

 

4. If you see the holder is: MMON, M000
the holder is ‘control file parallel write’ , when flashback database enable, sometimes due to a lot of flashback logs in FRA, the MMON try to monitor FRA usage while hold the CF lock a long time. alse query v$flash_recovery_area_usage or v$recovery_file_dest take a amount time to complete. solution to clean flasback log.
Rman feature usage tracking causes CF enqueue contention. M000 process holds CF for a long time. It is likely to show kewfeus_execute_usage_sampling on the stack with krbc_rman_usage also present .check bug 16485447

5, If you see the holder is : RMAN
The RMAN process reads the control file sequentially for its operation for long time and thus the CF enqueue resource has been held up in Shared mode and not yet released. You can kill the rman process. There are many cases causing the CF enqueue contention and this is one of the real time examples having the CF enqueue blocked by a RMAN process.IO could be slow as the RMAN process has to wait for long time to read the control file.

6, The Control file itself can be large enough containing huge no.of records history and this would lead to the processes accessing the control file take long time to finish.

The view V$CONTROLFILE_RECORD_SECTION can be queried and check for the CF record section having huge RECORD_SIZE and try to reduce the size it has occupied.Check for CONTROL_FILE_RECORD_KEEP_TIME parameter and make sure it has not been set to large value. By default, its 7.
In such cases of control file becoming huge, controlfile needs to be recreated as part of the fix.

7, When the CF enqueue has been heldup for more than 900 seconds and not released, the following error appears in the Database alert_log.
ORA-00494: enqueue [CF] held for too long (more than 900 seconds.This may be accompanied by the instance crash due to CF enqueue contention for long time.

_controlfile_enqueue_timeout: This parameter is used to set the timeout limit for CF enqueue. By default, it is 900 seconds.
_kill_controlfile_enqueue_blocker:This parameter enables killing controlfile enqueue blocker on timeout.TRUE. Default value. if the holder is a background process, for example the LGWR, the kill of the holder leads to instance crash. set FALSE to Disable this mechanism.
_kill_enqueue_blocker=1 default is 2,  set value =1 , if the enqueue holder is a background process, then it will not be killed, therefore the instance will not crash. If the enqueue holder is not a background process, the new 10.2.0.4 mechanism will still try to kill it.

8, From LGWR tracefile messages, you can interpret that your I/O subsystem might not be performing as expected. For every lgwr write to disk which takes more than 500ms, a warning is written to the lgwr tracefile.

打赏

对不起,这篇文章暂时关闭评论。