首页 » ORACLE [C]系列 » Troubleshooting oracle 19c wait ‘Library Cache Load Lock’ ,blocker REC0 wait ‘gc cr block lost’

Troubleshooting oracle 19c wait ‘Library Cache Load Lock’ ,blocker REC0 wait ‘gc cr block lost’

最近遇到这个案例大量FG prorcess堵塞,19c (19.4) 2nodes RAC, 等待Library Cache Load Lock, 堵塞会话为REC0, 该进程等待gc cr block lost.

op 10 Foreground Events by Total Wait Time

Event Waits Total Wait Time (sec) Avg Wait % DB time Wait Class
library cache load lock 1,781 156.1K 87.62 s 81.6 Concurrency
DB CPU 30K 15.7
db file sequential read 344,960 1953.4 5.66ms 1.0 User I/O
gc buffer busy acquire 143,177 1100.5 7.69ms .6 Cluster
read by other session 67,249 655.8 9.75ms .3 User I/O
log file sync 593,781 618.7 1.04ms .3 Commit
db file parallel read 12,208 231.9 19.00ms .1 User I/O
row cache mutex 61,341 165.3 2.69ms .1 Concurrency
gc cr block 2-way 351,188 121.9 347.17us .1 Cluster
PGA memory operation 4,045,614 120.3 29.75us .1 Other

Time Model Statistics

  • DB Time represents total time in user calls
  • DB CPU represents CPU time of foreground processes
  • Total CPU Time represents foreground and background processes
  • Statistics including the word “background” measure background process time, therefore do not contribute to the DB time statistic
  • Ordered by % of DB time in descending order, followed by Statistic Name
Statistic Name Time (s) % of DB Time % of Total CPU Time
connection management call elapsed time 156,106.57 81.62
DB CPU 30,006.52 15.69 84.25
sql execute elapsed time 24,763.90 12.95
parse time elapsed 10,144.13 5.30
hard parse elapsed time 9,306.45 4.87
RMAN cpu time (backup/restore) 3,508.68 1.83 9.85

Top Event P1/P2/P3 Values

  • Top Events by DB Time and the top P1/P2/P3 values for those events.
  • % Event is the percentage of DB Time due to the event
  • % Activity is the percentage of DB Time due to the event with the given P1,P2,P3 Values.
Event % Event P1, P2, P3 Values % Activity Parameter 1 Parameter 2 Parameter 3
library cache load lock 75.08 “15123211520”,”5803580152″,”4521987″ 1.06 object address lock address 100*mask+namespace
“15123211520”,”9764512168″,”4521987″ 1.06
“15123211520”,”12033336408″,”4521987″ 1.05
gc buffer busy acquire 1.43 “1”,”121″,”1″ 0.85 file# block# class
Per Second Per Transaction
Global Cache blocks received: 264.39 1.62
Global Cache blocks served: 302.74 1.85
GCS/GES messages received: 2,611.29 15.97
GCS/GES messages sent: 2,179.36 13.33
DBWR Fusion writes: 2.18 0.01
Estd Interconnect traffic (KB) 19,083.79

Global Cache and Enqueue Services – Workload Characteristics

Avg global cache cr block receive time (us): 427.7
Avg global cache current block receive time (us): 483.1
Avg LMS process busy %: 1.8
Avg global cache cr block build time (us): 26.9
Global cache log flushes for cr blocks served %: 0.8
Avg global cache cr block flush time (us): 1,045.1
Avg global cache current block pin time (us): 535.7
Global cache log flushes for current blocks served %: 0.7
Avg global cache current block flush time (us): 2,075.0
Avg global enqueue get time (us): 0.0

Global Cache and Enqueue Services – Messaging Statistics

Avg message sent queue time (us): 1,255.4
Avg message sent queue time on ksxp (us): 228.5
Avg message received kernel queue time (us): 2.0
Avg message received queue time (us): 3.4
Avg GCS message process time (us): 16.8
Avg GES message process time (us): 8.4
% of direct sent messages: 46.35
% of indirect sent messages: 53.60
% of flow controlled messages: 0.05


Note:
显示堵塞者为node1 的session 2107.
Note:
显示session 2107没有被堵塞,但是等待gc cr block lost.并且堵塞进程为REC0, 上一篇blog 刚遇到REC0堵塞的故障, REC0为分布式事务回滚后台进程。

Library Cache Load Lock

A session tries to obtain a library cache load lock for a database object so that it can load it into the library cache. When a session is experiencing this form of lock, there is another session that has the load lock for the same object. No other process can load the same object into the library cache, as the load lock is always obtained in exclusive mode.

gc cr block Lost

Probably, the most critical statistics for interconnect issues. Consistent high amount of ‘gc blocks lost’ is an indication of problem with underlying network infrastructure. (Hardware, firmware etc). Ideal value is near zero. But, only worry about this, if there are consistently higher values.

  • ‘gc cr block lost’或’gc current block lost’成为实例中Top 5的主要等待事件

通常伴随:

  • SQL trace报告显示多次出现gc cr requests,gc current request等待事件
  • 出现长时间的gc cr multiblock requests等待
  • 糟糕的应用性能与吞吐量
  • ifconfig或其他网络工具显示存在大量的网络包packet发送接收(send/receive)错误
  • netstat报告显示存在errors/retransmits/reassembly等失败
  • 单个或多个节点失败
  • 由网络处理引发的异常CPU使用率

被堵塞的会话SQL为

select host,userid,password,flag,authusr,authpwd, passwordx, authpwdx from link$ where owner#=:1 and name=:2

与dblink 相关的分布式事务.

查看rec0 trace file

IPCLW:[0.0]{-}[WAIT]:PROTO: [1661188358080974]ipclw_data_chunk_process:1165 Discarding msg with seq # 875492546, expecting 2730385062
IPCLW:[0.0]{-}[WAIT]:PROTO: [1661188358080974]ipclw_data_chunk_process:1165 Discarding msg with seq # 875492587, expecting 2730385062

IPCLW:[0.0]{-}[WAIT]:PROTO: [1661188358080974]ipclw_data_chunk_process:1165 Discarding msg with seq # 875492621, expecting 2730385062
IPCLW:[0.0]{-}[WAIT]:PROTO: [1661188358080974]ipclw_data_chunk_process:1165 Discarding msg with seq # 875492656, expecting 2730385062

IPCLW:[0.0]{-}[WAIT]:PROTO: [1661188358080974]ipclw_data_chunk_process:1165 Discarding msg with seq # 875492743, expecting 2730385062
IPCLW:[0.0]{-}[WAIT]:PROTO: [1661188358080974]ipclw_data_chunk_process:1165 Discarding msg with seq # 875492820, expecting 2730385062

Note:

发生在ipclw层,丢弃的消息比逾期的要差很远。

 

相关的已知bug
Bug 33676842 – Invalid Message Discarding ‘gc block lost’ Cause Waiting For ‘gc cr request’ and Subsequent DB Open Time Out (Doc ID 33676842.8)

Bug 34482043 Some “gc cr multi block mixed” wait outliers cause a performance regression in 19c
FG traces show IPCLW discarding messages with older sequence number than the expected.

Bug 34473784 : INSERT STATEMENT IS HANGED BY “GC CURRENT/CR BLOCK LOST”
Bug 32685989 PX Slave Session Hang While Waiting For ‘gc cr request’, ‘gc cr block lost’ continuously

OSW
Oswatcher netstat output to see if we are seeing packet drops/reassembly failures

	行 29:     144743 packet reassembles failed
	行 190:     144745 packet reassembles failed
	行 351:     144745 packet reassembles failed
	行 512:     144746 packet reassembles failed
	行 673:     144749 packet reassembles failed
	行 834:     144750 packet reassembles failed
	行 995:     144752 packet reassembles failed
	行 1156:     144754 packet reassembles failed
	行 1317:     144755 packet reassembles failed

当时有IP packt包重组失败。

相关诊断

# IPCLW traces should then be enabled for diagnosis using below command:
 ALTER SYSTEM SET events='ipclw_trace level=5, flag=0xFFFFFFFF';

# check MTU SIZE

# TEST CASE
- _lm_comm_channel='msgq_udp'
- db_block_size = 16384
- db_file_multiblock_read_count = 64 (default)
# node1
create table test_gc (aaa varchar(100 char)) nocompress;
set timing on
insert into test_gc
select
'bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb'
from dual connect by level <= 1000000 ;

commit;

# node2
alter session set events '10046 trace name context forever, level 12';
-- alter session set db_file_multiblock_read_count=1;

set timing on
select /*+ monitor */ count(*) from test_gc; 

可能的解决方案
1. 对于Bug 34482043原因是在IPCLW中丢弃无序消息会导致CR lost block。只有在接收到默认情况下在500毫秒后发送的侧信道消息后,才会再次请求该块。修改方向 Deliver out of order messages if they are received within a certain window size. 安装对应的one off patch

2.  increasing the IP reassembly buffers and timeout in the kernel.

3.  kill REC0 process

打赏

, ,

对不起,这篇文章暂时关闭评论。