Troubleshooting High Private/Interconnect Network Latency Case

系统资源的最大利用是系统优化的目标，但是过度”优化”系统的容错性就相对较差，如果出现丝毫的卡顿就会“雪崩”，前段时间分析的一个案例，号称是亚洲最好的一套HP小机配置，有Oracle原厂操刀调试的一套Oracle 12c R1环境，不过确实效率很高，突然有一天因RAC 节点之间的private network 高latency导致频繁的 gc 相关等待，从而导致性能问题。

问题前半小时node2 AWR

Per Second	Per Transaction	Per Exec	Per Call
DB Time(s):	380.7	0.2	0.00	0.00
DB CPU(s):	68.2	0.0	0.00	0.00
Background CPU(s):	8.6	0.0	0.00	0.00
Redo size (bytes):	39,391,148.6	16,534.7
Logical read (blocks):	2,776,413.2	1,165.4
Block changes:	178,201.7	74.8
Physical read (blocks):	29,338.7	12.3
Physical write (blocks):	6,488.4	2.7
Read IO requests:	18,763.1	7.9
Write IO requests:	2,981.2	1.3
Read IO (MB):	229.2	0.1
Write IO (MB):	50.7	0.0
IM scan rows:	0.0	0.0
Session Logical Read IM:
Global Cache blocks received:	2,144.3	0.9
Global Cache blocks served:	3,382.7	1.4
User calls:	112,248.9	47.1
Parses (SQL):	4,135.5	1.7
Hard parses (SQL):	17.6	0.0
SQL Work Area (MB):	36.1	0.0
Logons:	15.4	0.0
Executes (SQL):	93,517.4	39.3
Rollbacks:	2.0	0.0
Transactions:	2,382.3

Instance Efficiency Percentages (Target 100%)

Buffer Nowait %:	99.92	Redo NoWait %:	100.00
Buffer Hit %:	98.98	In-memory Sort %:	100.00
Library Hit %:	100.26	Soft Parse %:	99.57
Execute to Parse %:	95.58	Latch Hit %:	99.77
Parse CPU to Parse Elapsd %:	54.18	% Non-Parse CPU:	99.66
Flash Cache Hit %:	0.00

Top 10 Foreground Events by Total Wait Time

Event	Waits	Total Wait Time (sec)	Wait Avg(ms)	% DB time	Wait Class
DB CPU		60.1K		17.9
db file sequential read	10,859,769	13.7K	1.27	4.1	User I/O
log file sync	2,211,105	13.4K	6.06	4.0	Commit
latch: cache buffers chains	379,825	3994.9	10.52	1.2	Concurrency
gc cr grant 2-way	4,700,313	3993.6	0.85	1.2	Cluster
enq: TX – row lock contention	732	1991.3	2720.37	.6	Application
latch free	378,211	1389.8	3.67	.4	Other
db file parallel read	534,612	1239.2	2.32	.4	User I/O
gc buffer busy acquire	308,207	1225.6	3.98	.4	Cluster
gc current grant 2-way	959,084	851.4	0.89	.3	Cluster

15分钟后node2 AWR

Load Profile

	Per Second	Per Transaction	Per Exec	Per Call
DB Time(s):	2,366.8	1.8	0.05	0.04
DB CPU(s):	48.4	0.0	0.00	0.00
Background CPU(s):	7.3	0.0	0.00	0.00
Redo size (bytes):	17,262,623.5	12,978.6
Logical read (blocks):	1,821,514.9	1,369.5
Block changes:	80,043.7	60.2
Physical read (blocks):	23,230.6	17.5
Physical write (blocks):	4,214.2	3.2
Read IO requests:	15,238.3	11.5
Write IO requests:	2,345.5	1.8
Read IO (MB):	181.5	0.1
Write IO (MB):	32.9	0.0
IM scan rows:	0.0	0.0
Session Logical Read IM:
Global Cache blocks received:	6,799.8	5.1
Global Cache blocks served:	7,779.9	5.9
User calls:	62,306.9	46.8
Parses (SQL):	2,077.9	1.6
Hard parses (SQL):	22.2	0.0
SQL Work Area (MB):	19.1	0.0
Logons:	12.6	0.0
Executes (SQL):	43,196.6	32.5
Rollbacks:	3.1	0.0
Transactions:	1,330.1

Instance Efficiency Percentages (Target 100%)

Buffer Nowait %:	99.66	Redo NoWait %:	100.00
Buffer Hit %:	98.72	In-memory Sort %:	100.00
Library Hit %:	100.14	Soft Parse %:	98.93
Execute to Parse %:	95.19	Latch Hit %:	105.08
Parse CPU to Parse Elapsd %:	70.98	% Non-Parse CPU:	99.62
Flash Cache Hit %:	0.00

Top 10 Foreground Events by Total Wait Time

Event	Waits	Total Wait Time (sec)	Wait Avg(ms)	% DB time	Wait Class
latch: row cache objects	5,655,292	477K	84.35	21.7	Concurrency
enq: TX – index contention	305,813	376.4K	1230.67	17.1	Concurrency
gc cr grant 2-way	4,342,796	211K	48.59	9.6	Cluster
buffer busy waits	723,245	202.2K	279.60	9.2	Concurrency
gc buffer busy release	364,833	198.8K	544.96	9.0	Cluster
gc buffer busy acquire	2,666,894	137K	51.36	6.2	Cluster
gc cr multi block request	625,425	55K	87.91	2.5	Cluster
DB CPU		45K		2.0
gc current grant 2-way	812,849	40.1K	49.34	1.8	Cluster
gc current grant busy	459,003	28.1K	61.16	1.3	Cluster

Host CPU

CPUs	Cores	Sockets	Load Average Begin	Load Average End	%User	%System	%WIO	%Idle
256	128	16	0.40	0.33	28.0	5.0	0.6	67.1

Instance CPU

%Total CPU	%Busy CPU	%DB time waiting for CPU (Resource Manager)
23.4	71.0	0.0

看问题时刻另一节点node1的AWR TOP EVENT

Top 10 Foreground Events by Total Wait Time

Event	Waits	Total Wait Time (sec)	Wait Avg(ms)	% DB time	Wait Class
gc buffer busy acquire	1,940,939	106.1K	54.67	16.7	Cluster
gc cr block 2-way	4,050,009	92.3K	22.78	14.5	Cluster
gc cr grant 2-way	3,445,569	70.5K	20.46	11.1	Cluster
enq: TX – index contention	48,310	59K	1222.08	9.3	Concurrency
DB CPU		38.4K		6.0
gc current block 2-way	1,636,339	35K	21.40	5.5	Cluster
gc buffer busy release	58,210	32.9K	565.78	5.2	Cluster
gc cr multi block request	822,890	32.6K	39.62	5.1	Cluster
gc cr grant congested	215,405	27.8K	129.05	4.4	Cluster
latch: row cache objects	6,116,171	23.5K	3.84	3.7	Concurrency

Host CPU

CPUs	Cores	Sockets	Load Average Begin	Load Average End	%User	%System	%WIO	%Idle
256	128	16	0.26	0.25	21.3	4.2	0.1	74.6

Instance CPU

%Total CPU	%Busy CPU	%DB time waiting for CPU (Resource Manager)
18.8	73.8	0.0

当时互相ping私网已经显示出不正常的network latency，正常应该是小于1ms. 两个节点的CPU看使用率并不高，当然需要看osw更精确的时间点。

分析方法

1，检查两节点系统资源利用率，如是否因为CPU使用率过导致lms进程慢

2，AWRDD对比

3，检查网络延迟和瓶颈

$ netstat -s
$ ss -s
$ sar -n DEV 1 5
$ sar -n TCP 1 5
# ip link show – check network interface status (up /down)
# ethtool eth0 – ethtool is used for querying settings of an ethernet device like network speed, packet drop info etc

other tools iftop/nload  speedometer  nethogs tcptrack -i eth0

4, ASH和OSW分钟或秒级分析查找问题最初的表现

5, 应用跨实例部署

等等

看RAC 部分AWR

NODE1

RAC Statistics

	Begin	End
Number of Instances:	2	2

Global Cache Load Profile

	Per Second	Per Transaction
Global Cache blocks received:	8,087.67	11.81
Global Cache blocks served:	6,770.28	9.89
GCS/GES messages received:	53,558.17	78.23
GCS/GES messages sent:	62,283.68	90.98
DBWR Fusion writes:	272.70	0.40
Estd Interconnect traffic (KB)	141,488.97

Global Cache and Enqueue Services – Workload Characteristics

Avg global enqueue get time (ms):	11.5
Avg global cache cr block receive time (ms):	23.8
Avg global cache current block receive time (ms):	25.7
Avg global cache cr block build time (ms):	0.0
Avg global cache cr block send time (ms):	0.0
Global cache log flushes for cr blocks served %:	20.4
Avg global cache cr block flush time (ms):	2.3
Avg global cache current block pin time (ms):	1.2
Avg global cache current block send time (ms):	0.0
Global cache log flushes for current blocks served %:	2.2
Avg global cache current block flush time (ms):	6.8

Global Cache and Enqueue Services – Messaging Statistics

Avg message sent queue time (ms):	0.2
Avg message sent queue time on ksxp (ms):	36.0
Avg message received queue time (ms):	0.0
Avg GCS message process time (ms):	0.0
Avg GES message process time (ms):	0.0
% of direct sent messages:	35.89
% of indirect sent messages:	63.86
% of flow controlled messages:	0.25

Interconnect Ping Latency Stats

Target Instance	500B Ping Count	Avg Latency 500B msg	Stddev 500B msg	8K Ping Count	Avg Latency 8K msg	Stddev 8K msg
1	71	0.20	0.06	71	0.20	0.06
2	71	4.60	7.01	71	4.71	7.03

Node2

RAC Statistics

	Begin	End
Number of Instances:	2	2

Global Cache Load Profile

	Per Second	Per Transaction
Global Cache blocks received:	6,799.76	5.11
Global Cache blocks served:	7,779.91	5.85
GCS/GES messages received:	57,138.25	42.96
GCS/GES messages sent:	51,467.70	38.70
DBWR Fusion writes:	563.76	0.42
Estd Interconnect traffic (KB)	137,849.40

Global Cache and Enqueue Services – Workload Characteristics

Avg global enqueue get time (ms):	53.1
Avg global cache cr block receive time (ms):	4.3
Avg global cache current block receive time (ms):	7.2
Avg global cache cr block build time (ms):	0.0
Avg global cache cr block send time (ms):	0.0
Global cache log flushes for cr blocks served %:	11.7
Avg global cache cr block flush time (ms):	5.0
Avg global cache current block pin time (ms):	3.8
Avg global cache current block send time (ms):	0.0
Global cache log flushes for current blocks served %:	2.1
Avg global cache current block flush time (ms):	14.7

Global Cache and Enqueue Services – Messaging Statistics

Avg message sent queue time (ms):	8.3
Avg message sent queue time on ksxp (ms):	9.8
Avg message received queue time (ms):	0.2
Avg GCS message process time (ms):	0.1
Avg GES message process time (ms):	0.0
% of direct sent messages:	44.90
% of indirect sent messages:	54.87
% of flow controlled messages:	0.24

这个问题和几年前Troubleshooting high “gc current grant 2-way” and “gc cr grant 2-way” 案例(二) 有些相似之处,Estd Interconnect traffic (KB) 我们可以看到几乎达到了千兆的上限141,488KB/s。

ksxp 层决定 IPC 协议如何执行的可以查询x$ksxpclient，dba_hist_ic_client_stats 视图显示了相同的信息，并且该视图确实收集了所有实例的统计信息。

select
name,
trunc(bytes_sent/1024/1024) as mb_sent,
trunc(bytes_rcv/1024/1024) as mb_recv,
trunc((bytes_sent+bytes_rcv)/1024/1024) as mb_total,
to_char(round(100*(((bytes_sent+bytes_rcv)/1024/1024)/
sum((bytes_sent+bytes_rcv)/1024/1024) over ()),2),'990.00') as total_perc
from
sys.x$ksxpclient
order by
total_perc desc;

Poorly sized UDP receive (rx) buffer sizes / UDP buffer socket overflows

‘netstat -s’ or ‘netstat -su’ and look for either “udpInOverflowsudpInOverflows”, “packet receive errors”, “fragments dropped” or “outgoing packet drop” depending on the platform.
If there is a significant increases in “outgoing packets dropped” in the TCP section of netstat -s output on the nodes remote to where the workload is running, increase the wmem_default and wmem_max both to 4MB (Linux) could resolve the issue.

Poor interconnect performance and high cpu utilization. `netstat -s` reports packet reassembly failures

Large UDP datagrams may be fragmented and sent in multiple frames based on Maximum Transmission Unit (MTU) size.
`netstat -s` reports a large number of Internet Protocol (IP) “reassembles failed” and “fragments dropped after timeout” in the “IP Statistics” section of the output on the receiving node.

`netstat -s` IP stat counters:
3104582 fragments dropped after timeout
34550600 reassemblies required
8961342 packets reassembled ok
3104582 packet reassembles failed.

LINUX 6
netstat -s |grep reass
-sleep 5sec
netstat -s |grep reass

LINUX 7 
nstat -az|grep IpReasmFails
-sleep 5sec
nstat -az|grep IpReasmFails

Action: Increase fragment reassembly buffers, allocating more space for reassembly. Increase the time to reassemble packet fragment., increase udp receive buffers to accommodate network processing latencies that aggravate reassembly failures and identify CPU utilization that negatively impacts network stack processing. Note, increase the following settings will also increase the memory usage.

On LINUX:
To modify reassembly buffer space, change the following thresholds:

/proc/sys/net/ipv4/ipfrag_low_thresh (default = 196608)
/proc/sys/net/ipv4/ipfrag_high_thresh (default = 262144)

To modify packet fragment reassembly times, modify:

/proc/sys/net/ipv4/ipfrag_time (default = 30)

ADDM report

Finding 2: Interconnect Latency
Impact is 421.58 active sessions, 61.59% of total activity.
———————————————————–
Higher than expected latency of the cluster interconnect was responsible for significant database time on this instance.
The instance was consuming 414762 kilo bits per second of interconnect bandwidth.
65% of this interconnect bandwidth was used for global cache messaging, 0% for parallel query messaging and 33% for database lock management.
The average latency for 8K interconnect messages was 4708 microseconds. The instance is using the private interconnect device “lan901:801” with IP address 169.254.141.243 and source “33”.

Finding 5: Global Cache Congestion
Impact is 80.49 active sessions, 11.76% of total activity.
———————————————————-
Global Cache Service Processes (LMSn) in other instances were not processing requests fast enough.

Recommendation 1: Database Configuration Estimated benefit is 80.49 active sessions, 11.76% of total activity.
———————————————————————
Action
Increase throughput of the Global Cache Service (LMSn) processes.
Increase the number of Global Cache Service processes by increasing the value of the parameter “gcs_server_processes”. Alternatively, if the
host is CPU bound consider increasing the OS priority of the Global Cache Service processes.
Rationale
The value of parameter “gcs_server_processes” was “20” during theanalysis period.

简单记录一下问题现象，这个case后来看是因为应用因为连接数满自动failover到了其它实例，跨实例的定时调度大量insert，加剧GC，当然期中还引起了latch undo争用和看到index root split, 本篇仅记录一次interconnect network网络带宽耗尽后的gc现象。