首页 » ORACLE, 系统相关 » Troubleshooting Out-Of-Memory(OOM) killer db crash when memory exhausted

Troubleshooting Out-Of-Memory(OOM) killer db crash when memory exhausted

# db alert log

Warning: VKTM detected a time drift.
Time drifts can result in an unexpected behavior such as time-outs. Please check trace file for more details.
Tue Apr 23 08:54:27 2019
WARNING: Heavy swapping observed on system in last 5 mins.
pct of memory swapped in [3.68%] pct of memory swapped out [13.12%].
Please make sure there is no memory pressure and the SGA and PGA 
are configured correctly. Look at DBRM trace file for more details.
Tue Apr 23 08:56:27 2019
Thread 1 cannot allocate new log, sequence 10395
Private strand flush not complete
  Current log# 2 seq# 10394 mem# 0: /hescms/oradata/anbob/redo02a.log
  Current log# 2 seq# 10394 mem# 1: /hescms/oradata/scms/redo02b.log
Thread 1 advanced to log sequence 10395 (LGWR switch)
  Current log# 3 seq# 10395 mem# 0: /hescms/oradata/scms/redo03a.log
  Current log# 3 seq# 10395 mem# 1: /hescms/oradata/scms/redo03b.log
Tue Apr 23 08:56:41 2019
Archived Log entry 10505 added for thread 1 sequence 10394 ID 0xaef43455 dest 1:
Tue Apr 23 09:08:37 2019
System state dump requested by (instance=1, osid=8886 (PMON)), summary=[abnormal instance termination].
Tue Apr 23 09:08:37 2019
PMON (ospid: 8886): terminating the instance due to error 471
System State dumped to trace file /ora/diag/rdbms/scms/scms/trace/scms_diag_8896_20190423090837.trc
Tue Apr 23 09:08:37 2019
opiodr aborting process unknown ospid (22614) as a result of ORA-1092
Tue Apr 23 09:08:38 2019
opiodr aborting process unknown ospid (27627) as a result of ORA-1092
Instance terminated by PMON, pid = 8886
Tue Apr 23 09:18:18 2019
Starting ORACLE instance (normal)

# OS log /var/log/messages

Apr 23 08:52:18 anbobdb kernel: NET: Unregistered protocol family 36
Apr 23 09:07:28 anbobdb kernel: oracle invoked oom-killer: gfp_mask=0x84d0, order=0, oom_adj=0, oom_score_adj=0
Apr 23 09:07:32 anbobdb rtkit-daemon[3097]: The canary thread is apparently starving. Taking action.
Apr 23 09:07:47 anbobdb kernel: oracle cpuset=/ mems_allowed=0-4
Apr 23 09:07:47 anbobdb kernel: Pid: 22753, comm: oracle Not tainted 2.6.32-431.el6.x86_64 #1
Apr 23 09:07:47 anbobdb rtkit-daemon[3097]: Demoting known real-time threads.
Apr 23 09:07:47 anbobdb rtkit-daemon[3097]: Demoted 0 threads.
Apr 23 09:07:47 anbobdb kernel: Call Trace:
Apr 23 09:07:47 anbobdb kernel: [] ? dump_header+0x90/0x1b0
Apr 23 09:07:47 anbobdb kernel: [] ? security_real_capable_noaudit+0x3c/0x70
Apr 23 09:07:47 anbobdb kernel: [] ? oom_kill_process+0x82/0x2a0
Apr 23 09:07:47 anbobdb kernel: [] ? select_bad_process+0xe1/0x120
Apr 23 09:07:47 anbobdb kernel: [] ? out_of_memory+0x220/0x3c0
Apr 23 09:07:47 anbobdb kernel: [] ? __alloc_pages_nodemask+0x8ac/0x8d0
Apr 23 09:07:47 anbobdb rtkit-daemon[3097]: The canary thread is apparently starving. Taking action.
Apr 23 09:07:47 anbobdb rtkit-daemon[3097]: Demoting known real-time threads.
Apr 23 09:07:47 anbobdb rtkit-daemon[3097]: Demoted 0 threads.
Apr 23 09:07:48 anbobdb kernel: [] ? alloc_pages_current+0xaa/0x110
Apr 23 09:07:52 anbobdb kernel: [] ? pte_alloc_one+0x1b/0x50
Apr 23 09:07:52 anbobdb kernel: [] ? __pte_alloc+0x32/0x160
Apr 23 09:07:52 anbobdb kernel: [] ? handle_mm_fault+0x1c0/0x300
Apr 23 09:07:52 anbobdb kernel: [] ? down_read_trylock+0x1a/0x30

Note:  OS messages indicating resource shortage, OOM killer etc (TFA will collect this)

What is OOM Killer?
The OOM killer, a feature enabled by default on Linux kernel, is a self protection mechanism employed the Linux kernel when under severe memory pressure.
If kernel can not find memory to allocate when it’s needed, it puts in-use user data pages on the swap-out queue, to be swapped out. If the Virtual Memory (VM) cannot allocate memory and canot swap out in-use memory, the Out-of-memory killer may begin killing current userspace processes. it will sacrifice one or more processes in order to free up memory for the system when all else fails.

The behavior of OOM killer in principle is as follows:
Lose the minimum amount of work done
Recover as much as memory it can
Do not kill anything actually not using a lot memory alone
Kill the minimum amount of processes (one)
Try to kill the process the user expects to kill

Reason Probable Cause:

1 Spike in memory usage based on a load event (additional processes are needed for increased load).
2 Spike in memory usage based on additional services being added or migrated to the system. (Added another app or started a new service on the system)
3 Spike in memory usage due to failed hardware such as a DIMM memory module.
4 Spike in memory usage due to undersizing of hardware resources for the running application(s).
5 There’s a memory leak in a running application.

If the application uses mlock() or HugeTLB pages (HugePages), it may not be able to use its swap space for that application (because locked pages or HugePages are not swappable). If this happens, SwapFree may still have a very large value when the OOM occurs. However overusing them may exhaust system memory and leave the system with no other recourse.

Troubleshooting
Check to see how often the Out of memory (OOM) killer process is running.
$ egrep ‘Out of memory:’ /var/log/messages

Check to see how large the memory consumption is of the processes being killed.
$ egrep ‘total-vm’ /var/log/messages

Further analysis, we can check the system activity reporter (SAR) data to see what it’s captured about the OS.

Check swap statistics with the -S flag: A high % of swpusedindicates swapping and memory shortage
$ sar -S -f /var/log/sa/sa2

Check CPU and IOwait statistics: High %user or %systemindicate a busy system, also high %iowait the system is spending important time waiting on underlying storage
$ sar -f /var/log/sa/sa31

Check memory statistics: High %memused and %commit values tell us the system is using nearly all of its memory, and memory that is committed to processes (high %commit is more concerning)
$ sar -r -f /var/log/sa/sa

Lastly, check the amount of memory on the system, and how much is free/available:

$ free -m or cat /proc/meminfo or dmidecode -t memory

In the oracle environment, first check whether the SGA and PGA configuration is reasonable. In this case, we later reduced the size of these memory areas, reserved more available memory for the operating system, and configured hugepage. The benefits of hugepage are not described in multiple descriptions, BTW, If you increase the hugepages, check if the check has reached the upper limit of kernel.shmall. and check application process memory leak, even PGA leak.   config hugepage linux

 

References  Linux: Out-of-Memory (OOM)Killer (文档 ID 452000.1) and RHEL online docs.

打赏

目前这篇文章有1条评论(Rss)评论关闭。

  1. Big Daddy's Orlando | #1
    2019-04-29 at 20:08

    Thanks for another magnificent post. Where else could anyone get
    that type of info in such an ideal method of writing?
    I’ve a presentation subsequent week, and I am at the look for
    such info. http://Bigdaddysorlando.com/