首页 » ORACLE, 系统相关 » Troubleshooting Oracle RAC node crash frequently on OEL 6.9 (Kernel panic)

Troubleshooting Oracle RAC node crash frequently on OEL 6.9 (Kernel panic)

有个客户的多套RAC节点总是频繁的重启, 故障并没有时间规律,环境Oracle RAC 11G r2 ON OEL 6.9 , 当然首先建议要排除一些硬件共性的配置,如是否电源电压不稳?机器所在区域空调是否差? 是否同一个宿主机上的VM? 硬件是否有报错? 一线同学说都排查过也做了些调整并未解决。 下面分享一些处理建议。

DB ALERT LOG

# db1 alert log

Wed Mar 16 05:00:17 2022
Archived Log entry 44608 added for thread 1 sequence 20901 ID 0x332f72cc dest 1:
Wed Mar 16 09:00:11 2022
Thread 1 advanced to log sequence 20903 (LGWR switch)
  Current log# 6 seq# 20903 mem# 0: +DATA/anbob/onlinelog/group_6.275.1079106239
  Current log# 6 seq# 20903 mem# 1: +FRA/anbob/onlinelog/group_6.11781.1079106243
Wed Mar 16 09:00:13 2022
Archived Log entry 44611 added for thread 1 sequence 20902 ID 0x332f72cc dest 1:
Wed Mar 16 09:59:43 2022
Starting ORACLE instance (normal)
************************ Large Pages Information *******************
Per process system memlock (soft) limit = UNLIMITED
 
Total Shared Global Region in Large Pages = 0 KB (0%)
 
Large Pages used by this instance: 0 (0 KB)
Large Pages unused system wide = 0 (0 KB)
Large Pages configured system wide = 0 (0 KB)
Large Page size = 2048 KB
 
RECOMMENDATION:
  Total System Global Area size is 76 GB. For optimal performance,
  prior to the next instance restart:
  1. Increase the number of unused large pages by 
 at least 38657 (page size 2048 KB, total size 76 GB) system wide to
  get 100% of the System Global Area allocated with large pages
********************************************************************
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Initial number of CPU is 64
Number of processor cores in the system is 32
Number of processor sockets in the system is 4

#  db2
Fri Feb 18 04:30:11 2022
Archived Log entry 3688 added for thread 1 sequence 2011 ID 0x4f4f7234 dest 1:
Fri Feb 18 08:30:08 2022
Thread 1 advanced to log sequence 2013 (LGWR switch)
  Current log# 1 seq# 2013 mem# 0: +DATA/weejar/onlinelog/group_1.261.1073243451
  Current log# 1 seq# 2013 mem# 1: +FRA/weejar/onlinelog/group_1.257.1073243453
Fri Feb 18 08:30:09 2022
Archived Log entry 3689 added for thread 1 sequence 2012 ID 0x4f4f7234 dest 1:
Fri Feb 18 09:39:23 2022
Starting ORACLE instance (normal)
************************ Large Pages Information *******************
Per process system memlock (soft) limit = UNLIMITED
 
Total Shared Global Region in Large Pages = 49 GB (98%)
 
Large Pages used by this instance: 25280 (49 GB)
Large Pages unused system wide = 320 (640 MB)
Large Pages configured system wide = 25600 (50 GB)

Note:
找了2套库看了一下时间和重启前现象,数据库1是没有配置hugepage,不符合最佳实践,但不是问题原因,共同的现象是重启前DB 层都没有任何报错,日志切换不频繁说明DML事务并不多。

GI ALERT LOG

 [ctssd(15405)]CRS-2412:The Cluster Time Synchronization Service detects that the local time is significantly different from the mean cluster time. Details in /u01/app/11.2.0/grid/log/anbob1/ctssd/octssd.log.
2022-02-11 10:43:45.541: 
[ctssd(15405)]CRS-2412:The Cluster Time Synchronization Service detects that the local time is significantly different from the mean cluster time. Details in /u01/app/11.2.0/grid/log/anbob1/ctssd/octssd.log.
2022-02-11 11:13:45.941: 
[ctssd(15405)]CRS-2412:The Cluster Time Synchronization Service detects that the local time is significantly different from the mean cluster time. Details in /u01/app/11.2.0/grid/log/anbob1/ctssd/octssd.log.
2022-02-11 11:25:32.172: 
[ohasd(13944)]CRS-2112:The OLR service started on node anbob1.
2022-02-11 11:25:32.248: 
[ohasd(13944)]CRS-1301:Oracle High Availability Service started on node anbob1.

...

 [ctssd(15468)]CRS-2409:The clock on host anbob1 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
2022-02-14 02:19:51.430: 
[ctssd(15468)]CRS-2409:The clock on host anbob1 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
2022-02-14 02:49:52.268: 
[ctssd(15468)]CRS-2409:The clock on host anbob1 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
2022-02-14 03:19:53.127: 
[ctssd(15468)]CRS-2409:The clock on host anbob1 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
2022-02-14 03:50:01.978: 
[ctssd(15468)]CRS-2409:The clock on host anbob1 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
2022-02-14 04:02:24.282: 
[cssd(14459)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingMonitorThread not scheduled for 7940 msecs.
2022-02-14 04:02:24.282: 
[cssd(14459)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmSendingThread not scheduled for 7740 msecs.
2022-02-14 04:02:25.283: 
[cssd(14459)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 7900 msecs.
2022-02-14 04:02:25.283: 
[cssd(14459)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 7900 msecs.
2022-02-14 04:02:25.283: 
[cssd(14459)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 7930 msecs.
2022-02-14 04:02:25.283: 
[cssd(14459)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 7930 msecs.
2022-02-14 04:02:25.283: 
[cssd(14459)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 7960 msecs.
2022-02-14 04:02:26.284: 
[cssd(14459)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 8310 msecs.
2022-02-14 04:02:26.284: 
[cssd(14459)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvKillBlockThread not scheduled for 9210 msecs.
2022-02-14 04:02:31.870: 
[cssd(14459)]CRS-1612:Network communication with node yngxsky2 (2) missing for 50% of timeout interval.  Removal of this node from cluster in 14.620 seconds
2022-02-14 04:02:32.287: 
[cssd(14459)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingMonitorThread not scheduled for 15940 msecs.
2022-02-14 04:02:32.287: 
[cssd(14459)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmSendingThread not scheduled for 15740 msecs.
2022-02-14 04:02:33.287: 
[cssd(14459)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 15900 msecs.
2022-02-14 04:02:33.287: 
[cssd(14459)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 15900 msecs.
2022-02-14 04:02:33.287: 
[cssd(14459)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 15930 msecs.
2022-02-14 04:02:33.287: 
[cssd(14459)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 15930 msecs.
2022-02-14 04:02:33.288: 
[cssd(14459)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 15960 msecs.
2022-02-14 04:02:34.288: 
[cssd(14459)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 16310 msecs.
2022-02-14 04:02:34.288: 
[cssd(14459)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvKillBlockThread not scheduled for 17210 msecs.
2022-02-14 04:02:39.289: 
[cssd(14459)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvKillBlockThread not scheduled for 8000 msecs.
2022-02-14 04:02:39.290: 
[cssd(14459)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvKillBlockThread not scheduled for 8000 msecs.
2022-02-14 04:02:39.872: 
[cssd(14459)]CRS-1611:Network communication with node yngxsky2 (2) missing for 75% of timeout interval.  Removal of this node from cluster in 6.620 seconds
2022-02-14 04:02:40.290: 
[cssd(14459)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingMonitorThread not scheduled for 23940 msecs.
2022-02-14 04:02:40.290: 
[cssd(14459)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmSendingThread not scheduled for 23750 msecs.
2022-02-14 04:02:41.291: 
[cssd(14459)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 23910 msecs.
2022-02-14 04:02:41.291: 
[cssd(14459)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 23910 msecs.
2022-02-14 04:02:41.291: 
[cssd(14459)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 23940 msecs.
2022-02-14 04:02:41.291: 
[cssd(14459)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 23940 msecs.
2022-02-14 04:02:41.291: 
[cssd(14459)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 23970 msecs.
2022-02-14 04:02:42.291: 
[cssd(14459)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 24320 msecs.
2022-02-14 04:02:42.291: 
[cssd(14459)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvKillBlockThread not scheduled for 25220 msecs.
2022-02-14 04:08:16.481: 
[ohasd(14069)]CRS-2112:The OLR service started on node anbob1.
2022-02-14 04:08:16.554: 
[ohasd(14069)]CRS-1301:Oracle High Availability Service started on node anbob1.
2022-02-14 04:08:16.560: 
[ohasd(14069)]CRS-8011:reboot advisory message from host: anbob1, component: cssmonit, with time stamp: L-2022-02-14-04:02:45.004
[ohasd(14069)]CRS-8013:reboot advisory message text: oracssdmonitor is rebooting this node due to network timeout (no network activity for 28450 milliseconds).
2022-02-14 04:08:16.567: 
[ohasd(14069)]CRS-8017:location: /etc/oracle/lastgasp has 2 reboot advisory log files, 1 were announced and 0 errors occurred

...

 [ctssd(80383)]CRS-2409:The clock on host anbob1 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
2022-02-18 09:12:06.232: 
[ctssd(80383)]CRS-2409:The clock on host anbob1 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  2022-02-18 09:37:39.125: 
[ohasd(13946)]CRS-2112:The OLR service started on node anbob1.
2022-02-18 09:37:39.185: 
[ohasd(13946)]CRS-1301:Oracle High Availability Service started on node anbob1.
2022-02-18 09:37:39.205: 
[ohasd(13946)]CRS-8017:location: /etc/oracle/lastgasp has 2 reboot advisory log files, 0 were announced and 0 errors occurred
2022-02-18 09:37:39.975: 
[/u01/app/11.2.0/grid/bin/oraagent.bin(14066)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/u01/app/11.2.0/grid/log/anbob1/agent/ohasd/oraagent_grid//oraagent_grid.log"
2022-02-18 09:37:43.200: 
[/u01/app/11.2.0/grid/bin/orarootagent.bin(14070)]CRS-2302:Cannot get GPnP profile. Error CLSGPNP_NO_DAEMON (GPNPD daemon is not running). 
2022-02-18 09:37:45.622: 
[gpnpd(14231)]CRS-2328:GPNPD started on node anbob1. 

Note:
抽了3次重启日志,有1次是重启前提示disk和network heartbeat心跳都丢失,可能是部分进程HANG或是存储和心跳网络同时故障,另外2次同样是无报错,直接日志断, 当然这种要么是断电也可能是OS hang本地文件系统都没响应。大多数也是GI层没有错误日志

OS messages log

# OS MESSAGE 
Feb 18 08:58:12 anbob1 sshd[15443]: error: session_signal_req: session signalling requires privilege separation
Feb 18 08:58:15 anbob1 sshd[15443]: error: session_signal_req: session signalling requires privilege separation

Feb 18 09:03:11 anbob1 sshd[15443]: error: session_signal_req: session signalling requires privilege separation
Feb 18 09:08:01 anbob1 sshd[15443]: error: session_signal_req: session signalling requires privilege separation
Feb 18 09:08:06 anbob1 sshd[15443]: error: session_signal_req: session signalling requires privilege separation
Feb 18 09:08:11 anbob1 sshd[15443]: error: session_signal_req: session signalling requires privilege separation
Feb 18 09:08:16 anbob1 sshd[15443]: error: session_signal_req: session signalling requires privilege separation

Feb 18 09:28:07 anbob1 sshd[15443]: error: session_signal_req: session signalling requires privilege separation
Feb 18 09:28:08 anbob1 sshd[15443]: error: session_signal_req: session signalling requires privilege separation
Feb 18 09:28:12 anbob1 sshd[15443]: error: session_signal_req: session signalling requires privilege separation
Feb 18 09:28:19 anbob1 sshd[15443]: error: session_signal_req: session signalling requires privilege separation
Feb 18 09:30:16 anbob1 sshd[15443]: error: session_signal_req: session signalling requires privilege separation
Feb 18 09:30:17 anbob1 sshd[15443]: error: session_signal_req: session signalling requires privilege separation
<<<<<<<<                                                                                                                                                                                Feb 18 09:37:06 anbob1 kernel: imklog 5.8.10, log source = /proc/kmsg started.
Feb 18 09:37:06 anbob1 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="10712" x-info="http://www.rsyslog.com"] start
Feb 18 09:37:06 anbob1 kernel: [    0.000000] Initializing cgroup subsys cpuset
Feb 18 09:37:06 anbob1 kernel: [    0.000000] Initializing cgroup subsys cpu
Feb 18 09:37:06 anbob1 kernel: [    0.000000] Initializing cgroup subsys cpuacct
Feb 18 09:37:06 anbob1 kernel: [    0.000000] Linux version 4.1.12-61.1.28.el6uek.x86_64 (mockbuild@x86-ol6-builder-06) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-16) (GCC) ) #2 SMP Thu Feb 23 20:03:53 PST 2017
Feb 18 09:37:06 anbob1 kernel: [    0.000000] Command line: ro root=/dev/mapper/vg-root rd_NO_LUKS  KEYBOARDTYPE=pc KEYTABLE=us LANG=en_US.UTF-8 rd_LVM_LV=vg/sawp rd_NO_MD SYSFONT=latarcyrheb-sun16 crashkernel=auto rd_LVM_LV=vg/root rd_NO_DM quiet transparent_hugepage=never
Feb 18 09:37:06 anbob1 kernel: [    0.000000] e820: BIOS-provided physical RAM map:
Feb 18 09:37:06 anbob1 kernel: [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009cbff] usable
Feb 18 09:37:06 anbob1 kernel: [    0.000000] BIOS-e820: [mem 0x000000000009cc00-0x000000000009ffff] reserved
Feb 18 09:37:06 anbob1 kernel: [    0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved

Note:
最近一次重启前OS 一样没有致命性报错信息。

OS release

Oracle Linux Server release 6.9

OSW

# VMSTAT
zzz ***Fri Feb 18 09:29:23 CST 2022
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 4  0      0 140329936 545852 53463124    0    0    28    11    0    0  1  0 99  0  0	
 1  0      0 140321344 545852 53463336    0    0   937  1195 8208 16256  1  1 98  0  0	
 0  0      0 140323120 545852 53463520    0    0   889   432 7031 15666  0  1 99  0  0	
zzz ***Fri Feb 18 09:29:54 CST 2022
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 140333120 545852 53463764    0    0    28    11    0    0  1  0 99  0  0	
 0  0      0 140324464 545852 53463768    0    0   941   208 8253 17067  1  1 98  0  0	
 1  0      0 140325232 545852 53463840    0    0  1966   197 7511 15787  0  1 99  0  0	
zzz ***Fri Feb 18 09:30:24 CST 2022
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 8  0      0 140293360 545860 53464768    0    0    28    11    0    0  1  0 99  0  0	
 3  0      0 140281392 545860 53465060    0    0  2950   446 12225 18983  4  2 94  0  0	
 3  0      0 140279648 545860 53465244    0    0  1967   290 9693 16894  2  1 97  0  0	
zzz ***Fri Feb 18 09:30:55 CST 2022
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  0      0 140313696 545860 53465420    0    0    28    11    0    0  1  0 99  0  0	
 1  0      0 140308144 545860 53465436    0    0  1998  1254 8888 16500  2  1 97  0  0	
 0  0      0 140307984 545860 53465820    0    0  1815  1161 7762 15977  1  1 98  0  0	
zzz ***Fri Feb 18 09:31:25 CST 2022
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 6  0      0 140326928 545860 53466056    0    0    28    11    0    0  1  0 99  0  0	
 3  0      0 140318560 545860 53466284    0    0  1872   114 8055 16400  1  1 98  0  0	
 1  0      0 140317616 545860 53466468    0    0  1826    92 7514 15862  1  1 99  0  0	
                                                                  <<<< snapshot miss                                                                                                                                                                                                                                                                                                                                                                                    Linux OSWbb v7.3.3 anbob1
SNAP_INTERVAL 30
CPU_COUNT 64
OSWBB_ARCHIVE_DEST /u01/app/grid/tfa/repository/suptools/anbob1/oswbb/grid/archive
zzz ***Fri Feb 18 09:38:15 CST 2022
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 5  1      0 209582320  59836 381792    0    0    62     4   47   77  1  1 98  0  0	
 1  0      0 209570784  59840 382068    0    0    59    17 1563 4010  0  0 99  0  0	
 2  0      0 209571616  59840 382024    0    0    55     2  847 2878  0  0 100  0  0	


# mpstat

09:31:26     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
09:31:27     all    0.57    0.00    0.77    0.05    0.00    0.00    0.00    0.00   98.62
09:31:27       0    1.02    0.00    1.02    0.00    0.00    1.02    0.00    0.00   96.94
09:31:27       1    2.02    0.00    1.01    0.00    0.00    0.00    0.00    0.00   96.97
09:31:27       2    1.01    0.00    2.02    0.00    0.00    0.00    0.00    0.00   96.97
09:31:27       3    2.04    0.00    1.02    0.00    0.00    0.00    0.00    0.00   96.94
09:31:27       4    2.02    0.00    0.00    0.00    0.00    0.00    0.00    0.00   97.98
09:31:27       5    1.01    0.00    1.01    0.00    0.00    0.00    0.00    0.00   97.98
09:31:27       6    3.03    0.00    0.00    0.00    0.00    0.00    0.00    0.00   96.97
09:31:27       7    1.01    0.00    0.00    0.00    0.00    0.00    0.00    0.00   98.99
09:31:27       8    1.01    0.00    0.00    0.00    0.00    0.00    0.00    0.00   98.99
09:31:27       9    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
09:31:27      10    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
09:31:27      11    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
09:31:27      12    1.01    0.00    1.01    0.00    0.00    0.00    0.00    0.00   97.98
09:31:27      13    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
09:31:27      14    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
09:31:27      15    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
09:31:27      16    2.08    0.00   12.50    0.00    0.00    0.00    0.00    0.00   85.42
09:31:27      17    0.00    0.00    1.02    0.00    0.00    0.00    0.00    0.00   98.98
09:31:27      18    1.03    0.00    1.03    0.00    0.00    0.00    0.00    0.00   97.94
09:31:27      19    1.02    0.00    0.00    0.00    0.00    0.00    0.00    0.00   98.98
09:31:27      20    0.00    0.00    1.02    0.00    0.00    0.00    0.00    0.00   98.98
09:31:27      21    1.00    0.00    1.00    2.00    0.00    0.00    0.00    0.00   96.00
09:31:27      22    1.01    0.00    0.00    1.01    0.00    0.00    0.00    0.00   97.98
09:31:27      23    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
09:31:27      24    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
09:31:27      25    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
09:31:27      26    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
09:31:27      27    0.98    0.00    0.98    0.00    0.00    0.00    0.00    0.00   98.04
09:31:27      28    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
09:31:27      29    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
09:31:27      30    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
09:31:27      31    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
09:31:27      32    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
09:31:27      33    0.00    0.00    1.03    0.00    0.00    0.00    0.00    0.00   98.97
09:31:27      34    6.00    0.00    2.00    0.00    0.00    0.00    0.00    0.00   92.00
09:31:27      35    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
09:31:27      36    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
09:31:27      37    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
09:31:27      38    0.00    0.00    0.99    0.00    0.00    0.00    0.00    0.00   99.01
09:31:27      39    0.00    0.00    0.99    0.00    0.00    0.00    0.00    0.00   99.01
09:31:27      40    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
09:31:27      41    0.00    0.00    2.00    0.00    0.00    0.00    0.00    0.00   98.00
09:31:27      42    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
09:31:27      43    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
09:31:27      44    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
09:31:27      45    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
09:31:27      46    0.00    0.00    0.99    0.00    0.00    0.00    0.00    0.00   99.01
09:31:27      47    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
09:31:27      48    2.04    0.00    7.14    0.00    0.00    0.00    0.00    0.00   90.82
09:31:27      49    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
09:31:27      50    0.00    0.00    3.00    0.00    0.00    0.00    0.00    0.00   97.00
09:31:27      51    0.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00   99.00
09:31:27      52    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
09:31:27      53    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
09:31:27      54    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
09:31:27      55    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
09:31:27      56    0.00    0.00    1.01    0.00    0.00    0.00    0.00    0.00   98.99
09:31:27      57    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
09:31:27      58    4.00    0.00    5.00    0.00    0.00    0.00    0.00    0.00   91.00
09:31:27      59    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
09:31:27      60    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
09:31:27      61    0.99    0.00    0.00    0.00    0.00    0.00    0.00    0.00   99.01
09:31:27      62    0.00    0.00    0.99    0.00    0.00    0.00    0.00    0.00   99.01
09:31:27      63    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
Average:     all    0.78    0.00    0.88    0.05    0.00    0.01    0.00    0.00   98.29
Average:       0    3.09    0.00    0.52    0.00    0.00    0.52    0.00    0.00   95.88
Average:       1    1.52    0.00    1.02    0.00    0.00    0.00    0.00    0.00   97.46
Average:       2    1.01    0.00    1.52    0.00    0.00    0.00    0.00    0.00   97.47
Average:       3    2.02    0.00    0.51    0.00    0.00    0.00    0.00    0.00   97.47
Average:       4    3.05    0.00    0.51    0.00    0.00    0.00    0.00    0.00   96.45
Average:       5    1.02    0.00    0.51    0.00    0.00    0.00    0.00    0.00   98.48
Average:       6    1.52    0.00    0.51    0.00    0.00    0.00    0.00    0.00   97.98
Average:       7    1.01    0.00    0.00    0.00    0.00    0.00    0.00    0.00   98.99
Average:       8    0.50    0.00    0.00    0.00    0.00    0.00    0.00    0.00   99.50
Average:       9    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      10    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      11    0.50    0.00    0.50    0.00    0.00    0.00    0.00    0.00   98.99
Average:      12    1.01    0.00    0.50    0.00    0.00    0.00    0.00    0.00   98.49
Average:      13    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      14    0.50    0.00    1.01    0.00    0.00    0.00    0.00    0.00   98.49
Average:      15    0.50    0.00    0.00    0.00    0.00    0.00    0.00    0.00   99.50
Average:      16    2.05    0.00    6.67    0.00    0.00    0.00    0.00    0.00   91.28
Average:      17    0.52    0.00    1.03    0.00    0.00    0.00    0.00    0.00   98.45
Average:      18    7.77    0.00    2.07    0.00    0.00    0.00    0.00    0.00   90.16
Average:      19    1.02    0.00    0.51    0.00    0.00    0.00    0.00    0.00   98.48
Average:      20    0.51    0.00    0.51    0.51    0.00    0.00    0.00    0.00   98.47
Average:      21    1.00    0.00    1.00    1.50    0.00    0.00    0.00    0.00   96.50
Average:      22    0.51    0.00    0.00    1.02    0.00    0.00    0.00    0.00   98.47
Average:      23    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      24    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      25    0.50    0.00    0.50    0.00    0.00    0.00    0.00    0.00   99.00
Average:      26    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      27    0.50    0.00    0.50    0.00    0.00    0.00    0.00    0.00   99.00
Average:      28    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      29    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      30    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      31    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      32    2.50    0.00    0.50    0.00    0.00    0.00    0.00    0.00   97.00
Average:      33    0.00    0.00    1.03    0.00    0.00    0.00    0.00    0.00   98.97
Average:      34    3.50    0.00    1.50    0.00    0.00    0.00    0.00    0.00   95.00
Average:      35    3.02    0.00    5.53    0.00    0.00    0.00    0.00    0.00   91.46
Average:      36    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      37    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      38    0.00    0.00    0.50    0.00    0.00    0.00    0.00    0.00   99.50
Average:      39    0.00    0.00    0.50    0.00    0.00    0.00    0.00    0.00   99.50
Average:      40    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      41    0.50    0.00    1.00    0.00    0.00    0.00    0.00    0.00   98.50
Average:      42    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      43    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      44    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      45    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      46    0.00    0.00    0.50    0.00    0.00    0.00    0.00    0.00   99.50
Average:      47    0.00    0.00    0.50    0.00    0.00    0.00    0.00    0.00   99.50
Average:      48    3.63    0.00   12.44    0.00    0.00    0.00    0.00    0.00   83.94
Average:      49    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      50    0.00    0.00    1.50    0.00    0.00    0.00    0.00    0.00   98.50
Average:      51    0.00    0.00    0.50    0.00    0.00    0.00    0.00    0.00   99.50
Average:      52    0.00    0.00    0.50    0.00    0.00    0.00    0.00    0.00   99.50
Average:      53    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      54    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      55    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      56    0.00    0.00    1.01    0.00    0.00    0.00    0.00    0.00   98.99
Average:      57    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      58    5.00    0.00    7.50    0.00    0.00    0.00    0.00    0.00   87.50
Average:      59    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      60    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      61    0.50    0.00    0.00    0.00    0.00    0.00    0.00    0.00   99.50
Average:      62    0.00    0.00    0.50    0.00    0.00    0.00    0.00    0.00   99.50
Average:      63    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
                                            <<<<<<<<  snapshot   miss                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Linux OSWbb v7.3.3
zzz ***Fri Feb 18 09:38:15 CST 2022
Linux 4.1.12-61.1.28.el6uek.x86_64 (anbob1) 	02/18/22 	_x86_64_	(64 CPU)

NOTE:
从09:31 开始OS 未能取到VMSTAT 说明OS 已HANG. 并且系统负载极低。

关于IPC问题 网络信息常见的内核参数

net.ipv4.ipfrag_high_thresh = 8388608
net.ipv4.ipfrag_low_thresh = 6291456
net.ipv4.ipfrag_max_dist = 64
net.ipv4.ipfrag_secret_interval = 0
net.ipv4.ipfrag_time = 60

Note:
可见是调整过比默认增加了1倍,但是还是再建议增加1倍做为最佳实践,但当前的系统负载,这问题也不大。

对于OS的突然hang像WINDOWS的蓝屏,LINUX的panic ,通常要看OS CRASH时的转储信息,配置了panic或KDUMP会把问题时的内存信息存下来,如生成OS Debug的vmcore或core文件,可以使用crash进一步分析。

找了了最近3次的vmcore_demsg信息

[2379924.043290] e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
[2510402.232207] sched: RT throttling activated
[2590470.081517] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 41
[2590470.082069] CPU: 41 PID: 0 Comm: swapper/41 Tainted: G OE 4.1.12-61.1.28.el6uek.x86_64 #2
[2590470.082445] Hardware name: HP ProLiant DL580 G7, BIOS P65 10/01/2013
[2590470.082545] 0000000000000000 ffff881fffa46a10 ffffffff816c6e40 ffffffff8198f0d8
[2590470.082887] 0000000000000000 ffff881fffa46a90 ffffffff816c6b9c 0000000000000010
[2590470.083237] ffff881fffa46aa0 ffff881fffa46a40 0000000000000000 0000000000000000
[2590470.083363] Call Trace:
[2590470.083761] <NMI> [<ffffffff816c6e40>] dump_stack+0x63/0x83
[2590470.084074] [<ffffffff816c6b9c>] panic+0xc1/0x210
[2590470.084366] [<ffffffff81135abe>] watchdog_overflow_callback+0xae/0xb0
[2590470.085259] [<ffffffff8117b337>] __perf_event_overflow+0x97/0x230
[2590470.085828] [<ffffffff8102e600>] ? x86_perf_event_set_period+0xf0/0x180
[2590470.086205] [<ffffffff8117bc14>] perf_event_overflow+0x14/0x20
[2590470.086635] [<ffffffff81037924>] intel_pmu_handle_irq+0x224/0x4c0
[2590470.086872] [<ffffffff81020a85>] ? native_sched_clock+0x35/0xc0
[2590470.087266] [<ffffffff810209e9>] ? sched_clock+0x9/0x10
[2590470.087500] [<ffffffff8102dc94>] perf_event_nmi_handler+0x34/0x60
[2590470.087740] [<ffffffff8101afe7>] nmi_handle+0x87/0x150
[2590470.088039] [<ffffffff81056d50>] ? hw_nmi_get_sample_period+0x20/0x20
[2590470.088142] [<ffffffff8101b2ce>] default_do_nmi+0x5e/0x130
[2590470.088389] [<ffffffff8101b495>] do_nmi+0xf5/0x170
[2590470.088793] [<ffffffff816cdec5>] end_repeat_nmi+0x1a/0x1e
[2590470.089048] [<ffffffff8139e5a4>] ? intel_idle+0xb4/0x120
[2590470.089338] [<ffffffff8139e5a4>] ? intel_idle+0xb4/0x120
[2590470.089792] [<ffffffff8139e5a4>] ? intel_idle+0xb4/0x120
[2590470.089875] <<EOE>> [<ffffffff8159bd1d>] cpuidle_enter_state+0x8d/0x230
[2590470.089998] [<ffffffff8159bed7>] cpuidle_enter+0x17/0x20
[2590470.090090] [<ffffffff810c8ed4>] cpuidle_idle_call+0xe4/0x1e0
[2590470.090393] [<ffffffff810c91c5>] cpu_idle_loop+0x1f5/0x220
[2590470.090481] [<ffffffff810c920b>] ? cpu_startup_entry+0x1b/0x70
[2590470.090574] [<ffffffff810c924f>] cpu_startup_entry+0x5f/0x70
[2590470.090668] [<ffffffff81052d3b>] start_secondary+0xbb/0xc0

----

75.587647] UltraPath started successfully.
[ 167.110162] perf interrupt took too long (5797 > 5000), lowering kernel.perf_event_max_sample_rate to 25000
[ 193.801534] hugetlbfs: oracle (17463): Using mlock ulimits for SHM_HUGETLB is deprecated
[ 237.641653] perf interrupt took too long (10426 > 9615), lowering kernel.perf_event_max_sample_rate to 13000
[ 406.494152] nr_pdflush_threads exported in /proc is scheduled for removal
[ 416.460360] hrtimer: interrupt took 2147246 ns
[ 788.112662] perf interrupt took too long (21138 > 17857), lowering kernel.perf_event_max_sample_rate to 7000
[25467.877127] perf interrupt took too long (36178 > 31250), lowering kernel.perf_event_max_sample_rate to 4000
[928925.575289] ------------[ cut here ]------------
[928925.575676] WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:303 dev_watchdog+0x240/0x250()
[928925.575684] NETDEV WATCHDOG: eth1 (e1000e): transmit queue 0 timed out
[928925.575686] Modules linked in: bonding ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 ipmi_ssif ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support pcspkr serio_raw lpc_ich mfd_core hpilo hpwdt i7core_edac edac_core igb dca tg3 e1000e ptp pps_core pcc_cpufreq acpi_cpufreq nxupext_a(OE) nxup(OE) sg ext4 jbd2 mbcache sr_mod cdrom sd_mod pata_acpi ata_generic ata_piix hpsa scsi_transport_sas lpfc scsi_transport_fc wmi radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod
[928925.575779] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G OE 4.1.12-61.1.28.el6uek.x86_64 #2
[928925.575783] Hardware name: HP ProLiant DL580 G7, BIOS P65 10/01/2013
[928925.575786] 0000000000000000 ffff880fff843d38 ffffffff816c6e40 ffff880fff843d88
[928925.575801] 000000000000012f ffff880fff843d78 ffffffff810845e5 ffff880fff850b80
[928925.575807] ffff880ff3600000 ffff880ff36003e0 ffff880ff770b200 0000000000000001
[928925.575812] Call Trace:
[928925.575815] <IRQ> [<ffffffff816c6e40>] dump_stack+0x63/0x83
[928925.575836] [<ffffffff810845e5>] warn_slowpath_common+0x95/0xe0
[928925.575841] [<ffffffff810846e6>] warn_slowpath_fmt+0x46/0x70
[928925.575869] [<ffffffffa0a64f60>] ? opt.61719+0x40/0xffffffffffff71d2 [e1000e]
[928925.575882] [<ffffffff81615d60>] dev_watchdog+0x240/0x250
[928925.575892] [<ffffffff8109df80>] ? __queue_work+0x3f0/0x3f0
[928925.575897] [<ffffffff81615b20>] ? __netdev_watchdog_up+0x80/0x80
[928925.575903] [<ffffffff810ecfa7>] call_timer_fn+0x47/0x160
[928925.575907] [<ffffffff810ed2c0>] run_timer_softirq+0x200/0x380
[928925.575912] [<ffffffff81615b20>] ? __netdev_watchdog_up+0x80/0x80
[928925.575919] [<ffffffff8108879a>] __do_softirq+0x10a/0x350
[928925.575925] [<ffffffff81088b55>] irq_exit+0x125/0x130
[928925.575931] [<ffffffff8105b964>] ? native_apic_msr_eoi_write+0x14/0x20
[928925.575937] [<ffffffff816ceab6>] smp_apic_timer_interrupt+0x46/0x60
[928925.575942] [<ffffffff816cc9fe>] apic_timer_interrupt+0x6e/0x80
[928925.575945] <EOI> [<ffffffff8159bd52>] ? cpuidle_enter_state+0xc2/0x230
[928925.575958] [<ffffffff8159bd25>] ? cpuidle_enter_state+0x95/0x230
[928925.575964] [<ffffffff8159bed7>] cpuidle_enter+0x17/0x20
[928925.575981] [<ffffffff810c8ed4>] cpuidle_idle_call+0xe4/0x1e0
[928925.575991] [<ffffffff810c91c5>] cpu_idle_loop+0x1f5/0x220
[928925.576001] [<ffffffff810c920b>] ? cpu_startup_entry+0x1b/0x70
[928925.576009] [<ffffffff810c924f>] cpu_startup_entry+0x5f/0x70
[928925.576014] [<ffffffff81052d3b>] start_secondary+0xbb/0xc0
[928925.576018] ---[ end trace eca7ca9111afe041 ]---
[928925.576399] e1000e 0000:0e:00.1 eth1: Reset adapter unexpectedly
[928928.988375] e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
[1177095.290528] RPC: Registered named UNIX socket transport module.
[1177095.290533] RPC: Registered udp transport module.
[1177095.290535] RPC: Registered tcp transport module.
[1177095.290536] RPC: Registered tcp NFSv4.1 backchannel transport module.
[1177095.330292] FS-Cache: Loaded
[1177095.352991] FS-Cache: Netfs 'nfs' registered for caching
[1177095.354053] NFS: server address does not match proto= option
[2454185.729236] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 54
[2454185.730340] CPU: 54 PID: 0 Comm: swapper/54 Tainted: G W OE 4.1.12-61.1.28.el6uek.x86_64 #2
[2454185.730482] Hardware name: HP ProLiant DL580 G7, BIOS P65 10/01/2013
[2454185.730579] 0000000000000000 ffff882fffb86a10 ffffffff816c6e40 ffffffff8198f0d8
[2454185.730706] 0000000000000000 ffff882fffb86a90 ffffffff816c6b9c 0000000000000010
[2454185.731099] ffff882fffb86aa0 ffff882fffb86a40 0000000000000000 0000000000000000
[2454185.731227] Call Trace:
[2454185.731266] <NMI> [<ffffffff816c6e40>] dump_stack+0x63/0x83
[2454185.731368] [<ffffffff816c6b9c>] panic+0xc1/0x210
[2454185.731716] [<ffffffff81135abe>] watchdog_overflow_callback+0xae/0xb0
[2454185.731819] [<ffffffff8117b337>] __perf_event_overflow+0x97/0x230
[2454185.731920] [<ffffffff8102e600>] ? x86_perf_event_set_period+0xf0/0x180
[2454185.732021] [<ffffffff8117bc14>] perf_event_overflow+0x14/0x20
[2454185.732115] [<ffffffff81037924>] intel_pmu_handle_irq+0x224/0x4c0
[2454185.732216] [<ffffffff81020a85>] ? native_sched_clock+0x35/0xc0
[2454185.732570] [<ffffffff810209e9>] ? sched_clock+0x9/0x10
[2454185.732653] [<ffffffff8102dc94>] perf_event_nmi_handler+0x34/0x60
[2454185.732745] [<ffffffff8101afe7>] nmi_handle+0x87/0x150
[2454185.732828] [<ffffffff81056d50>] ? hw_nmi_get_sample_period+0x20/0x20
[2454185.732925] [<ffffffff8101b2ce>] default_do_nmi+0x5e/0x130
[2454185.733006] [<ffffffff8101b495>] do_nmi+0xf5/0x170
[2454185.733081] [<ffffffff816cdec5>] end_repeat_nmi+0x1a/0x1e
[2454185.733166] [<ffffffff8139e5a4>] ? intel_idle+0xb4/0x120
[2454185.733246] [<ffffffff8139e5a4>] ? intel_idle+0xb4/0x120
[2454185.733325] [<ffffffff8139e5a4>] ? intel_idle+0xb4/0x120
[2454185.733404] <<EOE>> [<ffffffff8159bd1d>] cpuidle_enter_state+0x8d/0x230
[2454185.733516] [<ffffffff8159bed7>] cpuidle_enter+0x17/0x20
[2454185.733600] [<ffffffff810c8ed4>] cpuidle_idle_call+0xe4/0x1e0
[2454185.733687] [<ffffffff810c91c5>] cpu_idle_loop+0x1f5/0x220
[2454185.733768] [<ffffffff810c920b>] ? cpu_startup_entry+0x1b/0x70
[2454185.734098] [<ffffffff810c924f>] cpu_startup_entry+0x5f/0x70
[2454185.734186] [<ffffffff81052d3b>] start_secondary+0xbb/0xc0

--------------------
[ 75.291118] UltraPath started successfully.
[ 75.929766] RPC: Registered named UNIX socket transport module.
[ 75.929773] RPC: Registered udp transport module.
[ 75.929776] RPC: Registered tcp transport module.
[ 75.929778] RPC: Registered tcp NFSv4.1 backchannel transport module.
[ 91.447029] FS-Cache: Loaded
[ 91.460755] FS-Cache: Netfs 'nfs' registered for caching
[ 205.842823] hugetlbfs: oracle (17664): Using mlock ulimits for SHM_HUGETLB is deprecated
[ 210.899295] perf interrupt took too long (5015 > 5000), lowering kernel.perf_event_max_sample_rate to 25000
[ 245.017290] perf interrupt took too long (11059 > 9615), lowering kernel.perf_event_max_sample_rate to 13000
[ 422.138190] nr_pdflush_threads exported in /proc is scheduled for removal
[ 686.266909] perf interrupt took too long (20079 > 17857), lowering kernel.perf_event_max_sample_rate to 7000
[12199.986042] perf interrupt took too long (36188 > 31250), lowering kernel.perf_event_max_sample_rate to 4000
[244199.960717] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 19
[244199.960858] CPU: 19 PID: 0 Comm: swapper/19 Tainted: G OE 4.1.12-61.1.28.el6uek.x86_64 #2
[244199.961005] Hardware name: HP ProLiant DL580 G7, BIOS P65 10/01/2013
[244199.961107] 0000000000000000 ffff882fff8c6a10 ffffffff816c6e40 ffffffff8198f0d8
[244199.961240] 0000000000000000 ffff882fff8c6a90 ffffffff816c6b9c 0000000000000010
[244199.961371] ffff882fff8c6aa0 ffff882fff8c6a40 0000000000000000 0000000000000000
[244199.961503] Call Trace:
[244199.961546] <NMI> [<ffffffff816c6e40>] dump_stack+0x63/0x83
[244199.961723] [<ffffffff816c6b9c>] panic+0xc1/0x210
[244199.961807] [<ffffffff81135abe>] watchdog_overflow_callback+0xae/0xb0
[244199.961915] [<ffffffff8117b337>] __perf_event_overflow+0x97/0x230
[244199.962020] [<ffffffff8102e600>] ? x86_perf_event_set_period+0xf0/0x180
[244199.962127] [<ffffffff8117bc14>] perf_event_overflow+0x14/0x20
[244199.962229] [<ffffffff81037924>] intel_pmu_handle_irq+0x224/0x4c0
[244199.962337] [<ffffffff81020a85>] ? native_sched_clock+0x35/0xc0
[244199.962430] [<ffffffff810209e9>] ? sched_clock+0x9/0x10
[244199.962513] [<ffffffff8102dc94>] perf_event_nmi_handler+0x34/0x60
[244199.962607] [<ffffffff8101afe7>] nmi_handle+0x87/0x150
[244199.962692] [<ffffffff81056d50>] ? hw_nmi_get_sample_period+0x20/0x20
[244199.962792] [<ffffffff8101b2ce>] default_do_nmi+0x5e/0x130
[244199.962876] [<ffffffff8101b495>] do_nmi+0xf5/0x170
[244199.962954] [<ffffffff816cdec5>] end_repeat_nmi+0x1a/0x1e
[244199.963041] [<ffffffff8139e5a4>] ? intel_idle+0xb4/0x120
[244199.963123] [<ffffffff8139e5a4>] ? intel_idle+0xb4/0x120
[244199.963205] [<ffffffff8139e5a4>] ? intel_idle+0xb4/0x120
[244199.963287] <<EOE>> [<ffffffff8159bd1d>] cpuidle_enter_state+0x8d/0x230
[244199.963404] [<ffffffff8159bed7>] cpuidle_enter+0x17/0x20
[244199.963490] [<ffffffff810c8ed4>] cpuidle_idle_call+0xe4/0x1e0
[244199.963580] [<ffffffff810c91c5>] cpu_idle_loop+0x1f5/0x220
[244199.963664] [<ffffffff810c920b>] ? cpu_startup_entry+0x1b/0x70
[244199.963756] [<ffffffff810c924f>] cpu_startup_entry+0x5f/0x70
[244199.963846] [<ffffffff81052d3b>] start_secondary+0xbb/0xc0 

Note:
可见3次的调用基本一样, 判断是OS 某个kernel bug导致的异常重启在Watchdog detected hard LOCKUP“硬锁定时的恐慌” 后(简单理解为硬死锁)。

关于调用的后半部分是
watchdog_timer_fn
这是一个高分辨率定时器功能,它在中断上下文中执行时定期更新特定于 CPU 的中断计数器。

watchdog_overflow_callback
此函数从不可屏蔽中断 (NMI) 上下文中调用。如果 CPU 繁忙,此函数会定期执行,并检查watchdog_timer_fn在过去的时间间隔内是否增加了特定于 CPU 的计数器。如果计数器没有增加,watchdog_overflow_callback则假定 CPU 在中断被禁用的内核代码部分中“锁定”,除非通过nmi_watchdog=nopanic内核命令行上的参数显式禁用“硬锁定时的恐慌”,否则会发生恐慌panic.

watchdog_timer_fn和的区间watchdog_overflow_callback是从/proc/sys/kernel/watchdog_thresh参数导出的。该参数的值默认为 10。

调用的前半部分是
cpu_idel cpu_idel_loop, cpu_startup
猜测是内核中的一种CPU调度器,如把CPU变会IDEL或从唤醒,达到CPU节省功耗性能目的。于与相关配置的如BIOS中CPU是否是最大性能模式还是自动节能模式和CPUspeed service.

查看当时的CPU 主频(cpuinfo)

egrep "processor|cpu MHz" cpuinfo
processor       : 0
cpu MHz         : 1149.520
processor       : 1
cpu MHz         : 1684.020
processor       : 2
cpu MHz         : 1074.690
processor       : 3
cpu MHz         : 1106.760
processor       : 4
cpu MHz         : 1074.690
processor       : 5
cpu MHz         : 1074.690
processor       : 6
cpu MHz         : 1170.900
processor       : 7
cpu MHz         : 1170.900
processor       : 8
cpu MHz         : 1758.850
processor       : 9
cpu MHz         : 2133.000
processor       : 10
cpu MHz         : 1416.770
processor       : 11
cpu MHz         : 1085.380
processor       : 12
cpu MHz         : 1149.520
processor       : 13
cpu MHz         : 1085.380
processor       : 14
cpu MHz         : 1064.000
processor       : 15
cpu MHz         : 1085.380
processor       : 16
cpu MHz         : 1074.690
processor       : 17
cpu MHz         : 1106.760
processor       : 18
cpu MHz         : 1117.450
processor       : 19
cpu MHz         : 1064.000
processor       : 20
cpu MHz         : 1085.380
processor       : 21
cpu MHz         : 1074.690
processor       : 22
cpu MHz         : 1074.690
processor       : 23
cpu MHz         : 1085.380
processor       : 24
cpu MHz         : 1064.000
processor       : 25
cpu MHz         : 1064.000
processor       : 26
cpu MHz         : 1085.380
processor       : 27
cpu MHz         : 1074.690
processor       : 28
cpu MHz         : 1074.690
processor       : 29
cpu MHz         : 1064.000
processor       : 30
cpu MHz         : 1374.010
processor       : 31
cpu MHz         : 1074.690
processor       : 32
cpu MHz         : 1523.670
processor       : 33
cpu MHz         : 1138.830
processor       : 34
cpu MHz         : 1074.690
processor       : 35
cpu MHz         : 1064.000
processor       : 36
cpu MHz         : 1064.000
processor       : 37
cpu MHz         : 1064.000
processor       : 38
cpu MHz         : 1064.000
processor       : 39
cpu MHz         : 1085.380
processor       : 40
cpu MHz         : 1074.690
processor       : 41
cpu MHz         : 1064.000
processor       : 42
cpu MHz         : 1064.000
processor       : 43
cpu MHz         : 1064.000
processor       : 44
cpu MHz         : 1064.000
processor       : 45
cpu MHz         : 1064.000
processor       : 46
cpu MHz         : 1064.000
processor       : 47
cpu MHz         : 1064.000
processor       : 48
cpu MHz         : 2133.000
processor       : 49
cpu MHz         : 2133.000
processor       : 50
cpu MHz         : 2131.024
processor       : 51
cpu MHz         : 1064.000
processor       : 52
cpu MHz         : 2133.000
processor       : 53
cpu MHz         : 2133.000
processor       : 54
cpu MHz         : 1064.000
processor       : 55
cpu MHz         : 1064.000
processor       : 56
cpu MHz         : 2133.000
processor       : 57
cpu MHz         : 1074.690
processor       : 58
cpu MHz         : 1064.000
processor       : 59
cpu MHz         : 1064.000
processor       : 60
cpu MHz         : 1074.690
processor       : 61
cpu MHz         : 1064.000
processor       : 62
cpu MHz         : 1064.000
processor       : 63
cpu MHz         : 2133.000

NOTE:
当前存在CPU降频现象, CPU主频都不一致,应该是节能模式。

Red Hat Enterprise Linux 6 带来了许多省电功能。在 RHEL 6 中,intel_idle 驱动程序可以覆盖 BIOS 默认值,并且必须通过“intel_idle.max_cstate=0”禁用,以确保在主机上完全禁用更高的 C 状态。

方案建议
对于其它类型数据库环境#2 #3通常也都是建议调整。
1,/proc/sys/kernel/watchdog_thresh 从10 到20
2, BIOS 选择Performance Per Watt Optimized(DAPC)模式,发挥CPU最大性能
3, 关闭C1E和C States等选项,检查当前 cat /proc/cmdline|grep max_cstate

intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll intel_pstate=disable

也可以使用tuned profile 切到throughput-performance 模式.

— vm环境不要配置idle=poll

— over —

打赏

,

对不起,这篇文章暂时关闭评论。