首页 » ORACLE » MMNL生成非常大的trace文件,包含ASH dump, MMON进程不存在

MMNL生成非常大的trace文件,包含ASH dump, MMON进程不存在

前几日一套11.2.0.4.7 on suse 11 2nodes2 RAC中节点2的oracle软件目录使用率高,发现mmnl后台进程生成了64GB的trace文件, 包含大量的ash dump信息, 下面简单的记录一下该问题。

Mmnl trace file

=========================

*** 2018-08-22 20:33:13.512
kewa_sampler_cb: Exception handler 453
kewa_sampler_cb: Cleared error 453
kewa_sampler: Re-signalling MMON(74) (!= 73) at 22-AUG-18 08.53.28.796 PM

*** 2018-08-22 20:53:28.796
kewa_sampler_cb: Exception handler 453
kewa_sampler_cb: Cleared error 453
kewa_sampler: Re-signalling MMON(75) (!= 74) at 22-AUG-18 09.03.37.184 PM

*** 2018-08-22 21:03:37.184
kewa_sampler_cb: Exception handler 453
kewa_sampler_cb: Cleared error 453
kewa_sampler: Re-signalling MMON(76) (!= 75) at 22-AUG-18 09.13.36.996 PM

*** 2018-08-22 21:13:36.996
kewa_sampler_cb: Exception handler 453
kewa_sampler_cb: Cleared error 453

*** 2018-08-22 21:23:37.692
kewa_sampler: Re-signalling MMON(77) (!= 76) at 22-AUG-18 09.23.37.692 PM
kewa_sampler_cb: Exception handler 453
kewa_sampler_cb: Cleared error 453
kgskmetricupd: could not get change-schema latch
kgskmetricupd: could not get change-schema latch
kewa_sampler: Re-signalling MMON(78) (!= 77) at 23-AUG-18 12.23.40.741 AM

*** 2018-08-23 00:23:40.741
kewa_sampler_cb: Exception handler 453
kewa_sampler_cb: Cleared error 453
kgskmetricupd: could not get change-schema latch
kgskmetricupd: could not get change-schema latch

*** 2018-08-23 01:02:31.559
kgskmetricupd: could not get change-schema latch

*** 2018-08-23 01:04:30.886
kgskmetricupd: could not get change-schema latch
kewa_sampler: Re-signalling MMON(79) (!= 78) at 23-AUG-18 01.23.57.348 AM

*** 2018-08-23 01:23:57.349
kewa_sampler_cb: Exception handler 453
kewa_sampler_cb: Cleared error 453
ASH dump

*** 2018-08-23 01:38:45.614
<<<ACTIVE SESSION HISTORY - PROCESS TRACE DUMP HEADER BEGIN>>>
****************
SCRIPT TO IMPORT
****************
------------------------------------------
Step 1: Create destination table <ashdump>
------------------------------------------
CREATE TABLE ashdump AS
SELECT * FROM SYS.WRH$_ACTIVE_SESSION_HISTORY WHERE rownum < 0
----------------------------------------------------------------
Step 2: Create the SQL*Loader control file <ashldr.ctl> as below
----------------------------------------------------------------
load data
infile * "str '\n####\n'"
append
into table ashdump
fields terminated by ',' optionally enclosed by '"'
(
SNAP_ID CONSTANT 0 ,
DBID ,
INSTANCE_NUMBER ,

...
...
...
<<<ACTIVE SESSION HISTORY - PROCESS TRACE DUMP END>>>

Note:

Trace文件中包含大旱的ASH DUMP信息,MMNL后台进程负责当前活动会话历史记录(ASH)进行采样,存储在循环SGA内存缓冲区中的信息将通过快照或紧急刷新刷新到AWR。
如果循环缓冲区已满三分之二,则ASH紧急冲洗将启动。如果由于某种原因无法将信息写入AWR,则ASH会将其内容(如果没有重新启动,则每12小时一次)转储到MMNL跟踪文件。

将信息写入跟踪而不是AWR的可能原因包括:

ASH缓冲区太小<===首先检查ASH缓冲区是否太小
MMON进程没有响应<====然后,检查MMON进程没有响应的原因

1, 首先确认当前的ASH 大小是否紧张?

如果因为ash buffer size小而紧急刷新,可以从db alert log确认,也可以使用下面的sql确认:

SQL> select total_size,awr_flush_count,awr_flush_emergency_count from v$ash_info;

TOTAL_SIZE AWR_FLUSH_COUNT AWR_FLUSH_EMERGENCY_COUNT
---------- --------------- -------------------------
 134217728           20529                      6828

Note:
这里的紧急刷新(AWR_FLUSH_EMERGENCY_COUNT)次数不少,可以调大该参数(可在线调整),确认ash buffer从v$ash_info.total_size, 默认“_ash_size”并非当前值,但是修改该内存大小是修改”_ash_size”参数值,可以每次尝试增长50%观察使用情况。

SQL> alter system set "_ash_size"=100m;
alter system set "_ash_size"=100m
*
ERROR at line 1:
ORA-02097: parameter cannot be modified because specified value is invalid
ORA-02097: parameter cannot be modified because specified value is invalid

SQL> @pd ash_size
Show all parameters and session values from x$ksppi/x$ksppcv...

      INDX I_HEX NAME                                               VALUE                          DESCRIPTION
---------- ----- -------------------------------------------------- ------------------------------ ----------------------------------------------------------------------
      2647   A57 _olap_dimension_corehash_size                      30                             OLAP Dimension In-Core Hash Table Maximum Memory Use
      2731   AAB _kffmap_hash_size                                  1024                           size of kffmap_hash table
      2732   AAC _kffmop_hash_size                                  2048                           size of kffmop_hash table
      2840   B18 _ash_size                                          1048618                        To set the size of the in-memory Active Session History buffers

SQL> alter system set "_ash_size"=200m;
System altered.

SQL> select total_size,awr_flush_count,awr_flush_emergency_count from v$ash_info;

TOTAL_SIZE AWR_FLUSH_COUNT AWR_FLUSH_EMERGENCY_COUNT
---------- --------------- -------------------------
 209715200           20531                      6829

Note:
如果调整的值小于当前值会提示ora-2097错误,但是在12C之前的版本超过了“_ash_size”的最大上限254Mb,同样会提示ora-2097错误。 但是从12c以后版本_ash_size值可以调整到大于254MB.

2. 检查MMON

既然写了这么多的ASH dump说明很多次是flush AWR失败的,  负责AWR Snapshot收集的是MMON进程, 然后检查MMON trace无日志文件。

oracle@ANBOB-2:/oracle/app/oracle/diag/rdbms/ANBOB/ANBOB2/trace>ps -ef|grep mmon
grid 8576 1 0 2016 ? 03:29:29 asm_mmon_+ASM2

oracle@ANBOB-2:/oracle/app/oracle/diag/rdbms/ANBOB/ANBOB2/trace>ps -ef|grep mmnl
grid 8578 1 0 2016 ? 2-17:07:26 asm_mmnl_+ASM2
oracle 9176 1 1 2016 ? 11-20:41:46 ora_mmnl_ANBOB2
 
SQL> col systimestamp form a35
SQL> col most_recent_snap_time form a25
SQL> col snap_interval form a17
SQL> select systimestamp, most_recent_snap_time, snap_interval from wrm$_wr_control where dbid = (select dbid from v$database);

SYSTIMESTAMP MOST_RECENT_SNAP_TIME SNAP_INTERVAL
----------------------------------- ------------------------- -----------------
24-AUG-18 09.36.27.638363 AM +08:00 24-AUG-18 09.00.33.122 AM +00000 01:00:00.0

SQL> select thread#,open_time from v$thread;

THREAD# OPEN_TIME
---------- -----------------
1 20161225 17:46:00
2 20160226 14:31:03

SQL> select open_mode from v$database;

OPEN_MODE
--------------------
READ WRITE

SQL> col instance_number form 999 head INST
SQL>  col begin_interval_time form a25
SQL>  col flush_elapsed form a17
SQL>  col status form 999
SQL>  col error_count form 999 head ERR
SQL>  col snap_flag form 999 head SNAP
SQL> select * from
      (select snap_id,
      instance_number,
      begin_interval_time,
      flush_elapsed,
      status,
      error_count,
      snap_flag
      from wrm$_snapshot
      where dbid = (select dbid from v$database)
      and instance_number=2   --# instance 2
      order by snap_id desc)
      where rownum <= 10
      order by snap_id ;
 -- none --

Note:
该DB实例的MMON进程都不存在,且近段时间均为未生成AWR snapshot.

AWR没有生成SNAPSHOT的原因:
1,数据库配置问题和当前状态

1.1 实例参数STATISTICS_LEVEL
1.2 数据库没打开
1.3 数据库打开没有在READ WRITE模式
1.4 配置的间隔时间问题

2,数据库MMON后台进程问题

2.1 MMON进程未启动
2.2 MMON进程挂起
2.3 MMON在收集某项时时间异常久,没有完成
2.4 SYSAUX表空间可用空间紧张

实例1的AWR正常,只有实例2的存在这现象,此时可以尝试手动生成AWR Snapshot,检查是否有报错或是某项收集时间长。

SQL> exec dbms_workload_repository.create_snapshot;
SQL> exec dbms_workload_repository.create_snapshot;
SQL> 

SQL> select * from
  2        (select snap_id,
  3        instance_number,
  4        begin_interval_time,
  5        flush_elapsed,
  6        status,
  7        error_count,
  8        snap_flag
  9        from wrm$_snapshot
 10        where dbid = (select dbid from v$database)
 11        and instance_number=2   --# instance 2
 12        order by snap_id desc)
 13        where rownum <= 10
 14        order by snap_id ;

  Snap Id INST BEGIN_INTERVAL_TIME       FLUSH_ELAPSED     STATUS  ERR SNAP
--------- ---- ------------------------- ----------------- ------ ---- ----
    22765    2 16-SEP-17 08.56.39.415 AM +00000 00:00:12.2      0    0    1
    22766    2 24-AUG-18 09.51.14.766 AM +00000 00:00:03.3      0    0    1

Note:
手动是可以生成,说明在AWR实际收集是没有问题的,只是调度进程出了问题, 当然如果手动收集时间都异常久可以配置AWR debug trace查看具体日志

Gathering MMON action trace and snapshot flush trace:
SQL> alter session set "_swrf_test_action" = 28;
Session altered. 

SQL> alter session set "_swrf_test_action" = 10; 
Session altered.

等待AWR snapshot生成后,关闭TRACE对应_swrf_test_action值使用29和11, 同时还有awr_test dump event level(1,2,3,4)可以trace AWR的过程。该事件虽然是alter session, 但影响是实例级的。

对于本案例主要是因为MMON进程不存在,配置EVENT已没有意义, 下一步查看db alert log 查找MMON进程没有启动或停止的原因。

# alert log
Sat Sep 16 05:56:38 2017
VKRM started with pid=87, OS id=44418
Sat Sep 16 07:24:10 2017
Thread 2 advanced to log sequence 3377 (LGWR switch)
  Current log# 3 seq# 3377 mem# 0: +DATADG/yygk/redo03.log
Sat Sep 16 09:14:04 2017
ORA-25455 encountered when generating server alert SMG-3500
Sat Sep 16 09:15:04 2017
Exception [type: SIGSEGV, SI_KERNEL(general_protection)] [ADDR:0x0] [PC:0x9929640, kocdsgt()+368] [flags: 0x0, count: 1]
Errors in file /oracle/app/oracle/diag/rdbms/yygk/yygk2/trace/yygk2_mmon_9174.trc  (incident=288273):
ORA-07445: exception encountered: core dump [kocdsgt()+368] [SIGSEGV] [ADDR:0x0] [PC:0x9929640] [SI_KERNEL(general_protection)] []
Incident details in: /oracle/app/oracle/diag/rdbms/yygk/yygk2/incident/incdir_288273/yygk2_mmon_9174_i288273.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Sat Sep 16 09:15:06 2017
Dumping diagnostic data in directory=[cdmp_20170916091506], requested by (instance=2, osid=9174 (MMON)), summary=[incident=288273].
Sat Sep 16 14:58:12 2017
Thread 2 advanced to log sequence 3378 (LGWR switch)
  Current log# 4 seq# 3378 mem# 0: +DATADG/yygk/redo04.log
Sun Sep 17 01:56:39 2017
Closing Resource Manager plan via scheduler window
Clearing Resource Manager plan via parameter
Sun Sep 17 05:56:39 2017
Setting Resource Manager plan DEFAULT_MAINTENANCE_PLAN via parameter
Sun Sep 17 05:56:39 2017
Starting background process VKRM
Sun Sep 17 05:56:39 2017


oracle@anbob-2:/home/oracle>vi /oracle/app/oracle/diag/rdbms/anbob/anbob2/incident/incdir_288273/anbob2_mmon_9174_i288273.trc
Dump file /oracle/app/oracle/diag/rdbms/anbob/anbob2/incident/incdir_288273/anbob2_mmon_9174_i288273.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options
ORACLE_HOME = /oracle/app/oracle/product/11.2.0/db
System name:    Linux
Node name:      anbob-2
Release:        3.0.76-0.11-default
Version:        #1 SMP Fri Jun 14 08:21:43 UTC 2013 (ccab990)
Machine:        x86_64
Instance name: anbob2
Redo thread mounted by this instance: 2
Oracle process number: 34
Unix process pid: 9174, image: oracle@anbob-2 (MMON)


*** 2017-09-16 09:15:05.098
*** SESSION ID:(4795.1) 2017-09-16 09:15:05.098
*** CLIENT ID:() 2017-09-16 09:15:05.098
*** SERVICE NAME:(SYS$BACKGROUND) 2017-09-16 09:15:05.098
*** MODULE NAME:() 2017-09-16 09:15:05.098
*** ACTION NAME:() 2017-09-16 09:15:05.098

Dump continued from file: /oracle/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_mmon_9174.trc
ORA-07445: exception encountered: core dump [kocdsgt()+368] [SIGSEGV] [ADDR:0x0] [PC:0x9929640] [SI_KERNEL(general_protection)] []

========= Dump for incident 288273 (ORA 7445 [kocdsgt()+368]) ========
----- Beginning of Customized Incident Dump(s) -----
Exception [type: SIGSEGV, SI_KERNEL(general_protection)] [ADDR:0x0] [PC:0x9929640, kocdsgt()+368] [flags: 0x0, count: 1]
Registers:
%rax: 0x00007f14f45e3640 %rbx: 0x0000000000000000 %rcx: 0x0000000000000008
%rdx: 0x0140000700000100 %rdi: 0x00007f14f4325568 %rsi: 0x0000000000000000
%rsp: 0x00007fffbec8d8c0 %rbp: 0x00007fffbec8db60  %r8: 0x0000000000000000
 %r9: 0x00007f14f4325d70 %r10: 0x00007f14f4326570 %r11: 0x0000000000000000
%r12: 0x00007f14f4344e20 %r13: 0x0000000000002010 %r14: 0x0000000000000000
%r15: 0x00007fffbec8dca8 %rip: 0x0000000009929640 %efl: 0x0000000000010203
  kocdsgt()+352 (0x9929630) add $-48,%rdx
  kocdsgt()+356 (0x9929634) mov %rdx,-0x40(%rbp)
  kocdsgt()+360 (0x9929638) mov -0x40(%rbp),%rax
  kocdsgt()+364 (0x992963c) mov 0x8(%rax),%rdx
> kocdsgt()+368 (0x9929640) movzwl (%rdx),%ebx
  kocdsgt()+371 (0x9929643) mov -0x38(%rbp),%rcx
  kocdsgt()+375 (0x9929647) movzwl 0xc(%rcx),%esi
  kocdsgt()+379 (0x992964b) cmp %esi,%ebx
  kocdsgt()+381 (0x992964d) jnz 0x992982b
----- SQL Statement (None) -----
Current SQL information unavailable - no cursor.

----- Call Stack Trace -----

location             type  
-------------------- ------
skdstdst()+41        <  ksedst1()+103        <  ksedst()+39          <  dbkedDefDump()+2746  <  
ksedmp()+41          <  ssexhd()+2462        <  __sighandler()       <  kocdsgt()+368        signal
kocioc()+648         <  kokdn2p()+520        <  kodin2p()+176        <  kocapm()+1051        <  koioapm()+525     <  
OCITypeByName()+816  <  keltgtyp()+75        <  kpotcuexec()+254     <  rpiswu2()+1776       <  
OCIKSwitchUserWithPriv()+200    <      keltenq()+583        <  
keltpost()+1812      <  kelr_issue_alert()309   kelr_monitor_metrics()+3631   <              
ksb_run_managed_action()+384  <                ksbcti()+2490        <  ksbabs()+1735        <  
kebm_mmon_main()+209 <                         ksbrdp()+1045        <  opirip()+623         <  opidrv()+603         <  
sou2o()+103          <  opimai_real()+250    <  ssthrdmain()+265     <  
main()+201           

Note:
根据DB ALERT LOG可以发现该实例启动后,过了一段时间MMON进程因ORA-7445内部错误挂掉后就一直没有再启动,AWR SNAPSHOT也一直没有再生成。 MOS中有个相似的CASE。 Bug 22494766 : ORA-7445[KOCCNGT+74] ON MMON PROCESS AFTER ORA-25455

解决方法
1,重启实例
或者
2. 临时启动并禁用限制模式
ALTER SYSTEM enable restricted session;
ALTER SYSTEM disable restricted session;

因为这是一个新交维的边缘库,平时没有什么新连接和活动,但是重启实例目前是不可以的。 启用限制模式只影响新连接请求,已有连接无影响,我们采用方案2.
— db alert log

2018-08-24 09:01:23.236000 +08:00
Thread 2 advanced to log sequence 6820 (LGWR switch)
  Current log# 4 seq# 6820 mem# 0: +DATADG/anbob/redo04.log
2018-08-24 10:25:59.102000 +08:00
Stopping background process MMNL
2018-08-24 10:26:00.101000 +08:00
Starting background process MMON
MMON started with pid=35, OS id=76908
Starting background process MMNL
MMNL started with pid=120, OS id=76910
ALTER SYSTEM enable restricted session;
ALTER SYSTEM disable restricted session;
2018-08-24 10:26:01.811000 +08:00
minact-scn: Inst 2 is a slave inc#:40 mmon proc-id:76908 status:0x2
minact-scn status: grec-scn:0x0ccf.d5bb5cdf gmin-scn:0x0ccf.d5b7d696 gcalc-scn:0x0ccf.d5bb3a01
Some DDE async actions failed or were cancelled

oracle@anbob-2:/home/oracle>ps -ef|grep ora_m|grep -v grep
oracle    9144     1  0  2016 ?        02:07:06 ora_mman_anbob2
oracle    9184     1  0  2016 ?        01:55:18 ora_mark_anbob2
oracle   76908     1  0 10:25 ?        00:01:35 ora_mmon_anbob2
oracle   76910     1  1 10:25 ?        00:05:52 ora_mmnl_anbob2

SQL> select * from
  2        (select snap_id,
  3        instance_number,
  4        begin_interval_time,
  5        flush_elapsed,
  6        status,
  7        error_count,
  8        snap_flag
  9        from wrm$_snapshot
 10        where dbid = (select dbid from v$database)
 11        and instance_number=2   --# instance 2
 12        order by snap_id desc)
 13        where rownum <= 10 14 order by snap_id ; 
Snap Id INST BEGIN_INTERVAL_TIME FLUSH_ELAPSED STATUS ERR SNAP
 --------- ---- ------------------------- ----------------- ------ ---- ---- 
22765 2 16-SEP-17 08.56.39.415 AM +00000 00:00:12.2 0 0 1 
22766 2 24-AUG-18 09.51.14.766 AM +00000 00:00:03.3 0 0 1 
SQL> /

  Snap Id INST BEGIN_INTERVAL_TIME       FLUSH_ELAPSED     STATUS  ERR SNAP
--------- ---- ------------------------- ----------------- ------ ---- ----
    22765    2 16-SEP-17 08.56.39.415 AM +00000 00:00:12.2      0    0    1
    22766    2 24-AUG-18 09.51.14.766 AM +00000 00:00:03.3      0    0    1
    22767    2 24-AUG-18 09.52.24.661 AM +00000 00:00:05.1      0    0    0
    22768    2 24-AUG-18 11.00.09.946 AM +00000 00:00:04.0      0    0    0
    22769    2 24-AUG-18 12.00.22.008 PM +00000 00:00:03.2      0    0    0
    22770    2 24-AUG-18 01.00.33.819 PM +00000 00:00:05.1      0    0    0
    22771    2 24-AUG-18 02.00.45.857 PM +00000 00:00:04.0      0    0    0
    22772    2 24-AUG-18 03.00.57.960 PM +00000 00:00:04.0      0    0    0
    22773    2 24-AUG-18 04.00.09.952 PM +00000 00:00:03.9      0    0    0
    22774    2 24-AUG-18 05.00.22.065 PM +00000 00:00:04.1      0    0    0

后期确认数据库实例已经可以正常收集AWR,MMON记启动,同时MMNL进程已重启,(KILL两进程对实例可用性无影响,该进程可以自动重启)。MMNL TRACE没有再生成那么大量的日志。

— enjoy it —

打赏

,

对不起,这篇文章暂时关闭评论。