首页 » ORACLE 9i-23c » The Agent orarootagent.bin process Memory Leak(内存溢出) case

The Agent orarootagent.bin process Memory Leak(内存溢出) case

昨天有套数据库主机内存使用告警, 环境为11201 2NODES RAC on AIX,  发现orarootagent.bin进程占用了20G+的内存, 符合在该版本一个典型的ORACLE 内存溢出的BUG. 这里只是记录一下 分析思路.

1, 使用TOPas查看当时的主机资源, Cpu比较空闲, 但内存已耗尽.

Topas Monitor for host:    anbob1                 EVENTS/QUEUES    FILE/TTY
Tue Dec  6 15:23:20 2016   Interval:  2         Cswitch    8432  Readch  5265.7K
                                                Syscall    7824  Writech   19515
CPU  User%  Kern%  Wait%  Idle%                 Reads      1320  Rawin         0
ALL   12.5    0.7    0.1   86.7                 Writes       72  Ttyout      732
                                                Forks         0  Igets         0
Network  KBPS   I-Pack  O-Pack   KB-In  KB-Out  Execs         0  Namei        41
en6      17.7     16.0    17.0     8.1     9.6  Runqueue    1.5  Dirblk        0
lo0      11.2      7.0     7.0     5.6     5.6  Waitqueue   0.0
en8       0.9      1.5     1.0     0.1     0.8
                                                PAGING           MEMORY
Disk    Busy%     KBPS     TPS KB-Read KB-Writ  Faults      233  Real,MB   47360
hdisk53   0.0    64.0      2.0    0.0    64.0   Steals        0  % Comp     82
hdisk71   0.0    64.0      3.0   40.0    24.0   PgspIn        0  % Noncomp  11
hdisk49   0.0    56.0      2.0    0.0    56.0   PgspOut       0  % Client   11
hdisk31   0.0    56.0      1.0    0.0    56.0   PageIn        0
hdisk90   0.0    40.0      1.0    0.0    40.0   PageOut       0  PAGING SPACE
hdisk88   0.0    40.0      2.0   24.0    16.0   Sios          0  Size,MB   35072
hdisk24   0.0    40.0      1.0    0.0    40.0                    % Used     26
hdisk22   0.0    40.0      1.0    0.0    40.0   NFS (calls/sec)  % Free     74

2,查看最耗内存的进程 as root

nmon  t  4

+-topas_nmon--N=NFS--------------Host=anbob1-----------Refresh=2 secs---15:52.45-------------------------------
| Top-Processes-(468) -----Mode=4  [1=Basic 2=CPU 3=Perf 4=Size 5=I/O 6=Cmds]----------------------------------
|  PID       %CPU     Size      Res     Res      Res     Char    RAM      Paging         Command               
|            Used       KB      Set     Text     Data     I/O     Use   io   other repage                      
|  299462     2.5 22360156 16291884    14236 16277648        1   35%      0      0      0 orarootagent.bin     
|  328362     1.0   247728   104912    14564    90348        4    0%      0      1      0 crsd.bin             
|  307472     4.4   208200   140048    91332    48716        0    0%      0      2      0 oracle               
|   86254     1.2   197656    68504    18780    49724        0    0%      0      0      0 ohasd.bin            
|  389652     1.3   189132   113528    91332    22196        0    0%      0      0      0 oracle               
|  480024     0.0   187892    98168    91332     6836        0    0%      0      0      0 oracle               
|  475924     0.2   183080    95780    91332     4448        0    0%      0      0      0 oracle               
|  336138     1.2   182964   110904    91332    19572        0    0%      0      5      0 oracle               
|  332038     1.4   182936   110900    91332    19568        0    0%      0      3      0 oracle               
|  393750     0.1   175092   106608    91332    15276        7    0%      0      0      0 oracle               
|  377352     0.2   174912   108380    91332    17048        0    0%      0      2      0 oracle               
|  459764     1.2   172716   104708    91332    13376        0    0%      0      0      0 oracle               

-- OR
anbob1:/> svmon -Pt15 | perl -e 'while(<>){print if($.==2||$&&&!$s++);$.=0 if(/^-+$/)}'
-------------------------------------------------------------------------------
     Pid Command          Inuse      Pin     Pgsp  Virtual 64-bit Mthrd  16MB
  299462 orarootagent.  4173207    80289  1506768  5674447      Y     Y     N
  307472 oracle         1926122    80288     3764  1906650      Y     N     N
  389652 oracle         1919492    80288     5628  1901883      Y     N     N
  336138 oracle         1918836    80288     4704  1900341      Y     N     N
  332038 oracle         1918835    80288     4696  1900334      Y     N     N
  377352 oracle         1918338    80288     3348  1898328      Y     N     N
  393750 oracle         1917762    80288     3811  1898373      Y     N     N
  459764 oracle         1917287    80288     3729  1897779      Y     N     N
  562164 oracle         1916288    80288     3210  1896255      Y     N     N
  377342 oracle         1916213    80288     3123  1896093      Y     N     N
  413810 oracle         1916196    80288     2857  1895778      Y     N     N
  516942 oracle         1916083    80288     3121  1895954      Y     N     N
  446530 oracle         1916008    80288     3256  1896046      Y     N     N
  426534 oracle         1915885    80288     3219  1895887      Y     N     N
   49524 oracle         1915874    80288     3187  1895838      Y     N     N

NOTE:
最耗内存的是orarootagent进程,进程号为299462,占用了20G的内存. 已非常异常.

[oracle@anbob1:/home/oracle]# ps auxw|grep orarootagent.bin |grep -v grep
root     299462  0.1 35.0 22360388 16292116      - A      Jun 01 4977:12 /oracle/app/11.2.0/grid/bin/orarootagent.bin 
root      33390  0.0  0.0 126912 29916      - A      Jun 01 228:55 /oracle/app/11.2.0/grid/bin/orarootagent.bin 

TIP:
orarootagent.bin => A specialized oraagent process that helps crsd manages resources owned by root, such as the network, and the Grid virtual IP address.The above 2 process are actually threads which looks like processes, One is spawned by ohasd and the other by crsd. more

[oracle@anbob1:/home/oracle]# while sleep 4; do ps auxw|grep orarootagent.bin |grep -v grep; done;
root     299462  0.1 35.0 22360428 16292156      - A      Jun 01 4977:13 /oracle/app/11.2.0/grid/bin/orarootagent.bin 
root      33390  0.0  0.0 126912 29916      - A      Jun 01 228:55 /oracle/app/11.2.0/grid/bin/orarootagent.bin 
root     299462  0.1 35.0 22360436 16292164      - A      Jun 01 4977:13 /oracle/app/11.2.0/grid/bin/orarootagent.bin 
root      33390  0.0  0.0 126912 29916      - A      Jun 01 228:55 /oracle/app/11.2.0/grid/bin/orarootagent.bin 
root     299462  0.1 35.0 22360440 16292168      - A      Jun 01 4977:13 /oracle/app/11.2.0/grid/bin/orarootagent.bin 
root      33390  0.0  0.0 126912 29916      - A      Jun 01 228:55 /oracle/app/11.2.0/grid/bin/orarootagent.bin 
root     299462  0.1 35.0 22360444 16292172      - A      Jun 01 4977:13 /oracle/app/11.2.0/grid/bin/orarootagent.bin 
root      33390  0.0  0.0 126912 29916      - A      Jun 01 228:55 /oracle/app/11.2.0/grid/bin/orarootagent.bin 
root     299462  0.1 35.0 22360452 16292180      - A      Jun 01 4977:13 /oracle/app/11.2.0/grid/bin/orarootagent.bin 
root      33390  0.0  0.0 126912 29916      - A      Jun 01 228:55 /oracle/app/11.2.0/grid/bin/orarootagent.bin 
root     299462  0.1 35.0 22360456 16292184      - A      Jun 01 4977:13 /oracle/app/11.2.0/grid/bin/orarootagent.bin 
root      33390  0.0  0.0 126912 29916      - A      Jun 01 228:55 /oracle/app/11.2.0/grid/bin/orarootagent.bin 
root     299462  0.1 35.0 22360460 16292188      - A      Jun 01 4977:13 /oracle/app/11.2.0/grid/bin/orarootagent.bin 
root      33390  0.0  0.0 126912 29916      - A      Jun 01 228:55 /oracle/app/11.2.0/grid/bin/orarootagent.bin 
root     299462  0.1 35.0 22360468 16292196      - A      Jun 01 4977:13 /oracle/app/11.2.0/grid/bin/orarootagent.bin 
root      33390  0.0  0.0 126912 29916      - A      Jun 01 228:55 /oracle/app/11.2.0/grid/bin/orarootagent.bin 

Note:
可以发现orarootagent.bin pid=299462 还在以每4秒4KB的速度增加.

解决方法:
升级到11.2.0.2以后
OR
临时的解决 通过KILL orarootagent.bin 进程的方法释放内存.orarootagent.bin KILL后不会影响RAC, 该进程会立即自动启动.

anbob1:/> ps auxw|grep orarootagent.bin |grep -v grep
root     299462  0.1 36.0 22362572 16294300      - A      Jun 01 4977:42 /oracle/app/11.2.0/grid/bin/orarootagent.bin 
root      33390  0.0  0.0 126916 29920      - A      Jun 01 228:56 /oracle/app/11.2.0/grid/bin/orarootagent.bin 

anbob1:/> kill -9 299462

anbob1:/> crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

anbob1:/> ps auxw|grep orarootagent.bin |grep -v grep
root      33390  0.0  0.0 126916 36264      - A      Jun 01 228:56 /oracle/app/11.2.0/grid/bin/orarootagent.bin 
root     528664  0.0  0.0 94888 49692      - A    16:23:13  0:00 /oracle/app/11.2.0/grid/bin/orarootagent.bin 

AGENT log /oracle/app/11.2.0/grid/log/anbob1/agent/crsd/orarootagent_root/

2016-12-06 16:23:18.038: [   AGENT][1] ENV Logging level for Module: SuiteTes  1
2016-12-06 16:23:18.038: [   AGENT][1] ENV Logging level for Module: UiServer  0
2016-12-06 16:23:18.038: [    AGFW][1] Starting the agent: /oracle/app/11.2.0/grid/log/anbob1/agent/crsd/orarootagent_root/
2016-12-06 16:23:18.117: [   AGENT][1] Agent framework initialized, Process Id = 528664
2016-12-06 16:23:18.132: [ USRTHRD][1] Utils::getCrsHome crsHome /oracle/app/11.2.0/grid
2016-12-06 16:23:18.136: [ USRTHRD][1] Process::convertPidToString pid = 528664
2016-12-06 16:23:18.194: [CLSFRAME][1] Inited lsf context 110a6f6d0
2016-12-06 16:23:18.231: [CLSFRAME][1] Initing CLS Framework messaging
2016-12-06 16:23:18.266: [CLSFRAME][1] New Framework state: 2
2016-12-06 16:23:18.271: [CLSFRAME][1] M2M is starting...
2016-12-06 16:23:18.281: [ CRSCOMM][1] Ipc: Starting send thread
2016-12-06 16:23:18.287: [ CRSCOMM][515] Ipc: sendWork thread started.
2016-12-06 16:23:18.317: [ CRSCOMM][772] IpcC: IPC Client thread started listening
2016-12-06 16:23:18.317: [ CRSCOMM][772] IpcC: Received member number of 6
2016-12-06 16:23:18.323: [CLSFRAME][1] New IPC Member:{Relative|Node:0|Process:0|Type:1}:CRSD:anbob1
2016-12-06 16:23:18.356: [CLSFRAME][1] New process connected to us ID:{Relative|Node:0|Process:0|Type:1} Info:CRSD:anbob1
2016-12-06 16:23:18.372: [CLSFRAME][1] Starting thread model named: MultiThread
2016-12-06 16:23:18.372: [CLSFRAME][1] Starting thread model named: SingleThread
2016-12-06 16:23:18.377: [CLSFRAME][1] New Framework state: 3
2016-12-06 16:23:18.377: [    AGFW][1] Agent Framework started successfully
2016-12-06 16:23:18.377: [    AGFW][2057] Agfw engine module has enabled...
2016-12-06 16:23:18.377: [CLSFRAME][2057] Module Enabling is complete
2016-12-06 16:23:18.377: [CLSFRAME][2057] New Framework state: 6
2016-12-06 16:23:18.424: [    AGFW][2057] Agent is started with userid: root , expected user: root
2016-12-06 16:23:18.449: [    AGFW][2057] Agent sending message to PE: AGENT_HANDSHAKE[Proxy] ID 20484:14

确认内存已释放

Topas Monitor for host:    anbob1                 EVENTS/QUEUES    FILE/TTY
Tue Dec  6 16:30:46 2016   Interval:  2         Cswitch    7362  Readch  1340.7K
                                                Syscall    7555  Writech  854.7K
CPU  User%  Kern%  Wait%  Idle%                 Reads       233  Rawin         0
ALL   23.3    0.9    0.7   75.0                 Writes      218  Ttyout      699
                                                Forks         0  Igets         0
Network  KBPS   I-Pack  O-Pack   KB-In  KB-Out  Execs         1  Namei       573
lo0    1667.2    127.6   127.6   833.6   833.6  Runqueue    1.5  Dirblk        0
en6     136.1    149.1   109.1    77.1    59.0  Waitqueue   0.0
en8       0.8      1.0     1.0     0.0     0.8
                                                PAGING           MEMORY
Disk    Busy%     KBPS     TPS KB-Read KB-Writ  Faults      476  Real,MB   47360
hdisk18   0.0   520.0      2.0    8.0   512.0   Steals        0  % Comp     49
hdisk17   0.0   512.0      1.0    0.0   512.0   PgspIn        0  % Noncomp  13
hdisk73   0.0   512.0      1.0    0.0   512.0   PgspOut       0  % Client   13

如果短期内没有升级计划, 也可以在root 下增加crontab 根据大小(比如大于1GB)重启orarootagent.bin

ps auxw|grep rootagent|grep -v grep |awk '{if($5>1000000) print $2}'|xargs kill -9
打赏

对不起,这篇文章暂时关闭评论。