Oracle 12c/19c ADR trace dest disk busy (100%) when ‘ls’ trace files

最近遇到几次故障升级oracle 12c后，相同的硬件有几次instance crash同时伴有LGWR 核心进程N seconds not move现象，OSW中vmstat ‘B’列会伴有突然大量的blocked（通常是I/O）问题，mpstat/iostat 显示$ORACLE_BASE所在本地文件系统出现90-100% busy现象， ps 显示LGWR和一些FG进程同时在等待相同事OS Kernel function address。

环境Oracle 12C r2 2nodes-RAC on HPUX 11.31, 从12c开始确实从ADR trace目录中比11G前trace文件数有所增加，尤其还有些潜在BUG如《Oracle12c R2注意事项：又一个BUG 生成大量的trace 含 kjshash() kjqghd()》中记录过1天产生近90万trace，如文件系统inode使用率过，过多的小文件在同一目录中是会导致文件检索效率变差，Veritas建议将单个VxFS目录中的文件数限制为100,000个文件。

当前出现oracle ADR所在本地disk busy的inode使用并不高，前且统计在DB trace目录1600个文件时ls 也有近4分钟时间， ls同时会出现 disk busy 100%, tusc ls进程也基本是花在getdents调用上。当前是找出哪个进程导致阶段性disk busy? 如何解决disk busy?

racle@anbob2:/home/oracle> sar -d 1 3                                                                                                                                                 
HP-UX anbob2 B.11.31 U ia64    12/03/20

18:07:55   device   %busy   avque   r+w/s  blks/s  avwait  avserv
18:07:56    disk4   98.02    0.50     240    3784    0.00    4.11   # local_vg oracle lv
            disk5    0.00    0.50       1      16    0.00    0.21
         disk3054    0.00    0.50       4      14    0.00    0.07
         disk3022    0.00    0.50      20    1576    0.00    0.17

Glance查看

#按文件系统
Glance 12.02.008                14:51:35   anbob2      ia64                                   Current Avg  High
---------------------------------------------------------------------------------------------------------------
CPU  Util   S    SU           UW                                                              | 16%   18%   23%
Disk Util   F                                                                           F     | 97%   99%  100%
Mem  Util   S                  SU                                    UF   F                   | 53%   53%   53%
Networkil   U                                                          UR                     | 51%   51%   51%
---------------------------------------------------------------------------------------------------------------
                                                                  IO BY FILE SYSTEM                Users=   15
Idx   File System            Device               Type     Logl IO     Phys IO
---------------------------------------------------------------------------------------------------------------
   1 /                       /dev/vg00/lvol3      vxfs      na/   na   0.1/  1.0
   2 /stand                  /dev/vg00/lvol1      vxfs      na/   na   0.1/  0.7
   3 /var                    /dev/vg00/lvol8      vxfs      na/   na   0.1/  5.7
   4 /usr                    /dev/vg00/lvol7      vxfs      na/   na   0.5/  0.8
   5 /topteact               /dev/.../lv_topteact vxfs      na/   na   0.0/  0.0
   6 /tmp                    /dev/vg00/lvol6      vxfs      na/   na   0.7/  0.7
   9 /ptfs                   /dev/.../fslv_ptfs   vxfs      na/   na   0.0/  0.0
  10 /patrol                 /dev/.../fslv_patrol vxfs      na/   na   0.0/  1.2
  11 /oracle                 /dev/.../fslv_oracle vxfs      na/   na 246.1/275.5


#按disk排序
glance u
  Glance 12.02.008                14:51:51   anbob2      ia64                                   Current Avg  High
---------------------------------------------------------------------------------------------------------------
CPU  Util   S     SU               UW                                                         | 20%   18%   23%
Disk Util   F                                                                           F     | 97%   99%  100%
Mem  Util   S                  SU                                    UF   F                   | 53%   53%   53%
Networkil   U                                                          UR                     | 51%   51%   51%
---------------------------------------------------------------------------------------------------------------
                                                                      IO BY DISK                   Users=   15
                                                      Logl       Phys       Serv
Idx   Device        Util    Qlen       KB/Sec          IO         IO        Time
---------------------------------------------------------------------------------------------------------------
   1 disk5_p2    0.0/0.1     0.0    13.0/  108.2    na/   na   1.1/  9.6    0.00
   2 disk4       7.1/100     0.0    2087/   2367    na/   na 243.4/277.3    4.03
   3 disk3059    1.5/1.3     0.0   14155/  14388    na/   na 123.0/121.6    0.45
   4 disk3054    0.1/0.0     0.0     4.4/   12.6    na/   na   3.2/  3.8    0.58
   5 disk3020    0.9/0.8     0.0   150.7/  700.9    na/   na  17.6/ 38.3    0.54
# 按IO进程排序
glance o 
  Glance 12.02.008                17:53:19   anbob2      ia64                                   Current Avg  High
---------------------------------------------------------------------------------------------------------------
CPU  Util   S    SU         UW                                                                | 14%   14%   14%
Disk Util   F                                                                             F   | 98%   99%   99%
Mem  Util   S                  SU                                      UF   F                 | 54%   54%   54%
Networkil   U                                                          UR                     | 51%   51%   51%
---------------------------------------------------------------------------------------------------------------
                      INTERESTING PROCESS THRESHOLD OPTIONS          
                         
Display processes with resource usage:              Current Thresholds:

      CPU Utilization             >                 (1.0                %      )
      Disk I/O Rate               >                 (1.0                IOs/sec)
      Resident Set Size           >                 (20                 Mbytes )
      Virtual Set Size            >                 (500                Mbytes )
      User name                   =                 (all                       )
      Program name                =                 (all                       )
      TTY path name               =                 (all                       )
      Use match logic (and/or)    :                 (or                        )
      Sort key (name/cpu/disk/rss): disk            (disk                      )

      Glance started or last reset: 12/03/2020  17:53:12  
      Current refresh interval        : 5 seconds  


Glance 12.02.008                18:04:55   anbob2      ia64                                   Current Avg  High
---------------------------------------------------------------------------------------------------------------
CPU  Util   S   SU      UW                                                                    | 11%   13%   18%
Disk Util   F                                                                              F  | 99%   78%  100%
Mem  Util   S                  SU                                      UF   F                 | 54%   54%   54%
Networkil   U                                                          UR                     | 51%   51%   52%
---------------------------------------------------------------------------------------------------------------
                                                                     PROCESS LIST                  Users=   12
                         User       CPU %   Thrd   Disk        Memory     Block
Process Name         PID Name    (9600% max) Cnt IO rate    RSS      VSS   On
---------------------------------------------------------------------------------------------------------------
pmgreader          20491 itmuser       0.3     1  225.1    4.6mb    8.5mb    IO    <<<<<<<
ora_lg00_tbc        6511 oracle        1.3     1   63.8   34.2mb   47.9mb SLEEP 
replicat           15014 oracle        0.1     4   20.7   81.6mb    135mb SLEEP 
oracletbcse2        9258 oracle        0.5     1   18.0   42.2mb   72.6mb  PIPE 
oracletbcse2       15113 oracle        0.1     1   16.5   34.5mb   51.3mb  PIPE

本地oracle_base在lv_oracle上/oracle ，下层为disk4, 2块物理机械盘, 所以在IOPS在240时 disk busy 超过了90%. 当然使用glance也找到了导致间断性disk busy的进程是本地部署的监控应用pmgreader在读oracle ADR trace目录中的alert日志。停止该监控后不再出现disk busy. 但是问题的原根本原因在为什么ls 1600多个文件的目录会出现这么大的性能影响？

去年在《Troubleshooting Slower IO Performance on Veritas for 11.2.0.4 compared 10gR2 on RAW device after RMAN migrate》案例中遇到过一个因为数据库迁移后出现的IO慢性能问题案例，当时就是因为使用的是veritas的Vxfs, 是因为vxfs中的碎片太多导致。

当文件系统<2TB的文件系统，Vxfs默认使用1K block size，这也是oracle binary directory建议的块大小， VxFS 有点像oracle的segment 结构也是block > extents. 有个经验值当<8k的extents 占用total extents 5%时说明当前的VxFS是糟糕的。

oracle@anbob2:/home/oracle> df -o s /oracle/app/oracle/diag/rdbms/prianbob/anbob2/trace/
/oracle                (/dev/localvg/fslv_oracle) :
Free Extents by Size
          1:      67999            2:      81451            4:      83073  
          8:     135547           16:     121548           32:     105307  
         64:      87917          128:      84479          256:      56530  
        512:      32626         1024:      15477         2048:       4959  
       4096:        715         8192:         79        16384:          3  
      32768:          3        65536:          1       131072:          0  
     262144:          1       524288:          1      1048576:          1  
    2097152:          2      4194304:          2      8388608:          1  
   16777216:          0     33554432:          0     67108864:          0  
  134217728:          0    268435456:          0    536870912:          0  
 1073741824:          0   2147483648:          0

检查文件系统碎片

 oracle@anbob2:/home/oracle>  vxfsstat -b /oracle

buffer cache statistics
   208512 Kbyte current    4221440 maximum
88821047312 lookups            95.42% hit rate
     5501 sec recycle age [not limited by maximum]
	 
oracle@anbob2:/opt/VRTS/bin> ls -l /oracle/app/oracle/diag/rdbms/prianbob/anbob2/trace/alert*.log
-rw-r-----   1 oracle     asmadmin   18217303 Dec 10 16:31 /oracle/app/oracle/diag/rdbms/prianbob/anbob2/trace/alert_anbob2.log
 
oracle@anbob2:/opt/VRTS/bin> ./fsmap -a /oracle/app/oracle/diag/rdbms/prianbob/anbob2/trace/alert*.log|grep -v Volume |awk '{print  $4}'|sort |uniq -c| sort -nk2
1055 1024
  76 2048
  14 3072
  14 4096
   5 5120
   4 6144
   2 7168
  15 8192
   2 9216
   3 11264
   1 13312
   1 15360
   7 16384
   1 22528
   1 31744
   3 32768
   1 35840
   1 65536
   1 75776
   1 81920
   1 197632
   1 206848
   1 262144
   1 312320
   1 314368
   1 393216
   1 524288
   1 676864
   1 698368
   1 779264
   1 929792
   2 1048576
   1 1319936
   3 2097152
   1 3179520

如果这个目录过去可能因为oracle的bug出现过几十万的trace file,导致该目录的VxFS metadata很大，如果后来及时删了，但未整理碎片，可能metadata依旧不会回收那ls 时依旧会需要加载这么大的metadata（no sure），类似oracle 的Full table scan，但是oracle 有buffer cache可以减少物理读， VxFS 有吗？后来咨询了HP原厂，有的那就是vxfs_bc_bufhwm，发现这个主机确实过去的集成商不知道出于什么原因调整过，而且调的参数值是不合理的，默认应该是0，表示自动调整，vxfs_bc_bufhwm 显示指定了值（下面是恢复后的默认值）。

 oracle@anbob2:/home/oracle> kctune |grep vxfs
vxfs_bc_bufhwm                               0  0             Immed
vxfs_ifree_timelag                          -1  -1            Immed

NAME vxfs_bc_bufhwm – VxFS buffer cache high water mark(determines the VxFS
buffer cache size)
SYNOPSIS
/usr/bin/kctune -s vxfs_bc_bufhwm={maximum size of buffer cache}
VALUE
Specify an integer value.
Minimum
6144 KB
Maximum
90% of kernel memory.
Default
0

DESCRIPTION
VxFS maintains a buffer cache in the kernel for frequently accessed
file system metadata in addition to the HP-UX kernel buffer cache that
caches the file data. vxfs_bc_bufhwm tunable parameter determines the
size of the VxFS buffer cache (the maximum amount of memory that can
be used to cache VxFS metadata).
// The maximum size of the metadata buffer cache is set (auto-tuned) at boot time based on system memory size, provided that the value of vxfs_bc_bufhwm is set to zero (default).
Like with the tunable vx_ninode, a large metadata buffer cache can help improve file system performance, especially during metadata-intensive loads (stat, create, remove, link, lookup operations).

NOTES
Use the vxfsstat command to monitor buffer cache statistics and inode cache usage. See the vxfsstat(1M) manual page.
Setting the vxfs_bc_bufhwm value too low can result in a system hang. Set the value of vxfs_bc_bufhwm to 5% or more of the system’s total physical memory if the system has 8 GB or less physical memory. Set the value of vxfs_bc_bufhwm to 2% or more of the system’s total physical memory if the system has more than 8 GB of physical memory. The higher the physical memory of the system, the lower you can set vxfs_bc_bufhwm. You can set vxfs_bc_bufhwm to as low as 0.5% of the system’s total physical memory if the system has much more than 8 GB of memory.
EXAMPLES
The following command sets the maximum of size of buffer cache at 300000:

# kctune -s vxfs_bc_bufhwm=300000
WARNINGS
Incorrectly tunning a parameter may adversely affect system performance. See the Storage Foundation Administrator’s Guide for more information about tuning parameters.
VxFS kernel tunable parameters are release specific. This parameter may be removed or the default value and usage may change in future releases. See the Storage Foundation Release Notes for information about changes to parameters, if any.

解决方案
尝试调整内核参数

/usr/bin/kctune -s vxfs_bc_bufhwm=0

oracle@anbob2:/home/oracle>  vxfsstat -b /oracle
buffer cache statistics
   688512 Kbyte current    4221440 maximum
88824047314 lookups            95.42% hit rate
     5501 sec recycle age [not limited by maximum]

再1次ls 做物理读，但第2次就恢复了正常，秒出结果。

另外一种解决方法是把trace目录rename，新创建trace目录再把文件复制回原目录。注意需要实例停止情况下做。