Troubleshooting oracle wait “reliable message”

“reliable message”它是一个通用的等待事件，用于跟踪 Oracle 数据库中多种不同类型的通道通信。通常这是一个良性等待事件，可以忽略，如果占比过高需要诊断,一旦确定了较长的等待时间是否是由于频率、所涉及的 SQL 和包造成的，在oracle 11g较常见，主要有两个bug. Troubleshooting High Waits for ‘Reliable Message’ (Doc ID 2017390.1) 显示“If there is no performance issue, these waits can be ignored.”

Top 10 Foreground Events by Total Wait Time

Event	Waits	Total Wait Time (sec)	Wait Avg(ms)	% DB time	Wait Class
reliable message	2,762,866	13.6K	5	63.8	Other
DB CPU		7181.1		33.7
latch: ges resource hash list	19,384	372.3	19	1.7	Other
enq: TX – index contention	20,936	278.9	13	1.3	Concurrency
log file sync	634,241	201.4	0	.9	Commit
buffer busy waits	56,629	117.4	2	.6	Concurrency
log file sequential read	152,460	44.1	0	.2	System I/O

为了解读这些，您需要关注与等待相关的 P1 值。

获取 P1 值

select
  to_char(p1, 'XXXXXXXXXXXXXXXX') event_param,
  count(*), sum(time_waited/1000000) time_waited
from gv$active_session_history
where event = 'reliable message'
group by to_char(p1, 'XXXXXXXXXXXXXXXX')
order by time_waited desc；

SQL> @sed reliable
Show wait event descriptions matching %reliable%..

EVENT# EVENT_NAME                                              WAIT_CLASS           PARAMETER1                PARAMETER2                PARAMETER3                ENQUEUE_NAME                   REQ_REASON                       REQ_DESCRIPTION
------ ------------------------------------------------------- -------------------- ------------------------- ------------------------- ------------------------- ------------------------------ -------------------------------- ----------------------------------------------------------------------------------------------------
   458 reliable message                                        Other                channel context           channel handle            broadcast message

-- 查询V$CHANNEL_WAITS的基表

col NAME_KSRCDES format a60
SELECT b.addr "Channel context",
       b.totpub_ksrcctx,
       a.name_ksrcdes
  FROM X$KSRCDES a, 
       X$KSRCCTX b
 WHERE b.name_ksrcctx=a.indx
   AND b.addr='&P1RAW'
;

本案例很容易确认是Result Cache: Channel, 输于Bug 18416368。12c已修复，解决方法禁用result cache,并滚动重启实例

SQL> alter system set result_cache_max_size=0;

也可以使用V$CHANNEL_WAITS查询

select
  inst_id, channel, messages_published, wait_count,
  WAIT_TIME_USEC/1000000 wait_time_sec
from GV$CHANNEL_WAITS
order by inst_id, wait_time_sec desc;

HANGANALYZE 协助查找 blockers

...
    which is waiting for 'reliable message' with wait info:
    {
                      p1: 'channel context'=0xcaa634a50
                      p2: 'channel handle'=0xcda5759d8
                      p3: 'broadcast message'=0xcda65b300
...

已知问题

Document 1951729.1 Very High Waits for ‘reliable message’ After Upgrade to 11.2.0.4 When Using Result Cache
Document 1377830.1 Slow Performance Due to Waits for: “Enq: KO – Fast Object Checkpoint” and “Reliable Message”

Document 14589750.8 Bug 14589750 – TRUNCATE table hangs in RAC with “reliable message” wait if fix 14144283 present
Document 15826962.8 Bug 15826962 – High “reliable message” wait due to “RBR channel”
Document 13879664.8 Bug 13879664 – ASM hang due to ASM disk online waiting for “reliable message”
Document 9367132.8 Bug 9367132 – Process hang waiting on ‘reliable message’