首页 » ORACLE 9i-23c » Troubleshooting ora-07445 [__lwp_kill()+48] [SIGIOT] error and instance crash

Troubleshooting ora-07445 [__lwp_kill()+48] [SIGIOT] error and instance crash

最近有一套11.2.0.3 rac on hpux ia31的节点1重启, 重启前数据库出现了ora-7445 [__lwp_kill()+48] 错误

# db alert

Sun Jul 26 03:05:26 2015
Thread 1 advanced to log sequence 44945 (LGWR switch)
  Current log# 5 seq# 44945 mem# 0: /dev/yyb_oravg02/ryyb_redo05
Sun Jul 26 03:05:29 2015
Archived Log entry 38703 added for thread 1 sequence 44944 ID 0x1474c95c dest 1:
Sun Jul 26 03:05:45 2015
Exception [type: SIGIOT, unknown code] [ADDR:0x6CA9] [PC:0xC0000000003125F0, __lwp_kill()+48] [exception issued by pid: 27817, uid: 1024] [flags: 0x0, count: 1]
Errors in file /oracle/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_lms3_27817.trc  (incident=704134):
ORA-07445: exception encountered: core dump [__lwp_kill()+48] [SIGIOT] [ADDR:0x6CA9] [PC:0xC0000000003125F0] [unknown code] []
Incident details in: /oracle/app/oracle/diag/rdbms/anbob/anbob1/incident/incdir_704134/anbob1_lms3_27817_i704134.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Sun Jul 26 03:05:47 2015
Dumping diagnostic data in directory=[cdmp_20150726030547], requested by (instance=1, osid=27817 (LMS3)), summary=[incident=704134].
Sun Jul 26 03:05:49 2015
PMON (ospid: 27789): terminating the instance due to error 484

# anbob1_lms3_27817.trc trace

*** 2015-07-26 03:05:45.259
SKGXP:[9ffffffffd5834d8.5]{0}: (27817 25925) ach 9ffffffffd599770 : RDMA: No Active: No State: PASSIVE OPEN (40)
SKGXP:[9ffffffffd5834d8.41]{0}:    sconno: 0x434e3dda aconno: 0x5a0ee48b sadmno: 0x6459330c aadmno: 0x630e0d86 creqtime: 212728
SKGXP:[9ffffffffd5834d8.42]{0}:    fragsz: 32768 cdt_bits: 3 tot_cdts: 8 cdts: max_sends: 8
SKGXP:[9ffffffffd5834d8.43]{0}:    seqnxt: 58072 last_ack: 58072 lmseqn: 0 transform_side: No inactive_time: 5048206018
SKGXP:[9ffffffffd5834d8.44]{0}: 
SKGXP:[9ffffffffd5834d8.45]{0}:    Dumping Sliding Window
SKGXP:[9ffffffffd5834d8.46]{0}:    Slot: 0 State: FREE 
SKGXP:[9ffffffffd5834d8.47]{0}:      ftype: 2 seqn: 58064 first_seqn: 58064 flen: 1520 fragno: 0 tot_frags: 1
SKGXP:[9ffffffffd5834d8.48]{0}:      rqh: 0000000000000000 ttrt: 0 tc: 0 xmit_ts: 0
SKGXP:[9ffffffffd5834d8.49]{0}:    Slot: 1 State: FREE 
SKGXP:[9ffffffffd5834d8.50]{0}:      ftype: 2 seqn: 58065 first_seqn: 58065 flen: 1520 fragno: 0 tot_frags: 1
SKGXP:[9ffffffffd5834d8.51]{0}:      rqh: 0000000000000000 ttrt: 0 tc: 0 xmit_ts: 0
SKGXP:[9ffffffffd5834d8.52]{0}:    Slot: 2 State: FREE 
SKGXP:[9ffffffffd5834d8.53]{0}:      ftype: 2 seqn: 58066 first_seqn: 58066 flen: 2672 fragno: 0 tot_frags: 1
SKGXP:[9ffffffffd5834d8.54]{0}:      rqh: 0000000000000000 ttrt: 0 tc: 0 xmit_ts: 0
SKGXP:[9ffffffffd5834d8.55]{0}:    Slot: 3 State: FREE 
SKGXP:[9ffffffffd5834d8.56]{0}:      ftype: 2 seqn: 58067 first_seqn: 58067 flen: 944 fragno: 0 tot_frags: 1
SKGXP:[9ffffffffd5834d8.57]{0}:      rqh: 0000000000000000 ttrt: 0 tc: 0 xmit_ts: 0
SKGXP:[9ffffffffd5834d8.58]{0}:    Slot: 4 State: FREE 
SKGXP:[9ffffffffd5834d8.59]{0}:      ftype: 2 seqn: 58068 first_seqn: 58068 flen: 1376 fragno: 0 tot_frags: 1
SKGXP:[9ffffffffd5834d8.60]{0}:      rqh: 0000000000000000 ttrt: 0 tc: 0 xmit_ts: 0
SKGXP:[9ffffffffd5834d8.61]{0}:    Slot: 5 State: FREE 
SKGXP:[9ffffffffd5834d8.62]{0}:      ftype: 2 seqn: 58069 first_seqn: 58069 flen: 368 fragno: 0 tot_frags: 1
SKGXP:[9ffffffffd5834d8.63]{0}:      rqh: 0000000000000000 ttrt: 0 tc: 0 xmit_ts: 0
SKGXP:[9ffffffffd5834d8.64]{0}:    Slot: 6 State: FREE 
SKGXP:[9ffffffffd5834d8.65]{0}:      ftype: 2 seqn: 58070 first_seqn: 58070 flen: 3248 fragno: 0 tot_frags: 1
SKGXP:[9ffffffffd5834d8.66]{0}:      rqh: 0000000000000000 ttrt: 0 tc: 0 xmit_ts: 0
SKGXP:[9ffffffffd5834d8.67]{0}:    Slot: 7 State: FREE 
SKGXP:[9ffffffffd5834d8.68]{0}:      ftype: 2 seqn: 58071 first_seqn: 58071 flen: 224 fragno: 0 tot_frags: 1
SKGXP:[9ffffffffd5834d8.69]{0}:      rqh: 0000000000000000 ttrt: 0 tc: 0 xmit_ts: 0
SKGXP:[9ffffffffd5834d8.70]{0}: SKGXP Assertion FALSE Failed at location skgxp_slide_recv_window:gen-oow line_num: 19499  <<<<<<<<<<<<<<<

*** 2015-07-26 03:05:45.291
Exception [type: SIGIOT, unknown code] [ADDR:0x6CA9] [PC:0xC0000000003125F0, __lwp_kill()+48] [exception issued by pid: 27817, uid: 1024] [flags: 0x0, count: 1]
Incident 704134 created, dump file: /oracle/app/oracle/diag/rdbms/anbob/anbob1/incident/incdir_704134/anbob1_lms3_27817_i704134.trc
ORA-07445: exception encountered: core dump [__lwp_kill()+48] [SIGIOT] [ADDR:0x6CA9] [PC:0xC0000000003125F0] [unknown code] []

ssexhd: crashing the process...
Background_Core_Dump = partial
ksdbgcra: writing core file to directory '/oracle/app/oracle/diag/rdbms/anbob/anbob1/cdump'

# /oracle/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_lms3_27817.trc

ORA-07445: exception encountered: core dump [__lwp_kill()+48] [SIGIOT] [ADDR:0x6CA9] [PC:0xC0000000003125F0] [unknown code] []

*** 2015-07-26 03:05:45.416
dbkedDefDump(): Starting a non-incident diagnostic dump (flags=0x3, level=3, mask=0x0)
----- SQL Statement (None) -----
Current SQL information unavailable - no cursor.

----- Call Stack Trace -----
    skdstdst 
<-__lwp_kill()+48<-__pthread_kill()+2512<-_raise()+224<-abort()+544<-_assert()+608<-skgxp_assert()+784
<-skgxp_assert_recv()+1488<-skgxp_window_land_recv()+624
<-skgxpprcrcv()+160<-skgxp_recv_next_fragment()+3504<-skgxprusr()+1008<-skgxpiwait()+9728 
----- End of Abridged Call Stack Trace -----  

skgxpwait() This file contains the OSD(Operating System Dependent) API used by the TCP/IP version of IPC
skgxp_recv_next_fragment() receive next fragment on the wire
skgxp_window_land_recv() do sliding window processing for the received fragment
assert () Macro to get the row-cache latch. It stores the comment in the PGA to help identify who got the latch and who frees it

Note:
skgxp_slide_recv_window 没有找到相关的解释,不过可以猜测是网络滑动窗口协议使用,基于package的传速数据是分块 接受,通过buffer 缓冲再交给上层的应用层,最后在有序的拼装在一起,该function出错应该是在最后的组装阶段或中间的部分package传送失败导致。

主要怀疑是网络或主机和应用的问题,稳定的网络是RAC 高效的前提,使用netstat -s 查看了checksums的值,节点2值为0 ,节点1(重启节点)值为21473

#node 1
udp:
       211615 incomplete headers
       21473 bad checksums    #<============= 
       190142 socket overflows
ip:
       242745393114 total packets received
       227417 bad IP headers
       10347787171 fragments received
       64410 fragments dropped (dup or out of space) #《===============
       28719 fragments dropped after timeout
       0 packets forwarded
       17 packets not forwardable 
#node2
udp:
648577 incomplete headers
0 bad checksums   #<<<<<<<<<<<<<<
648577 socket overflows
ip:
230910024388 total packets received
323248 bad IP headers
9016546368 fragments received
0 fragments dropped (dup or out of space)
0 fragments dropped after timeout
0 packets forwarded
0 packets not forwardable 

从MOS中找到了几个相似BUG

Bug 19520489 : ORA-7445 [__LWP_KILL()+8] FOLLOWED BY INSTANCE CRASH
Bug 18518529 : ORA-7445 [__LWP_KILL()+8] [SIGIOT] AND INSTANCE CRASH
Bug 18011512 : LMON: TERMINATING THE INSTANCE, NOT ABLE TO START IT AGAIN 
Bug 12753779 - LMS PROCESSES DIE WITH ORA-07445: EXCEPTION ENCOUNTERED: CORE DUMP [_KILL()+48] 
Bug 14119119 : ORA-7445 [__LWP_KILL+48]

关于hostname长度的bug不符合,且本实例的lms 进程数一致, 和SR多次沟通后,确认了以下修改方案:

1,Cut ip_fragment_timeout to 100 (1 second). (Default is 60 seconds).
2,Increase the ip_reass_mem_limit to 10000000 (10MB) (Default is 2 MB) 
3,Increase the  socket_udp_rcvbuf_default and  socket_udp_sndbuf_default
udp_sendspace >=max([(DB_BLOCK_SIZE * DB_FILE_MULTIBLOCK_READ_COUNT) + 4096], 66536);
udp_recvspace >=udp_sendspace*2 (on hpux)
4, Increase the _lm_tickets and gcs_server_processes parameter values (according to the actual situation)

— update 20170329
近日有朋友出现了相同问题,特更新一下。我们的案例当时有点小复杂,问题是已经解决。

遇到该类问题我建议先调整上面的OS参数观察, 我们的案例当时是调整了参数后虽然没有再出现ora-7445,但是有几套相同的环境也出现了crash, 并且也出现了bad checksum和overflow, 没出现ORA-7445, 后来没办法确认是网络还是数据库问题,因为该库是核心库,所以以解决其它库的问题时,这套出现ora-7445的数据库在修改了上面的参数后,在没有再出现ora-7445或者说是确认是否修改参数解决了ora-7445前,又安装了下面的补丁。

客户的强势要求下,Oracle dev 部门针对这个case 特意提供了一个merge patch, 解释说是加强了lms等后台进程的健壮性。安装后1年多时间没在再出现问题。
补丁程序21252795: MERGE REQUEST ON TOP OF DATABASE PSU 11.2.0.3.7 FOR BUGS 18719357 16088176

先决条件补丁程序
16619892 DATABASE PATCH SET UPDATE 11.2.0.3.7 (INCLUDES CPUJUL2013)
此补丁程序所解决的 Bug
16088176 LNX64-12.1-RAC-CDB: LMD PROC HIT ORA-600 [KJMSCNDSCQ:TIMEOUT] AND INST CRASH
16819962 CDB_RAC : INSTANCE TERMINATED BY LMD0 – LMON RECEIVED AN INSTANCE EVICTION
17452841 LMS HIT ORA-600 [KJCTSRW:1]
17801017 INSTANCES ARE EVICTED FROM CLUSTER WHEN INTERNAL DLM MESSAGING STALLS
17847764 FA + INDEX COMP: ORA-481 LMON INST EVICT – ABNORMAL INSTANCE TERMINATION BY LMD0

打赏

,

对不起,这篇文章暂时关闭评论。