Troubleshooting ORA-600 [4187] Undo segment sequence number is exhausted
每个事务都有一个唯一的 ID,称为 XID,基于Undo segment number 、Undo segment slot number and Undo segment sequence number (wrap#),为每个事务在第一个 DML语句期间分配一个事务 ID(XID), 例如:0002.028.000004DA. 要查明transaction是否真的被提交,需要查看transaction table的描述符。transaction table在每个undo 段头可以转储如alter system dump undo header ‘xxx’, 每次transaction table的slot重用时, slot wrap(sequence number) number就会增加.当slot重用(或wrap#)达到maximum value 4,294,967,295 (2^32 – 1)(0xffffffff)次数时,当再次增加 wrap number 1或多值时,通常会提示ORA-1558 或ORA-600[4187]出现,由于撤销段分配的 AUM 算法,这也应该是非常罕见的。但是执行量非常大事务的库,在undo段非常少时还是有发生的概率,最近这个客户就遇到了(ORACLE RAC 11.2.0.4)。
现象
Fri Nov 19 15:10:12 2021
Thread 1 advanced to log sequence 54671 (LGWR switch)
Current log# 5 seq# 54671 mem# 0: +DATA/anbob/onlinelog/group_5.287.904064243
Fri Nov 19 15:10:16 2021
Archived Log entry 52856 added for thread 1 sequence 54670 ID 0xfe89eeec dest 1:
Fri Nov 19 15:12:09 2021
Errors in file /u01/oracle/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_ora_23451.trc (incident=189693):
ORA-00600: internal error code, arguments: [4187], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /u01/oracle/app/oracle/diag/rdbms/anbob/anbob1/incident/incdir_189693/anbob1_ora_23451_i189693.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Alert中出现的ora-600 [4187],部分事务无法继续,随后系统会生成较大的trace日志,有可能会导致ORACLE_HOME磁盘空间耗尽, 通常这种情况重启实例也无法彻底解决,客户在反复重启后,可能会出现下面其它的ORA-600 错误。
行 5958: ORA-00600: internal error code, arguments: [4187], [], [], [], [], [], [], [], [], [], [], [] 行 5963: ORA-00600: internal error code, arguments: [kfncSlaveMsgFree1], [0], [0], [0], [], [], [], [], [], [], [], [] 行 5965: ORA-00600: internal error code, arguments: [4187], [], [], [], [], [], [], [], [], [], [], [] 行 5972: ORA-00600: internal error code, arguments: [kfgRecoverCreation_0], [], [], [], [], [], [], [], [], [], [], [] 行 5973: ORA-00600: internal error code, arguments: [kfncSlaveMsgFree1], [0], [0], [0], [], [], [], [], [], [], [], [] 行 5975: ORA-00600: internal error code, arguments: [4187], [], [], [], [], [], [], [], [], [], [], [] 行 5984: ORA-00600: internal error code, arguments: [kfioTranslateIO03], [], [], [], [], [], [], [], [], [], [], [] 行 5985: ORA-00600: internal error code, arguments: [kfgRecoverCreation_0], [], [], [], [], [], [], [], [], [], [], [] 行 5986: ORA-00600: internal error code, arguments: [kfncSlaveMsgFree1], [0], [0], [0], [], [], [], [], [], [], [], []
因为事务无法继续,可能在数据库层分析性能SQL就容易被误解,毕竟事务挂起无法继续。事务中断回滚清理、UNDO header Block recovery等,甚至可能存在SMON实例回滚时的内部错误,如果smon遇到100次内部错误则会重启实例(参数_smon_internal_errlimit控制)。在实例重启后出现ORA-15221 raised in ASM I/O path等ora-600 KF 相关的ASM事务报错,PMON会终止实例。
PMON TRACE
*** 2021-11-19 16:27:44.257 *** SESSION ID:(567.1) 2021-11-19 16:27:44.257 *** CLIENT ID:() 2021-11-19 16:27:44.257 *** SERVICE NAME:(SYS$BACKGROUND) 2021-11-19 16:27:44.257 *** MODULE NAME:() 2021-11-19 16:27:44.257 *** ACTION NAME:() 2021-11-19 16:27:44.257 Doing block recovery for file 3 block 272 Block header before block recovery: buffer tsn: 2 rdba: 0x00c00110 (3/272) scn: 0x0f9b.ed0bc9e7 seq: 0x01 flg: 0x04 tail: 0xc9e72601 frmt: 0x02 chkval: 0x5ec8 type: 0x26=KTU SMU HEADER BLOCK Resuming block recovery (PMON) for file 3 block 272 Block recovery from logseq 54683, block 996 to scn 17162371681671 *** 2021-11-19 16:27:44.270 Recovery of Online Redo Log: Thread 1 Group 5 Seq 54683 Reading mem 0 Block recovery completed at rba 54683.998.16, scn 3995.3977334373 ==== Redo read statistics for thread 1 ==== Total physical reads (from disk and memory): 1541Kb -- Redo read_disk statistics -- Read rate (ASYNC): 0Kb in 0.01s => 0.00 Mb/sec -- Redo read_memory statistics -- Read disk 0Kb and read memory 1541Kb, hit-ratio=1.00 Precise schema: 1541Kb valid, 0Kb invalid, valid-ratio 1.00 Longest record: 0Kb, moves: 0/4 (0%) Longest LWN: 1Kb, moves: 0/3 (0%), moved: 0Mb Last redo scn: 0x0f9b.ed114d64 (17162371681636) ---------------------------------------------- Block image after block recovery: buffer tsn: 2 rdba: 0x00c00110 (3/272) scn: 0x0f9b.ed0bc9e7 seq: 0x01 flg: 0x04 tail: 0xc9e72601 frmt: 0x02 chkval: 0x5ec8 type: 0x26=KTU SMU HEADER BLOCK Retention Table ----------------------------------------------------------- Extent Number:0 Commit Time: 1637305903 Extent Number:1 Commit Time: 1637305905 Extent Number:2 Commit Time: 1637305905 Extent Number:3 Commit Time: 1637303989 Extent Number:4 Commit Time: 1637304038 ... ... Extent Number:28 Commit Time: 1637305842 Extent Number:29 Commit Time: 1637305901 Extent Number:30 Commit Time: 1637305901 TRN CTL:: seq: 0xd14f chd: 0x0009 ctl: 0x0004 inc: 0x00000000 nfb: 0x0000 mgc: 0xb000 xts: 0x0068 flg: 0x0001 opt: 2147483646 (0x7ffffffe) uba: 0x00c06434.d14f.2c scn: 0x0f9b.ed0bc826 Version: 0x01 FREE BLOCK POOL:: uba: 0x00000000.d14f.2b ext: 0x2 spc: 0x1dc uba: 0x00000000.d14f.2a ext: 0x2 spc: 0x70e uba: 0x00000000.d14b.02 ext: 0x1e spc: 0x1f02 uba: 0x00000000.ce3f.02 ext: 0x12 spc: 0x14da uba: 0x00000000.3226.02 ext: 0x32 spc: 0x14ae TRN TBL:: index state cflags wrap# uel scn dba parent-xid nub stmt_num cmt ------------------------------------------------------------------------------------------------ 0x00 9 0x00 0xfffffa0c 0x000c 0x0f9b.ed0bc8f1 0x00c06434 0x0000.000.00000000 0x00000001 0x00000000 1637305928 0x01 9 0x00 0xfffff4ab 0x0016 0x0f9b.ed0bc84e 0x00c06434 0x0000.000.00000000 0x00000001 0x00000000 1637305927 0x02 9 0x00 0xfffff3aa 0x0008 0x0f9b.ed0bc934 0x00c06434 0x0000.000.00000000 0x00000001 0x00000000 1637305928 0x03 9 0x00 0xfffff8d9 0x000e 0x0f9b.ed0bc985 0x00c06434 0x0000.000.00000000 0x00000001 0x00000000 1637305928 0x04 9 0x00 0xfffffce8 0xffff 0x0f9b.ed0bc9e7 0x00c06434 0x0000.000.00000000 0x00000001 0x00000000 1637305928 0x05 9 0x00 0xfffff627 0x001a 0x0f9b.ed0bc833 0x00c06434 0x0000.000.00000000 0x00000001 0x00000000 1637305927 0x06 9 0x00 0xfffff4e6 0x0004 0x0f9b.ed0bc9cb 0x00c06434 0x0000.000.00000000 0x00000001 0x00000000 1637305928 0x07 9 0x00 0xffffece5 0x000b 0x0f9b.ed0bc85c 0x00c06434 0x0000.000.00000000 0x00000001 0x00000000 1637305927 0x08 9 0x00 0xfffff724 0x0021 0x0f9b.ed0bc93a 0x00c06434 0x0000.000.00000000 0x00000001 0x00000000 1637305928 0x09 9 0x00 0xfffffff3 0x0015 0x0f9b.ed0bc828 0x00c06434 0x0000.000.00000000 0x00000001 0x00000000 1637305927 0x0a 9 0x00 0xfffffaf2 0x0018 0x0f9b.ed0bc90c 0x00c06434 0x0000.000.00000000 0x00000001 0x00000000 1637305928 0x0b 9 0x00 0xfffff671 0x0010 0x0f9b.ed0bc867 0x00c06434 0x0000.000.00000000 0x00000001 0x00000000 1637305927 0x0c 9 0x00 0xfffffec0 0x001e 0x0f9b.ed0bc900 0x00c06434 0x0000.000.00000000 0x00000001 0x00000000 1637305928 0x0d 9 0x00 0xfffff8bf 0x0020 0x0f9b.ed0bc889 0x00c06434 0x0000.000.00000000 0x00000001 0x00000000 1637305928 0x0e 9 0x00 0xfffff4ce 0x0013 0x0f9b.ed0bc9ab 0x00c06434 0x0000.000.00000000 0x00000001 0x00000000 1637305928 0x0f 9 0x00 0xfffff64d 0x000d 0x0f9b.ed0bc875 0x00c06434 0x0000.000.00000000 0x00000001 0x00000000 1637305927 0x10 9 0x00 0xfffff5ec 0x000f 0x0f9b.ed0bc86b 0x00c06434 0x0000.000.00000000 0x00000001 0x00000000 1637305927 ... 0x1d 9 0x00 0xfffff23f 0x0019 0x0f9b.ed0bc91e 0x00c06434 0x0000.000.00000000 0x00000001 0x00000000 1637305928 0x1e 9 0x00 0xfffff67e 0x000a 0x0f9b.ed0bc908 0x00c06434 0x0000.000.00000000 0x00000001 0x00000000 1637305928 0x1f 9 0x00 0xfffff1ad 0x0003 0x0f9b.ed0bc982 0x00c06434 0x0000.000.00000000 0x00000001 0x00000000 1637305928 0x20 9 0x00 0xfffffb0c 0x001b 0x0f9b.ed0bc8ba 0x00c06434 0x0000.000.00000000 0x00000001 0x00000000 1637305928 0x21 9 0x00 0xfffff1eb 0x0011 0x0f9b.ed0bc943 0x00c06434 0x0000.000.00000000 0x00000001 0x00000000 1637305928 EXT TRN CTL:: Incident 328031 created, dump file: /u01/oracle/app/oracle/diag/rdbms/lcfa/LCFA1/incident/incdir_328031/LCFA1_pmon_11307_i328031.trc ORA-00600: internal error code, arguments: [kfgRecoverCreation_0], [], [], [], [], [], [], [], [], [], [], [] PMON: fatal error while deleting s.o. 0x6ae9d8a18 in this tree: ---------------------------------------- SO: 0x6d91e2ea0, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x6d91e2ea0, name=process, file=ksu.h LINE:12721, pg=0 (process) Oracle pid:50, ser:3, calls cur/top: 0x65863abc0/0x65863abc0 flags : (0x1) DEAD flags2: (0x8000), flags3: (0x10) intr error: 0, call error: 0, sess error: 0, txn error 0 intr queue: empty ksudlp FALSE at location: 0 Cleanup details: Found dead = 1 sec ago Total Cleanup attempts = 1, Cleanup time = 0 sec, Cleanup timer = 0.000 sec Last attempt (full) started 1 sec ago, Length = in progress (post info) last post received: 0 0 51 last post received-location: ksv2.h LINE:1679 ID:ksvpst: workdone last process to post me: 0x6d51f8310 1 2 last post sent: 0 0 26 last post sent-location: ksa2.h LINE:285 ID:ksasnd last process posted by me: 0x6ed26d1a0 1 6 (latch info) wait_event=0 bits=0x0 Process Group: DEFAULT, pseudo proc: 0x6ed5819c8 O/S info: user: grid, term: UNKNOWN, ospid: 12107 (DEAD) OSD pid info: Unix process pid: 12107, image: oracle@ ...
Kernel Transaction Undo tranXaction Control (KTUXC): Maintains control information to rollback changes to the transaction table itself.
从上面transaction table header(ktuxc)显示chd 9# slot, 而slot 9的wrap#为 0xfffffff3已经非常接近上限0xffffffff,并且其它slot同样也非常高。
wrap#增长当前采用的是在执行ktubnd函数为事务绑定undo段时,会调用kjqghd去计算出一个重用slot递增值delta,这个delta限制必须小于16(由KTU_MAX_KSQN_DELTA定义),所以就可能会出现 0xfffffff3 +delta的值超过KSQNMAXVAL,显示为负值。
从MOS中可以确认存在该问题Doc ID 19700135.8
ORA-600 [4187] can occur for undo segments where wrap# is close to the max value of 0xffffffff (KSQNMAXVAL).
This normally affects databases with high transaction rate that have existed for a relatively long time.To identify undo segments causing the above error and others that may potentially cause it
in the future, run the next query:select b.segment_name, b.tablespace_name
,a.ktuxeusn “Undo Segment Number”
,a.ktuxeslt “Slot”
,a.ktuxesqn “Wrap#”
from x$ktuxe a, dba_rollback_segs b
where a.ktuxesqn > -429496730 and a.ktuxesqn < 0
and a.ktuxeusn = b.segment_id;Then drop the undo segments or the undo tablespace from the output above.
数据库中查询以上SQL
SQL> set lines 999
SQL> select b.segment_name, b.tablespace_name
2 ,a.ktuxeusn "Undo Segment Number"
3 ,a.ktuxeslt "Slot"
4 ,a.ktuxesqn "Wrap#"
5 from x$ktuxe a, dba_rollback_segs b
6 where a.ktuxesqn > -429496730 and a.ktuxesqn < 0
7 and a.ktuxeusn = b.segment_id;
SEGMENT_NAME TABLESPACE_NAME Undo Segment Number Slot Wrap#
-------------------------------- ------------------------ ------------------- ---------- ----------
_SYSSMU10_562384330$ UNDOTBS1 10 0 -1524
_SYSSMU10_562384330$ UNDOTBS1 10 1 -2901
_SYSSMU10_562384330$ UNDOTBS1 10 2 -3158
_SYSSMU10_562384330$ UNDOTBS1 10 3 -1831
_SYSSMU10_562384330$ UNDOTBS1 10 4 -792
_SYSSMU10_562384330$ UNDOTBS1 10 5 -2521
_SYSSMU10_562384330$ UNDOTBS1 10 6 -2842
_SYSSMU10_562384330$ UNDOTBS1 10 7 -4891
_SYSSMU10_562384330$ UNDOTBS1 10 8 -2268
_SYSSMU10_562384330$ UNDOTBS1 10 9 -13
_SYSSMU10_562384330$ UNDOTBS1 10 10 -1294
...
...
_SYSSMU10_562384330$ UNDOTBS1 10 22 -666
_SYSSMU10_562384330$ UNDOTBS1 10 23 -555
_SYSSMU10_562384330$ UNDOTBS1 10 24 -3596
_SYSSMU10_562384330$ UNDOTBS1 10 25 -3325
_SYSSMU10_562384330$ UNDOTBS1 10 26 -2670
_SYSSMU10_562384330$ UNDOTBS1 10 27 -1551
_SYSSMU10_562384330$ UNDOTBS1 10 28 -4384
_SYSSMU10_562384330$ UNDOTBS1 10 29 -3521
_SYSSMU10_562384330$ UNDOTBS1 10 30 -2434
_SYSSMU10_562384330$ UNDOTBS1 10 31 -3667
_SYSSMU10_562384330$ UNDOTBS1 10 32 -1268
'''
Note:
大量slot的wrap# 已经出现的溢出为负值。
transaction table的说明
To find out whether the transaction has really been committed, it is necessary to look at its descriptor in the transaction table.
The transaction ID (or XID) enables you to find the transaction descriptor:
• usn: Number of the undo segment which was assigned to the transaction
• slot: Number of the entry in the transaction table (contained in the header of the undo segment) corresponding to the transaction descriptor.
This is the index column in the transaction table.
• seq: Sequence number indicating reuse of the slot. This number is incremented as each slot is reused. It corresponds to the wrap# column in the transaction table.
All this information is collected in a memory area called the cleanout information area. This information is then used to actually apply the changes.
The transaction table holds a pointer to the head and tail of the committed transaction table slots. Each inactive entry in the transaction table points to the next committed (inactive) slot.
They are ordered by commit SCN, starting at the oldest transaction to have committed. If the committed list is empty after several attempts at traversing it, the transaction table remains full
and an error is reported. This indicates that there are not enough undo segments for the amount of concurrent transactions in the database. However, since AUM tries to allocate each concurrent transaction to their
own undo segment, this should be extremely rare to see. If this error does occur, the solution is to increase the size of the undo tablespace to allow more undo segments to be created.
At instance startup each undo tablespace will contain 10 undo segments. The number of additional segments brought online during instance startup is based on the SESSIONS parameter.
Oracle allocates approximately one undo segment for every five sessions. These are sized according to the auto-allocate algorithm for locally managed tablespaces. The basic algorithm
is that the first 16 extents are 64 KB in size. During subsequent allocation the next 63 extents are 1 MB, the next 120 extents are 8 MB and all additional extents are 64 MB.
事务绑定UNDO段的原则是尽可能的将活动事务平均的各个undo段上,具体算法为:
1,在当前undo tablespace中的online undo segment中寻找事务表中没有活动事务的undo segment
2,如果没有找到则尝试在当前undo tablespace online 那些处于offline状态的undo segment
3,如果没有找到则尝试在当前undo tablespace创建undo segment并online
4,如果无法创建则会寻找最近最少使用的undo segment
到这一步既然确认了是业务特性高频事务导致undo 段的slot wrap#耗尽,检查一下UNDO段个数,发现当前的undo 表空间为大小1.5G (数据文件自动扩展),判断应该是大量的小事务,如loop 中insert value ommit等, 对于这么高频率的事务的数据库节点,同时因为UNDO段个数少,即使是undo段使用平衡,也会存在耗尽的风险, 如果个别segment的slot wrap#高,可以drop undo segment,但是如果普遍undo段wrap#使用都高,可以先创建新的UNDO表空间,修改数据库默认表空间解决,后期删聊原UNDO表空间。
预防
1, 初始化更多的UNDO段,使用_rollback_segment_count参数
2, 增加巡检项
select b.segment_name, b.tablespace_name ,a.ktuxeusn "Undo Segment Number" ,a.ktuxeslt "Slot" ,a.ktuxesqn "Wrap#" from x$ktuxe a, dba_rollback_segs b where a.ktuxeusn = b.segment_id and ktuxesqn>power(2,20); -- change
对不起,这篇文章暂时关闭评论。