首页 » ORACLE 9i-23c » Troubleshooting ORA-600 [4187] Undo segment sequence number is exhausted

Troubleshooting ORA-600 [4187] Undo segment sequence number is exhausted

每个事务都有一个唯一的 ID,称为 XID,基于Undo segment number 、Undo segment slot number and Undo segment sequence number (wrap#),为每个事务在第一个 DML语句期间分配一个事务 ID(XID), 例如:0002.028.000004DA. 要查明transaction是否真的被提交,需要查看transaction table的描述符。transaction table在每个undo 段头可以转储如alter system dump undo header ‘xxx’, 每次transaction table的slot重用时, slot wrap(sequence number) number就会增加.当slot重用(或wrap#)达到maximum value 4,294,967,295 (2^32 – 1)(0xffffffff)次数时,当再次增加 wrap number 1或多值时,通常会提示ORA-1558 或ORA-600[4187]出现,由于撤销段分配的 AUM 算法,这也应该是非常罕见的。但是执行量非常大事务的库,在undo段非常少时还是有发生的概率,最近这个客户就遇到了(ORACLE RAC 11.2.0.4)。

现象

Fri Nov 19 15:10:12 2021
Thread 1 advanced to log sequence 54671 (LGWR switch)
  Current log# 5 seq# 54671 mem# 0: +DATA/anbob/onlinelog/group_5.287.904064243
Fri Nov 19 15:10:16 2021
Archived Log entry 52856 added for thread 1 sequence 54670 ID 0xfe89eeec dest 1:
Fri Nov 19 15:12:09 2021
Errors in file /u01/oracle/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_ora_23451.trc  (incident=189693):
ORA-00600: internal error code, arguments: [4187], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /u01/oracle/app/oracle/diag/rdbms/anbob/anbob1/incident/incdir_189693/anbob1_ora_23451_i189693.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.

Alert中出现的ora-600 [4187],部分事务无法继续,随后系统会生成较大的trace日志,有可能会导致ORACLE_HOME磁盘空间耗尽, 通常这种情况重启实例也无法彻底解决,客户在反复重启后,可能会出现下面其它的ORA-600 错误。

	行 5958: ORA-00600: internal error code, arguments: [4187], [], [], [], [], [], [], [], [], [], [], []
	行 5963: ORA-00600: internal error code, arguments: [kfncSlaveMsgFree1], [0], [0], [0], [], [], [], [], [], [], [], []
	行 5965: ORA-00600: internal error code, arguments: [4187], [], [], [], [], [], [], [], [], [], [], []
	行 5972: ORA-00600: internal error code, arguments: [kfgRecoverCreation_0], [], [], [], [], [], [], [], [], [], [], []
	行 5973: ORA-00600: internal error code, arguments: [kfncSlaveMsgFree1], [0], [0], [0], [], [], [], [], [], [], [], []
	行 5975: ORA-00600: internal error code, arguments: [4187], [], [], [], [], [], [], [], [], [], [], []
	行 5984: ORA-00600: internal error code, arguments: [kfioTranslateIO03], [], [], [], [], [], [], [], [], [], [], []
	行 5985: ORA-00600: internal error code, arguments: [kfgRecoverCreation_0], [], [], [], [], [], [], [], [], [], [], []
	行 5986: ORA-00600: internal error code, arguments: [kfncSlaveMsgFree1], [0], [0], [0], [], [], [], [], [], [], [], []

因为事务无法继续,可能在数据库层分析性能SQL就容易被误解,毕竟事务挂起无法继续。事务中断回滚清理、UNDO header Block recovery等,甚至可能存在SMON实例回滚时的内部错误,如果smon遇到100次内部错误则会重启实例(参数_smon_internal_errlimit控制)。在实例重启后出现ORA-15221 raised in ASM I/O path等ora-600 KF 相关的ASM事务报错,PMON会终止实例。

PMON TRACE

*** 2021-11-19 16:27:44.257
*** SESSION ID:(567.1) 2021-11-19 16:27:44.257
*** CLIENT ID:() 2021-11-19 16:27:44.257
*** SERVICE NAME:(SYS$BACKGROUND) 2021-11-19 16:27:44.257
*** MODULE NAME:() 2021-11-19 16:27:44.257
*** ACTION NAME:() 2021-11-19 16:27:44.257
 
Doing block recovery for file 3 block 272
Block header before block recovery:
buffer tsn: 2 rdba: 0x00c00110 (3/272)
scn: 0x0f9b.ed0bc9e7 seq: 0x01 flg: 0x04 tail: 0xc9e72601
frmt: 0x02 chkval: 0x5ec8 type: 0x26=KTU SMU HEADER BLOCK
Resuming block recovery (PMON) for file 3 block 272
Block recovery from logseq 54683, block 996 to scn 17162371681671

*** 2021-11-19 16:27:44.270
Recovery of Online Redo Log: Thread 1 Group 5 Seq 54683 Reading mem 0
Block recovery completed at rba 54683.998.16, scn 3995.3977334373
==== Redo read statistics for thread 1 ====
Total physical reads (from disk and memory): 1541Kb
-- Redo read_disk statistics --
Read rate (ASYNC): 0Kb in 0.01s => 0.00 Mb/sec
-- Redo read_memory statistics --
Read disk 0Kb and read memory 1541Kb, hit-ratio=1.00
Precise schema: 1541Kb valid, 0Kb invalid, valid-ratio 1.00
Longest record: 0Kb, moves: 0/4 (0%)
Longest LWN: 1Kb, moves: 0/3 (0%), moved: 0Mb
Last redo scn: 0x0f9b.ed114d64 (17162371681636)
----------------------------------------------
Block image after block recovery:
buffer tsn: 2 rdba: 0x00c00110 (3/272)
scn: 0x0f9b.ed0bc9e7 seq: 0x01 flg: 0x04 tail: 0xc9e72601
frmt: 0x02 chkval: 0x5ec8 type: 0x26=KTU SMU HEADER BLOCK

 Retention Table 
  -----------------------------------------------------------
 Extent Number:0  Commit Time: 1637305903
 Extent Number:1  Commit Time: 1637305905
 Extent Number:2  Commit Time: 1637305905
 Extent Number:3  Commit Time: 1637303989
 Extent Number:4  Commit Time: 1637304038
...
...
 Extent Number:28  Commit Time: 1637305842
 Extent Number:29  Commit Time: 1637305901
 Extent Number:30  Commit Time: 1637305901
  
  TRN CTL:: seq: 0xd14f chd: 0x0009 ctl: 0x0004 inc: 0x00000000 nfb: 0x0000
            mgc: 0xb000 xts: 0x0068 flg: 0x0001 opt: 2147483646 (0x7ffffffe)
            uba: 0x00c06434.d14f.2c scn: 0x0f9b.ed0bc826
Version: 0x01
  FREE BLOCK POOL::
    uba: 0x00000000.d14f.2b ext: 0x2  spc: 0x1dc   
    uba: 0x00000000.d14f.2a ext: 0x2  spc: 0x70e   
    uba: 0x00000000.d14b.02 ext: 0x1e spc: 0x1f02  
    uba: 0x00000000.ce3f.02 ext: 0x12 spc: 0x14da  
    uba: 0x00000000.3226.02 ext: 0x32 spc: 0x14ae  
  TRN TBL::
 
  index  state cflags  wrap#    uel         scn            dba            parent-xid    nub     stmt_num    cmt
  ------------------------------------------------------------------------------------------------
   0x00    9    0x00  0xfffffa0c  0x000c  0x0f9b.ed0bc8f1  0x00c06434  0x0000.000.00000000  0x00000001   0x00000000  1637305928
   0x01    9    0x00  0xfffff4ab  0x0016  0x0f9b.ed0bc84e  0x00c06434  0x0000.000.00000000  0x00000001   0x00000000  1637305927
   0x02    9    0x00  0xfffff3aa  0x0008  0x0f9b.ed0bc934  0x00c06434  0x0000.000.00000000  0x00000001   0x00000000  1637305928
   0x03    9    0x00  0xfffff8d9  0x000e  0x0f9b.ed0bc985  0x00c06434  0x0000.000.00000000  0x00000001   0x00000000  1637305928
   0x04    9    0x00  0xfffffce8  0xffff  0x0f9b.ed0bc9e7  0x00c06434  0x0000.000.00000000  0x00000001   0x00000000  1637305928
   0x05    9    0x00  0xfffff627  0x001a  0x0f9b.ed0bc833  0x00c06434  0x0000.000.00000000  0x00000001   0x00000000  1637305927
   0x06    9    0x00  0xfffff4e6  0x0004  0x0f9b.ed0bc9cb  0x00c06434  0x0000.000.00000000  0x00000001   0x00000000  1637305928
   0x07    9    0x00  0xffffece5  0x000b  0x0f9b.ed0bc85c  0x00c06434  0x0000.000.00000000  0x00000001   0x00000000  1637305927
   0x08    9    0x00  0xfffff724  0x0021  0x0f9b.ed0bc93a  0x00c06434  0x0000.000.00000000  0x00000001   0x00000000  1637305928
   0x09    9    0x00  0xfffffff3  0x0015  0x0f9b.ed0bc828  0x00c06434  0x0000.000.00000000  0x00000001   0x00000000  1637305927
   0x0a    9    0x00  0xfffffaf2  0x0018  0x0f9b.ed0bc90c  0x00c06434  0x0000.000.00000000  0x00000001   0x00000000  1637305928
   0x0b    9    0x00  0xfffff671  0x0010  0x0f9b.ed0bc867  0x00c06434  0x0000.000.00000000  0x00000001   0x00000000  1637305927
   0x0c    9    0x00  0xfffffec0  0x001e  0x0f9b.ed0bc900  0x00c06434  0x0000.000.00000000  0x00000001   0x00000000  1637305928
   0x0d    9    0x00  0xfffff8bf  0x0020  0x0f9b.ed0bc889  0x00c06434  0x0000.000.00000000  0x00000001   0x00000000  1637305928
   0x0e    9    0x00  0xfffff4ce  0x0013  0x0f9b.ed0bc9ab  0x00c06434  0x0000.000.00000000  0x00000001   0x00000000  1637305928
   0x0f    9    0x00  0xfffff64d  0x000d  0x0f9b.ed0bc875  0x00c06434  0x0000.000.00000000  0x00000001   0x00000000  1637305927
   0x10    9    0x00  0xfffff5ec  0x000f  0x0f9b.ed0bc86b  0x00c06434  0x0000.000.00000000  0x00000001   0x00000000  1637305927
...
   0x1d    9    0x00  0xfffff23f  0x0019  0x0f9b.ed0bc91e  0x00c06434  0x0000.000.00000000  0x00000001   0x00000000  1637305928
   0x1e    9    0x00  0xfffff67e  0x000a  0x0f9b.ed0bc908  0x00c06434  0x0000.000.00000000  0x00000001   0x00000000  1637305928
   0x1f    9    0x00  0xfffff1ad  0x0003  0x0f9b.ed0bc982  0x00c06434  0x0000.000.00000000  0x00000001   0x00000000  1637305928
   0x20    9    0x00  0xfffffb0c  0x001b  0x0f9b.ed0bc8ba  0x00c06434  0x0000.000.00000000  0x00000001   0x00000000  1637305928
   0x21    9    0x00  0xfffff1eb  0x0011  0x0f9b.ed0bc943  0x00c06434  0x0000.000.00000000  0x00000001   0x00000000  1637305928
  EXT TRN CTL::

Incident 328031 created, dump file: /u01/oracle/app/oracle/diag/rdbms/lcfa/LCFA1/incident/incdir_328031/LCFA1_pmon_11307_i328031.trc
ORA-00600: internal error code, arguments: [kfgRecoverCreation_0], [], [], [], [], [], [], [], [], [], [], []

PMON: fatal error while deleting s.o. 0x6ae9d8a18 in this tree:
----------------------------------------
SO: 0x6d91e2ea0, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3
 proc=0x6d91e2ea0, name=process, file=ksu.h LINE:12721, pg=0
(process) Oracle pid:50, ser:3, calls cur/top: 0x65863abc0/0x65863abc0
          flags : (0x1) DEAD
          flags2: (0x8000),  flags3: (0x10) 
          intr error: 0, call error: 0, sess error: 0, txn error 0
          intr queue: empty
  ksudlp FALSE at location: 0
  Cleanup details:
    Found dead = 1 sec ago
    Total Cleanup attempts = 1, Cleanup time = 0 sec, Cleanup timer = 0.000 sec
    Last attempt (full) started 1 sec ago, Length = in progress
  (post info) last post received: 0 0 51
              last post received-location: ksv2.h LINE:1679 ID:ksvpst: workdone
              last process to post me: 0x6d51f8310 1 2
              last post sent: 0 0 26
              last post sent-location: ksa2.h LINE:285 ID:ksasnd
              last process posted by me: 0x6ed26d1a0 1 6
  (latch info) wait_event=0 bits=0x0
  Process Group: DEFAULT, pseudo proc: 0x6ed5819c8
  O/S info: user: grid, term: UNKNOWN, ospid: 12107 (DEAD)
  OSD pid info: Unix process pid: 12107, image: oracle@
...

Kernel Transaction Undo tranXaction Control (KTUXC): Maintains control information to rollback changes to the transaction table itself.

从上面transaction table header(ktuxc)显示chd 9# slot, 而slot 9的wrap#为 0xfffffff3已经非常接近上限0xffffffff,并且其它slot同样也非常高。
wrap#增长当前采用的是在执行ktubnd函数为事务绑定undo段时,会调用kjqghd去计算出一个重用slot递增值delta,这个delta限制必须小于16(由KTU_MAX_KSQN_DELTA定义),所以就可能会出现 0xfffffff3 +delta的值超过KSQNMAXVAL,显示为负值。

 

从MOS中可以确认存在该问题Doc ID 19700135.8

ORA-600 [4187] can occur for undo segments where wrap# is close to the max value of 0xffffffff (KSQNMAXVAL).
This normally affects databases with high transaction rate that have existed for a relatively long time.

To identify undo segments causing the above error and others that may potentially cause it
in the future, run the next query:

select b.segment_name, b.tablespace_name
,a.ktuxeusn “Undo Segment Number”
,a.ktuxeslt “Slot”
,a.ktuxesqn “Wrap#”
from x$ktuxe a, dba_rollback_segs b
where a.ktuxesqn > -429496730 and a.ktuxesqn < 0
and a.ktuxeusn = b.segment_id;

Then drop the undo segments or the undo tablespace from the output above.

数据库中查询以上SQL

SQL> set lines 999
SQL> select b.segment_name, b.tablespace_name 
  2           ,a.ktuxeusn "Undo Segment Number"
  3           ,a.ktuxeslt "Slot"
  4           ,a.ktuxesqn "Wrap#"
  5     from  x$ktuxe a, dba_rollback_segs b
  6     where a.ktuxesqn > -429496730 and a.ktuxesqn < 0
  7         and a.ktuxeusn = b.segment_id;

SEGMENT_NAME                     TABLESPACE_NAME          Undo Segment Number       Slot      Wrap#
-------------------------------- ------------------------ ------------------- ---------- ----------
_SYSSMU10_562384330$             UNDOTBS1                                  10          0      -1524
_SYSSMU10_562384330$             UNDOTBS1                                  10          1      -2901
_SYSSMU10_562384330$             UNDOTBS1                                  10          2      -3158
_SYSSMU10_562384330$             UNDOTBS1                                  10          3      -1831
_SYSSMU10_562384330$             UNDOTBS1                                  10          4       -792
_SYSSMU10_562384330$             UNDOTBS1                                  10          5      -2521
_SYSSMU10_562384330$             UNDOTBS1                                  10          6      -2842
_SYSSMU10_562384330$             UNDOTBS1                                  10          7      -4891
_SYSSMU10_562384330$             UNDOTBS1                                  10          8      -2268
_SYSSMU10_562384330$             UNDOTBS1                                  10          9        -13
_SYSSMU10_562384330$             UNDOTBS1                                  10         10      -1294
...
...
_SYSSMU10_562384330$             UNDOTBS1                                  10         22       -666
_SYSSMU10_562384330$             UNDOTBS1                                  10         23       -555
_SYSSMU10_562384330$             UNDOTBS1                                  10         24      -3596
_SYSSMU10_562384330$             UNDOTBS1                                  10         25      -3325
_SYSSMU10_562384330$             UNDOTBS1                                  10         26      -2670
_SYSSMU10_562384330$             UNDOTBS1                                  10         27      -1551
_SYSSMU10_562384330$             UNDOTBS1                                  10         28      -4384
_SYSSMU10_562384330$             UNDOTBS1                                  10         29      -3521
_SYSSMU10_562384330$             UNDOTBS1                                  10         30      -2434
_SYSSMU10_562384330$             UNDOTBS1                                  10         31      -3667
_SYSSMU10_562384330$             UNDOTBS1                                  10         32      -1268
'''

Note:
大量slot的wrap# 已经出现的溢出为负值。

transaction table的说明

To find out whether the transaction has really been committed, it is necessary to look at its descriptor in the transaction table.
The transaction ID (or XID) enables you to find the transaction descriptor:
• usn: Number of the undo segment which was assigned to the transaction
• slot: Number of the entry in the transaction table (contained in the header of the undo segment) corresponding to the transaction descriptor. 
This is the index column in the transaction table.
• seq: Sequence number indicating reuse of the slot. This number is incremented as each slot is reused. It corresponds to the wrap# column in the transaction table.
All this information is collected in a memory area called the cleanout information area. This information is then used to actually apply the changes.

The transaction table holds a pointer to the head and tail of the committed transaction table slots. Each inactive entry in the transaction table points to the next committed (inactive) slot. 
They are ordered by commit SCN, starting at the oldest transaction to have committed. If the committed list is empty after several attempts at traversing it, the transaction table remains full 
and an error is reported. This indicates that there are not enough undo segments for the amount of concurrent transactions in the database. However, since AUM tries to allocate each concurrent transaction to their 
own undo segment, this should be extremely rare to see. If this error does occur, the solution is to increase the size of the undo tablespace to allow more undo segments to be created.

At instance startup each undo tablespace will contain 10 undo segments. The number of additional segments brought online during instance startup is based on the SESSIONS parameter.
 Oracle allocates approximately one undo segment for every five sessions. These are sized according to the auto-allocate algorithm for locally managed tablespaces. The basic algorithm 
is that the first 16 extents are 64 KB in size. During subsequent allocation the next 63 extents are 1 MB, the next 120 extents are 8 MB and all additional extents are 64 MB.

事务绑定UNDO段的原则是尽可能的将活动事务平均的各个undo段上,具体算法为:
1,在当前undo tablespace中的online undo segment中寻找事务表中没有活动事务的undo segment
2,如果没有找到则尝试在当前undo tablespace online 那些处于offline状态的undo segment
3,如果没有找到则尝试在当前undo tablespace创建undo segment并online
4,如果无法创建则会寻找最近最少使用的undo segment

到这一步既然确认了是业务特性高频事务导致undo 段的slot wrap#耗尽,检查一下UNDO段个数,发现当前的undo 表空间为大小1.5G (数据文件自动扩展),判断应该是大量的小事务,如loop 中insert value ommit等, 对于这么高频率的事务的数据库节点,同时因为UNDO段个数少,即使是undo段使用平衡,也会存在耗尽的风险, 如果个别segment的slot wrap#高,可以drop undo segment,但是如果普遍undo段wrap#使用都高,可以先创建新的UNDO表空间,修改数据库默认表空间解决,后期删聊原UNDO表空间。

预防
1, 初始化更多的UNDO段,使用_rollback_segment_count参数

2, 增加巡检项

select b.segment_name, b.tablespace_name 
         ,a.ktuxeusn "Undo Segment Number"
         ,a.ktuxeslt "Slot"
         ,a.ktuxesqn "Wrap#"
   from  x$ktuxe a, dba_rollback_segs b
   where  a.ktuxeusn = b.segment_id 
   and ktuxesqn>power(2,20); -- change
打赏

, ,

对不起,这篇文章暂时关闭评论。