首页 » Exadata » Exadata (21.2.0) Automatic Recovery from Disk Controller Cache Failure

Exadata (21.2.0) Automatic Recovery from Disk Controller Cache Failure

Recently, for one of our customers’ XD system, a cell node automatically restarted. The reason was due to disk control. However, after the restart, it was discovered that there was an automatic operation to delete the ASM disk in the log, and this was not an intentional human action. Simply record this feature.

multiple-bit ECC (Error-Correcting Code) error in Oracle Exadata indicates a severe memory corruption event where more bits are corrupted than the ECC mechanism can correct. This triggers automatic hardware protection protocols to prevent data corruption.

 Cell Service Impact

  • If the DIMM hosts Exadata processes:

    • cellsrv may crash/restart if using corrupted memory.

    • GridDisks on the cell temporarily lose access (ASM diskgroup remains online via partner cells).

  • Automatic Recovery:
    Exadata Storage Server software (cellsrv) auto-restarts and reconnects healthy disks.

Hardware Deactivation

  • The faulty DIMM is permanently disabled until physically replaced.

  • Check status via:

    # On cell node:
    sudo ipmitool sel list  # Check for "Memory" errors
    CellCLI> LIST PHYSICALDISK DETAIL | grep status  # Affected cache disks

The System may report error messages similar to the following and abort auto-boot sequence:

“Multibit ECC errors were detected on the RAID controller.
 The DIMM on the controller needs replacement.
 Please contact technical support to resolve this issue.
 If you continue, data corruption can occur.
 Press ‘X’ to continue or else power off the system and replace the DIMM module and reboot. If you have replaced the DIMM press ‘X’ to continue

The error messages shown above points to a faulty DIMM/memory module on the RAID HostBusAdapter in a PCIE slot. The DIMM/memory module is not a replaceable unit and therefore the faulty RAID HostBusAdapter in a PCIE slot needs to be replaced.

If you press ‘X’ at startup, you can enter the operating system. However, you will see the operations of automatically ” Drop griddisk ” and “create GRIDDISK” in cell alert log.

[MS] Disk controller was hung. Cell was power cycled to restore access to the cell. Timestamp: Fri May 30 06:10:55 CST 2025
2025-05-30T07:58:21.735339+08:00
Drop griddisk DATAC1_CD_00_xxxxxxxxxxxxx (options: force no-erase) - begin 
Drop griddisk DATAC1_CD_00_xxxxxxxxxxxxx - end
2025-05-30T07:58:22.181918+08:00
create GRIDDISK DATAC1_CD_00_xxxxxxxxxxxxx on CELLDISK CD_00_xxxxxxxxxxxxx type 0
GridDisk name=DATAC1_CD_00_xxxxxxxxxxxxx      guid=6bff9df1-9cc4-4163-a606-47574fb82201 ( 516350460) status=GDISK_ACTIVE  
Requesting ASM to do ASM DROP ADD disk for griddisks:
DATAC1_CD_00_xxxxxxxxxxxxx
Published: 1 events ASM DROP ADD disk of opcode 4 for diskgroup DATAC1 to: 
ClientHostName = xxxxx,  ClientPID = 75991
2025-05-30T07:58:22.832645+08:00
Drop griddisk DATAC1_CD_01_xxxxxxxxxxxxx (options: force no-erase) - begin 
Drop griddisk DATAC1_CD_01_xxxxxxxxxxxxx - end
2025-05-30T07:58:23.296279+08:00
create GRIDDISK DATAC1_CD_01_xxxxxxxxxxxxx on CELLDISK CD_01_xxxxxxxxxxxxx type 0
GridDisk name=DATAC1_CD_01_xxxxxxxxxxxxx      guid=9569a9db-e3fd-4946-a65b-398c618c3a11 (3775014332) status=GDISK_ACTIVE  
Requesting ASM to do ASM DROP ADD disk for griddisks:
DATAC1_CD_01_xxxxxxxxxxxxx
Published: 1 events ASM DROP ADD disk of opcode 4 for diskgroup DATAC1 to: 
ClientHostName = xxxxx,  ClientPID = 76635

This is a new feature specifically designed for Oracle Exadata. called Automatic Recovery from Disk Controller Cache Failure  Oracle Exadata System Software Release 21.2 new feature.  It is not affected by the disk_repair_time parameter.

6.1.11 Automatic Recovery from Disk Controller Cache Failure

High Capacity (HC) and Extended (XT) Oracle Exadata Storage Server models have a disk controller that contains a write-back cache, which is separate and distinct from the write-back flash cache.

A multiple-bit ECC error on the disk controller cache can crash the server, resulting in the loss of unflushed data in the disk controller cache and leaving stale data on the disks. If not handled carefully, such errors are severe and can result in data loss.

Oracle Exadata System Software release 21.2.0 supports automatic recovery following a disk controller cache failure. Where possible, this feature automatically detects the problem and copies data from other mirrors to recover all of the disks in the failed server.

Minimum requirements:

  • Oracle Exadata System Software release 21.2.0

打赏

目前这篇文章还没有评论(Rss)

我要评论