首页 » ORACLE » Troubleshooting database crash and HPUX-ia64 Error: 11 caused by IO error( SFP 光衰 )

Troubleshooting database crash and HPUX-ia64 Error: 11 caused by IO error( SFP 光衰 )

SFP

一个2节点RAC on HPUX平台的数据库实例1突然因为I/O 错误,数据文件读写失败,自动重启后不久再次因为Vating Disk不可用crs crash, 另一节点正常。后暂时停止问题节点,排查硬件的底层环境问题,确认数据库主机节点的SFP(Small Form-factor Pluggable)光衰减严重,导致数据传输丢帧,频繁重发数据帧导致数据库存储I/O延迟过高(58ms-100s)或间断,数据库IO错误。
Db alert log

# db alert log
2019-08-19 11:11:28.886000 +08:00
Errors in file /oracle/app/oracle/diag/rdbms/prianbob/anbob1/trace/anbob1_dbw9_9393.trc:
ORA-27063: number of bytes read/written is incorrect
HPUX-ia64 Error: 11: Resource temporarily unavailable
Additional information: -1
Additional information: 8192
WARNING: Write Failed. group:2 disk:41 AU:273629 offset:393216 size:8192
Errors in file /oracle/app/oracle/diag/rdbms/prianbob/anbob1/trace/anbob1_dbw9_9393.trc:
ORA-27063: number of bytes read/written is incorrect
HPUX-ia64 Error: 11: Resource temporarily unavailable
Additional information: -1
Additional information: 8192
WARNING: Write Failed. group:2 disk:39 AU:254967 offset:98304 size:8192
Errors in file /oracle/app/oracle/diag/rdbms/prianbob/anbob1/trace/anbob1_dbw9_9393.trc:
ORA-27063: number of bytes read/written is incorrect
HPUX-ia64 Error: 11: Resource temporarily unavailable
Additional information: -1
Additional information: 8192
WARNING: Write Failed. group:2 disk:29 AU:241881 offset:229376 size:8192
Errors in file /oracle/app/oracle/diag/rdbms/prianbob/anbob1/trace/anbob1_dbw9_9393.trc:
ORA-27063: number of bytes read/written is incorrect
HPUX-ia64 Error: 11: Resource temporarily unavailable
Additional information: -1
Additional information: 8192
WARNING: Write Failed. group:2 disk:5 AU:418940 offset:753664 size:8192
Errors in file /oracle/app/oracle/diag/rdbms/prianbob/anbob1/trace/anbob1_dbw9_9393.trc:
ORA-15080: synchronous I/O operation to a disk failed
WARNING: failed to write mirror side 1 of virtual extent 4540 logical extent 0 of file 908 in group 2 on disk 5 allocation unit 418940
Errors in file /oracle/app/oracle/diag/rdbms/prianbob/anbob1/trace/anbob1_dbw9_9393.trc:
ORA-15080: synchronous I/O operation to a disk failed
WARNING: failed to write mirror side 1 of virtual extent 29563 logical extent 0 of file 573 in group 2 on disk 29 allocation unit 241881
Errors in file /oracle/app/oracle/diag/rdbms/prianbob/anbob1/trace/anbob1_dbw9_9393.trc:
ORA-15080: synchronous I/O operation to a disk failed
WARNING: failed to write mirror side 1 of virtual extent 29658 logical extent 0 of file 590 in group 2 on disk 39 allocation unit 254967
Errors in file /oracle/app/oracle/diag/rdbms/prianbob/anbob1/trace/anbob1_dbw9_9393.trc:
ORA-15080: synchronous I/O operation to a disk failed
WARNING: failed to write mirror side 1 of virtual extent 27522 logical extent 0 of file 650 in group 2 on disk 41 allocation unit 273629
KCF: read, write or open error, block=0x35c130 online=1
        file=426 '+DATADG/prianbob/datafile/ryya_lv30_249'
        error=15081 txt: ''
Errors in file /oracle/app/oracle/diag/rdbms/prianbob/anbob1/trace/anbob1_dbw9_9393.trc:
Errors in file /oracle/app/oracle/diag/rdbms/prianbob/anbob1/trace/anbob1_dbw9_9393.trc:
ORA-63999: data file suffered media failure
ORA-01114: IO error writing block to file 426 (block # 3522864)
ORA-01114: data file 426: '+DATADG/prianbob/datafile/ryya_lv30_249'
ORA-15081: failed to submit an I/O operation to a disk
ORA-15081: failed to submit an I/O operation to a disk
System state dump requested by (instance=1, osid=9393 (DBW9)), summary=[abnormal instance termination].


Mon Aug 19 11:38:20 2019
ORA-1092 : opitsk aborting process
Mon Aug 19 11:38:25 2019
Instance termination failed to kill one or more processes
Instance terminated by ASMB, pid = 16652
Mon Aug 19 11:39:49 2019
Starting ORACLE instance (normal)
Mon Aug 19 11:40:01 2019

# gi alert log

2019-08-19 11:34:51.282
[cssd(27396)]CRS-1649:An I/O error occured for voting file: /dev/rdisk/disk446; details at (:CSSNM00060:) in /oracle/app/11.2.0.3/grid/log/anbob1/cssd/ocssd.log.
2019-08-19 11:34:51.282
[cssd(27396)]CRS-1649:An I/O error occured for voting file: /dev/rdisk/disk444; details at (:CSSNM00060:) in /oracle/app/11.2.0.3/grid/log/anbob1/cssd/ocssd.log.
2019-08-19 11:34:51.282
[cssd(27396)]CRS-1649:An I/O error occured for voting file: /dev/rdisk/disk445; details at (:CSSNM00060:) in /oracle/app/11.2.0.3/grid/log/anbob1/cssd/ocssd.log.
2019-08-19 11:36:32.114
[cssd(27396)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file /dev/rdisk/disk444 will be considered not functional in 99149 milliseconds
2019-08-19 11:36:32.114
[cssd(27396)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file /dev/rdisk/disk445 will be considered not functional in 99052 milliseconds
2019-08-19 11:36:32.114
[cssd(27396)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file /dev/rdisk/disk446 will be considered not functional in 99148 milliseconds
2019-08-19 11:37:21.604
[cssd(27396)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file /dev/rdisk/disk444 will be considered not functional in 49659 milliseconds
2019-08-19 11:37:21.605
[cssd(27396)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file /dev/rdisk/disk445 will be considered not functional in 49562 milliseconds
2019-08-19 11:37:21.605
[cssd(27396)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file /dev/rdisk/disk446 will be considered not functional in 49657 milliseconds
2019-08-19 11:37:51.904
[cssd(27396)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file /dev/rdisk/disk444 will be considered not functional in 19359 milliseconds
2019-08-19 11:37:51.905
[cssd(27396)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file /dev/rdisk/disk445 will be considered not functional in 19262 milliseconds
2019-08-19 11:37:51.905
[cssd(27396)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file /dev/rdisk/disk446 will be considered not functional in 19357 milliseconds

Note:
数据库报 ORA-27063 、 HPUX-ia64 Error: 11: Resource temporarily unavailable、 ORA-15080  、ORA-63999 、ORA-01114 、 ORA-15081 等IO错误, 确认是的OS Layer一定存在IO异常, 后来VD 访问也出现了200 秒的 timeout 访问超时。

在LINUX 平台,多路径软件可能存在BUG, 如OEL平台的multipath bug 1344381 ,这个案例是HPUX平台,使用的是HP 3PAR存储。主机上未做任何调整,因此需要排查存储链路, 因为共享存储的另一个RAC 实例正常,很有可能出在节点1主机到存储之间。

存储链路上的故障点有, 如果有级联交换机会更复杂。

HBA SFP — cable –Switch SFP — Switch SFP — cable — Storage SFP

如果从存储和操作系统日志无法定位故障点,可以试试supportshow的信息定位。

ANBOB_1:FID128:admin> porterrshow -i 65
           frames      enc    crc    crc    too    too    bad    enc   disc   link   loss   loss   frjt   fbsy  c3timeout    pcs
        tx     rx      in    err    g_eof  shrt   long   eof     out   c3    fail    sync   sig                  tx    rx     err
 65:  823.5m   1.8g   1.4k   1.4k   1.4k   0      0     16    521.0k  10.9k   0      0      0      0      0     10.9k   0      0   
 
ANBOB_1:FID128:admin> porterrshow -i 79
           frames      enc    crc    crc    too    too    bad    enc   disc   link   loss   loss   frjt   fbsy  c3timeout    pcs
        tx     rx      in    err    g_eof  shrt   long   eof     out   c3    fail    sync   sig                  tx    rx     err
 79:    1.8g 823.6m   0      0      0      0      0      0      0      2.8k   0      0      0      0      0      0      2.8k   0   
 
 
ANBOB_1:FID128:admin> supportshow

 portshow 65
portDisableReason: None
portCFlags: 0x0
portFlags: 0x4021        PRESENT U_PORT DISABLED LED
LocalSwcFlags: 0x0
portType:  17.0
portState: 2    Offline  
Protocol: FC
portPhys:  5    No_Sync         portScn:   2    Offline   
port generation number:    22400
state transition count:    35         

portId:    014100
portIfId:    43220039
portWwn:   20:41:00:27:f8:33:95:00
portWwn of device(s) connected:
        None
Distance:  normal
portSpeed: N8Gbps

Credit Recovery: Inactive
LE domain: 0
Peer beacon: Off
FC Fastwrite: OFF
Interrupts:        0          Link_failure: 0          Frjt:         0          
Unknown:           0          Loss_of_sync: 0          Fbsy:         0          
Lli:               80         Loss_of_sig:  0          
Proc_rqrd:         267        Protocol_err: 0          
Timed_out:         0          Invalid_word: 522566     
Rx_flushed:        0          Invalid_crc:  1414       
Tx_unavail:        0          Delim_err:    0          
Free_buffer:       0          Address_err:  0          
Overrun:           0          Lr_in:        0          
Suspended:         0          Lr_out:       40         
Parity_err:        0          Ols_in:       0          
2_parity_err:      0          Ols_out:      0          
CMI_bus_err:       0          

下面从操作系统上检查光纤通信, 这里记录一下不同OS检查光纤设备的方法

 OS               列出HP机上连接的光纤卡设备             查看光纤卡的WWN号 
----------       ------------------------             ---------------------------
HPUX             ioscan -kfnNCfc                        fcmsutil /dev/fcd0
Solars	         luxadm probe ** or **luxadm -e port			luxadm -e dump_map /devices/pci@0,0/pci1022,7450@2/pci1077,101@1/fp@0,0:devctl
AIX	         lsdev -Cc adapter -S a | grep fcs		lscfg -vpl fcs0
Linux	         lspci -nn| grep "Fibre Channel"** or ** systool -c fc_host    systool -c fc_host -v host3 ** or **more /sys/class/fc_host/host?/port_name

光纤收发器最重要的两个因素:Tx功率和Rx功率。光学Tx功率是离开该设备的信号电平,它应该在发射机功率范围内。Rx功率是从远端设备接收的输入信号电平,它应该落在接收功率范围内。
dBm(毫瓦分贝)与mW(毫瓦)都是光功率的单位。
两者之间的换算关系:dBm=10xlgP,其中P为功率,单位为mW。

# ioscan -kfnNCfc
Class     I  H/W Path        Driver S/W State   H/W Type     Description
======================================================================
fc        0  43/0/0/2/0/0/0  fcd   CLAIMED     INTERFACE    HP AH401A 8Gb Dual Port PCIe Fibre Channel Adapter (FC Port 1)
                            /dev/fcd0
fc        1  43/0/0/2/0/0/1  fcd   CLAIMED     INTERFACE    HP AH401A 8Gb Dual Port PCIe Fibre Channel Adapter (FC Port 2)
                            /dev/fcd1
fc        2  43/0/2/0/0/0/0  fcd   CLAIMED     INTERFACE    HP AH401A 8Gb Dual Port PCIe Fibre Channel Adapter (FC Port 1)
                            /dev/fcd2
fc        3  43/0/2/0/0/0/1  fcd   CLAIMED     INTERFACE    HP AH401A 8Gb Dual Port PCIe Fibre Channel Adapter (FC Port 2)
                            /dev/fcd3
fc        4  43/0/2/2/0/0/0  fcd   CLAIMED     INTERFACE    HP AH401A 8Gb Dual Port PCIe Fibre Channel Adapter (FC Port 1)
                            /dev/fcd4
fc        5  43/0/2/2/0/0/1  fcd   CLAIMED     INTERFACE    HP AH401A 8Gb Dual Port PCIe Fibre Channel Adapter (FC Port 2)
                            /dev/fcd5
fc        6  45/0/0/2/0/0/0  fcd   CLAIMED     INTERFACE    HP AH401A 8Gb Dual Port PCIe Fibre Channel Adapter (FC Port 1)
                            /dev/fcd6
fc        7  45/0/0/2/0/0/1  fcd   CLAIMED     INTERFACE    HP AH401A 8Gb Dual Port PCIe Fibre Channel Adapter (FC Port 2)

# fcmsutil /dev/fcd7 sfp
ADDRESS 0xA0:
                Identifier: SFP
           Ext. Identifier: GBIC/SFP function is defined by serial ID only
                 Connector: LC
               Transceiver: 0x0 0x0 0x0 0x0 0x40 0x40 0xc 0x54
                  Encoding: 8B10B
               BR, Nominal: 8500 Mbits/sec.
              Length (9um): 0 km
              Length (9um): 0 m
             Length (50um): 50 m
           Length (62.5um): 20 m
           Length (Copper): 19 m
               Vendor Name: HP-F            
                Vendor OUI: 0x009065
                 Vendor PN: AJ718A          
                Vendor rev: A   
                Wavelength: 0x0352
                   CC_BASE: 0x92
                   Options: -RATE SELECT is implemented: 1
                            -TX_DISABLE implemented & disables serial output: 1
                            -TX_FAULT is implemented. (See SFP MSA): 1
                            -Loss of Signal, inverted from standard in SFP MSA:0
                            -Loss of Signal, signal as defined in SFP MSA: 1
                   BR, max: 0%
                   BR, min: 0%
                 Vendor SN: MY84250LUJ      
        Date Code (yymmdd): 140620  
Diagnostic Monitoring Type: -Digital diagnostic monitoring implemented: 1
                            -Internally Calibrated: 1
                            -Externally Calibrated: 0
                            -Received power measurement type: 1 (Average Power)
                            -Address change required: 0
          Enhanced Options: -Alarm/warning flags implemented: 1
                            -Soft TX_DISABLE control and monitoring: 1
                            -Soft TX_DISABLE monitoring: 1
                            -Soft RX_LOS monitoring: 1
                            -Soft RATE_SELECT control and monitoring: 1
                            -Application Select control per SFF-8079: 0
       SFF-8472 Compliance: 0x3
                    CC_EXT: 0x60
           Vendor Specific: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
                            0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
                            0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
                            0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
                           
ADDRESS 0xA2:
           Temp High Alarm: 90.000 C
            Temp Low Alarm: -10.000 C
         Temp High Warning: 85.000 C
          Temp Low Warning: -5.000 C
        Voltage High Alarm: 3.70 Volts
         Voltage Low Alarm: 2.90 Volts
      Voltage High Warning: 3.60 Volts
       Voltage Low Warning: 3.00 Volts
           Bias High Alarm: 17.00 mA
            Bias Low Alarm: 1.00 mA
         Bias High Warning: 14.00 mA
          Bias Low Warning: 2.00 mA
       TX Power High Alarm: 0.63 mW
        TX Power Low Alarm: 0.13 mW
     TX Power High Warning: 0.56 mW
      TX Power Low Warning: 0.16 mW
       RX Power High Alarm: 1.26 mW
        RX Power Low Alarm: 0.01 mW
     RX Power High Warning: 1.00 mW
      RX Power Low Warning: 0.02 mW
     Checksum (bytes 0-94): Valid
               Temperature: 31.254 C
                       Vcc: 3.34 Volts
                   TX Bias: 8.04 mA
                  TX Power: 0.48 mW
                  RX Power: 0.36 mW
       Status/Control Bits: -TX Disable State: 0
                            -Soft TX Disable: 0
                            -Rate Select State: 0
                            -Soft Rate Select: 0
                            -TX Fault: 0
                            -LOS: 0
                            -Data_Ready_Bar: 0
                            -Temp High Alarm: 0
                            -Temp Low Alarm: 0
                            -Vcc High Alarm: 0
                            -Vcc Low Alarm: 0
                            -TX Bias High Alarm: 0
                            -TX Bias Low Alarm: 0
                            -TX Power High Alarm: 0
                            -TX Power Low Alarm: 0
                            -RX Power High Alarm: 0
                            -RX Power Low Alarm: 0
                            -Temp High Warning: 0
                            -Temp Low Warning: 0
                            -Vcc High Warning: 0
                            -Vcc Low Warning: 0
                            -TX Bias High Warning: 0
                            -TX Bias Low Warning: 0
                            -TX Power High Warning: 0
                            -TX Power Low Warning: 0
                            -RX Power High Warning: 0
                            -RX Power Low Warning: 0
          Vendor Specific::  0xff 0xff 0xff 0xff 0xff 0xff 0xff 0x01

Tip:

问题时间点主机测的SFP 光功率低到了0.2mW, 因为存储有多路径,可以暂时禁用该链路,更换SFP 后,恢复正常。

打赏

, , , , , ,

对不起,这篇文章暂时关闭评论。