首页 » Exadata, ORACLE 9i-23c » Exadata OS reboot dev_watchdog call Trace show “dev_deactivate_queue run_timer_softirq cpuidle_enter_state”

Exadata OS reboot dev_watchdog call Trace show “dev_deactivate_queue run_timer_softirq cpuidle_enter_state”

近日一客户Exadata Machine节点总是会不定理重启,在DB和GI层无错误日志,类突然断电或无响应重启, 分析OS message日志显示如下信息。

Apr  7 12:49:37 xd08anbob03 kernel: NETDEV WATCHDOG: eth1 (bnxt_en): transmit queue 6 timed out
Apr  7 12:49:37 xd08anbob03 kernel: ------------[ cut here ]------------
Apr  7 12:49:37 xd08anbob03 kernel: WARNING: CPU: 2 PID: 0 at net/sched/sch_generic.c:334 dev_watchdog+0x228/0x22c
Apr  7 12:49:37 xd08anbob03 kernel: Modules linked in: oracleacfs(PO) oracleadvm(PO) oracleoks(PO) ipmi_poweroff scsi_transport_iscsi ipmi_ssif bonding ib_umad mlx4_en mlx4_ib vfat fat skx_edac intel_powerclamp iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel crypto_simd glue_helper cryptd mlx4_core ipmi_si ipmi_devintf ipmi_msghandler ioatdma pcc_cpufreq i2c_i801 lpc_ich shpchp wmi rds_rdma resilient_rdmaip ib_ipoib dm_multipath rds rdma_ucm rdma_cm iw_cm ib_cm ib_uverbs ib_core binfmt_misc sg fuse ip_tables ext4 mbcache jbd2 fscrypto sd_mod ahci libahci igb bnxt_en megaraid_sas libata crc32c_intel ptp pps_core i2c_algo_bit dca dm_mirror dm_region_hash dm_log dm_mod
Apr  7 12:49:37 xd08anbob03 kernel: CPU: 2 PID: 0 Comm: swapper/2 Tainted: P           O    4.14.35-1902.5.1.2.el7uek.x86_64 #2
Apr  7 12:49:37 xd08anbob03 kernel: Hardware name: Oracle Corporation ORACLE SERVER X8-2/ASM,MB,X8-2, BIOS 51020500 08/26/2019
Apr  7 12:49:37 xd08anbob03 kernel: task: ffff93ec61559ec0 task.stack: ffffaf408026c000
Apr  7 12:49:37 xd08anbob03 kernel: RIP: 0010:dev_watchdog+0x228/0x22c
Apr  7 12:49:37 xd08anbob03 kernel: RSP: 0018:ffff942a00683e58 EFLAGS: 00010246
Apr  7 12:49:37 xd08anbob03 kernel: RAX: 000000000000003b RBX: 0000000000000006 RCX: 0000000000000000
Apr  7 12:49:37 xd08anbob03 kernel: RDX: 0000000000000000 RSI: ffff942a006969c8 RDI: ffff942a006969c8
Apr  7 12:49:37 xd08anbob03 kernel: RBP: ffff942a00683e88 R08: 0000000000000000 R09: 0000000000002daa
Apr  7 12:49:37 xd08anbob03 kernel: R10: 0000000000000004 R11: 0000000000002da9 R12: ffff9409b6705bc0
Apr  7 12:49:37 xd08anbob03 kernel: R13: ffff9409b66f8000 R14: 0000000000000002 R15: 000000000000004a
Apr  7 12:49:37 xd08anbob03 kernel: FS:  0000000000000000(0000) GS:ffff942a00680000(0000) knlGS:0000000000000000
Apr  7 12:49:37 xd08anbob03 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr  7 12:49:37 xd08anbob03 kernel: CR2: 00007ffff5ea7060 CR3: 0000007c7540a005 CR4: 00000000007606e0
Apr  7 12:49:37 xd08anbob03 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr  7 12:49:37 xd08anbob03 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Apr  7 12:49:37 xd08anbob03 kernel: PKRU: 55555554
Apr  7 12:49:37 xd08anbob03 kernel: Call Trace:
Apr  7 12:49:37 xd08anbob03 kernel: 
Apr  7 12:49:37 xd08anbob03 kernel: ? dev_deactivate_queue.constprop.31+0x60/0x59
Apr  7 12:49:37 xd08anbob03 kernel: ? dev_deactivate_queue.constprop.31+0x60/0x59
Apr  7 12:49:37 xd08anbob03 kernel: call_timer_fn+0x3c/0x148
Apr  7 12:49:37 xd08anbob03 kernel: ? dev_deactivate_queue.constprop.31+0x60/0x59
Apr  7 12:49:37 xd08anbob03 kernel: run_timer_softirq+0x20d/0x494
Apr  7 12:49:37 xd08anbob03 kernel: ? timerqueue_add+0x59/0x82
Apr  7 12:49:37 xd08anbob03 kernel: ? ktime_get+0x3e/0x95
Apr  7 12:49:37 xd08anbob03 kernel: __do_softirq+0xd9/0x28d
Apr  7 12:49:37 xd08anbob03 kernel: irq_exit+0xdf/0xe5
Apr  7 12:49:37 xd08anbob03 kernel: smp_apic_timer_interrupt+0x91/0x155
Apr  7 12:49:37 xd08anbob03 kernel: apic_timer_interrupt+0x1c2/0x1c7
Apr  7 12:49:37 xd08anbob03 kernel: 
Apr  7 12:49:37 xd08anbob03 kernel: RIP: 0010:cpuidle_enter_state+0xdd/0x2a5
Apr  7 12:49:37 xd08anbob03 kernel: RSP: 0018:ffffaf408026fe68 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
Apr  7 12:49:37 xd08anbob03 kernel: RAX: ffff942a006a2c00 RBX: ffffcf0080e85200 RCX: 000000000000001f
Apr  7 12:49:37 xd08anbob03 kernel: RDX: 0000000000000000 RSI: ffffffdb4e228751 RDI: 0000000000000000
Apr  7 12:49:37 xd08anbob03 kernel: RBP: ffffaf408026fea0 R08: 000000000000026a R09: 000000000000035c
Apr  7 12:49:37 xd08anbob03 kernel: R10: 000000000000035c R11: 0000000000000014 R12: 0000000000000003
Apr  7 12:49:37 xd08anbob03 kernel: R13: 0000000000000002 R14: ffffffff8f56f5c0 R15: 001c30a045075215
Apr  7 12:49:37 xd08anbob03 kernel: ? cpuidle_enter_state+0xcc/0x2a5
Apr  7 12:49:37 xd08anbob03 kernel: cpuidle_enter+0x17/0x19
Apr  7 12:49:37 xd08anbob03 kernel: call_cpuidle+0x23/0x3a
Apr  7 12:49:37 xd08anbob03 kernel: do_idle+0x172/0x1d5
Apr  7 12:49:37 xd08anbob03 kernel: cpu_startup_entry+0x73/0x75
Apr  7 12:49:37 xd08anbob03 kernel: start_secondary+0x1b9/0x208
Apr  7 12:49:37 xd08anbob03 kernel: secondary_startup_64+0xa5/0xa5
Apr  7 12:49:37 xd08anbob03 kernel: Code: 60 04 00 00 eb 8f 4c 89 ef c6 05 89 37 eb 00 01 e8 4e ee fc ff 89 d9 48 89 c2 4c 89 ee 48 c7 c7 50 ae 26 8f 31 c0 e8 0b 10 9b ff <0f> 0b eb bb 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 57 49 89 
Apr  7 12:49:37 xd08anbob03 kernel: ---[ end trace f148d0a35db90ed5 ]---
Apr  7 12:49:37 xd08anbob03 kernel: bnxt_en 0000:18:00.0 eth1: TX timeout detected, starting reset task!
Apr  7 12:49:37 xd08anbob03 kernel: bondeth0: link status down for active interface eth1, disabling it in 2000 ms
Apr  7 12:49:37 xd08anbob03 kernel: bondeth0: link status down for active interface eth1, disabling it in 2000 ms
Apr  7 12:49:37 xd08anbob03 kernel: bondeth0: link status down for active interface eth1, disabling it in 2000 ms
Apr  7 12:49:37 xd08anbob03 kernel: bondeth0: link status down for active interface eth1, disabling it in 2000 ms

检查eth1

ethtool -m ech1

cat /proc/net/bonding/bondeth0

似乎在禁用bond的网卡时超时,影响IB/ISCSI 的call。linux kernel bug. 建议安装Patch to kernel: 4.14.35-1902.302.2

打赏

对不起,这篇文章暂时关闭评论。