Hardware tsc bug and core dump analyzation

Recently one of the exadata cell node experienced reboot multiple times without any load. This is blocking us releasing the env to customer.

Kernel is 4.1.12-61.47.1.el6uek.x86_64

I analyzed all 6 vmcores and found some common feature, expired timer jiffies are quite small than current jiffies, like below:

TVEC_BASES[2]: ffff881ffda8eb40
JIFFIES
15605264118
EXPIRES TIMER_LIST FUNCTION
4320897356 ffff88003786a408 ffffffff8157d810 <intel_pstate_timer_func>
4320898348 ffff8810af00fc48 ffffffff810ed490 <process_timeout>
...
4320898348 ffff8810c7e63c48 ffffffff810ed490 <process_timeout>
4321197994 ffff881ffda8cee0 ffffffff81045770 <mce_timer_fn>

At first, I can't think of how jiffies jump like this, even with nohz enabled, due to the limit of max idle time, jiffies delta is quite small.

Luckily, I found below log on 2 vmcores, it showed TSC has large jump from its last value and lead to current jiffies jumped large. This print isn't in the other 4 of vmcores, because watchdog timer of clocksource cycle through all cpus(40 cpus in this env), it may not always have opportunity to check the buggy cpu yet before the softlockup.

[11277703.341052] timekeeping watchdog: Marking clocksource 'tsc' as unstable, because the skew is too large:
[11277703.341054] 'hpet' wd_now: 63896768 wd_last: 636c76a3 mask: ffffffff
[11277703.341056] 'tsc' cs_now: 5a341a9f1c6ea0 cs_last: 2341a8dc79b74 mask: ffffffffffffffff

In __run_timers, there is a loop to process all the expired timers until current jiffies, with a delta of 0x5800001154D32C jiffies the loop time centainly exceed softlockup's timeout. So we see below output:

[ 5062.020370] NMI watchdog: BUG: soft lockup - CPU#22 stuck for 205s! [swapper/22:0]
[ 5062.028896] task: ffff881ff17bd400 ti: ffff881ff17e8000 task.ti: ffff881ff17e8000
[ 5062.028897] RIP: 0010:[<ffffffff810ed290>] [<ffffffff810ed290>] run_timer_softirq+0x1f0/0x380
[ 5062.028923] Call Trace:
[ 5062.028924] <IRQ>
[ 5062.028930] [<ffffffff8157d810>] ? pid_param_set+0x120/0x120
[ 5062.028935] [<ffffffff8108871a>] __do_softirq+0x10a/0x350
[ 5062.028938] [<ffffffff81088ad5>] irq_exit+0x125/0x130
[ 5062.028941] [<ffffffff8105b9d4>] ? native_apic_msr_eoi_write+0x14/0x20
[ 5062.028946] [<ffffffff816a0af6>] smp_apic_timer_interrupt+0x46/0x60
[ 5062.028948] [<ffffffff8169ea3e>] apic_timer_interrupt+0x6e/0x80
[ 5062.028949] <EOI>
[ 5062.028952] [<ffffffff8157e695>] ? cpuidle_enter_state+0xc5/0x230
[ 5062.028954] [<ffffffff8157e665>] ? cpuidle_enter_state+0x95/0x230
[ 5062.028956] [<ffffffff8157e817>] cpuidle_enter+0x17/0x20
[ 5062.028959] [<ffffffff810c8eb4>] cpuidle_idle_call+0xe4/0x1e0
[ 5062.028961] [<ffffffff810c91a5>] cpu_idle_loop+0x1f5/0x220
[ 5062.028962] [<ffffffff810c91eb>] ? cpu_startup_entry+0x1b/0x70
[ 5062.028964] [<ffffffff810c922f>] cpu_startup_entry+0x5f/0x70
[ 5062.028966] [<ffffffff81052dab>] start_secondary+0xbb/0xc0

Finally issue disapper after they replace the buggy processor. This issue should not exist with 4.12 or above as there is a NOHZ optimized implementation for expired timer to survive and drop down TSC source. I also suggest they add 'notsc lapic=notscdeadline' to see if workaround the issue, unluckily they will not try it.

更多精彩内容