Recently one of the exadata cell node experienced reboot multiple times without any load. This is blocking us releasing the env to customer.
Kernel is 4.1.12-61.47.1.el6uek.x86_64
I analyzed all 6 vmcores and found some common feature, expired timer jiffies are quite small than current jiffies, like below:
TVEC_BASES[2]: ffff881ffda8eb40
JIFFIES
15605264118
EXPIRES TIMER_LIST FUNCTION
4320897356 ffff88003786a408 ffffffff8157d810 <intel_pstate_timer_func>
4320898348 ffff8810af00fc48 ffffffff810ed490 <process_timeout>
...
4320898348 ffff8810c7e63c48 ffffffff810ed490 <process_timeout>
4321197994 ffff881ffda8cee0 ffffffff81045770 <mce_timer_fn>
At first, I can't think of how jiffies jump like this, even with nohz enabled, due to the limit of max idle time, jiffies delta is quite small.
Luckily, I found below log on 2 vmcores, it showed TSC has large jump from its last value and lead to current jiffies jumped large. This print isn't in the other 4 of vmcores, because watchdog timer of clocksource cycle through all cpus(40 cpus in this env), it may not always have opportunity to check the buggy cpu yet before the softlockup.
[11277703.341052] timekeeping watchdog: Marking clocksource 'tsc' as unstable, because the skew is too large:
[11277703.341054] 'hpet' wd_now: 63896768 wd_last: 636c76a3 mask: ffffffff
[11277703.341056] 'tsc' cs_now: 5a341a9f1c6ea0 cs_last: 2341a8dc79b74 mask: ffffffffffffffff
In __run_timers, there is a loop to process all the expired timers until current jiffies, with a delta of 0x5800001154D32C jiffies the loop time centainly exceed softlockup's timeout. So we see below output:
[ 5062.020370] NMI watchdog: BUG: soft lockup - CPU#22 stuck for 205s! [swapper/22:0]
[ 5062.028896] task: ffff881ff17bd400 ti: ffff881ff17e8000 task.ti: ffff881ff17e8000
[ 5062.028897] RIP: 0010:[<ffffffff810ed290>] [<ffffffff810ed290>] run_timer_softirq+0x1f0/0x380
[ 5062.028923] Call Trace:
[ 5062.028924] <IRQ>
[ 5062.028930] [<ffffffff8157d810>] ? pid_param_set+0x120/0x120
[ 5062.028935] [<ffffffff8108871a>] __do_softirq+0x10a/0x350
[ 5062.028938] [<ffffffff81088ad5>] irq_exit+0x125/0x130
[ 5062.028941] [<ffffffff8105b9d4>] ? native_apic_msr_eoi_write+0x14/0x20
[ 5062.028946] [<ffffffff816a0af6>] smp_apic_timer_interrupt+0x46/0x60
[ 5062.028948] [<ffffffff8169ea3e>] apic_timer_interrupt+0x6e/0x80
[ 5062.028949] <EOI>
[ 5062.028952] [<ffffffff8157e695>] ? cpuidle_enter_state+0xc5/0x230
[ 5062.028954] [<ffffffff8157e665>] ? cpuidle_enter_state+0x95/0x230
[ 5062.028956] [<ffffffff8157e817>] cpuidle_enter+0x17/0x20
[ 5062.028959] [<ffffffff810c8eb4>] cpuidle_idle_call+0xe4/0x1e0
[ 5062.028961] [<ffffffff810c91a5>] cpu_idle_loop+0x1f5/0x220
[ 5062.028962] [<ffffffff810c91eb>] ? cpu_startup_entry+0x1b/0x70
[ 5062.028964] [<ffffffff810c922f>] cpu_startup_entry+0x5f/0x70
[ 5062.028966] [<ffffffff81052dab>] start_secondary+0xbb/0xc0
Finally issue disapper after they replace the buggy processor. This issue should not exist with 4.12 or above as there is a NOHZ optimized implementation for expired timer to survive and drop down TSC source. I also suggest they add 'notsc lapic=notscdeadline' to see if workaround the issue, unluckily they will not try it.