2019-07-26 20:14:17

by Chris Friesen

[permalink] [raw]
Subject: [RT] hit recently-fixed PREEMPT_RT CFS-bandwidth timer locking issue in the wild

Hi all,

I thought people might be interested to hear that we recently hit the
bug fixed by git commit c0ad4aa4d8 on multiple lab systems running the
RHEL 7 "kernel-rt" kernel. (But I think other versions are at risk as
well.)

Interestingly, when the bug hit the system just hung completely. Nothing
was emitted on netconsole or serial console, neither the hung task timer
nor the NMI watchdog triggered, CONFIG_DEBUG_SPINLOCK didn't output
anything, and magic sysrq didn't work on the serial console. As you can
imagine this was a bit frustrating. I was finally able to cause a panic
by sending an NMI from the BMC and that allowed kdump to store the core
file so I could get stack traces.

Given how annoying it was to debug, I'd recommend backporting this fix
as far back as it applies. HRTIMER_MODE_SOFT was introduced in mainline
in 4.16, but at least in the RHEL7 kernel-rt package (and I think in the
vanilla PREEMPT_RT patches as well) hrtimers are run by default in
softirq context and so the fix might apply to all supported PREEMPT_RT
versions.

Chris