From: Joern Engel <[email protected]>
We have observed cases where the soft lockup detector triggered, but no
kernel bug existed. Instead we had a buggy realtime thread that
monopolized a cpu. So let's kill the responsible party and not panic
the entire system.
Signed-off-by: Joern Engel <[email protected]>
Signed-off-by: Spencer Baugh <[email protected]>
---
kernel/watchdog.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index a6ffa43..2355bd5 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -428,7 +428,10 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
}
add_taint(TAINT_SOFTLOCKUP, LOCKDEP_STILL_OK);
- if (softlockup_panic)
+ if (rt_prio(current->prio)) {
+ pr_emerg("killing realtime thread\n");
+ send_sig(SIGILL, current, 0);
+ } else if (softlockup_panic)
panic("softlockup: hung tasks");
__this_cpu_write(soft_watchdog_warn, true);
} else
--
2.4.3
On Tue, 2015-07-21 at 15:07 -0700, Spencer Baugh wrote:
> We have observed cases where the soft lockup detector triggered, but no
> kernel bug existed. Instead we had a buggy realtime thread that
> monopolized a cpu. So let's kill the responsible party and not panic
> the entire system.
If you don't tell the kernel to panic, it won't, and if you don't remove
its leash (the throttle), your not so tame rt beast won't maul you.
-Mike
On Wed, Jul 22, 2015 at 06:36:30AM +0200, Mike Galbraith wrote:
> On Tue, 2015-07-21 at 15:07 -0700, Spencer Baugh wrote:
>
> > We have observed cases where the soft lockup detector triggered, but no
> > kernel bug existed. Instead we had a buggy realtime thread that
> > monopolized a cpu. So let's kill the responsible party and not panic
> > the entire system.
>
> If you don't tell the kernel to panic, it won't, and if you don't remove
> its leash (the throttle), your not so tame rt beast won't maul you.
Not sure if this patch is something for mainline, but those two
alternatives have problems of their own. Not panicking on lockups can
leave a system disabled until some human come around. In many cases
that human will do no better than power-cycle. A panic reduces the
downtime.
And the realtime throttling gives non-realtime threads some minimum
runtime, but does nothing to help low-priority realtime threads. It
also introduces latencies, often when workloads are high and you would
like any available cpu to get through that rough spot.
I don't think we have a good answer to this problem in the mainline
kernel yet.
J?rn
--
To announce that there must be no criticism of the President, or that we
are to stand by the President, right or wrong, is not only unpatriotic
and servile, but is morally treasonable to the American public.
-- Theodore Roosevelt, Kansas City Star, 1918
On Tue, 2015-07-21 at 22:18 -0700, Jörn Engel wrote:
> On Wed, Jul 22, 2015 at 06:36:30AM +0200, Mike Galbraith wrote:
> > On Tue, 2015-07-21 at 15:07 -0700, Spencer Baugh wrote:
> >
> > > We have observed cases where the soft lockup detector triggered, but no
> > > kernel bug existed. Instead we had a buggy realtime thread that
> > > monopolized a cpu. So let's kill the responsible party and not panic
> > > the entire system.
> >
> > If you don't tell the kernel to panic, it won't, and if you don't remove
> > its leash (the throttle), your not so tame rt beast won't maul you.
>
> Not sure if this patch is something for mainline, but those two
> alternatives have problems of their own. Not panicking on lockups can
> leave a system disabled until some human come around. In many cases
> that human will do no better than power-cycle. A panic reduces the
> downtime.
If a realtime task goes bonkers, the realtime game is over, you're down.
> And the realtime throttling gives non-realtime threads some minimum
> runtime, but does nothing to help low-priority realtime threads. It
> also introduces latencies, often when workloads are high and you would
> like any available cpu to get through that rough spot.
You can use group scheduling as a debug crutch until the little beasts
are housebroken.
> I don't think we have a good answer to this problem in the mainline
> kernel yet.
IMHO, there's no point in trying to make rt warm/fuzzy/cuddly. Just
don't stuff a Hells Angel into a super-suit, that gets real ugly ;-)
-Mike
On Wed, Jul 22, 2015 at 07:41:48AM +0200, Mike Galbraith wrote:
> On Tue, 2015-07-21 at 22:18 -0700, J?rn Engel wrote:
> >
> > Not sure if this patch is something for mainline, but those two
> > alternatives have problems of their own. Not panicking on lockups can
> > leave a system disabled until some human come around. In many cases
> > that human will do no better than power-cycle. A panic reduces the
> > downtime.
>
> If a realtime task goes bonkers, the realtime game is over, you're down.
Agreed. But a reboot will often solve the issue. So the automatic
panic will repair the system within minutes, while no panic will leave
the system broken for days, depending on human response time. Automatic
panic is a great way to minimize downtime - or vulnerable time if you
have HA.
One could argue that killing the realtime thread is even better than
panic, as things can restart with a blank slate even faster. But the
real benefit is that we get better debug data for the failing component.
If we had a kernel bug, the backtrace would usually be sufficient to
point fingers. With a bonkers realtime thread, not so much.
Anyway, this patch has been useful to us once. If someone deems it
merge-worthy, great. If not, I won't lose any sleep either.
J?rn
--
The key to performance is elegance, not battalions of special cases.
-- Jon Bentley and Doug McIlroy
> On Jul 22, 2015, at 06:07, Spencer Baugh <[email protected]> wrote:
>
> From: Joern Engel <[email protected]>
>
> We have observed cases where the soft lockup detector triggered, but no
> kernel bug existed. Instead we had a buggy realtime thread that
> monopolized a cpu. So let's kill the responsible party and not panic
> the entire system.
>
> Signed-off-by: Joern Engel <[email protected]>
> Signed-off-by: Spencer Baugh <[email protected]>
> ---
> kernel/watchdog.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index a6ffa43..2355bd5 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -428,7 +428,10 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
> }
>
> add_taint(TAINT_SOFTLOCKUP, LOCKDEP_STILL_OK);
> - if (softlockup_panic)
> + if (rt_prio(current->prio)) {
> + pr_emerg("killing realtime thread\n");
> + send_sig(SIGILL, current, 0);
> + } else if (softlockup_panic)
> panic("softlockup: hung tasks");
> __this_cpu_write(soft_watchdog_warn, true);
> } else
> --
just my advice about this patch,
i think should add PF_KTHREAD condition like this:
if (rt_prio(current->prio) && !(current->flags & PF_KTHREAD)) {
+ pr_emerg("killing realtime thread\n");
+ send_sig(SIGILL, current, 0);
+ } else if (softlockup_panic)
if soft lockup is caused by kthread, should still panic .
Thanks
On Tue, 2015-07-21 at 23:33 -0700, Jörn Engel wrote:
> One could argue that killing the realtime thread is even better than
> panic, as things can restart with a blank slate even faster. But the
> real benefit is that we get better debug data for the failing component.
> If we had a kernel bug, the backtrace would usually be sufficient to
> point fingers. With a bonkers realtime thread, not so much.
If userspace wants a watchdog, it should train a userspace dog, not turn
the kernel watchdog into a userspace attack dog.
-Mike
On Wed, Jul 22, 2015 at 09:35:28AM +0200, Mike Galbraith wrote:
> On Tue, 2015-07-21 at 23:33 -0700, J?rn Engel wrote:
>
> > One could argue that killing the realtime thread is even better than
> > panic, as things can restart with a blank slate even faster. But the
> > real benefit is that we get better debug data for the failing component.
> > If we had a kernel bug, the backtrace would usually be sufficient to
> > point fingers. With a bonkers realtime thread, not so much.
>
> If userspace wants a watchdog, it should train a userspace dog, not turn
> the kernel watchdog into a userspace attack dog.
I agree. The spirit of the watchdog was detection and panic (if configured
that way). I don't think adding policy like this works well in the long
run.
Cheers,
Don
On Wed, Jul 22, 2015 at 09:35:28AM +0200, Mike Galbraith wrote:
> On Tue, 2015-07-21 at 23:33 -0700, J?rn Engel wrote:
>
> > One could argue that killing the realtime thread is even better than
> > panic, as things can restart with a blank slate even faster. But the
> > real benefit is that we get better debug data for the failing component.
> > If we had a kernel bug, the backtrace would usually be sufficient to
> > point fingers. With a bonkers realtime thread, not so much.
>
> If userspace wants a watchdog, it should train a userspace dog, not turn
> the kernel watchdog into a userspace attack dog.
Fair point. Let's drop this patch then.
J?rn
--
Money can buy bandwidth, but latency is forever.
-- John R. Mashey
On Tue, 21 Jul 2015 15:07:57 -0700 Spencer Baugh <[email protected]> wrote:
> From: Joern Engel <[email protected]>
>
> We have observed cases where the soft lockup detector triggered, but no
> kernel bug existed. Instead we had a buggy realtime thread that
> monopolized a cpu. So let's kill the responsible party and not panic
> the entire system.
>
> ...
>
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -428,7 +428,10 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
> }
>
> add_taint(TAINT_SOFTLOCKUP, LOCKDEP_STILL_OK);
> - if (softlockup_panic)
> + if (rt_prio(current->prio)) {
> + pr_emerg("killing realtime thread\n");
> + send_sig(SIGILL, current, 0);
Why choose SIGILL?
> + } else if (softlockup_panic)
> panic("softlockup: hung tasks");
> __this_cpu_write(soft_watchdog_warn, true);
But what about a non-buggy realtime thread which happens to
occasionally spend 15 seconds doing stuff?
Old behaviour: kernel blurts a softlockup message, everything keeps running.
New behaviour: thread gets killed, plane crashes.
Possibly a better approach would be to only kill the thread if
softlockup_panic was set, because the system is going down anyway.
Also, perhaps some users would prefer that the kernel simply suppress
the softlockup warning in this situation, rather than killing stuff!
Really, what you're trying to implement here is a watchdog for runaway
realtime threads. And that sounds a worthy project but it's a rather
separate thing from the softlockup detector. A realtime thread
watchdog feature might have things as
- timeout duration separately configurable from softlockup
- enabled independently from sotflockup: people might want one and
not the other.
- configurable signal, perhaps?
Now, the *implementation* of the realtime thread watchdog may well
share code with the softlockup detector. But from a
conceptual/configuration/documentation point of view, it's a separate
thing, no?
On Wed, Jul 22, 2015 at 03:54:36PM -0700, Andrew Morton wrote:
> On Tue, 21 Jul 2015 15:07:57 -0700 Spencer Baugh <[email protected]> wrote:
>
> > From: Joern Engel <[email protected]>
> >
> > We have observed cases where the soft lockup detector triggered, but no
> > kernel bug existed. Instead we had a buggy realtime thread that
> > monopolized a cpu. So let's kill the responsible party and not panic
> > the entire system.
> >
> > ...
> >
> > --- a/kernel/watchdog.c
> > +++ b/kernel/watchdog.c
> > @@ -428,7 +428,10 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
> > }
> >
> > add_taint(TAINT_SOFTLOCKUP, LOCKDEP_STILL_OK);
> > - if (softlockup_panic)
> > + if (rt_prio(current->prio)) {
> > + pr_emerg("killing realtime thread\n");
> > + send_sig(SIGILL, current, 0);
>
> Why choose SIGILL?
It is a random signal that happens to generate a stacktrace in
userspace.
> > + } else if (softlockup_panic)
> > panic("softlockup: hung tasks");
> > __this_cpu_write(soft_watchdog_warn, true);
>
> But what about a non-buggy realtime thread which happens to
> occasionally spend 15 seconds doing stuff?
>
> Old behaviour: kernel blurts a softlockup message, everything keeps running.
>
> New behaviour: thread gets killed, plane crashes.
>
>
> Possibly a better approach would be to only kill the thread if
> softlockup_panic was set, because the system is going down anyway.
>
> Also, perhaps some users would prefer that the kernel simply suppress
> the softlockup warning in this situation, rather than killing stuff!
>
>
>
>
> Really, what you're trying to implement here is a watchdog for runaway
> realtime threads. And that sounds a worthy project but it's a rather
> separate thing from the softlockup detector. A realtime thread
> watchdog feature might have things as
>
> - timeout duration separately configurable from softlockup
>
> - enabled independently from sotflockup: people might want one and
> not the other.
>
> - configurable signal, perhaps?
>
> Now, the *implementation* of the realtime thread watchdog may well
> share code with the softlockup detector. But from a
> conceptual/configuration/documentation point of view, it's a separate
> thing, no?
Agreed. We needed this patch exactly once and it is a rather quick hack
that yielded the necessary results. Realtime threads were well-behaved
since and the patch has seen zero polish as a result.
I think it is better to drop the patch for now. If someone else keeps
running into the same issue, it might be a starting point for a better
implementation. They will find it in list archives.
J?rn
--
I can say that I spend most of my time fixing bugs even if I have lots
of new features to implement in mind, but I give bugs more priority.
-- Andrea Arcangeli, 2000