Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753324AbbGVWyj (ORCPT ); Wed, 22 Jul 2015 18:54:39 -0400 Received: from mail.linuxfoundation.org ([140.211.169.12]:52732 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752763AbbGVWyi (ORCPT ); Wed, 22 Jul 2015 18:54:38 -0400 Date: Wed, 22 Jul 2015 15:54:36 -0700 From: Andrew Morton To: Spencer Baugh Cc: Don Zickus , Ulrich Obergfell , Ingo Molnar , Andrew Jones , chai wen , Chris Metcalf , Stephane Eranian , linux-kernel@vger.kernel.org (open list), Joern Engel , Spencer Baugh , Joern Engel Subject: Re: [PATCH] soft lockup: kill realtime threads before panic Message-Id: <20150722155436.04d66934cd423107b810f2b1@linux-foundation.org> In-Reply-To: <1437516477-30554-5-git-send-email-sbaugh@catern.com> References: <1437516477-30554-5-git-send-email-sbaugh@catern.com> X-Mailer: Sylpheed 3.4.1 (GTK+ 2.24.23; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2236 Lines: 67 On Tue, 21 Jul 2015 15:07:57 -0700 Spencer Baugh wrote: > From: Joern Engel > > We have observed cases where the soft lockup detector triggered, but no > kernel bug existed. Instead we had a buggy realtime thread that > monopolized a cpu. So let's kill the responsible party and not panic > the entire system. > > ... > > --- a/kernel/watchdog.c > +++ b/kernel/watchdog.c > @@ -428,7 +428,10 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) > } > > add_taint(TAINT_SOFTLOCKUP, LOCKDEP_STILL_OK); > - if (softlockup_panic) > + if (rt_prio(current->prio)) { > + pr_emerg("killing realtime thread\n"); > + send_sig(SIGILL, current, 0); Why choose SIGILL? > + } else if (softlockup_panic) > panic("softlockup: hung tasks"); > __this_cpu_write(soft_watchdog_warn, true); But what about a non-buggy realtime thread which happens to occasionally spend 15 seconds doing stuff? Old behaviour: kernel blurts a softlockup message, everything keeps running. New behaviour: thread gets killed, plane crashes. Possibly a better approach would be to only kill the thread if softlockup_panic was set, because the system is going down anyway. Also, perhaps some users would prefer that the kernel simply suppress the softlockup warning in this situation, rather than killing stuff! Really, what you're trying to implement here is a watchdog for runaway realtime threads. And that sounds a worthy project but it's a rather separate thing from the softlockup detector. A realtime thread watchdog feature might have things as - timeout duration separately configurable from softlockup - enabled independently from sotflockup: people might want one and not the other. - configurable signal, perhaps? Now, the *implementation* of the realtime thread watchdog may well share code with the softlockup detector. But from a conceptual/configuration/documentation point of view, it's a separate thing, no? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/