Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S966674AbcCPOJx (ORCPT ); Wed, 16 Mar 2016 10:09:53 -0400 Received: from mx1.redhat.com ([209.132.183.28]:49743 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S966257AbcCPOJv (ORCPT ); Wed, 16 Mar 2016 10:09:51 -0400 Date: Wed, 16 Mar 2016 10:09:46 -0400 From: Don Zickus To: lizf@kernel.org Cc: stable@vger.kernel.org, linux-kernel@vger.kernel.org, Ben Zhang , Andrew Morton , Linus Torvalds , Zefan Li Subject: Re: [PATCH 3.4 098/107] kernel/watchdog.c: touch_nmi_watchdog should only touch local cpu not every one Message-ID: <20160316140946.GR194535@redhat.com> References: <1458115541-5712-1-git-send-email-lizf@kernel.org> <1458115601-5762-98-git-send-email-lizf@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1458115601-5762-98-git-send-email-lizf@kernel.org> User-Agent: Mutt/1.5.23.1 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3514 Lines: 92 On Wed, Mar 16, 2016 at 04:06:32PM +0800, lizf@kernel.org wrote: > From: Ben Zhang > > 3.4.111-rc1 review patch. If anyone has any objections, please let me know. Just an FYI below, this patch won't work the way it was integrated.. comments below > > ------------------ > > > commit 62572e29bc530b38921ef6059088b4788a9832a5 upstream. > > I ran into a scenario where while one cpu was stuck and should have > panic'd because of the NMI watchdog, it didn't. The reason was another > cpu was spewing stack dumps on to the console. Upon investigation, I > noticed that when writing to the console and also when dumping the > stack, the watchdog is touched. > > This causes all the cpus to reset their NMI watchdog flags and the > 'stuck' cpu just spins forever. > > This change causes the semantics of touch_nmi_watchdog to be changed > slightly. Previously, I accidentally changed the semantics and we > noticed there was a codepath in which touch_nmi_watchdog could be > touched from a preemtible area. That caused a BUG() to happen when > CONFIG_DEBUG_PREEMPT was enabled. I believe it was the acpi code. > > My attempt here re-introduces the change to have the > touch_nmi_watchdog() code only touch the local cpu instead of all of the > cpus. But instead of using __get_cpu_var(), I use the > __raw_get_cpu_var() version. > > This avoids the preemption problem. However my reasoning wasn't because > I was trying to be lazy. Instead I rationalized it as, well if > preemption is enabled then interrupts should be enabled to and the NMI > watchdog will have no reason to trigger. So it won't matter if the > wrong cpu is touched because the percpu interrupt counters the NMI > watchdog uses should still be incrementing. > > Don said: > > : I'm ok with this patch, though it does alter the behaviour of how > : touch_nmi_watchdog works. For the most part I don't think most callers > : need to touch all of the watchdogs (on each cpu). Perhaps a corner case > : will pop up (the scheduler?? to mimic touch_all_softlockup_watchdogs() ). > : > : But this does address an issue where if a system is locked up and one cpu > : is spewing out useful debug messages (or error messages), the hard lockup > : will fail to go off. We have seen this on RHEL also. > > Signed-off-by: Don Zickus > Signed-off-by: Ben Zhang > Signed-off-by: Andrew Morton > Signed-off-by: Linus Torvalds > [lizf: Backported to 3.4: adjust context] > Signed-off-by: Zefan Li > --- > kernel/watchdog.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/kernel/watchdog.c b/kernel/watchdog.c > index 991aa93..7527c8c 100644 > --- a/kernel/watchdog.c > +++ b/kernel/watchdog.c > @@ -162,6 +162,14 @@ void touch_nmi_watchdog(void) > per_cpu(watchdog_nmi_touch, cpu) = true; > } > } The above for-loop was to be replaced by the non-for-loop below. The above for-loop is the problem this patch was solving, so keeping it around does not solve anything. :-) > + /* > + * Using __raw here because some code paths have > + * preemption enabled. If preemption is enabled > + * then interrupts should be enabled too, in which > + * case we shouldn't have to worry about the watchdog > + * going off. > + */ > + __raw_get_cpu_var(watchdog_nmi_touch) = true; > touch_softlockup_watchdog(); > } > EXPORT_SYMBOL(touch_nmi_watchdog); Cheers, Don