Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752894AbdFTWEC (ORCPT ); Tue, 20 Jun 2017 18:04:02 -0400 Received: from mail.linuxfoundation.org ([140.211.169.12]:44100 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752877AbdFTWEA (ORCPT ); Tue, 20 Jun 2017 18:04:00 -0400 Date: Tue, 20 Jun 2017 15:03:59 -0700 From: Andrew Morton To: kan.liang@intel.com Cc: linux-kernel@vger.kernel.org, dzickus@redhat.com, mingo@kernel.org, babu.moger@oracle.com, atomlin@redhat.com, prarit@redhat.com, torvalds@linux-foundation.org, peterz@infradead.org, tglx@linutronix.de, eranian@google.com, acme@redhat.com, ak@linux.intel.com, Kan Liang , stable@vger.kernel.org Subject: Re: [PATCH] kernel/watchdog: fix spurious hard lockups Message-Id: <20170620150359.0fbb417aed72c84ac6ad8498@linux-foundation.org> In-Reply-To: <20170620213309.30051-1-kan.liang@intel.com> References: <20170620213309.30051-1-kan.liang@intel.com> X-Mailer: Sylpheed 3.4.1 (GTK+ 2.24.23; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 1490 Lines: 34 On Tue, 20 Jun 2017 14:33:09 -0700 kan.liang@intel.com wrote: > From: Kan Liang > > Some users reported spurious NMI watchdog timeouts. > > We now have more and more systems where the Turbo range is wide enough > that the NMI watchdog expires faster than the soft watchdog timer that > updates the interrupt tick the NMI watchdog relies on. > > This problem was originally added by commit 58687acba592 > ("lockup_detector: Combine nmi_watchdog and softlockup detector"). > Previously the NMI watchdog would always check jiffies, which were > ticking fast enough. But now the backing is quite slow so the expire > time becomes more sensitive. > > For mainline the right fix is to switch the NMI watchdog to reference > cycles, which tick always at the same rate independent of turbo mode. > But this is requires some complicated changes in perf, which are too > difficult to backport. Since we need a stable fix too just increase the > NMI watchdog rate here to avoid the spurious timeouts. This is not an > ideal fix because a 3x as large Turbo range could still fail, but for > now that's not likely. > > ... > > The right fix for mainline can be found here. > perf/x86/intel: enable CPU ref_cycles for GP counter > perf/x86/intel, watchdog: Switch NMI watchdog to ref cycles on x86 > https://patchwork.kernel.org/patch/9779087/ > https://patchwork.kernel.org/patch/9779089/ Presumably the "right fix" will later be altered to revert this one-line workaround?