Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751899AbdF1UOT (ORCPT ); Wed, 28 Jun 2017 16:14:19 -0400 Received: from mga05.intel.com ([192.55.52.43]:58967 "EHLO mga05.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751840AbdF1UOL (ORCPT ); Wed, 28 Jun 2017 16:14:11 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.40,277,1496127600"; d="scan'208";a="102434087" Date: Wed, 28 Jun 2017 13:14:04 -0700 From: Andi Kleen To: Don Zickus Cc: "Liang, Kan" , Thomas Gleixner , "linux-kernel@vger.kernel.org" , "mingo@kernel.org" , "akpm@linux-foundation.org" , "babu.moger@oracle.com" , "atomlin@redhat.com" , "prarit@redhat.com" , "torvalds@linux-foundation.org" , "peterz@infradead.org" , "eranian@google.com" , "acme@redhat.com" , "stable@vger.kernel.org" Subject: Re: [PATCH V2] kernel/watchdog: fix spurious hard lockups Message-ID: <20170628201404.GM23705@tassilo.jf.intel.com> References: <20170622154450.2lua7fdmigcixldw@redhat.com> <20170623162907.l6inpxgztwwkeaoi@redhat.com> <20170626201927.3ak7fk3yvdzbb4ay@redhat.com> <20170627201249.ll34ecwhpme3vh2u@redhat.com> <37D7C6CF3E00A74B8858931C1DB2F0775371357D@SHSMSX103.ccr.corp.intel.com> <20170627234822.GL23705@tassilo.jf.intel.com> <20170628190008.3ftqq75evhn2hozp@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170628190008.3ftqq75evhn2hozp@redhat.com> User-Agent: Mutt/1.8.0 (2017-02-23) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 1604 Lines: 42 On Wed, Jun 28, 2017 at 03:00:08PM -0400, Don Zickus wrote: > On Tue, Jun 27, 2017 at 04:48:22PM -0700, Andi Kleen wrote: > > > I haven't heard back any test result yet. > > > > > > The above patch looks good to me. > > > > This needs performance testing. It may slow down performance or latency sensitive workloads. > > More motivation to work through the issues with the proposed real fix? :-) > > > > > > Which workaround do you prefer, the above one or the one checking timestamp? > > > > I prefer the earlier patch, it has far less risk of performance issues. > > But now you are slowing down the nmi_watchdog so much that the > watchdog_thresh hold becomes meaningless, no? (granted the turbo-mode blows > it out of the water too) So now folks who depend on the 10/5/1/whatever second > reliability lose that. I think that might be unfair too. What do you mean with reliability? If you need guarantees of resetting you always need another separate hardware watchdog (like the TCO watchdog), as the CPU could be hung up enough that even the NMI watchdog is not functional anymore. So relying solely on the NMI watchdog doesn't make any sense. It can be a useful debugging tool for a specific class of bugs: when kernel software is looping forever. But if that happens does it really matter how many iterations the loop does before it is stopped? Even the current timeout is essentially eternity in CPU time, and 3x eternity is still eternity. > The hrtimer increase maintains that and just adds a few more > interrupts/second. Interruptions are a big deal for many people. -Andi