Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751937AbdF1NYY convert rfc822-to-8bit (ORCPT ); Wed, 28 Jun 2017 09:24:24 -0400 Received: from mga07.intel.com ([134.134.136.100]:25996 "EHLO mga07.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751788AbdF1NYR (ORCPT ); Wed, 28 Jun 2017 09:24:17 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.40,275,1496127600"; d="scan'208";a="1145651467" From: "Liang, Kan" To: Michal Hocko CC: "linux-kernel@vger.kernel.org" , "dzickus@redhat.com" , "mingo@kernel.org" , "akpm@linux-foundation.org" , "babu.moger@oracle.com" , "atomlin@redhat.com" , "prarit@redhat.com" , "torvalds@linux-foundation.org" , "peterz@infradead.org" , "tglx@linutronix.de" , "eranian@google.com" , "acme@redhat.com" , "ak@linux.intel.com" , "stable@vger.kernel.org" Subject: RE: [PATCH] kernel/watchdog: fix spurious hard lockups Thread-Topic: [PATCH] kernel/watchdog: fix spurious hard lockups Thread-Index: AQHS6gz8LSkdJFMSZk6lPSiU4VpMCqI5rJyAgACREVA= Date: Wed, 28 Jun 2017 13:24:08 +0000 Message-ID: <37D7C6CF3E00A74B8858931C1DB2F07753713A22@SHSMSX103.ccr.corp.intel.com> References: <20170620213309.30051-1-kan.liang@intel.com> <20170628114058.GB5234@dhcp22.suse.cz> In-Reply-To: <20170628114058.GB5234@dhcp22.suse.cz> Accept-Language: zh-CN, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiZWIyNDhkMmEtOTk4ZC00MDI3LTg0OGUtZjBlMjc3NmRkYWQxIiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX0lDIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE2LjUuOS4zIiwiVHJ1c3RlZExhYmVsSGFzaCI6IjZka3pxMit4SkdiSFVzVlwveGhsUWJaaGVsMkxmSVdFUGhZalBEdnBFV0VJPSJ9 x-ctpclassification: CTP_IC dlp-product: dlpe-windows dlp-version: 10.0.102.7 dlp-reaction: no-action x-originating-ip: [10.239.127.40] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 8BIT MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 1427 Lines: 37 > > From: Kan Liang > > > > Some users reported spurious NMI watchdog timeouts. > > > > We now have more and more systems where the Turbo range is wide > enough > > that the NMI watchdog expires faster than the soft watchdog timer that > > updates the interrupt tick the NMI watchdog relies on. > > AFAIR the watchdog doesn't rely on deferred timers so this would suggest > that a standard hrtimer can expire much later than programmed, right? The softlockup watchdog relies on hrtimers. The hardlockup watchdog (NMI watchdog) relies on perf subsystem and using unhalted CPU cycles. When the softlockup watchdog expires, it updates the hrtimer_interrupts. When the NMI watchdog expires, it will check the hrtimer_interrupts, and determine if it's a hardlockup. The design was to make the softlockup watchdog runs with 2.5 times the rate of NMI watchdog. So it guarantees that the hrtimer_interrupts is updated before the NMI watchdog expires. That works well if Turbo-Mode is disabled. However, when Turbo-Mode is enabled, unhalted CPU cycles might run much faster than expected, even faster than softlockup watchdog. So the softlockup watchdog will not get a chance to update the hrtimer_interrupts, which will trigger false positives. Thanks, Kan > If that is the case how come other parts of the system do not break. We do > rely on hrtimers on many other places? > -- > Michal Hocko > SUSE Labs