Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752942AbdFUAMK (ORCPT ); Tue, 20 Jun 2017 20:12:10 -0400 Received: from mx1.redhat.com ([209.132.183.28]:45038 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752660AbdFUAMI (ORCPT ); Tue, 20 Jun 2017 20:12:08 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 039A8B170 Authentication-Results: ext-mx06.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx06.extmail.prod.ext.phx2.redhat.com; spf=pass smtp.mailfrom=prarit@redhat.com DKIM-Filter: OpenDKIM Filter v2.11.0 mx1.redhat.com 039A8B170 Subject: Re: [PATCH] kernel/watchdog: fix spurious hard lockups To: Andi Kleen References: <20170620213309.30051-1-kan.liang@intel.com> <4718a252-9515-626e-a69f-565f1c2bc589@redhat.com> <20170620230002.GE23705@tassilo.jf.intel.com> Cc: kan.liang@intel.com, linux-kernel@vger.kernel.org, dzickus@redhat.com, mingo@kernel.org, akpm@linux-foundation.org, babu.moger@oracle.com, atomlin@redhat.com, torvalds@linux-foundation.org, peterz@infradead.org, tglx@linutronix.de, eranian@google.com, acme@redhat.com, stable@vger.kernel.org From: Prarit Bhargava Message-ID: <9320cd00-88f4-49c5-aaa5-4bb4a80c8813@redhat.com> Date: Tue, 20 Jun 2017 20:12:04 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.7.0 MIME-Version: 1.0 In-Reply-To: <20170620230002.GE23705@tassilo.jf.intel.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.30]); Wed, 21 Jun 2017 00:12:08 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 900 Lines: 33 On 06/20/2017 07:00 PM, Andi Kleen wrote: > On Tue, Jun 20, 2017 at 06:34:23PM -0400, Prarit Bhargava wrote: >> >> >> On 06/20/2017 05:33 PM, kan.liang@intel.com wrote: >>> From: Kan Liang >>> >>> Some users reported spurious NMI watchdog timeouts. >>> >>> We now have more and more systems where the Turbo range is wide enough >>> that the NMI watchdog expires faster than the soft watchdog timer that >>> updates the interrupt tick the NMI watchdog relies on. >>> >> >> Hmm ... odd that I haven't seen this. We're running a pretty wide >> variety of systems here. Do you have a reproducer? I'd like to see >> this occur on production HW. > > It only happens on a few specific CPU SKUs with a very wide Turbo range. Which ones? > Reproducer is typically some stress workload that turbos very high. So stress the single Turbo Max core? Or any core? P. > > -Andi >