Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752931AbdHOBRL convert rfc822-to-8bit (ORCPT ); Mon, 14 Aug 2017 21:17:11 -0400 Received: from mga04.intel.com ([192.55.52.120]:29138 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752740AbdHOBRJ (ORCPT ); Mon, 14 Aug 2017 21:17:09 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.41,375,1498546800"; d="scan'208";a="123607154" From: "Liang, Kan" To: "'Don Zickus'" , "'Thomas Gleixner'" CC: "linux-kernel@vger.kernel.org" , "mingo@kernel.org" , "akpm@linux-foundation.org" , "babu.moger@oracle.com" , "atomlin@redhat.com" , "prarit@redhat.com" , "torvalds@linux-foundation.org" , "peterz@infradead.org" , "eranian@google.com" , "acme@redhat.com" , "ak@linux.intel.com" , "stable@vger.kernel.org" Subject: RE: [PATCH V2] kernel/watchdog: fix spurious hard lockups Thread-Topic: [PATCH V2] kernel/watchdog: fix spurious hard lockups Thread-Index: AQHS6pyX93nZMscYGUu+rlSjZMkkxKIvVm6AgAErMwCAARD/gIAAjbWAgABZxoCABJ2VAIABkHmAgB6zMiCAAGBLgIAc+e0w Date: Tue, 15 Aug 2017 01:16:51 +0000 Message-ID: <37D7C6CF3E00A74B8858931C1DB2F07753784A2B@SHSMSX103.ccr.corp.intel.com> References: <20170621144118.5939-1-kan.liang@intel.com> <20170622154450.2lua7fdmigcixldw@redhat.com> <20170623162907.l6inpxgztwwkeaoi@redhat.com> <20170626201927.3ak7fk3yvdzbb4ay@redhat.com> <20170627201249.ll34ecwhpme3vh2u@redhat.com> <37D7C6CF3E00A74B8858931C1DB2F0775371D43E@SHSMSX103.ccr.corp.intel.com> <20170717144637.34umykrccvjma3fl@redhat.com> In-Reply-To: <20170717144637.34umykrccvjma3fl@redhat.com> Accept-Language: zh-CN, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiNmRjNDEzNDctZjRjMC00NjZjLTg0NjctMmU0M2ExOWUzOTA5IiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX0lDIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE2LjUuOS4zIiwiVHJ1c3RlZExhYmVsSGFzaCI6Ild4eUpTV3NkdU9YVlJ5NWFHQWpoRytjR0hBRFRDdkFaTm9HY01SQkZMMlk9In0= x-ctpclassification: CTP_IC dlp-product: dlpe-windows dlp-version: 10.0.102.7 dlp-reaction: no-action x-originating-ip: [10.239.127.40] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 8BIT MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2213 Lines: 61 > On Mon, Jul 17, 2017 at 01:24:23AM +0000, Liang, Kan wrote: > > Hi Don & Thomas, > > > > Sorry for the late response. We just finished the tests for all proposed > patches. > > > > There are three proposed patches so far. > > Patch 1: The patch as above which speed up the hrtimer. > > Patch 2: Thomas's first proposal. > > https://patchwork.kernel.org/patch/9803033/ > > https://patchwork.kernel.org/patch/9805903/ > > Patch 3: my original proposal which increase the NMI watchdog timeout > > by 3X https://patchwork.kernel.org/patch/9802053/ > > > > According to our test, only patch 3 works well. > > The other two patches will hang the system eventually. > > For patch 1, the system hang after running our test case for ~1 hour. > > For patch 2, the system hang in running the overnight test. > > There is no error message shown when the system hang. So I don't know > > the root cause yet. > > Hi Kan, > > Thanks for the feedback. Odd that the different patches had different results. > What is more odd to me is the hang. I thought these were all false lockups > that prematurely panic'd and rebooted the box. > > Is the machine configured to panic on hardlockup and reboot? Perhaps > kdump is enabled to store the console log for review upon reboot? > > It almost implies that a hardlockup did happen but isnt' being detected until > later?? > > > > BTW: We set 1 to watchdog_thresh when we did the test. > > It's believed that can speed up the failure. > > Sure, you/they look for 1 second hangs instead of 10 second ones. But with > patch3 it is more like 3 seconds'ish vs 30 second'ish. > > As Thomas asked, I would also be interested in the way the test works. The > hang doesn't make sense. > Hi Don and Thomas, Sorry for the late response. We have confirmed that the hardlock with "speed up the hrtimer" patch is actually another issue. Tim has already proposed a patch to fix it. Here is his patch. https://lkml.org/lkml/2017/8/14/1000 This patch which speed up the hrtimer (https://lkml.org/lkml/2017/6/26/685) is decent to fix the spurious hard lockups. Tested-by: Kan Liang Please consider to merge it into both mainline and stable tree. Thanks, Kan