Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751272AbdGQBY2 convert rfc822-to-8bit (ORCPT ); Sun, 16 Jul 2017 21:24:28 -0400 Received: from mga11.intel.com ([192.55.52.93]:46439 "EHLO mga11.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751208AbdGQBY0 (ORCPT ); Sun, 16 Jul 2017 21:24:26 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.40,372,1496127600"; d="scan'208";a="1173275027" From: "Liang, Kan" To: Don Zickus , Thomas Gleixner CC: "linux-kernel@vger.kernel.org" , "mingo@kernel.org" , "akpm@linux-foundation.org" , "babu.moger@oracle.com" , "atomlin@redhat.com" , "prarit@redhat.com" , "torvalds@linux-foundation.org" , "peterz@infradead.org" , "eranian@google.com" , "acme@redhat.com" , "ak@linux.intel.com" , "stable@vger.kernel.org" Subject: RE: [PATCH V2] kernel/watchdog: fix spurious hard lockups Thread-Topic: [PATCH V2] kernel/watchdog: fix spurious hard lockups Thread-Index: AQHS6pyX93nZMscYGUu+rlSjZMkkxKIvVm6AgAErMwCAARD/gIAAjbWAgABZxoCABJ2VAIABkHmAgB6zMiA= Date: Mon, 17 Jul 2017 01:24:23 +0000 Message-ID: <37D7C6CF3E00A74B8858931C1DB2F0775371D43E@SHSMSX103.ccr.corp.intel.com> References: <20170621144118.5939-1-kan.liang@intel.com> <20170622154450.2lua7fdmigcixldw@redhat.com> <20170623162907.l6inpxgztwwkeaoi@redhat.com> <20170626201927.3ak7fk3yvdzbb4ay@redhat.com> <20170627201249.ll34ecwhpme3vh2u@redhat.com> In-Reply-To: <20170627201249.ll34ecwhpme3vh2u@redhat.com> Accept-Language: zh-CN, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiNWFjNjczYjUtNGQ1ZS00NWJmLThmYTItZjc2MjBmOTdkZDY0IiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX0lDIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE2LjUuOS4zIiwiVHJ1c3RlZExhYmVsSGFzaCI6ImJyWUZMQjU3WlhjZUFkY2ZJeE9wVmJmMU9nS0xVcm04WmQ1eXp5OE9mZmM9In0= x-ctpclassification: CTP_IC dlp-product: dlpe-windows dlp-version: 10.0.102.7 dlp-reaction: no-action x-originating-ip: [10.239.127.40] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 8BIT MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2137 Lines: 62 > On Mon, Jun 26, 2017 at 04:19:27PM -0400, Don Zickus wrote: > > On Fri, Jun 23, 2017 at 11:50:25PM +0200, Thomas Gleixner wrote: > > > On Fri, 23 Jun 2017, Don Zickus wrote: > > > > Hmm, all this work for a temp fix. Kan, how much longer until the > > > > real fix of having perf count the right cycles? > > > > > > Quite a while. The approach is wilfully breaking the user space ABI, > > > which is not going to happen. > > > > > > And there is a simpler solution as well, as I said here: > > > > > > > > > http://lkml.kernel.org/r/alpine.DEB.2.20.1706221730520.1885@nanos > > > > Hi Thomas, > > > > So, you are saying instead of slowing down the perf counter, speed up > > the hrtimer to sample more frequently like so: > > > > diff --git a/kernel/watchdog.c b/kernel/watchdog.c index > > 03e0b69..8ff49de 100644 > > --- a/kernel/watchdog.c > > +++ b/kernel/watchdog.c > > @@ -160,7 +160,7 @@ static void set_sample_period(void) > > * and hard thresholds) to increment before the > > * hardlockup detector generates a warning > > */ > > - sample_period = get_softlockup_thresh() * ((u64)NSEC_PER_SEC / 5); > > + sample_period = get_softlockup_thresh() * ((u64)NSEC_PER_SEC / > 10); > > } > > Hi Kan, > > Will the above patch work for you? Hi Don & Thomas, Sorry for the late response. We just finished the tests for all proposed patches. There are three proposed patches so far. Patch 1: The patch as above which speed up the hrtimer. Patch 2: Thomas's first proposal. https://patchwork.kernel.org/patch/9803033/ https://patchwork.kernel.org/patch/9805903/ Patch 3: my original proposal which increase the NMI watchdog timeout by 3X https://patchwork.kernel.org/patch/9802053/ According to our test, only patch 3 works well. The other two patches will hang the system eventually. For patch 1, the system hang after running our test case for ~1 hour. For patch 2, the system hang in running the overnight test. There is no error message shown when the system hang. So I don't know the root cause yet. BTW: We set 1 to watchdog_thresh when we did the test. It's believed that can speed up the failure. Thanks, Kan