Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751381AbdGQPBK (ORCPT ); Mon, 17 Jul 2017 11:01:10 -0400 Received: from Galois.linutronix.de ([146.0.238.70]:60472 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751278AbdGQPBJ (ORCPT ); Mon, 17 Jul 2017 11:01:09 -0400 Date: Mon, 17 Jul 2017 17:00:40 +0200 (CEST) From: Thomas Gleixner To: "Liang, Kan" cc: Don Zickus , "linux-kernel@vger.kernel.org" , "mingo@kernel.org" , "akpm@linux-foundation.org" , "babu.moger@oracle.com" , "atomlin@redhat.com" , "prarit@redhat.com" , "torvalds@linux-foundation.org" , "peterz@infradead.org" , "eranian@google.com" , "acme@redhat.com" , "ak@linux.intel.com" , "stable@vger.kernel.org" Subject: RE: [PATCH V2] kernel/watchdog: fix spurious hard lockups In-Reply-To: <37D7C6CF3E00A74B8858931C1DB2F0775371D9AE@SHSMSX103.ccr.corp.intel.com> Message-ID: References: <20170621144118.5939-1-kan.liang@intel.com> <20170622154450.2lua7fdmigcixldw@redhat.com> <20170623162907.l6inpxgztwwkeaoi@redhat.com> <20170626201927.3ak7fk3yvdzbb4ay@redhat.com> <20170627201249.ll34ecwhpme3vh2u@redhat.com> <37D7C6CF3E00A74B8858931C1DB2F0775371D43E@SHSMSX103.ccr.corp.intel.com> <37D7C6CF3E00A74B8858931C1DB2F0775371D8AA@SHSMSX103.ccr.corp.intel.com> <37D7C6CF3E00A74B8858931C1DB2F0775371D9AE@SHSMSX103.ccr.corp.intel.com> User-Agent: Alpine 2.20 (DEB 67 2015-01-07) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 1328 Lines: 32 On Mon, 17 Jul 2017, Liang, Kan wrote: > > > > > According to our test, only patch 3 works well. > > > > > The other two patches will hang the system eventually. > > > > Hang the system eventually? Does that mean that the system stops working > > and the watchdog does not catch the problem? > > Right, the system stops working and the watchdog does not catch the problem. What exactly means: "stops working" ? Just that you observe that the system does not make progress or is not reacting to key strokes or what? And what is the lockup, which is detected in the other case? Which code path causes the lockup? > I personally didn't compare the difference between 1 and default 10 for this > test case. > Before we had the test case from customer, we developed other micro > which can reproduce the similar issue. > For that micro, 1 can speed up the failure. > (BTW: all the three patches can fix the issue which was reproduced by that micro.) > > If you think it's meaningful to verify 10 as well, I can do the compare. It might be worth a try, but unless we can either get hands on the test scenario or at least have a proper explanation of what it is doing including the expected outcome, i.e. what is the 'system is locked up' failure which should be detected by the watchdog, I can't tell anything. Thanks, tglx