Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751855AbdGQOq6 convert rfc822-to-8bit (ORCPT ); Mon, 17 Jul 2017 10:46:58 -0400 Received: from mga09.intel.com ([134.134.136.24]:52131 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751320AbdGQOq5 (ORCPT ); Mon, 17 Jul 2017 10:46:57 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.40,374,1496127600"; d="scan'208";a="1196345615" From: "Liang, Kan" To: Thomas Gleixner CC: Don Zickus , "linux-kernel@vger.kernel.org" , "mingo@kernel.org" , "akpm@linux-foundation.org" , "babu.moger@oracle.com" , "atomlin@redhat.com" , "prarit@redhat.com" , "torvalds@linux-foundation.org" , "peterz@infradead.org" , "eranian@google.com" , "acme@redhat.com" , "ak@linux.intel.com" , "stable@vger.kernel.org" Subject: RE: [PATCH V2] kernel/watchdog: fix spurious hard lockups Thread-Topic: [PATCH V2] kernel/watchdog: fix spurious hard lockups Thread-Index: AQHS6pyX93nZMscYGUu+rlSjZMkkxKIvVm6AgAErMwCAARD/gIAAjbWAgABZxoCABJ2VAIABkHmAgB6zMiD//+HwAIAA0r+Q//+RnoCAAIguAA== Date: Mon, 17 Jul 2017 14:46:53 +0000 Message-ID: <37D7C6CF3E00A74B8858931C1DB2F0775371D9AE@SHSMSX103.ccr.corp.intel.com> References: <20170621144118.5939-1-kan.liang@intel.com> <20170622154450.2lua7fdmigcixldw@redhat.com> <20170623162907.l6inpxgztwwkeaoi@redhat.com> <20170626201927.3ak7fk3yvdzbb4ay@redhat.com> <20170627201249.ll34ecwhpme3vh2u@redhat.com> <37D7C6CF3E00A74B8858931C1DB2F0775371D43E@SHSMSX103.ccr.corp.intel.com> <37D7C6CF3E00A74B8858931C1DB2F0775371D8AA@SHSMSX103.ccr.corp.intel.com> In-Reply-To: Accept-Language: zh-CN, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiMzA2YmU2YzAtMzU0OC00ZjY5LWJiMTItMWFmMDZlNDg3MzExIiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX0lDIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE2LjUuOS4zIiwiVHJ1c3RlZExhYmVsSGFzaCI6InlaSU45NjdYYkYwS1dJNEFqTkNkR0tJZ1c4QmJhSXZocnpvMitrK3JWM0E9In0= x-ctpclassification: CTP_IC dlp-product: dlpe-windows dlp-version: 10.0.102.7 dlp-reaction: no-action x-originating-ip: [10.239.127.40] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 8BIT MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2109 Lines: 56 > On Mon, 17 Jul 2017, Liang, Kan wrote: > > > That doesn't make sense. What's the exact test procedure? > > > > I don't know the exact test procedure. The test case is from our customer. > > I only know that the test case makes calls into the x11 libs. > > Sigh. This starts to be silly. You test something and have no idea what it does? As I said, the test case is from our customer. They only share binaries with us. Actually, it's more proper to call it test suite. It includes dozens of small test. I just reproduced the issue and verified all the three patches in our lab. Then I report it here as request immediately. So I know little about the test case for now. I will share more when I learn more. Sorry for that. > > > > > According to our test, only patch 3 works well. > > > > The other two patches will hang the system eventually. > > Hang the system eventually? Does that mean that the system stops working > and the watchdog does not catch the problem? Right, the system stops working and the watchdog does not catch the problem. > > > > > BTW: We set 1 to watchdog_thresh when we did the test. > > > > It's believed that can speed up the failure. > > > > > > Believe is not really a technical measure.... > > > > > > > 1 is a valid value for watchdog_thresh. > > It was set through the standard proc interface. > > /proc/sys/kernel/watchdog_thresh > > It should not impacts the final test result. > > I know that 1 is a valid value and I know how that can be set. Still, it does not > help if you believe that setting the threshold to 1 can speed up the failure. > Either you know it for sure or not. You can believe in god or whatever, but > here we talk about facts. I personally didn't compare the difference between 1 and default 10 for this test case. Before we had the test case from customer, we developed other micro which can reproduce the similar issue. For that micro, 1 can speed up the failure. (BTW: all the three patches can fix the issue which was reproduced by that micro.) If you think it's meaningful to verify 10 as well, I can do the compare. Thanks, Kan