Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754316AbdGURWY (ORCPT ); Fri, 21 Jul 2017 13:22:24 -0400 Received: from g4t3426.houston.hpe.com ([15.241.140.75]:35364 "EHLO g4t3426.houston.hpe.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754743AbdGURV7 (ORCPT ); Fri, 21 Jul 2017 13:21:59 -0400 From: "Kani, Toshimitsu" To: "mchehab@s-opensource.com" CC: "linux-kernel@vger.kernel.org" , "tglx@linutronix.de" , "mchehab@kernel.org" , "rjw@rjwysocki.net" , "srinivas.pandruvada@linux.intel.com" , "bp@alien8.de" , "tony.luck@intel.com" , "lenb@kernel.org" , "linux-acpi@vger.kernel.org" , "linux-edac@vger.kernel.org" Subject: Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac Thread-Topic: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac Thread-Index: AQHS/0lGOl4i5ss6QUSn88C0VF1rJqJZF6OAgADnxYCAABgPgIABQtCAgADJ5ACAAP1sAIABLAyAgAABfoCAAAIFgIAAFCYAgAAD5ICAAANsAIAABSqAgAANIYCAAAiGgIAAAv2A Date: Fri, 21 Jul 2017 17:21:31 +0000 Message-ID: <1500657133.2042.51.camel@hpe.com> References: <20170718060007.GB8736@nazgul.tnic> <1500407379.2042.21.camel@hpe.com> <20170718181545.32bd9181@vento.lan> <1500481869.2042.29.camel@hpe.com> <20170720043344.GC14367@nazgul.tnic> <1500579646.2042.37.camel@hpe.com> <20170721133441.GB5036@nazgul.tnic> <20170721104001.3cd2b884@vento.lan> <20170721134715.GC5036@nazgul.tnic> <1500649162.2042.43.camel@hpe.com> <20170721151317.GA13424@nazgul.tnic> <1500650732.2042.45.camel@hpe.com> <20170721124401.5f94aba9@vento.lan> <1500654661.2042.49.camel@hpe.com> <20170721140131.40079805@vento.lan> In-Reply-To: <20170721140131.40079805@vento.lan> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: spf=none (sender IP is ) smtp.mailfrom=toshi.kani@hpe.com; x-originating-ip: [15.219.147.8] x-ms-publictraffictype: Email x-microsoft-exchange-diagnostics: 1;DF4PR84MB0187;7:r+bT+7L+PLE2o/wT9kH/YEn47ljpdMNomhAb4WBWR7mfsvwGMLWU7wHdzGC0QuOIpCsc8t3EAe54d3pJecthyVxb6KLJMiyLCr4/Iyn2kh6PMP8tyCuMIz2bHw0Ky/IaNsi51NlLXsn6MeVbX3FeMb9s98iwRLrkCmdXWFr4tnKS9J4RvuyCPtrM1TqO4aF5NVzECJkhsaxtNP694p5i7Htj1Byg3HpVo8Ga6IW9jseKX/iLylgwsMz7hcSwgRtGx79p11YCOHOCUhipmvcNx4sy48GwYZCaki5qdh7GdCMlQoi8r5HCt4lHUYzmXJ5FMVA8CjcTf8sy1ccyXiD2aSxlUSIBc55kduKLflKoUoeSJgex9xWCEzsfC/kSyOx1hDuu2faTYNTiS9NFy4v2n1AV9+ZNZZI8S2jC6edUe+F7Z21GMoLs+bmyfn4P1RtYrkzvKYlUZdV+sncr7mdzokJnj++goTbiRWPL+zlEyOpxwtpZfIzBDXaL+cydDusbpodkWfaPO5B2nCaN5NMjykxlMmi7zhoZlWeksKEzgMEKvMhnsipI1FGQELpGIySMJ16l4NLFQESaeBHHC1W0SpR1H7VeZzV+0nnQTKnvHV1LuxDqFjOK3diXOTBA8oxWvN8fU8v6LfdUrHyDcZUcehHtuFAGP3LI+y+YnYBsxVfHrkzOuW65Wb9wFCqyB2yFJWMhaWvvpnOpY2vjIAjYkWQhIcx6qa41rKZ+J8pntwM1Y1/Fr2GXDTLkB5Rl20LLL39olLz63HnXABrMs09c/caOA5HAfUOYpoZ4X42J9r8= x-ms-office365-filtering-correlation-id: a937ac64-b8b5-4fb3-cdce-08d4d05cedf4 x-ms-office365-filtering-ht: Tenant x-microsoft-antispam: UriScan:;BCL:0;PCL:0;RULEID:(300000500095)(300135000095)(300000501095)(300135300095)(22001)(300000502095)(300135100095)(2017030254075)(300000503095)(300135400095)(48565401081)(2017052603031)(201703131423075)(201703031133081)(201702281549075)(300000504095)(300135200095)(300000505095)(300135600095)(300000506095)(300135500095);SRVR:DF4PR84MB0187; x-ms-traffictypediagnostic: DF4PR84MB0187: x-exchange-antispam-report-test: UriScan:(227479698468861)(211171220733660); x-microsoft-antispam-prvs: x-exchange-antispam-report-cfa-test: BCL:0;PCL:0;RULEID:(100000700101)(100105000095)(100000701101)(100105300095)(100000702101)(100105100095)(6040450)(601004)(2401047)(5005006)(8121501046)(3002001)(93006095)(93001095)(10201501046)(100000703101)(100105400095)(6055026)(6041248)(20161123555025)(201703131423075)(201702281528075)(201703061421075)(201703061406153)(20161123562025)(20161123560025)(20161123564025)(20161123558100)(6072148)(100000704101)(100105200095)(100000705101)(100105500095);SRVR:DF4PR84MB0187;BCL:0;PCL:0;RULEID:(100000800101)(100110000095)(100000801101)(100110300095)(100000802101)(100110100095)(100000803101)(100110400095)(100000804101)(100110200095)(100000805101)(100110500095);SRVR:DF4PR84MB0187; x-forefront-prvs: 0375972289 x-forefront-antispam-report: SFV:NSPM;SFS:(10019020)(979002)(6009001)(39840400002)(39400400002)(39450400003)(39850400002)(39860400002)(39410400002)(189002)(377424004)(24454002)(199003)(54356999)(50986999)(2950100002)(2501003)(6246003)(5660300001)(68736007)(105586002)(76176999)(3280700002)(2900100001)(101416001)(8676002)(8936002)(38730400002)(36756003)(110136004)(4326008)(97736004)(6916009)(81156014)(93886004)(1730700003)(53936002)(103116003)(3660700001)(189998001)(81166006)(106356001)(66066001)(7416002)(3846002)(33646002)(6436002)(229853002)(7736002)(478600001)(6506006)(14454004)(86362001)(2906002)(6486002)(5640700003)(77096006)(6512007)(25786009)(102836003)(305945005)(6116002)(54906002)(2351001)(969003)(989001)(999001)(1009001)(1019001);DIR:OUT;SFP:1102;SCL:1;SRVR:DF4PR84MB0187;H:DF4PR84MB0187.NAMPRD84.PROD.OUTLOOK.COM;FPR:;SPF:None;PTR:InfoNoRecords;MX:1;A:1;LANG:en; spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: text/plain; charset="utf-8" Content-ID: <2FBE99E5C94956489F646539178B0D71@NAMPRD84.PROD.OUTLOOK.COM> MIME-Version: 1.0 X-MS-Exchange-CrossTenant-originalarrivaltime: 21 Jul 2017 17:21:31.2644 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 105b2061-b669-4b31-92ac-24d304d195dc X-MS-Exchange-Transport-CrossTenantHeadersStamped: DF4PR84MB0187 X-OriginatorOrg: hpe.com Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by nfs id v6LHMUVv028084 Content-Length: 3114 Lines: 69 On Fri, 2017-07-21 at 14:01 -0300, Mauro Carvalho Chehab wrote: > Em Fri, 21 Jul 2017 16:40:20 +0000 > "Kani, Toshimitsu" escreveu: > > > On Fri, 2017-07-21 at 12:44 -0300, Mauro Carvalho Chehab wrote: > > > Em Fri, 21 Jul 2017 15:34:50 +0000 > > > "Kani, Toshimitsu" escreveu: > > >    > > > > On Fri, 2017-07-21 at 17:13 +0200, Borislav Petkov wrote:   > > > > > On Fri, Jul 21, 2017 at 03:08:41PM +0000, Kani, Toshimitsu > > > > > wrote:     > > > > > > Yes, that is correct.  Corrected errors are reported to the > > > > > > OS when they exceeded the platform's threshold.     > > > > > > > > > > Are those thresholds user-configurable?     > > > > > > > > I suppose it'd depend on vendors, but I do not think users can > > > > do it properly unless they have depth knowledge about the > > > > hardware. > > > >    > > > > > If not, what are you telling users who want to see *every* > > > > > corrected error for measuring DIMM wear and so on...?     > > > > > > > > Corrected errors are normal and expected to occur on healthy > > > > hardware.  They do not need user's attention until they > > > > repeatedly occurred at a same place.   > > > > > > Yes, they're expected to happen. Still, some sys admins have > > > their own measurements about what's "normal" for their scenario, > > > and want to monitor every single corrected error, running their > > > own algorithm to warn if the number of corrected errors is above > > > their "normal" rate.   > > > > I suppose these admins had to do it because their platforms > > reported all corrected errors.  It addresses such administrators' > > burden. > > I see the value of having a threshold in BIOS, provided that it is > well documented, and whose value can be adjusted, if needed. > > One of the things I wanted to implement in ras-daemon were an > algorithm that would be doing such threshold in software. > The problem is that it would require field experience. So, > I talked with a few vendors, to see if they could help doing > it, but, on that time, none rised their hands :-) I think it'd be very hard to keep it up to date. > The thing with a BIOS threshold is that the user has no way to > audit the algorithm. So, when BIOS start reporting such errors, > it may be already too late: the systems may be in the verge of  > losing data (or some data was already lost). > > That's critical on cluster systems with thousands of machines: > while the impact of disabling a cluster node to do some maintainance > is marginal, the impact of an uncorrected error on a single > machine may compromise weeks of expensive processing. > > That's why some users prefer to monitor every single corrected > error, and compare with the probability distribution they > know that the risk of uncorrected errors is acceptable. Right, I do not think all platforms need to be firmware-first. I do not want to talk like a sale's person, but we also offer lower-cost platforms that do not come with built-in RAS. Users can choose a right model for their needs. Thanks, -Toshi