Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933566AbdGSQka (ORCPT ); Wed, 19 Jul 2017 12:40:30 -0400 Received: from g4t3425.houston.hpe.com ([15.241.140.78]:47986 "EHLO g4t3425.houston.hpe.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933501AbdGSQk1 (ORCPT ); Wed, 19 Jul 2017 12:40:27 -0400 From: "Kani, Toshimitsu" To: "mchehab@s-opensource.com" CC: "linux-kernel@vger.kernel.org" , "tglx@linutronix.de" , "mchehab@kernel.org" , "rjw@rjwysocki.net" , "srinivas.pandruvada@linux.intel.com" , "bp@alien8.de" , "tony.luck@intel.com" , "lenb@kernel.org" , "linux-acpi@vger.kernel.org" , "linux-edac@vger.kernel.org" Subject: Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac Thread-Topic: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac Thread-Index: AQHS/0lGOl4i5ss6QUSn88C0VF1rJqJZF6OAgADnxYCAABgPgIABQtCA Date: Wed, 19 Jul 2017 16:40:25 +0000 Message-ID: <1500481869.2042.29.camel@hpe.com> References: <20170717215912.26070-1-toshi.kani@hpe.com> <20170717215912.26070-4-toshi.kani@hpe.com> <20170718060007.GB8736@nazgul.tnic> <1500407379.2042.21.camel@hpe.com> <20170718181545.32bd9181@vento.lan> In-Reply-To: <20170718181545.32bd9181@vento.lan> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: s-opensource.com; dkim=none (message not signed) header.d=none;s-opensource.com; dmarc=none action=none header.from=hpe.com; x-originating-ip: [15.203.227.8] x-ms-publictraffictype: Email x-microsoft-exchange-diagnostics: 1;DF4PR84MB0186;7:8BA1OQbjMTqJzOLPGY1NprSymwGWPuwGShH+LN9Sy7YmN8xcIN5L625aC18b2f2YQcjX1UsEVKn4j2aI+s9jdM7/iU09ZUpXRXpS2XgAAzrcdkcwEtK92GWwDCtsrZVKEaZO5+NyJP0CyztKhhTwR6hzgw58VB+f3IFCZAJ60RiEN+2qTcXJqF5LQu/cMyIaJVTX874XcWc7gTGRnUxK4+J3Q+6Nx+8cQXWG5lVSVQoMupaUB/cShxJDHr1SD2aHD8S7hIQMhnnMEgc03PSIbfSaTVxL2jN+tN8lLZ1JyzKfaWrFq5myVRE5FVIiOnxP1rx4Iz3SECN1I2LCvGzevUMGrTKc02LCkkcm7Fzkv/wJCpjHQj0C0UAvnEbcRN4i5nobus3DBDW90qwZYjTNstXizFlNKOG9uiFsIJz0NZq5gLFnKSY7xYtkWV6rG8QWs4QR462G34OAFc6yyUxhZV9IAZHSKMwd8hIEKEO/G1DXbTFIUmNlEZZYdcxP2DTq3dBaRLQNnWHjZPKMVWNeC0MnQ99Et56DH61J2gPOzauFeBuTzM6j+HoPOn2cZsmqTWgf1Wzrey6Bz4+kOqak/vm303+Dmjxev+MZyy8tKRxWlwpZuFvZSd4KyzBXd5xDA9+hNWOm/VHFBMa0lOVOPb0uGGtDd0CM48nZsd2e4o2V02wO2RfwCoR8n8H01fxFLQv9+7OKS6C8N2su1fsSqfvS4IPRl7TDsm+wX/W3HYdyOqJnrZjRsqoNiF/92wDbdmUG/UNZuO3hVX/wSeMWLmQQmDDNjQnea6pGlCvbAlk= x-ms-office365-filtering-correlation-id: 390d48fe-812e-47de-2650-08d4cec4db40 x-ms-office365-filtering-ht: Tenant x-microsoft-antispam: UriScan:;BCL:0;PCL:0;RULEID:(300000500095)(300135000095)(300000501095)(300135300095)(22001)(300000502095)(300135100095)(2017030254075)(48565401081)(300000503095)(300135400095)(2017052603031)(201703131423075)(201703031133081)(201702281549075)(300000504095)(300135200095)(300000505095)(300135600095)(300000506095)(300135500095);SRVR:DF4PR84MB0186; x-ms-traffictypediagnostic: DF4PR84MB0186: x-exchange-antispam-report-test: UriScan:(158342451672863)(236129657087228)(211171220733660); x-microsoft-antispam-prvs: x-exchange-antispam-report-cfa-test: BCL:0;PCL:0;RULEID:(100000700101)(100105000095)(100000701101)(100105300095)(100000702101)(100105100095)(6040450)(601004)(2401047)(5005006)(8121501046)(2017060910075)(10201501046)(100000703101)(100105400095)(93006095)(93001095)(3002001)(6055026)(6041248)(20161123560025)(20161123558100)(20161123555025)(20161123562025)(20161123564025)(201703131423075)(201702281528075)(201703061421075)(201703061406153)(6072148)(100000704101)(100105200095)(100000705101)(100105500095);SRVR:DF4PR84MB0186;BCL:0;PCL:0;RULEID:(100000800101)(100110000095)(100000801101)(100110300095)(100000802101)(100110100095)(100000803101)(100110400095)(100000804101)(100110200095)(100000805101)(100110500095);SRVR:DF4PR84MB0186; x-forefront-prvs: 0373D94D15 x-forefront-antispam-report: SFV:NSPM;SFS:(10019020)(6009001)(39410400002)(39850400002)(39860400002)(39400400002)(39840400002)(39450400003)(24454002)(51914003)(377424004)(102836003)(50986999)(86362001)(305945005)(2900100001)(478600001)(33646002)(229853002)(7736002)(5660300001)(2501003)(4326008)(2351001)(2906002)(3846002)(6506006)(189998001)(7416002)(76176999)(93886004)(54356999)(8676002)(5640700003)(25786009)(53936002)(8936002)(66066001)(6116002)(6436002)(3660700001)(6486002)(14454004)(77096006)(54906002)(110136004)(2950100002)(38730400002)(6246003)(1730700003)(103116003)(81166006)(6916009)(36756003)(6512007)(3280700002);DIR:OUT;SFP:1102;SCL:1;SRVR:DF4PR84MB0186;H:DF4PR84MB0187.NAMPRD84.PROD.OUTLOOK.COM;FPR:;SPF:None;MLV:sfv;LANG:en; spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: text/plain; charset="utf-8" Content-ID: MIME-Version: 1.0 X-MS-Exchange-CrossTenant-originalarrivaltime: 19 Jul 2017 16:40:25.2158 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 105b2061-b669-4b31-92ac-24d304d195dc X-MS-Exchange-Transport-CrossTenantHeadersStamped: DF4PR84MB0186 X-OriginatorOrg: hpe.com Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by nfs id v6JGeaZA010551 Content-Length: 3457 Lines: 78 On Tue, 2017-07-18 at 18:15 -0300, Mauro Carvalho Chehab wrote: > Em Tue, 18 Jul 2017 19:58:54 +0000 : > We had a similar discussion several years ago when I wrote this > driver. On that time, I talked with Red Hat, HP, Dell, Intel people > and with some customers with large clusters. > > The way it is, ghes_edac is a poor man's driver. What it hopefully > provide is a detection that an error happened, without really telling > the user what component should be replaced. "poor man's driver" is a bit misleading, but yes, firmware-first platforms have RAS features built-into the platforms, and they do not need intelligence in EDAC drivers, which may conflict with the platform's RAS features. I cannot speak for other vendors, but HPE platforms log errors and provide FRU info. ghes_edac allows to report errors to OS management tools like rasdaemon in addition to platform- specific managements. > Ok, on machines with their own error reporting mechanism (like > HP servers), a sys admin can look on some proprietary software > (or bios), in order to identify what happened. > > Yet, BIOS doesn't provide any glue about what's the memory > architecture, as it maps memory as if it was a single DIMM memory: > > (from ghes_edac_register) > > layers[0].type = EDAC_MC_LAYER_ALL_MEM; > layers[0].size = num_dimm; > layers[0].is_virt_csrow = true; > > So, even on systems where the BIOS actually knows how the memory > cards are wired, it will mask the memory controller data. > > Now, the EDAC driver can also be used to identify what > channels are used. That helps the sys admin to know if the > memories are connected in a way that it will be using multiple > channels, or not, helping to setup the machine to obtain > the maximum possible performance. > > So, for example, on my Intel-based HP server, I can check > such info with: > > $ ras-mc-ctl --mainboard > ras-mc-ctl: mainboard: HP model ProLiant ML350 Gen9 > $ ras-mc-ctl --layout >        +------------------------------------------------------------- > ----------+ >        |                mc0                |                mc1       >           | >        | channel0  | channel1  | channel2  | channel0  | channel1  | > channel2  | > -------+------------------------------------------------------------- > ----------+ > slot2: |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 > MB  |     0 MB  | > slot1: |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 > MB  |     0 MB  | > slot0: |  16384 MB  |     0 MB  |  16384 MB  |  16384 MB  |     0 > MB  |  16384 MB  | > -------+------------------------------------------------------------- > --------------+ > > So, I know that both CPUs will be connected to my memories, and, > on both, it is using 2 channels. > > If I was using the ghes driver, that information would be hidden. > > So, due to all problems with ghes, it is enabled only if there are no > better solution, e. g. on systems where there's no way to talk > directly to the hardware (like on E7 Xeon machines, where the memory > controller is actually on a separate chip that are controlled only by > the BIOS). Thanks for the info! That's very helpful. I will check to see if ghes_edac provides enough info that we need. -Toshi