Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S966017AbcJ1SO4 (ORCPT ); Fri, 28 Oct 2016 14:14:56 -0400 Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:45927 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S942356AbcJ1SOH (ORCPT ); Fri, 28 Oct 2016 14:14:07 -0400 From: Aaron Miller To: Borislav Petkov , Mauro Carvalho Chehab CC: "linux-edac@vger.kernel.org" , "linux-kernel@vger.kernel.org" Subject: Re: [PATCH v2] EDAC: expose per-dimm error counts in sysfs Thread-Topic: [PATCH v2] EDAC: expose per-dimm error counts in sysfs Thread-Index: AQHSMJneC2CaAoPU1kGmg5UPtafE7KC91fMA///hXoA= Date: Fri, 28 Oct 2016 18:13:08 +0000 Message-ID: <7EB49132-0CB0-46AE-8F2C-1AF334DE4494@fb.com> References: <20161025232551.3270769-1-aaronmiller@fb.com> <20161027213332.581284-1-aaronmiller@fb.com> <20161028130246.xcdzjy4mdrthgzol@pd.tnic> In-Reply-To: <20161028130246.xcdzjy4mdrthgzol@pd.tnic> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-ms-exchange-messagesentrepresentingtype: 1 x-originating-ip: [69.181.210.47] x-ms-office365-filtering-correlation-id: 04641f00-7f1e-4a96-cd3e-08d3ff5e11ed x-microsoft-exchange-diagnostics: 1;DM5PR15MB1260;7:6cJsY17oWd39pONONZRDPfFHdBTBaqPvop829+kDCjYS+IKs+ad3Qviw+dT3m4Lc0tM4vKrW0IrvvxhLGdRnmD6en4fkr9MbY6UW9YfocQJOV2cEaKtYInrd2uKxsXwhEGREVwWI1MCO6xT+69OtXLjWIHQfH2AbLnULTHkZyt5Hr65OTndynyjQjQKNccFvDt1LX7V5fYYkHIOOKvSf40m0oa5R1kbMXDd+QmJPKEXOjjzVz5zA8KZye9RLZXNLE8L1OO6/yZUWZULM+Rf/v1gN61gAmr3fVMnOb7KTxJpLNaCc6Qe5twKpVud7iV/cB1ZuyC1kw9YZMZ/3JZrbPc7m5TqQ/O2V0b4Z6kdVZ9U=;20:64dOK00VR+UpfJyVM56lfIQWaNcZSbULSpx2OkA9WpsyY0nqaOQWDinkukKmMYaUMd7F4/o1xcDVi3wDBtKX944ng1rDFCKw6WmuBXrqGCMwpJk7xGOodcRQf3mXv9DKb3VVKu+cpY2pggbD+WxK+ZYFLYQ0DcLdAxvgvzlkZck= x-microsoft-antispam: UriScan:;BCL:0;PCL:0;RULEID:;SRVR:DM5PR15MB1260; x-microsoft-antispam-prvs: x-exchange-antispam-report-test: UriScan:(20558992708506)(67672495146484)(211171220733660); x-exchange-antispam-report-cfa-test: BCL:0;PCL:0;RULEID:(6040176)(601004)(2401047)(5005006)(8121501046)(10201501046)(3002001);SRVR:DM5PR15MB1260;BCL:0;PCL:0;RULEID:;SRVR:DM5PR15MB1260; x-forefront-prvs: 0109D382B0 x-forefront-antispam-report: SFV:NSPM;SFS:(10019020)(6009001)(7916002)(199003)(24454002)(377454003)(189002)(81166006)(77096005)(7846002)(11100500001)(97736004)(122556002)(305945005)(7736002)(8936002)(3280700002)(5001770100001)(4326007)(76176999)(101416001)(106356001)(106116001)(54356999)(92566002)(81156014)(2906002)(36756003)(99286002)(105586002)(68736007)(82746002)(5660300001)(83716003)(8676002)(10400500002)(19580395003)(50986999)(87936001)(6116002)(3846002)(102836003)(586003)(86362001)(3660700001)(2900100001)(66066001)(19580405001)(33656002)(2950100002)(189998001)(5002640100001)(104396002);DIR:OUT;SFP:1102;SCL:1;SRVR:DM5PR15MB1260;H:DM5PR15MB1259.namprd15.prod.outlook.com;FPR:;SPF:None;PTR:InfoNoRecords;A:1;MX:1;LANG:en; spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: text/plain; charset="utf-8" Content-ID: MIME-Version: 1.0 X-MS-Exchange-CrossTenant-originalarrivaltime: 28 Oct 2016 18:13:08.0346 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 8ae927fe-1255-47a7-a2af-5f3a069daaa2 X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM5PR15MB1260 X-OriginatorOrg: fb.com X-Proofpoint-Spam-Reason: safe X-FB-Internal: Safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2016-10-28_11:,, signatures=0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by mail.home.local id u9SIGZbN016991 Content-Length: 3410 Lines: 63 The EDAC drivers themselves are reporting the channel and slot counts and dimm config and also doing the decoding of MCEs to map them to a node/channel/slot. If one of those was reporting that an MCE came from a position it had reported earlier did not exist I would say that driver was buggy, since the MCE doesn’t contain those as raw values from h/w, they have to be determined by the driver using what it knows about the current memory configuration. Adding some bound checks might not be a bad idea in this does happen, but you can cause an oops like this by injecting out of bounds without this patch – only added reads it there shouldn’t be a way to cause an out-of-bounds read with it. On 10/28/16, 6:02 AM, "Borislav Petkov" wrote: On Thu, Oct 27, 2016 at 02:33:32PM -0700, Aaron Miller wrote: > The old 'csrowX' sysfs directories had per-csrow error counters, but the > new 'dimmX' directories do not currently expose error counts. > > EDAC already keeps these counts, add them to sysfs so per-dimm counts > are still available when CONFIG_EDAC_LEGACY_SYSFS=n > > Signed-off-by: Aaron Miller > --- Hmm, so something's still broken in the adding of the error count, especially that dancing around with layers. Mauro, I think you should take a look. Here's what I did: EDAC_DFS=/sys/kernel/debug/edac/mc0/ echo 24 > $EDAC_DFS/fake_inject_channel echo 0 > $EDAC_DFS/fake_inject_slot echo 3 > $EDAC_DFS/fake_inject_count echo 1 > $EDAC_DFS/fake_inject 24 is the max channels the system reports during start: [ 4864.030901] EDAC MC: Removed device 0 for sbridge_edac.c Sandy Bridge Socket#0: DEV 0000:3f:0e.0 [ 4865.867873] EDAC MC: Ver: 3.0.0 [ 4866.081102] edac_mc_alloc: errcount layer 0 size 8 [ 4866.086081] edac_mc_alloc: errcount layer 1 size 24 [ 4866.091133] edac_mc_alloc: allocating 64 error counters [ 4866.096529] edac_mc_alloc: allocating 3320 bytes for mci data (24 dimms, 24 csrows/channels) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [ 4866.105146] edac_mc_alloc: ce_per_layer[0]: ffff8928b7e918f8 [ 4866.110979] edac_mc_alloc: ce_per_layer[1]: ffff8928b7e91938 [ 4866.619521] slot 0 <7>[ 4866.621530] EDAC DEBUG: edac_create_dimm_object: creating rank/dimm device dimm0 [ 4866.637405] slot 0 <7>[ 4866.639420] EDAC DEBUG: edac_create_dimm_object: creating rank/dimm device dimm3 [ 4866.655282] slot 0 <7>[ 4866.657288] EDAC DEBUG: edac_create_dimm_object: creating rank/dimm device dimm6 [ 4866.673168] slot 0 <7>[ 4866.675190] EDAC DEBUG: edac_create_dimm_object: creating rank/dimm device dimm9 [ 4866.691114] EDAC MC0: Giving out device to module sbridge_edac.c controller Sandy Bridge Socket#0: DEV 0000:3f:0e.0 (INTERRUPT) and when I do that injection, there's boom, see below. It looks to me that indexing in edac_inc_ce_error() goes out of bounds: mci->ce_per_layer[i][index] += count; if (i < mci->n_layers - 1) index *= mci->layers[i + 1].size; and it shouldn't. We need to be prepared to handle crap data coming from the hardware so those error count adding functions should check array lengths and scream if something's overflowing...