Received: by 2002:ac0:a5a6:0:0:0:0:0 with SMTP id m35-v6csp1837041imm; Thu, 23 Aug 2018 09:32:09 -0700 (PDT) X-Google-Smtp-Source: ANB0VdZRu6pXNurdtJ7EkHnl6W0gHUWgp1gX44KKdrBQGGnDognru7FPBAmELvEf9lfe/l2h14gH X-Received: by 2002:a17:902:a24:: with SMTP id 33-v6mr4680874plo.293.1535041929002; Thu, 23 Aug 2018 09:32:09 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1535041928; cv=none; d=google.com; s=arc-20160816; b=BptxT/IFVhwVAmNRMcGIDDEDt9Hq/42MOM9KYhTPBU53Em3uB5Xl0les97jaZXYlWf XE6sG2ead7uAO0xJ8uIkC0X2eJsw/Gh8Jy/SiboNLEeUd30Xe10aiLe+QnBPLzDIEPbC 9DgfM4b+5mwxp7KkxZiy6af1spfjwvWJUm75ot14BSzpQPV7IKwpH1lHxEx0mCiEOaYw e7Rbt+AR5a5QZEDoTM8wOP568uPPKEJXm0nAFBCf6YwzUGO7ogM4FwKT6afT7UKzlnnp BgpSrfBwATTEgVMOxfLpZUKh4mqTzrhNaHnkAVrM4eJ3QdGpNsaAcSdp05dkGSqcd95y Djig== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature :arc-authentication-results; bh=WmA1vERHFF/2qDPA+nfYeJ6q6ef529bxANhjNbLYpec=; b=p8jkBk+aNBBWv5SJgrdZaFhch3CTB+8pL2Rd9+gvoCtEauIrx4GgqMQzcYfE+Pdm/G Yk/P0O6qhZ+0qWg1uHILYKs35OwezpSZb/KDkgLx0QOfwjXQM7kmhZRpYEoY9lU6jqDp F7koBj+8VqKV5k955rP+v8aWJtP5W//gFH1u+UcKVo1/LwLBxstFWfS0PiHqzSMQHodZ qbr3WMn1exKBrlB3KCU7Q8TL3EwnJoWQZE4lSnMUVWj8haqBylKfECnnRHz/vQqYOPlI Vb3Ijj2711L5CHtKA46IbwOIoYTJmEyrgXyaSlgZTyD357o+3WKBi+5fQst+xcnNUbSm TvHg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=FfPk0Ln2; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id i13-v6si4626666pgh.642.2018.08.23.09.31.53; Thu, 23 Aug 2018 09:32:08 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=FfPk0Ln2; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728409AbeHWTRW (ORCPT + 99 others); Thu, 23 Aug 2018 15:17:22 -0400 Received: from mail-ua1-f67.google.com ([209.85.222.67]:34880 "EHLO mail-ua1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727633AbeHWTRW (ORCPT ); Thu, 23 Aug 2018 15:17:22 -0400 Received: by mail-ua1-f67.google.com with SMTP id m26-v6so2187806uap.2; Thu, 23 Aug 2018 08:47:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=WmA1vERHFF/2qDPA+nfYeJ6q6ef529bxANhjNbLYpec=; b=FfPk0Ln2oeX05S4fOPT8U/G0/dMotHTISeDxtnZGrG1eWYLdPLDxqHUkDHw1SvHcS8 jnwTiIrNDx+Er695CHC4FxsuKYZHpwwICOxV4PUG8okJ4+XoQhsKsFg8Z3XHhqmwJVy0 LAORuiCLk9JWiyztVADMlLMCO61WDNKA31GujcV84TrvueR/YYM1A9aSgh16cguDL/Ao j0L2CcbO0hjInfPT3ACMV8Mpgh9SVoe+I9B59E27g/GNHOa4p+yc1EgEiE/T4jz9q8os PL5hoJQFI1zclGqtaYirub67jEb17ozKlj7WXDCyY5jjWW9XvZJr2NbY4+lNeEkD3pv9 Jpqg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=WmA1vERHFF/2qDPA+nfYeJ6q6ef529bxANhjNbLYpec=; b=paeRC+S7/jic9O7RyR2XZ8MN3axL97qXAGCjscGvmgjrgdvQOopnfnsDgS7r8VsO3J HKpNtV6jico9RzFCEpMdOLCzFAiJW2CRA1wYyv6gV/eFC1M6xa9Xb2VfLDdwUu97KEDN U10MvRkQeHIzUzTL1HwHutQL7ep7hC07hY8rsdgM0DH0bs2xWS3m0rvh0w8lUw+cT+k/ bHN9yWsI9UsLvbUQTaanuLXEHkS/yqu6fcrAHaqA5HpqalbOquUXWCk8d1XS1cdD6ugB /yVzEsv6KykTEjiAbB8YpiIkXmYmZZvM8Duvhj7daTRDaVAZ1rpI+Gssc+xW8AFzpWVY smsA== X-Gm-Message-State: APzg51CQnI8wNI4KPZ1/G3cbKF9RhCG7wh5DUt2UZZbbbWtGW9gtU6hk f5m6+x36SyV0nu+Pur4rXJNybXojXn02E5lsPqlsZiYE X-Received: by 2002:ab0:5f21:: with SMTP id p33-v6mr6945331uah.172.1535039228389; Thu, 23 Aug 2018 08:47:08 -0700 (PDT) MIME-Version: 1.0 References: <1531762009-15112-1-git-send-email-tbaicar@codeaurora.org> <20180719140102.GB25185@nazgul.tnic> <94e3a0fb-9b7d-045f-733b-9f063dcb39e4@arm.com> <45fefe7d-c6ea-5791-4477-13ecce39ce48@codeaurora.org> In-Reply-To: From: Tyler Baicar Date: Thu, 23 Aug 2018 11:46:59 -0400 Message-ID: Subject: Re: [RFC PATCH] EDAC, ghes: Enable per-layer error reporting for ARM To: James Morse Cc: Tyler Baicar , wufan@codeaurora.org, Linux Kernel Mailing List , harba@qti.qualcomm.com, Borislav Petkov , mchehab@kernel.org, arm-mail-list , linux-edac@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello James, On Thu, Aug 23, 2018 at 5:29 AM James Morse wrote: > On 19/07/18 19:36, Tyler Baicar wrote: > > On 7/19/2018 10:46 AM, James Morse wrote: > >> On 19/07/18 15:01, Borislav Petkov wrote: > >>> On Mon, Jul 16, 2018 at 01:26:49PM -0400, Tyler Baicar wrote: > >>>> Enable per-layer error reporting for ARM systems so that the error > >>>> counters are incremented per-DIMM. > > This 'layer' term seems to be EDAC's artificial view of memory. > Yes, it's just the terminology that EDAC uses for locating a DIMM. "Layer" can mean several things here: https://elixir.bootlin.com/linux/latest/source/include/linux/edac.h#L318 We should be able to avoid the layer definitions with the SMBIOS handles. > > >> Does this work on x86, and its just the dmi/cper fields have a subtle difference? > > > There are CPU specific EDAC drivers for a lot of x86 folks and those drivers > > populate the layer information in a custom way. > > Not for GHES surely? > Correct, the x86 drivers that properly increment the DIMM error counters are not tied to the ghes_edac driver. > > (DPC == DIMM per Channel?) > Yes. > > The goal is to be able to enable the per layer error reporting in the ghes_edac > > driver so that the per dimm counters exposed in the EDAC sysfs nodes are properly > > updated. > > What do you mean by layer? I can't find anything in the ACPI/UEFI/SMBIOS specs > that uses this term... > > If its just 'per dimm counters' you're after, this looks straightforward. > Yes, we just need a way to increment the per DIMM counters that are exposed by the EDAC sysfs nodes. > [re-ordered hunk:] > > This seems pretty hacky to me, so if anyone has other suggestions please share > > them. > > CPER's "Memory Error Record 2" thinks that "NODE, CARD and MODULE should provide > the information necessary to identify the failing FRU". As EDAC has three > 'levels', these are what they should correspond to for ghes-edac. > > I assume NODE means rack/chassis in some distributed system. Lets ignore it as > it doesn't seem to map to anything in the SMBIOS table. I believe NODE should map to socket number for multi-socket systems. > The CPER record's card and module numbers are useless to us, as we need to know > how many there will be in advance. (does this version of firmware count from 0 > or 1?) > > ... but CPER also gives us a 'Card Handle' and 'Module Handle'. > 'Module Handle' maps to SMBIOS:17 Memory Device (aka, a DIMM). The Handle is a > word-value in the structure, so it doesn't depend on the layout/parse-order of > the SMBIOS tables. When we count the DIMMs in edac-ghes we can give them some > level-idx, then use the handle to find which level-idx to use for this DIMM. > > ghes_edac_report_mem_error() already picks up the module-handle, but only uses > it to print the bank/device. > > 'Card' doesn't mean much to me, but it maps to SMBIOS:17 "Memory Array > Structure", which the Memory Device structure also points to. > Card then must mean "a collection of memory devices (DIMMs) that operate > together to form an address space". > > This might be what I think of as a memory-controller, or it might be something > more complicated. Regardless, the CPER records think its relevant. > > For the edac:layers, we could walk the DMI table to find these structures, and > build the layers from them. If the Memory-array-structures are missing, we can > use the existing 1:NUM_DIMMS approach. > I think the proper way to get this working would be to use these handles. We can avoid populating this layer information and instead have a mapping of type 17 index number (how edac is numbering the DIMMs today) to the handle number. Then we will need a new function to increment the counter based on the handle number rather than this layer information. Is that how you are envisioning it? Thanks, Tyler