Received: by 2002:ac0:a5a6:0:0:0:0:0 with SMTP id m35-v6csp2641798imm; Fri, 24 Aug 2018 02:49:48 -0700 (PDT) X-Google-Smtp-Source: ANB0Vdbj22PGZPI3gvSmJtV0bQh5BS5EgquacsmsrYMjQ2homL8uOmyvanLaH94K3unIodZIZitj X-Received: by 2002:a62:404e:: with SMTP id n75-v6mr1145054pfa.232.1535104188738; Fri, 24 Aug 2018 02:49:48 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1535104188; cv=none; d=google.com; s=arc-20160816; b=iqI/euBYhzxBkofqlyGTeaHKq5h7FFWpV85I4dV5455dGvc+aT4KxzXF3BRwFX847E L0k1Mj+/82vhc8if8D9uDA/mbPo9df8VrnVzMlRsDfIIpNZ+MXjqqE0A+qmh7Del1mrv P1jMncHljx//HVfVm5gEoB2gAPTVcVDgYMHUKLBL9vFY8FvG1kOBZ75/YpyT4ITf7fVg ryFtXP52ulQridy6aT6xj+JkM1jKobLC9dY7edfKnPwfJu9kxxD1oCFBQ54iQHwVtf99 XowzYbkXzwyaQSDo592VF/42RJFae3BBAa+923YimGfLRKp9Y1OfQhAXjWNU2ChfEmbi lOqQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:arc-authentication-results; bh=uGD2hSUxO/Y+GVeJD/e9AC767t/oT/kLrD+8OEAH084=; b=hCk/1w7NxRLodzR4w9VMXrYxyVQCGDtswfomhosbxLmlPQ1aQZedmkdav4q5PrrwSH 0SxtAlky9Kl4i/SZXVEWFBmfTpD1BKkcC6ht495l5jNk+XvWsPCtnkFMfRhBNVsphZ5e W0qIZpA/B3bVnywKtKeBGC/+BoUwXApZ92GNyrSkKjNW9kzlkKHcCUYO5pnRw1vCpzZw w1av0Khcn7F4Y0Ai+D+DrV5iCVkCuK+fECZo++a0nKiwfxP10l5xoD3sVvYBd6asB8Df 85o2GmjFZ3PSKndKMNQ+vFWXa7kNMUo6UjkDcESFEg7syJSH4IKXRYkInCGu5CLocbSt HfDg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id c12-v6si8039584pfk.145.2018.08.24.02.49.32; Fri, 24 Aug 2018 02:49:48 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726860AbeHXNWT (ORCPT + 99 others); Fri, 24 Aug 2018 09:22:19 -0400 Received: from foss.arm.com ([217.140.101.70]:55008 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726243AbeHXNWT (ORCPT ); Fri, 24 Aug 2018 09:22:19 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 58F257A9; Fri, 24 Aug 2018 02:48:27 -0700 (PDT) Received: from [10.4.12.81] (melchizedek.emea.arm.com [10.4.12.81]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id D1DFE3F5A0; Fri, 24 Aug 2018 02:48:25 -0700 (PDT) Subject: Re: [RFC PATCH] EDAC, ghes: Enable per-layer error reporting for ARM To: Tyler Baicar Cc: Tyler Baicar , wufan@codeaurora.org, Linux Kernel Mailing List , harba@qti.qualcomm.com, Borislav Petkov , mchehab@kernel.org, arm-mail-list , linux-edac@vger.kernel.org References: <1531762009-15112-1-git-send-email-tbaicar@codeaurora.org> <20180719140102.GB25185@nazgul.tnic> <94e3a0fb-9b7d-045f-733b-9f063dcb39e4@arm.com> <45fefe7d-c6ea-5791-4477-13ecce39ce48@codeaurora.org> From: James Morse Message-ID: <68a800c7-446e-9b6b-1847-6e45a1d17262@arm.com> Date: Fri, 24 Aug 2018 10:48:24 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Tyler, On 23/08/18 16:46, Tyler Baicar wrote: > On Thu, Aug 23, 2018 at 5:29 AM James Morse wrote: >> On 19/07/18 19:36, Tyler Baicar wrote: >>> On 7/19/2018 10:46 AM, James Morse wrote: >>>> On 19/07/18 15:01, Borislav Petkov wrote: >>>>> On Mon, Jul 16, 2018 at 01:26:49PM -0400, Tyler Baicar wrote: >>>>>> Enable per-layer error reporting for ARM systems so that the error >>>>>> counters are incremented per-DIMM. >> >> This 'layer' term seems to be EDAC's artificial view of memory. >> > > Yes, it's just the terminology that EDAC uses for locating a DIMM. > > "Layer" can mean several things here: > > https://elixir.bootlin.com/linux/latest/source/include/linux/edac.h#L318 Aha, its an enum. I thought it was an upper/middle/lower mapping at the whim of the edac driver. [...] >> [re-ordered hunk:] >>> This seems pretty hacky to me, so if anyone has other suggestions please share >>> them. >> >> CPER's "Memory Error Record 2" thinks that "NODE, CARD and MODULE should provide >> the information necessary to identify the failing FRU". As EDAC has three >> 'levels', these are what they should correspond to for ghes-edac. >> >> I assume NODE means rack/chassis in some distributed system. Lets ignore it as >> it doesn't seem to map to anything in the SMBIOS table. > > I believe NODE should map to socket number for multi-socket systems. Isn't the Memory Array Structure still unique in a multi-socket system? If so the node isn't telling us anything new. Do sockets show up in the SMBIOS table? We would need to know how many there are in advance. For arm systems the cpu topology from PPTT is the best bet for this information, but what do we do if that table is missing? (also, does firmware count from 1 or 0?) I suspect we can't use this field unless we know what the range of values is going to be in advance. I assumed this node must be a level of information above Card/Memory-Array's address-space. Somehow the Card handle isn't no long unique, we need the node number too. If the CPER records were all being pumped at a single agent, (shared BMC in a blade/chassis thing) then this might matter. I suspect we can ignore it in linux. >> The CPER record's card and module numbers are useless to us, as we need to know >> how many there will be in advance. (does this version of firmware count from 0 >> or 1?) >> >> ... but CPER also gives us a 'Card Handle' and 'Module Handle'. >> 'Module Handle' maps to SMBIOS:17 Memory Device (aka, a DIMM). The Handle is a >> word-value in the structure, so it doesn't depend on the layout/parse-order of >> the SMBIOS tables. When we count the DIMMs in edac-ghes we can give them some >> level-idx, then use the handle to find which level-idx to use for this DIMM. >> >> ghes_edac_report_mem_error() already picks up the module-handle, but only uses >> it to print the bank/device. >> >> 'Card' doesn't mean much to me, but it maps to SMBIOS:17 "Memory Array >> Structure", which the Memory Device structure also points to. >> Card then must mean "a collection of memory devices (DIMMs) that operate >> together to form an address space". >> >> This might be what I think of as a memory-controller, or it might be something >> more complicated. Regardless, the CPER records think its relevant. >> >> For the edac:layers, we could walk the DMI table to find these structures, and >> build the layers from them. If the Memory-array-structures are missing, we can >> use the existing 1:NUM_DIMMS approach. > I think the proper way to get this working would be to use these handles. We can > avoid populating this layer information and instead have a mapping of type 17 > index number (how edac is numbering the DIMMs today) to the handle number. Why get avoid the layer stuff? Isn't counting DIMM/memory-devices what EDAC_MC_LAYER_SLOT is for? > Then we will need a new function to increment the counter based on the handle > number rather than this layer information. Is that how you are envisioning it? I'm not familiar with edac's internals, so I didn't have any particular vision! Isn't the problem that ghes_edac_report_mem_error() does this: | e->top_layer = -1; | e->mid_layer = -1; | e->low_layer = -1; so edac_raw_mc_handle_error() has no clue where the error happened. (I haven't read what it does with this information yet). ghes_edac_report_mem_error() does check CPER_MEM_VALID_MODULE_HANDLE, and if its set, it uses the handle to find the bank/device strings and prints them out. Naively I thought we could generate some index during ghes_edac_count_dimms(), and use this as e->${whichever}_layer. I hoped there would be something we could already use as the index, but I can't spot it, so this will be more than the one-liner I was hoping for! Thanks, James