Received: by 2002:a25:e7d8:0:0:0:0:0 with SMTP id e207csp2421193ybh; Mon, 16 Mar 2020 02:53:17 -0700 (PDT) X-Google-Smtp-Source: ADFU+vv1DHPPssKyGSRR5ssIn0m3It1gFUwp2kJUMZXzV1ml3eoLG77ELpeiJlI9TVpqqZJOIpmE X-Received: by 2002:a9d:6e01:: with SMTP id e1mr411589otr.299.1584352397771; Mon, 16 Mar 2020 02:53:17 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1584352397; cv=none; d=google.com; s=arc-20160816; b=N3H7HIhrHT7VCPjyWeDWTdmEKqEsUCd0khOcJnUn8JZGZa9aZ6WHU8NbbPzj4jGJJS eKGeTO2lC12cN+golYCgCbNzA5w2H+dT67guLDaTIhWVYjH/sE3LQyzowAg75UzqEGs3 om6PhwPpGD9nABIJ7ZWj2fwo6jRoZ52b1rP6VG9Tosa6N9sMjVmMhQO1YYLxbA77HC+g lLwEhPSC0MCiYhloVo6TWkJtxfJGriiHnqddTquTr229C8zhUBSuSam+g2HpnNc+ktAr A0mxuDgmIbAKHO5/Ys4vJmWxPN89kRfrod2iqs+7cg2cjNivgXtWFc/PxSu5qWWJGNO7 qn8w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=ifYntF04mk/uMnlnJ+xuphFsg2r2jtgAWgf2i801nHc=; b=tmNNz+S/GeG0TSrOjRaePfM+py5x8e0aNGFw0IQobs79rf0MqA31w0x985bHSUtJGi MO863rtFumbpRjQy/dJWkF7o56Sv9zIXk3RkSQVE6c1qsOq+HPm3AADTOTuWfXcui/kh 0JxTMrQnLjCKmlHh69A1UF1NNVyHOs1deJkk04r12Iz83S1Uvnh8uH6AFczSWH0nKYOj d7vGFhhiFhTr9RmouBmRRP1WqhKU7dGu6kvp4Ge9uANf4dqw3g5M21RVTG11nBx6XX9A 5RyDsiFaVvUReccf4TQkIesQYxZZuFtrQ0ckd5M7XmZ0MlX8WzZl3gY1htjzFdF4esIC tRfw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@alien8.de header.s=dkim header.b=UAhCiAvp; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alien8.de Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id c28si10677009otd.215.2020.03.16.02.53.05; Mon, 16 Mar 2020 02:53:17 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@alien8.de header.s=dkim header.b=UAhCiAvp; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alien8.de Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730494AbgCPJvl (ORCPT + 99 others); Mon, 16 Mar 2020 05:51:41 -0400 Received: from mail.skyhub.de ([5.9.137.197]:37030 "EHLO mail.skyhub.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730088AbgCPJvk (ORCPT ); Mon, 16 Mar 2020 05:51:40 -0400 Received: from zn.tnic (p200300EC2F06AB0069F33882ABEAD541.dip0.t-ipconnect.de [IPv6:2003:ec:2f06:ab00:69f3:3882:abea:d541]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.skyhub.de (SuperMail on ZX Spectrum 128k) with ESMTPSA id E7EE01EC0B89; Mon, 16 Mar 2020 10:51:38 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=alien8.de; s=dkim; t=1584352299; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:in-reply-to:in-reply-to: references:references; bh=ifYntF04mk/uMnlnJ+xuphFsg2r2jtgAWgf2i801nHc=; b=UAhCiAvp8/NUrep2e/FeG7v22OrCD0yHwL8spX3n72j90tddnXvZ7W+BwsT86+CzAh4jrU RVNY2bzaJF+W4FD0Q6FroOPMpQjcKlXRkYUcJxjApO2NZWwmY3S9TAhuLSHgEW88+TA5mz 0okS+q4Kb8awXvS9Erugd7AWeFUPiTI= Date: Mon, 16 Mar 2020 10:51:49 +0100 From: Borislav Petkov To: Robert Richter Cc: Mauro Carvalho Chehab , Tony Luck , James Morse , Aristeu Rozanski , linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org, Toshi Kani , John Garry Subject: Re: [PATCH 11/11] EDAC/ghes: Create one memory controller per physical memory array Message-ID: <20200316095149.GE26126@zn.tnic> References: <20200306151318.17422-1-rrichter@marvell.com> <20200306151318.17422-12-rrichter@marvell.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20200306151318.17422-12-rrichter@marvell.com> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Mar 06, 2020 at 04:13:18PM +0100, Robert Richter wrote: > The ghes driver only creates one memory controller for the whole > system. This does not reflect memory topology especially in multi-node > systems. E.g. a Marvell ThunderX2 system shows: > > /sys/devices/system/edac/mc/mc0/dimm0 > /sys/devices/system/edac/mc/mc0/dimm1 > /sys/devices/system/edac/mc/mc0/dimm2 > /sys/devices/system/edac/mc/mc0/dimm3 > /sys/devices/system/edac/mc/mc0/dimm4 > /sys/devices/system/edac/mc/mc0/dimm5 > /sys/devices/system/edac/mc/mc0/dimm6 > /sys/devices/system/edac/mc/mc0/dimm7 > /sys/devices/system/edac/mc/mc0/dimm8 > /sys/devices/system/edac/mc/mc0/dimm9 > /sys/devices/system/edac/mc/mc0/dimm10 > /sys/devices/system/edac/mc/mc0/dimm11 > /sys/devices/system/edac/mc/mc0/dimm12 > /sys/devices/system/edac/mc/mc0/dimm13 > /sys/devices/system/edac/mc/mc0/dimm14 > /sys/devices/system/edac/mc/mc0/dimm15 > > The DIMMs 9-15 are located on the 2nd node of the system. On > comparable x86 systems there is one memory controller per node. The > ghes driver should also group DIMMs depending on the topology and > create one MC per node. > > There are several options to detect the topology. ARM64 systems > retrieve the (NUMA) node information from the ACPI SRAT table (see > acpi_table_parse_srat()). The node id is later stored in the physical > address page. The pfn_to_nid() macro could be used for a DIMM after > determining its physical address. The drawback of this approach is > that there are too many subsystems involved it depends on. It could > easily break and makes the implementation complex. E.g. pfn_to_nid() > can only be reliable used on mapped address ranges which is not always > granted, there are various firmware instances involved which could be > broken, or results may vary depending on NUMA settings. > > Another approach that was suggested by James' is to use the DIMM's > physical memory array handle to group DIMMs [1]. The advantage is to > only use the information on memory devices from the SMBIOS table that > contains a reference to the physical memory array it belongs too. This > information is mandatory same as the use of DIMM handle references by > GHES to provide the DIMM location of an error. There is only a single > table to parse which eases implementation. This patch uses this > approach for DIMM grouping. > > Modify the DMI decoder to also detect the physical memory array a DIMM > is linked to and create one memory controller per array to group > DIMMs. With the change DIMMs are grouped, e.g. a ThunderX2 system > shows one MC per node now: > > # grep . /sys/devices/system/edac/mc/mc*/dimm*/dimm_label > /sys/devices/system/edac/mc/mc0/dimm0/dimm_label:N0 DIMM_A0 > /sys/devices/system/edac/mc/mc0/dimm1/dimm_label:N0 DIMM_B0 > /sys/devices/system/edac/mc/mc0/dimm2/dimm_label:N0 DIMM_C0 > /sys/devices/system/edac/mc/mc0/dimm3/dimm_label:N0 DIMM_D0 > /sys/devices/system/edac/mc/mc0/dimm4/dimm_label:N0 DIMM_E0 > /sys/devices/system/edac/mc/mc0/dimm5/dimm_label:N0 DIMM_F0 > /sys/devices/system/edac/mc/mc0/dimm6/dimm_label:N0 DIMM_G0 > /sys/devices/system/edac/mc/mc0/dimm7/dimm_label:N0 DIMM_H0 > /sys/devices/system/edac/mc/mc1/dimm0/dimm_label:N1 DIMM_I0 > /sys/devices/system/edac/mc/mc1/dimm1/dimm_label:N1 DIMM_J0 > /sys/devices/system/edac/mc/mc1/dimm2/dimm_label:N1 DIMM_K0 > /sys/devices/system/edac/mc/mc1/dimm3/dimm_label:N1 DIMM_L0 > /sys/devices/system/edac/mc/mc1/dimm4/dimm_label:N1 DIMM_M0 > /sys/devices/system/edac/mc/mc1/dimm5/dimm_label:N1 DIMM_N0 > /sys/devices/system/edac/mc/mc1/dimm6/dimm_label:N1 DIMM_O0 > /sys/devices/system/edac/mc/mc1/dimm7/dimm_label:N1 DIMM_P0 > > [1] https://lkml.kernel.org/r/f878201f-f8fd-0f2a-5072-ba60c64eefaf@arm.com > > Suggested-by: James Morse > Signed-off-by: Robert Richter > --- > drivers/edac/ghes_edac.c | 137 ++++++++++++++++++++++++++++++--------- > 1 file changed, 107 insertions(+), 30 deletions(-) This is all fine and good but that change affects the one x86 platform we support so the whole patchset should be tested there too. Adding Toshi. As a matter of fact, the final version of this set should be tested on all platforms which are using this thing. Adding John Garry too who reported issues with this driver recently on his platform. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette