Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758047AbcKCWCD (ORCPT ); Thu, 3 Nov 2016 18:02:03 -0400 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:38917 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752984AbcKCWCB (ORCPT ); Thu, 3 Nov 2016 18:02:01 -0400 From: Aaron Miller To: Borislav Petkov , Mauro Carvalho Chehab CC: , , Aaron Miller Subject: [PATCH v3] EDAC: expose per-dimm error counts in sysfs Date: Thu, 3 Nov 2016 15:01:53 -0700 Message-ID: <20161103220153.3997328-1-aaronmiller@fb.com> X-Mailer: git-send-email 2.9.3 MIME-Version: 1.0 In-Reply-To: <20161025232551.3270769-1-aaronmiller@fb.com> References: <20161025232551.3270769-1-aaronmiller@fb.com> Content-Type: text/plain; charset="UTF-8" X-FB-Internal: Safe X-Proofpoint-Spam-Reason: safe X-FB-Internal: Safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2016-11-03_06:,, signatures=0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by mail.home.local id uA3M29HW018251 Content-Length: 6642 Lines: 164 The old 'csrowX' sysfs directories had per-csrow error counters, but the new 'dimmX' directories do not currently expose error counts. EDAC already keeps these counts, add them to sysfs so per-dimm counts are still available when CONFIG_EDAC_LEGACY_SYSFS=n Signed-off-by: Aaron Miller --- Notes: v2: Add commit messsage and documentation v3: Add ReST documentation on top of Mauro's patchset Documentation/ABI/testing/sysfs-devices-edac | 17 +++++++++++++ Documentation/admin-guide/ras.rst | 20 +++++++++++++++ drivers/edac/edac_mc_sysfs.c | 38 ++++++++++++++++++++++++++++ 3 files changed, 75 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-devices-edac b/Documentation/ABI/testing/sysfs-devices-edac index 6568e0010e1a..46ff929fd52a 100644 --- a/Documentation/ABI/testing/sysfs-devices-edac +++ b/Documentation/ABI/testing/sysfs-devices-edac @@ -138,3 +138,20 @@ Contact: Mauro Carvalho Chehab Description: This attribute file will display what type of memory is currently on this csrow. Normally, either buffered or unbuffered memory (for example, Unbuffered-DDR3). + +What: /sys/devices/system/edac/mc/mc*/(dimm|rank)*/dimm_ce_count +Date: October 2016 +Contact: linux-edac@vger.kernel.org +Description: This attribute file displays the total count of correctable + errors that have occurred on this DIMM. This count is very important + to examine. CEs provide early indications that a DIMM is beginning + to fail. This count field should be monitored for non-zero values + and report such information to the system administrator. + +What: /sys/devices/system/edac/mc/mc*/(dimm|rank)*/dimm_ue_count +Date: October 2016 +Contact: linux-edac@vger.kernel.org +Description: This attribute file displays the total count of uncorrectable + errors that have occurred on this DIMM. If panic_on_ue is set, this + counter will not have a chance to increment, since EDAC will panic the + system diff --git a/Documentation/admin-guide/ras.rst b/Documentation/admin-guide/ras.rst index d71340e86c27..9939348bd4a3 100644 --- a/Documentation/admin-guide/ras.rst +++ b/Documentation/admin-guide/ras.rst @@ -438,11 +438,13 @@ A typical EDAC system has the following structure under │   │   ├── ce_count │   │   ├── ce_noinfo_count │   │   ├── dimm0 + │   │   │   ├── dimm_ce_count │   │   │   ├── dimm_dev_type │   │   │   ├── dimm_edac_mode │   │   │   ├── dimm_label │   │   │   ├── dimm_location │   │   │   ├── dimm_mem_type + │   │   │   ├── dimm_ue_count │   │   │   ├── size │   │   │   └── uevent │   │   ├── max_location @@ -457,11 +459,13 @@ A typical EDAC system has the following structure under │   │   ├── ce_count │   │   ├── ce_noinfo_count │   │   ├── dimm0 + │   │   │   ├── dimm_ce_count │   │   │   ├── dimm_dev_type │   │   │   ├── dimm_edac_mode │   │   │   ├── dimm_label │   │   │   ├── dimm_location │   │   │   ├── dimm_mem_type + │   │   │   ├── dimm_ue_count │   │   │   ├── size │   │   │   └── uevent │   │   ├── max_location @@ -483,6 +487,22 @@ this ``X`` memory module: This attribute file displays, in count of megabytes, the memory that this csrow contains. +- ``dimm_ue_count`` - Uncorrectable Errors count attribute file + + This attribute file displays the total count of uncorrectable + errors that have occurred on this DIMM. If panic_on_ue is set + this counter will not have a chance to increment, since EDAC + will panic the system. + +- ``dimm_ce_count`` - Correctable Errors count attribute file + + This attribute file displays the total count of correctable + errors that have occurred on this DIMM. This count is very + important to examine. CEs provide early indications that a + DIMM is beginning to fail. This count field should be + monitored for non-zero values and report such information + to the system administrator. + - ``dimm_dev_type`` - Device type attribute file This attribute file will display what type of DRAM device is diff --git a/drivers/edac/edac_mc_sysfs.c b/drivers/edac/edac_mc_sysfs.c index 39dbab7d62f1..184fed2b005d 100644 --- a/drivers/edac/edac_mc_sysfs.c +++ b/drivers/edac/edac_mc_sysfs.c @@ -569,6 +569,40 @@ static ssize_t dimmdev_edac_mode_show(struct device *dev, return sprintf(data, "%s\n", edac_caps[dimm->edac_mode]); } +static ssize_t dimmdev_ce_count_show(struct device *dev, + struct device_attribute *mattr, + char *data) +{ + struct dimm_info *dimm = to_dimm(dev); + u32 count; + int off; + + off = EDAC_DIMM_OFF(dimm->mci->layers, + dimm->mci->n_layers, + dimm->location[0], + dimm->location[1], + dimm->location[2]); + count = dimm->mci->ce_per_layer[dimm->mci->n_layers-1][off]; + return sprintf(data, "%u\n", count); +} + +static ssize_t dimmdev_ue_count_show(struct device *dev, + struct device_attribute *mattr, + char *data) +{ + struct dimm_info *dimm = to_dimm(dev); + u32 count; + int off; + + off = EDAC_DIMM_OFF(dimm->mci->layers, + dimm->mci->n_layers, + dimm->location[0], + dimm->location[1], + dimm->location[2]); + count = dimm->mci->ue_per_layer[dimm->mci->n_layers-1][off]; + return sprintf(data, "%u\n", count); +} + /* dimm/rank attribute files */ static DEVICE_ATTR(dimm_label, S_IRUGO | S_IWUSR, dimmdev_label_show, dimmdev_label_store); @@ -577,6 +611,8 @@ static DEVICE_ATTR(size, S_IRUGO, dimmdev_size_show, NULL); static DEVICE_ATTR(dimm_mem_type, S_IRUGO, dimmdev_mem_type_show, NULL); static DEVICE_ATTR(dimm_dev_type, S_IRUGO, dimmdev_dev_type_show, NULL); static DEVICE_ATTR(dimm_edac_mode, S_IRUGO, dimmdev_edac_mode_show, NULL); +static DEVICE_ATTR(dimm_ce_count, S_IRUGO, dimmdev_ce_count_show, NULL); +static DEVICE_ATTR(dimm_ue_count, S_IRUGO, dimmdev_ue_count_show, NULL); /* attributes of the dimm/rank object */ static struct attribute *dimm_attrs[] = { @@ -586,6 +622,8 @@ static struct attribute *dimm_attrs[] = { &dev_attr_dimm_mem_type.attr, &dev_attr_dimm_dev_type.attr, &dev_attr_dimm_edac_mode.attr, + &dev_attr_dimm_ce_count.attr, + &dev_attr_dimm_ue_count.attr, NULL, }; -- 2.9.3