2019-12-05 09:39:55

by Robert Richter

[permalink] [raw]
Subject: [PATCH v2 00/10] EDAC: Rework core and ghes drivers, part two

This patch set is part two of a rework of the ghes_edac and edac_mc
driver. It addresses issues found during code review and while working
with the code. Part one has been included to v5.5, see:

https://lore.kernel.org/patchwork/cover/1093488/

The changes of this series include:

* add helper functions and factor out code (#1, #2, #5)

* improve function interfaces and data structures to decrease
complexity such as number of function arguments, unused data, etc.
(#3, #4, #7, #8, #9, #10),

* minor functional fixes (#6)

* improve code readability (#9)

V2:
* fixed documentation issue in #3 found by "kbuild test robot
<[email protected]>"

Changes compared to part one:
* rebased onto 5781823fd0d3 ("EDAC/altera: Use the Altera System
Manager driver")
* reworded patch subjects
* reordered patches
* collected Mauro's Reviewed-by-tags (note: I kept them though there
has been small conflicts but dropped it when reworked)
* dropped: "EDAC/mc: Rework edac_raw_mc_handle_error() to use struct
dimm_info"
* split "EDAC/mc: Remove per layer counters" into smaller changes
* added:
"EDAC/mc: Report "unknown memory" on too many DIMM labels found"
"EDAC/mc: Remove enable_per_layer_report function arguments"
"EDAC/mc: Pass the error descriptor to error reporting functions"
"EDAC/mc: Remove detail[] string and cleanup error string
generation"
* moved to the end:
"EDAC/mc: Remove per layer counters"


Robert Richter (10):
EDAC/mc: Split edac_mc_alloc() into smaller functions
EDAC/mc: Reorder functions edac_mc_alloc*()
EDAC: Store error type in struct edac_raw_error_desc
EDAC/mc: Determine mci pointer from the error descriptor
EDAC/mc: Create new function edac_inc_csrow()
EDAC/mc: Report "unknown memory" on too many DIMM labels found
EDAC/mc: Remove enable_per_layer_report function arguments
EDAC/mc: Pass the error descriptor to error reporting functions
EDAC/mc: Remove detail[] string and cleanup error string generation
EDAC/mc: Remove per layer counters

drivers/edac/edac_mc.c | 496 ++++++++++++++++-------------------
drivers/edac/edac_mc.h | 6 +-
drivers/edac/edac_mc_sysfs.c | 20 +-
drivers/edac/ghes_edac.c | 16 +-
include/linux/edac.h | 9 +-
5 files changed, 249 insertions(+), 298 deletions(-)

--
2.20.1


2019-12-05 09:40:21

by Robert Richter

[permalink] [raw]
Subject: [PATCH v2 10/10] EDAC/mc: Remove per layer counters

Looking at how mci->{ue,ce}_per_layer[EDAC_MAX_LAYERS] is used, it
turns out that only the leaves in the memory hierarchy are consumed
(in sysfs), but not the intermediate layers, e.g.:

count = dimm->mci->ce_per_layer[dimm->mci->n_layers-1][dimm->idx];

These unused counters only add complexity, remove them. The error
counter values are directly stored in struct dimm_info now.

Signed-off-by: Robert Richter <[email protected]>
---
drivers/edac/edac_mc.c | 65 +++++++++---------------------------
drivers/edac/edac_mc_sysfs.c | 20 +++++------
include/linux/edac.h | 4 ++-
3 files changed, 26 insertions(+), 63 deletions(-)

diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index 5ea834fceb50..3d43b9fe8171 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -445,11 +445,9 @@ struct mem_ctl_info *edac_mc_alloc(unsigned int mc_num,
{
struct mem_ctl_info *mci;
struct edac_mc_layer *layer;
- u32 *ce_per_layer[EDAC_MAX_LAYERS], *ue_per_layer[EDAC_MAX_LAYERS];
- unsigned int idx, size, tot_dimms = 1, count = 1;
- unsigned int tot_csrows = 1, tot_channels = 1, tot_errcount = 0;
+ unsigned int idx, size, tot_dimms = 1;
+ unsigned int tot_csrows = 1, tot_channels = 1;
void *pvt, *ptr = NULL;
- int i;
bool per_rank = false;

if (WARN_ON(n_layers > EDAC_MAX_LAYERS || n_layers == 0))
@@ -476,19 +474,10 @@ struct mem_ctl_info *edac_mc_alloc(unsigned int mc_num,
* stringent as what the compiler would provide if we could simply
* hardcode everything into a single struct.
*/
- mci = edac_align_ptr(&ptr, sizeof(*mci), 1);
- layer = edac_align_ptr(&ptr, sizeof(*layer), n_layers);
- for (i = 0; i < n_layers; i++) {
- count *= layers[i].size;
- edac_dbg(4, "errcount layer %d size %d\n", i, count);
- ce_per_layer[i] = edac_align_ptr(&ptr, sizeof(u32), count);
- ue_per_layer[i] = edac_align_ptr(&ptr, sizeof(u32), count);
- tot_errcount += 2 * count;
- }
-
- edac_dbg(4, "allocating %d error counters\n", tot_errcount);
- pvt = edac_align_ptr(&ptr, sz_pvt, 1);
- size = ((unsigned long)pvt) + sz_pvt;
+ mci = edac_align_ptr(&ptr, sizeof(*mci), 1);
+ layer = edac_align_ptr(&ptr, sizeof(*layer), n_layers);
+ pvt = edac_align_ptr(&ptr, sz_pvt, 1);
+ size = ((unsigned long)pvt) + sz_pvt;

edac_dbg(1, "allocating %u bytes for mci data (%d %s, %d csrows/channels)\n",
size,
@@ -504,10 +493,6 @@ struct mem_ctl_info *edac_mc_alloc(unsigned int mc_num,
* rather than an imaginary chunk of memory located at address 0.
*/
layer = (struct edac_mc_layer *)(((char *)mci) + ((unsigned long)layer));
- for (i = 0; i < n_layers; i++) {
- mci->ce_per_layer[i] = (u32 *)((char *)mci + ((unsigned long)ce_per_layer[i]));
- mci->ue_per_layer[i] = (u32 *)((char *)mci + ((unsigned long)ue_per_layer[i]));
- }
pvt = sz_pvt ? (((char *)mci) + ((unsigned long)pvt)) : NULL;

/* setup index and various internal pointers */
@@ -949,48 +934,28 @@ static void edac_inc_ce_error(struct edac_raw_error_desc *e)
{
struct mem_ctl_info *mci = error_desc_to_mci(e);
int pos[EDAC_MAX_LAYERS] = { e->top_layer, e->mid_layer, e->low_layer };
- int i, index = 0;
+ struct dimm_info *dimm = edac_get_dimm(mci, pos[0], pos[1], pos[2]);

mci->ce_mc += e->error_count;

- if (pos[0] < 0) {
+ if (dimm)
+ dimm->ce_count += e->error_count;
+ else
mci->ce_noinfo_count += e->error_count;
- return;
- }
-
- for (i = 0; i < mci->n_layers; i++) {
- if (pos[i] < 0)
- break;
- index += pos[i];
- mci->ce_per_layer[i][index] += e->error_count;
-
- if (i < mci->n_layers - 1)
- index *= mci->layers[i + 1].size;
- }
}

static void edac_inc_ue_error(struct edac_raw_error_desc *e)
{
struct mem_ctl_info *mci = error_desc_to_mci(e);
int pos[EDAC_MAX_LAYERS] = { e->top_layer, e->mid_layer, e->low_layer };
- int i, index = 0;
+ struct dimm_info *dimm = edac_get_dimm(mci, pos[0], pos[1], pos[2]);

mci->ue_mc += e->error_count;

- if (pos[0] < 0) {
+ if (dimm)
+ dimm->ue_count += e->error_count;
+ else
mci->ue_noinfo_count += e->error_count;
- return;
- }
-
- for (i = 0; i < mci->n_layers; i++) {
- if (pos[i] < 0)
- break;
- index += pos[i];
- mci->ue_per_layer[i][index] += e->error_count;
-
- if (i < mci->n_layers - 1)
- index *= mci->layers[i + 1].size;
- }
}

static void edac_ce_error(struct edac_raw_error_desc *e)
@@ -1137,7 +1102,7 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
/*
* Check if the event report is consistent and if the memory
* location is known. If it is known, the DIMM(s) label info
- * will be filled and the per-layer error counters will be
+ * will be filled and the DIMM's error counters will be
* incremented.
*/
for (i = 0; i < mci->n_layers; i++) {
diff --git a/drivers/edac/edac_mc_sysfs.c b/drivers/edac/edac_mc_sysfs.c
index 0367554e7437..8682df2f7f4f 100644
--- a/drivers/edac/edac_mc_sysfs.c
+++ b/drivers/edac/edac_mc_sysfs.c
@@ -556,10 +556,8 @@ static ssize_t dimmdev_ce_count_show(struct device *dev,
char *data)
{
struct dimm_info *dimm = to_dimm(dev);
- u32 count;

- count = dimm->mci->ce_per_layer[dimm->mci->n_layers-1][dimm->idx];
- return sprintf(data, "%u\n", count);
+ return sprintf(data, "%u\n", dimm->ce_count);
}

static ssize_t dimmdev_ue_count_show(struct device *dev,
@@ -567,10 +565,8 @@ static ssize_t dimmdev_ue_count_show(struct device *dev,
char *data)
{
struct dimm_info *dimm = to_dimm(dev);
- u32 count;

- count = dimm->mci->ue_per_layer[dimm->mci->n_layers-1][dimm->idx];
- return sprintf(data, "%u\n", count);
+ return sprintf(data, "%u\n", dimm->ue_count);
}

/* dimm/rank attribute files */
@@ -666,7 +662,9 @@ static ssize_t mci_reset_counters_store(struct device *dev,
const char *data, size_t count)
{
struct mem_ctl_info *mci = to_mci(dev);
- int cnt, row, chan, i;
+ struct dimm_info *dimm;
+ int row, chan;
+
mci->ue_mc = 0;
mci->ce_mc = 0;
mci->ue_noinfo_count = 0;
@@ -682,11 +680,9 @@ static ssize_t mci_reset_counters_store(struct device *dev,
ri->channels[chan]->ce_count = 0;
}

- cnt = 1;
- for (i = 0; i < mci->n_layers; i++) {
- cnt *= mci->layers[i].size;
- memset(mci->ce_per_layer[i], 0, cnt * sizeof(u32));
- memset(mci->ue_per_layer[i], 0, cnt * sizeof(u32));
+ mci_for_each_dimm(mci, dimm) {
+ dimm->ue_count = 0;
+ dimm->ce_count = 0;
}

mci->start_time = jiffies;
diff --git a/include/linux/edac.h b/include/linux/edac.h
index 815f246e0abd..0f20b986b0ab 100644
--- a/include/linux/edac.h
+++ b/include/linux/edac.h
@@ -383,6 +383,9 @@ struct dimm_info {
unsigned int csrow, cschannel; /* Points to the old API data */

u16 smbios_handle; /* Handle for SMBIOS type 17 */
+
+ u32 ce_count;
+ u32 ue_count;
};

/**
@@ -559,7 +562,6 @@ struct mem_ctl_info {
*/
u32 ce_noinfo_count, ue_noinfo_count;
u32 ue_mc, ce_mc;
- u32 *ce_per_layer[EDAC_MAX_LAYERS], *ue_per_layer[EDAC_MAX_LAYERS];

struct completion complete;

--
2.20.1

2019-12-05 09:40:42

by Robert Richter

[permalink] [raw]
Subject: [PATCH v2 03/10] EDAC: Store error type in struct edac_raw_error_desc

Store the error type in struct edac_raw_error_desc. This makes the
type parameter of edac_raw_mc_handle_error() obsolete.

Reported-by: kbuild test robot <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
Reviewed-by: Mauro Carvalho Chehab <[email protected]>
---
drivers/edac/edac_mc.c | 10 +++++-----
drivers/edac/edac_mc.h | 4 +---
drivers/edac/ghes_edac.c | 11 +++++------
include/linux/edac.h | 2 ++
4 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index f2dee4e8ba85..ecab08032b4a 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -1084,8 +1084,7 @@ static void edac_ue_error(struct mem_ctl_info *mci,
edac_inc_ue_error(mci, enable_per_layer_report, pos, error_count);
}

-void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
- struct mem_ctl_info *mci,
+void edac_raw_mc_handle_error(struct mem_ctl_info *mci,
struct edac_raw_error_desc *e)
{
char detail[80];
@@ -1100,14 +1099,14 @@ void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,

/* Report the error via the trace interface */
if (IS_ENABLED(CONFIG_RAS))
- trace_mc_event(type, e->msg, e->label, e->error_count,
+ trace_mc_event(e->type, e->msg, e->label, e->error_count,
mci->mc_idx, e->top_layer, e->mid_layer,
e->low_layer,
(e->page_frame_number << PAGE_SHIFT) | e->offset_in_page,
grain_bits, e->syndrome, e->other_detail);

/* Memory type dependent details about the error */
- if (type == HW_EVENT_ERR_CORRECTED) {
+ if (e->type == HW_EVENT_ERR_CORRECTED) {
snprintf(detail, sizeof(detail),
"page:0x%lx offset:0x%lx grain:%ld syndrome:0x%lx",
e->page_frame_number, e->offset_in_page,
@@ -1152,6 +1151,7 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
/* Fills the error report buffer */
memset(e, 0, sizeof (*e));
e->error_count = error_count;
+ e->type = type;
e->top_layer = top_layer;
e->mid_layer = mid_layer;
e->low_layer = low_layer;
@@ -1282,6 +1282,6 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
if (p > e->location)
*(p - 1) = '\0';

- edac_raw_mc_handle_error(type, mci, e);
+ edac_raw_mc_handle_error(mci, e);
}
EXPORT_SYMBOL_GPL(edac_mc_handle_error);
diff --git a/drivers/edac/edac_mc.h b/drivers/edac/edac_mc.h
index 02aac5c61d00..5d78be774f9e 100644
--- a/drivers/edac/edac_mc.h
+++ b/drivers/edac/edac_mc.h
@@ -212,7 +212,6 @@ extern int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci,
* edac_raw_mc_handle_error() - Reports a memory event to userspace without
* doing anything to discover the error location.
*
- * @type: severity of the error (CE/UE/Fatal)
* @mci: a struct mem_ctl_info pointer
* @e: error description
*
@@ -220,8 +219,7 @@ extern int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci,
* only be called directly when the hardware error come directly from BIOS,
* like in the case of APEI GHES driver.
*/
-void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
- struct mem_ctl_info *mci,
+void edac_raw_mc_handle_error(struct mem_ctl_info *mci,
struct edac_raw_error_desc *e);

/**
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index b99080d8a10c..7c3e5264a41e 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -201,7 +201,6 @@ static void ghes_edac_dmidecode(const struct dmi_header *dh, void *arg)

void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
{
- enum hw_event_mc_err_type type;
struct edac_raw_error_desc *e;
struct mem_ctl_info *mci;
struct ghes_edac_pvt *pvt;
@@ -240,17 +239,17 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)

switch (sev) {
case GHES_SEV_CORRECTED:
- type = HW_EVENT_ERR_CORRECTED;
+ e->type = HW_EVENT_ERR_CORRECTED;
break;
case GHES_SEV_RECOVERABLE:
- type = HW_EVENT_ERR_UNCORRECTED;
+ e->type = HW_EVENT_ERR_UNCORRECTED;
break;
case GHES_SEV_PANIC:
- type = HW_EVENT_ERR_FATAL;
+ e->type = HW_EVENT_ERR_FATAL;
break;
default:
case GHES_SEV_NO:
- type = HW_EVENT_ERR_INFO;
+ e->type = HW_EVENT_ERR_INFO;
}

edac_dbg(1, "error validation_bits: 0x%08llx\n",
@@ -442,7 +441,7 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
if (p > pvt->other_detail)
*(p - 1) = '\0';

- edac_raw_mc_handle_error(type, mci, e);
+ edac_raw_mc_handle_error(mci, e);

unlock:
spin_unlock_irqrestore(&ghes_lock, flags);
diff --git a/include/linux/edac.h b/include/linux/edac.h
index cc31b9742684..6703eb492cd2 100644
--- a/include/linux/edac.h
+++ b/include/linux/edac.h
@@ -442,6 +442,7 @@ struct errcount_attribute_data {
* struct edac_raw_error_desc - Raw error report structure
* @grain: minimum granularity for an error report, in bytes
* @error_count: number of errors of the same type
+ * @type: severity of the error (CE/UE/Fatal)
* @top_layer: top layer of the error (layer[0])
* @mid_layer: middle layer of the error (layer[1])
* @low_layer: low layer of the error (layer[2])
@@ -462,6 +463,7 @@ struct edac_raw_error_desc {
long grain;

u16 error_count;
+ enum hw_event_mc_err_type type;
int top_layer;
int mid_layer;
int low_layer;
--
2.20.1

2020-01-22 17:05:00

by Aristeu Rozanski

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] EDAC: Rework core and ghes drivers, part two

On Thu, Dec 05, 2019 at 09:37:55AM +0000, Robert Richter wrote:
> This patch set is part two of a rework of the ghes_edac and edac_mc
> driver. It addresses issues found during code review and while working
> with the code. Part one has been included to v5.5, see:
>
> https://lore.kernel.org/patchwork/cover/1093488/
>
> The changes of this series include:
>
> * add helper functions and factor out code (#1, #2, #5)
>
> * improve function interfaces and data structures to decrease
> complexity such as number of function arguments, unused data, etc.
> (#3, #4, #7, #8, #9, #10),
>
> * minor functional fixes (#6)
>
> * improve code readability (#9)
>
> V2:
> * fixed documentation issue in #3 found by "kbuild test robot
> <[email protected]>"
>
> Changes compared to part one:
> * rebased onto 5781823fd0d3 ("EDAC/altera: Use the Altera System
> Manager driver")
> * reworded patch subjects
> * reordered patches
> * collected Mauro's Reviewed-by-tags (note: I kept them though there
> has been small conflicts but dropped it when reworked)
> * dropped: "EDAC/mc: Rework edac_raw_mc_handle_error() to use struct
> dimm_info"
> * split "EDAC/mc: Remove per layer counters" into smaller changes
> * added:
> "EDAC/mc: Report "unknown memory" on too many DIMM labels found"
> "EDAC/mc: Remove enable_per_layer_report function arguments"
> "EDAC/mc: Pass the error descriptor to error reporting functions"
> "EDAC/mc: Remove detail[] string and cleanup error string
> generation"
> * moved to the end:
> "EDAC/mc: Remove per layer counters"
>
>
> Robert Richter (10):
> EDAC/mc: Split edac_mc_alloc() into smaller functions
> EDAC/mc: Reorder functions edac_mc_alloc*()
> EDAC: Store error type in struct edac_raw_error_desc
> EDAC/mc: Determine mci pointer from the error descriptor
> EDAC/mc: Create new function edac_inc_csrow()
> EDAC/mc: Report "unknown memory" on too many DIMM labels found
> EDAC/mc: Remove enable_per_layer_report function arguments
> EDAC/mc: Pass the error descriptor to error reporting functions
> EDAC/mc: Remove detail[] string and cleanup error string generation
> EDAC/mc: Remove per layer counters
>
> drivers/edac/edac_mc.c | 496 ++++++++++++++++-------------------
> drivers/edac/edac_mc.h | 6 +-
> drivers/edac/edac_mc_sysfs.c | 20 +-
> drivers/edac/ghes_edac.c | 16 +-
> include/linux/edac.h | 9 +-
> 5 files changed, 249 insertions(+), 298 deletions(-)

Acked-by: Aristeu Rozanski <[email protected]>

--
Aristeu