This patch set is part two of a rework of the ghes_edac and edac_mc
driver. It addresses issues found during code review and while working
with the code. Part one has been included to v5.5, see:
https://lore.kernel.org/patchwork/cover/1093488/
The changes of this series include:
* add helper functions and factor out code (#1, #2, #5)
* improve function interfaces and data structures to decrease
complexity such as number of function arguments, unused data, etc.
(#3, #4, #7, #8, #9, #10),
* minor functional fixes (#6)
* improve code readability (#9)
V3:
* rebased onto edac-for-next + "EDAC/mc: Fix use-after-free and
memleaks during device removal", no code changes:
7e5d6cf35329 ("EDAC/amd64: Do not warn when removing instances")
https://lore.kernel.org/patchwork/patch/1169444/
* added Aristeu's ACKs
V2:
* fixed documentation issue in #3 found by "kbuild test robot
<[email protected]>"
Changes compared to part one:
* rebased onto 5781823fd0d3 ("EDAC/altera: Use the Altera System
Manager driver")
* reworded patch subjects
* reordered patches
* collected Mauro's Reviewed-by-tags (note: I kept them though there
has been small conflicts but dropped it when reworked)
* dropped: "EDAC/mc: Rework edac_raw_mc_handle_error() to use struct
dimm_info"
* split "EDAC/mc: Remove per layer counters" into smaller changes
* added:
"EDAC/mc: Report "unknown memory" on too many DIMM labels found"
"EDAC/mc: Remove enable_per_layer_report function arguments"
"EDAC/mc: Pass the error descriptor to error reporting functions"
"EDAC/mc: Remove detail[] string and cleanup error string
generation"
* moved to the end:
"EDAC/mc: Remove per layer counters"
Robert Richter (10):
EDAC/mc: Split edac_mc_alloc() into smaller functions
EDAC/mc: Reorder functions edac_mc_alloc*()
EDAC: Store error type in struct edac_raw_error_desc
EDAC/mc: Determine mci pointer from the error descriptor
EDAC/mc: Create new function edac_inc_csrow()
EDAC/mc: Report "unknown memory" on too many DIMM labels found
EDAC/mc: Remove enable_per_layer_report function arguments
EDAC/mc: Pass the error descriptor to error reporting functions
EDAC/mc: Remove detail[] string and cleanup error string generation
EDAC/mc: Remove per layer counters
drivers/edac/edac_mc.c | 502 ++++++++++++++++-------------------
drivers/edac/edac_mc.h | 6 +-
drivers/edac/edac_mc_sysfs.c | 20 +-
drivers/edac/ghes_edac.c | 16 +-
include/linux/edac.h | 9 +-
5 files changed, 252 insertions(+), 301 deletions(-)
--
2.20.1
Have a separate function to count errors in csrow/channel. This better
separates code and reduces the indentation level. No functional
changes.
Signed-off-by: Robert Richter <[email protected]>
Reviewed-by: Mauro Carvalho Chehab <[email protected]>
Acked-by: Aristeu Rozanski <[email protected]>
---
drivers/edac/edac_mc.c | 40 +++++++++++++++++++++++++---------------
1 file changed, 25 insertions(+), 15 deletions(-)
diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index 3c00c046acc9..e75cb7a9c454 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -1091,6 +1091,26 @@ static void edac_ue_error(struct mem_ctl_info *mci,
edac_inc_ue_error(mci, enable_per_layer_report, pos, error_count);
}
+static void edac_inc_csrow(struct edac_raw_error_desc *e, int row, int chan)
+{
+ struct mem_ctl_info *mci = error_desc_to_mci(e);
+ u16 count = e->error_count;
+ enum hw_event_mc_err_type type = e->type;
+
+ if (row < 0)
+ return;
+
+ edac_dbg(4, "csrow/channel to increment: (%d,%d)\n", row, chan);
+
+ if (type == HW_EVENT_ERR_CORRECTED) {
+ mci->csrows[row]->ce_count += count;
+ if (chan >= 0)
+ mci->csrows[row]->channels[chan]->ce_count += count;
+ } else {
+ mci->csrows[row]->ue_count += count;
+ }
+}
+
void edac_raw_mc_handle_error(struct edac_raw_error_desc *e)
{
struct mem_ctl_info *mci = error_desc_to_mci(e);
@@ -1258,22 +1278,12 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
chan = -2;
}
- if (!e->enable_per_layer_report) {
+ if (!e->enable_per_layer_report)
strcpy(e->label, "any memory");
- } else {
- edac_dbg(4, "csrow/channel to increment: (%d,%d)\n", row, chan);
- if (p == e->label)
- strcpy(e->label, "unknown memory");
- if (type == HW_EVENT_ERR_CORRECTED) {
- if (row >= 0) {
- mci->csrows[row]->ce_count += error_count;
- if (chan >= 0)
- mci->csrows[row]->channels[chan]->ce_count += error_count;
- }
- } else
- if (row >= 0)
- mci->csrows[row]->ue_count += error_count;
- }
+ else if (!*e->label)
+ strcpy(e->label, "unknown memory");
+
+ edac_inc_csrow(e, row, chan);
/* Fill the RAM location data */
p = e->location;
--
2.20.1
Each struct mci has its own error descriptor. Create a function
error_desc_to_mci() to determine the corresponding mci from an error
descriptor. This eases the parameter list of edac_raw_mc_handle_
error() as the mci pointer do not need to be passed any longer.
Signed-off-by: Robert Richter <[email protected]>
Reviewed-by: Mauro Carvalho Chehab <[email protected]>
Acked-by: Aristeu Rozanski <[email protected]>
---
drivers/edac/edac_mc.c | 11 ++++++++---
drivers/edac/edac_mc.h | 4 +---
drivers/edac/ghes_edac.c | 2 +-
3 files changed, 10 insertions(+), 7 deletions(-)
diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index 8ef69d24297d..3c00c046acc9 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -55,6 +55,11 @@ static LIST_HEAD(mc_devices);
*/
static const char *edac_mc_owner;
+static struct mem_ctl_info *error_desc_to_mci(struct edac_raw_error_desc *e)
+{
+ return container_of(e, struct mem_ctl_info, error_desc);
+}
+
int edac_get_report_status(void)
{
return edac_report;
@@ -1086,9 +1091,9 @@ static void edac_ue_error(struct mem_ctl_info *mci,
edac_inc_ue_error(mci, enable_per_layer_report, pos, error_count);
}
-void edac_raw_mc_handle_error(struct mem_ctl_info *mci,
- struct edac_raw_error_desc *e)
+void edac_raw_mc_handle_error(struct edac_raw_error_desc *e)
{
+ struct mem_ctl_info *mci = error_desc_to_mci(e);
char detail[80];
int pos[EDAC_MAX_LAYERS] = { e->top_layer, e->mid_layer, e->low_layer };
u8 grain_bits;
@@ -1284,6 +1289,6 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
if (p > e->location)
*(p - 1) = '\0';
- edac_raw_mc_handle_error(mci, e);
+ edac_raw_mc_handle_error(e);
}
EXPORT_SYMBOL_GPL(edac_mc_handle_error);
diff --git a/drivers/edac/edac_mc.h b/drivers/edac/edac_mc.h
index 5d78be774f9e..881b00eadf7a 100644
--- a/drivers/edac/edac_mc.h
+++ b/drivers/edac/edac_mc.h
@@ -212,15 +212,13 @@ extern int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci,
* edac_raw_mc_handle_error() - Reports a memory event to userspace without
* doing anything to discover the error location.
*
- * @mci: a struct mem_ctl_info pointer
* @e: error description
*
* This raw function is used internally by edac_mc_handle_error(). It should
* only be called directly when the hardware error come directly from BIOS,
* like in the case of APEI GHES driver.
*/
-void edac_raw_mc_handle_error(struct mem_ctl_info *mci,
- struct edac_raw_error_desc *e);
+void edac_raw_mc_handle_error(struct edac_raw_error_desc *e);
/**
* edac_mc_handle_error() - Reports a memory event to userspace.
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 7c3e5264a41e..bef8a428c429 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -441,7 +441,7 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
if (p > pvt->other_detail)
*(p - 1) = '\0';
- edac_raw_mc_handle_error(mci, e);
+ edac_raw_mc_handle_error(e);
unlock:
spin_unlock_irqrestore(&ghes_lock, flags);
--
2.20.1
There is a limitation to report only EDAC_MAX_LABELS in e->label of
the error descriptor. This is to prevent a possible string overflow.
Current implementation falls back to "any memory" in this case and
also stops all further processing to find a unique row and channel of
the possible error location. Reporting "any memory" is wrong as the
memory controller reported an error location for one of the layers.
Instead, report "unknown memory" and also do not break early in the
loop to further check row and channel for uniqueness.
Signed-off-by: Robert Richter <[email protected]>
Acked-by: Aristeu Rozanski <[email protected]>
---
drivers/edac/edac_mc.c | 21 +++++++++++----------
1 file changed, 11 insertions(+), 10 deletions(-)
diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index e75cb7a9c454..aa94152777fe 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -1245,20 +1245,21 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
* channel/memory controller/... may be affected.
* Also, don't show errors for empty DIMM slots.
*/
- if (!e->enable_per_layer_report || !dimm->nr_pages)
+ if (!dimm->nr_pages)
continue;
- if (n_labels >= EDAC_MAX_LABELS) {
- e->enable_per_layer_report = false;
- break;
- }
n_labels++;
- if (p != e->label) {
- strcpy(p, OTHER_LABEL);
- p += strlen(OTHER_LABEL);
+ if (n_labels > EDAC_MAX_LABELS) {
+ p = e->label;
+ *p = '\0';
+ } else {
+ if (p != e->label) {
+ strcpy(p, OTHER_LABEL);
+ p += strlen(OTHER_LABEL);
+ }
+ strcpy(p, dimm->label);
+ p += strlen(p);
}
- strcpy(p, dimm->label);
- p += strlen(p);
/*
* get csrow/channel of the DIMM, in order to allow
--
2.20.1
Many functions carry the enable_per_layer_report argument. This is a
bool value indicating the error information contains some location
data where the error occurred. This can easily being determined by
checking the pos[] array for values. Negative values indicate there is
no location available. So if the top layer is negative, the error
location is unknown.
Just check if the top layer is negative and remove
enable_per_layer_report as function argument and also from struct
edac_raw_error_desc.
Signed-off-by: Robert Richter <[email protected]>
Acked-by: Aristeu Rozanski <[email protected]>
---
drivers/edac/edac_mc.c | 42 +++++++++++++++++++---------------------
drivers/edac/ghes_edac.c | 5 +----
include/linux/edac.h | 3 ---
3 files changed, 21 insertions(+), 29 deletions(-)
diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index aa94152777fe..35e427f89949 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -948,7 +948,6 @@ const char *edac_layer_name[] = {
EXPORT_SYMBOL_GPL(edac_layer_name);
static void edac_inc_ce_error(struct mem_ctl_info *mci,
- bool enable_per_layer_report,
const int pos[EDAC_MAX_LAYERS],
const u16 count)
{
@@ -956,7 +955,7 @@ static void edac_inc_ce_error(struct mem_ctl_info *mci,
mci->ce_mc += count;
- if (!enable_per_layer_report) {
+ if (pos[0] < 0) {
mci->ce_noinfo_count += count;
return;
}
@@ -973,7 +972,6 @@ static void edac_inc_ce_error(struct mem_ctl_info *mci,
}
static void edac_inc_ue_error(struct mem_ctl_info *mci,
- bool enable_per_layer_report,
const int pos[EDAC_MAX_LAYERS],
const u16 count)
{
@@ -981,7 +979,7 @@ static void edac_inc_ue_error(struct mem_ctl_info *mci,
mci->ue_mc += count;
- if (!enable_per_layer_report) {
+ if (pos[0] < 0) {
mci->ue_noinfo_count += count;
return;
}
@@ -1005,7 +1003,6 @@ static void edac_ce_error(struct mem_ctl_info *mci,
const char *label,
const char *detail,
const char *other_detail,
- const bool enable_per_layer_report,
const unsigned long page_frame_number,
const unsigned long offset_in_page,
long grain)
@@ -1028,7 +1025,7 @@ static void edac_ce_error(struct mem_ctl_info *mci,
error_count, msg, msg_aux, label,
location, detail);
}
- edac_inc_ce_error(mci, enable_per_layer_report, pos, error_count);
+ edac_inc_ce_error(mci, pos, error_count);
if (mci->scrub_mode == SCRUB_SW_SRC) {
/*
@@ -1058,8 +1055,7 @@ static void edac_ue_error(struct mem_ctl_info *mci,
const char *location,
const char *label,
const char *detail,
- const char *other_detail,
- const bool enable_per_layer_report)
+ const char *other_detail)
{
char *msg_aux = "";
@@ -1088,7 +1084,7 @@ static void edac_ue_error(struct mem_ctl_info *mci,
msg, msg_aux, label, location, detail);
}
- edac_inc_ue_error(mci, enable_per_layer_report, pos, error_count);
+ edac_inc_ue_error(mci, pos, error_count);
}
static void edac_inc_csrow(struct edac_raw_error_desc *e, int row, int chan)
@@ -1138,16 +1134,16 @@ void edac_raw_mc_handle_error(struct edac_raw_error_desc *e)
"page:0x%lx offset:0x%lx grain:%ld syndrome:0x%lx",
e->page_frame_number, e->offset_in_page,
e->grain, e->syndrome);
- edac_ce_error(mci, e->error_count, pos, e->msg, e->location, e->label,
- detail, e->other_detail, e->enable_per_layer_report,
+ edac_ce_error(mci, e->error_count, pos, e->msg, e->location,
+ e->label, detail, e->other_detail,
e->page_frame_number, e->offset_in_page, e->grain);
} else {
snprintf(detail, sizeof(detail),
"page:0x%lx offset:0x%lx grain:%ld",
e->page_frame_number, e->offset_in_page, e->grain);
- edac_ue_error(mci, e->error_count, pos, e->msg, e->location, e->label,
- detail, e->other_detail, e->enable_per_layer_report);
+ edac_ue_error(mci, e->error_count, pos, e->msg, e->location,
+ e->label, detail, e->other_detail);
}
@@ -1172,6 +1168,7 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
int pos[EDAC_MAX_LAYERS] = { top_layer, mid_layer, low_layer };
int i, n_labels = 0;
struct edac_raw_error_desc *e = &mci->error_desc;
+ bool any_memory = true;
edac_dbg(3, "MC%d\n", mci->mc_idx);
@@ -1190,9 +1187,9 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
/*
* Check if the event report is consistent and if the memory
- * location is known. If it is known, enable_per_layer_report will be
- * true, the DIMM(s) label info will be filled and the per-layer
- * error counters will be incremented.
+ * location is known. If it is known, the DIMM(s) label info
+ * will be filled and the per-layer error counters will be
+ * incremented.
*/
for (i = 0; i < mci->n_layers; i++) {
if (pos[i] >= (int)mci->layers[i].size) {
@@ -1210,7 +1207,7 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
pos[i] = -1;
}
if (pos[i] >= 0)
- e->enable_per_layer_report = true;
+ any_memory = false;
}
/*
@@ -1240,10 +1237,11 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
e->grain = dimm->grain;
/*
- * If the error is memory-controller wide, there's no need to
- * seek for the affected DIMMs because the whole
- * channel/memory controller/... may be affected.
- * Also, don't show errors for empty DIMM slots.
+ * If the error is memory-controller wide, there's no
+ * need to seek for the affected DIMMs because the
+ * whole channel/memory controller/... may be
+ * affected. Also, don't show errors for empty DIMM
+ * slots.
*/
if (!dimm->nr_pages)
continue;
@@ -1279,7 +1277,7 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
chan = -2;
}
- if (!e->enable_per_layer_report)
+ if (any_memory)
strcpy(e->label, "any memory");
else if (!*e->label)
strcpy(e->label, "unknown memory");
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index bef8a428c429..cb3dab56a875 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -355,11 +355,8 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
mem_err->mem_dev_handle);
index = get_dimm_smbios_index(mci, mem_err->mem_dev_handle);
- if (index >= 0) {
+ if (index >= 0)
e->top_layer = index;
- e->enable_per_layer_report = true;
- }
-
}
if (p > e->location)
*(p - 1) = '\0';
diff --git a/include/linux/edac.h b/include/linux/edac.h
index 6703eb492cd2..815f246e0abd 100644
--- a/include/linux/edac.h
+++ b/include/linux/edac.h
@@ -454,8 +454,6 @@ struct errcount_attribute_data {
* @location: location of the error
* @label: label of the affected DIMM(s)
* @other_detail: other driver-specific detail about the error
- * @enable_per_layer_report: if false, the error affects all layers
- * (typically, a memory controller error)
*/
struct edac_raw_error_desc {
char location[LOCATION_SIZE];
@@ -472,7 +470,6 @@ struct edac_raw_error_desc {
unsigned long syndrome;
const char *msg;
const char *other_detail;
- bool enable_per_layer_report;
};
/* MEMORY controller information structure
--
2.20.1
Looking at how mci->{ue,ce}_per_layer[EDAC_MAX_LAYERS] is used, it
turns out that only the leaves in the memory hierarchy are consumed
(in sysfs), but not the intermediate layers, e.g.:
count = dimm->mci->ce_per_layer[dimm->mci->n_layers-1][dimm->idx];
These unused counters only add complexity, remove them. The error
counter values are directly stored in struct dimm_info now.
Signed-off-by: Robert Richter <[email protected]>
Acked-by: Aristeu Rozanski <[email protected]>
---
drivers/edac/edac_mc.c | 65 +++++++++---------------------------
drivers/edac/edac_mc_sysfs.c | 20 +++++------
include/linux/edac.h | 4 ++-
3 files changed, 26 insertions(+), 63 deletions(-)
diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index 60639def8697..fbd9faa5c0f9 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -451,11 +451,9 @@ struct mem_ctl_info *edac_mc_alloc(unsigned int mc_num,
{
struct mem_ctl_info *mci;
struct edac_mc_layer *layer;
- u32 *ce_per_layer[EDAC_MAX_LAYERS], *ue_per_layer[EDAC_MAX_LAYERS];
- unsigned int idx, size, tot_dimms = 1, count = 1;
- unsigned int tot_csrows = 1, tot_channels = 1, tot_errcount = 0;
+ unsigned int idx, size, tot_dimms = 1;
+ unsigned int tot_csrows = 1, tot_channels = 1;
void *pvt, *ptr = NULL;
- int i;
bool per_rank = false;
if (WARN_ON(n_layers > EDAC_MAX_LAYERS || n_layers == 0))
@@ -482,19 +480,10 @@ struct mem_ctl_info *edac_mc_alloc(unsigned int mc_num,
* stringent as what the compiler would provide if we could simply
* hardcode everything into a single struct.
*/
- mci = edac_align_ptr(&ptr, sizeof(*mci), 1);
- layer = edac_align_ptr(&ptr, sizeof(*layer), n_layers);
- for (i = 0; i < n_layers; i++) {
- count *= layers[i].size;
- edac_dbg(4, "errcount layer %d size %d\n", i, count);
- ce_per_layer[i] = edac_align_ptr(&ptr, sizeof(u32), count);
- ue_per_layer[i] = edac_align_ptr(&ptr, sizeof(u32), count);
- tot_errcount += 2 * count;
- }
-
- edac_dbg(4, "allocating %d error counters\n", tot_errcount);
- pvt = edac_align_ptr(&ptr, sz_pvt, 1);
- size = ((unsigned long)pvt) + sz_pvt;
+ mci = edac_align_ptr(&ptr, sizeof(*mci), 1);
+ layer = edac_align_ptr(&ptr, sizeof(*layer), n_layers);
+ pvt = edac_align_ptr(&ptr, sz_pvt, 1);
+ size = ((unsigned long)pvt) + sz_pvt;
edac_dbg(1, "allocating %u bytes for mci data (%d %s, %d csrows/channels)\n",
size,
@@ -513,10 +502,6 @@ struct mem_ctl_info *edac_mc_alloc(unsigned int mc_num,
* rather than an imaginary chunk of memory located at address 0.
*/
layer = (struct edac_mc_layer *)(((char *)mci) + ((unsigned long)layer));
- for (i = 0; i < n_layers; i++) {
- mci->ce_per_layer[i] = (u32 *)((char *)mci + ((unsigned long)ce_per_layer[i]));
- mci->ue_per_layer[i] = (u32 *)((char *)mci + ((unsigned long)ue_per_layer[i]));
- }
pvt = sz_pvt ? (((char *)mci) + ((unsigned long)pvt)) : NULL;
/* setup index and various internal pointers */
@@ -951,48 +936,28 @@ static void edac_inc_ce_error(struct edac_raw_error_desc *e)
{
struct mem_ctl_info *mci = error_desc_to_mci(e);
int pos[EDAC_MAX_LAYERS] = { e->top_layer, e->mid_layer, e->low_layer };
- int i, index = 0;
+ struct dimm_info *dimm = edac_get_dimm(mci, pos[0], pos[1], pos[2]);
mci->ce_mc += e->error_count;
- if (pos[0] < 0) {
+ if (dimm)
+ dimm->ce_count += e->error_count;
+ else
mci->ce_noinfo_count += e->error_count;
- return;
- }
-
- for (i = 0; i < mci->n_layers; i++) {
- if (pos[i] < 0)
- break;
- index += pos[i];
- mci->ce_per_layer[i][index] += e->error_count;
-
- if (i < mci->n_layers - 1)
- index *= mci->layers[i + 1].size;
- }
}
static void edac_inc_ue_error(struct edac_raw_error_desc *e)
{
struct mem_ctl_info *mci = error_desc_to_mci(e);
int pos[EDAC_MAX_LAYERS] = { e->top_layer, e->mid_layer, e->low_layer };
- int i, index = 0;
+ struct dimm_info *dimm = edac_get_dimm(mci, pos[0], pos[1], pos[2]);
mci->ue_mc += e->error_count;
- if (pos[0] < 0) {
+ if (dimm)
+ dimm->ue_count += e->error_count;
+ else
mci->ue_noinfo_count += e->error_count;
- return;
- }
-
- for (i = 0; i < mci->n_layers; i++) {
- if (pos[i] < 0)
- break;
- index += pos[i];
- mci->ue_per_layer[i][index] += e->error_count;
-
- if (i < mci->n_layers - 1)
- index *= mci->layers[i + 1].size;
- }
}
static void edac_ce_error(struct edac_raw_error_desc *e)
@@ -1139,7 +1104,7 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
/*
* Check if the event report is consistent and if the memory
* location is known. If it is known, the DIMM(s) label info
- * will be filled and the per-layer error counters will be
+ * will be filled and the DIMM's error counters will be
* incremented.
*/
for (i = 0; i < mci->n_layers; i++) {
diff --git a/drivers/edac/edac_mc_sysfs.c b/drivers/edac/edac_mc_sysfs.c
index 408bace699dc..20657530a108 100644
--- a/drivers/edac/edac_mc_sysfs.c
+++ b/drivers/edac/edac_mc_sysfs.c
@@ -551,10 +551,8 @@ static ssize_t dimmdev_ce_count_show(struct device *dev,
char *data)
{
struct dimm_info *dimm = to_dimm(dev);
- u32 count;
- count = dimm->mci->ce_per_layer[dimm->mci->n_layers-1][dimm->idx];
- return sprintf(data, "%u\n", count);
+ return sprintf(data, "%u\n", dimm->ce_count);
}
static ssize_t dimmdev_ue_count_show(struct device *dev,
@@ -562,10 +560,8 @@ static ssize_t dimmdev_ue_count_show(struct device *dev,
char *data)
{
struct dimm_info *dimm = to_dimm(dev);
- u32 count;
- count = dimm->mci->ue_per_layer[dimm->mci->n_layers-1][dimm->idx];
- return sprintf(data, "%u\n", count);
+ return sprintf(data, "%u\n", dimm->ue_count);
}
/* dimm/rank attribute files */
@@ -661,7 +657,9 @@ static ssize_t mci_reset_counters_store(struct device *dev,
const char *data, size_t count)
{
struct mem_ctl_info *mci = to_mci(dev);
- int cnt, row, chan, i;
+ struct dimm_info *dimm;
+ int row, chan;
+
mci->ue_mc = 0;
mci->ce_mc = 0;
mci->ue_noinfo_count = 0;
@@ -677,11 +675,9 @@ static ssize_t mci_reset_counters_store(struct device *dev,
ri->channels[chan]->ce_count = 0;
}
- cnt = 1;
- for (i = 0; i < mci->n_layers; i++) {
- cnt *= mci->layers[i].size;
- memset(mci->ce_per_layer[i], 0, cnt * sizeof(u32));
- memset(mci->ue_per_layer[i], 0, cnt * sizeof(u32));
+ mci_for_each_dimm(mci, dimm) {
+ dimm->ue_count = 0;
+ dimm->ce_count = 0;
}
mci->start_time = jiffies;
diff --git a/include/linux/edac.h b/include/linux/edac.h
index 815f246e0abd..0f20b986b0ab 100644
--- a/include/linux/edac.h
+++ b/include/linux/edac.h
@@ -383,6 +383,9 @@ struct dimm_info {
unsigned int csrow, cschannel; /* Points to the old API data */
u16 smbios_handle; /* Handle for SMBIOS type 17 */
+
+ u32 ce_count;
+ u32 ue_count;
};
/**
@@ -559,7 +562,6 @@ struct mem_ctl_info {
*/
u32 ce_noinfo_count, ue_noinfo_count;
u32 ue_mc, ce_mc;
- u32 *ce_per_layer[EDAC_MAX_LAYERS], *ue_per_layer[EDAC_MAX_LAYERS];
struct completion complete;
--
2.20.1
Store the error type in struct edac_raw_error_desc. This makes the
type parameter of edac_raw_mc_handle_error() obsolete.
Reported-by: kbuild test robot <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
Reviewed-by: Mauro Carvalho Chehab <[email protected]>
Acked-by: Aristeu Rozanski <[email protected]>
---
drivers/edac/edac_mc.c | 10 +++++-----
drivers/edac/edac_mc.h | 4 +---
drivers/edac/ghes_edac.c | 11 +++++------
include/linux/edac.h | 2 ++
4 files changed, 13 insertions(+), 14 deletions(-)
diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index 1e227e69e216..8ef69d24297d 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -1086,8 +1086,7 @@ static void edac_ue_error(struct mem_ctl_info *mci,
edac_inc_ue_error(mci, enable_per_layer_report, pos, error_count);
}
-void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
- struct mem_ctl_info *mci,
+void edac_raw_mc_handle_error(struct mem_ctl_info *mci,
struct edac_raw_error_desc *e)
{
char detail[80];
@@ -1102,14 +1101,14 @@ void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
/* Report the error via the trace interface */
if (IS_ENABLED(CONFIG_RAS))
- trace_mc_event(type, e->msg, e->label, e->error_count,
+ trace_mc_event(e->type, e->msg, e->label, e->error_count,
mci->mc_idx, e->top_layer, e->mid_layer,
e->low_layer,
(e->page_frame_number << PAGE_SHIFT) | e->offset_in_page,
grain_bits, e->syndrome, e->other_detail);
/* Memory type dependent details about the error */
- if (type == HW_EVENT_ERR_CORRECTED) {
+ if (e->type == HW_EVENT_ERR_CORRECTED) {
snprintf(detail, sizeof(detail),
"page:0x%lx offset:0x%lx grain:%ld syndrome:0x%lx",
e->page_frame_number, e->offset_in_page,
@@ -1154,6 +1153,7 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
/* Fills the error report buffer */
memset(e, 0, sizeof (*e));
e->error_count = error_count;
+ e->type = type;
e->top_layer = top_layer;
e->mid_layer = mid_layer;
e->low_layer = low_layer;
@@ -1284,6 +1284,6 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
if (p > e->location)
*(p - 1) = '\0';
- edac_raw_mc_handle_error(type, mci, e);
+ edac_raw_mc_handle_error(mci, e);
}
EXPORT_SYMBOL_GPL(edac_mc_handle_error);
diff --git a/drivers/edac/edac_mc.h b/drivers/edac/edac_mc.h
index 02aac5c61d00..5d78be774f9e 100644
--- a/drivers/edac/edac_mc.h
+++ b/drivers/edac/edac_mc.h
@@ -212,7 +212,6 @@ extern int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci,
* edac_raw_mc_handle_error() - Reports a memory event to userspace without
* doing anything to discover the error location.
*
- * @type: severity of the error (CE/UE/Fatal)
* @mci: a struct mem_ctl_info pointer
* @e: error description
*
@@ -220,8 +219,7 @@ extern int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci,
* only be called directly when the hardware error come directly from BIOS,
* like in the case of APEI GHES driver.
*/
-void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
- struct mem_ctl_info *mci,
+void edac_raw_mc_handle_error(struct mem_ctl_info *mci,
struct edac_raw_error_desc *e);
/**
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index b99080d8a10c..7c3e5264a41e 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -201,7 +201,6 @@ static void ghes_edac_dmidecode(const struct dmi_header *dh, void *arg)
void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
{
- enum hw_event_mc_err_type type;
struct edac_raw_error_desc *e;
struct mem_ctl_info *mci;
struct ghes_edac_pvt *pvt;
@@ -240,17 +239,17 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
switch (sev) {
case GHES_SEV_CORRECTED:
- type = HW_EVENT_ERR_CORRECTED;
+ e->type = HW_EVENT_ERR_CORRECTED;
break;
case GHES_SEV_RECOVERABLE:
- type = HW_EVENT_ERR_UNCORRECTED;
+ e->type = HW_EVENT_ERR_UNCORRECTED;
break;
case GHES_SEV_PANIC:
- type = HW_EVENT_ERR_FATAL;
+ e->type = HW_EVENT_ERR_FATAL;
break;
default:
case GHES_SEV_NO:
- type = HW_EVENT_ERR_INFO;
+ e->type = HW_EVENT_ERR_INFO;
}
edac_dbg(1, "error validation_bits: 0x%08llx\n",
@@ -442,7 +441,7 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
if (p > pvt->other_detail)
*(p - 1) = '\0';
- edac_raw_mc_handle_error(type, mci, e);
+ edac_raw_mc_handle_error(mci, e);
unlock:
spin_unlock_irqrestore(&ghes_lock, flags);
diff --git a/include/linux/edac.h b/include/linux/edac.h
index cc31b9742684..6703eb492cd2 100644
--- a/include/linux/edac.h
+++ b/include/linux/edac.h
@@ -442,6 +442,7 @@ struct errcount_attribute_data {
* struct edac_raw_error_desc - Raw error report structure
* @grain: minimum granularity for an error report, in bytes
* @error_count: number of errors of the same type
+ * @type: severity of the error (CE/UE/Fatal)
* @top_layer: top layer of the error (layer[0])
* @mid_layer: middle layer of the error (layer[1])
* @low_layer: low layer of the error (layer[2])
@@ -462,6 +463,7 @@ struct edac_raw_error_desc {
long grain;
u16 error_count;
+ enum hw_event_mc_err_type type;
int top_layer;
int mid_layer;
int low_layer;
--
2.20.1
On Thu, Jan 23, 2020 at 09:02:58AM +0000, Robert Richter wrote:
> Have a separate function to count errors in csrow/channel. This better
> separates code and reduces the indentation level. No functional
> changes.
>
> Signed-off-by: Robert Richter <[email protected]>
> Reviewed-by: Mauro Carvalho Chehab <[email protected]>
> Acked-by: Aristeu Rozanski <[email protected]>
> ---
> drivers/edac/edac_mc.c | 40 +++++++++++++++++++++++++---------------
> 1 file changed, 25 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
> index 3c00c046acc9..e75cb7a9c454 100644
> --- a/drivers/edac/edac_mc.c
> +++ b/drivers/edac/edac_mc.c
> @@ -1091,6 +1091,26 @@ static void edac_ue_error(struct mem_ctl_info *mci,
> edac_inc_ue_error(mci, enable_per_layer_report, pos, error_count);
> }
>
> +static void edac_inc_csrow(struct edac_raw_error_desc *e, int row, int chan)
> +{
> + struct mem_ctl_info *mci = error_desc_to_mci(e);
> + u16 count = e->error_count;
> + enum hw_event_mc_err_type type = e->type;
Please sort function local variables declaration in a reverse christmas
tree order:
<type A> longest_variable_name;
<type B> shorter_var_name;
<type C> even_shorter;
<type D> i;
> +
> + if (row < 0)
> + return;
> +
> + edac_dbg(4, "csrow/channel to increment: (%d,%d)\n", row, chan);
> +
> + if (type == HW_EVENT_ERR_CORRECTED) {
> + mci->csrows[row]->ce_count += count;
> + if (chan >= 0)
> + mci->csrows[row]->channels[chan]->ce_count += count;
> + } else {
> + mci->csrows[row]->ue_count += count;
> + }
> +}
> +
> void edac_raw_mc_handle_error(struct edac_raw_error_desc *e)
> {
> struct mem_ctl_info *mci = error_desc_to_mci(e);
> @@ -1258,22 +1278,12 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
> chan = -2;
> }
>
> - if (!e->enable_per_layer_report) {
> + if (!e->enable_per_layer_report)
> strcpy(e->label, "any memory");
> - } else {
> - edac_dbg(4, "csrow/channel to increment: (%d,%d)\n", row, chan);
> - if (p == e->label)
> - strcpy(e->label, "unknown memory");
> - if (type == HW_EVENT_ERR_CORRECTED) {
> - if (row >= 0) {
> - mci->csrows[row]->ce_count += error_count;
> - if (chan >= 0)
> - mci->csrows[row]->channels[chan]->ce_count += error_count;
> - }
> - } else
> - if (row >= 0)
> - mci->csrows[row]->ue_count += error_count;
> - }
> + else if (!*e->label)
> + strcpy(e->label, "unknown memory");
> +
> + edac_inc_csrow(e, row, chan);
Err, but this has functional changes: the !e->enable_per_layer_report
case sets only the e->label and the else branch only does increment
->ce_count.
Your change gets ->ce_count incremented in both cases.
Why?
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 13.02.20 13:47:08, Borislav Petkov wrote:
> On Thu, Jan 23, 2020 at 09:02:58AM +0000, Robert Richter wrote:
> > Have a separate function to count errors in csrow/channel. This better
> > separates code and reduces the indentation level. No functional
> > changes.
> >
> > Signed-off-by: Robert Richter <[email protected]>
> > Reviewed-by: Mauro Carvalho Chehab <[email protected]>
> > Acked-by: Aristeu Rozanski <[email protected]>
> > ---
> > drivers/edac/edac_mc.c | 40 +++++++++++++++++++++++++---------------
> > 1 file changed, 25 insertions(+), 15 deletions(-)
> >
> > diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
> > index 3c00c046acc9..e75cb7a9c454 100644
> > --- a/drivers/edac/edac_mc.c
> > +++ b/drivers/edac/edac_mc.c
> > @@ -1091,6 +1091,26 @@ static void edac_ue_error(struct mem_ctl_info *mci,
> > edac_inc_ue_error(mci, enable_per_layer_report, pos, error_count);
> > }
> >
> > +static void edac_inc_csrow(struct edac_raw_error_desc *e, int row, int chan)
> > +{
> > + struct mem_ctl_info *mci = error_desc_to_mci(e);
> > + u16 count = e->error_count;
> > + enum hw_event_mc_err_type type = e->type;
>
> Please sort function local variables declaration in a reverse christmas
> tree order:
>
> <type A> longest_variable_name;
> <type B> shorter_var_name;
> <type C> even_shorter;
> <type D> i;
I can change this. Does variable name length include the assignment?
Generally I prefer to sort it by size_of() to avoid holes due to
padding, though the compiler uses probably registers here anyway.
It's just a flavor.
If it's just this change, could you edit the patch to avoid respin?
>
> > +
> > + if (row < 0)
> > + return;
> > +
> > + edac_dbg(4, "csrow/channel to increment: (%d,%d)\n", row, chan);
> > +
> > + if (type == HW_EVENT_ERR_CORRECTED) {
> > + mci->csrows[row]->ce_count += count;
> > + if (chan >= 0)
> > + mci->csrows[row]->channels[chan]->ce_count += count;
> > + } else {
> > + mci->csrows[row]->ue_count += count;
> > + }
> > +}
> > +
> > void edac_raw_mc_handle_error(struct edac_raw_error_desc *e)
> > {
> > struct mem_ctl_info *mci = error_desc_to_mci(e);
> > @@ -1258,22 +1278,12 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
> > chan = -2;
> > }
> >
> > - if (!e->enable_per_layer_report) {
> > + if (!e->enable_per_layer_report)
> > strcpy(e->label, "any memory");
> > - } else {
> > - edac_dbg(4, "csrow/channel to increment: (%d,%d)\n", row, chan);
> > - if (p == e->label)
> > - strcpy(e->label, "unknown memory");
> > - if (type == HW_EVENT_ERR_CORRECTED) {
> > - if (row >= 0) {
> > - mci->csrows[row]->ce_count += error_count;
> > - if (chan >= 0)
> > - mci->csrows[row]->channels[chan]->ce_count += error_count;
> > - }
> > - } else
> > - if (row >= 0)
> > - mci->csrows[row]->ue_count += error_count;
> > - }
> > + else if (!*e->label)
> > + strcpy(e->label, "unknown memory");
> > +
> > + edac_inc_csrow(e, row, chan);
>
> Err, but this has functional changes: the !e->enable_per_layer_report
> case sets only the e->label and the else branch only does increment
> ->ce_count.
>
> Your change gets ->ce_count incremented in both cases.
No, there is a check in edac_inc_csrow(): if (row < 0) ... In the case
of "any memory", row is also < 0, so nothing is counted. This is
reasonable since no dimm is found and row/channel is still set to the
setup value of -1.
-Robert
Have a separate function to count errors in csrow/channel. This better
separates code and reduces the indentation level.
Implementation note: Function edac_inc_csrow() counts the same as
before, ->ce_count is only incremented if row >= 0. This is esp. true
for the case of (!e->enable_per_layer_report). Here, a DIMM was not
found, variable row still has a value of -1 and ->ce_count is not
incremented.
Signed-off-by: Robert Richter <[email protected]>
Reviewed-by: Mauro Carvalho Chehab <[email protected]>
Acked-by: Aristeu Rozanski <[email protected]>
---
v2:
* updated patch description to address the case fir
(!e->enable_per_layer_report),
* reordered variable declarations
---
drivers/edac/edac_mc.c | 40 +++++++++++++++++++++++++---------------
1 file changed, 25 insertions(+), 15 deletions(-)
diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index e817a710739f..31ba988359d2 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -1089,6 +1089,26 @@ static void edac_ue_error(struct mem_ctl_info *mci,
edac_inc_ue_error(mci, enable_per_layer_report, pos, error_count);
}
+static void edac_inc_csrow(struct edac_raw_error_desc *e, int row, int chan)
+{
+ struct mem_ctl_info *mci = error_desc_to_mci(e);
+ enum hw_event_mc_err_type type = e->type;
+ u16 count = e->error_count;
+
+ if (row < 0)
+ return;
+
+ edac_dbg(4, "csrow/channel to increment: (%d,%d)\n", row, chan);
+
+ if (type == HW_EVENT_ERR_CORRECTED) {
+ mci->csrows[row]->ce_count += count;
+ if (chan >= 0)
+ mci->csrows[row]->channels[chan]->ce_count += count;
+ } else {
+ mci->csrows[row]->ue_count += count;
+ }
+}
+
void edac_raw_mc_handle_error(struct edac_raw_error_desc *e)
{
struct mem_ctl_info *mci = error_desc_to_mci(e);
@@ -1256,22 +1276,12 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
chan = -2;
}
- if (!e->enable_per_layer_report) {
+ if (!e->enable_per_layer_report)
strcpy(e->label, "any memory");
- } else {
- edac_dbg(4, "csrow/channel to increment: (%d,%d)\n", row, chan);
- if (p == e->label)
- strcpy(e->label, "unknown memory");
- if (type == HW_EVENT_ERR_CORRECTED) {
- if (row >= 0) {
- mci->csrows[row]->ce_count += error_count;
- if (chan >= 0)
- mci->csrows[row]->channels[chan]->ce_count += error_count;
- }
- } else
- if (row >= 0)
- mci->csrows[row]->ue_count += error_count;
- }
+ else if (!*e->label)
+ strcpy(e->label, "unknown memory");
+
+ edac_inc_csrow(e, row, chan);
/* Fill the RAM location data */
p = e->location;
--
2.20.1
On Thu, Jan 23, 2020 at 09:02:47AM +0000, Robert Richter wrote:
> This patch set is part two of a rework of the ghes_edac and edac_mc
> driver. It addresses issues found during code review and while working
> with the code. Part one has been included to v5.5, see:
>
> https://lore.kernel.org/patchwork/cover/1093488/
>
> The changes of this series include:
>
> * add helper functions and factor out code (#1, #2, #5)
>
> * improve function interfaces and data structures to decrease
> complexity such as number of function arguments, unused data, etc.
> (#3, #4, #7, #8, #9, #10),
>
> * minor functional fixes (#6)
>
> * improve code readability (#9)
>
> V3:
> * rebased onto edac-for-next + "EDAC/mc: Fix use-after-free and
> memleaks during device removal", no code changes:
> 7e5d6cf35329 ("EDAC/amd64: Do not warn when removing instances")
> https://lore.kernel.org/patchwork/patch/1169444/
> * added Aristeu's ACKs
>
> V2:
> * fixed documentation issue in #3 found by "kbuild test robot
> <[email protected]>"
>
> Changes compared to part one:
> * rebased onto 5781823fd0d3 ("EDAC/altera: Use the Altera System
> Manager driver")
> * reworded patch subjects
> * reordered patches
> * collected Mauro's Reviewed-by-tags (note: I kept them though there
> has been small conflicts but dropped it when reworked)
> * dropped: "EDAC/mc: Rework edac_raw_mc_handle_error() to use struct
> dimm_info"
> * split "EDAC/mc: Remove per layer counters" into smaller changes
> * added:
> "EDAC/mc: Report "unknown memory" on too many DIMM labels found"
> "EDAC/mc: Remove enable_per_layer_report function arguments"
> "EDAC/mc: Pass the error descriptor to error reporting functions"
> "EDAC/mc: Remove detail[] string and cleanup error string
> generation"
> * moved to the end:
> "EDAC/mc: Remove per layer counters"
>
>
> Robert Richter (10):
> EDAC/mc: Split edac_mc_alloc() into smaller functions
> EDAC/mc: Reorder functions edac_mc_alloc*()
> EDAC: Store error type in struct edac_raw_error_desc
> EDAC/mc: Determine mci pointer from the error descriptor
> EDAC/mc: Create new function edac_inc_csrow()
> EDAC/mc: Report "unknown memory" on too many DIMM labels found
> EDAC/mc: Remove enable_per_layer_report function arguments
> EDAC/mc: Pass the error descriptor to error reporting functions
> EDAC/mc: Remove detail[] string and cleanup error string generation
> EDAC/mc: Remove per layer counters
>
> drivers/edac/edac_mc.c | 502 ++++++++++++++++-------------------
> drivers/edac/edac_mc.h | 6 +-
> drivers/edac/edac_mc_sysfs.c | 20 +-
> drivers/edac/ghes_edac.c | 16 +-
> include/linux/edac.h | 9 +-
> 5 files changed, 252 insertions(+), 301 deletions(-)
Ok, all queued and will appear in linux-next soon. Let's see what falls
out.
Thx for the nice cleanup!
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette