2021-12-07 03:19:19

by Shuai Xue

[permalink] [raw]
Subject: [PATCH 1/2] ghes_edac: refactor memory error location processing

The memory error location processing in ghes_edac_report_mem_error() have
Duplicated Code with cper_mem_err_location(), cper_dimm_err_location(), and
cper_mem_err_type_str() in drivers/firmware/efi/cper.c.

To avoid the duplicated code, this patch introduces the above cper_*() into
ghes_edac_report_mem_error().

The EDAC error log is now properly reporting the error as follows (all
Validation Bits are enabled):

[ 375.938411] EDAC MC0: 1 CE single-symbol chipkill ECC on unknown memory (node: 0 card: 0 module: 0 rank: 0 bank: 513 bank_group: 2 bank_address: 1 device: 0 row: 4887 column: 1032 bit_position: 0 requestor_id: 0x0000000000000000 responder_id: 0x0000000000000000 DIMM location: not present. DMI handle: 0x0000 page:0x898b86 offset:0x20 grain:1 syndrome:0x0 - APEI location: node: 0 card: 0 module: 0 rank: 0 bank: 513 bank_group: 2 bank_address: 1 device: 0 row: 4887 column: 1032 bit_position: 0 requestor_id: 0x0000000000000000 responder_id: 0x0000000000000000 DIMM location: not present. DMI handle: 0x0000 status(0x0000000000000000): reserved)
[ 375.938416] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
[ 375.938417] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
[ 375.938418] {2}[Hardware Error]: event severity: corrected
[ 375.938419] {2}[Hardware Error]: Error 0, type: corrected
[ 375.938420] {2}[Hardware Error]: section_type: memory error
[ 375.938421] {2}[Hardware Error]: error_status: 0x0000000000000000
[ 375.938422] {2}[Hardware Error]: physical_address: 0x0000000898b86020
[ 375.938422] {2}[Hardware Error]: physical_address_mask: 0x0000000000000000
[ 375.938426] {2}[Hardware Error]: node: 0 card: 0 module: 0 rank: 0 bank: 513 bank_group: 2 bank_address: 1 device: 0 row: 4887 column: 1032 bit_position: 0 requestor_id: 0x0000000000000000 responder_id: 0x0000000000000000
[ 375.938426] {2}[Hardware Error]: error_type: 4, single-symbol chipkill ECC
[ 375.938428] {2}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000

Signed-off-by: Shuai Xue <[email protected]>
---
drivers/edac/ghes_edac.c | 108 ++++--------------------------------
drivers/firmware/efi/cper.c | 4 +-
include/linux/cper.h | 2 +
3 files changed, 14 insertions(+), 100 deletions(-)

diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 6d1ddecbf0da..7b25c3207c11 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -239,6 +239,7 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
{
struct edac_raw_error_desc *e;
struct mem_ctl_info *mci;
+ struct cper_mem_err_compact cmem;
struct ghes_pvt *pvt;
unsigned long flags;
char *p;
@@ -292,60 +293,10 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)

/* Error type, mapped on e->msg */
if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_TYPE) {
+ u8 etype = mem_err->error_type;
+
p = pvt->msg;
- switch (mem_err->error_type) {
- case 0:
- p += sprintf(p, "Unknown");
- break;
- case 1:
- p += sprintf(p, "No error");
- break;
- case 2:
- p += sprintf(p, "Single-bit ECC");
- break;
- case 3:
- p += sprintf(p, "Multi-bit ECC");
- break;
- case 4:
- p += sprintf(p, "Single-symbol ChipKill ECC");
- break;
- case 5:
- p += sprintf(p, "Multi-symbol ChipKill ECC");
- break;
- case 6:
- p += sprintf(p, "Master abort");
- break;
- case 7:
- p += sprintf(p, "Target abort");
- break;
- case 8:
- p += sprintf(p, "Parity Error");
- break;
- case 9:
- p += sprintf(p, "Watchdog timeout");
- break;
- case 10:
- p += sprintf(p, "Invalid address");
- break;
- case 11:
- p += sprintf(p, "Mirror Broken");
- break;
- case 12:
- p += sprintf(p, "Memory Sparing");
- break;
- case 13:
- p += sprintf(p, "Scrub corrected error");
- break;
- case 14:
- p += sprintf(p, "Scrub uncorrected error");
- break;
- case 15:
- p += sprintf(p, "Physical Memory Map-out event");
- break;
- default:
- p += sprintf(p, "reserved error (%d)",
- mem_err->error_type);
- }
+ p += snprintf(p, sizeof(pvt->msg), "%s", cper_mem_err_type_str(etype));
} else {
strcpy(pvt->msg, "unknown error");
}
@@ -362,42 +313,13 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)

/* Memory error location, mapped on e->location */
p = e->location;
- if (mem_err->validation_bits & CPER_MEM_VALID_NODE)
- p += sprintf(p, "node:%d ", mem_err->node);
- if (mem_err->validation_bits & CPER_MEM_VALID_CARD)
- p += sprintf(p, "card:%d ", mem_err->card);
- if (mem_err->validation_bits & CPER_MEM_VALID_MODULE)
- p += sprintf(p, "module:%d ", mem_err->module);
- if (mem_err->validation_bits & CPER_MEM_VALID_RANK_NUMBER)
- p += sprintf(p, "rank:%d ", mem_err->rank);
- if (mem_err->validation_bits & CPER_MEM_VALID_BANK)
- p += sprintf(p, "bank:%d ", mem_err->bank);
- if (mem_err->validation_bits & CPER_MEM_VALID_BANK_GROUP)
- p += sprintf(p, "bank_group:%d ",
- mem_err->bank >> CPER_MEM_BANK_GROUP_SHIFT);
- if (mem_err->validation_bits & CPER_MEM_VALID_BANK_ADDRESS)
- p += sprintf(p, "bank_address:%d ",
- mem_err->bank & CPER_MEM_BANK_ADDRESS_MASK);
- if (mem_err->validation_bits & (CPER_MEM_VALID_ROW | CPER_MEM_VALID_ROW_EXT)) {
- u32 row = mem_err->row;
-
- row |= cper_get_mem_extension(mem_err->validation_bits, mem_err->extended);
- p += sprintf(p, "row:%d ", row);
- }
- if (mem_err->validation_bits & CPER_MEM_VALID_COLUMN)
- p += sprintf(p, "col:%d ", mem_err->column);
- if (mem_err->validation_bits & CPER_MEM_VALID_BIT_POSITION)
- p += sprintf(p, "bit_pos:%d ", mem_err->bit_pos);
+ cper_mem_err_pack(mem_err, &cmem);
+ p += cper_mem_err_location(&cmem, p);
+
if (mem_err->validation_bits & CPER_MEM_VALID_MODULE_HANDLE) {
- const char *bank = NULL, *device = NULL;
struct dimm_info *dimm;

- dmi_memdev_name(mem_err->mem_dev_handle, &bank, &device);
- if (bank != NULL && device != NULL)
- p += sprintf(p, "DIMM location:%s %s ", bank, device);
- else
- p += sprintf(p, "DIMM DMI handle: 0x%.4x ",
- mem_err->mem_dev_handle);
+ p += cper_dimm_err_location(&cmem, p);

dimm = find_dimm_by_handle(mci, mem_err->mem_dev_handle);
if (dimm) {
@@ -405,9 +327,7 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
strcpy(e->label, dimm->label);
}
}
- if (mem_err->validation_bits & CPER_MEM_VALID_CHIP_ID)
- p += sprintf(p, "chipID: %d ",
- mem_err->extended >> CPER_MEM_CHIP_ID_SHIFT);
+
if (p > e->location)
*(p - 1) = '\0';

@@ -479,15 +399,7 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
break;
}
}
- if (mem_err->validation_bits & CPER_MEM_VALID_REQUESTOR_ID)
- p += sprintf(p, "requestorID: 0x%016llx ",
- (long long)mem_err->requestor_id);
- if (mem_err->validation_bits & CPER_MEM_VALID_RESPONDER_ID)
- p += sprintf(p, "responderID: 0x%016llx ",
- (long long)mem_err->responder_id);
- if (mem_err->validation_bits & CPER_MEM_VALID_TARGET_ID)
- p += sprintf(p, "targetID: 0x%016llx ",
- (long long)mem_err->responder_id);
+
if (p > pvt->other_detail)
*(p - 1) = '\0';

diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index 6ec8edec6329..08eabb2e23f8 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -211,7 +211,7 @@ const char *cper_mem_err_type_str(unsigned int etype)
}
EXPORT_SYMBOL_GPL(cper_mem_err_type_str);

-static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
+int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
{
u32 len, n;

@@ -265,7 +265,7 @@ static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
return n;
}

-static int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg)
+int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg)
{
u32 len, n;
const char *bank = NULL, *device = NULL;
diff --git a/include/linux/cper.h b/include/linux/cper.h
index 6a511a1078ca..918e7efffb60 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -568,5 +568,7 @@ void cper_print_proc_arm(const char *pfx,
const struct cper_sec_proc_arm *proc);
void cper_print_proc_ia(const char *pfx,
const struct cper_sec_proc_ia *proc);
+int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg);
+int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg);

#endif
--
2.20.1.12.g72788fdb



2021-12-07 03:19:20

by Shuai Xue

[permalink] [raw]
Subject: [PATCH 2/2] ghes_edac: refactor error status fields decoding

ghes_edac_report_mem_error() in ghes_edac.c is a Long Method in which the
error status fields decoding could be refactored for reuse. On the other
hand, the cper_print_mem() only reports the error status and misses its
description.

This patch introduces a new helper function cper_mem_err_status_str() which
is used to wrap up the decoding logics, and the cper_print_mem() will call
it and report the details of error status description.

The cper error log is now properly reporting the error as follows (all
Validation Bits are enabled):

[37863.026267] EDAC MC0: 1 CE single-symbol chipkill ECC on unknown memory (node: 0 card: 0 module: 0 rank: 0 bank: 1282 bank_group: 5 bank_address: 2 device: 0 row: 11387 column: 1544 bit_position: 0 requestor_id: 0x0000000000000000 responder_id: 0x0000000000000000 DIMM location: not present. DMI handle: 0x0000 page:0x963d9b offset:0x20 grain:1 syndrome:0x0 - APEI location: node: 0 card: 0 module: 0 rank: 0 bank: 1282 bank_group: 5 bank_address: 2 device: 0 row: 11387 column: 1544 bit_position: 0 requestor_id: 0x0000000000000000 responder_id: 0x0000000000000000 DIMM location: not present. DMI handle: 0x0000 status(0x0000000000000000): reserved)
[37863.026272] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
[37863.026273] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
[37863.026275] {2}[Hardware Error]: event severity: corrected
[37863.026276] {2}[Hardware Error]: Error 0, type: corrected
[37863.026278] {2}[Hardware Error]: section_type: memory error
[37863.026279] {2}[Hardware Error]: error_status: 0x0000000000000000, reserved
[37863.026279] {2}[Hardware Error]: physical_address: 0x0000000963d9b020
[37863.026280] {2}[Hardware Error]: physical_address_mask: 0x0000000000000000
[37863.026282] {2}[Hardware Error]: node: 0 card: 0 module: 0 rank: 0 bank: 1282 bank_group: 5 bank_address: 2 device: 0 row: 11387 column: 1544 bit_position: 0 requestor_id: 0x0000000000000000 responder_id: 0x0000000000000000
[37863.026283] {2}[Hardware Error]: error_type: 4, single-symbol chipkill ECC
[37863.026284] {2}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000

Signed-off-by: Shuai Xue <[email protected]>
---
drivers/edac/ghes_edac.c | 89 +++++++++++--------------------------
drivers/firmware/efi/cper.c | 49 ++++++++++++++++++--
include/linux/cper.h | 5 ++-
3 files changed, 75 insertions(+), 68 deletions(-)

diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 7b25c3207c11..b74ffd82f56b 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -235,6 +235,31 @@ static void ghes_scan_system(void)
system_scanned = true;
}

+static int ghes_edac_mem_err_other_detail(const struct cper_sec_mem_err *mem,
+ char *msg, const char *location)
+{
+ u32 len, n;
+
+ if (!msg)
+ return 0;
+
+ n = 0;
+ len = 2 * CPER_REC_LEN - 1;
+
+ n += snprintf(msg + n, len - n, "APEI location: %s ", location);
+
+ if (mem->validation_bits & CPER_MEM_VALID_ERROR_STATUS) {
+ u64 status = mem->error_status;
+
+ n += snprintf(msg + n, len - n, "status(0x%016llx): ",
+ (long long)status);
+ n += snprintf(msg + n, len - n, "%s ", cper_mem_err_status_str(status));
+ }
+
+ msg[n] = '\0';
+ return n;
+}
+
void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
{
struct edac_raw_error_desc *e;
@@ -336,69 +361,7 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)

/* All other fields are mapped on e->other_detail */
p = pvt->other_detail;
- p += snprintf(p, sizeof(pvt->other_detail),
- "APEI location: %s ", e->location);
- if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_STATUS) {
- u64 status = mem_err->error_status;
-
- p += sprintf(p, "status(0x%016llx): ", (long long)status);
- switch ((status >> 8) & 0xff) {
- case 1:
- p += sprintf(p, "Error detected internal to the component ");
- break;
- case 16:
- p += sprintf(p, "Error detected in the bus ");
- break;
- case 4:
- p += sprintf(p, "Storage error in DRAM memory ");
- break;
- case 5:
- p += sprintf(p, "Storage error in TLB ");
- break;
- case 6:
- p += sprintf(p, "Storage error in cache ");
- break;
- case 7:
- p += sprintf(p, "Error in one or more functional units ");
- break;
- case 8:
- p += sprintf(p, "component failed self test ");
- break;
- case 9:
- p += sprintf(p, "Overflow or undervalue of internal queue ");
- break;
- case 17:
- p += sprintf(p, "Virtual address not found on IO-TLB or IO-PDIR ");
- break;
- case 18:
- p += sprintf(p, "Improper access error ");
- break;
- case 19:
- p += sprintf(p, "Access to a memory address which is not mapped to any component ");
- break;
- case 20:
- p += sprintf(p, "Loss of Lockstep ");
- break;
- case 21:
- p += sprintf(p, "Response not associated with a request ");
- break;
- case 22:
- p += sprintf(p, "Bus parity error - must also set the A, C, or D Bits ");
- break;
- case 23:
- p += sprintf(p, "Detection of a PATH_ERROR ");
- break;
- case 25:
- p += sprintf(p, "Bus operation timeout ");
- break;
- case 26:
- p += sprintf(p, "A read was issued to data that has been poisoned ");
- break;
- default:
- p += sprintf(p, "reserved ");
- break;
- }
- }
+ p += ghes_edac_mem_err_other_detail(mem_err, p, e->location);

if (p > pvt->other_detail)
*(p - 1) = '\0';
diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index 08eabb2e23f8..6bf3e293c006 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -211,7 +211,7 @@ const char *cper_mem_err_type_str(unsigned int etype)
}
EXPORT_SYMBOL_GPL(cper_mem_err_type_str);

-int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
+int cper_mem_err_location(const struct cper_mem_err_compact *mem, char *msg)
{
u32 len, n;

@@ -265,7 +265,7 @@ int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
return n;
}

-int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg)
+int cper_dimm_err_location(const struct cper_mem_err_compact *mem, char *msg)
{
u32 len, n;
const char *bank = NULL, *device = NULL;
@@ -285,6 +285,48 @@ int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg)
return n;
}

+const char *cper_mem_err_status_str(u64 status)
+{
+ switch ((status >> 8) & 0xff) {
+ case 1:
+ return "Error detected internal to the component";
+ case 16:
+ return "Error detected in the bus";
+ case 4:
+ return "Storage error in DRAM memory";
+ case 5:
+ return "Storage error in TLB";
+ case 6:
+ return "Storage error in cache";
+ case 7:
+ return "Error in one or more functional units";
+ case 8:
+ return "component failed self test";
+ case 9:
+ return "Overflow or undervalue of internal queue";
+ case 17:
+ return "Virtual address not found on IO-TLB or IO-PDIR";
+ case 18:
+ return "Improper access error";
+ case 19:
+ return "Access to a memory address which is not mapped to any component";
+ case 20:
+ return "Loss of Lockstep";
+ case 21:
+ return "Response not associated with a request";
+ case 22:
+ return "Bus parity error - must also set the A, C, or D Bits";
+ case 23:
+ return "Detection of a PATH_ERROR ";
+ case 25:
+ return "Bus operation timeout";
+ case 26:
+ return "A read was issued to data that has been poisoned";
+ default:
+ return "reserved";
+ }
+}
+
void cper_mem_err_pack(const struct cper_sec_mem_err *mem,
struct cper_mem_err_compact *cmem)
{
@@ -334,7 +376,8 @@ static void cper_print_mem(const char *pfx, const struct cper_sec_mem_err *mem,
return;
}
if (mem->validation_bits & CPER_MEM_VALID_ERROR_STATUS)
- printk("%s""error_status: 0x%016llx\n", pfx, mem->error_status);
+ printk("%s""error_status: 0x%016llx, %s\n", pfx, mem->error_status,
+ cper_mem_err_status_str(mem->error_status));
if (mem->validation_bits & CPER_MEM_VALID_PA)
printk("%s""physical_address: 0x%016llx\n",
pfx, mem->physical_addr);
diff --git a/include/linux/cper.h b/include/linux/cper.h
index 918e7efffb60..a45fb7ceacf8 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -568,7 +568,8 @@ void cper_print_proc_arm(const char *pfx,
const struct cper_sec_proc_arm *proc);
void cper_print_proc_ia(const char *pfx,
const struct cper_sec_proc_ia *proc);
-int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg);
-int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg);
+int cper_mem_err_location(const struct cper_mem_err_compact *mem, char *msg);
+int cper_dimm_err_location(const struct cper_mem_err_compact *mem, char *msg);
+const char *cper_mem_err_status_str(u64 status);

#endif
--
2.20.1.12.g72788fdb


2021-12-07 11:31:08

by Robert Richter

[permalink] [raw]
Subject: Re: [PATCH 1/2] ghes_edac: refactor memory error location processing

On 07.12.21 11:19:04, Shuai Xue wrote:
> The memory error location processing in ghes_edac_report_mem_error() have
> Duplicated Code with cper_mem_err_location(), cper_dimm_err_location(), and
> cper_mem_err_type_str() in drivers/firmware/efi/cper.c.
>
> To avoid the duplicated code, this patch introduces the above cper_*() into
> ghes_edac_report_mem_error().

It is not really duplicate yet, changes are slightly different which
could trigger problems in some parsers. At least those differences
should be listed in the patch description. I would rather remove the
'space' delimiter after the colon and take the ghes version of it as
logs become harder to read. So ideally there is a unification patch
before the "duplication" is removed with changes in both files as
necessary for review and to document the change.

>
> The EDAC error log is now properly reporting the error as follows (all
> Validation Bits are enabled):
>
> [ 375.938411] EDAC MC0: 1 CE single-symbol chipkill ECC on unknown memory (node: 0 card: 0 module: 0 rank: 0 bank: 513 bank_group: 2 bank_address: 1 device: 0 row: 4887 column: 1032 bit_position: 0 requestor_id: 0x0000000000000000 responder_id: 0x0000000000000000 DIMM location: not present. DMI handle: 0x0000 page:0x898b86 offset:0x20 grain:1 syndrome:0x0 - APEI location: node: 0 card: 0 module: 0 rank: 0 bank: 513 bank_group: 2 bank_address: 1 device: 0 row: 4887 column: 1032 bit_position: 0 requestor_id: 0x0000000000000000 responder_id: 0x0000000000000000 DIMM location: not present. DMI handle: 0x0000 status(0x0000000000000000): reserved)
> [ 375.938416] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
> [ 375.938417] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
> [ 375.938418] {2}[Hardware Error]: event severity: corrected
> [ 375.938419] {2}[Hardware Error]: Error 0, type: corrected
> [ 375.938420] {2}[Hardware Error]: section_type: memory error
> [ 375.938421] {2}[Hardware Error]: error_status: 0x0000000000000000
> [ 375.938422] {2}[Hardware Error]: physical_address: 0x0000000898b86020
> [ 375.938422] {2}[Hardware Error]: physical_address_mask: 0x0000000000000000
> [ 375.938426] {2}[Hardware Error]: node: 0 card: 0 module: 0 rank: 0 bank: 513 bank_group: 2 bank_address: 1 device: 0 row: 4887 column: 1032 bit_position: 0 requestor_id: 0x0000000000000000 responder_id: 0x0000000000000000
> [ 375.938426] {2}[Hardware Error]: error_type: 4, single-symbol chipkill ECC
> [ 375.938428] {2}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000
>
> Signed-off-by: Shuai Xue <[email protected]>


> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
> index 6ec8edec6329..08eabb2e23f8 100644
> --- a/drivers/firmware/efi/cper.c
> +++ b/drivers/firmware/efi/cper.c
> @@ -211,7 +211,7 @@ const char *cper_mem_err_type_str(unsigned int etype)
> }
> EXPORT_SYMBOL_GPL(cper_mem_err_type_str);
>
> -static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
> +int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
> {
> u32 len, n;
>
> @@ -265,7 +265,7 @@ static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
> return n;
> }
>
> -static int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg)
> +int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg)
> {
> u32 len, n;
> const char *bank = NULL, *device = NULL;

Even though the ghes driver cannot be built as module,
EXPORT_SYMBOL_GPL()s should be added for both.

It would be good to add a note to the description that the
UEFI_CPER/EDAC_GHES dependency is always solved through
ACPI_APEI_GHES/ACPI_APEI. But we should make the UEFI_CPER dependency
explicit for EDAC_GHES in Kconfig anyway.

-Robert

2021-12-07 11:47:29

by Robert Richter

[permalink] [raw]
Subject: Re: [PATCH 2/2] ghes_edac: refactor error status fields decoding

On 07.12.21 11:19:05, Shuai Xue wrote:

> @@ -285,6 +285,48 @@ int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg)
> return n;
> }
>
> +const char *cper_mem_err_status_str(u64 status)

[...]

Same here, add an EXPORT_SYMBOL_GPL for the function.

> --- a/include/linux/cper.h
> +++ b/include/linux/cper.h
> @@ -568,7 +568,8 @@ void cper_print_proc_arm(const char *pfx,
> const struct cper_sec_proc_arm *proc);
> void cper_print_proc_ia(const char *pfx,
> const struct cper_sec_proc_ia *proc);
> -int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg);
> -int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg);
> +int cper_mem_err_location(const struct cper_mem_err_compact *mem, char *msg);
> +int cper_dimm_err_location(const struct cper_mem_err_compact *mem, char *msg);

Do we really need that 'const' here?

> +const char *cper_mem_err_status_str(u64 status);

The function i/f is different compared to the others, though the
purpose is the same. Let's use same style:

int cper_mem_err_status(const struct cper_mem_err_compact *mem, char *msg);

-Robert

2021-12-07 12:59:50

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH 1/2] ghes_edac: refactor memory error location processing

Hi, Robert,

Thank you for your quick comments! I will change these in the next version.

On 2021/12/7 PM7:30, Robert Richter wrote:
> On 07.12.21 11:19:04, Shuai Xue wrote:
>> The memory error location processing in ghes_edac_report_mem_error() have
>> Duplicated Code with cper_mem_err_location(), cper_dimm_err_location(), and
>> cper_mem_err_type_str() in drivers/firmware/efi/cper.c.
>>
>> To avoid the duplicated code, this patch introduces the above cper_*() into
>> ghes_edac_report_mem_error().
>
> It is not really duplicate yet, changes are slightly different which
> could trigger problems in some parsers. At least those differences
> should be listed in the patch description.
I see your concerns. I will document the changes in patch description.

> I would rather remove the
> 'space' delimiter after the colon and take the ghes version of it as
> logs become harder to read. So ideally there is a unification patch
> before the "duplication" is removed with changes in both files as
> necessary for review and to document the change.
I will add a new patch to unify ghes and cper before removing duplication.


>> The EDAC error log is now properly reporting the error as follows (all
>> Validation Bits are enabled):
>>
>> [ 375.938411] EDAC MC0: 1 CE single-symbol chipkill ECC on unknown memory (node: 0 card: 0 module: 0 rank: 0 bank: 513 bank_group: 2 bank_address: 1 device: 0 row: 4887 column: 1032 bit_position: 0 requestor_id: 0x0000000000000000 responder_id: 0x0000000000000000 DIMM location: not present. DMI handle: 0x0000 page:0x898b86 offset:0x20 grain:1 syndrome:0x0 - APEI location: node: 0 card: 0 module: 0 rank: 0 bank: 513 bank_group: 2 bank_address: 1 device: 0 row: 4887 column: 1032 bit_position: 0 requestor_id: 0x0000000000000000 responder_id: 0x0000000000000000 DIMM location: not present. DMI handle: 0x0000 status(0x0000000000000000): reserved)
>> [ 375.938416] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
>> [ 375.938417] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
>> [ 375.938418] {2}[Hardware Error]: event severity: corrected
>> [ 375.938419] {2}[Hardware Error]: Error 0, type: corrected
>> [ 375.938420] {2}[Hardware Error]: section_type: memory error
>> [ 375.938421] {2}[Hardware Error]: error_status: 0x0000000000000000
>> [ 375.938422] {2}[Hardware Error]: physical_address: 0x0000000898b86020
>> [ 375.938422] {2}[Hardware Error]: physical_address_mask: 0x0000000000000000
>> [ 375.938426] {2}[Hardware Error]: node: 0 card: 0 module: 0 rank: 0 bank: 513 bank_group: 2 bank_address: 1 device: 0 row: 4887 column: 1032 bit_position: 0 requestor_id: 0x0000000000000000 responder_id: 0x0000000000000000
>> [ 375.938426] {2}[Hardware Error]: error_type: 4, single-symbol chipkill ECC
>> [ 375.938428] {2}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000
>>
>> Signed-off-by: Shuai Xue <[email protected]>
>
>
>> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
>> index 6ec8edec6329..08eabb2e23f8 100644
>> --- a/drivers/firmware/efi/cper.c
>> +++ b/drivers/firmware/efi/cper.c
>> @@ -211,7 +211,7 @@ const char *cper_mem_err_type_str(unsigned int etype)
>> }
>> EXPORT_SYMBOL_GPL(cper_mem_err_type_str);
>>
>> -static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
>> +int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
>> {
>> u32 len, n;
>>
>> @@ -265,7 +265,7 @@ static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
>> return n;
>> }
>>
>> -static int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg)
>> +int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg)
>> {
>> u32 len, n;
>> const char *bank = NULL, *device = NULL;
>
> Even though the ghes driver cannot be built as module,
> EXPORT_SYMBOL_GPL()s should be added for both.
I will add EXPORT_SYMBOL_GPL()s.

> It would be good to add a note to the description that the
> UEFI_CPER/EDAC_GHES dependency is always solved through
> ACPI_APEI_GHES/ACPI_APEI. But we should make the UEFI_CPER dependency
> explicit for EDAC_GHES in Kconfig anyway.
Will document the dependency and add UEFI_CPER dependency explicit for
EDAC_GHES in Kconfig in next version.

Cheers,
Shuai

2021-12-07 13:20:33

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH 2/2] ghes_edac: refactor error status fields decoding

Hi, Robert,

Thank you for your quick comments!

On 2021/12/7 PM7:47, Robert Richter wrote:
> On 07.12.21 11:19:05, Shuai Xue wrote:
>
>> @@ -285,6 +285,48 @@ int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg)
>> return n;
>> }
>>
>> +const char *cper_mem_err_status_str(u64 status)
>
> [...]
>
> Same here, add an EXPORT_SYMBOL_GPL for the function.
Will add it in next version.


>> --- a/include/linux/cper.h
>> +++ b/include/linux/cper.h
>> @@ -568,7 +568,8 @@ void cper_print_proc_arm(const char *pfx,
>> const struct cper_sec_proc_arm *proc);
>> void cper_print_proc_ia(const char *pfx,
>> const struct cper_sec_proc_ia *proc);
>> -int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg);
>> -int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg);
>> +int cper_mem_err_location(const struct cper_mem_err_compact *mem, char *msg);
>> +int cper_dimm_err_location(const struct cper_mem_err_compact *mem, char *msg);
>
> Do we really need that 'const' here?
I think we do. It is read only and should not be modified in these functions,
just as cper_print_proc_arm' style.


>> +const char *cper_mem_err_status_str(u64 status);
>
> The function i/f is different compared to the others, though the
> purpose is the same. Let's use same style:
>
> int cper_mem_err_status(const struct cper_mem_err_compact *mem, char *msg);
Sorry, I don't catch it. cper_mem_err_status_str() decodes the error status and return
a string, the same style as cper_severity_str and cper_mem_err_type_str do. May
we need to move the declaration ahead with cper_severity_str?

Best Regards,
Shuai


2021-12-08 10:50:21

by Robert Richter

[permalink] [raw]
Subject: Re: [PATCH 2/2] ghes_edac: refactor error status fields decoding

On 07.12.21 21:20:25, Shuai Xue wrote:

> >> --- a/include/linux/cper.h
> >> +++ b/include/linux/cper.h
> >> @@ -568,7 +568,8 @@ void cper_print_proc_arm(const char *pfx,
> >> const struct cper_sec_proc_arm *proc);
> >> void cper_print_proc_ia(const char *pfx,
> >> const struct cper_sec_proc_ia *proc);
> >> -int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg);
> >> -int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg);
> >> +int cper_mem_err_location(const struct cper_mem_err_compact *mem, char *msg);
> >> +int cper_dimm_err_location(const struct cper_mem_err_compact *mem, char *msg);
> >
> > Do we really need that 'const' here?
> I think we do. It is read only and should not be modified in these functions,
> just as cper_print_proc_arm' style.

Even if it is used read-only I don't see a real need for const here.
So let's change this only if there is a reason such as avoiding
unnecessary casts.

> >> +const char *cper_mem_err_status_str(u64 status);
> >
> > The function i/f is different compared to the others, though the
> > purpose is the same. Let's use same style:
> >
> > int cper_mem_err_status(const struct cper_mem_err_compact *mem, char *msg);
> Sorry, I don't catch it. cper_mem_err_status_str() decodes the error status and return
> a string, the same style as cper_severity_str and cper_mem_err_type_str do. May
> we need to move the declaration ahead with cper_severity_str?

Right, move it after cper_mem_err_type_str(). Looks good then.

Thanks,

-Robert

2021-12-08 11:27:51

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH 2/2] ghes_edac: refactor error status fields decoding

Dear, Robert,

Thank you for your reply.

On 2021/12/8 PM6:50, Robert Richter wrote:
> On 07.12.21 21:20:25, Shuai Xue wrote:
>
>>>> --- a/include/linux/cper.h
>>>> +++ b/include/linux/cper.h
>>>> @@ -568,7 +568,8 @@ void cper_print_proc_arm(const char *pfx,
>>>> const struct cper_sec_proc_arm *proc);
>>>> void cper_print_proc_ia(const char *pfx,
>>>> const struct cper_sec_proc_ia *proc);
>>>> -int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg);
>>>> -int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg);
>>>> +int cper_mem_err_location(const struct cper_mem_err_compact *mem, char *msg);
>>>> +int cper_dimm_err_location(const struct cper_mem_err_compact *mem, char *msg);
>>>
>>> Do we really need that 'const' here?
>> I think we do. It is read only and should not be modified in these functions,
>> just as cper_print_proc_arm' style.
>
> Even if it is used read-only I don't see a real need for const here.
> So let's change this only if there is a reason such as avoiding
> unnecessary casts.
I will change it back to the original.


>>>> +const char *cper_mem_err_status_str(u64 status);
>>>
>>> The function i/f is different compared to the others, though the
>>> purpose is the same. Let's use same style:
>>>
>>> int cper_mem_err_status(const struct cper_mem_err_compact *mem, char *msg);
>> Sorry, I don't catch it. cper_mem_err_status_str() decodes the error status and return
>> a string, the same style as cper_severity_str and cper_mem_err_type_str do. May
>> we need to move the declaration ahead with cper_severity_str?
>
> Right, move it after cper_mem_err_type_str(). Looks good then.
OK, will change it in next send.

Thanks.

Shuai