2021-12-10 13:40:27

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v2 0/3] ghes_edac: refactor memory error reporting to avoid code duplication

ghes_edac_report_mem_error() in ghes_edac.c is Long Method and have
Duplicated Code with cper_mem_err_location(), cper_dimm_err_location(), and
cper_mem_err_type_str() in drivers/firmware/efi/cper.c. In addition, the
cper_print_mem() in drivers/firmware/efi/cper.c only reports the error
status and misses its description.

This patch set is to refactor ghes_edac_report_mem_error with:

- Patch 01 is to unify memory error report format with cper;
- Patch 02 is to introduces cper_*(), into ghes_edac_report_mem_error(),
this can avoid bunch of duplicate code lines;
- Patch 02 is to wrap up error status decoding logics and reuse it in
cper_print_mem().

Changes since v1:
https://lore.kernel.org/all/[email protected]/

- add a new patch to unify ghes and cper before removing duplication.
- document the changes in patch description
- add EXPORT_SYMBOL_GPL()s for cper_*()
- document and the dependency and add UEFI_CPER dependency explicitly
Thanks Robert Richter for review comments.

Shuai Xue (3):
ghes_edac: unify memory error report format with cper
ghes_edac: refactor memory error location processing
ghes_edac: refactor error status fields decoding

drivers/edac/Kconfig | 1 +
drivers/edac/ghes_edac.c | 196 +++++++-----------------------------
drivers/firmware/efi/cper.c | 86 ++++++++++++----
include/linux/cper.h | 3 +
4 files changed, 105 insertions(+), 181 deletions(-)

--
2.20.1.12.g72788fdb



2021-12-10 13:40:30

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v2 1/3] ghes_edac: unify memory error report format with cper

The changes are to:

- add device info into ghes_edac
- change bit_pos to bit_position, col to column, requestorID to
requestor_id, etc in ghes_edac
- move requestor_id, responder_id, target_id and chip_id into memory error
location in ghes_edac
- add "DIMM location: not present." for DIMM location in ghes_edac
- remove the 'space' delimiter after the colon in ghes_edac and cper

The original EDAC and cper error log are as follows (all Validation Bits
are enabled):

[31940.060454] EDAC MC0: 1 CE Single-symbol ChipKill ECC on unknown memory (node:0 card:0 module:0 rank:0 bank:257 bank_group:1 bank_address:1 row:75492 col:8 bit_pos:0 DIMM DMI handle: 0x0000 chipID: 0 page:0x93724c offset:0x20 grain:1 syndrome:0x0 - APEI location: node:0 card:0 module:0 rank:0 bank:257 bank_group:1 bank_address:1 row:75492 col:8 bit_pos:0 DIMM DMI handle: 0x0000 chipID: 0 status(0x0000000000000000): reserved requestorID: 0x0000000000000000 responderID: 0x0000000000000000 targetID: 0x0000000000000000)
[31940.060459] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
[31940.060460] {3}[Hardware Error]: It has been corrected by h/w and requires no further action
[31940.060462] {3}[Hardware Error]: event severity: corrected
[31940.060463] {3}[Hardware Error]: Error 0, type: corrected
[31940.060464] {3}[Hardware Error]: section_type: memory error
[31940.060465] {3}[Hardware Error]: error_status: 0x0000000000000000
[31940.060466] {3}[Hardware Error]: physical_address: 0x000000093724c020
[31940.060466] {3}[Hardware Error]: physical_address_mask: 0x0000000000000000
[31940.060469] {3}[Hardware Error]: node: 0 card: 0 module: 0 rank: 0 bank: 257 bank_group: 1 bank_address: 1 device: 0 row: 75492 column: 8 bit_position: 0 requestor_id: 0x0000000000000000 responder_id: 0x0000000000000000
[31940.060470] {3}[Hardware Error]: error_type: 4, single-symbol chipkill ECC
[31940.060471] {3}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000

Now, the EDAC and cper error log are properly reporting the error as
follows (all Validation Bits are enabled):

[ 117.973657] EDAC MC0: 1 CE Single-symbol ChipKill ECC on 0x0000 (node:0 card:0 module:0 rank:0 bank:1026 bank_group:4 bank_address:2 device:0 row:6749 column:8 bit_position:0 requestor_id:0x0000000000000000 responder_id:0x0000000000000000 target_id:0x0000000000000000 chip_id:0 DIMM location:not present. DIMM DMI handle:0x0000 page:0x8d2ef4 offset:0x20 grain:1 syndrome:0x0 - APEI location: node:0 card:0 module:0 rank:0 bank:1026 bank_group:4 bank_address:2 device:0 row:6749 column:8 bit_position:0 requestor_id:0x0000000000000000 responder_id:0x0000000000000000 target_id:0x0000000000000000 chip_id:0 DIMM location:not present. DIMM DMI handle:0x0000 status(0x0000000000000000):reserved)
[ 117.973663] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
[ 117.973664] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
[ 117.973665] {2}[Hardware Error]: event severity: corrected
[ 117.973666] {2}[Hardware Error]: Error 0, type: corrected
[ 117.973667] {2}[Hardware Error]: section_type: memory error
[ 117.973668] {2}[Hardware Error]: error_status: 0x0000000000000000
[ 117.973669] {2}[Hardware Error]: physical_address: 0x00000008d2ef4020
[ 117.973670] {2}[Hardware Error]: physical_address_mask: 0x0000000000000000
[ 117.973672] {2}[Hardware Error]: node:0 card:0 module:0 rank:0 bank:1026 bank_group:4 bank_address:2 device:0 row:6749 column:8 bit_position:0 requestor_id:0x0000000000000000 responder_id:0x0000000000000000 target_id:0x0000000000000000 chip_id:0
[ 117.973673] {2}[Hardware Error]: error_type: 4, single-symbol chipkill ECC
[ 117.973674] {2}[Hardware Error]: DIMM location: not present. DMI handle:0x0000

Signed-off-by: Shuai Xue <[email protected]>
---
drivers/edac/ghes_edac.c | 35 +++++++++++++++++++----------------
drivers/firmware/efi/cper.c | 34 +++++++++++++++++-----------------
2 files changed, 36 insertions(+), 33 deletions(-)

diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 6d1ddecbf0da..526a28cbb19b 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -378,6 +378,8 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
if (mem_err->validation_bits & CPER_MEM_VALID_BANK_ADDRESS)
p += sprintf(p, "bank_address:%d ",
mem_err->bank & CPER_MEM_BANK_ADDRESS_MASK);
+ if (mem_err->validation_bits & CPER_MEM_VALID_DEVICE)
+ p += sprintf(p, "device:%d ", mem_err->device);
if (mem_err->validation_bits & (CPER_MEM_VALID_ROW | CPER_MEM_VALID_ROW_EXT)) {
u32 row = mem_err->row;

@@ -385,9 +387,21 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
p += sprintf(p, "row:%d ", row);
}
if (mem_err->validation_bits & CPER_MEM_VALID_COLUMN)
- p += sprintf(p, "col:%d ", mem_err->column);
+ p += sprintf(p, "column:%d ", mem_err->column);
if (mem_err->validation_bits & CPER_MEM_VALID_BIT_POSITION)
- p += sprintf(p, "bit_pos:%d ", mem_err->bit_pos);
+ p += sprintf(p, "bit_position:%d ", mem_err->bit_pos);
+ if (mem_err->validation_bits & CPER_MEM_VALID_REQUESTOR_ID)
+ p += sprintf(p, "requestor_id:0x%016llx ",
+ (long long)mem_err->requestor_id);
+ if (mem_err->validation_bits & CPER_MEM_VALID_RESPONDER_ID)
+ p += sprintf(p, "responder_id:0x%016llx ",
+ (long long)mem_err->responder_id);
+ if (mem_err->validation_bits & CPER_MEM_VALID_TARGET_ID)
+ p += sprintf(p, "target_id:0x%016llx ",
+ (long long)mem_err->responder_id);
+ if (mem_err->validation_bits & CPER_MEM_VALID_CHIP_ID)
+ p += sprintf(p, "chip_id:%d ",
+ mem_err->extended >> CPER_MEM_CHIP_ID_SHIFT);
if (mem_err->validation_bits & CPER_MEM_VALID_MODULE_HANDLE) {
const char *bank = NULL, *device = NULL;
struct dimm_info *dimm;
@@ -396,7 +410,7 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
if (bank != NULL && device != NULL)
p += sprintf(p, "DIMM location:%s %s ", bank, device);
else
- p += sprintf(p, "DIMM DMI handle: 0x%.4x ",
+ p += sprintf(p, "DIMM location:not present. DIMM DMI handle:0x%.4x ",
mem_err->mem_dev_handle);

dimm = find_dimm_by_handle(mci, mem_err->mem_dev_handle);
@@ -405,9 +419,6 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
strcpy(e->label, dimm->label);
}
}
- if (mem_err->validation_bits & CPER_MEM_VALID_CHIP_ID)
- p += sprintf(p, "chipID: %d ",
- mem_err->extended >> CPER_MEM_CHIP_ID_SHIFT);
if (p > e->location)
*(p - 1) = '\0';

@@ -421,7 +432,7 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_STATUS) {
u64 status = mem_err->error_status;

- p += sprintf(p, "status(0x%016llx): ", (long long)status);
+ p += sprintf(p, "status(0x%016llx):", (long long)status);
switch ((status >> 8) & 0xff) {
case 1:
p += sprintf(p, "Error detected internal to the component ");
@@ -479,15 +490,7 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
break;
}
}
- if (mem_err->validation_bits & CPER_MEM_VALID_REQUESTOR_ID)
- p += sprintf(p, "requestorID: 0x%016llx ",
- (long long)mem_err->requestor_id);
- if (mem_err->validation_bits & CPER_MEM_VALID_RESPONDER_ID)
- p += sprintf(p, "responderID: 0x%016llx ",
- (long long)mem_err->responder_id);
- if (mem_err->validation_bits & CPER_MEM_VALID_TARGET_ID)
- p += sprintf(p, "targetID: 0x%016llx ",
- (long long)mem_err->responder_id);
+
if (p > pvt->other_detail)
*(p - 1) = '\0';

diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index 6ec8edec6329..77b39b058924 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -221,45 +221,45 @@ static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
n = 0;
len = CPER_REC_LEN;
if (mem->validation_bits & CPER_MEM_VALID_NODE)
- n += scnprintf(msg + n, len - n, "node: %d ", mem->node);
+ n += scnprintf(msg + n, len - n, "node:%d ", mem->node);
if (mem->validation_bits & CPER_MEM_VALID_CARD)
- n += scnprintf(msg + n, len - n, "card: %d ", mem->card);
+ n += scnprintf(msg + n, len - n, "card:%d ", mem->card);
if (mem->validation_bits & CPER_MEM_VALID_MODULE)
- n += scnprintf(msg + n, len - n, "module: %d ", mem->module);
+ n += scnprintf(msg + n, len - n, "module:%d ", mem->module);
if (mem->validation_bits & CPER_MEM_VALID_RANK_NUMBER)
- n += scnprintf(msg + n, len - n, "rank: %d ", mem->rank);
+ n += scnprintf(msg + n, len - n, "rank:%d ", mem->rank);
if (mem->validation_bits & CPER_MEM_VALID_BANK)
- n += scnprintf(msg + n, len - n, "bank: %d ", mem->bank);
+ n += scnprintf(msg + n, len - n, "bank:%d ", mem->bank);
if (mem->validation_bits & CPER_MEM_VALID_BANK_GROUP)
- n += scnprintf(msg + n, len - n, "bank_group: %d ",
+ n += scnprintf(msg + n, len - n, "bank_group:%d ",
mem->bank >> CPER_MEM_BANK_GROUP_SHIFT);
if (mem->validation_bits & CPER_MEM_VALID_BANK_ADDRESS)
- n += scnprintf(msg + n, len - n, "bank_address: %d ",
+ n += scnprintf(msg + n, len - n, "bank_address:%d ",
mem->bank & CPER_MEM_BANK_ADDRESS_MASK);
if (mem->validation_bits & CPER_MEM_VALID_DEVICE)
- n += scnprintf(msg + n, len - n, "device: %d ", mem->device);
+ n += scnprintf(msg + n, len - n, "device:%d ", mem->device);
if (mem->validation_bits & (CPER_MEM_VALID_ROW | CPER_MEM_VALID_ROW_EXT)) {
u32 row = mem->row;

row |= cper_get_mem_extension(mem->validation_bits, mem->extended);
- n += scnprintf(msg + n, len - n, "row: %d ", row);
+ n += scnprintf(msg + n, len - n, "row:%d ", row);
}
if (mem->validation_bits & CPER_MEM_VALID_COLUMN)
- n += scnprintf(msg + n, len - n, "column: %d ", mem->column);
+ n += scnprintf(msg + n, len - n, "column:%d ", mem->column);
if (mem->validation_bits & CPER_MEM_VALID_BIT_POSITION)
- n += scnprintf(msg + n, len - n, "bit_position: %d ",
+ n += scnprintf(msg + n, len - n, "bit_position:%d ",
mem->bit_pos);
if (mem->validation_bits & CPER_MEM_VALID_REQUESTOR_ID)
- n += scnprintf(msg + n, len - n, "requestor_id: 0x%016llx ",
+ n += scnprintf(msg + n, len - n, "requestor_id:0x%016llx ",
mem->requestor_id);
if (mem->validation_bits & CPER_MEM_VALID_RESPONDER_ID)
- n += scnprintf(msg + n, len - n, "responder_id: 0x%016llx ",
+ n += scnprintf(msg + n, len - n, "responder_id:0x%016llx ",
mem->responder_id);
if (mem->validation_bits & CPER_MEM_VALID_TARGET_ID)
- n += scnprintf(msg + n, len - n, "target_id: 0x%016llx ",
+ n += scnprintf(msg + n, len - n, "target_id:0x%016llx ",
mem->target_id);
if (mem->validation_bits & CPER_MEM_VALID_CHIP_ID)
- n += scnprintf(msg + n, len - n, "chip_id: %d ",
+ n += scnprintf(msg + n, len - n, "chip_id:%d ",
mem->extended >> CPER_MEM_CHIP_ID_SHIFT);

return n;
@@ -276,10 +276,10 @@ static int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg)
len = CPER_REC_LEN;
dmi_memdev_name(mem->mem_dev_handle, &bank, &device);
if (bank && device)
- n = snprintf(msg, len, "DIMM location: %s %s ", bank, device);
+ n = snprintf(msg, len, "DIMM location:%s %s ", bank, device);
else
n = snprintf(msg, len,
- "DIMM location: not present. DMI handle: 0x%.4x ",
+ "DIMM location: not present. DMI handle:0x%.4x ",
mem->mem_dev_handle);

return n;
--
2.20.1.12.g72788fdb


2021-12-10 13:40:38

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v2 2/3] ghes_edac: refactor memory error location processing

The memory error location processing in ghes_edac_report_mem_error() have
Duplicated Code with cper_mem_err_location(), cper_dimm_err_location(), and
cper_mem_err_type_str() in drivers/firmware/efi/cper.c.

To avoid the duplicated code, this patch introduces the above cper_*() into
ghes_edac_report_mem_error(). Although the UEFI_CPER/EDAC_GHES dependency
is always solved through ACPI_APEI_GHES/ACPI_APEI, add the UEFI_CPER
dependency explicitly for EDAC_GHES in Kconfig.

Signed-off-by: Shuai Xue <[email protected]>
---
drivers/edac/Kconfig | 1 +
drivers/edac/ghes_edac.c | 108 +++---------------------------------
drivers/firmware/efi/cper.c | 6 +-
include/linux/cper.h | 2 +
4 files changed, 15 insertions(+), 102 deletions(-)

diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
index 2fc4c3f91fd5..7f1a2e019ede 100644
--- a/drivers/edac/Kconfig
+++ b/drivers/edac/Kconfig
@@ -55,6 +55,7 @@ config EDAC_DECODE_MCE
config EDAC_GHES
bool "Output ACPI APEI/GHES BIOS detected errors via EDAC"
depends on ACPI_APEI_GHES && (EDAC=y)
+ select UEFI_CPER
help
Not all machines support hardware-driven error report. Some of those
provide a BIOS-driven error report mechanism via ACPI, using the
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 526a28cbb19b..103ad5b3a018 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -239,6 +239,7 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
{
struct edac_raw_error_desc *e;
struct mem_ctl_info *mci;
+ struct cper_mem_err_compact cmem;
struct ghes_pvt *pvt;
unsigned long flags;
char *p;
@@ -292,60 +293,10 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)

/* Error type, mapped on e->msg */
if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_TYPE) {
+ u8 etype = mem_err->error_type;
+
p = pvt->msg;
- switch (mem_err->error_type) {
- case 0:
- p += sprintf(p, "Unknown");
- break;
- case 1:
- p += sprintf(p, "No error");
- break;
- case 2:
- p += sprintf(p, "Single-bit ECC");
- break;
- case 3:
- p += sprintf(p, "Multi-bit ECC");
- break;
- case 4:
- p += sprintf(p, "Single-symbol ChipKill ECC");
- break;
- case 5:
- p += sprintf(p, "Multi-symbol ChipKill ECC");
- break;
- case 6:
- p += sprintf(p, "Master abort");
- break;
- case 7:
- p += sprintf(p, "Target abort");
- break;
- case 8:
- p += sprintf(p, "Parity Error");
- break;
- case 9:
- p += sprintf(p, "Watchdog timeout");
- break;
- case 10:
- p += sprintf(p, "Invalid address");
- break;
- case 11:
- p += sprintf(p, "Mirror Broken");
- break;
- case 12:
- p += sprintf(p, "Memory Sparing");
- break;
- case 13:
- p += sprintf(p, "Scrub corrected error");
- break;
- case 14:
- p += sprintf(p, "Scrub uncorrected error");
- break;
- case 15:
- p += sprintf(p, "Physical Memory Map-out event");
- break;
- default:
- p += sprintf(p, "reserved error (%d)",
- mem_err->error_type);
- }
+ p += snprintf(p, sizeof(pvt->msg), "%s", cper_mem_err_type_str(etype));
} else {
strcpy(pvt->msg, "unknown error");
}
@@ -362,56 +313,13 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)

/* Memory error location, mapped on e->location */
p = e->location;
- if (mem_err->validation_bits & CPER_MEM_VALID_NODE)
- p += sprintf(p, "node:%d ", mem_err->node);
- if (mem_err->validation_bits & CPER_MEM_VALID_CARD)
- p += sprintf(p, "card:%d ", mem_err->card);
- if (mem_err->validation_bits & CPER_MEM_VALID_MODULE)
- p += sprintf(p, "module:%d ", mem_err->module);
- if (mem_err->validation_bits & CPER_MEM_VALID_RANK_NUMBER)
- p += sprintf(p, "rank:%d ", mem_err->rank);
- if (mem_err->validation_bits & CPER_MEM_VALID_BANK)
- p += sprintf(p, "bank:%d ", mem_err->bank);
- if (mem_err->validation_bits & CPER_MEM_VALID_BANK_GROUP)
- p += sprintf(p, "bank_group:%d ",
- mem_err->bank >> CPER_MEM_BANK_GROUP_SHIFT);
- if (mem_err->validation_bits & CPER_MEM_VALID_BANK_ADDRESS)
- p += sprintf(p, "bank_address:%d ",
- mem_err->bank & CPER_MEM_BANK_ADDRESS_MASK);
- if (mem_err->validation_bits & CPER_MEM_VALID_DEVICE)
- p += sprintf(p, "device:%d ", mem_err->device);
- if (mem_err->validation_bits & (CPER_MEM_VALID_ROW | CPER_MEM_VALID_ROW_EXT)) {
- u32 row = mem_err->row;
-
- row |= cper_get_mem_extension(mem_err->validation_bits, mem_err->extended);
- p += sprintf(p, "row:%d ", row);
- }
- if (mem_err->validation_bits & CPER_MEM_VALID_COLUMN)
- p += sprintf(p, "column:%d ", mem_err->column);
- if (mem_err->validation_bits & CPER_MEM_VALID_BIT_POSITION)
- p += sprintf(p, "bit_position:%d ", mem_err->bit_pos);
- if (mem_err->validation_bits & CPER_MEM_VALID_REQUESTOR_ID)
- p += sprintf(p, "requestor_id:0x%016llx ",
- (long long)mem_err->requestor_id);
- if (mem_err->validation_bits & CPER_MEM_VALID_RESPONDER_ID)
- p += sprintf(p, "responder_id:0x%016llx ",
- (long long)mem_err->responder_id);
- if (mem_err->validation_bits & CPER_MEM_VALID_TARGET_ID)
- p += sprintf(p, "target_id:0x%016llx ",
- (long long)mem_err->responder_id);
- if (mem_err->validation_bits & CPER_MEM_VALID_CHIP_ID)
- p += sprintf(p, "chip_id:%d ",
- mem_err->extended >> CPER_MEM_CHIP_ID_SHIFT);
+ cper_mem_err_pack(mem_err, &cmem);
+ p += cper_mem_err_location(&cmem, p);
+
if (mem_err->validation_bits & CPER_MEM_VALID_MODULE_HANDLE) {
- const char *bank = NULL, *device = NULL;
struct dimm_info *dimm;

- dmi_memdev_name(mem_err->mem_dev_handle, &bank, &device);
- if (bank != NULL && device != NULL)
- p += sprintf(p, "DIMM location:%s %s ", bank, device);
- else
- p += sprintf(p, "DIMM location:not present. DIMM DMI handle:0x%.4x ",
- mem_err->mem_dev_handle);
+ p += cper_dimm_err_location(&cmem, p);

dimm = find_dimm_by_handle(mci, mem_err->mem_dev_handle);
if (dimm) {
diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index 77b39b058924..7553fecf2819 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -211,7 +211,7 @@ const char *cper_mem_err_type_str(unsigned int etype)
}
EXPORT_SYMBOL_GPL(cper_mem_err_type_str);

-static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
+int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
{
u32 len, n;

@@ -264,8 +264,9 @@ static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)

return n;
}
+EXPORT_SYMBOL_GPL(cper_mem_err_location);

-static int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg)
+int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg)
{
u32 len, n;
const char *bank = NULL, *device = NULL;
@@ -284,6 +285,7 @@ static int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg)

return n;
}
+EXPORT_SYMBOL_GPL(cper_dimm_err_location);

void cper_mem_err_pack(const struct cper_sec_mem_err *mem,
struct cper_mem_err_compact *cmem)
diff --git a/include/linux/cper.h b/include/linux/cper.h
index 6a511a1078ca..918e7efffb60 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -568,5 +568,7 @@ void cper_print_proc_arm(const char *pfx,
const struct cper_sec_proc_arm *proc);
void cper_print_proc_ia(const char *pfx,
const struct cper_sec_proc_ia *proc);
+int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg);
+int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg);

#endif
--
2.20.1.12.g72788fdb


2021-12-10 13:40:40

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v2 3/3] ghes_edac: refactor error status fields decoding

ghes_edac_report_mem_error() in ghes_edac.c is a Long Method in which the
error status fields decoding could be refactored for reuse. On the other
hand, the cper_print_mem() only reports the error status and misses its
description.

This patch introduces a new helper function cper_mem_err_status_str() which
is used to wrap up the decoding logics, and the cper_print_mem() will call
it and report the details of error status description.

The cper error log is now properly reporting the error as follows (all
Validation Bits are enabled):

[37863.026267] EDAC MC0: 1 CE single-symbol chipkill ECC on unknown memory (node: 0 card: 0 module: 0 rank: 0 bank: 1282 bank_group: 5 bank_address: 2 device: 0 row: 11387 column: 1544 bit_position: 0 requestor_id: 0x0000000000000000 responder_id: 0x0000000000000000 DIMM location: not present. DMI handle: 0x0000 page:0x963d9b offset:0x20 grain:1 syndrome:0x0 - APEI location: node: 0 card: 0 module: 0 rank: 0 bank: 1282 bank_group: 5 bank_address: 2 device: 0 row: 11387 column: 1544 bit_position: 0 requestor_id: 0x0000000000000000 responder_id: 0x0000000000000000 DIMM location: not present. DMI handle: 0x0000 status(0x0000000000000000): reserved)
[37863.026272] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
[37863.026273] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
[37863.026275] {2}[Hardware Error]: event severity: corrected
[37863.026276] {2}[Hardware Error]: Error 0, type: corrected
[37863.026278] {2}[Hardware Error]: section_type: memory error
[37863.026279] {2}[Hardware Error]: error_status: 0x0000000000000000, reserved
[37863.026279] {2}[Hardware Error]: physical_address: 0x0000000963d9b020
[37863.026280] {2}[Hardware Error]: physical_address_mask: 0x0000000000000000
[37863.026282] {2}[Hardware Error]: node: 0 card: 0 module: 0 rank: 0 bank: 1282 bank_group: 5 bank_address: 2 device: 0 row: 11387 column: 1544 bit_position: 0 requestor_id: 0x0000000000000000 responder_id: 0x0000000000000000
[37863.026283] {2}[Hardware Error]: error_type: 4, single-symbol chipkill ECC
[37863.026284] {2}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000

Signed-off-by: Shuai Xue <[email protected]>
---
drivers/edac/ghes_edac.c | 89 +++++++++++--------------------------
drivers/firmware/efi/cper.c | 46 ++++++++++++++++++-
include/linux/cper.h | 1 +
3 files changed, 72 insertions(+), 64 deletions(-)

diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 103ad5b3a018..986276557c93 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -235,6 +235,31 @@ static void ghes_scan_system(void)
system_scanned = true;
}

+static int ghes_edac_mem_err_other_detail(const struct cper_sec_mem_err *mem,
+ char *msg, const char *location)
+{
+ u32 len, n;
+
+ if (!msg)
+ return 0;
+
+ n = 0;
+ len = 2 * CPER_REC_LEN - 1;
+
+ n += snprintf(msg + n, len - n, "APEI location: %s ", location);
+
+ if (mem->validation_bits & CPER_MEM_VALID_ERROR_STATUS) {
+ u64 status = mem->error_status;
+
+ n += snprintf(msg + n, len - n, "status(0x%016llx): ",
+ (long long)status);
+ n += snprintf(msg + n, len - n, "%s ", cper_mem_err_status_str(status));
+ }
+
+ msg[n] = '\0';
+ return n;
+}
+
void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
{
struct edac_raw_error_desc *e;
@@ -335,69 +360,7 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)

/* All other fields are mapped on e->other_detail */
p = pvt->other_detail;
- p += snprintf(p, sizeof(pvt->other_detail),
- "APEI location: %s ", e->location);
- if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_STATUS) {
- u64 status = mem_err->error_status;
-
- p += sprintf(p, "status(0x%016llx):", (long long)status);
- switch ((status >> 8) & 0xff) {
- case 1:
- p += sprintf(p, "Error detected internal to the component ");
- break;
- case 16:
- p += sprintf(p, "Error detected in the bus ");
- break;
- case 4:
- p += sprintf(p, "Storage error in DRAM memory ");
- break;
- case 5:
- p += sprintf(p, "Storage error in TLB ");
- break;
- case 6:
- p += sprintf(p, "Storage error in cache ");
- break;
- case 7:
- p += sprintf(p, "Error in one or more functional units ");
- break;
- case 8:
- p += sprintf(p, "component failed self test ");
- break;
- case 9:
- p += sprintf(p, "Overflow or undervalue of internal queue ");
- break;
- case 17:
- p += sprintf(p, "Virtual address not found on IO-TLB or IO-PDIR ");
- break;
- case 18:
- p += sprintf(p, "Improper access error ");
- break;
- case 19:
- p += sprintf(p, "Access to a memory address which is not mapped to any component ");
- break;
- case 20:
- p += sprintf(p, "Loss of Lockstep ");
- break;
- case 21:
- p += sprintf(p, "Response not associated with a request ");
- break;
- case 22:
- p += sprintf(p, "Bus parity error - must also set the A, C, or D Bits ");
- break;
- case 23:
- p += sprintf(p, "Detection of a PATH_ERROR ");
- break;
- case 25:
- p += sprintf(p, "Bus operation timeout ");
- break;
- case 26:
- p += sprintf(p, "A read was issued to data that has been poisoned ");
- break;
- default:
- p += sprintf(p, "reserved ");
- break;
- }
- }
+ p += ghes_edac_mem_err_other_detail(mem_err, p, e->location);

if (p > pvt->other_detail)
*(p - 1) = '\0';
diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index 7553fecf2819..f604cb38da7e 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -211,6 +211,49 @@ const char *cper_mem_err_type_str(unsigned int etype)
}
EXPORT_SYMBOL_GPL(cper_mem_err_type_str);

+const char *cper_mem_err_status_str(u64 status)
+{
+ switch ((status >> 8) & 0xff) {
+ case 1:
+ return "Error detected internal to the component";
+ case 16:
+ return "Error detected in the bus";
+ case 4:
+ return "Storage error in DRAM memory";
+ case 5:
+ return "Storage error in TLB";
+ case 6:
+ return "Storage error in cache";
+ case 7:
+ return "Error in one or more functional units";
+ case 8:
+ return "component failed self test";
+ case 9:
+ return "Overflow or undervalue of internal queue";
+ case 17:
+ return "Virtual address not found on IO-TLB or IO-PDIR";
+ case 18:
+ return "Improper access error";
+ case 19:
+ return "Access to a memory address which is not mapped to any component";
+ case 20:
+ return "Loss of Lockstep";
+ case 21:
+ return "Response not associated with a request";
+ case 22:
+ return "Bus parity error - must also set the A, C, or D Bits";
+ case 23:
+ return "Detection of a PATH_ERROR ";
+ case 25:
+ return "Bus operation timeout";
+ case 26:
+ return "A read was issued to data that has been poisoned";
+ default:
+ return "reserved";
+ }
+}
+EXPORT_SYMBOL_GPL(cper_mem_err_status_str);
+
int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
{
u32 len, n;
@@ -336,7 +379,8 @@ static void cper_print_mem(const char *pfx, const struct cper_sec_mem_err *mem,
return;
}
if (mem->validation_bits & CPER_MEM_VALID_ERROR_STATUS)
- printk("%s""error_status: 0x%016llx\n", pfx, mem->error_status);
+ printk("%s""error_status: 0x%016llx, %s\n", pfx, mem->error_status,
+ cper_mem_err_status_str(mem->error_status));
if (mem->validation_bits & CPER_MEM_VALID_PA)
printk("%s""physical_address: 0x%016llx\n",
pfx, mem->physical_addr);
diff --git a/include/linux/cper.h b/include/linux/cper.h
index 918e7efffb60..eacb7dd7b3af 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -558,6 +558,7 @@ extern const char *const cper_proc_error_type_strs[4];
u64 cper_next_record_id(void);
const char *cper_severity_str(unsigned int);
const char *cper_mem_err_type_str(unsigned int);
+const char *cper_mem_err_status_str(u64 status);
void cper_print_bits(const char *prefix, unsigned int bits,
const char * const strs[], unsigned int strs_size);
void cper_mem_err_pack(const struct cper_sec_mem_err *,
--
2.20.1.12.g72788fdb


2021-12-16 12:54:23

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] ghes_edac: refactor memory error reporting to avoid code duplication

Hi folks,

On 2021/12/10 PM9:40, Shuai Xue wrote:
> ghes_edac_report_mem_error() in ghes_edac.c is Long Method and have
> Duplicated Code with cper_mem_err_location(), cper_dimm_err_location(), and
> cper_mem_err_type_str() in drivers/firmware/efi/cper.c. In addition, the
> cper_print_mem() in drivers/firmware/efi/cper.c only reports the error
> status and misses its description.
>
> This patch set is to refactor ghes_edac_report_mem_error with:
>
> - Patch 01 is to unify memory error report format with cper;
> - Patch 02 is to introduces cper_*(), into ghes_edac_report_mem_error(),
> this can avoid bunch of duplicate code lines;
> - Patch 02 is to wrap up error status decoding logics and reuse it in
> cper_print_mem().
>
> Changes since v1:
> https://lore.kernel.org/all/[email protected]/
>
> - add a new patch to unify ghes and cper before removing duplication.
> - document the changes in patch description
> - add EXPORT_SYMBOL_GPL()s for cper_*()
> - document and the dependency and add UEFI_CPER dependency explicitly
> Thanks Robert Richter for review comments.
>
> Shuai Xue (3):
> ghes_edac: unify memory error report format with cper
> ghes_edac: refactor memory error location processing
> ghes_edac: refactor error status fields decoding
>
> drivers/edac/Kconfig | 1 +
> drivers/edac/ghes_edac.c | 196 +++++++-----------------------------
> drivers/firmware/efi/cper.c | 86 ++++++++++++----
> include/linux/cper.h | 3 +
> 4 files changed, 105 insertions(+), 181 deletions(-)
>

I am wondering if you have any comments on this series of patches?

Best Regards,
Shuai

2021-12-28 17:13:00

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] ghes_edac: unify memory error report format with cper

On Fri, Dec 10, 2021 at 09:40:17PM +0800, Shuai Xue wrote:
> Subject: Re: [PATCH v2 1/3] ghes_edac: unify memory error report format with

"EDAC/ghes: Unify..."

is the format we use in the EDAC tree.

> The changes are to:
>
> - add device info into ghes_edac
> - change bit_pos to bit_position, col to column, requestorID to
> requestor_id, etc in ghes_edac
> - move requestor_id, responder_id, target_id and chip_id into memory error
> location in ghes_edac
> - add "DIMM location: not present." for DIMM location in ghes_edac
> - remove the 'space' delimiter after the colon in ghes_edac and cper

This commit message is useless: it should not talk about what your patch
does - that should hopefully be visible in the diff itself. Rather, it
should talk about *why* you're doing what you're doing.

Also, your patch does a bunch of things at once.

From: Documentation/process/submitting-patches.rst

"Solve only one problem per patch. If your description starts to get
long, that's a sign that you probably need to split up your patch.
See :ref:`split_changes`."

You should have a look at that file.

Also, avoid having "This patch" or "This commit" in the commit message.
It is tautologically useless.

Also, do

$ git grep 'This patch' Documentation/process

for more details.

> The original EDAC and cper error log are as follows (all Validation Bits
> are enabled):
>
> [31940.060454] EDAC MC0: 1 CE Single-symbol ChipKill ECC on unknown memory (node:0 card:0 module:0 rank:0 bank:257 bank_group:1 bank_address:1 row:75492 col:8 bit_pos:0 DIMM DMI handle: 0x0000 chipID: 0 page:0x93724c offset:0x20 grain:1 syndrome:0x0 - APEI location: node:0 card:0 module:0 rank:0 bank:257 bank_group:1 bank_address:1 row:75492 col:8 bit_pos:0 DIMM DMI handle: 0x0000 chipID: 0 status(0x0000000000000000): reserved requestorID: 0x0000000000000000 responderID: 0x0000000000000000 targetID: 0x0000000000000000)
> [31940.060459] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
> [31940.060460] {3}[Hardware Error]: It has been corrected by h/w and requires no further action
> [31940.060462] {3}[Hardware Error]: event severity: corrected
> [31940.060463] {3}[Hardware Error]: Error 0, type: corrected
> [31940.060464] {3}[Hardware Error]: section_type: memory error
> [31940.060465] {3}[Hardware Error]: error_status: 0x0000000000000000
> [31940.060466] {3}[Hardware Error]: physical_address: 0x000000093724c020
> [31940.060466] {3}[Hardware Error]: physical_address_mask: 0x0000000000000000
> [31940.060469] {3}[Hardware Error]: node: 0 card: 0 module: 0 rank: 0 bank: 257 bank_group: 1 bank_address: 1 device: 0 row: 75492 column: 8 bit_position: 0 requestor_id: 0x0000000000000000 responder_id: 0x0000000000000000
> [31940.060470] {3}[Hardware Error]: error_type: 4, single-symbol chipkill ECC
> [31940.060471] {3}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000
>
> Now, the EDAC and cper error log are properly reporting the error as
> follows (all Validation Bits are enabled):
>
> [ 117.973657] EDAC MC0: 1 CE Single-symbol ChipKill ECC on 0x0000

What does "ECC on 0x0000" mean?

> (node:0 card:0 module:0 rank:0 bank:1026 bank_group:4
> bank_address:2 device:0 row:6749 column:8 bit_position:0

> requestor_id:0x0000000000000000
> responder_id:0x0000000000000000
> target_id:0x0000000000000000

those look useless to me too. Probably invalid/unpopulated by BIOS...

> chip_id:0 DIMM location:not present. DIMM
> DMI handle:0x0000 page:0x8d2ef4 offset:0x20 grain:1 syndrome:0x0 -
> APEI location: node:0 card:0 module:0 rank:0 bank:1026 bank_group:4
> bank_address:2 device:0 row:6749 column:8 bit_position:0
> requestor_id:0x0000000000000000 responder_id:0x0000000000000000
> target_id:0x0000000000000000 chip_id:0 DIMM location:not present. DIMM
> DMI handle:0x0000 status(0x0000000000000000):reserved)

Sorry but I fail to see how this changed error record is an improvement.

> [ 117.973663] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
> [ 117.973664] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
> [ 117.973665] {2}[Hardware Error]: event severity: corrected
> [ 117.973666] {2}[Hardware Error]: Error 0, type: corrected
> [ 117.973667] {2}[Hardware Error]: section_type: memory error
> [ 117.973668] {2}[Hardware Error]: error_status: 0x0000000000000000
> [ 117.973669] {2}[Hardware Error]: physical_address: 0x00000008d2ef4020
> [ 117.973670] {2}[Hardware Error]: physical_address_mask: 0x0000000000000000
> [ 117.973672] {2}[Hardware Error]: node:0 card:0 module:0 rank:0 bank:1026 bank_group:4 bank_address:2 device:0 row:6749 column:8 bit_position:0 requestor_id:0x0000000000000000 responder_id:0x0000000000000000 target_id:0x0000000000000000 chip_id:0
> [ 117.973673] {2}[Hardware Error]: error_type: 4, single-symbol chipkill ECC
> [ 117.973674] {2}[Hardware Error]: DIMM location: not present. DMI handle:0x0000
>
> Signed-off-by: Shuai Xue <[email protected]>
> ---
> drivers/edac/ghes_edac.c | 35 +++++++++++++++++++----------------
> drivers/firmware/efi/cper.c | 34 +++++++++++++++++-----------------
> 2 files changed, 36 insertions(+), 33 deletions(-)
>
> diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
> index 6d1ddecbf0da..526a28cbb19b 100644
> --- a/drivers/edac/ghes_edac.c
> +++ b/drivers/edac/ghes_edac.c
> @@ -378,6 +378,8 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
> if (mem_err->validation_bits & CPER_MEM_VALID_BANK_ADDRESS)
> p += sprintf(p, "bank_address:%d ",
> mem_err->bank & CPER_MEM_BANK_ADDRESS_MASK);
> + if (mem_err->validation_bits & CPER_MEM_VALID_DEVICE)
> + p += sprintf(p, "device:%d ", mem_err->device);
> if (mem_err->validation_bits & (CPER_MEM_VALID_ROW | CPER_MEM_VALID_ROW_EXT)) {
> u32 row = mem_err->row;
>
> @@ -385,9 +387,21 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
> p += sprintf(p, "row:%d ", row);
> }
> if (mem_err->validation_bits & CPER_MEM_VALID_COLUMN)
> - p += sprintf(p, "col:%d ", mem_err->column);
> + p += sprintf(p, "column:%d ", mem_err->column);
> if (mem_err->validation_bits & CPER_MEM_VALID_BIT_POSITION)
> - p += sprintf(p, "bit_pos:%d ", mem_err->bit_pos);
> + p += sprintf(p, "bit_position:%d ", mem_err->bit_pos);

What for?

> + if (mem_err->validation_bits & CPER_MEM_VALID_REQUESTOR_ID)
> + p += sprintf(p, "requestor_id:0x%016llx ",
> + (long long)mem_err->requestor_id);
> + if (mem_err->validation_bits & CPER_MEM_VALID_RESPONDER_ID)
> + p += sprintf(p, "responder_id:0x%016llx ",
> + (long long)mem_err->responder_id);
^^^^^^^^^^^^^^^^^^^^^^

> + if (mem_err->validation_bits & CPER_MEM_VALID_TARGET_ID)
> + p += sprintf(p, "target_id:0x%016llx ",
> + (long long)mem_err->responder_id);
^^^^^^^^^^^^^^^^^^^^^^

mem_err->responder_id for both responder and target?!

> + if (mem_err->validation_bits & CPER_MEM_VALID_CHIP_ID)
> + p += sprintf(p, "chip_id:%d ",
> + mem_err->extended >> CPER_MEM_CHIP_ID_SHIFT);

I don't know if some dumb BIOS simply sets those valid bits regardless
of whether those fields are populated or not... It looks like it...

Right, so first this needs to explain *why* you're doing what you're
doing. And then with each reason why, you should do a patch, one by one,
explaining the rationale.

/me goes and looks at the next patches.

Aha, and you add all that crap here just to remove it in patch 2. But
this is all useless churn.

If you want to use cper_mem_err_location(), why don't you simply use it
directly?

And so on and so on...

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-12-29 03:22:19

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] ghes_edac: unify memory error report format with cper

Hi, Borislav,

Thank you very much for your valuable comments.


在 2021/12/29 AM1:12, Borislav Petkov 写道:
> On Fri, Dec 10, 2021 at 09:40:17PM +0800, Shuai Xue wrote:
>> Subject: Re: [PATCH v2 1/3] ghes_edac: unify memory error report format with
>
> "EDAC/ghes: Unify..."
>
> is the format we use in the EDAC tree.
Got it, thank you. Will fix in next version.


>> The changes are to:
>>
>> - add device info into ghes_edac
>> - change bit_pos to bit_position, col to column, requestorID to
>> requestor_id, etc in ghes_edac
>> - move requestor_id, responder_id, target_id and chip_id into memory error
>> location in ghes_edac
>> - add "DIMM location: not present." for DIMM location in ghes_edac
>> - remove the 'space' delimiter after the colon in ghes_edac and cper
>
> This commit message is useless: it should not talk about what your patch
> does - that should hopefully be visible in the diff itself. Rather, it
> should talk about *why* you're doing what you're doing.

The comments from Robert in v1[1] ask me to explicitly list those differences,

It is not really duplicate yet, changes are slightly different which
could trigger problems in some parsers. At least those differences
should be listed in the patch description.

Should I keep it here? Or delete it and just add why.


> Also, your patch does a bunch of things at once.
>
> From: Documentation/process/submitting-patches.rst
>
> "Solve only one problem per patch. If your description starts to get
> long, that's a sign that you probably need to split up your patch.
> See :ref:`split_changes`."
>
> You should have a look at that file.
>
> Also, avoid having "This patch" or "This commit" in the commit message.
> It is tautologically useless.
>
> Also, do
>
> $ git grep 'This patch' Documentation/process
>
> for more details.

Got it, thank you very much. But as you mentioned bellow, the purpose of this
patch is to unify memory error report format with cper so that we can use use
cper_mem_err_location() directly. Actually, this patch series is originally
suggested by Tony[2] to share the same code with cper. Should I split this
patch by function?


>> The original EDAC and cper error log are as follows (all Validation Bits
>> are enabled):
>>
>> [31940.060454] EDAC MC0: 1 CE Single-symbol ChipKill ECC on unknown memory (node:0 card:0 module:0 rank:0 bank:257 bank_group:1 bank_address:1 row:75492 col:8 bit_pos:0 DIMM DMI handle: 0x0000 chipID: 0 page:0x93724c offset:0x20 grain:1 syndrome:0x0 - APEI location: node:0 card:0 module:0 rank:0 bank:257 bank_group:1 bank_address:1 row:75492 col:8 bit_pos:0 DIMM DMI handle: 0x0000 chipID: 0 status(0x0000000000000000): reserved requestorID: 0x0000000000000000 responderID: 0x0000000000000000 targetID: 0x0000000000000000)
>> [31940.060459] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
>> [31940.060460] {3}[Hardware Error]: It has been corrected by h/w and requires no further action
>> [31940.060462] {3}[Hardware Error]: event severity: corrected
>> [31940.060463] {3}[Hardware Error]: Error 0, type: corrected
>> [31940.060464] {3}[Hardware Error]: section_type: memory error
>> [31940.060465] {3}[Hardware Error]: error_status: 0x0000000000000000
>> [31940.060466] {3}[Hardware Error]: physical_address: 0x000000093724c020
>> [31940.060466] {3}[Hardware Error]: physical_address_mask: 0x0000000000000000
>> [31940.060469] {3}[Hardware Error]: node: 0 card: 0 module: 0 rank: 0 bank: 257 bank_group: 1 bank_address: 1 device: 0 row: 75492 column: 8 bit_position: 0 requestor_id: 0x0000000000000000 responder_id: 0x0000000000000000
>> [31940.060470] {3}[Hardware Error]: error_type: 4, single-symbol chipkill ECC
>> [31940.060471] {3}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000
>>
>> Now, the EDAC and cper error log are properly reporting the error as
>> follows (all Validation Bits are enabled):
>>
>> [ 117.973657] EDAC MC0: 1 CE Single-symbol ChipKill ECC on 0x0000
>
> What does "ECC on 0x0000" mean?
>
>> (node:0 card:0 module:0 rank:0 bank:1026 bank_group:4
>> bank_address:2 device:0 row:6749 column:8 bit_position:0
>
>> requestor_id:0x0000000000000000
>> responder_id:0x0000000000000000
>> target_id:0x0000000000000000
>
> those look useless to me too. Probably invalid/unpopulated by BIOS...

Yep, these fields are unpopulated by BIOS, I manually enable all Validation
Bits for debug so that we see the difference more clearly. I will declare it
in next version.


>> chip_id:0 DIMM location:not present. DIMM
>> DMI handle:0x0000 page:0x8d2ef4 offset:0x20 grain:1 syndrome:0x0 -
>> APEI location: node:0 card:0 module:0 rank:0 bank:1026 bank_group:4
>> bank_address:2 device:0 row:6749 column:8 bit_position:0
>> requestor_id:0x0000000000000000 responder_id:0x0000000000000000
>> target_id:0x0000000000000000 chip_id:0 DIMM location:not present. DIMM
>> DMI handle:0x0000 status(0x0000000000000000):reserved)
>
> Sorry but I fail to see how this changed error record is an improvement.

Well, the purpose is not to improve but unify.


>> [ 117.973663] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
>> [ 117.973664] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
>> [ 117.973665] {2}[Hardware Error]: event severity: corrected
>> [ 117.973666] {2}[Hardware Error]: Error 0, type: corrected
>> [ 117.973667] {2}[Hardware Error]: section_type: memory error
>> [ 117.973668] {2}[Hardware Error]: error_status: 0x0000000000000000
>> [ 117.973669] {2}[Hardware Error]: physical_address: 0x00000008d2ef4020
>> [ 117.973670] {2}[Hardware Error]: physical_address_mask: 0x0000000000000000
>> [ 117.973672] {2}[Hardware Error]: node:0 card:0 module:0 rank:0 bank:1026 bank_group:4 bank_address:2 device:0 row:6749 column:8 bit_position:0 requestor_id:0x0000000000000000 responder_id:0x0000000000000000 target_id:0x0000000000000000 chip_id:0
>> [ 117.973673] {2}[Hardware Error]: error_type: 4, single-symbol chipkill ECC
>> [ 117.973674] {2}[Hardware Error]: DIMM location: not present. DMI handle:0x0000
>>
>> Signed-off-by: Shuai Xue <[email protected]>
>> ---
>> drivers/edac/ghes_edac.c | 35 +++++++++++++++++++----------------
>> drivers/firmware/efi/cper.c | 34 +++++++++++++++++-----------------
>> 2 files changed, 36 insertions(+), 33 deletions(-)
>>
>> diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
>> index 6d1ddecbf0da..526a28cbb19b 100644
>> --- a/drivers/edac/ghes_edac.c
>> +++ b/drivers/edac/ghes_edac.c
>> @@ -378,6 +378,8 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
>> if (mem_err->validation_bits & CPER_MEM_VALID_BANK_ADDRESS)
>> p += sprintf(p, "bank_address:%d ",
>> mem_err->bank & CPER_MEM_BANK_ADDRESS_MASK);
>> + if (mem_err->validation_bits & CPER_MEM_VALID_DEVICE)
>> + p += sprintf(p, "device:%d ", mem_err->device);
>> if (mem_err->validation_bits & (CPER_MEM_VALID_ROW | CPER_MEM_VALID_ROW_EXT)) {
>> u32 row = mem_err->row;
>>
>> @@ -385,9 +387,21 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
>> p += sprintf(p, "row:%d ", row);
>> }
>> if (mem_err->validation_bits & CPER_MEM_VALID_COLUMN)
>> - p += sprintf(p, "col:%d ", mem_err->column);
>> + p += sprintf(p, "column:%d ", mem_err->column);
>> if (mem_err->validation_bits & CPER_MEM_VALID_BIT_POSITION)
>> - p += sprintf(p, "bit_pos:%d ", mem_err->bit_pos);
>> + p += sprintf(p, "bit_position:%d ", mem_err->bit_pos);
>
> What for?

To unify memory error report format with cper. Should we use cper
or ghes_edac version?


>> + if (mem_err->validation_bits & CPER_MEM_VALID_REQUESTOR_ID)
>> + p += sprintf(p, "requestor_id:0x%016llx ",
>> + (long long)mem_err->requestor_id);
>> + if (mem_err->validation_bits & CPER_MEM_VALID_RESPONDER_ID)
>> + p += sprintf(p, "responder_id:0x%016llx ",
>> + (long long)mem_err->responder_id);
> ^^^^^^^^^^^^^^^^^^^^^^
>
>> + if (mem_err->validation_bits & CPER_MEM_VALID_TARGET_ID)
>> + p += sprintf(p, "target_id:0x%016llx ",
>> + (long long)mem_err->responder_id);
> ^^^^^^^^^^^^^^^^^^^^^^
>
> mem_err->responder_id for both responder and target?!

Sorry for this bug. Will fix in next version.

>
>> + if (mem_err->validation_bits & CPER_MEM_VALID_CHIP_ID)
>> + p += sprintf(p, "chip_id:%d ",
>> + mem_err->extended >> CPER_MEM_CHIP_ID_SHIFT);
>
> I don't know if some dumb BIOS simply sets those valid bits regardless
> of whether those fields are populated or not... It looks like it...
>
> Right, so first this needs to explain *why* you're doing what you're
> doing. And then with each reason why, you should do a patch, one by one,
> explaining the rationale.
>
> /me goes and looks at the next patches.
>
> Aha, and you add all that crap here just to remove it in patch 2. But
> this is all useless churn.
>
> If you want to use cper_mem_err_location(), why don't you simply use it
> directly?
>
> And so on and so on...

Well, Robert suggested me add a unification patch[1] so that we could review
the changes more clearly. I think it makes sense.

It is not really duplicate yet, changes are slightly different which
could trigger problems in some parsers. At least those differences
should be listed in the patch description. I would rather remove the
'space' delimiter after the colon and take the ghes version of it as
logs become harder to read. So ideally there is a unification patch
before the "duplication" is removed with changes in both files as
necessary for review and to document the change.

[1]https://lore.kernel.org/all/Ya9F75xWt%[email protected]/
[2]https://lore.kernel.org/lkml/[email protected]/T/#m8cdf77bee7d20ba094130707363359e765e8459f


Best Regards,
Shuai

2021-12-30 17:57:21

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] ghes_edac: unify memory error report format with cper

On Wed, Dec 29, 2021 at 11:22:11AM +0800, Shuai Xue wrote:
> Yep, these fields are unpopulated by BIOS, I manually enable all Validation
> Bits for debug so that we see the difference more clearly. I will declare it
> in next version.

Declare what? I can't parse your statement.

> Well, the purpose is not to improve but unify.

The most importang goal with kernel code is improvement and less bugs.
Unification is second. We should not jump through hoops and unify at
every price just because there's a duplicated function somewhere.
Remember that when doing your changes.

> Well, Robert suggested me add a unification patch[1] so that we could review
> the changes more clearly. I think it makes sense.

Not really. I can imagine why Robert suggested that but this strategy is
causing unnecessary churn. What one usually does in such cases is:

1. Add changes to the target functionality - the one in cper.c - by
explaining *why* those changes are needed.

2. Switch ghes_edac.c to that functionality and remove the redundant one
there.

Simple and clean diffstat and easy review.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-12-31 03:04:16

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] ghes_edac: unify memory error report format with cper

Hi, Borislav,

Thank you for your comments.

在 2021/12/31 AM1:57, Borislav Petkov 写道:
> On Wed, Dec 29, 2021 at 11:22:11AM +0800, Shuai Xue wrote:
>> Yep, these fields are unpopulated by BIOS, I manually enable all Validation
>> Bits for debug so that we see the difference more clearly. I will declare it
>> in next version.
>
> Declare what? I can't parse your statement.

The ghes_edac log message is printed only when a validation bit is set, e.g.:

if (mem_err->validation_bits & CPER_MEM_VALID_NODE)
p += sprintf(p, "node:%d ", mem_err->node);

Not all bits are populated by BIOS in my platform, I manually enable all
validation bits during test so that we can see log message and differences of all
fields more clearly.

+ mem_err->validation_bits = 0xfffffffffffffff;

>> Well, the purpose is not to improve but unify.
>
> The most importang goal with kernel code is improvement and less bugs.
> Unification is second. We should not jump through hoops and unify at
> every price just because there's a duplicated function somewhere.
> Remember that when doing your changes.

I see. Thank you.

>> Well, Robert suggested me add a unification patch[1] so that we could review
>> the changes more clearly. I think it makes sense.
>
> Not really. I can imagine why Robert suggested that but this strategy is
> causing unnecessary churn. What one usually does in such cases is:
>
> 1. Add changes to the target functionality - the one in cper.c - by
> explaining *why* those changes are needed.
>
> 2. Switch ghes_edac.c to that functionality and remove the redundant one
> there.
>
> Simple and clean diffstat and easy review.
>
> Thx.

Got it. I will send next version latter.

Merry Christmas and happy New Year.

Best Regards,
Shuai

2022-01-17 05:19:08

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v3 0/2] EDAC/ghes: refactor memory error reporting to avoid

ghes_edac_report_mem_error() in ghes_edac.c is a Long Method and have
Duplicated Code with cper_mem_err_location(), cper_dimm_err_location(), and
cper_mem_err_type_str() in drivers/firmware/efi/cper.c. In addition, the
cper_print_mem() in drivers/firmware/efi/cper.c only reports the error status
and misses its description.

This patch set is to refactor ghes_edac_report_mem_error with:

- Patch 01 is to wrap up error status decoding logics and reuse it in
cper_print_mem().
- Patch 02 is to introduces cper_*(), into ghes_edac_report_mem_error(),
this can avoid bunch of duplicate code lines;


Changes since v2:

- delete the unified patch
- adjust the order of patches
- Link: https://lore.kernel.org/all/[email protected]/
- Thanks Borislav Petkov for review comments.

Changes since v1:

- add a new patch to unify ghes and cper before removing duplication.
- document the changes in patch description
- add EXPORT_SYMBOL_GPL()s for cper_*()
- document and the dependency and add UEFI_CPER dependency explicitly
- Link: https://lore.kernel.org/all/[email protected]/
- Thanks Robert Richter for review comments.

Shuai Xue (2):
efi/cper: add cper_mem_err_status_str to decode error description
EDAC/ghes: use cper functions to avoid code duplication

drivers/edac/Kconfig | 1 +
drivers/edac/ghes_edac.c | 196 +++++++-----------------------------
drivers/firmware/efi/cper.c | 82 +++++++++++----
include/linux/cper.h | 3 +
4 files changed, 102 insertions(+), 180 deletions(-)

--
2.20.1.12.g72788fdb

2022-01-17 05:19:44

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v3 1/2] efi/cper: add cper_mem_err_status_str to decode error description

Introduce a new helper function cper_mem_err_status_str() which is used to
decode the description of error status, and the cper_print_mem() will call
it and report the details of error status.

Signed-off-by: Shuai Xue <[email protected]>
---
drivers/firmware/efi/cper.c | 46 ++++++++++++++++++++++++++++++++++++-
include/linux/cper.h | 1 +
2 files changed, 46 insertions(+), 1 deletion(-)

diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index 6ec8edec6329..addafccecd84 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -211,6 +211,49 @@ const char *cper_mem_err_type_str(unsigned int etype)
}
EXPORT_SYMBOL_GPL(cper_mem_err_type_str);

+const char *cper_mem_err_status_str(u64 status)
+{
+ switch ((status >> 8) & 0xff) {
+ case 1:
+ return "Error detected internal to the component";
+ case 16:
+ return "Error detected in the bus";
+ case 4:
+ return "Storage error in DRAM memory";
+ case 5:
+ return "Storage error in TLB";
+ case 6:
+ return "Storage error in cache";
+ case 7:
+ return "Error in one or more functional units";
+ case 8:
+ return "component failed self test";
+ case 9:
+ return "Overflow or undervalue of internal queue";
+ case 17:
+ return "Virtual address not found on IO-TLB or IO-PDIR";
+ case 18:
+ return "Improper access error";
+ case 19:
+ return "Access to a memory address which is not mapped to any component";
+ case 20:
+ return "Loss of Lockstep";
+ case 21:
+ return "Response not associated with a request";
+ case 22:
+ return "Bus parity error - must also set the A, C, or D Bits";
+ case 23:
+ return "Detection of a PATH_ERROR ";
+ case 25:
+ return "Bus operation timeout";
+ case 26:
+ return "A read was issued to data that has been poisoned";
+ default:
+ return "reserved";
+ }
+}
+EXPORT_SYMBOL_GPL(cper_mem_err_status_str);
+
static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
{
u32 len, n;
@@ -334,7 +377,8 @@ static void cper_print_mem(const char *pfx, const struct cper_sec_mem_err *mem,
return;
}
if (mem->validation_bits & CPER_MEM_VALID_ERROR_STATUS)
- printk("%s""error_status: 0x%016llx\n", pfx, mem->error_status);
+ printk("%s""error_status: 0x%016llx, %s\n", pfx, mem->error_status,
+ cper_mem_err_status_str(mem->error_status));
if (mem->validation_bits & CPER_MEM_VALID_PA)
printk("%s""physical_address: 0x%016llx\n",
pfx, mem->physical_addr);
diff --git a/include/linux/cper.h b/include/linux/cper.h
index 6a511a1078ca..5b1dd27b317d 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -558,6 +558,7 @@ extern const char *const cper_proc_error_type_strs[4];
u64 cper_next_record_id(void);
const char *cper_severity_str(unsigned int);
const char *cper_mem_err_type_str(unsigned int);
+const char *cper_mem_err_status_str(u64 status);
void cper_print_bits(const char *prefix, unsigned int bits,
const char * const strs[], unsigned int strs_size);
void cper_mem_err_pack(const struct cper_sec_mem_err *,
--
2.20.1.12.g72788fdb

2022-01-17 05:19:45

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v3 2/2] EDAC/ghes: use cper functions to avoid code duplication

The memory error location processing in ghes_edac_report_mem_error() have
Duplicated Code with cper_mem_err_location(), cper_dimm_err_location(), and
cper_mem_err_type_str() in drivers/firmware/efi/cper.c. To avoid the
duplicated code, introduce the above cper_*() into
ghes_edac_report_mem_error().

A side effect is that the report format in ghes_edac is changed. The
changes are to:

- add device info into ghes_edac
- change bit_pos to bit_position, col to column, requestorID to
requestor_id, etc in ghes_edac
- move requestor_id, responder_id, target_id and chip_id into memory error
location in ghes_edac
- add "DIMM location: not present." for DIMM location in ghes_edac
- remove the 'space' delimiter after the colon in cper_mem_err_location(), suggested by Robert

The original EDAC and cper error log are as follows (all Validation Bits
are enabled):
--------------------------------------------------------------------
[ 281.759984] EDAC MC0: 1 CE Single-symbol ChipKill ECC on unknown memory (node:0 card:0 module:0 rank:0 bank:1793 bank_group:7 bank_address:1 row:5310 col:520 bit_pos:0 DIMM DMI handle: 0x0000 chipID: 0 page:0x8a5f45 offset:0x20 grain:1 syndrome:0x0 - APEI location: node:0 card:0 module:0 rank:0 bank:1793 bank_group:7 bank_address:1 row:5310 col:520 bit_pos:0 DIMM DMI handle: 0x0000 chipID: 0 status(0x0000000000000000): reserved requestorID: 0x0000000000000000 responderID: 0x0000000000000000 targetID: 0x0000000000000000)
[ 281.818273] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
[ 281.828878] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
[ 281.839687] {2}[Hardware Error]: event severity: corrected
[ 281.847522] {2}[Hardware Error]: Error 0, type: corrected
[ 281.855331] {2}[Hardware Error]: section_type: memory error
[ 281.863389] {2}[Hardware Error]: error_status: 0x0000000000000000
[ 281.871983] {2}[Hardware Error]: physical_address: 0x00000008a5f45020
[ 281.880944] {2}[Hardware Error]: physical_address_mask: 0x0000000000000000
[ 281.890361] {2}[Hardware Error]: node: 0 card: 0 module: 0 rank: 0 bank: 1793 bank_group: 7 bank_address: 1 device: 0 row: 5310 column: 520 bit_position: 0 requestor_id: 0x0000000000000000 responder_id: 0x0000000000000000 target_id: 0x0000000000000000 chip_id: 0
[ 281.921361] {2}[Hardware Error]: error_type: 4, single-symbol chipkill ECC
[ 281.931021] {2}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000
--------------------------------------------------------------------

Now, the EDAC and cper error log are properly reporting the error as
follows (all Validation Bits are enabled):
--------------------------------------------------------------------
[ 2335.677847] EDAC MC0: 1 CE single-symbol chipkill ECC on 0x0000 (node:0 card:0 module:0 rank:0 bank:770 bank_group:3 bank_address:2 device:0 row:6510 column:1544 bit_position:0 requestor_id:0x0000000000000000 responder_id:0x0000000000000000 target_id:0x0000000000000000 chip_id:0 DIMM location: not present. DMI handle: 0x0000 page:0x8cb743 offset:0x20 grain:1 syndrome:0x0 - APEI location: node:0 card:0 module:0 rank:0 bank:770 bank_group:3 bank_address:2 device:0 row:6510 column:1544 bit_position:0 requestor_id: 0x0000000000000000 responder_id:0x0000000000000000 target_id:0x0000000000000000 chip_id:0 DIMM location: not present. DMI handle: 0x0000 status(0x0000000000000000): reserved)
[ 2335.753234] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
[ 2335.763930] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
[ 2335.774828] {2}[Hardware Error]: event severity: corrected
[ 2335.782761] {2}[Hardware Error]: Error 0, type: corrected
[ 2335.790671] {2}[Hardware Error]: section_type: memory error
[ 2335.798838] {2}[Hardware Error]: error_status: 0x0000000000000000, reserved
[ 2335.808444] {2}[Hardware Error]: physical_address: 0x00000008cb743020
[ 2335.817552] {2}[Hardware Error]: physical_address_mask: 0x0000000000000000
[ 2335.827098] {2}[Hardware Error]: node:0 card:0 module:0 rank:0 bank:770 bank_group:3 bank_address:2 device:0 row:6510 column:1544 bit_position:0 requestor_id: 0x0000000000000000 responder_id:0x0000000000000000 target_id:0x0000000000000000 chip_id:0
[ 2335.857212] {2}[Hardware Error]: error_type: 4, single-symbol chipkill ECC
[ 2335.867001] {2}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000
--------------------------------------------------------------------

NOTE: Not all bits are populated by BIOS in my platform, I manually enable all
validation bits during test so that we can see log message and differences of
all fields more clearly. Therefore, the values of some fields look strange.

+ mem_err->validation_bits = 0xfffffffffffffff;

Although the UEFI_CPER/EDAC_GHES dependency is always solved through
ACPI_APEI_GHES/ACPI_APEI, add the UEFI_CPER dependency explicitly for
EDAC_GHES in Kconfig.

Signed-off-by: Shuai Xue <[email protected]>
---
drivers/edac/Kconfig | 1 +
drivers/edac/ghes_edac.c | 196 +++++++-----------------------------
drivers/firmware/efi/cper.c | 36 +++----
include/linux/cper.h | 2 +
4 files changed, 56 insertions(+), 179 deletions(-)

diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
index 58ab63642e72..23f11554f400 100644
--- a/drivers/edac/Kconfig
+++ b/drivers/edac/Kconfig
@@ -55,6 +55,7 @@ config EDAC_DECODE_MCE
config EDAC_GHES
bool "Output ACPI APEI/GHES BIOS detected errors via EDAC"
depends on ACPI_APEI_GHES && (EDAC=y)
+ select UEFI_CPER
help
Not all machines support hardware-driven error report. Some of those
provide a BIOS-driven error report mechanism via ACPI, using the
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 6d1ddecbf0da..94e9f6b45817 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -235,10 +235,36 @@ static void ghes_scan_system(void)
system_scanned = true;
}

+static int ghes_edac_mem_err_other_detail(const struct cper_sec_mem_err *mem,
+ char *msg, const char *location)
+{
+ u32 len, n;
+
+ if (!msg)
+ return 0;
+
+ n = 0;
+ len = 2 * CPER_REC_LEN - 1;
+
+ n += snprintf(msg + n, len - n, "APEI location: %s ", location);
+
+ if (mem->validation_bits & CPER_MEM_VALID_ERROR_STATUS) {
+ u64 status = mem->error_status;
+
+ n += snprintf(msg + n, len - n, "status(0x%016llx): ",
+ (long long)status);
+ n += snprintf(msg + n, len - n, "%s ", cper_mem_err_status_str(status));
+ }
+
+ msg[n] = '\0';
+ return n;
+}
+
void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
{
struct edac_raw_error_desc *e;
struct mem_ctl_info *mci;
+ struct cper_mem_err_compact cmem;
struct ghes_pvt *pvt;
unsigned long flags;
char *p;
@@ -292,60 +318,10 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)

/* Error type, mapped on e->msg */
if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_TYPE) {
+ u8 etype = mem_err->error_type;
+
p = pvt->msg;
- switch (mem_err->error_type) {
- case 0:
- p += sprintf(p, "Unknown");
- break;
- case 1:
- p += sprintf(p, "No error");
- break;
- case 2:
- p += sprintf(p, "Single-bit ECC");
- break;
- case 3:
- p += sprintf(p, "Multi-bit ECC");
- break;
- case 4:
- p += sprintf(p, "Single-symbol ChipKill ECC");
- break;
- case 5:
- p += sprintf(p, "Multi-symbol ChipKill ECC");
- break;
- case 6:
- p += sprintf(p, "Master abort");
- break;
- case 7:
- p += sprintf(p, "Target abort");
- break;
- case 8:
- p += sprintf(p, "Parity Error");
- break;
- case 9:
- p += sprintf(p, "Watchdog timeout");
- break;
- case 10:
- p += sprintf(p, "Invalid address");
- break;
- case 11:
- p += sprintf(p, "Mirror Broken");
- break;
- case 12:
- p += sprintf(p, "Memory Sparing");
- break;
- case 13:
- p += sprintf(p, "Scrub corrected error");
- break;
- case 14:
- p += sprintf(p, "Scrub uncorrected error");
- break;
- case 15:
- p += sprintf(p, "Physical Memory Map-out event");
- break;
- default:
- p += sprintf(p, "reserved error (%d)",
- mem_err->error_type);
- }
+ p += snprintf(p, sizeof(pvt->msg), "%s", cper_mem_err_type_str(etype));
} else {
strcpy(pvt->msg, "unknown error");
}
@@ -362,52 +338,19 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)

/* Memory error location, mapped on e->location */
p = e->location;
- if (mem_err->validation_bits & CPER_MEM_VALID_NODE)
- p += sprintf(p, "node:%d ", mem_err->node);
- if (mem_err->validation_bits & CPER_MEM_VALID_CARD)
- p += sprintf(p, "card:%d ", mem_err->card);
- if (mem_err->validation_bits & CPER_MEM_VALID_MODULE)
- p += sprintf(p, "module:%d ", mem_err->module);
- if (mem_err->validation_bits & CPER_MEM_VALID_RANK_NUMBER)
- p += sprintf(p, "rank:%d ", mem_err->rank);
- if (mem_err->validation_bits & CPER_MEM_VALID_BANK)
- p += sprintf(p, "bank:%d ", mem_err->bank);
- if (mem_err->validation_bits & CPER_MEM_VALID_BANK_GROUP)
- p += sprintf(p, "bank_group:%d ",
- mem_err->bank >> CPER_MEM_BANK_GROUP_SHIFT);
- if (mem_err->validation_bits & CPER_MEM_VALID_BANK_ADDRESS)
- p += sprintf(p, "bank_address:%d ",
- mem_err->bank & CPER_MEM_BANK_ADDRESS_MASK);
- if (mem_err->validation_bits & (CPER_MEM_VALID_ROW | CPER_MEM_VALID_ROW_EXT)) {
- u32 row = mem_err->row;
-
- row |= cper_get_mem_extension(mem_err->validation_bits, mem_err->extended);
- p += sprintf(p, "row:%d ", row);
- }
- if (mem_err->validation_bits & CPER_MEM_VALID_COLUMN)
- p += sprintf(p, "col:%d ", mem_err->column);
- if (mem_err->validation_bits & CPER_MEM_VALID_BIT_POSITION)
- p += sprintf(p, "bit_pos:%d ", mem_err->bit_pos);
+ cper_mem_err_pack(mem_err, &cmem);
+ p += cper_mem_err_location(&cmem, p);
+
if (mem_err->validation_bits & CPER_MEM_VALID_MODULE_HANDLE) {
- const char *bank = NULL, *device = NULL;
struct dimm_info *dimm;

- dmi_memdev_name(mem_err->mem_dev_handle, &bank, &device);
- if (bank != NULL && device != NULL)
- p += sprintf(p, "DIMM location:%s %s ", bank, device);
- else
- p += sprintf(p, "DIMM DMI handle: 0x%.4x ",
- mem_err->mem_dev_handle);
-
+ p += cper_dimm_err_location(&cmem, p);
dimm = find_dimm_by_handle(mci, mem_err->mem_dev_handle);
if (dimm) {
e->top_layer = dimm->idx;
strcpy(e->label, dimm->label);
}
}
- if (mem_err->validation_bits & CPER_MEM_VALID_CHIP_ID)
- p += sprintf(p, "chipID: %d ",
- mem_err->extended >> CPER_MEM_CHIP_ID_SHIFT);
if (p > e->location)
*(p - 1) = '\0';

@@ -416,78 +359,7 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)

/* All other fields are mapped on e->other_detail */
p = pvt->other_detail;
- p += snprintf(p, sizeof(pvt->other_detail),
- "APEI location: %s ", e->location);
- if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_STATUS) {
- u64 status = mem_err->error_status;
-
- p += sprintf(p, "status(0x%016llx): ", (long long)status);
- switch ((status >> 8) & 0xff) {
- case 1:
- p += sprintf(p, "Error detected internal to the component ");
- break;
- case 16:
- p += sprintf(p, "Error detected in the bus ");
- break;
- case 4:
- p += sprintf(p, "Storage error in DRAM memory ");
- break;
- case 5:
- p += sprintf(p, "Storage error in TLB ");
- break;
- case 6:
- p += sprintf(p, "Storage error in cache ");
- break;
- case 7:
- p += sprintf(p, "Error in one or more functional units ");
- break;
- case 8:
- p += sprintf(p, "component failed self test ");
- break;
- case 9:
- p += sprintf(p, "Overflow or undervalue of internal queue ");
- break;
- case 17:
- p += sprintf(p, "Virtual address not found on IO-TLB or IO-PDIR ");
- break;
- case 18:
- p += sprintf(p, "Improper access error ");
- break;
- case 19:
- p += sprintf(p, "Access to a memory address which is not mapped to any component ");
- break;
- case 20:
- p += sprintf(p, "Loss of Lockstep ");
- break;
- case 21:
- p += sprintf(p, "Response not associated with a request ");
- break;
- case 22:
- p += sprintf(p, "Bus parity error - must also set the A, C, or D Bits ");
- break;
- case 23:
- p += sprintf(p, "Detection of a PATH_ERROR ");
- break;
- case 25:
- p += sprintf(p, "Bus operation timeout ");
- break;
- case 26:
- p += sprintf(p, "A read was issued to data that has been poisoned ");
- break;
- default:
- p += sprintf(p, "reserved ");
- break;
- }
- }
- if (mem_err->validation_bits & CPER_MEM_VALID_REQUESTOR_ID)
- p += sprintf(p, "requestorID: 0x%016llx ",
- (long long)mem_err->requestor_id);
- if (mem_err->validation_bits & CPER_MEM_VALID_RESPONDER_ID)
- p += sprintf(p, "responderID: 0x%016llx ",
- (long long)mem_err->responder_id);
- if (mem_err->validation_bits & CPER_MEM_VALID_TARGET_ID)
- p += sprintf(p, "targetID: 0x%016llx ",
- (long long)mem_err->responder_id);
+ p += ghes_edac_mem_err_other_detail(mem_err, p, e->location);
if (p > pvt->other_detail)
*(p - 1) = '\0';

diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index addafccecd84..3efec4aca212 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -254,7 +254,7 @@ const char *cper_mem_err_status_str(u64 status)
}
EXPORT_SYMBOL_GPL(cper_mem_err_status_str);

-static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
+int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
{
u32 len, n;

@@ -264,51 +264,52 @@ static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
n = 0;
len = CPER_REC_LEN;
if (mem->validation_bits & CPER_MEM_VALID_NODE)
- n += scnprintf(msg + n, len - n, "node: %d ", mem->node);
+ n += scnprintf(msg + n, len - n, "node:%d ", mem->node);
if (mem->validation_bits & CPER_MEM_VALID_CARD)
- n += scnprintf(msg + n, len - n, "card: %d ", mem->card);
+ n += scnprintf(msg + n, len - n, "card:%d ", mem->card);
if (mem->validation_bits & CPER_MEM_VALID_MODULE)
- n += scnprintf(msg + n, len - n, "module: %d ", mem->module);
+ n += scnprintf(msg + n, len - n, "module:%d ", mem->module);
if (mem->validation_bits & CPER_MEM_VALID_RANK_NUMBER)
- n += scnprintf(msg + n, len - n, "rank: %d ", mem->rank);
+ n += scnprintf(msg + n, len - n, "rank:%d ", mem->rank);
if (mem->validation_bits & CPER_MEM_VALID_BANK)
- n += scnprintf(msg + n, len - n, "bank: %d ", mem->bank);
+ n += scnprintf(msg + n, len - n, "bank:%d ", mem->bank);
if (mem->validation_bits & CPER_MEM_VALID_BANK_GROUP)
- n += scnprintf(msg + n, len - n, "bank_group: %d ",
+ n += scnprintf(msg + n, len - n, "bank_group:%d ",
mem->bank >> CPER_MEM_BANK_GROUP_SHIFT);
if (mem->validation_bits & CPER_MEM_VALID_BANK_ADDRESS)
- n += scnprintf(msg + n, len - n, "bank_address: %d ",
+ n += scnprintf(msg + n, len - n, "bank_address:%d ",
mem->bank & CPER_MEM_BANK_ADDRESS_MASK);
if (mem->validation_bits & CPER_MEM_VALID_DEVICE)
- n += scnprintf(msg + n, len - n, "device: %d ", mem->device);
+ n += scnprintf(msg + n, len - n, "device:%d ", mem->device);
if (mem->validation_bits & (CPER_MEM_VALID_ROW | CPER_MEM_VALID_ROW_EXT)) {
u32 row = mem->row;

row |= cper_get_mem_extension(mem->validation_bits, mem->extended);
- n += scnprintf(msg + n, len - n, "row: %d ", row);
+ n += scnprintf(msg + n, len - n, "row:%d ", row);
}
if (mem->validation_bits & CPER_MEM_VALID_COLUMN)
- n += scnprintf(msg + n, len - n, "column: %d ", mem->column);
+ n += scnprintf(msg + n, len - n, "column:%d ", mem->column);
if (mem->validation_bits & CPER_MEM_VALID_BIT_POSITION)
- n += scnprintf(msg + n, len - n, "bit_position: %d ",
+ n += scnprintf(msg + n, len - n, "bit_position:%d ",
mem->bit_pos);
if (mem->validation_bits & CPER_MEM_VALID_REQUESTOR_ID)
- n += scnprintf(msg + n, len - n, "requestor_id: 0x%016llx ",
+ n += scnprintf(msg + n, len - n, "requestor_id:0x%016llx ",
mem->requestor_id);
if (mem->validation_bits & CPER_MEM_VALID_RESPONDER_ID)
- n += scnprintf(msg + n, len - n, "responder_id: 0x%016llx ",
+ n += scnprintf(msg + n, len - n, "responder_id:0x%016llx ",
mem->responder_id);
if (mem->validation_bits & CPER_MEM_VALID_TARGET_ID)
- n += scnprintf(msg + n, len - n, "target_id: 0x%016llx ",
+ n += scnprintf(msg + n, len - n, "target_id:0x%016llx ",
mem->target_id);
if (mem->validation_bits & CPER_MEM_VALID_CHIP_ID)
- n += scnprintf(msg + n, len - n, "chip_id: %d ",
+ n += scnprintf(msg + n, len - n, "chip_id:%d ",
mem->extended >> CPER_MEM_CHIP_ID_SHIFT);

return n;
}
+EXPORT_SYMBOL_GPL(cper_mem_err_location);

-static int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg)
+int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg)
{
u32 len, n;
const char *bank = NULL, *device = NULL;
@@ -327,6 +328,7 @@ static int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg)

return n;
}
+EXPORT_SYMBOL_GPL(cper_dimm_err_location);

void cper_mem_err_pack(const struct cper_sec_mem_err *mem,
struct cper_mem_err_compact *cmem)
diff --git a/include/linux/cper.h b/include/linux/cper.h
index 5b1dd27b317d..eacb7dd7b3af 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -569,5 +569,7 @@ void cper_print_proc_arm(const char *pfx,
const struct cper_sec_proc_arm *proc);
void cper_print_proc_ia(const char *pfx,
const struct cper_sec_proc_ia *proc);
+int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg);
+int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg);

#endif
--
2.20.1.12.g72788fdb

2022-01-24 16:01:42

by Shuai Xue

[permalink] [raw]
Subject: [RESEND PATCH v3 0/2] EDAC/ghes: refactor memory error reporting to avoid

ghes_edac_report_mem_error() in ghes_edac.c is a Long Method and have
Duplicated Code with cper_mem_err_location(), cper_dimm_err_location(), and
cper_mem_err_type_str() in drivers/firmware/efi/cper.c. In addition, the
cper_print_mem() in drivers/firmware/efi/cper.c only reports the error status
and misses its description.

This patch set is to refactor ghes_edac_report_mem_error with:

- Patch 01 is to wrap up error status decoding logics and reuse it in
cper_print_mem().
- Patch 02 is to introduces cper_*(), into ghes_edac_report_mem_error(),
this can avoid bunch of duplicate code lines;


Changes since v2:

- delete the unified patch
- adjusted the order of patches
- Link: https://lore.kernel.org/all/[email protected]/
- Thanks Borislav Petkov for review comments.

Changes since v1:

- add a new patch to unify ghes and cper before removing duplication.
- document the changes in patch description
- add EXPORT_SYMBOL_GPL()s for cper_*()
- document and the dependency and add UEFI_CPER dependency explicitly
- Link: https://lore.kernel.org/all/[email protected]/
- Thanks Robert Richter for review comments.

Shuai Xue (2):
efi/cper: add cper_mem_err_status_str to decode error description
EDAC/ghes: use cper functions to avoid code duplication

drivers/edac/Kconfig | 1 +
drivers/edac/ghes_edac.c | 196 +++++++-----------------------------
drivers/firmware/efi/cper.c | 82 +++++++++++----
include/linux/cper.h | 3 +
4 files changed, 102 insertions(+), 180 deletions(-)

--
2.20.1.12.g72788fdb

2022-01-24 16:01:47

by Shuai Xue

[permalink] [raw]
Subject: [RESEND PATCH v3 2/2] EDAC/ghes: use cper functions to avoid code duplication

The memory error location processing in ghes_edac_report_mem_error() have
Duplicated Code with cper_mem_err_location(), cper_dimm_err_location(), and
cper_mem_err_type_str() in drivers/firmware/efi/cper.c. To avoid the
duplicated code, introduce the above cper_*() into
ghes_edac_report_mem_error().

A side effect is that the report format in ghes_edac is changed. The
changes are to:

- add device info into ghes_edac
- change bit_pos to bit_position, col to column, requestorID to
requestor_id, etc in ghes_edac
- move requestor_id, responder_id, target_id and chip_id into memory error
location in ghes_edac
- add "DIMM location: not present." for DIMM location in ghes_edac
- remove the 'space' delimiter after the colon in cper_mem_err_location(), suggested by Robert

The original EDAC and cper error log are as follows (all Validation Bits
are enabled):
--------------------------------------------------------------------
[ 281.759984] EDAC MC0: 1 CE Single-symbol ChipKill ECC on unknown memory (node:0 card:0 module:0 rank:0 bank:1793 bank_group:7 bank_address:1 row:5310 col:520 bit_pos:0 DIMM DMI handle: 0x0000 chipID: 0 page:0x8a5f45 offset:0x20 grain:1 syndrome:0x0 - APEI location: node:0 card:0 module:0 rank:0 bank:1793 bank_group:7 bank_address:1 row:5310 col:520 bit_pos:0 DIMM DMI handle: 0x0000 chipID: 0 status(0x0000000000000000): reserved requestorID: 0x0000000000000000 responderID: 0x0000000000000000 targetID: 0x0000000000000000)
[ 281.818273] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
[ 281.828878] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
[ 281.839687] {2}[Hardware Error]: event severity: corrected
[ 281.847522] {2}[Hardware Error]: Error 0, type: corrected
[ 281.855331] {2}[Hardware Error]: section_type: memory error
[ 281.863389] {2}[Hardware Error]: error_status: 0x0000000000000000
[ 281.871983] {2}[Hardware Error]: physical_address: 0x00000008a5f45020
[ 281.880944] {2}[Hardware Error]: physical_address_mask: 0x0000000000000000
[ 281.890361] {2}[Hardware Error]: node: 0 card: 0 module: 0 rank: 0 bank: 1793 bank_group: 7 bank_address: 1 device: 0 row: 5310 column: 520 bit_position: 0 requestor_id: 0x0000000000000000 responder_id: 0x0000000000000000 target_id: 0x0000000000000000 chip_id: 0
[ 281.921361] {2}[Hardware Error]: error_type: 4, single-symbol chipkill ECC
[ 281.931021] {2}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000
--------------------------------------------------------------------

Now, the EDAC and cper error log are properly reporting the error as
follows (all Validation Bits are enabled):
--------------------------------------------------------------------
[ 2335.677847] EDAC MC0: 1 CE single-symbol chipkill ECC on 0x0000 (node:0 card:0 module:0 rank:0 bank:770 bank_group:3 bank_address:2 device:0 row:6510 column:1544 bit_position:0 requestor_id:0x0000000000000000 responder_id:0x0000000000000000 target_id:0x0000000000000000 chip_id:0 DIMM location: not present. DMI handle: 0x0000 page:0x8cb743 offset:0x20 grain:1 syndrome:0x0 - APEI location: node:0 card:0 module:0 rank:0 bank:770 bank_group:3 bank_address:2 device:0 row:6510 column:1544 bit_position:0 requestor_id: 0x0000000000000000 responder_id:0x0000000000000000 target_id:0x0000000000000000 chip_id:0 DIMM location: not present. DMI handle: 0x0000 status(0x0000000000000000): reserved)
[ 2335.753234] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
[ 2335.763930] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
[ 2335.774828] {2}[Hardware Error]: event severity: corrected
[ 2335.782761] {2}[Hardware Error]: Error 0, type: corrected
[ 2335.790671] {2}[Hardware Error]: section_type: memory error
[ 2335.798838] {2}[Hardware Error]: error_status: 0x0000000000000000, reserved
[ 2335.808444] {2}[Hardware Error]: physical_address: 0x00000008cb743020
[ 2335.817552] {2}[Hardware Error]: physical_address_mask: 0x0000000000000000
[ 2335.827098] {2}[Hardware Error]: node:0 card:0 module:0 rank:0 bank:770 bank_group:3 bank_address:2 device:0 row:6510 column:1544 bit_position:0 requestor_id: 0x0000000000000000 responder_id:0x0000000000000000 target_id:0x0000000000000000 chip_id:0
[ 2335.857212] {2}[Hardware Error]: error_type: 4, single-symbol chipkill ECC
[ 2335.867001] {2}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000
--------------------------------------------------------------------

NOTE: Not all bits are populated by BIOS in my platform, I manually enable all
validation bits during test so that we can see log message and differences of
all fields more clearly. Therefore, the values of some fields look strange.

+ mem_err->validation_bits = 0xfffffffffffffff;

Although the UEFI_CPER/EDAC_GHES dependency is always solved through
ACPI_APEI_GHES/ACPI_APEI, add the UEFI_CPER dependency explicitly for
EDAC_GHES in Kconfig.

Signed-off-by: Shuai Xue <[email protected]>
---
drivers/edac/Kconfig | 1 +
drivers/edac/ghes_edac.c | 196 +++++++-----------------------------
drivers/firmware/efi/cper.c | 36 +++----
include/linux/cper.h | 2 +
4 files changed, 56 insertions(+), 179 deletions(-)

diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
index 58ab63642e72..23f11554f400 100644
--- a/drivers/edac/Kconfig
+++ b/drivers/edac/Kconfig
@@ -55,6 +55,7 @@ config EDAC_DECODE_MCE
config EDAC_GHES
bool "Output ACPI APEI/GHES BIOS detected errors via EDAC"
depends on ACPI_APEI_GHES && (EDAC=y)
+ select UEFI_CPER
help
Not all machines support hardware-driven error report. Some of those
provide a BIOS-driven error report mechanism via ACPI, using the
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 6d1ddecbf0da..94e9f6b45817 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -235,10 +235,36 @@ static void ghes_scan_system(void)
system_scanned = true;
}

+static int ghes_edac_mem_err_other_detail(const struct cper_sec_mem_err *mem,
+ char *msg, const char *location)
+{
+ u32 len, n;
+
+ if (!msg)
+ return 0;
+
+ n = 0;
+ len = 2 * CPER_REC_LEN - 1;
+
+ n += snprintf(msg + n, len - n, "APEI location: %s ", location);
+
+ if (mem->validation_bits & CPER_MEM_VALID_ERROR_STATUS) {
+ u64 status = mem->error_status;
+
+ n += snprintf(msg + n, len - n, "status(0x%016llx): ",
+ (long long)status);
+ n += snprintf(msg + n, len - n, "%s ", cper_mem_err_status_str(status));
+ }
+
+ msg[n] = '\0';
+ return n;
+}
+
void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
{
struct edac_raw_error_desc *e;
struct mem_ctl_info *mci;
+ struct cper_mem_err_compact cmem;
struct ghes_pvt *pvt;
unsigned long flags;
char *p;
@@ -292,60 +318,10 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)

/* Error type, mapped on e->msg */
if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_TYPE) {
+ u8 etype = mem_err->error_type;
+
p = pvt->msg;
- switch (mem_err->error_type) {
- case 0:
- p += sprintf(p, "Unknown");
- break;
- case 1:
- p += sprintf(p, "No error");
- break;
- case 2:
- p += sprintf(p, "Single-bit ECC");
- break;
- case 3:
- p += sprintf(p, "Multi-bit ECC");
- break;
- case 4:
- p += sprintf(p, "Single-symbol ChipKill ECC");
- break;
- case 5:
- p += sprintf(p, "Multi-symbol ChipKill ECC");
- break;
- case 6:
- p += sprintf(p, "Master abort");
- break;
- case 7:
- p += sprintf(p, "Target abort");
- break;
- case 8:
- p += sprintf(p, "Parity Error");
- break;
- case 9:
- p += sprintf(p, "Watchdog timeout");
- break;
- case 10:
- p += sprintf(p, "Invalid address");
- break;
- case 11:
- p += sprintf(p, "Mirror Broken");
- break;
- case 12:
- p += sprintf(p, "Memory Sparing");
- break;
- case 13:
- p += sprintf(p, "Scrub corrected error");
- break;
- case 14:
- p += sprintf(p, "Scrub uncorrected error");
- break;
- case 15:
- p += sprintf(p, "Physical Memory Map-out event");
- break;
- default:
- p += sprintf(p, "reserved error (%d)",
- mem_err->error_type);
- }
+ p += snprintf(p, sizeof(pvt->msg), "%s", cper_mem_err_type_str(etype));
} else {
strcpy(pvt->msg, "unknown error");
}
@@ -362,52 +338,19 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)

/* Memory error location, mapped on e->location */
p = e->location;
- if (mem_err->validation_bits & CPER_MEM_VALID_NODE)
- p += sprintf(p, "node:%d ", mem_err->node);
- if (mem_err->validation_bits & CPER_MEM_VALID_CARD)
- p += sprintf(p, "card:%d ", mem_err->card);
- if (mem_err->validation_bits & CPER_MEM_VALID_MODULE)
- p += sprintf(p, "module:%d ", mem_err->module);
- if (mem_err->validation_bits & CPER_MEM_VALID_RANK_NUMBER)
- p += sprintf(p, "rank:%d ", mem_err->rank);
- if (mem_err->validation_bits & CPER_MEM_VALID_BANK)
- p += sprintf(p, "bank:%d ", mem_err->bank);
- if (mem_err->validation_bits & CPER_MEM_VALID_BANK_GROUP)
- p += sprintf(p, "bank_group:%d ",
- mem_err->bank >> CPER_MEM_BANK_GROUP_SHIFT);
- if (mem_err->validation_bits & CPER_MEM_VALID_BANK_ADDRESS)
- p += sprintf(p, "bank_address:%d ",
- mem_err->bank & CPER_MEM_BANK_ADDRESS_MASK);
- if (mem_err->validation_bits & (CPER_MEM_VALID_ROW | CPER_MEM_VALID_ROW_EXT)) {
- u32 row = mem_err->row;
-
- row |= cper_get_mem_extension(mem_err->validation_bits, mem_err->extended);
- p += sprintf(p, "row:%d ", row);
- }
- if (mem_err->validation_bits & CPER_MEM_VALID_COLUMN)
- p += sprintf(p, "col:%d ", mem_err->column);
- if (mem_err->validation_bits & CPER_MEM_VALID_BIT_POSITION)
- p += sprintf(p, "bit_pos:%d ", mem_err->bit_pos);
+ cper_mem_err_pack(mem_err, &cmem);
+ p += cper_mem_err_location(&cmem, p);
+
if (mem_err->validation_bits & CPER_MEM_VALID_MODULE_HANDLE) {
- const char *bank = NULL, *device = NULL;
struct dimm_info *dimm;

- dmi_memdev_name(mem_err->mem_dev_handle, &bank, &device);
- if (bank != NULL && device != NULL)
- p += sprintf(p, "DIMM location:%s %s ", bank, device);
- else
- p += sprintf(p, "DIMM DMI handle: 0x%.4x ",
- mem_err->mem_dev_handle);
-
+ p += cper_dimm_err_location(&cmem, p);
dimm = find_dimm_by_handle(mci, mem_err->mem_dev_handle);
if (dimm) {
e->top_layer = dimm->idx;
strcpy(e->label, dimm->label);
}
}
- if (mem_err->validation_bits & CPER_MEM_VALID_CHIP_ID)
- p += sprintf(p, "chipID: %d ",
- mem_err->extended >> CPER_MEM_CHIP_ID_SHIFT);
if (p > e->location)
*(p - 1) = '\0';

@@ -416,78 +359,7 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)

/* All other fields are mapped on e->other_detail */
p = pvt->other_detail;
- p += snprintf(p, sizeof(pvt->other_detail),
- "APEI location: %s ", e->location);
- if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_STATUS) {
- u64 status = mem_err->error_status;
-
- p += sprintf(p, "status(0x%016llx): ", (long long)status);
- switch ((status >> 8) & 0xff) {
- case 1:
- p += sprintf(p, "Error detected internal to the component ");
- break;
- case 16:
- p += sprintf(p, "Error detected in the bus ");
- break;
- case 4:
- p += sprintf(p, "Storage error in DRAM memory ");
- break;
- case 5:
- p += sprintf(p, "Storage error in TLB ");
- break;
- case 6:
- p += sprintf(p, "Storage error in cache ");
- break;
- case 7:
- p += sprintf(p, "Error in one or more functional units ");
- break;
- case 8:
- p += sprintf(p, "component failed self test ");
- break;
- case 9:
- p += sprintf(p, "Overflow or undervalue of internal queue ");
- break;
- case 17:
- p += sprintf(p, "Virtual address not found on IO-TLB or IO-PDIR ");
- break;
- case 18:
- p += sprintf(p, "Improper access error ");
- break;
- case 19:
- p += sprintf(p, "Access to a memory address which is not mapped to any component ");
- break;
- case 20:
- p += sprintf(p, "Loss of Lockstep ");
- break;
- case 21:
- p += sprintf(p, "Response not associated with a request ");
- break;
- case 22:
- p += sprintf(p, "Bus parity error - must also set the A, C, or D Bits ");
- break;
- case 23:
- p += sprintf(p, "Detection of a PATH_ERROR ");
- break;
- case 25:
- p += sprintf(p, "Bus operation timeout ");
- break;
- case 26:
- p += sprintf(p, "A read was issued to data that has been poisoned ");
- break;
- default:
- p += sprintf(p, "reserved ");
- break;
- }
- }
- if (mem_err->validation_bits & CPER_MEM_VALID_REQUESTOR_ID)
- p += sprintf(p, "requestorID: 0x%016llx ",
- (long long)mem_err->requestor_id);
- if (mem_err->validation_bits & CPER_MEM_VALID_RESPONDER_ID)
- p += sprintf(p, "responderID: 0x%016llx ",
- (long long)mem_err->responder_id);
- if (mem_err->validation_bits & CPER_MEM_VALID_TARGET_ID)
- p += sprintf(p, "targetID: 0x%016llx ",
- (long long)mem_err->responder_id);
+ p += ghes_edac_mem_err_other_detail(mem_err, p, e->location);
if (p > pvt->other_detail)
*(p - 1) = '\0';

diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index addafccecd84..3efec4aca212 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -254,7 +254,7 @@ const char *cper_mem_err_status_str(u64 status)
}
EXPORT_SYMBOL_GPL(cper_mem_err_status_str);

-static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
+int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
{
u32 len, n;

@@ -264,51 +264,52 @@ static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
n = 0;
len = CPER_REC_LEN;
if (mem->validation_bits & CPER_MEM_VALID_NODE)
- n += scnprintf(msg + n, len - n, "node: %d ", mem->node);
+ n += scnprintf(msg + n, len - n, "node:%d ", mem->node);
if (mem->validation_bits & CPER_MEM_VALID_CARD)
- n += scnprintf(msg + n, len - n, "card: %d ", mem->card);
+ n += scnprintf(msg + n, len - n, "card:%d ", mem->card);
if (mem->validation_bits & CPER_MEM_VALID_MODULE)
- n += scnprintf(msg + n, len - n, "module: %d ", mem->module);
+ n += scnprintf(msg + n, len - n, "module:%d ", mem->module);
if (mem->validation_bits & CPER_MEM_VALID_RANK_NUMBER)
- n += scnprintf(msg + n, len - n, "rank: %d ", mem->rank);
+ n += scnprintf(msg + n, len - n, "rank:%d ", mem->rank);
if (mem->validation_bits & CPER_MEM_VALID_BANK)
- n += scnprintf(msg + n, len - n, "bank: %d ", mem->bank);
+ n += scnprintf(msg + n, len - n, "bank:%d ", mem->bank);
if (mem->validation_bits & CPER_MEM_VALID_BANK_GROUP)
- n += scnprintf(msg + n, len - n, "bank_group: %d ",
+ n += scnprintf(msg + n, len - n, "bank_group:%d ",
mem->bank >> CPER_MEM_BANK_GROUP_SHIFT);
if (mem->validation_bits & CPER_MEM_VALID_BANK_ADDRESS)
- n += scnprintf(msg + n, len - n, "bank_address: %d ",
+ n += scnprintf(msg + n, len - n, "bank_address:%d ",
mem->bank & CPER_MEM_BANK_ADDRESS_MASK);
if (mem->validation_bits & CPER_MEM_VALID_DEVICE)
- n += scnprintf(msg + n, len - n, "device: %d ", mem->device);
+ n += scnprintf(msg + n, len - n, "device:%d ", mem->device);
if (mem->validation_bits & (CPER_MEM_VALID_ROW | CPER_MEM_VALID_ROW_EXT)) {
u32 row = mem->row;

row |= cper_get_mem_extension(mem->validation_bits, mem->extended);
- n += scnprintf(msg + n, len - n, "row: %d ", row);
+ n += scnprintf(msg + n, len - n, "row:%d ", row);
}
if (mem->validation_bits & CPER_MEM_VALID_COLUMN)
- n += scnprintf(msg + n, len - n, "column: %d ", mem->column);
+ n += scnprintf(msg + n, len - n, "column:%d ", mem->column);
if (mem->validation_bits & CPER_MEM_VALID_BIT_POSITION)
- n += scnprintf(msg + n, len - n, "bit_position: %d ",
+ n += scnprintf(msg + n, len - n, "bit_position:%d ",
mem->bit_pos);
if (mem->validation_bits & CPER_MEM_VALID_REQUESTOR_ID)
- n += scnprintf(msg + n, len - n, "requestor_id: 0x%016llx ",
+ n += scnprintf(msg + n, len - n, "requestor_id:0x%016llx ",
mem->requestor_id);
if (mem->validation_bits & CPER_MEM_VALID_RESPONDER_ID)
- n += scnprintf(msg + n, len - n, "responder_id: 0x%016llx ",
+ n += scnprintf(msg + n, len - n, "responder_id:0x%016llx ",
mem->responder_id);
if (mem->validation_bits & CPER_MEM_VALID_TARGET_ID)
- n += scnprintf(msg + n, len - n, "target_id: 0x%016llx ",
+ n += scnprintf(msg + n, len - n, "target_id:0x%016llx ",
mem->target_id);
if (mem->validation_bits & CPER_MEM_VALID_CHIP_ID)
- n += scnprintf(msg + n, len - n, "chip_id: %d ",
+ n += scnprintf(msg + n, len - n, "chip_id:%d ",
mem->extended >> CPER_MEM_CHIP_ID_SHIFT);

return n;
}
+EXPORT_SYMBOL_GPL(cper_mem_err_location);

-static int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg)
+int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg)
{
u32 len, n;
const char *bank = NULL, *device = NULL;
@@ -327,6 +328,7 @@ static int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg)

return n;
}
+EXPORT_SYMBOL_GPL(cper_dimm_err_location);

void cper_mem_err_pack(const struct cper_sec_mem_err *mem,
struct cper_mem_err_compact *cmem)
diff --git a/include/linux/cper.h b/include/linux/cper.h
index 5b1dd27b317d..eacb7dd7b3af 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -569,5 +569,7 @@ void cper_print_proc_arm(const char *pfx,
const struct cper_sec_proc_arm *proc);
void cper_print_proc_ia(const char *pfx,
const struct cper_sec_proc_ia *proc);
+int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg);
+int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg);

#endif
--
2.20.1.12.g72788fdb

2022-01-24 16:01:51

by Shuai Xue

[permalink] [raw]
Subject: [RESEND PATCH v3 1/2] efi/cper: add cper_mem_err_status_str to decode error description

Introduce a new helper function cper_mem_err_status_str() which is used to
decode the description of error status, and the cper_print_mem() will call
it and report the details of error status.

Signed-off-by: Shuai Xue <[email protected]>
---
drivers/firmware/efi/cper.c | 46 ++++++++++++++++++++++++++++++++++++-
include/linux/cper.h | 1 +
2 files changed, 46 insertions(+), 1 deletion(-)

diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index 6ec8edec6329..addafccecd84 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -211,6 +211,49 @@ const char *cper_mem_err_type_str(unsigned int etype)
}
EXPORT_SYMBOL_GPL(cper_mem_err_type_str);

+const char *cper_mem_err_status_str(u64 status)
+{
+ switch ((status >> 8) & 0xff) {
+ case 1:
+ return "Error detected internal to the component";
+ case 16:
+ return "Error detected in the bus";
+ case 4:
+ return "Storage error in DRAM memory";
+ case 5:
+ return "Storage error in TLB";
+ case 6:
+ return "Storage error in cache";
+ case 7:
+ return "Error in one or more functional units";
+ case 8:
+ return "component failed self test";
+ case 9:
+ return "Overflow or undervalue of internal queue";
+ case 17:
+ return "Virtual address not found on IO-TLB or IO-PDIR";
+ case 18:
+ return "Improper access error";
+ case 19:
+ return "Access to a memory address which is not mapped to any component";
+ case 20:
+ return "Loss of Lockstep";
+ case 21:
+ return "Response not associated with a request";
+ case 22:
+ return "Bus parity error - must also set the A, C, or D Bits";
+ case 23:
+ return "Detection of a PATH_ERROR ";
+ case 25:
+ return "Bus operation timeout";
+ case 26:
+ return "A read was issued to data that has been poisoned";
+ default:
+ return "reserved";
+ }
+}
+EXPORT_SYMBOL_GPL(cper_mem_err_status_str);
+
static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
{
u32 len, n;
@@ -334,7 +377,8 @@ static void cper_print_mem(const char *pfx, const struct cper_sec_mem_err *mem,
return;
}
if (mem->validation_bits & CPER_MEM_VALID_ERROR_STATUS)
- printk("%s""error_status: 0x%016llx\n", pfx, mem->error_status);
+ printk("%s""error_status: 0x%016llx, %s\n", pfx, mem->error_status,
+ cper_mem_err_status_str(mem->error_status));
if (mem->validation_bits & CPER_MEM_VALID_PA)
printk("%s""physical_address: 0x%016llx\n",
pfx, mem->physical_addr);
diff --git a/include/linux/cper.h b/include/linux/cper.h
index 6a511a1078ca..5b1dd27b317d 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -558,6 +558,7 @@ extern const char *const cper_proc_error_type_strs[4];
u64 cper_next_record_id(void);
const char *cper_severity_str(unsigned int);
const char *cper_mem_err_type_str(unsigned int);
+const char *cper_mem_err_status_str(u64 status);
void cper_print_bits(const char *prefix, unsigned int bits,
const char * const strs[], unsigned int strs_size);
void cper_mem_err_pack(const struct cper_sec_mem_err *,
--
2.20.1.12.g72788fdb

2022-01-24 23:48:53

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RESEND PATCH v3 1/2] efi/cper: add cper_mem_err_status_str to decode error description

On Mon, Jan 24, 2022 at 10:47:58AM +0800, Shuai Xue wrote:
> Introduce a new helper function cper_mem_err_status_str() which is used to
> decode the description of error status, and the cper_print_mem() will call
> it and report the details of error status.
>
> Signed-off-by: Shuai Xue <[email protected]>
> ---
> drivers/firmware/efi/cper.c | 46 ++++++++++++++++++++++++++++++++++++-
> include/linux/cper.h | 1 +
> 2 files changed, 46 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
> index 6ec8edec6329..addafccecd84 100644
> --- a/drivers/firmware/efi/cper.c
> +++ b/drivers/firmware/efi/cper.c
> @@ -211,6 +211,49 @@ const char *cper_mem_err_type_str(unsigned int etype)
> }
> EXPORT_SYMBOL_GPL(cper_mem_err_type_str);
>
> +const char *cper_mem_err_status_str(u64 status)
> +{
> + switch ((status >> 8) & 0xff) {
> + case 1:
> + return "Error detected internal to the component";

You can make that table a lot more compact:

switch ((status >> 8) & 0xff) {
case 1: return "Error detected internal to the component";
case 4: return "Storage error in DRAM memory";
case 5: return "Storage error in TLB";
case 6: return "Storage error in cache";
case 7: return "Error in one or more functional units";
case 8: return "component failed self test";
case 9: return "Overflow or undervalue of internal queue";
case 16: return "Error detected in the bus";
...

> + case 16:
> + return "Error detected in the bus";

And yes, that 16 needs to come before 17, ofc.

> @@ -334,7 +377,8 @@ static void cper_print_mem(const char *pfx, const struct cper_sec_mem_err *mem,
> return;
> }
> if (mem->validation_bits & CPER_MEM_VALID_ERROR_STATUS)
> - printk("%s""error_status: 0x%016llx\n", pfx, mem->error_status);
> + printk("%s""error_status: 0x%016llx, %s\n", pfx, mem->error_status,
> + cper_mem_err_status_str(mem->error_status));

Arguments need to be aligned at opening brace, i.e.:

printk("%s""error_status: 0x%016llx, %s\n",
pfx, mem->error_status, cper_mem_err_status_str(mem->error_status));


Also, the naked error status number is not as user-friendly when we have
the decoded string. So the format should be:

printk("%s error_status: %s (0x%016llx)\n",
pfx, cper_mem_err_status_str(mem->error_status), mem->error_status);

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-01-25 08:47:28

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v4 0/2] EDAC/ghes: refactor memory error reporting to avoid code duplication

ghes_edac_report_mem_error() in ghes_edac.c is a Long Method and have
Duplicated Code with cper_mem_err_location(), cper_dimm_err_location(), and
cper_mem_err_type_str() in drivers/firmware/efi/cper.c. In addition, the
cper_print_mem() in drivers/firmware/efi/cper.c only reports the error status
and misses its description.

This patch set is to refactor ghes_edac_report_mem_error with:

- Patch 01 is to wrap up error status decoding logics and reuse it in
cper_print_mem().
- Patch 02 is to introduces cper_*(), into ghes_edac_report_mem_error(),
this can avoid bunch of duplicate code lines;

Changes since v3:

- make cper_mem_err_status_str table a lot more compact
- Fix alignment and format problem
- Link: https://lore.kernel.org/all/[email protected]/
- Thanks Borislav Petkov for review comments.

Changes since v2:

- delete the unified patch
- adjusted the order of patches
- Link: https://lore.kernel.org/all/[email protected]/
- Thanks Borislav Petkov for review comments.

Changes since v1:

- add a new patch to unify ghes and cper before removing duplication.
- document the changes in patch description
- add EXPORT_SYMBOL_GPL()s for cper_*()
- document and the dependency and add UEFI_CPER dependency explicitly
- Link: https://lore.kernel.org/all/[email protected]/
- Thanks Robert Richter for review comments.

Shuai Xue (2):
efi/cper: add cper_mem_err_status_str to decode error description
EDAC/ghes: use cper functions to avoid code duplication

drivers/edac/Kconfig | 1 +
drivers/edac/ghes_edac.c | 196 +++++++-----------------------------
drivers/firmware/efi/cper.c | 65 ++++++++----
include/linux/cper.h | 3 +
4 files changed, 85 insertions(+), 180 deletions(-)

--
2.20.1.12.g72788fdb

2022-01-25 08:47:28

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v4 2/2] EDAC/ghes: use cper functions to avoid code duplication

The memory error location processing in ghes_edac_report_mem_error() have
Duplicated Code with cper_mem_err_location(), cper_dimm_err_location(), and
cper_mem_err_type_str() in drivers/firmware/efi/cper.c. To avoid the
duplicated code, introduce the above cper_*() into
ghes_edac_report_mem_error().

A side effect is that the report format in ghes_edac is changed. The
changes are to:

- add device info into ghes_edac
- change bit_pos to bit_position, col to column, requestorID to
requestor_id, etc in ghes_edac
- move requestor_id, responder_id, target_id and chip_id into memory error
location in ghes_edac
- add "DIMM location: not present." for DIMM location in ghes_edac
- remove the 'space' delimiter after the colon in cper_mem_err_location(), suggested by Robert

The original EDAC and cper error log are as follows (all Validation Bits
are enabled):
--------------------------------------------------------------------
[ 281.759984] EDAC MC0: 1 CE Single-symbol ChipKill ECC on unknown memory (node:0 card:0 module:0 rank:0 bank:1793 bank_group:7 bank_address:1 row:5310 col:520 bit_pos:0 DIMM DMI handle: 0x0000 chipID: 0 page:0x8a5f45 offset:0x20 grain:1 syndrome:0x0 - APEI location: node:0 card:0 module:0 rank:0 bank:1793 bank_group:7 bank_address:1 row:5310 col:520 bit_pos:0 DIMM DMI handle: 0x0000 chipID: 0 status(0x0000000000000000): reserved requestorID: 0x0000000000000000 responderID: 0x0000000000000000 targetID: 0x0000000000000000)
[ 281.818273] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
[ 281.828878] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
[ 281.839687] {2}[Hardware Error]: event severity: corrected
[ 281.847522] {2}[Hardware Error]: Error 0, type: corrected
[ 281.855331] {2}[Hardware Error]: section_type: memory error
[ 281.863389] {2}[Hardware Error]: error_status: 0x0000000000000000
[ 281.871983] {2}[Hardware Error]: physical_address: 0x00000008a5f45020
[ 281.880944] {2}[Hardware Error]: physical_address_mask: 0x0000000000000000
[ 281.890361] {2}[Hardware Error]: node: 0 card: 0 module: 0 rank: 0 bank: 1793 bank_group: 7 bank_address: 1 device: 0 row: 5310 column: 520 bit_position: 0 requestor_id: 0x0000000000000000 responder_id: 0x0000000000000000 target_id: 0x0000000000000000 chip_id: 0
[ 281.921361] {2}[Hardware Error]: error_type: 4, single-symbol chipkill ECC
[ 281.931021] {2}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000
--------------------------------------------------------------------

Now, the EDAC and cper error log are properly reporting the error as
follows (all Validation Bits are enabled):
--------------------------------------------------------------------
[ 2335.677847] EDAC MC0: 1 CE single-symbol chipkill ECC on 0x0000 (node:0 card:0 module:0 rank:0 bank:770 bank_group:3 bank_address:2 device:0 row:6510 column:1544 bit_position:0 requestor_id:0x0000000000000000 responder_id:0x0000000000000000 target_id:0x0000000000000000 chip_id:0 DIMM location: not present. DMI handle: 0x0000 page:0x8cb743 offset:0x20 grain:1 syndrome:0x0 - APEI location: node:0 card:0 module:0 rank:0 bank:770 bank_group:3 bank_address:2 device:0 row:6510 column:1544 bit_position:0 requestor_id: 0x0000000000000000 responder_id:0x0000000000000000 target_id:0x0000000000000000 chip_id:0 DIMM location: not present. DMI handle: 0x0000 status(0x0000000000000000): reserved)
[ 2335.753234] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
[ 2335.763930] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
[ 2335.774828] {2}[Hardware Error]: event severity: corrected
[ 2335.782761] {2}[Hardware Error]: Error 0, type: corrected
[ 2335.790671] {2}[Hardware Error]: section_type: memory error
[ 2335.798838] {2}[Hardware Error]: error_status: reserved (0x0000000000000000)
[ 2335.808444] {2}[Hardware Error]: physical_address: 0x00000008cb743020
[ 2335.817552] {2}[Hardware Error]: physical_address_mask: 0x0000000000000000
[ 2335.827098] {2}[Hardware Error]: node:0 card:0 module:0 rank:0 bank:770 bank_group:3 bank_address:2 device:0 row:6510 column:1544 bit_position:0 requestor_id: 0x0000000000000000 responder_id:0x0000000000000000 target_id:0x0000000000000000 chip_id:0
[ 2335.857212] {2}[Hardware Error]: error_type: 4, single-symbol chipkill ECC
[ 2335.867001] {2}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000
--------------------------------------------------------------------

NOTE: Not all bits are populated by BIOS in my platform, I manually enable all
validation bits during test so that we can see log message and differences of
all fields more clearly. Therefore, the values of some fields look strange.

+ mem_err->validation_bits = 0xfffffffffffffff;

Although the UEFI_CPER/EDAC_GHES dependency is always solved through
ACPI_APEI_GHES/ACPI_APEI, add the UEFI_CPER dependency explicitly for
EDAC_GHES in Kconfig.

Signed-off-by: Shuai Xue <[email protected]>
---
drivers/edac/Kconfig | 1 +
drivers/edac/ghes_edac.c | 196 +++++++-----------------------------
drivers/firmware/efi/cper.c | 36 +++----
include/linux/cper.h | 2 +
4 files changed, 56 insertions(+), 179 deletions(-)

diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
index 58ab63642e72..23f11554f400 100644
--- a/drivers/edac/Kconfig
+++ b/drivers/edac/Kconfig
@@ -55,6 +55,7 @@ config EDAC_DECODE_MCE
config EDAC_GHES
bool "Output ACPI APEI/GHES BIOS detected errors via EDAC"
depends on ACPI_APEI_GHES && (EDAC=y)
+ select UEFI_CPER
help
Not all machines support hardware-driven error report. Some of those
provide a BIOS-driven error report mechanism via ACPI, using the
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 6d1ddecbf0da..94e9f6b45817 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -235,10 +235,36 @@ static void ghes_scan_system(void)
system_scanned = true;
}

+static int ghes_edac_mem_err_other_detail(const struct cper_sec_mem_err *mem,
+ char *msg, const char *location)
+{
+ u32 len, n;
+
+ if (!msg)
+ return 0;
+
+ n = 0;
+ len = 2 * CPER_REC_LEN - 1;
+
+ n += snprintf(msg + n, len - n, "APEI location: %s ", location);
+
+ if (mem->validation_bits & CPER_MEM_VALID_ERROR_STATUS) {
+ u64 status = mem->error_status;
+
+ n += snprintf(msg + n, len - n, "status(0x%016llx): ",
+ (long long)status);
+ n += snprintf(msg + n, len - n, "%s ", cper_mem_err_status_str(status));
+ }
+
+ msg[n] = '\0';
+ return n;
+}
+
void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
{
struct edac_raw_error_desc *e;
struct mem_ctl_info *mci;
+ struct cper_mem_err_compact cmem;
struct ghes_pvt *pvt;
unsigned long flags;
char *p;
@@ -292,60 +318,10 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)

/* Error type, mapped on e->msg */
if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_TYPE) {
+ u8 etype = mem_err->error_type;
+
p = pvt->msg;
- switch (mem_err->error_type) {
- case 0:
- p += sprintf(p, "Unknown");
- break;
- case 1:
- p += sprintf(p, "No error");
- break;
- case 2:
- p += sprintf(p, "Single-bit ECC");
- break;
- case 3:
- p += sprintf(p, "Multi-bit ECC");
- break;
- case 4:
- p += sprintf(p, "Single-symbol ChipKill ECC");
- break;
- case 5:
- p += sprintf(p, "Multi-symbol ChipKill ECC");
- break;
- case 6:
- p += sprintf(p, "Master abort");
- break;
- case 7:
- p += sprintf(p, "Target abort");
- break;
- case 8:
- p += sprintf(p, "Parity Error");
- break;
- case 9:
- p += sprintf(p, "Watchdog timeout");
- break;
- case 10:
- p += sprintf(p, "Invalid address");
- break;
- case 11:
- p += sprintf(p, "Mirror Broken");
- break;
- case 12:
- p += sprintf(p, "Memory Sparing");
- break;
- case 13:
- p += sprintf(p, "Scrub corrected error");
- break;
- case 14:
- p += sprintf(p, "Scrub uncorrected error");
- break;
- case 15:
- p += sprintf(p, "Physical Memory Map-out event");
- break;
- default:
- p += sprintf(p, "reserved error (%d)",
- mem_err->error_type);
- }
+ p += snprintf(p, sizeof(pvt->msg), "%s", cper_mem_err_type_str(etype));
} else {
strcpy(pvt->msg, "unknown error");
}
@@ -362,52 +338,19 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)

/* Memory error location, mapped on e->location */
p = e->location;
- if (mem_err->validation_bits & CPER_MEM_VALID_NODE)
- p += sprintf(p, "node:%d ", mem_err->node);
- if (mem_err->validation_bits & CPER_MEM_VALID_CARD)
- p += sprintf(p, "card:%d ", mem_err->card);
- if (mem_err->validation_bits & CPER_MEM_VALID_MODULE)
- p += sprintf(p, "module:%d ", mem_err->module);
- if (mem_err->validation_bits & CPER_MEM_VALID_RANK_NUMBER)
- p += sprintf(p, "rank:%d ", mem_err->rank);
- if (mem_err->validation_bits & CPER_MEM_VALID_BANK)
- p += sprintf(p, "bank:%d ", mem_err->bank);
- if (mem_err->validation_bits & CPER_MEM_VALID_BANK_GROUP)
- p += sprintf(p, "bank_group:%d ",
- mem_err->bank >> CPER_MEM_BANK_GROUP_SHIFT);
- if (mem_err->validation_bits & CPER_MEM_VALID_BANK_ADDRESS)
- p += sprintf(p, "bank_address:%d ",
- mem_err->bank & CPER_MEM_BANK_ADDRESS_MASK);
- if (mem_err->validation_bits & (CPER_MEM_VALID_ROW | CPER_MEM_VALID_ROW_EXT)) {
- u32 row = mem_err->row;
-
- row |= cper_get_mem_extension(mem_err->validation_bits, mem_err->extended);
- p += sprintf(p, "row:%d ", row);
- }
- if (mem_err->validation_bits & CPER_MEM_VALID_COLUMN)
- p += sprintf(p, "col:%d ", mem_err->column);
- if (mem_err->validation_bits & CPER_MEM_VALID_BIT_POSITION)
- p += sprintf(p, "bit_pos:%d ", mem_err->bit_pos);
+ cper_mem_err_pack(mem_err, &cmem);
+ p += cper_mem_err_location(&cmem, p);
+
if (mem_err->validation_bits & CPER_MEM_VALID_MODULE_HANDLE) {
- const char *bank = NULL, *device = NULL;
struct dimm_info *dimm;

- dmi_memdev_name(mem_err->mem_dev_handle, &bank, &device);
- if (bank != NULL && device != NULL)
- p += sprintf(p, "DIMM location:%s %s ", bank, device);
- else
- p += sprintf(p, "DIMM DMI handle: 0x%.4x ",
- mem_err->mem_dev_handle);
-
+ p += cper_dimm_err_location(&cmem, p);
dimm = find_dimm_by_handle(mci, mem_err->mem_dev_handle);
if (dimm) {
e->top_layer = dimm->idx;
strcpy(e->label, dimm->label);
}
}
- if (mem_err->validation_bits & CPER_MEM_VALID_CHIP_ID)
- p += sprintf(p, "chipID: %d ",
- mem_err->extended >> CPER_MEM_CHIP_ID_SHIFT);
if (p > e->location)
*(p - 1) = '\0';

@@ -416,78 +359,7 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)

/* All other fields are mapped on e->other_detail */
p = pvt->other_detail;
- p += snprintf(p, sizeof(pvt->other_detail),
- "APEI location: %s ", e->location);
- if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_STATUS) {
- u64 status = mem_err->error_status;
-
- p += sprintf(p, "status(0x%016llx): ", (long long)status);
- switch ((status >> 8) & 0xff) {
- case 1:
- p += sprintf(p, "Error detected internal to the component ");
- break;
- case 16:
- p += sprintf(p, "Error detected in the bus ");
- break;
- case 4:
- p += sprintf(p, "Storage error in DRAM memory ");
- break;
- case 5:
- p += sprintf(p, "Storage error in TLB ");
- break;
- case 6:
- p += sprintf(p, "Storage error in cache ");
- break;
- case 7:
- p += sprintf(p, "Error in one or more functional units ");
- break;
- case 8:
- p += sprintf(p, "component failed self test ");
- break;
- case 9:
- p += sprintf(p, "Overflow or undervalue of internal queue ");
- break;
- case 17:
- p += sprintf(p, "Virtual address not found on IO-TLB or IO-PDIR ");
- break;
- case 18:
- p += sprintf(p, "Improper access error ");
- break;
- case 19:
- p += sprintf(p, "Access to a memory address which is not mapped to any component ");
- break;
- case 20:
- p += sprintf(p, "Loss of Lockstep ");
- break;
- case 21:
- p += sprintf(p, "Response not associated with a request ");
- break;
- case 22:
- p += sprintf(p, "Bus parity error - must also set the A, C, or D Bits ");
- break;
- case 23:
- p += sprintf(p, "Detection of a PATH_ERROR ");
- break;
- case 25:
- p += sprintf(p, "Bus operation timeout ");
- break;
- case 26:
- p += sprintf(p, "A read was issued to data that has been poisoned ");
- break;
- default:
- p += sprintf(p, "reserved ");
- break;
- }
- }
- if (mem_err->validation_bits & CPER_MEM_VALID_REQUESTOR_ID)
- p += sprintf(p, "requestorID: 0x%016llx ",
- (long long)mem_err->requestor_id);
- if (mem_err->validation_bits & CPER_MEM_VALID_RESPONDER_ID)
- p += sprintf(p, "responderID: 0x%016llx ",
- (long long)mem_err->responder_id);
- if (mem_err->validation_bits & CPER_MEM_VALID_TARGET_ID)
- p += sprintf(p, "targetID: 0x%016llx ",
- (long long)mem_err->responder_id);
+ p += ghes_edac_mem_err_other_detail(mem_err, p, e->location);
if (p > pvt->other_detail)
*(p - 1) = '\0';

diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index 7f08d4ea906e..e49ccf131fad 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -236,7 +236,7 @@ const char *cper_mem_err_status_str(u64 status)
}
EXPORT_SYMBOL_GPL(cper_mem_err_status_str);

-static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
+int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
{
u32 len, n;

@@ -246,51 +246,52 @@ static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
n = 0;
len = CPER_REC_LEN;
if (mem->validation_bits & CPER_MEM_VALID_NODE)
- n += scnprintf(msg + n, len - n, "node: %d ", mem->node);
+ n += scnprintf(msg + n, len - n, "node:%d ", mem->node);
if (mem->validation_bits & CPER_MEM_VALID_CARD)
- n += scnprintf(msg + n, len - n, "card: %d ", mem->card);
+ n += scnprintf(msg + n, len - n, "card:%d ", mem->card);
if (mem->validation_bits & CPER_MEM_VALID_MODULE)
- n += scnprintf(msg + n, len - n, "module: %d ", mem->module);
+ n += scnprintf(msg + n, len - n, "module:%d ", mem->module);
if (mem->validation_bits & CPER_MEM_VALID_RANK_NUMBER)
- n += scnprintf(msg + n, len - n, "rank: %d ", mem->rank);
+ n += scnprintf(msg + n, len - n, "rank:%d ", mem->rank);
if (mem->validation_bits & CPER_MEM_VALID_BANK)
- n += scnprintf(msg + n, len - n, "bank: %d ", mem->bank);
+ n += scnprintf(msg + n, len - n, "bank:%d ", mem->bank);
if (mem->validation_bits & CPER_MEM_VALID_BANK_GROUP)
- n += scnprintf(msg + n, len - n, "bank_group: %d ",
+ n += scnprintf(msg + n, len - n, "bank_group:%d ",
mem->bank >> CPER_MEM_BANK_GROUP_SHIFT);
if (mem->validation_bits & CPER_MEM_VALID_BANK_ADDRESS)
- n += scnprintf(msg + n, len - n, "bank_address: %d ",
+ n += scnprintf(msg + n, len - n, "bank_address:%d ",
mem->bank & CPER_MEM_BANK_ADDRESS_MASK);
if (mem->validation_bits & CPER_MEM_VALID_DEVICE)
- n += scnprintf(msg + n, len - n, "device: %d ", mem->device);
+ n += scnprintf(msg + n, len - n, "device:%d ", mem->device);
if (mem->validation_bits & (CPER_MEM_VALID_ROW | CPER_MEM_VALID_ROW_EXT)) {
u32 row = mem->row;

row |= cper_get_mem_extension(mem->validation_bits, mem->extended);
- n += scnprintf(msg + n, len - n, "row: %d ", row);
+ n += scnprintf(msg + n, len - n, "row:%d ", row);
}
if (mem->validation_bits & CPER_MEM_VALID_COLUMN)
- n += scnprintf(msg + n, len - n, "column: %d ", mem->column);
+ n += scnprintf(msg + n, len - n, "column:%d ", mem->column);
if (mem->validation_bits & CPER_MEM_VALID_BIT_POSITION)
- n += scnprintf(msg + n, len - n, "bit_position: %d ",
+ n += scnprintf(msg + n, len - n, "bit_position:%d ",
mem->bit_pos);
if (mem->validation_bits & CPER_MEM_VALID_REQUESTOR_ID)
- n += scnprintf(msg + n, len - n, "requestor_id: 0x%016llx ",
+ n += scnprintf(msg + n, len - n, "requestor_id:0x%016llx ",
mem->requestor_id);
if (mem->validation_bits & CPER_MEM_VALID_RESPONDER_ID)
- n += scnprintf(msg + n, len - n, "responder_id: 0x%016llx ",
+ n += scnprintf(msg + n, len - n, "responder_id:0x%016llx ",
mem->responder_id);
if (mem->validation_bits & CPER_MEM_VALID_TARGET_ID)
- n += scnprintf(msg + n, len - n, "target_id: 0x%016llx ",
+ n += scnprintf(msg + n, len - n, "target_id:0x%016llx ",
mem->target_id);
if (mem->validation_bits & CPER_MEM_VALID_CHIP_ID)
- n += scnprintf(msg + n, len - n, "chip_id: %d ",
+ n += scnprintf(msg + n, len - n, "chip_id:%d ",
mem->extended >> CPER_MEM_CHIP_ID_SHIFT);

return n;
}
+EXPORT_SYMBOL_GPL(cper_mem_err_location);

-static int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg)
+int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg)
{
u32 len, n;
const char *bank = NULL, *device = NULL;
@@ -309,6 +310,7 @@ static int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg)

return n;
}
+EXPORT_SYMBOL_GPL(cper_dimm_err_location);

void cper_mem_err_pack(const struct cper_sec_mem_err *mem,
struct cper_mem_err_compact *cmem)
diff --git a/include/linux/cper.h b/include/linux/cper.h
index 5b1dd27b317d..eacb7dd7b3af 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -569,5 +569,7 @@ void cper_print_proc_arm(const char *pfx,
const struct cper_sec_proc_arm *proc);
void cper_print_proc_ia(const char *pfx,
const struct cper_sec_proc_ia *proc);
+int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg);
+int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg);

#endif
--
2.20.1.12.g72788fdb

2022-01-25 08:47:44

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v4 1/2] efi/cper: add cper_mem_err_status_str to decode error description

Introduce a new helper function cper_mem_err_status_str() which is used to
decode the description of error status, and the cper_print_mem() will call
it and report the details of error status.

Signed-off-by: Shuai Xue <[email protected]>
---
drivers/firmware/efi/cper.c | 29 ++++++++++++++++++++++++++++-
include/linux/cper.h | 1 +
2 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index 6ec8edec6329..7f08d4ea906e 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -211,6 +211,31 @@ const char *cper_mem_err_type_str(unsigned int etype)
}
EXPORT_SYMBOL_GPL(cper_mem_err_type_str);

+const char *cper_mem_err_status_str(u64 status)
+{
+ switch ((status >> 8) & 0xff) {
+ case 1: return "Error detected internal to the component";
+ case 4: return "Storage error in DRAM memory";
+ case 5: return "Storage error in TLB";
+ case 6: return "Storage error in cache";
+ case 7: return "Error in one or more functional units";
+ case 8: return "component failed self test";
+ case 9: return "Overflow or undervalue of internal queue";
+ case 16: return "Error detected in the bus";
+ case 17: return "Virtual address not found on IO-TLB or IO-PDIR";
+ case 18: return "Improper access error";
+ case 19: return "Access to a memory address which is not mapped to any component";
+ case 20: return "Loss of Lockstep";
+ case 21: return "Response not associated with a request";
+ case 22: return "Bus parity error - must also set the A, C, or D Bits";
+ case 23: return "Detection of a PATH_ERROR ";
+ case 25: return "Bus operation timeout";
+ case 26: return "A read was issued to data that has been poisoned";
+ default: return "reserved";
+ }
+}
+EXPORT_SYMBOL_GPL(cper_mem_err_status_str);
+
static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
{
u32 len, n;
@@ -334,7 +359,9 @@ static void cper_print_mem(const char *pfx, const struct cper_sec_mem_err *mem,
return;
}
if (mem->validation_bits & CPER_MEM_VALID_ERROR_STATUS)
- printk("%s""error_status: 0x%016llx\n", pfx, mem->error_status);
+ printk("%s""error_status: %s (0x%016llx)\n",
+ pfx, cper_mem_err_status_str(mem->error_status),
+ mem->error_status);
if (mem->validation_bits & CPER_MEM_VALID_PA)
printk("%s""physical_address: 0x%016llx\n",
pfx, mem->physical_addr);
diff --git a/include/linux/cper.h b/include/linux/cper.h
index 6a511a1078ca..5b1dd27b317d 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -558,6 +558,7 @@ extern const char *const cper_proc_error_type_strs[4];
u64 cper_next_record_id(void);
const char *cper_severity_str(unsigned int);
const char *cper_mem_err_type_str(unsigned int);
+const char *cper_mem_err_status_str(u64 status);
void cper_print_bits(const char *prefix, unsigned int bits,
const char * const strs[], unsigned int strs_size);
void cper_mem_err_pack(const struct cper_sec_mem_err *,
--
2.20.1.12.g72788fdb

2022-01-25 08:47:57

by Shuai Xue

[permalink] [raw]
Subject: Re: [RESEND PATCH v3 1/2] efi/cper: add cper_mem_err_status_str to decode error description

Hi, Borislav,

Thank you for your valuable comments.

在 2022/1/25 AM5:16, Borislav Petkov 写道:
> On Mon, Jan 24, 2022 at 10:47:58AM +0800, Shuai Xue wrote:
>> Introduce a new helper function cper_mem_err_status_str() which is used to
>> decode the description of error status, and the cper_print_mem() will call
>> it and report the details of error status.
>>
>> Signed-off-by: Shuai Xue <[email protected]>
>> ---
>> drivers/firmware/efi/cper.c | 46 ++++++++++++++++++++++++++++++++++++-
>> include/linux/cper.h | 1 +
>> 2 files changed, 46 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
>> index 6ec8edec6329..addafccecd84 100644
>> --- a/drivers/firmware/efi/cper.c
>> +++ b/drivers/firmware/efi/cper.c
>> @@ -211,6 +211,49 @@ const char *cper_mem_err_type_str(unsigned int etype)
>> }
>> EXPORT_SYMBOL_GPL(cper_mem_err_type_str);
>>
>> +const char *cper_mem_err_status_str(u64 status)
>> +{
>> + switch ((status >> 8) & 0xff) {
>> + case 1:
>> + return "Error detected internal to the component";
>
> You can make that table a lot more compact:
>
> switch ((status >> 8) & 0xff) {
> case 1: return "Error detected internal to the component";
> case 4: return "Storage error in DRAM memory";
> case 5: return "Storage error in TLB";
> case 6: return "Storage error in cache";
> case 7: return "Error in one or more functional units";
> case 8: return "component failed self test";
> case 9: return "Overflow or undervalue of internal queue";
> case 16: return "Error detected in the bus";
> ...
>
>> + case 16:
>> + return "Error detected in the bus";
>
> And yes, that 16 needs to come before 17, ofc.

I will fix it in next version.

>> @@ -334,7 +377,8 @@ static void cper_print_mem(const char *pfx, const struct cper_sec_mem_err *mem,
>> return;
>> }
>> if (mem->validation_bits & CPER_MEM_VALID_ERROR_STATUS)
>> - printk("%s""error_status: 0x%016llx\n", pfx, mem->error_status);
>> + printk("%s""error_status: 0x%016llx, %s\n", pfx, mem->error_status,
>> + cper_mem_err_status_str(mem->error_status));
>
> Arguments need to be aligned at opening brace, i.e.:
>
> printk("%s""error_status: 0x%016llx, %s\n",
> pfx, mem->error_status, cper_mem_err_status_str(mem->error_status));
>
>
> Also, the naked error status number is not as user-friendly when we have
> the decoded string. So the format should be:
>
> printk("%s error_status: %s (0x%016llx)\n",
> pfx, cper_mem_err_status_str(mem->error_status), mem->error_status);
>

Good point. Will fix it.

Best Regard,
Shuai

2022-01-25 14:51:46

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RESEND PATCH v3 1/2] efi/cper: add cper_mem_err_status_str to decode error description

On Tue, Jan 25, 2022 at 10:45:31AM +0800, Shuai Xue wrote:
> I will fix it in next version.

Yes, thanks.

However, you don't have to resend immediately but wait instead until
people have had time to review the whole thing. And while you're
waiting, you can read through Documentation/process/...

There are passages like the following one, for example:

"Don't get discouraged - or impatient
------------------------------------

After you have submitted your change, be patient and wait. Reviewers are
busy people and may not get to your patch right away.

Once upon a time, patches used to disappear into the void without comment,
but the development process works more smoothly than that now. You should
receive comments within a week or so; if that does not happen, make sure
that you have sent your patches to the right place. Wait for a minimum of
one week before resubmitting or pinging reviewers - possibly longer during
busy times like merge windows.

It's also ok to resend the patch or the patch series after a couple of
weeks with the word "RESEND" added to the subject line::

[PATCH Vx RESEND] sub/sys: Condensed patch summary

Don't add "RESEND" when you are submitting a modified version of your
patch or patch series - "RESEND" only applies to resubmission of a
patch or patch series which have not been modified in any way from the
previous submission."

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-01-25 16:45:34

by Shuai Xue

[permalink] [raw]
Subject: Re: [RESEND PATCH v3 1/2] efi/cper: add cper_mem_err_status_str to decode error description

Hi, Borislav,

Thank you for your reply.

I am sorry if you feel the RESEND tag is pushing you.

Take your time, I will be more patient :)

Best Regards,
Shuai

在 2022/1/25 PM6:21, Borislav Petkov 写道:
> On Tue, Jan 25, 2022 at 10:45:31AM +0800, Shuai Xue wrote:
>> I will fix it in next version.
>
> Yes, thanks.
>
> However, you don't have to resend immediately but wait instead until
> people have had time to review the whole thing. And while you're
> waiting, you can read through Documentation/process/...
>
> There are passages like the following one, for example:
>
> "Don't get discouraged - or impatient
> ------------------------------------
>
> After you have submitted your change, be patient and wait. Reviewers are
> busy people and may not get to your patch right away.
>
> Once upon a time, patches used to disappear into the void without comment,
> but the development process works more smoothly than that now. You should
> receive comments within a week or so; if that does not happen, make sure
> that you have sent your patches to the right place. Wait for a minimum of
> one week before resubmitting or pinging reviewers - possibly longer during
> busy times like merge windows.
>
> It's also ok to resend the patch or the patch series after a couple of
> weeks with the word "RESEND" added to the subject line::
>
> [PATCH Vx RESEND] sub/sys: Condensed patch summary
>
> Don't add "RESEND" when you are submitting a modified version of your
> patch or patch series - "RESEND" only applies to resubmission of a
> patch or patch series which have not been modified in any way from the
> previous submission."
>

2022-01-25 17:00:00

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RESEND PATCH v3 1/2] efi/cper: add cper_mem_err_status_str to decode error description

On Tue, Jan 25, 2022 at 07:49:41PM +0800, Shuai Xue wrote:
> I am sorry if you feel the RESEND tag is pushing you.

It is not pushing me - there are rules, simply. Rules you should read
first before sending patches.

How about I start flooding you a patchset every day?

Also, please do not top-post. That's also explained in that
documentation directory I mentioned. Read about it too pls.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-01-25 22:47:36

by Shuai Xue

[permalink] [raw]
Subject: Re: [RESEND PATCH v3 1/2] efi/cper: add cper_mem_err_status_str to decode error description

Hi Borislav,

在 2022/1/25 PM8:37, Borislav Petkov 写道:
> On Tue, Jan 25, 2022 at 07:49:41PM +0800, Shuai Xue wrote:
>> I am sorry if you feel the RESEND tag is pushing you.
>
> It is not pushing me - there are rules, simply. Rules you should read
> first before sending patches.

Got it. I will learn rules first.

> How about I start flooding you a patchset every day?

Haha, I see. I am sorry to bother you and thank you very much for your patient
and valuable comments, just take your time. By the way, after sending patchset
v3 8 days, I resend it yesterday, and the patchset v4 sent today is to address
your comments, not a resend patchset. Anyway, I will be more patient.

> Also, please do not top-post. That's also explained in that
> documentation directory I mentioned. Read about it too pls.

I will, thank you.

Best Regards,
Shuai

2022-01-26 07:09:52

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] efi/cper: add cper_mem_err_status_str to decode error description

On Tue, Jan 25, 2022 at 10:49:38AM +0800, Shuai Xue wrote:
> Introduce a new helper function cper_mem_err_status_str() which is used to
> decode the description of error status, and the cper_print_mem() will call
> it and report the details of error status.
>
> Signed-off-by: Shuai Xue <[email protected]>
> ---
> drivers/firmware/efi/cper.c | 29 ++++++++++++++++++++++++++++-
> include/linux/cper.h | 1 +
> 2 files changed, 29 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
> index 6ec8edec6329..7f08d4ea906e 100644
> --- a/drivers/firmware/efi/cper.c
> +++ b/drivers/firmware/efi/cper.c
> @@ -211,6 +211,31 @@ const char *cper_mem_err_type_str(unsigned int etype)
> }
> EXPORT_SYMBOL_GPL(cper_mem_err_type_str);
>
> +const char *cper_mem_err_status_str(u64 status)
> +{
> + switch ((status >> 8) & 0xff) {
> + case 1: return "Error detected internal to the component";
> + case 4: return "Storage error in DRAM memory";
> + case 5: return "Storage error in TLB";
> + case 6: return "Storage error in cache";
> + case 7: return "Error in one or more functional units";
> + case 8: return "component failed self test";

Well, at least start them all with capital letters: "Component... " And
yes, I know this is how it is in the spec but the spec has typos and
other problems - doesn't mean we have to copy them too.

> + case 9: return "Overflow or undervalue of internal queue";
> + case 16: return "Error detected in the bus";
> + case 17: return "Virtual address not found on IO-TLB or IO-PDIR";
> + case 18: return "Improper access error";
> + case 19: return "Access to a memory address which is not mapped to any component";
> + case 20: return "Loss of Lockstep";
> + case 21: return "Response not associated with a request";
> + case 22: return "Bus parity error - must also set the A, C, or D Bits";
> + case 23: return "Detection of a PATH_ERROR ";

Trailing space here. Also what is PATH_ERROR?

That "PATH_ERROR" is nowhere else explained in that big fat UEFI spec.
2558 pages and they can't explain *that*. Geez.

> + case 25: return "Bus operation timeout";
> + case 26: return "A read was issued to data that has been poisoned";
> + default: return "reserved";
> + }
> +}
> +EXPORT_SYMBOL_GPL(cper_mem_err_status_str);
> +
> static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
> {
> u32 len, n;
> @@ -334,7 +359,9 @@ static void cper_print_mem(const char *pfx, const struct cper_sec_mem_err *mem,
> return;
> }
> if (mem->validation_bits & CPER_MEM_VALID_ERROR_STATUS)
> - printk("%s""error_status: 0x%016llx\n", pfx, mem->error_status);
> + printk("%s""error_status: %s (0x%016llx)\n",

Why do you insist on having two back-to-back strings instead of one
here?

(And don't tell me it is because the other function calls here do it
too.)

FWIW, even checkpatch complains here:

WARNING: Consecutive strings are generally better as a single string
#87: FILE: drivers/firmware/efi/cper.c:362:
+ printk("%s""error_status: %s (0x%016llx)\n",

Btw, please integrate scripts/checkpatch.pl into your patch creation
workflow. Some of the warnings/errors *actually* make sense.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-01-26 18:41:45

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] efi/cper: add cper_mem_err_status_str to decode error description

Hi, Borislav,

Thank you for your comments.

在 2022/1/26 AM3:08, Borislav Petkov 写道:
> On Tue, Jan 25, 2022 at 10:49:38AM +0800, Shuai Xue wrote:
>> Introduce a new helper function cper_mem_err_status_str() which is used to
>> decode the description of error status, and the cper_print_mem() will call
>> it and report the details of error status.
>>
>> Signed-off-by: Shuai Xue <[email protected]>
>> ---
>> drivers/firmware/efi/cper.c | 29 ++++++++++++++++++++++++++++-
>> include/linux/cper.h | 1 +
>> 2 files changed, 29 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
>> index 6ec8edec6329..7f08d4ea906e 100644
>> --- a/drivers/firmware/efi/cper.c
>> +++ b/drivers/firmware/efi/cper.c
>> @@ -211,6 +211,31 @@ const char *cper_mem_err_type_str(unsigned int etype)
>> }
>> EXPORT_SYMBOL_GPL(cper_mem_err_type_str);
>>
>> +const char *cper_mem_err_status_str(u64 status)
>> +{
>> + switch ((status >> 8) & 0xff) {
>> + case 1: return "Error detected internal to the component";
>> + case 4: return "Storage error in DRAM memory";
>> + case 5: return "Storage error in TLB";
>> + case 6: return "Storage error in cache";
>> + case 7: return "Error in one or more functional units";
>> + case 8: return "component failed self test";
>
> Well, at least start them all with capital letters: "Component... " And
> yes, I know this is how it is in the spec but the spec has typos and
> other problems - doesn't mean we have to copy them too.

You are right, I will fix it in next version.

>> + case 9: return "Overflow or undervalue of internal queue";
>> + case 16: return "Error detected in the bus";
>> + case 17: return "Virtual address not found on IO-TLB or IO-PDIR";
>> + case 18: return "Improper access error";
>> + case 19: return "Access to a memory address which is not mapped to any component";
>> + case 20: return "Loss of Lockstep";
>> + case 21: return "Response not associated with a request";
>> + case 22: return "Bus parity error - must also set the A, C, or D Bits";
>> + case 23: return "Detection of a PATH_ERROR ";
>
> Trailing space here.

Sorry, will delete it.

> Also what is PATH_ERROR?
>
> That "PATH_ERROR" is nowhere else explained in that big fat UEFI spec.
> 2558 pages and they can't explain *that*. Geez.

I don't know either. A related item I found is "iSCSI Device Path error".
Section 10 defines the device path protocol, but I don't know if "PATH_ERROR"
means the path of the device is not found, or something else.

>> + case 25: return "Bus operation timeout";
>> + case 26: return "A read was issued to data that has been poisoned";
>> + default: return "reserved";
>> + }
>> +}
>> +EXPORT_SYMBOL_GPL(cper_mem_err_status_str);
>> +
>> static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
>> {
>> u32 len, n;
>> @@ -334,7 +359,9 @@ static void cper_print_mem(const char *pfx, const struct cper_sec_mem_err *mem,
>> return;
>> }
>> if (mem->validation_bits & CPER_MEM_VALID_ERROR_STATUS)
>> - printk("%s""error_status: 0x%016llx\n", pfx, mem->error_status);
>> + printk("%s""error_status: %s (0x%016llx)\n",
>
> Why do you insist on having two back-to-back strings instead of one
> here?
>
> (And don't tell me it is because the other function calls here do it
> too.)
>
> FWIW, even checkpatch complains here:
>
> WARNING: Consecutive strings are generally better as a single string
> #87: FILE: drivers/firmware/efi/cper.c:362:
> + printk("%s""error_status: %s (0x%016llx)\n",
>
> Btw, please integrate scripts/checkpatch.pl into your patch creation
> workflow. Some of the warnings/errors *actually* make sense.

Sorry, I did integrate scripts/checkpatch.pl before sending patch. And as you see,
other function calls here do the same, so I ignored the warnings. I will change as
your comments in next version:

- printk("%s""error_status: 0x%016llx\n", pfx, mem->error_status);
+ printk("%s error_status: %s (0x%016llx)\n",


Best regards,
Shuai

2022-01-26 20:30:52

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v5 0/2] EDAC/ghes: refactor memory error reporting to avoid code duplication

ghes_edac_report_mem_error() in ghes_edac.c is a Long Method and have
Duplicated Code with cper_mem_err_location(), cper_dimm_err_location(), and
cper_mem_err_type_str() in drivers/firmware/efi/cper.c. In addition, the
cper_print_mem() in drivers/firmware/efi/cper.c only reports the error
status and misses its description.

This patch set is to refactor ghes_edac_report_mem_error with:

- Patch 01 is to wrap up error status decoding logics and reuse it in
cper_print_mem().
- Patch 02 is to introduces cper_*(), into ghes_edac_report_mem_error(),
this can avoid bunch of duplicate code lines;

Changes since v4:
- Fix alignment and format problem
- Link: https://lore.kernel.org/all/[email protected]/
- Thanks Borislav Petkov for review comments.

Changes since v3:

- make cper_mem_err_status_str table a lot more compact
- Fix alignment and format problem
- Link: https://lore.kernel.org/all/[email protected]/
- Thanks Borislav Petkov for review comments.

Changes since v2:

- delete the unified patch
- adjusted the order of patches
- Link: https://lore.kernel.org/all/[email protected]/
- Thanks Borislav Petkov for review comments.

Changes since v1:

- add a new patch to unify ghes and cper before removing duplication.
- document the changes in patch description
- add EXPORT_SYMBOL_GPL()s for cper_*()
- document and the dependency and add UEFI_CPER dependency explicitly
- Link: https://lore.kernel.org/all/[email protected]/
- Thanks Robert Richter for review comments.

Shuai Xue (2):
efi/cper: add cper_mem_err_status_str to decode error description
EDAC/ghes: use cper functions to avoid code duplication

drivers/edac/Kconfig | 1 +
drivers/edac/ghes_edac.c | 196 +++++++-----------------------------
drivers/firmware/efi/cper.c | 66 ++++++++----
include/linux/cper.h | 3 +
4 files changed, 86 insertions(+), 180 deletions(-)

--
2.20.1.12.g72788fdb

2022-01-26 20:31:18

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v5 2/2] EDAC/ghes: use cper functions to avoid code duplication

The memory error location processing in ghes_edac_report_mem_error() have
Duplicated Code with cper_mem_err_location(), cper_dimm_err_location(), and
cper_mem_err_type_str() in drivers/firmware/efi/cper.c. To avoid the
duplicated code, introduce the above cper_*() into
ghes_edac_report_mem_error().

A side effect is that the report format in ghes_edac is changed. The
changes are to:

- add device info into ghes_edac
- change bit_pos to bit_position, col to column, requestorID to
requestor_id, etc in ghes_edac
- move requestor_id, responder_id, target_id and chip_id into memory error
location in ghes_edac
- add "DIMM location: not present." for DIMM location in ghes_edac
- remove the 'space' delimiter after the colon in cper_mem_err_location(),
suggested by Robert

The original EDAC and cper error log are as follows (all Validation Bits
are enabled):
--------------------------------------------------------------------
[ 281.759984] EDAC MC0: 1 CE Single-symbol ChipKill ECC on unknown memory (node:0 card:0 module:0 rank:0 bank:1793 bank_group:7 bank_address:1 row:5310 col:520 bit_pos:0 DIMM DMI handle: 0x0000 chipID: 0 page:0x8a5f45 offset:0x20 grain:1 syndrome:0x0 - APEI location: node:0 card:0 module:0 rank:0 bank:1793 bank_group:7 bank_address:1 row:5310 col:520 bit_pos:0 DIMM DMI handle: 0x0000 chipID: 0 status(0x0000000000000000): reserved requestorID: 0x0000000000000000 responderID: 0x0000000000000000 targetID: 0x0000000000000000)
[ 281.818273] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
[ 281.828878] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
[ 281.839687] {2}[Hardware Error]: event severity: corrected
[ 281.847522] {2}[Hardware Error]: Error 0, type: corrected
[ 281.855331] {2}[Hardware Error]: section_type: memory error
[ 281.863389] {2}[Hardware Error]: error_status: 0x0000000000000000
[ 281.871983] {2}[Hardware Error]: physical_address: 0x00000008a5f45020
[ 281.880944] {2}[Hardware Error]: physical_address_mask: 0x0000000000000000
[ 281.890361] {2}[Hardware Error]: node: 0 card: 0 module: 0 rank: 0 bank: 1793 bank_group: 7 bank_address: 1 device: 0 row: 5310 column: 520 bit_position: 0 requestor_id: 0x0000000000000000 responder_id: 0x0000000000000000 target_id: 0x0000000000000000 chip_id: 0
[ 281.921361] {2}[Hardware Error]: error_type: 4, single-symbol chipkill ECC
[ 281.931021] {2}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000
--------------------------------------------------------------------

Now, the EDAC and cper error log are properly reporting the error as
follows (all Validation Bits are enabled):
--------------------------------------------------------------------
[ 2335.677847] EDAC MC0: 1 CE single-symbol chipkill ECC on 0x0000 (node:0 card:0 module:0 rank:0 bank:770 bank_group:3 bank_address:2 device:0 row:6510 column:1544 bit_position:0 requestor_id:0x0000000000000000 responder_id:0x0000000000000000 target_id:0x0000000000000000 chip_id:0 DIMM location: not present. DMI handle: 0x0000 page:0x8cb743 offset:0x20 grain:1 syndrome:0x0 - APEI location: node:0 card:0 module:0 rank:0 bank:770 bank_group:3 bank_address:2 device:0 row:6510 column:1544 bit_position:0 requestor_id: 0x0000000000000000 responder_id:0x0000000000000000 target_id:0x0000000000000000 chip_id:0 DIMM location: not present. DMI handle: 0x0000 status(0x0000000000000000): Reserved)
[ 2335.753234] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
[ 2335.763930] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
[ 2335.774828] {2}[Hardware Error]: event severity: corrected
[ 2335.782761] {2}[Hardware Error]: Error 0, type: corrected
[ 2335.790671] {2}[Hardware Error]: section_type: memory error
[ 2335.798838] {2}[Hardware Error]: error_status: Reserved (0x0000000000000000)
[ 2335.808444] {2}[Hardware Error]: physical_address: 0x00000008cb743020
[ 2335.817552] {2}[Hardware Error]: physical_address_mask: 0x0000000000000000
[ 2335.827098] {2}[Hardware Error]: node:0 card:0 module:0 rank:0 bank:770 bank_group:3 bank_address:2 device:0 row:6510 column:1544 bit_position:0 requestor_id: 0x0000000000000000 responder_id:0x0000000000000000 target_id:0x0000000000000000 chip_id:0
[ 2335.857212] {2}[Hardware Error]: error_type: 4, single-symbol chipkill ECC
[ 2335.867001] {2}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000
--------------------------------------------------------------------

NOTE: Not all bits are populated by BIOS in my platform, I manually enable
all validation bits during test so that we can see log message and
differences of all fields more clearly. Therefore, the values of some
fields look strange.

+ mem_err->validation_bits = 0xfffffffffffffff;

Although the UEFI_CPER/EDAC_GHES dependency is always solved through
ACPI_APEI_GHES/ACPI_APEI, add the UEFI_CPER dependency explicitly for
EDAC_GHES in Kconfig.

Signed-off-by: Shuai Xue <[email protected]>
---
drivers/edac/Kconfig | 1 +
drivers/edac/ghes_edac.c | 196 +++++++-----------------------------
drivers/firmware/efi/cper.c | 36 +++----
include/linux/cper.h | 2 +
4 files changed, 56 insertions(+), 179 deletions(-)

diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
index 58ab63642e72..23f11554f400 100644
--- a/drivers/edac/Kconfig
+++ b/drivers/edac/Kconfig
@@ -55,6 +55,7 @@ config EDAC_DECODE_MCE
config EDAC_GHES
bool "Output ACPI APEI/GHES BIOS detected errors via EDAC"
depends on ACPI_APEI_GHES && (EDAC=y)
+ select UEFI_CPER
help
Not all machines support hardware-driven error report. Some of those
provide a BIOS-driven error report mechanism via ACPI, using the
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 6d1ddecbf0da..94e9f6b45817 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -235,10 +235,36 @@ static void ghes_scan_system(void)
system_scanned = true;
}

+static int ghes_edac_mem_err_other_detail(const struct cper_sec_mem_err *mem,
+ char *msg, const char *location)
+{
+ u32 len, n;
+
+ if (!msg)
+ return 0;
+
+ n = 0;
+ len = 2 * CPER_REC_LEN - 1;
+
+ n += snprintf(msg + n, len - n, "APEI location: %s ", location);
+
+ if (mem->validation_bits & CPER_MEM_VALID_ERROR_STATUS) {
+ u64 status = mem->error_status;
+
+ n += snprintf(msg + n, len - n, "status(0x%016llx): ",
+ (long long)status);
+ n += snprintf(msg + n, len - n, "%s ", cper_mem_err_status_str(status));
+ }
+
+ msg[n] = '\0';
+ return n;
+}
+
void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
{
struct edac_raw_error_desc *e;
struct mem_ctl_info *mci;
+ struct cper_mem_err_compact cmem;
struct ghes_pvt *pvt;
unsigned long flags;
char *p;
@@ -292,60 +318,10 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)

/* Error type, mapped on e->msg */
if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_TYPE) {
+ u8 etype = mem_err->error_type;
+
p = pvt->msg;
- switch (mem_err->error_type) {
- case 0:
- p += sprintf(p, "Unknown");
- break;
- case 1:
- p += sprintf(p, "No error");
- break;
- case 2:
- p += sprintf(p, "Single-bit ECC");
- break;
- case 3:
- p += sprintf(p, "Multi-bit ECC");
- break;
- case 4:
- p += sprintf(p, "Single-symbol ChipKill ECC");
- break;
- case 5:
- p += sprintf(p, "Multi-symbol ChipKill ECC");
- break;
- case 6:
- p += sprintf(p, "Master abort");
- break;
- case 7:
- p += sprintf(p, "Target abort");
- break;
- case 8:
- p += sprintf(p, "Parity Error");
- break;
- case 9:
- p += sprintf(p, "Watchdog timeout");
- break;
- case 10:
- p += sprintf(p, "Invalid address");
- break;
- case 11:
- p += sprintf(p, "Mirror Broken");
- break;
- case 12:
- p += sprintf(p, "Memory Sparing");
- break;
- case 13:
- p += sprintf(p, "Scrub corrected error");
- break;
- case 14:
- p += sprintf(p, "Scrub uncorrected error");
- break;
- case 15:
- p += sprintf(p, "Physical Memory Map-out event");
- break;
- default:
- p += sprintf(p, "reserved error (%d)",
- mem_err->error_type);
- }
+ p += snprintf(p, sizeof(pvt->msg), "%s", cper_mem_err_type_str(etype));
} else {
strcpy(pvt->msg, "unknown error");
}
@@ -362,52 +338,19 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)

/* Memory error location, mapped on e->location */
p = e->location;
- if (mem_err->validation_bits & CPER_MEM_VALID_NODE)
- p += sprintf(p, "node:%d ", mem_err->node);
- if (mem_err->validation_bits & CPER_MEM_VALID_CARD)
- p += sprintf(p, "card:%d ", mem_err->card);
- if (mem_err->validation_bits & CPER_MEM_VALID_MODULE)
- p += sprintf(p, "module:%d ", mem_err->module);
- if (mem_err->validation_bits & CPER_MEM_VALID_RANK_NUMBER)
- p += sprintf(p, "rank:%d ", mem_err->rank);
- if (mem_err->validation_bits & CPER_MEM_VALID_BANK)
- p += sprintf(p, "bank:%d ", mem_err->bank);
- if (mem_err->validation_bits & CPER_MEM_VALID_BANK_GROUP)
- p += sprintf(p, "bank_group:%d ",
- mem_err->bank >> CPER_MEM_BANK_GROUP_SHIFT);
- if (mem_err->validation_bits & CPER_MEM_VALID_BANK_ADDRESS)
- p += sprintf(p, "bank_address:%d ",
- mem_err->bank & CPER_MEM_BANK_ADDRESS_MASK);
- if (mem_err->validation_bits & (CPER_MEM_VALID_ROW | CPER_MEM_VALID_ROW_EXT)) {
- u32 row = mem_err->row;
-
- row |= cper_get_mem_extension(mem_err->validation_bits, mem_err->extended);
- p += sprintf(p, "row:%d ", row);
- }
- if (mem_err->validation_bits & CPER_MEM_VALID_COLUMN)
- p += sprintf(p, "col:%d ", mem_err->column);
- if (mem_err->validation_bits & CPER_MEM_VALID_BIT_POSITION)
- p += sprintf(p, "bit_pos:%d ", mem_err->bit_pos);
+ cper_mem_err_pack(mem_err, &cmem);
+ p += cper_mem_err_location(&cmem, p);
+
if (mem_err->validation_bits & CPER_MEM_VALID_MODULE_HANDLE) {
- const char *bank = NULL, *device = NULL;
struct dimm_info *dimm;

- dmi_memdev_name(mem_err->mem_dev_handle, &bank, &device);
- if (bank != NULL && device != NULL)
- p += sprintf(p, "DIMM location:%s %s ", bank, device);
- else
- p += sprintf(p, "DIMM DMI handle: 0x%.4x ",
- mem_err->mem_dev_handle);
-
+ p += cper_dimm_err_location(&cmem, p);
dimm = find_dimm_by_handle(mci, mem_err->mem_dev_handle);
if (dimm) {
e->top_layer = dimm->idx;
strcpy(e->label, dimm->label);
}
}
- if (mem_err->validation_bits & CPER_MEM_VALID_CHIP_ID)
- p += sprintf(p, "chipID: %d ",
- mem_err->extended >> CPER_MEM_CHIP_ID_SHIFT);
if (p > e->location)
*(p - 1) = '\0';

@@ -416,78 +359,7 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)

/* All other fields are mapped on e->other_detail */
p = pvt->other_detail;
- p += snprintf(p, sizeof(pvt->other_detail),
- "APEI location: %s ", e->location);
- if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_STATUS) {
- u64 status = mem_err->error_status;
-
- p += sprintf(p, "status(0x%016llx): ", (long long)status);
- switch ((status >> 8) & 0xff) {
- case 1:
- p += sprintf(p, "Error detected internal to the component ");
- break;
- case 16:
- p += sprintf(p, "Error detected in the bus ");
- break;
- case 4:
- p += sprintf(p, "Storage error in DRAM memory ");
- break;
- case 5:
- p += sprintf(p, "Storage error in TLB ");
- break;
- case 6:
- p += sprintf(p, "Storage error in cache ");
- break;
- case 7:
- p += sprintf(p, "Error in one or more functional units ");
- break;
- case 8:
- p += sprintf(p, "component failed self test ");
- break;
- case 9:
- p += sprintf(p, "Overflow or undervalue of internal queue ");
- break;
- case 17:
- p += sprintf(p, "Virtual address not found on IO-TLB or IO-PDIR ");
- break;
- case 18:
- p += sprintf(p, "Improper access error ");
- break;
- case 19:
- p += sprintf(p, "Access to a memory address which is not mapped to any component ");
- break;
- case 20:
- p += sprintf(p, "Loss of Lockstep ");
- break;
- case 21:
- p += sprintf(p, "Response not associated with a request ");
- break;
- case 22:
- p += sprintf(p, "Bus parity error - must also set the A, C, or D Bits ");
- break;
- case 23:
- p += sprintf(p, "Detection of a PATH_ERROR ");
- break;
- case 25:
- p += sprintf(p, "Bus operation timeout ");
- break;
- case 26:
- p += sprintf(p, "A read was issued to data that has been poisoned ");
- break;
- default:
- p += sprintf(p, "reserved ");
- break;
- }
- }
- if (mem_err->validation_bits & CPER_MEM_VALID_REQUESTOR_ID)
- p += sprintf(p, "requestorID: 0x%016llx ",
- (long long)mem_err->requestor_id);
- if (mem_err->validation_bits & CPER_MEM_VALID_RESPONDER_ID)
- p += sprintf(p, "responderID: 0x%016llx ",
- (long long)mem_err->responder_id);
- if (mem_err->validation_bits & CPER_MEM_VALID_TARGET_ID)
- p += sprintf(p, "targetID: 0x%016llx ",
- (long long)mem_err->responder_id);
+ p += ghes_edac_mem_err_other_detail(mem_err, p, e->location);
if (p > pvt->other_detail)
*(p - 1) = '\0';

diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index 34eeaa59f04a..c762b0ace627 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -237,7 +237,7 @@ const char *cper_mem_err_status_str(u64 status)
}
EXPORT_SYMBOL_GPL(cper_mem_err_status_str);

-static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
+int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
{
u32 len, n;

@@ -247,51 +247,52 @@ static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
n = 0;
len = CPER_REC_LEN;
if (mem->validation_bits & CPER_MEM_VALID_NODE)
- n += scnprintf(msg + n, len - n, "node: %d ", mem->node);
+ n += scnprintf(msg + n, len - n, "node:%d ", mem->node);
if (mem->validation_bits & CPER_MEM_VALID_CARD)
- n += scnprintf(msg + n, len - n, "card: %d ", mem->card);
+ n += scnprintf(msg + n, len - n, "card:%d ", mem->card);
if (mem->validation_bits & CPER_MEM_VALID_MODULE)
- n += scnprintf(msg + n, len - n, "module: %d ", mem->module);
+ n += scnprintf(msg + n, len - n, "module:%d ", mem->module);
if (mem->validation_bits & CPER_MEM_VALID_RANK_NUMBER)
- n += scnprintf(msg + n, len - n, "rank: %d ", mem->rank);
+ n += scnprintf(msg + n, len - n, "rank:%d ", mem->rank);
if (mem->validation_bits & CPER_MEM_VALID_BANK)
- n += scnprintf(msg + n, len - n, "bank: %d ", mem->bank);
+ n += scnprintf(msg + n, len - n, "bank:%d ", mem->bank);
if (mem->validation_bits & CPER_MEM_VALID_BANK_GROUP)
- n += scnprintf(msg + n, len - n, "bank_group: %d ",
+ n += scnprintf(msg + n, len - n, "bank_group:%d ",
mem->bank >> CPER_MEM_BANK_GROUP_SHIFT);
if (mem->validation_bits & CPER_MEM_VALID_BANK_ADDRESS)
- n += scnprintf(msg + n, len - n, "bank_address: %d ",
+ n += scnprintf(msg + n, len - n, "bank_address:%d ",
mem->bank & CPER_MEM_BANK_ADDRESS_MASK);
if (mem->validation_bits & CPER_MEM_VALID_DEVICE)
- n += scnprintf(msg + n, len - n, "device: %d ", mem->device);
+ n += scnprintf(msg + n, len - n, "device:%d ", mem->device);
if (mem->validation_bits & (CPER_MEM_VALID_ROW | CPER_MEM_VALID_ROW_EXT)) {
u32 row = mem->row;

row |= cper_get_mem_extension(mem->validation_bits, mem->extended);
- n += scnprintf(msg + n, len - n, "row: %d ", row);
+ n += scnprintf(msg + n, len - n, "row:%d ", row);
}
if (mem->validation_bits & CPER_MEM_VALID_COLUMN)
- n += scnprintf(msg + n, len - n, "column: %d ", mem->column);
+ n += scnprintf(msg + n, len - n, "column:%d ", mem->column);
if (mem->validation_bits & CPER_MEM_VALID_BIT_POSITION)
- n += scnprintf(msg + n, len - n, "bit_position: %d ",
+ n += scnprintf(msg + n, len - n, "bit_position:%d ",
mem->bit_pos);
if (mem->validation_bits & CPER_MEM_VALID_REQUESTOR_ID)
- n += scnprintf(msg + n, len - n, "requestor_id: 0x%016llx ",
+ n += scnprintf(msg + n, len - n, "requestor_id:0x%016llx ",
mem->requestor_id);
if (mem->validation_bits & CPER_MEM_VALID_RESPONDER_ID)
- n += scnprintf(msg + n, len - n, "responder_id: 0x%016llx ",
+ n += scnprintf(msg + n, len - n, "responder_id:0x%016llx ",
mem->responder_id);
if (mem->validation_bits & CPER_MEM_VALID_TARGET_ID)
- n += scnprintf(msg + n, len - n, "target_id: 0x%016llx ",
+ n += scnprintf(msg + n, len - n, "target_id:0x%016llx ",
mem->target_id);
if (mem->validation_bits & CPER_MEM_VALID_CHIP_ID)
- n += scnprintf(msg + n, len - n, "chip_id: %d ",
+ n += scnprintf(msg + n, len - n, "chip_id:%d ",
mem->extended >> CPER_MEM_CHIP_ID_SHIFT);

return n;
}
+EXPORT_SYMBOL_GPL(cper_mem_err_location);

-static int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg)
+int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg)
{
u32 len, n;
const char *bank = NULL, *device = NULL;
@@ -310,6 +311,7 @@ static int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg)

return n;
}
+EXPORT_SYMBOL_GPL(cper_dimm_err_location);

void cper_mem_err_pack(const struct cper_sec_mem_err *mem,
struct cper_mem_err_compact *cmem)
diff --git a/include/linux/cper.h b/include/linux/cper.h
index 5b1dd27b317d..eacb7dd7b3af 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -569,5 +569,7 @@ void cper_print_proc_arm(const char *pfx,
const struct cper_sec_proc_arm *proc);
void cper_print_proc_ia(const char *pfx,
const struct cper_sec_proc_ia *proc);
+int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg);
+int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg);

#endif
--
2.20.1.12.g72788fdb

2022-01-26 20:31:59

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v5 1/2] efi/cper: add cper_mem_err_status_str to decode error description

Introduce a new helper function cper_mem_err_status_str() which is used to
decode the description of error status, and the cper_print_mem() will call
it and report the details of error status.

Signed-off-by: Shuai Xue <[email protected]>
---
drivers/firmware/efi/cper.c | 30 +++++++++++++++++++++++++++++-
include/linux/cper.h | 1 +
2 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index 6ec8edec6329..34eeaa59f04a 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -211,6 +211,32 @@ const char *cper_mem_err_type_str(unsigned int etype)
}
EXPORT_SYMBOL_GPL(cper_mem_err_type_str);

+const char *cper_mem_err_status_str(u64 status)
+{
+ switch ((status >> 8) & 0xff) {
+ case 1: return "Error detected internal to the component";
+ case 4: return "Storage error in DRAM memory";
+ case 5: return "Storage error in TLB";
+ case 6: return "Storage error in cache";
+ case 7: return "Error in one or more functional units";
+ case 8: return "Component failed self test";
+ case 9: return "Overflow or undervalue of internal queue";
+ case 16: return "Error detected in the bus";
+ case 17: return "Virtual address not found on IO-TLB or IO-PDIR";
+ case 18: return "Improper access error";
+ case 19: return "Access to a memory address which is not mapped to any component";
+ case 20: return "Loss of Lockstep";
+ case 21: return "Response not associated with a request";
+ case 22: return "Bus parity error - must also set the A, C, or D Bits";
+ case 23: return "Detection of a protocol error";
+ case 24: return "Detection of a PATH_ERROR";
+ case 25: return "Bus operation timeout";
+ case 26: return "A read was issued to data that has been poisoned";
+ default: return "Reserved";
+ }
+}
+EXPORT_SYMBOL_GPL(cper_mem_err_status_str);
+
static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
{
u32 len, n;
@@ -334,7 +360,9 @@ static void cper_print_mem(const char *pfx, const struct cper_sec_mem_err *mem,
return;
}
if (mem->validation_bits & CPER_MEM_VALID_ERROR_STATUS)
- printk("%s""error_status: 0x%016llx\n", pfx, mem->error_status);
+ printk("%s error_status: %s (0x%016llx)\n",
+ pfx, cper_mem_err_status_str(mem->error_status),
+ mem->error_status);
if (mem->validation_bits & CPER_MEM_VALID_PA)
printk("%s""physical_address: 0x%016llx\n",
pfx, mem->physical_addr);
diff --git a/include/linux/cper.h b/include/linux/cper.h
index 6a511a1078ca..5b1dd27b317d 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -558,6 +558,7 @@ extern const char *const cper_proc_error_type_strs[4];
u64 cper_next_record_id(void);
const char *cper_severity_str(unsigned int);
const char *cper_mem_err_type_str(unsigned int);
+const char *cper_mem_err_status_str(u64 status);
void cper_print_bits(const char *prefix, unsigned int bits,
const char * const strs[], unsigned int strs_size);
void cper_mem_err_pack(const struct cper_sec_mem_err *,
--
2.20.1.12.g72788fdb

2022-01-26 20:33:33

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v5 0/2] EDAC/ghes: refactor memory error reporting to avoid code duplication

On Wed, Jan 26, 2022 at 04:17:00PM +0800, Shuai Xue wrote:
> ghes_edac_report_mem_error() in ghes_edac.c is a Long Method and have
> Duplicated Code with cper_mem_err_location(), cper_dimm_err_location(), and
> cper_mem_err_type_str() in drivers/firmware/efi/cper.c. In addition, the
> cper_print_mem() in drivers/firmware/efi/cper.c only reports the error
> status and misses its description.

Dude, what about

wait for a week or until the patchset has been fully reviewed

don't you understand?!

Please let me know what about the review process is not clear to you so
that we can document it better.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-01-26 20:43:56

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v5 0/2] EDAC/ghes: refactor memory error reporting to avoid code duplication

Hi, Borislav,

在 2022/1/26 PM4:20, Borislav Petkov 写道:
> On Wed, Jan 26, 2022 at 04:17:00PM +0800, Shuai Xue wrote:
>> ghes_edac_report_mem_error() in ghes_edac.c is a Long Method and have
>> Duplicated Code with cper_mem_err_location(), cper_dimm_err_location(), and
>> cper_mem_err_type_str() in drivers/firmware/efi/cper.c. In addition, the
>> cper_print_mem() in drivers/firmware/efi/cper.c only reports the error
>> status and misses its description.
>
> Dude, what about
>
> wait for a week or until the patchset has been fully reviewed
>
> don't you understand?!
>
> Please let me know what about the review process is not clear to you so
> that we can document it better.

Emmm, when I received your replied email, I thought you had fully reviewed them. So
I work to address your comments and reply as soon as possible. Sorry, I misunderstood.

Of course, I can wait. As I said before, take your time.

By the way, I have a question about review process: after waiting for a period
of time, how can I tell whether you have no comments or are still in review process?

Best Regards,
Shuai

2022-01-26 20:59:30

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v5 0/2] EDAC/ghes: refactor memory error reporting to avoid code duplication

On January 26, 2022 9:26:01 AM UTC, Shuai Xue <[email protected]> wrote:
>By the way, I have a question about review process: after waiting for a period
>of time, how can I tell whether you have no comments or are still in review process?
>

A good sign for when review is done is to wait to see replies to every patch.

BUT, there are other people on CC too so they would need to get a chance to have a look too.

Regardless, you wait for a week and then you incorporate all review comments and resend - not before.

This constant spamming with the patchset is not productive. You're not the only one who sends patches and wants review - you should consider that there are others who would need to get reviewed too.

--
Sent from a small device: formatting sux and brevity is inevitable.

2022-01-27 09:16:21

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v5 0/2] EDAC/ghes: refactor memory error reporting to avoid code duplication

Hi, Borislav,

在 2022/1/26 PM6:16, Boris Petkov 写道:
> On January 26, 2022 9:26:01 AM UTC, Shuai Xue <[email protected]> wrote:
>> By the way, I have a question about review process: after waiting for a period
>> of time, how can I tell whether you have no comments or are still in review process?
>>
>
> A good sign for when review is done is to wait to see replies to every patch.
>
> BUT, there are other people on CC too so they would need to get a chance to have a look too.
>
> Regardless, you wait for a week and then you incorporate all review comments and resend - not before.
>
> This constant spamming with the patchset is not productive. You're not the only one who sends patches and wants review - you should consider that there are others who would need to get reviewed too.

Got it. Thank you very much.

Best Regards,
Shuai

2022-02-01 02:04:19

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] EDAC/ghes: use cper functions to avoid code duplication

On Wed, Jan 26, 2022 at 04:17:02PM +0800, Shuai Xue wrote:
> The memory error location processing in ghes_edac_report_mem_error() have

I will look at this patch again after you have incorporated in all
review comments from last time:

https://lore.kernel.org/r/[email protected]

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-02-14 13:39:48

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] EDAC/ghes: use cper functions to avoid code duplication

Hi, Borislav,

在 2022/1/29 PM9:09, Borislav Petkov 写道:
> On Wed, Jan 26, 2022 at 04:17:02PM +0800, Shuai Xue wrote:
>> The memory error location processing in ghes_edac_report_mem_error() have
>
> I will look at this patch again after you have incorporated in all
> review comments from last time:
>
> https://lore.kernel.org/r/[email protected]
>

Happy Chinese New Year. Sorry for getting back to you late. I was on holiday
last weak.

I have try to address your comments in this version. If I missed your comments,
please let me know, thank you.

Best Regards,
Shuai



2022-03-03 13:39:33

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v6 0/2] EDAC/ghes: refactor memory error reporting to avoid code duplication

ghes_edac_report_mem_error() in ghes_edac.c is a Long Method and have
Duplicated Code with cper_mem_err_location(), cper_dimm_err_location(), and
cper_mem_err_type_str() in drivers/firmware/efi/cper.c. In addition, the
cper_print_mem() in drivers/firmware/efi/cper.c only reports the error
status and misses its description.

This patch set is to refactor ghes_edac_report_mem_error with:

- Patch 01 is to wrap up error status decoding logics and reuse it in
cper_print_mem().
- Patch 02 is to introduces cper_*(), into ghes_edac_report_mem_error(),
this can avoid bunch of duplicate code lines;

Changes since v5:
- Delete change summary in commit log
- Link: https://lore.kernel.org/all/[email protected]/
- Thanks Borislav Petkov for review comments.

Changes since v4:
- Fix alignment and format problem
- Link: https://lore.kernel.org/all/[email protected]/
- Thanks Borislav Petkov for review comments.

Changes since v3:

- make cper_mem_err_status_str table a lot more compact
- Fix alignment and format problem
- Link: https://lore.kernel.org/all/[email protected]/
- Thanks Borislav Petkov for review comments.

Changes since v2:

- delete the unified patch
- adjusted the order of patches
- Link: https://lore.kernel.org/all/[email protected]/
- Thanks Borislav Petkov for review comments.

Changes since v1:

- add a new patch to unify ghes and cper before removing duplication.
- document the changes in patch description
- add EXPORT_SYMBOL_GPL()s for cper_*()
- document and the dependency and add UEFI_CPER dependency explicitly
- Link: https://lore.kernel.org/all/[email protected]/
- Thanks Robert Richter for review comments.

Shuai Xue (2):
efi/cper: add cper_mem_err_status_str to decode error description
EDAC/ghes: use cper functions to avoid code duplication

drivers/edac/Kconfig | 1 +
drivers/edac/ghes_edac.c | 196 +++++++-----------------------------
drivers/firmware/efi/cper.c | 66 ++++++++----
include/linux/cper.h | 3 +
4 files changed, 86 insertions(+), 180 deletions(-)

--
2.20.1.12.g72788fdb

2022-03-03 17:16:08

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v6 1/2] efi/cper: add cper_mem_err_status_str to decode error description

Introduce a new helper function cper_mem_err_status_str() which is used to
decode the description of error status, and the cper_print_mem() will call
it and report the details of error status.

Signed-off-by: Shuai Xue <[email protected]>
---
drivers/firmware/efi/cper.c | 30 +++++++++++++++++++++++++++++-
include/linux/cper.h | 1 +
2 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index 6ec8edec6329..34eeaa59f04a 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -211,6 +211,32 @@ const char *cper_mem_err_type_str(unsigned int etype)
}
EXPORT_SYMBOL_GPL(cper_mem_err_type_str);

+const char *cper_mem_err_status_str(u64 status)
+{
+ switch ((status >> 8) & 0xff) {
+ case 1: return "Error detected internal to the component";
+ case 4: return "Storage error in DRAM memory";
+ case 5: return "Storage error in TLB";
+ case 6: return "Storage error in cache";
+ case 7: return "Error in one or more functional units";
+ case 8: return "Component failed self test";
+ case 9: return "Overflow or undervalue of internal queue";
+ case 16: return "Error detected in the bus";
+ case 17: return "Virtual address not found on IO-TLB or IO-PDIR";
+ case 18: return "Improper access error";
+ case 19: return "Access to a memory address which is not mapped to any component";
+ case 20: return "Loss of Lockstep";
+ case 21: return "Response not associated with a request";
+ case 22: return "Bus parity error - must also set the A, C, or D Bits";
+ case 23: return "Detection of a protocol error";
+ case 24: return "Detection of a PATH_ERROR";
+ case 25: return "Bus operation timeout";
+ case 26: return "A read was issued to data that has been poisoned";
+ default: return "Reserved";
+ }
+}
+EXPORT_SYMBOL_GPL(cper_mem_err_status_str);
+
static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
{
u32 len, n;
@@ -334,7 +360,9 @@ static void cper_print_mem(const char *pfx, const struct cper_sec_mem_err *mem,
return;
}
if (mem->validation_bits & CPER_MEM_VALID_ERROR_STATUS)
- printk("%s""error_status: 0x%016llx\n", pfx, mem->error_status);
+ printk("%s error_status: %s (0x%016llx)\n",
+ pfx, cper_mem_err_status_str(mem->error_status),
+ mem->error_status);
if (mem->validation_bits & CPER_MEM_VALID_PA)
printk("%s""physical_address: 0x%016llx\n",
pfx, mem->physical_addr);
diff --git a/include/linux/cper.h b/include/linux/cper.h
index 6a511a1078ca..5b1dd27b317d 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -558,6 +558,7 @@ extern const char *const cper_proc_error_type_strs[4];
u64 cper_next_record_id(void);
const char *cper_severity_str(unsigned int);
const char *cper_mem_err_type_str(unsigned int);
+const char *cper_mem_err_status_str(u64 status);
void cper_print_bits(const char *prefix, unsigned int bits,
const char * const strs[], unsigned int strs_size);
void cper_mem_err_pack(const struct cper_sec_mem_err *,
--
2.20.1.12.g72788fdb

2022-03-03 19:18:56

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v6 2/2] EDAC/ghes: use cper functions to avoid code duplication

The memory error location processing in ghes_edac_report_mem_error() have
Duplicated Code with cper_mem_err_location(), cper_dimm_err_location(), and
cper_mem_err_type_str() in drivers/firmware/efi/cper.c. To avoid the
duplicated code, introduce the above cper_*() into
ghes_edac_report_mem_error().

The original EDAC and cper error log are as follows (all Validation Bits
are enabled):
--------------------------------------------------------------------
[ 281.759984] EDAC MC0: 1 CE Single-symbol ChipKill ECC on unknown memory (node:0 card:0 module:0 rank:0 bank:1793 bank_group:7 bank_address:1 row:5310 col:520 bit_pos:0 DIMM DMI handle: 0x0000 chipID: 0 page:0x8a5f45 offset:0x20 grain:1 syndrome:0x0 - APEI location: node:0 card:0 module:0 rank:0 bank:1793 bank_group:7 bank_address:1 row:5310 col:520 bit_pos:0 DIMM DMI handle: 0x0000 chipID: 0 status(0x0000000000000000): reserved requestorID: 0x0000000000000000 responderID: 0x0000000000000000 targetID: 0x0000000000000000)
[ 281.818273] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
[ 281.828878] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
[ 281.839687] {2}[Hardware Error]: event severity: corrected
[ 281.847522] {2}[Hardware Error]: Error 0, type: corrected
[ 281.855331] {2}[Hardware Error]: section_type: memory error
[ 281.863389] {2}[Hardware Error]: error_status: 0x0000000000000000
[ 281.871983] {2}[Hardware Error]: physical_address: 0x00000008a5f45020
[ 281.880944] {2}[Hardware Error]: physical_address_mask: 0x0000000000000000
[ 281.890361] {2}[Hardware Error]: node: 0 card: 0 module: 0 rank: 0 bank: 1793 bank_group: 7 bank_address: 1 device: 0 row: 5310 column: 520 bit_position: 0 requestor_id: 0x0000000000000000 responder_id: 0x0000000000000000 target_id: 0x0000000000000000 chip_id: 0
[ 281.921361] {2}[Hardware Error]: error_type: 4, single-symbol chipkill ECC
[ 281.931021] {2}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000
--------------------------------------------------------------------

Now, the EDAC and cper error log are properly reporting the error as
follows (all Validation Bits are enabled):
--------------------------------------------------------------------
[ 2335.677847] EDAC MC0: 1 CE single-symbol chipkill ECC on 0x0000 (node:0 card:0 module:0 rank:0 bank:770 bank_group:3 bank_address:2 device:0 row:6510 column:1544 bit_position:0 requestor_id:0x0000000000000000 responder_id:0x0000000000000000 target_id:0x0000000000000000 chip_id:0 DIMM location: not present. DMI handle: 0x0000 page:0x8cb743 offset:0x20 grain:1 syndrome:0x0 - APEI location: node:0 card:0 module:0 rank:0 bank:770 bank_group:3 bank_address:2 device:0 row:6510 column:1544 bit_position:0 requestor_id: 0x0000000000000000 responder_id:0x0000000000000000 target_id:0x0000000000000000 chip_id:0 DIMM location: not present. DMI handle: 0x0000 status(0x0000000000000000): Reserved)
[ 2335.753234] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
[ 2335.763930] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
[ 2335.774828] {2}[Hardware Error]: event severity: corrected
[ 2335.782761] {2}[Hardware Error]: Error 0, type: corrected
[ 2335.790671] {2}[Hardware Error]: section_type: memory error
[ 2335.798838] {2}[Hardware Error]: error_status: Reserved (0x0000000000000000)
[ 2335.808444] {2}[Hardware Error]: physical_address: 0x00000008cb743020
[ 2335.817552] {2}[Hardware Error]: physical_address_mask: 0x0000000000000000
[ 2335.827098] {2}[Hardware Error]: node:0 card:0 module:0 rank:0 bank:770 bank_group:3 bank_address:2 device:0 row:6510 column:1544 bit_position:0 requestor_id: 0x0000000000000000 responder_id:0x0000000000000000 target_id:0x0000000000000000 chip_id:0
[ 2335.857212] {2}[Hardware Error]: error_type: 4, single-symbol chipkill ECC
[ 2335.867001] {2}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000
--------------------------------------------------------------------

NOTE: Not all bits are populated by BIOS in my platform, I manually enable
all validation bits during test so that we can see log message and
differences of all fields more clearly. Therefore, the values of some
fields look strange.

+ mem_err->validation_bits = 0xfffffffffffffff;

Although the UEFI_CPER/EDAC_GHES dependency is always solved through
ACPI_APEI_GHES/ACPI_APEI, add the UEFI_CPER dependency explicitly for
EDAC_GHES in Kconfig.

Signed-off-by: Shuai Xue <[email protected]>
---
To Borislav: Sorry, I only delete the format change summary in this commit
log in this version. If I missed any comments, could you please point out
clearly? Thank you very much.

drivers/edac/Kconfig | 1 +
drivers/edac/ghes_edac.c | 196 +++++++-----------------------------
drivers/firmware/efi/cper.c | 36 +++----
include/linux/cper.h | 2 +
4 files changed, 56 insertions(+), 179 deletions(-)

diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
index 58ab63642e72..23f11554f400 100644
--- a/drivers/edac/Kconfig
+++ b/drivers/edac/Kconfig
@@ -55,6 +55,7 @@ config EDAC_DECODE_MCE
config EDAC_GHES
bool "Output ACPI APEI/GHES BIOS detected errors via EDAC"
depends on ACPI_APEI_GHES && (EDAC=y)
+ select UEFI_CPER
help
Not all machines support hardware-driven error report. Some of those
provide a BIOS-driven error report mechanism via ACPI, using the
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 6d1ddecbf0da..94e9f6b45817 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -235,10 +235,36 @@ static void ghes_scan_system(void)
system_scanned = true;
}

+static int ghes_edac_mem_err_other_detail(const struct cper_sec_mem_err *mem,
+ char *msg, const char *location)
+{
+ u32 len, n;
+
+ if (!msg)
+ return 0;
+
+ n = 0;
+ len = 2 * CPER_REC_LEN - 1;
+
+ n += snprintf(msg + n, len - n, "APEI location: %s ", location);
+
+ if (mem->validation_bits & CPER_MEM_VALID_ERROR_STATUS) {
+ u64 status = mem->error_status;
+
+ n += snprintf(msg + n, len - n, "status(0x%016llx): ",
+ (long long)status);
+ n += snprintf(msg + n, len - n, "%s ", cper_mem_err_status_str(status));
+ }
+
+ msg[n] = '\0';
+ return n;
+}
+
void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
{
struct edac_raw_error_desc *e;
struct mem_ctl_info *mci;
+ struct cper_mem_err_compact cmem;
struct ghes_pvt *pvt;
unsigned long flags;
char *p;
@@ -292,60 +318,10 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)

/* Error type, mapped on e->msg */
if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_TYPE) {
+ u8 etype = mem_err->error_type;
+
p = pvt->msg;
- switch (mem_err->error_type) {
- case 0:
- p += sprintf(p, "Unknown");
- break;
- case 1:
- p += sprintf(p, "No error");
- break;
- case 2:
- p += sprintf(p, "Single-bit ECC");
- break;
- case 3:
- p += sprintf(p, "Multi-bit ECC");
- break;
- case 4:
- p += sprintf(p, "Single-symbol ChipKill ECC");
- break;
- case 5:
- p += sprintf(p, "Multi-symbol ChipKill ECC");
- break;
- case 6:
- p += sprintf(p, "Master abort");
- break;
- case 7:
- p += sprintf(p, "Target abort");
- break;
- case 8:
- p += sprintf(p, "Parity Error");
- break;
- case 9:
- p += sprintf(p, "Watchdog timeout");
- break;
- case 10:
- p += sprintf(p, "Invalid address");
- break;
- case 11:
- p += sprintf(p, "Mirror Broken");
- break;
- case 12:
- p += sprintf(p, "Memory Sparing");
- break;
- case 13:
- p += sprintf(p, "Scrub corrected error");
- break;
- case 14:
- p += sprintf(p, "Scrub uncorrected error");
- break;
- case 15:
- p += sprintf(p, "Physical Memory Map-out event");
- break;
- default:
- p += sprintf(p, "reserved error (%d)",
- mem_err->error_type);
- }
+ p += snprintf(p, sizeof(pvt->msg), "%s", cper_mem_err_type_str(etype));
} else {
strcpy(pvt->msg, "unknown error");
}
@@ -362,52 +338,19 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)

/* Memory error location, mapped on e->location */
p = e->location;
- if (mem_err->validation_bits & CPER_MEM_VALID_NODE)
- p += sprintf(p, "node:%d ", mem_err->node);
- if (mem_err->validation_bits & CPER_MEM_VALID_CARD)
- p += sprintf(p, "card:%d ", mem_err->card);
- if (mem_err->validation_bits & CPER_MEM_VALID_MODULE)
- p += sprintf(p, "module:%d ", mem_err->module);
- if (mem_err->validation_bits & CPER_MEM_VALID_RANK_NUMBER)
- p += sprintf(p, "rank:%d ", mem_err->rank);
- if (mem_err->validation_bits & CPER_MEM_VALID_BANK)
- p += sprintf(p, "bank:%d ", mem_err->bank);
- if (mem_err->validation_bits & CPER_MEM_VALID_BANK_GROUP)
- p += sprintf(p, "bank_group:%d ",
- mem_err->bank >> CPER_MEM_BANK_GROUP_SHIFT);
- if (mem_err->validation_bits & CPER_MEM_VALID_BANK_ADDRESS)
- p += sprintf(p, "bank_address:%d ",
- mem_err->bank & CPER_MEM_BANK_ADDRESS_MASK);
- if (mem_err->validation_bits & (CPER_MEM_VALID_ROW | CPER_MEM_VALID_ROW_EXT)) {
- u32 row = mem_err->row;
-
- row |= cper_get_mem_extension(mem_err->validation_bits, mem_err->extended);
- p += sprintf(p, "row:%d ", row);
- }
- if (mem_err->validation_bits & CPER_MEM_VALID_COLUMN)
- p += sprintf(p, "col:%d ", mem_err->column);
- if (mem_err->validation_bits & CPER_MEM_VALID_BIT_POSITION)
- p += sprintf(p, "bit_pos:%d ", mem_err->bit_pos);
+ cper_mem_err_pack(mem_err, &cmem);
+ p += cper_mem_err_location(&cmem, p);
+
if (mem_err->validation_bits & CPER_MEM_VALID_MODULE_HANDLE) {
- const char *bank = NULL, *device = NULL;
struct dimm_info *dimm;

- dmi_memdev_name(mem_err->mem_dev_handle, &bank, &device);
- if (bank != NULL && device != NULL)
- p += sprintf(p, "DIMM location:%s %s ", bank, device);
- else
- p += sprintf(p, "DIMM DMI handle: 0x%.4x ",
- mem_err->mem_dev_handle);
-
+ p += cper_dimm_err_location(&cmem, p);
dimm = find_dimm_by_handle(mci, mem_err->mem_dev_handle);
if (dimm) {
e->top_layer = dimm->idx;
strcpy(e->label, dimm->label);
}
}
- if (mem_err->validation_bits & CPER_MEM_VALID_CHIP_ID)
- p += sprintf(p, "chipID: %d ",
- mem_err->extended >> CPER_MEM_CHIP_ID_SHIFT);
if (p > e->location)
*(p - 1) = '\0';

@@ -416,78 +359,7 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)

/* All other fields are mapped on e->other_detail */
p = pvt->other_detail;
- p += snprintf(p, sizeof(pvt->other_detail),
- "APEI location: %s ", e->location);
- if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_STATUS) {
- u64 status = mem_err->error_status;
-
- p += sprintf(p, "status(0x%016llx): ", (long long)status);
- switch ((status >> 8) & 0xff) {
- case 1:
- p += sprintf(p, "Error detected internal to the component ");
- break;
- case 16:
- p += sprintf(p, "Error detected in the bus ");
- break;
- case 4:
- p += sprintf(p, "Storage error in DRAM memory ");
- break;
- case 5:
- p += sprintf(p, "Storage error in TLB ");
- break;
- case 6:
- p += sprintf(p, "Storage error in cache ");
- break;
- case 7:
- p += sprintf(p, "Error in one or more functional units ");
- break;
- case 8:
- p += sprintf(p, "component failed self test ");
- break;
- case 9:
- p += sprintf(p, "Overflow or undervalue of internal queue ");
- break;
- case 17:
- p += sprintf(p, "Virtual address not found on IO-TLB or IO-PDIR ");
- break;
- case 18:
- p += sprintf(p, "Improper access error ");
- break;
- case 19:
- p += sprintf(p, "Access to a memory address which is not mapped to any component ");
- break;
- case 20:
- p += sprintf(p, "Loss of Lockstep ");
- break;
- case 21:
- p += sprintf(p, "Response not associated with a request ");
- break;
- case 22:
- p += sprintf(p, "Bus parity error - must also set the A, C, or D Bits ");
- break;
- case 23:
- p += sprintf(p, "Detection of a PATH_ERROR ");
- break;
- case 25:
- p += sprintf(p, "Bus operation timeout ");
- break;
- case 26:
- p += sprintf(p, "A read was issued to data that has been poisoned ");
- break;
- default:
- p += sprintf(p, "reserved ");
- break;
- }
- }
- if (mem_err->validation_bits & CPER_MEM_VALID_REQUESTOR_ID)
- p += sprintf(p, "requestorID: 0x%016llx ",
- (long long)mem_err->requestor_id);
- if (mem_err->validation_bits & CPER_MEM_VALID_RESPONDER_ID)
- p += sprintf(p, "responderID: 0x%016llx ",
- (long long)mem_err->responder_id);
- if (mem_err->validation_bits & CPER_MEM_VALID_TARGET_ID)
- p += sprintf(p, "targetID: 0x%016llx ",
- (long long)mem_err->responder_id);
+ p += ghes_edac_mem_err_other_detail(mem_err, p, e->location);
if (p > pvt->other_detail)
*(p - 1) = '\0';

diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index 34eeaa59f04a..c762b0ace627 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -237,7 +237,7 @@ const char *cper_mem_err_status_str(u64 status)
}
EXPORT_SYMBOL_GPL(cper_mem_err_status_str);

-static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
+int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
{
u32 len, n;

@@ -247,51 +247,52 @@ static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
n = 0;
len = CPER_REC_LEN;
if (mem->validation_bits & CPER_MEM_VALID_NODE)
- n += scnprintf(msg + n, len - n, "node: %d ", mem->node);
+ n += scnprintf(msg + n, len - n, "node:%d ", mem->node);
if (mem->validation_bits & CPER_MEM_VALID_CARD)
- n += scnprintf(msg + n, len - n, "card: %d ", mem->card);
+ n += scnprintf(msg + n, len - n, "card:%d ", mem->card);
if (mem->validation_bits & CPER_MEM_VALID_MODULE)
- n += scnprintf(msg + n, len - n, "module: %d ", mem->module);
+ n += scnprintf(msg + n, len - n, "module:%d ", mem->module);
if (mem->validation_bits & CPER_MEM_VALID_RANK_NUMBER)
- n += scnprintf(msg + n, len - n, "rank: %d ", mem->rank);
+ n += scnprintf(msg + n, len - n, "rank:%d ", mem->rank);
if (mem->validation_bits & CPER_MEM_VALID_BANK)
- n += scnprintf(msg + n, len - n, "bank: %d ", mem->bank);
+ n += scnprintf(msg + n, len - n, "bank:%d ", mem->bank);
if (mem->validation_bits & CPER_MEM_VALID_BANK_GROUP)
- n += scnprintf(msg + n, len - n, "bank_group: %d ",
+ n += scnprintf(msg + n, len - n, "bank_group:%d ",
mem->bank >> CPER_MEM_BANK_GROUP_SHIFT);
if (mem->validation_bits & CPER_MEM_VALID_BANK_ADDRESS)
- n += scnprintf(msg + n, len - n, "bank_address: %d ",
+ n += scnprintf(msg + n, len - n, "bank_address:%d ",
mem->bank & CPER_MEM_BANK_ADDRESS_MASK);
if (mem->validation_bits & CPER_MEM_VALID_DEVICE)
- n += scnprintf(msg + n, len - n, "device: %d ", mem->device);
+ n += scnprintf(msg + n, len - n, "device:%d ", mem->device);
if (mem->validation_bits & (CPER_MEM_VALID_ROW | CPER_MEM_VALID_ROW_EXT)) {
u32 row = mem->row;

row |= cper_get_mem_extension(mem->validation_bits, mem->extended);
- n += scnprintf(msg + n, len - n, "row: %d ", row);
+ n += scnprintf(msg + n, len - n, "row:%d ", row);
}
if (mem->validation_bits & CPER_MEM_VALID_COLUMN)
- n += scnprintf(msg + n, len - n, "column: %d ", mem->column);
+ n += scnprintf(msg + n, len - n, "column:%d ", mem->column);
if (mem->validation_bits & CPER_MEM_VALID_BIT_POSITION)
- n += scnprintf(msg + n, len - n, "bit_position: %d ",
+ n += scnprintf(msg + n, len - n, "bit_position:%d ",
mem->bit_pos);
if (mem->validation_bits & CPER_MEM_VALID_REQUESTOR_ID)
- n += scnprintf(msg + n, len - n, "requestor_id: 0x%016llx ",
+ n += scnprintf(msg + n, len - n, "requestor_id:0x%016llx ",
mem->requestor_id);
if (mem->validation_bits & CPER_MEM_VALID_RESPONDER_ID)
- n += scnprintf(msg + n, len - n, "responder_id: 0x%016llx ",
+ n += scnprintf(msg + n, len - n, "responder_id:0x%016llx ",
mem->responder_id);
if (mem->validation_bits & CPER_MEM_VALID_TARGET_ID)
- n += scnprintf(msg + n, len - n, "target_id: 0x%016llx ",
+ n += scnprintf(msg + n, len - n, "target_id:0x%016llx ",
mem->target_id);
if (mem->validation_bits & CPER_MEM_VALID_CHIP_ID)
- n += scnprintf(msg + n, len - n, "chip_id: %d ",
+ n += scnprintf(msg + n, len - n, "chip_id:%d ",
mem->extended >> CPER_MEM_CHIP_ID_SHIFT);

return n;
}
+EXPORT_SYMBOL_GPL(cper_mem_err_location);

-static int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg)
+int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg)
{
u32 len, n;
const char *bank = NULL, *device = NULL;
@@ -310,6 +311,7 @@ static int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg)

return n;
}
+EXPORT_SYMBOL_GPL(cper_dimm_err_location);

void cper_mem_err_pack(const struct cper_sec_mem_err *mem,
struct cper_mem_err_compact *cmem)
diff --git a/include/linux/cper.h b/include/linux/cper.h
index 5b1dd27b317d..eacb7dd7b3af 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -569,5 +569,7 @@ void cper_print_proc_arm(const char *pfx,
const struct cper_sec_proc_arm *proc);
void cper_print_proc_ia(const char *pfx,
const struct cper_sec_proc_ia *proc);
+int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg);
+int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg);

#endif
--
2.20.1.12.g72788fdb

2022-03-04 20:40:42

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v6 2/2] EDAC/ghes: use cper functions to avoid code duplication

On Thu, Mar 03, 2022 at 08:26:26PM +0800, Shuai Xue wrote:
> To Borislav: Sorry, I only delete the format change summary in this commit
> log in this version. If I missed any comments, could you please point out
> clearly? Thank you very much.

I can't be more clear than the below - I simply took your patch and
heavily massaged it because it is a lot quicker this way.

You can compare it with your version and see what needed to be changed.
And you can of course ask questios why, if it is not clear. But it
should be - a shortlog of what I did is also in the commit message.

Now, when you look at the resulting patch you can see what the change
does. Yours did more than one logical thing - which a patch should never
do.

Also, I'd really really urge you to read

Documentation/process/submitting-patches.rst

carefully before sending other patches. I won't do this heavy editing
in the future so you're going to have to read review comments and
*incorporate* them into your patches before submitting them again.

You can test the below on your machine now to make sure it still
performs as expected.

Thx.

---
From: Shuai Xue <[email protected]>
Date: Thu, 3 Mar 2022 20:26:26 +0800
Subject: [PATCH] EDAC/ghes: Unify CPER memory error location reporting

Switch the GHES EDAC memory error reporting functions to use the common
CPER ones and get rid of code duplication.

[ bp:
- rewrite commit message, remove useless text
- rip out useless reformatting
- align function params on the opening brace
- rename function to a more descriptive name
- drop useless function exports
- handle buffer lengths properly when printing other detail
- remove useless casting
]

Signed-off-by: Shuai Xue <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
drivers/edac/Kconfig | 1 +
drivers/edac/ghes_edac.c | 200 +++++++-----------------------------
drivers/firmware/efi/cper.c | 4 +-
include/linux/cper.h | 2 +
4 files changed, 42 insertions(+), 165 deletions(-)

diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
index 58ab63642e72..23f11554f400 100644
--- a/drivers/edac/Kconfig
+++ b/drivers/edac/Kconfig
@@ -55,6 +55,7 @@ config EDAC_DECODE_MCE
config EDAC_GHES
bool "Output ACPI APEI/GHES BIOS detected errors via EDAC"
depends on ACPI_APEI_GHES && (EDAC=y)
+ select UEFI_CPER
help
Not all machines support hardware-driven error report. Some of those
provide a BIOS-driven error report mechanism via ACPI, using the
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 6d1ddecbf0da..2805d5610300 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -15,11 +15,13 @@
#include "edac_module.h"
#include <ras/ras_event.h>

+#define OTHER_DETAIL_LEN 400
+
struct ghes_pvt {
struct mem_ctl_info *mci;

/* Buffers for the error handling routine */
- char other_detail[400];
+ char other_detail[OTHER_DETAIL_LEN];
char msg[80];
};

@@ -235,8 +237,34 @@ static void ghes_scan_system(void)
system_scanned = true;
}

+static int print_mem_error_other_detail(const struct cper_sec_mem_err *mem, char *msg,
+ const char *location, unsigned int len)
+{
+ u32 n;
+
+ if (!msg)
+ return 0;
+
+ n = 0;
+ len -= 1;
+
+ n += scnprintf(msg + n, len - n, "APEI location: %s ", location);
+
+ if (!(mem->validation_bits & CPER_MEM_VALID_ERROR_STATUS))
+ goto out;
+
+ n += scnprintf(msg + n, len - n, "status(0x%016llx): ", mem->error_status);
+ n += scnprintf(msg + n, len - n, "%s ", cper_mem_err_status_str(mem->error_status));
+
+out:
+ msg[n] = '\0';
+
+ return n;
+}
+
void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
{
+ struct cper_mem_err_compact cmem;
struct edac_raw_error_desc *e;
struct mem_ctl_info *mci;
struct ghes_pvt *pvt;
@@ -292,60 +320,10 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)

/* Error type, mapped on e->msg */
if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_TYPE) {
+ u8 etype = mem_err->error_type;
+
p = pvt->msg;
- switch (mem_err->error_type) {
- case 0:
- p += sprintf(p, "Unknown");
- break;
- case 1:
- p += sprintf(p, "No error");
- break;
- case 2:
- p += sprintf(p, "Single-bit ECC");
- break;
- case 3:
- p += sprintf(p, "Multi-bit ECC");
- break;
- case 4:
- p += sprintf(p, "Single-symbol ChipKill ECC");
- break;
- case 5:
- p += sprintf(p, "Multi-symbol ChipKill ECC");
- break;
- case 6:
- p += sprintf(p, "Master abort");
- break;
- case 7:
- p += sprintf(p, "Target abort");
- break;
- case 8:
- p += sprintf(p, "Parity Error");
- break;
- case 9:
- p += sprintf(p, "Watchdog timeout");
- break;
- case 10:
- p += sprintf(p, "Invalid address");
- break;
- case 11:
- p += sprintf(p, "Mirror Broken");
- break;
- case 12:
- p += sprintf(p, "Memory Sparing");
- break;
- case 13:
- p += sprintf(p, "Scrub corrected error");
- break;
- case 14:
- p += sprintf(p, "Scrub uncorrected error");
- break;
- case 15:
- p += sprintf(p, "Physical Memory Map-out event");
- break;
- default:
- p += sprintf(p, "reserved error (%d)",
- mem_err->error_type);
- }
+ p += snprintf(p, sizeof(pvt->msg), "%s", cper_mem_err_type_str(etype));
} else {
strcpy(pvt->msg, "unknown error");
}
@@ -362,52 +340,19 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)

/* Memory error location, mapped on e->location */
p = e->location;
- if (mem_err->validation_bits & CPER_MEM_VALID_NODE)
- p += sprintf(p, "node:%d ", mem_err->node);
- if (mem_err->validation_bits & CPER_MEM_VALID_CARD)
- p += sprintf(p, "card:%d ", mem_err->card);
- if (mem_err->validation_bits & CPER_MEM_VALID_MODULE)
- p += sprintf(p, "module:%d ", mem_err->module);
- if (mem_err->validation_bits & CPER_MEM_VALID_RANK_NUMBER)
- p += sprintf(p, "rank:%d ", mem_err->rank);
- if (mem_err->validation_bits & CPER_MEM_VALID_BANK)
- p += sprintf(p, "bank:%d ", mem_err->bank);
- if (mem_err->validation_bits & CPER_MEM_VALID_BANK_GROUP)
- p += sprintf(p, "bank_group:%d ",
- mem_err->bank >> CPER_MEM_BANK_GROUP_SHIFT);
- if (mem_err->validation_bits & CPER_MEM_VALID_BANK_ADDRESS)
- p += sprintf(p, "bank_address:%d ",
- mem_err->bank & CPER_MEM_BANK_ADDRESS_MASK);
- if (mem_err->validation_bits & (CPER_MEM_VALID_ROW | CPER_MEM_VALID_ROW_EXT)) {
- u32 row = mem_err->row;
-
- row |= cper_get_mem_extension(mem_err->validation_bits, mem_err->extended);
- p += sprintf(p, "row:%d ", row);
- }
- if (mem_err->validation_bits & CPER_MEM_VALID_COLUMN)
- p += sprintf(p, "col:%d ", mem_err->column);
- if (mem_err->validation_bits & CPER_MEM_VALID_BIT_POSITION)
- p += sprintf(p, "bit_pos:%d ", mem_err->bit_pos);
+ cper_mem_err_pack(mem_err, &cmem);
+ p += cper_mem_err_location(&cmem, p);
+
if (mem_err->validation_bits & CPER_MEM_VALID_MODULE_HANDLE) {
- const char *bank = NULL, *device = NULL;
struct dimm_info *dimm;

- dmi_memdev_name(mem_err->mem_dev_handle, &bank, &device);
- if (bank != NULL && device != NULL)
- p += sprintf(p, "DIMM location:%s %s ", bank, device);
- else
- p += sprintf(p, "DIMM DMI handle: 0x%.4x ",
- mem_err->mem_dev_handle);
-
+ p += cper_dimm_err_location(&cmem, p);
dimm = find_dimm_by_handle(mci, mem_err->mem_dev_handle);
if (dimm) {
e->top_layer = dimm->idx;
strcpy(e->label, dimm->label);
}
}
- if (mem_err->validation_bits & CPER_MEM_VALID_CHIP_ID)
- p += sprintf(p, "chipID: %d ",
- mem_err->extended >> CPER_MEM_CHIP_ID_SHIFT);
if (p > e->location)
*(p - 1) = '\0';

@@ -416,78 +361,7 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)

/* All other fields are mapped on e->other_detail */
p = pvt->other_detail;
- p += snprintf(p, sizeof(pvt->other_detail),
- "APEI location: %s ", e->location);
- if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_STATUS) {
- u64 status = mem_err->error_status;
-
- p += sprintf(p, "status(0x%016llx): ", (long long)status);
- switch ((status >> 8) & 0xff) {
- case 1:
- p += sprintf(p, "Error detected internal to the component ");
- break;
- case 16:
- p += sprintf(p, "Error detected in the bus ");
- break;
- case 4:
- p += sprintf(p, "Storage error in DRAM memory ");
- break;
- case 5:
- p += sprintf(p, "Storage error in TLB ");
- break;
- case 6:
- p += sprintf(p, "Storage error in cache ");
- break;
- case 7:
- p += sprintf(p, "Error in one or more functional units ");
- break;
- case 8:
- p += sprintf(p, "component failed self test ");
- break;
- case 9:
- p += sprintf(p, "Overflow or undervalue of internal queue ");
- break;
- case 17:
- p += sprintf(p, "Virtual address not found on IO-TLB or IO-PDIR ");
- break;
- case 18:
- p += sprintf(p, "Improper access error ");
- break;
- case 19:
- p += sprintf(p, "Access to a memory address which is not mapped to any component ");
- break;
- case 20:
- p += sprintf(p, "Loss of Lockstep ");
- break;
- case 21:
- p += sprintf(p, "Response not associated with a request ");
- break;
- case 22:
- p += sprintf(p, "Bus parity error - must also set the A, C, or D Bits ");
- break;
- case 23:
- p += sprintf(p, "Detection of a PATH_ERROR ");
- break;
- case 25:
- p += sprintf(p, "Bus operation timeout ");
- break;
- case 26:
- p += sprintf(p, "A read was issued to data that has been poisoned ");
- break;
- default:
- p += sprintf(p, "reserved ");
- break;
- }
- }
- if (mem_err->validation_bits & CPER_MEM_VALID_REQUESTOR_ID)
- p += sprintf(p, "requestorID: 0x%016llx ",
- (long long)mem_err->requestor_id);
- if (mem_err->validation_bits & CPER_MEM_VALID_RESPONDER_ID)
- p += sprintf(p, "responderID: 0x%016llx ",
- (long long)mem_err->responder_id);
- if (mem_err->validation_bits & CPER_MEM_VALID_TARGET_ID)
- p += sprintf(p, "targetID: 0x%016llx ",
- (long long)mem_err->responder_id);
+ p += print_mem_error_other_detail(mem_err, p, e->location, OTHER_DETAIL_LEN);
if (p > pvt->other_detail)
*(p - 1) = '\0';

diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index 34eeaa59f04a..215c778fb33c 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -237,7 +237,7 @@ const char *cper_mem_err_status_str(u64 status)
}
EXPORT_SYMBOL_GPL(cper_mem_err_status_str);

-static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
+int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
{
u32 len, n;

@@ -291,7 +291,7 @@ static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
return n;
}

-static int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg)
+int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg)
{
u32 len, n;
const char *bank = NULL, *device = NULL;
diff --git a/include/linux/cper.h b/include/linux/cper.h
index 5b1dd27b317d..eacb7dd7b3af 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -569,5 +569,7 @@ void cper_print_proc_arm(const char *pfx,
const struct cper_sec_proc_arm *proc);
void cper_print_proc_ia(const char *pfx,
const struct cper_sec_proc_ia *proc);
+int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg);
+int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg);

#endif
--
2.23.0

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-03-07 09:08:10

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v6 2/2] EDAC/ghes: use cper functions to avoid code duplication

Hi, Borislav,

在 2022/3/4 PM6:16, Borislav Petkov 写道:
> On Thu, Mar 03, 2022 at 08:26:26PM +0800, Shuai Xue wrote:
>> To Borislav: Sorry, I only delete the format change summary in this commit
>> log in this version. If I missed any comments, could you please point out
>> clearly? Thank you very much.
>
> I can't be more clear than the below - I simply took your patch and
> heavily massaged it because it is a lot quicker this way.
>
> You can compare it with your version and see what needed to be changed.
> And you can of course ask questios why, if it is not clear. But it
> should be - a shortlog of what I did is also in the commit message.
>
> Now, when you look at the resulting patch you can see what the change
> does. Yours did more than one logical thing - which a patch should never
> do.
>
> Also, I'd really really urge you to read
>
> Documentation/process/submitting-patches.rst
>
> carefully before sending other patches. I won't do this heavy editing
> in the future so you're going to have to read review comments and
> *incorporate* them into your patches before submitting them again.

Thank you very much for your valuable reworking. I see your concerns.

> You can test the below on your machine now to make sure it still
> performs as expected.

I have test this patch, it works well.

> - rip out useless reformatting

I have a question about the format after this patch:

[ 4578.277035] EDAC MC0: 1 CE single-symbol chipkill ECC on unknown memory (node: 0 card: 0 rank: 0 bank: 256 row: 11098 column: 1032 page:0x95ad2e offset:0x20 grain:1 syndrome:0x0 - APEI location: node: 0 card: 0 rank: 0 bank: 256 row: 11098 column: 1032)

Should we add a new patch to remove the 'space' delimiter after the colon?

Best Regards,
Shuai


> ---
> From: Shuai Xue <[email protected]>
> Date: Thu, 3 Mar 2022 20:26:26 +0800
> Subject: [PATCH] EDAC/ghes: Unify CPER memory error location reporting
>
> Switch the GHES EDAC memory error reporting functions to use the common
> CPER ones and get rid of code duplication.
>
> [ bp:
> - rewrite commit message, remove useless text
> - rip out useless reformatting
> - align function params on the opening brace
> - rename function to a more descriptive name
> - drop useless function exports
> - handle buffer lengths properly when printing other detail
> - remove useless casting
> ]
>
> Signed-off-by: Shuai Xue <[email protected]>
> Signed-off-by: Borislav Petkov <[email protected]>
> Link: https://lore.kernel.org/r/[email protected]
> ---
> drivers/edac/Kconfig | 1 +
> drivers/edac/ghes_edac.c | 200 +++++++-----------------------------
> drivers/firmware/efi/cper.c | 4 +-
> include/linux/cper.h | 2 +
> 4 files changed, 42 insertions(+), 165 deletions(-)
>
> diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
> index 58ab63642e72..23f11554f400 100644
> --- a/drivers/edac/Kconfig
> +++ b/drivers/edac/Kconfig
> @@ -55,6 +55,7 @@ config EDAC_DECODE_MCE
> config EDAC_GHES
> bool "Output ACPI APEI/GHES BIOS detected errors via EDAC"
> depends on ACPI_APEI_GHES && (EDAC=y)
> + select UEFI_CPER
> help
> Not all machines support hardware-driven error report. Some of those
> provide a BIOS-driven error report mechanism via ACPI, using the
> diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
> index 6d1ddecbf0da..2805d5610300 100644
> --- a/drivers/edac/ghes_edac.c
> +++ b/drivers/edac/ghes_edac.c
> @@ -15,11 +15,13 @@
> #include "edac_module.h"
> #include <ras/ras_event.h>
>
> +#define OTHER_DETAIL_LEN 400
> +
> struct ghes_pvt {
> struct mem_ctl_info *mci;
>
> /* Buffers for the error handling routine */
> - char other_detail[400];
> + char other_detail[OTHER_DETAIL_LEN];
> char msg[80];
> };
>
> @@ -235,8 +237,34 @@ static void ghes_scan_system(void)
> system_scanned = true;
> }
>
> +static int print_mem_error_other_detail(const struct cper_sec_mem_err *mem, char *msg,
> + const char *location, unsigned int len)
> +{
> + u32 n;
> +
> + if (!msg)
> + return 0;
> +
> + n = 0;
> + len -= 1;
> +
> + n += scnprintf(msg + n, len - n, "APEI location: %s ", location);
> +
> + if (!(mem->validation_bits & CPER_MEM_VALID_ERROR_STATUS))
> + goto out;
> +
> + n += scnprintf(msg + n, len - n, "status(0x%016llx): ", mem->error_status);
> + n += scnprintf(msg + n, len - n, "%s ", cper_mem_err_status_str(mem->error_status));
> +
> +out:
> + msg[n] = '\0';
> +
> + return n;
> +}
> +
> void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
> {
> + struct cper_mem_err_compact cmem;
> struct edac_raw_error_desc *e;
> struct mem_ctl_info *mci;
> struct ghes_pvt *pvt;
> @@ -292,60 +320,10 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
>
> /* Error type, mapped on e->msg */
> if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_TYPE) {
> + u8 etype = mem_err->error_type;
> +
> p = pvt->msg;
> - switch (mem_err->error_type) {
> - case 0:
> - p += sprintf(p, "Unknown");
> - break;
> - case 1:
> - p += sprintf(p, "No error");
> - break;
> - case 2:
> - p += sprintf(p, "Single-bit ECC");
> - break;
> - case 3:
> - p += sprintf(p, "Multi-bit ECC");
> - break;
> - case 4:
> - p += sprintf(p, "Single-symbol ChipKill ECC");
> - break;
> - case 5:
> - p += sprintf(p, "Multi-symbol ChipKill ECC");
> - break;
> - case 6:
> - p += sprintf(p, "Master abort");
> - break;
> - case 7:
> - p += sprintf(p, "Target abort");
> - break;
> - case 8:
> - p += sprintf(p, "Parity Error");
> - break;
> - case 9:
> - p += sprintf(p, "Watchdog timeout");
> - break;
> - case 10:
> - p += sprintf(p, "Invalid address");
> - break;
> - case 11:
> - p += sprintf(p, "Mirror Broken");
> - break;
> - case 12:
> - p += sprintf(p, "Memory Sparing");
> - break;
> - case 13:
> - p += sprintf(p, "Scrub corrected error");
> - break;
> - case 14:
> - p += sprintf(p, "Scrub uncorrected error");
> - break;
> - case 15:
> - p += sprintf(p, "Physical Memory Map-out event");
> - break;
> - default:
> - p += sprintf(p, "reserved error (%d)",
> - mem_err->error_type);
> - }
> + p += snprintf(p, sizeof(pvt->msg), "%s", cper_mem_err_type_str(etype));
> } else {
> strcpy(pvt->msg, "unknown error");
> }
> @@ -362,52 +340,19 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
>
> /* Memory error location, mapped on e->location */
> p = e->location;
> - if (mem_err->validation_bits & CPER_MEM_VALID_NODE)
> - p += sprintf(p, "node:%d ", mem_err->node);
> - if (mem_err->validation_bits & CPER_MEM_VALID_CARD)
> - p += sprintf(p, "card:%d ", mem_err->card);
> - if (mem_err->validation_bits & CPER_MEM_VALID_MODULE)
> - p += sprintf(p, "module:%d ", mem_err->module);
> - if (mem_err->validation_bits & CPER_MEM_VALID_RANK_NUMBER)
> - p += sprintf(p, "rank:%d ", mem_err->rank);
> - if (mem_err->validation_bits & CPER_MEM_VALID_BANK)
> - p += sprintf(p, "bank:%d ", mem_err->bank);
> - if (mem_err->validation_bits & CPER_MEM_VALID_BANK_GROUP)
> - p += sprintf(p, "bank_group:%d ",
> - mem_err->bank >> CPER_MEM_BANK_GROUP_SHIFT);
> - if (mem_err->validation_bits & CPER_MEM_VALID_BANK_ADDRESS)
> - p += sprintf(p, "bank_address:%d ",
> - mem_err->bank & CPER_MEM_BANK_ADDRESS_MASK);
> - if (mem_err->validation_bits & (CPER_MEM_VALID_ROW | CPER_MEM_VALID_ROW_EXT)) {
> - u32 row = mem_err->row;
> -
> - row |= cper_get_mem_extension(mem_err->validation_bits, mem_err->extended);
> - p += sprintf(p, "row:%d ", row);
> - }
> - if (mem_err->validation_bits & CPER_MEM_VALID_COLUMN)
> - p += sprintf(p, "col:%d ", mem_err->column);
> - if (mem_err->validation_bits & CPER_MEM_VALID_BIT_POSITION)
> - p += sprintf(p, "bit_pos:%d ", mem_err->bit_pos);
> + cper_mem_err_pack(mem_err, &cmem);
> + p += cper_mem_err_location(&cmem, p);
> +
> if (mem_err->validation_bits & CPER_MEM_VALID_MODULE_HANDLE) {
> - const char *bank = NULL, *device = NULL;
> struct dimm_info *dimm;
>
> - dmi_memdev_name(mem_err->mem_dev_handle, &bank, &device);
> - if (bank != NULL && device != NULL)
> - p += sprintf(p, "DIMM location:%s %s ", bank, device);
> - else
> - p += sprintf(p, "DIMM DMI handle: 0x%.4x ",
> - mem_err->mem_dev_handle);
> -
> + p += cper_dimm_err_location(&cmem, p);
> dimm = find_dimm_by_handle(mci, mem_err->mem_dev_handle);
> if (dimm) {
> e->top_layer = dimm->idx;
> strcpy(e->label, dimm->label);
> }
> }
> - if (mem_err->validation_bits & CPER_MEM_VALID_CHIP_ID)
> - p += sprintf(p, "chipID: %d ",
> - mem_err->extended >> CPER_MEM_CHIP_ID_SHIFT);
> if (p > e->location)
> *(p - 1) = '\0';
>
> @@ -416,78 +361,7 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
>
> /* All other fields are mapped on e->other_detail */
> p = pvt->other_detail;
> - p += snprintf(p, sizeof(pvt->other_detail),
> - "APEI location: %s ", e->location);
> - if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_STATUS) {
> - u64 status = mem_err->error_status;
> -
> - p += sprintf(p, "status(0x%016llx): ", (long long)status);
> - switch ((status >> 8) & 0xff) {
> - case 1:
> - p += sprintf(p, "Error detected internal to the component ");
> - break;
> - case 16:
> - p += sprintf(p, "Error detected in the bus ");
> - break;
> - case 4:
> - p += sprintf(p, "Storage error in DRAM memory ");
> - break;
> - case 5:
> - p += sprintf(p, "Storage error in TLB ");
> - break;
> - case 6:
> - p += sprintf(p, "Storage error in cache ");
> - break;
> - case 7:
> - p += sprintf(p, "Error in one or more functional units ");
> - break;
> - case 8:
> - p += sprintf(p, "component failed self test ");
> - break;
> - case 9:
> - p += sprintf(p, "Overflow or undervalue of internal queue ");
> - break;
> - case 17:
> - p += sprintf(p, "Virtual address not found on IO-TLB or IO-PDIR ");
> - break;
> - case 18:
> - p += sprintf(p, "Improper access error ");
> - break;
> - case 19:
> - p += sprintf(p, "Access to a memory address which is not mapped to any component ");
> - break;
> - case 20:
> - p += sprintf(p, "Loss of Lockstep ");
> - break;
> - case 21:
> - p += sprintf(p, "Response not associated with a request ");
> - break;
> - case 22:
> - p += sprintf(p, "Bus parity error - must also set the A, C, or D Bits ");
> - break;
> - case 23:
> - p += sprintf(p, "Detection of a PATH_ERROR ");
> - break;
> - case 25:
> - p += sprintf(p, "Bus operation timeout ");
> - break;
> - case 26:
> - p += sprintf(p, "A read was issued to data that has been poisoned ");
> - break;
> - default:
> - p += sprintf(p, "reserved ");
> - break;
> - }
> - }
> - if (mem_err->validation_bits & CPER_MEM_VALID_REQUESTOR_ID)
> - p += sprintf(p, "requestorID: 0x%016llx ",
> - (long long)mem_err->requestor_id);
> - if (mem_err->validation_bits & CPER_MEM_VALID_RESPONDER_ID)
> - p += sprintf(p, "responderID: 0x%016llx ",
> - (long long)mem_err->responder_id);
> - if (mem_err->validation_bits & CPER_MEM_VALID_TARGET_ID)
> - p += sprintf(p, "targetID: 0x%016llx ",
> - (long long)mem_err->responder_id);
> + p += print_mem_error_other_detail(mem_err, p, e->location, OTHER_DETAIL_LEN);
> if (p > pvt->other_detail)
> *(p - 1) = '\0';
>
> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
> index 34eeaa59f04a..215c778fb33c 100644
> --- a/drivers/firmware/efi/cper.c
> +++ b/drivers/firmware/efi/cper.c
> @@ -237,7 +237,7 @@ const char *cper_mem_err_status_str(u64 status)
> }
> EXPORT_SYMBOL_GPL(cper_mem_err_status_str);
>
> -static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
> +int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
> {
> u32 len, n;
>
> @@ -291,7 +291,7 @@ static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
> return n;
> }
>
> -static int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg)
> +int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg)
> {
> u32 len, n;
> const char *bank = NULL, *device = NULL;
> diff --git a/include/linux/cper.h b/include/linux/cper.h
> index 5b1dd27b317d..eacb7dd7b3af 100644
> --- a/include/linux/cper.h
> +++ b/include/linux/cper.h
> @@ -569,5 +569,7 @@ void cper_print_proc_arm(const char *pfx,
> const struct cper_sec_proc_arm *proc);
> void cper_print_proc_ia(const char *pfx,
> const struct cper_sec_proc_ia *proc);
> +int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg);
> +int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg);
>
> #endif

2022-03-07 09:33:11

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v6 2/2] EDAC/ghes: use cper functions to avoid code duplication

On Mon, Mar 07, 2022 at 02:16:42PM +0800, Shuai Xue wrote:
> Should we add a new patch to remove the 'space' delimiter after the colon?

Yeah, it looks like it would be easier to parse this way - both by
humans and tools.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-03-08 15:42:34

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v7 0/3] EDAC/ghes: refactor memory error reporting to avoid code duplication

ghes_edac_report_mem_error() in ghes_edac.c is a Long Method and have
Duplicated Code with cper_mem_err_location(), cper_dimm_err_location(), and
cper_mem_err_type_str() in drivers/firmware/efi/cper.c. In addition, the
cper_print_mem() in drivers/firmware/efi/cper.c only reports the error
status and misses its description.

This patch set is to refactor ghes_edac_report_mem_error with:

- Patch 01 is to wrap up error status decoding logics and reuse it in
cper_print_mem().
- Patch 02 is to introduces cper_*(), into ghes_edac_report_mem_error(),
this can avoid bunch of duplicate code lines;
- Patch 03 is to improve report format

Changes since v6:
- Rework patch 02 by Borislav Petkov
- add patch 03 to improve format
- Link: https://lore.kernel.org/r/[email protected]

Changes since v5:
- Delete change summary in commit log
- Link: https://lore.kernel.org/all/[email protected]/
- Thanks Borislav Petkov for review comments.

Changes since v4:
- Fix alignment and format problem
- Link: https://lore.kernel.org/all/[email protected]/

Changes since v3:

- make cper_mem_err_status_str table a lot more compact
- Fix alignment and format problem
- Link: https://lore.kernel.org/all/[email protected]/

Changes since v2:

- delete the unified patch
- adjusted the order of patches
- Link: https://lore.kernel.org/all/[email protected]/

Changes since v1:

- add a new patch to unify ghes and cper before removing duplication.
- document the changes in patch description
- add EXPORT_SYMBOL_GPL()s for cper_*()
- document and the dependency and add UEFI_CPER dependency explicitly
- Link: https://lore.kernel.org/all/[email protected]/

Shuai Xue (3):
efi/cper: add cper_mem_err_status_str to decode error description
EDAC/ghes: Unify CPER memory error location reporting
efi/cper: reformat CPER memory error location to more readable

drivers/edac/Kconfig | 1 +
drivers/edac/ghes_edac.c | 200 +++++++-----------------------------
drivers/firmware/efi/cper.c | 64 ++++++++----
include/linux/cper.h | 3 +
4 files changed, 87 insertions(+), 181 deletions(-)

--
2.20.1.12.g72788fdb

2022-03-08 18:52:26

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v7 2/3] EDAC/ghes: Unify CPER memory error location reporting

Switch the GHES EDAC memory error reporting functions to use the common
CPER ones and get rid of code duplication.

[ bp:
- rewrite commit message, remove useless text
- rip out useless reformatting
- align function params on the opening brace
- rename function to a more descriptive name
- drop useless function exports
- handle buffer lengths properly when printing other detail
- remove useless casting
]

Signed-off-by: Shuai Xue <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
drivers/edac/Kconfig | 1 +
drivers/edac/ghes_edac.c | 200 +++++++-----------------------------
drivers/firmware/efi/cper.c | 4 +-
include/linux/cper.h | 2 +
4 files changed, 42 insertions(+), 165 deletions(-)

diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
index 58ab63642e72..23f11554f400 100644
--- a/drivers/edac/Kconfig
+++ b/drivers/edac/Kconfig
@@ -55,6 +55,7 @@ config EDAC_DECODE_MCE
config EDAC_GHES
bool "Output ACPI APEI/GHES BIOS detected errors via EDAC"
depends on ACPI_APEI_GHES && (EDAC=y)
+ select UEFI_CPER
help
Not all machines support hardware-driven error report. Some of those
provide a BIOS-driven error report mechanism via ACPI, using the
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 6d1ddecbf0da..2805d5610300 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -15,11 +15,13 @@
#include "edac_module.h"
#include <ras/ras_event.h>

+#define OTHER_DETAIL_LEN 400
+
struct ghes_pvt {
struct mem_ctl_info *mci;

/* Buffers for the error handling routine */
- char other_detail[400];
+ char other_detail[OTHER_DETAIL_LEN];
char msg[80];
};

@@ -235,8 +237,34 @@ static void ghes_scan_system(void)
system_scanned = true;
}

+static int print_mem_error_other_detail(const struct cper_sec_mem_err *mem, char *msg,
+ const char *location, unsigned int len)
+{
+ u32 n;
+
+ if (!msg)
+ return 0;
+
+ n = 0;
+ len -= 1;
+
+ n += scnprintf(msg + n, len - n, "APEI location: %s ", location);
+
+ if (!(mem->validation_bits & CPER_MEM_VALID_ERROR_STATUS))
+ goto out;
+
+ n += scnprintf(msg + n, len - n, "status(0x%016llx): ", mem->error_status);
+ n += scnprintf(msg + n, len - n, "%s ", cper_mem_err_status_str(mem->error_status));
+
+out:
+ msg[n] = '\0';
+
+ return n;
+}
+
void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
{
+ struct cper_mem_err_compact cmem;
struct edac_raw_error_desc *e;
struct mem_ctl_info *mci;
struct ghes_pvt *pvt;
@@ -292,60 +320,10 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)

/* Error type, mapped on e->msg */
if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_TYPE) {
+ u8 etype = mem_err->error_type;
+
p = pvt->msg;
- switch (mem_err->error_type) {
- case 0:
- p += sprintf(p, "Unknown");
- break;
- case 1:
- p += sprintf(p, "No error");
- break;
- case 2:
- p += sprintf(p, "Single-bit ECC");
- break;
- case 3:
- p += sprintf(p, "Multi-bit ECC");
- break;
- case 4:
- p += sprintf(p, "Single-symbol ChipKill ECC");
- break;
- case 5:
- p += sprintf(p, "Multi-symbol ChipKill ECC");
- break;
- case 6:
- p += sprintf(p, "Master abort");
- break;
- case 7:
- p += sprintf(p, "Target abort");
- break;
- case 8:
- p += sprintf(p, "Parity Error");
- break;
- case 9:
- p += sprintf(p, "Watchdog timeout");
- break;
- case 10:
- p += sprintf(p, "Invalid address");
- break;
- case 11:
- p += sprintf(p, "Mirror Broken");
- break;
- case 12:
- p += sprintf(p, "Memory Sparing");
- break;
- case 13:
- p += sprintf(p, "Scrub corrected error");
- break;
- case 14:
- p += sprintf(p, "Scrub uncorrected error");
- break;
- case 15:
- p += sprintf(p, "Physical Memory Map-out event");
- break;
- default:
- p += sprintf(p, "reserved error (%d)",
- mem_err->error_type);
- }
+ p += snprintf(p, sizeof(pvt->msg), "%s", cper_mem_err_type_str(etype));
} else {
strcpy(pvt->msg, "unknown error");
}
@@ -362,52 +340,19 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)

/* Memory error location, mapped on e->location */
p = e->location;
- if (mem_err->validation_bits & CPER_MEM_VALID_NODE)
- p += sprintf(p, "node:%d ", mem_err->node);
- if (mem_err->validation_bits & CPER_MEM_VALID_CARD)
- p += sprintf(p, "card:%d ", mem_err->card);
- if (mem_err->validation_bits & CPER_MEM_VALID_MODULE)
- p += sprintf(p, "module:%d ", mem_err->module);
- if (mem_err->validation_bits & CPER_MEM_VALID_RANK_NUMBER)
- p += sprintf(p, "rank:%d ", mem_err->rank);
- if (mem_err->validation_bits & CPER_MEM_VALID_BANK)
- p += sprintf(p, "bank:%d ", mem_err->bank);
- if (mem_err->validation_bits & CPER_MEM_VALID_BANK_GROUP)
- p += sprintf(p, "bank_group:%d ",
- mem_err->bank >> CPER_MEM_BANK_GROUP_SHIFT);
- if (mem_err->validation_bits & CPER_MEM_VALID_BANK_ADDRESS)
- p += sprintf(p, "bank_address:%d ",
- mem_err->bank & CPER_MEM_BANK_ADDRESS_MASK);
- if (mem_err->validation_bits & (CPER_MEM_VALID_ROW | CPER_MEM_VALID_ROW_EXT)) {
- u32 row = mem_err->row;
-
- row |= cper_get_mem_extension(mem_err->validation_bits, mem_err->extended);
- p += sprintf(p, "row:%d ", row);
- }
- if (mem_err->validation_bits & CPER_MEM_VALID_COLUMN)
- p += sprintf(p, "col:%d ", mem_err->column);
- if (mem_err->validation_bits & CPER_MEM_VALID_BIT_POSITION)
- p += sprintf(p, "bit_pos:%d ", mem_err->bit_pos);
+ cper_mem_err_pack(mem_err, &cmem);
+ p += cper_mem_err_location(&cmem, p);
+
if (mem_err->validation_bits & CPER_MEM_VALID_MODULE_HANDLE) {
- const char *bank = NULL, *device = NULL;
struct dimm_info *dimm;

- dmi_memdev_name(mem_err->mem_dev_handle, &bank, &device);
- if (bank != NULL && device != NULL)
- p += sprintf(p, "DIMM location:%s %s ", bank, device);
- else
- p += sprintf(p, "DIMM DMI handle: 0x%.4x ",
- mem_err->mem_dev_handle);
-
+ p += cper_dimm_err_location(&cmem, p);
dimm = find_dimm_by_handle(mci, mem_err->mem_dev_handle);
if (dimm) {
e->top_layer = dimm->idx;
strcpy(e->label, dimm->label);
}
}
- if (mem_err->validation_bits & CPER_MEM_VALID_CHIP_ID)
- p += sprintf(p, "chipID: %d ",
- mem_err->extended >> CPER_MEM_CHIP_ID_SHIFT);
if (p > e->location)
*(p - 1) = '\0';

@@ -416,78 +361,7 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)

/* All other fields are mapped on e->other_detail */
p = pvt->other_detail;
- p += snprintf(p, sizeof(pvt->other_detail),
- "APEI location: %s ", e->location);
- if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_STATUS) {
- u64 status = mem_err->error_status;
-
- p += sprintf(p, "status(0x%016llx): ", (long long)status);
- switch ((status >> 8) & 0xff) {
- case 1:
- p += sprintf(p, "Error detected internal to the component ");
- break;
- case 16:
- p += sprintf(p, "Error detected in the bus ");
- break;
- case 4:
- p += sprintf(p, "Storage error in DRAM memory ");
- break;
- case 5:
- p += sprintf(p, "Storage error in TLB ");
- break;
- case 6:
- p += sprintf(p, "Storage error in cache ");
- break;
- case 7:
- p += sprintf(p, "Error in one or more functional units ");
- break;
- case 8:
- p += sprintf(p, "component failed self test ");
- break;
- case 9:
- p += sprintf(p, "Overflow or undervalue of internal queue ");
- break;
- case 17:
- p += sprintf(p, "Virtual address not found on IO-TLB or IO-PDIR ");
- break;
- case 18:
- p += sprintf(p, "Improper access error ");
- break;
- case 19:
- p += sprintf(p, "Access to a memory address which is not mapped to any component ");
- break;
- case 20:
- p += sprintf(p, "Loss of Lockstep ");
- break;
- case 21:
- p += sprintf(p, "Response not associated with a request ");
- break;
- case 22:
- p += sprintf(p, "Bus parity error - must also set the A, C, or D Bits ");
- break;
- case 23:
- p += sprintf(p, "Detection of a PATH_ERROR ");
- break;
- case 25:
- p += sprintf(p, "Bus operation timeout ");
- break;
- case 26:
- p += sprintf(p, "A read was issued to data that has been poisoned ");
- break;
- default:
- p += sprintf(p, "reserved ");
- break;
- }
- }
- if (mem_err->validation_bits & CPER_MEM_VALID_REQUESTOR_ID)
- p += sprintf(p, "requestorID: 0x%016llx ",
- (long long)mem_err->requestor_id);
- if (mem_err->validation_bits & CPER_MEM_VALID_RESPONDER_ID)
- p += sprintf(p, "responderID: 0x%016llx ",
- (long long)mem_err->responder_id);
- if (mem_err->validation_bits & CPER_MEM_VALID_TARGET_ID)
- p += sprintf(p, "targetID: 0x%016llx ",
- (long long)mem_err->responder_id);
+ p += print_mem_error_other_detail(mem_err, p, e->location, OTHER_DETAIL_LEN);
if (p > pvt->other_detail)
*(p - 1) = '\0';

diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index 34eeaa59f04a..215c778fb33c 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -237,7 +237,7 @@ const char *cper_mem_err_status_str(u64 status)
}
EXPORT_SYMBOL_GPL(cper_mem_err_status_str);

-static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
+int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
{
u32 len, n;

@@ -291,7 +291,7 @@ static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
return n;
}

-static int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg)
+int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg)
{
u32 len, n;
const char *bank = NULL, *device = NULL;
diff --git a/include/linux/cper.h b/include/linux/cper.h
index 5b1dd27b317d..eacb7dd7b3af 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -569,5 +569,7 @@ void cper_print_proc_arm(const char *pfx,
const struct cper_sec_proc_arm *proc);
void cper_print_proc_ia(const char *pfx,
const struct cper_sec_proc_ia *proc);
+int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg);
+int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg);

#endif
--
2.20.1.12.g72788fdb

2022-03-09 00:09:46

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v7 3/3] efi/cper: reformat CPER memory error location to more readable

Remove the 'space' delimiter after the colon in cper_mem_err_location() so
that it would be easier to parse this way - both by humans and tools.

Signed-off-by: Shuai Xue <[email protected]>
---
drivers/firmware/efi/cper.c | 30 +++++++++++++++---------------
1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index 215c778fb33c..e4e5ea7ce910 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -247,45 +247,45 @@ int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
n = 0;
len = CPER_REC_LEN;
if (mem->validation_bits & CPER_MEM_VALID_NODE)
- n += scnprintf(msg + n, len - n, "node: %d ", mem->node);
+ n += scnprintf(msg + n, len - n, "node:%d ", mem->node);
if (mem->validation_bits & CPER_MEM_VALID_CARD)
- n += scnprintf(msg + n, len - n, "card: %d ", mem->card);
+ n += scnprintf(msg + n, len - n, "card:%d ", mem->card);
if (mem->validation_bits & CPER_MEM_VALID_MODULE)
- n += scnprintf(msg + n, len - n, "module: %d ", mem->module);
+ n += scnprintf(msg + n, len - n, "module:%d ", mem->module);
if (mem->validation_bits & CPER_MEM_VALID_RANK_NUMBER)
- n += scnprintf(msg + n, len - n, "rank: %d ", mem->rank);
+ n += scnprintf(msg + n, len - n, "rank:%d ", mem->rank);
if (mem->validation_bits & CPER_MEM_VALID_BANK)
- n += scnprintf(msg + n, len - n, "bank: %d ", mem->bank);
+ n += scnprintf(msg + n, len - n, "bank:%d ", mem->bank);
if (mem->validation_bits & CPER_MEM_VALID_BANK_GROUP)
- n += scnprintf(msg + n, len - n, "bank_group: %d ",
+ n += scnprintf(msg + n, len - n, "bank_group:%d ",
mem->bank >> CPER_MEM_BANK_GROUP_SHIFT);
if (mem->validation_bits & CPER_MEM_VALID_BANK_ADDRESS)
- n += scnprintf(msg + n, len - n, "bank_address: %d ",
+ n += scnprintf(msg + n, len - n, "bank_address:%d ",
mem->bank & CPER_MEM_BANK_ADDRESS_MASK);
if (mem->validation_bits & CPER_MEM_VALID_DEVICE)
- n += scnprintf(msg + n, len - n, "device: %d ", mem->device);
+ n += scnprintf(msg + n, len - n, "device:%d ", mem->device);
if (mem->validation_bits & (CPER_MEM_VALID_ROW | CPER_MEM_VALID_ROW_EXT)) {
u32 row = mem->row;

row |= cper_get_mem_extension(mem->validation_bits, mem->extended);
- n += scnprintf(msg + n, len - n, "row: %d ", row);
+ n += scnprintf(msg + n, len - n, "row:%d ", row);
}
if (mem->validation_bits & CPER_MEM_VALID_COLUMN)
- n += scnprintf(msg + n, len - n, "column: %d ", mem->column);
+ n += scnprintf(msg + n, len - n, "column:%d ", mem->column);
if (mem->validation_bits & CPER_MEM_VALID_BIT_POSITION)
- n += scnprintf(msg + n, len - n, "bit_position: %d ",
+ n += scnprintf(msg + n, len - n, "bit_position:%d ",
mem->bit_pos);
if (mem->validation_bits & CPER_MEM_VALID_REQUESTOR_ID)
- n += scnprintf(msg + n, len - n, "requestor_id: 0x%016llx ",
+ n += scnprintf(msg + n, len - n, "requestor_id:0x%016llx ",
mem->requestor_id);
if (mem->validation_bits & CPER_MEM_VALID_RESPONDER_ID)
- n += scnprintf(msg + n, len - n, "responder_id: 0x%016llx ",
+ n += scnprintf(msg + n, len - n, "responder_id:0x%016llx ",
mem->responder_id);
if (mem->validation_bits & CPER_MEM_VALID_TARGET_ID)
- n += scnprintf(msg + n, len - n, "target_id: 0x%016llx ",
+ n += scnprintf(msg + n, len - n, "target_id:0x%016llx ",
mem->target_id);
if (mem->validation_bits & CPER_MEM_VALID_CHIP_ID)
- n += scnprintf(msg + n, len - n, "chip_id: %d ",
+ n += scnprintf(msg + n, len - n, "chip_id:%d ",
mem->extended >> CPER_MEM_CHIP_ID_SHIFT);

return n;
--
2.20.1.12.g72788fdb

2022-03-09 00:43:09

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v7 1/3] efi/cper: add cper_mem_err_status_str to decode error description

Introduce a new helper function cper_mem_err_status_str() which is used to
decode the description of error status, and the cper_print_mem() will call
it and report the details of error status.

Signed-off-by: Shuai Xue <[email protected]>
---
drivers/firmware/efi/cper.c | 30 +++++++++++++++++++++++++++++-
include/linux/cper.h | 1 +
2 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index 6ec8edec6329..34eeaa59f04a 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -211,6 +211,32 @@ const char *cper_mem_err_type_str(unsigned int etype)
}
EXPORT_SYMBOL_GPL(cper_mem_err_type_str);

+const char *cper_mem_err_status_str(u64 status)
+{
+ switch ((status >> 8) & 0xff) {
+ case 1: return "Error detected internal to the component";
+ case 4: return "Storage error in DRAM memory";
+ case 5: return "Storage error in TLB";
+ case 6: return "Storage error in cache";
+ case 7: return "Error in one or more functional units";
+ case 8: return "Component failed self test";
+ case 9: return "Overflow or undervalue of internal queue";
+ case 16: return "Error detected in the bus";
+ case 17: return "Virtual address not found on IO-TLB or IO-PDIR";
+ case 18: return "Improper access error";
+ case 19: return "Access to a memory address which is not mapped to any component";
+ case 20: return "Loss of Lockstep";
+ case 21: return "Response not associated with a request";
+ case 22: return "Bus parity error - must also set the A, C, or D Bits";
+ case 23: return "Detection of a protocol error";
+ case 24: return "Detection of a PATH_ERROR";
+ case 25: return "Bus operation timeout";
+ case 26: return "A read was issued to data that has been poisoned";
+ default: return "Reserved";
+ }
+}
+EXPORT_SYMBOL_GPL(cper_mem_err_status_str);
+
static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
{
u32 len, n;
@@ -334,7 +360,9 @@ static void cper_print_mem(const char *pfx, const struct cper_sec_mem_err *mem,
return;
}
if (mem->validation_bits & CPER_MEM_VALID_ERROR_STATUS)
- printk("%s""error_status: 0x%016llx\n", pfx, mem->error_status);
+ printk("%s error_status: %s (0x%016llx)\n",
+ pfx, cper_mem_err_status_str(mem->error_status),
+ mem->error_status);
if (mem->validation_bits & CPER_MEM_VALID_PA)
printk("%s""physical_address: 0x%016llx\n",
pfx, mem->physical_addr);
diff --git a/include/linux/cper.h b/include/linux/cper.h
index 6a511a1078ca..5b1dd27b317d 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -558,6 +558,7 @@ extern const char *const cper_proc_error_type_strs[4];
u64 cper_next_record_id(void);
const char *cper_severity_str(unsigned int);
const char *cper_mem_err_type_str(unsigned int);
+const char *cper_mem_err_status_str(u64 status);
void cper_print_bits(const char *prefix, unsigned int bits,
const char * const strs[], unsigned int strs_size);
void cper_mem_err_pack(const struct cper_sec_mem_err *,
--
2.20.1.12.g72788fdb

2022-03-22 06:37:46

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v7 0/3] EDAC/ghes: refactor memory error reporting to avoid code duplication

Gentle ping.

在 2022/3/8 PM10:40, Shuai Xue 写道:
> ghes_edac_report_mem_error() in ghes_edac.c is a Long Method and have
> Duplicated Code with cper_mem_err_location(), cper_dimm_err_location(), and
> cper_mem_err_type_str() in drivers/firmware/efi/cper.c. In addition, the
> cper_print_mem() in drivers/firmware/efi/cper.c only reports the error
> status and misses its description.
>
> This patch set is to refactor ghes_edac_report_mem_error with:
>
> - Patch 01 is to wrap up error status decoding logics and reuse it in
> cper_print_mem().
> - Patch 02 is to introduces cper_*(), into ghes_edac_report_mem_error(),
> this can avoid bunch of duplicate code lines;
> - Patch 03 is to improve report format
>
> Changes since v6:
> - Rework patch 02 by Borislav Petkov
> - add patch 03 to improve format
> - Link: https://lore.kernel.org/r/[email protected]
>
> Changes since v5:
> - Delete change summary in commit log
> - Link: https://lore.kernel.org/all/[email protected]/
> - Thanks Borislav Petkov for review comments.
>
> Changes since v4:
> - Fix alignment and format problem
> - Link: https://lore.kernel.org/all/[email protected]/
>
> Changes since v3:
>
> - make cper_mem_err_status_str table a lot more compact
> - Fix alignment and format problem
> - Link: https://lore.kernel.org/all/[email protected]/
>
> Changes since v2:
>
> - delete the unified patch
> - adjusted the order of patches
> - Link: https://lore.kernel.org/all/[email protected]/
>
> Changes since v1:
>
> - add a new patch to unify ghes and cper before removing duplication.
> - document the changes in patch description
> - add EXPORT_SYMBOL_GPL()s for cper_*()
> - document and the dependency and add UEFI_CPER dependency explicitly
> - Link: https://lore.kernel.org/all/[email protected]/
>
> Shuai Xue (3):
> efi/cper: add cper_mem_err_status_str to decode error description
> EDAC/ghes: Unify CPER memory error location reporting
> efi/cper: reformat CPER memory error location to more readable
>
> drivers/edac/Kconfig | 1 +
> drivers/edac/ghes_edac.c | 200 +++++++-----------------------------
> drivers/firmware/efi/cper.c | 64 ++++++++----
> include/linux/cper.h | 3 +
> 4 files changed, 87 insertions(+), 181 deletions(-)


2022-04-01 15:21:19

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v7 0/3] EDAC/ghes: refactor memory error reporting to avoid code duplication

Hi, Borislav,

在 2022/3/8 PM10:40, Shuai Xue 写道:
> ghes_edac_report_mem_error() in ghes_edac.c is a Long Method and have
> Duplicated Code with cper_mem_err_location(), cper_dimm_err_location(), and
> cper_mem_err_type_str() in drivers/firmware/efi/cper.c. In addition, the
> cper_print_mem() in drivers/firmware/efi/cper.c only reports the error
> status and misses its description.
>
> This patch set is to refactor ghes_edac_report_mem_error with:
>
> - Patch 01 is to wrap up error status decoding logics and reuse it in
> cper_print_mem().
> - Patch 02 is to introduces cper_*(), into ghes_edac_report_mem_error(),
> this can avoid bunch of duplicate code lines;
> - Patch 03 is to improve report format
>
> Changes since v6:
> - Rework patch 02 by Borislav Petkov
> - add patch 03 to improve format
> - Link: https://lore.kernel.org/r/[email protected]
>
> Changes since v5:
> - Delete change summary in commit log
> - Link: https://lore.kernel.org/all/[email protected]/
> - Thanks Borislav Petkov for review comments.
>
> Changes since v4:
> - Fix alignment and format problem
> - Link: https://lore.kernel.org/all/[email protected]/
>
> Changes since v3:
>
> - make cper_mem_err_status_str table a lot more compact
> - Fix alignment and format problem
> - Link: https://lore.kernel.org/all/[email protected]/
>
> Changes since v2:
>
> - delete the unified patch
> - adjusted the order of patches
> - Link: https://lore.kernel.org/all/[email protected]/
>
> Changes since v1:
>
> - add a new patch to unify ghes and cper before removing duplication.
> - document the changes in patch description
> - add EXPORT_SYMBOL_GPL()s for cper_*()
> - document and the dependency and add UEFI_CPER dependency explicitly
> - Link: https://lore.kernel.org/all/[email protected]/
>
> Shuai Xue (3):
> efi/cper: add cper_mem_err_status_str to decode error description
> EDAC/ghes: Unify CPER memory error location reporting
> efi/cper: reformat CPER memory error location to more readable
>
> drivers/edac/Kconfig | 1 +
> drivers/edac/ghes_edac.c | 200 +++++++-----------------------------
> drivers/firmware/efi/cper.c | 64 ++++++++----
> include/linux/cper.h | 3 +
> 4 files changed, 87 insertions(+), 181 deletions(-)


Ping. I am wondering if you have any comments on this series of patches?

Best Regards,
Shuai

2022-04-02 16:49:45

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v7 1/3] efi/cper: add cper_mem_err_status_str to decode error description

On Tue, Mar 08, 2022 at 10:40:51PM +0800, Shuai Xue wrote:
> Introduce a new helper function cper_mem_err_status_str() which is used to
> decode the description of error status, and the cper_print_mem() will call
> it and report the details of error status.
>
> Signed-off-by: Shuai Xue <[email protected]>
> ---
> drivers/firmware/efi/cper.c | 30 +++++++++++++++++++++++++++++-
> include/linux/cper.h | 1 +
> 2 files changed, 30 insertions(+), 1 deletion(-)

Ard, ack to take this one and patch 3 through the EDAC tree?

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-04-07 01:24:02

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v7 1/3] efi/cper: add cper_mem_err_status_str to decode error description

On Fri, 1 Apr 2022 at 22:44, Borislav Petkov <[email protected]> wrote:
>
> On Tue, Mar 08, 2022 at 10:40:51PM +0800, Shuai Xue wrote:
> > Introduce a new helper function cper_mem_err_status_str() which is used to
> > decode the description of error status, and the cper_print_mem() will call
> > it and report the details of error status.
> >
> > Signed-off-by: Shuai Xue <[email protected]>
> > ---
> > drivers/firmware/efi/cper.c | 30 +++++++++++++++++++++++++++++-
> > include/linux/cper.h | 1 +
> > 2 files changed, 30 insertions(+), 1 deletion(-)
>
> Ard, ack to take this one and patch 3 through the EDAC tree?
>

Works for me.

Acked-by: Ard Biesheuvel <[email protected]>

2022-04-11 10:06:40

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v7 0/3] EDAC/ghes: refactor memory error reporting to avoid code duplication

On Tue, Mar 08, 2022 at 10:40:50PM +0800, Shuai Xue wrote:
> Shuai Xue (3):
> efi/cper: add cper_mem_err_status_str to decode error description
> EDAC/ghes: Unify CPER memory error location reporting
> efi/cper: reformat CPER memory error location to more readable
>
> drivers/edac/Kconfig | 1 +
> drivers/edac/ghes_edac.c | 200 +++++++-----------------------------
> drivers/firmware/efi/cper.c | 64 ++++++++----
> include/linux/cper.h | 3 +
> 4 files changed, 87 insertions(+), 181 deletions(-)

Applied, thanks.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette