2022-04-06 12:05:47

by Bilbao, Carlos

[permalink] [raw]
Subject: [PATCH 0/2] x86/mce: Simplify AMD MCEs severity grading and include messages for panic cases

This patchset simplifies the grading of machine errors on AMD's MCE
grading logic mce_severity_amd(), which helps the MCE handler determine
what actions to take. If the error is graded as a PANIC, the EDAC driver
will not decode; so we also include new error messages to describe the MCE
and help debugging critical errors.

Carlos Bilbao (2):
x86/mce: x86/mce: Simplify AMD severity grading logic
x86/mce: Add messages for panic errors in AMD's MCE grading
---
arch/x86/kernel/cpu/mce/severity.c | 113 ++++++++++++-----------------
1 file changed, 48 insertions(+), 65 deletions(-)

--
2.31.1


2022-04-06 12:28:24

by Bilbao, Carlos

[permalink] [raw]
Subject: [PATCH 2/2] x86/mce: Add messages for panic errors in AMD's MCE grading

When a machine error is graded as PANIC by AMD grading logic, the MCE
handler calls mce_panic(). The notification chain does not come into effect
so the AMD EDAC driver does not decode the errors. In these cases, the
messages displayed to the user are more cryptic and miss information
that might be relevant, like the context in which the error took place.

Fix the above issue including messages on AMD's grading logic for machine
errors graded as PANIC.

Signed-off-by: Carlos Bilbao <[email protected]>
---
arch/x86/kernel/cpu/mce/severity.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/mce/severity.c b/arch/x86/kernel/cpu/mce/severity.c
index 25aec5a27899..c09fa4f01616 100644
--- a/arch/x86/kernel/cpu/mce/severity.c
+++ b/arch/x86/kernel/cpu/mce/severity.c
@@ -306,6 +306,7 @@ static noinstr int error_context(struct mce *m, struct pt_regs *regs)
*/
static noinstr int mce_severity_amd(struct mce *m, struct pt_regs *regs, char **msg, bool is_excp)
{
+ char *panic_msg = NULL;
int ret;

/*
@@ -316,6 +317,7 @@ static noinstr int mce_severity_amd(struct mce *m, struct pt_regs *regs, char **

/* Processor Context Corrupt, no need to fumble too much, die! */
if (m->status & MCI_STATUS_PCC) {
+ panic_msg = "Processor Context Corrupt";
ret = MCE_PANIC_SEVERITY;
goto out_amd_severity;
}
@@ -339,20 +341,27 @@ static noinstr int mce_severity_amd(struct mce *m, struct pt_regs *regs, char **
* system will not be able to recover.
*/
if ((m->status & MCI_STATUS_OVER) && !mce_flags.overflow_recov) {
+ panic_msg = "Overflowed uncorrected error without MCA Overflow Recovery";
ret = MCE_PANIC_SEVERITY;
goto out_amd_severity;
}

if (!mce_flags.succor) {
+ panic_msg = "Uncorrected error without MCA Recovery";
ret = MCE_PANIC_SEVERITY;
goto out_amd_severity;
}

- if (error_context(m, regs) == IN_KERNEL)
+ if (error_context(m, regs) == IN_KERNEL) {
+ panic_msg = "Uncorrected unrecoverable error in kernel context";
ret = MCE_PANIC_SEVERITY;
+ }

out_amd_severity:

+ if (msg && panic_msg)
+ *msg = panic_msg;
+
return ret;
}

--
2.31.1

2022-04-11 09:00:10

by Yazen Ghannam

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/mce: Add messages for panic errors in AMD's MCE grading

On Tue, Apr 05, 2022 at 01:32:14PM -0500, Carlos Bilbao wrote:
> When a machine error is graded as PANIC by AMD grading logic, the MCE
> handler calls mce_panic(). The notification chain does not come into effect
> so the AMD EDAC driver does not decode the errors. In these cases, the
> messages displayed to the user are more cryptic and miss information
> that might be relevant, like the context in which the error took place.
>
> Fix the above issue including messages on AMD's grading logic for machine
> errors graded as PANIC.
>
> Signed-off-by: Carlos Bilbao <[email protected]>
> ---

Reviewed-by: Yazen Ghannam <[email protected]>

Thanks!

-Yazen

Subject: [tip: ras/core] x86/mce: Add messages for panic errors in AMD's MCE grading

The following commit has been merged into the ras/core branch of tip:

Commit-ID: fa619f5156cf1ee3773edc6d756be262c9ef35de
Gitweb: https://git.kernel.org/tip/fa619f5156cf1ee3773edc6d756be262c9ef35de
Author: Carlos Bilbao <[email protected]>
AuthorDate: Tue, 05 Apr 2022 13:32:14 -05:00
Committer: Borislav Petkov <[email protected]>
CommitterDate: Mon, 25 Apr 2022 12:40:48 +02:00

x86/mce: Add messages for panic errors in AMD's MCE grading

When a machine error is graded as PANIC by the AMD grading logic, the
MCE handler calls mce_panic(). The notification chain does not come
into effect so the AMD EDAC driver does not decode the errors. In these
cases, the messages displayed to the user are more cryptic and miss
information that might be relevant, like the context in which the error
took place.

Add messages to the grading logic for machine errors so that it is clear
what error it was.

[ bp: Massage commit message. ]

Signed-off-by: Carlos Bilbao <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Yazen Ghannam <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/kernel/cpu/mce/severity.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/mce/severity.c b/arch/x86/kernel/cpu/mce/severity.c
index d842148..00483d1 100644
--- a/arch/x86/kernel/cpu/mce/severity.c
+++ b/arch/x86/kernel/cpu/mce/severity.c
@@ -304,6 +304,7 @@ static noinstr int error_context(struct mce *m, struct pt_regs *regs)
/* See AMD PPR(s) section Machine Check Error Handling. */
static noinstr int mce_severity_amd(struct mce *m, struct pt_regs *regs, char **msg, bool is_excp)
{
+ char *panic_msg = NULL;
int ret;

/*
@@ -314,6 +315,7 @@ static noinstr int mce_severity_amd(struct mce *m, struct pt_regs *regs, char **

/* Processor Context Corrupt, no need to fumble too much, die! */
if (m->status & MCI_STATUS_PCC) {
+ panic_msg = "Processor Context Corrupt";
ret = MCE_PANIC_SEVERITY;
goto out;
}
@@ -337,19 +339,26 @@ static noinstr int mce_severity_amd(struct mce *m, struct pt_regs *regs, char **
* system will not be able to recover, panic.
*/
if ((m->status & MCI_STATUS_OVER) && !mce_flags.overflow_recov) {
+ panic_msg = "Overflowed uncorrected error without MCA Overflow Recovery";
ret = MCE_PANIC_SEVERITY;
goto out;
}

if (!mce_flags.succor) {
+ panic_msg = "Uncorrected error without MCA Recovery";
ret = MCE_PANIC_SEVERITY;
goto out;
}

- if (error_context(m, regs) == IN_KERNEL)
+ if (error_context(m, regs) == IN_KERNEL) {
+ panic_msg = "Uncorrected unrecoverable error in kernel context";
ret = MCE_PANIC_SEVERITY;
+ }

out:
+ if (msg && panic_msg)
+ *msg = panic_msg;
+
return ret;
}