Hello,
This patchset adds support for two new "syndrome" registers used in
future AMD Scalable MCA (SMCA) systems and also a new "FRU Text in MCA"
feature that uses these new registers.
Patch 1 adds support for the new registers. They are read/printed
wherever the existing MCA_SYND register is used.
Patch 2 updates the function that pulls MCA information from UEFI x86
Common Platform Error Records (CPERs) to handle systems that support the
new registers.
Patch 3 adds support to the AMD MCE decoder module to detect and use the
"FRU Text in MCA" feature which leverages the new registers.
Thanks,
Yazen
Yazen Ghannam (3):
x86/MCE, EDAC/mce_amd: Add support for new MCA_SYND{1,2} registers
x86/MCE/APEI: Handle variable register array size
EDAC/mce_amd: Add support for FRU Text in MCA
arch/x86/include/asm/mce.h | 5 +++
arch/x86/include/uapi/asm/mce.h | 2 +
arch/x86/kernel/cpu/mce/amd.c | 5 ++-
arch/x86/kernel/cpu/mce/apei.c | 73 ++++++++++++++++++++++++++-------
arch/x86/kernel/cpu/mce/core.c | 9 +++-
drivers/edac/mce_amd.c | 24 ++++++++---
include/trace/events/mce.h | 7 +++-
7 files changed, 103 insertions(+), 22 deletions(-)
--
2.25.1
A new "FRU Text in MCA" feature is defined where the Field Replaceable
Unit (FRU) Text for a device is represented by a string in the new
MCA_SYND1 and MCA_SYND2 registers. This feature is supported per MCA
bank, and it is advertised by the McaFruTextInMca bit (MCA_CONFIG[9]).
The FRU Text is populated dynamically for each individual error state
(MCA_STATUS, MCA_ADDR, et al.). This handles the case where an MCA bank
covers multiple devices, for example, a Unified Memory Controller (UMC)
bank that manages two DIMMs.
Print the FRU Text string, if available, when decoding an MCA error.
Signed-off-by: Yazen Ghannam <[email protected]>
---
drivers/edac/mce_amd.c | 17 ++++++++++++++---
1 file changed, 14 insertions(+), 3 deletions(-)
diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index 28b48c711fe0..3cacc3f22379 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -1235,6 +1235,7 @@ amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
{
struct mce *m = (struct mce *)data;
unsigned int fam = x86_family(m->cpuid);
+ u64 mca_config = 0;
int ecc;
if (m->kflags & MCE_HANDLED_CEC)
@@ -1254,11 +1255,10 @@ amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
((m->status & MCI_STATUS_PCC) ? "PCC" : "-"));
if (boot_cpu_has(X86_FEATURE_SMCA)) {
- u32 low, high;
u32 addr = MSR_AMD64_SMCA_MCx_CONFIG(m->bank);
- if (!rdmsr_safe(addr, &low, &high) &&
- (low & MCI_CONFIG_MCAX))
+ if (!rdmsrl_safe_on_cpu(m->extcpu, addr, &mca_config) &&
+ (mca_config & MCI_CONFIG_MCAX))
pr_cont("|%s", ((m->status & MCI_STATUS_TCC) ? "TCC" : "-"));
pr_cont("|%s", ((m->status & MCI_STATUS_SYNDV) ? "SyndV" : "-"));
@@ -1300,6 +1300,17 @@ amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
pr_cont("\n");
decode_smca_error(m);
+
+ if (mca_config & BIT(9)) {
+ char frutext[32];
+
+ memset(frutext, 0, sizeof(frutext));
+ memcpy(&frutext[0], &m->synd1, 8);
+ memcpy(&frutext[8], &m->synd2, 8);
+
+ pr_emerg(HW_ERR "FRU Text: %s\n", frutext);
+ }
+
goto err_code;
}
--
2.25.1
On Mon, Apr 18, 2022 at 05:44:37PM +0000, Yazen Ghannam wrote:
> Hello,
>
> This patchset adds support for two new "syndrome" registers used in
> future AMD Scalable MCA (SMCA) systems and also a new "FRU Text in MCA"
> feature that uses these new registers.
>
> Patch 1 adds support for the new registers. They are read/printed
> wherever the existing MCA_SYND register is used.
>
> Patch 2 updates the function that pulls MCA information from UEFI x86
> Common Platform Error Records (CPERs) to handle systems that support the
> new registers.
>
> Patch 3 adds support to the AMD MCE decoder module to detect and use the
> "FRU Text in MCA" feature which leverages the new registers.
>
> Thanks,
> Yazen
>
> Yazen Ghannam (3):
> x86/MCE, EDAC/mce_amd: Add support for new MCA_SYND{1,2} registers
> x86/MCE/APEI: Handle variable register array size
> EDAC/mce_amd: Add support for FRU Text in MCA
>
> arch/x86/include/asm/mce.h | 5 +++
> arch/x86/include/uapi/asm/mce.h | 2 +
> arch/x86/kernel/cpu/mce/amd.c | 5 ++-
> arch/x86/kernel/cpu/mce/apei.c | 73 ++++++++++++++++++++++++++-------
> arch/x86/kernel/cpu/mce/core.c | 9 +++-
> drivers/edac/mce_amd.c | 24 ++++++++---
> include/trace/events/mce.h | 7 +++-
> 7 files changed, 103 insertions(+), 22 deletions(-)
>
> --
Hi everyone,
Any comments on this set? Thanks!
-Yazen
On Mon, Apr 18, 2022 at 05:44:40PM +0000, Yazen Ghannam wrote:
> @@ -1254,11 +1255,10 @@ amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
> ((m->status & MCI_STATUS_PCC) ? "PCC" : "-"));
>
> if (boot_cpu_has(X86_FEATURE_SMCA)) {
> - u32 low, high;
> u32 addr = MSR_AMD64_SMCA_MCx_CONFIG(m->bank);
>
> - if (!rdmsr_safe(addr, &low, &high) &&
> - (low & MCI_CONFIG_MCAX))
> + if (!rdmsrl_safe_on_cpu(m->extcpu, addr, &mca_config) &&
This change needs to be mentioned in the commit message.
> + (mca_config & MCI_CONFIG_MCAX))
> pr_cont("|%s", ((m->status & MCI_STATUS_TCC) ? "TCC" : "-"));
>
> pr_cont("|%s", ((m->status & MCI_STATUS_SYNDV) ? "SyndV" : "-"));
> @@ -1300,6 +1300,17 @@ amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
> pr_cont("\n");
So here above the code prints SYND1 and SYND2.
If they contain FRU strings, then this printing should be an if-else by
checking bit 9.
if (BIT(9))
print fru text
else
print naked syndromes
> decode_smca_error(m);
> +
> + if (mca_config & BIT(9)) {
> + char frutext[32];
Why 32?
> + memset(frutext, 0, sizeof(frutext));
> + memcpy(&frutext[0], &m->synd1, 8);
> + memcpy(&frutext[8], &m->synd2, 8);
> +
> + pr_emerg(HW_ERR "FRU Text: %s\n", frutext);
> + }
> +
> goto err_code;
> }
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Mon, Jul 04, 2022 at 11:13:56AM +0200, Borislav Petkov wrote:
> On Mon, Apr 18, 2022 at 05:44:40PM +0000, Yazen Ghannam wrote:
> > @@ -1254,11 +1255,10 @@ amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
> > ((m->status & MCI_STATUS_PCC) ? "PCC" : "-"));
> >
> > if (boot_cpu_has(X86_FEATURE_SMCA)) {
> > - u32 low, high;
> > u32 addr = MSR_AMD64_SMCA_MCx_CONFIG(m->bank);
> >
> > - if (!rdmsr_safe(addr, &low, &high) &&
> > - (low & MCI_CONFIG_MCAX))
> > + if (!rdmsrl_safe_on_cpu(m->extcpu, addr, &mca_config) &&
>
> This change needs to be mentioned in the commit message.
>
Okay.
> > + (mca_config & MCI_CONFIG_MCAX))
> > pr_cont("|%s", ((m->status & MCI_STATUS_TCC) ? "TCC" : "-"));
> >
> > pr_cont("|%s", ((m->status & MCI_STATUS_SYNDV) ? "SyndV" : "-"));
> > @@ -1300,6 +1300,17 @@ amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
> > pr_cont("\n");
>
> So here above the code prints SYND1 and SYND2.
>
> If they contain FRU strings, then this printing should be an if-else by
> checking bit 9.
>
> if (BIT(9))
> print fru text
> else
> print naked syndromes
>
>
Okay, will change.
> > decode_smca_error(m);
> > +
> > + if (mca_config & BIT(9)) {
> > + char frutext[32];
>
> Why 32?
>
I picked the next power of 2 after 16 in order to have some space for
terminating NULL in the string. I can't think of a good reason it needs to be
a power of 2, so I can change this to 17.
> > + memset(frutext, 0, sizeof(frutext));
> > + memcpy(&frutext[0], &m->synd1, 8);
> > + memcpy(&frutext[8], &m->synd2, 8);
> > +
> > + pr_emerg(HW_ERR "FRU Text: %s\n", frutext);
> > + }
> > +
> > goto err_code;
> > }
>
Thanks,
Yazen