2022-04-19 13:37:11

by Yazen Ghannam

[permalink] [raw]
Subject: [PATCH 2/3] x86/MCE/APEI: Handle variable register array size

Recent AMD systems may provide an x86 Common Platform Error Record
(CPER) for errors reported in the ACPI Boot Error Record Table (BERT).
The x86 CPER may contain one or more Processor Context Information
Structures. The context structures may represent an x86 MSR range where
a starting address is given, and the data represents a contiguous set of
MSRs starting from, and including, the starting address.

It's common, for AMD systems that implement this behavior, that the MSR
range represents the MCAX register space used for the Scalable MCA
feature. The apei_smca_report_x86_error() function decodes and passes
this information through the MCE notifier chain. However, this function
assumes a fixed register size based on the original HW/FW
implementation.

This assumption breaks with the addition of two new MCAX registers:
MCA_SYND1 and MCA_SYND2. These registers are added at the end of the
MCAX register space, so they won't be included when decoding the CPER
data.

Rework apei_smca_report_x86_error() to support a variable register array
size. This covers any case where the MSR context information starts at
the MCAX address for MCA_STATUS and ends at any other register within
the MCAX register space.

Add code comments indicating the MCAX register at each offset.

Signed-off-by: Yazen Ghannam <[email protected]>
---
arch/x86/kernel/cpu/mce/apei.c | 73 +++++++++++++++++++++++++++-------
1 file changed, 59 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c
index 0e3ae64d3b76..7510cd88f7eb 100644
--- a/arch/x86/kernel/cpu/mce/apei.c
+++ b/arch/x86/kernel/cpu/mce/apei.c
@@ -55,7 +55,7 @@ EXPORT_SYMBOL_GPL(apei_mce_report_mem_error);
int apei_smca_report_x86_error(struct cper_ia_proc_ctx *ctx_info, u64 lapic_id)
{
const u64 *i_mce = ((const u64 *) (ctx_info + 1));
- unsigned int cpu;
+ unsigned int cpu, num_registers;
struct mce m;

if (!boot_cpu_has(X86_FEATURE_SMCA))
@@ -74,16 +74,12 @@ int apei_smca_report_x86_error(struct cper_ia_proc_ctx *ctx_info, u64 lapic_id)
return -EINVAL;

/*
- * The register array size must be large enough to include all the
- * SMCA registers which need to be extracted.
- *
* The number of registers in the register array is determined by
* Register Array Size/8 as defined in UEFI spec v2.8, sec N.2.4.2.2.
- * The register layout is fixed and currently the raw data in the
- * register array includes 6 SMCA registers which the kernel can
- * extract.
+ * Ensure that the array size includes at least 1 register.
*/
- if (ctx_info->reg_arr_size < 48)
+ num_registers = ctx_info->reg_arr_size >> 3;
+ if (!num_registers)
return -EINVAL;

mce_setup(&m);
@@ -101,12 +97,61 @@ int apei_smca_report_x86_error(struct cper_ia_proc_ctx *ctx_info, u64 lapic_id)

m.apicid = lapic_id;
m.bank = (ctx_info->msr_addr >> 4) & 0xFF;
- m.status = *i_mce;
- m.addr = *(i_mce + 1);
- m.misc = *(i_mce + 2);
- /* Skipping MCA_CONFIG */
- m.ipid = *(i_mce + 4);
- m.synd = *(i_mce + 5);
+
+ /*
+ * The SMCA register layout is fixed and includes 16 registers.
+ * The end of the array may be variable, but the beginning is known.
+ * Switch on the number of registers. Cap the number of registers to
+ * expected max (15).
+ */
+ if (num_registers > 15)
+ num_registers = 15;
+
+ switch (num_registers) {
+ /* MCA_SYND2 */
+ case 15:
+ m.synd2 = *(i_mce + 14);
+ fallthrough;
+ /* MCA_SYND1 */
+ case 14:
+ m.synd1 = *(i_mce + 13);
+ fallthrough;
+ /* MCA_MISC4 */
+ case 13:
+ /* MCA_MISC3 */
+ case 12:
+ /* MCA_MISC2 */
+ case 11:
+ /* MCA_MISC1 */
+ case 10:
+ /* MCA_DEADDR */
+ case 9:
+ /* MCA_DESTAT */
+ case 8:
+ /* reserved */
+ case 7:
+ /* MCA_SYND */
+ case 6:
+ m.synd = *(i_mce + 5);
+ fallthrough;
+ /* MCA_IPID */
+ case 5:
+ m.ipid = *(i_mce + 4);
+ fallthrough;
+ /* MCA_CONFIG */
+ case 4:
+ /* MCA_MISC0 */
+ case 3:
+ m.misc = *(i_mce + 2);
+ fallthrough;
+ /* MCA_ADDR */
+ case 2:
+ m.addr = *(i_mce + 1);
+ fallthrough;
+ /* MCA_STATUS */
+ case 1:
+ m.status = *i_mce;
+ }

mce_log(&m);

--
2.25.1


2022-07-03 13:09:00

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 2/3] x86/MCE/APEI: Handle variable register array size

On Mon, Apr 18, 2022 at 05:44:39PM +0000, Yazen Ghannam wrote:
> Recent AMD systems may provide an x86 Common Platform Error Record
> (CPER) for errors reported in the ACPI Boot Error Record Table (BERT).
> The x86 CPER may contain one or more Processor Context Information
> Structures. The context structures may represent an x86 MSR range where
> a starting address is given, and the data represents a contiguous set of
> MSRs starting from, and including, the starting address.

You're killing me with these "may" formulations. Just say it once and
then drop it. I mean, we know some future hw "may" support something
new - you can just as well drop the "may" thing because if it only may
and it turns out it might not, you don't even have to do the work and
enabling it and sending the patch.

So no need to do that - the patch commit message should talk purely
about functionality and not sound like some vendor doc - there are
enough of those.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-07-11 17:55:56

by Yazen Ghannam

[permalink] [raw]
Subject: Re: [PATCH 2/3] x86/MCE/APEI: Handle variable register array size

On Sun, Jul 03, 2022 at 02:30:24PM +0200, Borislav Petkov wrote:
> On Mon, Apr 18, 2022 at 05:44:39PM +0000, Yazen Ghannam wrote:
> > Recent AMD systems may provide an x86 Common Platform Error Record
> > (CPER) for errors reported in the ACPI Boot Error Record Table (BERT).
> > The x86 CPER may contain one or more Processor Context Information
> > Structures. The context structures may represent an x86 MSR range where
> > a starting address is given, and the data represents a contiguous set of
> > MSRs starting from, and including, the starting address.
>
> You're killing me with these "may" formulations. Just say it once and
> then drop it. I mean, we know some future hw "may" support something
> new - you can just as well drop the "may" thing because if it only may
> and it turns out it might not, you don't even have to do the work and
> enabling it and sending the patch.
>
> So no need to do that - the patch commit message should talk purely
> about functionality and not sound like some vendor doc - there are
> enough of those.
>

Understood.

Thanks,
Yazen