2024-04-04 15:15:22

by Yazen Ghannam

[permalink] [raw]
Subject: [PATCH v2 00/16] MCA Updates

Hi all,

This set is a collection of logically independent updates that make
changes to common code. I've collected them to resolve conflicts and
ordering. Furthermore, this is the first half of a larger set. The
second half is focused on refactoring the AMD MCA Thresholding feature
support. So I decided to leave out the second half for now. The second
part will include AMD interrupt storm handling support on top of the
refactored code. Please see the link below for a work-in-progress branch
with the remaining changes.

Patches 1-2 deal with BERT MCA decode and preemption.

Patches 3-8 are general refactoring in preparation for later patches in
this set and the second planned set. The overall theme is to simplify
the AMD MCA init flow and to remove unnecessary data caching in per-CPU
variables. The init flow refactor will be completed in the second patch
set, since much of the cached data is used to set up MCA Thresholding.

Patches 9-10 unify the AMD THR and DFR interrupt handlers with MCA
polling.

Patch 11 is a small cleanup for the MCA Thresholding init path.

Patch 12 adds support for a new Corrected Error Interrupt on Scalable
MCA systems.

Patches 13-16 add support for new Scalable MCA registers and FRU Text
decoding feature.

Thanks,
Yazen

Branch for this set:
https://github.com/AMDESE/linux/tree/mca-updates-v2

Branch for remaining changes (work-in-progrss):
https://github.com/AMDESE/linux/tree/wip-mca

Link:
https://lkml.kernel.org/r/[email protected]

Avadhut Naik (2):
x86/mce: Add wrapper for struct mce to export vendor specific info
x86/mce, EDAC/mce_amd: Add support for new MCA_SYND{1,2} registers

Yazen Ghannam (14):
x86/mce: Define mce_setup() helpers for common and per-CPU fields
x86/mce: Use mce_setup() helpers for apei_smca_report_x86_error()
x86/mce/amd: Use fixed bank number for quirks
x86/mce/amd: Look up bank type by IPID
x86/mce/amd: Clean up SMCA configuration
x86/mce/amd: Prep DFR handler before enabling banks
x86/mce/amd: Simplify DFR handler setup
x86/mce/amd: Clean up enable_deferred_error_interrupt()
x86/mce: Unify AMD THR handler with MCA Polling
x86/mce: Unify AMD DFR handler with MCA Polling
x86/mce: Skip AMD threshold init if no threshold banks found
x86/mce/amd: Support SMCA Corrected Error Interrupt
x86/mce/apei: Handle variable register array size
EDAC/mce_amd: Add support for FRU Text in MCA

arch/x86/include/asm/mce.h | 24 +-
arch/x86/kernel/cpu/mce/amd.c | 461 ++++++++++++++----------
arch/x86/kernel/cpu/mce/apei.c | 124 +++++--
arch/x86/kernel/cpu/mce/core.c | 253 ++++++++-----
arch/x86/kernel/cpu/mce/dev-mcelog.c | 2 +-
arch/x86/kernel/cpu/mce/genpool.c | 20 +-
arch/x86/kernel/cpu/mce/inject.c | 4 +-
arch/x86/kernel/cpu/mce/internal.h | 13 +-
drivers/acpi/acpi_extlog.c | 2 +-
drivers/acpi/nfit/mce.c | 3 +-
drivers/edac/amd64_edac.c | 2 +-
drivers/edac/i7core_edac.c | 2 +-
drivers/edac/igen6_edac.c | 2 +-
drivers/edac/mce_amd.c | 29 +-
drivers/edac/pnd2_edac.c | 2 +-
drivers/edac/sb_edac.c | 2 +-
drivers/edac/skx_common.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 4 +-
drivers/ras/amd/fmpm.c | 2 +-
drivers/ras/cec.c | 3 +-
include/trace/events/mce.h | 51 +--
21 files changed, 620 insertions(+), 387 deletions(-)


base-commit: f382ab1037497f49d290ce6ceb9cdb10b186682e
--
2.34.1



2024-04-04 15:15:58

by Yazen Ghannam

[permalink] [raw]
Subject: [PATCH v2 08/16] x86/mce/amd: Clean up enable_deferred_error_interrupt()

Switch to bitops to help with clarity. Also, avoid an unnecessary
wrmsr() for SMCA systems.

Use the updated name for MSR 0xC000_0410 to match the documentation for
Family 0x17 and later systems.

This MSR is used for setting up both Deferred and MCA Thresholding
interrupts on current systems. So read it once during init and pass to
functions that need it. Start with the Deferred error interrupt case.
The MCA Thresholding interrupt case will be handled during refactoring.

Signed-off-by: Yazen Ghannam <[email protected]>
---

Notes:
Link:
https://lkml.kernel.org/r/[email protected]

v1->v2:
* Remove invalid SMCA check in get_mca_intr_cfg(). (Yazen)

arch/x86/kernel/cpu/mce/amd.c | 46 +++++++++++++++++++++++------------
1 file changed, 30 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index 32628a30a5c1..f59f4a1c9b21 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -44,11 +44,11 @@
#define MASK_BLKPTR_LO 0xFF000000
#define MCG_XBLK_ADDR 0xC0000400

-/* Deferred error settings */
+/* MCA Interrupt Configuration register, one per CPU */
#define MSR_CU_DEF_ERR 0xC0000410
-#define MASK_DEF_LVTOFF 0x000000F0
-#define MASK_DEF_INT_TYPE 0x00000006
-#define DEF_INT_TYPE_APIC 0x2
+#define MSR_MCA_INTR_CFG 0xC0000410
+#define INTR_CFG_DFR_LVT_OFFSET GENMASK_ULL(7, 4)
+#define INTR_CFG_LEGACY_DFR_INTR_TYPE GENMASK_ULL(2, 1)
#define INTR_TYPE_APIC 0x1

/* Scalable MCA: */
@@ -574,30 +574,30 @@ static int setup_APIC_mce_threshold(int reserved, int new)
return reserved;
}

-static void enable_deferred_error_interrupt(void)
+static void enable_deferred_error_interrupt(u64 mca_intr_cfg)
{
- u32 low = 0, high = 0, def_new;
+ u8 dfr_offset;

- if (!mce_flags.succor)
- return;
-
- if (rdmsr_safe(MSR_CU_DEF_ERR, &low, &high))
+ if (!mca_intr_cfg)
return;

/*
* Trust the value from hardware.
* If there's a conflict, then setup_APIC_eilvt() will throw an error.
*/
- def_new = (low & MASK_DEF_LVTOFF) >> 4;
- if (setup_APIC_eilvt(def_new, DEFERRED_ERROR_VECTOR, APIC_EILVT_MSG_FIX, 0))
+ dfr_offset = FIELD_GET(INTR_CFG_DFR_LVT_OFFSET, mca_intr_cfg);
+ if (setup_APIC_eilvt(dfr_offset, DEFERRED_ERROR_VECTOR, APIC_EILVT_MSG_FIX, 0))
return;

deferred_error_int_vector = amd_deferred_error_interrupt;

- if (!mce_flags.smca)
- low = (low & ~MASK_DEF_INT_TYPE) | DEF_INT_TYPE_APIC;
+ if (mce_flags.smca)
+ return;

- wrmsr(MSR_CU_DEF_ERR, low, high);
+ mca_intr_cfg &= ~INTR_CFG_LEGACY_DFR_INTR_TYPE;
+ mca_intr_cfg |= FIELD_PREP(INTR_CFG_LEGACY_DFR_INTR_TYPE, INTR_TYPE_APIC);
+
+ wrmsrl(MSR_MCA_INTR_CFG, mca_intr_cfg);
}

static u32 smca_get_block_address(unsigned int bank, unsigned int block,
@@ -751,14 +751,28 @@ static void disable_err_thresholding(struct cpuinfo_x86 *c, unsigned int bank)
wrmsrl(MSR_K7_HWCR, hwcr);
}

+static u64 get_mca_intr_cfg(void)
+{
+ u64 mca_intr_cfg;
+
+ if (!mce_flags.succor)
+ return 0;
+
+ if (rdmsrl_safe(MSR_MCA_INTR_CFG, &mca_intr_cfg))
+ return 0;
+
+ return mca_intr_cfg;
+}
+
/* cpu init entry point, called from mce.c with preempt off */
void mce_amd_feature_init(struct cpuinfo_x86 *c)
{
unsigned int bank, block, cpu = smp_processor_id();
+ u64 mca_intr_cfg = get_mca_intr_cfg();
u32 low = 0, high = 0, address = 0;
int offset = -1;

- enable_deferred_error_interrupt();
+ enable_deferred_error_interrupt(mca_intr_cfg);

for (bank = 0; bank < this_cpu_read(mce_num_banks); ++bank) {
if (mce_flags.smca)
--
2.34.1


2024-04-04 15:16:41

by Yazen Ghannam

[permalink] [raw]
Subject: [PATCH v2 11/16] x86/mce: Skip AMD threshold init if no threshold banks found

AMD systems optionally support MCA Thresholding. This feature is
discovered by checking capability bits in the MCA_MISC* registers.

Currently, MCA Thresholding is set up in two passes. The first is during
CPU init where available banks are detected, and the "bank_map" variable
is updated. The second is during sysfs/device init when the thresholding
data structures are allocated and hardware is fully configured.

During device init, the "threshold_banks" array is allocated even if no
available banks were discovered. Furthermore, the thresholding reset
flow checks if the top-level "threshold_banks" array is non-NULL, but it
doesn't check if individual "threshold_bank" structures are non-NULL.
This is avoided because the hardware interrupt is not enabled in this
case. But this issue becomes present if enabling the interrupt when the
thresholding data structures are not initialized.

Check "bank_map" to determine if the thresholding structures should be
allocated and initialized. Also, remove "mce_flags.amd_threshold" which
is redundant when checking "bank_map".

Signed-off-by: Yazen Ghannam <[email protected]>
---

Notes:
Link:
https://lkml.kernel.org/r/[email protected]

v1->v2:
* Update mce_vendor_flags reserved bits. (Yazen)

arch/x86/kernel/cpu/mce/amd.c | 2 +-
arch/x86/kernel/cpu/mce/core.c | 1 -
arch/x86/kernel/cpu/mce/internal.h | 5 +----
3 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index 40912c5e35d1..08ee647cb6ce 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -1455,7 +1455,7 @@ int mce_threshold_create_device(unsigned int cpu)
struct threshold_bank **bp;
int err;

- if (!mce_flags.amd_threshold)
+ if (!this_cpu_read(bank_map))
return 0;

bp = this_cpu_read(threshold_banks);
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 308766868f39..17cf5a9df3cd 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -2024,7 +2024,6 @@ static void __mcheck_cpu_init_early(struct cpuinfo_x86 *c)
mce_flags.overflow_recov = !!cpu_has(c, X86_FEATURE_OVERFLOW_RECOV);
mce_flags.succor = !!cpu_has(c, X86_FEATURE_SUCCOR);
mce_flags.smca = !!cpu_has(c, X86_FEATURE_SMCA);
- mce_flags.amd_threshold = 1;
}
}

diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index 96b108175ca2..9802f7c6cc93 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -214,9 +214,6 @@ struct mce_vendor_flags {
/* Zen IFU quirk */
zen_ifu_quirk : 1,

- /* AMD-style error thresholding banks present. */
- amd_threshold : 1,
-
/* Pentium, family 5-style MCA */
p5 : 1,

@@ -229,7 +226,7 @@ struct mce_vendor_flags {
/* Skylake, Cascade Lake, Cooper Lake REP;MOVS* quirk */
skx_repmov_quirk : 1,

- __reserved_0 : 55;
+ __reserved_0 : 56;
};

extern struct mce_vendor_flags mce_flags;
--
2.34.1


2024-04-04 15:16:42

by Yazen Ghannam

[permalink] [raw]
Subject: [PATCH v2 12/16] x86/mce/amd: Support SMCA Corrected Error Interrupt

AMD systems optionally support MCA Thresholding which provides the
ability for hardware to send an interrupt when a set error threshold is
reached. This feature counts errors of all severities, but it is
commonly used to report correctable errors with an interrupt rather than
polling.

Scalable MCA systems allow the Platform to take control of this feature.
In this case, the OS will not see the feature configuration and control
bits in the MCA_MISC* registers. The OS will not receive the MCA
Thresholding interrupt, and it will need to poll for correctable errors.

A "corrected error interrupt" will be available on Scalable MCA systems.
This will be used in the same configuration where the Platform controls
MCA Thresholding. However, the Platform will now be able to send the
MCA Thresholding interrupt to the OS.

Check for the feature bit in the MCA_CONFIG register and attempt to set
up the MCA Thresholding interrupt handler. If successful, set the feature
enable bit in the MCA_CONFIG register to indicate to the Platform that
the OS is ready for the interrupt.

Signed-off-by: Yazen Ghannam <[email protected]>
---

Notes:
Link:
https://lkml.kernel.org/r/[email protected]

v1->v2:
* Rebase on earlier changes. (Yazen)

arch/x86/kernel/cpu/mce/amd.c | 21 +++++++++++++++++++--
1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index 08ee647cb6ce..a81d911d608e 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -47,6 +47,7 @@
/* MCA Interrupt Configuration register, one per CPU */
#define MSR_CU_DEF_ERR 0xC0000410
#define MSR_MCA_INTR_CFG 0xC0000410
+#define INTR_CFG_THR_LVT_OFFSET GENMASK_ULL(15, 12)
#define INTR_CFG_DFR_LVT_OFFSET GENMASK_ULL(7, 4)
#define INTR_CFG_LEGACY_DFR_INTR_TYPE GENMASK_ULL(2, 1)
#define INTR_TYPE_APIC 0x1
@@ -58,8 +59,10 @@
#define MCI_IPID_HWID_OLD 0xFFF

/* MCA_CONFIG register, one per MCA bank */
+#define CFG_CE_INT_EN BIT_ULL(40)
#define CFG_DFR_INT_TYPE GENMASK_ULL(38, 37)
#define CFG_MCAX_EN BIT_ULL(32)
+#define CFG_CE_INT_PRESENT BIT_ULL(10)
#define CFG_LSB_IN_STATUS BIT_ULL(8)
#define CFG_DFR_INT_SUPP BIT_ULL(5)
#define CFG_DFR_LOG_SUPP BIT_ULL(2)
@@ -352,6 +355,17 @@ static void smca_set_misc_banks_map(unsigned int bank, unsigned int cpu)

}

+static bool smca_thr_handler_enabled(u64 mca_intr_cfg)
+{
+ u8 offset = FIELD_GET(INTR_CFG_THR_LVT_OFFSET, mca_intr_cfg);
+
+ if (setup_APIC_eilvt(offset, THRESHOLD_APIC_VECTOR, APIC_EILVT_MSG_FIX, 0))
+ return false;
+
+ mce_threshold_vector = amd_threshold_interrupt;
+ return true;
+}
+
/* SMCA sets the Deferred Error Interrupt type per bank. */
static void configure_smca_dfr(unsigned int bank, u64 *mca_config)
{
@@ -375,7 +389,7 @@ static void configure_smca_dfr(unsigned int bank, u64 *mca_config)
}

/* Set appropriate bits in MCA_CONFIG. */
-static void configure_smca(unsigned int bank)
+static void configure_smca(unsigned int bank, u64 mca_intr_cfg)
{
u64 mca_config;

@@ -399,6 +413,9 @@ static void configure_smca(unsigned int bank)
if (FIELD_GET(CFG_LSB_IN_STATUS, mca_config))
this_cpu_ptr(mce_banks_array)[bank].lsb_in_status = true;

+ if (FIELD_GET(CFG_CE_INT_PRESENT, mca_config) && smca_thr_handler_enabled(mca_intr_cfg))
+ mca_config |= FIELD_PREP(CFG_CE_INT_EN, 0x1);
+
wrmsrl(MSR_AMD64_SMCA_MCx_CONFIG(bank), mca_config);
}

@@ -791,7 +808,7 @@ void mce_amd_feature_init(struct cpuinfo_x86 *c)
if (mce_flags.smca)
smca_configure_old(bank, cpu);

- configure_smca(bank);
+ configure_smca(bank, mca_intr_cfg);
disable_err_thresholding(c, bank);

for (block = 0; block < NR_BLOCKS; ++block) {
--
2.34.1


2024-04-04 15:16:44

by Yazen Ghannam

[permalink] [raw]
Subject: [PATCH v2 07/16] x86/mce/amd: Simplify DFR handler setup

AMD systems with the SUCCOR feature can send an APIC LVT interrupt for
deferred errors. The LVT offset is 0x2 by convention, i.e. this is the
default as listed in hardware documentation.

However, the MCA registers may list a different LVT offset for this
interrupt. The kernel should honor the value from the hardware.

Simplify the enable flow by using the hardware-provided value. Any
conflicts will be caught by setup_APIC_eilvt(). Conflicts on production
systems can be handled as quirks, if needed.

Also, rename the function using a "verb-first" style.

Signed-off-by: Yazen Ghannam <[email protected]>
---

Notes:
Link:
https://lkml.kernel.org/r/[email protected]

v1->v2:
* No change.

arch/x86/kernel/cpu/mce/amd.c | 33 ++++++++++-----------------------
1 file changed, 10 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index e8e78d91082b..32628a30a5c1 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -48,7 +48,6 @@
#define MSR_CU_DEF_ERR 0xC0000410
#define MASK_DEF_LVTOFF 0x000000F0
#define MASK_DEF_INT_TYPE 0x00000006
-#define DEF_LVT_OFF 0x2
#define DEF_INT_TYPE_APIC 0x2
#define INTR_TYPE_APIC 0x1

@@ -575,19 +574,9 @@ static int setup_APIC_mce_threshold(int reserved, int new)
return reserved;
}

-static int setup_APIC_deferred_error(int reserved, int new)
+static void enable_deferred_error_interrupt(void)
{
- if (reserved < 0 && !setup_APIC_eilvt(new, DEFERRED_ERROR_VECTOR,
- APIC_EILVT_MSG_FIX, 0))
- return new;
-
- return reserved;
-}
-
-static void deferred_error_interrupt_enable(struct cpuinfo_x86 *c)
-{
- u32 low = 0, high = 0;
- int def_offset = -1, def_new;
+ u32 low = 0, high = 0, def_new;

if (!mce_flags.succor)
return;
@@ -595,17 +584,15 @@ static void deferred_error_interrupt_enable(struct cpuinfo_x86 *c)
if (rdmsr_safe(MSR_CU_DEF_ERR, &low, &high))
return;

+ /*
+ * Trust the value from hardware.
+ * If there's a conflict, then setup_APIC_eilvt() will throw an error.
+ */
def_new = (low & MASK_DEF_LVTOFF) >> 4;
- if (!(low & MASK_DEF_LVTOFF)) {
- pr_err(FW_BUG "Your BIOS is not setting up LVT offset 0x2 for deferred error IRQs correctly.\n");
- def_new = DEF_LVT_OFF;
- low = (low & ~MASK_DEF_LVTOFF) | (DEF_LVT_OFF << 4);
- }
+ if (setup_APIC_eilvt(def_new, DEFERRED_ERROR_VECTOR, APIC_EILVT_MSG_FIX, 0))
+ return;

- def_offset = setup_APIC_deferred_error(def_offset, def_new);
- if ((def_offset == def_new) &&
- (deferred_error_int_vector != amd_deferred_error_interrupt))
- deferred_error_int_vector = amd_deferred_error_interrupt;
+ deferred_error_int_vector = amd_deferred_error_interrupt;

if (!mce_flags.smca)
low = (low & ~MASK_DEF_INT_TYPE) | DEF_INT_TYPE_APIC;
@@ -771,7 +758,7 @@ void mce_amd_feature_init(struct cpuinfo_x86 *c)
u32 low = 0, high = 0, address = 0;
int offset = -1;

- deferred_error_interrupt_enable(c);
+ enable_deferred_error_interrupt();

for (bank = 0; bank < this_cpu_read(mce_num_banks); ++bank) {
if (mce_flags.smca)
--
2.34.1


2024-04-04 15:17:25

by Yazen Ghannam

[permalink] [raw]
Subject: [PATCH v2 14/16] x86/mce, EDAC/mce_amd: Add support for new MCA_SYND{1,2} registers

From: Avadhut Naik <[email protected]>

AMD's Scalable MCA systems viz. Genoa will include two new registers:
MCA_SYND1 and MCA_SYND2.

These registers will include supplemental error information in addition
to the existing MCA_SYND register. The data within the registers is
considered valid if MCA_STATUS[SyndV] is set.

Add fields for these registers as vendor-specific error information
in struct mce_hw_err. Save and print these registers wherever
MCA_STATUS[SyndV]/MCA_SYND is currently used.

Also, modify the mce_record tracepoint to export these new registers
through __dynamic_array. While the sizeof() operator has been used to
determine the size of this __dynamic_array, the same, if needed in the
future can be substituted by caching the size of vendor-specific error
information as part of struct mce_hw_err.

Signed-off-by: Avadhut Naik <[email protected]>
Signed-off-by: Yazen Ghannam <[email protected]>
---

Notes:
Link:
https://lkml.kernel.org/r/[email protected]

v1->v2:
* Rebase on upstream changes for MCE trace event. (Avadhut)

arch/x86/include/asm/mce.h | 12 ++++++++++++
arch/x86/kernel/cpu/mce/core.c | 26 ++++++++++++++++++--------
drivers/edac/mce_amd.c | 10 +++++++---
include/trace/events/mce.h | 9 +++++++--
4 files changed, 44 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index e4ad9807b3e3..a701290f80a1 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -118,6 +118,9 @@
#define MSR_AMD64_SMCA_MC0_DESTAT 0xc0002008
#define MSR_AMD64_SMCA_MC0_DEADDR 0xc0002009
#define MSR_AMD64_SMCA_MC0_MISC1 0xc000200a
+/* Registers MISC2 to MISC4 are at offsets B to D. */
+#define MSR_AMD64_SMCA_MC0_SYND1 0xc000200e
+#define MSR_AMD64_SMCA_MC0_SYND2 0xc000200f
#define MSR_AMD64_SMCA_MCx_CTL(x) (MSR_AMD64_SMCA_MC0_CTL + 0x10*(x))
#define MSR_AMD64_SMCA_MCx_STATUS(x) (MSR_AMD64_SMCA_MC0_STATUS + 0x10*(x))
#define MSR_AMD64_SMCA_MCx_ADDR(x) (MSR_AMD64_SMCA_MC0_ADDR + 0x10*(x))
@@ -128,6 +131,8 @@
#define MSR_AMD64_SMCA_MCx_DESTAT(x) (MSR_AMD64_SMCA_MC0_DESTAT + 0x10*(x))
#define MSR_AMD64_SMCA_MCx_DEADDR(x) (MSR_AMD64_SMCA_MC0_DEADDR + 0x10*(x))
#define MSR_AMD64_SMCA_MCx_MISCy(x, y) ((MSR_AMD64_SMCA_MC0_MISC1 + y) + (0x10*(x)))
+#define MSR_AMD64_SMCA_MCx_SYND1(x) (MSR_AMD64_SMCA_MC0_SYND1 + 0x10*(x))
+#define MSR_AMD64_SMCA_MCx_SYND2(x) (MSR_AMD64_SMCA_MC0_SYND2 + 0x10*(x))

#define XEC(x, mask) (((x) >> 16) & mask)

@@ -185,6 +190,13 @@ enum mce_notifier_prios {

struct mce_hw_err {
struct mce m;
+
+ union vendor_info {
+ struct {
+ u64 synd1;
+ u64 synd2;
+ } amd;
+ } vi;
};

struct notifier_block;
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index fef025bda2af..aa27729f7899 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -201,6 +201,10 @@ static void __print_mce(struct mce_hw_err *err)
if (mce_flags.smca) {
if (m->synd)
pr_cont("SYND %llx ", m->synd);
+ if (err->vi.amd.synd1)
+ pr_cont("SYND1 %llx ", err->vi.amd.synd1);
+ if (err->vi.amd.synd2)
+ pr_cont("SYND2 %llx ", err->vi.amd.synd2);
if (m->ipid)
pr_cont("IPID %llx ", m->ipid);
}
@@ -651,8 +655,10 @@ static struct notifier_block mce_default_nb = {
/*
* Read ADDR and MISC registers.
*/
-static noinstr void mce_read_aux(struct mce *m, int i)
+static noinstr void mce_read_aux(struct mce_hw_err *err, int i)
{
+ struct mce *m = &err->m;
+
if (m->status & MCI_STATUS_MISCV)
m->misc = mce_rdmsrl(mca_msr_reg(i, MCA_MISC));

@@ -674,8 +680,11 @@ static noinstr void mce_read_aux(struct mce *m, int i)
if (mce_flags.smca) {
m->ipid = mce_rdmsrl(MSR_AMD64_SMCA_MCx_IPID(i));

- if (m->status & MCI_STATUS_SYNDV)
+ if (m->status & MCI_STATUS_SYNDV) {
m->synd = mce_rdmsrl(MSR_AMD64_SMCA_MCx_SYND(i));
+ err->vi.amd.synd1 = mce_rdmsrl(MSR_AMD64_SMCA_MCx_SYND1(i));
+ err->vi.amd.synd2 = mce_rdmsrl(MSR_AMD64_SMCA_MCx_SYND2(i));
+ }
}
}

@@ -751,7 +760,7 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
/* If this entry is not valid, ignore it */
if (!(m->status & MCI_STATUS_VAL)) {
if (smca_destat_is_valid(i)) {
- mce_read_aux(m, i);
+ mce_read_aux(&err, i);
goto clear_it;
}

@@ -801,7 +810,7 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
if (flags & MCP_DONTLOG)
goto clear_it;

- mce_read_aux(m, i);
+ mce_read_aux(&err, i);
m->severity = mce_severity(m, NULL, NULL, false);

/*
@@ -943,9 +952,10 @@ static __always_inline void quirk_zen_ifu(int bank, struct mce *m, struct pt_reg
* Do a quick check if any of the events requires a panic.
* This decides if we keep the events around or clear them.
*/
-static __always_inline int mce_no_way_out(struct mce *m, char **msg, unsigned long *validp,
+static __always_inline int mce_no_way_out(struct mce_hw_err *err, char **msg, unsigned long *validp,
struct pt_regs *regs)
{
+ struct mce *m = &err->m;
char *tmp = *msg;
int i;

@@ -963,7 +973,7 @@ static __always_inline int mce_no_way_out(struct mce *m, char **msg, unsigned lo

m->bank = i;
if (mce_severity(m, regs, &tmp, true) >= MCE_PANIC_SEVERITY) {
- mce_read_aux(m, i);
+ mce_read_aux(err, i);
*msg = tmp;
return 1;
}
@@ -1361,7 +1371,7 @@ __mc_scan_banks(struct mce_hw_err *err, struct pt_regs *regs, struct mce *final,
if (severity == MCE_NO_SEVERITY)
continue;

- mce_read_aux(m, i);
+ mce_read_aux(err, i);

/* assuming valid severity level != 0 */
m->severity = severity;
@@ -1562,7 +1572,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
final = this_cpu_ptr(&hw_errs_seen);
final->m = *m;

- no_way_out = mce_no_way_out(m, &msg, valid_banks, regs);
+ no_way_out = mce_no_way_out(&err, &msg, valid_banks, regs);

barrier();

diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index e02af5da1ec2..32bf4cc564a3 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -792,7 +792,8 @@ static const char *decode_error_status(struct mce *m)
static int
amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
{
- struct mce *m = (struct mce *)data;
+ struct mce_hw_err *err = (struct mce_hw_err *)data;
+ struct mce *m = &err->m;
unsigned int fam = x86_family(m->cpuid);
int ecc;

@@ -850,8 +851,11 @@ amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
if (boot_cpu_has(X86_FEATURE_SMCA)) {
pr_emerg(HW_ERR "IPID: 0x%016llx", m->ipid);

- if (m->status & MCI_STATUS_SYNDV)
- pr_cont(", Syndrome: 0x%016llx", m->synd);
+ if (m->status & MCI_STATUS_SYNDV) {
+ pr_cont(", Syndrome: 0x%016llx\n", m->synd);
+ pr_emerg(HW_ERR "Syndrome1: 0x%016llx, Syndrome2: 0x%016llx",
+ err->vi.amd.synd1, err->vi.amd.synd2);
+ }

pr_cont("\n");

diff --git a/include/trace/events/mce.h b/include/trace/events/mce.h
index 65aba1afcd07..43e8ecc11881 100644
--- a/include/trace/events/mce.h
+++ b/include/trace/events/mce.h
@@ -43,6 +43,8 @@ TRACE_EVENT(mce_record,
__field( u8, bank )
__field( u8, cpuvendor )
__field( u32, microcode )
+ __field( u8, len )
+ __dynamic_array(u8, v_data, sizeof(err->vi))
),

TP_fast_assign(
@@ -65,9 +67,11 @@ TRACE_EVENT(mce_record,
__entry->bank = err->m.bank;
__entry->cpuvendor = err->m.cpuvendor;
__entry->microcode = err->m.microcode;
+ __entry->len = sizeof(err->vi);
+ memcpy(__get_dynamic_array(v_data), &err->vi, sizeof(err->vi));
),

- TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, IPID: %016Lx, ADDR: %016Lx, MISC: %016Lx, SYND: %016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PPIN: %llx, vendor: %u, CPUID: %x, time: %llu, socket: %u, APIC: %x, microcode: %x",
+ TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016llx, IPID: %016llx, ADDR: %016llx, MISC: %016llx, SYND: %016llx, RIP: %02x:<%016llx>, TSC: %llx, PPIN: %llx, vendor: %u, CPUID: %x, time: %llu, socket: %u, APIC: %x, microcode: %x, vendor data: %s",
__entry->cpu,
__entry->mcgcap, __entry->mcgstatus,
__entry->bank, __entry->status,
@@ -83,7 +87,8 @@ TRACE_EVENT(mce_record,
__entry->walltime,
__entry->socketid,
__entry->apicid,
- __entry->microcode)
+ __entry->microcode,
+ __print_array(__get_dynamic_array(v_data), __entry->len / 8, 8))
);

#endif /* _TRACE_MCE_H */
--
2.34.1


2024-04-04 15:18:25

by Yazen Ghannam

[permalink] [raw]
Subject: [PATCH v2 15/16] x86/mce/apei: Handle variable register array size

ACPI Boot Error Record Table (BERT) is being used by the kernel to
report errors that occurred in a previous boot. On some modern AMD
systems, these very errors within the BERT are reported through the
x86 Common Platform Error Record (CPER) format which consists of one
or more Processor Context Information Structures. These context
structures provide a starting address and represent an x86 MSR range
in which the data constitutes a contiguous set of MSRs starting from,
and including the starting address.

It's common, for AMD systems that implement this behavior, that the
MSR range represents the MCAX register space used for the Scalable MCA
feature. The apei_smca_report_x86_error() function decodes and passes
this information through the MCE notifier chain. However, this function
assumes a fixed register size based on the original HW/FW implementation.

This assumption breaks with the addition of two new MCAX registers viz.
MCA_SYND1 and MCA_SYND2. These registers are added at the end of the
MCAX register space, so they won't be included when decoding the CPER
data.

Rework apei_smca_report_x86_error() to support a variable register array
size. This covers any case where the MSR context information starts at
the MCAX address for MCA_STATUS and ends at any other register within
the MCAX register space.

Add code comments indicating the MCAX register at each offset.

Co-developed-by: Avadhut Naik <[email protected]>
Signed-off-by: Avadhut Naik <[email protected]>
Signed-off-by: Yazen Ghannam <[email protected]>
---

Notes:
Link:
https://lkml.kernel.org/r/[email protected]

v1->v2:
* No change.

arch/x86/kernel/cpu/mce/apei.c | 73 +++++++++++++++++++++++++++-------
1 file changed, 59 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c
index 89a8ebac53ea..43622241c379 100644
--- a/arch/x86/kernel/cpu/mce/apei.c
+++ b/arch/x86/kernel/cpu/mce/apei.c
@@ -69,9 +69,9 @@ EXPORT_SYMBOL_GPL(apei_mce_report_mem_error);
int apei_smca_report_x86_error(struct cper_ia_proc_ctx *ctx_info, u64 lapic_id)
{
const u64 *i_mce = ((const u64 *) (ctx_info + 1));
+ unsigned int cpu, num_registers;
struct mce_hw_err err;
struct mce *m = &err.m;
- unsigned int cpu;

memset(&err, 0, sizeof(struct mce_hw_err));

@@ -91,16 +91,12 @@ int apei_smca_report_x86_error(struct cper_ia_proc_ctx *ctx_info, u64 lapic_id)
return -EINVAL;

/*
- * The register array size must be large enough to include all the
- * SMCA registers which need to be extracted.
- *
* The number of registers in the register array is determined by
* Register Array Size/8 as defined in UEFI spec v2.8, sec N.2.4.2.2.
- * The register layout is fixed and currently the raw data in the
- * register array includes 6 SMCA registers which the kernel can
- * extract.
+ * Ensure that the array size includes at least 1 register.
*/
- if (ctx_info->reg_arr_size < 48)
+ num_registers = ctx_info->reg_arr_size >> 3;
+ if (!num_registers)
return -EINVAL;

for_each_possible_cpu(cpu) {
@@ -115,12 +111,61 @@ int apei_smca_report_x86_error(struct cper_ia_proc_ctx *ctx_info, u64 lapic_id)
mce_setup_for_cpu(cpu, m);

m->bank = (ctx_info->msr_addr >> 4) & 0xFF;
- m->status = *i_mce;
- m->addr = *(i_mce + 1);
- m->misc = *(i_mce + 2);
- /* Skipping MCA_CONFIG */
- m->ipid = *(i_mce + 4);
- m->synd = *(i_mce + 5);
+
+ /*
+ * The SMCA register layout is fixed and includes 16 registers.
+ * The end of the array may be variable, but the beginning is known.
+ * Switch on the number of registers. Cap the number of registers to
+ * expected max (15).
+ */
+ if (num_registers > 15)
+ num_registers = 15;
+
+ switch (num_registers) {
+ /* MCA_SYND2 */
+ case 15:
+ err.vi.amd.synd2 = *(i_mce + 14);
+ fallthrough;
+ /* MCA_SYND1 */
+ case 14:
+ err.vi.amd.synd1 = *(i_mce + 13);
+ fallthrough;
+ /* MCA_MISC4 */
+ case 13:
+ /* MCA_MISC3 */
+ case 12:
+ /* MCA_MISC2 */
+ case 11:
+ /* MCA_MISC1 */
+ case 10:
+ /* MCA_DEADDR */
+ case 9:
+ /* MCA_DESTAT */
+ case 8:
+ /* reserved */
+ case 7:
+ /* MCA_SYND */
+ case 6:
+ m->synd = *(i_mce + 5);
+ fallthrough;
+ /* MCA_IPID */
+ case 5:
+ m->ipid = *(i_mce + 4);
+ fallthrough;
+ /* MCA_CONFIG */
+ case 4:
+ /* MCA_MISC0 */
+ case 3:
+ m->misc = *(i_mce + 2);
+ fallthrough;
+ /* MCA_ADDR */
+ case 2:
+ m->addr = *(i_mce + 1);
+ fallthrough;
+ /* MCA_STATUS */
+ case 1:
+ m->status = *i_mce;
+ }

mce_log(&err);

--
2.34.1


2024-04-04 15:18:29

by Yazen Ghannam

[permalink] [raw]
Subject: [PATCH v2 05/16] x86/mce/amd: Clean up SMCA configuration

The current SMCA configuration function does more than just configure
SMCA features. It also detects and caches the SMCA bank types.

However, the bank type caching flow will be removed during the init path
clean up.

Define a new function that only configures SMCA features. This will
operate on the MCA_CONFIG MSR, so include updated register field
definitions using bitops.

Leave old code until init path is cleaned up.

Signed-off-by: Yazen Ghannam <[email protected]>
---

Notes:
Link:
https://lkml.kernel.org/r/[email protected]

v1->v2:
* No change.

arch/x86/kernel/cpu/mce/amd.c | 84 ++++++++++++++++++++---------------
1 file changed, 49 insertions(+), 35 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index c76bc158b6b6..3093fed06194 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -50,6 +50,7 @@
#define MASK_DEF_INT_TYPE 0x00000006
#define DEF_LVT_OFF 0x2
#define DEF_INT_TYPE_APIC 0x2
+#define INTR_TYPE_APIC 0x1

/* Scalable MCA: */
#define MCI_IPID_MCATYPE GENMASK_ULL(47, 44)
@@ -57,6 +58,12 @@
#define MCI_IPID_MCATYPE_OLD 0xFFFF0000
#define MCI_IPID_HWID_OLD 0xFFF

+/* MCA_CONFIG register, one per MCA bank */
+#define CFG_DFR_INT_TYPE GENMASK_ULL(38, 37)
+#define CFG_MCAX_EN BIT_ULL(32)
+#define CFG_LSB_IN_STATUS BIT_ULL(8)
+#define CFG_DFR_INT_SUPP BIT_ULL(5)
+
/* Threshold LVT offset is at MSR0xC0000410[15:12] */
#define SMCA_THR_LVT_OFF 0xF000

@@ -344,45 +351,51 @@ static void smca_set_misc_banks_map(unsigned int bank, unsigned int cpu)

}

-static void smca_configure(unsigned int bank, unsigned int cpu)
+/* Set appropriate bits in MCA_CONFIG. */
+static void configure_smca(unsigned int bank)
{
- u8 *bank_counts = this_cpu_ptr(smca_bank_counts);
- const struct smca_hwid *s_hwid;
- unsigned int i, hwid_mcatype;
- u32 high, low;
- u32 smca_config = MSR_AMD64_SMCA_MCx_CONFIG(bank);
+ u64 mca_config;

- /* Set appropriate bits in MCA_CONFIG */
- if (!rdmsr_safe(smca_config, &low, &high)) {
- /*
- * OS is required to set the MCAX bit to acknowledge that it is
- * now using the new MSR ranges and new registers under each
- * bank. It also means that the OS will configure deferred
- * errors in the new MCx_CONFIG register. If the bit is not set,
- * uncorrectable errors will cause a system panic.
- *
- * MCA_CONFIG[MCAX] is bit 32 (0 in the high portion of the MSR.)
- */
- high |= BIT(0);
+ if (!mce_flags.smca)
+ return;

- /*
- * SMCA sets the Deferred Error Interrupt type per bank.
- *
- * MCA_CONFIG[DeferredIntTypeSupported] is bit 5, and tells us
- * if the DeferredIntType bit field is available.
- *
- * MCA_CONFIG[DeferredIntType] is bits [38:37] ([6:5] in the
- * high portion of the MSR). OS should set this to 0x1 to enable
- * APIC based interrupt. First, check that no interrupt has been
- * set.
- */
- if ((low & BIT(5)) && !((high >> 5) & 0x3))
- high |= BIT(5);
+ if (rdmsrl_safe(MSR_AMD64_SMCA_MCx_CONFIG(bank), &mca_config))
+ return;
+
+ /*
+ * OS is required to set the MCAX enable bit to acknowledge that it is
+ * now using the new MSR ranges and new registers under each
+ * bank. It also means that the OS will configure deferred
+ * errors in the new MCA_CONFIG register. If the bit is not set,
+ * uncorrectable errors will cause a system panic.
+ */
+ mca_config |= FIELD_PREP(CFG_MCAX_EN, 0x1);

- this_cpu_ptr(mce_banks_array)[bank].lsb_in_status = !!(low & BIT(8));
+ /*
+ * SMCA sets the Deferred Error Interrupt type per bank.
+ *
+ * MCA_CONFIG[DeferredIntTypeSupported] is bit 5, and tells us
+ * if the DeferredIntType bit field is available.
+ *
+ * MCA_CONFIG[DeferredIntType] is bits [38:37]. OS should set
+ * this to 0x1 to enable APIC based interrupt. First, check that
+ * no interrupt has been set.
+ */
+ if (FIELD_GET(CFG_DFR_INT_SUPP, mca_config) && !FIELD_GET(CFG_DFR_INT_TYPE, mca_config))
+ mca_config |= FIELD_PREP(CFG_DFR_INT_TYPE, INTR_TYPE_APIC);

- wrmsr(smca_config, low, high);
- }
+ if (FIELD_GET(CFG_LSB_IN_STATUS, mca_config))
+ this_cpu_ptr(mce_banks_array)[bank].lsb_in_status = true;
+
+ wrmsrl(MSR_AMD64_SMCA_MCx_CONFIG(bank), mca_config);
+}
+
+static void smca_configure_old(unsigned int bank, unsigned int cpu)
+{
+ u8 *bank_counts = this_cpu_ptr(smca_bank_counts);
+ const struct smca_hwid *s_hwid;
+ unsigned int i, hwid_mcatype;
+ u32 high, low;

smca_set_misc_banks_map(bank, cpu);

@@ -758,8 +771,9 @@ void mce_amd_feature_init(struct cpuinfo_x86 *c)

for (bank = 0; bank < this_cpu_read(mce_num_banks); ++bank) {
if (mce_flags.smca)
- smca_configure(bank, cpu);
+ smca_configure_old(bank, cpu);

+ configure_smca(bank);
disable_err_thresholding(c, bank);

for (block = 0; block < NR_BLOCKS; ++block) {
--
2.34.1


2024-04-04 15:22:34

by Yazen Ghannam

[permalink] [raw]
Subject: [PATCH v2 10/16] x86/mce: Unify AMD DFR handler with MCA Polling

AMD systems optionally support a Deferred error interrupt. The interrupt
should be used as another signal to trigger MCA polling. This is similar
to how other MCA interrupts are handled.

Deferred errors do not require any special handling related to the
interrupt, e.g. resetting or rearming the interrupt, etc.

However, Scalable MCA systems include a pair of registers, MCA_DESTAT
and MCA_DEADDR, that should be checked for valid errors. This check
should be done whenever MCA registers are polled. Currently, the
Deferred error interrupt does this check, but the MCA polling function
does not.

Call the MCA polling function when handling the Deferred error
interrupt. This keeps all "polling" cases in a common function.

Call the polling function only for banks that have the Deferred error
interrupt enabled.

Add a "SMCA DFR handler" for Deferred errors to the AMD vendor-specific
error handler callback. This will do the same status check, register
clearing, and logging that the interrupt handler has done. And it
extends the common polling flow to find AMD Deferred errors.

Remove old code whose functionality is already covered in the common MCA
code.

Signed-off-by: Yazen Ghannam <[email protected]>
---

Notes:
Link:
https://lkml.kernel.org/r/[email protected]

v1->v2:
* Keep separate interrupt entry points. (Yazen)
* Move DFR error setup for MCA_CONFIG to a helper. (Yazen)

arch/x86/kernel/cpu/mce/amd.c | 155 +++++++++++++--------------------
arch/x86/kernel/cpu/mce/core.c | 16 +++-
2 files changed, 76 insertions(+), 95 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index 75195d6fe971..40912c5e35d1 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -62,11 +62,13 @@
#define CFG_MCAX_EN BIT_ULL(32)
#define CFG_LSB_IN_STATUS BIT_ULL(8)
#define CFG_DFR_INT_SUPP BIT_ULL(5)
+#define CFG_DFR_LOG_SUPP BIT_ULL(2)

/* Threshold LVT offset is at MSR0xC0000410[15:12] */
#define SMCA_THR_LVT_OFF 0xF000

static bool thresholding_irq_en;
+static DEFINE_PER_CPU(mce_banks_t, mce_dfr_int_banks);

static const char * const th_names[] = {
"load_store",
@@ -350,6 +352,28 @@ static void smca_set_misc_banks_map(unsigned int bank, unsigned int cpu)

}

+/* SMCA sets the Deferred Error Interrupt type per bank. */
+static void configure_smca_dfr(unsigned int bank, u64 *mca_config)
+{
+ /* Nothing to do if the bank doesn't support deferred error logging. */
+ if (!FIELD_GET(CFG_DFR_LOG_SUPP, *mca_config))
+ return;
+
+ /* Nothing to do if the bank doesn't support setting the interrupt type. */
+ if (!FIELD_GET(CFG_DFR_INT_SUPP, *mca_config))
+ return;
+
+ /*
+ * Nothing to do if the interrupt type is already set. Either it was set by
+ * the OS already. Or it was set by firmware, and the OS should leave it as-is.
+ */
+ if (FIELD_GET(CFG_DFR_INT_TYPE, *mca_config))
+ return;
+
+ *mca_config |= FIELD_PREP(CFG_DFR_INT_TYPE, INTR_TYPE_APIC);
+ set_bit(bank, (void *)this_cpu_ptr(&mce_dfr_int_banks));
+}
+
/* Set appropriate bits in MCA_CONFIG. */
static void configure_smca(unsigned int bank)
{
@@ -370,18 +394,7 @@ static void configure_smca(unsigned int bank)
*/
mca_config |= FIELD_PREP(CFG_MCAX_EN, 0x1);

- /*
- * SMCA sets the Deferred Error Interrupt type per bank.
- *
- * MCA_CONFIG[DeferredIntTypeSupported] is bit 5, and tells us
- * if the DeferredIntType bit field is available.
- *
- * MCA_CONFIG[DeferredIntType] is bits [38:37]. OS should set
- * this to 0x1 to enable APIC based interrupt. First, check that
- * no interrupt has been set.
- */
- if (FIELD_GET(CFG_DFR_INT_SUPP, mca_config) && !FIELD_GET(CFG_DFR_INT_TYPE, mca_config))
- mca_config |= FIELD_PREP(CFG_DFR_INT_TYPE, INTR_TYPE_APIC);
+ configure_smca_dfr(bank, &mca_config);

if (FIELD_GET(CFG_LSB_IN_STATUS, mca_config))
this_cpu_ptr(mce_banks_array)[bank].lsb_in_status = true;
@@ -872,33 +885,6 @@ bool amd_mce_usable_address(struct mce *m)
return false;
}

-static void __log_error(unsigned int bank, u64 status, u64 addr, u64 misc)
-{
- struct mce m;
-
- mce_setup(&m);
-
- m.status = status;
- m.misc = misc;
- m.bank = bank;
- m.tsc = rdtsc();
-
- if (m.status & MCI_STATUS_ADDRV) {
- m.addr = addr;
-
- smca_extract_err_addr(&m);
- }
-
- if (mce_flags.smca) {
- rdmsrl(MSR_AMD64_SMCA_MCx_IPID(bank), m.ipid);
-
- if (m.status & MCI_STATUS_SYNDV)
- rdmsrl(MSR_AMD64_SMCA_MCx_SYND(bank), m.synd);
- }
-
- mce_log(&m);
-}
-
DEFINE_IDTENTRY_SYSVEC(sysvec_deferred_error)
{
trace_deferred_error_apic_entry(DEFERRED_ERROR_VECTOR);
@@ -908,75 +894,46 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_deferred_error)
apic_eoi();
}

-/*
- * Returns true if the logged error is deferred. False, otherwise.
- */
-static inline bool
-_log_error_bank(unsigned int bank, u32 msr_stat, u32 msr_addr, u64 misc)
-{
- u64 status, addr = 0;
-
- rdmsrl(msr_stat, status);
- if (!(status & MCI_STATUS_VAL))
- return false;
-
- if (status & MCI_STATUS_ADDRV)
- rdmsrl(msr_addr, addr);
-
- __log_error(bank, status, addr, misc);
-
- wrmsrl(msr_stat, 0);
-
- return status & MCI_STATUS_DEFERRED;
-}
-
-static bool _log_error_deferred(unsigned int bank, u32 misc)
-{
- if (!_log_error_bank(bank, mca_msr_reg(bank, MCA_STATUS),
- mca_msr_reg(bank, MCA_ADDR), misc))
- return false;
-
- /*
- * Non-SMCA systems don't have MCA_DESTAT/MCA_DEADDR registers.
- * Return true here to avoid accessing these registers.
- */
- if (!mce_flags.smca)
- return true;
-
- /* Clear MCA_DESTAT if the deferred error was logged from MCA_STATUS. */
- wrmsrl(MSR_AMD64_SMCA_MCx_DESTAT(bank), 0);
- return true;
-}
-
/*
* We have three scenarios for checking for Deferred errors:
*
* 1) Non-SMCA systems check MCA_STATUS and log error if found.
+ * This is already handled in machine_check_poll().
* 2) SMCA systems check MCA_STATUS. If error is found then log it and also
* clear MCA_DESTAT.
* 3) SMCA systems check MCA_DESTAT, if error was not found in MCA_STATUS, and
* log it.
*/
-static void log_error_deferred(unsigned int bank)
+static void handle_smca_dfr_error(struct mce *m)
{
- if (_log_error_deferred(bank, 0))
+ struct mce m_dfr;
+ u64 mca_destat;
+
+ /* Non-SMCA systems don't have MCA_DESTAT/MCA_DEADDR registers. */
+ if (!mce_flags.smca)
return;

- /*
- * Only deferred errors are logged in MCA_DE{STAT,ADDR} so just check
- * for a valid error.
- */
- _log_error_bank(bank, MSR_AMD64_SMCA_MCx_DESTAT(bank),
- MSR_AMD64_SMCA_MCx_DEADDR(bank), 0);
-}
+ /* Clear MCA_DESTAT if the deferred error was logged from MCA_STATUS. */
+ if (m->status & MCI_STATUS_DEFERRED)
+ goto out;

-/* APIC interrupt handler for deferred errors */
-static void amd_deferred_error_interrupt(void)
-{
- unsigned int bank;
+ /* MCA_STATUS didn't have a deferred error, so check MCA_DESTAT for one. */
+ mca_destat = mce_rdmsrl(MSR_AMD64_SMCA_MCx_DESTAT(m->bank));
+
+ if (!(mca_destat & MCI_STATUS_VAL))
+ return;
+
+ /* Reuse the same data collected from machine_check_poll(). */
+ memcpy(&m_dfr, m, sizeof(m_dfr));
+
+ /* Save the MCA_DE{STAT,ADDR} values. */
+ m_dfr.status = mca_destat;
+ m_dfr.addr = mce_rdmsrl(MSR_AMD64_SMCA_MCx_DEADDR(m_dfr.bank));

- for (bank = 0; bank < this_cpu_read(mce_num_banks); ++bank)
- log_error_deferred(bank);
+ mce_log(&m_dfr);
+
+out:
+ wrmsrl(MSR_AMD64_SMCA_MCx_DESTAT(m->bank), 0);
}

static void reset_block(struct threshold_block *block)
@@ -1035,9 +992,19 @@ static void amd_threshold_interrupt(void)
machine_check_poll(MCP_TIMESTAMP, this_cpu_ptr(&mce_poll_banks));
}

+/*
+ * Deferred error interrupt handler will service DEFERRED_ERROR_VECTOR. The interrupt
+ * is triggered when a bank logs a deferred error.
+ */
+static void amd_deferred_error_interrupt(void)
+{
+ machine_check_poll(MCP_TIMESTAMP, this_cpu_ptr(&mce_dfr_int_banks));
+}
+
void amd_handle_error(struct mce *m)
{
reset_thr_blocks(m->bank);
+ handle_smca_dfr_error(m);
}

/*
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 75297e7eb980..308766868f39 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -680,6 +680,14 @@ static void vendor_handle_error(struct mce *m)

DEFINE_PER_CPU(unsigned, mce_poll_count);

+static bool smca_destat_is_valid(unsigned int bank)
+{
+ if (!mce_flags.smca)
+ return false;
+
+ return mce_rdmsrl(MSR_AMD64_SMCA_MCx_DESTAT(bank)) & MCI_STATUS_VAL;
+}
+
/*
* Poll for corrected events or events that happened before reset.
* Those are just logged through /dev/mcelog.
@@ -731,8 +739,14 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
mce_track_storm(&m);

/* If this entry is not valid, ignore it */
- if (!(m.status & MCI_STATUS_VAL))
+ if (!(m.status & MCI_STATUS_VAL)) {
+ if (smca_destat_is_valid(i)) {
+ mce_read_aux(&m, i);
+ goto clear_it;
+ }
+
continue;
+ }

/*
* If we are logging everything (at CPU online) or this
--
2.34.1


2024-04-04 15:39:46

by Yazen Ghannam

[permalink] [raw]
Subject: [PATCH v2 13/16] x86/mce: Add wrapper for struct mce to export vendor specific info

From: Avadhut Naik <[email protected]>

Currently, exporting new additional machine check error information
involves adding new fields for the same at the end of the struct mce.
This additional information can then be consumed through mcelog or
tracepoint.

However, as new MSRs are being added (and will be added in the future)
by CPU vendors on their newer CPUs with additional machine check error
information to be exported, the size of struct mce will balloon on some
CPUs, unnecessarily, since those fields are vendor-specific. Moreover,
different CPU vendors may export the additional information in varying
sizes.

The problem particularly intensifies since struct mce is exposed to
userspace as part of UAPI. It's bloating through vendor-specific data
should be avoided to limit the information being sent out to userspace.

Add a new structure mce_hw_err to wrap the existing struct mce. The same
will prevent its ballooning since vendor-specifc data, if any, can now be
exported through a union within the wrapper structure and through
__dynamic_array in mce_record tracepoint.

Furthermore, new internal kernel fields can be added to the wrapper
struct without impacting the user space API.

[Yazen: Add last commit message paragraph. Rebase on other MCA updates.]

Suggested-by: Borislav Petkov (AMD) <[email protected]>
Signed-off-by: Avadhut Naik <[email protected]>
Signed-off-by: Yazen Ghannam <[email protected]>
---

Notes:
Link:
https://lkml.kernel.org/r/[email protected]

v1->v2:
* Update all MCE nofitier blocks. (Yazen)
* Rebase on upstream changes for MCE trace event. (Avadhut)

arch/x86/include/asm/mce.h | 6 +-
arch/x86/kernel/cpu/mce/amd.c | 24 ++--
arch/x86/kernel/cpu/mce/apei.c | 46 +++---
arch/x86/kernel/cpu/mce/core.c | 181 +++++++++++++-----------
arch/x86/kernel/cpu/mce/dev-mcelog.c | 2 +-
arch/x86/kernel/cpu/mce/genpool.c | 20 +--
arch/x86/kernel/cpu/mce/inject.c | 4 +-
arch/x86/kernel/cpu/mce/internal.h | 8 +-
drivers/acpi/acpi_extlog.c | 2 +-
drivers/acpi/nfit/mce.c | 3 +-
drivers/edac/i7core_edac.c | 2 +-
drivers/edac/igen6_edac.c | 2 +-
drivers/edac/pnd2_edac.c | 2 +-
drivers/edac/sb_edac.c | 2 +-
drivers/edac/skx_common.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 2 +-
drivers/ras/amd/fmpm.c | 2 +-
drivers/ras/cec.c | 3 +-
include/trace/events/mce.h | 42 +++---
19 files changed, 196 insertions(+), 159 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index adad99bac567..e4ad9807b3e3 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -183,6 +183,10 @@ enum mce_notifier_prios {
MCE_PRIO_HIGHEST = MCE_PRIO_CEC
};

+struct mce_hw_err {
+ struct mce m;
+};
+
struct notifier_block;
extern void mce_register_decode_chain(struct notifier_block *nb);
extern void mce_unregister_decode_chain(struct notifier_block *nb);
@@ -218,7 +222,7 @@ static inline int apei_smca_report_x86_error(struct cper_ia_proc_ctx *ctx_info,
#endif

void mce_setup(struct mce *m);
-void mce_log(struct mce *m);
+void mce_log(struct mce_hw_err *err);
DECLARE_PER_CPU(struct device *, mce_device);

/* Maximum number of MCA banks per CPU. */
diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index a81d911d608e..40e6c5a98dce 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -921,9 +921,9 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_deferred_error)
* 3) SMCA systems check MCA_DESTAT, if error was not found in MCA_STATUS, and
* log it.
*/
-static void handle_smca_dfr_error(struct mce *m)
+static void handle_smca_dfr_error(struct mce_hw_err *err)
{
- struct mce m_dfr;
+ struct mce_hw_err err_dfr;
u64 mca_destat;

/* Non-SMCA systems don't have MCA_DESTAT/MCA_DEADDR registers. */
@@ -931,26 +931,26 @@ static void handle_smca_dfr_error(struct mce *m)
return;

/* Clear MCA_DESTAT if the deferred error was logged from MCA_STATUS. */
- if (m->status & MCI_STATUS_DEFERRED)
+ if (err->m.status & MCI_STATUS_DEFERRED)
goto out;

/* MCA_STATUS didn't have a deferred error, so check MCA_DESTAT for one. */
- mca_destat = mce_rdmsrl(MSR_AMD64_SMCA_MCx_DESTAT(m->bank));
+ mca_destat = mce_rdmsrl(MSR_AMD64_SMCA_MCx_DESTAT(err->m.bank));

if (!(mca_destat & MCI_STATUS_VAL))
return;

/* Reuse the same data collected from machine_check_poll(). */
- memcpy(&m_dfr, m, sizeof(m_dfr));
+ memcpy(&err_dfr, err, sizeof(err_dfr));

/* Save the MCA_DE{STAT,ADDR} values. */
- m_dfr.status = mca_destat;
- m_dfr.addr = mce_rdmsrl(MSR_AMD64_SMCA_MCx_DEADDR(m_dfr.bank));
+ err_dfr.m.status = mca_destat;
+ err_dfr.m.addr = mce_rdmsrl(MSR_AMD64_SMCA_MCx_DEADDR(err_dfr.m.bank));

- mce_log(&m_dfr);
+ mce_log(&err_dfr);

out:
- wrmsrl(MSR_AMD64_SMCA_MCx_DESTAT(m->bank), 0);
+ wrmsrl(MSR_AMD64_SMCA_MCx_DESTAT(err->m.bank), 0);
}

static void reset_block(struct threshold_block *block)
@@ -1018,10 +1018,10 @@ static void amd_deferred_error_interrupt(void)
machine_check_poll(MCP_TIMESTAMP, this_cpu_ptr(&mce_dfr_int_banks));
}

-void amd_handle_error(struct mce *m)
+void amd_handle_error(struct mce_hw_err *err)
{
- reset_thr_blocks(m->bank);
- handle_smca_dfr_error(m);
+ reset_thr_blocks(err->m.bank);
+ handle_smca_dfr_error(err);
}

/*
diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c
index e4e32e337110..89a8ebac53ea 100644
--- a/arch/x86/kernel/cpu/mce/apei.c
+++ b/arch/x86/kernel/cpu/mce/apei.c
@@ -28,9 +28,12 @@

void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err)
{
- struct mce m;
+ struct mce_hw_err err;
+ struct mce *m = &err.m;
int lsb;

+ memset(&err, 0, sizeof(struct mce_hw_err));
+
if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
return;

@@ -44,30 +47,33 @@ void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err)
else
lsb = PAGE_SHIFT;

- mce_setup(&m);
- m.bank = -1;
+ mce_setup(m);
+ m->bank = -1;
/* Fake a memory read error with unknown channel */
- m.status = MCI_STATUS_VAL | MCI_STATUS_EN | MCI_STATUS_ADDRV | MCI_STATUS_MISCV | 0x9f;
- m.misc = (MCI_MISC_ADDR_PHYS << 6) | lsb;
+ m->status = MCI_STATUS_VAL | MCI_STATUS_EN | MCI_STATUS_ADDRV | MCI_STATUS_MISCV | 0x9f;
+ m->misc = (MCI_MISC_ADDR_PHYS << 6) | lsb;

if (severity >= GHES_SEV_RECOVERABLE)
- m.status |= MCI_STATUS_UC;
+ m->status |= MCI_STATUS_UC;

if (severity >= GHES_SEV_PANIC) {
- m.status |= MCI_STATUS_PCC;
- m.tsc = rdtsc();
+ m->status |= MCI_STATUS_PCC;
+ m->tsc = rdtsc();
}

- m.addr = mem_err->physical_addr;
- mce_log(&m);
+ m->addr = mem_err->physical_addr;
+ mce_log(&err);
}
EXPORT_SYMBOL_GPL(apei_mce_report_mem_error);

int apei_smca_report_x86_error(struct cper_ia_proc_ctx *ctx_info, u64 lapic_id)
{
const u64 *i_mce = ((const u64 *) (ctx_info + 1));
+ struct mce_hw_err err;
+ struct mce *m = &err.m;
unsigned int cpu;
- struct mce m;
+
+ memset(&err, 0, sizeof(struct mce_hw_err));

if (!boot_cpu_has(X86_FEATURE_SMCA))
return -EINVAL;
@@ -105,18 +111,18 @@ int apei_smca_report_x86_error(struct cper_ia_proc_ctx *ctx_info, u64 lapic_id)
if (!cpu_possible(cpu))
return -EINVAL;

- mce_setup_common(&m);
- mce_setup_for_cpu(cpu, &m);
+ mce_setup_common(m);
+ mce_setup_for_cpu(cpu, m);

- m.bank = (ctx_info->msr_addr >> 4) & 0xFF;
- m.status = *i_mce;
- m.addr = *(i_mce + 1);
- m.misc = *(i_mce + 2);
+ m->bank = (ctx_info->msr_addr >> 4) & 0xFF;
+ m->status = *i_mce;
+ m->addr = *(i_mce + 1);
+ m->misc = *(i_mce + 2);
/* Skipping MCA_CONFIG */
- m.ipid = *(i_mce + 4);
- m.synd = *(i_mce + 5);
+ m->ipid = *(i_mce + 4);
+ m->synd = *(i_mce + 5);

- mce_log(&m);
+ mce_log(&err);

return 0;
}
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 17cf5a9df3cd..fef025bda2af 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -88,7 +88,7 @@ struct mca_config mca_cfg __read_mostly = {
.monarch_timeout = -1
};

-static DEFINE_PER_CPU(struct mce, mces_seen);
+static DEFINE_PER_CPU(struct mce_hw_err, hw_errs_seen);
static unsigned long mce_need_notify;

/*
@@ -148,9 +148,9 @@ void mce_setup(struct mce *m)
DEFINE_PER_CPU(struct mce, injectm);
EXPORT_PER_CPU_SYMBOL_GPL(injectm);

-void mce_log(struct mce *m)
+void mce_log(struct mce_hw_err *err)
{
- if (!mce_gen_pool_add(m))
+ if (!mce_gen_pool_add(err))
irq_work_queue(&mce_irq_work);
}
EXPORT_SYMBOL_GPL(mce_log);
@@ -171,8 +171,10 @@ void mce_unregister_decode_chain(struct notifier_block *nb)
}
EXPORT_SYMBOL_GPL(mce_unregister_decode_chain);

-static void __print_mce(struct mce *m)
+static void __print_mce(struct mce_hw_err *err)
{
+ struct mce *m = &err->m;
+
pr_emerg(HW_ERR "CPU %d: Machine Check%s: %Lx Bank %d: %016Lx\n",
m->extcpu,
(m->mcgstatus & MCG_STATUS_MCIP ? " Exception" : ""),
@@ -214,9 +216,11 @@ static void __print_mce(struct mce *m)
m->microcode);
}

-static void print_mce(struct mce *m)
+static void print_mce(struct mce_hw_err *err)
{
- __print_mce(m);
+ struct mce *m = &err->m;
+
+ __print_mce(err);

if (m->cpuvendor != X86_VENDOR_AMD && m->cpuvendor != X86_VENDOR_HYGON)
pr_emerg_ratelimited(HW_ERR "Run the above through 'mcelog --ascii'\n");
@@ -251,7 +255,7 @@ static const char *mce_dump_aux_info(struct mce *m)
return NULL;
}

-static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
+static noinstr void mce_panic(const char *msg, struct mce_hw_err *final, char *exp)
{
struct llist_node *pending;
struct mce_evt_llist *l;
@@ -282,20 +286,22 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
pending = mce_gen_pool_prepare_records();
/* First print corrected ones that are still unlogged */
llist_for_each_entry(l, pending, llnode) {
- struct mce *m = &l->mce;
+ struct mce_hw_err *err = &l->err;
+ struct mce *m = &err->m;
if (!(m->status & MCI_STATUS_UC)) {
- print_mce(m);
+ print_mce(err);
if (!apei_err)
apei_err = apei_write_mce(m);
}
}
/* Now print uncorrected but with the final one last */
llist_for_each_entry(l, pending, llnode) {
- struct mce *m = &l->mce;
+ struct mce_hw_err *err = &l->err;
+ struct mce *m = &err->m;
if (!(m->status & MCI_STATUS_UC))
continue;
- if (!final || mce_cmp(m, final)) {
- print_mce(m);
+ if (!final || mce_cmp(m, &final->m)) {
+ print_mce(err);
if (!apei_err)
apei_err = apei_write_mce(m);
}
@@ -303,12 +309,12 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
if (final) {
print_mce(final);
if (!apei_err)
- apei_err = apei_write_mce(final);
+ apei_err = apei_write_mce(&final->m);
}
if (exp)
pr_emerg(HW_ERR "Machine check: %s\n", exp);

- memmsg = mce_dump_aux_info(final);
+ memmsg = mce_dump_aux_info(&final->m);
if (memmsg)
pr_emerg(HW_ERR "Machine check: %s\n", memmsg);

@@ -323,9 +329,9 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
* panic.
*/
if (kexec_crash_loaded()) {
- if (final && (final->status & MCI_STATUS_ADDRV)) {
+ if (final && (final->m.status & MCI_STATUS_ADDRV)) {
struct page *p;
- p = pfn_to_online_page(final->addr >> PAGE_SHIFT);
+ p = pfn_to_online_page(final->m.addr >> PAGE_SHIFT);
if (p)
SetPageHWPoison(p);
}
@@ -574,13 +580,13 @@ EXPORT_SYMBOL_GPL(mce_is_correctable);
static int mce_early_notifier(struct notifier_block *nb, unsigned long val,
void *data)
{
- struct mce *m = (struct mce *)data;
+ struct mce_hw_err *err = (struct mce_hw_err *)data;

- if (!m)
+ if (!err)
return NOTIFY_DONE;

/* Emit the trace record: */
- trace_mce_record(m);
+ trace_mce_record(err);

set_bit(0, &mce_need_notify);

@@ -597,7 +603,8 @@ static struct notifier_block early_nb = {
static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
void *data)
{
- struct mce *mce = (struct mce *)data;
+ struct mce_hw_err *err = (struct mce_hw_err *)data;
+ struct mce *mce = &err->m;
unsigned long pfn;

if (!mce || !mce_usable_address(mce))
@@ -624,13 +631,13 @@ static struct notifier_block mce_uc_nb = {
static int mce_default_notifier(struct notifier_block *nb, unsigned long val,
void *data)
{
- struct mce *m = (struct mce *)data;
+ struct mce_hw_err *err = (struct mce_hw_err *)data;

- if (!m)
+ if (!err)
return NOTIFY_DONE;

- if (mca_cfg.print_all || !m->kflags)
- __print_mce(m);
+ if (mca_cfg.print_all || !(err->m.kflags))
+ __print_mce(err);

return NOTIFY_DONE;
}
@@ -672,10 +679,10 @@ static noinstr void mce_read_aux(struct mce *m, int i)
}
}

-static void vendor_handle_error(struct mce *m)
+static void vendor_handle_error(struct mce_hw_err *err)
{
if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD)
- return amd_handle_error(m);
+ return amd_handle_error(err);
}

DEFINE_PER_CPU(unsigned, mce_poll_count);
@@ -707,26 +714,29 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
{
struct mce_bank *mce_banks = this_cpu_ptr(mce_banks_array);
bool error_seen = false;
- struct mce m;
+ struct mce_hw_err err;
+ struct mce *m = &err.m;
int i;

+ memset(&err, 0, sizeof(struct mce_hw_err));
+
this_cpu_inc(mce_poll_count);

- mce_gather_info(&m, NULL);
+ mce_gather_info(m, NULL);

if (flags & MCP_TIMESTAMP)
- m.tsc = rdtsc();
+ m->tsc = rdtsc();

for (i = 0; i < this_cpu_read(mce_num_banks); i++) {
if (!mce_banks[i].ctl || !test_bit(i, *b))
continue;

- m.misc = 0;
- m.addr = 0;
- m.bank = i;
+ m->misc = 0;
+ m->addr = 0;
+ m->bank = i;

barrier();
- m.status = mce_rdmsrl(mca_msr_reg(i, MCA_STATUS));
+ m->status = mce_rdmsrl(mca_msr_reg(i, MCA_STATUS));

/*
* Update storm tracking here, before checking for the
@@ -736,12 +746,12 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
* storm status.
*/
if (!mca_cfg.cmci_disabled)
- mce_track_storm(&m);
+ mce_track_storm(m);

/* If this entry is not valid, ignore it */
- if (!(m.status & MCI_STATUS_VAL)) {
+ if (!(m->status & MCI_STATUS_VAL)) {
if (smca_destat_is_valid(i)) {
- mce_read_aux(&m, i);
+ mce_read_aux(m, i);
goto clear_it;
}

@@ -752,7 +762,7 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
* If we are logging everything (at CPU online) or this
* is a corrected error, then we must log it.
*/
- if ((flags & MCP_UC) || !(m.status & MCI_STATUS_UC))
+ if ((flags & MCP_UC) || !(m->status & MCI_STATUS_UC))
goto log_it;

/*
@@ -762,20 +772,20 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
* everything else.
*/
if (!mca_cfg.ser) {
- if (m.status & MCI_STATUS_UC)
+ if (m->status & MCI_STATUS_UC)
continue;
goto log_it;
}

/* Log "not enabled" (speculative) errors */
- if (!(m.status & MCI_STATUS_EN))
+ if (!(m->status & MCI_STATUS_EN))
goto log_it;

/*
* Log UCNA (SDM: 15.6.3 "UCR Error Classification")
* UC == 1 && PCC == 0 && S == 0
*/
- if (!(m.status & MCI_STATUS_PCC) && !(m.status & MCI_STATUS_S))
+ if (!(m->status & MCI_STATUS_PCC) && !(m->status & MCI_STATUS_S))
goto log_it;

/*
@@ -791,23 +801,24 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
if (flags & MCP_DONTLOG)
goto clear_it;

- mce_read_aux(&m, i);
- m.severity = mce_severity(&m, NULL, NULL, false);
+ mce_read_aux(m, i);
+ m->severity = mce_severity(m, NULL, NULL, false);
+
/*
* Don't get the IP here because it's unlikely to
* have anything to do with the actual error location.
*/

- if (mca_cfg.dont_log_ce && !mce_usable_address(&m))
+ if (mca_cfg.dont_log_ce && !mce_usable_address(m))
goto clear_it;

if (flags & MCP_QUEUE_LOG)
- mce_gen_pool_add(&m);
+ mce_gen_pool_add(&err);
else
- mce_log(&m);
+ mce_log(&err);

clear_it:
- vendor_handle_error(&m);
+ vendor_handle_error(&err);

/*
* Clear state for this bank.
@@ -1044,6 +1055,7 @@ static noinstr int mce_timed_out(u64 *t, const char *msg)
static void mce_reign(void)
{
int cpu;
+ struct mce_hw_err *err = NULL;
struct mce *m = NULL;
int global_worst = 0;
char *msg = NULL;
@@ -1054,11 +1066,13 @@ static void mce_reign(void)
* Grade the severity of the errors of all the CPUs.
*/
for_each_possible_cpu(cpu) {
- struct mce *mtmp = &per_cpu(mces_seen, cpu);
+ struct mce_hw_err *etmp = &per_cpu(hw_errs_seen, cpu);
+ struct mce *mtmp = &etmp->m;

if (mtmp->severity > global_worst) {
global_worst = mtmp->severity;
- m = &per_cpu(mces_seen, cpu);
+ err = &per_cpu(hw_errs_seen, cpu);
+ m = &err->m;
}
}

@@ -1070,7 +1084,7 @@ static void mce_reign(void)
if (m && global_worst >= MCE_PANIC_SEVERITY) {
/* call mce_severity() to get "msg" for panic */
mce_severity(m, NULL, &msg, true);
- mce_panic("Fatal machine check", m, msg);
+ mce_panic("Fatal machine check", err, msg);
}

/*
@@ -1087,11 +1101,11 @@ static void mce_reign(void)
mce_panic("Fatal machine check from unknown source", NULL, NULL);

/*
- * Now clear all the mces_seen so that they don't reappear on
+ * Now clear all the hw_errs_seen so that they don't reappear on
* the next mce.
*/
for_each_possible_cpu(cpu)
- memset(&per_cpu(mces_seen, cpu), 0, sizeof(struct mce));
+ memset(&per_cpu(hw_errs_seen, cpu), 0, sizeof(struct mce_hw_err));
}

static atomic_t global_nwo;
@@ -1295,12 +1309,13 @@ static noinstr bool mce_check_crashing_cpu(void)
}

static __always_inline int
-__mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *final,
+__mc_scan_banks(struct mce_hw_err *err, struct pt_regs *regs, struct mce *final,
unsigned long *toclear, unsigned long *valid_banks, int no_way_out,
int *worst)
{
struct mce_bank *mce_banks = this_cpu_ptr(mce_banks_array);
struct mca_config *cfg = &mca_cfg;
+ struct mce *m = &err->m;
int severity, i, taint = 0;

for (i = 0; i < this_cpu_read(mce_num_banks); i++) {
@@ -1356,7 +1371,7 @@ __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *final,
* done in #MC context, where instrumentation is disabled.
*/
instrumentation_begin();
- mce_log(m);
+ mce_log(err);
instrumentation_end();

if (severity > *worst) {
@@ -1426,8 +1441,9 @@ static void kill_me_never(struct callback_head *cb)
set_mce_nospec(pfn);
}

-static void queue_task_work(struct mce *m, char *msg, void (*func)(struct callback_head *))
+static void queue_task_work(struct mce_hw_err *err, char *msg, void (*func)(struct callback_head *))
{
+ struct mce *m = &err->m;
int count = ++current->mce_count;

/* First call, save all the details */
@@ -1441,11 +1457,12 @@ static void queue_task_work(struct mce *m, char *msg, void (*func)(struct callba

/* Ten is likely overkill. Don't expect more than two faults before task_work() */
if (count > 10)
- mce_panic("Too many consecutive machine checks while accessing user data", m, msg);
+ mce_panic("Too many consecutive machine checks while accessing user data",
+ err, msg);

/* Second or later call, make sure page address matches the one from first call */
if (count > 1 && (current->mce_addr >> PAGE_SHIFT) != (m->addr >> PAGE_SHIFT))
- mce_panic("Consecutive machine checks to different user pages", m, msg);
+ mce_panic("Consecutive machine checks to different user pages", err, msg);

/* Do not call task_work_add() more than once */
if (count > 1)
@@ -1494,8 +1511,14 @@ noinstr void do_machine_check(struct pt_regs *regs)
int worst = 0, order, no_way_out, kill_current_task, lmce, taint = 0;
DECLARE_BITMAP(valid_banks, MAX_NR_BANKS) = { 0 };
DECLARE_BITMAP(toclear, MAX_NR_BANKS) = { 0 };
- struct mce m, *final;
+ struct mce_hw_err *final;
+ struct mce_hw_err err;
char *msg = NULL;
+ struct mce *m;
+
+ memset(&err, 0, sizeof(struct mce_hw_err));
+
+ m = &err.m;

if (unlikely(mce_flags.p5))
return pentium_machine_check(regs);
@@ -1533,13 +1556,13 @@ noinstr void do_machine_check(struct pt_regs *regs)

this_cpu_inc(mce_exception_count);

- mce_gather_info(&m, regs);
- m.tsc = rdtsc();
+ mce_gather_info(m, regs);
+ m->tsc = rdtsc();

- final = this_cpu_ptr(&mces_seen);
- *final = m;
+ final = this_cpu_ptr(&hw_errs_seen);
+ final->m = *m;

- no_way_out = mce_no_way_out(&m, &msg, valid_banks, regs);
+ no_way_out = mce_no_way_out(m, &msg, valid_banks, regs);

barrier();

@@ -1548,15 +1571,15 @@ noinstr void do_machine_check(struct pt_regs *regs)
* Assume the worst for now, but if we find the
* severity is MCE_AR_SEVERITY we have other options.
*/
- if (!(m.mcgstatus & MCG_STATUS_RIPV))
+ if (!(m->mcgstatus & MCG_STATUS_RIPV))
kill_current_task = 1;
/*
* Check if this MCE is signaled to only this logical processor,
* on Intel, Zhaoxin only.
*/
- if (m.cpuvendor == X86_VENDOR_INTEL ||
- m.cpuvendor == X86_VENDOR_ZHAOXIN)
- lmce = m.mcgstatus & MCG_STATUS_LMCES;
+ if (m->cpuvendor == X86_VENDOR_INTEL ||
+ m->cpuvendor == X86_VENDOR_ZHAOXIN)
+ lmce = m->mcgstatus & MCG_STATUS_LMCES;

/*
* Local machine check may already know that we have to panic.
@@ -1567,12 +1590,12 @@ noinstr void do_machine_check(struct pt_regs *regs)
*/
if (lmce) {
if (no_way_out)
- mce_panic("Fatal local machine check", &m, msg);
+ mce_panic("Fatal local machine check", &err, msg);
} else {
order = mce_start(&no_way_out);
}

- taint = __mc_scan_banks(&m, regs, final, toclear, valid_banks, no_way_out, &worst);
+ taint = __mc_scan_banks(&err, regs, &final->m, toclear, valid_banks, no_way_out, &worst);

if (!no_way_out)
mce_clear_state(toclear);
@@ -1587,7 +1610,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
no_way_out = worst >= MCE_PANIC_SEVERITY;

if (no_way_out)
- mce_panic("Fatal machine check on current CPU", &m, msg);
+ mce_panic("Fatal machine check on current CPU", &err, msg);
}
} else {
/*
@@ -1599,8 +1622,8 @@ noinstr void do_machine_check(struct pt_regs *regs)
* make sure we have the right "msg".
*/
if (worst >= MCE_PANIC_SEVERITY) {
- mce_severity(&m, regs, &msg, true);
- mce_panic("Local fatal machine check!", &m, msg);
+ mce_severity(m, regs, &msg, true);
+ mce_panic("Local fatal machine check!", &err, msg);
}
}

@@ -1618,14 +1641,14 @@ noinstr void do_machine_check(struct pt_regs *regs)
goto out;

/* Fault was in user mode and we need to take some action */
- if ((m.cs & 3) == 3) {
+ if ((m->cs & 3) == 3) {
/* If this triggers there is no way to recover. Die hard. */
BUG_ON(!on_thread_stack() || !user_mode(regs));

- if (!mce_usable_address(&m))
- queue_task_work(&m, msg, kill_me_now);
+ if (!mce_usable_address(m))
+ queue_task_work(&err, msg, kill_me_now);
else
- queue_task_work(&m, msg, kill_me_maybe);
+ queue_task_work(&err, msg, kill_me_maybe);

} else {
/*
@@ -1637,13 +1660,13 @@ noinstr void do_machine_check(struct pt_regs *regs)
* corresponding exception handler which would do that is the
* proper one.
*/
- if (m.kflags & MCE_IN_KERNEL_RECOV) {
+ if (m->kflags & MCE_IN_KERNEL_RECOV) {
if (!fixup_exception(regs, X86_TRAP_MC, 0, 0))
- mce_panic("Failed kernel mode recovery", &m, msg);
+ mce_panic("Failed kernel mode recovery", &err, msg);
}

- if (m.kflags & MCE_IN_KERNEL_COPYIN)
- queue_task_work(&m, msg, kill_me_never);
+ if (m->kflags & MCE_IN_KERNEL_COPYIN)
+ queue_task_work(&err, msg, kill_me_never);
}

out:
diff --git a/arch/x86/kernel/cpu/mce/dev-mcelog.c b/arch/x86/kernel/cpu/mce/dev-mcelog.c
index a05ac0716ecf..4a0e3bb4a4fb 100644
--- a/arch/x86/kernel/cpu/mce/dev-mcelog.c
+++ b/arch/x86/kernel/cpu/mce/dev-mcelog.c
@@ -36,7 +36,7 @@ static DECLARE_WAIT_QUEUE_HEAD(mce_chrdev_wait);
static int dev_mce_log(struct notifier_block *nb, unsigned long val,
void *data)
{
- struct mce *mce = (struct mce *)data;
+ struct mce *mce = &((struct mce_hw_err *)data)->m;
unsigned int entry;

if (mce->kflags & MCE_HANDLED_CEC)
diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c
index 4284749ec803..3337ea5c428d 100644
--- a/arch/x86/kernel/cpu/mce/genpool.c
+++ b/arch/x86/kernel/cpu/mce/genpool.c
@@ -31,15 +31,15 @@ static LLIST_HEAD(mce_event_llist);
*/
static bool is_duplicate_mce_record(struct mce_evt_llist *t, struct mce_evt_llist *l)
{
+ struct mce_hw_err *err1, *err2;
struct mce_evt_llist *node;
- struct mce *m1, *m2;

- m1 = &t->mce;
+ err1 = &t->err;

llist_for_each_entry(node, &l->llnode, llnode) {
- m2 = &node->mce;
+ err2 = &node->err;

- if (!mce_cmp(m1, m2))
+ if (!mce_cmp(&err1->m, &err2->m))
return true;
}
return false;
@@ -73,9 +73,9 @@ struct llist_node *mce_gen_pool_prepare_records(void)

void mce_gen_pool_process(struct work_struct *__unused)
{
+ struct mce_hw_err *err;
struct llist_node *head;
struct mce_evt_llist *node, *tmp;
- struct mce *mce;

head = llist_del_all(&mce_event_llist);
if (!head)
@@ -83,8 +83,8 @@ void mce_gen_pool_process(struct work_struct *__unused)

head = llist_reverse_order(head);
llist_for_each_entry_safe(node, tmp, head, llnode) {
- mce = &node->mce;
- blocking_notifier_call_chain(&x86_mce_decoder_chain, 0, mce);
+ err = &node->err;
+ blocking_notifier_call_chain(&x86_mce_decoder_chain, 0, err);
gen_pool_free(mce_evt_pool, (unsigned long)node, sizeof(*node));
}
}
@@ -94,11 +94,11 @@ bool mce_gen_pool_empty(void)
return llist_empty(&mce_event_llist);
}

-int mce_gen_pool_add(struct mce *mce)
+int mce_gen_pool_add(struct mce_hw_err *err)
{
struct mce_evt_llist *node;

- if (filter_mce(mce))
+ if (filter_mce(&err->m))
return -EINVAL;

if (!mce_evt_pool)
@@ -110,7 +110,7 @@ int mce_gen_pool_add(struct mce *mce)
return -ENOMEM;
}

- memcpy(&node->mce, mce, sizeof(*mce));
+ memcpy(&node->err, err, sizeof(*err));
llist_add(&node->llnode, &mce_event_llist);

return 0;
diff --git a/arch/x86/kernel/cpu/mce/inject.c b/arch/x86/kernel/cpu/mce/inject.c
index 94953d749475..1905938e2fd5 100644
--- a/arch/x86/kernel/cpu/mce/inject.c
+++ b/arch/x86/kernel/cpu/mce/inject.c
@@ -498,6 +498,7 @@ static void prepare_msrs(void *info)

static void do_inject(void)
{
+ struct mce_hw_err err;
u64 mcg_status = 0;
unsigned int cpu = i_mce.extcpu;
u8 b = i_mce.bank;
@@ -513,7 +514,8 @@ static void do_inject(void)
i_mce.status |= MCI_STATUS_SYNDV;

if (inj_type == SW_INJ) {
- mce_log(&i_mce);
+ err.m = i_mce;
+ mce_log(&err);
return;
}

diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index 9802f7c6cc93..c9db046e7124 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -26,12 +26,12 @@ extern struct blocking_notifier_head x86_mce_decoder_chain;

struct mce_evt_llist {
struct llist_node llnode;
- struct mce mce;
+ struct mce_hw_err err;
};

void mce_gen_pool_process(struct work_struct *__unused);
bool mce_gen_pool_empty(void);
-int mce_gen_pool_add(struct mce *mce);
+int mce_gen_pool_add(struct mce_hw_err *err);
int mce_gen_pool_init(void);
struct llist_node *mce_gen_pool_prepare_records(void);

@@ -264,7 +264,7 @@ void mce_setup_for_cpu(unsigned int cpu, struct mce *m);
#ifdef CONFIG_X86_MCE_AMD
extern bool amd_filter_mce(struct mce *m);
bool amd_mce_usable_address(struct mce *m);
-void amd_handle_error(struct mce *m);
+void amd_handle_error(struct mce_hw_err *err);

/*
* If MCA_CONFIG[McaLsbInStatusSupported] is set, extract ErrAddr in bits
@@ -293,7 +293,7 @@ static __always_inline void smca_extract_err_addr(struct mce *m)
#else
static inline bool amd_filter_mce(struct mce *m) { return false; }
static inline bool amd_mce_usable_address(struct mce *m) { return false; }
-static inline void amd_handle_error(struct mce *m) { }
+static inline void amd_handle_error(struct mce_hw_err *err) { }
static inline void smca_extract_err_addr(struct mce *m) { }
#endif

diff --git a/drivers/acpi/acpi_extlog.c b/drivers/acpi/acpi_extlog.c
index ca87a0939135..4864191918db 100644
--- a/drivers/acpi/acpi_extlog.c
+++ b/drivers/acpi/acpi_extlog.c
@@ -134,7 +134,7 @@ static int print_extlog_rcd(const char *pfx,
static int extlog_print(struct notifier_block *nb, unsigned long val,
void *data)
{
- struct mce *mce = (struct mce *)data;
+ struct mce *mce = &((struct mce_hw_err *)data)->m;
int bank = mce->bank;
int cpu = mce->extcpu;
struct acpi_hest_generic_status *estatus, *tmp;
diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c
index d48a388b796e..18916a73a363 100644
--- a/drivers/acpi/nfit/mce.c
+++ b/drivers/acpi/nfit/mce.c
@@ -13,8 +13,9 @@
static int nfit_handle_mce(struct notifier_block *nb, unsigned long val,
void *data)
{
- struct mce *mce = (struct mce *)data;
+ struct mce_hw_err *err = (struct mce_hw_err *)data;
struct acpi_nfit_desc *acpi_desc;
+ struct mce *mce = &err->m;
struct nfit_spa *nfit_spa;

/* We only care about uncorrectable memory errors */
diff --git a/drivers/edac/i7core_edac.c b/drivers/edac/i7core_edac.c
index 91e0a88ef904..d1e47cba0ff2 100644
--- a/drivers/edac/i7core_edac.c
+++ b/drivers/edac/i7core_edac.c
@@ -1810,7 +1810,7 @@ static void i7core_check_error(struct mem_ctl_info *mci, struct mce *m)
static int i7core_mce_check_error(struct notifier_block *nb, unsigned long val,
void *data)
{
- struct mce *mce = (struct mce *)data;
+ struct mce *mce = &((struct mce_hw_err *)data)->m;
struct i7core_dev *i7_dev;
struct mem_ctl_info *mci;

diff --git a/drivers/edac/igen6_edac.c b/drivers/edac/igen6_edac.c
index cdd8480e7368..2c112d4d842b 100644
--- a/drivers/edac/igen6_edac.c
+++ b/drivers/edac/igen6_edac.c
@@ -911,7 +911,7 @@ static int ecclog_nmi_handler(unsigned int cmd, struct pt_regs *regs)
static int ecclog_mce_handler(struct notifier_block *nb, unsigned long val,
void *data)
{
- struct mce *mce = (struct mce *)data;
+ struct mce *mce = &((struct mce_hw_err *)data)->m;
char *type;

if (mce->kflags & MCE_HANDLED_CEC)
diff --git a/drivers/edac/pnd2_edac.c b/drivers/edac/pnd2_edac.c
index 2afcd148fcf8..e2fb2d75af04 100644
--- a/drivers/edac/pnd2_edac.c
+++ b/drivers/edac/pnd2_edac.c
@@ -1366,7 +1366,7 @@ static void pnd2_unregister_mci(struct mem_ctl_info *mci)
*/
static int pnd2_mce_check_error(struct notifier_block *nb, unsigned long val, void *data)
{
- struct mce *mce = (struct mce *)data;
+ struct mce *mce = &((struct mce_hw_err *)data)->m;
struct mem_ctl_info *mci;
struct dram_addr daddr;
char *type;
diff --git a/drivers/edac/sb_edac.c b/drivers/edac/sb_edac.c
index 26cca5a9322d..0c4e45245153 100644
--- a/drivers/edac/sb_edac.c
+++ b/drivers/edac/sb_edac.c
@@ -3255,7 +3255,7 @@ static void sbridge_mce_output_error(struct mem_ctl_info *mci,
static int sbridge_mce_check_error(struct notifier_block *nb, unsigned long val,
void *data)
{
- struct mce *mce = (struct mce *)data;
+ struct mce *mce = &((struct mce_hw_err *)data)->m;
struct mem_ctl_info *mci;
char *type;

diff --git a/drivers/edac/skx_common.c b/drivers/edac/skx_common.c
index 9c5b6f8bd8bd..e0a4a1ecd25e 100644
--- a/drivers/edac/skx_common.c
+++ b/drivers/edac/skx_common.c
@@ -633,7 +633,7 @@ static bool skx_error_in_mem(const struct mce *m)
int skx_mce_check_error(struct notifier_block *nb, unsigned long val,
void *data)
{
- struct mce *mce = (struct mce *)data;
+ struct mce *mce = &((struct mce_hw_err *)data)->m;
struct decoded_addr res;
struct mem_ctl_info *mci;
char *type;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index c543600b759b..7c3e7ce811ce 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -3536,7 +3536,7 @@ static struct amdgpu_device *find_adev(uint32_t node_id)
static int amdgpu_bad_page_notifier(struct notifier_block *nb,
unsigned long val, void *data)
{
- struct mce *m = (struct mce *)data;
+ struct mce *m = &((struct mce_hw_err *)data)->m;
struct amdgpu_device *adev = NULL;
uint32_t gpu_id = 0;
uint32_t umc_inst = 0, ch_inst = 0;
diff --git a/drivers/ras/amd/fmpm.c b/drivers/ras/amd/fmpm.c
index 271dfad05d68..d3ce41a46ac4 100644
--- a/drivers/ras/amd/fmpm.c
+++ b/drivers/ras/amd/fmpm.c
@@ -400,7 +400,7 @@ static void retire_dram_row(u64 addr, u64 id, u32 cpu)

static int fru_handle_mem_poison(struct notifier_block *nb, unsigned long val, void *data)
{
- struct mce *m = (struct mce *)data;
+ struct mce *m = &((struct mce_hw_err *)data)->m;
struct fru_rec *rec;

if (!mce_is_memory_error(m))
diff --git a/drivers/ras/cec.c b/drivers/ras/cec.c
index e440b15fbabc..4940e97fbcdc 100644
--- a/drivers/ras/cec.c
+++ b/drivers/ras/cec.c
@@ -534,7 +534,8 @@ static int __init create_debugfs_nodes(void)
static int cec_notifier(struct notifier_block *nb, unsigned long val,
void *data)
{
- struct mce *m = (struct mce *)data;
+ struct mce_hw_err *err = (struct mce_hw_err *)data;
+ struct mce *m = &err->m;

if (!m)
return NOTIFY_DONE;
diff --git a/include/trace/events/mce.h b/include/trace/events/mce.h
index f0f7b3cb2041..65aba1afcd07 100644
--- a/include/trace/events/mce.h
+++ b/include/trace/events/mce.h
@@ -19,9 +19,9 @@

TRACE_EVENT(mce_record,

- TP_PROTO(struct mce *m),
+ TP_PROTO(struct mce_hw_err *err),

- TP_ARGS(m),
+ TP_ARGS(err),

TP_STRUCT__entry(
__field( u64, mcgcap )
@@ -46,25 +46,25 @@ TRACE_EVENT(mce_record,
),

TP_fast_assign(
- __entry->mcgcap = m->mcgcap;
- __entry->mcgstatus = m->mcgstatus;
- __entry->status = m->status;
- __entry->addr = m->addr;
- __entry->misc = m->misc;
- __entry->synd = m->synd;
- __entry->ipid = m->ipid;
- __entry->ip = m->ip;
- __entry->tsc = m->tsc;
- __entry->ppin = m->ppin;
- __entry->walltime = m->time;
- __entry->cpu = m->extcpu;
- __entry->cpuid = m->cpuid;
- __entry->apicid = m->apicid;
- __entry->socketid = m->socketid;
- __entry->cs = m->cs;
- __entry->bank = m->bank;
- __entry->cpuvendor = m->cpuvendor;
- __entry->microcode = m->microcode;
+ __entry->mcgcap = err->m.mcgcap;
+ __entry->mcgstatus = err->m.mcgstatus;
+ __entry->status = err->m.status;
+ __entry->addr = err->m.addr;
+ __entry->misc = err->m.misc;
+ __entry->synd = err->m.synd;
+ __entry->ipid = err->m.ipid;
+ __entry->ip = err->m.ip;
+ __entry->tsc = err->m.tsc;
+ __entry->ppin = err->m.ppin;
+ __entry->walltime = err->m.time;
+ __entry->cpu = err->m.extcpu;
+ __entry->cpuid = err->m.cpuid;
+ __entry->apicid = err->m.apicid;
+ __entry->socketid = err->m.socketid;
+ __entry->cs = err->m.cs;
+ __entry->bank = err->m.bank;
+ __entry->cpuvendor = err->m.cpuvendor;
+ __entry->microcode = err->m.microcode;
),

TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, IPID: %016Lx, ADDR: %016Lx, MISC: %016Lx, SYND: %016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PPIN: %llx, vendor: %u, CPUID: %x, time: %llu, socket: %u, APIC: %x, microcode: %x",
--
2.34.1


2024-04-04 15:43:09

by Yazen Ghannam

[permalink] [raw]
Subject: [PATCH v2 09/16] x86/mce: Unify AMD THR handler with MCA Polling

AMD systems optionally support an MCA Thresholding interrupt. The
interrupt should be used as another signal to trigger MCA polling. This
is similar to how the Intel Corrected Machine Check interrupt (CMCI) is
handled.

AMD MCA Thresholding is managed using the MCA_MISC registers within an
MCA bank. The OS will need to modify the hardware error count field in
order to reset the threshold limit and rearm the interrupt. Management
of the MCA_MISC register should be done as a follow up to the basic MCA
polling flow. It should not be the main focus of the interrupt handler.

Furthermore, future systems will have the ability to send an MCA
Thresholding interrupt to the OS even when the OS does not manage the
feature, i.e. MCA_MISC registers are Read-as-Zero/Locked.

Call the common MCA polling function when handling the MCA Thresholding
interrupt. This will allow the OS to find any valid errors whether or
not the MCA Thresholding feature is OS-managed. Also, this allows the
common MCA polling options and kernel parameters to apply to AMD
systems.

Add a callback to the MCA polling function to handle vendor-specific
operations. Start by handling the AMD MCA Thresholding "block reset"
flow.

Signed-off-by: Yazen Ghannam <[email protected]>
---

Notes:
Link:
https://lkml.kernel.org/r/[email protected]

v1->v2:
* No change.

arch/x86/kernel/cpu/mce/amd.c | 57 ++++++++++++++----------------
arch/x86/kernel/cpu/mce/core.c | 8 +++++
arch/x86/kernel/cpu/mce/internal.h | 2 ++
3 files changed, 37 insertions(+), 30 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index f59f4a1c9b21..75195d6fe971 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -979,12 +979,7 @@ static void amd_deferred_error_interrupt(void)
log_error_deferred(bank);
}

-static void log_error_thresholding(unsigned int bank, u64 misc)
-{
- _log_error_deferred(bank, misc);
-}
-
-static void log_and_reset_block(struct threshold_block *block)
+static void reset_block(struct threshold_block *block)
{
struct thresh_restart tr;
u32 low = 0, high = 0;
@@ -998,49 +993,51 @@ static void log_and_reset_block(struct threshold_block *block)
if (!(high & MASK_OVERFLOW_HI))
return;

- /* Log the MCE which caused the threshold event. */
- log_error_thresholding(block->bank, ((u64)high << 32) | low);
-
/* Reset threshold block after logging error. */
memset(&tr, 0, sizeof(tr));
tr.b = block;
threshold_restart_bank(&tr);
}

-/*
- * Threshold interrupt handler will service THRESHOLD_APIC_VECTOR. The interrupt
- * goes off when error_count reaches threshold_limit.
- */
-static void amd_threshold_interrupt(void)
+static void reset_thr_blocks(unsigned int bank)
{
struct threshold_block *first_block = NULL, *block = NULL, *tmp = NULL;
struct threshold_bank **bp = this_cpu_read(threshold_banks);
- unsigned int bank, cpu = smp_processor_id();

/*
* Validate that the threshold bank has been initialized already. The
* handler is installed at boot time, but on a hotplug event the
* interrupt might fire before the data has been initialized.
*/
- if (!bp)
+ if (!bp || !bp[bank])
return;

- for (bank = 0; bank < this_cpu_read(mce_num_banks); ++bank) {
- if (!(per_cpu(bank_map, cpu) & BIT_ULL(bank)))
- continue;
+ first_block = bp[bank]->blocks;
+ if (!first_block)
+ return;

- first_block = bp[bank]->blocks;
- if (!first_block)
- continue;
+ /*
+ * The first block is also the head of the list. Check it first
+ * before iterating over the rest.
+ */
+ reset_block(first_block);
+ list_for_each_entry_safe(block, tmp, &first_block->miscj, miscj)
+ reset_block(block);
+}

- /*
- * The first block is also the head of the list. Check it first
- * before iterating over the rest.
- */
- log_and_reset_block(first_block);
- list_for_each_entry_safe(block, tmp, &first_block->miscj, miscj)
- log_and_reset_block(block);
- }
+/*
+ * Threshold interrupt handler will service THRESHOLD_APIC_VECTOR. The interrupt
+ * goes off when error_count reaches threshold_limit.
+ */
+static void amd_threshold_interrupt(void)
+{
+ /* Check all banks for now. This could be optimized in the future. */
+ machine_check_poll(MCP_TIMESTAMP, this_cpu_ptr(&mce_poll_banks));
+}
+
+void amd_handle_error(struct mce *m)
+{
+ reset_thr_blocks(m->bank);
}

/*
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 7a857b33f515..75297e7eb980 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -672,6 +672,12 @@ static noinstr void mce_read_aux(struct mce *m, int i)
}
}

+static void vendor_handle_error(struct mce *m)
+{
+ if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD)
+ return amd_handle_error(m);
+}
+
DEFINE_PER_CPU(unsigned, mce_poll_count);

/*
@@ -787,6 +793,8 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
mce_log(&m);

clear_it:
+ vendor_handle_error(&m);
+
/*
* Clear state for this bank.
*/
diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index e86e53695828..96b108175ca2 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -267,6 +267,7 @@ void mce_setup_for_cpu(unsigned int cpu, struct mce *m);
#ifdef CONFIG_X86_MCE_AMD
extern bool amd_filter_mce(struct mce *m);
bool amd_mce_usable_address(struct mce *m);
+void amd_handle_error(struct mce *m);

/*
* If MCA_CONFIG[McaLsbInStatusSupported] is set, extract ErrAddr in bits
@@ -295,6 +296,7 @@ static __always_inline void smca_extract_err_addr(struct mce *m)
#else
static inline bool amd_filter_mce(struct mce *m) { return false; }
static inline bool amd_mce_usable_address(struct mce *m) { return false; }
+static inline void amd_handle_error(struct mce *m) { }
static inline void smca_extract_err_addr(struct mce *m) { }
#endif

--
2.34.1


2024-04-04 15:48:16

by Yazen Ghannam

[permalink] [raw]
Subject: [PATCH v2 16/16] EDAC/mce_amd: Add support for FRU Text in MCA

A new "FRU Text in MCA" feature is defined where the Field Replaceable
Unit (FRU) Text for a device is represented by a string in the new
MCA_SYND1 and MCA_SYND2 registers. This feature is supported per MCA
bank, and it is advertised by the McaFruTextInMca bit (MCA_CONFIG[9]).

The FRU Text is populated dynamically for each individual error state
(MCA_STATUS, MCA_ADDR, et al.). This handles the case where an MCA bank
covers multiple devices, for example, a Unified Memory Controller (UMC)
bank that manages two DIMMs.

Print the FRU Text string, if available, when decoding an MCA error.

Also, add field for MCA_CONFIG MSR in struct mce_hw_err as vendor specific
error information and save the value of the MSR. The very value can then be
exported through tracepoint for userspace tools like rasdaemon to print FRU
Text, if available.

Co-developed-by: Avadhut Naik <[email protected]>
Signed-off-by: Avadhut Naik <[email protected]>
Signed-off-by: Yazen Ghannam <[email protected]>
---

Notes:
Link:
https://lkml.kernel.org/r/[email protected]

v1->v2:
* No change.

arch/x86/include/asm/mce.h | 2 ++
arch/x86/kernel/cpu/mce/apei.c | 2 ++
arch/x86/kernel/cpu/mce/core.c | 3 +++
drivers/edac/mce_amd.c | 21 ++++++++++++++-------
4 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index a701290f80a1..2a8997d7ba4d 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -59,6 +59,7 @@
* - TCC bit is present in MCx_STATUS.
*/
#define MCI_CONFIG_MCAX 0x1
+#define MCI_CONFIG_FRUTEXT BIT_ULL(9)

/*
* Note that the full MCACOD field of IA32_MCi_STATUS MSR is
@@ -195,6 +196,7 @@ struct mce_hw_err {
struct {
u64 synd1;
u64 synd2;
+ u64 config;
} amd;
} vi;
};
diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c
index 43622241c379..a9c28614530b 100644
--- a/arch/x86/kernel/cpu/mce/apei.c
+++ b/arch/x86/kernel/cpu/mce/apei.c
@@ -154,6 +154,8 @@ int apei_smca_report_x86_error(struct cper_ia_proc_ctx *ctx_info, u64 lapic_id)
fallthrough;
/* MCA_CONFIG */
case 4:
+ err.vi.amd.config = *(i_mce + 3);
+ fallthrough;
/* MCA_MISC0 */
case 3:
m->misc = *(i_mce + 2);
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index aa27729f7899..a4d09dda5fae 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -207,6 +207,8 @@ static void __print_mce(struct mce_hw_err *err)
pr_cont("SYND2 %llx ", err->vi.amd.synd2);
if (m->ipid)
pr_cont("IPID %llx ", m->ipid);
+ if (err->vi.amd.config)
+ pr_cont("CONFIG %llx ", err->vi.amd.config);
}

pr_cont("\n");
@@ -679,6 +681,7 @@ static noinstr void mce_read_aux(struct mce_hw_err *err, int i)

if (mce_flags.smca) {
m->ipid = mce_rdmsrl(MSR_AMD64_SMCA_MCx_IPID(i));
+ err->vi.amd.config = mce_rdmsrl(MSR_AMD64_SMCA_MCx_CONFIG(i));

if (m->status & MCI_STATUS_SYNDV) {
m->synd = mce_rdmsrl(MSR_AMD64_SMCA_MCx_SYND(i));
diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index 32bf4cc564a3..f68b3d1b558e 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -795,6 +795,7 @@ amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
struct mce_hw_err *err = (struct mce_hw_err *)data;
struct mce *m = &err->m;
unsigned int fam = x86_family(m->cpuid);
+ u64 mca_config = err->vi.amd.config;
int ecc;

if (m->kflags & MCE_HANDLED_CEC)
@@ -814,11 +815,7 @@ amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
((m->status & MCI_STATUS_PCC) ? "PCC" : "-"));

if (boot_cpu_has(X86_FEATURE_SMCA)) {
- u32 low, high;
- u32 addr = MSR_AMD64_SMCA_MCx_CONFIG(m->bank);
-
- if (!rdmsr_safe(addr, &low, &high) &&
- (low & MCI_CONFIG_MCAX))
+ if (mca_config & MCI_CONFIG_MCAX)
pr_cont("|%s", ((m->status & MCI_STATUS_TCC) ? "TCC" : "-"));

pr_cont("|%s", ((m->status & MCI_STATUS_SYNDV) ? "SyndV" : "-"));
@@ -853,8 +850,18 @@ amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)

if (m->status & MCI_STATUS_SYNDV) {
pr_cont(", Syndrome: 0x%016llx\n", m->synd);
- pr_emerg(HW_ERR "Syndrome1: 0x%016llx, Syndrome2: 0x%016llx",
- err->vi.amd.synd1, err->vi.amd.synd2);
+ if (mca_config & MCI_CONFIG_FRUTEXT) {
+ char frutext[17];
+
+ memset(frutext, 0, sizeof(frutext));
+ memcpy(&frutext[0], &err->vi.amd.synd1, 8);
+ memcpy(&frutext[8], &err->vi.amd.synd2, 8);
+
+ pr_emerg(HW_ERR "FRU Text: %s", frutext);
+ } else {
+ pr_emerg(HW_ERR "Syndrome1: 0x%016llx, Syndrome2: 0x%016llx",
+ err->vi.amd.synd1, err->vi.amd.synd2);
+ }
}

pr_cont("\n");
--
2.34.1


2024-04-04 15:49:45

by Yazen Ghannam

[permalink] [raw]
Subject: [PATCH v2 01/16] x86/mce: Define mce_setup() helpers for common and per-CPU fields

Generally, MCA information for an error is gathered on the CPU that
reported the error. In this case, CPU-specific information from the
running CPU will be correct.

However, this will be incorrect if the MCA information is gathered while
running on a CPU that didn't report the error. One example is creating
an MCA record using mce_setup() for errors reported from ACPI.

Split mce_setup() so that there is a helper function to gather common,
i.e. not CPU-specific, information and another helper for CPU-specific
information.

Leave mce_setup() defined as-is for the common case when running on the
reporting CPU.

Get MCG_CAP in the global helper even though the register is per-CPU.
This value is not already cached per-CPU like other values. And it does
not assist with any per-CPU decoding or handling.

Signed-off-by: Yazen Ghannam <[email protected]>
---

Notes:
Link:
https://lkml.kernel.org/r/[email protected]

v1->v2:
* Change helper names and pass-in CPU number (Boris)

arch/x86/kernel/cpu/mce/core.c | 34 ++++++++++++++++++++----------
arch/x86/kernel/cpu/mce/internal.h | 2 ++
2 files changed, 25 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index b5cc557cfc37..7a857b33f515 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -117,20 +117,32 @@ static struct irq_work mce_irq_work;
*/
BLOCKING_NOTIFIER_HEAD(x86_mce_decoder_chain);

-/* Do initial initialization of a struct mce */
-void mce_setup(struct mce *m)
+void mce_setup_common(struct mce *m)
{
memset(m, 0, sizeof(struct mce));
- m->cpu = m->extcpu = smp_processor_id();
+
+ m->cpuid = cpuid_eax(1);
+ m->cpuvendor = boot_cpu_data.x86_vendor;
+ m->mcgcap = __rdmsr(MSR_IA32_MCG_CAP);
/* need the internal __ version to avoid deadlocks */
- m->time = __ktime_get_real_seconds();
- m->cpuvendor = boot_cpu_data.x86_vendor;
- m->cpuid = cpuid_eax(1);
- m->socketid = cpu_data(m->extcpu).topo.pkg_id;
- m->apicid = cpu_data(m->extcpu).topo.initial_apicid;
- m->mcgcap = __rdmsr(MSR_IA32_MCG_CAP);
- m->ppin = cpu_data(m->extcpu).ppin;
- m->microcode = boot_cpu_data.microcode;
+ m->time = __ktime_get_real_seconds();
+}
+
+void mce_setup_for_cpu(unsigned int cpu, struct mce *m)
+{
+ m->cpu = cpu;
+ m->extcpu = cpu;
+ m->apicid = cpu_data(m->extcpu).topo.initial_apicid;
+ m->microcode = cpu_data(m->extcpu).microcode;
+ m->ppin = cpu_data(m->extcpu).ppin;
+ m->socketid = cpu_data(m->extcpu).topo.pkg_id;
+}
+
+/* Do initial initialization of a struct mce */
+void mce_setup(struct mce *m)
+{
+ mce_setup_common(m);
+ mce_setup_for_cpu(smp_processor_id(), m);
}

DEFINE_PER_CPU(struct mce, injectm);
diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index 01f8f03969e6..e86e53695828 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -261,6 +261,8 @@ enum mca_msr {

/* Decide whether to add MCE record to MCE event pool or filter it out. */
extern bool filter_mce(struct mce *m);
+void mce_setup_common(struct mce *m);
+void mce_setup_for_cpu(unsigned int cpu, struct mce *m);

#ifdef CONFIG_X86_MCE_AMD
extern bool amd_filter_mce(struct mce *m);
--
2.34.1


2024-04-05 16:06:49

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH v2 16/16] EDAC/mce_amd: Add support for FRU Text in MCA

> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index aa27729f7899..a4d09dda5fae 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -207,6 +207,8 @@ static void __print_mce(struct mce_hw_err *err)
> pr_cont("SYND2 %llx ", err->vi.amd.synd2);
> if (m->ipid)
> pr_cont("IPID %llx ", m->ipid);
> + if (err->vi.amd.config)

This is in common code. If other vendors start adding their own stuff to the
"vi" union you might incorrectly print this. Add a vendor check before looking
at values inside "m->vi".

> + pr_cont("CONFIG %llx ", err->vi.amd.config);
> }
>
> pr_cont("\n");

2024-04-07 13:42:23

by Yazen Ghannam

[permalink] [raw]
Subject: Re: [PATCH v2 16/16] EDAC/mce_amd: Add support for FRU Text in MCA



On 4/5/24 12:06, Luck, Tony wrote:
>> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
>> index aa27729f7899..a4d09dda5fae 100644
>> --- a/arch/x86/kernel/cpu/mce/core.c
>> +++ b/arch/x86/kernel/cpu/mce/core.c
>> @@ -207,6 +207,8 @@ static void __print_mce(struct mce_hw_err *err)
>> pr_cont("SYND2 %llx ", err->vi.amd.synd2);
>> if (m->ipid)
>> pr_cont("IPID %llx ", m->ipid);
>> + if (err->vi.amd.config)
>
> This is in common code. If other vendors start adding their own stuff to the
> "vi" union you might incorrectly print this. Add a vendor check before looking
> at values inside "m->vi".
>

Yes, agreed. Will do.

Thanks,
Yazen

2024-04-08 19:48:11

by Naik, Avadhut

[permalink] [raw]
Subject: Re: [PATCH v2 16/16] EDAC/mce_amd: Add support for FRU Text in MCA



On 4/5/2024 11:06, Luck, Tony wrote:
>> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
>> index aa27729f7899..a4d09dda5fae 100644
>> --- a/arch/x86/kernel/cpu/mce/core.c
>> +++ b/arch/x86/kernel/cpu/mce/core.c
>> @@ -207,6 +207,8 @@ static void __print_mce(struct mce_hw_err *err)
>> pr_cont("SYND2 %llx ", err->vi.amd.synd2);
>> if (m->ipid)
>> pr_cont("IPID %llx ", m->ipid);
>> + if (err->vi.amd.config)
>
> This is in common code. If other vendors start adding their own stuff to the
> "vi" union you might incorrectly print this. Add a vendor check before looking
> at values inside "m->vi".
>

Do we really need an explicit vendor check in this particular instance?

Below is a snippet from __print_mce() after applying this series:

if (mce_flags.smca) {
if (m->synd)
pr_cont("SYND %llx ", m->synd);
if (err->vi.amd.synd1)
pr_cont("SYND1 %llx ", err->vi.amd.synd1);
if (err->vi.amd.synd2)
pr_cont("SYND2 %llx ", err->vi.amd.synd2);
if (m->ipid)
pr_cont("IPID %llx ", m->ipid);
if (err->vi.amd.config)
pr_cont("CONFIG %llx ", err->vi.amd.config);
}

pr_cont("\n");

All of the above registers including the newly added config MSR will only
be logged if the smca flag is set in mce_flags.
Doesn't that already serve as a vendor check of sorts?
Something that I am missing here?

--
Thanks,
Avadhut Naik

2024-04-08 20:06:02

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH v2 16/16] EDAC/mce_amd: Add support for FRU Text in MCA

> All of the above registers including the newly added config MSR will only
> be logged if the smca flag is set in mce_flags.
> Doesn't that already serve as a vendor check of sorts?
> Something that I am missing here?


Avadhut,

Yes. That's a sufficient vendor check. I was looking at the bits in the patch,
not at the broader context. Sorry for the noise.

-Tony

2024-04-16 10:03:07

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2 01/16] x86/mce: Define mce_setup() helpers for common and per-CPU fields

On Thu, Apr 04, 2024 at 10:13:44AM -0500, Yazen Ghannam wrote:
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index b5cc557cfc37..7a857b33f515 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -117,20 +117,32 @@ static struct irq_work mce_irq_work;
> */
> BLOCKING_NOTIFIER_HEAD(x86_mce_decoder_chain);
>
> -/* Do initial initialization of a struct mce */
> -void mce_setup(struct mce *m)
> +void mce_setup_common(struct mce *m)

Since we're touching this...

mce_setup() is a perfectly wrong name for what it does. So let's clean
it up. Diff ontop below.

* mce_prep_record() - the name says what the function does.

* mce_prep_record_per_cpu() - "per_cpu" as this is a common kernel
concept and we do use per_cpu data in there.

Please do this in two patches:

- the first one renames mce_setup() only without adding the additional
functionality

- the second one does the split

Thx.

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index dfd2e9699bd7..491f3d78c46a 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -221,7 +221,7 @@ static inline int apei_smca_report_x86_error(struct cper_ia_proc_ctx *ctx_info,
u64 lapic_id) { return -EINVAL; }
#endif

-void mce_setup(struct mce *m);
+void mce_prep_record(struct mce *m);
void mce_log(struct mce *m);
DECLARE_PER_CPU(struct device *, mce_device);

diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index 9a0133ef7e20..14bf8c232e45 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -780,7 +780,7 @@ static void __log_error(unsigned int bank, u64 status, u64 addr, u64 misc)
{
struct mce m;

- mce_setup(&m);
+ mce_prep_record(&m);

m.status = status;
m.misc = misc;
diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c
index 7f7309ff67d0..8f509c8a4e98 100644
--- a/arch/x86/kernel/cpu/mce/apei.c
+++ b/arch/x86/kernel/cpu/mce/apei.c
@@ -44,7 +44,7 @@ void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err)
else
lsb = PAGE_SHIFT;

- mce_setup(&m);
+ mce_prep_record(&m);
m.bank = -1;
/* Fake a memory read error with unknown channel */
m.status = MCI_STATUS_VAL | MCI_STATUS_EN | MCI_STATUS_ADDRV | MCI_STATUS_MISCV | 0x9f;
@@ -97,7 +97,7 @@ int apei_smca_report_x86_error(struct cper_ia_proc_ctx *ctx_info, u64 lapic_id)
if (ctx_info->reg_arr_size < 48)
return -EINVAL;

- mce_setup(&m);
+ mce_prep_record(&m);

m.extcpu = -1;
m.socketid = -1;
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index c0ce2de7fb51..a89508327b0d 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -117,7 +117,7 @@ static struct irq_work mce_irq_work;
*/
BLOCKING_NOTIFIER_HEAD(x86_mce_decoder_chain);

-void mce_setup_common(struct mce *m)
+void mce_prep_record_common(struct mce *m)
{
memset(m, 0, sizeof(struct mce));

@@ -128,21 +128,21 @@ void mce_setup_common(struct mce *m)
m->time = __ktime_get_real_seconds();
}

-void mce_setup_for_cpu(unsigned int cpu, struct mce *m)
+void mce_prep_record_per_cpu(unsigned int cpu, struct mce *m)
{
- m->cpu = cpu;
- m->extcpu = cpu;
- m->apicid = cpu_data(m->extcpu).topo.initial_apicid;
- m->microcode = cpu_data(m->extcpu).microcode;
- m->ppin = cpu_data(m->extcpu).ppin;
- m->socketid = cpu_data(m->extcpu).topo.pkg_id;
+ m->cpu = cpu;
+ m->extcpu = cpu;
+ m->apicid = cpu_data(m->extcpu).topo.initial_apicid;
+ m->microcode = cpu_data(m->extcpu).microcode;
+ m->ppin = cpu_data(m->extcpu).ppin;
+ m->socketid = cpu_data(m->extcpu).topo.pkg_id;
}

/* Do initial initialization of a struct mce */
-void mce_setup(struct mce *m)
+void mce_prep_record(struct mce *m)
{
- mce_setup_common(m);
- mce_setup_for_cpu(smp_processor_id(), m);
+ mce_prep_record_common(m);
+ mce_prep_record_per_cpu(smp_processor_id(), m);
}

DEFINE_PER_CPU(struct mce, injectm);
@@ -448,11 +448,11 @@ static noinstr void mce_wrmsrl(u32 msr, u64 v)
static noinstr void mce_gather_info(struct mce *m, struct pt_regs *regs)
{
/*
- * Enable instrumentation around mce_setup() which calls external
+ * Enable instrumentation around mce_prep_record() which calls external
* facilities.
*/
instrumentation_begin();
- mce_setup(m);
+ mce_prep_record(m);
instrumentation_end();

m->mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index e86e53695828..43c7f3b71df5 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -261,8 +261,8 @@ enum mca_msr {

/* Decide whether to add MCE record to MCE event pool or filter it out. */
extern bool filter_mce(struct mce *m);
-void mce_setup_common(struct mce *m);
-void mce_setup_for_cpu(unsigned int cpu, struct mce *m);
+void mce_prep_record_common(struct mce *m);
+void mce_prep_record_per_cpu(unsigned int cpu, struct mce *m);

#ifdef CONFIG_X86_MCE_AMD
extern bool amd_filter_mce(struct mce *m);

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-04-17 14:04:37

by Yazen Ghannam

[permalink] [raw]
Subject: Re: [PATCH v2 01/16] x86/mce: Define mce_setup() helpers for common and per-CPU fields



On 4/16/24 06:02, Borislav Petkov wrote:
> On Thu, Apr 04, 2024 at 10:13:44AM -0500, Yazen Ghannam wrote:
>> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
>> index b5cc557cfc37..7a857b33f515 100644
>> --- a/arch/x86/kernel/cpu/mce/core.c
>> +++ b/arch/x86/kernel/cpu/mce/core.c
>> @@ -117,20 +117,32 @@ static struct irq_work mce_irq_work;
>> */
>> BLOCKING_NOTIFIER_HEAD(x86_mce_decoder_chain);
>>
>> -/* Do initial initialization of a struct mce */
>> -void mce_setup(struct mce *m)
>> +void mce_setup_common(struct mce *m)
>
> Since we're touching this...
>
> mce_setup() is a perfectly wrong name for what it does. So let's clean
> it up. Diff ontop below.
>
> * mce_prep_record() - the name says what the function does.
>
> * mce_prep_record_per_cpu() - "per_cpu" as this is a common kernel
> concept and we do use per_cpu data in there.
>
> Please do this in two patches:
>
> - the first one renames mce_setup() only without adding the additional
> functionality
>
> - the second one does the split
>
> Thx.
>

Okay, will do.

Should I send another revision of this entire set? Or should I split out
the mce_setup() patches?

Thanks,
Yazen

2024-04-22 08:17:16

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2 01/16] x86/mce: Define mce_setup() helpers for common and per-CPU fields

On Wed, Apr 17, 2024 at 09:50:58AM -0400, Yazen Ghannam wrote:
> Should I send another revision of this entire set? Or should I split out
> the mce_setup() patches?

I leave it up to you.

If it makes sense to keep it all together then wait for me to go through
the rest first and send a whole new set or if you want to break this
out, that's fine too.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-04-23 19:07:13

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2 05/16] x86/mce/amd: Clean up SMCA configuration

On Thu, Apr 04, 2024 at 10:13:48AM -0500, Yazen Ghannam wrote:
> + /*
> + * OS is required to set the MCAX enable bit to acknowledge that it is
> + * now using the new MSR ranges and new registers under each
> + * bank. It also means that the OS will configure deferred
> + * errors in the new MCA_CONFIG register. If the bit is not set,
> + * uncorrectable errors will cause a system panic.
> + */
> + mca_config |= FIELD_PREP(CFG_MCAX_EN, 0x1);

Can we please drop this cryptic crap?

mca_config |= SMCA_MCI_CONFIG_MCAX_ENABLE;

if I trust the PPR.

> - this_cpu_ptr(mce_banks_array)[bank].lsb_in_status = !!(low & BIT(8));
> + /*
> + * SMCA sets the Deferred Error Interrupt type per bank.
> + *
> + * MCA_CONFIG[DeferredIntTypeSupported] is bit 5, and tells us
> + * if the DeferredIntType bit field is available.
> + *
> + * MCA_CONFIG[DeferredIntType] is bits [38:37]. OS should set
> + * this to 0x1 to enable APIC based interrupt. First, check that
> + * no interrupt has been set.
> + */
> + if (FIELD_GET(CFG_DFR_INT_SUPP, mca_config) && !FIELD_GET(CFG_DFR_INT_TYPE, mca_config))
> + mca_config |= FIELD_PREP(CFG_DFR_INT_TYPE, INTR_TYPE_APIC);

Ditto:

if (mca_config & CFG_DFR_INT_SUPP &&
!(mca_config & CFG_DFR_INT_TYPE))
mca_config |= CFG_DFR_INT_TYPE | CFG_INTR_TYPE_APIC;

Plain and simple. Not this unreadable mess.

And use proper prefixes with the register name in them:

SMCA_MCI_CONFIG_

or so, for example.

>
> - wrmsr(smca_config, low, high);
> - }
> + if (FIELD_GET(CFG_LSB_IN_STATUS, mca_config))
> + this_cpu_ptr(mce_banks_array)[bank].lsb_in_status = true;
> +
> + wrmsrl(MSR_AMD64_SMCA_MCx_CONFIG(bank), mca_config);
> +}
> +
> +static void smca_configure_old(unsigned int bank, unsigned int cpu)

Yah, at the end of the patchset there should be patches which remove all
the _old things. Not in a different patchset. You can split things
accordingly.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-04-23 19:33:35

by Yazen Ghannam

[permalink] [raw]
Subject: Re: [PATCH v2 05/16] x86/mce/amd: Clean up SMCA configuration

On 4/23/2024 3:06 PM, Borislav Petkov wrote:
> On Thu, Apr 04, 2024 at 10:13:48AM -0500, Yazen Ghannam wrote:
>> + /*
>> + * OS is required to set the MCAX enable bit to acknowledge that it is
>> + * now using the new MSR ranges and new registers under each
>> + * bank. It also means that the OS will configure deferred
>> + * errors in the new MCA_CONFIG register. If the bit is not set,
>> + * uncorrectable errors will cause a system panic.
>> + */
>> + mca_config |= FIELD_PREP(CFG_MCAX_EN, 0x1);
>
> Can we please drop this cryptic crap?
>
> mca_config |= SMCA_MCI_CONFIG_MCAX_ENABLE;
>
> if I trust the PPR.
>

Okay.

>> - this_cpu_ptr(mce_banks_array)[bank].lsb_in_status = !!(low & BIT(8));
>> + /*
>> + * SMCA sets the Deferred Error Interrupt type per bank.
>> + *
>> + * MCA_CONFIG[DeferredIntTypeSupported] is bit 5, and tells us
>> + * if the DeferredIntType bit field is available.
>> + *
>> + * MCA_CONFIG[DeferredIntType] is bits [38:37]. OS should set
>> + * this to 0x1 to enable APIC based interrupt. First, check that
>> + * no interrupt has been set.
>> + */
>> + if (FIELD_GET(CFG_DFR_INT_SUPP, mca_config) && !FIELD_GET(CFG_DFR_INT_TYPE, mca_config))
>> + mca_config |= FIELD_PREP(CFG_DFR_INT_TYPE, INTR_TYPE_APIC);
>
> Ditto:
>
> if (mca_config & CFG_DFR_INT_SUPP &&
> !(mca_config & CFG_DFR_INT_TYPE))
> mca_config |= CFG_DFR_INT_TYPE | CFG_INTR_TYPE_APIC;
>
> Plain and simple. Not this unreadable mess.
>

This is not the same.

"CFG_DFR_INT_TYPE" is a register field.

"INTR_TYPE_APIC" is a value. And this same value can be used in other register
fields.

I think it's fair to just use logical AND for single bit checks instead of the
FIELD_GET() macro.

But the FIELD_PREP() macro does help for setting bitfields. I think it's
clearer than manually doing the proper shifts and masks.

> And use proper prefixes with the register name in them:
>
> SMCA_MCI_CONFIG_
>
> or so, for example.
>

Okay. I was thinking to keep the names shorter since they are only used in
this file. But I'll change them.

>>
>> - wrmsr(smca_config, low, high);
>> - }
>> + if (FIELD_GET(CFG_LSB_IN_STATUS, mca_config))
>> + this_cpu_ptr(mce_banks_array)[bank].lsb_in_status = true;
>> +
>> + wrmsrl(MSR_AMD64_SMCA_MCx_CONFIG(bank), mca_config);
>> +}
>> +
>> +static void smca_configure_old(unsigned int bank, unsigned int cpu)
>
> Yah, at the end of the patchset there should be patches which remove all
> the _old things. Not in a different patchset. You can split things
> accordingly.
>

Okay. I'll include the follow up patches in the next revision of this set.

Thanks,
Yazen

2024-04-24 08:59:11

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2 05/16] x86/mce/amd: Clean up SMCA configuration

On Tue, Apr 23, 2024 at 03:32:00PM -0400, Yazen Ghannam wrote:
> This is not the same.
>
> "CFG_DFR_INT_TYPE" is a register field.
>
> "INTR_TYPE_APIC" is a value. And this same value can be used in other register
> fields.

I don't care - this was just an example of how it should look like. Like
the rest of the code around the kernel and not like an obfuscated
C contest mess.

> I think it's fair to just use logical AND for single bit checks instead of the
> FIELD_GET() macro.
>
> But the FIELD_PREP() macro does help for setting bitfields. I think it's
> clearer than manually doing the proper shifts and masks.

To you maybe.

Pls stick to how common code does masks generation and manipulation so
that this remains readable. This FIELD* crap is not helping.

> Okay. I was thinking to keep the names shorter since they are only used in
> this file. But I'll change them.

If you want to keep them shorter, then think of an overall shorter
scheme of how the register *and* the fields which belong to it, should
be named. But there's a point in having the same prefix for register and
bits which belong to it.

> Okay. I'll include the follow up patches in the next revision of this set.

Pls do.

Thanks.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-04-24 14:07:49

by Yazen Ghannam

[permalink] [raw]
Subject: Re: [PATCH v2 05/16] x86/mce/amd: Clean up SMCA configuration

On 4/23/2024 10:29 PM, Borislav Petkov wrote:
> On Tue, Apr 23, 2024 at 03:32:00PM -0400, Yazen Ghannam wrote:
>> This is not the same.
>>
>> "CFG_DFR_INT_TYPE" is a register field.
>>
>> "INTR_TYPE_APIC" is a value. And this same value can be used in other register
>> fields.
>
> I don't care - this was just an example of how it should look like. Like
> the rest of the code around the kernel and not like an obfuscated
> C contest mess.
>
>> I think it's fair to just use logical AND for single bit checks instead of the
>> FIELD_GET() macro.
>>
>> But the FIELD_PREP() macro does help for setting bitfields. I think it's
>> clearer than manually doing the proper shifts and masks.
>
> To you maybe.
>
> Pls stick to how common code does masks generation and manipulation so
> that this remains readable. This FIELD* crap is not helping.
>

Okay, will do. I'll drop all the bitfield stuff from the entire set.

>> Okay. I was thinking to keep the names shorter since they are only used in
>> this file. But I'll change them.
>
> If you want to keep them shorter, then think of an overall shorter
> scheme of how the register *and* the fields which belong to it, should
> be named. But there's a point in having the same prefix for register and
> bits which belong to it.
>

Okay, understood.

Thanks,
Yazen

2024-04-24 19:09:22

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2 07/16] x86/mce/amd: Simplify DFR handler setup

On Thu, Apr 04, 2024 at 10:13:50AM -0500, Yazen Ghannam wrote:
> AMD systems with the SUCCOR feature can send an APIC LVT interrupt for
> deferred errors. The LVT offset is 0x2 by convention, i.e. this is the
> default as listed in hardware documentation.
>
> However, the MCA registers may list a different LVT offset for this
> interrupt. The kernel should honor the value from the hardware.

There's this "may" thing again.

Is this enablement for some future hw too or do you really trust the
value in MSR_CU_DEF_ERR is programmed correctly in all cases?

> Simplify the enable flow by using the hardware-provided value. Any
> conflicts will be caught by setup_APIC_eilvt(). Conflicts on production
> systems can be handled as quirks, if needed.

Well, which systems support succor?

I'd like to test this on them before we face all the quirkery. :)

That area has been plagued by hw snafus if you look at
setup_APIC_eilvt() and talk to uncle Robert. :-P

> @@ -595,17 +584,15 @@ static void deferred_error_interrupt_enable(struct cpuinfo_x86 *c)
> if (rdmsr_safe(MSR_CU_DEF_ERR, &low, &high))
> return;
>
> + /*
> + * Trust the value from hardware.
> + * If there's a conflict, then setup_APIC_eilvt() will throw an error.
> + */
> def_new = (low & MASK_DEF_LVTOFF) >> 4;
> - if (!(low & MASK_DEF_LVTOFF)) {
> - pr_err(FW_BUG "Your BIOS is not setting up LVT offset 0x2 for deferred error IRQs correctly.\n");
> - def_new = DEF_LVT_OFF;
> - low = (low & ~MASK_DEF_LVTOFF) | (DEF_LVT_OFF << 4);
> - }
> + if (setup_APIC_eilvt(def_new, DEFERRED_ERROR_VECTOR, APIC_EILVT_MSG_FIX, 0))
> + return;
>
> - def_offset = setup_APIC_deferred_error(def_offset, def_new);
> - if ((def_offset == def_new) &&
> - (deferred_error_int_vector != amd_deferred_error_interrupt))
> - deferred_error_int_vector = amd_deferred_error_interrupt;

There was a reason for that - deferred_error_int_vector is a global var
and you're calling enable_deferred_error_interrupt() on each CPU.

> + deferred_error_int_vector = amd_deferred_error_interrupt;
>
> if (!mce_flags.smca)
> low = (low & ~MASK_DEF_INT_TYPE) | DEF_INT_TYPE_APIC;

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-04-25 14:15:36

by Yazen Ghannam

[permalink] [raw]
Subject: Re: [PATCH v2 07/16] x86/mce/amd: Simplify DFR handler setup

On 4/24/2024 3:06 PM, Borislav Petkov wrote:
> On Thu, Apr 04, 2024 at 10:13:50AM -0500, Yazen Ghannam wrote:
>> AMD systems with the SUCCOR feature can send an APIC LVT interrupt for
>> deferred errors. The LVT offset is 0x2 by convention, i.e. this is the
>> default as listed in hardware documentation.
>>
>> However, the MCA registers may list a different LVT offset for this
>> interrupt. The kernel should honor the value from the hardware.
>
> There's this "may" thing again.
>

Right, I should say "the microarchitecture allows it". :)

> Is this enablement for some future hw too or do you really trust the
> value in MSR_CU_DEF_ERR is programmed correctly in all cases?
>

I trust the value from hardware.

The intention here is to simplify the code for general maintenance and to make
later patches easier.

>> Simplify the enable flow by using the hardware-provided value. Any
>> conflicts will be caught by setup_APIC_eilvt(). Conflicts on production
>> systems can be handled as quirks, if needed.
>
> Well, which systems support succor?
>
> I'd like to test this on them before we face all the quirkery. :)
>

All Zen/SMCA systems. I don't recall any issues in this area.

Some later Family 15h systems (Carrizo?) had it. But I don't know if it was
used in production. It was slightly before my time.

> That area has been plagued by hw snafus if you look at
> setup_APIC_eilvt() and talk to uncle Robert. :-P
>

Right, I found this:
27afdf2008da ("apic, x86: Use BIOS settings for IBS and MCE threshold
interrupt LVT offsets")

Which is basically the same idea: use what is in the register.

But it looks there was an issue with IBS on Family 10h.

Is this the only case of a real issue? If so, then why apply this method to
the THR and DFR interrupts?

Robert, were there any other issues?

>> @@ -595,17 +584,15 @@ static void deferred_error_interrupt_enable(struct cpuinfo_x86 *c)
>> if (rdmsr_safe(MSR_CU_DEF_ERR, &low, &high))
>> return;
>>
>> + /*
>> + * Trust the value from hardware.
>> + * If there's a conflict, then setup_APIC_eilvt() will throw an error.
>> + */
>> def_new = (low & MASK_DEF_LVTOFF) >> 4;
>> - if (!(low & MASK_DEF_LVTOFF)) {
>> - pr_err(FW_BUG "Your BIOS is not setting up LVT offset 0x2 for deferred error IRQs correctly.\n");
>> - def_new = DEF_LVT_OFF;
>> - low = (low & ~MASK_DEF_LVTOFF) | (DEF_LVT_OFF << 4);
>> - }
>> + if (setup_APIC_eilvt(def_new, DEFERRED_ERROR_VECTOR, APIC_EILVT_MSG_FIX, 0))
>> + return;
>>
>> - def_offset = setup_APIC_deferred_error(def_offset, def_new);
>> - if ((def_offset == def_new) &&
>> - (deferred_error_int_vector != amd_deferred_error_interrupt))
>> - deferred_error_int_vector = amd_deferred_error_interrupt;
>
> There was a reason for that - deferred_error_int_vector is a global var
> and you're calling enable_deferred_error_interrupt() on each CPU.
>

Right, and all CPUs should use the same APIC LVT offset. If they differ, then
setup_APIC_eilvt() will fail above and return.

Why check "if X != Y, then X = Y"? Why not just unconditionally do "X = Y"?

>> + deferred_error_int_vector = amd_deferred_error_interrupt;
>>
>> if (!mce_flags.smca)
>> low = (low & ~MASK_DEF_INT_TYPE) | DEF_INT_TYPE_APIC;
>
> Thx.
>

Thanks,
Yazen

2024-04-29 13:00:21

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2 07/16] x86/mce/amd: Simplify DFR handler setup

On Thu, Apr 25, 2024 at 10:12:44AM -0400, Yazen Ghannam wrote:
> I trust the value from hardware.
>
> The intention here is to simplify the code for general maintenance and to make
> later patches easier.

There's this BIOS thing which programs those and throws a wrench in all
our trusting in the hw.

> All Zen/SMCA systems. I don't recall any issues in this area.
>
> Some later Family 15h systems (Carrizo?) had it. But I don't know if it was
> used in production. It was slightly before my time.

Yeah, already solved in the previous mail.

> Right, I found this:
> 27afdf2008da ("apic, x86: Use BIOS settings for IBS and MCE threshold
> interrupt LVT offsets")
>
> Which is basically the same idea: use what is in the register.
>
> But it looks there was an issue with IBS on Family 10h.

Yap, and it was pretty blatant AFAIR.

> Is this the only case of a real issue?

I don't remember anything else but I'm not excluding there not being
others.

> If so, then why apply this method to the THR and DFR interrupts?

Meaning what exactly? You want to trust the hw for THR and DFR and let
the others use this offset reservation we're doing?

> Right, and all CPUs should use the same APIC LVT offset. If they differ, then
> setup_APIC_eilvt() will fail above and return.
>
> Why check "if X != Y, then X = Y"? Why not just unconditionally do "X = Y"?

Why unconditionally do the assignment if it is already assigned?

I don't think x86 does store tearing so that we get deferred interrupt
on some core while some other core writes the same function pointer in
there but why even risk it if it can be avoided with a simple test?

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-04-29 13:43:03

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2 08/16] x86/mce/amd: Clean up enable_deferred_error_interrupt()

On Thu, Apr 04, 2024 at 10:13:51AM -0500, Yazen Ghannam wrote:
> -/* Deferred error settings */
> +/* MCA Interrupt Configuration register, one per CPU */

SMCA?

> #define MSR_CU_DEF_ERR 0xC0000410
> -#define MASK_DEF_LVTOFF 0x000000F0
> -#define MASK_DEF_INT_TYPE 0x00000006
> -#define DEF_INT_TYPE_APIC 0x2
> +#define MSR_MCA_INTR_CFG 0xC0000410

You do see those other MSRs' prefixes, right?

MSR_AMD64_SMCA_...

Is this one not part of the SMCA arch?

> +#define INTR_CFG_DFR_LVT_OFFSET GENMASK_ULL(7, 4)
> +#define INTR_CFG_LEGACY_DFR_INTR_TYPE GENMASK_ULL(2, 1)
> #define INTR_TYPE_APIC 0x1

Ditto for its bit(s) names.

> +static u64 get_mca_intr_cfg(void)
> +{
> + u64 mca_intr_cfg;
> +
> + if (!mce_flags.succor)
> + return 0;
> +
> + if (rdmsrl_safe(MSR_MCA_INTR_CFG, &mca_intr_cfg))
> + return 0;
> +
> + return mca_intr_cfg;
> +}

This is an overkill. If we add a function for every MSR we're reading...

Do this differently: prepare the value you're writing back into the
INTR_CFG MSR once, save it into mca_intr_cfg and then write it on each
core at the end of enable_deferred_error_interrupt().

And make u64 mca_intr_cfg static global to amd.c so that you can refer
to it from outside of the functions and then you don't have to pass it
around as a function param.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-04-29 13:58:26

by Yazen Ghannam

[permalink] [raw]
Subject: Re: [PATCH v2 07/16] x86/mce/amd: Simplify DFR handler setup

On 4/29/2024 8:59 AM, Borislav Petkov wrote:
> On Thu, Apr 25, 2024 at 10:12:44AM -0400, Yazen Ghannam wrote:
>> I trust the value from hardware.
>>
>> The intention here is to simplify the code for general maintenance and to make
>> later patches easier.
>
> There's this BIOS thing which programs those and throws a wrench in all
> our trusting in the hw.
>
>> All Zen/SMCA systems. I don't recall any issues in this area.
>>
>> Some later Family 15h systems (Carrizo?) had it. But I don't know if it was
>> used in production. It was slightly before my time.
>
> Yeah, already solved in the previous mail.
>
>> Right, I found this:
>> 27afdf2008da ("apic, x86: Use BIOS settings for IBS and MCE threshold
>> interrupt LVT offsets")
>>
>> Which is basically the same idea: use what is in the register.
>>
>> But it looks there was an issue with IBS on Family 10h.
>
> Yap, and it was pretty blatant AFAIR.
>
>> Is this the only case of a real issue?
>
> I don't remember anything else but I'm not excluding there not being
> others.
>
>> If so, then why apply this method to the THR and DFR interrupts?
>
> Meaning what exactly? You want to trust the hw for THR and DFR and let
> the others use this offset reservation we're doing?
>

Right, I mean we should do things the simpler way unless there's a real issue
to address.

>> Right, and all CPUs should use the same APIC LVT offset. If they differ, then
>> setup_APIC_eilvt() will fail above and return.
>>
>> Why check "if X != Y, then X = Y"? Why not just unconditionally do "X = Y"?
>
> Why unconditionally do the assignment if it is already assigned?
>
> I don't think x86 does store tearing so that we get deferred interrupt
> on some core while some other core writes the same function pointer in
> there but why even risk it if it can be avoided with a simple test?
>

I'm not opposed to this, but I don't understand what is at risk.

Is it that the function pointer may not be written atomically? So even if we
write it again with the same value, a concurrent interrupt on another core may
see a partially updated (corrupt) pointer?

intel_init_cmci() does not do this check. So is it more at risk, or is the AMD
code just more cautious?

Again I'm not against the current code. I just think we should simplify it, if
possible.

Thanks,
Yazen


2024-04-29 14:11:37

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2 09/16] x86/mce: Unify AMD THR handler with MCA Polling

On Thu, Apr 04, 2024 at 10:13:52AM -0500, Yazen Ghannam wrote:
> @@ -787,6 +793,8 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
> mce_log(&m);
>
> clear_it:
> + vendor_handle_error(&m);

Wait, whaaat?

The normal polling happens periodically (each 5 mins) and you want to
reset the thresholding blocks each 5 mins?

And the code has there now:

static void reset_block(struct threshold_block *block)
{

..

/* Reset threshold block after logging error. */
memset(&tr, 0, sizeof(tr));
tr.b = block;
threshold_restart_bank(&tr);
}

but no error has been logged.

Frankly, I don't see the point for this part: polling all banks on
a thresholding interrupt makes sense. But this resetting from within the
polling doesn't make any sense.

Especially if that polling interval is user-controllable.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-04-29 14:14:26

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2 07/16] x86/mce/amd: Simplify DFR handler setup

On Mon, Apr 29, 2024 at 09:56:56AM -0400, Yazen Ghannam wrote:
> Right, I mean we should do things the simpler way unless there's a real issue
> to address.

You need to pay attention to past issues before you go, simplify it and
break it again.

> I'm not opposed to this, but I don't understand what is at risk.
>
> Is it that the function pointer may not be written atomically? So even if we
> write it again with the same value, a concurrent interrupt on another core may
> see a partially updated (corrupt) pointer?

Yes, it won't happen, they say as it is guaranteed by the
architecture. But I've heard those "promises".

> intel_init_cmci() does not do this check. So is it more at risk, or is the AMD
> code just more cautious?
>
> Again I'm not against the current code. I just think we should simplify it, if
> possible.

So in looking at the INTR_CFG MSR, I think we should do a function which
does MCA init stuff only on the BSP exactly for things like that.

There you can set the interrupt handler pointer, the INTR_CFG MSR and so
on. And we don't have such function and I've needed a function like that
in the past.

And just for the general goal of not doing ugly code which should run
only once but is run per-CPU just because our infrastructure doesn't
allow it.

Wanna give that a try?

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-04-29 14:19:20

by Yazen Ghannam

[permalink] [raw]
Subject: Re: [PATCH v2 08/16] x86/mce/amd: Clean up enable_deferred_error_interrupt()

On 4/29/2024 9:12 AM, Borislav Petkov wrote:
> On Thu, Apr 04, 2024 at 10:13:51AM -0500, Yazen Ghannam wrote:
>> -/* Deferred error settings */
>> +/* MCA Interrupt Configuration register, one per CPU */
>
> SMCA?
>
>> #define MSR_CU_DEF_ERR 0xC0000410
>> -#define MASK_DEF_LVTOFF 0x000000F0
>> -#define MASK_DEF_INT_TYPE 0x00000006
>> -#define DEF_INT_TYPE_APIC 0x2
>> +#define MSR_MCA_INTR_CFG 0xC0000410
>
> You do see those other MSRs' prefixes, right?
>
> MSR_AMD64_SMCA_...
>
> Is this one not part of the SMCA arch?
>

No, it is part of SUCCOR. The old define is above: MSR_CU_DEF_ERR.

This is how it is listed in the PPR:
MSRC000_0410 [MCA Interrupt Configuration] (Core::X86::Msr::McaIntrCfg)

>> +#define INTR_CFG_DFR_LVT_OFFSET GENMASK_ULL(7, 4)
>> +#define INTR_CFG_LEGACY_DFR_INTR_TYPE GENMASK_ULL(2, 1)
>> #define INTR_TYPE_APIC 0x1
>
> Ditto for its bit(s) names.
>

Okay.

>> +static u64 get_mca_intr_cfg(void)
>> +{
>> + u64 mca_intr_cfg;
>> +
>> + if (!mce_flags.succor)
>> + return 0;
>> +
>> + if (rdmsrl_safe(MSR_MCA_INTR_CFG, &mca_intr_cfg))
>> + return 0;
>> +
>> + return mca_intr_cfg;
>> +}
>
> This is an overkill. If we add a function for every MSR we're reading...
>
> Do this differently: prepare the value you're writing back into the
> INTR_CFG MSR once, save it into mca_intr_cfg and then write it on each
> core at the end of enable_deferred_error_interrupt().
>
> And make u64 mca_intr_cfg static global to amd.c so that you can refer
> to it from outside of the functions and then you don't have to pass it
> around as a function param.
>
> Thx.
>

Good idea. In fact, we can treat this register as read-only, since we will
only handle (SUCCOR && SMCA) systems. The only need to write this register
would be on !SMCA systems.

We need to assume that the register value will be identical for all CPUs. This
is the expectation, but I'll add a comment to highlight this.

Also, we don't need the entire register. We just need the LVT offset fields
which are 4 bits each.

Thanks,
Yazen

2024-04-29 14:25:17

by Yazen Ghannam

[permalink] [raw]
Subject: Re: [PATCH v2 07/16] x86/mce/amd: Simplify DFR handler setup

On 4/29/2024 10:12 AM, Borislav Petkov wrote:
> On Mon, Apr 29, 2024 at 09:56:56AM -0400, Yazen Ghannam wrote:
>> Right, I mean we should do things the simpler way unless there's a real issue
>> to address.
>
> You need to pay attention to past issues before you go, simplify it and
> break it again.
>

I completely agree. I haven't seen evidence of an issue yet for the DFR case
though. Which is why I thought it'd be safe to do some clean up.

>> I'm not opposed to this, but I don't understand what is at risk.
>>
>> Is it that the function pointer may not be written atomically? So even if we
>> write it again with the same value, a concurrent interrupt on another core may
>> see a partially updated (corrupt) pointer?
>
> Yes, it won't happen, they say as it is guaranteed by the
> architecture. But I've heard those "promises".
>
>> intel_init_cmci() does not do this check. So is it more at risk, or is the AMD
>> code just more cautious?
>>
>> Again I'm not against the current code. I just think we should simplify it, if
>> possible.
>
> So in looking at the INTR_CFG MSR, I think we should do a function which
> does MCA init stuff only on the BSP exactly for things like that.
>
> There you can set the interrupt handler pointer, the INTR_CFG MSR and so
> on. And we don't have such function and I've needed a function like that
> in the past.
>
> And just for the general goal of not doing ugly code which should run
> only once but is run per-CPU just because our infrastructure doesn't
> allow it.
>
> Wanna give that a try?
>
> Thx.
>

Yep, "MCA init cleanup" is another thing on my TODO list.

The BSP still completely finishes init before the APs, correct? I recall some
effort to make CPU init more parallel, but I haven't been following it.

Thanks,
Yazen

2024-04-29 14:37:25

by Yazen Ghannam

[permalink] [raw]
Subject: Re: [PATCH v2 09/16] x86/mce: Unify AMD THR handler with MCA Polling

On 4/29/2024 9:40 AM, Borislav Petkov wrote:
> On Thu, Apr 04, 2024 at 10:13:52AM -0500, Yazen Ghannam wrote:
>> @@ -787,6 +793,8 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
>> mce_log(&m);
>>
>> clear_it:
>> + vendor_handle_error(&m);
>
> Wait, whaaat?
>
> The normal polling happens periodically (each 5 mins) and you want to
> reset the thresholding blocks each 5 mins?
>
> And the code has there now:
>
> static void reset_block(struct threshold_block *block)
> {
>
> ...
>
> /* Reset threshold block after logging error. */
> memset(&tr, 0, sizeof(tr));
> tr.b = block;
> threshold_restart_bank(&tr);
> }
>
> but no error has been logged.
>
> Frankly, I don't see the point for this part: polling all banks on
> a thresholding interrupt makes sense. But this resetting from within the
> polling doesn't make any sense.
>
> Especially if that polling interval is user-controllable.
>
> Thx.
>

The reset only happens on a threshold overflow event. There's a check above.

if (!(high & MASK_OVERFLOW_HI))
return;

Basically, all the cases in vendor_handle_error() would be conditional.

Related to this, I've been thinking that banks with thresholding enabled
should be removed from the list of polling banks. This is done on Intel but
not on AMD.

I wanted to give it more thought, because I think folks have come to expect
polling and thresholding to be independent on AMD.

Thanks,
Yazen

2024-04-29 18:35:06

by Robert Richter

[permalink] [raw]
Subject: Re: [PATCH v2 07/16] x86/mce/amd: Simplify DFR handler setup

On 25.04.24 10:12:44, Yazen Ghannam wrote:
> On 4/24/2024 3:06 PM, Borislav Petkov wrote:
> > On Thu, Apr 04, 2024 at 10:13:50AM -0500, Yazen Ghannam wrote:
> >> AMD systems with the SUCCOR feature can send an APIC LVT interrupt for
> >> deferred errors. The LVT offset is 0x2 by convention, i.e. this is the
> >> default as listed in hardware documentation.
> >>
> >> However, the MCA registers may list a different LVT offset for this
> >> interrupt. The kernel should honor the value from the hardware.
> >
> > There's this "may" thing again.
> >
>
> Right, I should say "the microarchitecture allows it". :)
>
> > Is this enablement for some future hw too or do you really trust the
> > value in MSR_CU_DEF_ERR is programmed correctly in all cases?
> >
>
> I trust the value from hardware.
>
> The intention here is to simplify the code for general maintenance and to make
> later patches easier.
>
> >> Simplify the enable flow by using the hardware-provided value. Any
> >> conflicts will be caught by setup_APIC_eilvt(). Conflicts on production
> >> systems can be handled as quirks, if needed.
> >
> > Well, which systems support succor?
> >
> > I'd like to test this on them before we face all the quirkery. :)
> >
>
> All Zen/SMCA systems. I don't recall any issues in this area.
>
> Some later Family 15h systems (Carrizo?) had it. But I don't know if it was
> used in production. It was slightly before my time.
>
> > That area has been plagued by hw snafus if you look at
> > setup_APIC_eilvt() and talk to uncle Robert. :-P
> >
>
> Right, I found this:
> 27afdf2008da ("apic, x86: Use BIOS settings for IBS and MCE threshold
> interrupt LVT offsets")
>
> Which is basically the same idea: use what is in the register.
>
> But it looks there was an issue with IBS on Family 10h.

After looking a while into it I think the issue was the following:

IBS offset was not enabled by firmware, but MCE already was (due to
earlier setup). And mce was (maybe) not on all cpus and only one cpu
per socket enabled. The IBS vector should be enabled on all cpus. Now
firmware allocated offset 1 for mce (instead of offset 0 as for
k8). This caused the hardcoded value (offset 1 for IBS) to be already
taken. Also, hardcoded values couldn't be used at all as this would
have not been worked on k8 (for mce). Another issue was to find the
next free offset as you couldn't examine just the current cpu. So even
if the offset on the current was available, another cpu might have
that offset already in use. Yet another problem was that programmed
offsets for mce and ibs overlapped each other and the kernel had to
reassign them (the ibs offset).

I hope a remember correctly here with all details.

Thanks,

-Robert

2024-04-30 13:48:28

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2 07/16] x86/mce/amd: Simplify DFR handler setup

On Mon, Apr 29, 2024 at 10:25:02AM -0400, Yazen Ghannam wrote:
> Yep, "MCA init cleanup" is another thing on my TODO list.

Right, so looking at the code, early_identify_cpu()/identify_boot_cpu()
could call a

mca_bsp_init()

or so which does the once, vendor-agnostic setup.

identify_cpu->mcheck_cpu_init()

already does both the BSP and AP per-CPU init.

With such a split, I think we'll have all the proper places to do setup
at the right time without hackery.

> The BSP still completely finishes init before the APs, correct?

Yes.

> I recall some effort to make CPU init more parallel, but I haven't
> been following it.

I think you mean parallel hotplug but that's for the APs only.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-04-30 18:08:26

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2 07/16] x86/mce/amd: Simplify DFR handler setup

On Mon, Apr 29, 2024 at 08:34:37PM +0200, Robert Richter wrote:
> After looking a while into it I think the issue was the following:
>
> IBS offset was not enabled by firmware, but MCE already was (due to
> earlier setup). And mce was (maybe) not on all cpus and only one cpu
> per socket enabled. The IBS vector should be enabled on all cpus. Now
> firmware allocated offset 1 for mce (instead of offset 0 as for
> k8). This caused the hardcoded value (offset 1 for IBS) to be already
> taken. Also, hardcoded values couldn't be used at all as this would
> have not been worked on k8 (for mce). Another issue was to find the
> next free offset as you couldn't examine just the current cpu. So even
> if the offset on the current was available, another cpu might have
> that offset already in use. Yet another problem was that programmed
> offsets for mce and ibs overlapped each other and the kernel had to
> reassign them (the ibs offset).
>
> I hope a remember correctly here with all details.

I think you're remembering it correct because after I read this, a very
very old and dormant brain cell did light up in my head and said, oh
yeah, that definitely rings a bell!

:-P

Yazen, this is the type of mess I was talking about.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-05-02 16:02:48

by Yazen Ghannam

[permalink] [raw]
Subject: Re: [PATCH v2 07/16] x86/mce/amd: Simplify DFR handler setup

On 4/30/24 2:06 PM, Borislav Petkov wrote:
> On Mon, Apr 29, 2024 at 08:34:37PM +0200, Robert Richter wrote:
>> After looking a while into it I think the issue was the following:
>>
>> IBS offset was not enabled by firmware, but MCE already was (due to
>> earlier setup). And mce was (maybe) not on all cpus and only one cpu
>> per socket enabled. The IBS vector should be enabled on all cpus. Now
>> firmware allocated offset 1 for mce (instead of offset 0 as for
>> k8). This caused the hardcoded value (offset 1 for IBS) to be already
>> taken. Also, hardcoded values couldn't be used at all as this would
>> have not been worked on k8 (for mce). Another issue was to find the
>> next free offset as you couldn't examine just the current cpu. So even
>> if the offset on the current was available, another cpu might have
>> that offset already in use. Yet another problem was that programmed
>> offsets for mce and ibs overlapped each other and the kernel had to
>> reassign them (the ibs offset).
>>
>> I hope a remember correctly here with all details.
>
> I think you're remembering it correct because after I read this, a very
> very old and dormant brain cell did light up in my head and said, oh
> yeah, that definitely rings a bell!
>
> :-P
>
> Yazen, this is the type of mess I was talking about.
>

Yep, I see what you mean. Definitely a pain :/

So is this the only known issue? And was it encountered in production
systems? Were/are people using IBS on K8 (Family Fh) systems? I know
that perf got support at this time, but do people still use it?

Just as an example, this project has Family 10h as the earliest supported.
https://github.com/jlgreathouse/AMD_IBS_Toolkit

My thinking is that we can simplify the code if there are no practical
issues. And we can address any reported issues as they come.

If you think that's okay, then I can continue with this particular clean
up. If not, then at least we have some more context here.

I'm sure there will be more topics like this when redoing the MCA init path.

:)

Thanks,
Yazen

2024-05-02 18:48:57

by Robert Richter

[permalink] [raw]
Subject: Re: [PATCH v2 07/16] x86/mce/amd: Simplify DFR handler setup

On 02.05.24 12:02:02, Yazen Ghannam wrote:
> On 4/30/24 2:06 PM, Borislav Petkov wrote:
> > On Mon, Apr 29, 2024 at 08:34:37PM +0200, Robert Richter wrote:
> > > After looking a while into it I think the issue was the following:
> > >
> > > IBS offset was not enabled by firmware, but MCE already was (due to
> > > earlier setup). And mce was (maybe) not on all cpus and only one cpu
> > > per socket enabled. The IBS vector should be enabled on all cpus. Now
> > > firmware allocated offset 1 for mce (instead of offset 0 as for
> > > k8). This caused the hardcoded value (offset 1 for IBS) to be already
> > > taken. Also, hardcoded values couldn't be used at all as this would
> > > have not been worked on k8 (for mce). Another issue was to find the
> > > next free offset as you couldn't examine just the current cpu. So even
> > > if the offset on the current was available, another cpu might have
> > > that offset already in use. Yet another problem was that programmed
> > > offsets for mce and ibs overlapped each other and the kernel had to
> > > reassign them (the ibs offset).
> > >
> > > I hope a remember correctly here with all details.
> >
> > I think you're remembering it correct because after I read this, a very
> > very old and dormant brain cell did light up in my head and said, oh
> > yeah, that definitely rings a bell!
> >
> > :-P
> >
> > Yazen, this is the type of mess I was talking about.
> >
>
> Yep, I see what you mean. Definitely a pain :/
>
> So is this the only known issue? And was it encountered in production
> systems? Were/are people using IBS on K8 (Family Fh) systems? I know
> that perf got support at this time, but do people still use it?
>
> Just as an example, this project has Family 10h as the earliest supported.
> https://github.com/jlgreathouse/AMD_IBS_Toolkit

No, IBS was introduced with 10h, but the eilvt offset came already
with k8, but with only one entry. That affected productions systems
and BIOSes.

>
> My thinking is that we can simplify the code if there are no practical
> issues. And we can address any reported issues as they come.
>
> If you think that's okay, then I can continue with this particular clean
> up. If not, then at least we have some more context here.

The general approach to use the preprogrammed offsets looks good to
me. Though, existing code [1] checks the offset and reapplies a
hardcoded value of 2 if it is zero. I don't know the history of
this. However, it would be great if that code could be simplified.

-Robert

[1] commit 24fd78a81f6d ("x86/mce/amd: Introduce deferred error interrupt handler")

>
> I'm sure there will be more topics like this when redoing the MCA init path.
>
> :)
>
> Thanks,
> Yazen

2024-05-04 14:40:14

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2 07/16] x86/mce/amd: Simplify DFR handler setup

On Thu, May 02, 2024 at 08:48:34PM +0200, Robert Richter wrote:
> The general approach to use the preprogrammed offsets looks good to
> me. Though, existing code [1] checks the offset and reapplies a
> hardcoded value of 2 if it is zero. I don't know the history of
> this. However, it would be great if that code could be simplified.

Whatever we do, the best approach is to leave old hw as it is - you
don't want to "rework" all this and break some obscure old configuration
which only one user has and debugging on it is hard, to say the least.

So for everyone's sake, do a cutoff at Zen, leave the pre-Zen machines
support as it is and go wild with Zen and SMCA and so on, where we can
still test on all kinds of hw.

And as always, ask yourself whether any "simplification" you're thinking
of, is even worth the effort and is not just code churn without any real
advantages.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-05-04 14:42:01

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2 08/16] x86/mce/amd: Clean up enable_deferred_error_interrupt()

On Mon, Apr 29, 2024 at 10:18:59AM -0400, Yazen Ghannam wrote:
> Good idea. In fact, we can treat this register as read-only, since we will
> only handle (SUCCOR && SMCA) systems. The only need to write this register
> would be on !SMCA systems.
>
> We need to assume that the register value will be identical for all CPUs. This
> is the expectation, but I'll add a comment to highlight this.
>
> Also, we don't need the entire register. We just need the LVT offset fields
> which are 4 bits each.

Yes, you could read out and sanity-check only the LVT offsets and make
they're the same across all cores to make sure BIOS hasn't dropped the
ball in programming them.

But see my previous mail too: I think we should leave pre-Zen as it is
and do cleanups and simplifications only for Zen and newer.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-05-04 14:53:21

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2 09/16] x86/mce: Unify AMD THR handler with MCA Polling

On Mon, Apr 29, 2024 at 10:36:57AM -0400, Yazen Ghannam wrote:
> Related to this, I've been thinking that banks with thresholding enabled
> should be removed from the list of polling banks. This is done on Intel but
> not on AMD.
>
> I wanted to give it more thought, because I think folks have come to expect
> polling and thresholding to be independent on AMD.

Yes, this whole thing sounds weird.

On the one hand, you have a special interrupt for errors which have
reached a threshold *just* *so* you don't have to poll. Because polling
is ok but getting a a special interrupt is better and such notification
systems always want to have a special interrupt and not have to poll.

On the other hand, you're marrying the two which sounds weird. Why?

What is wrong with getting thresholding interrupts?

Why can't we simply stop the polling and do THR only if available? That
would save a lot of energy.

So why can't we program the THR to raise an interrupt on a single error
and disable polling completely?

Because that would be a lot better as the hardware would be doing the
work for us.

In any case, I'm missing the strategy here so no cleanups without
a clear goal first please.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-05-07 18:27:55

by Yazen Ghannam

[permalink] [raw]
Subject: Re: [PATCH v2 09/16] x86/mce: Unify AMD THR handler with MCA Polling

On 5/4/24 10:52 AM, Borislav Petkov wrote:
> On Mon, Apr 29, 2024 at 10:36:57AM -0400, Yazen Ghannam wrote:
>> Related to this, I've been thinking that banks with thresholding enabled
>> should be removed from the list of polling banks. This is done on Intel but
>> not on AMD.
>>
>> I wanted to give it more thought, because I think folks have come to expect
>> polling and thresholding to be independent on AMD.
>
> Yes, this whole thing sounds weird.
>
> On the one hand, you have a special interrupt for errors which have
> reached a threshold *just* *so* you don't have to poll. Because polling
> is ok but getting a a special interrupt is better and such notification
> systems always want to have a special interrupt and not have to poll.
>
> On the other hand, you're marrying the two which sounds weird. Why?
>
> What is wrong with getting thresholding interrupts?
>

Nothing. This patch is not disabling the interrupt. The goal is to get
rid of duplicate code and have a single function that checks the MCA
banks.

This would be similar to intel_threshold_interrupt().

> Why can't we simply stop the polling and do THR only if available? That
> would save a lot of energy.
>
> So why can't we program the THR to raise an interrupt on a single error
> and disable polling completely?
>
> Because that would be a lot better as the hardware would be doing the
> work for us.
>
> In any case, I'm missing the strategy here so no cleanups without
> a clear goal first please.
>

We could do that. In fact, there's a request to use the threshold that
is pre-programmed in the hardware. And we could use some of the current
kernel parameters for overrides, if needed.

Thanks,
Yazen