2022-04-06 14:11:36

by Smita Koralahalli

[permalink] [raw]
Subject: [RFC PATCH 0/5] Handle corrected machine check interrupt storms

Extend the logic of handling Intel's corrected machine check interrupt
storms to AMD's threshold interrupts.

First two patches are from Tony which cleans up the existing storm
handling for Intel and proposes per CPU per bank storm handling.

Third and fourth patches do some cleanup and refactoring on the CMCI
storm handling in order to extend similar workaround for AMD's threshold
interrupt storms. These two patches could be merged into Tony's second
patch of CMCI storm mitigation.

AMD's storm mitigation for threshold interrupts also relies on per CPU
per bank approach similar to Intel. But unlike CMCI storm handling it does
not set thresholds to reduce rate of interrupts on a storm. Rather it
turns off the interrupt on the current CPU and bank if there is a storm
and re-enables back the interrupts when the storm subsides.

It is okay to turn off threshold interrupts on AMD systems as other error
severities continue to be handled even if the threshold interrupts are
turned off. Uncorrected errors will generate a #MC and deferred errors
have a unique separate deferred error interrupt. The final patch adds
support for handling threshold interrupt storms on AMD systems.

Smita Koralahalli (3):
x86/mce: Introduce a function pointer mce_handle_storm
x86/mce: Move storm handling to core.
x86/mce: Handle AMD threshold interrupt storms

Tony Luck (2):
x86/mce: Remove old CMCI storm mitigation code
x86/mce: Add per-bank CMCI storm mitigation

arch/x86/kernel/cpu/mce/amd.c | 49 ++++++++
arch/x86/kernel/cpu/mce/core.c | 129 +++++++++++++++++----
arch/x86/kernel/cpu/mce/intel.c | 179 +++++++----------------------
arch/x86/kernel/cpu/mce/internal.h | 42 +++++--
4 files changed, 231 insertions(+), 168 deletions(-)

--
2.17.1


2022-04-06 14:13:10

by Smita Koralahalli

[permalink] [raw]
Subject: [RFC PATCH 5/5] x86/mce: Handle AMD threshold interrupt storms

Extend the logic of handling CMCI storms to AMD threshold interrupts.

Rely on the similar approach as of Intel's CMCI to mitigate storms per
CPU and per bank. But, unlike CMCI, do not set thresholds and reduce
interrupt rate on a storm. Rather, disable the interrupt on the
corresponding CPU and bank. Re-enable back the interrupts if enough
consecutive polls of the bank show no corrected errors (30, as
programmed by Intel).

Turning off the threshold interrupts would be a better solution on AMD
systems as other error severities will still be handled even if the
threshold interrupts are disabled.

Signed-off-by: Smita Koralahalli <[email protected]>
---
arch/x86/kernel/cpu/mce/amd.c | 49 ++++++++++++++++++++++++++++++
arch/x86/kernel/cpu/mce/core.c | 1 +
arch/x86/kernel/cpu/mce/internal.h | 4 +++
3 files changed, 54 insertions(+)

diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index 1940d305db1c..941b09f4dac5 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -466,6 +466,47 @@ static void threshold_restart_bank(void *_tr)
wrmsr(tr->b->address, lo, hi);
}

+static void _reset_block(struct threshold_block *block)
+{
+ struct thresh_restart tr;
+
+ memset(&tr, 0, sizeof(tr));
+ tr.b = block;
+ threshold_restart_bank(&tr);
+}
+
+static void toggle_interrupt_reset_block(struct threshold_block *block, bool on)
+{
+ if (!block)
+ return;
+
+ block->interrupt_enable = !!on;
+ _reset_block(block);
+}
+
+void mce_amd_handle_storm(int bank, bool on)
+{
+ struct threshold_block *first_block = NULL, *block = NULL, *tmp = NULL;
+ struct threshold_bank **bp = this_cpu_read(threshold_banks);
+ unsigned long flags;
+
+ if (!bp)
+ return;
+
+ local_irq_save(flags);
+
+ first_block = bp[bank]->blocks;
+ if (!first_block)
+ goto end;
+
+ toggle_interrupt_reset_block(first_block, on);
+
+ list_for_each_entry_safe(block, tmp, &first_block->miscj, miscj)
+ toggle_interrupt_reset_block(block, on);
+end:
+ local_irq_restore(flags);
+}
+
static void mce_threshold_block_init(struct threshold_block *b, int offset)
{
struct thresh_restart tr = {
@@ -867,6 +908,7 @@ static void amd_threshold_interrupt(void)
struct threshold_block *first_block = NULL, *block = NULL, *tmp = NULL;
struct threshold_bank **bp = this_cpu_read(threshold_banks);
unsigned int bank, cpu = smp_processor_id();
+ u64 status;

/*
* Validate that the threshold bank has been initialized already. The
@@ -880,6 +922,13 @@ static void amd_threshold_interrupt(void)
if (!(per_cpu(bank_map, cpu) & (1 << bank)))
continue;

+ rdmsrl(mca_msr_reg(bank, MCA_STATUS), status);
+ track_cmci_storm(bank, status);
+
+ /* Return early on an interrupt storm */
+ if (this_cpu_read(bank_storm[bank]))
+ return;
+
first_block = bp[bank]->blocks;
if (!first_block)
continue;
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 6caee488bf7d..c510dd17f2c5 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -2078,6 +2078,7 @@ static void __mcheck_cpu_init_vendor(struct cpuinfo_x86 *c)

case X86_VENDOR_AMD: {
mce_amd_feature_init(c);
+ mce_handle_storm = mce_amd_handle_storm;
break;
}

diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index 49907cadf9ad..b9e8c8155c66 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -213,7 +213,11 @@ extern bool filter_mce(struct mce *m);

#ifdef CONFIG_X86_MCE_AMD
extern bool amd_filter_mce(struct mce *m);
+void track_cmci_storm(int bank, u64 status);
+void mce_amd_handle_storm(int bank, bool on);
#else
+static inline void track_cmci_storm(int bank, u64 status) { }
+# define mce_amd_handle_storm mce_handle_storm_default
static inline bool amd_filter_mce(struct mce *m) { return false; }
#endif

--
2.17.1

2022-04-06 14:13:27

by Smita Koralahalli

[permalink] [raw]
Subject: [RFC PATCH 4/5] x86/mce: Move storm handling to core.

AMD's storm handling for threshold interrupts is similar to Intel's CMCI
storm handling. Hence, make the storm handling code common by moving to
core and removing the vendor exclusivity.

On the contrary, setting different thresholds to reduce rate of interrupts
in IA32_MCi_CTL2 register is kept Intel intact as the storm handling for
AMD slightly differs where in it handles the storms by turning off the
interrupts.

No functional changes.

Signed-off-by: Smita Koralahalli <[email protected]>
---
This is another patch which can be merged into Tony's per CPU per bank
CMCI storm mitigation.
---
arch/x86/kernel/cpu/mce/core.c | 81 +++++++++++++++++++++++
arch/x86/kernel/cpu/mce/intel.c | 100 +----------------------------
arch/x86/kernel/cpu/mce/internal.h | 25 ++++++++
3 files changed, 107 insertions(+), 99 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index db6d60825e77..6caee488bf7d 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -611,6 +611,87 @@ static struct notifier_block mce_default_nb = {
.priority = MCE_PRIO_LOWEST,
};

+/*
+ * CMCI storm tracking state
+ * stormy_bank_count: per-cpu count of MC banks in storm state
+ * bank_history: bitmask tracking of corrected errors seen in each bank
+ * bank_time_stamp: last time (in jiffies) that each bank was polled
+ */
+DEFINE_PER_CPU(int, stormy_bank_count);
+DEFINE_PER_CPU(u64 [MAX_NR_BANKS], bank_history);
+DEFINE_PER_CPU(bool [MAX_NR_BANKS], bank_storm);
+DEFINE_PER_CPU(unsigned long [MAX_NR_BANKS], bank_time_stamp);
+
+void cmci_storm_begin(int bank)
+{
+ __set_bit(bank, this_cpu_ptr(mce_poll_banks));
+ this_cpu_write(bank_storm[bank], true);
+
+ /*
+ * If this is the first bank on this CPU to enter storm mode
+ * start polling
+ */
+ if (this_cpu_inc_return(stormy_bank_count) == 1)
+ mce_timer_kick(true);
+}
+
+void cmci_storm_end(int bank)
+{
+ __clear_bit(bank, this_cpu_ptr(mce_poll_banks));
+ this_cpu_write(bank_history[bank], 0ull);
+ this_cpu_write(bank_storm[bank], false);
+
+ /* If no banks left in storm mode, stop polling */
+ if (!this_cpu_dec_return(stormy_bank_count))
+ mce_timer_kick(false);
+}
+
+void track_cmci_storm(int bank, u64 status)
+{
+ unsigned long now = jiffies, delta;
+ unsigned int shift = 1;
+ u64 history;
+
+ /*
+ * When a bank is in storm mode, the history mask covers about
+ * one second of elapsed time. Check how long it has been since
+ * this bank was last polled, and compute a shift value to update
+ * the history bitmask. When not in storm mode, each consecutive
+ * poll of the bank is logged in the next history bit, so shift
+ * is kept at "1".
+ */
+ if (this_cpu_read(bank_storm[bank])) {
+ delta = now - this_cpu_read(bank_time_stamp[bank]);
+ shift = (delta + HZBITS) / HZBITS;
+ }
+
+ /* If has been a long time since the last poll, clear history */
+ if (shift >= 64)
+ history = 0;
+ else
+ history = this_cpu_read(bank_history[bank]) << shift;
+ this_cpu_write(bank_time_stamp[bank], now);
+
+ /* History keeps track of corrected errors. VAL=1 && UC=0 */
+ if ((status & (MCI_STATUS_VAL | MCI_STATUS_UC)) == MCI_STATUS_VAL)
+ history |= 1;
+ this_cpu_write(bank_history[bank], history);
+
+ if (this_cpu_read(bank_storm[bank])) {
+ if (history & GENMASK_ULL(STORM_END_POLL_THRESHOLD - 1, 0))
+ return;
+ pr_notice("CPU%d BANK%d CMCI storm subsided\n", smp_processor_id(), bank);
+ mce_handle_storm(bank, true);
+ cmci_storm_end(bank);
+ } else {
+ if (hweight64(history) < STORM_BEGIN_THRESHOLD)
+ return;
+ pr_notice("CPU%d BANK%d CMCI storm detected\n", smp_processor_id(), bank);
+ mce_handle_storm(bank, false);
+ cmci_storm_begin(bank);
+ }
+}
+
/*
* Read ADDR and MISC registers.
*/
diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c
index 7edc31742fe0..6cc9aa97c092 100644
--- a/arch/x86/kernel/cpu/mce/intel.c
+++ b/arch/x86/kernel/cpu/mce/intel.c
@@ -47,17 +47,7 @@ static DEFINE_PER_CPU(mce_banks_t, mce_banks_owned);
*/
static DEFINE_RAW_SPINLOCK(cmci_discover_lock);

-/*
- * CMCI storm tracking state
- * stormy_bank_count: per-cpu count of MC banks in storm state
- * bank_history: bitmask tracking of corrected errors seen in each bank
- * bank_time_stamp: last time (in jiffies) that each bank was polled
- * cmci_threshold: MCi_CTL2 threshold for each bank when there is no storm
- */
-static DEFINE_PER_CPU(int, stormy_bank_count);
-static DEFINE_PER_CPU(u64 [MAX_NR_BANKS], bank_history);
-static DEFINE_PER_CPU(bool [MAX_NR_BANKS], bank_storm);
-static DEFINE_PER_CPU(unsigned long [MAX_NR_BANKS], bank_time_stamp);
+/* MCi_CTL2 threshold for each bank when there is no storm */
static int cmci_threshold[MAX_NR_BANKS];

/* Linux non-storm CMCI threshold (may be overridden by BIOS */
@@ -70,24 +60,6 @@ static int cmci_threshold[MAX_NR_BANKS];
*/
#define CMCI_STORM_THRESHOLD 32749

-/*
- * How many errors within the history buffer mark the start of a storm
- */
-#define STORM_BEGIN_THRESHOLD 5
-
-/*
- * How many polls of machine check bank without an error before declaring
- * the storm is over
- */
-#define STORM_END_POLL_THRESHOLD 30
-
-/*
- * When there is no storm each "bit" in the history represents
- * this many jiffies. When there is a storm every poll() takes
- * one history bit.
- */
-#define HZBITS (HZ / 64)
-
static int cmci_supported(int *banks)
{
u64 cap;
@@ -167,76 +139,6 @@ void mce_intel_handle_storm(int bank, bool on)
cmci_set_threshold(bank, CMCI_STORM_THRESHOLD);
}

-static void cmci_storm_begin(int bank)
-{
- __set_bit(bank, this_cpu_ptr(mce_poll_banks));
- this_cpu_write(bank_storm[bank], true);
-
- /*
- * If this is the first bank on this CPU to enter storm mode
- * start polling
- */
- if (this_cpu_inc_return(stormy_bank_count) == 1)
- mce_timer_kick(true);
-}
-
-static void cmci_storm_end(int bank)
-{
- __clear_bit(bank, this_cpu_ptr(mce_poll_banks));
- this_cpu_write(bank_history[bank], 0ull);
- this_cpu_write(bank_storm[bank], false);
-
- /* If no banks left in storm mode, stop polling */
- if (!this_cpu_dec_return(stormy_bank_count))
- mce_timer_kick(false);
-}
-
-void track_cmci_storm(int bank, u64 status)
-{
- unsigned long now = jiffies, delta;
- unsigned int shift = 1;
- u64 history;
-
- /*
- * When a bank is in storm mode, the history mask covers about
- * one second of elapsed time. Check how long it has been since
- * this bank was last polled, and compute a shift value to update
- * the history bitmask. When not in storm mode, each consecutive
- * poll of the bank is logged in the next history bit, so shift
- * is kept at "1".
- */
- if (this_cpu_read(bank_storm[bank])) {
- delta = now - this_cpu_read(bank_time_stamp[bank]);
- shift = (delta + HZBITS) / HZBITS;
- }
-
- /* If has been a long time since the last poll, clear history */
- if (shift >= 64)
- history = 0;
- else
- history = this_cpu_read(bank_history[bank]) << shift;
- this_cpu_write(bank_time_stamp[bank], now);
-
- /* History keeps track of corrected errors. VAL=1 && UC=0 */
- if ((status & (MCI_STATUS_VAL | MCI_STATUS_UC)) == MCI_STATUS_VAL)
- history |= 1;
- this_cpu_write(bank_history[bank], history);
-
- if (this_cpu_read(bank_storm[bank])) {
- if (history & GENMASK_ULL(STORM_END_POLL_THRESHOLD - 1, 0))
- return;
- pr_notice("CPU%d BANK%d CMCI storm subsided\n", smp_processor_id(), bank);
- mce_handle_storm(bank, true);
- cmci_storm_end(bank);
- } else {
- if (hweight64(history) < STORM_BEGIN_THRESHOLD)
- return;
- pr_notice("CPU%d BANK%d CMCI storm detected\n", smp_processor_id(), bank);
- mce_handle_storm(bank, false);
- cmci_storm_begin(bank);
- }
-}
-
/*
* The interrupt handler. This is called on every event.
* Just call the poller directly to log any events.
diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index c95802db9535..49907cadf9ad 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -60,6 +60,31 @@ static inline bool intel_filter_mce(struct mce *m) { return false; }

void mce_timer_kick(bool storm);
extern void (*mce_handle_storm)(int bank, bool on);
+void cmci_storm_begin(int bank);
+void cmci_storm_end(int bank);
+
+DECLARE_PER_CPU(int, stormy_bank_count);
+DECLARE_PER_CPU(u64 [MAX_NR_BANKS], bank_history);
+DECLARE_PER_CPU(bool [MAX_NR_BANKS], bank_storm);
+DECLARE_PER_CPU(unsigned long [MAX_NR_BANKS], bank_time_stamp);
+
+/*
+ * How many errors within the history buffer mark the start of a storm
+ */
+#define STORM_BEGIN_THRESHOLD 5
+
+/*
+ * How many polls of machine check bank without an error before declaring
+ * the storm is over
+ */
+#define STORM_END_POLL_THRESHOLD 30
+
+/*
+ * When there is no storm each "bit" in the history represents
+ * this many jiffies. When there is a storm every poll() takes
+ * one history bit.
+ */
+#define HZBITS (HZ / 64)

#ifdef CONFIG_ACPI_APEI
int apei_write_mce(struct mce *m);
--
2.17.1

2022-04-07 00:33:45

by Luck, Tony

[permalink] [raw]
Subject: RE: [RFC PATCH 5/5] x86/mce: Handle AMD threshold interrupt storms

+ /* Return early on an interrupt storm */
+ if (this_cpu_read(bank_storm[bank]))
+ return;

Is you reasoning for early return that you already have plenty of
logged errors from this bank, so OK to skip additional processing
of this one?

-Tony

Subject: Re: [RFC PATCH 5/5] x86/mce: Handle AMD threshold interrupt storms

Hi,

On 4/6/22 5:44 PM, Luck, Tony wrote:

> + /* Return early on an interrupt storm */
> + if (this_cpu_read(bank_storm[bank]))
> + return;
>
> Is you reasoning for early return that you already have plenty of
> logged errors from this bank, so OK to skip additional processing
> of this one?

The idea behind this was: Once, the interrupts are turned off by
track_cmci_storm() on a storm, (which is called before this "if
statement") logging and handling of subsequent corrected errors
will be taken care by machine_check_poll(). Hence, no need to
redo this again in the handler....

Let me know what are your thoughts on this?

>
> -Tony


2022-04-09 03:41:14

by Luck, Tony

[permalink] [raw]
Subject: Re: [RFC PATCH 5/5] x86/mce: Handle AMD threshold interrupt storms

On Fri, Apr 08, 2022 at 02:48:47AM -0500, Koralahalli Channabasappa, Smita wrote:
> Hi,
>
> On 4/6/22 5:44 PM, Luck, Tony wrote:
>
> > + /* Return early on an interrupt storm */
> > + if (this_cpu_read(bank_storm[bank]))
> > + return;
> >
> > Is you reasoning for early return that you already have plenty of
> > logged errors from this bank, so OK to skip additional processing
> > of this one?
>
> The idea behind this was: Once, the interrupts are turned off by
> track_cmci_storm() on a storm, (which is called before this "if
> statement") logging and handling of subsequent corrected errors
> will be taken care by machine_check_poll(). Hence, no need to
> redo this again in the handler....
>
> Let me know what are your thoughts on this?

Makes sense. There's a storm, so picking up this error now,
or waiting for machine_check_poll() to get it makes little
difference.

-Tony

2022-06-21 05:17:59

by Luck, Tony

[permalink] [raw]
Subject: Re: [RFC PATCH 4/5] x86/mce: Move storm handling to core.

On Wed, Apr 06, 2022 at 01:35:41AM -0500, Smita Koralahalli wrote:
> + /*
> + * When a bank is in storm mode, the history mask covers about
> + * one second of elapsed time. Check how long it has been since
> + * this bank was last polled, and compute a shift value to update
> + * the history bitmask. When not in storm mode, each consecutive
> + * poll of the bank is logged in the next history bit, so shift
> + * is kept at "1".
> + */
> + if (this_cpu_read(bank_storm[bank])) {
> + delta = now - this_cpu_read(bank_time_stamp[bank]);
> + shift = (delta + HZBITS) / HZBITS;
> + }

Apologies for the long delay in following up on this.

I tested out your patches on an Intel system, and they "work"
in that storms are detected, mitigations applied, and then the
storm end is detected and the system returns to regular mode.

But the storm end happens far more quickly than I expected (in
just over a second). So I stared again at the code above, and
realized it doesn't do what I expected. Not your fault, you
just copied from my patches ... which means that my comment
didn't help explain what I was trying to do ... and so it wasn't
obvious that:
1) the test is backwards (need to adjust when the bank is NOT in
storm mode ... in storm mode we poll every second).
2) I can't even remember what I was trying to do with HZBITS, but
it seems wrong too. Just need to use HZ.

Patch below to be merged back into the series. This lets things
run for just over 30 seconds without finding a logged error while
polling in storm mode. Which is what I wanted.

[ 111.486306] mce: CPU48 BANK7 CMCI storm detected
[ 111.486394] mce: [Hardware Error]: Machine check events logged
[ 111.486401] mce: [Hardware Error]: Machine check events logged
[ 142.861874] mce: CPU48 BANK7 CMCI storm subsided

-Tony

---

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 74254f15f5db..8e6b77349911 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -655,16 +655,16 @@ void track_cmci_storm(int bank, u64 status)
u64 history;

/*
- * When a bank is in storm mode, the history mask covers about
- * one second of elapsed time. Check how long it has been since
- * this bank was last polled, and compute a shift value to update
- * the history bitmask. When not in storm mode, each consecutive
- * poll of the bank is logged in the next history bit, so shift
- * is kept at "1".
+ * When a bank is in storm mode it is polled once per second and
+ * the history mask will record about the last minute of poll results.
+ * If it is not in storm mode, then the bank is only checked when
+ * there is a CMCI interrupt. Check how long it has been since
+ * this bank was last checked, and adjust the amount of "shift"
+ * to apply to history.
*/
- if (this_cpu_read(bank_storm[bank])) {
+ if (!this_cpu_read(bank_storm[bank])) {
delta = now - this_cpu_read(bank_time_stamp[bank]);
- shift = (delta + HZBITS) / HZBITS;
+ shift = (delta + HZ) / HZ;
}

/* If has been a long time since the last poll, clear history */
diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index b9e8c8155c66..b88773a212cf 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -79,13 +79,6 @@ DECLARE_PER_CPU(unsigned long [MAX_NR_BANKS], bank_time_stamp);
*/
#define STORM_END_POLL_THRESHOLD 30

-/*
- * When there is no storm each "bit" in the history represents
- * this many jiffies. When there is a storm every poll() takes
- * one history bit.
- */
-#define HZBITS (HZ / 64)
-
#ifdef CONFIG_ACPI_APEI
int apei_write_mce(struct mce *m);
ssize_t apei_read_mce(struct mce *m, u64 *record_id);

2022-06-27 17:51:07

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v2 0/5] Handle corrected machine check interrupt storms

Extend the logic of handling Intel's corrected machine check interrupt
storms to AMD's threshold interrupts.

First two patches are from Tony which cleans up the existing storm
handling for Intel and proposes per CPU per bank storm handling.

Third and fourth patches do some cleanup and refactoring on the CMCI
storm handling in order to extend similar workaround for AMD's threshold
interrupt storms. These two patches could be merged into Tony's second
patch of CMCI storm mitigation.

AMD's storm mitigation for threshold interrupts also relies on per CPU
per bank approach similar to Intel. But unlike CMCI storm handling it does
not set thresholds to reduce rate of interrupts on a storm. Rather it
turns off the interrupt on the current CPU and bank if there is a storm
and re-enables back the interrupts when the storm subsides.

It is okay to turn off threshold interrupts on AMD systems as other error
severities continue to be handled even if the threshold interrupts are
turned off. Uncorrected errors will generate a #MC and deferred errors
have a unique separate deferred error interrupt. The final patch adds
support for handling threshold interrupt storms on AMD systems.

Changes since v1:

1) Fix shift computation when keeping track of bank history. Shift
should be "1" when a storm is in progress (because polling once per
second). When a storm is not in progress shift should be based on
number of seconds since the bank was last checked.

2) Changed Smita's code in part 0003 to avoid use of a function pointer
(since the kernel is avoiding indirect branch points that might be
trainable for various Spectre-like issues).

Smita Koralahalli (2):
x86/mce: Introduce mce_handle_storm() to deal with begin/end of storms
x86/mce: Handle AMD threshold interrupt storms
x86/mce: Move storm handling to core.

Tony Luck (3):
x86/mce: Remove old CMCI storm mitigation code
x86/mce: Add per-bank CMCI storm mitigation

arch/x86/kernel/cpu/mce/amd.c | 49 ++++++++
arch/x86/kernel/cpu/mce/core.c | 139 +++++++++++++++++-----
arch/x86/kernel/cpu/mce/intel.c | 179 +++++++----------------------
arch/x86/kernel/cpu/mce/internal.h | 33 ++++--
4 files changed, 230 insertions(+), 170 deletions(-)

--
2.35.3

2022-06-27 18:32:57

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v2 4/5] x86/mce: Move storm handling to core.

From: Smita Koralahalli <[email protected]>

AMD's storm handling for threshold interrupts is similar to Intel's CMCI
storm handling. Hence, make the storm handling code common by moving to
core and removing the vendor exclusivity.

On the contrary, setting different thresholds to reduce rate of interrupts
in IA32_MCi_CTL2 register is kept Intel intact as the storm handling for
AMD slightly differs where in it handles the storms by turning off the
interrupts.

No functional changes.

[Tony: Same as Smita's original, plus changes rolled in from prior patches]

Signed-off-by: Smita Koralahalli <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/mce/core.c | 81 ++++++++++++++++++++++++++
arch/x86/kernel/cpu/mce/intel.c | 93 +-----------------------------
arch/x86/kernel/cpu/mce/internal.h | 18 ++++++
3 files changed, 100 insertions(+), 92 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index f4d2a7ba29f7..d27daa199523 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -613,6 +613,87 @@ static struct notifier_block mce_default_nb = {
.priority = MCE_PRIO_LOWEST,
};

+/*
+ * CMCI storm tracking state
+ * stormy_bank_count: per-cpu count of MC banks in storm state
+ * bank_history: bitmask tracking of corrected errors seen in each bank
+ * bank_time_stamp: last time (in jiffies) that each bank was polled
+ */
+DEFINE_PER_CPU(int, stormy_bank_count);
+DEFINE_PER_CPU(u64 [MAX_NR_BANKS], bank_history);
+DEFINE_PER_CPU(bool [MAX_NR_BANKS], bank_storm);
+DEFINE_PER_CPU(unsigned long [MAX_NR_BANKS], bank_time_stamp);
+
+void cmci_storm_begin(int bank)
+{
+ __set_bit(bank, this_cpu_ptr(mce_poll_banks));
+ this_cpu_write(bank_storm[bank], true);
+
+ /*
+ * If this is the first bank on this CPU to enter storm mode
+ * start polling
+ */
+ if (this_cpu_inc_return(stormy_bank_count) == 1)
+ mce_timer_kick(true);
+}
+
+void cmci_storm_end(int bank)
+{
+ __clear_bit(bank, this_cpu_ptr(mce_poll_banks));
+ this_cpu_write(bank_history[bank], 0ull);
+ this_cpu_write(bank_storm[bank], false);
+
+ /* If no banks left in storm mode, stop polling */
+ if (!this_cpu_dec_return(stormy_bank_count))
+ mce_timer_kick(false);
+}
+
+void track_cmci_storm(int bank, u64 status)
+{
+ unsigned long now = jiffies, delta;
+ unsigned int shift = 1;
+ u64 history;
+
+ /*
+ * When a bank is in storm mode it is polled once per second and
+ * the history mask will record about the last minute of poll results.
+ * If it is not in storm mode, then the bank is only checked when
+ * there is a CMCI interrupt. Check how long it has been since
+ * this bank was last checked, and adjust the amount of "shift"
+ * to apply to history.
+ */
+ if (!this_cpu_read(bank_storm[bank])) {
+ delta = now - this_cpu_read(bank_time_stamp[bank]);
+ shift = (delta + HZ) / HZ;
+ }
+
+ /* If has been a long time since the last poll, clear history */
+ if (shift >= 64)
+ history = 0;
+ else
+ history = this_cpu_read(bank_history[bank]) << shift;
+ this_cpu_write(bank_time_stamp[bank], now);
+
+ /* History keeps track of corrected errors. VAL=1 && UC=0 */
+ if ((status & (MCI_STATUS_VAL | MCI_STATUS_UC)) == MCI_STATUS_VAL)
+ history |= 1;
+ this_cpu_write(bank_history[bank], history);
+
+ if (this_cpu_read(bank_storm[bank])) {
+ if (history & GENMASK_ULL(STORM_END_POLL_THRESHOLD - 1, 0))
+ return;
+ pr_notice("CPU%d BANK%d CMCI storm subsided\n", smp_processor_id(), bank);
+ mce_handle_storm(bank, true);
+ cmci_storm_end(bank);
+ } else {
+ if (hweight64(history) < STORM_BEGIN_THRESHOLD)
+ return;
+ pr_notice("CPU%d BANK%d CMCI storm detected\n", smp_processor_id(), bank);
+ mce_handle_storm(bank, false);
+ cmci_storm_begin(bank);
+ }
+}
+
/*
* Read ADDR and MISC registers.
*/
diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c
index 4238b73c2143..6cc9aa97c092 100644
--- a/arch/x86/kernel/cpu/mce/intel.c
+++ b/arch/x86/kernel/cpu/mce/intel.c
@@ -47,17 +47,7 @@ static DEFINE_PER_CPU(mce_banks_t, mce_banks_owned);
*/
static DEFINE_RAW_SPINLOCK(cmci_discover_lock);

-/*
- * CMCI storm tracking state
- * stormy_bank_count: per-cpu count of MC banks in storm state
- * bank_history: bitmask tracking of corrected errors seen in each bank
- * bank_time_stamp: last time (in jiffies) that each bank was polled
- * cmci_threshold: MCi_CTL2 threshold for each bank when there is no storm
- */
-static DEFINE_PER_CPU(int, stormy_bank_count);
-static DEFINE_PER_CPU(u64 [MAX_NR_BANKS], bank_history);
-static DEFINE_PER_CPU(bool [MAX_NR_BANKS], bank_storm);
-static DEFINE_PER_CPU(unsigned long [MAX_NR_BANKS], bank_time_stamp);
+/* MCi_CTL2 threshold for each bank when there is no storm */
static int cmci_threshold[MAX_NR_BANKS];

/* Linux non-storm CMCI threshold (may be overridden by BIOS */
@@ -70,17 +60,6 @@ static int cmci_threshold[MAX_NR_BANKS];
*/
#define CMCI_STORM_THRESHOLD 32749

-/*
- * How many errors within the history buffer mark the start of a storm
- */
-#define STORM_BEGIN_THRESHOLD 5
-
-/*
- * How many polls of machine check bank without an error before declaring
- * the storm is over
- */
-#define STORM_END_POLL_THRESHOLD 30
-
static int cmci_supported(int *banks)
{
u64 cap;
@@ -160,76 +139,6 @@ void mce_intel_handle_storm(int bank, bool on)
cmci_set_threshold(bank, CMCI_STORM_THRESHOLD);
}

-static void cmci_storm_begin(int bank)
-{
- __set_bit(bank, this_cpu_ptr(mce_poll_banks));
- this_cpu_write(bank_storm[bank], true);
-
- /*
- * If this is the first bank on this CPU to enter storm mode
- * start polling
- */
- if (this_cpu_inc_return(stormy_bank_count) == 1)
- mce_timer_kick(true);
-}
-
-static void cmci_storm_end(int bank)
-{
- __clear_bit(bank, this_cpu_ptr(mce_poll_banks));
- this_cpu_write(bank_history[bank], 0ull);
- this_cpu_write(bank_storm[bank], false);
-
- /* If no banks left in storm mode, stop polling */
- if (!this_cpu_dec_return(stormy_bank_count))
- mce_timer_kick(false);
-}
-
-void track_cmci_storm(int bank, u64 status)
-{
- unsigned long now = jiffies, delta;
- unsigned int shift = 1;
- u64 history;
-
- /*
- * When a bank is in storm mode it is polled once per second and
- * the history mask will record about the last minute of poll results.
- * If it is not in storm mode, then the bank is only checked when
- * there is a CMCI interrupt. Check how long it has been since
- * this bank was last checked, and adjust the amount of "shift"
- * to apply to history.
- */
- if (!this_cpu_read(bank_storm[bank])) {
- delta = now - this_cpu_read(bank_time_stamp[bank]);
- shift = (delta + HZ) / HZ;
- }
-
- /* If has been a long time since the last poll, clear history */
- if (shift >= 64)
- history = 0;
- else
- history = this_cpu_read(bank_history[bank]) << shift;
- this_cpu_write(bank_time_stamp[bank], now);
-
- /* History keeps track of corrected errors. VAL=1 && UC=0 */
- if ((status & (MCI_STATUS_VAL | MCI_STATUS_UC)) == MCI_STATUS_VAL)
- history |= 1;
- this_cpu_write(bank_history[bank], history);
-
- if (this_cpu_read(bank_storm[bank])) {
- if (history & GENMASK_ULL(STORM_END_POLL_THRESHOLD - 1, 0))
- return;
- pr_notice("CPU%d BANK%d CMCI storm subsided\n", smp_processor_id(), bank);
- mce_handle_storm(bank, true);
- cmci_storm_end(bank);
- } else {
- if (hweight64(history) < STORM_BEGIN_THRESHOLD)
- return;
- pr_notice("CPU%d BANK%d CMCI storm detected\n", smp_processor_id(), bank);
- mce_handle_storm(bank, false);
- cmci_storm_begin(bank);
- }
-}
-
/*
* The interrupt handler. This is called on every event.
* Just call the poller directly to log any events.
diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index 78467f6cdd04..d7cad839a6a9 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -60,6 +60,24 @@ static inline bool intel_filter_mce(struct mce *m) { return false; }

void mce_timer_kick(bool storm);
void mce_handle_storm(int bank, bool on);
+void cmci_storm_begin(int bank);
+void cmci_storm_end(int bank);
+
+DECLARE_PER_CPU(int, stormy_bank_count);
+DECLARE_PER_CPU(u64 [MAX_NR_BANKS], bank_history);
+DECLARE_PER_CPU(bool [MAX_NR_BANKS], bank_storm);
+DECLARE_PER_CPU(unsigned long [MAX_NR_BANKS], bank_time_stamp);
+
+/*
+ * How many errors within the history buffer mark the start of a storm
+ */
+#define STORM_BEGIN_THRESHOLD 5
+
+/*
+ * How many polls of machine check bank without an error before declaring
+ * the storm is over
+ */
+#define STORM_END_POLL_THRESHOLD 30

#ifdef CONFIG_ACPI_APEI
int apei_write_mce(struct mce *m);
--
2.35.3

2022-06-27 19:01:07

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v2 1/5] x86/mce: Remove old CMCI storm mitigation code

When a "storm" of CMCI is detected this code mitigates by
disabling CMCI interrupt signalling from all of the banks
owned by the CPU that saw the storm.

There are problems with this approach:

1) It is very coarse grained. In all likelihood only one of the
banks was generating the interrupts, but CMCI is disabled for all.
This means Linux may delay seeing and processing errors logged
from other banks.

2) Although CMCI stands for Corrected Machine Check Interrupt, it
is also used to signal when an uncorrected error is logged. This
is a problem because these errors should be handled in a timely
manner.

Delete all this code in preparation for a finer grained solution.

Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/mce/core.c | 20 +---
arch/x86/kernel/cpu/mce/intel.c | 145 -----------------------------
arch/x86/kernel/cpu/mce/internal.h | 6 --
3 files changed, 1 insertion(+), 170 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 2c8ec5c71712..92c2dee4bf43 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1598,13 +1598,6 @@ static unsigned long check_interval = INITIAL_CHECK_INTERVAL;
static DEFINE_PER_CPU(unsigned long, mce_next_interval); /* in jiffies */
static DEFINE_PER_CPU(struct timer_list, mce_timer);

-static unsigned long mce_adjust_timer_default(unsigned long interval)
-{
- return interval;
-}
-
-static unsigned long (*mce_adjust_timer)(unsigned long interval) = mce_adjust_timer_default;
-
static void __start_timer(struct timer_list *t, unsigned long interval)
{
unsigned long when = jiffies + interval;
@@ -1627,15 +1620,9 @@ static void mce_timer_fn(struct timer_list *t)

iv = __this_cpu_read(mce_next_interval);

- if (mce_available(this_cpu_ptr(&cpu_info))) {
+ if (mce_available(this_cpu_ptr(&cpu_info)))
machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));

- if (mce_intel_cmci_poll()) {
- iv = mce_adjust_timer(iv);
- goto done;
- }
- }
-
/*
* Alert userspace if needed. If we logged an MCE, reduce the polling
* interval, otherwise increase the polling interval.
@@ -1645,7 +1632,6 @@ static void mce_timer_fn(struct timer_list *t)
else
iv = min(iv * 2, round_jiffies_relative(check_interval * HZ));

-done:
__this_cpu_write(mce_next_interval, iv);
__start_timer(t, iv);
}
@@ -1982,7 +1968,6 @@ static void mce_zhaoxin_feature_init(struct cpuinfo_x86 *c)

intel_init_cmci();
intel_init_lmce();
- mce_adjust_timer = cmci_intel_adjust_timer;
}

static void mce_zhaoxin_feature_clear(struct cpuinfo_x86 *c)
@@ -1995,7 +1980,6 @@ static void __mcheck_cpu_init_vendor(struct cpuinfo_x86 *c)
switch (c->x86_vendor) {
case X86_VENDOR_INTEL:
mce_intel_feature_init(c);
- mce_adjust_timer = cmci_intel_adjust_timer;
break;

case X86_VENDOR_AMD: {
@@ -2651,8 +2635,6 @@ static void mce_reenable_cpu(void)

static int mce_cpu_dead(unsigned int cpu)
{
- mce_intel_hcpu_update(cpu);
-
/* intentionally ignoring frozen here */
if (!cpuhp_tasks_frozen)
cmci_rediscover();
diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c
index 95275a5e57e0..052bf2708391 100644
--- a/arch/x86/kernel/cpu/mce/intel.c
+++ b/arch/x86/kernel/cpu/mce/intel.c
@@ -41,15 +41,6 @@
*/
static DEFINE_PER_CPU(mce_banks_t, mce_banks_owned);

-/*
- * CMCI storm detection backoff counter
- *
- * During storm, we reset this counter to INITIAL_CHECK_INTERVAL in case we've
- * encountered an error. If not, we decrement it by one. We signal the end of
- * the CMCI storm when it reaches 0.
- */
-static DEFINE_PER_CPU(int, cmci_backoff_cnt);
-
/*
* cmci_discover_lock protects against parallel discovery attempts
* which could race against each other.
@@ -57,21 +48,6 @@ static DEFINE_PER_CPU(int, cmci_backoff_cnt);
static DEFINE_RAW_SPINLOCK(cmci_discover_lock);

#define CMCI_THRESHOLD 1
-#define CMCI_POLL_INTERVAL (30 * HZ)
-#define CMCI_STORM_INTERVAL (HZ)
-#define CMCI_STORM_THRESHOLD 15
-
-static DEFINE_PER_CPU(unsigned long, cmci_time_stamp);
-static DEFINE_PER_CPU(unsigned int, cmci_storm_cnt);
-static DEFINE_PER_CPU(unsigned int, cmci_storm_state);
-
-enum {
- CMCI_STORM_NONE,
- CMCI_STORM_ACTIVE,
- CMCI_STORM_SUBSIDED,
-};
-
-static atomic_t cmci_storm_on_cpus;

static int cmci_supported(int *banks)
{
@@ -127,124 +103,6 @@ static bool lmce_supported(void)
return tmp & FEAT_CTL_LMCE_ENABLED;
}

-bool mce_intel_cmci_poll(void)
-{
- if (__this_cpu_read(cmci_storm_state) == CMCI_STORM_NONE)
- return false;
-
- /*
- * Reset the counter if we've logged an error in the last poll
- * during the storm.
- */
- if (machine_check_poll(0, this_cpu_ptr(&mce_banks_owned)))
- this_cpu_write(cmci_backoff_cnt, INITIAL_CHECK_INTERVAL);
- else
- this_cpu_dec(cmci_backoff_cnt);
-
- return true;
-}
-
-void mce_intel_hcpu_update(unsigned long cpu)
-{
- if (per_cpu(cmci_storm_state, cpu) == CMCI_STORM_ACTIVE)
- atomic_dec(&cmci_storm_on_cpus);
-
- per_cpu(cmci_storm_state, cpu) = CMCI_STORM_NONE;
-}
-
-static void cmci_toggle_interrupt_mode(bool on)
-{
- unsigned long flags, *owned;
- int bank;
- u64 val;
-
- raw_spin_lock_irqsave(&cmci_discover_lock, flags);
- owned = this_cpu_ptr(mce_banks_owned);
- for_each_set_bit(bank, owned, MAX_NR_BANKS) {
- rdmsrl(MSR_IA32_MCx_CTL2(bank), val);
-
- if (on)
- val |= MCI_CTL2_CMCI_EN;
- else
- val &= ~MCI_CTL2_CMCI_EN;
-
- wrmsrl(MSR_IA32_MCx_CTL2(bank), val);
- }
- raw_spin_unlock_irqrestore(&cmci_discover_lock, flags);
-}
-
-unsigned long cmci_intel_adjust_timer(unsigned long interval)
-{
- if ((this_cpu_read(cmci_backoff_cnt) > 0) &&
- (__this_cpu_read(cmci_storm_state) == CMCI_STORM_ACTIVE)) {
- mce_notify_irq();
- return CMCI_STORM_INTERVAL;
- }
-
- switch (__this_cpu_read(cmci_storm_state)) {
- case CMCI_STORM_ACTIVE:
-
- /*
- * We switch back to interrupt mode once the poll timer has
- * silenced itself. That means no events recorded and the timer
- * interval is back to our poll interval.
- */
- __this_cpu_write(cmci_storm_state, CMCI_STORM_SUBSIDED);
- if (!atomic_sub_return(1, &cmci_storm_on_cpus))
- pr_notice("CMCI storm subsided: switching to interrupt mode\n");
-
- fallthrough;
-
- case CMCI_STORM_SUBSIDED:
- /*
- * We wait for all CPUs to go back to SUBSIDED state. When that
- * happens we switch back to interrupt mode.
- */
- if (!atomic_read(&cmci_storm_on_cpus)) {
- __this_cpu_write(cmci_storm_state, CMCI_STORM_NONE);
- cmci_toggle_interrupt_mode(true);
- cmci_recheck();
- }
- return CMCI_POLL_INTERVAL;
- default:
-
- /* We have shiny weather. Let the poll do whatever it thinks. */
- return interval;
- }
-}
-
-static bool cmci_storm_detect(void)
-{
- unsigned int cnt = __this_cpu_read(cmci_storm_cnt);
- unsigned long ts = __this_cpu_read(cmci_time_stamp);
- unsigned long now = jiffies;
- int r;
-
- if (__this_cpu_read(cmci_storm_state) != CMCI_STORM_NONE)
- return true;
-
- if (time_before_eq(now, ts + CMCI_STORM_INTERVAL)) {
- cnt++;
- } else {
- cnt = 1;
- __this_cpu_write(cmci_time_stamp, now);
- }
- __this_cpu_write(cmci_storm_cnt, cnt);
-
- if (cnt <= CMCI_STORM_THRESHOLD)
- return false;
-
- cmci_toggle_interrupt_mode(false);
- __this_cpu_write(cmci_storm_state, CMCI_STORM_ACTIVE);
- r = atomic_add_return(1, &cmci_storm_on_cpus);
- mce_timer_kick(CMCI_STORM_INTERVAL);
- this_cpu_write(cmci_backoff_cnt, INITIAL_CHECK_INTERVAL);
-
- if (r == 1)
- pr_notice("CMCI storm detected: switching to poll mode\n");
- return true;
-}
-
/*
* The interrupt handler. This is called on every event.
* Just call the poller directly to log any events.
@@ -253,9 +111,6 @@ static bool cmci_storm_detect(void)
*/
static void intel_threshold_interrupt(void)
{
- if (cmci_storm_detect())
- return;
-
machine_check_poll(MCP_TIMESTAMP, this_cpu_ptr(&mce_banks_owned));
}

diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index 4ae0e603f7fa..17d313c9cc60 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -41,18 +41,12 @@ struct dentry *mce_get_debugfs_dir(void);
extern mce_banks_t mce_banks_ce_disabled;

#ifdef CONFIG_X86_MCE_INTEL
-unsigned long cmci_intel_adjust_timer(unsigned long interval);
-bool mce_intel_cmci_poll(void);
-void mce_intel_hcpu_update(unsigned long cpu);
void cmci_disable_bank(int bank);
void intel_init_cmci(void);
void intel_init_lmce(void);
void intel_clear_lmce(void);
bool intel_filter_mce(struct mce *m);
#else
-# define cmci_intel_adjust_timer mce_adjust_timer_default
-static inline bool mce_intel_cmci_poll(void) { return false; }
-static inline void mce_intel_hcpu_update(unsigned long cpu) { }
static inline void cmci_disable_bank(int bank) { }
static inline void intel_init_cmci(void) { }
static inline void intel_init_lmce(void) { }
--
2.35.3

2023-03-17 14:50:16

by Yazen Ghannam

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Handle corrected machine check interrupt storms

On Mon, Jun 27, 2022 at 10:36:00AM -0700, Tony Luck wrote:
> Extend the logic of handling Intel's corrected machine check interrupt
> storms to AMD's threshold interrupts.
>
> First two patches are from Tony which cleans up the existing storm
> handling for Intel and proposes per CPU per bank storm handling.
>
> Third and fourth patches do some cleanup and refactoring on the CMCI
> storm handling in order to extend similar workaround for AMD's threshold
> interrupt storms. These two patches could be merged into Tony's second
> patch of CMCI storm mitigation.
>
> AMD's storm mitigation for threshold interrupts also relies on per CPU
> per bank approach similar to Intel. But unlike CMCI storm handling it does
> not set thresholds to reduce rate of interrupts on a storm. Rather it
> turns off the interrupt on the current CPU and bank if there is a storm
> and re-enables back the interrupts when the storm subsides.
>
> It is okay to turn off threshold interrupts on AMD systems as other error
> severities continue to be handled even if the threshold interrupts are
> turned off. Uncorrected errors will generate a #MC and deferred errors
> have a unique separate deferred error interrupt. The final patch adds
> support for handling threshold interrupt storms on AMD systems.
>
> Changes since v1:
>
> 1) Fix shift computation when keeping track of bank history. Shift
> should be "1" when a storm is in progress (because polling once per
> second). When a storm is not in progress shift should be based on
> number of seconds since the bank was last checked.
>
> 2) Changed Smita's code in part 0003 to avoid use of a function pointer
> (since the kernel is avoiding indirect branch points that might be
> trainable for various Spectre-like issues).
>
> Smita Koralahalli (2):
> x86/mce: Introduce mce_handle_storm() to deal with begin/end of storms
> x86/mce: Handle AMD threshold interrupt storms
> x86/mce: Move storm handling to core.
>
> Tony Luck (3):
> x86/mce: Remove old CMCI storm mitigation code
> x86/mce: Add per-bank CMCI storm mitigation
>
> arch/x86/kernel/cpu/mce/amd.c | 49 ++++++++
> arch/x86/kernel/cpu/mce/core.c | 139 +++++++++++++++++-----
> arch/x86/kernel/cpu/mce/intel.c | 179 +++++++----------------------
> arch/x86/kernel/cpu/mce/internal.h | 33 ++++--
> 4 files changed, 230 insertions(+), 170 deletions(-)
>
> --

Hi Tony,

Is there an updated version of this set? I can help review and test. Smita is
focusing on other items at the moment.

Thanks!

-Yazen

2023-03-17 17:20:57

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v3 0/5] Handle corrected machine check interrupt storms

This is the same as v2 (posted June 2022) rebased to v6.1-rc4. I meant to post
when I did that, but apparently got distracted.

Pathces 1-4 still apply cleanly to upstream but there's a trivial fixup
needed to arch/x86/kernel/cpu/mce/internal.h to make patch 5 apply
to v6.3-rc2.

Smita Koralahalli (3):
x86/mce: Introduce mce_handle_storm() to deal with begin/end of storms
x86/mce: Move storm handling to core.
x86/mce: Handle AMD threshold interrupt storms

Tony Luck (2):
x86/mce: Remove old CMCI storm mitigation code
x86/mce: Add per-bank CMCI storm mitigation

arch/x86/kernel/cpu/mce/internal.h | 33 ++++--
arch/x86/kernel/cpu/mce/amd.c | 49 ++++++++
arch/x86/kernel/cpu/mce/core.c | 139 +++++++++++++++++-----
arch/x86/kernel/cpu/mce/intel.c | 179 +++++++----------------------
4 files changed, 230 insertions(+), 170 deletions(-)


base-commit: f0c4d9fc9cc9462659728d168387191387e903cc
--
2.39.2


2023-03-17 17:21:00

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v3 2/5] x86/mce: Add per-bank CMCI storm mitigation

Add a hook into machine_check_poll() to keep track of per-CPU, per-bank
corrected error logs.

Maintain a bitmap history for each bank showing whether the bank
logged an corrected error or not each time it is polled.

In normal operation the interval between polls of this banks
determines how far to shift the history. The 64 bit width corresponds
to about one second.

When a storm is observed the Rate of interrupts is reduced by setting
a large threshold value for this bank in IA32_MCi_CTL2. This bank is
added to the bitmap of banks for this CPU to poll. The polling rate
is increased to once per second.
During a storm each bit in the history indicates the status of the
bank each time it is polled. Thus the history covers just over a minute.

Declare a storm for that bank if the number of corrected interrupts
seen in that history is above some threshold (5 in this RFC code for
ease of testing, likely move to 15 for compatibility with previous
storm detection).

A storm on a bank ends if enough consecutive polls of the bank show
no corrected errors (currently 30, may also change). That resets the
threshold in IA32_MCi_CTL2 back to 1, removes the bank from the bitmap
for polling, and changes the polling rate back to the default.

If a CPU with banks in storm mode is taken offline, the new CPU
that inherits ownership of those banks takes over management of
storm(s) in the inherited bank(s).

Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/mce/internal.h | 4 +-
arch/x86/kernel/cpu/mce/core.c | 26 ++++--
arch/x86/kernel/cpu/mce/intel.c | 139 ++++++++++++++++++++++++++++-
3 files changed, 158 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index 07fef4d74525..72fbec8f6c3c 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -40,6 +40,8 @@ struct dentry *mce_get_debugfs_dir(void);

extern mce_banks_t mce_banks_ce_disabled;

+void track_cmci_storm(int bank, u64 status);
+
#ifdef CONFIG_X86_MCE_INTEL
void cmci_disable_bank(int bank);
void intel_init_cmci(void);
@@ -54,7 +56,7 @@ static inline void intel_clear_lmce(void) { }
static inline bool intel_filter_mce(struct mce *m) { return false; }
#endif

-void mce_timer_kick(unsigned long interval);
+void mce_timer_kick(bool storm);

#ifdef CONFIG_ACPI_APEI
int apei_write_mce(struct mce *m);
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 92c2dee4bf43..776d4724b1e0 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -694,6 +694,8 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
barrier();
m.status = mce_rdmsrl(mca_msr_reg(i, MCA_STATUS));

+ track_cmci_storm(i, m.status);
+
/* If this entry is not valid, ignore it */
if (!(m.status & MCI_STATUS_VAL))
continue;
@@ -1597,6 +1599,7 @@ static unsigned long check_interval = INITIAL_CHECK_INTERVAL;

static DEFINE_PER_CPU(unsigned long, mce_next_interval); /* in jiffies */
static DEFINE_PER_CPU(struct timer_list, mce_timer);
+static DEFINE_PER_CPU(bool, storm_poll_mode);

static void __start_timer(struct timer_list *t, unsigned long interval)
{
@@ -1632,22 +1635,29 @@ static void mce_timer_fn(struct timer_list *t)
else
iv = min(iv * 2, round_jiffies_relative(check_interval * HZ));

- __this_cpu_write(mce_next_interval, iv);
- __start_timer(t, iv);
+ if (__this_cpu_read(storm_poll_mode)) {
+ __start_timer(t, HZ);
+ } else {
+ __this_cpu_write(mce_next_interval, iv);
+ __start_timer(t, iv);
+ }
}

/*
- * Ensure that the timer is firing in @interval from now.
+ * When a storm starts on any bank on this CPU, switch to polling
+ * once per second. When the storm ends, revert to the default
+ * polling interval.
*/
-void mce_timer_kick(unsigned long interval)
+void mce_timer_kick(bool storm)
{
struct timer_list *t = this_cpu_ptr(&mce_timer);
- unsigned long iv = __this_cpu_read(mce_next_interval);

- __start_timer(t, interval);
+ __this_cpu_write(storm_poll_mode, storm);

- if (interval < iv)
- __this_cpu_write(mce_next_interval, interval);
+ if (storm)
+ __start_timer(t, HZ);
+ else
+ __this_cpu_write(mce_next_interval, check_interval * HZ);
}

/* Must not be called in IRQ context where del_timer_sync() can deadlock */
diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c
index 052bf2708391..4106877de028 100644
--- a/arch/x86/kernel/cpu/mce/intel.c
+++ b/arch/x86/kernel/cpu/mce/intel.c
@@ -47,8 +47,40 @@ static DEFINE_PER_CPU(mce_banks_t, mce_banks_owned);
*/
static DEFINE_RAW_SPINLOCK(cmci_discover_lock);

+/*
+ * CMCI storm tracking state
+ * stormy_bank_count: per-cpu count of MC banks in storm state
+ * bank_history: bitmask tracking of corrected errors seen in each bank
+ * bank_time_stamp: last time (in jiffies) that each bank was polled
+ * cmci_threshold: MCi_CTL2 threshold for each bank when there is no storm
+ */
+static DEFINE_PER_CPU(int, stormy_bank_count);
+static DEFINE_PER_CPU(u64 [MAX_NR_BANKS], bank_history);
+static DEFINE_PER_CPU(bool [MAX_NR_BANKS], bank_storm);
+static DEFINE_PER_CPU(unsigned long [MAX_NR_BANKS], bank_time_stamp);
+static int cmci_threshold[MAX_NR_BANKS];
+
+/* Linux non-storm CMCI threshold (may be overridden by BIOS */
#define CMCI_THRESHOLD 1

+/*
+ * High threshold to limit CMCI rate during storms. Max supported is
+ * 0x7FFF. Use this slightly smaller value so it has a distinctive
+ * signature when some asks "Why am I not seeing all corrected errors?"
+ */
+#define CMCI_STORM_THRESHOLD 32749
+
+/*
+ * How many errors within the history buffer mark the start of a storm
+ */
+#define STORM_BEGIN_THRESHOLD 5
+
+/*
+ * How many polls of machine check bank without an error before declaring
+ * the storm is over
+ */
+#define STORM_END_POLL_THRESHOLD 30
+
static int cmci_supported(int *banks)
{
u64 cap;
@@ -103,6 +135,93 @@ static bool lmce_supported(void)
return tmp & FEAT_CTL_LMCE_ENABLED;
}

+/*
+ * Set a new CMCI threshold value. Preserve the state of the
+ * MCI_CTL2_CMCI_EN bit in case this happens during a
+ * cmci_rediscover() operation.
+ */
+static void cmci_set_threshold(int bank, int thresh)
+{
+ unsigned long flags;
+ u64 val;
+
+ raw_spin_lock_irqsave(&cmci_discover_lock, flags);
+ rdmsrl(MSR_IA32_MCx_CTL2(bank), val);
+ val &= ~MCI_CTL2_CMCI_THRESHOLD_MASK;
+ wrmsrl(MSR_IA32_MCx_CTL2(bank), val | thresh);
+ raw_spin_unlock_irqrestore(&cmci_discover_lock, flags);
+}
+
+static void cmci_storm_begin(int bank)
+{
+ __set_bit(bank, this_cpu_ptr(mce_poll_banks));
+ this_cpu_write(bank_storm[bank], true);
+
+ /*
+ * If this is the first bank on this CPU to enter storm mode
+ * start polling
+ */
+ if (this_cpu_inc_return(stormy_bank_count) == 1)
+ mce_timer_kick(true);
+}
+
+static void cmci_storm_end(int bank)
+{
+ __clear_bit(bank, this_cpu_ptr(mce_poll_banks));
+ this_cpu_write(bank_history[bank], 0ull);
+ this_cpu_write(bank_storm[bank], false);
+
+ /* If no banks left in storm mode, stop polling */
+ if (!this_cpu_dec_return(stormy_bank_count))
+ mce_timer_kick(false);
+}
+
+void track_cmci_storm(int bank, u64 status)
+{
+ unsigned long now = jiffies, delta;
+ unsigned int shift = 1;
+ u64 history;
+
+ /*
+ * When a bank is in storm mode it is polled once per second and
+ * the history mask will record about the last minute of poll results.
+ * If it is not in storm mode, then the bank is only checked when
+ * there is a CMCI interrupt. Check how long it has been since
+ * this bank was last checked, and adjust the amount of "shift"
+ * to apply to history.
+ */
+ if (!this_cpu_read(bank_storm[bank])) {
+ delta = now - this_cpu_read(bank_time_stamp[bank]);
+ shift = (delta + HZ) / HZ;
+ }
+
+ /* If has been a long time since the last poll, clear history */
+ if (shift >= 64)
+ history = 0;
+ else
+ history = this_cpu_read(bank_history[bank]) << shift;
+ this_cpu_write(bank_time_stamp[bank], now);
+
+ /* History keeps track of corrected errors. VAL=1 && UC=0 */
+ if ((status & (MCI_STATUS_VAL | MCI_STATUS_UC)) == MCI_STATUS_VAL)
+ history |= 1;
+ this_cpu_write(bank_history[bank], history);
+
+ if (this_cpu_read(bank_storm[bank])) {
+ if (history & GENMASK_ULL(STORM_END_POLL_THRESHOLD - 1, 0))
+ return;
+ pr_notice("CPU%d BANK%d CMCI storm subsided\n", smp_processor_id(), bank);
+ cmci_set_threshold(bank, cmci_threshold[bank]);
+ cmci_storm_end(bank);
+ } else {
+ if (hweight64(history) < STORM_BEGIN_THRESHOLD)
+ return;
+ pr_notice("CPU%d BANK%d CMCI storm detected\n", smp_processor_id(), bank);
+ cmci_set_threshold(bank, CMCI_STORM_THRESHOLD);
+ cmci_storm_begin(bank);
+ }
+}
+
/*
* The interrupt handler. This is called on every event.
* Just call the poller directly to log any events.
@@ -147,6 +266,9 @@ static void cmci_discover(int banks)
continue;
}

+ if ((val & MCI_CTL2_CMCI_THRESHOLD_MASK) == CMCI_STORM_THRESHOLD)
+ goto storm;
+
if (!mca_cfg.bios_cmci_threshold) {
val &= ~MCI_CTL2_CMCI_THRESHOLD_MASK;
val |= CMCI_THRESHOLD;
@@ -159,7 +281,7 @@ static void cmci_discover(int banks)
bios_zero_thresh = 1;
val |= CMCI_THRESHOLD;
}
-
+storm:
val |= MCI_CTL2_CMCI_EN;
wrmsrl(MSR_IA32_MCx_CTL2(i), val);
rdmsrl(MSR_IA32_MCx_CTL2(i), val);
@@ -167,7 +289,14 @@ static void cmci_discover(int banks)
/* Did the enable bit stick? -- the bank supports CMCI */
if (val & MCI_CTL2_CMCI_EN) {
set_bit(i, owned);
- __clear_bit(i, this_cpu_ptr(mce_poll_banks));
+ if ((val & MCI_CTL2_CMCI_THRESHOLD_MASK) == CMCI_STORM_THRESHOLD) {
+ pr_notice("CPU%d BANK%d CMCI inherited storm\n", smp_processor_id(), i);
+ this_cpu_write(bank_history[i], ~0ull);
+ this_cpu_write(bank_time_stamp[i], jiffies);
+ cmci_storm_begin(i);
+ } else {
+ __clear_bit(i, this_cpu_ptr(mce_poll_banks));
+ }
/*
* We are able to set thresholds for some banks that
* had a threshold of 0. This means the BIOS has not
@@ -177,6 +306,10 @@ static void cmci_discover(int banks)
if (mca_cfg.bios_cmci_threshold && bios_zero_thresh &&
(val & MCI_CTL2_CMCI_THRESHOLD_MASK))
bios_wrong_thresh = 1;
+
+ /* Save default threshold for each bank */
+ if (cmci_threshold[i] == 0)
+ cmci_threshold[i] = val & MCI_CTL2_CMCI_THRESHOLD_MASK;
} else {
WARN_ON(!test_bit(i, this_cpu_ptr(mce_poll_banks)));
}
@@ -218,6 +351,8 @@ static void __cmci_disable_bank(int bank)
val &= ~MCI_CTL2_CMCI_EN;
wrmsrl(MSR_IA32_MCx_CTL2(bank), val);
__clear_bit(bank, this_cpu_ptr(mce_banks_owned));
+ if ((val & MCI_CTL2_CMCI_THRESHOLD_MASK) == CMCI_STORM_THRESHOLD)
+ cmci_storm_end(bank);
}

/*
--
2.39.2


2023-03-17 17:21:03

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v3 1/5] x86/mce: Remove old CMCI storm mitigation code

When a "storm" of CMCI is detected this code mitigates by
disabling CMCI interrupt signalling from all of the banks
owned by the CPU that saw the storm.

There are problems with this approach:

1) It is very coarse grained. In all likelihood only one of the
banks was generating the interrupts, but CMCI is disabled for all.
This means Linux may delay seeing and processing errors logged
from other banks.

2) Although CMCI stands for Corrected Machine Check Interrupt, it
is also used to signal when an uncorrected error is logged. This
is a problem because these errors should be handled in a timely
manner.

Delete all this code in preparation for a finer grained solution.

Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/mce/internal.h | 6 --
arch/x86/kernel/cpu/mce/core.c | 20 +---
arch/x86/kernel/cpu/mce/intel.c | 145 -----------------------------
3 files changed, 1 insertion(+), 170 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index 7e03f5b7f6bd..07fef4d74525 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -41,18 +41,12 @@ struct dentry *mce_get_debugfs_dir(void);
extern mce_banks_t mce_banks_ce_disabled;

#ifdef CONFIG_X86_MCE_INTEL
-unsigned long cmci_intel_adjust_timer(unsigned long interval);
-bool mce_intel_cmci_poll(void);
-void mce_intel_hcpu_update(unsigned long cpu);
void cmci_disable_bank(int bank);
void intel_init_cmci(void);
void intel_init_lmce(void);
void intel_clear_lmce(void);
bool intel_filter_mce(struct mce *m);
#else
-# define cmci_intel_adjust_timer mce_adjust_timer_default
-static inline bool mce_intel_cmci_poll(void) { return false; }
-static inline void mce_intel_hcpu_update(unsigned long cpu) { }
static inline void cmci_disable_bank(int bank) { }
static inline void intel_init_cmci(void) { }
static inline void intel_init_lmce(void) { }
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 2c8ec5c71712..92c2dee4bf43 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1598,13 +1598,6 @@ static unsigned long check_interval = INITIAL_CHECK_INTERVAL;
static DEFINE_PER_CPU(unsigned long, mce_next_interval); /* in jiffies */
static DEFINE_PER_CPU(struct timer_list, mce_timer);

-static unsigned long mce_adjust_timer_default(unsigned long interval)
-{
- return interval;
-}
-
-static unsigned long (*mce_adjust_timer)(unsigned long interval) = mce_adjust_timer_default;
-
static void __start_timer(struct timer_list *t, unsigned long interval)
{
unsigned long when = jiffies + interval;
@@ -1627,15 +1620,9 @@ static void mce_timer_fn(struct timer_list *t)

iv = __this_cpu_read(mce_next_interval);

- if (mce_available(this_cpu_ptr(&cpu_info))) {
+ if (mce_available(this_cpu_ptr(&cpu_info)))
machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));

- if (mce_intel_cmci_poll()) {
- iv = mce_adjust_timer(iv);
- goto done;
- }
- }
-
/*
* Alert userspace if needed. If we logged an MCE, reduce the polling
* interval, otherwise increase the polling interval.
@@ -1645,7 +1632,6 @@ static void mce_timer_fn(struct timer_list *t)
else
iv = min(iv * 2, round_jiffies_relative(check_interval * HZ));

-done:
__this_cpu_write(mce_next_interval, iv);
__start_timer(t, iv);
}
@@ -1982,7 +1968,6 @@ static void mce_zhaoxin_feature_init(struct cpuinfo_x86 *c)

intel_init_cmci();
intel_init_lmce();
- mce_adjust_timer = cmci_intel_adjust_timer;
}

static void mce_zhaoxin_feature_clear(struct cpuinfo_x86 *c)
@@ -1995,7 +1980,6 @@ static void __mcheck_cpu_init_vendor(struct cpuinfo_x86 *c)
switch (c->x86_vendor) {
case X86_VENDOR_INTEL:
mce_intel_feature_init(c);
- mce_adjust_timer = cmci_intel_adjust_timer;
break;

case X86_VENDOR_AMD: {
@@ -2651,8 +2635,6 @@ static void mce_reenable_cpu(void)

static int mce_cpu_dead(unsigned int cpu)
{
- mce_intel_hcpu_update(cpu);
-
/* intentionally ignoring frozen here */
if (!cpuhp_tasks_frozen)
cmci_rediscover();
diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c
index 95275a5e57e0..052bf2708391 100644
--- a/arch/x86/kernel/cpu/mce/intel.c
+++ b/arch/x86/kernel/cpu/mce/intel.c
@@ -41,15 +41,6 @@
*/
static DEFINE_PER_CPU(mce_banks_t, mce_banks_owned);

-/*
- * CMCI storm detection backoff counter
- *
- * During storm, we reset this counter to INITIAL_CHECK_INTERVAL in case we've
- * encountered an error. If not, we decrement it by one. We signal the end of
- * the CMCI storm when it reaches 0.
- */
-static DEFINE_PER_CPU(int, cmci_backoff_cnt);
-
/*
* cmci_discover_lock protects against parallel discovery attempts
* which could race against each other.
@@ -57,21 +48,6 @@ static DEFINE_PER_CPU(int, cmci_backoff_cnt);
static DEFINE_RAW_SPINLOCK(cmci_discover_lock);

#define CMCI_THRESHOLD 1
-#define CMCI_POLL_INTERVAL (30 * HZ)
-#define CMCI_STORM_INTERVAL (HZ)
-#define CMCI_STORM_THRESHOLD 15
-
-static DEFINE_PER_CPU(unsigned long, cmci_time_stamp);
-static DEFINE_PER_CPU(unsigned int, cmci_storm_cnt);
-static DEFINE_PER_CPU(unsigned int, cmci_storm_state);
-
-enum {
- CMCI_STORM_NONE,
- CMCI_STORM_ACTIVE,
- CMCI_STORM_SUBSIDED,
-};
-
-static atomic_t cmci_storm_on_cpus;

static int cmci_supported(int *banks)
{
@@ -127,124 +103,6 @@ static bool lmce_supported(void)
return tmp & FEAT_CTL_LMCE_ENABLED;
}

-bool mce_intel_cmci_poll(void)
-{
- if (__this_cpu_read(cmci_storm_state) == CMCI_STORM_NONE)
- return false;
-
- /*
- * Reset the counter if we've logged an error in the last poll
- * during the storm.
- */
- if (machine_check_poll(0, this_cpu_ptr(&mce_banks_owned)))
- this_cpu_write(cmci_backoff_cnt, INITIAL_CHECK_INTERVAL);
- else
- this_cpu_dec(cmci_backoff_cnt);
-
- return true;
-}
-
-void mce_intel_hcpu_update(unsigned long cpu)
-{
- if (per_cpu(cmci_storm_state, cpu) == CMCI_STORM_ACTIVE)
- atomic_dec(&cmci_storm_on_cpus);
-
- per_cpu(cmci_storm_state, cpu) = CMCI_STORM_NONE;
-}
-
-static void cmci_toggle_interrupt_mode(bool on)
-{
- unsigned long flags, *owned;
- int bank;
- u64 val;
-
- raw_spin_lock_irqsave(&cmci_discover_lock, flags);
- owned = this_cpu_ptr(mce_banks_owned);
- for_each_set_bit(bank, owned, MAX_NR_BANKS) {
- rdmsrl(MSR_IA32_MCx_CTL2(bank), val);
-
- if (on)
- val |= MCI_CTL2_CMCI_EN;
- else
- val &= ~MCI_CTL2_CMCI_EN;
-
- wrmsrl(MSR_IA32_MCx_CTL2(bank), val);
- }
- raw_spin_unlock_irqrestore(&cmci_discover_lock, flags);
-}
-
-unsigned long cmci_intel_adjust_timer(unsigned long interval)
-{
- if ((this_cpu_read(cmci_backoff_cnt) > 0) &&
- (__this_cpu_read(cmci_storm_state) == CMCI_STORM_ACTIVE)) {
- mce_notify_irq();
- return CMCI_STORM_INTERVAL;
- }
-
- switch (__this_cpu_read(cmci_storm_state)) {
- case CMCI_STORM_ACTIVE:
-
- /*
- * We switch back to interrupt mode once the poll timer has
- * silenced itself. That means no events recorded and the timer
- * interval is back to our poll interval.
- */
- __this_cpu_write(cmci_storm_state, CMCI_STORM_SUBSIDED);
- if (!atomic_sub_return(1, &cmci_storm_on_cpus))
- pr_notice("CMCI storm subsided: switching to interrupt mode\n");
-
- fallthrough;
-
- case CMCI_STORM_SUBSIDED:
- /*
- * We wait for all CPUs to go back to SUBSIDED state. When that
- * happens we switch back to interrupt mode.
- */
- if (!atomic_read(&cmci_storm_on_cpus)) {
- __this_cpu_write(cmci_storm_state, CMCI_STORM_NONE);
- cmci_toggle_interrupt_mode(true);
- cmci_recheck();
- }
- return CMCI_POLL_INTERVAL;
- default:
-
- /* We have shiny weather. Let the poll do whatever it thinks. */
- return interval;
- }
-}
-
-static bool cmci_storm_detect(void)
-{
- unsigned int cnt = __this_cpu_read(cmci_storm_cnt);
- unsigned long ts = __this_cpu_read(cmci_time_stamp);
- unsigned long now = jiffies;
- int r;
-
- if (__this_cpu_read(cmci_storm_state) != CMCI_STORM_NONE)
- return true;
-
- if (time_before_eq(now, ts + CMCI_STORM_INTERVAL)) {
- cnt++;
- } else {
- cnt = 1;
- __this_cpu_write(cmci_time_stamp, now);
- }
- __this_cpu_write(cmci_storm_cnt, cnt);
-
- if (cnt <= CMCI_STORM_THRESHOLD)
- return false;
-
- cmci_toggle_interrupt_mode(false);
- __this_cpu_write(cmci_storm_state, CMCI_STORM_ACTIVE);
- r = atomic_add_return(1, &cmci_storm_on_cpus);
- mce_timer_kick(CMCI_STORM_INTERVAL);
- this_cpu_write(cmci_backoff_cnt, INITIAL_CHECK_INTERVAL);
-
- if (r == 1)
- pr_notice("CMCI storm detected: switching to poll mode\n");
- return true;
-}
-
/*
* The interrupt handler. This is called on every event.
* Just call the poller directly to log any events.
@@ -253,9 +111,6 @@ static bool cmci_storm_detect(void)
*/
static void intel_threshold_interrupt(void)
{
- if (cmci_storm_detect())
- return;
-
machine_check_poll(MCP_TIMESTAMP, this_cpu_ptr(&mce_banks_owned));
}

--
2.39.2


2023-03-17 17:21:06

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v3 3/5] x86/mce: Introduce mce_handle_storm() to deal with begin/end of storms

From: Smita Koralahalli <[email protected]>

Intel and AMD need to take different actions when a storm begins or
ends. Prepare for the storm code moving from intel.c into core.c by
adding a function that checks CPU vendor to pick the right action.

No functional changes.

[Tony: Changed from function pointer to regular function]

Signed-off-by: Smita Koralahalli <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/mce/internal.h | 3 +++
arch/x86/kernel/cpu/mce/core.c | 9 +++++++++
arch/x86/kernel/cpu/mce/intel.c | 12 ++++++++++--
3 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index 72fbec8f6c3c..f37816b4d4cf 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -43,12 +43,14 @@ extern mce_banks_t mce_banks_ce_disabled;
void track_cmci_storm(int bank, u64 status);

#ifdef CONFIG_X86_MCE_INTEL
+void mce_intel_handle_storm(int bank, bool on);
void cmci_disable_bank(int bank);
void intel_init_cmci(void);
void intel_init_lmce(void);
void intel_clear_lmce(void);
bool intel_filter_mce(struct mce *m);
#else
+static inline void mce_intel_handle_storm(int bank, bool on) { }
static inline void cmci_disable_bank(int bank) { }
static inline void intel_init_cmci(void) { }
static inline void intel_init_lmce(void) { }
@@ -57,6 +59,7 @@ static inline bool intel_filter_mce(struct mce *m) { return false; }
#endif

void mce_timer_kick(bool storm);
+void mce_handle_storm(int bank, bool on);

#ifdef CONFIG_ACPI_APEI
int apei_write_mce(struct mce *m);
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 776d4724b1e0..f4d2a7ba29f7 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1985,6 +1985,15 @@ static void mce_zhaoxin_feature_clear(struct cpuinfo_x86 *c)
intel_clear_lmce();
}

+void mce_handle_storm(int bank, bool on)
+{
+ switch (boot_cpu_data.x86_vendor) {
+ case X86_VENDOR_INTEL:
+ mce_intel_handle_storm(bank, on);
+ break;
+ }
+}
+
static void __mcheck_cpu_init_vendor(struct cpuinfo_x86 *c)
{
switch (c->x86_vendor) {
diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c
index 4106877de028..4238b73c2143 100644
--- a/arch/x86/kernel/cpu/mce/intel.c
+++ b/arch/x86/kernel/cpu/mce/intel.c
@@ -152,6 +152,14 @@ static void cmci_set_threshold(int bank, int thresh)
raw_spin_unlock_irqrestore(&cmci_discover_lock, flags);
}

+void mce_intel_handle_storm(int bank, bool on)
+{
+ if (on)
+ cmci_set_threshold(bank, cmci_threshold[bank]);
+ else
+ cmci_set_threshold(bank, CMCI_STORM_THRESHOLD);
+}
+
static void cmci_storm_begin(int bank)
{
__set_bit(bank, this_cpu_ptr(mce_poll_banks));
@@ -211,13 +219,13 @@ void track_cmci_storm(int bank, u64 status)
if (history & GENMASK_ULL(STORM_END_POLL_THRESHOLD - 1, 0))
return;
pr_notice("CPU%d BANK%d CMCI storm subsided\n", smp_processor_id(), bank);
- cmci_set_threshold(bank, cmci_threshold[bank]);
+ mce_handle_storm(bank, true);
cmci_storm_end(bank);
} else {
if (hweight64(history) < STORM_BEGIN_THRESHOLD)
return;
pr_notice("CPU%d BANK%d CMCI storm detected\n", smp_processor_id(), bank);
- cmci_set_threshold(bank, CMCI_STORM_THRESHOLD);
+ mce_handle_storm(bank, false);
cmci_storm_begin(bank);
}
}
--
2.39.2


2023-03-17 17:21:09

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v3 4/5] x86/mce: Move storm handling to core.

From: Smita Koralahalli <[email protected]>

AMD's storm handling for threshold interrupts is similar to Intel's CMCI
storm handling. Hence, make the storm handling code common by moving to
core and removing the vendor exclusivity.

On the contrary, setting different thresholds to reduce rate of interrupts
in IA32_MCi_CTL2 register is kept Intel intact as the storm handling for
AMD slightly differs where in it handles the storms by turning off the
interrupts.

No functional changes.

[Tony: Same as Smita's original, plus changes rolled in from prior patches]

Signed-off-by: Smita Koralahalli <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/mce/internal.h | 18 ++++++
arch/x86/kernel/cpu/mce/core.c | 81 ++++++++++++++++++++++++++
arch/x86/kernel/cpu/mce/intel.c | 93 +-----------------------------
3 files changed, 100 insertions(+), 92 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index f37816b4d4cf..9b2c54f30fb9 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -60,6 +60,24 @@ static inline bool intel_filter_mce(struct mce *m) { return false; }

void mce_timer_kick(bool storm);
void mce_handle_storm(int bank, bool on);
+void cmci_storm_begin(int bank);
+void cmci_storm_end(int bank);
+
+DECLARE_PER_CPU(int, stormy_bank_count);
+DECLARE_PER_CPU(u64 [MAX_NR_BANKS], bank_history);
+DECLARE_PER_CPU(bool [MAX_NR_BANKS], bank_storm);
+DECLARE_PER_CPU(unsigned long [MAX_NR_BANKS], bank_time_stamp);
+
+/*
+ * How many errors within the history buffer mark the start of a storm
+ */
+#define STORM_BEGIN_THRESHOLD 5
+
+/*
+ * How many polls of machine check bank without an error before declaring
+ * the storm is over
+ */
+#define STORM_END_POLL_THRESHOLD 30

#ifdef CONFIG_ACPI_APEI
int apei_write_mce(struct mce *m);
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index f4d2a7ba29f7..d27daa199523 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -613,6 +613,87 @@ static struct notifier_block mce_default_nb = {
.priority = MCE_PRIO_LOWEST,
};

+/*
+ * CMCI storm tracking state
+ * stormy_bank_count: per-cpu count of MC banks in storm state
+ * bank_history: bitmask tracking of corrected errors seen in each bank
+ * bank_time_stamp: last time (in jiffies) that each bank was polled
+ */
+DEFINE_PER_CPU(int, stormy_bank_count);
+DEFINE_PER_CPU(u64 [MAX_NR_BANKS], bank_history);
+DEFINE_PER_CPU(bool [MAX_NR_BANKS], bank_storm);
+DEFINE_PER_CPU(unsigned long [MAX_NR_BANKS], bank_time_stamp);
+
+void cmci_storm_begin(int bank)
+{
+ __set_bit(bank, this_cpu_ptr(mce_poll_banks));
+ this_cpu_write(bank_storm[bank], true);
+
+ /*
+ * If this is the first bank on this CPU to enter storm mode
+ * start polling
+ */
+ if (this_cpu_inc_return(stormy_bank_count) == 1)
+ mce_timer_kick(true);
+}
+
+void cmci_storm_end(int bank)
+{
+ __clear_bit(bank, this_cpu_ptr(mce_poll_banks));
+ this_cpu_write(bank_history[bank], 0ull);
+ this_cpu_write(bank_storm[bank], false);
+
+ /* If no banks left in storm mode, stop polling */
+ if (!this_cpu_dec_return(stormy_bank_count))
+ mce_timer_kick(false);
+}
+
+void track_cmci_storm(int bank, u64 status)
+{
+ unsigned long now = jiffies, delta;
+ unsigned int shift = 1;
+ u64 history;
+
+ /*
+ * When a bank is in storm mode it is polled once per second and
+ * the history mask will record about the last minute of poll results.
+ * If it is not in storm mode, then the bank is only checked when
+ * there is a CMCI interrupt. Check how long it has been since
+ * this bank was last checked, and adjust the amount of "shift"
+ * to apply to history.
+ */
+ if (!this_cpu_read(bank_storm[bank])) {
+ delta = now - this_cpu_read(bank_time_stamp[bank]);
+ shift = (delta + HZ) / HZ;
+ }
+
+ /* If has been a long time since the last poll, clear history */
+ if (shift >= 64)
+ history = 0;
+ else
+ history = this_cpu_read(bank_history[bank]) << shift;
+ this_cpu_write(bank_time_stamp[bank], now);
+
+ /* History keeps track of corrected errors. VAL=1 && UC=0 */
+ if ((status & (MCI_STATUS_VAL | MCI_STATUS_UC)) == MCI_STATUS_VAL)
+ history |= 1;
+ this_cpu_write(bank_history[bank], history);
+
+ if (this_cpu_read(bank_storm[bank])) {
+ if (history & GENMASK_ULL(STORM_END_POLL_THRESHOLD - 1, 0))
+ return;
+ pr_notice("CPU%d BANK%d CMCI storm subsided\n", smp_processor_id(), bank);
+ mce_handle_storm(bank, true);
+ cmci_storm_end(bank);
+ } else {
+ if (hweight64(history) < STORM_BEGIN_THRESHOLD)
+ return;
+ pr_notice("CPU%d BANK%d CMCI storm detected\n", smp_processor_id(), bank);
+ mce_handle_storm(bank, false);
+ cmci_storm_begin(bank);
+ }
+}
+
/*
* Read ADDR and MISC registers.
*/
diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c
index 4238b73c2143..6cc9aa97c092 100644
--- a/arch/x86/kernel/cpu/mce/intel.c
+++ b/arch/x86/kernel/cpu/mce/intel.c
@@ -47,17 +47,7 @@ static DEFINE_PER_CPU(mce_banks_t, mce_banks_owned);
*/
static DEFINE_RAW_SPINLOCK(cmci_discover_lock);

-/*
- * CMCI storm tracking state
- * stormy_bank_count: per-cpu count of MC banks in storm state
- * bank_history: bitmask tracking of corrected errors seen in each bank
- * bank_time_stamp: last time (in jiffies) that each bank was polled
- * cmci_threshold: MCi_CTL2 threshold for each bank when there is no storm
- */
-static DEFINE_PER_CPU(int, stormy_bank_count);
-static DEFINE_PER_CPU(u64 [MAX_NR_BANKS], bank_history);
-static DEFINE_PER_CPU(bool [MAX_NR_BANKS], bank_storm);
-static DEFINE_PER_CPU(unsigned long [MAX_NR_BANKS], bank_time_stamp);
+/* MCi_CTL2 threshold for each bank when there is no storm */
static int cmci_threshold[MAX_NR_BANKS];

/* Linux non-storm CMCI threshold (may be overridden by BIOS */
@@ -70,17 +60,6 @@ static int cmci_threshold[MAX_NR_BANKS];
*/
#define CMCI_STORM_THRESHOLD 32749

-/*
- * How many errors within the history buffer mark the start of a storm
- */
-#define STORM_BEGIN_THRESHOLD 5
-
-/*
- * How many polls of machine check bank without an error before declaring
- * the storm is over
- */
-#define STORM_END_POLL_THRESHOLD 30
-
static int cmci_supported(int *banks)
{
u64 cap;
@@ -160,76 +139,6 @@ void mce_intel_handle_storm(int bank, bool on)
cmci_set_threshold(bank, CMCI_STORM_THRESHOLD);
}

-static void cmci_storm_begin(int bank)
-{
- __set_bit(bank, this_cpu_ptr(mce_poll_banks));
- this_cpu_write(bank_storm[bank], true);
-
- /*
- * If this is the first bank on this CPU to enter storm mode
- * start polling
- */
- if (this_cpu_inc_return(stormy_bank_count) == 1)
- mce_timer_kick(true);
-}
-
-static void cmci_storm_end(int bank)
-{
- __clear_bit(bank, this_cpu_ptr(mce_poll_banks));
- this_cpu_write(bank_history[bank], 0ull);
- this_cpu_write(bank_storm[bank], false);
-
- /* If no banks left in storm mode, stop polling */
- if (!this_cpu_dec_return(stormy_bank_count))
- mce_timer_kick(false);
-}
-
-void track_cmci_storm(int bank, u64 status)
-{
- unsigned long now = jiffies, delta;
- unsigned int shift = 1;
- u64 history;
-
- /*
- * When a bank is in storm mode it is polled once per second and
- * the history mask will record about the last minute of poll results.
- * If it is not in storm mode, then the bank is only checked when
- * there is a CMCI interrupt. Check how long it has been since
- * this bank was last checked, and adjust the amount of "shift"
- * to apply to history.
- */
- if (!this_cpu_read(bank_storm[bank])) {
- delta = now - this_cpu_read(bank_time_stamp[bank]);
- shift = (delta + HZ) / HZ;
- }
-
- /* If has been a long time since the last poll, clear history */
- if (shift >= 64)
- history = 0;
- else
- history = this_cpu_read(bank_history[bank]) << shift;
- this_cpu_write(bank_time_stamp[bank], now);
-
- /* History keeps track of corrected errors. VAL=1 && UC=0 */
- if ((status & (MCI_STATUS_VAL | MCI_STATUS_UC)) == MCI_STATUS_VAL)
- history |= 1;
- this_cpu_write(bank_history[bank], history);
-
- if (this_cpu_read(bank_storm[bank])) {
- if (history & GENMASK_ULL(STORM_END_POLL_THRESHOLD - 1, 0))
- return;
- pr_notice("CPU%d BANK%d CMCI storm subsided\n", smp_processor_id(), bank);
- mce_handle_storm(bank, true);
- cmci_storm_end(bank);
- } else {
- if (hweight64(history) < STORM_BEGIN_THRESHOLD)
- return;
- pr_notice("CPU%d BANK%d CMCI storm detected\n", smp_processor_id(), bank);
- mce_handle_storm(bank, false);
- cmci_storm_begin(bank);
- }
-}
-
/*
* The interrupt handler. This is called on every event.
* Just call the poller directly to log any events.
--
2.39.2


2023-03-17 17:21:13

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v3 5/5] x86/mce: Handle AMD threshold interrupt storms

From: Smita Koralahalli <[email protected]>

Extend the logic of handling CMCI storms to AMD threshold interrupts.

Rely on the similar approach as of Intel's CMCI to mitigate storms per
CPU and per bank. But, unlike CMCI, do not set thresholds and reduce
interrupt rate on a storm. Rather, disable the interrupt on the
corresponding CPU and bank. Re-enable back the interrupts if enough
consecutive polls of the bank show no corrected errors (30, as
programmed by Intel).

Turning off the threshold interrupts would be a better solution on AMD
systems as other error severities will still be handled even if the
threshold interrupts are disabled.

[Tony: Small tweak because mce_handle_storm() isn't a pointer now]

Signed-off-by: Smita Koralahalli <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/mce/internal.h | 2 ++
arch/x86/kernel/cpu/mce/amd.c | 49 ++++++++++++++++++++++++++++++
arch/x86/kernel/cpu/mce/core.c | 3 ++
3 files changed, 54 insertions(+)

diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index 9b2c54f30fb9..b580bd609fdc 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -206,7 +206,9 @@ extern bool filter_mce(struct mce *m);

#ifdef CONFIG_X86_MCE_AMD
extern bool amd_filter_mce(struct mce *m);
+void mce_amd_handle_storm(int bank, bool on);
#else
+static inline void mce_amd_handle_storm(int bank, bool on) {}
static inline bool amd_filter_mce(struct mce *m) { return false; }
#endif

diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index 1c87501e0fa3..b7f92af065e1 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -466,6 +466,47 @@ static void threshold_restart_bank(void *_tr)
wrmsr(tr->b->address, lo, hi);
}

+static void _reset_block(struct threshold_block *block)
+{
+ struct thresh_restart tr;
+
+ memset(&tr, 0, sizeof(tr));
+ tr.b = block;
+ threshold_restart_bank(&tr);
+}
+
+static void toggle_interrupt_reset_block(struct threshold_block *block, bool on)
+{
+ if (!block)
+ return;
+
+ block->interrupt_enable = !!on;
+ _reset_block(block);
+}
+
+void mce_amd_handle_storm(int bank, bool on)
+{
+ struct threshold_block *first_block = NULL, *block = NULL, *tmp = NULL;
+ struct threshold_bank **bp = this_cpu_read(threshold_banks);
+ unsigned long flags;
+
+ if (!bp)
+ return;
+
+ local_irq_save(flags);
+
+ first_block = bp[bank]->blocks;
+ if (!first_block)
+ goto end;
+
+ toggle_interrupt_reset_block(first_block, on);
+
+ list_for_each_entry_safe(block, tmp, &first_block->miscj, miscj)
+ toggle_interrupt_reset_block(block, on);
+end:
+ local_irq_restore(flags);
+}
+
static void mce_threshold_block_init(struct threshold_block *b, int offset)
{
struct thresh_restart tr = {
@@ -867,6 +908,7 @@ static void amd_threshold_interrupt(void)
struct threshold_block *first_block = NULL, *block = NULL, *tmp = NULL;
struct threshold_bank **bp = this_cpu_read(threshold_banks);
unsigned int bank, cpu = smp_processor_id();
+ u64 status;

/*
* Validate that the threshold bank has been initialized already. The
@@ -880,6 +922,13 @@ static void amd_threshold_interrupt(void)
if (!(per_cpu(bank_map, cpu) & (1 << bank)))
continue;

+ rdmsrl(mca_msr_reg(bank, MCA_STATUS), status);
+ track_cmci_storm(bank, status);
+
+ /* Return early on an interrupt storm */
+ if (this_cpu_read(bank_storm[bank]))
+ return;
+
first_block = bp[bank]->blocks;
if (!first_block)
continue;
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index d27daa199523..6121f0afe45a 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -2072,6 +2072,9 @@ void mce_handle_storm(int bank, bool on)
case X86_VENDOR_INTEL:
mce_intel_handle_storm(bank, on);
break;
+ case X86_VENDOR_AMD:
+ mce_amd_handle_storm(bank, on);
+ break;
}
}

--
2.39.2


2023-03-23 15:24:25

by Yazen Ghannam

[permalink] [raw]
Subject: Re: [PATCH v3 3/5] x86/mce: Introduce mce_handle_storm() to deal with begin/end of storms

On Fri, Mar 17, 2023 at 10:20:40AM -0700, Tony Luck wrote:
> From: Smita Koralahalli <[email protected]>
>
> Intel and AMD need to take different actions when a storm begins or
> ends. Prepare for the storm code moving from intel.c into core.c by
> adding a function that checks CPU vendor to pick the right action.
>
> No functional changes.
>
> [Tony: Changed from function pointer to regular function]
>
> Signed-off-by: Smita Koralahalli <[email protected]>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> arch/x86/kernel/cpu/mce/internal.h | 3 +++
> arch/x86/kernel/cpu/mce/core.c | 9 +++++++++
> arch/x86/kernel/cpu/mce/intel.c | 12 ++++++++++--
> 3 files changed, 22 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
> index 72fbec8f6c3c..f37816b4d4cf 100644
> --- a/arch/x86/kernel/cpu/mce/internal.h
> +++ b/arch/x86/kernel/cpu/mce/internal.h
> @@ -43,12 +43,14 @@ extern mce_banks_t mce_banks_ce_disabled;
> void track_cmci_storm(int bank, u64 status);
>
> #ifdef CONFIG_X86_MCE_INTEL
> +void mce_intel_handle_storm(int bank, bool on);
> void cmci_disable_bank(int bank);
> void intel_init_cmci(void);
> void intel_init_lmce(void);
> void intel_clear_lmce(void);
> bool intel_filter_mce(struct mce *m);
> #else
> +static inline void mce_intel_handle_storm(int bank, bool on) { }
> static inline void cmci_disable_bank(int bank) { }
> static inline void intel_init_cmci(void) { }
> static inline void intel_init_lmce(void) { }
> @@ -57,6 +59,7 @@ static inline bool intel_filter_mce(struct mce *m) { return false; }
> #endif
>
> void mce_timer_kick(bool storm);
> +void mce_handle_storm(int bank, bool on);
>
> #ifdef CONFIG_ACPI_APEI
> int apei_write_mce(struct mce *m);
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 776d4724b1e0..f4d2a7ba29f7 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -1985,6 +1985,15 @@ static void mce_zhaoxin_feature_clear(struct cpuinfo_x86 *c)
> intel_clear_lmce();
> }
>
> +void mce_handle_storm(int bank, bool on)
> +{
> + switch (boot_cpu_data.x86_vendor) {
> + case X86_VENDOR_INTEL:
> + mce_intel_handle_storm(bank, on);
> + break;
> + }
> +}
> +
> static void __mcheck_cpu_init_vendor(struct cpuinfo_x86 *c)
> {
> switch (c->x86_vendor) {
> diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c
> index 4106877de028..4238b73c2143 100644
> --- a/arch/x86/kernel/cpu/mce/intel.c
> +++ b/arch/x86/kernel/cpu/mce/intel.c
> @@ -152,6 +152,14 @@ static void cmci_set_threshold(int bank, int thresh)
> raw_spin_unlock_irqrestore(&cmci_discover_lock, flags);
> }
>
> +void mce_intel_handle_storm(int bank, bool on)
> +{
> + if (on)
> + cmci_set_threshold(bank, cmci_threshold[bank]);
> + else
> + cmci_set_threshold(bank, CMCI_STORM_THRESHOLD);

I think these conditions are reversed. When storm handling is 'on' we should
use CMCI_STORM_THRESHOLD, and when off use the saved bank threshold.

> +}
> +
> static void cmci_storm_begin(int bank)
> {
> __set_bit(bank, this_cpu_ptr(mce_poll_banks));
> @@ -211,13 +219,13 @@ void track_cmci_storm(int bank, u64 status)
> if (history & GENMASK_ULL(STORM_END_POLL_THRESHOLD - 1, 0))
> return;
> pr_notice("CPU%d BANK%d CMCI storm subsided\n", smp_processor_id(), bank);
> - cmci_set_threshold(bank, cmci_threshold[bank]);
> + mce_handle_storm(bank, true);

Should be 'false' when the storm subsides.

> cmci_storm_end(bank);
> } else {
> if (hweight64(history) < STORM_BEGIN_THRESHOLD)
> return;
> pr_notice("CPU%d BANK%d CMCI storm detected\n", smp_processor_id(), bank);
> - cmci_set_threshold(bank, CMCI_STORM_THRESHOLD);
> + mce_handle_storm(bank, false);

Should be 'true' when the storm starts.

> cmci_storm_begin(bank);
> }
> }
> --

Thanks,
Yazen

2023-03-23 15:48:43

by Yazen Ghannam

[permalink] [raw]
Subject: Re: [PATCH v3 4/5] x86/mce: Move storm handling to core.

On Fri, Mar 17, 2023 at 10:20:41AM -0700, Tony Luck wrote:
> From: Smita Koralahalli <[email protected]>
>
> AMD's storm handling for threshold interrupts is similar to Intel's CMCI
> storm handling. Hence, make the storm handling code common by moving to
> core and removing the vendor exclusivity.
>
> On the contrary, setting different thresholds to reduce rate of interrupts
> in IA32_MCi_CTL2 register is kept Intel intact as the storm handling for
> AMD slightly differs where in it handles the storms by turning off the
> interrupts.
>
> No functional changes.
>
> [Tony: Same as Smita's original, plus changes rolled in from prior patches]
>
> Signed-off-by: Smita Koralahalli <[email protected]>
> Signed-off-by: Tony Luck <[email protected]>
> ---

Can this patch and the previous two be squashed together?

Like so?
Patch 1: Remove old code.
Patch 2: Add new common and Intel-specific code.
Patch 3: Add AMD-specific code.

Thanks,
Yazen

2023-03-23 18:04:40

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH v3 3/5] x86/mce: Introduce mce_handle_storm() to deal with begin/end of storms

On Thu, Mar 23, 2023 at 11:22:22AM -0400, Yazen Ghannam wrote:
> On Fri, Mar 17, 2023 at 10:20:40AM -0700, Tony Luck wrote:
> > +void mce_intel_handle_storm(int bank, bool on)
> > +{
> > + if (on)
> > + cmci_set_threshold(bank, cmci_threshold[bank]);
> > + else
> > + cmci_set_threshold(bank, CMCI_STORM_THRESHOLD);
>
> I think these conditions are reversed. When storm handling is 'on' we should
> use CMCI_STORM_THRESHOLD, and when off use the saved bank threshold.
>
> > +}
> > +
> > static void cmci_storm_begin(int bank)
> > {
> > __set_bit(bank, this_cpu_ptr(mce_poll_banks));
> > @@ -211,13 +219,13 @@ void track_cmci_storm(int bank, u64 status)
> > if (history & GENMASK_ULL(STORM_END_POLL_THRESHOLD - 1, 0))
> > return;
> > pr_notice("CPU%d BANK%d CMCI storm subsided\n", smp_processor_id(), bank);
> > - cmci_set_threshold(bank, cmci_threshold[bank]);
> > + mce_handle_storm(bank, true);
>
> Should be 'false' when the storm subsides.
>
> > cmci_storm_end(bank);
> > } else {
> > if (hweight64(history) < STORM_BEGIN_THRESHOLD)
> > return;
> > pr_notice("CPU%d BANK%d CMCI storm detected\n", smp_processor_id(), bank);
> > - cmci_set_threshold(bank, CMCI_STORM_THRESHOLD);
> > + mce_handle_storm(bank, false);
>
> Should be 'true' when the storm starts.
>
> > cmci_storm_begin(bank);
> > }
> > }

There's a saying that two wrongs do not make a right (but three lefts do).

My code was working, but only because the second mistake cancelled
out the first.

Changing them both as you suggest (diff below) and the code still
works, and makes sense too!

Thanks

-Tony

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 74b560476424..c3e1bb790680 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -677,13 +677,13 @@ void track_cmci_storm(int bank, u64 status)
if (history & GENMASK_ULL(STORM_END_POLL_THRESHOLD - 1, 0))
return;
pr_notice("CPU%d BANK%d CMCI storm subsided\n", smp_processor_id(), bank);
- mce_handle_storm(bank, true);
+ mce_handle_storm(bank, false);
cmci_storm_end(bank);
} else {
if (hweight64(history) < STORM_BEGIN_THRESHOLD)
return;
pr_notice("CPU%d BANK%d CMCI storm detected\n", smp_processor_id(), bank);
- mce_handle_storm(bank, false);
+ mce_handle_storm(bank, true);
cmci_storm_begin(bank);
}
}
diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c
index 6cc9aa97c092..20c2143a68c1 100644
--- a/arch/x86/kernel/cpu/mce/intel.c
+++ b/arch/x86/kernel/cpu/mce/intel.c
@@ -134,9 +134,9 @@ static void cmci_set_threshold(int bank, int thresh)
void mce_intel_handle_storm(int bank, bool on)
{
if (on)
- cmci_set_threshold(bank, cmci_threshold[bank]);
- else
cmci_set_threshold(bank, CMCI_STORM_THRESHOLD);
+ else
+ cmci_set_threshold(bank, cmci_threshold[bank]);
}

/*

2023-03-23 18:26:14

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH v3 4/5] x86/mce: Move storm handling to core.

> Can this patch and the previous two be squashed together?
>
> Like so?
> Patch 1: Remove old code.
> Patch 2: Add new common and Intel-specific code.
> Patch 3: Add AMD-specific code.

Yazen,

Those three patches could be merged ... but they already seem big:

0002: 3 files changed, 158 insertions(+), 11 deletions(-)
0003: 3 files changed, 22 insertions(+), 2 deletions(-)
0004: 3 files changed, 100 insertions(+), 92 deletions(-)

Lumping them together wouldn't be the sum of those but would be worse (IMHO)

-Tony

2023-03-23 20:41:25

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH v3 4/5] x86/mce: Move storm handling to core.

Yazen,

I folded the fixes for the issues you pointed to in patch 3/5 into the series and rebased to v6.3-rc3

Resulting tree pushed here:

git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux.git try-storm-on6-3

Builds, boots, and passes my storm tests here.

How is testing going on the AMD side of this series?

-Tony

2023-03-24 20:45:57

by Yazen Ghannam

[permalink] [raw]
Subject: Re: [PATCH v3 4/5] x86/mce: Move storm handling to core.

On Thu, Mar 23, 2023 at 08:26:02PM +0000, Luck, Tony wrote:
> Yazen,
>
> I folded the fixes for the issues you pointed to in patch 3/5 into the series and rebased to v6.3-rc3
>
> Resulting tree pushed here:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux.git try-storm-on6-3
>
> Builds, boots, and passes my storm tests here.
>
> How is testing going on the AMD side of this series?
>

Thanks Tony. I'll try to test by the middle of next week. Sorry for the delay,
I just got back from work travel.

I think the code looks good though.

Thanks,
Yazen

2023-03-29 15:27:05

by Yazen Ghannam

[permalink] [raw]
Subject: Re: [PATCH v3 4/5] x86/mce: Move storm handling to core.

On Thu, Mar 23, 2023 at 08:26:02PM +0000, Luck, Tony wrote:
> Yazen,
>
> I folded the fixes for the issues you pointed to in patch 3/5 into the series and rebased to v6.3-rc3
>
> Resulting tree pushed here:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux.git try-storm-on6-3
>
> Builds, boots, and passes my storm tests here.
>
> How is testing going on the AMD side of this series?
>

Hi Tony,

Builds, boots, and passes my tests here too.

Reviewed-by: Yazen Ghannam <[email protected]>
Tested-by: Yazen Ghannam <[email protected]>

Thanks!

-Yazen

2023-04-03 19:23:05

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH v3 4/5] x86/mce: Move storm handling to core.

> Hi Tony,
>
> Builds, boots, and passes my tests here too.
>
> Reviewed-by: Yazen Ghannam <[email protected]>
> Tested-by: Yazen Ghannam <[email protected]>

Yazen,

Thanks. I'll add your tags and post a v4 to the mailing list for Boris to review.

-Tony

2023-04-03 21:10:19

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v4 0/5] Handle corrected machine check interrupt storms

Linux CMCI storm mitigation is a big hammer that just disables the CMCI
interrupt globally and switches to polling all banks.

There are two problems with this:
1) It really is a big hammer. It means that errors reported in other
banks from different functional units are all subject to the same
polling delay before being processed.
2) Intel systems signal some uncorrected errors using CMCI (e.g.
memory controller patrol scrub on Icelake Xeon and newer). Delaying
processing these error reports negates some of the benefit of the patrol
scrubber providing early notice of errors before they are consumed and
cause a machine check.

This series throws away the old storm implementation and replaces it
with one that keeps track of the weather on each separate machine check
bank. When a storm is detected from a bank. On Intel the storm is
mitigated by setting a very high threshold for corrected errors to
signal CMCI. This threshold does not affect signaling CMCI for
uncorrected errors.

AMD's storm mitigation for threshold interrupts also relies on per CPU
per bank approach similar to Intel. But unlike CMCI storm handling it does
not set thresholds to reduce rate of interrupts on a storm. Rather it
turns off the interrupt on the current CPU and bank if there is a storm
and re-enables back the interrupts when the storm subsides.

It is okay to turn off threshold interrupts on AMD systems as other error
severities continue to be handled even if the threshold interrupts are
turned off. Uncorrected errors will generate a #MC and deferred errors
have a unique separate deferred error interrupt. The final patch adds
support for handling threshold interrupt storms on AMD systems.

Changes since last version:

Yazen:
Reported inverted tests in two places that cancelled each other
out so the code worked. But the logic was backwards.
Provided Tested-by and Reviewed-by tags


Smita Koralahalli (3):
x86/mce: Introduce mce_handle_storm() to deal with begin/end of storms
x86/mce: Move storm handling to core.
x86/mce: Handle AMD threshold interrupt storms

Tony Luck (2):
x86/mce: Remove old CMCI storm mitigation code
x86/mce: Add per-bank CMCI storm mitigation

arch/x86/kernel/cpu/mce/internal.h | 33 ++++--
arch/x86/kernel/cpu/mce/amd.c | 49 ++++++++
arch/x86/kernel/cpu/mce/core.c | 139 +++++++++++++++++-----
arch/x86/kernel/cpu/mce/intel.c | 179 +++++++----------------------
4 files changed, 230 insertions(+), 170 deletions(-)


base-commit: 7e364e56293bb98cae1b55fd835f5991c4e96e7d
--
2.39.2

2023-04-03 21:10:30

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v4 2/5] x86/mce: Add per-bank CMCI storm mitigation

Add a hook into machine_check_poll() to keep track of per-CPU, per-bank
corrected error logs.

Maintain a bitmap history for each bank showing whether the bank
logged an corrected error or not each time it is polled.

In normal operation the interval between polls of this banks
determines how far to shift the history. The 64 bit width corresponds
to about one second.

When a storm is observed the Rate of interrupts is reduced by setting
a large threshold value for this bank in IA32_MCi_CTL2. This bank is
added to the bitmap of banks for this CPU to poll. The polling rate
is increased to once per second.
During a storm each bit in the history indicates the status of the
bank each time it is polled. Thus the history covers just over a minute.

Declare a storm for that bank if the number of corrected interrupts
seen in that history is above some threshold (5 in this RFC code for
ease of testing, likely move to 15 for compatibility with previous
storm detection).

A storm on a bank ends if enough consecutive polls of the bank show
no corrected errors (currently 30, may also change). That resets the
threshold in IA32_MCi_CTL2 back to 1, removes the bank from the bitmap
for polling, and changes the polling rate back to the default.

If a CPU with banks in storm mode is taken offline, the new CPU
that inherits ownership of those banks takes over management of
storm(s) in the inherited bank(s).

Signed-off-by: Tony Luck <[email protected]>
Reviewed-by: Yazen Ghannam <[email protected]>
Tested-by: Yazen Ghannam <[email protected]>
---
arch/x86/kernel/cpu/mce/internal.h | 4 +-
arch/x86/kernel/cpu/mce/core.c | 26 ++++--
arch/x86/kernel/cpu/mce/intel.c | 139 ++++++++++++++++++++++++++++-
3 files changed, 158 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index f9331c6229b4..8d3a740a66ff 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -40,6 +40,8 @@ struct dentry *mce_get_debugfs_dir(void);

extern mce_banks_t mce_banks_ce_disabled;

+void track_cmci_storm(int bank, u64 status);
+
#ifdef CONFIG_X86_MCE_INTEL
void cmci_disable_bank(int bank);
void intel_init_cmci(void);
@@ -54,7 +56,7 @@ static inline void intel_clear_lmce(void) { }
static inline bool intel_filter_mce(struct mce *m) { return false; }
#endif

-void mce_timer_kick(unsigned long interval);
+void mce_timer_kick(bool storm);

#ifdef CONFIG_ACPI_APEI
int apei_write_mce(struct mce *m);
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index e7936be84204..20347eb65b8b 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -680,6 +680,8 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
barrier();
m.status = mce_rdmsrl(mca_msr_reg(i, MCA_STATUS));

+ track_cmci_storm(i, m.status);
+
/* If this entry is not valid, ignore it */
if (!(m.status & MCI_STATUS_VAL))
continue;
@@ -1587,6 +1589,7 @@ static unsigned long check_interval = INITIAL_CHECK_INTERVAL;

static DEFINE_PER_CPU(unsigned long, mce_next_interval); /* in jiffies */
static DEFINE_PER_CPU(struct timer_list, mce_timer);
+static DEFINE_PER_CPU(bool, storm_poll_mode);

static void __start_timer(struct timer_list *t, unsigned long interval)
{
@@ -1622,22 +1625,29 @@ static void mce_timer_fn(struct timer_list *t)
else
iv = min(iv * 2, round_jiffies_relative(check_interval * HZ));

- __this_cpu_write(mce_next_interval, iv);
- __start_timer(t, iv);
+ if (__this_cpu_read(storm_poll_mode)) {
+ __start_timer(t, HZ);
+ } else {
+ __this_cpu_write(mce_next_interval, iv);
+ __start_timer(t, iv);
+ }
}

/*
- * Ensure that the timer is firing in @interval from now.
+ * When a storm starts on any bank on this CPU, switch to polling
+ * once per second. When the storm ends, revert to the default
+ * polling interval.
*/
-void mce_timer_kick(unsigned long interval)
+void mce_timer_kick(bool storm)
{
struct timer_list *t = this_cpu_ptr(&mce_timer);
- unsigned long iv = __this_cpu_read(mce_next_interval);

- __start_timer(t, interval);
+ __this_cpu_write(storm_poll_mode, storm);

- if (interval < iv)
- __this_cpu_write(mce_next_interval, interval);
+ if (storm)
+ __start_timer(t, HZ);
+ else
+ __this_cpu_write(mce_next_interval, check_interval * HZ);
}

/* Must not be called in IRQ context where del_timer_sync() can deadlock */
diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c
index 052bf2708391..4106877de028 100644
--- a/arch/x86/kernel/cpu/mce/intel.c
+++ b/arch/x86/kernel/cpu/mce/intel.c
@@ -47,8 +47,40 @@ static DEFINE_PER_CPU(mce_banks_t, mce_banks_owned);
*/
static DEFINE_RAW_SPINLOCK(cmci_discover_lock);

+/*
+ * CMCI storm tracking state
+ * stormy_bank_count: per-cpu count of MC banks in storm state
+ * bank_history: bitmask tracking of corrected errors seen in each bank
+ * bank_time_stamp: last time (in jiffies) that each bank was polled
+ * cmci_threshold: MCi_CTL2 threshold for each bank when there is no storm
+ */
+static DEFINE_PER_CPU(int, stormy_bank_count);
+static DEFINE_PER_CPU(u64 [MAX_NR_BANKS], bank_history);
+static DEFINE_PER_CPU(bool [MAX_NR_BANKS], bank_storm);
+static DEFINE_PER_CPU(unsigned long [MAX_NR_BANKS], bank_time_stamp);
+static int cmci_threshold[MAX_NR_BANKS];
+
+/* Linux non-storm CMCI threshold (may be overridden by BIOS */
#define CMCI_THRESHOLD 1

+/*
+ * High threshold to limit CMCI rate during storms. Max supported is
+ * 0x7FFF. Use this slightly smaller value so it has a distinctive
+ * signature when some asks "Why am I not seeing all corrected errors?"
+ */
+#define CMCI_STORM_THRESHOLD 32749
+
+/*
+ * How many errors within the history buffer mark the start of a storm
+ */
+#define STORM_BEGIN_THRESHOLD 5
+
+/*
+ * How many polls of machine check bank without an error before declaring
+ * the storm is over
+ */
+#define STORM_END_POLL_THRESHOLD 30
+
static int cmci_supported(int *banks)
{
u64 cap;
@@ -103,6 +135,93 @@ static bool lmce_supported(void)
return tmp & FEAT_CTL_LMCE_ENABLED;
}

+/*
+ * Set a new CMCI threshold value. Preserve the state of the
+ * MCI_CTL2_CMCI_EN bit in case this happens during a
+ * cmci_rediscover() operation.
+ */
+static void cmci_set_threshold(int bank, int thresh)
+{
+ unsigned long flags;
+ u64 val;
+
+ raw_spin_lock_irqsave(&cmci_discover_lock, flags);
+ rdmsrl(MSR_IA32_MCx_CTL2(bank), val);
+ val &= ~MCI_CTL2_CMCI_THRESHOLD_MASK;
+ wrmsrl(MSR_IA32_MCx_CTL2(bank), val | thresh);
+ raw_spin_unlock_irqrestore(&cmci_discover_lock, flags);
+}
+
+static void cmci_storm_begin(int bank)
+{
+ __set_bit(bank, this_cpu_ptr(mce_poll_banks));
+ this_cpu_write(bank_storm[bank], true);
+
+ /*
+ * If this is the first bank on this CPU to enter storm mode
+ * start polling
+ */
+ if (this_cpu_inc_return(stormy_bank_count) == 1)
+ mce_timer_kick(true);
+}
+
+static void cmci_storm_end(int bank)
+{
+ __clear_bit(bank, this_cpu_ptr(mce_poll_banks));
+ this_cpu_write(bank_history[bank], 0ull);
+ this_cpu_write(bank_storm[bank], false);
+
+ /* If no banks left in storm mode, stop polling */
+ if (!this_cpu_dec_return(stormy_bank_count))
+ mce_timer_kick(false);
+}
+
+void track_cmci_storm(int bank, u64 status)
+{
+ unsigned long now = jiffies, delta;
+ unsigned int shift = 1;
+ u64 history;
+
+ /*
+ * When a bank is in storm mode it is polled once per second and
+ * the history mask will record about the last minute of poll results.
+ * If it is not in storm mode, then the bank is only checked when
+ * there is a CMCI interrupt. Check how long it has been since
+ * this bank was last checked, and adjust the amount of "shift"
+ * to apply to history.
+ */
+ if (!this_cpu_read(bank_storm[bank])) {
+ delta = now - this_cpu_read(bank_time_stamp[bank]);
+ shift = (delta + HZ) / HZ;
+ }
+
+ /* If has been a long time since the last poll, clear history */
+ if (shift >= 64)
+ history = 0;
+ else
+ history = this_cpu_read(bank_history[bank]) << shift;
+ this_cpu_write(bank_time_stamp[bank], now);
+
+ /* History keeps track of corrected errors. VAL=1 && UC=0 */
+ if ((status & (MCI_STATUS_VAL | MCI_STATUS_UC)) == MCI_STATUS_VAL)
+ history |= 1;
+ this_cpu_write(bank_history[bank], history);
+
+ if (this_cpu_read(bank_storm[bank])) {
+ if (history & GENMASK_ULL(STORM_END_POLL_THRESHOLD - 1, 0))
+ return;
+ pr_notice("CPU%d BANK%d CMCI storm subsided\n", smp_processor_id(), bank);
+ cmci_set_threshold(bank, cmci_threshold[bank]);
+ cmci_storm_end(bank);
+ } else {
+ if (hweight64(history) < STORM_BEGIN_THRESHOLD)
+ return;
+ pr_notice("CPU%d BANK%d CMCI storm detected\n", smp_processor_id(), bank);
+ cmci_set_threshold(bank, CMCI_STORM_THRESHOLD);
+ cmci_storm_begin(bank);
+ }
+}
+
/*
* The interrupt handler. This is called on every event.
* Just call the poller directly to log any events.
@@ -147,6 +266,9 @@ static void cmci_discover(int banks)
continue;
}

+ if ((val & MCI_CTL2_CMCI_THRESHOLD_MASK) == CMCI_STORM_THRESHOLD)
+ goto storm;
+
if (!mca_cfg.bios_cmci_threshold) {
val &= ~MCI_CTL2_CMCI_THRESHOLD_MASK;
val |= CMCI_THRESHOLD;
@@ -159,7 +281,7 @@ static void cmci_discover(int banks)
bios_zero_thresh = 1;
val |= CMCI_THRESHOLD;
}
-
+storm:
val |= MCI_CTL2_CMCI_EN;
wrmsrl(MSR_IA32_MCx_CTL2(i), val);
rdmsrl(MSR_IA32_MCx_CTL2(i), val);
@@ -167,7 +289,14 @@ static void cmci_discover(int banks)
/* Did the enable bit stick? -- the bank supports CMCI */
if (val & MCI_CTL2_CMCI_EN) {
set_bit(i, owned);
- __clear_bit(i, this_cpu_ptr(mce_poll_banks));
+ if ((val & MCI_CTL2_CMCI_THRESHOLD_MASK) == CMCI_STORM_THRESHOLD) {
+ pr_notice("CPU%d BANK%d CMCI inherited storm\n", smp_processor_id(), i);
+ this_cpu_write(bank_history[i], ~0ull);
+ this_cpu_write(bank_time_stamp[i], jiffies);
+ cmci_storm_begin(i);
+ } else {
+ __clear_bit(i, this_cpu_ptr(mce_poll_banks));
+ }
/*
* We are able to set thresholds for some banks that
* had a threshold of 0. This means the BIOS has not
@@ -177,6 +306,10 @@ static void cmci_discover(int banks)
if (mca_cfg.bios_cmci_threshold && bios_zero_thresh &&
(val & MCI_CTL2_CMCI_THRESHOLD_MASK))
bios_wrong_thresh = 1;
+
+ /* Save default threshold for each bank */
+ if (cmci_threshold[i] == 0)
+ cmci_threshold[i] = val & MCI_CTL2_CMCI_THRESHOLD_MASK;
} else {
WARN_ON(!test_bit(i, this_cpu_ptr(mce_poll_banks)));
}
@@ -218,6 +351,8 @@ static void __cmci_disable_bank(int bank)
val &= ~MCI_CTL2_CMCI_EN;
wrmsrl(MSR_IA32_MCx_CTL2(bank), val);
__clear_bit(bank, this_cpu_ptr(mce_banks_owned));
+ if ((val & MCI_CTL2_CMCI_THRESHOLD_MASK) == CMCI_STORM_THRESHOLD)
+ cmci_storm_end(bank);
}

/*
--
2.39.2

2023-04-03 21:10:37

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v4 4/5] x86/mce: Move storm handling to core.

From: Smita Koralahalli <[email protected]>

AMD's storm handling for threshold interrupts is similar to Intel's CMCI
storm handling. Hence, make the storm handling code common by moving to
core and removing the vendor exclusivity.

On the contrary, setting different thresholds to reduce rate of interrupts
in IA32_MCi_CTL2 register is kept Intel intact as the storm handling for
AMD slightly differs where in it handles the storms by turning off the
interrupts.

No functional changes.

[Tony: Same as Smita's original, plus changes rolled in from prior patches]

Signed-off-by: Smita Koralahalli <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Reviewed-by: Yazen Ghannam <[email protected]>
Tested-by: Yazen Ghannam <[email protected]>
---
arch/x86/kernel/cpu/mce/internal.h | 18 ++++++
arch/x86/kernel/cpu/mce/core.c | 81 ++++++++++++++++++++++++++
arch/x86/kernel/cpu/mce/intel.c | 93 +-----------------------------
3 files changed, 100 insertions(+), 92 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index 68288099b125..d052d80cce7a 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -60,6 +60,24 @@ static inline bool intel_filter_mce(struct mce *m) { return false; }

void mce_timer_kick(bool storm);
void mce_handle_storm(int bank, bool on);
+void cmci_storm_begin(int bank);
+void cmci_storm_end(int bank);
+
+DECLARE_PER_CPU(int, stormy_bank_count);
+DECLARE_PER_CPU(u64 [MAX_NR_BANKS], bank_history);
+DECLARE_PER_CPU(bool [MAX_NR_BANKS], bank_storm);
+DECLARE_PER_CPU(unsigned long [MAX_NR_BANKS], bank_time_stamp);
+
+/*
+ * How many errors within the history buffer mark the start of a storm
+ */
+#define STORM_BEGIN_THRESHOLD 5
+
+/*
+ * How many polls of machine check bank without an error before declaring
+ * the storm is over
+ */
+#define STORM_END_POLL_THRESHOLD 30

#ifdef CONFIG_ACPI_APEI
int apei_write_mce(struct mce *m);
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 099d8444aca4..820b317b1b85 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -607,6 +607,87 @@ static struct notifier_block mce_default_nb = {
.priority = MCE_PRIO_LOWEST,
};

+/*
+ * CMCI storm tracking state
+ * stormy_bank_count: per-cpu count of MC banks in storm state
+ * bank_history: bitmask tracking of corrected errors seen in each bank
+ * bank_time_stamp: last time (in jiffies) that each bank was polled
+ */
+DEFINE_PER_CPU(int, stormy_bank_count);
+DEFINE_PER_CPU(u64 [MAX_NR_BANKS], bank_history);
+DEFINE_PER_CPU(bool [MAX_NR_BANKS], bank_storm);
+DEFINE_PER_CPU(unsigned long [MAX_NR_BANKS], bank_time_stamp);
+
+void cmci_storm_begin(int bank)
+{
+ __set_bit(bank, this_cpu_ptr(mce_poll_banks));
+ this_cpu_write(bank_storm[bank], true);
+
+ /*
+ * If this is the first bank on this CPU to enter storm mode
+ * start polling
+ */
+ if (this_cpu_inc_return(stormy_bank_count) == 1)
+ mce_timer_kick(true);
+}
+
+void cmci_storm_end(int bank)
+{
+ __clear_bit(bank, this_cpu_ptr(mce_poll_banks));
+ this_cpu_write(bank_history[bank], 0ull);
+ this_cpu_write(bank_storm[bank], false);
+
+ /* If no banks left in storm mode, stop polling */
+ if (!this_cpu_dec_return(stormy_bank_count))
+ mce_timer_kick(false);
+}
+
+void track_cmci_storm(int bank, u64 status)
+{
+ unsigned long now = jiffies, delta;
+ unsigned int shift = 1;
+ u64 history;
+
+ /*
+ * When a bank is in storm mode it is polled once per second and
+ * the history mask will record about the last minute of poll results.
+ * If it is not in storm mode, then the bank is only checked when
+ * there is a CMCI interrupt. Check how long it has been since
+ * this bank was last checked, and adjust the amount of "shift"
+ * to apply to history.
+ */
+ if (!this_cpu_read(bank_storm[bank])) {
+ delta = now - this_cpu_read(bank_time_stamp[bank]);
+ shift = (delta + HZ) / HZ;
+ }
+
+ /* If has been a long time since the last poll, clear history */
+ if (shift >= 64)
+ history = 0;
+ else
+ history = this_cpu_read(bank_history[bank]) << shift;
+ this_cpu_write(bank_time_stamp[bank], now);
+
+ /* History keeps track of corrected errors. VAL=1 && UC=0 */
+ if ((status & (MCI_STATUS_VAL | MCI_STATUS_UC)) == MCI_STATUS_VAL)
+ history |= 1;
+ this_cpu_write(bank_history[bank], history);
+
+ if (this_cpu_read(bank_storm[bank])) {
+ if (history & GENMASK_ULL(STORM_END_POLL_THRESHOLD - 1, 0))
+ return;
+ pr_notice("CPU%d BANK%d CMCI storm subsided\n", smp_processor_id(), bank);
+ mce_handle_storm(bank, false);
+ cmci_storm_end(bank);
+ } else {
+ if (hweight64(history) < STORM_BEGIN_THRESHOLD)
+ return;
+ pr_notice("CPU%d BANK%d CMCI storm detected\n", smp_processor_id(), bank);
+ mce_handle_storm(bank, true);
+ cmci_storm_begin(bank);
+ }
+}
+
/*
* Read ADDR and MISC registers.
*/
diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c
index a8248514a689..20c2143a68c1 100644
--- a/arch/x86/kernel/cpu/mce/intel.c
+++ b/arch/x86/kernel/cpu/mce/intel.c
@@ -47,17 +47,7 @@ static DEFINE_PER_CPU(mce_banks_t, mce_banks_owned);
*/
static DEFINE_RAW_SPINLOCK(cmci_discover_lock);

-/*
- * CMCI storm tracking state
- * stormy_bank_count: per-cpu count of MC banks in storm state
- * bank_history: bitmask tracking of corrected errors seen in each bank
- * bank_time_stamp: last time (in jiffies) that each bank was polled
- * cmci_threshold: MCi_CTL2 threshold for each bank when there is no storm
- */
-static DEFINE_PER_CPU(int, stormy_bank_count);
-static DEFINE_PER_CPU(u64 [MAX_NR_BANKS], bank_history);
-static DEFINE_PER_CPU(bool [MAX_NR_BANKS], bank_storm);
-static DEFINE_PER_CPU(unsigned long [MAX_NR_BANKS], bank_time_stamp);
+/* MCi_CTL2 threshold for each bank when there is no storm */
static int cmci_threshold[MAX_NR_BANKS];

/* Linux non-storm CMCI threshold (may be overridden by BIOS */
@@ -70,17 +60,6 @@ static int cmci_threshold[MAX_NR_BANKS];
*/
#define CMCI_STORM_THRESHOLD 32749

-/*
- * How many errors within the history buffer mark the start of a storm
- */
-#define STORM_BEGIN_THRESHOLD 5
-
-/*
- * How many polls of machine check bank without an error before declaring
- * the storm is over
- */
-#define STORM_END_POLL_THRESHOLD 30
-
static int cmci_supported(int *banks)
{
u64 cap;
@@ -160,76 +139,6 @@ void mce_intel_handle_storm(int bank, bool on)
cmci_set_threshold(bank, cmci_threshold[bank]);
}

-static void cmci_storm_begin(int bank)
-{
- __set_bit(bank, this_cpu_ptr(mce_poll_banks));
- this_cpu_write(bank_storm[bank], true);
-
- /*
- * If this is the first bank on this CPU to enter storm mode
- * start polling
- */
- if (this_cpu_inc_return(stormy_bank_count) == 1)
- mce_timer_kick(true);
-}
-
-static void cmci_storm_end(int bank)
-{
- __clear_bit(bank, this_cpu_ptr(mce_poll_banks));
- this_cpu_write(bank_history[bank], 0ull);
- this_cpu_write(bank_storm[bank], false);
-
- /* If no banks left in storm mode, stop polling */
- if (!this_cpu_dec_return(stormy_bank_count))
- mce_timer_kick(false);
-}
-
-void track_cmci_storm(int bank, u64 status)
-{
- unsigned long now = jiffies, delta;
- unsigned int shift = 1;
- u64 history;
-
- /*
- * When a bank is in storm mode it is polled once per second and
- * the history mask will record about the last minute of poll results.
- * If it is not in storm mode, then the bank is only checked when
- * there is a CMCI interrupt. Check how long it has been since
- * this bank was last checked, and adjust the amount of "shift"
- * to apply to history.
- */
- if (!this_cpu_read(bank_storm[bank])) {
- delta = now - this_cpu_read(bank_time_stamp[bank]);
- shift = (delta + HZ) / HZ;
- }
-
- /* If has been a long time since the last poll, clear history */
- if (shift >= 64)
- history = 0;
- else
- history = this_cpu_read(bank_history[bank]) << shift;
- this_cpu_write(bank_time_stamp[bank], now);
-
- /* History keeps track of corrected errors. VAL=1 && UC=0 */
- if ((status & (MCI_STATUS_VAL | MCI_STATUS_UC)) == MCI_STATUS_VAL)
- history |= 1;
- this_cpu_write(bank_history[bank], history);
-
- if (this_cpu_read(bank_storm[bank])) {
- if (history & GENMASK_ULL(STORM_END_POLL_THRESHOLD - 1, 0))
- return;
- pr_notice("CPU%d BANK%d CMCI storm subsided\n", smp_processor_id(), bank);
- mce_handle_storm(bank, false);
- cmci_storm_end(bank);
- } else {
- if (hweight64(history) < STORM_BEGIN_THRESHOLD)
- return;
- pr_notice("CPU%d BANK%d CMCI storm detected\n", smp_processor_id(), bank);
- mce_handle_storm(bank, true);
- cmci_storm_begin(bank);
- }
-}
-
/*
* The interrupt handler. This is called on every event.
* Just call the poller directly to log any events.
--
2.39.2

2023-04-03 21:10:44

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v4 5/5] x86/mce: Handle AMD threshold interrupt storms

From: Smita Koralahalli <[email protected]>

Extend the logic of handling CMCI storms to AMD threshold interrupts.

Rely on the similar approach as of Intel's CMCI to mitigate storms per
CPU and per bank. But, unlike CMCI, do not set thresholds and reduce
interrupt rate on a storm. Rather, disable the interrupt on the
corresponding CPU and bank. Re-enable back the interrupts if enough
consecutive polls of the bank show no corrected errors (30, as
programmed by Intel).

Turning off the threshold interrupts would be a better solution on AMD
systems as other error severities will still be handled even if the
threshold interrupts are disabled.

[Tony: Small tweak because mce_handle_storm() isn't a pointer now]

Signed-off-by: Smita Koralahalli <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Reviewed-by: Yazen Ghannam <[email protected]>
Tested-by: Yazen Ghannam <[email protected]>
---
arch/x86/kernel/cpu/mce/internal.h | 2 ++
arch/x86/kernel/cpu/mce/amd.c | 49 ++++++++++++++++++++++++++++++
arch/x86/kernel/cpu/mce/core.c | 3 ++
3 files changed, 54 insertions(+)

diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index d052d80cce7a..f1a48bc2e904 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -224,6 +224,7 @@ extern bool filter_mce(struct mce *m);

#ifdef CONFIG_X86_MCE_AMD
extern bool amd_filter_mce(struct mce *m);
+void mce_amd_handle_storm(int bank, bool on);

/*
* If MCA_CONFIG[McaLsbInStatusSupported] is set, extract ErrAddr in bits
@@ -251,6 +252,7 @@ static __always_inline void smca_extract_err_addr(struct mce *m)

#else
static inline bool amd_filter_mce(struct mce *m) { return false; }
+static inline void mce_amd_handle_storm(int bank, bool on) {}
static inline void smca_extract_err_addr(struct mce *m) { }
#endif

diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index 23c5072fbbb7..cd79295e2a0a 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -468,6 +468,47 @@ static void threshold_restart_bank(void *_tr)
wrmsr(tr->b->address, lo, hi);
}

+static void _reset_block(struct threshold_block *block)
+{
+ struct thresh_restart tr;
+
+ memset(&tr, 0, sizeof(tr));
+ tr.b = block;
+ threshold_restart_bank(&tr);
+}
+
+static void toggle_interrupt_reset_block(struct threshold_block *block, bool on)
+{
+ if (!block)
+ return;
+
+ block->interrupt_enable = !!on;
+ _reset_block(block);
+}
+
+void mce_amd_handle_storm(int bank, bool on)
+{
+ struct threshold_block *first_block = NULL, *block = NULL, *tmp = NULL;
+ struct threshold_bank **bp = this_cpu_read(threshold_banks);
+ unsigned long flags;
+
+ if (!bp)
+ return;
+
+ local_irq_save(flags);
+
+ first_block = bp[bank]->blocks;
+ if (!first_block)
+ goto end;
+
+ toggle_interrupt_reset_block(first_block, on);
+
+ list_for_each_entry_safe(block, tmp, &first_block->miscj, miscj)
+ toggle_interrupt_reset_block(block, on);
+end:
+ local_irq_restore(flags);
+}
+
static void mce_threshold_block_init(struct threshold_block *b, int offset)
{
struct thresh_restart tr = {
@@ -868,6 +909,7 @@ static void amd_threshold_interrupt(void)
struct threshold_block *first_block = NULL, *block = NULL, *tmp = NULL;
struct threshold_bank **bp = this_cpu_read(threshold_banks);
unsigned int bank, cpu = smp_processor_id();
+ u64 status;

/*
* Validate that the threshold bank has been initialized already. The
@@ -881,6 +923,13 @@ static void amd_threshold_interrupt(void)
if (!(per_cpu(bank_map, cpu) & (1 << bank)))
continue;

+ rdmsrl(mca_msr_reg(bank, MCA_STATUS), status);
+ track_cmci_storm(bank, status);
+
+ /* Return early on an interrupt storm */
+ if (this_cpu_read(bank_storm[bank]))
+ return;
+
first_block = bp[bank]->blocks;
if (!first_block)
continue;
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 820b317b1b85..fac90625d8cb 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -2062,6 +2062,9 @@ void mce_handle_storm(int bank, bool on)
case X86_VENDOR_INTEL:
mce_intel_handle_storm(bank, on);
break;
+ case X86_VENDOR_AMD:
+ mce_amd_handle_storm(bank, on);
+ break;
}
}

--
2.39.2

2023-04-03 21:10:53

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v4 1/5] x86/mce: Remove old CMCI storm mitigation code

When a "storm" of CMCI is detected this code mitigates by
disabling CMCI interrupt signalling from all of the banks
owned by the CPU that saw the storm.

There are problems with this approach:

1) It is very coarse grained. In all likelihood only one of the
banks was generating the interrupts, but CMCI is disabled for all.
This means Linux may delay seeing and processing errors logged
from other banks.

2) Although CMCI stands for Corrected Machine Check Interrupt, it
is also used to signal when an uncorrected error is logged. This
is a problem because these errors should be handled in a timely
manner.

Delete all this code in preparation for a finer grained solution.

Signed-off-by: Tony Luck <[email protected]>
Reviewed-by: Yazen Ghannam <[email protected]>
Tested-by: Yazen Ghannam <[email protected]>
---
arch/x86/kernel/cpu/mce/internal.h | 6 --
arch/x86/kernel/cpu/mce/core.c | 20 +---
arch/x86/kernel/cpu/mce/intel.c | 145 -----------------------------
3 files changed, 1 insertion(+), 170 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index 91a415553c27..f9331c6229b4 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -41,18 +41,12 @@ struct dentry *mce_get_debugfs_dir(void);
extern mce_banks_t mce_banks_ce_disabled;

#ifdef CONFIG_X86_MCE_INTEL
-unsigned long cmci_intel_adjust_timer(unsigned long interval);
-bool mce_intel_cmci_poll(void);
-void mce_intel_hcpu_update(unsigned long cpu);
void cmci_disable_bank(int bank);
void intel_init_cmci(void);
void intel_init_lmce(void);
void intel_clear_lmce(void);
bool intel_filter_mce(struct mce *m);
#else
-# define cmci_intel_adjust_timer mce_adjust_timer_default
-static inline bool mce_intel_cmci_poll(void) { return false; }
-static inline void mce_intel_hcpu_update(unsigned long cpu) { }
static inline void cmci_disable_bank(int bank) { }
static inline void intel_init_cmci(void) { }
static inline void intel_init_lmce(void) { }
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 2eec60f50057..e7936be84204 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1588,13 +1588,6 @@ static unsigned long check_interval = INITIAL_CHECK_INTERVAL;
static DEFINE_PER_CPU(unsigned long, mce_next_interval); /* in jiffies */
static DEFINE_PER_CPU(struct timer_list, mce_timer);

-static unsigned long mce_adjust_timer_default(unsigned long interval)
-{
- return interval;
-}
-
-static unsigned long (*mce_adjust_timer)(unsigned long interval) = mce_adjust_timer_default;
-
static void __start_timer(struct timer_list *t, unsigned long interval)
{
unsigned long when = jiffies + interval;
@@ -1617,15 +1610,9 @@ static void mce_timer_fn(struct timer_list *t)

iv = __this_cpu_read(mce_next_interval);

- if (mce_available(this_cpu_ptr(&cpu_info))) {
+ if (mce_available(this_cpu_ptr(&cpu_info)))
machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));

- if (mce_intel_cmci_poll()) {
- iv = mce_adjust_timer(iv);
- goto done;
- }
- }
-
/*
* Alert userspace if needed. If we logged an MCE, reduce the polling
* interval, otherwise increase the polling interval.
@@ -1635,7 +1622,6 @@ static void mce_timer_fn(struct timer_list *t)
else
iv = min(iv * 2, round_jiffies_relative(check_interval * HZ));

-done:
__this_cpu_write(mce_next_interval, iv);
__start_timer(t, iv);
}
@@ -1972,7 +1958,6 @@ static void mce_zhaoxin_feature_init(struct cpuinfo_x86 *c)

intel_init_cmci();
intel_init_lmce();
- mce_adjust_timer = cmci_intel_adjust_timer;
}

static void mce_zhaoxin_feature_clear(struct cpuinfo_x86 *c)
@@ -1985,7 +1970,6 @@ static void __mcheck_cpu_init_vendor(struct cpuinfo_x86 *c)
switch (c->x86_vendor) {
case X86_VENDOR_INTEL:
mce_intel_feature_init(c);
- mce_adjust_timer = cmci_intel_adjust_timer;
break;

case X86_VENDOR_AMD: {
@@ -2642,8 +2626,6 @@ static void mce_reenable_cpu(void)

static int mce_cpu_dead(unsigned int cpu)
{
- mce_intel_hcpu_update(cpu);
-
/* intentionally ignoring frozen here */
if (!cpuhp_tasks_frozen)
cmci_rediscover();
diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c
index 95275a5e57e0..052bf2708391 100644
--- a/arch/x86/kernel/cpu/mce/intel.c
+++ b/arch/x86/kernel/cpu/mce/intel.c
@@ -41,15 +41,6 @@
*/
static DEFINE_PER_CPU(mce_banks_t, mce_banks_owned);

-/*
- * CMCI storm detection backoff counter
- *
- * During storm, we reset this counter to INITIAL_CHECK_INTERVAL in case we've
- * encountered an error. If not, we decrement it by one. We signal the end of
- * the CMCI storm when it reaches 0.
- */
-static DEFINE_PER_CPU(int, cmci_backoff_cnt);
-
/*
* cmci_discover_lock protects against parallel discovery attempts
* which could race against each other.
@@ -57,21 +48,6 @@ static DEFINE_PER_CPU(int, cmci_backoff_cnt);
static DEFINE_RAW_SPINLOCK(cmci_discover_lock);

#define CMCI_THRESHOLD 1
-#define CMCI_POLL_INTERVAL (30 * HZ)
-#define CMCI_STORM_INTERVAL (HZ)
-#define CMCI_STORM_THRESHOLD 15
-
-static DEFINE_PER_CPU(unsigned long, cmci_time_stamp);
-static DEFINE_PER_CPU(unsigned int, cmci_storm_cnt);
-static DEFINE_PER_CPU(unsigned int, cmci_storm_state);
-
-enum {
- CMCI_STORM_NONE,
- CMCI_STORM_ACTIVE,
- CMCI_STORM_SUBSIDED,
-};
-
-static atomic_t cmci_storm_on_cpus;

static int cmci_supported(int *banks)
{
@@ -127,124 +103,6 @@ static bool lmce_supported(void)
return tmp & FEAT_CTL_LMCE_ENABLED;
}

-bool mce_intel_cmci_poll(void)
-{
- if (__this_cpu_read(cmci_storm_state) == CMCI_STORM_NONE)
- return false;
-
- /*
- * Reset the counter if we've logged an error in the last poll
- * during the storm.
- */
- if (machine_check_poll(0, this_cpu_ptr(&mce_banks_owned)))
- this_cpu_write(cmci_backoff_cnt, INITIAL_CHECK_INTERVAL);
- else
- this_cpu_dec(cmci_backoff_cnt);
-
- return true;
-}
-
-void mce_intel_hcpu_update(unsigned long cpu)
-{
- if (per_cpu(cmci_storm_state, cpu) == CMCI_STORM_ACTIVE)
- atomic_dec(&cmci_storm_on_cpus);
-
- per_cpu(cmci_storm_state, cpu) = CMCI_STORM_NONE;
-}
-
-static void cmci_toggle_interrupt_mode(bool on)
-{
- unsigned long flags, *owned;
- int bank;
- u64 val;
-
- raw_spin_lock_irqsave(&cmci_discover_lock, flags);
- owned = this_cpu_ptr(mce_banks_owned);
- for_each_set_bit(bank, owned, MAX_NR_BANKS) {
- rdmsrl(MSR_IA32_MCx_CTL2(bank), val);
-
- if (on)
- val |= MCI_CTL2_CMCI_EN;
- else
- val &= ~MCI_CTL2_CMCI_EN;
-
- wrmsrl(MSR_IA32_MCx_CTL2(bank), val);
- }
- raw_spin_unlock_irqrestore(&cmci_discover_lock, flags);
-}
-
-unsigned long cmci_intel_adjust_timer(unsigned long interval)
-{
- if ((this_cpu_read(cmci_backoff_cnt) > 0) &&
- (__this_cpu_read(cmci_storm_state) == CMCI_STORM_ACTIVE)) {
- mce_notify_irq();
- return CMCI_STORM_INTERVAL;
- }
-
- switch (__this_cpu_read(cmci_storm_state)) {
- case CMCI_STORM_ACTIVE:
-
- /*
- * We switch back to interrupt mode once the poll timer has
- * silenced itself. That means no events recorded and the timer
- * interval is back to our poll interval.
- */
- __this_cpu_write(cmci_storm_state, CMCI_STORM_SUBSIDED);
- if (!atomic_sub_return(1, &cmci_storm_on_cpus))
- pr_notice("CMCI storm subsided: switching to interrupt mode\n");
-
- fallthrough;
-
- case CMCI_STORM_SUBSIDED:
- /*
- * We wait for all CPUs to go back to SUBSIDED state. When that
- * happens we switch back to interrupt mode.
- */
- if (!atomic_read(&cmci_storm_on_cpus)) {
- __this_cpu_write(cmci_storm_state, CMCI_STORM_NONE);
- cmci_toggle_interrupt_mode(true);
- cmci_recheck();
- }
- return CMCI_POLL_INTERVAL;
- default:
-
- /* We have shiny weather. Let the poll do whatever it thinks. */
- return interval;
- }
-}
-
-static bool cmci_storm_detect(void)
-{
- unsigned int cnt = __this_cpu_read(cmci_storm_cnt);
- unsigned long ts = __this_cpu_read(cmci_time_stamp);
- unsigned long now = jiffies;
- int r;
-
- if (__this_cpu_read(cmci_storm_state) != CMCI_STORM_NONE)
- return true;
-
- if (time_before_eq(now, ts + CMCI_STORM_INTERVAL)) {
- cnt++;
- } else {
- cnt = 1;
- __this_cpu_write(cmci_time_stamp, now);
- }
- __this_cpu_write(cmci_storm_cnt, cnt);
-
- if (cnt <= CMCI_STORM_THRESHOLD)
- return false;
-
- cmci_toggle_interrupt_mode(false);
- __this_cpu_write(cmci_storm_state, CMCI_STORM_ACTIVE);
- r = atomic_add_return(1, &cmci_storm_on_cpus);
- mce_timer_kick(CMCI_STORM_INTERVAL);
- this_cpu_write(cmci_backoff_cnt, INITIAL_CHECK_INTERVAL);
-
- if (r == 1)
- pr_notice("CMCI storm detected: switching to poll mode\n");
- return true;
-}
-
/*
* The interrupt handler. This is called on every event.
* Just call the poller directly to log any events.
@@ -253,9 +111,6 @@ static bool cmci_storm_detect(void)
*/
static void intel_threshold_interrupt(void)
{
- if (cmci_storm_detect())
- return;
-
machine_check_poll(MCP_TIMESTAMP, this_cpu_ptr(&mce_banks_owned));
}

--
2.39.2

2023-04-03 21:11:22

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v4 3/5] x86/mce: Introduce mce_handle_storm() to deal with begin/end of storms

From: Smita Koralahalli <[email protected]>

Intel and AMD need to take different actions when a storm begins or
ends. Prepare for the storm code moving from intel.c into core.c by
adding a function that checks CPU vendor to pick the right action.

No functional changes.

[Tony: Changed from function pointer to regular function]

Signed-off-by: Smita Koralahalli <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Reviewed-by: Yazen Ghannam <[email protected]>
Tested-by: Yazen Ghannam <[email protected]>
---
arch/x86/kernel/cpu/mce/internal.h | 3 +++
arch/x86/kernel/cpu/mce/core.c | 9 +++++++++
arch/x86/kernel/cpu/mce/intel.c | 12 ++++++++++--
3 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index 8d3a740a66ff..68288099b125 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -43,12 +43,14 @@ extern mce_banks_t mce_banks_ce_disabled;
void track_cmci_storm(int bank, u64 status);

#ifdef CONFIG_X86_MCE_INTEL
+void mce_intel_handle_storm(int bank, bool on);
void cmci_disable_bank(int bank);
void intel_init_cmci(void);
void intel_init_lmce(void);
void intel_clear_lmce(void);
bool intel_filter_mce(struct mce *m);
#else
+static inline void mce_intel_handle_storm(int bank, bool on) { }
static inline void cmci_disable_bank(int bank) { }
static inline void intel_init_cmci(void) { }
static inline void intel_init_lmce(void) { }
@@ -57,6 +59,7 @@ static inline bool intel_filter_mce(struct mce *m) { return false; }
#endif

void mce_timer_kick(bool storm);
+void mce_handle_storm(int bank, bool on);

#ifdef CONFIG_ACPI_APEI
int apei_write_mce(struct mce *m);
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 20347eb65b8b..099d8444aca4 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1975,6 +1975,15 @@ static void mce_zhaoxin_feature_clear(struct cpuinfo_x86 *c)
intel_clear_lmce();
}

+void mce_handle_storm(int bank, bool on)
+{
+ switch (boot_cpu_data.x86_vendor) {
+ case X86_VENDOR_INTEL:
+ mce_intel_handle_storm(bank, on);
+ break;
+ }
+}
+
static void __mcheck_cpu_init_vendor(struct cpuinfo_x86 *c)
{
switch (c->x86_vendor) {
diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c
index 4106877de028..a8248514a689 100644
--- a/arch/x86/kernel/cpu/mce/intel.c
+++ b/arch/x86/kernel/cpu/mce/intel.c
@@ -152,6 +152,14 @@ static void cmci_set_threshold(int bank, int thresh)
raw_spin_unlock_irqrestore(&cmci_discover_lock, flags);
}

+void mce_intel_handle_storm(int bank, bool on)
+{
+ if (on)
+ cmci_set_threshold(bank, CMCI_STORM_THRESHOLD);
+ else
+ cmci_set_threshold(bank, cmci_threshold[bank]);
+}
+
static void cmci_storm_begin(int bank)
{
__set_bit(bank, this_cpu_ptr(mce_poll_banks));
@@ -211,13 +219,13 @@ void track_cmci_storm(int bank, u64 status)
if (history & GENMASK_ULL(STORM_END_POLL_THRESHOLD - 1, 0))
return;
pr_notice("CPU%d BANK%d CMCI storm subsided\n", smp_processor_id(), bank);
- cmci_set_threshold(bank, cmci_threshold[bank]);
+ mce_handle_storm(bank, false);
cmci_storm_end(bank);
} else {
if (hweight64(history) < STORM_BEGIN_THRESHOLD)
return;
pr_notice("CPU%d BANK%d CMCI storm detected\n", smp_processor_id(), bank);
- cmci_set_threshold(bank, CMCI_STORM_THRESHOLD);
+ mce_handle_storm(bank, true);
cmci_storm_begin(bank);
}
}
--
2.39.2

2023-04-11 12:38:33

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] x86/mce: Add per-bank CMCI storm mitigation

On Mon, Apr 03, 2023 at 02:07:13PM -0700, Tony Luck wrote:
> Add a hook into machine_check_poll() to keep track of per-CPU, per-bank
> corrected error logs.
>
> Maintain a bitmap history for each bank showing whether the bank
> logged an corrected error or not each time it is polled.
>
> In normal operation the interval between polls of this banks
> determines how far to shift the history. The 64 bit width corresponds
> to about one second.
>
> When a storm is observed the Rate of interrupts is reduced by setting
> a large threshold value for this bank in IA32_MCi_CTL2. This bank is
> added to the bitmap of banks for this CPU to poll. The polling rate
> is increased to once per second.
> During a storm each bit in the history indicates the status of the
> bank each time it is polled. Thus the history covers just over a minute.
>
> Declare a storm for that bank if the number of corrected interrupts
> seen in that history is above some threshold (5 in this RFC code for
> ease of testing, likely move to 15 for compatibility with previous
> storm detection).
>
> A storm on a bank ends if enough consecutive polls of the bank show
> no corrected errors (currently 30, may also change). That resets the
> threshold in IA32_MCi_CTL2 back to 1, removes the bank from the bitmap
> for polling, and changes the polling rate back to the default.
>
> If a CPU with banks in storm mode is taken offline, the new CPU
> that inherits ownership of those banks takes over management of
> storm(s) in the inherited bank(s).
>
> Signed-off-by: Tony Luck <[email protected]>
> Reviewed-by: Yazen Ghannam <[email protected]>
> Tested-by: Yazen Ghannam <[email protected]>
> ---
> arch/x86/kernel/cpu/mce/internal.h | 4 +-
> arch/x86/kernel/cpu/mce/core.c | 26 ++++--
> arch/x86/kernel/cpu/mce/intel.c | 139 ++++++++++++++++++++++++++++-
> 3 files changed, 158 insertions(+), 11 deletions(-)

ld: vmlinux.o: in function `machine_check_poll':
/home/boris/kernel/2nd/linux/arch/x86/kernel/cpu/mce/core.c:683: undefined reference to `track_cmci_storm'
make[1]: *** [scripts/Makefile.vmlinux:35: vmlinux] Error 1
make: *** [Makefile:1249: vmlinux] Error 2
make: *** Waiting for unfinished jobs....

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-04-11 14:12:09

by Yazen Ghannam

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] x86/mce: Add per-bank CMCI storm mitigation

On 4/11/23 08:32, Borislav Petkov wrote:
> On Mon, Apr 03, 2023 at 02:07:13PM -0700, Tony Luck wrote:
>> Add a hook into machine_check_poll() to keep track of per-CPU, per-bank
>> corrected error logs.
>>
>> Maintain a bitmap history for each bank showing whether the bank
>> logged an corrected error or not each time it is polled.
>>
>> In normal operation the interval between polls of this banks
>> determines how far to shift the history. The 64 bit width corresponds
>> to about one second.
>>
>> When a storm is observed the Rate of interrupts is reduced by setting
>> a large threshold value for this bank in IA32_MCi_CTL2. This bank is
>> added to the bitmap of banks for this CPU to poll. The polling rate
>> is increased to once per second.
>> During a storm each bit in the history indicates the status of the
>> bank each time it is polled. Thus the history covers just over a minute.
>>
>> Declare a storm for that bank if the number of corrected interrupts
>> seen in that history is above some threshold (5 in this RFC code for
>> ease of testing, likely move to 15 for compatibility with previous
>> storm detection).
>>
>> A storm on a bank ends if enough consecutive polls of the bank show
>> no corrected errors (currently 30, may also change). That resets the
>> threshold in IA32_MCi_CTL2 back to 1, removes the bank from the bitmap
>> for polling, and changes the polling rate back to the default.
>>
>> If a CPU with banks in storm mode is taken offline, the new CPU
>> that inherits ownership of those banks takes over management of
>> storm(s) in the inherited bank(s).
>>
>> Signed-off-by: Tony Luck <[email protected]>
>> Reviewed-by: Yazen Ghannam <[email protected]>
>> Tested-by: Yazen Ghannam <[email protected]>
>> ---
>> arch/x86/kernel/cpu/mce/internal.h | 4 +-
>> arch/x86/kernel/cpu/mce/core.c | 26 ++++--
>> arch/x86/kernel/cpu/mce/intel.c | 139 ++++++++++++++++++++++++++++-
>> 3 files changed, 158 insertions(+), 11 deletions(-)
>
> ld: vmlinux.o: in function `machine_check_poll':
> /home/boris/kernel/2nd/linux/arch/x86/kernel/cpu/mce/core.c:683: undefined reference to `track_cmci_storm'
> make[1]: *** [scripts/Makefile.vmlinux:35: vmlinux] Error 1
> make: *** [Makefile:1249: vmlinux] Error 2
> make: *** Waiting for unfinished jobs....
>

Ah, this is with CONFIG_MCE_INTEL=n and everything =y. Is there an automated
way to test every config, not just random, combination in a subsystem?

I'll try to add something like this to my flow. It seems allnoconfig,
defconfig, etc. aren't enough. And it's too easy to overlook during code review.

Thanks,
Yazen

2023-04-11 16:09:25

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH v4 2/5] x86/mce: Add per-bank CMCI storm mitigation

> > ld: vmlinux.o: in function `machine_check_poll':
> > /home/boris/kernel/2nd/linux/arch/x86/kernel/cpu/mce/core.c:683: undefined reference to `track_cmci_storm'
> > make[1]: *** [scripts/Makefile.vmlinux:35: vmlinux] Error 1
> > make: *** [Makefile:1249: vmlinux] Error 2
> > make: *** Waiting for unfinished jobs....
> >
>
> Ah, this is with CONFIG_MCE_INTEL=n and everything =y. Is there an automated
> way to test every config, not just random, combination in a subsystem?
>
> I'll try to add something like this to my flow. It seems allnoconfig,
> defconfig, etc. aren't enough. And it's too easy to overlook during code review.

I'm a bit surprised that lkp didn't complain. It used to do a zillion builds with combinations
of CONFIG options that were relevant to the patch series.

I'll spin a v5 with an inline stub function to fix this.

Boris: Have you seen anything else that needs fixing?

-Tony

2023-04-11 17:22:30

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] x86/mce: Add per-bank CMCI storm mitigation

On Tue, Apr 11, 2023 at 04:06:17PM +0000, Luck, Tony wrote:
> Boris: Have you seen anything else that needs fixing?

Not yet. I stopped looking at the build failure.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-04-11 17:40:33

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v5 0/5] Handle corrected machine check interrupt storms

Linux CMCI storm mitigation is a big hammer that just disables the CMCI
interrupt globally and switches to polling all banks.

There are two problems with this:
1) It really is a big hammer. It means that errors reported in other
banks from different functional units are all subject to the same
polling delay before being processed.
2) Intel systems signal some uncorrected errors using CMCI (e.g.
memory controller patrol scrub on Icelake Xeon and newer). Delaying
processing these error reports negates some of the benefit of the patrol
scrubber providing early notice of errors before they are consumed and
cause a machine check.

This series throws away the old storm implementation and replaces it
with one that keeps track of the weather on each separate machine check
bank. When a storm is detected from a bank. On Intel the storm is
mitigated by setting a very high threshold for corrected errors to
signal CMCI. This threshold does not affect signaling CMCI for
uncorrected errors.

AMD's storm mitigation for threshold interrupts also relies on per CPU
per bank approach similar to Intel. But unlike CMCI storm handling it does
not set thresholds to reduce rate of interrupts on a storm. Rather it
turns off the interrupt on the current CPU and bank if there is a storm
and re-enables back the interrupts when the storm subsides.

It is okay to turn off threshold interrupts on AMD systems as other error
severities continue to be handled even if the threshold interrupts are
turned off. Uncorrected errors will generate a #MC and deferred errors
have a unique separate deferred error interrupt. The final patch adds
support for handling threshold interrupt storms on AMD systems.

Changes since last version:
Boris: Build failure on part 2 with CONFIG_MCE_INTEL=n
Fixed by adding necessary stub function for track_cmci_storm()

Smita Koralahalli (3):
x86/mce: Introduce mce_handle_storm() to deal with begin/end of storms
x86/mce: Move storm handling to core.
x86/mce: Handle AMD threshold interrupt storms

Tony Luck (2):
x86/mce: Remove old CMCI storm mitigation code
x86/mce: Add per-bank CMCI storm mitigation

arch/x86/kernel/cpu/mce/internal.h | 33 ++++--
arch/x86/kernel/cpu/mce/amd.c | 49 ++++++++
arch/x86/kernel/cpu/mce/core.c | 139 +++++++++++++++++-----
arch/x86/kernel/cpu/mce/intel.c | 179 +++++++----------------------
4 files changed, 230 insertions(+), 170 deletions(-)


base-commit: 09a9639e56c01c7a00d6c0ca63f4c7c41abe075d
--
2.39.2

2023-04-11 17:41:00

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v5 4/5] x86/mce: Move storm handling to core.

From: Smita Koralahalli <[email protected]>

AMD's storm handling for threshold interrupts is similar to Intel's CMCI
storm handling. Hence, make the storm handling code common by moving to
core and removing the vendor exclusivity.

On the contrary, setting different thresholds to reduce rate of interrupts
in IA32_MCi_CTL2 register is kept Intel intact as the storm handling for
AMD slightly differs where in it handles the storms by turning off the
interrupts.

No functional changes.

[Tony: Same as Smita's original, plus changes rolled in from prior patches]

Signed-off-by: Smita Koralahalli <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Reviewed-by: Yazen Ghannam <[email protected]>
Tested-by: Yazen Ghannam <[email protected]>
---
arch/x86/kernel/cpu/mce/internal.h | 20 ++++++-
arch/x86/kernel/cpu/mce/core.c | 81 ++++++++++++++++++++++++++
arch/x86/kernel/cpu/mce/intel.c | 93 +-----------------------------
3 files changed, 100 insertions(+), 94 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index e0d76378c116..9a2d6e289b8d 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -47,7 +47,6 @@ void intel_init_cmci(void);
void intel_init_lmce(void);
void intel_clear_lmce(void);
bool intel_filter_mce(struct mce *m);
-void track_cmci_storm(int bank, u64 status);
#else
static inline void mce_intel_handle_storm(int bank, bool on) { }
static inline void cmci_disable_bank(int bank) { }
@@ -55,11 +54,28 @@ static inline void intel_init_cmci(void) { }
static inline void intel_init_lmce(void) { }
static inline void intel_clear_lmce(void) { }
static inline bool intel_filter_mce(struct mce *m) { return false; }
-static inline void track_cmci_storm(int bank, u64 status) { }
#endif

void mce_timer_kick(bool storm);
void mce_handle_storm(int bank, bool on);
+void cmci_storm_begin(int bank);
+void cmci_storm_end(int bank);
+
+DECLARE_PER_CPU(int, stormy_bank_count);
+DECLARE_PER_CPU(u64 [MAX_NR_BANKS], bank_history);
+DECLARE_PER_CPU(bool [MAX_NR_BANKS], bank_storm);
+DECLARE_PER_CPU(unsigned long [MAX_NR_BANKS], bank_time_stamp);
+
+/*
+ * How many errors within the history buffer mark the start of a storm
+ */
+#define STORM_BEGIN_THRESHOLD 5
+
+/*
+ * How many polls of machine check bank without an error before declaring
+ * the storm is over
+ */
+#define STORM_END_POLL_THRESHOLD 30

#ifdef CONFIG_ACPI_APEI
int apei_write_mce(struct mce *m);
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 099d8444aca4..820b317b1b85 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -607,6 +607,87 @@ static struct notifier_block mce_default_nb = {
.priority = MCE_PRIO_LOWEST,
};

+/*
+ * CMCI storm tracking state
+ * stormy_bank_count: per-cpu count of MC banks in storm state
+ * bank_history: bitmask tracking of corrected errors seen in each bank
+ * bank_time_stamp: last time (in jiffies) that each bank was polled
+ */
+DEFINE_PER_CPU(int, stormy_bank_count);
+DEFINE_PER_CPU(u64 [MAX_NR_BANKS], bank_history);
+DEFINE_PER_CPU(bool [MAX_NR_BANKS], bank_storm);
+DEFINE_PER_CPU(unsigned long [MAX_NR_BANKS], bank_time_stamp);
+
+void cmci_storm_begin(int bank)
+{
+ __set_bit(bank, this_cpu_ptr(mce_poll_banks));
+ this_cpu_write(bank_storm[bank], true);
+
+ /*
+ * If this is the first bank on this CPU to enter storm mode
+ * start polling
+ */
+ if (this_cpu_inc_return(stormy_bank_count) == 1)
+ mce_timer_kick(true);
+}
+
+void cmci_storm_end(int bank)
+{
+ __clear_bit(bank, this_cpu_ptr(mce_poll_banks));
+ this_cpu_write(bank_history[bank], 0ull);
+ this_cpu_write(bank_storm[bank], false);
+
+ /* If no banks left in storm mode, stop polling */
+ if (!this_cpu_dec_return(stormy_bank_count))
+ mce_timer_kick(false);
+}
+
+void track_cmci_storm(int bank, u64 status)
+{
+ unsigned long now = jiffies, delta;
+ unsigned int shift = 1;
+ u64 history;
+
+ /*
+ * When a bank is in storm mode it is polled once per second and
+ * the history mask will record about the last minute of poll results.
+ * If it is not in storm mode, then the bank is only checked when
+ * there is a CMCI interrupt. Check how long it has been since
+ * this bank was last checked, and adjust the amount of "shift"
+ * to apply to history.
+ */
+ if (!this_cpu_read(bank_storm[bank])) {
+ delta = now - this_cpu_read(bank_time_stamp[bank]);
+ shift = (delta + HZ) / HZ;
+ }
+
+ /* If has been a long time since the last poll, clear history */
+ if (shift >= 64)
+ history = 0;
+ else
+ history = this_cpu_read(bank_history[bank]) << shift;
+ this_cpu_write(bank_time_stamp[bank], now);
+
+ /* History keeps track of corrected errors. VAL=1 && UC=0 */
+ if ((status & (MCI_STATUS_VAL | MCI_STATUS_UC)) == MCI_STATUS_VAL)
+ history |= 1;
+ this_cpu_write(bank_history[bank], history);
+
+ if (this_cpu_read(bank_storm[bank])) {
+ if (history & GENMASK_ULL(STORM_END_POLL_THRESHOLD - 1, 0))
+ return;
+ pr_notice("CPU%d BANK%d CMCI storm subsided\n", smp_processor_id(), bank);
+ mce_handle_storm(bank, false);
+ cmci_storm_end(bank);
+ } else {
+ if (hweight64(history) < STORM_BEGIN_THRESHOLD)
+ return;
+ pr_notice("CPU%d BANK%d CMCI storm detected\n", smp_processor_id(), bank);
+ mce_handle_storm(bank, true);
+ cmci_storm_begin(bank);
+ }
+}
+
/*
* Read ADDR and MISC registers.
*/
diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c
index a8248514a689..20c2143a68c1 100644
--- a/arch/x86/kernel/cpu/mce/intel.c
+++ b/arch/x86/kernel/cpu/mce/intel.c
@@ -47,17 +47,7 @@ static DEFINE_PER_CPU(mce_banks_t, mce_banks_owned);
*/
static DEFINE_RAW_SPINLOCK(cmci_discover_lock);

-/*
- * CMCI storm tracking state
- * stormy_bank_count: per-cpu count of MC banks in storm state
- * bank_history: bitmask tracking of corrected errors seen in each bank
- * bank_time_stamp: last time (in jiffies) that each bank was polled
- * cmci_threshold: MCi_CTL2 threshold for each bank when there is no storm
- */
-static DEFINE_PER_CPU(int, stormy_bank_count);
-static DEFINE_PER_CPU(u64 [MAX_NR_BANKS], bank_history);
-static DEFINE_PER_CPU(bool [MAX_NR_BANKS], bank_storm);
-static DEFINE_PER_CPU(unsigned long [MAX_NR_BANKS], bank_time_stamp);
+/* MCi_CTL2 threshold for each bank when there is no storm */
static int cmci_threshold[MAX_NR_BANKS];

/* Linux non-storm CMCI threshold (may be overridden by BIOS */
@@ -70,17 +60,6 @@ static int cmci_threshold[MAX_NR_BANKS];
*/
#define CMCI_STORM_THRESHOLD 32749

-/*
- * How many errors within the history buffer mark the start of a storm
- */
-#define STORM_BEGIN_THRESHOLD 5
-
-/*
- * How many polls of machine check bank without an error before declaring
- * the storm is over
- */
-#define STORM_END_POLL_THRESHOLD 30
-
static int cmci_supported(int *banks)
{
u64 cap;
@@ -160,76 +139,6 @@ void mce_intel_handle_storm(int bank, bool on)
cmci_set_threshold(bank, cmci_threshold[bank]);
}

-static void cmci_storm_begin(int bank)
-{
- __set_bit(bank, this_cpu_ptr(mce_poll_banks));
- this_cpu_write(bank_storm[bank], true);
-
- /*
- * If this is the first bank on this CPU to enter storm mode
- * start polling
- */
- if (this_cpu_inc_return(stormy_bank_count) == 1)
- mce_timer_kick(true);
-}
-
-static void cmci_storm_end(int bank)
-{
- __clear_bit(bank, this_cpu_ptr(mce_poll_banks));
- this_cpu_write(bank_history[bank], 0ull);
- this_cpu_write(bank_storm[bank], false);
-
- /* If no banks left in storm mode, stop polling */
- if (!this_cpu_dec_return(stormy_bank_count))
- mce_timer_kick(false);
-}
-
-void track_cmci_storm(int bank, u64 status)
-{
- unsigned long now = jiffies, delta;
- unsigned int shift = 1;
- u64 history;
-
- /*
- * When a bank is in storm mode it is polled once per second and
- * the history mask will record about the last minute of poll results.
- * If it is not in storm mode, then the bank is only checked when
- * there is a CMCI interrupt. Check how long it has been since
- * this bank was last checked, and adjust the amount of "shift"
- * to apply to history.
- */
- if (!this_cpu_read(bank_storm[bank])) {
- delta = now - this_cpu_read(bank_time_stamp[bank]);
- shift = (delta + HZ) / HZ;
- }
-
- /* If has been a long time since the last poll, clear history */
- if (shift >= 64)
- history = 0;
- else
- history = this_cpu_read(bank_history[bank]) << shift;
- this_cpu_write(bank_time_stamp[bank], now);
-
- /* History keeps track of corrected errors. VAL=1 && UC=0 */
- if ((status & (MCI_STATUS_VAL | MCI_STATUS_UC)) == MCI_STATUS_VAL)
- history |= 1;
- this_cpu_write(bank_history[bank], history);
-
- if (this_cpu_read(bank_storm[bank])) {
- if (history & GENMASK_ULL(STORM_END_POLL_THRESHOLD - 1, 0))
- return;
- pr_notice("CPU%d BANK%d CMCI storm subsided\n", smp_processor_id(), bank);
- mce_handle_storm(bank, false);
- cmci_storm_end(bank);
- } else {
- if (hweight64(history) < STORM_BEGIN_THRESHOLD)
- return;
- pr_notice("CPU%d BANK%d CMCI storm detected\n", smp_processor_id(), bank);
- mce_handle_storm(bank, true);
- cmci_storm_begin(bank);
- }
-}
-
/*
* The interrupt handler. This is called on every event.
* Just call the poller directly to log any events.
--
2.39.2

2023-04-11 17:41:03

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v5 1/5] x86/mce: Remove old CMCI storm mitigation code

When a "storm" of CMCI is detected this code mitigates by
disabling CMCI interrupt signalling from all of the banks
owned by the CPU that saw the storm.

There are problems with this approach:

1) It is very coarse grained. In all likelihood only one of the
banks was generating the interrupts, but CMCI is disabled for all.
This means Linux may delay seeing and processing errors logged
from other banks.

2) Although CMCI stands for Corrected Machine Check Interrupt, it
is also used to signal when an uncorrected error is logged. This
is a problem because these errors should be handled in a timely
manner.

Delete all this code in preparation for a finer grained solution.

Signed-off-by: Tony Luck <[email protected]>
Reviewed-by: Yazen Ghannam <[email protected]>
Tested-by: Yazen Ghannam <[email protected]>
---
arch/x86/kernel/cpu/mce/internal.h | 6 --
arch/x86/kernel/cpu/mce/core.c | 20 +---
arch/x86/kernel/cpu/mce/intel.c | 145 -----------------------------
3 files changed, 1 insertion(+), 170 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index 91a415553c27..f9331c6229b4 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -41,18 +41,12 @@ struct dentry *mce_get_debugfs_dir(void);
extern mce_banks_t mce_banks_ce_disabled;

#ifdef CONFIG_X86_MCE_INTEL
-unsigned long cmci_intel_adjust_timer(unsigned long interval);
-bool mce_intel_cmci_poll(void);
-void mce_intel_hcpu_update(unsigned long cpu);
void cmci_disable_bank(int bank);
void intel_init_cmci(void);
void intel_init_lmce(void);
void intel_clear_lmce(void);
bool intel_filter_mce(struct mce *m);
#else
-# define cmci_intel_adjust_timer mce_adjust_timer_default
-static inline bool mce_intel_cmci_poll(void) { return false; }
-static inline void mce_intel_hcpu_update(unsigned long cpu) { }
static inline void cmci_disable_bank(int bank) { }
static inline void intel_init_cmci(void) { }
static inline void intel_init_lmce(void) { }
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 2eec60f50057..e7936be84204 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1588,13 +1588,6 @@ static unsigned long check_interval = INITIAL_CHECK_INTERVAL;
static DEFINE_PER_CPU(unsigned long, mce_next_interval); /* in jiffies */
static DEFINE_PER_CPU(struct timer_list, mce_timer);

-static unsigned long mce_adjust_timer_default(unsigned long interval)
-{
- return interval;
-}
-
-static unsigned long (*mce_adjust_timer)(unsigned long interval) = mce_adjust_timer_default;
-
static void __start_timer(struct timer_list *t, unsigned long interval)
{
unsigned long when = jiffies + interval;
@@ -1617,15 +1610,9 @@ static void mce_timer_fn(struct timer_list *t)

iv = __this_cpu_read(mce_next_interval);

- if (mce_available(this_cpu_ptr(&cpu_info))) {
+ if (mce_available(this_cpu_ptr(&cpu_info)))
machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));

- if (mce_intel_cmci_poll()) {
- iv = mce_adjust_timer(iv);
- goto done;
- }
- }
-
/*
* Alert userspace if needed. If we logged an MCE, reduce the polling
* interval, otherwise increase the polling interval.
@@ -1635,7 +1622,6 @@ static void mce_timer_fn(struct timer_list *t)
else
iv = min(iv * 2, round_jiffies_relative(check_interval * HZ));

-done:
__this_cpu_write(mce_next_interval, iv);
__start_timer(t, iv);
}
@@ -1972,7 +1958,6 @@ static void mce_zhaoxin_feature_init(struct cpuinfo_x86 *c)

intel_init_cmci();
intel_init_lmce();
- mce_adjust_timer = cmci_intel_adjust_timer;
}

static void mce_zhaoxin_feature_clear(struct cpuinfo_x86 *c)
@@ -1985,7 +1970,6 @@ static void __mcheck_cpu_init_vendor(struct cpuinfo_x86 *c)
switch (c->x86_vendor) {
case X86_VENDOR_INTEL:
mce_intel_feature_init(c);
- mce_adjust_timer = cmci_intel_adjust_timer;
break;

case X86_VENDOR_AMD: {
@@ -2642,8 +2626,6 @@ static void mce_reenable_cpu(void)

static int mce_cpu_dead(unsigned int cpu)
{
- mce_intel_hcpu_update(cpu);
-
/* intentionally ignoring frozen here */
if (!cpuhp_tasks_frozen)
cmci_rediscover();
diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c
index 95275a5e57e0..052bf2708391 100644
--- a/arch/x86/kernel/cpu/mce/intel.c
+++ b/arch/x86/kernel/cpu/mce/intel.c
@@ -41,15 +41,6 @@
*/
static DEFINE_PER_CPU(mce_banks_t, mce_banks_owned);

-/*
- * CMCI storm detection backoff counter
- *
- * During storm, we reset this counter to INITIAL_CHECK_INTERVAL in case we've
- * encountered an error. If not, we decrement it by one. We signal the end of
- * the CMCI storm when it reaches 0.
- */
-static DEFINE_PER_CPU(int, cmci_backoff_cnt);
-
/*
* cmci_discover_lock protects against parallel discovery attempts
* which could race against each other.
@@ -57,21 +48,6 @@ static DEFINE_PER_CPU(int, cmci_backoff_cnt);
static DEFINE_RAW_SPINLOCK(cmci_discover_lock);

#define CMCI_THRESHOLD 1
-#define CMCI_POLL_INTERVAL (30 * HZ)
-#define CMCI_STORM_INTERVAL (HZ)
-#define CMCI_STORM_THRESHOLD 15
-
-static DEFINE_PER_CPU(unsigned long, cmci_time_stamp);
-static DEFINE_PER_CPU(unsigned int, cmci_storm_cnt);
-static DEFINE_PER_CPU(unsigned int, cmci_storm_state);
-
-enum {
- CMCI_STORM_NONE,
- CMCI_STORM_ACTIVE,
- CMCI_STORM_SUBSIDED,
-};
-
-static atomic_t cmci_storm_on_cpus;

static int cmci_supported(int *banks)
{
@@ -127,124 +103,6 @@ static bool lmce_supported(void)
return tmp & FEAT_CTL_LMCE_ENABLED;
}

-bool mce_intel_cmci_poll(void)
-{
- if (__this_cpu_read(cmci_storm_state) == CMCI_STORM_NONE)
- return false;
-
- /*
- * Reset the counter if we've logged an error in the last poll
- * during the storm.
- */
- if (machine_check_poll(0, this_cpu_ptr(&mce_banks_owned)))
- this_cpu_write(cmci_backoff_cnt, INITIAL_CHECK_INTERVAL);
- else
- this_cpu_dec(cmci_backoff_cnt);
-
- return true;
-}
-
-void mce_intel_hcpu_update(unsigned long cpu)
-{
- if (per_cpu(cmci_storm_state, cpu) == CMCI_STORM_ACTIVE)
- atomic_dec(&cmci_storm_on_cpus);
-
- per_cpu(cmci_storm_state, cpu) = CMCI_STORM_NONE;
-}
-
-static void cmci_toggle_interrupt_mode(bool on)
-{
- unsigned long flags, *owned;
- int bank;
- u64 val;
-
- raw_spin_lock_irqsave(&cmci_discover_lock, flags);
- owned = this_cpu_ptr(mce_banks_owned);
- for_each_set_bit(bank, owned, MAX_NR_BANKS) {
- rdmsrl(MSR_IA32_MCx_CTL2(bank), val);
-
- if (on)
- val |= MCI_CTL2_CMCI_EN;
- else
- val &= ~MCI_CTL2_CMCI_EN;
-
- wrmsrl(MSR_IA32_MCx_CTL2(bank), val);
- }
- raw_spin_unlock_irqrestore(&cmci_discover_lock, flags);
-}
-
-unsigned long cmci_intel_adjust_timer(unsigned long interval)
-{
- if ((this_cpu_read(cmci_backoff_cnt) > 0) &&
- (__this_cpu_read(cmci_storm_state) == CMCI_STORM_ACTIVE)) {
- mce_notify_irq();
- return CMCI_STORM_INTERVAL;
- }
-
- switch (__this_cpu_read(cmci_storm_state)) {
- case CMCI_STORM_ACTIVE:
-
- /*
- * We switch back to interrupt mode once the poll timer has
- * silenced itself. That means no events recorded and the timer
- * interval is back to our poll interval.
- */
- __this_cpu_write(cmci_storm_state, CMCI_STORM_SUBSIDED);
- if (!atomic_sub_return(1, &cmci_storm_on_cpus))
- pr_notice("CMCI storm subsided: switching to interrupt mode\n");
-
- fallthrough;
-
- case CMCI_STORM_SUBSIDED:
- /*
- * We wait for all CPUs to go back to SUBSIDED state. When that
- * happens we switch back to interrupt mode.
- */
- if (!atomic_read(&cmci_storm_on_cpus)) {
- __this_cpu_write(cmci_storm_state, CMCI_STORM_NONE);
- cmci_toggle_interrupt_mode(true);
- cmci_recheck();
- }
- return CMCI_POLL_INTERVAL;
- default:
-
- /* We have shiny weather. Let the poll do whatever it thinks. */
- return interval;
- }
-}
-
-static bool cmci_storm_detect(void)
-{
- unsigned int cnt = __this_cpu_read(cmci_storm_cnt);
- unsigned long ts = __this_cpu_read(cmci_time_stamp);
- unsigned long now = jiffies;
- int r;
-
- if (__this_cpu_read(cmci_storm_state) != CMCI_STORM_NONE)
- return true;
-
- if (time_before_eq(now, ts + CMCI_STORM_INTERVAL)) {
- cnt++;
- } else {
- cnt = 1;
- __this_cpu_write(cmci_time_stamp, now);
- }
- __this_cpu_write(cmci_storm_cnt, cnt);
-
- if (cnt <= CMCI_STORM_THRESHOLD)
- return false;
-
- cmci_toggle_interrupt_mode(false);
- __this_cpu_write(cmci_storm_state, CMCI_STORM_ACTIVE);
- r = atomic_add_return(1, &cmci_storm_on_cpus);
- mce_timer_kick(CMCI_STORM_INTERVAL);
- this_cpu_write(cmci_backoff_cnt, INITIAL_CHECK_INTERVAL);
-
- if (r == 1)
- pr_notice("CMCI storm detected: switching to poll mode\n");
- return true;
-}
-
/*
* The interrupt handler. This is called on every event.
* Just call the poller directly to log any events.
@@ -253,9 +111,6 @@ static bool cmci_storm_detect(void)
*/
static void intel_threshold_interrupt(void)
{
- if (cmci_storm_detect())
- return;
-
machine_check_poll(MCP_TIMESTAMP, this_cpu_ptr(&mce_banks_owned));
}

--
2.39.2

2023-04-11 17:41:06

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v5 3/5] x86/mce: Introduce mce_handle_storm() to deal with begin/end of storms

From: Smita Koralahalli <[email protected]>

Intel and AMD need to take different actions when a storm begins or
ends. Prepare for the storm code moving from intel.c into core.c by
adding a function that checks CPU vendor to pick the right action.

No functional changes.

[Tony: Changed from function pointer to regular function]

Signed-off-by: Smita Koralahalli <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Reviewed-by: Yazen Ghannam <[email protected]>
Tested-by: Yazen Ghannam <[email protected]>
---
arch/x86/kernel/cpu/mce/internal.h | 3 +++
arch/x86/kernel/cpu/mce/core.c | 9 +++++++++
arch/x86/kernel/cpu/mce/intel.c | 12 ++++++++++--
3 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index 1e8e0706a4e8..e0d76378c116 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -41,6 +41,7 @@ struct dentry *mce_get_debugfs_dir(void);
extern mce_banks_t mce_banks_ce_disabled;

#ifdef CONFIG_X86_MCE_INTEL
+void mce_intel_handle_storm(int bank, bool on);
void cmci_disable_bank(int bank);
void intel_init_cmci(void);
void intel_init_lmce(void);
@@ -48,6 +49,7 @@ void intel_clear_lmce(void);
bool intel_filter_mce(struct mce *m);
void track_cmci_storm(int bank, u64 status);
#else
+static inline void mce_intel_handle_storm(int bank, bool on) { }
static inline void cmci_disable_bank(int bank) { }
static inline void intel_init_cmci(void) { }
static inline void intel_init_lmce(void) { }
@@ -57,6 +59,7 @@ static inline void track_cmci_storm(int bank, u64 status) { }
#endif

void mce_timer_kick(bool storm);
+void mce_handle_storm(int bank, bool on);

#ifdef CONFIG_ACPI_APEI
int apei_write_mce(struct mce *m);
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 20347eb65b8b..099d8444aca4 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1975,6 +1975,15 @@ static void mce_zhaoxin_feature_clear(struct cpuinfo_x86 *c)
intel_clear_lmce();
}

+void mce_handle_storm(int bank, bool on)
+{
+ switch (boot_cpu_data.x86_vendor) {
+ case X86_VENDOR_INTEL:
+ mce_intel_handle_storm(bank, on);
+ break;
+ }
+}
+
static void __mcheck_cpu_init_vendor(struct cpuinfo_x86 *c)
{
switch (c->x86_vendor) {
diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c
index 4106877de028..a8248514a689 100644
--- a/arch/x86/kernel/cpu/mce/intel.c
+++ b/arch/x86/kernel/cpu/mce/intel.c
@@ -152,6 +152,14 @@ static void cmci_set_threshold(int bank, int thresh)
raw_spin_unlock_irqrestore(&cmci_discover_lock, flags);
}

+void mce_intel_handle_storm(int bank, bool on)
+{
+ if (on)
+ cmci_set_threshold(bank, CMCI_STORM_THRESHOLD);
+ else
+ cmci_set_threshold(bank, cmci_threshold[bank]);
+}
+
static void cmci_storm_begin(int bank)
{
__set_bit(bank, this_cpu_ptr(mce_poll_banks));
@@ -211,13 +219,13 @@ void track_cmci_storm(int bank, u64 status)
if (history & GENMASK_ULL(STORM_END_POLL_THRESHOLD - 1, 0))
return;
pr_notice("CPU%d BANK%d CMCI storm subsided\n", smp_processor_id(), bank);
- cmci_set_threshold(bank, cmci_threshold[bank]);
+ mce_handle_storm(bank, false);
cmci_storm_end(bank);
} else {
if (hweight64(history) < STORM_BEGIN_THRESHOLD)
return;
pr_notice("CPU%d BANK%d CMCI storm detected\n", smp_processor_id(), bank);
- cmci_set_threshold(bank, CMCI_STORM_THRESHOLD);
+ mce_handle_storm(bank, true);
cmci_storm_begin(bank);
}
}
--
2.39.2

2023-04-11 17:41:07

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v5 5/5] x86/mce: Handle AMD threshold interrupt storms

From: Smita Koralahalli <[email protected]>

Extend the logic of handling CMCI storms to AMD threshold interrupts.

Rely on the similar approach as of Intel's CMCI to mitigate storms per
CPU and per bank. But, unlike CMCI, do not set thresholds and reduce
interrupt rate on a storm. Rather, disable the interrupt on the
corresponding CPU and bank. Re-enable back the interrupts if enough
consecutive polls of the bank show no corrected errors (30, as
programmed by Intel).

Turning off the threshold interrupts would be a better solution on AMD
systems as other error severities will still be handled even if the
threshold interrupts are disabled.

[Tony: Small tweak because mce_handle_storm() isn't a pointer now]

Signed-off-by: Smita Koralahalli <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Reviewed-by: Yazen Ghannam <[email protected]>
Tested-by: Yazen Ghannam <[email protected]>
---
arch/x86/kernel/cpu/mce/internal.h | 4 +++
arch/x86/kernel/cpu/mce/amd.c | 49 ++++++++++++++++++++++++++++++
arch/x86/kernel/cpu/mce/core.c | 3 ++
3 files changed, 56 insertions(+)

diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index 9a2d6e289b8d..f1a48bc2e904 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -40,6 +40,8 @@ struct dentry *mce_get_debugfs_dir(void);

extern mce_banks_t mce_banks_ce_disabled;

+void track_cmci_storm(int bank, u64 status);
+
#ifdef CONFIG_X86_MCE_INTEL
void mce_intel_handle_storm(int bank, bool on);
void cmci_disable_bank(int bank);
@@ -222,6 +224,7 @@ extern bool filter_mce(struct mce *m);

#ifdef CONFIG_X86_MCE_AMD
extern bool amd_filter_mce(struct mce *m);
+void mce_amd_handle_storm(int bank, bool on);

/*
* If MCA_CONFIG[McaLsbInStatusSupported] is set, extract ErrAddr in bits
@@ -249,6 +252,7 @@ static __always_inline void smca_extract_err_addr(struct mce *m)

#else
static inline bool amd_filter_mce(struct mce *m) { return false; }
+static inline void mce_amd_handle_storm(int bank, bool on) {}
static inline void smca_extract_err_addr(struct mce *m) { }
#endif

diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index 23c5072fbbb7..cd79295e2a0a 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -468,6 +468,47 @@ static void threshold_restart_bank(void *_tr)
wrmsr(tr->b->address, lo, hi);
}

+static void _reset_block(struct threshold_block *block)
+{
+ struct thresh_restart tr;
+
+ memset(&tr, 0, sizeof(tr));
+ tr.b = block;
+ threshold_restart_bank(&tr);
+}
+
+static void toggle_interrupt_reset_block(struct threshold_block *block, bool on)
+{
+ if (!block)
+ return;
+
+ block->interrupt_enable = !!on;
+ _reset_block(block);
+}
+
+void mce_amd_handle_storm(int bank, bool on)
+{
+ struct threshold_block *first_block = NULL, *block = NULL, *tmp = NULL;
+ struct threshold_bank **bp = this_cpu_read(threshold_banks);
+ unsigned long flags;
+
+ if (!bp)
+ return;
+
+ local_irq_save(flags);
+
+ first_block = bp[bank]->blocks;
+ if (!first_block)
+ goto end;
+
+ toggle_interrupt_reset_block(first_block, on);
+
+ list_for_each_entry_safe(block, tmp, &first_block->miscj, miscj)
+ toggle_interrupt_reset_block(block, on);
+end:
+ local_irq_restore(flags);
+}
+
static void mce_threshold_block_init(struct threshold_block *b, int offset)
{
struct thresh_restart tr = {
@@ -868,6 +909,7 @@ static void amd_threshold_interrupt(void)
struct threshold_block *first_block = NULL, *block = NULL, *tmp = NULL;
struct threshold_bank **bp = this_cpu_read(threshold_banks);
unsigned int bank, cpu = smp_processor_id();
+ u64 status;

/*
* Validate that the threshold bank has been initialized already. The
@@ -881,6 +923,13 @@ static void amd_threshold_interrupt(void)
if (!(per_cpu(bank_map, cpu) & (1 << bank)))
continue;

+ rdmsrl(mca_msr_reg(bank, MCA_STATUS), status);
+ track_cmci_storm(bank, status);
+
+ /* Return early on an interrupt storm */
+ if (this_cpu_read(bank_storm[bank]))
+ return;
+
first_block = bp[bank]->blocks;
if (!first_block)
continue;
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 820b317b1b85..fac90625d8cb 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -2062,6 +2062,9 @@ void mce_handle_storm(int bank, bool on)
case X86_VENDOR_INTEL:
mce_intel_handle_storm(bank, on);
break;
+ case X86_VENDOR_AMD:
+ mce_amd_handle_storm(bank, on);
+ break;
}
}

--
2.39.2

2023-04-11 17:41:41

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v5 2/5] x86/mce: Add per-bank CMCI storm mitigation

Add an Intel specific hook into machine_check_poll() to keep track
of per-CPU, per-bank corrected error logs (with a stub for the
CONFIG_MCE_INTEL=n case).

Maintain a bitmap history for each bank showing whether the bank
logged an corrected error or not each time it is polled.

In normal operation the interval between polls of this banks
determines how far to shift the history. The 64 bit width corresponds
to about one second.

When a storm is observed the Rate of interrupts is reduced by setting
a large threshold value for this bank in IA32_MCi_CTL2. This bank is
added to the bitmap of banks for this CPU to poll. The polling rate
is increased to once per second.
During a storm each bit in the history indicates the status of the
bank each time it is polled. Thus the history covers just over a minute.

Declare a storm for that bank if the number of corrected interrupts
seen in that history is above some threshold (5 in this RFC code for
ease of testing, likely move to 15 for compatibility with previous
storm detection).

A storm on a bank ends if enough consecutive polls of the bank show
no corrected errors (currently 30, may also change). That resets the
threshold in IA32_MCi_CTL2 back to 1, removes the bank from the bitmap
for polling, and changes the polling rate back to the default.

If a CPU with banks in storm mode is taken offline, the new CPU
that inherits ownership of those banks takes over management of
storm(s) in the inherited bank(s).

Signed-off-by: Tony Luck <[email protected]>
Reviewed-by: Yazen Ghannam <[email protected]>
Tested-by: Yazen Ghannam <[email protected]>
---
arch/x86/kernel/cpu/mce/internal.h | 4 +-
arch/x86/kernel/cpu/mce/core.c | 26 ++++--
arch/x86/kernel/cpu/mce/intel.c | 139 ++++++++++++++++++++++++++++-
3 files changed, 158 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index f9331c6229b4..1e8e0706a4e8 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -46,15 +46,17 @@ void intel_init_cmci(void);
void intel_init_lmce(void);
void intel_clear_lmce(void);
bool intel_filter_mce(struct mce *m);
+void track_cmci_storm(int bank, u64 status);
#else
static inline void cmci_disable_bank(int bank) { }
static inline void intel_init_cmci(void) { }
static inline void intel_init_lmce(void) { }
static inline void intel_clear_lmce(void) { }
static inline bool intel_filter_mce(struct mce *m) { return false; }
+static inline void track_cmci_storm(int bank, u64 status) { }
#endif

-void mce_timer_kick(unsigned long interval);
+void mce_timer_kick(bool storm);

#ifdef CONFIG_ACPI_APEI
int apei_write_mce(struct mce *m);
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index e7936be84204..20347eb65b8b 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -680,6 +680,8 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
barrier();
m.status = mce_rdmsrl(mca_msr_reg(i, MCA_STATUS));

+ track_cmci_storm(i, m.status);
+
/* If this entry is not valid, ignore it */
if (!(m.status & MCI_STATUS_VAL))
continue;
@@ -1587,6 +1589,7 @@ static unsigned long check_interval = INITIAL_CHECK_INTERVAL;

static DEFINE_PER_CPU(unsigned long, mce_next_interval); /* in jiffies */
static DEFINE_PER_CPU(struct timer_list, mce_timer);
+static DEFINE_PER_CPU(bool, storm_poll_mode);

static void __start_timer(struct timer_list *t, unsigned long interval)
{
@@ -1622,22 +1625,29 @@ static void mce_timer_fn(struct timer_list *t)
else
iv = min(iv * 2, round_jiffies_relative(check_interval * HZ));

- __this_cpu_write(mce_next_interval, iv);
- __start_timer(t, iv);
+ if (__this_cpu_read(storm_poll_mode)) {
+ __start_timer(t, HZ);
+ } else {
+ __this_cpu_write(mce_next_interval, iv);
+ __start_timer(t, iv);
+ }
}

/*
- * Ensure that the timer is firing in @interval from now.
+ * When a storm starts on any bank on this CPU, switch to polling
+ * once per second. When the storm ends, revert to the default
+ * polling interval.
*/
-void mce_timer_kick(unsigned long interval)
+void mce_timer_kick(bool storm)
{
struct timer_list *t = this_cpu_ptr(&mce_timer);
- unsigned long iv = __this_cpu_read(mce_next_interval);

- __start_timer(t, interval);
+ __this_cpu_write(storm_poll_mode, storm);

- if (interval < iv)
- __this_cpu_write(mce_next_interval, interval);
+ if (storm)
+ __start_timer(t, HZ);
+ else
+ __this_cpu_write(mce_next_interval, check_interval * HZ);
}

/* Must not be called in IRQ context where del_timer_sync() can deadlock */
diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c
index 052bf2708391..4106877de028 100644
--- a/arch/x86/kernel/cpu/mce/intel.c
+++ b/arch/x86/kernel/cpu/mce/intel.c
@@ -47,8 +47,40 @@ static DEFINE_PER_CPU(mce_banks_t, mce_banks_owned);
*/
static DEFINE_RAW_SPINLOCK(cmci_discover_lock);

+/*
+ * CMCI storm tracking state
+ * stormy_bank_count: per-cpu count of MC banks in storm state
+ * bank_history: bitmask tracking of corrected errors seen in each bank
+ * bank_time_stamp: last time (in jiffies) that each bank was polled
+ * cmci_threshold: MCi_CTL2 threshold for each bank when there is no storm
+ */
+static DEFINE_PER_CPU(int, stormy_bank_count);
+static DEFINE_PER_CPU(u64 [MAX_NR_BANKS], bank_history);
+static DEFINE_PER_CPU(bool [MAX_NR_BANKS], bank_storm);
+static DEFINE_PER_CPU(unsigned long [MAX_NR_BANKS], bank_time_stamp);
+static int cmci_threshold[MAX_NR_BANKS];
+
+/* Linux non-storm CMCI threshold (may be overridden by BIOS */
#define CMCI_THRESHOLD 1

+/*
+ * High threshold to limit CMCI rate during storms. Max supported is
+ * 0x7FFF. Use this slightly smaller value so it has a distinctive
+ * signature when some asks "Why am I not seeing all corrected errors?"
+ */
+#define CMCI_STORM_THRESHOLD 32749
+
+/*
+ * How many errors within the history buffer mark the start of a storm
+ */
+#define STORM_BEGIN_THRESHOLD 5
+
+/*
+ * How many polls of machine check bank without an error before declaring
+ * the storm is over
+ */
+#define STORM_END_POLL_THRESHOLD 30
+
static int cmci_supported(int *banks)
{
u64 cap;
@@ -103,6 +135,93 @@ static bool lmce_supported(void)
return tmp & FEAT_CTL_LMCE_ENABLED;
}

+/*
+ * Set a new CMCI threshold value. Preserve the state of the
+ * MCI_CTL2_CMCI_EN bit in case this happens during a
+ * cmci_rediscover() operation.
+ */
+static void cmci_set_threshold(int bank, int thresh)
+{
+ unsigned long flags;
+ u64 val;
+
+ raw_spin_lock_irqsave(&cmci_discover_lock, flags);
+ rdmsrl(MSR_IA32_MCx_CTL2(bank), val);
+ val &= ~MCI_CTL2_CMCI_THRESHOLD_MASK;
+ wrmsrl(MSR_IA32_MCx_CTL2(bank), val | thresh);
+ raw_spin_unlock_irqrestore(&cmci_discover_lock, flags);
+}
+
+static void cmci_storm_begin(int bank)
+{
+ __set_bit(bank, this_cpu_ptr(mce_poll_banks));
+ this_cpu_write(bank_storm[bank], true);
+
+ /*
+ * If this is the first bank on this CPU to enter storm mode
+ * start polling
+ */
+ if (this_cpu_inc_return(stormy_bank_count) == 1)
+ mce_timer_kick(true);
+}
+
+static void cmci_storm_end(int bank)
+{
+ __clear_bit(bank, this_cpu_ptr(mce_poll_banks));
+ this_cpu_write(bank_history[bank], 0ull);
+ this_cpu_write(bank_storm[bank], false);
+
+ /* If no banks left in storm mode, stop polling */
+ if (!this_cpu_dec_return(stormy_bank_count))
+ mce_timer_kick(false);
+}
+
+void track_cmci_storm(int bank, u64 status)
+{
+ unsigned long now = jiffies, delta;
+ unsigned int shift = 1;
+ u64 history;
+
+ /*
+ * When a bank is in storm mode it is polled once per second and
+ * the history mask will record about the last minute of poll results.
+ * If it is not in storm mode, then the bank is only checked when
+ * there is a CMCI interrupt. Check how long it has been since
+ * this bank was last checked, and adjust the amount of "shift"
+ * to apply to history.
+ */
+ if (!this_cpu_read(bank_storm[bank])) {
+ delta = now - this_cpu_read(bank_time_stamp[bank]);
+ shift = (delta + HZ) / HZ;
+ }
+
+ /* If has been a long time since the last poll, clear history */
+ if (shift >= 64)
+ history = 0;
+ else
+ history = this_cpu_read(bank_history[bank]) << shift;
+ this_cpu_write(bank_time_stamp[bank], now);
+
+ /* History keeps track of corrected errors. VAL=1 && UC=0 */
+ if ((status & (MCI_STATUS_VAL | MCI_STATUS_UC)) == MCI_STATUS_VAL)
+ history |= 1;
+ this_cpu_write(bank_history[bank], history);
+
+ if (this_cpu_read(bank_storm[bank])) {
+ if (history & GENMASK_ULL(STORM_END_POLL_THRESHOLD - 1, 0))
+ return;
+ pr_notice("CPU%d BANK%d CMCI storm subsided\n", smp_processor_id(), bank);
+ cmci_set_threshold(bank, cmci_threshold[bank]);
+ cmci_storm_end(bank);
+ } else {
+ if (hweight64(history) < STORM_BEGIN_THRESHOLD)
+ return;
+ pr_notice("CPU%d BANK%d CMCI storm detected\n", smp_processor_id(), bank);
+ cmci_set_threshold(bank, CMCI_STORM_THRESHOLD);
+ cmci_storm_begin(bank);
+ }
+}
+
/*
* The interrupt handler. This is called on every event.
* Just call the poller directly to log any events.
@@ -147,6 +266,9 @@ static void cmci_discover(int banks)
continue;
}

+ if ((val & MCI_CTL2_CMCI_THRESHOLD_MASK) == CMCI_STORM_THRESHOLD)
+ goto storm;
+
if (!mca_cfg.bios_cmci_threshold) {
val &= ~MCI_CTL2_CMCI_THRESHOLD_MASK;
val |= CMCI_THRESHOLD;
@@ -159,7 +281,7 @@ static void cmci_discover(int banks)
bios_zero_thresh = 1;
val |= CMCI_THRESHOLD;
}
-
+storm:
val |= MCI_CTL2_CMCI_EN;
wrmsrl(MSR_IA32_MCx_CTL2(i), val);
rdmsrl(MSR_IA32_MCx_CTL2(i), val);
@@ -167,7 +289,14 @@ static void cmci_discover(int banks)
/* Did the enable bit stick? -- the bank supports CMCI */
if (val & MCI_CTL2_CMCI_EN) {
set_bit(i, owned);
- __clear_bit(i, this_cpu_ptr(mce_poll_banks));
+ if ((val & MCI_CTL2_CMCI_THRESHOLD_MASK) == CMCI_STORM_THRESHOLD) {
+ pr_notice("CPU%d BANK%d CMCI inherited storm\n", smp_processor_id(), i);
+ this_cpu_write(bank_history[i], ~0ull);
+ this_cpu_write(bank_time_stamp[i], jiffies);
+ cmci_storm_begin(i);
+ } else {
+ __clear_bit(i, this_cpu_ptr(mce_poll_banks));
+ }
/*
* We are able to set thresholds for some banks that
* had a threshold of 0. This means the BIOS has not
@@ -177,6 +306,10 @@ static void cmci_discover(int banks)
if (mca_cfg.bios_cmci_threshold && bios_zero_thresh &&
(val & MCI_CTL2_CMCI_THRESHOLD_MASK))
bios_wrong_thresh = 1;
+
+ /* Save default threshold for each bank */
+ if (cmci_threshold[i] == 0)
+ cmci_threshold[i] = val & MCI_CTL2_CMCI_THRESHOLD_MASK;
} else {
WARN_ON(!test_bit(i, this_cpu_ptr(mce_poll_banks)));
}
@@ -218,6 +351,8 @@ static void __cmci_disable_bank(int bank)
val &= ~MCI_CTL2_CMCI_EN;
wrmsrl(MSR_IA32_MCx_CTL2(bank), val);
__clear_bit(bank, this_cpu_ptr(mce_banks_owned));
+ if ((val & MCI_CTL2_CMCI_THRESHOLD_MASK) == CMCI_STORM_THRESHOLD)
+ cmci_storm_end(bank);
}

/*
--
2.39.2

2023-06-13 18:06:34

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v5 2/5] x86/mce: Add per-bank CMCI storm mitigation

On Tue, Apr 11, 2023 at 10:38:38AM -0700, Tony Luck wrote:
> @@ -1587,6 +1589,7 @@ static unsigned long check_interval = INITIAL_CHECK_INTERVAL;
>
> static DEFINE_PER_CPU(unsigned long, mce_next_interval); /* in jiffies */
> static DEFINE_PER_CPU(struct timer_list, mce_timer);
> +static DEFINE_PER_CPU(bool, storm_poll_mode);

See comment below about putting all those storm-related vars in a struct.

Also, there's another bool - bank_storm - which looks like it does the
same.

> static void __start_timer(struct timer_list *t, unsigned long interval)
> {
> @@ -1622,22 +1625,29 @@ static void mce_timer_fn(struct timer_list *t)
> else
> iv = min(iv * 2, round_jiffies_relative(check_interval * HZ));
>
> - __this_cpu_write(mce_next_interval, iv);
> - __start_timer(t, iv);
> + if (__this_cpu_read(storm_poll_mode)) {
> + __start_timer(t, HZ);
> + } else {
> + __this_cpu_write(mce_next_interval, iv);
> + __start_timer(t, iv);
> + }
> }
>
> /*
> - * Ensure that the timer is firing in @interval from now.
> + * When a storm starts on any bank on this CPU, switch to polling
> + * once per second. When the storm ends, revert to the default
> + * polling interval.
> */
> -void mce_timer_kick(unsigned long interval)
> +void mce_timer_kick(bool storm)
> {
> struct timer_list *t = this_cpu_ptr(&mce_timer);
> - unsigned long iv = __this_cpu_read(mce_next_interval);
>
> - __start_timer(t, interval);
> + __this_cpu_write(storm_poll_mode, storm);
>
> - if (interval < iv)
> - __this_cpu_write(mce_next_interval, interval);
> + if (storm)
> + __start_timer(t, HZ);
> + else
> + __this_cpu_write(mce_next_interval, check_interval * HZ);

This looks very familiar to what mce_timer_fn() above does. Add
a helper.

> /* Must not be called in IRQ context where del_timer_sync() can deadlock */
> diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c
> index 052bf2708391..4106877de028 100644
> --- a/arch/x86/kernel/cpu/mce/intel.c
> +++ b/arch/x86/kernel/cpu/mce/intel.c
> @@ -47,8 +47,40 @@ static DEFINE_PER_CPU(mce_banks_t, mce_banks_owned);
> */
> static DEFINE_RAW_SPINLOCK(cmci_discover_lock);
>
> +/*
> + * CMCI storm tracking state
> + * stormy_bank_count: per-cpu count of MC banks in storm state
> + * bank_history: bitmask tracking of corrected errors seen in each bank

bank_storm: determines whether the bank is in storm mode

> + * bank_time_stamp: last time (in jiffies) that each bank was polled
> + * cmci_threshold: MCi_CTL2 threshold for each bank when there is no storm
> + */
> +static DEFINE_PER_CPU(int, stormy_bank_count);
> +static DEFINE_PER_CPU(u64 [MAX_NR_BANKS], bank_history);
> +static DEFINE_PER_CPU(bool [MAX_NR_BANKS], bank_storm);
> +static DEFINE_PER_CPU(unsigned long [MAX_NR_BANKS], bank_time_stamp);

All those are begging to be a

struct mca_storm_desc {

....

};

or so, so that they don't "dangle" randomly all over the place and one
doesn't know what they belong to.

Every time you then do storm management, you get the percpu pointer and
do

storm_desc->bank_history[bank] ...
storm_desc->bank_count
...

and so on.

> +static int cmci_threshold[MAX_NR_BANKS];

Why do we have to save per-bank thresholds instead of writing a default
non-storm value into all? Why are they each special?

> +
> +/* Linux non-storm CMCI threshold (may be overridden by BIOS */

Missing ")".

> #define CMCI_THRESHOLD 1
>
> +/*
> + * High threshold to limit CMCI rate during storms. Max supported is
> + * 0x7FFF. Use this slightly smaller value so it has a distinctive
> + * signature when some asks "Why am I not seeing all corrected errors?"
> + */
> +#define CMCI_STORM_THRESHOLD 32749

Why if you can simply clear CMCI_EN and disable CMCI for this bank while
the storm goes on?

And reenable it when it subsides?

> +void track_cmci_storm(int bank, u64 status)

cmci_track_storm

> +{
> + unsigned long now = jiffies, delta;
> + unsigned int shift = 1;
> + u64 history;
> +
> + /*
> + * When a bank is in storm mode it is polled once per second and
> + * the history mask will record about the last minute of poll results.
> + * If it is not in storm mode, then the bank is only checked when
> + * there is a CMCI interrupt. Check how long it has been since
> + * this bank was last checked, and adjust the amount of "shift"
> + * to apply to history.
> + */
> + if (!this_cpu_read(bank_storm[bank])) {
> + delta = now - this_cpu_read(bank_time_stamp[bank]);
> + shift = (delta + HZ) / HZ;
> + }
> +
> + /* If has been a long time since the last poll, clear history */
> + if (shift >= 64)
> + history = 0;
> + else
> + history = this_cpu_read(bank_history[bank]) << shift;

<---- newline here.

> + this_cpu_write(bank_time_stamp[bank], now);
> +
> + /* History keeps track of corrected errors. VAL=1 && UC=0 */
> + if ((status & (MCI_STATUS_VAL | MCI_STATUS_UC)) == MCI_STATUS_VAL)
> + history |= 1;

Ditto.

> + this_cpu_write(bank_history[bank], history);
> +
> + if (this_cpu_read(bank_storm[bank])) {

You just read bank_storm and now you're reading it again. Just do
a struct pls.

> + if (history & GENMASK_ULL(STORM_END_POLL_THRESHOLD - 1, 0))

"- 1" because you start from 0? So define the STORM_END_POLL_THRESHOLD
thing above as (30 - 1) and explain why.

> + return;

<---- newline here.

> + pr_notice("CPU%d BANK%d CMCI storm subsided\n", smp_processor_id(), bank);
> + cmci_set_threshold(bank, cmci_threshold[bank]);
> + cmci_storm_end(bank);
> + } else {
> + if (hweight64(history) < STORM_BEGIN_THRESHOLD)

How am I to understand this? Is that the "5 in this RFC code for ease of
testing" thing from the commit message?

> + return;

<---- newline here.

> + pr_notice("CPU%d BANK%d CMCI storm detected\n", smp_processor_id(), bank);
> + cmci_set_threshold(bank, CMCI_STORM_THRESHOLD);
> + cmci_storm_begin(bank);
> + }
> +}
> +
> /*
> * The interrupt handler. This is called on every event.
> * Just call the poller directly to log any events.
> @@ -147,6 +266,9 @@ static void cmci_discover(int banks)
> continue;
> }
>
> + if ((val & MCI_CTL2_CMCI_THRESHOLD_MASK) == CMCI_STORM_THRESHOLD)

This is silly: you have at least two per-cpu bools which record which
banks are in storm mode. Why don't you query them?

> + goto storm;
> +
> if (!mca_cfg.bios_cmci_threshold) {
> val &= ~MCI_CTL2_CMCI_THRESHOLD_MASK;
> val |= CMCI_THRESHOLD;
> @@ -159,7 +281,7 @@ static void cmci_discover(int banks)
> bios_zero_thresh = 1;
> val |= CMCI_THRESHOLD;
> }
> -
> +storm:

That piece from here on wants to be a separate helper - that function is
becoming huge and unwieldy, doing a bunch of things.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-06-16 18:28:32

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH v5 2/5] x86/mce: Add per-bank CMCI storm mitigation

On Tue, Jun 13, 2023 at 07:45:53PM +0200, Borislav Petkov wrote:
> On Tue, Apr 11, 2023 at 10:38:38AM -0700, Tony Luck wrote:
> > @@ -1587,6 +1589,7 @@ static unsigned long check_interval = INITIAL_CHECK_INTERVAL;
> >
> > static DEFINE_PER_CPU(unsigned long, mce_next_interval); /* in jiffies */
> > static DEFINE_PER_CPU(struct timer_list, mce_timer);
> > +static DEFINE_PER_CPU(bool, storm_poll_mode);
>
> See comment below about putting all those storm-related vars in a struct.

Done. Looks much better without the forest ot this_cpu*() operators. Thanks.

> Also, there's another bool - bank_storm - which looks like it does the
> same.

storm_poll_mode is a regular per-cpu variable that indicates a CPU is in
poll mode because one or more of the banks it owns has gone over the
storm threshold.

bank_storm - is a per-cpu per-bank indicator that a particular bank
on a particular CPU is in storm mode.

>
> > static void __start_timer(struct timer_list *t, unsigned long interval)
> > {
> > @@ -1622,22 +1625,29 @@ static void mce_timer_fn(struct timer_list *t)
> > else
> > iv = min(iv * 2, round_jiffies_relative(check_interval * HZ));
> >
> > - __this_cpu_write(mce_next_interval, iv);
> > - __start_timer(t, iv);
> > + if (__this_cpu_read(storm_poll_mode)) {
> > + __start_timer(t, HZ);
> > + } else {
> > + __this_cpu_write(mce_next_interval, iv);
> > + __start_timer(t, iv);
> > + }
> > }
> >
> > /*
> > - * Ensure that the timer is firing in @interval from now.
> > + * When a storm starts on any bank on this CPU, switch to polling
> > + * once per second. When the storm ends, revert to the default
> > + * polling interval.
> > */
> > -void mce_timer_kick(unsigned long interval)
> > +void mce_timer_kick(bool storm)
> > {
> > struct timer_list *t = this_cpu_ptr(&mce_timer);
> > - unsigned long iv = __this_cpu_read(mce_next_interval);
> >
> > - __start_timer(t, interval);
> > + __this_cpu_write(storm_poll_mode, storm);
> >
> > - if (interval < iv)
> > - __this_cpu_write(mce_next_interval, interval);
> > + if (storm)
> > + __start_timer(t, HZ);
> > + else
> > + __this_cpu_write(mce_next_interval, check_interval * HZ);
>
> This looks very familiar to what mce_timer_fn() above does. Add
> a helper.

Looking at the final versions of these functions with patches applied,
I'm not seeing the similarities.

> > /* Must not be called in IRQ context where del_timer_sync() can deadlock */
> > diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c
> > index 052bf2708391..4106877de028 100644
> > --- a/arch/x86/kernel/cpu/mce/intel.c
> > +++ b/arch/x86/kernel/cpu/mce/intel.c
> > @@ -47,8 +47,40 @@ static DEFINE_PER_CPU(mce_banks_t, mce_banks_owned);
> > */
> > static DEFINE_RAW_SPINLOCK(cmci_discover_lock);
> >
> > +/*
> > + * CMCI storm tracking state
> > + * stormy_bank_count: per-cpu count of MC banks in storm state
> > + * bank_history: bitmask tracking of corrected errors seen in each bank
>
> bank_storm: determines whether the bank is in storm mode

Good catch. Added.

>
> > + * bank_time_stamp: last time (in jiffies) that each bank was polled
> > + * cmci_threshold: MCi_CTL2 threshold for each bank when there is no storm
> > + */
> > +static DEFINE_PER_CPU(int, stormy_bank_count);
> > +static DEFINE_PER_CPU(u64 [MAX_NR_BANKS], bank_history);
> > +static DEFINE_PER_CPU(bool [MAX_NR_BANKS], bank_storm);
> > +static DEFINE_PER_CPU(unsigned long [MAX_NR_BANKS], bank_time_stamp);
>
> All those are begging to be a
>
> struct mca_storm_desc {
>
> ....
>
> };
>
> or so, so that they don't "dangle" randomly all over the place and one
> doesn't know what they belong to.
>
> Every time you then do storm management, you get the percpu pointer and
> do
>
> storm_desc->bank_history[bank] ...
> storm_desc->bank_count
> ...
>
> and so on.

Yup. Done.

> > +static int cmci_threshold[MAX_NR_BANKS];
>
> Why do we have to save per-bank thresholds instead of writing a default
> non-storm value into all? Why are they each special?

Because we have an option to use thresholds set by BIOS.

> > +
> > +/* Linux non-storm CMCI threshold (may be overridden by BIOS */
>
> Missing ")".

Fixed.

> > #define CMCI_THRESHOLD 1
> >
> > +/*
> > + * High threshold to limit CMCI rate during storms. Max supported is
> > + * 0x7FFF. Use this slightly smaller value so it has a distinctive
> > + * signature when some asks "Why am I not seeing all corrected errors?"
> > + */
> > +#define CMCI_STORM_THRESHOLD 32749
>
> Why if you can simply clear CMCI_EN and disable CMCI for this bank while
> the storm goes on?
>
> And reenable it when it subsides?

Because Intel reports both corrected and uncorrected errors in the same
bank and signals both with CMCI (that first "C" stands for "Corrected",
so this is now a misleading name). I want Linux to get notification of
uncorrected errors in a timely fashion, so CMCI has to stay enabled.

AMD doesn't have this problem, Smita's patch disables CMCI as you
suggest.

>
> > +void track_cmci_storm(int bank, u64 status)
>
> cmci_track_storm

Updated.

> > +{
> > + unsigned long now = jiffies, delta;
> > + unsigned int shift = 1;
> > + u64 history;
> > +
> > + /*
> > + * When a bank is in storm mode it is polled once per second and
> > + * the history mask will record about the last minute of poll results.
> > + * If it is not in storm mode, then the bank is only checked when
> > + * there is a CMCI interrupt. Check how long it has been since
> > + * this bank was last checked, and adjust the amount of "shift"
> > + * to apply to history.
> > + */
> > + if (!this_cpu_read(bank_storm[bank])) {
> > + delta = now - this_cpu_read(bank_time_stamp[bank]);
> > + shift = (delta + HZ) / HZ;
> > + }
> > +
> > + /* If has been a long time since the last poll, clear history */
> > + if (shift >= 64)
> > + history = 0;
> > + else
> > + history = this_cpu_read(bank_history[bank]) << shift;
>
> <---- newline here.

Added

> > + this_cpu_write(bank_time_stamp[bank], now);
> > +
> > + /* History keeps track of corrected errors. VAL=1 && UC=0 */
> > + if ((status & (MCI_STATUS_VAL | MCI_STATUS_UC)) == MCI_STATUS_VAL)
> > + history |= 1;
>
> Ditto.

Ditto.

> > + this_cpu_write(bank_history[bank], history);
> > +
> > + if (this_cpu_read(bank_storm[bank])) {
>
> You just read bank_storm and now you're reading it again. Just do
> a struct pls.
>
> > + if (history & GENMASK_ULL(STORM_END_POLL_THRESHOLD - 1, 0))
>
> "- 1" because you start from 0? So define the STORM_END_POLL_THRESHOLD
> thing above as (30 - 1) and explain why.

Because the low bit in a bitmap is named 0. I want to check if any of
the low 30 bits are set, so I need a bitmask with bits {29..0}

> > + return;
>
> <---- newline here.

Added.

> > + pr_notice("CPU%d BANK%d CMCI storm subsided\n", smp_processor_id(), bank);
> > + cmci_set_threshold(bank, cmci_threshold[bank]);
> > + cmci_storm_end(bank);
> > + } else {
> > + if (hweight64(history) < STORM_BEGIN_THRESHOLD)
>
> How am I to understand this? Is that the "5 in this RFC code for ease of
> testing" thing from the commit message?

Yes. I've fixed up the commit message to remove the "ease of testing". 5
seems (to me) to be a reasonable value. But it's #define so easy to
change if anyone has data to support a better choice.

> > + return;
>
> <---- newline here.

Added.

> > + pr_notice("CPU%d BANK%d CMCI storm detected\n", smp_processor_id(), bank);
> > + cmci_set_threshold(bank, CMCI_STORM_THRESHOLD);
> > + cmci_storm_begin(bank);
> > + }
> > +}
> > +
> > /*
> > * The interrupt handler. This is called on every event.
> > * Just call the poller directly to log any events.
> > @@ -147,6 +266,9 @@ static void cmci_discover(int banks)
> > continue;
> > }
> >
> > + if ((val & MCI_CTL2_CMCI_THRESHOLD_MASK) == CMCI_STORM_THRESHOLD)
>
> This is silly: you have at least two per-cpu bools which record which
> banks are in storm mode. Why don't you query them?

This is the case where a CPU is taken offline while a storm is in
progress for one of its banks. So the bool would tell us the storm
was in progress if we knew which CPU was the previous owner of this
bank. But there's no way to know that. Which banks are shared by
which CPUs isn't enumerated anywhere. So the old CPU went offline
with a storm active, the new CPU picking up ownership of this bank
must carry on managing the storm.

> > + goto storm;
> > +
> > if (!mca_cfg.bios_cmci_threshold) {
> > val &= ~MCI_CTL2_CMCI_THRESHOLD_MASK;
> > val |= CMCI_THRESHOLD;
> > @@ -159,7 +281,7 @@ static void cmci_discover(int banks)
> > bios_zero_thresh = 1;
> > val |= CMCI_THRESHOLD;
> > }
> > -
> > +storm:
>
> That piece from here on wants to be a separate helper - that function is
> becoming huge and unwieldy, doing a bunch of things.

Agreed. I pulled out three helpers. The result looks (IMHO) far more
readable.
>
> Thx.

Thanks for the detailed review. Will post new version later today.

-Tony

2023-06-16 18:53:25

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v6 0/4] Handle corrected machine check interrupt storms

Linux CMCI storm mitigation is a big hammer that just disables the CMCI
interrupt globally and switches to polling all banks.

There are two problems with this:
1) It really is a big hammer. It means that errors reported in other
banks from different functional units are all subject to the same
polling delay before being processed.
2) Intel systems signal some uncorrected errors using CMCI (e.g.
memory controller patrol scrub on Icelake Xeon and newer). Delaying
processing these error reports negates some of the benefit of the patrol
scrubber providing early notice of errors before they are consumed and
cause a machine check.

This series throws away the old storm implementation and replaces it
with one that keeps track of the weather on each separate machine check
bank. When a storm is detected from a bank. On Intel the storm is
mitigated by setting a very high threshold for corrected errors to
signal CMCI. This threshold does not affect signaling CMCI for
uncorrected errors.

AMD's storm mitigation for threshold interrupts also relies on per CPU
per bank approach similar to Intel. But unlike CMCI storm handling it does
not set thresholds to reduce rate of interrupts on a storm. Rather it
turns off the interrupt on the current CPU and bank if there is a storm
and re-enables back the interrupts when the storm subsides.

It is okay to turn off threshold interrupts on AMD systems as other error
severities continue to be handled even if the threshold interrupts are
turned off. Uncorrected errors will generate a #MC and deferred errors
have a unique separate deferred error interrupt. The final patch adds
support for handling threshold interrupt storms on AMD systems.

Changes since last version:

1) Broke series into different steps. Previous series was based on
the sequence of development:
a) Remove old storm code
b) Add Intel specific storm mitigation
c) Small refactor by to allow for support for multiple vendors
d) Move all the generic parts from intel.c to core.c
e) Add support for AMD
but this resulted in a lot of code being added in step 'b' and then
moved in step 'd'. Preserving this history in Linux GIT commits doesn't
seem overly useful. So I squashed parts 'b' through 'e' together and
then split them into a more rationale set:
a) Same as before, removes old storm code.
b) Add generic code for storm handling
c) Add AMD vendor specific code
d) Add Intel vendor specific code.

2) Boris suggested that all the storm tracking variables should be
bundled into a "storm_desc" structure, then declare a "storm" pointer
initialized using this_cpu_ptr() then replace the forest of this_cpu_*()
code with operation on "storm->field".

3) The Intel cmci_discover() function is far too large, with
many separate things going on in the inner loop. Per Boris suggestion
refactored with some helper functions.

4) Numerous other small cleanups, additional comments to explain
what is happening in areas where it wasn't obvious.

Smita Koralahalli (1):
x86/mce: Handle AMD threshold interrupt storms

Tony Luck (3):
x86/mce: Remove old CMCI storm mitigation code
x86/mce: Add per-bank CMCI storm mitigation
x86/mce: Handle Intel threshold interrupt storms

arch/x86/kernel/cpu/mce/internal.h | 43 +++-
arch/x86/kernel/cpu/mce/amd.c | 49 +++++
arch/x86/kernel/cpu/mce/core.c | 135 +++++++++---
arch/x86/kernel/cpu/mce/intel.c | 333 +++++++++++++----------------
4 files changed, 337 insertions(+), 223 deletions(-)


base-commit: 858fd168a95c5b9669aac8db6c14a9aeab446375
--
2.40.1


2023-06-16 18:55:34

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v6 3/4] x86/mce: Handle AMD threshold interrupt storms

From: Smita Koralahalli <[email protected]>

Add hook into core storm handling code for AMD threshold interrupts.

Disable the interrupt on the corresponding CPU and bank. Re-enable
back the interrupts if enough consecutive polls of the bank show no
corrected errors.

Turning off the threshold interrupts is the best solution on AMD systems
as other error severities will still be handled even if the threshold
interrupts are disabled.

[Tony: Updated places where storm tracking variables moved into a
structure]

Signed-off-by: Smita Koralahalli <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/mce/internal.h | 2 ++
arch/x86/kernel/cpu/mce/amd.c | 49 ++++++++++++++++++++++++++++++
arch/x86/kernel/cpu/mce/core.c | 3 ++
3 files changed, 54 insertions(+)

diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index eae88a824d97..22899d28138f 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -232,6 +232,7 @@ extern bool filter_mce(struct mce *m);

#ifdef CONFIG_X86_MCE_AMD
extern bool amd_filter_mce(struct mce *m);
+void mce_amd_handle_storm(int bank, bool on);

/*
* If MCA_CONFIG[McaLsbInStatusSupported] is set, extract ErrAddr in bits
@@ -259,6 +260,7 @@ static __always_inline void smca_extract_err_addr(struct mce *m)

#else
static inline bool amd_filter_mce(struct mce *m) { return false; }
+static inline void mce_amd_handle_storm(int bank, bool on) {}
static inline void smca_extract_err_addr(struct mce *m) { }
#endif

diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index 0b971f974096..b19f3eb70187 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -468,6 +468,47 @@ static void threshold_restart_bank(void *_tr)
wrmsr(tr->b->address, lo, hi);
}

+static void _reset_block(struct threshold_block *block)
+{
+ struct thresh_restart tr;
+
+ memset(&tr, 0, sizeof(tr));
+ tr.b = block;
+ threshold_restart_bank(&tr);
+}
+
+static void toggle_interrupt_reset_block(struct threshold_block *block, bool on)
+{
+ if (!block)
+ return;
+
+ block->interrupt_enable = !!on;
+ _reset_block(block);
+}
+
+void mce_amd_handle_storm(int bank, bool on)
+{
+ struct threshold_block *first_block = NULL, *block = NULL, *tmp = NULL;
+ struct threshold_bank **bp = this_cpu_read(threshold_banks);
+ unsigned long flags;
+
+ if (!bp)
+ return;
+
+ local_irq_save(flags);
+
+ first_block = bp[bank]->blocks;
+ if (!first_block)
+ goto end;
+
+ toggle_interrupt_reset_block(first_block, on);
+
+ list_for_each_entry_safe(block, tmp, &first_block->miscj, miscj)
+ toggle_interrupt_reset_block(block, on);
+end:
+ local_irq_restore(flags);
+}
+
static void mce_threshold_block_init(struct threshold_block *b, int offset)
{
struct thresh_restart tr = {
@@ -868,6 +909,7 @@ static void amd_threshold_interrupt(void)
struct threshold_block *first_block = NULL, *block = NULL, *tmp = NULL;
struct threshold_bank **bp = this_cpu_read(threshold_banks);
unsigned int bank, cpu = smp_processor_id();
+ u64 status;

/*
* Validate that the threshold bank has been initialized already. The
@@ -881,6 +923,13 @@ static void amd_threshold_interrupt(void)
if (!(per_cpu(bank_map, cpu) & BIT_ULL(bank)))
continue;

+ rdmsrl(mca_msr_reg(bank, MCA_STATUS), status);
+ track_cmci_storm(bank, status);
+
+ /* Return early on an interrupt storm */
+ if (this_cpu_read(storm_desc.bank_storm[bank]))
+ return;
+
first_block = bp[bank]->blocks;
if (!first_block)
continue;
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index cd9d9ea5bb0a..d4c9dc194d56 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -2055,6 +2055,9 @@ static void mce_zhaoxin_feature_clear(struct cpuinfo_x86 *c)
void mce_handle_storm(int bank, bool on)
{
switch (boot_cpu_data.x86_vendor) {
+ case X86_VENDOR_AMD:
+ mce_amd_handle_storm(bank, on);
+ break;
}
}

--
2.40.1


2023-06-16 18:55:42

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v6 1/4] x86/mce: Remove old CMCI storm mitigation code

When a "storm" of CMCI is detected this code mitigates by
disabling CMCI interrupt signalling from all of the banks
owned by the CPU that saw the storm.

There are problems with this approach:

1) It is very coarse grained. In all likelihood only one of the
banks was generating the interrupts, but CMCI is disabled for all.
This means Linux may delay seeing and processing errors logged
from other banks.

2) Although CMCI stands for Corrected Machine Check Interrupt, it
is also used to signal when an uncorrected error is logged. This
is a problem because these errors should be handled in a timely
manner.

Delete all this code in preparation for a finer grained solution.

Reviewed-by: Yazen Ghannam <[email protected]>
Tested-by: Yazen Ghannam <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/mce/internal.h | 6 --
arch/x86/kernel/cpu/mce/core.c | 20 +---
arch/x86/kernel/cpu/mce/intel.c | 145 -----------------------------
3 files changed, 1 insertion(+), 170 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index d2412ce2d312..9dcad55835fa 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -41,18 +41,12 @@ struct dentry *mce_get_debugfs_dir(void);
extern mce_banks_t mce_banks_ce_disabled;

#ifdef CONFIG_X86_MCE_INTEL
-unsigned long cmci_intel_adjust_timer(unsigned long interval);
-bool mce_intel_cmci_poll(void);
-void mce_intel_hcpu_update(unsigned long cpu);
void cmci_disable_bank(int bank);
void intel_init_cmci(void);
void intel_init_lmce(void);
void intel_clear_lmce(void);
bool intel_filter_mce(struct mce *m);
#else
-# define cmci_intel_adjust_timer mce_adjust_timer_default
-static inline bool mce_intel_cmci_poll(void) { return false; }
-static inline void mce_intel_hcpu_update(unsigned long cpu) { }
static inline void cmci_disable_bank(int bank) { }
static inline void intel_init_cmci(void) { }
static inline void intel_init_lmce(void) { }
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 2eec60f50057..e7936be84204 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1588,13 +1588,6 @@ static unsigned long check_interval = INITIAL_CHECK_INTERVAL;
static DEFINE_PER_CPU(unsigned long, mce_next_interval); /* in jiffies */
static DEFINE_PER_CPU(struct timer_list, mce_timer);

-static unsigned long mce_adjust_timer_default(unsigned long interval)
-{
- return interval;
-}
-
-static unsigned long (*mce_adjust_timer)(unsigned long interval) = mce_adjust_timer_default;
-
static void __start_timer(struct timer_list *t, unsigned long interval)
{
unsigned long when = jiffies + interval;
@@ -1617,15 +1610,9 @@ static void mce_timer_fn(struct timer_list *t)

iv = __this_cpu_read(mce_next_interval);

- if (mce_available(this_cpu_ptr(&cpu_info))) {
+ if (mce_available(this_cpu_ptr(&cpu_info)))
machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));

- if (mce_intel_cmci_poll()) {
- iv = mce_adjust_timer(iv);
- goto done;
- }
- }
-
/*
* Alert userspace if needed. If we logged an MCE, reduce the polling
* interval, otherwise increase the polling interval.
@@ -1635,7 +1622,6 @@ static void mce_timer_fn(struct timer_list *t)
else
iv = min(iv * 2, round_jiffies_relative(check_interval * HZ));

-done:
__this_cpu_write(mce_next_interval, iv);
__start_timer(t, iv);
}
@@ -1972,7 +1958,6 @@ static void mce_zhaoxin_feature_init(struct cpuinfo_x86 *c)

intel_init_cmci();
intel_init_lmce();
- mce_adjust_timer = cmci_intel_adjust_timer;
}

static void mce_zhaoxin_feature_clear(struct cpuinfo_x86 *c)
@@ -1985,7 +1970,6 @@ static void __mcheck_cpu_init_vendor(struct cpuinfo_x86 *c)
switch (c->x86_vendor) {
case X86_VENDOR_INTEL:
mce_intel_feature_init(c);
- mce_adjust_timer = cmci_intel_adjust_timer;
break;

case X86_VENDOR_AMD: {
@@ -2642,8 +2626,6 @@ static void mce_reenable_cpu(void)

static int mce_cpu_dead(unsigned int cpu)
{
- mce_intel_hcpu_update(cpu);
-
/* intentionally ignoring frozen here */
if (!cpuhp_tasks_frozen)
cmci_rediscover();
diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c
index 95275a5e57e0..052bf2708391 100644
--- a/arch/x86/kernel/cpu/mce/intel.c
+++ b/arch/x86/kernel/cpu/mce/intel.c
@@ -41,15 +41,6 @@
*/
static DEFINE_PER_CPU(mce_banks_t, mce_banks_owned);

-/*
- * CMCI storm detection backoff counter
- *
- * During storm, we reset this counter to INITIAL_CHECK_INTERVAL in case we've
- * encountered an error. If not, we decrement it by one. We signal the end of
- * the CMCI storm when it reaches 0.
- */
-static DEFINE_PER_CPU(int, cmci_backoff_cnt);
-
/*
* cmci_discover_lock protects against parallel discovery attempts
* which could race against each other.
@@ -57,21 +48,6 @@ static DEFINE_PER_CPU(int, cmci_backoff_cnt);
static DEFINE_RAW_SPINLOCK(cmci_discover_lock);

#define CMCI_THRESHOLD 1
-#define CMCI_POLL_INTERVAL (30 * HZ)
-#define CMCI_STORM_INTERVAL (HZ)
-#define CMCI_STORM_THRESHOLD 15
-
-static DEFINE_PER_CPU(unsigned long, cmci_time_stamp);
-static DEFINE_PER_CPU(unsigned int, cmci_storm_cnt);
-static DEFINE_PER_CPU(unsigned int, cmci_storm_state);
-
-enum {
- CMCI_STORM_NONE,
- CMCI_STORM_ACTIVE,
- CMCI_STORM_SUBSIDED,
-};
-
-static atomic_t cmci_storm_on_cpus;

static int cmci_supported(int *banks)
{
@@ -127,124 +103,6 @@ static bool lmce_supported(void)
return tmp & FEAT_CTL_LMCE_ENABLED;
}

-bool mce_intel_cmci_poll(void)
-{
- if (__this_cpu_read(cmci_storm_state) == CMCI_STORM_NONE)
- return false;
-
- /*
- * Reset the counter if we've logged an error in the last poll
- * during the storm.
- */
- if (machine_check_poll(0, this_cpu_ptr(&mce_banks_owned)))
- this_cpu_write(cmci_backoff_cnt, INITIAL_CHECK_INTERVAL);
- else
- this_cpu_dec(cmci_backoff_cnt);
-
- return true;
-}
-
-void mce_intel_hcpu_update(unsigned long cpu)
-{
- if (per_cpu(cmci_storm_state, cpu) == CMCI_STORM_ACTIVE)
- atomic_dec(&cmci_storm_on_cpus);
-
- per_cpu(cmci_storm_state, cpu) = CMCI_STORM_NONE;
-}
-
-static void cmci_toggle_interrupt_mode(bool on)
-{
- unsigned long flags, *owned;
- int bank;
- u64 val;
-
- raw_spin_lock_irqsave(&cmci_discover_lock, flags);
- owned = this_cpu_ptr(mce_banks_owned);
- for_each_set_bit(bank, owned, MAX_NR_BANKS) {
- rdmsrl(MSR_IA32_MCx_CTL2(bank), val);
-
- if (on)
- val |= MCI_CTL2_CMCI_EN;
- else
- val &= ~MCI_CTL2_CMCI_EN;
-
- wrmsrl(MSR_IA32_MCx_CTL2(bank), val);
- }
- raw_spin_unlock_irqrestore(&cmci_discover_lock, flags);
-}
-
-unsigned long cmci_intel_adjust_timer(unsigned long interval)
-{
- if ((this_cpu_read(cmci_backoff_cnt) > 0) &&
- (__this_cpu_read(cmci_storm_state) == CMCI_STORM_ACTIVE)) {
- mce_notify_irq();
- return CMCI_STORM_INTERVAL;
- }
-
- switch (__this_cpu_read(cmci_storm_state)) {
- case CMCI_STORM_ACTIVE:
-
- /*
- * We switch back to interrupt mode once the poll timer has
- * silenced itself. That means no events recorded and the timer
- * interval is back to our poll interval.
- */
- __this_cpu_write(cmci_storm_state, CMCI_STORM_SUBSIDED);
- if (!atomic_sub_return(1, &cmci_storm_on_cpus))
- pr_notice("CMCI storm subsided: switching to interrupt mode\n");
-
- fallthrough;
-
- case CMCI_STORM_SUBSIDED:
- /*
- * We wait for all CPUs to go back to SUBSIDED state. When that
- * happens we switch back to interrupt mode.
- */
- if (!atomic_read(&cmci_storm_on_cpus)) {
- __this_cpu_write(cmci_storm_state, CMCI_STORM_NONE);
- cmci_toggle_interrupt_mode(true);
- cmci_recheck();
- }
- return CMCI_POLL_INTERVAL;
- default:
-
- /* We have shiny weather. Let the poll do whatever it thinks. */
- return interval;
- }
-}
-
-static bool cmci_storm_detect(void)
-{
- unsigned int cnt = __this_cpu_read(cmci_storm_cnt);
- unsigned long ts = __this_cpu_read(cmci_time_stamp);
- unsigned long now = jiffies;
- int r;
-
- if (__this_cpu_read(cmci_storm_state) != CMCI_STORM_NONE)
- return true;
-
- if (time_before_eq(now, ts + CMCI_STORM_INTERVAL)) {
- cnt++;
- } else {
- cnt = 1;
- __this_cpu_write(cmci_time_stamp, now);
- }
- __this_cpu_write(cmci_storm_cnt, cnt);
-
- if (cnt <= CMCI_STORM_THRESHOLD)
- return false;
-
- cmci_toggle_interrupt_mode(false);
- __this_cpu_write(cmci_storm_state, CMCI_STORM_ACTIVE);
- r = atomic_add_return(1, &cmci_storm_on_cpus);
- mce_timer_kick(CMCI_STORM_INTERVAL);
- this_cpu_write(cmci_backoff_cnt, INITIAL_CHECK_INTERVAL);
-
- if (r == 1)
- pr_notice("CMCI storm detected: switching to poll mode\n");
- return true;
-}
-
/*
* The interrupt handler. This is called on every event.
* Just call the poller directly to log any events.
@@ -253,9 +111,6 @@ static bool cmci_storm_detect(void)
*/
static void intel_threshold_interrupt(void)
{
- if (cmci_storm_detect())
- return;
-
machine_check_poll(MCP_TIMESTAMP, this_cpu_ptr(&mce_banks_owned));
}

--
2.40.1


2023-06-23 15:10:01

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v6 3/4] x86/mce: Handle AMD threshold interrupt storms

On Fri, Jun 16, 2023 at 11:27:43AM -0700, Tony Luck wrote:
> +static void _reset_block(struct threshold_block *block)
> +{
> + struct thresh_restart tr;
> +
> + memset(&tr, 0, sizeof(tr));
> + tr.b = block;
> + threshold_restart_bank(&tr);
> +}

> +
> +static void toggle_interrupt_reset_block(struct threshold_block *block, bool on)
> +{
> + if (!block)
> + return;
> +
> + block->interrupt_enable = !!on;
> + _reset_block(block);
> +}
> +
> +void mce_amd_handle_storm(int bank, bool on)
> +{
> + struct threshold_block *first_block = NULL, *block = NULL, *tmp = NULL;
> + struct threshold_bank **bp = this_cpu_read(threshold_banks);
> + unsigned long flags;
> +
> + if (!bp)
> + return;
> +
> + local_irq_save(flags);
> +
> + first_block = bp[bank]->blocks;
> + if (!first_block)
> + goto end;
> +
> + toggle_interrupt_reset_block(first_block, on);
> +
> + list_for_each_entry_safe(block, tmp, &first_block->miscj, miscj)
> + toggle_interrupt_reset_block(block, on);
> +end:
> + local_irq_restore(flags);
> +}

There's already other code which does this threshold block control. Pls
refactor and unify it instead of adding almost redundant similar functions.

> static void mce_threshold_block_init(struct threshold_block *b, int offset)
> {
> struct thresh_restart tr = {
> @@ -868,6 +909,7 @@ static void amd_threshold_interrupt(void)
> struct threshold_block *first_block = NULL, *block = NULL, *tmp = NULL;
> struct threshold_bank **bp = this_cpu_read(threshold_banks);
> unsigned int bank, cpu = smp_processor_id();
> + u64 status;
>
> /*
> * Validate that the threshold bank has been initialized already. The
> @@ -881,6 +923,13 @@ static void amd_threshold_interrupt(void)
> if (!(per_cpu(bank_map, cpu) & BIT_ULL(bank)))
> continue;
>
> + rdmsrl(mca_msr_reg(bank, MCA_STATUS), status);
> + track_cmci_storm(bank, status);

So this is called from interrupt context.

There's another track_cmci_storm() from machine_check_poll() which can
happen in process context.

And there's no sync (locking) between the two. Not good.

Why are even two calls needed on AMD?

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-06-23 16:11:04

by Yazen Ghannam

[permalink] [raw]
Subject: Re: [PATCH v6 3/4] x86/mce: Handle AMD threshold interrupt storms

On 6/23/2023 10:45 AM, Borislav Petkov wrote:
> On Fri, Jun 16, 2023 at 11:27:43AM -0700, Tony Luck wrote:
>> +static void _reset_block(struct threshold_block *block)
>> +{
>> + struct thresh_restart tr;
>> +
>> + memset(&tr, 0, sizeof(tr));
>> + tr.b = block;
>> + threshold_restart_bank(&tr);
>> +}
>
>> +
>> +static void toggle_interrupt_reset_block(struct threshold_block *block, bool on)
>> +{
>> + if (!block)
>> + return;
>> +
>> + block->interrupt_enable = !!on;
>> + _reset_block(block);
>> +}
>> +
>> +void mce_amd_handle_storm(int bank, bool on)
>> +{
>> + struct threshold_block *first_block = NULL, *block = NULL, *tmp = NULL;
>> + struct threshold_bank **bp = this_cpu_read(threshold_banks);
>> + unsigned long flags;
>> +
>> + if (!bp)
>> + return;
>> +
>> + local_irq_save(flags);
>> +
>> + first_block = bp[bank]->blocks;
>> + if (!first_block)
>> + goto end;
>> +
>> + toggle_interrupt_reset_block(first_block, on);
>> +
>> + list_for_each_entry_safe(block, tmp, &first_block->miscj, miscj)
>> + toggle_interrupt_reset_block(block, on);
>> +end:
>> + local_irq_restore(flags);
>> +}
>
> There's already other code which does this threshold block control. Pls
> refactor and unify it instead of adding almost redundant similar functions.
>

Okay, will do.

>> static void mce_threshold_block_init(struct threshold_block *b, int offset)
>> {
>> struct thresh_restart tr = {
>> @@ -868,6 +909,7 @@ static void amd_threshold_interrupt(void)
>> struct threshold_block *first_block = NULL, *block = NULL, *tmp = NULL;
>> struct threshold_bank **bp = this_cpu_read(threshold_banks);
>> unsigned int bank, cpu = smp_processor_id();
>> + u64 status;
>>
>> /*
>> * Validate that the threshold bank has been initialized already. The
>> @@ -881,6 +923,13 @@ static void amd_threshold_interrupt(void)
>> if (!(per_cpu(bank_map, cpu) & BIT_ULL(bank)))
>> continue;
>>
>> + rdmsrl(mca_msr_reg(bank, MCA_STATUS), status);
>> + track_cmci_storm(bank, status);
>
> So this is called from interrupt context.
>
> There's another track_cmci_storm() from machine_check_poll() which can
> happen in process context.
>
> And there's no sync (locking) between the two. Not good.
>
> Why are even two calls needed on AMD?
>

I think because the AMD interrupt handlers don't call
machine_check_poll(). This is a good opportunity to unify the AMD
thresholding and deferred error interrupt handlers with
machine_check_poll().

Tony,
Please leave out this AMD patch for now. I'll work on refactoring it.

Thanks,
Yazen


2023-07-18 21:26:26

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v7 0/3] Handle corrected machine check interrupt storms

Linux CMCI storm mitigation is a big hammer that just disables the CMCI
interrupt globally and switches to polling all banks.

There are two problems with this:
1) It really is a big hammer. It means that errors reported in other
banks from different functional units are all subject to the same
polling delay before being processed.
2) Intel systems signal some uncorrected errors using CMCI (e.g.
memory controller patrol scrub on Icelake Xeon and newer). Delaying
processing these error reports negates some of the benefit of the patrol
scrubber providing early notice of errors before they are consumed and
cause a machine check.

This series throws away the old storm implementation and replaces it
with one that keeps track of the weather on each separate machine check
bank. When a storm is detected from a bank. On Intel the storm is
mitigated by setting a very high threshold for corrected errors to
signal CMCI. This threshold does not affect signaling CMCI for
uncorrected errors.

Changes since last version:

0) Rebased to v6.5-rc2
1) Yazen & Boris - dropped AMD patch pending integration of AMD
machine check bank scanning with the core machine_check_poll()
function.
2) Boris - rename track_cmci_storm() as track_storm() in prep for
the day when AMD joins in - they don't call the interrupt "CMCI".
This function is now "static" and local to core.c.
3) Boris - Define new "struct storm_bank" for all the storm tracking
arrays.
4) Move the storm_poll_mode per-CPU tracker into the storm_desc
structure.
5) Define STORM_END_POLL_THRESHOLD as "29" instead of "30" with comment
that it is used as high end of a bitmask that counts from zero. Drop
the " - 1" where it is used.
6) Don't user kernel-doc format comments in mce/internal.h.

Suggested change NOT taken:
> + * If this is the first bank on this CPU to enter storm mode
> + * start polling
> + */
> + if (++storm->stormy_bank_count == 1)

if (++storm->stormy_bank_count)

> + mce_timer_kick(true);

As the comment above this code says, only want to "kick" the timer when
first bank on a core goes into storm mode. If another bank also goes
into storm while the first storm is active, then no need to "start
polling" that's already happening for the first storm.

Tony Luck (3):
x86/mce: Remove old CMCI storm mitigation code
x86/mce: Add per-bank CMCI storm mitigation
x86/mce: Handle Intel threshold interrupt storms

arch/x86/kernel/cpu/mce/internal.h | 49 ++++-
arch/x86/kernel/cpu/mce/core.c | 131 +++++++++---
arch/x86/kernel/cpu/mce/intel.c | 333 +++++++++++++----------------
3 files changed, 290 insertions(+), 223 deletions(-)


base-commit: fdf0eaf11452d72945af31804e2a1048ee1b574c
--
2.40.1


2023-07-18 21:37:12

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v7 3/3] x86/mce: Handle Intel threshold interrupt storms

Add an Intel specific hook into machine_check_poll() to keep track
of per-CPU, per-bank corrected error logs (with a stub for the
CONFIG_MCE_INTEL=n case).

When a storm is observed the Rate of interrupts is reduced by setting
a large threshold value for this bank in IA32_MCi_CTL2. This bank is
added to the bitmap of banks for this CPU to poll. The polling rate
is increased to once per second.

When a storm ends reset the
threshold in IA32_MCi_CTL2 back to 1, removes the bank from the bitmap
for polling, and changes the polling rate back to the default.

If a CPU with banks in storm mode is taken offline, the new CPU
that inherits ownership of those banks takes over management of
storm(s) in the inherited bank(s).

The cmci_discover() function was already very large. These changes
pushed it well over the top. Refactor with three helper functions
to braing it back under control.

Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/mce/internal.h | 2 +
arch/x86/kernel/cpu/mce/core.c | 3 +
arch/x86/kernel/cpu/mce/intel.c | 202 +++++++++++++++++++++--------
3 files changed, 156 insertions(+), 51 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index da790d13d010..e641c991beb1 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -41,12 +41,14 @@ struct dentry *mce_get_debugfs_dir(void);
extern mce_banks_t mce_banks_ce_disabled;

#ifdef CONFIG_X86_MCE_INTEL
+void mce_intel_handle_storm(int bank, bool on);
void cmci_disable_bank(int bank);
void intel_init_cmci(void);
void intel_init_lmce(void);
void intel_clear_lmce(void);
bool intel_filter_mce(struct mce *m);
#else
+static inline void mce_intel_handle_storm(int bank, bool on) { }
static inline void cmci_disable_bank(int bank) { }
static inline void intel_init_cmci(void) { }
static inline void intel_init_lmce(void) { }
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 6a44e15d74fe..0a287998e62f 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -2054,6 +2054,9 @@ static void mce_zhaoxin_feature_clear(struct cpuinfo_x86 *c)
void mce_handle_storm(int bank, bool on)
{
switch (boot_cpu_data.x86_vendor) {
+ case X86_VENDOR_INTEL:
+ mce_intel_handle_storm(bank, on);
+ break;
}
}

diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c
index 052bf2708391..55643c5944e1 100644
--- a/arch/x86/kernel/cpu/mce/intel.c
+++ b/arch/x86/kernel/cpu/mce/intel.c
@@ -47,8 +47,27 @@ static DEFINE_PER_CPU(mce_banks_t, mce_banks_owned);
*/
static DEFINE_RAW_SPINLOCK(cmci_discover_lock);

+/* Linux non-storm CMCI threshold (may be overridden by BIOS) */
#define CMCI_THRESHOLD 1

+/*
+ * MCi_CTL2 threshold for each bank when there is no storm.
+ * Default value for each bank may have been set by BIOS.
+ */
+static int cmci_threshold[MAX_NR_BANKS];
+
+/*
+ * High threshold to limit CMCI rate during storms. Max supported is
+ * 0x7FFF. Use this slightly smaller value so it has a distinctive
+ * signature when some asks "Why am I not seeing all corrected errors?"
+ * A high threshold is used instead of just disabling CMCI for a
+ * bank because both corrected and uncorrected errors may be logged
+ * in the same bank and signalled with CMCI. The threshold only applies
+ * to corrected errors, so keeping CMCI enabled means that uncorrected
+ * errors will still be processed in a timely fashion.
+ */
+#define CMCI_STORM_THRESHOLD 32749
+
static int cmci_supported(int *banks)
{
u64 cap;
@@ -103,6 +122,31 @@ static bool lmce_supported(void)
return tmp & FEAT_CTL_LMCE_ENABLED;
}

+/*
+ * Set a new CMCI threshold value. Preserve the state of the
+ * MCI_CTL2_CMCI_EN bit in case this happens during a
+ * cmci_rediscover() operation.
+ */
+static void cmci_set_threshold(int bank, int thresh)
+{
+ unsigned long flags;
+ u64 val;
+
+ raw_spin_lock_irqsave(&cmci_discover_lock, flags);
+ rdmsrl(MSR_IA32_MCx_CTL2(bank), val);
+ val &= ~MCI_CTL2_CMCI_THRESHOLD_MASK;
+ wrmsrl(MSR_IA32_MCx_CTL2(bank), val | thresh);
+ raw_spin_unlock_irqrestore(&cmci_discover_lock, flags);
+}
+
+void mce_intel_handle_storm(int bank, bool on)
+{
+ if (on)
+ cmci_set_threshold(bank, CMCI_STORM_THRESHOLD);
+ else
+ cmci_set_threshold(bank, cmci_threshold[bank]);
+}
+
/*
* The interrupt handler. This is called on every event.
* Just call the poller directly to log any events.
@@ -114,72 +158,126 @@ static void intel_threshold_interrupt(void)
machine_check_poll(MCP_TIMESTAMP, this_cpu_ptr(&mce_banks_owned));
}

+/*
+ * Check all the reasons why current CPU cannot claim
+ * ownership of a bank.
+ * 1: CPU already owns this bank
+ * 2: BIOS owns this bank
+ * 3: Some other CPU owns this bank
+ */
+static bool cmci_skip_bank(int bank, u64 *val)
+{
+ unsigned long *owned = (void *)this_cpu_ptr(&mce_banks_owned);
+
+ if (test_bit(bank, owned))
+ return true;
+
+ /* Skip banks in firmware first mode */
+ if (test_bit(bank, mce_banks_ce_disabled))
+ return true;
+
+ rdmsrl(MSR_IA32_MCx_CTL2(bank), *val);
+
+ /* Already owned by someone else? */
+ if (*val & MCI_CTL2_CMCI_EN) {
+ clear_bit(bank, owned);
+ __clear_bit(bank, this_cpu_ptr(mce_poll_banks));
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Decide which CMCI interrupt threshold to use:
+ * 1: If this bank is in storm mode from whichever CPU was
+ * the previous owner, stay in storm mode.
+ * 2: If ignoring any threshold set by BIOS, set Linux default
+ * 3: Try to honor BIOS threshold (unless buggy BIOS set it at zero).
+ */
+static u64 cmci_pick_threshold(u64 val, int *bios_zero_thresh)
+{
+ if ((val & MCI_CTL2_CMCI_THRESHOLD_MASK) == CMCI_STORM_THRESHOLD)
+ return val;
+
+ if (!mca_cfg.bios_cmci_threshold) {
+ val &= ~MCI_CTL2_CMCI_THRESHOLD_MASK;
+ val |= CMCI_THRESHOLD;
+ } else if (!(val & MCI_CTL2_CMCI_THRESHOLD_MASK)) {
+ /*
+ * If bios_cmci_threshold boot option was specified
+ * but the threshold is zero, we'll try to initialize
+ * it to 1.
+ */
+ *bios_zero_thresh = 1;
+ val |= CMCI_THRESHOLD;
+ }
+
+ return val;
+}
+
+/*
+ * Try to claim ownership of a bank.
+ */
+static void cmci_claim_bank(int bank, u64 val, int bios_zero_thresh, int *bios_wrong_thresh)
+{
+ struct mca_storm_desc *storm = this_cpu_ptr(&storm_desc);
+
+ val |= MCI_CTL2_CMCI_EN;
+ wrmsrl(MSR_IA32_MCx_CTL2(bank), val);
+ rdmsrl(MSR_IA32_MCx_CTL2(bank), val);
+
+ /* Did the enable bit stick? -- the bank supports CMCI */
+ if (val & MCI_CTL2_CMCI_EN) {
+ set_bit(bank, (void *)this_cpu_ptr(&mce_banks_owned));
+ if ((val & MCI_CTL2_CMCI_THRESHOLD_MASK) == CMCI_STORM_THRESHOLD) {
+ pr_notice("CPU%d BANK%d CMCI inherited storm\n", smp_processor_id(), bank);
+ storm->banks[bank].history = ~0ull;
+ storm->banks[bank].timestamp = jiffies;
+ cmci_storm_begin(bank);
+ } else {
+ __clear_bit(bank, this_cpu_ptr(mce_poll_banks));
+ }
+ /*
+ * We are able to set thresholds for some banks that
+ * had a threshold of 0. This means the BIOS has not
+ * set the thresholds properly or does not work with
+ * this boot option. Note down now and report later.
+ */
+ if (mca_cfg.bios_cmci_threshold && bios_zero_thresh &&
+ (val & MCI_CTL2_CMCI_THRESHOLD_MASK))
+ *bios_wrong_thresh = 1;
+
+ /* Save default threshold for each bank */
+ if (cmci_threshold[bank] == 0)
+ cmci_threshold[bank] = val & MCI_CTL2_CMCI_THRESHOLD_MASK;
+ } else {
+ WARN_ON(!test_bit(bank, this_cpu_ptr(mce_poll_banks)));
+ }
+}
+
/*
* Enable CMCI (Corrected Machine Check Interrupt) for available MCE banks
* on this CPU. Use the algorithm recommended in the SDM to discover shared
- * banks.
+ * banks. Called during initial bootstrap, and also for hotplug CPU operations
+ * to rediscover/reassign machine check banks.
*/
static void cmci_discover(int banks)
{
- unsigned long *owned = (void *)this_cpu_ptr(&mce_banks_owned);
- unsigned long flags;
- int i;
int bios_wrong_thresh = 0;
+ unsigned long flags;
+ int i;

raw_spin_lock_irqsave(&cmci_discover_lock, flags);
for (i = 0; i < banks; i++) {
u64 val;
int bios_zero_thresh = 0;

- if (test_bit(i, owned))
+ if (cmci_skip_bank(i, &val))
continue;

- /* Skip banks in firmware first mode */
- if (test_bit(i, mce_banks_ce_disabled))
- continue;
-
- rdmsrl(MSR_IA32_MCx_CTL2(i), val);
-
- /* Already owned by someone else? */
- if (val & MCI_CTL2_CMCI_EN) {
- clear_bit(i, owned);
- __clear_bit(i, this_cpu_ptr(mce_poll_banks));
- continue;
- }
-
- if (!mca_cfg.bios_cmci_threshold) {
- val &= ~MCI_CTL2_CMCI_THRESHOLD_MASK;
- val |= CMCI_THRESHOLD;
- } else if (!(val & MCI_CTL2_CMCI_THRESHOLD_MASK)) {
- /*
- * If bios_cmci_threshold boot option was specified
- * but the threshold is zero, we'll try to initialize
- * it to 1.
- */
- bios_zero_thresh = 1;
- val |= CMCI_THRESHOLD;
- }
-
- val |= MCI_CTL2_CMCI_EN;
- wrmsrl(MSR_IA32_MCx_CTL2(i), val);
- rdmsrl(MSR_IA32_MCx_CTL2(i), val);
-
- /* Did the enable bit stick? -- the bank supports CMCI */
- if (val & MCI_CTL2_CMCI_EN) {
- set_bit(i, owned);
- __clear_bit(i, this_cpu_ptr(mce_poll_banks));
- /*
- * We are able to set thresholds for some banks that
- * had a threshold of 0. This means the BIOS has not
- * set the thresholds properly or does not work with
- * this boot option. Note down now and report later.
- */
- if (mca_cfg.bios_cmci_threshold && bios_zero_thresh &&
- (val & MCI_CTL2_CMCI_THRESHOLD_MASK))
- bios_wrong_thresh = 1;
- } else {
- WARN_ON(!test_bit(i, this_cpu_ptr(mce_poll_banks)));
- }
+ val = cmci_pick_threshold(val, &bios_zero_thresh);
+ cmci_claim_bank(i, val, bios_zero_thresh, &bios_wrong_thresh);
}
raw_spin_unlock_irqrestore(&cmci_discover_lock, flags);
if (mca_cfg.bios_cmci_threshold && bios_wrong_thresh) {
@@ -218,6 +316,8 @@ static void __cmci_disable_bank(int bank)
val &= ~MCI_CTL2_CMCI_EN;
wrmsrl(MSR_IA32_MCx_CTL2(bank), val);
__clear_bit(bank, this_cpu_ptr(mce_banks_owned));
+ if ((val & MCI_CTL2_CMCI_THRESHOLD_MASK) == CMCI_STORM_THRESHOLD)
+ cmci_storm_end(bank);
}

/*
--
2.40.1


2023-09-20 02:22:33

by Yazen Ghannam

[permalink] [raw]
Subject: Re: [PATCH v7 3/3] x86/mce: Handle Intel threshold interrupt storms

On 7/18/23 5:08 PM, Tony Luck wrote:
> Add an Intel specific hook into machine_check_poll() to keep track
> of per-CPU, per-bank corrected error logs (with a stub for the
> CONFIG_MCE_INTEL=n case).
>
> When a storm is observed the Rate of interrupts is reduced by setting

Rate -> rate

> a large threshold value for this bank in IA32_MCi_CTL2. This bank is
> added to the bitmap of banks for this CPU to poll. The polling rate
> is increased to once per second.
>
> When a storm ends reset the

Spurious newline?

> threshold in IA32_MCi_CTL2 back to 1, removes the bank from the bitmap

removes -> remove

> for polling, and changes the polling rate back to the default.

changes -> change

>
> If a CPU with banks in storm mode is taken offline, the new CPU
> that inherits ownership of those banks takes over management of
> storm(s) in the inherited bank(s).
>
> The cmci_discover() function was already very large. These changes
> pushed it well over the top. Refactor with three helper functions
> to braing it back under control.

braing -> bring

>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> arch/x86/kernel/cpu/mce/internal.h | 2 +
> arch/x86/kernel/cpu/mce/core.c | 3 +
> arch/x86/kernel/cpu/mce/intel.c | 202 +++++++++++++++++++++--------
> 3 files changed, 156 insertions(+), 51 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
> index da790d13d010..e641c991beb1 100644
> --- a/arch/x86/kernel/cpu/mce/internal.h
> +++ b/arch/x86/kernel/cpu/mce/internal.h
> @@ -41,12 +41,14 @@ struct dentry *mce_get_debugfs_dir(void);
> extern mce_banks_t mce_banks_ce_disabled;
>
> #ifdef CONFIG_X86_MCE_INTEL
> +void mce_intel_handle_storm(int bank, bool on);
> void cmci_disable_bank(int bank);
> void intel_init_cmci(void);
> void intel_init_lmce(void);
> void intel_clear_lmce(void);
> bool intel_filter_mce(struct mce *m);
> #else
> +static inline void mce_intel_handle_storm(int bank, bool on) { }
> static inline void cmci_disable_bank(int bank) { }
> static inline void intel_init_cmci(void) { }
> static inline void intel_init_lmce(void) { }
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 6a44e15d74fe..0a287998e62f 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -2054,6 +2054,9 @@ static void mce_zhaoxin_feature_clear(struct cpuinfo_x86 *c)
> void mce_handle_storm(int bank, bool on)
> {
> switch (boot_cpu_data.x86_vendor) {
> + case X86_VENDOR_INTEL:
> + mce_intel_handle_storm(bank, on);
> + break;
> }
> }
>
> diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c
> index 052bf2708391..55643c5944e1 100644
> --- a/arch/x86/kernel/cpu/mce/intel.c
> +++ b/arch/x86/kernel/cpu/mce/intel.c
> @@ -47,8 +47,27 @@ static DEFINE_PER_CPU(mce_banks_t, mce_banks_owned);
> */
> static DEFINE_RAW_SPINLOCK(cmci_discover_lock);
>
> +/* Linux non-storm CMCI threshold (may be overridden by BIOS) */
> #define CMCI_THRESHOLD 1

Just curious, but why use '1' for the default? We have a lot of code to
hide corrected errors. So why not just use the maximum limit? This would
effectively hide the corrected errors. And if not the maximum, maybe some
other intermediate value?

>
> +/*
> + * MCi_CTL2 threshold for each bank when there is no storm.
> + * Default value for each bank may have been set by BIOS.
> + */
> +static int cmci_threshold[MAX_NR_BANKS];

Can this be a 'u16', since the max threshold for Intel is 0x7FFF?

> +
> +/*
> + * High threshold to limit CMCI rate during storms. Max supported is
> + * 0x7FFF. Use this slightly smaller value so it has a distinctive
> + * signature when some asks "Why am I not seeing all corrected errors?"

Maybe this answers my question above.

> + * A high threshold is used instead of just disabling CMCI for a
> + * bank because both corrected and uncorrected errors may be logged
> + * in the same bank and signalled with CMCI. The threshold only applies
> + * to corrected errors, so keeping CMCI enabled means that uncorrected
> + * errors will still be processed in a timely fashion.
> + */
> +#define CMCI_STORM_THRESHOLD 32749
> +
> static int cmci_supported(int *banks)
> {
> u64 cap;
> @@ -103,6 +122,31 @@ static bool lmce_supported(void)
> return tmp & FEAT_CTL_LMCE_ENABLED;
> }
>
> +/*
> + * Set a new CMCI threshold value. Preserve the state of the
> + * MCI_CTL2_CMCI_EN bit in case this happens during a
> + * cmci_rediscover() operation.
> + */
> +static void cmci_set_threshold(int bank, int thresh)
> +{
> + unsigned long flags;
> + u64 val;
> +
> + raw_spin_lock_irqsave(&cmci_discover_lock, flags);
> + rdmsrl(MSR_IA32_MCx_CTL2(bank), val);
> + val &= ~MCI_CTL2_CMCI_THRESHOLD_MASK;
> + wrmsrl(MSR_IA32_MCx_CTL2(bank), val | thresh);
> + raw_spin_unlock_irqrestore(&cmci_discover_lock, flags);
> +}
> +
> +void mce_intel_handle_storm(int bank, bool on)
> +{
> + if (on)
> + cmci_set_threshold(bank, CMCI_STORM_THRESHOLD);
> + else
> + cmci_set_threshold(bank, cmci_threshold[bank]);
> +}
> +
> /*
> * The interrupt handler. This is called on every event.
> * Just call the poller directly to log any events.
> @@ -114,72 +158,126 @@ static void intel_threshold_interrupt(void)
> machine_check_poll(MCP_TIMESTAMP, this_cpu_ptr(&mce_banks_owned));
> }
>
> +/*
> + * Check all the reasons why current CPU cannot claim
> + * ownership of a bank.
> + * 1: CPU already owns this bank
> + * 2: BIOS owns this bank
> + * 3: Some other CPU owns this bank
> + */
> +static bool cmci_skip_bank(int bank, u64 *val)
> +{
> + unsigned long *owned = (void *)this_cpu_ptr(&mce_banks_owned);
> +
> + if (test_bit(bank, owned))
> + return true;
> +
> + /* Skip banks in firmware first mode */
> + if (test_bit(bank, mce_banks_ce_disabled))
> + return true;
> +
> + rdmsrl(MSR_IA32_MCx_CTL2(bank), *val);
> +
> + /* Already owned by someone else? */
> + if (*val & MCI_CTL2_CMCI_EN) {
> + clear_bit(bank, owned);
> + __clear_bit(bank, this_cpu_ptr(mce_poll_banks));
> + return true;
> + }
> +
> + return false;
> +}
> +
> +/*
> + * Decide which CMCI interrupt threshold to use:
> + * 1: If this bank is in storm mode from whichever CPU was
> + * the previous owner, stay in storm mode.
> + * 2: If ignoring any threshold set by BIOS, set Linux default
> + * 3: Try to honor BIOS threshold (unless buggy BIOS set it at zero).
> + */
> +static u64 cmci_pick_threshold(u64 val, int *bios_zero_thresh)
> +{
> + if ((val & MCI_CTL2_CMCI_THRESHOLD_MASK) == CMCI_STORM_THRESHOLD)
> + return val;
> +
> + if (!mca_cfg.bios_cmci_threshold) {

Are there many users of this option? Maybe this is something we should
also include in the AMD threshold code. But I don't think anyone has
asked me about it yet.

> + val &= ~MCI_CTL2_CMCI_THRESHOLD_MASK;
> + val |= CMCI_THRESHOLD;
> + } else if (!(val & MCI_CTL2_CMCI_THRESHOLD_MASK)) {
> + /*
> + * If bios_cmci_threshold boot option was specified
> + * but the threshold is zero, we'll try to initialize
> + * it to 1.
> + */
> + *bios_zero_thresh = 1;
> + val |= CMCI_THRESHOLD;
> + }
> +
> + return val;
> +}
> +
> +/*
> + * Try to claim ownership of a bank.
> + */
> +static void cmci_claim_bank(int bank, u64 val, int bios_zero_thresh, int *bios_wrong_thresh)
> +{
> + struct mca_storm_desc *storm = this_cpu_ptr(&storm_desc);
> +
> + val |= MCI_CTL2_CMCI_EN;
> + wrmsrl(MSR_IA32_MCx_CTL2(bank), val);
> + rdmsrl(MSR_IA32_MCx_CTL2(bank), val);
> +
> + /* Did the enable bit stick? -- the bank supports CMCI */
> + if (val & MCI_CTL2_CMCI_EN) {
> + set_bit(bank, (void *)this_cpu_ptr(&mce_banks_owned));

Newline here, please.

> + if ((val & MCI_CTL2_CMCI_THRESHOLD_MASK) == CMCI_STORM_THRESHOLD) {
> + pr_notice("CPU%d BANK%d CMCI inherited storm\n", smp_processor_id(), bank);
> + storm->banks[bank].history = ~0ull;
> + storm->banks[bank].timestamp = jiffies;
> + cmci_storm_begin(bank);
> + } else {
> + __clear_bit(bank, this_cpu_ptr(mce_poll_banks));
> + }

Newline here, please.

> + /*
> + * We are able to set thresholds for some banks that
> + * had a threshold of 0. This means the BIOS has not
> + * set the thresholds properly or does not work with
> + * this boot option. Note down now and report later.
> + */
> + if (mca_cfg.bios_cmci_threshold && bios_zero_thresh &&
> + (val & MCI_CTL2_CMCI_THRESHOLD_MASK))
> + *bios_wrong_thresh = 1;
> +
> + /* Save default threshold for each bank */
> + if (cmci_threshold[bank] == 0)
> + cmci_threshold[bank] = val & MCI_CTL2_CMCI_THRESHOLD_MASK;
> + } else {
> + WARN_ON(!test_bit(bank, this_cpu_ptr(mce_poll_banks)));

Could you invert the "MCI_CTL2_CMCI_EN" check and WARN/return early?
This could save an indentation level.

> + }
> +}
> +
> /*
> * Enable CMCI (Corrected Machine Check Interrupt) for available MCE banks
> * on this CPU. Use the algorithm recommended in the SDM to discover shared
> - * banks.
> + * banks. Called during initial bootstrap, and also for hotplug CPU operations
> + * to rediscover/reassign machine check banks.
> */
> static void cmci_discover(int banks)
> {
> - unsigned long *owned = (void *)this_cpu_ptr(&mce_banks_owned);
> - unsigned long flags;
> - int i;
> int bios_wrong_thresh = 0;
> + unsigned long flags;
> + int i;
>
> raw_spin_lock_irqsave(&cmci_discover_lock, flags);
> for (i = 0; i < banks; i++) {
> u64 val;
> int bios_zero_thresh = 0;
>
> - if (test_bit(i, owned))
> + if (cmci_skip_bank(i, &val))
> continue;
>
> - /* Skip banks in firmware first mode */
> - if (test_bit(i, mce_banks_ce_disabled))
> - continue;
> -
> - rdmsrl(MSR_IA32_MCx_CTL2(i), val);
> -
> - /* Already owned by someone else? */
> - if (val & MCI_CTL2_CMCI_EN) {
> - clear_bit(i, owned);
> - __clear_bit(i, this_cpu_ptr(mce_poll_banks));
> - continue;
> - }
> -
> - if (!mca_cfg.bios_cmci_threshold) {
> - val &= ~MCI_CTL2_CMCI_THRESHOLD_MASK;
> - val |= CMCI_THRESHOLD;
> - } else if (!(val & MCI_CTL2_CMCI_THRESHOLD_MASK)) {
> - /*
> - * If bios_cmci_threshold boot option was specified
> - * but the threshold is zero, we'll try to initialize
> - * it to 1.
> - */
> - bios_zero_thresh = 1;
> - val |= CMCI_THRESHOLD;
> - }
> -
> - val |= MCI_CTL2_CMCI_EN;
> - wrmsrl(MSR_IA32_MCx_CTL2(i), val);
> - rdmsrl(MSR_IA32_MCx_CTL2(i), val);
> -
> - /* Did the enable bit stick? -- the bank supports CMCI */
> - if (val & MCI_CTL2_CMCI_EN) {
> - set_bit(i, owned);
> - __clear_bit(i, this_cpu_ptr(mce_poll_banks));
> - /*
> - * We are able to set thresholds for some banks that
> - * had a threshold of 0. This means the BIOS has not
> - * set the thresholds properly or does not work with
> - * this boot option. Note down now and report later.
> - */
> - if (mca_cfg.bios_cmci_threshold && bios_zero_thresh &&
> - (val & MCI_CTL2_CMCI_THRESHOLD_MASK))
> - bios_wrong_thresh = 1;
> - } else {
> - WARN_ON(!test_bit(i, this_cpu_ptr(mce_poll_banks)));
> - }
> + val = cmci_pick_threshold(val, &bios_zero_thresh);
> + cmci_claim_bank(i, val, bios_zero_thresh, &bios_wrong_thresh);
> }
> raw_spin_unlock_irqrestore(&cmci_discover_lock, flags);
> if (mca_cfg.bios_cmci_threshold && bios_wrong_thresh) {
> @@ -218,6 +316,8 @@ static void __cmci_disable_bank(int bank)
> val &= ~MCI_CTL2_CMCI_EN;
> wrmsrl(MSR_IA32_MCx_CTL2(bank), val);
> __clear_bit(bank, this_cpu_ptr(mce_banks_owned));

Newline here, please.

> + if ((val & MCI_CTL2_CMCI_THRESHOLD_MASK) == CMCI_STORM_THRESHOLD)
> + cmci_storm_end(bank);
> }
>
> /*

Thanks,
Yazen

2023-09-30 03:50:56

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v8 0/3] Handle corrected machine check interrupt storms

Linux CMCI storm mitigation is a big hammer that just disables the CMCI
interrupt globally and switches to polling all banks.

There are two problems with this:
1) It really is a big hammer. It means that errors reported in other
banks from different functional units are all subject to the same
polling delay before being processed.
2) Intel systems signal some uncorrected errors using CMCI (e.g.
memory controller patrol scrub on Icelake Xeon and newer). Delaying
processing these error reports negates some of the benefit of the patrol
scrubber providing early notice of errors before they are consumed and
cause a machine check.

This series throws away the old storm implementation and replaces it
with one that keeps track of the weather on each separate machine check
bank. When a storm is detected from a bank. On Intel the storm is
mitigated by setting a very high threshold for corrected errors to
signal CMCI. This threshold does not affect signaling CMCI for
uncorrected errors.

Signed-off-by: Tony Luck <[email protected]>

---

Changes since v7:

Applied all the suggestions from Yazen's review of v7

Link: https://lore.kernel.org/all/[email protected]/
Link: https://lore.kernel.org/all/[email protected]/

Including placing most of the storm tracking code into threshold.c
instead of bloating core.c.

Tony Luck (3):
x86/mce: Remove old CMCI storm mitigation code
x86/mce: Add per-bank CMCI storm mitigation
x86/mce: Handle Intel threshold interrupt storms

arch/x86/kernel/cpu/mce/internal.h | 47 +++-
arch/x86/kernel/cpu/mce/core.c | 45 ++--
arch/x86/kernel/cpu/mce/intel.c | 338 ++++++++++++----------------
arch/x86/kernel/cpu/mce/threshold.c | 86 +++++++
4 files changed, 293 insertions(+), 223 deletions(-)


base-commit: 6465e260f48790807eef06b583b38ca9789b6072
--
2.41.0

2023-10-02 20:41:02

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH v8 0/3] Handle corrected machine check interrupt storms

> Including placing most of the storm tracking code into threshold.c
> instead of bloating core.c.

The lkp test robot complains on a randconfig build with:

# CONFIG_X86_MCE_INTEL is not set
# CONFIG_X86_MCE_AMD is not set

about some undefined symbols.

>> core.c:(.text+0x1130): undefined reference to `storm_desc'
>> core.c:(.text+0x1634): undefined reference to `mce_track_storm'

Simple fix would be to move definition of storm_desc into core.c
and provide a stub:

static inline void mce_track_storm(struct mce *mce) { }

for the case where neither INTEL nor AMD is configured.

in internal.h

-Tony



2023-10-04 18:38:49

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v9 0/3] Handle corrected machine check interrupt storms

Linux CMCI storm mitigation is a big hammer that just disables the CMCI
interrupt globally and switches to polling all banks.

There are two problems with this:
1) It really is a big hammer. It means that errors reported in other
banks from different functional units are all subject to the same
polling delay before being processed.
2) Intel systems signal some uncorrected errors using CMCI (e.g.
memory controller patrol scrub on Icelake Xeon and newer). Delaying
processing these error reports negates some of the benefit of the patrol
scrubber providing early notice of errors before they are consumed and
cause a machine check.

This series throws away the old storm implementation and replaces it
with one that keeps track of the weather on each separate machine check
bank. When a storm is detected from a bank. On Intel the storm is
mitigated by setting a very high threshold for corrected errors to
signal CMCI. This threshold does not affect signaling CMCI for
uncorrected errors.

Signed-off-by: Tony Luck <[email protected]>

---

Changes since v8:

Fixed issue reported by lkp with randconfig build with neither
CONFIG_X86_MCE_INTEL not CONFIG_X86_MCE_AMD set by making a
cleaner division between the storm tracking code in threshold.c
with the restof the code using more function accessors that can
be stubbed out.

Tony Luck (3):
x86/mce: Remove old CMCI storm mitigation code
x86/mce: Add per-bank CMCI storm mitigation
x86/mce: Handle Intel threshold interrupt storms

arch/x86/kernel/cpu/mce/internal.h | 48 ++++-
arch/x86/kernel/cpu/mce/core.c | 45 ++---
arch/x86/kernel/cpu/mce/intel.c | 303 ++++++++++++----------------
arch/x86/kernel/cpu/mce/threshold.c | 115 +++++++++++
4 files changed, 304 insertions(+), 207 deletions(-)


base-commit: 6465e260f48790807eef06b583b38ca9789b6072
--
2.41.0

2023-11-15 19:55:09

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v10 0/3] Handle corrected machine check interrupt storms

Linux CMCI storm mitigation is a big hammer that just disables the CMCI
interrupt globally and switches to polling all banks.

There are two problems with this:
1) It really is a big hammer. It means that errors reported in other
banks from different functional units are all subject to the same
polling delay before being processed.
2) Intel systems signal some uncorrected errors using CMCI (e.g.
memory controller patrol scrub on Icelake Xeon and newer). Delaying
processing these error reports negates some of the benefit of the patrol
scrubber providing early notice of errors before they are consumed and
cause a machine check.

This series throws away the old storm implementation and replaces it
with one that keeps track of the weather on each separate machine check
bank. When a storm is detected from a bank. On Intel the storm is
mitigated by setting a very high threshold for corrected errors to
signal CMCI. This threshold does not affect signaling CMCI for
uncorrected errors.

Signed-off-by: Tony Luck <[email protected]>

---
Changes since v9 (based on Boris reviews)

#1 Better commit comment on flow. Added detail that both timer poll
and CMCI feed results of scanning each bank into the history
calculation. Also added comment in code where mce_trac_storm()
is called.
#2 Set a flag for banks that don't support CMCI so they can be
excluded from history processing
#3 Skip history processing if CMCI globally disabled with boot
argument mce=cmci_disable
#4 Move struct mca_storm_desc definition to internal.h (I had argued
against the need for this, but the new "poll_mode" flag added in
change #2 needs to be set in intel.c).
#5 Add #define NUM_HISTORY_BITS instead of hard-coded "64".
#6 Rebase to v6.7-rc1


Tony Luck (3):
x86/mce: Remove old CMCI storm mitigation code
x86/mce: Add per-bank CMCI storm mitigation
x86/mce: Handle Intel threshold interrupt storms

arch/x86/kernel/cpu/mce/internal.h | 66 +++++-
arch/x86/kernel/cpu/mce/core.c | 53 +++--
arch/x86/kernel/cpu/mce/intel.c | 304 ++++++++++++----------------
arch/x86/kernel/cpu/mce/threshold.c | 115 +++++++++++
4 files changed, 332 insertions(+), 206 deletions(-)


base-commit: b85ea95d086471afb4ad062012a4d73cd328fa86
--
2.41.0

2023-11-15 19:56:47

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v10 2/3] x86/mce: Add per-bank CMCI storm mitigation

This is the core functionality to track CMCI storms at the
machine check bank granularity. Subsequent patches will add
the vendor specific hooks to supply input to the storm
detection and take actions on the start/end of a storm.

machine_check_poll() is called both by the CMCI interrupt code,
and for periodic polls from a timer. Add a hook in this routine
to maintain a bitmap history for each bank showing whether the bank
logged an corrected error or not each time it is polled.

In normal operation the interval between polls of this banks
determines how far to shift the history. The 64 bit width corresponds
to about one second.

When a storm is observed a CPU vendor specific action is taken to reduce
or stop CMCI from the bank that is the source of the storm. The bank
is added to the bitmap of banks for this CPU to poll. The polling rate
is increased to once per second. During a storm each bit in the history
indicates the status of the bank each time it is polled. Thus the history
covers just over a minute.

Declare a storm for that bank if the number of corrected interrupts
seen in that history is above some threshold (defined as 5 in this
series, could be tuned later if there is data to suggest a better
value).

A storm on a bank ends if enough consecutive polls of the bank show
no corrected errors (defined as 30, may also change). That calls the
CPU vendor specific function to revert to normal operational mode,
and changes the polling rate back to the default.

[Changes made based on Boris' comments 23 Oct 2023]

Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/mce/internal.h | 58 +++++++++++++-
arch/x86/kernel/cpu/mce/core.c | 33 ++++++--
arch/x86/kernel/cpu/mce/threshold.c | 112 ++++++++++++++++++++++++++++
3 files changed, 194 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index b18e99016ce5..e55676f096d8 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -56,7 +56,63 @@ static inline bool intel_filter_mce(struct mce *m) { return false; }
static inline bool intel_mce_usable_address(struct mce *m) { return false; }
#endif

-void mce_timer_kick(unsigned long interval);
+void mce_timer_kick(bool storm);
+
+#ifdef CONFIG_X86_MCE_THRESHOLD
+void cmci_storm_begin(unsigned int bank);
+void cmci_storm_end(unsigned int bank);
+void mce_track_storm(struct mce *mce);
+void mce_inherit_storm(unsigned int bank);
+bool mce_get_storm_mode(void);
+void mce_set_storm_mode(bool storm);
+#else
+static inline void cmci_storm_begin(unsigned int bank) {}
+static inline void cmci_storm_end(unsigned int bank) {}
+static inline void mce_track_storm(struct mce *mce) {}
+static inline void mce_inherit_storm(unsigned int bank) {}
+static inline bool mce_get_storm_mode(void) { return false; }
+static inline void mce_set_storm_mode(bool storm) {}
+#endif
+
+/*
+ * history: bitmask tracking whether errors were seen or not seen in
+ * the most recent polls of a bank. Each '1' bit represents
+ * an error seen.
+ * timestamp: last time (in jiffies) that the bank was polled
+ * in_storm_mode: Is this bank in storm mode?
+ * poll_only: Bank does not support CMCI, skip storm tracking
+ */
+struct storm_bank {
+ u64 history;
+ u64 timestamp;
+ bool in_storm_mode;
+ bool poll_only;
+};
+
+#define NUM_HISTORY_BITS (sizeof(u64) * BITS_PER_BYTE)
+
+/* How many errors within the history buffer mark the start of a storm. */
+#define STORM_BEGIN_THRESHOLD 5
+
+/*
+ * How many polls of machine check bank without an error before declaring
+ * the storm is over. Since it is tracked by the bitmaks in the history
+ * field of struct storm_bank the mask is 30 bits [0 ... 29].
+ */
+#define STORM_END_POLL_THRESHOLD 29
+
+/*
+ * banks: per-cpu, per-bank details
+ * stormy_bank_count: count of MC banks in storm state
+ * poll_mode: CPU is in poll mode
+ */
+struct mca_storm_desc {
+ struct storm_bank banks[MAX_NR_BANKS];
+ u8 stormy_bank_count;
+ bool poll_mode;
+};
+
+DECLARE_PER_CPU(struct mca_storm_desc, storm_desc);

#ifdef CONFIG_ACPI_APEI
int apei_write_mce(struct mce *m);
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 117848a63aff..820bd7d448c1 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -670,6 +670,16 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
barrier();
m.status = mce_rdmsrl(mca_msr_reg(i, MCA_STATUS));

+ /*
+ * Update storm tracking here, before checking for the
+ * MCI_STATUS_VAL bit. Valid corrected errors count
+ * towards declaring, or maintaining, storm status. No
+ * error in a bank counts towards avoiding, or ending,
+ * storm status.
+ */
+ if (!mca_cfg.cmci_disabled)
+ mce_track_storm(&m);
+
/* If this entry is not valid, ignore it */
if (!(m.status & MCI_STATUS_VAL))
continue;
@@ -1642,22 +1652,29 @@ static void mce_timer_fn(struct timer_list *t)
else
iv = min(iv * 2, round_jiffies_relative(check_interval * HZ));

- __this_cpu_write(mce_next_interval, iv);
- __start_timer(t, iv);
+ if (mce_get_storm_mode()) {
+ __start_timer(t, HZ);
+ } else {
+ __this_cpu_write(mce_next_interval, iv);
+ __start_timer(t, iv);
+ }
}

/*
- * Ensure that the timer is firing in @interval from now.
+ * When a storm starts on any bank on this CPU, switch to polling
+ * once per second. When the storm ends, revert to the default
+ * polling interval.
*/
-void mce_timer_kick(unsigned long interval)
+void mce_timer_kick(bool storm)
{
struct timer_list *t = this_cpu_ptr(&mce_timer);
- unsigned long iv = __this_cpu_read(mce_next_interval);

- __start_timer(t, interval);
+ mce_set_storm_mode(storm);

- if (interval < iv)
- __this_cpu_write(mce_next_interval, interval);
+ if (storm)
+ __start_timer(t, HZ);
+ else
+ __this_cpu_write(mce_next_interval, check_interval * HZ);
}

/* Must not be called in IRQ context where del_timer_sync() can deadlock */
diff --git a/arch/x86/kernel/cpu/mce/threshold.c b/arch/x86/kernel/cpu/mce/threshold.c
index ef4e7bb5fd88..0e1988468ee4 100644
--- a/arch/x86/kernel/cpu/mce/threshold.c
+++ b/arch/x86/kernel/cpu/mce/threshold.c
@@ -29,3 +29,115 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_threshold)
trace_threshold_apic_exit(THRESHOLD_APIC_VECTOR);
apic_eoi();
}
+
+DEFINE_PER_CPU(struct mca_storm_desc, storm_desc);
+
+void mce_inherit_storm(unsigned int bank)
+{
+ struct mca_storm_desc *storm = this_cpu_ptr(&storm_desc);
+
+ /*
+ * Previous CPU owning this bank had put it into storm mode,
+ * but the precise history of that storm is unknown. Assume
+ * the worst (all recent polls of the bank found a valid error
+ * logged). This will avoid the new owner prematurely declaring
+ * the storm has ended.
+ */
+ storm->banks[bank].history = ~0ull;
+ storm->banks[bank].timestamp = jiffies;
+}
+
+bool mce_get_storm_mode(void)
+{
+ return __this_cpu_read(storm_desc.poll_mode);
+}
+
+void mce_set_storm_mode(bool storm)
+{
+ __this_cpu_write(storm_desc.poll_mode, storm);
+}
+
+static void mce_handle_storm(unsigned int bank, bool on)
+{
+ switch (boot_cpu_data.x86_vendor) {
+ }
+}
+
+void cmci_storm_begin(unsigned int bank)
+{
+ struct mca_storm_desc *storm = this_cpu_ptr(&storm_desc);
+
+ __set_bit(bank, this_cpu_ptr(mce_poll_banks));
+ storm->banks[bank].in_storm_mode = true;
+
+ /*
+ * If this is the first bank on this CPU to enter storm mode
+ * start polling.
+ */
+ if (++storm->stormy_bank_count == 1)
+ mce_timer_kick(true);
+}
+
+void cmci_storm_end(unsigned int bank)
+{
+ struct mca_storm_desc *storm = this_cpu_ptr(&storm_desc);
+
+ __clear_bit(bank, this_cpu_ptr(mce_poll_banks));
+ storm->banks[bank].history = 0;
+ storm->banks[bank].in_storm_mode = false;
+
+ /* If no banks left in storm mode, stop polling. */
+ if (!this_cpu_dec_return(storm_desc.stormy_bank_count))
+ mce_timer_kick(false);
+}
+
+void mce_track_storm(struct mce *mce)
+{
+ struct mca_storm_desc *storm = this_cpu_ptr(&storm_desc);
+ unsigned long now = jiffies, delta;
+ unsigned int shift = 1;
+ u64 history = 0;
+
+ /* No tracking needed for banks that do not support CMCI */
+ if (storm->banks[mce->bank].poll_only)
+ return;
+
+ /*
+ * When a bank is in storm mode it is polled once per second and
+ * the history mask will record about the last minute of poll results.
+ * If it is not in storm mode, then the bank is only checked when
+ * there is a CMCI interrupt. Check how long it has been since
+ * this bank was last checked, and adjust the amount of "shift"
+ * to apply to history.
+ */
+ if (!storm->banks[mce->bank].in_storm_mode) {
+ delta = now - storm->banks[mce->bank].timestamp;
+ shift = (delta + HZ) / HZ;
+ }
+
+ /* If it has been a long time since the last poll, clear history. */
+ if (shift < NUM_HISTORY_BITS)
+ history = storm->banks[mce->bank].history << shift;
+
+ storm->banks[mce->bank].timestamp = now;
+
+ /* History keeps track of corrected errors. VAL=1 && UC=0 */
+ if ((mce->status & MCI_STATUS_VAL) && mce_is_correctable(mce))
+ history |= 1;
+
+ storm->banks[mce->bank].history = history;
+
+ if (storm->banks[mce->bank].in_storm_mode) {
+ if (history & GENMASK_ULL(STORM_END_POLL_THRESHOLD, 0))
+ return;
+ printk_deferred(KERN_NOTICE "CPU%d BANK%d CMCI storm subsided\n", smp_processor_id(), mce->bank);
+ mce_handle_storm(mce->bank, false);
+ cmci_storm_end(mce->bank);
+ } else {
+ if (hweight64(history) < STORM_BEGIN_THRESHOLD)
+ return;
+ printk_deferred(KERN_NOTICE "CPU%d BANK%d CMCI storm detected\n", smp_processor_id(), mce->bank);
+ mce_handle_storm(mce->bank, true);
+ cmci_storm_begin(mce->bank);
+ }
+}
--
2.41.0

2023-11-15 19:56:47

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v10 3/3] x86/mce: Handle Intel threshold interrupt storms

Add an Intel specific hook into machine_check_poll() to keep track
of per-CPU, per-bank corrected error logs (with a stub for the
CONFIG_MCE_INTEL=n case).

When a storm is observed the rate of interrupts is reduced by setting
a large threshold value for this bank in IA32_MCi_CTL2. This bank is
added to the bitmap of banks for this CPU to poll. The polling rate
is increased to once per second.

When a storm ends reset the threshold in IA32_MCi_CTL2 back to 1, remove
the bank from the bitmap for polling, and change the polling rate back
to the default.

If a CPU with banks in storm mode is taken offline, the new CPU
that inherits ownership of those banks takes over management of
storm(s) in the inherited bank(s).

The cmci_discover() function was already very large. These changes
pushed it well over the top. Refactor with three helper functions
to bring it back under control.

Updated with review comments from Yazen.
Link: https://lore.kernel.org/r/[email protected]

Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/mce/internal.h | 2 +
arch/x86/kernel/cpu/mce/intel.c | 205 +++++++++++++++++++++-------
arch/x86/kernel/cpu/mce/threshold.c | 3 +
3 files changed, 160 insertions(+), 50 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index e55676f096d8..6315dbf58146 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -41,6 +41,7 @@ struct dentry *mce_get_debugfs_dir(void);
extern mce_banks_t mce_banks_ce_disabled;

#ifdef CONFIG_X86_MCE_INTEL
+void mce_intel_handle_storm(int bank, bool on);
void cmci_disable_bank(int bank);
void intel_init_cmci(void);
void intel_init_lmce(void);
@@ -48,6 +49,7 @@ void intel_clear_lmce(void);
bool intel_filter_mce(struct mce *m);
bool intel_mce_usable_address(struct mce *m);
#else
+static inline void mce_intel_handle_storm(int bank, bool on) { }
static inline void cmci_disable_bank(int bank) { }
static inline void intel_init_cmci(void) { }
static inline void intel_init_lmce(void) { }
diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c
index fc4ffc434023..399b62e223d2 100644
--- a/arch/x86/kernel/cpu/mce/intel.c
+++ b/arch/x86/kernel/cpu/mce/intel.c
@@ -54,8 +54,27 @@ static DEFINE_RAW_SPINLOCK(cmci_discover_lock);
*/
static DEFINE_SPINLOCK(cmci_poll_lock);

+/* Linux non-storm CMCI threshold (may be overridden by BIOS) */
#define CMCI_THRESHOLD 1

+/*
+ * MCi_CTL2 threshold for each bank when there is no storm.
+ * Default value for each bank may have been set by BIOS.
+ */
+static u16 cmci_threshold[MAX_NR_BANKS];
+
+/*
+ * High threshold to limit CMCI rate during storms. Max supported is
+ * 0x7FFF. Use this slightly smaller value so it has a distinctive
+ * signature when some asks "Why am I not seeing all corrected errors?"
+ * A high threshold is used instead of just disabling CMCI for a
+ * bank because both corrected and uncorrected errors may be logged
+ * in the same bank and signalled with CMCI. The threshold only applies
+ * to corrected errors, so keeping CMCI enabled means that uncorrected
+ * errors will still be processed in a timely fashion.
+ */
+#define CMCI_STORM_THRESHOLD 32749
+
static int cmci_supported(int *banks)
{
u64 cap;
@@ -110,6 +129,31 @@ static bool lmce_supported(void)
return tmp & FEAT_CTL_LMCE_ENABLED;
}

+/*
+ * Set a new CMCI threshold value. Preserve the state of the
+ * MCI_CTL2_CMCI_EN bit in case this happens during a
+ * cmci_rediscover() operation.
+ */
+static void cmci_set_threshold(int bank, int thresh)
+{
+ unsigned long flags;
+ u64 val;
+
+ raw_spin_lock_irqsave(&cmci_discover_lock, flags);
+ rdmsrl(MSR_IA32_MCx_CTL2(bank), val);
+ val &= ~MCI_CTL2_CMCI_THRESHOLD_MASK;
+ wrmsrl(MSR_IA32_MCx_CTL2(bank), val | thresh);
+ raw_spin_unlock_irqrestore(&cmci_discover_lock, flags);
+}
+
+void mce_intel_handle_storm(int bank, bool on)
+{
+ if (on)
+ cmci_set_threshold(bank, CMCI_STORM_THRESHOLD);
+ else
+ cmci_set_threshold(bank, cmci_threshold[bank]);
+}
+
/*
* The interrupt handler. This is called on every event.
* Just call the poller directly to log any events.
@@ -121,72 +165,130 @@ static void intel_threshold_interrupt(void)
machine_check_poll(MCP_TIMESTAMP, this_cpu_ptr(&mce_banks_owned));
}

+/*
+ * Check all the reasons why current CPU cannot claim
+ * ownership of a bank.
+ * 1: CPU already owns this bank
+ * 2: BIOS owns this bank
+ * 3: Some other CPU owns this bank
+ */
+static bool cmci_skip_bank(int bank, u64 *val)
+{
+ unsigned long *owned = (void *)this_cpu_ptr(&mce_banks_owned);
+
+ if (test_bit(bank, owned))
+ return true;
+
+ /* Skip banks in firmware first mode */
+ if (test_bit(bank, mce_banks_ce_disabled))
+ return true;
+
+ rdmsrl(MSR_IA32_MCx_CTL2(bank), *val);
+
+ /* Already owned by someone else? */
+ if (*val & MCI_CTL2_CMCI_EN) {
+ clear_bit(bank, owned);
+ __clear_bit(bank, this_cpu_ptr(mce_poll_banks));
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Decide which CMCI interrupt threshold to use:
+ * 1: If this bank is in storm mode from whichever CPU was
+ * the previous owner, stay in storm mode.
+ * 2: If ignoring any threshold set by BIOS, set Linux default
+ * 3: Try to honor BIOS threshold (unless buggy BIOS set it at zero).
+ */
+static u64 cmci_pick_threshold(u64 val, int *bios_zero_thresh)
+{
+ if ((val & MCI_CTL2_CMCI_THRESHOLD_MASK) == CMCI_STORM_THRESHOLD)
+ return val;
+
+ if (!mca_cfg.bios_cmci_threshold) {
+ val &= ~MCI_CTL2_CMCI_THRESHOLD_MASK;
+ val |= CMCI_THRESHOLD;
+ } else if (!(val & MCI_CTL2_CMCI_THRESHOLD_MASK)) {
+ /*
+ * If bios_cmci_threshold boot option was specified
+ * but the threshold is zero, we'll try to initialize
+ * it to 1.
+ */
+ *bios_zero_thresh = 1;
+ val |= CMCI_THRESHOLD;
+ }
+
+ return val;
+}
+
+/*
+ * Try to claim ownership of a bank.
+ */
+static void cmci_claim_bank(int bank, u64 val, int bios_zero_thresh, int *bios_wrong_thresh)
+{
+ struct mca_storm_desc *storm = this_cpu_ptr(&storm_desc);
+
+ val |= MCI_CTL2_CMCI_EN;
+ wrmsrl(MSR_IA32_MCx_CTL2(bank), val);
+ rdmsrl(MSR_IA32_MCx_CTL2(bank), val);
+
+ /* If the enable bit did not stick, this bank should be polled. */
+ if (!(val & MCI_CTL2_CMCI_EN)) {
+ WARN_ON(!test_bit(bank, this_cpu_ptr(mce_poll_banks)));
+ storm->banks[bank].poll_only = true;
+ return;
+ }
+
+ /* This CPU successfully set the enable bit. */
+ set_bit(bank, (void *)this_cpu_ptr(&mce_banks_owned));
+
+ if ((val & MCI_CTL2_CMCI_THRESHOLD_MASK) == CMCI_STORM_THRESHOLD) {
+ pr_notice("CPU%d BANK%d CMCI inherited storm\n", smp_processor_id(), bank);
+ mce_inherit_storm(bank);
+ cmci_storm_begin(bank);
+ } else {
+ __clear_bit(bank, this_cpu_ptr(mce_poll_banks));
+ }
+
+ /*
+ * We are able to set thresholds for some banks that
+ * had a threshold of 0. This means the BIOS has not
+ * set the thresholds properly or does not work with
+ * this boot option. Note down now and report later.
+ */
+ if (mca_cfg.bios_cmci_threshold && bios_zero_thresh &&
+ (val & MCI_CTL2_CMCI_THRESHOLD_MASK))
+ *bios_wrong_thresh = 1;
+
+ /* Save default threshold for each bank */
+ if (cmci_threshold[bank] == 0)
+ cmci_threshold[bank] = val & MCI_CTL2_CMCI_THRESHOLD_MASK;
+}
+
/*
* Enable CMCI (Corrected Machine Check Interrupt) for available MCE banks
* on this CPU. Use the algorithm recommended in the SDM to discover shared
- * banks.
+ * banks. Called during initial bootstrap, and also for hotplug CPU operations
+ * to rediscover/reassign machine check banks.
*/
static void cmci_discover(int banks)
{
- unsigned long *owned = (void *)this_cpu_ptr(&mce_banks_owned);
+ int bios_wrong_thresh = 0;
unsigned long flags;
int i;
- int bios_wrong_thresh = 0;

raw_spin_lock_irqsave(&cmci_discover_lock, flags);
for (i = 0; i < banks; i++) {
u64 val;
int bios_zero_thresh = 0;

- if (test_bit(i, owned))
- continue;
-
- /* Skip banks in firmware first mode */
- if (test_bit(i, mce_banks_ce_disabled))
+ if (cmci_skip_bank(i, &val))
continue;

- rdmsrl(MSR_IA32_MCx_CTL2(i), val);
-
- /* Already owned by someone else? */
- if (val & MCI_CTL2_CMCI_EN) {
- clear_bit(i, owned);
- __clear_bit(i, this_cpu_ptr(mce_poll_banks));
- continue;
- }
-
- if (!mca_cfg.bios_cmci_threshold) {
- val &= ~MCI_CTL2_CMCI_THRESHOLD_MASK;
- val |= CMCI_THRESHOLD;
- } else if (!(val & MCI_CTL2_CMCI_THRESHOLD_MASK)) {
- /*
- * If bios_cmci_threshold boot option was specified
- * but the threshold is zero, we'll try to initialize
- * it to 1.
- */
- bios_zero_thresh = 1;
- val |= CMCI_THRESHOLD;
- }
-
- val |= MCI_CTL2_CMCI_EN;
- wrmsrl(MSR_IA32_MCx_CTL2(i), val);
- rdmsrl(MSR_IA32_MCx_CTL2(i), val);
-
- /* Did the enable bit stick? -- the bank supports CMCI */
- if (val & MCI_CTL2_CMCI_EN) {
- set_bit(i, owned);
- __clear_bit(i, this_cpu_ptr(mce_poll_banks));
- /*
- * We are able to set thresholds for some banks that
- * had a threshold of 0. This means the BIOS has not
- * set the thresholds properly or does not work with
- * this boot option. Note down now and report later.
- */
- if (mca_cfg.bios_cmci_threshold && bios_zero_thresh &&
- (val & MCI_CTL2_CMCI_THRESHOLD_MASK))
- bios_wrong_thresh = 1;
- } else {
- WARN_ON(!test_bit(i, this_cpu_ptr(mce_poll_banks)));
- }
+ val = cmci_pick_threshold(val, &bios_zero_thresh);
+ cmci_claim_bank(i, val, bios_zero_thresh, &bios_wrong_thresh);
}
raw_spin_unlock_irqrestore(&cmci_discover_lock, flags);
if (mca_cfg.bios_cmci_threshold && bios_wrong_thresh) {
@@ -225,6 +327,9 @@ static void __cmci_disable_bank(int bank)
val &= ~MCI_CTL2_CMCI_EN;
wrmsrl(MSR_IA32_MCx_CTL2(bank), val);
__clear_bit(bank, this_cpu_ptr(mce_banks_owned));
+
+ if ((val & MCI_CTL2_CMCI_THRESHOLD_MASK) == CMCI_STORM_THRESHOLD)
+ cmci_storm_end(bank);
}

/*
diff --git a/arch/x86/kernel/cpu/mce/threshold.c b/arch/x86/kernel/cpu/mce/threshold.c
index 0e1988468ee4..89e31e1e5c9c 100644
--- a/arch/x86/kernel/cpu/mce/threshold.c
+++ b/arch/x86/kernel/cpu/mce/threshold.c
@@ -60,6 +60,9 @@ void mce_set_storm_mode(bool storm)
static void mce_handle_storm(unsigned int bank, bool on)
{
switch (boot_cpu_data.x86_vendor) {
+ case X86_VENDOR_INTEL:
+ mce_intel_handle_storm(bank, on);
+ break;
}
}

--
2.41.0

2023-11-15 19:57:08

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v10 1/3] x86/mce: Remove old CMCI storm mitigation code

When a "storm" of CMCI is detected this code mitigates by
disabling CMCI interrupt signalling from all of the banks
owned by the CPU that saw the storm.

There are problems with this approach:

1) It is very coarse grained. In all likelihood only one of the
banks was generating the interrupts, but CMCI is disabled for all.
This means Linux may delay seeing and processing errors logged
from other banks.

2) Although CMCI stands for Corrected Machine Check Interrupt, it
is also used to signal when an uncorrected error is logged. This
is a problem because these errors should be handled in a timely
manner.

Delete all this code in preparation for a finer grained solution.

Reviewed-by: Yazen Ghannam <[email protected]>
Tested-by: Yazen Ghannam <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/mce/internal.h | 6 --
arch/x86/kernel/cpu/mce/core.c | 20 +---
arch/x86/kernel/cpu/mce/intel.c | 145 -----------------------------
3 files changed, 1 insertion(+), 170 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index e13a26c9c0ac..b18e99016ce5 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -41,9 +41,6 @@ struct dentry *mce_get_debugfs_dir(void);
extern mce_banks_t mce_banks_ce_disabled;

#ifdef CONFIG_X86_MCE_INTEL
-unsigned long cmci_intel_adjust_timer(unsigned long interval);
-bool mce_intel_cmci_poll(void);
-void mce_intel_hcpu_update(unsigned long cpu);
void cmci_disable_bank(int bank);
void intel_init_cmci(void);
void intel_init_lmce(void);
@@ -51,9 +48,6 @@ void intel_clear_lmce(void);
bool intel_filter_mce(struct mce *m);
bool intel_mce_usable_address(struct mce *m);
#else
-# define cmci_intel_adjust_timer mce_adjust_timer_default
-static inline bool mce_intel_cmci_poll(void) { return false; }
-static inline void mce_intel_hcpu_update(unsigned long cpu) { }
static inline void cmci_disable_bank(int bank) { }
static inline void intel_init_cmci(void) { }
static inline void intel_init_lmce(void) { }
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 7b397370b4d6..117848a63aff 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1601,13 +1601,6 @@ static unsigned long check_interval = INITIAL_CHECK_INTERVAL;
static DEFINE_PER_CPU(unsigned long, mce_next_interval); /* in jiffies */
static DEFINE_PER_CPU(struct timer_list, mce_timer);

-static unsigned long mce_adjust_timer_default(unsigned long interval)
-{
- return interval;
-}
-
-static unsigned long (*mce_adjust_timer)(unsigned long interval) = mce_adjust_timer_default;
-
static void __start_timer(struct timer_list *t, unsigned long interval)
{
unsigned long when = jiffies + interval;
@@ -1637,15 +1630,9 @@ static void mce_timer_fn(struct timer_list *t)

iv = __this_cpu_read(mce_next_interval);

- if (mce_available(this_cpu_ptr(&cpu_info))) {
+ if (mce_available(this_cpu_ptr(&cpu_info)))
mc_poll_banks();

- if (mce_intel_cmci_poll()) {
- iv = mce_adjust_timer(iv);
- goto done;
- }
- }
-
/*
* Alert userspace if needed. If we logged an MCE, reduce the polling
* interval, otherwise increase the polling interval.
@@ -1655,7 +1642,6 @@ static void mce_timer_fn(struct timer_list *t)
else
iv = min(iv * 2, round_jiffies_relative(check_interval * HZ));

-done:
__this_cpu_write(mce_next_interval, iv);
__start_timer(t, iv);
}
@@ -1995,7 +1981,6 @@ static void mce_zhaoxin_feature_init(struct cpuinfo_x86 *c)

intel_init_cmci();
intel_init_lmce();
- mce_adjust_timer = cmci_intel_adjust_timer;
}

static void mce_zhaoxin_feature_clear(struct cpuinfo_x86 *c)
@@ -2008,7 +1993,6 @@ static void __mcheck_cpu_init_vendor(struct cpuinfo_x86 *c)
switch (c->x86_vendor) {
case X86_VENDOR_INTEL:
mce_intel_feature_init(c);
- mce_adjust_timer = cmci_intel_adjust_timer;
break;

case X86_VENDOR_AMD: {
@@ -2665,8 +2649,6 @@ static void mce_reenable_cpu(void)

static int mce_cpu_dead(unsigned int cpu)
{
- mce_intel_hcpu_update(cpu);
-
/* intentionally ignoring frozen here */
if (!cpuhp_tasks_frozen)
cmci_rediscover();
diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c
index 52bce533ddcc..fc4ffc434023 100644
--- a/arch/x86/kernel/cpu/mce/intel.c
+++ b/arch/x86/kernel/cpu/mce/intel.c
@@ -41,15 +41,6 @@
*/
static DEFINE_PER_CPU(mce_banks_t, mce_banks_owned);

-/*
- * CMCI storm detection backoff counter
- *
- * During storm, we reset this counter to INITIAL_CHECK_INTERVAL in case we've
- * encountered an error. If not, we decrement it by one. We signal the end of
- * the CMCI storm when it reaches 0.
- */
-static DEFINE_PER_CPU(int, cmci_backoff_cnt);
-
/*
* cmci_discover_lock protects against parallel discovery attempts
* which could race against each other.
@@ -64,21 +55,6 @@ static DEFINE_RAW_SPINLOCK(cmci_discover_lock);
static DEFINE_SPINLOCK(cmci_poll_lock);

#define CMCI_THRESHOLD 1
-#define CMCI_POLL_INTERVAL (30 * HZ)
-#define CMCI_STORM_INTERVAL (HZ)
-#define CMCI_STORM_THRESHOLD 15
-
-static DEFINE_PER_CPU(unsigned long, cmci_time_stamp);
-static DEFINE_PER_CPU(unsigned int, cmci_storm_cnt);
-static DEFINE_PER_CPU(unsigned int, cmci_storm_state);
-
-enum {
- CMCI_STORM_NONE,
- CMCI_STORM_ACTIVE,
- CMCI_STORM_SUBSIDED,
-};
-
-static atomic_t cmci_storm_on_cpus;

static int cmci_supported(int *banks)
{
@@ -134,124 +110,6 @@ static bool lmce_supported(void)
return tmp & FEAT_CTL_LMCE_ENABLED;
}

-bool mce_intel_cmci_poll(void)
-{
- if (__this_cpu_read(cmci_storm_state) == CMCI_STORM_NONE)
- return false;
-
- /*
- * Reset the counter if we've logged an error in the last poll
- * during the storm.
- */
- if (machine_check_poll(0, this_cpu_ptr(&mce_banks_owned)))
- this_cpu_write(cmci_backoff_cnt, INITIAL_CHECK_INTERVAL);
- else
- this_cpu_dec(cmci_backoff_cnt);
-
- return true;
-}
-
-void mce_intel_hcpu_update(unsigned long cpu)
-{
- if (per_cpu(cmci_storm_state, cpu) == CMCI_STORM_ACTIVE)
- atomic_dec(&cmci_storm_on_cpus);
-
- per_cpu(cmci_storm_state, cpu) = CMCI_STORM_NONE;
-}
-
-static void cmci_toggle_interrupt_mode(bool on)
-{
- unsigned long flags, *owned;
- int bank;
- u64 val;
-
- raw_spin_lock_irqsave(&cmci_discover_lock, flags);
- owned = this_cpu_ptr(mce_banks_owned);
- for_each_set_bit(bank, owned, MAX_NR_BANKS) {
- rdmsrl(MSR_IA32_MCx_CTL2(bank), val);
-
- if (on)
- val |= MCI_CTL2_CMCI_EN;
- else
- val &= ~MCI_CTL2_CMCI_EN;
-
- wrmsrl(MSR_IA32_MCx_CTL2(bank), val);
- }
- raw_spin_unlock_irqrestore(&cmci_discover_lock, flags);
-}
-
-unsigned long cmci_intel_adjust_timer(unsigned long interval)
-{
- if ((this_cpu_read(cmci_backoff_cnt) > 0) &&
- (__this_cpu_read(cmci_storm_state) == CMCI_STORM_ACTIVE)) {
- mce_notify_irq();
- return CMCI_STORM_INTERVAL;
- }
-
- switch (__this_cpu_read(cmci_storm_state)) {
- case CMCI_STORM_ACTIVE:
-
- /*
- * We switch back to interrupt mode once the poll timer has
- * silenced itself. That means no events recorded and the timer
- * interval is back to our poll interval.
- */
- __this_cpu_write(cmci_storm_state, CMCI_STORM_SUBSIDED);
- if (!atomic_sub_return(1, &cmci_storm_on_cpus))
- pr_notice("CMCI storm subsided: switching to interrupt mode\n");
-
- fallthrough;
-
- case CMCI_STORM_SUBSIDED:
- /*
- * We wait for all CPUs to go back to SUBSIDED state. When that
- * happens we switch back to interrupt mode.
- */
- if (!atomic_read(&cmci_storm_on_cpus)) {
- __this_cpu_write(cmci_storm_state, CMCI_STORM_NONE);
- cmci_toggle_interrupt_mode(true);
- cmci_recheck();
- }
- return CMCI_POLL_INTERVAL;
- default:
-
- /* We have shiny weather. Let the poll do whatever it thinks. */
- return interval;
- }
-}
-
-static bool cmci_storm_detect(void)
-{
- unsigned int cnt = __this_cpu_read(cmci_storm_cnt);
- unsigned long ts = __this_cpu_read(cmci_time_stamp);
- unsigned long now = jiffies;
- int r;
-
- if (__this_cpu_read(cmci_storm_state) != CMCI_STORM_NONE)
- return true;
-
- if (time_before_eq(now, ts + CMCI_STORM_INTERVAL)) {
- cnt++;
- } else {
- cnt = 1;
- __this_cpu_write(cmci_time_stamp, now);
- }
- __this_cpu_write(cmci_storm_cnt, cnt);
-
- if (cnt <= CMCI_STORM_THRESHOLD)
- return false;
-
- cmci_toggle_interrupt_mode(false);
- __this_cpu_write(cmci_storm_state, CMCI_STORM_ACTIVE);
- r = atomic_add_return(1, &cmci_storm_on_cpus);
- mce_timer_kick(CMCI_STORM_INTERVAL);
- this_cpu_write(cmci_backoff_cnt, INITIAL_CHECK_INTERVAL);
-
- if (r == 1)
- pr_notice("CMCI storm detected: switching to poll mode\n");
- return true;
-}
-
/*
* The interrupt handler. This is called on every event.
* Just call the poller directly to log any events.
@@ -260,9 +118,6 @@ static bool cmci_storm_detect(void)
*/
static void intel_threshold_interrupt(void)
{
- if (cmci_storm_detect())
- return;
-
machine_check_poll(MCP_TIMESTAMP, this_cpu_ptr(&mce_banks_owned));
}

--
2.41.0

Subject: [tip: ras/core] x86/mce: Add per-bank CMCI storm mitigation

The following commit has been merged into the ras/core branch of tip:

Commit-ID: 7eae17c4add5de46efcca45356388f480103e6d9
Gitweb: https://git.kernel.org/tip/7eae17c4add5de46efcca45356388f480103e6d9
Author: Tony Luck <[email protected]>
AuthorDate: Wed, 15 Nov 2023 11:54:49 -08:00
Committer: Borislav Petkov (AMD) <[email protected]>
CommitterDate: Fri, 15 Dec 2023 14:52:01 +01:00

x86/mce: Add per-bank CMCI storm mitigation

This is the core functionality to track CMCI storms at the machine check
bank granularity. Subsequent patches will add the vendor specific hooks
to supply input to the storm detection and take actions on the start/end
of a storm.

machine_check_poll() is called both by the CMCI interrupt code, and for
periodic polls from a timer. Add a hook in this routine to maintain
a bitmap history for each bank showing whether the bank logged an
corrected error or not each time it is polled.

In normal operation the interval between polls of these banks determines
how far to shift the history. The 64 bit width corresponds to about one
second.

When a storm is observed a CPU vendor specific action is taken to reduce
or stop CMCI from the bank that is the source of the storm. The bank is
added to the bitmap of banks for this CPU to poll. The polling rate is
increased to once per second. During a storm each bit in the history
indicates the status of the bank each time it is polled. Thus the
history covers just over a minute.

Declare a storm for that bank if the number of corrected interrupts seen
in that history is above some threshold (defined as 5 in this series,
could be tuned later if there is data to suggest a better value).

A storm on a bank ends if enough consecutive polls of the bank show no
corrected errors (defined as 30, may also change). That calls the CPU
vendor specific function to revert to normal operational mode, and
changes the polling rate back to the default.

[ bp: Massage. ]

Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/kernel/cpu/mce/core.c | 33 ++++++--
arch/x86/kernel/cpu/mce/internal.h | 58 +++++++++++++-
arch/x86/kernel/cpu/mce/threshold.c | 112 +++++++++++++++++++++++++++-
3 files changed, 194 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index b2ef487..fd5ce12 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -686,6 +686,16 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
barrier();
m.status = mce_rdmsrl(mca_msr_reg(i, MCA_STATUS));

+ /*
+ * Update storm tracking here, before checking for the
+ * MCI_STATUS_VAL bit. Valid corrected errors count
+ * towards declaring, or maintaining, storm status. No
+ * error in a bank counts towards avoiding, or ending,
+ * storm status.
+ */
+ if (!mca_cfg.cmci_disabled)
+ mce_track_storm(&m);
+
/* If this entry is not valid, ignore it */
if (!(m.status & MCI_STATUS_VAL))
continue;
@@ -1658,22 +1668,29 @@ static void mce_timer_fn(struct timer_list *t)
else
iv = min(iv * 2, round_jiffies_relative(check_interval * HZ));

- __this_cpu_write(mce_next_interval, iv);
- __start_timer(t, iv);
+ if (mce_get_storm_mode()) {
+ __start_timer(t, HZ);
+ } else {
+ __this_cpu_write(mce_next_interval, iv);
+ __start_timer(t, iv);
+ }
}

/*
- * Ensure that the timer is firing in @interval from now.
+ * When a storm starts on any bank on this CPU, switch to polling
+ * once per second. When the storm ends, revert to the default
+ * polling interval.
*/
-void mce_timer_kick(unsigned long interval)
+void mce_timer_kick(bool storm)
{
struct timer_list *t = this_cpu_ptr(&mce_timer);
- unsigned long iv = __this_cpu_read(mce_next_interval);

- __start_timer(t, interval);
+ mce_set_storm_mode(storm);

- if (interval < iv)
- __this_cpu_write(mce_next_interval, interval);
+ if (storm)
+ __start_timer(t, HZ);
+ else
+ __this_cpu_write(mce_next_interval, check_interval * HZ);
}

/* Must not be called in IRQ context where del_timer_sync() can deadlock */
diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index b18e990..157b2f2 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -56,7 +56,63 @@ static inline bool intel_filter_mce(struct mce *m) { return false; }
static inline bool intel_mce_usable_address(struct mce *m) { return false; }
#endif

-void mce_timer_kick(unsigned long interval);
+void mce_timer_kick(bool storm);
+
+#ifdef CONFIG_X86_MCE_THRESHOLD
+void cmci_storm_begin(unsigned int bank);
+void cmci_storm_end(unsigned int bank);
+void mce_track_storm(struct mce *mce);
+void mce_inherit_storm(unsigned int bank);
+bool mce_get_storm_mode(void);
+void mce_set_storm_mode(bool storm);
+#else
+static inline void cmci_storm_begin(unsigned int bank) {}
+static inline void cmci_storm_end(unsigned int bank) {}
+static inline void mce_track_storm(struct mce *mce) {}
+static inline void mce_inherit_storm(unsigned int bank) {}
+static inline bool mce_get_storm_mode(void) { return false; }
+static inline void mce_set_storm_mode(bool storm) {}
+#endif
+
+/*
+ * history: Bitmask tracking errors occurrence. Each set bit
+ * represents an error seen.
+ *
+ * timestamp: Last time (in jiffies) that the bank was polled.
+ * in_storm_mode: Is this bank in storm mode?
+ * poll_only: Bank does not support CMCI, skip storm tracking.
+ */
+struct storm_bank {
+ u64 history;
+ u64 timestamp;
+ bool in_storm_mode;
+ bool poll_only;
+};
+
+#define NUM_HISTORY_BITS (sizeof(u64) * BITS_PER_BYTE)
+
+/* How many errors within the history buffer mark the start of a storm. */
+#define STORM_BEGIN_THRESHOLD 5
+
+/*
+ * How many polls of machine check bank without an error before declaring
+ * the storm is over. Since it is tracked by the bitmasks in the history
+ * field of struct storm_bank the mask is 30 bits [0 ... 29].
+ */
+#define STORM_END_POLL_THRESHOLD 29
+
+/*
+ * banks: per-cpu, per-bank details
+ * stormy_bank_count: count of MC banks in storm state
+ * poll_mode: CPU is in poll mode
+ */
+struct mca_storm_desc {
+ struct storm_bank banks[MAX_NR_BANKS];
+ u8 stormy_bank_count;
+ bool poll_mode;
+};
+
+DECLARE_PER_CPU(struct mca_storm_desc, storm_desc);

#ifdef CONFIG_ACPI_APEI
int apei_write_mce(struct mce *m);
diff --git a/arch/x86/kernel/cpu/mce/threshold.c b/arch/x86/kernel/cpu/mce/threshold.c
index ef4e7bb..0e19884 100644
--- a/arch/x86/kernel/cpu/mce/threshold.c
+++ b/arch/x86/kernel/cpu/mce/threshold.c
@@ -29,3 +29,115 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_threshold)
trace_threshold_apic_exit(THRESHOLD_APIC_VECTOR);
apic_eoi();
}
+
+DEFINE_PER_CPU(struct mca_storm_desc, storm_desc);
+
+void mce_inherit_storm(unsigned int bank)
+{
+ struct mca_storm_desc *storm = this_cpu_ptr(&storm_desc);
+
+ /*
+ * Previous CPU owning this bank had put it into storm mode,
+ * but the precise history of that storm is unknown. Assume
+ * the worst (all recent polls of the bank found a valid error
+ * logged). This will avoid the new owner prematurely declaring
+ * the storm has ended.
+ */
+ storm->banks[bank].history = ~0ull;
+ storm->banks[bank].timestamp = jiffies;
+}
+
+bool mce_get_storm_mode(void)
+{
+ return __this_cpu_read(storm_desc.poll_mode);
+}
+
+void mce_set_storm_mode(bool storm)
+{
+ __this_cpu_write(storm_desc.poll_mode, storm);
+}
+
+static void mce_handle_storm(unsigned int bank, bool on)
+{
+ switch (boot_cpu_data.x86_vendor) {
+ }
+}
+
+void cmci_storm_begin(unsigned int bank)
+{
+ struct mca_storm_desc *storm = this_cpu_ptr(&storm_desc);
+
+ __set_bit(bank, this_cpu_ptr(mce_poll_banks));
+ storm->banks[bank].in_storm_mode = true;
+
+ /*
+ * If this is the first bank on this CPU to enter storm mode
+ * start polling.
+ */
+ if (++storm->stormy_bank_count == 1)
+ mce_timer_kick(true);
+}
+
+void cmci_storm_end(unsigned int bank)
+{
+ struct mca_storm_desc *storm = this_cpu_ptr(&storm_desc);
+
+ __clear_bit(bank, this_cpu_ptr(mce_poll_banks));
+ storm->banks[bank].history = 0;
+ storm->banks[bank].in_storm_mode = false;
+
+ /* If no banks left in storm mode, stop polling. */
+ if (!this_cpu_dec_return(storm_desc.stormy_bank_count))
+ mce_timer_kick(false);
+}
+
+void mce_track_storm(struct mce *mce)
+{
+ struct mca_storm_desc *storm = this_cpu_ptr(&storm_desc);
+ unsigned long now = jiffies, delta;
+ unsigned int shift = 1;
+ u64 history = 0;
+
+ /* No tracking needed for banks that do not support CMCI */
+ if (storm->banks[mce->bank].poll_only)
+ return;
+
+ /*
+ * When a bank is in storm mode it is polled once per second and
+ * the history mask will record about the last minute of poll results.
+ * If it is not in storm mode, then the bank is only checked when
+ * there is a CMCI interrupt. Check how long it has been since
+ * this bank was last checked, and adjust the amount of "shift"
+ * to apply to history.
+ */
+ if (!storm->banks[mce->bank].in_storm_mode) {
+ delta = now - storm->banks[mce->bank].timestamp;
+ shift = (delta + HZ) / HZ;
+ }
+
+ /* If it has been a long time since the last poll, clear history. */
+ if (shift < NUM_HISTORY_BITS)
+ history = storm->banks[mce->bank].history << shift;
+
+ storm->banks[mce->bank].timestamp = now;
+
+ /* History keeps track of corrected errors. VAL=1 && UC=0 */
+ if ((mce->status & MCI_STATUS_VAL) && mce_is_correctable(mce))
+ history |= 1;
+
+ storm->banks[mce->bank].history = history;
+
+ if (storm->banks[mce->bank].in_storm_mode) {
+ if (history & GENMASK_ULL(STORM_END_POLL_THRESHOLD, 0))
+ return;
+ printk_deferred(KERN_NOTICE "CPU%d BANK%d CMCI storm subsided\n", smp_processor_id(), mce->bank);
+ mce_handle_storm(mce->bank, false);
+ cmci_storm_end(mce->bank);
+ } else {
+ if (hweight64(history) < STORM_BEGIN_THRESHOLD)
+ return;
+ printk_deferred(KERN_NOTICE "CPU%d BANK%d CMCI storm detected\n", smp_processor_id(), mce->bank);
+ mce_handle_storm(mce->bank, true);
+ cmci_storm_begin(mce->bank);
+ }
+}

Subject: [tip: ras/core] x86/mce: Remove old CMCI storm mitigation code

The following commit has been merged into the ras/core branch of tip:

Commit-ID: 3ed57b41a4125609e9fd03e32228aec61d95fe1f
Gitweb: https://git.kernel.org/tip/3ed57b41a4125609e9fd03e32228aec61d95fe1f
Author: Tony Luck <[email protected]>
AuthorDate: Wed, 15 Nov 2023 11:54:48 -08:00
Committer: Borislav Petkov (AMD) <[email protected]>
CommitterDate: Fri, 15 Dec 2023 13:44:12 +01:00

x86/mce: Remove old CMCI storm mitigation code

When a "storm" of corrected machine check interrupts (CMCI) is detected
this code mitigates by disabling CMCI interrupt signalling from all of
the banks owned by the CPU that saw the storm.

There are problems with this approach:

1) It is very coarse grained. In all likelihood only one of the banks
was generating the interrupts, but CMCI is disabled for all. This
means Linux may delay seeing and processing errors logged from other
banks.

2) Although CMCI stands for Corrected Machine Check Interrupt, it is
also used to signal when an uncorrected error is logged. This is
a problem because these errors should be handled in a timely manner.

Delete all this code in preparation for a finer grained solution.

Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Yazen Ghannam <[email protected]>
Tested-by: Yazen Ghannam <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/kernel/cpu/mce/core.c | 20 +----
arch/x86/kernel/cpu/mce/intel.c | 145 +----------------------------
arch/x86/kernel/cpu/mce/internal.h | 6 +-
3 files changed, 1 insertion(+), 170 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 1642018..b2ef487 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1617,13 +1617,6 @@ static unsigned long check_interval = INITIAL_CHECK_INTERVAL;
static DEFINE_PER_CPU(unsigned long, mce_next_interval); /* in jiffies */
static DEFINE_PER_CPU(struct timer_list, mce_timer);

-static unsigned long mce_adjust_timer_default(unsigned long interval)
-{
- return interval;
-}
-
-static unsigned long (*mce_adjust_timer)(unsigned long interval) = mce_adjust_timer_default;
-
static void __start_timer(struct timer_list *t, unsigned long interval)
{
unsigned long when = jiffies + interval;
@@ -1653,15 +1646,9 @@ static void mce_timer_fn(struct timer_list *t)

iv = __this_cpu_read(mce_next_interval);

- if (mce_available(this_cpu_ptr(&cpu_info))) {
+ if (mce_available(this_cpu_ptr(&cpu_info)))
mc_poll_banks();

- if (mce_intel_cmci_poll()) {
- iv = mce_adjust_timer(iv);
- goto done;
- }
- }
-
/*
* Alert userspace if needed. If we logged an MCE, reduce the polling
* interval, otherwise increase the polling interval.
@@ -1671,7 +1658,6 @@ static void mce_timer_fn(struct timer_list *t)
else
iv = min(iv * 2, round_jiffies_relative(check_interval * HZ));

-done:
__this_cpu_write(mce_next_interval, iv);
__start_timer(t, iv);
}
@@ -2011,7 +1997,6 @@ static void mce_zhaoxin_feature_init(struct cpuinfo_x86 *c)

intel_init_cmci();
intel_init_lmce();
- mce_adjust_timer = cmci_intel_adjust_timer;
}

static void mce_zhaoxin_feature_clear(struct cpuinfo_x86 *c)
@@ -2024,7 +2009,6 @@ static void __mcheck_cpu_init_vendor(struct cpuinfo_x86 *c)
switch (c->x86_vendor) {
case X86_VENDOR_INTEL:
mce_intel_feature_init(c);
- mce_adjust_timer = cmci_intel_adjust_timer;
break;

case X86_VENDOR_AMD: {
@@ -2678,8 +2662,6 @@ static void mce_reenable_cpu(void)

static int mce_cpu_dead(unsigned int cpu)
{
- mce_intel_hcpu_update(cpu);
-
/* intentionally ignoring frozen here */
if (!cpuhp_tasks_frozen)
cmci_rediscover();
diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c
index 52bce53..fc4ffc4 100644
--- a/arch/x86/kernel/cpu/mce/intel.c
+++ b/arch/x86/kernel/cpu/mce/intel.c
@@ -42,15 +42,6 @@
static DEFINE_PER_CPU(mce_banks_t, mce_banks_owned);

/*
- * CMCI storm detection backoff counter
- *
- * During storm, we reset this counter to INITIAL_CHECK_INTERVAL in case we've
- * encountered an error. If not, we decrement it by one. We signal the end of
- * the CMCI storm when it reaches 0.
- */
-static DEFINE_PER_CPU(int, cmci_backoff_cnt);
-
-/*
* cmci_discover_lock protects against parallel discovery attempts
* which could race against each other.
*/
@@ -64,21 +55,6 @@ static DEFINE_RAW_SPINLOCK(cmci_discover_lock);
static DEFINE_SPINLOCK(cmci_poll_lock);

#define CMCI_THRESHOLD 1
-#define CMCI_POLL_INTERVAL (30 * HZ)
-#define CMCI_STORM_INTERVAL (HZ)
-#define CMCI_STORM_THRESHOLD 15
-
-static DEFINE_PER_CPU(unsigned long, cmci_time_stamp);
-static DEFINE_PER_CPU(unsigned int, cmci_storm_cnt);
-static DEFINE_PER_CPU(unsigned int, cmci_storm_state);
-
-enum {
- CMCI_STORM_NONE,
- CMCI_STORM_ACTIVE,
- CMCI_STORM_SUBSIDED,
-};
-
-static atomic_t cmci_storm_on_cpus;

static int cmci_supported(int *banks)
{
@@ -134,124 +110,6 @@ static bool lmce_supported(void)
return tmp & FEAT_CTL_LMCE_ENABLED;
}

-bool mce_intel_cmci_poll(void)
-{
- if (__this_cpu_read(cmci_storm_state) == CMCI_STORM_NONE)
- return false;
-
- /*
- * Reset the counter if we've logged an error in the last poll
- * during the storm.
- */
- if (machine_check_poll(0, this_cpu_ptr(&mce_banks_owned)))
- this_cpu_write(cmci_backoff_cnt, INITIAL_CHECK_INTERVAL);
- else
- this_cpu_dec(cmci_backoff_cnt);
-
- return true;
-}
-
-void mce_intel_hcpu_update(unsigned long cpu)
-{
- if (per_cpu(cmci_storm_state, cpu) == CMCI_STORM_ACTIVE)
- atomic_dec(&cmci_storm_on_cpus);
-
- per_cpu(cmci_storm_state, cpu) = CMCI_STORM_NONE;
-}
-
-static void cmci_toggle_interrupt_mode(bool on)
-{
- unsigned long flags, *owned;
- int bank;
- u64 val;
-
- raw_spin_lock_irqsave(&cmci_discover_lock, flags);
- owned = this_cpu_ptr(mce_banks_owned);
- for_each_set_bit(bank, owned, MAX_NR_BANKS) {
- rdmsrl(MSR_IA32_MCx_CTL2(bank), val);
-
- if (on)
- val |= MCI_CTL2_CMCI_EN;
- else
- val &= ~MCI_CTL2_CMCI_EN;
-
- wrmsrl(MSR_IA32_MCx_CTL2(bank), val);
- }
- raw_spin_unlock_irqrestore(&cmci_discover_lock, flags);
-}
-
-unsigned long cmci_intel_adjust_timer(unsigned long interval)
-{
- if ((this_cpu_read(cmci_backoff_cnt) > 0) &&
- (__this_cpu_read(cmci_storm_state) == CMCI_STORM_ACTIVE)) {
- mce_notify_irq();
- return CMCI_STORM_INTERVAL;
- }
-
- switch (__this_cpu_read(cmci_storm_state)) {
- case CMCI_STORM_ACTIVE:
-
- /*
- * We switch back to interrupt mode once the poll timer has
- * silenced itself. That means no events recorded and the timer
- * interval is back to our poll interval.
- */
- __this_cpu_write(cmci_storm_state, CMCI_STORM_SUBSIDED);
- if (!atomic_sub_return(1, &cmci_storm_on_cpus))
- pr_notice("CMCI storm subsided: switching to interrupt mode\n");
-
- fallthrough;
-
- case CMCI_STORM_SUBSIDED:
- /*
- * We wait for all CPUs to go back to SUBSIDED state. When that
- * happens we switch back to interrupt mode.
- */
- if (!atomic_read(&cmci_storm_on_cpus)) {
- __this_cpu_write(cmci_storm_state, CMCI_STORM_NONE);
- cmci_toggle_interrupt_mode(true);
- cmci_recheck();
- }
- return CMCI_POLL_INTERVAL;
- default:
-
- /* We have shiny weather. Let the poll do whatever it thinks. */
- return interval;
- }
-}
-
-static bool cmci_storm_detect(void)
-{
- unsigned int cnt = __this_cpu_read(cmci_storm_cnt);
- unsigned long ts = __this_cpu_read(cmci_time_stamp);
- unsigned long now = jiffies;
- int r;
-
- if (__this_cpu_read(cmci_storm_state) != CMCI_STORM_NONE)
- return true;
-
- if (time_before_eq(now, ts + CMCI_STORM_INTERVAL)) {
- cnt++;
- } else {
- cnt = 1;
- __this_cpu_write(cmci_time_stamp, now);
- }
- __this_cpu_write(cmci_storm_cnt, cnt);
-
- if (cnt <= CMCI_STORM_THRESHOLD)
- return false;
-
- cmci_toggle_interrupt_mode(false);
- __this_cpu_write(cmci_storm_state, CMCI_STORM_ACTIVE);
- r = atomic_add_return(1, &cmci_storm_on_cpus);
- mce_timer_kick(CMCI_STORM_INTERVAL);
- this_cpu_write(cmci_backoff_cnt, INITIAL_CHECK_INTERVAL);
-
- if (r == 1)
- pr_notice("CMCI storm detected: switching to poll mode\n");
- return true;
-}
-
/*
* The interrupt handler. This is called on every event.
* Just call the poller directly to log any events.
@@ -260,9 +118,6 @@ static bool cmci_storm_detect(void)
*/
static void intel_threshold_interrupt(void)
{
- if (cmci_storm_detect())
- return;
-
machine_check_poll(MCP_TIMESTAMP, this_cpu_ptr(&mce_banks_owned));
}

diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index e13a26c..b18e990 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -41,9 +41,6 @@ struct dentry *mce_get_debugfs_dir(void);
extern mce_banks_t mce_banks_ce_disabled;

#ifdef CONFIG_X86_MCE_INTEL
-unsigned long cmci_intel_adjust_timer(unsigned long interval);
-bool mce_intel_cmci_poll(void);
-void mce_intel_hcpu_update(unsigned long cpu);
void cmci_disable_bank(int bank);
void intel_init_cmci(void);
void intel_init_lmce(void);
@@ -51,9 +48,6 @@ void intel_clear_lmce(void);
bool intel_filter_mce(struct mce *m);
bool intel_mce_usable_address(struct mce *m);
#else
-# define cmci_intel_adjust_timer mce_adjust_timer_default
-static inline bool mce_intel_cmci_poll(void) { return false; }
-static inline void mce_intel_hcpu_update(unsigned long cpu) { }
static inline void cmci_disable_bank(int bank) { }
static inline void intel_init_cmci(void) { }
static inline void intel_init_lmce(void) { }

Subject: [tip: ras/core] x86/mce: Handle Intel threshold interrupt storms

The following commit has been merged into the ras/core branch of tip:

Commit-ID: 1f68ce2a027250aeeb1756391110cdc4dc97c797
Gitweb: https://git.kernel.org/tip/1f68ce2a027250aeeb1756391110cdc4dc97c797
Author: Tony Luck <[email protected]>
AuthorDate: Wed, 15 Nov 2023 11:54:50 -08:00
Committer: Borislav Petkov (AMD) <[email protected]>
CommitterDate: Fri, 15 Dec 2023 14:53:42 +01:00

x86/mce: Handle Intel threshold interrupt storms

Add an Intel specific hook into machine_check_poll() to keep track of
per-CPU, per-bank corrected error logs (with a stub for the
CONFIG_MCE_INTEL=n case).

When a storm is observed the rate of interrupts is reduced by setting
a large threshold value for this bank in IA32_MCi_CTL2. This bank is
added to the bitmap of banks for this CPU to poll. The polling rate is
increased to once per second.

When a storm ends reset the threshold in IA32_MCi_CTL2 back to 1, remove
the bank from the bitmap for polling, and change the polling rate back
to the default.

If a CPU with banks in storm mode is taken offline, the new CPU that
inherits ownership of those banks takes over management of storm(s) in
the inherited bank(s).

The cmci_discover() function was already very large. These changes
pushed it well over the top. Refactor with three helper functions to
bring it back under control.

Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/kernel/cpu/mce/intel.c | 205 ++++++++++++++++++++-------
arch/x86/kernel/cpu/mce/internal.h | 2 +-
arch/x86/kernel/cpu/mce/threshold.c | 3 +-
3 files changed, 160 insertions(+), 50 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c
index fc4ffc4..399b62e 100644
--- a/arch/x86/kernel/cpu/mce/intel.c
+++ b/arch/x86/kernel/cpu/mce/intel.c
@@ -54,8 +54,27 @@ static DEFINE_RAW_SPINLOCK(cmci_discover_lock);
*/
static DEFINE_SPINLOCK(cmci_poll_lock);

+/* Linux non-storm CMCI threshold (may be overridden by BIOS) */
#define CMCI_THRESHOLD 1

+/*
+ * MCi_CTL2 threshold for each bank when there is no storm.
+ * Default value for each bank may have been set by BIOS.
+ */
+static u16 cmci_threshold[MAX_NR_BANKS];
+
+/*
+ * High threshold to limit CMCI rate during storms. Max supported is
+ * 0x7FFF. Use this slightly smaller value so it has a distinctive
+ * signature when some asks "Why am I not seeing all corrected errors?"
+ * A high threshold is used instead of just disabling CMCI for a
+ * bank because both corrected and uncorrected errors may be logged
+ * in the same bank and signalled with CMCI. The threshold only applies
+ * to corrected errors, so keeping CMCI enabled means that uncorrected
+ * errors will still be processed in a timely fashion.
+ */
+#define CMCI_STORM_THRESHOLD 32749
+
static int cmci_supported(int *banks)
{
u64 cap;
@@ -111,6 +130,31 @@ static bool lmce_supported(void)
}

/*
+ * Set a new CMCI threshold value. Preserve the state of the
+ * MCI_CTL2_CMCI_EN bit in case this happens during a
+ * cmci_rediscover() operation.
+ */
+static void cmci_set_threshold(int bank, int thresh)
+{
+ unsigned long flags;
+ u64 val;
+
+ raw_spin_lock_irqsave(&cmci_discover_lock, flags);
+ rdmsrl(MSR_IA32_MCx_CTL2(bank), val);
+ val &= ~MCI_CTL2_CMCI_THRESHOLD_MASK;
+ wrmsrl(MSR_IA32_MCx_CTL2(bank), val | thresh);
+ raw_spin_unlock_irqrestore(&cmci_discover_lock, flags);
+}
+
+void mce_intel_handle_storm(int bank, bool on)
+{
+ if (on)
+ cmci_set_threshold(bank, CMCI_STORM_THRESHOLD);
+ else
+ cmci_set_threshold(bank, cmci_threshold[bank]);
+}
+
+/*
* The interrupt handler. This is called on every event.
* Just call the poller directly to log any events.
* This could in theory increase the threshold under high load,
@@ -122,71 +166,129 @@ static void intel_threshold_interrupt(void)
}

/*
+ * Check all the reasons why current CPU cannot claim
+ * ownership of a bank.
+ * 1: CPU already owns this bank
+ * 2: BIOS owns this bank
+ * 3: Some other CPU owns this bank
+ */
+static bool cmci_skip_bank(int bank, u64 *val)
+{
+ unsigned long *owned = (void *)this_cpu_ptr(&mce_banks_owned);
+
+ if (test_bit(bank, owned))
+ return true;
+
+ /* Skip banks in firmware first mode */
+ if (test_bit(bank, mce_banks_ce_disabled))
+ return true;
+
+ rdmsrl(MSR_IA32_MCx_CTL2(bank), *val);
+
+ /* Already owned by someone else? */
+ if (*val & MCI_CTL2_CMCI_EN) {
+ clear_bit(bank, owned);
+ __clear_bit(bank, this_cpu_ptr(mce_poll_banks));
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Decide which CMCI interrupt threshold to use:
+ * 1: If this bank is in storm mode from whichever CPU was
+ * the previous owner, stay in storm mode.
+ * 2: If ignoring any threshold set by BIOS, set Linux default
+ * 3: Try to honor BIOS threshold (unless buggy BIOS set it at zero).
+ */
+static u64 cmci_pick_threshold(u64 val, int *bios_zero_thresh)
+{
+ if ((val & MCI_CTL2_CMCI_THRESHOLD_MASK) == CMCI_STORM_THRESHOLD)
+ return val;
+
+ if (!mca_cfg.bios_cmci_threshold) {
+ val &= ~MCI_CTL2_CMCI_THRESHOLD_MASK;
+ val |= CMCI_THRESHOLD;
+ } else if (!(val & MCI_CTL2_CMCI_THRESHOLD_MASK)) {
+ /*
+ * If bios_cmci_threshold boot option was specified
+ * but the threshold is zero, we'll try to initialize
+ * it to 1.
+ */
+ *bios_zero_thresh = 1;
+ val |= CMCI_THRESHOLD;
+ }
+
+ return val;
+}
+
+/*
+ * Try to claim ownership of a bank.
+ */
+static void cmci_claim_bank(int bank, u64 val, int bios_zero_thresh, int *bios_wrong_thresh)
+{
+ struct mca_storm_desc *storm = this_cpu_ptr(&storm_desc);
+
+ val |= MCI_CTL2_CMCI_EN;
+ wrmsrl(MSR_IA32_MCx_CTL2(bank), val);
+ rdmsrl(MSR_IA32_MCx_CTL2(bank), val);
+
+ /* If the enable bit did not stick, this bank should be polled. */
+ if (!(val & MCI_CTL2_CMCI_EN)) {
+ WARN_ON(!test_bit(bank, this_cpu_ptr(mce_poll_banks)));
+ storm->banks[bank].poll_only = true;
+ return;
+ }
+
+ /* This CPU successfully set the enable bit. */
+ set_bit(bank, (void *)this_cpu_ptr(&mce_banks_owned));
+
+ if ((val & MCI_CTL2_CMCI_THRESHOLD_MASK) == CMCI_STORM_THRESHOLD) {
+ pr_notice("CPU%d BANK%d CMCI inherited storm\n", smp_processor_id(), bank);
+ mce_inherit_storm(bank);
+ cmci_storm_begin(bank);
+ } else {
+ __clear_bit(bank, this_cpu_ptr(mce_poll_banks));
+ }
+
+ /*
+ * We are able to set thresholds for some banks that
+ * had a threshold of 0. This means the BIOS has not
+ * set the thresholds properly or does not work with
+ * this boot option. Note down now and report later.
+ */
+ if (mca_cfg.bios_cmci_threshold && bios_zero_thresh &&
+ (val & MCI_CTL2_CMCI_THRESHOLD_MASK))
+ *bios_wrong_thresh = 1;
+
+ /* Save default threshold for each bank */
+ if (cmci_threshold[bank] == 0)
+ cmci_threshold[bank] = val & MCI_CTL2_CMCI_THRESHOLD_MASK;
+}
+
+/*
* Enable CMCI (Corrected Machine Check Interrupt) for available MCE banks
* on this CPU. Use the algorithm recommended in the SDM to discover shared
- * banks.
+ * banks. Called during initial bootstrap, and also for hotplug CPU operations
+ * to rediscover/reassign machine check banks.
*/
static void cmci_discover(int banks)
{
- unsigned long *owned = (void *)this_cpu_ptr(&mce_banks_owned);
+ int bios_wrong_thresh = 0;
unsigned long flags;
int i;
- int bios_wrong_thresh = 0;

raw_spin_lock_irqsave(&cmci_discover_lock, flags);
for (i = 0; i < banks; i++) {
u64 val;
int bios_zero_thresh = 0;

- if (test_bit(i, owned))
- continue;
-
- /* Skip banks in firmware first mode */
- if (test_bit(i, mce_banks_ce_disabled))
+ if (cmci_skip_bank(i, &val))
continue;

- rdmsrl(MSR_IA32_MCx_CTL2(i), val);
-
- /* Already owned by someone else? */
- if (val & MCI_CTL2_CMCI_EN) {
- clear_bit(i, owned);
- __clear_bit(i, this_cpu_ptr(mce_poll_banks));
- continue;
- }
-
- if (!mca_cfg.bios_cmci_threshold) {
- val &= ~MCI_CTL2_CMCI_THRESHOLD_MASK;
- val |= CMCI_THRESHOLD;
- } else if (!(val & MCI_CTL2_CMCI_THRESHOLD_MASK)) {
- /*
- * If bios_cmci_threshold boot option was specified
- * but the threshold is zero, we'll try to initialize
- * it to 1.
- */
- bios_zero_thresh = 1;
- val |= CMCI_THRESHOLD;
- }
-
- val |= MCI_CTL2_CMCI_EN;
- wrmsrl(MSR_IA32_MCx_CTL2(i), val);
- rdmsrl(MSR_IA32_MCx_CTL2(i), val);
-
- /* Did the enable bit stick? -- the bank supports CMCI */
- if (val & MCI_CTL2_CMCI_EN) {
- set_bit(i, owned);
- __clear_bit(i, this_cpu_ptr(mce_poll_banks));
- /*
- * We are able to set thresholds for some banks that
- * had a threshold of 0. This means the BIOS has not
- * set the thresholds properly or does not work with
- * this boot option. Note down now and report later.
- */
- if (mca_cfg.bios_cmci_threshold && bios_zero_thresh &&
- (val & MCI_CTL2_CMCI_THRESHOLD_MASK))
- bios_wrong_thresh = 1;
- } else {
- WARN_ON(!test_bit(i, this_cpu_ptr(mce_poll_banks)));
- }
+ val = cmci_pick_threshold(val, &bios_zero_thresh);
+ cmci_claim_bank(i, val, bios_zero_thresh, &bios_wrong_thresh);
}
raw_spin_unlock_irqrestore(&cmci_discover_lock, flags);
if (mca_cfg.bios_cmci_threshold && bios_wrong_thresh) {
@@ -225,6 +327,9 @@ static void __cmci_disable_bank(int bank)
val &= ~MCI_CTL2_CMCI_EN;
wrmsrl(MSR_IA32_MCx_CTL2(bank), val);
__clear_bit(bank, this_cpu_ptr(mce_banks_owned));
+
+ if ((val & MCI_CTL2_CMCI_THRESHOLD_MASK) == CMCI_STORM_THRESHOLD)
+ cmci_storm_end(bank);
}

/*
diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index 157b2f2..01f8f03 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -41,6 +41,7 @@ struct dentry *mce_get_debugfs_dir(void);
extern mce_banks_t mce_banks_ce_disabled;

#ifdef CONFIG_X86_MCE_INTEL
+void mce_intel_handle_storm(int bank, bool on);
void cmci_disable_bank(int bank);
void intel_init_cmci(void);
void intel_init_lmce(void);
@@ -48,6 +49,7 @@ void intel_clear_lmce(void);
bool intel_filter_mce(struct mce *m);
bool intel_mce_usable_address(struct mce *m);
#else
+static inline void mce_intel_handle_storm(int bank, bool on) { }
static inline void cmci_disable_bank(int bank) { }
static inline void intel_init_cmci(void) { }
static inline void intel_init_lmce(void) { }
diff --git a/arch/x86/kernel/cpu/mce/threshold.c b/arch/x86/kernel/cpu/mce/threshold.c
index 0e19884..89e31e1 100644
--- a/arch/x86/kernel/cpu/mce/threshold.c
+++ b/arch/x86/kernel/cpu/mce/threshold.c
@@ -60,6 +60,9 @@ void mce_set_storm_mode(bool storm)
static void mce_handle_storm(unsigned int bank, bool on)
{
switch (boot_cpu_data.x86_vendor) {
+ case X86_VENDOR_INTEL:
+ mce_intel_handle_storm(bank, on);
+ break;
}
}


2023-12-15 17:28:50

by Luck, Tony

[permalink] [raw]
Subject: RE: [tip: ras/core] x86/mce: Handle Intel threshold interrupt storms

> The following commit has been merged into the ras/core branch of tip:
>
> Commit-ID: 1f68ce2a027250aeeb1756391110cdc4dc97c797
> Gitweb: https://git.kernel.org/tip/1f68ce2a027250aeeb1756391110cdc4dc97c797
> Author: Tony Luck <[email protected]>
> AuthorDate: Wed, 15 Nov 2023 11:54:50 -08:00
> Committer: Borislav Petkov (AMD) <[email protected]>
> CommitterDate: Fri, 15 Dec 2023 14:53:42 +01:00

Early X-Mas present for me! Thanks Boris.

-Tony

2023-12-16 15:55:55

by Borislav Petkov

[permalink] [raw]
Subject: Re: [tip: ras/core] x86/mce: Handle Intel threshold interrupt storms

On Fri, Dec 15, 2023 at 05:21:12PM +0000, Luck, Tony wrote:
> > The following commit has been merged into the ras/core branch of tip:
> >
> > Commit-ID: 1f68ce2a027250aeeb1756391110cdc4dc97c797
> > Gitweb: https://git.kernel.org/tip/1f68ce2a027250aeeb1756391110cdc4dc97c797
> > Author: Tony Luck <[email protected]>
> > AuthorDate: Wed, 15 Nov 2023 11:54:50 -08:00
> > Committer: Borislav Petkov (AMD) <[email protected]>
> > CommitterDate: Fri, 15 Dec 2023 14:53:42 +01:00
>
> Early X-Mas present for me! Thanks Boris.

:-)

You're welcome - thanks for answering my silly questions.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette