LinuxLists.cc - [PATCH] x86: mce: Avoid timer double-add during CMCI interrupt storms.

2014-12-05 02:31:13

Subject: [PATCH] x86: mce: Avoid timer double-add during CMCI interrupt storms.

The Intel CMCI interrupt handler calls mce_timer_kick() to force more
frequent polling for MCE events when a CMCI storm occurs and CMCI
interrupts are subsequently disabled.

If a CMCI interrupt storm happens to be detected while the timer
interrupt is executing timer functions, mce_timer_kick() can race with
mce_timer_fn(), which results in a double-add and the following BUG:

#0 [ffff88047fda3ad0] machine_kexec at ffffffff8102bdf5
#1 [ffff88047fda3b20] crash_kexec at ffffffff8109e788
#2 [ffff88047fda3bf0] oops_end at ffffffff815f20e8
#3 [ffff88047fda3c20] die at ffffffff81005c08
#4 [ffff88047fda3c50] do_trap at ffffffff815f192b
#5 [ffff88047fda3cb0] do_invalid_op at ffffffff81002f42
#6 [ffff88047fda3d60] invalid_op at ffffffff815fa668
[exception RIP: add_timer_on+234]
RIP: ffffffff8104d05a RSP: ffff88047fda3e18 RFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff88047fdacbc0 RCX: 000000001fbee3ff
RDX: ffff88047fda0000 RSI: 000000000000001d RDI: ffff88047fdacbc0
RBP: ffff88047fda3e58 R8: 0000000000000000 R9: ffffffff81aa0940
R10: 0720072007200720 R11: 0720072007200765 R12: ffff880474a6c000
R13: 0000000000000101 R14: 000000000000001d R15: ffff88047fdacbc0
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#7 [ffff88047fda3e60] mce_timer_fn at ffffffff8101f524
#8 [ffff88047fda3e80] call_timer_fn at ffffffff8104b4fa
#9 [ffff88047fda3ec0] run_timer_softirq at ffffffff8104ce70

The timer_add() in mce_timer_kick() is actually unnecessary: since the
timer is re-added by its handler function, the only case in which the
timer doesn't exist is when the CMCI interrupt calls mce_timer_kick() in
the interval between the timer firing and mce_timer_fn() actually being
executed. Thus, the timer work will be performed by mce_timer_fn() just
after the interrupt exits.

This patch removes the add_timer() from mce_timer_kick(), and disables
local interrupts during mce_timer_fn() so that mce_timer_fn() will
always pick up the timer interval value that mce_timer_kick() drops
in the PERCPU variable.

This means that the CMCI interrupt that hits the storm threshold will
call mce_timer_kick() either:

1) In the interval between the mce_timer firing and mce_timer_fn()
disabling local IRQs. In this case, mce_timer_fn() will
immediately execute after the CMCI handler exits, and will
use the interval loaded in the PERCPU variable from
mce_timer_kick() to calculate its next timer interval.

2) Happen after mce_timer_fn() has done its work, in which case
the existing timer will be updated with the new interval if
it is before the existing one.

Signed-off-by: Calvin Owens <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 61a9668ce..7074a90 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1286,7 +1286,7 @@ static int cmc_error_seen(void)
static void mce_timer_fn(unsigned long data)
{
struct timer_list *t = this_cpu_ptr(&mce_timer);
- unsigned long iv;
+ unsigned long iv, flags;
int notify;

WARN_ON(smp_processor_id() != data);
@@ -1301,6 +1301,9 @@ static void mce_timer_fn(unsigned long data)
* Alert userspace if needed. If we logged an MCE, reduce the
* polling interval, otherwise increase the polling interval.
*/
+
+ local_irq_save(flags);
+
iv = __this_cpu_read(mce_next_interval);
notify = mce_notify_irq();
notify |= cmc_error_seen();
@@ -1316,6 +1319,8 @@ static void mce_timer_fn(unsigned long data)
t->expires = jiffies + iv;
add_timer_on(t, smp_processor_id());
}
+
+ local_irq_restore(flags);
}

/*
@@ -1330,9 +1335,6 @@ void mce_timer_kick(unsigned long interval)
if (timer_pending(t)) {
if (time_before(when, t->expires))
mod_timer_pinned(t, when);
- } else {
- t->expires = round_jiffies(when);
- add_timer_on(t, smp_processor_id());
}
if (interval < iv)
__this_cpu_write(mce_next_interval, interval);
--
2.1.1

2014-12-09 18:08:42

by Borislav Petkov

[permalink] [raw]

Subject: Re: [PATCH] x86: mce: Avoid timer double-add during CMCI interrupt storms.

On Thu, Dec 04, 2014 at 06:29:35PM -0800, Calvin Owens wrote:
> The Intel CMCI interrupt handler calls mce_timer_kick() to force more
> frequent polling for MCE events when a CMCI storm occurs and CMCI
> interrupts are subsequently disabled.
>
> If a CMCI interrupt storm happens to be detected while the timer
> interrupt is executing timer functions, mce_timer_kick() can race with
> mce_timer_fn(), which results in a double-add and the following BUG:
>
> #0 [ffff88047fda3ad0] machine_kexec at ffffffff8102bdf5
> #1 [ffff88047fda3b20] crash_kexec at ffffffff8109e788
> #2 [ffff88047fda3bf0] oops_end at ffffffff815f20e8
> #3 [ffff88047fda3c20] die at ffffffff81005c08
> #4 [ffff88047fda3c50] do_trap at ffffffff815f192b
> #5 [ffff88047fda3cb0] do_invalid_op at ffffffff81002f42
> #6 [ffff88047fda3d60] invalid_op at ffffffff815fa668
> [exception RIP: add_timer_on+234]

Do I understand this correctly?

This is

BUG_ON(timer_pending(timer) || !timer->function);

in add_timer_on() and the first check, at that, the timer_pending()?

> RIP: ffffffff8104d05a RSP: ffff88047fda3e18 RFLAGS: 00010286
> RAX: 0000000000000000 RBX: ffff88047fdacbc0 RCX: 000000001fbee3ff
> RDX: ffff88047fda0000 RSI: 000000000000001d RDI: ffff88047fdacbc0
> RBP: ffff88047fda3e58 R8: 0000000000000000 R9: ffffffff81aa0940
> R10: 0720072007200720 R11: 0720072007200765 R12: ffff880474a6c000
> R13: 0000000000000101 R14: 000000000000001d R15: ffff88047fdacbc0
> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> #7 [ffff88047fda3e60] mce_timer_fn at ffffffff8101f524
> #8 [ffff88047fda3e80] call_timer_fn at ffffffff8104b4fa
> #9 [ffff88047fda3ec0] run_timer_softirq at ffffffff8104ce70
>
> The timer_add() in mce_timer_kick() is actually unnecessary: since the
> timer is re-added by its handler function,

What about the case where iv can become 0?

/* Might have become 0 after CMCI storm subsided */
if (iv) {
t->expires = jiffies + iv;
add_timer_on(t, smp_processor_id());
}

I have to say that whole story of iv becoming 0 doesn't sound all too
sane to me, though. This interval either decreases when polling - but up
to HZ/100 jiffies and not below. So I don't see how that would become 0
ever!

Regardless, you need to readd the timer because you won't be able to
poll the MCE banks in a storm mode.

> the only case in which the
> timer doesn't exist is when the CMCI interrupt calls mce_timer_kick() in
> the interval between the timer firing and mce_timer_fn() actually being
> executed. Thus, the timer work will be performed by mce_timer_fn() just
> after the interrupt exits.
>
> This patch removes the add_timer() from mce_timer_kick(), and disables
> local interrupts during mce_timer_fn() so that mce_timer_fn() will
> always pick up the timer interval value that mce_timer_kick() drops
> in the PERCPU variable.
>
> This means that the CMCI interrupt that hits the storm threshold will
> call mce_timer_kick() either:
>
> 1) In the interval between the mce_timer firing and mce_timer_fn()
> disabling local IRQs. In this case, mce_timer_fn() will
> immediately execute after the CMCI handler exits, and will
> use the interval loaded in the PERCPU variable from
> mce_timer_kick() to calculate its next timer interval.
>
> 2) Happen after mce_timer_fn() has done its work, in which case
> the existing timer will be updated with the new interval if
> it is before the existing one.

Right, so this polling thing once again proves its fragility to me - we
have had problems with it in the past so maybe we should think about
replacing it with something simpler and and much more robust instead of
this flaky dynamically adjustable polling thing.

So I'm thinking of leaving the detection code as it is, when we detect
a storm on a CPU, we set CMCI_STORM_ACTIVE and start a kernel thread at
max freq HZ/100 and polling the MCA banks. No adjustable frequency, no
timers, no nothing. A stupid per-cpu thread which polls.

Then, once we haven't seen any more CMCI errors - cmc_error_seen() - we
park that thread and switch back to interrupt mode.

I think this should be a lot simpler and hopefully more robust.

Downsides? I don't see any. We will normally poll for only short periods
of time. On boxes where we have to poll for much longer, the current
mechanism will bring us to max frequency polling anyway. But I might be
missing some nasty use cases...

Tony, what do you think?

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-12-09 23:00:13

by Luck, Tony

[permalink] [raw]

Subject: RE: [PATCH] x86: mce: Avoid timer double-add during CMCI interrupt storms.

> Right, so this polling thing once again proves its fragility to me - we
> have had problems with it in the past so maybe we should think about
> replacing it with something simpler and and much more robust instead of
> this flaky dynamically adjustable polling thing.

Dynamic intervals for polling make sense for cpus that don't support
CMCI. We need to check occasionally to see if there are any corrected errors,
but we don't want to waste a lot of cpu time doing that too often. There are
almost never any errors to be found. So begin polling at 5 minute intervals
(eternity on a multi-GHz cpu). If we do find an error, then look more frequently,
because there are several cases where a single error source might generate
multiple errors (e.g. stuck bit).

But then we came along an co-opted this mechanism for CMCI storm
control. And you are right that we made things needlessly complex
by using the same variable rate mechanism. If we had a storm, we know
we are having a high rate of errors (15 in one second) ... so we just want
to poll at a high-ish rate to collect a good sample of subsequent errors.
Also to detect when the storm ends in a timely manner. So we don't
gain much by tweaking the poll rate, and we have complex code.

> So I'm thinking of leaving the detection code as it is, when we detect
> a storm on a CPU, we set CMCI_STORM_ACTIVE and start a kernel thread at
> max freq HZ/100 and polling the MCA banks. No adjustable frequency, no
> timers, no nothing. A stupid per-cpu thread which polls.

Go for it.

-Tony
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?

2014-12-10 03:12:13

by Calvin Owens

[permalink] [raw]

Subject: Re: [PATCH] x86: mce: Avoid timer double-add during CMCI interrupt storms.

On Tuesday 12/09 at 23:00 +0000, Luck, Tony wrote:
> > Right, so this polling thing once again proves its fragility to me - we
> > have had problems with it in the past so maybe we should think about
> > replacing it with something simpler and and much more robust instead of
> > this flaky dynamically adjustable polling thing.
>
> Dynamic intervals for polling make sense for cpus that don't support
> CMCI. We need to check occasionally to see if there are any corrected errors,
> but we don't want to waste a lot of cpu time doing that too often. There are
> almost never any errors to be found. So begin polling at 5 minute intervals
> (eternity on a multi-GHz cpu). If we do find an error, then look more frequently,
> because there are several cases where a single error source might generate
> multiple errors (e.g. stuck bit).
>
> But then we came along an co-opted this mechanism for CMCI storm
> control. And you are right that we made things needlessly complex
> by using the same variable rate mechanism. If we had a storm, we know
> we are having a high rate of errors (15 in one second) ... so we just want
> to poll at a high-ish rate to collect a good sample of subsequent errors.
> Also to detect when the storm ends in a timely manner. So we don't
> gain much by tweaking the poll rate, and we have complex code.
>
> > So I'm thinking of leaving the detection code as it is, when we detect
> > a storm on a CPU, we set CMCI_STORM_ACTIVE and start a kernel thread at
> > max freq HZ/100 and polling the MCA banks. No adjustable frequency, no
> > timers, no nothing. A stupid per-cpu thread which polls.
>
> Go for it.

Just to make sure I understand what you're looking for:

When MCE is initialized, spawn a kthread for each CPU instead of the
current timers. If CMCI is supported, we just leave this thread parked,
and only process errors from the CMCI interrupt handler.

When a CMCI storm happens, we disable CMCI interrupts and kick the
kthread, which polls every HZ/100 until the storm has subsided, at which
point it re-enables CMCI interrupts and parks itself.

If CMCI isn't supported though, how is the polling done? You said the
dynamic interval is desirable, wouldn't that need to be in the kthread?
Having both the kthread and the timer around seems ugly, even if only
one is used on a given machine.

Thanks,
Calvin

2014-12-10 19:23:47

by Borislav Petkov

[permalink] [raw]

Subject: Re: [PATCH] x86: mce: Avoid timer double-add during CMCI interrupt storms.

On Tue, Dec 09, 2014 at 07:11:02PM -0800, Calvin Owens wrote:
> Just to make sure I understand what you're looking for:
>
> When MCE is initialized, spawn a kthread for each CPU instead of the
> current timers. If CMCI is supported, we just leave this thread parked,
> and only process errors from the CMCI interrupt handler.
>
> When a CMCI storm happens, we disable CMCI interrupts and kick the
> kthread, which polls every HZ/100 until the storm has subsided, at which
> point it re-enables CMCI interrupts and parks itself.
>
> If CMCI isn't supported though, how is the polling done? You said the
> dynamic interval is desirable, wouldn't that need to be in the kthread?
> Having both the kthread and the timer around seems ugly, even if only
> one is used on a given machine.

Ok, so in talking with the guys here internally it sounds to me like
a kernel thread or a workqueue or whatever other facility relying on
wake_up_process() we take, would have scheduling latency in there and
get delayed when polling the MCA banks. In a storm condition, this might
not be such a good idea.

So we maybe better off with the timer interrupt after all but try
to decouple the dynamic adjusting of polling frequency for non-CMCI
machines and do an on/off simpler polling on CMCI machines.

So what I'm thinking of is:

* once we've detected CMCI storm, we immediately start polling with
CMCI_STORM_INTERVAL, i.e. once per second.

* as long as we keep seeing errors during polling in storm mode, we keep
that polling frequency.

* the moment we don't see any errors anymore, we go to
CMCI_STORM_SUBSIDED and then eventually to CMCI_STORM_NONE.

The code remains unchanged for CMCI machines which are not in storm
mode and for non-CMCI machines.

Anyway, this below is completely untested but it seems simpler to me and
does away with the adaptive logic for CMCI storms (you might want to
apply it first as the diff is hardly readable and this code is nasty as
hell anyway).

Thoughts?

---
diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 51b26e895933..1b9733614e4c 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -108,6 +108,7 @@ struct mca_config {
bool disabled;
bool ser;
bool bios_cmci_threshold;
+ bool cmci;
u8 banks;
s8 bootlog;
int tolerant;
diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index 10b46906767f..6e4984fc37ce 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -30,18 +30,19 @@ extern struct mce_bank *mce_banks;
extern mce_banks_t mce_banks_ce_disabled;

#ifdef CONFIG_X86_MCE_INTEL
-unsigned long mce_intel_adjust_timer(unsigned long interval);
+unsigned long cmci_intel_adjust_timer(unsigned long interval);
void mce_intel_cmci_poll(void);
void mce_intel_hcpu_update(unsigned long cpu);
void cmci_disable_bank(int bank);
#else
-# define mce_intel_adjust_timer mce_adjust_timer_default
+# define cmci_intel_adjust_timer mce_adjust_timer_default
static inline void mce_intel_cmci_poll(void) { }
static inline void mce_intel_hcpu_update(unsigned long cpu) { }
static inline void cmci_disable_bank(int bank) { }
#endif

void mce_timer_kick(unsigned long interval);
+int ce_error_seen(void);

#ifdef CONFIG_ACPI_APEI
int apei_write_mce(struct mce *m);
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index d2c611699cd9..e3f426698164 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1324,7 +1324,7 @@ static unsigned long mce_adjust_timer_default(unsigned long interval)
static unsigned long (*mce_adjust_timer)(unsigned long interval) =
mce_adjust_timer_default;

-static int cmc_error_seen(void)
+int ce_error_seen(void)
{
unsigned long *v = this_cpu_ptr(&mce_polled_error);

@@ -1334,36 +1334,36 @@ static int cmc_error_seen(void)
static void mce_timer_fn(unsigned long data)
{
struct timer_list *t = this_cpu_ptr(&mce_timer);
+ int cpu = smp_processor_id();
unsigned long iv;
- int notify;

- WARN_ON(smp_processor_id() != data);
+ WARN_ON(cpu != data);

if (mce_available(this_cpu_ptr(&cpu_info))) {
- machine_check_poll(MCP_TIMESTAMP,
- this_cpu_ptr(&mce_poll_banks));
+ machine_check_poll(MCP_TIMESTAMP, this_cpu_ptr(&mce_poll_banks));
mce_intel_cmci_poll();
}

+ iv = __this_cpu_read(mce_next_interval);
+
+ if (mca_cfg.cmci) {
+ iv = mce_adjust_timer(iv);
+ goto done;
+ }
+
/*
- * Alert userspace if needed. If we logged an MCE, reduce the
- * polling interval, otherwise increase the polling interval.
+ * Alert userspace if needed. If we logged an MCE, reduce the polling
+ * interval, otherwise increase the polling interval.
*/
- iv = __this_cpu_read(mce_next_interval);
- notify = mce_notify_irq();
- notify |= cmc_error_seen();
- if (notify) {
+ if (mce_notify_irq())
iv = max(iv / 2, (unsigned long) HZ/100);
- } else {
+ else
iv = min(iv * 2, round_jiffies_relative(check_interval * HZ));
- iv = mce_adjust_timer(iv);
- }
+
+done:
__this_cpu_write(mce_next_interval, iv);
- /* Might have become 0 after CMCI storm subsided */
- if (iv) {
- t->expires = jiffies + iv;
- add_timer_on(t, smp_processor_id());
- }
+ t->expires = jiffies + iv;
+ add_timer_on(t, cpu);
}

/*
@@ -1682,7 +1682,7 @@ static void __mcheck_cpu_init_vendor(struct cpuinfo_x86 *c)
switch (c->x86_vendor) {
case X86_VENDOR_INTEL:
mce_intel_feature_init(c);
- mce_adjust_timer = mce_intel_adjust_timer;
+ mce_adjust_timer = cmci_intel_adjust_timer;
break;
case X86_VENDOR_AMD:
mce_amd_feature_init(c);
diff --git a/arch/x86/kernel/cpu/mcheck/mce_intel.c b/arch/x86/kernel/cpu/mcheck/mce_intel.c
index b3c97bafc123..6b8cbeaaca88 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_intel.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c
@@ -46,7 +46,7 @@ static DEFINE_RAW_SPINLOCK(cmci_discover_lock);

#define CMCI_THRESHOLD 1
#define CMCI_POLL_INTERVAL (30 * HZ)
-#define CMCI_STORM_INTERVAL (1 * HZ)
+#define CMCI_STORM_INTERVAL (HZ)
#define CMCI_STORM_THRESHOLD 15

static DEFINE_PER_CPU(unsigned long, cmci_time_stamp);
@@ -68,6 +68,9 @@ static int cmci_supported(int *banks)
if (mca_cfg.cmci_disabled || mca_cfg.ignore_ce)
return 0;

+ if (mca_cfg.cmci)
+ return 1;
+
/*
* Vendor check is not strictly needed, but the initial
* initialization is vendor keyed and this
@@ -79,7 +82,9 @@ static int cmci_supported(int *banks)
return 0;
rdmsrl(MSR_IA32_MCG_CAP, cap);
*banks = min_t(unsigned, MAX_NR_BANKS, cap & 0xff);
- return !!(cap & MCG_CMCI_P);
+ mca_cfg.cmci = !!(cap & MCG_CMCI_P);
+
+ return mca_cfg.cmci;
}

void mce_intel_cmci_poll(void)
@@ -97,12 +102,13 @@ void mce_intel_hcpu_update(unsigned long cpu)
per_cpu(cmci_storm_state, cpu) = CMCI_STORM_NONE;
}

-unsigned long mce_intel_adjust_timer(unsigned long interval)
+unsigned long cmci_intel_adjust_timer(unsigned long interval)
{
- int r;
-
- if (interval < CMCI_POLL_INTERVAL)
- return interval;
+ if (ce_error_seen() &&
+ (__this_cpu_read(cmci_storm_state) == CMCI_STORM_ACTIVE)) {
+ mce_notify_irq();
+ return CMCI_STORM_INTERVAL;
+ }

switch (__this_cpu_read(cmci_storm_state)) {
case CMCI_STORM_ACTIVE:
@@ -112,8 +118,7 @@ unsigned long mce_intel_adjust_timer(unsigned long interval)
* timer interval is back to our poll interval.
*/
__this_cpu_write(cmci_storm_state, CMCI_STORM_SUBSIDED);
- r = atomic_sub_return(1, &cmci_storm_on_cpus);
- if (r == 0)
+ if (!atomic_sub_return(1, &cmci_storm_on_cpus))
pr_notice("CMCI storm subsided: switching to interrupt mode\n");
/* FALLTHROUGH */

@@ -178,7 +183,7 @@ static bool cmci_storm_detect(void)
cmci_storm_disable_banks();
__this_cpu_write(cmci_storm_state, CMCI_STORM_ACTIVE);
r = atomic_add_return(1, &cmci_storm_on_cpus);
- mce_timer_kick(CMCI_POLL_INTERVAL);
+ mce_timer_kick(CMCI_STORM_INTERVAL);

if (r == 1)
pr_notice("CMCI storm detected: switching to poll mode\n");

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-12-11 02:35:17

by Calvin Owens

[permalink] [raw]

Subject: Re: [PATCH] x86: mce: Avoid timer double-add during CMCI interrupt storms.

On Wednesday 12/10 at 20:23 +0100, Borislav Petkov wrote:
> On Tue, Dec 09, 2014 at 07:11:02PM -0800, Calvin Owens wrote:
> > Just to make sure I understand what you're looking for:
> >
> > When MCE is initialized, spawn a kthread for each CPU instead of the
> > current timers. If CMCI is supported, we just leave this thread parked,
> > and only process errors from the CMCI interrupt handler.
> >
> > When a CMCI storm happens, we disable CMCI interrupts and kick the
> > kthread, which polls every HZ/100 until the storm has subsided, at which
> > point it re-enables CMCI interrupts and parks itself.
> >
> > If CMCI isn't supported though, how is the polling done? You said the
> > dynamic interval is desirable, wouldn't that need to be in the kthread?
> > Having both the kthread and the timer around seems ugly, even if only
> > one is used on a given machine.
>
> Ok, so in talking with the guys here internally it sounds to me like
> a kernel thread or a workqueue or whatever other facility relying on
> wake_up_process() we take, would have scheduling latency in there and
> get delayed when polling the MCA banks. In a storm condition, this might
> not be such a good idea.
>
> So we maybe better off with the timer interrupt after all but try
> to decouple the dynamic adjusting of polling frequency for non-CMCI
> machines and do an on/off simpler polling on CMCI machines.
>
> So what I'm thinking of is:
>
> * once we've detected CMCI storm, we immediately start polling with
> CMCI_STORM_INTERVAL, i.e. once per second.
>
> * as long as we keep seeing errors during polling in storm mode, we keep
> that polling frequency.
>
> * the moment we don't see any errors anymore, we go to
> CMCI_STORM_SUBSIDED and then eventually to CMCI_STORM_NONE.
>
> The code remains unchanged for CMCI machines which are not in storm
> mode and for non-CMCI machines.
>
> Anyway, this below is completely untested but it seems simpler to me and
> does away with the adaptive logic for CMCI storms (you might want to
> apply it first as the diff is hardly readable and this code is nasty as
> hell anyway).
>
> Thoughts?

This doens't fix the original issue though: the timer double-add can
still happen if the CMCI interrupt that hits the STORM threshold pops
off during mce_timer_fn() and calls mce_timer_kick().

The polling is unnecessary on a CMCI-capable machine except in STORMs,
right? Wouldn't it be better to eliminate it in that case?

That said, I ran mce-test with your patch on top of 3.18, looks good
there. But that doesn't simulate CMCI interrupts.

Thanks,
Calvin

> ---
> diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
> index 51b26e895933..1b9733614e4c 100644
> --- a/arch/x86/include/asm/mce.h
> +++ b/arch/x86/include/asm/mce.h
> @@ -108,6 +108,7 @@ struct mca_config {
> bool disabled;
> bool ser;
> bool bios_cmci_threshold;
> + bool cmci;
> u8 banks;
> s8 bootlog;
> int tolerant;
> diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
> index 10b46906767f..6e4984fc37ce 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
> +++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
> @@ -30,18 +30,19 @@ extern struct mce_bank *mce_banks;
> extern mce_banks_t mce_banks_ce_disabled;
>
> #ifdef CONFIG_X86_MCE_INTEL
> -unsigned long mce_intel_adjust_timer(unsigned long interval);
> +unsigned long cmci_intel_adjust_timer(unsigned long interval);
> void mce_intel_cmci_poll(void);
> void mce_intel_hcpu_update(unsigned long cpu);
> void cmci_disable_bank(int bank);
> #else
> -# define mce_intel_adjust_timer mce_adjust_timer_default
> +# define cmci_intel_adjust_timer mce_adjust_timer_default
> static inline void mce_intel_cmci_poll(void) { }
> static inline void mce_intel_hcpu_update(unsigned long cpu) { }
> static inline void cmci_disable_bank(int bank) { }
> #endif
>
> void mce_timer_kick(unsigned long interval);
> +int ce_error_seen(void);
>
> #ifdef CONFIG_ACPI_APEI
> int apei_write_mce(struct mce *m);
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index d2c611699cd9..e3f426698164 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -1324,7 +1324,7 @@ static unsigned long mce_adjust_timer_default(unsigned long interval)
> static unsigned long (*mce_adjust_timer)(unsigned long interval) =
> mce_adjust_timer_default;
>
> -static int cmc_error_seen(void)
> +int ce_error_seen(void)
> {
> unsigned long *v = this_cpu_ptr(&mce_polled_error);
>
> @@ -1334,36 +1334,36 @@ static int cmc_error_seen(void)
> static void mce_timer_fn(unsigned long data)
> {
> struct timer_list *t = this_cpu_ptr(&mce_timer);
> + int cpu = smp_processor_id();
> unsigned long iv;
> - int notify;
>
> - WARN_ON(smp_processor_id() != data);
> + WARN_ON(cpu != data);
>
> if (mce_available(this_cpu_ptr(&cpu_info))) {
> - machine_check_poll(MCP_TIMESTAMP,
> - this_cpu_ptr(&mce_poll_banks));
> + machine_check_poll(MCP_TIMESTAMP, this_cpu_ptr(&mce_poll_banks));
> mce_intel_cmci_poll();
> }
>
> + iv = __this_cpu_read(mce_next_interval);
> +
> + if (mca_cfg.cmci) {
> + iv = mce_adjust_timer(iv);
> + goto done;
> + }
> +
> /*
> - * Alert userspace if needed. If we logged an MCE, reduce the
> - * polling interval, otherwise increase the polling interval.
> + * Alert userspace if needed. If we logged an MCE, reduce the polling
> + * interval, otherwise increase the polling interval.
> */
> - iv = __this_cpu_read(mce_next_interval);
> - notify = mce_notify_irq();
> - notify |= cmc_error_seen();
> - if (notify) {
> + if (mce_notify_irq())
> iv = max(iv / 2, (unsigned long) HZ/100);
> - } else {
> + else
> iv = min(iv * 2, round_jiffies_relative(check_interval * HZ));
> - iv = mce_adjust_timer(iv);
> - }
> +
> +done:
> __this_cpu_write(mce_next_interval, iv);
> - /* Might have become 0 after CMCI storm subsided */
> - if (iv) {
> - t->expires = jiffies + iv;
> - add_timer_on(t, smp_processor_id());
> - }
> + t->expires = jiffies + iv;
> + add_timer_on(t, cpu);
> }
>
> /*
> @@ -1682,7 +1682,7 @@ static void __mcheck_cpu_init_vendor(struct cpuinfo_x86 *c)
> switch (c->x86_vendor) {
> case X86_VENDOR_INTEL:
> mce_intel_feature_init(c);
> - mce_adjust_timer = mce_intel_adjust_timer;
> + mce_adjust_timer = cmci_intel_adjust_timer;
> break;
> case X86_VENDOR_AMD:
> mce_amd_feature_init(c);
> diff --git a/arch/x86/kernel/cpu/mcheck/mce_intel.c b/arch/x86/kernel/cpu/mcheck/mce_intel.c
> index b3c97bafc123..6b8cbeaaca88 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce_intel.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c
> @@ -46,7 +46,7 @@ static DEFINE_RAW_SPINLOCK(cmci_discover_lock);
>
> #define CMCI_THRESHOLD 1
> #define CMCI_POLL_INTERVAL (30 * HZ)
> -#define CMCI_STORM_INTERVAL (1 * HZ)
> +#define CMCI_STORM_INTERVAL (HZ)
> #define CMCI_STORM_THRESHOLD 15
>
> static DEFINE_PER_CPU(unsigned long, cmci_time_stamp);
> @@ -68,6 +68,9 @@ static int cmci_supported(int *banks)
> if (mca_cfg.cmci_disabled || mca_cfg.ignore_ce)
> return 0;
>
> + if (mca_cfg.cmci)
> + return 1;
> +
> /*
> * Vendor check is not strictly needed, but the initial
> * initialization is vendor keyed and this
> @@ -79,7 +82,9 @@ static int cmci_supported(int *banks)
> return 0;
> rdmsrl(MSR_IA32_MCG_CAP, cap);
> *banks = min_t(unsigned, MAX_NR_BANKS, cap & 0xff);
> - return !!(cap & MCG_CMCI_P);
> + mca_cfg.cmci = !!(cap & MCG_CMCI_P);
> +
> + return mca_cfg.cmci;
> }
>
> void mce_intel_cmci_poll(void)
> @@ -97,12 +102,13 @@ void mce_intel_hcpu_update(unsigned long cpu)
> per_cpu(cmci_storm_state, cpu) = CMCI_STORM_NONE;
> }
>
> -unsigned long mce_intel_adjust_timer(unsigned long interval)
> +unsigned long cmci_intel_adjust_timer(unsigned long interval)
> {
> - int r;
> -
> - if (interval < CMCI_POLL_INTERVAL)
> - return interval;
> + if (ce_error_seen() &&
> + (__this_cpu_read(cmci_storm_state) == CMCI_STORM_ACTIVE)) {
> + mce_notify_irq();
> + return CMCI_STORM_INTERVAL;
> + }
>
> switch (__this_cpu_read(cmci_storm_state)) {
> case CMCI_STORM_ACTIVE:
> @@ -112,8 +118,7 @@ unsigned long mce_intel_adjust_timer(unsigned long interval)
> * timer interval is back to our poll interval.
> */
> __this_cpu_write(cmci_storm_state, CMCI_STORM_SUBSIDED);
> - r = atomic_sub_return(1, &cmci_storm_on_cpus);
> - if (r == 0)
> + if (!atomic_sub_return(1, &cmci_storm_on_cpus))
> pr_notice("CMCI storm subsided: switching to interrupt mode\n");
> /* FALLTHROUGH */
>
> @@ -178,7 +183,7 @@ static bool cmci_storm_detect(void)
> cmci_storm_disable_banks();
> __this_cpu_write(cmci_storm_state, CMCI_STORM_ACTIVE);
> r = atomic_add_return(1, &cmci_storm_on_cpus);
> - mce_timer_kick(CMCI_POLL_INTERVAL);
> + mce_timer_kick(CMCI_STORM_INTERVAL);
>
> if (r == 1)
> pr_notice("CMCI storm detected: switching to poll mode\n");
>
>
> --
> Regards/Gruss,
> Boris.
>
> Sent from a fat crate under my desk. Formatting is fine.
> --

2014-12-11 14:43:19

by Borislav Petkov

[permalink] [raw]

Subject: Re: [PATCH] x86: mce: Avoid timer double-add during CMCI interrupt storms.

On Wed, Dec 10, 2014 at 06:34:38PM -0800, Calvin Owens wrote:
> This doens't fix the original issue though: the timer double-add can
> still happen if the CMCI interrupt that hits the STORM threshold pops
> off during mce_timer_fn() and calls mce_timer_kick().

Right, I guess we could do something similar to what you proposed
already (diff below ontop of my previous diff).

> The polling is unnecessary on a CMCI-capable machine except in STORMs,
> right? Wouldn't it be better to eliminate it in that case?

Right, the problem with error thresholding is that it only counts errors
and raises the APIC interrupt when a certain count has been reached.
But there's no way for the software to look at *each* error and try to
discover trends and stuff.

Thus, we've been working on something which looks at each memory error,
collects them and then decays them over time and acts only for the
significant ones by poisoning pages:

https://lkml.kernel.org/r/[email protected]

When we have that thing in place we can probably even go a step further
and simply disable error thresholding because it simply doesn't give us
what we need.

Btw, we would very much welcome your thoughs on this collector thing and
whether it would be usable in large installations. Or if not, what else
would it need added.

> That said, I ran mce-test with your patch on top of 3.18, looks good
> there. But that doesn't simulate CMCI interrupts.

Thanks, I had a machine here which was the perfect natural error
injector (has faulty DIMMs). I need to dig it out and play with it again
:)

Thanks.

---
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index e3f426698164..8cf2f486b81e 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1331,6 +1331,24 @@ int ce_error_seen(void)
return test_and_clear_bit(0, v);
}

+static void __restart_timer(struct timer_list *t, unsigned long interval)
+{
+ unsigned long when = jiffies + interval;
+ unsigned long flags;
+
+ local_irq_save(flags);
+
+ if (timer_pending(t)) {
+ if (time_before(when, t->expires))
+ mod_timer_pinned(t, when);
+ } else {
+ t->expires = round_jiffies(when);
+ add_timer_on(t, smp_processor_id());
+ }
+
+ local_irq_restore(flags);
+}
+
static void mce_timer_fn(unsigned long data)
{
struct timer_list *t = this_cpu_ptr(&mce_timer);
@@ -1362,8 +1380,7 @@ static void mce_timer_fn(unsigned long data)

done:
__this_cpu_write(mce_next_interval, iv);
- t->expires = jiffies + iv;
- add_timer_on(t, cpu);
+ __restart_timer(t, iv);
}

/*
@@ -1372,16 +1389,10 @@ done:
void mce_timer_kick(unsigned long interval)
{
struct timer_list *t = this_cpu_ptr(&mce_timer);
- unsigned long when = jiffies + interval;
unsigned long iv = __this_cpu_read(mce_next_interval);

- if (timer_pending(t)) {
- if (time_before(when, t->expires))
- mod_timer_pinned(t, when);
- } else {
- t->expires = round_jiffies(when);
- add_timer_on(t, smp_processor_id());
- }
+ __restart_timer(t, interval);
+
if (interval < iv)
__this_cpu_write(mce_next_interval, interval);
}

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-12-17 00:17:50

by Calvin Owens

[permalink] [raw]

Subject: Re: [PATCH] x86: mce: Avoid timer double-add during CMCI interrupt storms.

On Thursday 12/11 at 15:43 +0100, Borislav Petkov wrote:
> On Wed, Dec 10, 2014 at 06:34:38PM -0800, Calvin Owens wrote:
> > This doens't fix the original issue though: the timer double-add can
> > still happen if the CMCI interrupt that hits the STORM threshold pops
> > off during mce_timer_fn() and calls mce_timer_kick().
>
> Right, I guess we could do something similar to what you proposed
> already (diff below ontop of my previous diff).

That looks good, thanks :)

> > The polling is unnecessary on a CMCI-capable machine except in STORMs,
> > right? Wouldn't it be better to eliminate it in that case?
>
> Right, the problem with error thresholding is that it only counts errors
> and raises the APIC interrupt when a certain count has been reached.
> But there's no way for the software to look at *each* error and try to
> discover trends and stuff.
>
> Thus, we've been working on something which looks at each memory error,
> collects them and then decays them over time and acts only for the
> significant ones by poisoning pages:
>
> https://urldefense.proofpoint.com/v1/url?u=https://lkml.kernel.org/r/1404242623-10094-1-git-send-email-bp%40alien8.de&k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0A&r=oEb120Cp%2FehdhuY2M7qjelK5yT8IPB5WC2nEG4obDh4%3D%0A&m=Uvvje79YqtSSsaolAYcJ9N3FPGiWTIz7feUxwbhAkNc%3D%0A&s=c88a2b8458e7f07ba887d26414f2ce5b80f9180353d0895a7f4f8b47d19dbec8
>
> When we have that thing in place we can probably even go a step further
> and simply disable error thresholding because it simply doesn't give us
> what we need.
>
> Btw, we would very much welcome your thoughs on this collector thing and
> whether it would be usable in large installations. Or if not, what else
> would it need added.

Ooh I like this, I'll add some thoughts over in that thread.

> > That said, I ran mce-test with your patch on top of 3.18, looks good
> > there. But that doesn't simulate CMCI interrupts.
>
> Thanks, I had a machine here which was the perfect natural error
> injector (has faulty DIMMs). I need to dig it out and play with it again
> :)
>
> Thanks.
>
> ---
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index e3f426698164..8cf2f486b81e 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -1331,6 +1331,24 @@ int ce_error_seen(void)
> return test_and_clear_bit(0, v);
> }
>
> +static void __restart_timer(struct timer_list *t, unsigned long interval)
> +{
> + unsigned long when = jiffies + interval;
> + unsigned long flags;
> +
> + local_irq_save(flags);
> +
> + if (timer_pending(t)) {
> + if (time_before(when, t->expires))
> + mod_timer_pinned(t, when);
> + } else {
> + t->expires = round_jiffies(when);
> + add_timer_on(t, smp_processor_id());
> + }
> +
> + local_irq_restore(flags);
> +}
> +
> static void mce_timer_fn(unsigned long data)
> {
> struct timer_list *t = this_cpu_ptr(&mce_timer);
> @@ -1362,8 +1380,7 @@ static void mce_timer_fn(unsigned long data)
>
> done:
> __this_cpu_write(mce_next_interval, iv);
> - t->expires = jiffies + iv;
> - add_timer_on(t, cpu);
> + __restart_timer(t, iv);
> }
>
> /*
> @@ -1372,16 +1389,10 @@ done:
> void mce_timer_kick(unsigned long interval)
> {
> struct timer_list *t = this_cpu_ptr(&mce_timer);
> - unsigned long when = jiffies + interval;
> unsigned long iv = __this_cpu_read(mce_next_interval);
>
> - if (timer_pending(t)) {
> - if (time_before(when, t->expires))
> - mod_timer_pinned(t, when);
> - } else {
> - t->expires = round_jiffies(when);
> - add_timer_on(t, smp_processor_id());
> - }
> + __restart_timer(t, interval);
> +
> if (interval < iv)
> __this_cpu_write(mce_next_interval, interval);
> }
>
> --
> Regards/Gruss,
> Boris.
>
> Sent from a fat crate under my desk. Formatting is fine.
> --

2014-12-17 21:19:18

by Borislav Petkov

[permalink] [raw]

Subject: Re: [PATCH] x86: mce: Avoid timer double-add during CMCI interrupt storms.

On Tue, Dec 16, 2014 at 04:17:11PM -0800, Calvin Owens wrote:
> > Right, I guess we could do something similar to what you proposed
> > already (diff below ontop of my previous diff).
>
> That looks good, thanks :)

Ok, thanks. I'll try to give it a run once I get a quiet minute from all
the CVE fun recently.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2015-02-02 11:05:41

by Borislav Petkov

[permalink] [raw]

Subject: Re: [PATCH] x86: mce: Avoid timer double-add during CMCI interrupt storms.

On Fri, Jan 23, 2015 at 12:03:35PM +0100, Borislav Petkov wrote:
> I've uploaded it as a branch if you want to give it a run as there are
> other changes to MCE pending for 3.20 which might cause merge conflicts:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp.git, branch mce-fb
>
> Shout if there are some issues remaining not addressed.

In the meantime, I've tested it successfully on an EINJ box too, will
queue it for 3.21 for extra simmering time in linux-next et al.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--