2015-02-23 09:15:09

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH 1/2] x86: mce: kdump: use under_crashdumping to turn off MCE in all CPUs together

kexec disables (or "shoots down") all CPUs other than a crashing CPU before
entering the 2nd kernel. This disablement is done via NMI, and the crashing
CPU wait for the completions by spinning at most for 1 second.
However, there is a race window if this NMI handling doesn't complete within
the 1 second on some CPU, which cause the fragile situation where only a
portion of online CPUs are responsive to MCE interrupt. If MCE happens during
this race window, MCE synchronization always timeouts and results in kernel
panic. So the user-visible effect of this bug is kdump failure.

Note that this race window did exist when current MCE handler was implemented
around 2.6.32, and recently commit 716079f66eac ("mce: Panic when a core has
reached a timeout") made it more visible by changing the default behavior of
the synchronization timeout from "ignore" to "panic".

This patch adds a global variable representing that the system is running
kdump code in order to "turn off" the MCE handling code in kdump context.

Signed-off-by: Naoya Horiguchi <[email protected]>
Cc: <[email protected]> [2.6.32+]
---
arch/x86/include/asm/mce.h | 1 +
arch/x86/kernel/cpu/mcheck/mce.c | 13 +++++++++++++
arch/x86/kernel/crash.c | 8 ++++++++
3 files changed, 22 insertions(+)

diff --git v3.19.orig/arch/x86/include/asm/mce.h v3.19/arch/x86/include/asm/mce.h
index 51b26e895933..7ae9927d781a 100644
--- v3.19.orig/arch/x86/include/asm/mce.h
+++ v3.19/arch/x86/include/asm/mce.h
@@ -175,6 +175,7 @@ static inline void mce_amd_feature_init(struct cpuinfo_x86 *c) { }
#endif

int mce_available(struct cpuinfo_x86 *c);
+void cpu_emergency_mce_disable(void);

DECLARE_PER_CPU(unsigned, mce_exception_count);
DECLARE_PER_CPU(unsigned, mce_poll_count);
diff --git v3.19.orig/arch/x86/kernel/cpu/mcheck/mce.c v3.19/arch/x86/kernel/cpu/mcheck/mce.c
index 3112b79ace8e..3a155b9e276e 100644
--- v3.19.orig/arch/x86/kernel/cpu/mcheck/mce.c
+++ v3.19/arch/x86/kernel/cpu/mcheck/mce.c
@@ -87,6 +87,8 @@ static DECLARE_WAIT_QUEUE_HEAD(mce_chrdev_wait);
static DEFINE_PER_CPU(struct mce, mces_seen);
static int cpu_missing;

+static int under_crashdumping;
+
/* CMCI storm detection filter */
static DEFINE_PER_CPU(unsigned long, mce_polled_error);

@@ -1085,6 +1087,12 @@ void do_machine_check(struct pt_regs *regs, long error_code)
DECLARE_BITMAP(valid_banks, MAX_NR_BANKS);
char *msg = "Unknown";

+ if (under_crashdumping) {
+ pr_err("CPU#%d: Machine Check ignored because crash dump is running\n",
+ smp_processor_id());
+ return;
+ }
+
this_cpu_inc(mce_exception_count);

if (!cfg->banks)
@@ -2104,6 +2112,11 @@ static void mce_syscore_shutdown(void)
mce_disable_error_reporting();
}

+void cpu_emergency_mce_disable(void)
+{
+ under_crashdumping = 1;
+}
+
/*
* On resume clear all MCE state. Don't want to see leftovers from the BIOS.
* Only one CPU is active at this time, the others get re-added later using
diff --git v3.19.orig/arch/x86/kernel/crash.c v3.19/arch/x86/kernel/crash.c
index aceb2f90c716..08c9eaaaa8cb 100644
--- v3.19.orig/arch/x86/kernel/crash.c
+++ v3.19/arch/x86/kernel/crash.c
@@ -34,6 +34,7 @@
#include <asm/cpu.h>
#include <asm/reboot.h>
#include <asm/virtext.h>
+#include <asm/mce.h>

/* Alignment required for elf header segment */
#define ELF_CORE_HEADER_ALIGN 4096
@@ -157,6 +158,13 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
/* The kernel is broken so disable interrupts */
local_irq_disable();

+ /*
+ * MCE should be disabled in all CPUs together, because otherwise there
+ * exists a race window where a portion of online CPUs is responsive to
+ * MCE, which causes MCE synchronization timeout.
+ */
+ cpu_emergency_mce_disable();
+
kdump_nmi_shootdown_cpus();

/*
--
1.9.3


2015-02-23 09:15:07

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH 2/2] x86: mce: comment about MCE synchronization timeout on definition of tolerant

commit 716079f66eac ("mce: Panic when a core has reached a timeout") changed
the behavior of mca_cfg->tolerant. So let's add comment about it.

Signed-off-by: Naoya Horiguchi <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git v3.19.orig/arch/x86/kernel/cpu/mcheck/mce.c v3.19/arch/x86/kernel/cpu/mcheck/mce.c
index 3a155b9e276e..2dc93bd62bbc 100644
--- v3.19.orig/arch/x86/kernel/cpu/mcheck/mce.c
+++ v3.19/arch/x86/kernel/cpu/mcheck/mce.c
@@ -69,8 +69,10 @@ struct mca_config mca_cfg __read_mostly = {
/*
* Tolerant levels:
* 0: always panic on uncorrected errors, log corrected errors
- * 1: panic or SIGBUS on uncorrected errors, log corrected errors
- * 2: SIGBUS or log uncorrected errors (if possible), log corr. errors
+ * 1: panic or SIGBUS on uncorrected errors, log corrected errors,
+ * panic on MCE synchronization timeout.
+ * 2: SIGBUS or log uncorrected errors (if possible), log corr. errors,
+ * no panic on MCE synchronization timeout.
* 3: never panic or SIGBUS, log all errors (for testing only)
*/
.tolerant = 1,
--
1.9.3

2015-02-23 09:28:40

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86: mce: kdump: use under_crashdumping to turn off MCE in all CPUs together

On Mon, Feb 23, 2015 at 09:12:29AM +0000, Naoya Horiguchi wrote:
> kexec disables (or "shoots down") all CPUs other than a crashing CPU before
> entering the 2nd kernel. This disablement is done via NMI, and the crashing
> CPU wait for the completions by spinning at most for 1 second.
> However, there is a race window if this NMI handling doesn't complete within
> the 1 second on some CPU, which cause the fragile situation where only a
> portion of online CPUs are responsive to MCE interrupt. If MCE happens during
> this race window, MCE synchronization always timeouts and results in kernel
> panic. So the user-visible effect of this bug is kdump failure.
>
> Note that this race window did exist when current MCE handler was implemented
> around 2.6.32, and recently commit 716079f66eac ("mce: Panic when a core has
> reached a timeout") made it more visible by changing the default behavior of
> the synchronization timeout from "ignore" to "panic".

Let me guess: you could raise the tolerance level to 3 temporarily from
native_machine_crash_shutdown() and not touch the #MC handler at all,
right?

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--

2015-02-23 13:01:58

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86: mce: kdump: use under_crashdumping to turn off MCE in all CPUs together

# I resend this, sorry if you receive this twice.

On Mon, Feb 23, 2015 at 10:27:39AM +0100, Borislav Petkov wrote:
> On Mon, Feb 23, 2015 at 09:12:29AM +0000, Naoya Horiguchi wrote:
> > kexec disables (or "shoots down") all CPUs other than a crashing CPU before
> > entering the 2nd kernel. This disablement is done via NMI, and the crashing
> > CPU wait for the completions by spinning at most for 1 second.
> > However, there is a race window if this NMI handling doesn't complete within
> > the 1 second on some CPU, which cause the fragile situation where only a
> > portion of online CPUs are responsive to MCE interrupt. If MCE happens during
> > this race window, MCE synchronization always timeouts and results in kernel
> > panic. So the user-visible effect of this bug is kdump failure.
> >
> > Note that this race window did exist when current MCE handler was implemented
> > around 2.6.32, and recently commit 716079f66eac ("mce: Panic when a core has
> > reached a timeout") made it more visible by changing the default behavior of
> > the synchronization timeout from "ignore" to "panic".
>
> Let me guess: you could raise the tolerance level to 3 temporarily from
> native_machine_crash_shutdown() and not touch the #MC handler at all,
> right?

Yes, that can be a right solution for fixing the kdump failure itself, but I
think that it might not be the best solution from the viewpoint of messaging to
userspace. What end users see is like these timeout messages:
- "Timeout: Not all CPUs entered broadcast exception handler",
- "Timeout: Subject CPUs unable to finish machine check processing",
- "Timeout: Monarch CPU unable to finish machine check processing", or
- "Timeout: Monarch CPU did not finish machine check processing".
These are informative for developers like us, but confusing for end users.
If we can guess that what end users want to know is whether the kdump is
reliable or not, so "Machine Check ignored because crash dump is running."
sounds a bit better to me.

But yes, I agree that using mca_cfg->tolerant is a nice idea, so I'd like to
define another value to show that kdump is running. Does it make sense to you?

Thanks,
Naoya Horiguchi

2015-02-23 13:59:42

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86: mce: kdump: use under_crashdumping to turn off MCE in all CPUs together

On Mon, Feb 23, 2015 at 10:01:50PM +0900, Naoya Horiguchi wrote:
> userspace. What end users see is like these timeout messages:
> - "Timeout: Not all CPUs entered broadcast exception handler",
> - "Timeout: Subject CPUs unable to finish machine check processing",
> - "Timeout: Monarch CPU unable to finish machine check processing", or
> - "Timeout: Monarch CPU did not finish machine check processing".
> These are informative for developers like us, but confusing for end users.

Those messages won't go out if tolerant level is > 1 AFAICT and from
looking at mce_timed_out() and the machine wouldn't panic, for that
matter.

So what is the actual problem you're seeing?

Cores timeoutting when a machine check happens during entering kdump or
you not wanting cores to panic due to a machine check while the machine
enters kdump?

Something else?

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--

2015-02-23 15:41:18

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86: mce: kdump: use under_crashdumping to turn off MCE in all CPUs together

On 02/23/2015 10:58 PM, Borislav Petkov wrote:
> On Mon, Feb 23, 2015 at 10:01:50PM +0900, Naoya Horiguchi wrote:
>> userspace. What end users see is like these timeout messages:
>> - "Timeout: Not all CPUs entered broadcast exception handler",
>> - "Timeout: Subject CPUs unable to finish machine check processing",
>> - "Timeout: Monarch CPU unable to finish machine check processing", or
>> - "Timeout: Monarch CPU did not finish machine check processing".
>> These are informative for developers like us, but confusing for end users.
>
> Those messages won't go out if tolerant level is > 1 AFAICT and from
> looking at mce_timed_out() and the machine wouldn't panic, for that
> matter.

Sorry, I misread the code, and you're right. Please ignore this part.

>
> So what is the actual problem you're seeing?
>
> Cores timeoutting when a machine check happens during entering kdump or
> you not wanting cores to panic due to a machine check while the machine
> enters kdump?

What I saw was that once in hundreds of kdump and reboot cycle we hit
kdump failure and panic with "Timeout synchronizing machine check over
CPUs" message.

Panic is OK if the MCE is severe enough, but I don't think panic due to
this synchronization timeout is good because it is not related to MCE's
nature (like victim component or type of error) or severity, so even
recoverable MCEs could trigger this panic. This timeout is just an artifact
of current kdump code, so I think we can/should avoid it.

Anyway your suggestion of raising tolerant to 3 should solve this problem,
so I'll take this approach in the next post.

Thanks,
Naoya Horiguchi

2015-02-23 17:08:07

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86: mce: kdump: use under_crashdumping to turn off MCE in all CPUs together

On Tue, Feb 24, 2015 at 12:41:11AM +0900, Naoya Horiguchi wrote:
> What I saw was that once in hundreds of kdump and reboot cycle we hit
> kdump failure and panic with "Timeout synchronizing machine check over
> CPUs" message.

Ok, but this doesn't necessarily mean you're seeing an MCE.

Or perhaps your NMI shooting down is causing an MCE once in a hundred
kdump cycles, i.e. I'm looking at nmi_shootdown_cpus().

Can you send me a dmesg from such a case where this happens? The more
verbose, the better.

Thanks.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--

2015-02-24 08:16:36

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86: mce: kdump: use under_crashdumping to turn off MCE in all CPUs together

On Mon, Feb 23, 2015 at 06:06:53PM +0100, Borislav Petkov wrote:
> On Tue, Feb 24, 2015 at 12:41:11AM +0900, Naoya Horiguchi wrote:
> > What I saw was that once in hundreds of kdump and reboot cycle we hit
> > kdump failure and panic with "Timeout synchronizing machine check over
> > CPUs" message.
>
> Ok, but this doesn't necessarily mean you're seeing an MCE.

This message is spit out only in do_machine_check(), so I think this is
a real MCE.

> Or perhaps your NMI shooting down is causing an MCE once in a hundred
> kdump cycles, i.e. I'm looking at nmi_shootdown_cpus().
>
> Can you send me a dmesg from such a case where this happens? The more
> verbose, the better.

Unfortunately, the reproduced case was on develoment version of a distribution
kernel, so I'm afraid I can't show the dmesg here now. I don't reproduce this
on upstream kernel yet.

Let me update my explanation about the problem. I wrote the description about
race window of nmi shoot down threads. That's not wrong, but that's only the
part of the problem. The more suitable description is that all "shot down"
CPUs keep MCE enabled (disable_local_APIC() doesn't stop it) after entering
infinite loop of cpu_relax(), so any MCE event causes kernel panic due to
synchronization timeout whenever after the 2nd kernel launches on the crashing
CPU (where the CPU don't run do_machine_check(), but the other CPUs do).

Thanks,
Naoya Horiguchi-

2015-02-24 09:57:09

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86: mce: kdump: use under_crashdumping to turn off MCE in all CPUs together

On Tue, Feb 24, 2015 at 08:15:35AM +0000, Naoya Horiguchi wrote:
> Let me update my explanation about the problem. I wrote the description about
> race window of nmi shoot down threads. That's not wrong, but that's only the
> part of the problem. The more suitable description is that all "shot down"
> CPUs keep MCE enabled (disable_local_APIC() doesn't stop it) after entering
> infinite loop of cpu_relax(), so any MCE event causes kernel panic due to
> synchronization timeout whenever after the 2nd kernel launches on the crashing
> CPU (where the CPU don't run do_machine_check(), but the other CPUs do).

Ok, now we're getting closer.

So, I'm thinking you want to disable MCA on all cores going offline
as part of the work done in crash_nmi_callback(). And off the top
of my head, it should be something what mcheck_cpu_init() and
__mcheck_cpu_init_generic() in particular does but in reverse.

I'd even venture a guess and say that clearing CR4.MCE should be enough
but I *think* that doesn't prevent errors from being logged. Just to be
extra sure, you should clear MCG_CTL bits too. It all depends on what
exactly you want to do.

HTH.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--

2015-02-24 18:20:43

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH 1/2] x86: mce: kdump: use under_crashdumping to turn off MCE in all CPUs together

> I'd even venture a guess and say that clearing CR4.MCE should be enough
> but I *think* that doesn't prevent errors from being logged. Just to be
> extra sure, you should clear MCG_CTL bits too. It all depends on what
> exactly you want to do.

I'm not sure - the broadcast MCE will still arrive at that cpu - and with CR4.MCE==0, it will shutdown.
Not sure if that just affects that logical thread, or if it pulls a signal line to take down the whole machine.

-Tony
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-02-24 18:39:37

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86: mce: kdump: use under_crashdumping to turn off MCE in all CPUs together

On Tue, Feb 24, 2015 at 06:20:16PM +0000, Luck, Tony wrote:
> I'm not sure - the broadcast MCE will still arrive at that cpu - and
> with CR4.MCE==0, it will shutdown. Not sure if that just affects that
> logical thread, or if it pulls a signal line to take down the whole
> machine.

If that is the case, the tolerance level might be the better approach
after all...

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--

2015-02-24 18:47:27

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH 1/2] x86: mce: kdump: use under_crashdumping to turn off MCE in all CPUs together

> If that is the case, the tolerance level might be the better approach
> after all...

We should also take a look at mce_start():

int cpus = num_online_cpus();

...

/*
* Wait for everyone.
*/
while (atomic_read(&mce_callin) != cpus) {

since offline cpus will still show up to rendezvous ... perhaps "num_present_cpus()"
is the right number??

-Tony
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-02-24 21:18:10

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86: mce: kdump: use under_crashdumping to turn off MCE in all CPUs together

On Tue, Feb 24, 2015 at 06:47:23PM +0000, Luck, Tony wrote:
> since offline cpus will still show up to rendezvous ... perhaps
> "num_present_cpus()" is the right number??

Provided nmi_shootdown_cpus() code path clears offlined CPUs from
cpumasks so that num_present_cpus() or any other for that matter returns
the correct number, but I don't see it upon a quick grep...

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--

2015-02-25 00:55:44

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86: mce: kdump: use under_crashdumping to turn off MCE in all CPUs together

On Tue, Feb 24, 2015 at 10:56:16AM +0100, Borislav Petkov wrote:
> On Tue, Feb 24, 2015 at 08:15:35AM +0000, Naoya Horiguchi wrote:
> > Let me update my explanation about the problem. I wrote the description about
> > race window of nmi shoot down threads. That's not wrong, but that's only the
> > part of the problem. The more suitable description is that all "shot down"
> > CPUs keep MCE enabled (disable_local_APIC() doesn't stop it) after entering
> > infinite loop of cpu_relax(), so any MCE event causes kernel panic due to
> > synchronization timeout whenever after the 2nd kernel launches on the crashing
> > CPU (where the CPU don't run do_machine_check(), but the other CPUs do).
>
> Ok, now we're getting closer.
>
> So, I'm thinking you want to disable MCA on all cores going offline
> as part of the work done in crash_nmi_callback(). And off the top
> of my head, it should be something what mcheck_cpu_init() and
> __mcheck_cpu_init_generic() in particular does but in reverse.
>
> I'd even venture a guess and say that clearing CR4.MCE should be enough
> but I *think* that doesn't prevent errors from being logged. Just to be
> extra sure, you should clear MCG_CTL bits too. It all depends on what
> exactly you want to do.

Comparing with setting tolerant to 3, clearing CR4.MCE and MCG_CTL has one
benefit that we can also disable reporting to userspace.
do_machine_check() calls mce_report_event() which makes the mcelog process
read the log from /dev/mcelog and write it to storage even when tolerant == 3,
but we can't expect that works correctly under kdump context.
So keeping log data on /dev/mcelog buffer and letting them included in the
crashdump looks better to me.

Thanks,
Naoya Horiguchi-