kexec disables (or "shoots down") all CPUs other than a crashing CPU before
entering the 2nd kernel. But the MCE handler is still enabled after that, so
if MCE happens and broadcasts around CPUs after the main thread starts the
2nd kernel (which might not start MCE yet, or might decide not to start MCE,)
MCE handler runs only on the other CPUs (not on the main thread,) leading to
kernel panic with MCE synchronization. The user-visible effect of this bug
is kdump failure.
Note that this problem exists since current MCE handler was implemented in
2.6.32, and recently commit 716079f66eac ("mce: Panic when a core has reached
a timeout") made it more visible by changing the default behavior of the
synchronization timeout from "ignore" to "panic".
This patch adds a global variable representing that the system is running
kdump code in order to "turn off" the MCE handling code in kdump context.
Signed-off-by: Naoya Horiguchi <[email protected]>
Cc: <[email protected]> [2.6.32+]
---
ChangeLog v1 -> v2
- clear MSR_IA32_MCG_CTL, MSR_IA32_MCx_CTL, and CR4.MCE instead of using
global flag to ignore MCE events.
- fixed the description of the problem
---
arch/x86/include/asm/mce.h | 1 +
arch/x86/kernel/cpu/mcheck/mce.c | 17 +++++++++++++++++
arch/x86/kernel/crash.c | 8 ++++++++
3 files changed, 26 insertions(+)
diff --git v3.19.orig/arch/x86/include/asm/mce.h v3.19/arch/x86/include/asm/mce.h
index 51b26e895933..7ae9927d781a 100644
--- v3.19.orig/arch/x86/include/asm/mce.h
+++ v3.19/arch/x86/include/asm/mce.h
@@ -175,6 +175,7 @@ static inline void mce_amd_feature_init(struct cpuinfo_x86 *c) { }
#endif
int mce_available(struct cpuinfo_x86 *c);
+void cpu_emergency_mce_disable(void);
DECLARE_PER_CPU(unsigned, mce_exception_count);
DECLARE_PER_CPU(unsigned, mce_poll_count);
diff --git v3.19.orig/arch/x86/kernel/cpu/mcheck/mce.c v3.19/arch/x86/kernel/cpu/mcheck/mce.c
index 3112b79ace8e..10359ae1f558 100644
--- v3.19.orig/arch/x86/kernel/cpu/mcheck/mce.c
+++ v3.19/arch/x86/kernel/cpu/mcheck/mce.c
@@ -2105,6 +2105,23 @@ static void mce_syscore_shutdown(void)
}
/*
+ * Called in kdump entering code to turn off MCE handling function. We clear
+ * global switch first to forbid the situation where only portion of CPUs are
+ * responsive to MCE and MCE causes kernel panic with synchronization timeout.
+ */
+void cpu_emergency_mce_disable(void)
+{
+ u64 cap;
+ int i;
+
+ rdmsrl(MSR_IA32_MCG_CAP, cap);
+ if (cap & MCG_CTL_P)
+ wrmsr(MSR_IA32_MCG_CTL, 0, 0);
+ mce_disable_error_reporting();
+ clear_in_cr4(X86_CR4_MCE);
+}
+
+/*
* On resume clear all MCE state. Don't want to see leftovers from the BIOS.
* Only one CPU is active at this time, the others get re-added later using
* CPU hotplug:
diff --git v3.19.orig/arch/x86/kernel/crash.c v3.19/arch/x86/kernel/crash.c
index aceb2f90c716..22451c687fca 100644
--- v3.19.orig/arch/x86/kernel/crash.c
+++ v3.19/arch/x86/kernel/crash.c
@@ -34,6 +34,7 @@
#include <asm/cpu.h>
#include <asm/reboot.h>
#include <asm/virtext.h>
+#include <asm/mce.h>
/* Alignment required for elf header segment */
#define ELF_CORE_HEADER_ALIGN 4096
@@ -112,6 +113,8 @@ static void kdump_nmi_callback(int cpu, struct pt_regs *regs)
#endif
crash_save_cpu(regs, cpu);
+ cpu_emergency_mce_disable();
+
/*
* VMCLEAR VMCSs loaded on all cpus if needed.
*/
@@ -157,6 +160,11 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
/* The kernel is broken so disable interrupts */
local_irq_disable();
+ /*
+ * We can't expect MCE handling to work any more, so turn it off.
+ */
+ cpu_emergency_mce_disable();
+
kdump_nmi_shootdown_cpus();
/*
--
1.9.3
commit 716079f66eac ("mce: Panic when a core has reached a timeout") changed
the behavior of mca_cfg->tolerant. So let's add comment about it.
Signed-off-by: Naoya Horiguchi <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git v3.19.orig/arch/x86/kernel/cpu/mcheck/mce.c v3.19/arch/x86/kernel/cpu/mcheck/mce.c
index 10359ae1f558..abdd2631036b 100644
--- v3.19.orig/arch/x86/kernel/cpu/mcheck/mce.c
+++ v3.19/arch/x86/kernel/cpu/mcheck/mce.c
@@ -69,8 +69,10 @@ struct mca_config mca_cfg __read_mostly = {
/*
* Tolerant levels:
* 0: always panic on uncorrected errors, log corrected errors
- * 1: panic or SIGBUS on uncorrected errors, log corrected errors
- * 2: SIGBUS or log uncorrected errors (if possible), log corr. errors
+ * 1: panic or SIGBUS on uncorrected errors, log corrected errors,
+ * panic on MCE synchronization timeout.
+ * 2: SIGBUS or log uncorrected errors (if possible), log corr. errors,
+ * no panic on MCE synchronization timeout.
* 3: never panic or SIGBUS, log all errors (for testing only)
*/
.tolerant = 1,
--
1.9.3
On 02/26/2015 11:58 PM, Naoya Horiguchi wrote:
> kexec disables (or "shoots down") all CPUs other than a crashing CPU before
> entering the 2nd kernel. But the MCE handler is still enabled after that, so
> if MCE happens and broadcasts around CPUs after the main thread starts the
> 2nd kernel (which might not start MCE yet, or might decide not to start MCE,)
> MCE handler runs only on the other CPUs (not on the main thread,) leading to
> kernel panic with MCE synchronization. The user-visible effect of this bug
> is kdump failure.
>
> Note that this problem exists since current MCE handler was implemented in
> 2.6.32, and recently commit 716079f66eac ("mce: Panic when a core has reached
> a timeout") made it more visible by changing the default behavior of the
> synchronization timeout from "ignore" to "panic".
>
> This patch adds a global variable representing that the system is running
> kdump code in order to "turn off" the MCE handling code in kdump context.
>
> Signed-off-by: Naoya Horiguchi <[email protected]>
> Cc: <[email protected]> [2.6.32+]
> ---
> ChangeLog v1 -> v2
> - clear MSR_IA32_MCG_CTL, MSR_IA32_MCx_CTL, and CR4.MCE instead of using
> global flag to ignore MCE events.
> - fixed the description of the problem
> ---
> arch/x86/include/asm/mce.h | 1 +
> arch/x86/kernel/cpu/mcheck/mce.c | 17 +++++++++++++++++
> arch/x86/kernel/crash.c | 8 ++++++++
> 3 files changed, 26 insertions(+)
>
> diff --git v3.19.orig/arch/x86/include/asm/mce.h v3.19/arch/x86/include/asm/mce.h
> index 51b26e895933..7ae9927d781a 100644
> --- v3.19.orig/arch/x86/include/asm/mce.h
> +++ v3.19/arch/x86/include/asm/mce.h
> @@ -175,6 +175,7 @@ static inline void mce_amd_feature_init(struct cpuinfo_x86 *c) { }
> #endif
>
> int mce_available(struct cpuinfo_x86 *c);
> +void cpu_emergency_mce_disable(void);
>
> DECLARE_PER_CPU(unsigned, mce_exception_count);
> DECLARE_PER_CPU(unsigned, mce_poll_count);
> diff --git v3.19.orig/arch/x86/kernel/cpu/mcheck/mce.c v3.19/arch/x86/kernel/cpu/mcheck/mce.c
> index 3112b79ace8e..10359ae1f558 100644
> --- v3.19.orig/arch/x86/kernel/cpu/mcheck/mce.c
> +++ v3.19/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -2105,6 +2105,23 @@ static void mce_syscore_shutdown(void)
> }
>
> /*
> + * Called in kdump entering code to turn off MCE handling function. We clear
> + * global switch first to forbid the situation where only portion of CPUs are
> + * responsive to MCE and MCE causes kernel panic with synchronization timeout.
> + */
> +void cpu_emergency_mce_disable(void)
> +{
> + u64 cap;
> + int i;
> +
> + rdmsrl(MSR_IA32_MCG_CAP, cap);
> + if (cap & MCG_CTL_P)
> + wrmsr(MSR_IA32_MCG_CTL, 0, 0);
> + mce_disable_error_reporting();
> + clear_in_cr4(X86_CR4_MCE);
> +}
> +
> +/*
> * On resume clear all MCE state. Don't want to see leftovers from the BIOS.
> * Only one CPU is active at this time, the others get re-added later using
> * CPU hotplug:
> diff --git v3.19.orig/arch/x86/kernel/crash.c v3.19/arch/x86/kernel/crash.c
> index aceb2f90c716..22451c687fca 100644
> --- v3.19.orig/arch/x86/kernel/crash.c
> +++ v3.19/arch/x86/kernel/crash.c
> @@ -34,6 +34,7 @@
> #include <asm/cpu.h>
> #include <asm/reboot.h>
> #include <asm/virtext.h>
> +#include <asm/mce.h>
>
> /* Alignment required for elf header segment */
> #define ELF_CORE_HEADER_ALIGN 4096
> @@ -112,6 +113,8 @@ static void kdump_nmi_callback(int cpu, struct pt_regs *regs)
> #endif
> crash_save_cpu(regs, cpu);
>
> + cpu_emergency_mce_disable();
> +
> /*
> * VMCLEAR VMCSs loaded on all cpus if needed.
> */
> @@ -157,6 +160,11 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
> /* The kernel is broken so disable interrupts */
> local_irq_disable();
>
> + /*
> + * We can't expect MCE handling to work any more, so turn it off.
> + */
> + cpu_emergency_mce_disable();
What if the system is actually having problems with MCE errors -- which are
leading to system panics of some sort. Do you *really* want the system to
continue on at that point?
P.
> +
> kdump_nmi_shootdown_cpus();
>
> /*
>
On Fri, Feb 27, 2015 at 06:09:52AM -0500, Prarit Bhargava wrote:
> What if the system is actually having problems with MCE errors --
> which are leading to system panics of some sort. Do you *really* want
> the system to continue on at that point?
No one said that disabling MCA and doing kdump is a 100% reliable thing.
When CR4.MCE=0b and an MCE happens, it will shutdown the system, at
least on Intel, according to Tony. On AMD, disabling error reporting in
addition leads to CR4.MCE being ignored.
In any case, disabling MCA contains a risk kdump should be willing to
take. Let's ask the reverse question: is kdump prepared to handle an MCE
when one happens during dumping?
If we have to be really correct, kdump should actually be prepared
to handle MCEs and in the case where it cannot recover, stop dumping
because the already dumped data might be faulty and corrupted... And
print a nasty message on the screen...
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
--
Hi Prarit,
On Fri, Feb 27, 2015 at 06:09:52AM -0500, Prarit Bhargava wrote:
...
> > @@ -157,6 +160,11 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
> > /* The kernel is broken so disable interrupts */
> > local_irq_disable();
> >
> > + /*
> > + * We can't expect MCE handling to work any more, so turn it off.
> > + */
> > + cpu_emergency_mce_disable();
>
> What if the system is actually having problems with MCE errors -- which are
> leading to system panics of some sort. Do you *really* want the system to
> continue on at that point?
Yes, when running the above code, the system doesn't run any business logic,
so no worry about consuming broken data caused by HW errors.
And what we really want to get is any kind of information to find out what
caused the 1st panic, which are likely to be contained in kdump data.
So I think it's justified to improve the success rate of kdump by continuing
the operation here.
Thanks,
Naoya Horiguchi
On 02/27/2015 07:46 AM, Naoya Horiguchi wrote:
> Hi Prarit,
>
> On Fri, Feb 27, 2015 at 06:09:52AM -0500, Prarit Bhargava wrote:
> ...
>> > @@ -157,6 +160,11 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
>> > /* The kernel is broken so disable interrupts */
>> > local_irq_disable();
>> >
>> > + /*
>> > + * We can't expect MCE handling to work any more, so turn it off.
>> > + */
>> > + cpu_emergency_mce_disable();
>>
>> What if the system is actually having problems with MCE errors -- which are
>> leading to system panics of some sort. Do you *really* want the system to
>> continue on at that point?
>
> Yes, when running the above code, the system doesn't run any business logic,
> so no worry about consuming broken data caused by HW errors.
> And what we really want to get is any kind of information to find out what
> caused the 1st panic, which are likely to be contained in kdump data.
> So I think it's justified to improve the success rate of kdump by continuing
> the operation here.
I looked into it a bit further -- IIUC (according to the Intel spec) disabling
MCE this way will result in power cycle of the system if an MCE is detected. So
I guess it isn't a worry for Intel. If anyone from AMD can hazard a guess what
happens in their case it would be appreciated.
I still don't like this approach all that much as a corrected non-fatal error is
something I would want to know about as an admin, but that risk is mitigated by
BMC and system monitoring hardware.
>But the MCE handler is still enabled after that, so
>if MCE happens and broadcasts around CPUs after the main thread starts the
>2nd kernel (which might not start MCE yet, or might decide not to start MCE,)
>MCE handler runs only on the other CPUs (not on the main thread,) leading to
>kernel panic with MCE synchronization.
Not having looked at the code (and relying on your description) -- there is no
way to disable the MCE handler?
P.
> When CR4.MCE=0b and an MCE happens, it will shutdown the system, at
> least on Intel, according to Tony
I checked with the architects ... and I was right. If you clear CR4.MCE you'll still
see the machine check - and you'll pull the big system reset lever.
If you think the other cpus can survive the reset - then the right thing to do is to
have any offline cpus that show up in the machine check handler just clear MCG_STATUS
and return:
do_machine_check()
{
/* offline cpus may show up for the party - but don't need to do anything here - send them back home */
if (!(cpu_online(smp_processor_id())) {
mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
return;
}
If we are crashing because of a machine check - I wonder how useful it is to run kdump(). There are a very
small set of ways that you can induce a machine check from program action - normally the problem is that
something bad happened in the h/w ... a kdump will just fill your disk and waste your time looking at what
the s/w was dong when the machine check happened.
-Tony
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?