Various new Intel x86 machine check features, including initial support
for machine check recovery and new status bits. The actual machine check
recovery requires some generic VM changes, which will be posted in a separate
series ("POISON series") This series can be applied without the POISON
series and POISON requires these patches.
Especially the broadcast and early nowayout patches are fairly important for
Nehalem CPUs, without them machine checks are often incorrectly reported
due to the shared bank topology of the system.
-Andi
Impact: Spec compliance
The example code in the IA32 SDM recommends to synchronize the CPU
after machine check handling. So do that here.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce_64.c | 3 +++
1 file changed, 3 insertions(+)
Index: linux/arch/x86/kernel/cpu/mcheck/mce_64.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:09:59.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:43:15.000000000 +0200
@@ -254,6 +254,8 @@
* Don't clear MCG_STATUS here because it's only defined for
* exceptions.
*/
+
+ sync_core();
}
/*
@@ -419,6 +421,7 @@
wrmsrl(MSR_IA32_MCG_STATUS, 0);
out2:
atomic_dec(&mce_entry);
+ sync_core();
}
#ifdef CONFIG_X86_MCE_INTEL
Impact: bug fix, fixes an old regression
For some time each panic() called with interrupts disabled triggered the
!irqs_disabled() WARN_ON in smp_call_function(), producing ugly
backtraces and confusing users.
This is a common situation with machine checks for example which tend
to call panic with interrupts disabled, but will also hit
in other situations e.g. panic during early boot. In fact
it means that panic cannot be called in many circumstances, which
would be bad.
This all started with the new fancy queued smp_call_function,
which is then used by the shutdown path to shut down the other CPUs.
On closer examination it turned out that the fancy RCU
smp_call_function() does lots of things not suitable in a panic
situation anyways, like allocating memory and relying on complex system
state.
I originally tried to patch this over by checking for panic
there, but it was quite complicated and the original patch
was also not very popular. This also didn't fix some
of the underlying complexity problems.
The new code in post 2.6.29 tries to patch around this by
checking for oops_in_progress, but that is not enough to make
this fully safe and I don't think that's a real solution
because panic has to be reliable.
So instead use an own vector to reboot. This makes the reboot code
extremly straight forward, which is definitely a big plus
in a panic situation where it is important to avoid relying
on too much kernel state. The new simple code is also
safe to be called from interupts off region because it is
very very simple.
There can be situations where it is important that panic
is reliable. For example on a fatal machine check the panic
is needed to get the system up again and running as quickly
as possible. So it's important that panic is reliable and
all function it calls simple.
This is why I came up with this simple vector scheme.
It's very hard to beat in simplicity. Vectors are not
particularly precious anymore since all big systems
are using per CPU vectors.
Another possibility would have been to use an NMI similar
to kdump, but there is still the problem that NMIs don't
work reliably on some systems due to BIOS issues. NMIs
would have been able to stop CPUs running with interrupts
off too. In the sake of universal reliability I opted for
using a non NMI vector for now.
I put the reboot vector into the highest priority bucket
of the APIC vectors and moved the 64bit UV_BAU message
down instead into the next lower priority.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/include/asm/entry_arch.h | 1 +
arch/x86/include/asm/hw_irq.h | 1 +
arch/x86/include/asm/irq_vectors.h | 8 ++++++--
arch/x86/kernel/entry_64.S | 2 ++
arch/x86/kernel/irqinit_32.c | 3 +++
arch/x86/kernel/irqinit_64.c | 3 +++
arch/x86/kernel/smp.c | 28 +++++++++++++++++++++++++++-
7 files changed, 43 insertions(+), 3 deletions(-)
Index: linux/arch/x86/kernel/irqinit_32.c
===================================================================
--- linux.orig/arch/x86/kernel/irqinit_32.c 2009-04-07 16:09:58.000000000 +0200
+++ linux/arch/x86/kernel/irqinit_32.c 2009-04-07 16:09:58.000000000 +0200
@@ -167,6 +167,9 @@
/* Low priority IPI to cleanup after moving an irq */
set_intr_gate(IRQ_MOVE_CLEANUP_VECTOR, irq_move_cleanup_interrupt);
set_bit(IRQ_MOVE_CLEANUP_VECTOR, used_vectors);
+
+ /* IPI used for rebooting/stopping */
+ alloc_intr_gate(REBOOT_VECTOR, reboot_interrupt);
#endif
#ifdef CONFIG_X86_LOCAL_APIC
Index: linux/arch/x86/kernel/entry_64.S
===================================================================
--- linux.orig/arch/x86/kernel/entry_64.S 2009-04-07 16:09:58.000000000 +0200
+++ linux/arch/x86/kernel/entry_64.S 2009-04-07 16:43:12.000000000 +0200
@@ -976,6 +976,8 @@
#ifdef CONFIG_SMP
apicinterrupt IRQ_MOVE_CLEANUP_VECTOR \
irq_move_cleanup_interrupt smp_irq_move_cleanup_interrupt
+apicinterrupt REBOOT_VECTOR \
+ reboot_interrupt smp_reboot_interrupt
#endif
#ifdef CONFIG_X86_UV
Index: linux/arch/x86/kernel/irqinit_64.c
===================================================================
--- linux.orig/arch/x86/kernel/irqinit_64.c 2009-04-07 16:09:58.000000000 +0200
+++ linux/arch/x86/kernel/irqinit_64.c 2009-04-07 16:43:12.000000000 +0200
@@ -133,6 +133,9 @@
/* Low priority IPI to cleanup after moving an irq */
set_intr_gate(IRQ_MOVE_CLEANUP_VECTOR, irq_move_cleanup_interrupt);
set_bit(IRQ_MOVE_CLEANUP_VECTOR, used_vectors);
+
+ /* IPI for rebooting/panicing */
+ alloc_intr_gate(REBOOT_VECTOR, reboot_interrupt);
#endif
}
Index: linux/arch/x86/kernel/smp.c
===================================================================
--- linux.orig/arch/x86/kernel/smp.c 2009-04-07 16:09:58.000000000 +0200
+++ linux/arch/x86/kernel/smp.c 2009-04-07 16:43:16.000000000 +0200
@@ -150,14 +150,40 @@
* this function calls the 'stop' function on all other CPUs in the system.
*/
+asmlinkage void smp_reboot_interrupt(void)
+{
+ ack_APIC_irq();
+ irq_enter();
+ stop_this_cpu(NULL);
+ irq_exit();
+}
+
static void native_smp_send_stop(void)
{
unsigned long flags;
+ unsigned long wait;
if (reboot_force)
return;
- smp_call_function(stop_this_cpu, NULL, 0);
+ /*
+ * Use an own vector here because smp_call_function
+ * does lots of things not suitable in a panic situation.
+ * On most systems we could also use an NMI here,
+ * but there are a few systems around where NMI
+ * is problematic so stay with an non NMI for now
+ * (this implies we cannot stop CPUs spinning with irq off
+ * currently)
+ */
+ if (num_online_cpus() > 1) {
+ apic->send_IPI_allbutself(REBOOT_VECTOR);
+
+ /* Don't wait longer than a second */
+ wait = USEC_PER_SEC;
+ while (num_online_cpus() > 1 && wait--)
+ udelay(1);
+ }
+
local_irq_save(flags);
disable_local_APIC();
local_irq_restore(flags);
Index: linux/arch/x86/include/asm/entry_arch.h
===================================================================
--- linux.orig/arch/x86/include/asm/entry_arch.h 2009-04-07 16:09:58.000000000 +0200
+++ linux/arch/x86/include/asm/entry_arch.h 2009-04-07 16:09:58.000000000 +0200
@@ -14,6 +14,7 @@
BUILD_INTERRUPT(call_function_interrupt,CALL_FUNCTION_VECTOR)
BUILD_INTERRUPT(call_function_single_interrupt,CALL_FUNCTION_SINGLE_VECTOR)
BUILD_INTERRUPT(irq_move_cleanup_interrupt,IRQ_MOVE_CLEANUP_VECTOR)
+BUILD_INTERRUPT(reboot_interrupt,REBOOT_VECTOR)
BUILD_INTERRUPT3(invalidate_interrupt0,INVALIDATE_TLB_VECTOR_START+0,
smp_invalidate_interrupt)
Index: linux/arch/x86/include/asm/irq_vectors.h
===================================================================
--- linux.orig/arch/x86/include/asm/irq_vectors.h 2009-04-07 16:09:58.000000000 +0200
+++ linux/arch/x86/include/asm/irq_vectors.h 2009-04-07 16:43:12.000000000 +0200
@@ -88,12 +88,14 @@
#define THERMAL_APIC_VECTOR 0xfa
#ifdef CONFIG_X86_32
-/* 0xf8 - 0xf9 : free */
+/* 0xf9 : free */
#else
# define THRESHOLD_APIC_VECTOR 0xf9
-# define UV_BAU_MESSAGE 0xf8
#endif
+#define REBOOT_VECTOR 0xf8
+
+
/* f0-f7 used for spreading out TLB flushes: */
#define INVALIDATE_TLB_VECTOR_END 0xf7
#define INVALIDATE_TLB_VECTOR_START 0xf0
@@ -116,6 +118,8 @@
*/
#define GENERIC_INTERRUPT_VECTOR 0xed
+#define UV_BAU_MESSAGE 0xec
+
/*
* First APIC vector available to drivers: (vectors 0x30-0xee) we
* start at 0x31(0x41) to spread out vectors evenly between priority
Index: linux/arch/x86/include/asm/hw_irq.h
===================================================================
--- linux.orig/arch/x86/include/asm/hw_irq.h 2009-04-07 16:09:58.000000000 +0200
+++ linux/arch/x86/include/asm/hw_irq.h 2009-04-07 16:43:12.000000000 +0200
@@ -44,6 +44,7 @@
extern void invalidate_interrupt7(void);
extern void irq_move_cleanup_interrupt(void);
+extern void reboot_interrupt(void);
extern void threshold_interrupt(void);
extern void call_function_interrupt(void);
Impact: Spec compliance
The code assumed that on P4 the extended RIP MSR would
be an exact error location But no P4s have exact EIPVs and also
on rereading the SDM doesn't claim that. So let's drop that.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce_64.c | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)
Index: linux/arch/x86/kernel/cpu/mcheck/mce_64.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:09:59.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:43:14.000000000 +0200
@@ -184,11 +184,8 @@
m->ip = 0;
m->cs = 0;
}
- if (rip_msr) {
- /* Assume the RIP in the MSR is exact. Is this true? */
- m->mcgstatus |= MCG_STATUS_EIPV;
+ if (rip_msr)
rdmsrl(rip_msr, m->ip);
- }
}
/*
Impact: Cleanup
Instead of using own callbacks use the generic ones provided by
the sysdev later.
This finally allows to get rid of the ugly ACCESSOR macros. Should
also save some text size.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce_64.c | 38 +++++++++++++++---------------------
1 file changed, 16 insertions(+), 22 deletions(-)
Index: linux/arch/x86/kernel/cpu/mcheck/mce_64.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:09:59.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:43:15.000000000 +0200
@@ -921,26 +921,6 @@
DEFINE_PER_CPU(struct sys_device, device_mce);
void (*threshold_cpu_callback)(unsigned long action, unsigned int cpu) __cpuinitdata;
-
-/* Why are there no generic functions for this? */
-#define ACCESSOR(name, var, start) \
- static ssize_t show_ ## name(struct sys_device *s, \
- struct sysdev_attribute *attr, \
- char *buf) { \
- return sprintf(buf, "%lx\n", (unsigned long)var); \
- } \
- static ssize_t set_ ## name(struct sys_device *s, \
- struct sysdev_attribute *attr, \
- const char *buf, size_t siz) { \
- char *end; \
- unsigned long new = simple_strtoul(buf, &end, 0); \
- if (end == buf) return -EINVAL; \
- var = new; \
- start; \
- return end-buf; \
- } \
- static SYSDEV_ATTR(name, 0644, show_ ## name, set_ ## name);
-
static struct sysdev_attribute *bank_attrs;
static ssize_t show_bank(struct sys_device *s, struct sysdev_attribute *attr,
@@ -983,11 +963,25 @@
return len;
}
+static ssize_t store_int_with_restart(struct sys_device *s,
+ struct sysdev_attribute *attr,
+ const char *buf, size_t size)
+{
+ ssize_t ret = sysdev_store_int(s, attr, buf, size);
+ mce_restart();
+ return ret;
+}
+
static SYSDEV_ATTR(trigger, 0644, show_trigger, set_trigger);
static SYSDEV_INT_ATTR(tolerant, 0644, tolerant);
-ACCESSOR(check_interval,check_interval,mce_restart())
+static struct sysdev_ext_attribute attr_check_interval = {
+ _SYSDEV_ATTR(check_interval, 0644, sysdev_show_int,
+ store_int_with_restart),
+ &check_interval
+};
+
static struct sysdev_attribute *mce_attributes[] = {
- &attr_tolerant.attr, &attr_check_interval, &attr_trigger,
+ &attr_tolerant.attr, &attr_check_interval.attr, &attr_trigger,
NULL
};
Impact: cleanup
Makes the code easier to read. No functional changes.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/include/asm/mce.h | 2 ++
arch/x86/kernel/cpu/mcheck/mce_64.c | 4 ++--
2 files changed, 4 insertions(+), 2 deletions(-)
Index: linux/arch/x86/kernel/cpu/mcheck/mce_64.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:09:59.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:43:15.000000000 +0200
@@ -551,7 +551,7 @@
unsigned b;
rdmsrl(MSR_IA32_MCG_CAP, cap);
- b = cap & 0xff;
+ b = cap & MCG_BANKS_MASK;
if (b > MAX_NR_BANKS) {
printk(KERN_WARNING
"MCE: Using only %u machine check banks out of %u\n",
@@ -570,7 +570,7 @@
}
/* Use accurate RIP reporting if available. */
- if ((cap & (1<<9)) && ((cap >> 16) & 0xff) >= 9)
+ if ((cap & MCG_EXT_P) && (MCG_NUM_EXT(cap) >= 9))
rip_msr = MSR_IA32_MCG_EIP;
return 0;
Index: linux/arch/x86/include/asm/mce.h
===================================================================
--- linux.orig/arch/x86/include/asm/mce.h 2009-04-07 16:09:59.000000000 +0200
+++ linux/arch/x86/include/asm/mce.h 2009-04-07 16:43:14.000000000 +0200
@@ -10,8 +10,10 @@
* Machine Check support for x86
*/
+#define MCG_BANKS_MASK (0xff)
#define MCG_CTL_P (1UL<<8) /* MCG_CAP register available */
#define MCG_EXT_P (1ULL<<9) /* Extended registers available */
+#define MCG_NUM_EXT(c) (((c) >> 16) & 0xff)
#define MCG_CMCI_P (1ULL<<10) /* CMCI supported */
#define MCG_STATUS_RIPV (1UL<<0) /* restart ip valid */
Impact: feature, debugging tool
Useful for debugging, but it's also good general policy
to have a counter for all special interrupts there. This makes it easier
to diagnose where a CPU is spending its time.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/include/asm/mce.h | 3 +++
arch/x86/kernel/cpu/mcheck/mce_64.c | 4 ++++
arch/x86/kernel/irq.c | 10 ++++++++++
3 files changed, 17 insertions(+)
Index: linux/arch/x86/kernel/cpu/mcheck/mce_64.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:09:59.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:43:14.000000000 +0200
@@ -75,6 +75,8 @@
rdtscll(m->tsc);
}
+DEFINE_PER_CPU(unsigned, mce_exception_count);
+
/*
* Lockless MCE logging infrastructure.
* This avoids deadlocks on printk locks without having to break locks. Also
@@ -285,6 +287,8 @@
atomic_inc(&mce_entry);
+ __get_cpu_var(mce_exception_count)++;
+
if (notify_die(DIE_NMI, "machine check", regs, error_code,
18, SIGKILL) == NOTIFY_STOP)
goto out2;
Index: linux/arch/x86/include/asm/mce.h
===================================================================
--- linux.orig/arch/x86/include/asm/mce.h 2009-04-07 16:09:59.000000000 +0200
+++ linux/arch/x86/include/asm/mce.h 2009-04-07 16:43:10.000000000 +0200
@@ -93,6 +93,7 @@
#else /* CONFIG_X86_32 */
#include <asm/atomic.h>
+#include <linux/percpu.h>
void mce_setup(struct mce *m);
void mce_log(struct mce *m);
@@ -127,6 +128,8 @@
extern int mce_available(struct cpuinfo_x86 *c);
+DECLARE_PER_CPU(unsigned, mce_exception_count);
+
void mce_log_therm_throt_event(__u64 status);
extern atomic_t mce_entry;
Index: linux/arch/x86/kernel/irq.c
===================================================================
--- linux.orig/arch/x86/kernel/irq.c 2009-04-07 16:09:56.000000000 +0200
+++ linux/arch/x86/kernel/irq.c 2009-04-07 16:09:59.000000000 +0200
@@ -12,6 +12,7 @@
#include <asm/io_apic.h>
#include <asm/irq.h>
#include <asm/idle.h>
+#include <asm/mce.h>
atomic_t irq_err_count;
@@ -96,6 +97,12 @@
seq_printf(p, " Threshold APIC interrupts\n");
# endif
#endif
+#if defined(CONFIG_X86_MCE) && defined(CONFIG_X86_64)
+ seq_printf(p, "%*s: ", prec, "MCE");
+ for_each_online_cpu(j)
+ seq_printf(p, "%10u ", per_cpu(mce_exception_count, j));
+ seq_printf(p, " Machine check exceptions\n");
+#endif
seq_printf(p, "%*s: %10u\n", prec, "ERR", atomic_read(&irq_err_count));
#if defined(CONFIG_X86_IO_APIC)
seq_printf(p, "%*s: %10u\n", prec, "MIS", atomic_read(&irq_mis_count));
@@ -163,6 +170,9 @@
{
u64 sum = irq_stats(cpu)->__nmi_count;
+#if defined(CONFIG_X86_MCE) && defined(CONFIG_X86_64)
+ sum += per_cpu(mce_exception_count, cpu);
+#endif
#ifdef CONFIG_X86_LOCAL_APIC
sum += irq_stats(cpu)->apic_timer_irqs;
sum += irq_stats(cpu)->irq_spurious_count;
Impact: bugfix
Normally the machine check handler ignores corrected errors and leaves
them to machine_check_poll(). But when panicing mcp won't run, so
log all errors.
Note: this can still miss some cases until the "early no way out"
patch later is applied too.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce_64.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
Index: linux/arch/x86/kernel/cpu/mcheck/mce_64.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:09:59.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:43:14.000000000 +0200
@@ -317,9 +317,9 @@
/*
* Non uncorrected errors are handled by machine_check_poll
- * Leave them alone.
+ * Leave them alone, unless this panics.
*/
- if ((m.status & MCI_STATUS_UC) == 0)
+ if ((m.status & MCI_STATUS_UC) == 0 && !no_way_out)
continue;
/*
Impact: Cleanup
Remove unused mce_events static variable.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce_64.c | 2 --
1 file changed, 2 deletions(-)
Index: linux/arch/x86/kernel/cpu/mcheck/mce_64.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:09:59.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:43:13.000000000 +0200
@@ -55,7 +55,6 @@
static unsigned long notify_user;
static int rip_msr;
static int mce_bootlog = -1;
-static atomic_t mce_events;
static char trigger[128];
static char *trigger_argv[2] = { trigger, NULL };
@@ -91,7 +90,6 @@
void mce_log(struct mce *mce)
{
unsigned next, entry;
- atomic_inc(&mce_events);
mce->finished = 0;
wmb();
for (;;) {
For corrected machine checks found by the poller or
threshold interrupts going through an idle notifier is not needed
because the wake_up can is just done directly and doesn't
need the idle notifier. It is only needed for logging
exceptions.
The code going through the idle notifier is also
fairly ugly and intrusive to the rest of the kernel.
To be honest I never liked the idle notifier even though I signed
off on it. On closer investigation the code actually turned out
to be nearly. Right now machine check exceptions on x86 are always
unrecoverable (lead to panic due to PCC), which means we never execute
the idle notifier path in the normal case.
The only exception is the somewhat weird tolerant==3 case, which
ignores PCC. I'll fix this in a followup patch in a much cleaner way.
So remove the "mcelog wakeup through idle notifier" code
from 64bit.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce_64.c | 22 ----------------------
1 file changed, 22 deletions(-)
Index: linux/arch/x86/kernel/cpu/mcheck/mce_64.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:09:59.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:43:13.000000000 +0200
@@ -521,28 +521,6 @@
return 0;
}
-/* see if the idle task needs to notify userspace */
-static int
-mce_idle_callback(struct notifier_block *nfb, unsigned long action, void *junk)
-{
- /* IDLE_END should be safe - interrupts are back on */
- if (action == IDLE_END && test_thread_flag(TIF_MCE_NOTIFY))
- mce_notify_user();
-
- return NOTIFY_OK;
-}
-
-static struct notifier_block mce_idle_notifier = {
- .notifier_call = mce_idle_callback,
-};
-
-static __init int periodic_mcheck_init(void)
-{
- idle_notifier_register(&mce_idle_notifier);
- return 0;
-}
-__initcall(periodic_mcheck_init);
-
/*
* Initialize Machine Checks for a CPU.
*/
Impact: Minor cleanup, no functional changes
There's only a single out path in do_machine_check now, so rename the label from
out2 to out. Also align it at the first column.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce_64.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
Index: linux/arch/x86/kernel/cpu/mcheck/mce_64.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:09:59.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:43:12.000000000 +0200
@@ -287,9 +287,9 @@
if (notify_die(DIE_NMI, "machine check", regs, error_code,
18, SIGKILL) == NOTIFY_STOP)
- goto out2;
+ goto out;
if (!banks)
- goto out2;
+ goto out;
mce_setup(&m);
@@ -418,7 +418,7 @@
wrmsrl(MSR_IA32_MC0_STATUS+4*i, 0);
}
wrmsrl(MSR_IA32_MCG_STATUS, 0);
- out2:
+out:
atomic_dec(&mce_entry);
sync_core();
}
From: Thomas Gleixner <[email protected]>
Impact: Cleanup
Remove unused mce_init argument.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce_64.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
Index: linux/arch/x86/kernel/cpu/mcheck/mce_64.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:09:59.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:43:13.000000000 +0200
@@ -556,7 +556,7 @@
return 0;
}
-static void mce_init(void *dummy)
+static void mce_init(void)
{
u64 cap;
int i;
@@ -640,7 +640,7 @@
}
mce_cpu_quirks(c);
- mce_init(NULL);
+ mce_init();
mce_cpu_features(c);
mce_init_timer();
}
@@ -873,7 +873,7 @@
CPU hotplug. */
static int mce_resume(struct sys_device *dev)
{
- mce_init(NULL);
+ mce_init();
mce_cpu_features(¤t_cpu_data);
return 0;
}
@@ -882,7 +882,7 @@
{
del_timer_sync(&__get_cpu_var(mce_timer));
if (mce_available(¤t_cpu_data))
- mce_init(NULL);
+ mce_init();
mce_init_timer();
}
On closer examination it turns out oops_begin is not
a good idea in a machine check panic. All oops_begin
does it so check for recursive/parallel oopses and implement the
"wait on oops" heuristic. But there's actually no good reason
to lock machine checks against oopses or prevent them
from recursion. Also "wait on oops" does not really make
sense for a machine check either.
So just remove it.
Replace it with a manual bust_spinlocks/console_verbose.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce_64.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
Index: linux/arch/x86/kernel/cpu/mcheck/mce_64.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:09:59.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:43:13.000000000 +0200
@@ -151,7 +151,8 @@
{
int i;
- oops_begin();
+ bust_spinlocks(1);
+ console_verbose();
for (i = 0; i < MCE_LOG_LEN; i++) {
unsigned long tsc = mcelog.entry[i].tsc;
machine checks support waking up the mcelog daemon quickly.
The original wake up code for this was pretty ugly, relying on
a idle notifier and a special process flag. The reason it did
it this way is that the machine check handler is not subject
to normal interrupt locking rules so it's not safe
to call wake_up(). Instead it set a process flag
and then either did the wakeup in the syscall return
or in the idle notifier.
This patch adds a new "bootstraping" method as replacement.
The idea is that the handler checks if it's in a state where
it is unsafe to call wake_up(). If it's safe it calls it directly.
When it's not safe -- that is it interrupted in a critical
section with interrupts disables -- it uses a new "self IPI" to trigger
an IPI to its own CPU. This can be done safely because IPI
triggers are atomic with some care. The IPI is raised
once the interrupts are reenabled and can then safely call
wake_up().
When APICs are disabled the event is just queued and will be picked up
eventually by the next polling timer. I think that's a reasonable
compromise, since it should only happen quite rarely.
Contains fixes from Ying Huang
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/include/asm/hw_irq.h | 1
arch/x86/include/asm/irq_vectors.h | 5 +++
arch/x86/kernel/cpu/mcheck/mce_64.c | 52 ++++++++++++++++++++++++++++++++++++
arch/x86/kernel/entry_64.S | 5 +++
arch/x86/kernel/irqinit_64.c | 4 ++
5 files changed, 67 insertions(+)
Index: linux/arch/x86/kernel/cpu/mcheck/mce_64.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:09:59.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:43:12.000000000 +0200
@@ -7,6 +7,7 @@
* Author: Andi Kleen
*/
+#include <linux/interrupt.h>
#include <linux/init.h>
#include <linux/types.h>
#include <linux/kernel.h>
@@ -35,6 +36,9 @@
#include <asm/uaccess.h>
#include <asm/smp.h>
#include <asm/idle.h>
+#include <asm/ipi.h>
+#include <asm/hw_irq.h>
+#include <asm/apic.h>
#define MISC_MCELOG_MINOR 227
@@ -188,6 +192,52 @@
}
/*
+ * Called after interrupts have been reenabled again
+ * when a MCE happened during an interrupts off region
+ * in the kernel.
+ */
+asmlinkage void smp_mce_self_interrupt(struct pt_regs *regs)
+{
+ ack_APIC_irq();
+ exit_idle();
+ irq_enter();
+ mce_notify_user();
+ irq_exit();
+}
+
+static void mce_report_event(struct pt_regs *regs)
+{
+ if (regs->flags & (X86_VM_MASK|X86_EFLAGS_IF)) {
+ mce_notify_user();
+ return;
+ }
+
+#ifdef CONFIG_X86_LOCAL_APIC
+ /*
+ * Without APIC do not notify. The event will be picked
+ * up eventually.
+ */
+ if (!cpu_has_apic)
+ return;
+
+ /*
+ * When interrupts are disabled we cannot use
+ * kernel services safely. Trigger an self interrupt
+ * through the APIC to instead do the notification
+ * after interrupts are reenabled again.
+ */
+ apic->send_IPI_self(MCE_SELF_VECTOR);
+
+ /*
+ * Wait for idle afterwards again so that we don't leave the
+ * APIC in a non idle state because the normal APIC writes
+ * cannot exclude us.
+ */
+ apic_wait_icr_idle();
+#endif
+}
+
+/*
* Poll for corrected events or events that happened before reset.
* Those are just logged through /dev/mcelog.
*
@@ -412,6 +462,8 @@
/* notify userspace ASAP */
set_thread_flag(TIF_MCE_NOTIFY);
+ mce_report_event(regs);
+
/* the last thing we do is clear state */
for (i = 0; i < banks; i++) {
if (test_bit(i, toclear))
Index: linux/arch/x86/kernel/entry_64.S
===================================================================
--- linux.orig/arch/x86/kernel/entry_64.S 2009-04-07 16:09:58.000000000 +0200
+++ linux/arch/x86/kernel/entry_64.S 2009-04-07 16:09:59.000000000 +0200
@@ -1013,6 +1013,11 @@
apicinterrupt THERMAL_APIC_VECTOR \
thermal_interrupt smp_thermal_interrupt
+#ifdef CONFIG_X86_MCE
+apicinterrupt MCE_SELF_VECTOR \
+ mce_self_interrupt smp_mce_self_interrupt
+#endif
+
#ifdef CONFIG_SMP
apicinterrupt CALL_FUNCTION_SINGLE_VECTOR \
call_function_single_interrupt smp_call_function_single_interrupt
Index: linux/arch/x86/include/asm/irq_vectors.h
===================================================================
--- linux.orig/arch/x86/include/asm/irq_vectors.h 2009-04-07 16:09:58.000000000 +0200
+++ linux/arch/x86/include/asm/irq_vectors.h 2009-04-07 16:43:04.000000000 +0200
@@ -121,6 +121,11 @@
#define UV_BAU_MESSAGE 0xec
/*
+ * Self IPI vector for machine checks
+ */
+#define MCE_SELF_VECTOR 0xeb
+
+/*
* First APIC vector available to drivers: (vectors 0x30-0xee) we
* start at 0x31(0x41) to spread out vectors evenly between priority
* levels. (0x80 is the syscall vector)
Index: linux/arch/x86/include/asm/hw_irq.h
===================================================================
--- linux.orig/arch/x86/include/asm/hw_irq.h 2009-04-07 16:09:58.000000000 +0200
+++ linux/arch/x86/include/asm/hw_irq.h 2009-04-07 16:09:59.000000000 +0200
@@ -32,6 +32,7 @@
extern void spurious_interrupt(void);
extern void thermal_interrupt(void);
extern void reschedule_interrupt(void);
+extern void mce_self_interrupt(void);
extern void invalidate_interrupt(void);
extern void invalidate_interrupt0(void);
Index: linux/arch/x86/kernel/irqinit_64.c
===================================================================
--- linux.orig/arch/x86/kernel/irqinit_64.c 2009-04-07 16:09:58.000000000 +0200
+++ linux/arch/x86/kernel/irqinit_64.c 2009-04-07 16:09:59.000000000 +0200
@@ -155,6 +155,10 @@
/* IPI vectors for APIC spurious and error interrupts */
alloc_intr_gate(SPURIOUS_APIC_VECTOR, spurious_interrupt);
alloc_intr_gate(ERROR_APIC_VECTOR, error_interrupt);
+
+#ifdef CONFIG_X86_MCE
+ alloc_intr_gate(MCE_SELF_VECTOR, mce_self_interrupt);
+#endif
}
void __init native_init_IRQ(void)
Impact: cleanup
Don't think BKL is needed for anything in mce_open, so drop it.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce_64.c | 4 ----
1 file changed, 4 deletions(-)
Index: linux/arch/x86/kernel/cpu/mcheck/mce_64.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:09:59.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:43:11.000000000 +0200
@@ -12,7 +12,6 @@
#include <linux/types.h>
#include <linux/kernel.h>
#include <linux/sched.h>
-#include <linux/smp_lock.h>
#include <linux/string.h>
#include <linux/rcupdate.h>
#include <linux/kallsyms.h>
@@ -723,12 +722,10 @@
static int mce_open(struct inode *inode, struct file *file)
{
- lock_kernel();
spin_lock(&mce_state_lock);
if (open_exclu || (open_count && (file->f_flags & O_EXCL))) {
spin_unlock(&mce_state_lock);
- unlock_kernel();
return -EBUSY;
}
@@ -737,7 +734,6 @@
open_count++;
spin_unlock(&mce_state_lock);
- unlock_kernel();
return nonseekable_open(inode, file);
}
Impact: bug fix, cleanup
Previously mce_panic used a simple heuristic to avoid printing
old so far unreported machine check events on a mce panic. This worked
by comparing the TSC value at the start of the machine check handler
with the event time stamp and only printing newer ones.
This has a couple of issues, in particular on systems where the TSC
is not fully synchronized between CPUs it could lose events or print
old ones.
It is also problematic with full system synchronization as it is
added by the next patch.
Remove the TSC heuristic and instead replace it with a simple heuristic
to print corrected errors first and after that uncorrected errors
and finally the worst machine check as determined by the machine
check handler.
This simplifies the code because there is no need to pass the
original TSC value around.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce_64.c | 31 +++++++++++++++++--------------
1 file changed, 17 insertions(+), 14 deletions(-)
Index: linux/arch/x86/kernel/cpu/mcheck/mce_64.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:09:59.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:43:12.000000000 +0200
@@ -151,23 +151,29 @@
"and contact your hardware vendor\n");
}
-static void mce_panic(char *msg, struct mce *backup, unsigned long start)
+static void mce_panic(char *msg, struct mce *final)
{
int i;
bust_spinlocks(1);
console_verbose();
+ /* First print corrected ones that are still unlogged */
for (i = 0; i < MCE_LOG_LEN; i++) {
- unsigned long tsc = mcelog.entry[i].tsc;
-
- if (time_before(tsc, start))
+ struct mce *m = &mcelog.entry[i];
+ if ((m->status & MCI_STATUS_VAL) &&
+ !(m->status & MCI_STATUS_UC))
+ print_mce(m);
+ }
+ /* Now print uncorrected but with the final one last */
+ for (i = 0; i < MCE_LOG_LEN; i++) {
+ struct mce *m = &mcelog.entry[i];
+ if (!(m->status & MCI_STATUS_VAL))
continue;
- print_mce(&mcelog.entry[i]);
- if (backup && mcelog.entry[i].tsc == backup->tsc)
- backup = NULL;
+ if (!final || memcmp(m, final, sizeof(struct mce)))
+ print_mce(m);
}
- if (backup)
- print_mce(backup);
+ if (final)
+ print_mce(final);
panic(msg);
}
@@ -329,7 +335,6 @@
void do_machine_check(struct pt_regs * regs, long error_code)
{
struct mce m, panicm;
- u64 mcestart = 0;
int i;
int panicm_found = 0;
/*
@@ -361,7 +366,6 @@
if (!(m.mcgstatus & MCG_STATUS_RIPV))
no_way_out = 1;
- rdtscll(mcestart);
barrier();
for (i = 0; i < banks; i++) {
@@ -439,7 +443,7 @@
* has not set tolerant to an insane level, give up and die.
*/
if (no_way_out && tolerant < 3)
- mce_panic("Machine check", &panicm, mcestart);
+ mce_panic("Machine check", &panicm);
/*
* If the error seems to be unrecoverable, something should be
@@ -467,8 +471,7 @@
if (user_space && tolerant > 0) {
force_sig(SIGBUS, current);
} else if (panic_on_oops || tolerant < 2) {
- mce_panic("Uncorrected machine check",
- &panicm, mcestart);
+ mce_panic("Uncorrected machine check", &panicm);
}
}
Impact: Feature
The machine check grading (as in deciding what should be done for a given
register value) has to be done multiple times soon and it's also getting more complicated.
So it makes sense to consolidate it into a single function. To get smaller
and more straight forward and possibly more extensible code I opted towards
a new table driven method. The various rules are put into a table
when is then executed by a very simple interpreter.
The grading engine is in a new file mce-severity.c. I also added a private
include file mce-internal.h, because mce.h is already a bit too cluttered.
This is dead code right now, but will be used in followon patches.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/kernel/cpu/mcheck/Makefile | 1
arch/x86/kernel/cpu/mcheck/mce-internal.h | 10 ++++
arch/x86/kernel/cpu/mcheck/mce-severity.c | 61 ++++++++++++++++++++++++++++++
3 files changed, 72 insertions(+)
Index: linux/arch/x86/kernel/cpu/mcheck/mce-severity.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux/arch/x86/kernel/cpu/mcheck/mce-severity.c 2009-04-07 16:43:08.000000000 +0200
@@ -0,0 +1,61 @@
+/*
+ * MCE grading rules.
+ * Copyright 2008, 2009 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ *
+ * Author: Andi Kleen
+ */
+#include <linux/kernel.h>
+#include <asm/mce.h>
+
+#include "mce-internal.h"
+
+/*
+ * Grade an mce by severity. In general the most severe ones are processed
+ * first. Since there are quite a lot of combinations test the bits in a
+ * table-driven way. The rules are simply processed in order, first
+ * match wins.
+ */
+
+static struct severity {
+ u64 mask;
+ u64 result;
+ unsigned char sev;
+ unsigned char mcgmask;
+ unsigned char mcgres;
+ char *msg;
+} severities[] = {
+#define SEV(s) .sev = MCE_ ## s ## _SEVERITY
+#define BITCLR(x, s, m, r...) { .mask = x, .result = 0, SEV(s), .msg = m, ## r }
+#define BITSET(x, s, m, r...) { .mask = x, .result = x, SEV(s), .msg = m, ## r }
+#define MCGMASK(x, res, s, m, r...) \
+ { .mcgmask = x, .mcgres = res, SEV(s), .msg = m, ## r }
+ BITCLR(MCI_STATUS_VAL, NO, "Invalid"),
+ BITCLR(MCI_STATUS_EN, NO, "Not enabled"),
+ BITSET(MCI_STATUS_PCC, PANIC, "Processor context corrupt"),
+ MCGMASK(MCG_STATUS_RIPV, 0, PANIC, "No restart IP"),
+ BITSET(MCI_STATUS_UC|MCI_STATUS_OVER, PANIC, "Overflowed uncorrected"),
+ BITSET(MCI_STATUS_UC, UC, "Uncorrected"),
+ BITSET(0, SOME, "No match") /* always matches. keep at end */
+};
+
+int mce_severity(struct mce *a, int tolerant, char **msg)
+{
+ struct severity *s;
+ for (s = severities;; s++) {
+ if ((a->status & s->mask) != s->result)
+ continue;
+ if ((a->mcgstatus & s->mcgmask) != s->mcgres)
+ continue;
+ if (s->sev > MCE_NO_SEVERITY && (a->status & MCI_STATUS_UC) &&
+ tolerant < 1)
+ return MCE_PANIC_SEVERITY;
+ if (msg)
+ *msg = s->msg;
+ return s->sev;
+ }
+}
Index: linux/arch/x86/kernel/cpu/mcheck/Makefile
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/Makefile 2009-04-07 16:09:53.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/Makefile 2009-04-07 16:43:09.000000000 +0200
@@ -5,3 +5,4 @@
obj-$(CONFIG_X86_MCE_AMD) += mce_amd_64.o
obj-$(CONFIG_X86_MCE_NONFATAL) += non-fatal.o
obj-$(CONFIG_X86_MCE_THRESHOLD) += threshold.o
+obj-$(CONFIG_X86_64) += mce-severity.o
Index: linux/arch/x86/kernel/cpu/mcheck/mce-internal.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux/arch/x86/kernel/cpu/mcheck/mce-internal.h 2009-04-07 16:43:08.000000000 +0200
@@ -0,0 +1,10 @@
+#include <asm/mce.h>
+
+enum severity_level {
+ MCE_NO_SEVERITY,
+ MCE_SOME_SEVERITY,
+ MCE_UC_SEVERITY,
+ MCE_PANIC_SEVERITY,
+};
+
+int mce_severity(struct mce *a, int tolerant, char **msg);
This will be used by future patches to allow machine check error injection.
Right now it's a nop, except for adding some wrappers around the MSR reads.
This is early in the sequence to avoid too many conflicts.
Andi Kleen <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce_64.c | 37 ++++++++++++++++++++++++------------
1 file changed, 25 insertions(+), 12 deletions(-)
Index: linux/arch/x86/kernel/cpu/mcheck/mce_64.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:09:59.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:43:12.000000000 +0200
@@ -171,6 +171,19 @@
panic(msg);
}
+/* MSR access wrappers used for error injection */
+static u64 mce_rdmsrl(u32 msr)
+{
+ u64 v;
+ rdmsrl(msr, v);
+ return v;
+}
+
+static void mce_wrmsrl(u32 msr, u64 v)
+{
+ wrmsrl(msr, v);
+}
+
int mce_available(struct cpuinfo_x86 *c)
{
if (mce_dont_init)
@@ -188,7 +201,7 @@
m->cs = 0;
}
if (rip_msr)
- rdmsrl(rip_msr, m->ip);
+ m->ip = mce_rdmsrl(rip_msr);
}
/*
@@ -250,7 +263,7 @@
mce_setup(&m);
- rdmsrl(MSR_IA32_MCG_STATUS, m.mcgstatus);
+ m.mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
for (i = 0; i < banks; i++) {
if (!bank[i] || !test_bit(i, *b))
continue;
@@ -261,7 +274,7 @@
m.tsc = 0;
barrier();
- rdmsrl(MSR_IA32_MC0_STATUS + i*4, m.status);
+ m.status = mce_rdmsrl(MSR_IA32_MC0_STATUS + i*4);
if (!(m.status & MCI_STATUS_VAL))
continue;
@@ -276,9 +289,9 @@
continue;
if (m.status & MCI_STATUS_MISCV)
- rdmsrl(MSR_IA32_MC0_MISC + i*4, m.misc);
+ m.misc = mce_rdmsrl(MSR_IA32_MC0_MISC + i*4);
if (m.status & MCI_STATUS_ADDRV)
- rdmsrl(MSR_IA32_MC0_ADDR + i*4, m.addr);
+ m.addr = mce_rdmsrl(MSR_IA32_MC0_ADDR + i*4);
if (!(flags & MCP_TIMESTAMP))
m.tsc = 0;
@@ -294,7 +307,7 @@
/*
* Clear state for this bank.
*/
- wrmsrl(MSR_IA32_MC0_STATUS+4*i, 0);
+ mce_wrmsrl(MSR_IA32_MC0_STATUS+4*i, 0);
}
/*
@@ -342,8 +355,8 @@
goto out;
mce_setup(&m);
+ m.mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
- rdmsrl(MSR_IA32_MCG_STATUS, m.mcgstatus);
/* if the restart IP is not valid, we're done for */
if (!(m.mcgstatus & MCG_STATUS_RIPV))
no_way_out = 1;
@@ -360,7 +373,7 @@
m.addr = 0;
m.bank = i;
- rdmsrl(MSR_IA32_MC0_STATUS + i*4, m.status);
+ m.status = mce_rdmsrl(MSR_IA32_MC0_STATUS + i*4);
if ((m.status & MCI_STATUS_VAL) == 0)
continue;
@@ -400,9 +413,9 @@
}
if (m.status & MCI_STATUS_MISCV)
- rdmsrl(MSR_IA32_MC0_MISC + i*4, m.misc);
+ m.misc = mce_rdmsrl(MSR_IA32_MC0_MISC + i*4);
if (m.status & MCI_STATUS_ADDRV)
- rdmsrl(MSR_IA32_MC0_ADDR + i*4, m.addr);
+ m.addr = mce_rdmsrl(MSR_IA32_MC0_ADDR + i*4);
mce_get_rip(&m, regs);
mce_log(&m);
@@ -467,9 +480,9 @@
/* the last thing we do is clear state */
for (i = 0; i < banks; i++) {
if (test_bit(i, toclear))
- wrmsrl(MSR_IA32_MC0_STATUS+4*i, 0);
+ mce_wrmsrl(MSR_IA32_MC0_STATUS+4*i, 0);
}
- wrmsrl(MSR_IA32_MCG_STATUS, 0);
+ mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
out:
atomic_dec(&mce_entry);
sync_core();
Impact: Feature
The exception handler should behave differently if the exception is
fatal versus one that can be returned from. In the first case it should
never clear any registers because these need to be preserved
for logging after the next boot. Otherwise it should clear them
on each CPU step by step so that other CPUs sharing the same bank don't
see duplicate events. Otherwise we risk reporting events multiple
times on any CPUs which have shared machine check banks, which
is a common problem on Intel Nehalem which has both SMT (two
CPU threads sharing banks) and shared machine check banks in the uncore.
Determine early in a special pass if any event requires a panic.
This uses the mce_severity() function added earlier.
This is needed for the next patch.
Also fixes a problem together with an earlier patch
that corrected events weren't logged on a fatal MCE.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce_64.c | 38 +++++++++++++++++++++++-------------
1 file changed, 25 insertions(+), 13 deletions(-)
Index: linux/arch/x86/kernel/cpu/mcheck/mce_64.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:09:59.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:43:11.000000000 +0200
@@ -38,6 +38,7 @@
#include <asm/ipi.h>
#include <asm/hw_irq.h>
#include <asm/apic.h>
+#include "mce-internal.h"
#define MISC_MCELOG_MINOR 227
@@ -150,7 +151,7 @@
"and contact your hardware vendor\n");
}
-static void mce_panic(char *msg, struct mce *final)
+static void mce_panic(char *msg, struct mce *final, char *exp)
{
int i;
@@ -173,6 +174,8 @@
}
if (final)
print_mce(final);
+ if (exp)
+ printk(KERN_EMERG "Machine check: %s\n", exp);
panic(msg);
}
@@ -324,6 +327,22 @@
}
/*
+ * Do a quick check if any of the events requires a panic.
+ * This decides if we keep the events around or clear them.
+ */
+static int mce_no_way_out(struct mce *m, char **msg)
+{
+ int i;
+
+ for (i = 0; i < banks; i++) {
+ m->status = mce_rdmsrl(MSR_IA32_MC0_STATUS + i*4);
+ if (mce_severity(m, tolerant, msg) >= MCE_PANIC_SEVERITY)
+ return 1;
+ }
+ return 0;
+}
+
+/*
* The actual machine check handler. This only handles real
* exceptions when something got corrupted coming in through int 18.
*
@@ -347,6 +366,7 @@
*/
int kill_it = 0;
DECLARE_BITMAP(toclear, MAX_NR_BANKS);
+ char *msg = "Unknown";
atomic_inc(&mce_entry);
@@ -360,10 +380,7 @@
mce_setup(&m);
m.mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
-
- /* if the restart IP is not valid, we're done for */
- if (!(m.mcgstatus & MCG_STATUS_RIPV))
- no_way_out = 1;
+ no_way_out = mce_no_way_out(&m, &msg);
barrier();
@@ -395,18 +412,13 @@
__set_bit(i, toclear);
if (m.status & MCI_STATUS_EN) {
- /* if PCC was set, there's no way out */
- no_way_out |= !!(m.status & MCI_STATUS_PCC);
/*
* If this error was uncorrectable and there was
* an overflow, we're in trouble. If no overflow,
* we might get away with just killing a task.
*/
- if (m.status & MCI_STATUS_UC) {
- if (tolerant < 1 || m.status & MCI_STATUS_OVER)
- no_way_out = 1;
+ if (m.status & MCI_STATUS_UC)
kill_it = 1;
- }
} else {
/*
* Machine check event was not enabled. Clear, but
@@ -442,7 +454,7 @@
* has not set tolerant to an insane level, give up and die.
*/
if (no_way_out && tolerant < 3)
- mce_panic("Machine check", &panicm);
+ mce_panic("Machine check", &panicm, msg);
/*
* If the error seems to be unrecoverable, something should be
@@ -470,7 +482,7 @@
if (user_space && tolerant > 0) {
force_sig(SIGBUS, current);
} else if (panic_on_oops || tolerant < 2) {
- mce_panic("Uncorrected machine check", &panicm);
+ mce_panic("Uncorrected machine check", &panicm, msg);
}
}
Impact: bugfix
In some circumstances multiple CPUs can enter mce_panic() in parallel. This
gives quite confused output because they will all dump the same machine check
buffer.
The other problem is that they would all panic in parallel, but not process
each other's shutdown IPIs because interrupts are disabled.
Detect this situation early on in mce_panic(). On the first CPU
entering will do the panic, the others will just wait to be killed.
For paranoia reasons in case the other CPU dies during the MCE I added
a 5 seconds timeout. If it expires each CPU will panic on its own again,
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce_64.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)
Index: linux/arch/x86/kernel/cpu/mcheck/mce_64.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:09:59.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:43:11.000000000 +0200
@@ -151,10 +151,32 @@
"and contact your hardware vendor\n");
}
+#define PANIC_TIMEOUT 5 /* 5 seconds */
+
+static atomic_t mce_paniced;
+
+/* Panic in progress. Enable interrupts and wait for final IPI */
+static void wait_for_panic(void)
+{
+ long timeout = PANIC_TIMEOUT*USEC_PER_SEC;
+ preempt_disable();
+ local_irq_enable();
+ while (timeout-- > 0)
+ udelay(1);
+ panic("Panicing machine check CPU died");
+}
+
static void mce_panic(char *msg, struct mce *final, char *exp)
{
int i;
+ /*
+ * Make sure only one CPU runs in machine check panic
+ */
+ if (atomic_add_return(1, &mce_paniced) > 1)
+ wait_for_panic();
+ barrier();
+
bust_spinlocks(1);
console_verbose();
/* First print corrected ones that are still unlogged */
This makes it easier for tools who want to extract the mcelog out of
crash images or memory dumps to adapt to changing struct mce size.
The length field replaces padding, so it's fully compatible.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/include/asm/mce.h | 2 +-
arch/x86/kernel/cpu/mcheck/mce_64.c | 5 +++--
2 files changed, 4 insertions(+), 3 deletions(-)
Index: linux/arch/x86/kernel/cpu/mcheck/mce_64.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:09:59.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:43:10.000000000 +0200
@@ -94,8 +94,9 @@
*/
static struct mce_log mcelog = {
- MCE_LOG_SIGNATURE,
- MCE_LOG_LEN,
+ .signature = MCE_LOG_SIGNATURE,
+ .len = MCE_LOG_LEN,
+ .recordlen = sizeof(struct mce),
};
void mce_log(struct mce *mce)
Index: linux/arch/x86/include/asm/mce.h
===================================================================
--- linux.orig/arch/x86/include/asm/mce.h 2009-04-07 16:09:59.000000000 +0200
+++ linux/arch/x86/include/asm/mce.h 2009-04-07 16:43:10.000000000 +0200
@@ -59,7 +59,7 @@
unsigned len; /* = MCE_LOG_LEN */
unsigned next;
unsigned flags;
- unsigned pad0;
+ unsigned recordlen; /* length of struct mce */
struct mce entry[MCE_LOG_LEN];
};
Impact: feature, fixes long standing problems.
On Intel platforms machine check exceptions are always broadcast to all CPUs.
This patch makes the machine check handler synchronize all these machine
checks, elect a Monarch to handle the event and collect the worst event from
all CPUs and then process it first.
This has some advantages:
- When there is a truly data corrupting error the system panics as
quickly as possible. This improves containment of corrupted
data and makes sure the corrupted data never hits stable storage.
- The panics are synchronized and do not reenter the panic code
on multiple CPUs (which currently does not handle this well).
- All the errors are reported. Currently it often happens that
another CPU happens to do the panic first, but reports useless
information (empty machine check) because the real error
happened on another CPU which came in later.
This is a big advantage on Nehalem where the 8 threads per CPU
lead to often the wrong CPU winning the race and dumping
useless information on a machine check. The problem also occurs
in a less severe form on older CPUs.
- The system can detect when no CPUs detected a machine check
and shut down the system. This can happen when one CPU is so
badly hung that that it cannot process a machine check anymore
or when some external agent wants to stop the system by
asserting the machine check pin. This follows Intel hardware
recommendations.
- This matches the recommended error model by the CPU designers.
- The events can be output in true severity order
- When a panic happens on another CPU it makes sure to be actually
be able to process the stop IPI by enabling interrupts.
The code is extremly careful to handle timeouts while waiting
for other CPUs. It can't rely on the normal timing mechanisms
(jiffies, ktime_get) because of its asynchronous/lockless nature,
so it uses own timeouts using ndelay() and a "SPINUNIT"
The timeout is configurable. By default it waits for upto one
second for the other CPUs. This can be also disabled.
>From some informal testing AMD systems do not see to broadcast
machine checks, so right now it's always disabled by default on
non Intel CPUs or also on very old Intel systems.
Includes fixes from Ying Huang
Signed-off-by: Andi Kleen <[email protected]>
---
Documentation/x86/x86_64/boot-options.txt | 6
Documentation/x86/x86_64/machinecheck | 4
arch/x86/kernel/cpu/mcheck/mce_64.c | 358 +++++++++++++++++++++++++++---
3 files changed, 340 insertions(+), 28 deletions(-)
Index: linux/arch/x86/kernel/cpu/mcheck/mce_64.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:09:59.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:43:10.000000000 +0200
@@ -29,6 +29,8 @@
#include <linux/kobject.h>
#include <linux/sysfs.h>
#include <linux/ratelimit.h>
+#include <linux/nmi.h>
+#include <linux/delay.h>
#include <asm/processor.h>
#include <asm/msr.h>
#include <asm/mce.h>
@@ -42,6 +44,8 @@
#define MISC_MCELOG_MINOR 227
+#define SPINUNIT 100 /* 100ns */
+
atomic_t mce_entry;
static int mce_dont_init;
@@ -59,11 +63,14 @@
static unsigned long notify_user;
static int rip_msr;
static int mce_bootlog = -1;
+static int monarch_timeout = -1;
static char trigger[128];
static char *trigger_argv[2] = { trigger, NULL };
static DECLARE_WAIT_QUEUE_HEAD(mce_wait);
+static DEFINE_PER_CPU(struct mce, mces_seen);
+static int cpu_missing;
/* MCA banks polled by the period polling timer for corrected events */
DEFINE_PER_CPU(mce_banks_t, mce_poll_banks) = {
@@ -196,6 +203,8 @@
}
if (final)
print_mce(final);
+ if (cpu_missing)
+ printk(KERN_EMERG "Some CPUs didn't answer in synchronization\n");
if (exp)
printk(KERN_EMERG "Machine check: %s\n", exp);
panic(msg);
@@ -365,18 +374,284 @@
}
/*
+ * Variable to establish order between CPUs while scanning.
+ * Each CPU spins initially until executing is equal its number.
+ */
+static atomic_t mce_executing;
+
+/*
+ * Defines order of CPUs on entry. First CPU becomes Monarch.
+ */
+static atomic_t mce_callin;
+
+/*
+ * Check if a timeout waiting for other CPUs happened.
+ */
+static int mce_timed_out(u64 *t)
+{
+ /*
+ * The others already did panic for some reason.
+ * Bail out like in a timeout.
+ * rmb() to tell the compiler that system_state
+ * might have been modified by someone else.
+ */
+ rmb();
+ if (atomic_read(&mce_paniced))
+ wait_for_panic();
+ if (!monarch_timeout)
+ goto out;
+ if ((s64)*t < SPINUNIT) {
+ /* CHECKME: Make panic default for 1 too? */
+ if (tolerant < 1)
+ mce_panic("Timeout synchronizing machine check over CPUs",
+ NULL, NULL);
+ cpu_missing = 1;
+ return 1;
+ }
+ *t -= SPINUNIT;
+out:
+ touch_nmi_watchdog();
+ return 0;
+}
+
+/*
+ * The Monarch's reign. The Monarch is the CPU who entered
+ * the machine check handler first. It waits for the others to
+ * raise the exception too and then grades them. When any
+ * error is fatal panic. Only then let the others continue.
+ *
+ * The other CPUs entering the MCE handler will be controlled by the
+ * Monarch. They are called Subjects.
+ *
+ * This way we prevent any potential data corruption in a unrecoverable case
+ * and also makes sure always all CPU's errors are examined.
+ *
+ * Also this detects the case of an machine check event coming from outer
+ * space (not detected by any CPUs) In this case some external agent wants
+ * us to shut down, so panic too.
+ *
+ * The other CPUs might still decide to panic if the handler happens
+ * in a unrecoverable place, but in this case the system is in a semi-stable
+ * state and won't corrupt anything by itself. It's ok to let the others
+ * continue for a bit first.
+ *
+ * All the spin loops have timeouts; when a timeout happens a CPU
+ * typically elects itself to be Monarch.
+ */
+static void mce_reign(void)
+{
+ int cpu;
+ struct mce *m = NULL;
+ int global_worst = 0;
+ char *msg = NULL;
+
+ /*
+ * This CPU is the Monarch and the other CPUs have run
+ * through their handlers.
+ * Grade the severity of the errors of all the CPUs.
+ */
+ for_each_possible_cpu (cpu) {
+ int severity = mce_severity(&per_cpu(mces_seen, cpu), tolerant,
+ &msg);
+ if (severity > global_worst) {
+ global_worst = severity;
+ m = &per_cpu(mces_seen, cpu);
+ }
+ }
+
+ /*
+ * Cannot recover? Panic here then.
+ * This dumps all the mces in the log buffer and stops the
+ * other CPUs.
+ */
+ if (m && global_worst >= MCE_PANIC_SEVERITY && tolerant < 3)
+ mce_panic("Fatal machine check", m, msg);
+
+ /*
+ * For UC somewhere we let the CPU who detects it handle it.
+ * Also must let continue the others, otherwise the handling
+ * CPU could deadlock on a lock.
+ */
+
+ /*
+ * No machine check event found. Must be some external
+ * source or one CPU is hung. Panic.
+ */
+ if (!m && tolerant < 3)
+ mce_panic("Machine check from unknown source", NULL, NULL);
+
+ /*
+ * Now clear all the mces_seen so that they don't reappear on
+ * the next mce.
+ */
+ for_each_possible_cpu (cpu)
+ memset(&per_cpu(mces_seen, cpu), 0, sizeof(struct mce));
+}
+
+/*
+ * Start of Monarch synchronization. This waits until all CPUs have
+ * entered the ecception handler and then determines if any of them
+ * saw a fatal event that requires panic. Then it executes them
+ * in the entry order.
+ * TBD double check parallel CPU hotunplug
+ */
+static int mce_start(int no_way_out, int *order)
+{
+ int nwo;
+ int cpus = num_online_cpus();
+ static atomic_t global_nwo;
+ u64 timeout = (u64)monarch_timeout * NSEC_PER_USEC;
+
+ if (!timeout) {
+ *order = -1;
+ return no_way_out;
+ }
+
+ atomic_add(no_way_out, &global_nwo);
+
+ /*
+ * Wait for everyone.
+ */
+ while (atomic_read(&mce_callin) != cpus) {
+ if (mce_timed_out(&timeout)) {
+ atomic_set(&global_nwo, 0);
+ *order = -1;
+ return no_way_out;
+ }
+ ndelay(SPINUNIT);
+ }
+
+ /*
+ * Cache the global no_way_out state.
+ */
+ nwo = atomic_read(&global_nwo);
+
+ /*
+ * Monarch starts executing now, the others wait.
+ */
+ if (*order == 1) {
+ atomic_set(&global_nwo, 0);
+ atomic_set(&mce_executing, 1);
+ return nwo;
+ }
+
+ /*
+ * Now start the scanning loop one by one
+ * in the original callin order.
+ * This way when there are any shared banks it will
+ * be only seen by one CPU before cleared, avoiding duplicates.
+ */
+ while (atomic_read(&mce_executing) < *order) {
+ if (mce_timed_out(&timeout)) {
+ atomic_set(&global_nwo, 0);
+ *order = -1;
+ return no_way_out;
+ }
+ ndelay(SPINUNIT);
+ }
+ return nwo;
+}
+
+/*
+ * Synchronize between CPUs after main scanning loop.
+ * This invokes the bulk of the Monarch processing.
+ */
+static int mce_end(int order)
+{
+ int ret = -1;
+ u64 timeout = (u64)monarch_timeout * NSEC_PER_USEC;
+
+ if (!timeout)
+ goto reset;
+ if (order < 0)
+ goto reset;
+
+ /*
+ * Allow others to run.
+ */
+ atomic_inc(&mce_executing);
+
+ if (order == 1) {
+ /* CHECKME: Can this race with a parallel hotplug? */
+ int cpus = num_online_cpus();
+
+ /*
+ * Monarch: Wait for everyone to go through their scanning
+ * loops.
+ */
+ while (atomic_read(&mce_executing) <= cpus) {
+ if (mce_timed_out(&timeout))
+ goto reset;
+ ndelay(SPINUNIT);
+ }
+
+ mce_reign();
+ barrier();
+ ret = 0;
+ } else {
+ /*
+ * Subject: Wait for Monarch to finish.
+ */
+ while (atomic_read(&mce_executing) != 0) {
+ if (mce_timed_out(&timeout))
+ goto reset;
+ ndelay(SPINUNIT);
+ }
+
+ /*
+ * Don't reset anything. That's done by the Monarch.
+ */
+ return 0;
+ }
+
+ /*
+ * Reset all global state.
+ */
+reset:
+ atomic_set(&mce_callin, 0);
+ barrier();
+
+ /*
+ * Let others run again.
+ */
+ atomic_set(&mce_executing, 0);
+ return ret;
+}
+
+static void mce_clear_state(unsigned long *toclear)
+{
+ int i;
+
+ for (i = 0; i < banks; i++) {
+ if (test_bit(i, toclear))
+ mce_wrmsrl(MSR_IA32_MC0_STATUS+4*i, 0);
+ }
+}
+
+/*
* The actual machine check handler. This only handles real
* exceptions when something got corrupted coming in through int 18.
*
* This is executed in NMI context not subject to normal locking rules. This
* implies that most kernel services cannot be safely used. Don't even
* think about putting a printk in there!
+ *
+ * On Intel systems this is entered on all CPUs in parallel through
+ * MCE broadcast. However some CPUs might be broken beyond repair,
+ * so be always careful when synchronizing with others.
*/
void do_machine_check(struct pt_regs * regs, long error_code)
{
- struct mce m, panicm;
+ struct mce m, *final;
int i;
- int panicm_found = 0;
+ int worst = 0;
+ int severity;
+ /*
+ * Establish sequential order between the CPUs entering the machine
+ * check handler.
+ */
+ int order;
+
/*
* If no_way_out gets set, there is no safe way to recover from this
* MCE. If tolerant is cranked up, we'll try anyway.
@@ -400,12 +675,22 @@
if (!banks)
goto out;
+ order = atomic_add_return(1, &mce_callin);
mce_setup(&m);
m.mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
no_way_out = mce_no_way_out(&m, &msg);
+ final = &__get_cpu_var(mces_seen);
+ *final = m;
+
barrier();
+ /*
+ * Go through all the banks in exclusion of the other CPUs.
+ * This way we don't report duplicated events on shared banks
+ * because the first one to see it will clear it.
+ */
+ no_way_out = mce_start(no_way_out, &order);
for (i = 0; i < banks; i++) {
__clear_bit(i, toclear);
if (!bank[i])
@@ -457,26 +742,32 @@
mce_get_rip(&m, regs);
mce_log(&m);
- /* Did this bank cause the exception? */
- /* Assume that the bank with uncorrectable errors did it,
- and that there is only a single one. */
- if ((m.status & MCI_STATUS_UC) && (m.status & MCI_STATUS_EN)) {
- panicm = m;
- panicm_found = 1;
+ severity = mce_severity(&m, tolerant, NULL);
+ if (severity > worst) {
+ *final = m;
+ worst = severity;
}
}
- /* If we didn't find an uncorrectable error, pick
- the last one (shouldn't happen, just being safe). */
- if (!panicm_found)
- panicm = m;
+ if (!no_way_out)
+ mce_clear_state(toclear);
+
+ /*
+ * Do most of the synchronization with other CPUs.
+ * When there's any problem use only local no_way_out state.
+ */
+ if (mce_end(order) < 0)
+ no_way_out = worst >= MCE_PANIC_SEVERITY;
/*
* If we have decided that we just CAN'T continue, and the user
- * has not set tolerant to an insane level, give up and die.
+ * has not set tolerant to an insane level, give up and die.
+ *
+ * This is mainly used in the case when the system doesn't
+ * support MCE broadcasting or it has been disabled.
*/
if (no_way_out && tolerant < 3)
- mce_panic("Machine check", &panicm, msg);
+ mce_panic("Machine check", final, msg);
/*
* If the error seems to be unrecoverable, something should be
@@ -492,7 +783,7 @@
* instruction which caused the MCE.
*/
if (m.mcgstatus & MCG_STATUS_EIPV)
- user_space = panicm.ip && (panicm.cs & 3);
+ user_space = final->ip && (final->cs & 3);
/*
* If we know that the error was in user space, send a
@@ -504,20 +795,15 @@
if (user_space && tolerant > 0) {
force_sig(SIGBUS, current);
} else if (panic_on_oops || tolerant < 2) {
- mce_panic("Uncorrected machine check", &panicm, msg);
+ mce_panic("Uncorrected machine check", final, msg);
}
}
/* notify userspace ASAP */
set_thread_flag(TIF_MCE_NOTIFY);
- mce_report_event(regs);
-
- /* the last thing we do is clear state */
- for (i = 0; i < banks; i++) {
- if (test_bit(i, toclear))
- mce_wrmsrl(MSR_IA32_MC0_STATUS+4*i, 0);
- }
+ if (worst > 0)
+ mce_report_event(regs);
mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
out:
atomic_dec(&mce_entry);
@@ -695,7 +981,17 @@
by default and leave crap in there. Don't log. */
mce_bootlog = 0;
}
-
+ if (c->x86_vendor == X86_VENDOR_INTEL) {
+ /*
+ * All newer Intel systems support MCE broadcasting. Enable
+ * synchronization with a one second timeout.
+ */
+ if ((c->x86 > 6 || (c->x86 == 6 && c->x86_model >= 0xe)) &&
+ monarch_timeout < 0)
+ monarch_timeout = USEC_PER_SEC;
+ }
+ if (monarch_timeout < 0)
+ monarch_timeout = 0;
}
static void mce_cpu_features(struct cpuinfo_x86 *c)
@@ -920,7 +1216,9 @@
}
/* mce=off disables machine check.
- mce=TOLERANCELEVEL (number, see above)
+ mce=TOLERANCELEVEL[,monarchtimeout] (number, see above)
+ monarchtimeout is how long to wait for other CPUs on machine check, or 0
+ to not wait
mce=bootlog Log MCEs from before booting. Disabled by default on AMD.
mce=nobootlog Don't log MCEs from before booting. */
static int __init mcheck_enable(char *str)
@@ -929,9 +1227,13 @@
mce_dont_init = 1;
else if (!strcmp(str, "bootlog") || !strcmp(str,"nobootlog"))
mce_bootlog = str[0] == 'b';
- else if (isdigit(str[0]))
+ else if (isdigit(str[0])) {
get_option(&str, &tolerant);
- else
+ if (*str == ',') {
+ ++str;
+ get_option(&str, &monarch_timeout);
+ }
+ } else
printk("mce= argument %s ignored. Please use /sys", str);
return 1;
}
@@ -1057,9 +1359,11 @@
store_int_with_restart),
&check_interval
};
+static SYSDEV_INT_ATTR(monarch_timeout, 0644, monarch_timeout);
static struct sysdev_attribute *mce_attributes[] = {
&attr_tolerant.attr, &attr_check_interval.attr, &attr_trigger,
+ &attr_monarch_timeout.attr,
NULL
};
Index: linux/Documentation/x86/x86_64/boot-options.txt
===================================================================
--- linux.orig/Documentation/x86/x86_64/boot-options.txt 2009-04-07 16:09:52.000000000 +0200
+++ linux/Documentation/x86/x86_64/boot-options.txt 2009-04-07 16:43:10.000000000 +0200
@@ -13,13 +13,17 @@
in a reboot. On Intel systems it is enabled by default.
mce=nobootlog
Disable boot machine check logging.
- mce=tolerancelevel (number)
+ mce=tolerancelevel[,monarchtimeout] (number,number)
+ tolerance levels:
0: always panic on uncorrected errors, log corrected errors
1: panic or SIGBUS on uncorrected errors, log corrected errors
2: SIGBUS or log uncorrected errors, log corrected errors
3: never panic or SIGBUS, log all errors (for testing only)
Default is 1
Can be also set using sysfs which is preferable.
+ monarchtimout:
+ Sets the time in us to wait for other CPUs on machine checks. 0
+ to disable.
nomce (for compatibility with i386): same as mce=off
Index: linux/Documentation/x86/x86_64/machinecheck
===================================================================
--- linux.orig/Documentation/x86/x86_64/machinecheck 2009-04-07 16:09:52.000000000 +0200
+++ linux/Documentation/x86/x86_64/machinecheck 2009-04-07 16:43:10.000000000 +0200
@@ -67,6 +67,10 @@
Program to run when a machine check event is detected.
This is an alternative to running mcelog regularly from cron
and allows to detect events faster.
+monarch_timeout
+ How long to wait for the other CPUs to machine check too on a
+ exception. 0 to disable waiting for other CPUs.
+ Unit: us
TBD document entries for AMD threshold interrupt configuration
Document that check_interval set to 0 means no polling.
Noticed by Hidetoshi Seto
Also add a reference from boot options to the sysfs tunables
Cc: [email protected]
Signed-off-by: Andi Kleen <[email protected]>
---
Documentation/x86/x86_64/boot-options.txt | 2 ++
Documentation/x86/x86_64/machinecheck | 4 +++-
2 files changed, 5 insertions(+), 1 deletion(-)
Index: linux/Documentation/x86/x86_64/machinecheck
===================================================================
--- linux.orig/Documentation/x86/x86_64/machinecheck 2009-04-07 16:09:59.000000000 +0200
+++ linux/Documentation/x86/x86_64/machinecheck 2009-04-07 16:09:59.000000000 +0200
@@ -41,7 +41,9 @@
the polling interval. When the poller stops finding MCEs, it
triggers an exponential backoff (poll less often) on the polling
interval. The check_interval variable is both the initial and
- maximum polling interval.
+ maximum polling interval. 0 means no polling for corrected machine
+ check errors (but some corrected errors might be still reported
+ in other ways)
tolerant
Tolerance level. When a machine check exception occurs for a non
Index: linux/Documentation/x86/x86_64/boot-options.txt
===================================================================
--- linux.orig/Documentation/x86/x86_64/boot-options.txt 2009-04-07 16:09:59.000000000 +0200
+++ linux/Documentation/x86/x86_64/boot-options.txt 2009-04-07 16:09:59.000000000 +0200
@@ -5,6 +5,8 @@
Machine check
+ Please see Documentation/x86/x86_64/machinecheck for sysfs runtime tunables.
+
mce=off disable machine check
mce=bootlog Enable logging of machine checks left over from booting.
Disabled by default on AMD because some BIOS leave bogus ones.
Fatal machine checks can be logged to disk after boot, but only if
the system did a warm reboot. That's unfortunately difficult with the
default panic behaviour, which waits forever and the admin has to
press the power button because modern systems usually miss a reset button.
This clears the machine checks in the registers and make
it impossible to log them.
This patch changes the default for machine check panic to always
reboot after 30s. Then the mce can be successfully logged after
reboot.
I believe this will improve machine check experience for any
system running the X server.
This is dependent on successfull boot logging of MCEs. This currently
only works on Intel systems, on AMD there are quite a lot of systems
around which leave junk in the machine check registers after boot,
so it's disabled here. These systems will continue to default
to endless waiting panic.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce_64.c | 5 +++++
1 file changed, 5 insertions(+)
Index: linux/arch/x86/kernel/cpu/mcheck/mce_64.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:09:59.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:43:10.000000000 +0200
@@ -64,6 +64,7 @@
static int rip_msr;
static int mce_bootlog = -1;
static int monarch_timeout = -1;
+static int mce_panic_timeout;
static char trigger[128];
static char *trigger_argv[2] = { trigger, NULL };
@@ -171,6 +172,7 @@
local_irq_enable();
while (timeout-- > 0)
udelay(1);
+ panic_timeout = mce_panic_timeout;
panic("Panicing machine check CPU died");
}
@@ -208,6 +210,7 @@
printk(KERN_EMERG "Some CPUs didn't answer in synchronization\n");
if (exp)
printk(KERN_EMERG "Machine check: %s\n", exp);
+ panic_timeout = mce_panic_timeout;
panic(msg);
}
@@ -993,6 +996,8 @@
}
if (monarch_timeout < 0)
monarch_timeout = 0;
+ if (mce_bootlog != 0)
+ mce_panic_timeout = 30;
}
static void mce_cpu_features(struct cpuinfo_x86 *c)
The old struct mce had a limitation to 256 CPUs. But x86 Linux supports more than
that now with x2apic. Add a new field extcpu to report the extended number.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/include/asm/mce.h | 4 ++--
arch/x86/kernel/cpu/mcheck/mce_64.c | 4 ++--
2 files changed, 4 insertions(+), 4 deletions(-)
Index: linux/arch/x86/include/asm/mce.h
===================================================================
--- linux.orig/arch/x86/include/asm/mce.h 2009-04-07 16:09:59.000000000 +0200
+++ linux/arch/x86/include/asm/mce.h 2009-04-07 16:43:09.000000000 +0200
@@ -40,9 +40,9 @@
__u64 res2; /* dito. */
__u8 cs; /* code segment */
__u8 bank; /* machine check bank */
- __u8 cpu; /* cpu that raised the error */
+ __u8 cpu; /* cpu number; obsolete; use extcpu now */
__u8 finished; /* entry is valid */
- __u32 pad;
+ __u32 extcpu; /* linux cpu number that detected the error */
};
/*
Index: linux/arch/x86/kernel/cpu/mcheck/mce_64.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:09:59.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:43:09.000000000 +0200
@@ -82,7 +82,7 @@
void mce_setup(struct mce *m)
{
memset(m, 0, sizeof(struct mce));
- m->cpu = smp_processor_id();
+ m->cpu = m->extcpu = smp_processor_id();
rdtscll(m->tsc);
}
@@ -140,7 +140,7 @@
KERN_EMERG "HARDWARE ERROR\n"
KERN_EMERG
"CPU %d: Machine Check Exception: %16Lx Bank %d: %016Lx\n",
- m->cpu, m->mcgstatus, m->bank, m->status);
+ m->extcpu, m->mcgstatus, m->bank, m->status);
if (m->ip) {
printk(KERN_EMERG "RIP%s %02x:<%016Lx> ",
!(m->mcgstatus & MCG_STATUS_EIPV) ? " !INEXACT!" : "",
Experience has shown that struct mce which is used to pass an machine check
to the user space daemon currently a few limitations. Also some data
which is useful to print at panic level is also missing.
This patch addresses most of them. The same information is also
printed out together with mce panic.
struct mce can be painlessly extended in a compatible way, the mcelog user space
code just ignores additional fields with a warning.
- It doesn't provide a wall time timestamp. There have been a few
complaints about that. Fix that by adding a 64bit time_t
- It doesn't provide the exact CPU identification. This makes
it awkward for mcelog to decode the event correctly, especially
when there are variations in the supported MCE codes on different
CPU models or when mcelog is running on a different host after a panic.
Previously the administrator had to specify the correct CPU
when mcelog ran on a different host, but with the more variation
in machine checks now it's better to auto detect that.
It's also useful for more detailed analysis of CPU events.
Pass CPUID 1.EAX and the cpu vendor (as encoded in processor.h) instead.
- Socket ID and initial APIC ID are useful to report because they
allow to identify the failing CPU in some (not all) cases.
This is also especially useful for the panic situation.
This addresses one of the complaints from Thomas Gleixner earlier.
- The MCG capabilities MSR needs to be reported for some advanced
error processing in mcelog
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/include/asm/mce.h | 10 ++++++++--
arch/x86/kernel/cpu/mcheck/mce_64.c | 12 ++++++++++++
2 files changed, 20 insertions(+), 2 deletions(-)
Index: linux/arch/x86/include/asm/mce.h
===================================================================
--- linux.orig/arch/x86/include/asm/mce.h 2009-04-07 16:09:59.000000000 +0200
+++ linux/arch/x86/include/asm/mce.h 2009-04-07 16:43:09.000000000 +0200
@@ -36,13 +36,19 @@
__u64 mcgstatus;
__u64 ip;
__u64 tsc; /* cpu time stamp counter */
- __u64 res1; /* for future extension */
- __u64 res2; /* dito. */
+ __u64 time; /* wall time_t when error was detected */
+ __u8 cpuvendor; /* cpu vendor as encoded in system.h */
+ __u8 pad1;
+ __u16 pad2;
+ __u32 cpuid; /* CPUID 1 EAX */
__u8 cs; /* code segment */
__u8 bank; /* machine check bank */
__u8 cpu; /* cpu number; obsolete; use extcpu now */
__u8 finished; /* entry is valid */
__u32 extcpu; /* linux cpu number that detected the error */
+ __u32 socketid; /* CPU socket ID */
+ __u32 apicid; /* CPU initial apic ID */
+ __u64 mcgcap; /* MCGCAP MSR: machine check capabilities of CPU */
};
/*
Index: linux/arch/x86/kernel/cpu/mcheck/mce_64.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:09:59.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:43:09.000000000 +0200
@@ -84,6 +84,15 @@
memset(m, 0, sizeof(struct mce));
m->cpu = m->extcpu = smp_processor_id();
rdtscll(m->tsc);
+ /* We hope get_seconds stays lockless */
+ m->time = get_seconds();
+ m->cpuvendor = boot_cpu_data.x86_vendor;
+ m->cpuid = cpuid_eax(1);
+#ifdef CONFIG_SMP
+ m->socketid = cpu_data(m->extcpu).phys_proc_id;
+#endif
+ m->apicid = cpu_data(m->extcpu).initial_apicid;
+ rdmsrl(MSR_IA32_MCG_CAP, m->mcgcap);
}
DEFINE_PER_CPU(unsigned, mce_exception_count);
@@ -155,6 +164,9 @@
if (m->misc)
printk("MISC %llx ", m->misc);
printk("\n");
+ printk(KERN_EMERG "PROCESSOR %u:%x TIME %llu SOCKET %u APIC %x\n",
+ m->cpuvendor, m->cpuid, m->time, m->socketid,
+ m->apicid);
printk(KERN_EMERG "This is not a software problem!\n");
printk(KERN_EMERG "Run through mcelog --ascii to decode "
"and contact your hardware vendor\n");
Allow user programs to write mce records into /dev/mcelog. When they do
that a fake machine check is triggered to test the machine check code.
This uses the MCE MSR wrappers added earlier.
The implementation is straight forward. There is a struct mce record
per CPU and the MCE MSR accesses get data from there if there is valid
data injected there. This allows to test the machine check code
relatively realistically because only the lowest layer of hardware
access is intercepted.
The test suite and injector are available at
git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git
git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/Kconfig | 8 +
arch/x86/include/asm/mce.h | 4
arch/x86/kernel/cpu/mcheck/Makefile | 1
arch/x86/kernel/cpu/mcheck/mce-inject.c | 129 ++++++++++++++++++++++++++++++++
arch/x86/kernel/cpu/mcheck/mce_64.c | 39 +++++++++
5 files changed, 180 insertions(+), 1 deletion(-)
Index: linux/arch/x86/kernel/cpu/mcheck/mce-inject.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux/arch/x86/kernel/cpu/mcheck/mce-inject.c 2009-04-07 16:43:05.000000000 +0200
@@ -0,0 +1,129 @@
+/*
+ * Machine check injection support.
+ * Copyright 2008 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ *
+ * Authors:
+ * Andi Kleen
+ * Ying Huang
+ */
+#include <linux/module.h>
+#include <linux/timer.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/fs.h>
+#include <linux/smp.h>
+#include <asm/uaccess.h>
+#include <asm/mce.h>
+
+/* Update fake mce registers on current CPU. */
+static void inject_mce(struct mce *m)
+{
+ struct mce *i = &per_cpu(injectm, m->extcpu);
+
+ /* Make sure noone reads partially written injectm */
+ i->finished = 0;
+ mb();
+ m->finished = 0;
+ /* First set the fields after finished */
+ i->extcpu = m->extcpu;
+ mb();
+ /* Now write record in order, finished last (except above) */
+ memcpy(i, m, sizeof(struct mce));
+ /* Finally activate it */
+ mb();
+ i->finished = 1;
+}
+
+struct delayed_mce {
+ struct timer_list timer;
+ struct mce m;
+};
+
+/* Inject mce on current CPU */
+static void raise_mce(unsigned long data)
+{
+ struct delayed_mce *dm = (struct delayed_mce *)data;
+ struct mce *m = &dm->m;
+ int cpu = m->extcpu;
+
+ inject_mce(m);
+ if (m->status & MCI_STATUS_UC) {
+ struct pt_regs regs;
+ memset(®s, 0, sizeof(struct pt_regs));
+ regs.ip = m->ip;
+ regs.cs = m->cs;
+ printk(KERN_INFO "Triggering MCE exception on CPU %d\n", cpu);
+ do_machine_check(®s, 0);
+ printk(KERN_INFO "MCE exception done on CPU %d\n", cpu);
+ } else {
+ mce_banks_t b;
+ memset(&b, 0xff, sizeof(mce_banks_t));
+ printk(KERN_INFO "Starting machine check poll CPU %d\n", cpu);
+ machine_check_poll(0, &b);
+ mce_notify_user();
+ printk(KERN_INFO "Finished machine check poll on CPU %d\n",
+ cpu);
+ }
+ kfree(dm);
+}
+
+/* Error injection interface */
+static ssize_t mce_write(struct file *filp, const char __user *ubuf,
+ size_t usize, loff_t *off)
+{
+ struct delayed_mce *dm;
+ struct mce m;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+ /*
+ * There are some cases where real MSR reads could slip
+ * through.
+ */
+ if (!boot_cpu_has(X86_FEATURE_MCE) || !boot_cpu_has(X86_FEATURE_MCA))
+ return -EIO;
+
+ if ((unsigned long)usize > sizeof(struct mce))
+ usize = sizeof(struct mce);
+ if (copy_from_user(&m, ubuf, usize))
+ return -EFAULT;
+
+ if (m.cpu && !m.extcpu)
+ m.extcpu = m.cpu;
+
+ if (m.extcpu >= NR_CPUS || !cpu_online(m.extcpu))
+ return -EINVAL;
+
+ dm = kmalloc(sizeof(struct delayed_mce), GFP_KERNEL);
+ if (!dm)
+ return -ENOMEM;
+
+ /*
+ * Need to give user space some time to set everything up,
+ * so do it a jiffie or two later everywhere.
+ * Should we use a hrtimer here for better synchronization?
+ */
+ memcpy(&dm->m, &m, sizeof(struct mce));
+ setup_timer(&dm->timer, raise_mce, (unsigned long)dm);
+ dm->timer.expires = jiffies + 2;
+ add_timer_on(&dm->timer, m.extcpu);
+ return usize;
+}
+
+static int inject_init(void)
+{
+ printk(KERN_INFO "Machine check injector initialized\n");
+ mce_chrdev_ops.write = mce_write;
+ return 0;
+}
+
+module_init(inject_init);
+/* Cannot tolerate unloading currently because we cannot
+ * guarantee all openers of mce_chrdev will get a reference to us.
+ */
+MODULE_LICENSE("GPL");
Index: linux/arch/x86/include/asm/mce.h
===================================================================
--- linux.orig/arch/x86/include/asm/mce.h 2009-04-07 16:09:59.000000000 +0200
+++ linux/arch/x86/include/asm/mce.h 2009-04-07 16:43:08.000000000 +0200
@@ -156,6 +156,10 @@
#endif /* !CONFIG_X86_32 */
+DECLARE_PER_CPU(struct mce, injectm);
+extern struct file_operations mce_chrdev_ops;
+extern int mce_notify_user(void);
+
#ifdef CONFIG_X86_MCE
extern void mcheck_init(struct cpuinfo_x86 *c);
#else
Index: linux/arch/x86/Kconfig
===================================================================
--- linux.orig/arch/x86/Kconfig 2009-04-07 16:09:50.000000000 +0200
+++ linux/arch/x86/Kconfig 2009-04-07 16:43:04.000000000 +0200
@@ -795,6 +795,14 @@
bool
default y
+config X86_MCE_INJECT
+ depends on X86_64
+ tristate "Machine check injector support"
+ help
+ Provide support for injecting machine checks for testing purposes.
+ If you don't know what a machine check is and you don't do kernel
+ QA it is safe to say n.
+
config X86_MCE_NONFATAL
tristate "Check for non-fatal errors on AMD Athlon/Duron / Intel Pentium 4"
depends on X86_32 && X86_MCE
Index: linux/arch/x86/kernel/cpu/mcheck/Makefile
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/Makefile 2009-04-07 16:09:59.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/Makefile 2009-04-07 16:09:59.000000000 +0200
@@ -5,4 +5,5 @@
obj-$(CONFIG_X86_MCE_AMD) += mce_amd_64.o
obj-$(CONFIG_X86_MCE_NONFATAL) += non-fatal.o
obj-$(CONFIG_X86_MCE_THRESHOLD) += threshold.o
+obj-$(CONFIG_X86_MCE_INJECT) += mce-inject.o
obj-$(CONFIG_X86_64) += mce-severity.o
Index: linux/arch/x86/kernel/cpu/mcheck/mce_64.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:09:59.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:43:08.000000000 +0200
@@ -97,6 +97,9 @@
DEFINE_PER_CPU(unsigned, mce_exception_count);
+DEFINE_PER_CPU(struct mce, injectm);
+EXPORT_PER_CPU_SYMBOL_GPL(injectm);
+
/*
* Lockless MCE logging infrastructure.
* This avoids deadlocks on printk locks without having to break locks. Also
@@ -226,16 +229,46 @@
panic(msg);
}
+/* Support code for software error injection */
+
+static int msr_to_offset(u32 msr)
+{
+ unsigned bank = __get_cpu_var(injectm.bank);
+ if (msr == rip_msr)
+ return offsetof(struct mce, ip);
+ if (msr == MSR_IA32_MC0_STATUS + bank*4)
+ return offsetof(struct mce, status);
+ if (msr == MSR_IA32_MC0_ADDR + bank*4)
+ return offsetof(struct mce, addr);
+ if (msr == MSR_IA32_MC0_MISC + bank*4)
+ return offsetof(struct mce, misc);
+ if (msr == MSR_IA32_MCG_STATUS)
+ return offsetof(struct mce, mcgstatus);
+ return -1;
+}
+
/* MSR access wrappers used for error injection */
static u64 mce_rdmsrl(u32 msr)
{
u64 v;
+ if (__get_cpu_var(injectm).finished) {
+ int offset = msr_to_offset(msr);
+ if (offset < 0)
+ return 0;
+ return *(u64 *)((char *)&__get_cpu_var(injectm) + offset);
+ }
rdmsrl(msr, v);
return v;
}
static void mce_wrmsrl(u32 msr, u64 v)
{
+ if (__get_cpu_var(injectm).finished) {
+ int offset = msr_to_offset(msr);
+ if (offset >= 0)
+ *(u64 *)((char *)&__get_cpu_var(injectm) + offset) = v;
+ return;
+ }
wrmsrl(msr, v);
}
@@ -372,6 +405,7 @@
sync_core();
}
+EXPORT_SYMBOL_GPL(machine_check_poll);
/*
* Do a quick check if any of the events requires a panic.
@@ -825,6 +859,7 @@
atomic_dec(&mce_entry);
sync_core();
}
+EXPORT_SYMBOL_GPL(do_machine_check);
#ifdef CONFIG_X86_MCE_INTEL
/***
@@ -924,6 +959,7 @@
}
return 0;
}
+EXPORT_SYMBOL_GPL(mce_notify_user);
/*
* Initialize Machine Checks for a CPU.
@@ -1210,13 +1246,14 @@
}
}
-static const struct file_operations mce_chrdev_ops = {
+struct file_operations mce_chrdev_ops = {
.open = mce_open,
.release = mce_release,
.read = mce_read,
.poll = mce_poll,
.unlocked_ioctl = mce_ioctl,
};
+EXPORT_SYMBOL_GPL(mce_chrdev_ops);
static struct miscdevice mce_log_device = {
MISC_MCELOG_MINOR,
Needed for the modular machine check injector.
Signed-off-by: Andi Kleen <[email protected]>
---
kernel/timer.c | 1 +
1 file changed, 1 insertion(+)
Index: linux/kernel/timer.c
===================================================================
--- linux.orig/kernel/timer.c 2009-04-07 16:09:50.000000000 +0200
+++ linux/kernel/timer.c 2009-04-07 16:09:59.000000000 +0200
@@ -753,6 +753,7 @@
wake_up_idle_cpu(cpu);
spin_unlock_irqrestore(&base->lock, flags);
}
+EXPORT_SYMBOL_GPL(add_timer_on);
/**
* del_timer - deactive a timer.
The x86 architecture recently added some new machine check status bits:
S(ignalled) and AR (Action-Required). Signalled allows to check
if a specific event caused an exception or was just logged through CMCI.
AR allows the kernel to decide if an event needs immediate action
or can be delayed or ignored.
Implement support for these new status bits. mce_severity() uses
the new bits to grade the machine check correctly and decide what
to do. The exception handler uses AR to decide to kill or not.
The S bit is used to separate events between the poll/CMCI handler
and the exception handler.
Classical UC always leads to panic. That was true before anyways
because the existing CPUs always passed a PCC with it.
Also corrects the rules whether to kill in user or kernel context
and how to handle missing RIPV.
The machine check handler largely uses the mce-severity grading
engine now instead of making its own decisions. This means the logic
is centralized in one place.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/include/asm/mce.h | 10 +++
arch/x86/kernel/cpu/mcheck/mce-internal.h | 5 +
arch/x86/kernel/cpu/mcheck/mce-severity.c | 63 ++++++++++++++++++++++--
arch/x86/kernel/cpu/mcheck/mce_64.c | 76 +++++++++++++-----------------
4 files changed, 108 insertions(+), 46 deletions(-)
Index: linux/arch/x86/include/asm/mce.h
===================================================================
--- linux.orig/arch/x86/include/asm/mce.h 2009-04-07 16:09:59.000000000 +0200
+++ linux/arch/x86/include/asm/mce.h 2009-04-07 16:43:05.000000000 +0200
@@ -15,6 +15,7 @@
#define MCG_EXT_P (1ULL<<9) /* Extended registers available */
#define MCG_NUM_EXT(c) (((c) >> 16) & 0xff)
#define MCG_CMCI_P (1ULL<<10) /* CMCI supported */
+#define MCG_SER_P (1ULL<<24) /* MCA recovery/new status bits */
#define MCG_STATUS_RIPV (1UL<<0) /* restart ip valid */
#define MCG_STATUS_EIPV (1UL<<1) /* ip points to correct instruction */
@@ -27,6 +28,15 @@
#define MCI_STATUS_MISCV (1UL<<59) /* misc error reg. valid */
#define MCI_STATUS_ADDRV (1UL<<58) /* addr reg. valid */
#define MCI_STATUS_PCC (1UL<<57) /* processor context corrupt */
+#define MCI_STATUS_S (1ULL<<56) /* Signaled machine check */
+#define MCI_STATUS_AR (1ULL<<55) /* Action required */
+
+/* MISC register defines */
+#define MCM_ADDR_SEGOFF 0 /* segment offset */
+#define MCM_ADDR_LINEAR 1 /* linear address */
+#define MCM_ADDR_PHYS 2 /* physical address */
+#define MCM_ADDR_MEM 3 /* memory address */
+#define MCM_ADDR_GENERIC 7 /* generic */
/* Fields are zero when not available */
struct mce {
Index: linux/arch/x86/kernel/cpu/mcheck/mce_64.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:09:59.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:43:05.000000000 +0200
@@ -65,6 +65,7 @@
static int mce_bootlog = -1;
static int monarch_timeout = -1;
static int mce_panic_timeout;
+int mce_ser;
static char trigger[128];
static char *trigger_argv[2] = { trigger, NULL };
@@ -367,13 +368,13 @@
continue;
/*
- * Uncorrected events are handled by the exception handler
- * when it is enabled. But when the exception is disabled log
- * everything.
+ * Uncorrected or signalled events are handled by the exception
+ * handler when it is enabled, so don't process those here.
*
* TBD do the same check for MCI_STATUS_EN here?
*/
- if ((m.status & MCI_STATUS_UC) && !(flags & MCP_UC))
+ if (!(flags & MCP_UC) &&
+ (m.status & (mce_ser ? MCI_STATUS_S : MCI_STATUS_UC)))
continue;
if (m.status & MCI_STATUS_MISCV)
@@ -736,6 +737,12 @@
barrier();
/*
+ * When no restart IP must always kill or panic.
+ */
+ if (!(m.mcgstatus & MCG_STATUS_RIPV))
+ kill_it = 1;
+
+ /*
* Go through all the banks in exclusion of the other CPUs.
* This way we don't report duplicated events on shared banks
* because the first one to see it will clear it.
@@ -755,10 +762,11 @@
continue;
/*
- * Non uncorrected errors are handled by machine_check_poll
- * Leave them alone, unless this panics.
+ * Non uncorrected or non signaled errors are handled by
+ * machine_check_poll. Leave them alone, unless this panics.
*/
- if ((m.status & MCI_STATUS_UC) == 0 && !no_way_out)
+ if (!(m.status & (mce_ser ? MCI_STATUS_S : MCI_STATUS_UC)) &&
+ !no_way_out)
continue;
/*
@@ -766,17 +774,16 @@
*/
add_taint(TAINT_MACHINE_CHECK);
- __set_bit(i, toclear);
+ severity = mce_severity(&m, tolerant, NULL);
- if (m.status & MCI_STATUS_EN) {
- /*
- * If this error was uncorrectable and there was
- * an overflow, we're in trouble. If no overflow,
- * we might get away with just killing a task.
- */
- if (m.status & MCI_STATUS_UC)
- kill_it = 1;
- } else {
+ /*
+ * When machine check was for corrected handler don't touch,
+ * unless we're panicing.
+ */
+ if (severity == MCE_KEEP_SEVERITY && !no_way_out)
+ continue;
+ __set_bit(i, toclear);
+ if (severity == MCE_NO_SEVERITY) {
/*
* Machine check event was not enabled. Clear, but
* ignore.
@@ -784,6 +791,12 @@
continue;
}
+ /*
+ * Kill on action required.
+ */
+ if (severity == MCE_AR_SEVERITY)
+ kill_it = 1;
+
if (m.status & MCI_STATUS_MISCV)
m.misc = mce_rdmsrl(MSR_IA32_MC0_MISC + i*4);
if (m.status & MCI_STATUS_ADDRV)
@@ -792,7 +805,6 @@
mce_get_rip(&m, regs);
mce_log(&m);
- severity = mce_severity(&m, tolerant, NULL);
if (severity > worst) {
*final = m;
worst = severity;
@@ -825,29 +837,8 @@
* one task, do that. If the user has set the tolerance very
* high, don't try to do anything at all.
*/
- if (kill_it && tolerant < 3) {
- int user_space = 0;
-
- /*
- * If the EIPV bit is set, it means the saved IP is the
- * instruction which caused the MCE.
- */
- if (m.mcgstatus & MCG_STATUS_EIPV)
- user_space = final->ip && (final->cs & 3);
-
- /*
- * If we know that the error was in user space, send a
- * SIGBUS. Otherwise, panic if tolerance is low.
- *
- * force_sig() takes an awful lot of locks and has a slight
- * risk of deadlocking.
- */
- if (user_space && tolerant > 0) {
- force_sig(SIGBUS, current);
- } else if (panic_on_oops || tolerant < 2) {
- mce_panic("Uncorrected machine check", final, msg);
- }
- }
+ if (kill_it && tolerant < 3)
+ force_sig(SIGBUS, current);
/* notify userspace ASAP */
set_thread_flag(TIF_MCE_NOTIFY);
@@ -992,6 +983,9 @@
if ((cap & MCG_EXT_P) && (MCG_NUM_EXT(cap) >= 9))
rip_msr = MSR_IA32_MCG_EIP;
+ if (cap & MCG_SER_P)
+ mce_ser = 1;
+
return 0;
}
Index: linux/arch/x86/kernel/cpu/mcheck/mce-severity.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce-severity.c 2009-04-07 16:09:59.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce-severity.c 2009-04-07 16:43:04.000000000 +0200
@@ -21,28 +21,78 @@
* match wins.
*/
+enum context { IN_KERNEL, IN_USER };
+
static struct severity {
u64 mask;
u64 result;
unsigned char sev;
unsigned char mcgmask;
unsigned char mcgres;
+ unsigned char ser;
+ unsigned char context;
char *msg;
} severities[] = {
+#define KERNEL .context = IN_KERNEL
+#define USER .context = IN_USER
+#define SER .ser = 1
+#define NOSER .ser = -1
#define SEV(s) .sev = MCE_ ## s ## _SEVERITY
#define BITCLR(x, s, m, r...) { .mask = x, .result = 0, SEV(s), .msg = m, ## r }
#define BITSET(x, s, m, r...) { .mask = x, .result = x, SEV(s), .msg = m, ## r }
#define MCGMASK(x, res, s, m, r...) \
{ .mcgmask = x, .mcgres = res, SEV(s), .msg = m, ## r }
+#define MASK(x, y, s, m, r...) \
+ { .mask = x, .result = y, SEV(s), .msg = m, ## r }
+#define MCI_UC_S (MCI_STATUS_UC|MCI_STATUS_S)
+#define MCI_UC_SAR (MCI_STATUS_UC|MCI_STATUS_S|MCI_STATUS_AR)
+#define MCACOD 0xffff
+
BITCLR(MCI_STATUS_VAL, NO, "Invalid"),
- BITCLR(MCI_STATUS_EN, NO, "Not enabled"),
+ BITCLR(MCI_STATUS_EN, NO, "Not enabled", NOSER),
BITSET(MCI_STATUS_PCC, PANIC, "Processor context corrupt"),
- MCGMASK(MCG_STATUS_RIPV, 0, PANIC, "No restart IP"),
+ /* When MCIP is not set something is very confused */
+ MCGMASK(MCG_STATUS_MCIP, 0, PANIC, "MCIP not set in MCA handler"),
+ MCGMASK(MCG_STATUS_RIPV, 0, PANIC, "In kernel and no restart IP",
+ KERNEL),
+ /* Neither return not error IP -- no chance to recover -> PANIC */
+ MCGMASK(MCG_STATUS_RIPV|MCG_STATUS_EIPV, 0, PANIC,
+ "Neither restart nor error IP"),
+ BITCLR(MCI_STATUS_UC, KEEP, "Corrected error"),
+ MASK(MCI_STATUS_OVER|MCI_STATUS_UC|MCI_STATUS_EN, MCI_STATUS_UC, SOME,
+ "Spurious not enabled", SER),
+ BITCLR(MCI_STATUS_EN, NO, "Not enabled ser", SER),
+ /* AR add known MCACODs here */
+ MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_UC_SAR, PANIC,
+ "Action required; unknown MCACOD", SER),
+ MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_STATUS_OVER|MCI_UC_SAR, PANIC,
+ "Action required with lost events", SER),
+ /* AO add known MCACODs here */
+ MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_UC_S, SOME,
+ "Action optional unknown MCACOD", SER),
+ MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_UC_S|MCI_STATUS_OVER, SOME,
+ "Action optional with lost events", SER),
+ MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_STATUS_UC, KEEP,
+ "Uncorrected no action required", SER),
+ MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_STATUS_UC|MCI_STATUS_AR, PANIC,
+ "Illegal combination (UCNA with S=1)", SER),
BITSET(MCI_STATUS_UC|MCI_STATUS_OVER, PANIC, "Overflowed uncorrected"),
BITSET(MCI_STATUS_UC, UC, "Uncorrected"),
BITSET(0, SOME, "No match") /* always matches. keep at end */
};
+/*
+ * If the EIPV bit is set, it means the saved IP is the
+ * instruction which caused the MCE.
+ */
+static int error_context(struct mce *m)
+{
+ if (m->mcgstatus & MCG_STATUS_EIPV)
+ return (m->ip && (m->cs & 3) == 3) ? IN_USER : IN_KERNEL;
+ /* Unknown, assume kernel */
+ return IN_KERNEL;
+}
+
int mce_severity(struct mce *a, int tolerant, char **msg)
{
struct severity *s;
@@ -51,11 +101,14 @@
continue;
if ((a->mcgstatus & s->mcgmask) != s->mcgres)
continue;
- if (s->sev > MCE_NO_SEVERITY && (a->status & MCI_STATUS_UC) &&
- tolerant < 1)
- return MCE_PANIC_SEVERITY;
+ if ((s->ser == 1 && !mce_ser) || (s->ser == -1 && mce_ser))
+ continue;
+ if (s->context && error_context(a) != s->context)
+ continue;
if (msg)
*msg = s->msg;
+ if (s->context == IN_KERNEL && panic_on_oops)
+ return MCE_PANIC_SEVERITY;
return s->sev;
}
}
Index: linux/arch/x86/kernel/cpu/mcheck/mce-internal.h
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce-internal.h 2009-04-07 16:09:59.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce-internal.h 2009-04-07 16:39:36.000000000 +0200
@@ -2,9 +2,14 @@
enum severity_level {
MCE_NO_SEVERITY,
+ MCE_KEEP_SEVERITY,
MCE_SOME_SEVERITY,
+ MCE_AO_SEVERITY,
MCE_UC_SEVERITY,
+ MCE_AR_SEVERITY,
MCE_PANIC_SEVERITY,
};
int mce_severity(struct mce *a, int tolerant, char **msg);
+
+extern int mce_ser;
Andi Kleen wrote:
> @@ -96,6 +97,12 @@
> seq_printf(p, " Threshold APIC interrupts\n");
> # endif
> #endif
> +#if defined(CONFIG_X86_MCE) && defined(CONFIG_X86_64)
> + seq_printf(p, "%*s: ", prec, "MCE");
> + for_each_online_cpu(j)
> + seq_printf(p, "%10u ", per_cpu(mce_exception_count, j));
> + seq_printf(p, " Machine check exceptions\n");
> +#endif
> seq_printf(p, "%*s: %10u\n", prec, "ERR", atomic_read(&irq_err_count));
> #if defined(CONFIG_X86_IO_APIC)
> seq_printf(p, "%*s: %10u\n", prec, "MIS", atomic_read(&irq_mis_count));
How about unifying ifdefs?
The patched code will be:
#ifdef CONFIG_X86_MCE
seq_printf(p, "%*s: ", prec, "TRM");
for_each_online_cpu(j)
seq_printf(p, "%10u ", irq_stats(j)->irq_thermal_count);
seq_printf(p, " Thermal event interrupts\n");
# ifdef CONFIG_X86_64
seq_printf(p, "%*s: ", prec, "THR");
for_each_online_cpu(j)
seq_printf(p, "%10u ", irq_stats(j)->irq_threshold_count);
seq_printf(p, " Threshold APIC interrupts\n");
# endif
#endif
#if defined(CONFIG_X86_MCE) && defined(CONFIG_X86_64)
seq_printf(p, "%*s: ", prec, "MCE");
for_each_online_cpu(j)
seq_printf(p, "%10u ", per_cpu(mce_exception_count, j));
seq_printf(p, " Machine check exceptions\n");
#endif
> @@ -163,6 +170,9 @@
> {
> u64 sum = irq_stats(cpu)->__nmi_count;
>
> +#if defined(CONFIG_X86_MCE) && defined(CONFIG_X86_64)
> + sum += per_cpu(mce_exception_count, cpu);
> +#endif
> #ifdef CONFIG_X86_LOCAL_APIC
> sum += irq_stats(cpu)->apic_timer_irqs;
> sum += irq_stats(cpu)->irq_spurious_count;
Ditto.
Thanks,
H.Seto
Andi Kleen wrote:
> Document that check_interval set to 0 means no polling.
> Noticed by Hidetoshi Seto
>
> Also add a reference from boot options to the sysfs tunables
>
> Cc: [email protected]
>
> Signed-off-by: Andi Kleen <[email protected]>
Acked-by: Hidetoshi Seto <[email protected]>
Thanks!
H.Seto
Hidetoshi Seto <[email protected]> writes:
> Andi Kleen wrote:
>> @@ -96,6 +97,12 @@
>> seq_printf(p, " Threshold APIC interrupts\n");
>> # endif
>> #endif
>> +#if defined(CONFIG_X86_MCE) && defined(CONFIG_X86_64)
>> + seq_printf(p, "%*s: ", prec, "MCE");
>> + for_each_online_cpu(j)
>> + seq_printf(p, "%10u ", per_cpu(mce_exception_count, j));
>> + seq_printf(p, " Machine check exceptions\n");
>> +#endif
>> seq_printf(p, "%*s: %10u\n", prec, "ERR", atomic_read(&irq_err_count));
>> #if defined(CONFIG_X86_IO_APIC)
>> seq_printf(p, "%*s: %10u\n", prec, "MIS", atomic_read(&irq_mis_count));
>
> How about unifying ifdefs?
I did it this way because these are only temporary until the 32bit<->64bit merge
patch goes in too. So next patch would be to change it to X86_NEW_MCE and
then when the old 32bit machine check code is finally dropped one or two
releases later to just X86_MCE
-Andi
--
[email protected] -- Speaking for myself only.
Andi Kleen wrote:
> This will be used by future patches to allow machine check error injection.
> Right now it's a nop, except for adding some wrappers around the MSR reads.
>
> This is early in the sequence to avoid too many conflicts.
>
> Andi Kleen <[email protected]>
>
>
> Signed-off-by: Andi Kleen <[email protected]>
>
> ---
> arch/x86/kernel/cpu/mcheck/mce_64.c | 37 ++++++++++++++++++++++++------------
> 1 file changed, 25 insertions(+), 12 deletions(-)
>
> Index: linux/arch/x86/kernel/cpu/mcheck/mce_64.c
> ===================================================================
> --- linux.orig/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:09:59.000000000 +0200
> +++ linux/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:43:12.000000000 +0200
> @@ -171,6 +171,19 @@
> panic(msg);
> }
Since you introduce X86_MCE_INJECT in later patch,
how about this style?
#ifdef CONFIG_X86_MCE_INJECT
#define mce_rdmsrl(msr,v) (v) = __mce_rdmsrl((msr))
#define mce_wrmsrl(msr,v) __mce_wrmsrl((msr),(v))
static u64 __mce_rdmsrl(u32 msr)
{
...
}
static void __mce_wrmsrl(u32 msr, u64 v)
{
...
}
#else
#define mce_rdmsrl(msr,v) rdmsrl((msr),(v))
#define mce_wrmsrl(msr,v) wrmsrl((msr),(v))
#endif /* CONFIG_X86_MCE_INJECT */
Thanks,
H.Seto
>
> +/* MSR access wrappers used for error injection */
> +static u64 mce_rdmsrl(u32 msr)
> +{
> + u64 v;
> + rdmsrl(msr, v);
> + return v;
> +}
> +
> +static void mce_wrmsrl(u32 msr, u64 v)
> +{
> + wrmsrl(msr, v);
> +}
> +
> int mce_available(struct cpuinfo_x86 *c)
> {
> if (mce_dont_init)
> @@ -188,7 +201,7 @@
> m->cs = 0;
> }
> if (rip_msr)
> - rdmsrl(rip_msr, m->ip);
> + m->ip = mce_rdmsrl(rip_msr);
> }
>
> /*
> @@ -250,7 +263,7 @@
>
> mce_setup(&m);
>
> - rdmsrl(MSR_IA32_MCG_STATUS, m.mcgstatus);
> + m.mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
> for (i = 0; i < banks; i++) {
> if (!bank[i] || !test_bit(i, *b))
> continue;
> @@ -261,7 +274,7 @@
> m.tsc = 0;
>
> barrier();
> - rdmsrl(MSR_IA32_MC0_STATUS + i*4, m.status);
> + m.status = mce_rdmsrl(MSR_IA32_MC0_STATUS + i*4);
> if (!(m.status & MCI_STATUS_VAL))
> continue;
>
> @@ -276,9 +289,9 @@
> continue;
>
> if (m.status & MCI_STATUS_MISCV)
> - rdmsrl(MSR_IA32_MC0_MISC + i*4, m.misc);
> + m.misc = mce_rdmsrl(MSR_IA32_MC0_MISC + i*4);
> if (m.status & MCI_STATUS_ADDRV)
> - rdmsrl(MSR_IA32_MC0_ADDR + i*4, m.addr);
> + m.addr = mce_rdmsrl(MSR_IA32_MC0_ADDR + i*4);
>
> if (!(flags & MCP_TIMESTAMP))
> m.tsc = 0;
> @@ -294,7 +307,7 @@
> /*
> * Clear state for this bank.
> */
> - wrmsrl(MSR_IA32_MC0_STATUS+4*i, 0);
> + mce_wrmsrl(MSR_IA32_MC0_STATUS+4*i, 0);
> }
>
> /*
> @@ -342,8 +355,8 @@
> goto out;
>
> mce_setup(&m);
> + m.mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
>
> - rdmsrl(MSR_IA32_MCG_STATUS, m.mcgstatus);
> /* if the restart IP is not valid, we're done for */
> if (!(m.mcgstatus & MCG_STATUS_RIPV))
> no_way_out = 1;
> @@ -360,7 +373,7 @@
> m.addr = 0;
> m.bank = i;
>
> - rdmsrl(MSR_IA32_MC0_STATUS + i*4, m.status);
> + m.status = mce_rdmsrl(MSR_IA32_MC0_STATUS + i*4);
> if ((m.status & MCI_STATUS_VAL) == 0)
> continue;
>
> @@ -400,9 +413,9 @@
> }
>
> if (m.status & MCI_STATUS_MISCV)
> - rdmsrl(MSR_IA32_MC0_MISC + i*4, m.misc);
> + m.misc = mce_rdmsrl(MSR_IA32_MC0_MISC + i*4);
> if (m.status & MCI_STATUS_ADDRV)
> - rdmsrl(MSR_IA32_MC0_ADDR + i*4, m.addr);
> + m.addr = mce_rdmsrl(MSR_IA32_MC0_ADDR + i*4);
>
> mce_get_rip(&m, regs);
> mce_log(&m);
> @@ -467,9 +480,9 @@
> /* the last thing we do is clear state */
> for (i = 0; i < banks; i++) {
> if (test_bit(i, toclear))
> - wrmsrl(MSR_IA32_MC0_STATUS+4*i, 0);
> + mce_wrmsrl(MSR_IA32_MC0_STATUS+4*i, 0);
> }
> - wrmsrl(MSR_IA32_MCG_STATUS, 0);
> + mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
> out:
> atomic_dec(&mce_entry);
> sync_core();
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
Andi Kleen wrote:
> +/*
> + * Check if a timeout waiting for other CPUs happened.
> + */
> +static int mce_timed_out(u64 *t)
> +{
> + /*
> + * The others already did panic for some reason.
> + * Bail out like in a timeout.
> + * rmb() to tell the compiler that system_state
> + * might have been modified by someone else.
> + */
> + rmb();
> + if (atomic_read(&mce_paniced))
> + wait_for_panic();
> + if (!monarch_timeout)
> + goto out;
> + if ((s64)*t < SPINUNIT) {
> + /* CHECKME: Make panic default for 1 too? */
> + if (tolerant < 1)
> + mce_panic("Timeout synchronizing machine check over CPUs",
> + NULL, NULL);
Assuming that if we came here from mce_start() and panic, then I suppose no mce
log would be appeared on the console since no cpu have invoked mce_log(&m) yet.
Is it expected behavior?
> + cpu_missing = 1;
> + return 1;
> + }
> + *t -= SPINUNIT;
> +out:
> + touch_nmi_watchdog();
> + return 0;
> +}
(snip)
> +/*
> + * Start of Monarch synchronization. This waits until all CPUs have
> + * entered the ecception handler and then determines if any of them
^^^^^^^^^
exception
> + * saw a fatal event that requires panic. Then it executes them
> + * in the entry order.
> + * TBD double check parallel CPU hotunplug
> + */
> +static int mce_start(int no_way_out, int *order)
> +{
> + int nwo;
> + int cpus = num_online_cpus();
> + static atomic_t global_nwo;
> + u64 timeout = (u64)monarch_timeout * NSEC_PER_USEC;
> +
> + if (!timeout) {
> + *order = -1;
> + return no_way_out;
> + }
> +
> + atomic_add(no_way_out, &global_nwo);
> +
> + /*
> + * Wait for everyone.
> + */
> + while (atomic_read(&mce_callin) != cpus) {
> + if (mce_timed_out(&timeout)) {
> + atomic_set(&global_nwo, 0);
> + *order = -1;
> + return no_way_out;
> + }
> + ndelay(SPINUNIT);
> + }
> +
> + /*
> + * Cache the global no_way_out state.
> + */
> + nwo = atomic_read(&global_nwo);
> +
> + /*
> + * Monarch starts executing now, the others wait.
> + */
> + if (*order == 1) {
> + atomic_set(&global_nwo, 0);
Monarch should clear global_nwo after all Subjects have read it.
Or it should be cleared by last Subject instead.
> + atomic_set(&mce_executing, 1);
> + return nwo;
> + }
> +
> + /*
> + * Now start the scanning loop one by one
> + * in the original callin order.
> + * This way when there are any shared banks it will
> + * be only seen by one CPU before cleared, avoiding duplicates.
> + */
> + while (atomic_read(&mce_executing) < *order) {
> + if (mce_timed_out(&timeout)) {
> + atomic_set(&global_nwo, 0);
> + *order = -1;
> + return no_way_out;
> + }
> + ndelay(SPINUNIT);
> + }
> + return nwo;
> +}
Thanks,
H.Seto
Andi Kleen wrote:
> Fatal machine checks can be logged to disk after boot, but only if
> the system did a warm reboot. That's unfortunately difficult with the
> default panic behaviour, which waits forever and the admin has to
> press the power button because modern systems usually miss a reset button.
> This clears the machine checks in the registers and make
> it impossible to log them.
>
> This patch changes the default for machine check panic to always
> reboot after 30s. Then the mce can be successfully logged after
> reboot.
In case if user already set panic_timeout to 5s, the that user need to
wait extra 25s for machine check panic... ?
The idea is good, but I think some switch like "no_reboot_on_mce" would
be required.
Thanks,
H.Seto
Andi Kleen wrote:
> Various new Intel x86 machine check features, including initial support
> for machine check recovery and new status bits. The actual machine check
> recovery requires some generic VM changes, which will be posted in a separate
> series ("POISON series") This series can be applied without the POISON
> series and POISON requires these patches.
>
> Especially the broadcast and early nowayout patches are fairly important for
> Nehalem CPUs, without them machine checks are often incorrectly reported
> due to the shared bank topology of the system.
>
> -Andi
Andi, could you divide this series into "cleanups" and "features"?
Or May I help you?
Thanks,
H.Seto
Andi Kleen wrote:
> static struct severity {
> u64 mask;
> u64 result;
> unsigned char sev;
> unsigned char mcgmask;
> unsigned char mcgres;
> + unsigned char ser;
> + unsigned char context;
> char *msg;
> } severities[] = {
> +#define KERNEL .context = IN_KERNEL
> +#define USER .context = IN_USER
> +#define SER .ser = 1
> +#define NOSER .ser = -1
ser is unsigned or signed?
> int mce_severity(struct mce *a, int tolerant, char **msg)
> {
> struct severity *s;
> @@ -51,11 +101,14 @@
> continue;
> if ((a->mcgstatus & s->mcgmask) != s->mcgres)
> continue;
> - if (s->sev > MCE_NO_SEVERITY && (a->status & MCI_STATUS_UC) &&
> - tolerant < 1)
> - return MCE_PANIC_SEVERITY;
> + if ((s->ser == 1 && !mce_ser) || (s->ser == -1 && mce_ser))
> + continue;
> + if (s->context && error_context(a) != s->context)
> + continue;
> if (msg)
> *msg = s->msg;
> + if (s->context == IN_KERNEL && panic_on_oops)
> + return MCE_PANIC_SEVERITY;
> return s->sev;
> }
> }
Where did you throw away the statements for "tolerant < 1"?
Thanks,
H.Seto
On Fri, Apr 17, 2009 at 08:23:51PM +0900, Hidetoshi Seto wrote:
> Andi Kleen wrote:
> > This will be used by future patches to allow machine check error injection.
> > Right now it's a nop, except for adding some wrappers around the MSR reads.
> >
> > This is early in the sequence to avoid too many conflicts.
> >
> > Andi Kleen <[email protected]>
> >
> >
> > Signed-off-by: Andi Kleen <[email protected]>
> >
> > ---
> > arch/x86/kernel/cpu/mcheck/mce_64.c | 37 ++++++++++++++++++++++++------------
> > 1 file changed, 25 insertions(+), 12 deletions(-)
> >
> > Index: linux/arch/x86/kernel/cpu/mcheck/mce_64.c
> > ===================================================================
> > --- linux.orig/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:09:59.000000000 +0200
> > +++ linux/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:43:12.000000000 +0200
> > @@ -171,6 +171,19 @@
> > panic(msg);
> > }
>
> Since you introduce X86_MCE_INJECT in later patch,
> how about this style?
If someone feels strongly about that I can do that, but personally
I would think using standard function return arguments without
macros is more regular and more readable in general. That seems
to be also the general trend in the source base, going away from
magic macros.
Do you feel strongly about it?
-Andi
>
> #ifdef CONFIG_X86_MCE_INJECT
> #define mce_rdmsrl(msr,v) (v) = __mce_rdmsrl((msr))
> #define mce_wrmsrl(msr,v) __mce_wrmsrl((msr),(v))
Thanks for your review.
On Fri, Apr 17, 2009 at 08:24:03PM +0900, Hidetoshi Seto wrote:
> > + goto out;
> > + if ((s64)*t < SPINUNIT) {
> > + /* CHECKME: Make panic default for 1 too? */
> > + if (tolerant < 1)
> > + mce_panic("Timeout synchronizing machine check over CPUs",
> > + NULL, NULL);
>
> Assuming that if we came here from mce_start() and panic, then I suppose no mce
> log would be appeared on the console since no cpu have invoked mce_log(&m) yet.
Well it would be rather random if the CPU who detects the timeout
actually has something useful to report.
More likely the useful information is in some CPU's banks who doesn't
answer.
On the other hand the real log will come out to disk after reboot from
the CPU registers (especially together with the new mce panic=30 default)
> Is it expected behavior?
Kind of expected. We could probably fix it, adding a fallback path here,
but due to the reasons above I have my doubts it would improve things
in practice.
> > + }
> > + *t -= SPINUNIT;
> > +out:
> > + touch_nmi_watchdog();
> > + return 0;
> > +}
> (snip)
> > +/*
> > + * Start of Monarch synchronization. This waits until all CPUs have
> > + * entered the ecception handler and then determines if any of them
> ^^^^^^^^^
> exception
Thanks fixed. I did actually run the spell checker on the comments,
but that one must have slipped through somehow.
>
> > + while (atomic_read(&mce_callin) != cpus) {
> > + if (mce_timed_out(&timeout)) {
> > + atomic_set(&global_nwo, 0);
> > + *order = -1;
> > + return no_way_out;
> > + }
> > + ndelay(SPINUNIT);
> > + }
> > +
> > + /*
> > + * Cache the global no_way_out state.
> > + */
> > + nwo = atomic_read(&global_nwo);
> > +
> > + /*
> > + * Monarch starts executing now, the others wait.
> > + */
> > + if (*order == 1) {
> > + atomic_set(&global_nwo, 0);
>
> Monarch should clear global_nwo after all Subjects have read it.
The subjects don't care about the global nwo state, it only matters to the
Monarch who does the panic in mce_end(). The only exception would be timeout,
but in this case all the decisions are local only anyways.
We ensure that all the subjects have written it first.
-Andi
--
[email protected] -- Speaking for myself only.
On Fri, Apr 17, 2009 at 08:24:16PM +0900, Hidetoshi Seto wrote:
> Andi Kleen wrote:
> > Fatal machine checks can be logged to disk after boot, but only if
> > the system did a warm reboot. That's unfortunately difficult with the
> > default panic behaviour, which waits forever and the admin has to
> > press the power button because modern systems usually miss a reset button.
> > This clears the machine checks in the registers and make
> > it impossible to log them.
> >
> > This patch changes the default for machine check panic to always
> > reboot after 30s. Then the mce can be successfully logged after
> > reboot.
>
> In case if user already set panic_timeout to 5s, the that user need to
> wait extra 25s for machine check panic... ?
That's true. I'll fix that.
>
> The idea is good, but I think some switch like "no_reboot_on_mce" would
> be required.
mce=nobootlog should do that.
If someone feels strongly I can add a separate switch.
-Andi
--
[email protected] -- Speaking for myself only.
On Fri, Apr 17, 2009 at 08:24:23PM +0900, Hidetoshi Seto wrote:
Note. I have some fixes on my own for this one already. I wrote
some new validation tools for the grader which detected some problems.
> Andi Kleen wrote:
> > static struct severity {
> > u64 mask;
> > u64 result;
> > unsigned char sev;
> > unsigned char mcgmask;
> > unsigned char mcgres;
> > + unsigned char ser;
> > + unsigned char context;
> > char *msg;
> > } severities[] = {
> > +#define KERNEL .context = IN_KERNEL
> > +#define USER .context = IN_USER
> > +#define SER .ser = 1
> > +#define NOSER .ser = -1
>
> ser is unsigned or signed?
We only really use it as a abstract flag that is only compared for
equality so it doesn't matter. I can change it to 2, or better define
another enum.
>
> > int mce_severity(struct mce *a, int tolerant, char **msg)
> > {
> > struct severity *s;
> > @@ -51,11 +101,14 @@
> > continue;
> > if ((a->mcgstatus & s->mcgmask) != s->mcgres)
> > continue;
> > - if (s->sev > MCE_NO_SEVERITY && (a->status & MCI_STATUS_UC) &&
> > - tolerant < 1)
> > - return MCE_PANIC_SEVERITY;
> > + if ((s->ser == 1 && !mce_ser) || (s->ser == -1 && mce_ser))
> > + continue;
> > + if (s->context && error_context(a) != s->context)
> > + continue;
> > if (msg)
> > *msg = s->msg;
> > + if (s->context == IN_KERNEL && panic_on_oops)
> > + return MCE_PANIC_SEVERITY;
> > return s->sev;
> > }
> > }
>
> Where did you throw away the statements for "tolerant < 1"?
You mean why?
It didn't really fit into the new status bits and didn't improve
behaviour with recovery. I had originally
planned to fit it in, but after trying hard I gave up on that.
it only has its old meaning now, which means whether to risk
do_exit in kernel context (slight risk of deadlock) or not.
This has the advantage that it doesn't change behaviour
(although at least without mca recovery it didn't really matter
because you tended to always panic anyways)
-Andi
--
[email protected] -- Speaking for myself only.
On Fri, Apr 17, 2009 at 08:24:34PM +0900, Hidetoshi Seto wrote:
> Andi Kleen wrote:
> > Various new Intel x86 machine check features, including initial support
> > for machine check recovery and new status bits. The actual machine check
> > recovery requires some generic VM changes, which will be posted in a separate
> > series ("POISON series") This series can be applied without the POISON
> > series and POISON requires these patches.
> >
> > Especially the broadcast and early nowayout patches are fairly important for
> > Nehalem CPUs, without them machine checks are often incorrectly reported
> > due to the shared bank topology of the system.
> >
> > -Andi
>
> Andi, could you divide this series into "cleanups" and "features"?
There is very little classic cleanup (as in make code prettier
to read and do nothing else) in this series, it's pretty much all features or
feature like fixes.
The ones that could be seen as cleanups are:
x86: MCE: Remove unused mce_events variable
x86: MCE: Use symbolic macros to access MCG_CAP register
x86: MCE: Remove mce_init unused argument
x86: MCE: Rename and align out2 label
That one is technically a cleanup, but I would argue it's a fix because
it shrinks code size:
x86: MCE: Use extended sysattrs for the check_interval attribute.
That one could be see as cleanup too although I personally consider removing
BKL feature too:
x86: MCE: Drop BKL in mce_open
-Andi
--
[email protected] -- Speaking for myself only.
On Fri, Apr 17, 2009 at 03:09:44PM +0200, Andi Kleen wrote:
> > > +
> > > + /*
> > > + * Monarch starts executing now, the others wait.
> > > + */
> > > + if (*order == 1) {
> > > + atomic_set(&global_nwo, 0);
> >
> > Monarch should clear global_nwo after all Subjects have read it.
>
> The subjects don't care about the global nwo state, it only matters to the
> Monarch who does the panic in mce_end(). The only exception would be timeout,
> but in this case all the decisions are local only anyways.
> We ensure that all the subjects have written it first.
I thought some more about this and I think I was a little fast to dismiss
it due to the bank sharing heuristics. It's safer to indeed wait
for the subjects here too before clearing. I'll fix that up, thanks.
-Andi
--
[email protected] -- Speaking for myself only.
Andi Kleen wrote:
>
> If someone feels strongly about that I can do that, but personally
> I would think using standard function return arguments without
> macros is more regular and more readable in general. That seems
> to be also the general trend in the source base, going away from
> magic macros.
>
> Do you feel strongly about it?
>
I would agree with Andi on this; macros or not macros, we should use
"function-like" style, analogous to rdmsrl() and wrmsrl() in this case.
Inline functions are also preferred over macros, obviously.
-hpa
Andi Kleen wrote:
> I put the reboot vector into the highest priority bucket
> of the APIC vectors and moved the 64bit UV_BAU message
> down instead into the next lower priority.
I had forgotten to point this...
> @@ -88,12 +88,14 @@
> #define THERMAL_APIC_VECTOR 0xfa
>
> #ifdef CONFIG_X86_32
> -/* 0xf8 - 0xf9 : free */
> +/* 0xf9 : free */
> #else
> # define THRESHOLD_APIC_VECTOR 0xf9
> -# define UV_BAU_MESSAGE 0xf8
> #endif
>
> +#define REBOOT_VECTOR 0xf8
> +
> +
> /* f0-f7 used for spreading out TLB flushes: */
> #define INVALIDATE_TLB_VECTOR_END 0xf7
> #define INVALIDATE_TLB_VECTOR_START 0xf0
> @@ -116,6 +118,8 @@
> */
> #define GENERIC_INTERRUPT_VECTOR 0xed
>
> +#define UV_BAU_MESSAGE 0xec
> +
> /*
> * First APIC vector available to drivers: (vectors 0x30-0xee) we
> * start at 0x31(0x41) to spread out vectors evenly between priority
Does this change (=pulling down the priority of UV_BAU_VECTOR) not impact
users of the UV_BAU_MESSAGE?
I can see why REBOOT_VECTOR need to be highest priority.
Maybe you could pull down THERMAL_APIC_VECTOR/THRESHOLD_APIC_VECTOR
instead. Why you choose UV_BAU_MESSAGE?
Thanks,
H.Seto
On Mon, Apr 20, 2009 at 09:26:56AM +0900, Hidetoshi Seto wrote:
> Andi Kleen wrote:
> > I put the reboot vector into the highest priority bucket
> > of the APIC vectors and moved the 64bit UV_BAU message
> > down instead into the next lower priority.
>
> I had forgotten to point this...
>
> > @@ -88,12 +88,14 @@
> > #define THERMAL_APIC_VECTOR 0xfa
> >
> > #ifdef CONFIG_X86_32
> > -/* 0xf8 - 0xf9 : free */
> > +/* 0xf9 : free */
> > #else
> > # define THRESHOLD_APIC_VECTOR 0xf9
> > -# define UV_BAU_MESSAGE 0xf8
> > #endif
> >
> > +#define REBOOT_VECTOR 0xf8
> > +
> > +
> > /* f0-f7 used for spreading out TLB flushes: */
> > #define INVALIDATE_TLB_VECTOR_END 0xf7
> > #define INVALIDATE_TLB_VECTOR_START 0xf0
> > @@ -116,6 +118,8 @@
> > */
> > #define GENERIC_INTERRUPT_VECTOR 0xed
> >
> > +#define UV_BAU_MESSAGE 0xec
> > +
> > /*
> > * First APIC vector available to drivers: (vectors 0x30-0xee) we
> > * start at 0x31(0x41) to spread out vectors evenly between priority
>
> Does this change (=pulling down the priority of UV_BAU_VECTOR) not impact
> users of the UV_BAU_MESSAGE?
I don't think UV_BAU_MESSAGE is really critical in that the world
explodes if it gets processed a little later. AFAIK it's just
used for user space TLB flushes. So putting it in a lower bucket
is fine.
BTW these priorities don't make too much difference anyways, it only
helps slightly under livelock situations when a lot of messages
are pending in the APIC. They're not a full spl like scheme.
> I can see why REBOOT_VECTOR need to be highest priority.
In theory a NMI would be be even better, but that has the usual
problems with broken systems that don't handle NMIs properly.
> Maybe you could pull down THERMAL_APIC_VECTOR/THRESHOLD_APIC_VECTOR
> instead. Why you choose UV_BAU_MESSAGE?
Thermal should probably stay in the high bucket too, but yes THRESHOLD_APIC
could be used too.
-Andi
P.S.: This patch is imho a 2.6.30 candidate too, without it all panics
with interrupts off (including all machimne checks) print ugly backtraces.
This has been a regression since two releases or so.
--
[email protected] -- Speaking for myself only.