2011-05-23 21:54:30

by Luck, Tony

[permalink] [raw]
Subject: [RFC 0/9] mce recovery for Sandy Bridge server

Here's a nine-part patch series to implement "AR=1" recovery
that will be available on high-end Sandy Bridge server processors.
In this case the process detects an uncorrectable memory error
while doing an instruction of data fetch that is about to be
consumed. This is in contrast to the recoverable errors on
Nehalem and Westmere that were out of immediate execution context
(patrol scrubber and cache line write-back).

The code is based on work done by Andi last year and published in
the "mce/action-required" branch of his mce git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6.git
Thus he gets author credit on 6 out of 9 patches (but I'll take
the blame for all of them).

The first eight patches are mostly cleanups and minor new bits
that are needed by part 9 where the interesting stuff happens.

For the "in context" case, we must not return from the machine
check handler (in the data fetch case we'd re-execute the fetch
and take another machine check, in the instruction fetch case
we actually don't have a precise IP to return to). We use the
TIF_MCE_NOTIFY task flag bit to ensure that we don't return to
the user context - but we also need to keep track of the memory
address where the fault occurred. The h/w only gives us the physical
address which we must keep track of ... to do so we have added
"mce_error_pfn" to the task structure - this feels odd, but it
is an attribute of the task (e.g. this task may be migrated to
another processor before we get to look at TIF_MCE_NOTIFY and
head to do_notify_resume() to process it).

Andi's recovery code can also handle a few cases where the
error is detected while running kernel code (when copying
data to/from a user process) - but the TIF_MCE_NOTIFY method
doesn't actually ever get to this code (since the entry_64.S code
only checks TIF_MCE_NOTIFY on return to userspace). I'd
appreciate any ideas on how to handle this. Perhaps we could
do good things when CONFIG_PREEMPT=y (it seems probable that
any error in a non-preemtible section of kernel code is going
to be fatal).

-Tony

arch/x86/include/asm/mce.h | 3 +-
arch/x86/kernel/cpu/mcheck/mce-severity.c | 37 +++-
arch/x86/kernel/cpu/mcheck/mce.c | 286 ++++++++++++++++++++++++-----
arch/x86/kernel/signal.c | 2 +-
include/linux/init_task.h | 7 +
include/linux/sched.h | 3 +
mm/memory-failure.c | 28 ++--
7 files changed, 300 insertions(+), 66 deletions(-)

Andi Kleen (6):
MCE: Always retrieve mce rip before calling no_way_out
MCE: Move ADDR/MISC reading code into common function
MCE: Mask out address mask bits below address granuality
HWPOISON: Handle hwpoison in current process
MCE: Pass registers to work handlers
MCE: Add Action-Required support

Tony Luck (3):
mce: fixes for mce severity table
mce: save most severe error information
mce: run through processors with more severe problems first


2011-05-23 22:02:10

by Luck, Tony

[permalink] [raw]
Subject: [RFC 1/9] mce: fixes for mce severity table

From: Tony Luck <[email protected]>

The "Spurious not enabled" entry is redundant. The "Not enabled"
entry earlier in the table will cover this case.

The "Action required; unknown MCACOD" entry shouldn't specify
MCACOD in the .mask field. Current code will only match for
mcacod==0 rather than all AR=1 entries.

Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce-severity.c | 4 +---
1 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce-severity.c b/arch/x86/kernel/cpu/mcheck/mce-severity.c
index 1e8d66c..352d16a 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-severity.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-severity.c
@@ -69,8 +69,6 @@ static struct severity {
MCGMASK(MCG_STATUS_RIPV, 0, PANIC, "In kernel and no restart IP",
KERNEL),
BITCLR(MCI_STATUS_UC, KEEP, "Corrected error", NOSER),
- MASK(MCI_STATUS_OVER|MCI_STATUS_UC|MCI_STATUS_EN, MCI_STATUS_UC, SOME,
- "Spurious not enabled", SER),

/* ignore OVER for UCNA */
MASK(MCI_UC_SAR, MCI_STATUS_UC, KEEP,
@@ -82,7 +80,7 @@ static struct severity {
/* AR add known MCACODs here */
MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_STATUS_OVER|MCI_UC_SAR, PANIC,
"Action required with lost events", SER),
- MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCACOD, MCI_UC_SAR, PANIC,
+ MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_UC_SAR, PANIC,
"Action required; unknown MCACOD", SER),

/* known AO MCACODs: */
--
1.7.3.1

2011-05-23 22:12:59

by Luck, Tony

[permalink] [raw]
Subject: [RFC 2/9] mce: save most severe error information

From: Tony Luck <[email protected]>

monarch clears all of the per cpu "mces_seen", so we must keep a copy
to use after mce_end()

Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce.c | 5 ++++-
1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 3385ea2..ed1542a 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1046,6 +1046,9 @@ void do_machine_check(struct pt_regs *regs, long error_code)
}
}

+ /* Save our worst error locally, monarch will clear mces_seen */
+ m = *final;
+
if (!no_way_out)
mce_clear_state(toclear);

@@ -1064,7 +1067,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
* support MCE broadcasting or it has been disabled.
*/
if (no_way_out && tolerant < 3)
- mce_panic("Fatal machine check on current CPU", final, msg);
+ mce_panic("Fatal machine check on current CPU", &m, msg);

/*
* If the error seems to be unrecoverable, something should be
--
1.7.3.1

2011-05-23 22:13:20

by Luck, Tony

[permalink] [raw]
Subject: [RFC 3/9] MCE: Always retrieve mce rip before calling no_way_out

From: Andi Kleen <[email protected]>

Some of the rules in the decision engine need a valid rip,
so set it up before calling it.

Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index ed1542a..ee3b14e 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -957,6 +957,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
final = &__get_cpu_var(mces_seen);
*final = m;

+ mce_get_rip(&m, regs);
no_way_out = mce_no_way_out(&m, &msg);

barrier();
@@ -1037,7 +1038,6 @@ void do_machine_check(struct pt_regs *regs, long error_code)
if (severity == MCE_AO_SEVERITY && mce_usable_address(&m))
mce_ring_add(m.addr >> PAGE_SHIFT);

- mce_get_rip(&m, regs);
mce_log(&m);

if (severity > worst) {
--
1.7.3.1

2011-05-23 22:13:40

by Luck, Tony

[permalink] [raw]
Subject: [RFC 4/9] MCE: Move ADDR/MISC reading code into common function

From: Andi Kleen <[email protected]>

Used for next patch.

Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce.c | 21 +++++++++++++--------
1 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index ee3b14e..f4924dc 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -525,6 +525,17 @@ static void mce_report_event(struct pt_regs *regs)
#endif
}

+/*
+ * Read ADDR and MISC registers.
+ */
+static void mce_read_aux(struct mce *m, int i)
+{
+ if (m->status & MCI_STATUS_MISCV)
+ m->misc = mce_rdmsrl(MSR_IA32_MCx_MISC(i));
+ if (m->status & MCI_STATUS_ADDRV)
+ m->addr = mce_rdmsrl(MSR_IA32_MCx_ADDR(i));
+}
+
DEFINE_PER_CPU(unsigned, mce_poll_count);

/*
@@ -576,10 +587,7 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
(m.status & (mce_ser ? MCI_STATUS_S : MCI_STATUS_UC)))
continue;

- if (m.status & MCI_STATUS_MISCV)
- m.misc = mce_rdmsrl(MSR_IA32_MCx_MISC(i));
- if (m.status & MCI_STATUS_ADDRV)
- m.addr = mce_rdmsrl(MSR_IA32_MCx_ADDR(i));
+ mce_read_aux(&m, i);

if (!(flags & MCP_TIMESTAMP))
m.tsc = 0;
@@ -1023,10 +1031,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
if (severity == MCE_AR_SEVERITY)
kill_it = 1;

- if (m.status & MCI_STATUS_MISCV)
- m.misc = mce_rdmsrl(MSR_IA32_MCx_MISC(i));
- if (m.status & MCI_STATUS_ADDRV)
- m.addr = mce_rdmsrl(MSR_IA32_MCx_ADDR(i));
+ mce_read_aux(&m, i);

/*
* Action optional error. Queue address for later processing.
--
1.7.3.1

2011-05-23 22:13:59

by Luck, Tony

[permalink] [raw]
Subject: [RFC 5/9] MCE: Mask out address mask bits below address granuality

From: Andi Kleen <[email protected]>

SER enabled systems report the address granuality for each
reported address in a machine check. But the bits below
the granuality are undefined. Mask them out before
logging the machine check.

Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce.c | 12 +++++++++++-
1 files changed, 11 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index f4924dc..a9a0e25 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -532,8 +532,18 @@ static void mce_read_aux(struct mce *m, int i)
{
if (m->status & MCI_STATUS_MISCV)
m->misc = mce_rdmsrl(MSR_IA32_MCx_MISC(i));
- if (m->status & MCI_STATUS_ADDRV)
+ if (m->status & MCI_STATUS_ADDRV) {
m->addr = mce_rdmsrl(MSR_IA32_MCx_ADDR(i));
+
+ /*
+ * Mask the reported address by the reported granuality.
+ */
+ if (mce_ser && (m->status & MCI_STATUS_MISCV)) {
+ u8 shift = m->misc & 0x1f;
+ m->addr >>= shift;
+ m->addr <<= shift;
+ }
+ }
}

DEFINE_PER_CPU(unsigned, mce_poll_count);
--
1.7.3.1

2011-05-23 22:14:23

by Luck, Tony

[permalink] [raw]
Subject: [RFC 6/9] HWPOISON: Handle hwpoison in current process

From: Andi Kleen <[email protected]>

When hardware poison handles the current process use
a forced signal with _AR severity.

Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
---
mm/memory-failure.c | 28 ++++++++++++++++------------
1 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 2b9a5ee..a203113 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -184,8 +184,7 @@ int hwpoison_filter(struct page *p)
EXPORT_SYMBOL_GPL(hwpoison_filter);

/*
- * Send all the processes who have the page mapped an ``action optional''
- * signal.
+ * Send all the processes who have the page mapped a SIGBUS.
*/
static int kill_proc_ao(struct task_struct *t, unsigned long addr, int trapno,
unsigned long pfn, struct page *page)
@@ -194,23 +193,28 @@ static int kill_proc_ao(struct task_struct *t, unsigned long addr, int trapno,
int ret;

printk(KERN_ERR
- "MCE %#lx: Killing %s:%d early due to hardware memory corruption\n",
- pfn, t->comm, t->pid);
+ "MCE %#lx: Killing %s:%d due to hardware memory corruption\n",
+ pfn, t->comm, t->pid);
si.si_signo = SIGBUS;
si.si_errno = 0;
- si.si_code = BUS_MCEERR_AO;
si.si_addr = (void *)addr;
#ifdef __ARCH_SI_TRAPNO
si.si_trapno = trapno;
#endif
si.si_addr_lsb = compound_trans_order(compound_head(page)) + PAGE_SHIFT;
- /*
- * Don't use force here, it's convenient if the signal
- * can be temporarily blocked.
- * This could cause a loop when the user sets SIGBUS
- * to SIG_IGN, but hopefully no one will do that?
- */
- ret = send_sig_info(SIGBUS, &si, t); /* synchronous? */
+ if (t == current) {
+ si.si_code = BUS_MCEERR_AR;
+ ret = force_sig_info(SIGBUS, &si, t);
+ } else {
+ /*
+ * Don't use force here, it's convenient if the signal
+ * can be temporarily blocked.
+ * This could cause a loop when the user sets SIGBUS
+ * to SIG_IGN, but hopefully noone will do that?
+ */
+ si.si_code = BUS_MCEERR_AO;
+ ret = send_sig_info(SIGBUS, &si, t);
+ }
if (ret < 0)
printk(KERN_INFO "MCE: Error sending signal to %s:%d: %d\n",
t->comm, t->pid, ret);
--
1.7.3.1

2011-05-23 22:14:40

by Luck, Tony

[permalink] [raw]
Subject: [RFC 7/9] MCE: Pass registers to work handlers

From: Andi Kleen <[email protected]>

Pass the register context to the task work handler.
Needed for the next patch

Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/include/asm/mce.h | 3 ++-
arch/x86/kernel/cpu/mcheck/mce.c | 4 ++--
arch/x86/kernel/signal.c | 2 +-
3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index eb16e94..66c41e0 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -195,7 +195,8 @@ enum mcp_flags {
void machine_check_poll(enum mcp_flags flags, mce_banks_t *b);

int mce_notify_irq(void);
-void mce_notify_process(void);
+struct pt_regs;
+void mce_notify_process(struct pt_regs *regs);

DECLARE_PER_CPU(struct mce, injectm);
extern struct file_operations mce_chrdev_ops;
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index a9a0e25..6b2fa20 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1123,7 +1123,7 @@ void __attribute__((weak)) memory_failure(unsigned long pfn, int vector)
* This is merely a fast path to expedite processing in some common
* cases.
*/
-void mce_notify_process(void)
+void mce_notify_process(struct pt_regs *regs)
{
unsigned long pfn;
mce_notify_irq();
@@ -1133,7 +1133,7 @@ void mce_notify_process(void)

static void mce_process_work(struct work_struct *dummy)
{
- mce_notify_process();
+ mce_notify_process(NULL);
}

#ifdef CONFIG_X86_MCE_INTEL
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 4fd173c..3e6d421 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -841,7 +841,7 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
#ifdef CONFIG_X86_MCE
/* notify userspace of pending MCEs */
if (thread_info_flags & _TIF_MCE_NOTIFY)
- mce_notify_process();
+ mce_notify_process(regs);
#endif /* CONFIG_X86_64 && CONFIG_X86_MCE */

/* deal with pending signal delivery */
--
1.7.3.1

2011-05-23 22:15:00

by Luck, Tony

[permalink] [raw]
Subject: [RFC 8/9] mce: run through processors with more severe problems first

From: Tony Luck <[email protected]>

Instead of letting cpus run through the MC bank scanning code
in the order that they turned up in the handler, we arrange to
deal with those that have more severe problems (mcgstatus.ripv=0)
first. This will make life simpler in the case that banks are
shared between processors, since the cpu with the problem will
see it and clear it, leaving the other cpu(s) that share the
bank with nothing to do.

Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce.c | 87 +++++++++++++++++++++++++++-----------
1 files changed, 62 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 6b2fa20..7da4a75 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -646,7 +646,7 @@ static int mce_no_way_out(struct mce *m, char **msg)
* Variable to establish order between CPUs while scanning.
* Each CPU spins initially until executing is equal its number.
*/
-static atomic_t mce_executing;
+static atomic_t mce_executing = ATOMIC_INIT(-1);

/*
* Defines order of CPUs on entry. First CPU becomes Monarch.
@@ -762,13 +762,40 @@ static void mce_reign(void)
static atomic_t global_nwo;

/*
+ * Keep separate bitmaps for cpus that have the option return from
+ * machine check handler (MCG_STATUS.RIPV == 1) and those for that
+ * cannot.
+ */
+static cpumask_t can_return;
+static cpumask_t cant_return;
+
+static int monarch;
+
+/*
+ * next cpu choosing first from cant_return, and then from can_return
+ */
+int mce_nextcpu(int this)
+{
+ int next;
+
+ if (this == -1 || cpumask_test_cpu(this, &cant_return)) {
+ next = cpumask_next(this, &cant_return);
+ if (next >= nr_cpu_ids)
+ next = cpumask_next(-1, &can_return);
+ return next;
+ }
+
+ return cpumask_next(this, &can_return);
+}
+
+/*
* Start of Monarch synchronization. This waits until all CPUs have
* entered the exception handler and then determines if any of them
* saw a fatal event that requires panic. Then it executes them
- * in the entry order.
+ * one at a time.
* TBD double check parallel CPU hotunplug
*/
-static int mce_start(int *no_way_out)
+static int mce_start(int *no_way_out, int noreturn)
{
int order;
int cpus = num_online_cpus();
@@ -784,6 +811,11 @@ static int mce_start(int *no_way_out)
smp_wmb();
order = atomic_inc_return(&mce_callin);

+ if (noreturn)
+ cpumask_set_cpu(smp_processor_id(), &cant_return);
+ else
+ cpumask_set_cpu(smp_processor_id(), &can_return);
+
/*
* Wait for everyone.
*/
@@ -802,23 +834,26 @@ static int mce_start(int *no_way_out)

if (order == 1) {
/*
- * Monarch: Starts executing now, the others wait.
+ * Choose a cpu to be monarch. It will run first
*/
- atomic_set(&mce_executing, 1);
- } else {
- /*
- * Subject: Now start the scanning loop one by one in
- * the original callin order.
- * This way when there are any shared banks it will be
- * only seen by one CPU before cleared, avoiding duplicates.
- */
- while (atomic_read(&mce_executing) < order) {
- if (mce_timed_out(&timeout)) {
- atomic_set(&global_nwo, 0);
- return -1;
- }
- ndelay(SPINUNIT);
+ monarch = mce_nextcpu(-1);
+ atomic_set(&mce_executing, monarch);
+ }
+ /*
+ * Subject: Now start the scanning loop one by one in
+ * choosing first the "cant_return" cpus and the the others.
+ * This way when there are any shared banks it will be
+ * only seen by one CPU before cleared, avoiding duplicates.
+ * Letting the "cant_return" folks go first means that for
+ * "action required" errors, the cpu that hit the problem
+ * will find its error in the shared bank first.
+ */
+ while (atomic_read(&mce_executing) != smp_processor_id()) {
+ if (mce_timed_out(&timeout)) {
+ atomic_set(&global_nwo, 0);
+ return -1;
}
+ ndelay(SPINUNIT);
}

/*
@@ -846,9 +881,9 @@ static int mce_end(int order)
/*
* Allow others to run.
*/
- atomic_inc(&mce_executing);
+ atomic_set(&mce_executing, mce_nextcpu(smp_processor_id()));

- if (order == 1) {
+ if (smp_processor_id() == monarch) {
/* CHECKME: Can this race with a parallel hotplug? */
int cpus = num_online_cpus();

@@ -856,7 +891,7 @@ static int mce_end(int order)
* Monarch: Wait for everyone to go through their scanning
* loops.
*/
- while (atomic_read(&mce_executing) <= cpus) {
+ while (atomic_read(&mce_executing) != nr_cpu_ids) {
if (mce_timed_out(&timeout))
goto reset;
ndelay(SPINUNIT);
@@ -869,7 +904,7 @@ static int mce_end(int order)
/*
* Subject: Wait for Monarch to finish.
*/
- while (atomic_read(&mce_executing) != 0) {
+ while (atomic_read(&mce_executing) != -1) {
if (mce_timed_out(&timeout))
goto reset;
ndelay(SPINUNIT);
@@ -887,12 +922,14 @@ static int mce_end(int order)
reset:
atomic_set(&global_nwo, 0);
atomic_set(&mce_callin, 0);
+ cpumask_clear(&can_return);
+ cpumask_clear(&cant_return);
barrier();

/*
* Let others run again.
*/
- atomic_set(&mce_executing, 0);
+ atomic_set(&mce_executing, -1);
return ret;
}

@@ -991,7 +1028,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
* This way we don't report duplicated events on shared banks
* because the first one to see it will clear it.
*/
- order = mce_start(&no_way_out);
+ order = mce_start(&no_way_out, kill_it);
for (i = 0; i < banks; i++) {
__clear_bit(i, toclear);
if (!mce_banks[i].ctl)
@@ -2193,7 +2230,7 @@ static void mce_reset(void)
{
cpu_missing = 0;
atomic_set(&mce_fake_paniced, 0);
- atomic_set(&mce_executing, 0);
+ atomic_set(&mce_executing, -1);
atomic_set(&mce_callin, 0);
atomic_set(&global_nwo, 0);
}
--
1.7.3.1

2011-05-23 22:15:22

by Luck, Tony

[permalink] [raw]
Subject: [RFC 9/9] MCE: Add Action-Required support

From: Andi Kleen <[email protected]>

Implement core MCA recovery. This is used for errors
that happen in the current execution context.

The kernel has to first pass the error information
to a function running on the current process stack.
This is done using a new work flag and then executing
the code after the exception through do_notify_resume.

Then hwpoison is allowed to sleep and can try to recover it.

To pass the information about the error around we need
to use a field in the current process. The old ways
to handle this (per cpu buffer) don't work because
a CPU could be switched before reaching the handler code.

For kernel recovery we only handle errors happening
during copy_*_user() exception tables and inject EFAULT.
When the tolerance level is sufficiently high also
a unsafe oops like do_exit() killing, which has some
deadlock potential.

FIXME: fix 386 handling of mce notify bit in entry_32.S after mce

Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce-severity.c | 35 ++++++-
arch/x86/kernel/cpu/mcheck/mce.c | 157 +++++++++++++++++++++++++++--
include/linux/init_task.h | 7 ++
include/linux/sched.h | 3 +
4 files changed, 189 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce-severity.c b/arch/x86/kernel/cpu/mcheck/mce-severity.c
index 352d16a..fe8a28c 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-severity.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-severity.c
@@ -13,6 +13,7 @@
#include <linux/seq_file.h>
#include <linux/init.h>
#include <linux/debugfs.h>
+#include <linux/module.h>
#include <asm/mce.h>

#include "mce-internal.h"
@@ -54,6 +55,9 @@ static struct severity {
{ .mcgmask = x, .mcgres = res, SEV(s), .msg = m, ## r }
#define MASK(x, y, s, m, r...) \
{ .mask = x, .result = y, SEV(s), .msg = m, ## r }
+#define ARMASK(x, y, s, m, r...) \
+ { .mcgmask = MCG_STATUS_RIPV, .mcgres = 0, \
+ .mask = x, .result = y, SEV(s), .msg = m, ## r }
#define MCI_UC_S (MCI_STATUS_UC|MCI_STATUS_S)
#define MCI_UC_SAR (MCI_STATUS_UC|MCI_STATUS_S|MCI_STATUS_AR)
#define MCACOD 0xffff
@@ -67,7 +71,7 @@ static struct severity {
MCGMASK(MCG_STATUS_RIPV|MCG_STATUS_EIPV, 0, PANIC,
"Neither restart nor error IP"),
MCGMASK(MCG_STATUS_RIPV, 0, PANIC, "In kernel and no restart IP",
- KERNEL),
+ KERNEL, NOSER),
BITCLR(MCI_STATUS_UC, KEEP, "Corrected error", NOSER),

/* ignore OVER for UCNA */
@@ -77,10 +81,16 @@ static struct severity {
"Illegal combination (UCNA with AR=1)", SER),
MASK(MCI_STATUS_S, 0, KEEP, "Non signalled machine check", SER),

- /* AR add known MCACODs here */
MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_STATUS_OVER|MCI_UC_SAR, PANIC,
"Action required with lost events", SER),
- MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_UC_SAR, PANIC,
+
+ /* known AR MCACODs: */
+ ARMASK(MCI_UC_SAR|MCI_STATUS_OVER|0xffff, MCI_UC_SAR|0x134, AR,
+ "Action required: data load error", SER),
+ ARMASK(MCI_UC_SAR|MCI_STATUS_OVER|0xffff, MCI_UC_SAR|0x150, AR,
+ "Action required: instruction fetch error", SER),
+
+ ARMASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_UC_SAR, PANIC,
"Action required; unknown MCACOD", SER),

/* known AO MCACODs: */
@@ -89,6 +99,7 @@ static struct severity {
MASK(MCI_UC_SAR|MCI_STATUS_OVER|MCACOD, MCI_UC_S|0x17a, AO,
"Action optional: last level cache writeback error", SER),

+
MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_UC_S, SOME,
"Action optional unknown MCACOD", SER),
MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_UC_S|MCI_STATUS_OVER, SOME,
@@ -110,6 +121,17 @@ static int error_context(struct mce *m)
return IN_KERNEL;
}

+static int kernel_ar_recoverable(struct mce *m, int tolerant)
+{
+ if (tolerant >= 2)
+ return MCE_AR_SEVERITY;
+ if (!(m->mcgstatus & MCG_STATUS_EIPV) || !m->ip)
+ return MCE_PANIC_SEVERITY;
+ if (search_exception_tables(m->ip))
+ return MCE_AR_SEVERITY;
+ return MCE_PANIC_SEVERITY;
+}
+
int mce_severity(struct mce *a, int tolerant, char **msg)
{
enum context ctx = error_context(a);
@@ -129,9 +151,12 @@ int mce_severity(struct mce *a, int tolerant, char **msg)
if (msg)
*msg = s->msg;
s->covered = 1;
- if (s->sev >= MCE_UC_SEVERITY && ctx == IN_KERNEL) {
- if (panic_on_oops || tolerant < 1)
+ if (ctx == IN_KERNEL) {
+ if (s->sev >= MCE_UC_SEVERITY &&
+ (panic_on_oops || tolerant < 1))
return MCE_PANIC_SEVERITY;
+ if (s->sev == MCE_AR_SEVERITY)
+ return kernel_ar_recoverable(a, tolerant);
}
return s->sev;
}
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 7da4a75..9d5e679 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -960,6 +960,131 @@ static void mce_clear_state(unsigned long *toclear)
}
}

+/* Stub when hwpoison is not compiled in */
+int __attribute__((weak)) __memory_failure(unsigned long pfn, int vector,
+ int precount)
+{
+ return -1;
+}
+
+/*
+ * Uncorrected error for current process.
+ */
+static void mce_action_required(struct mce *m, char *msg, struct pt_regs *regs)
+{
+ if (!mce_usable_address(m))
+ mce_panic("No address for Action-Required Machine Check",
+ m, msg);
+ if (!(m->mcgstatus & MCG_STATUS_EIPV))
+ mce_panic("No EIPV for Action-Required Machine Check",
+ m, msg);
+
+ WARN_ON(current->mce_error_pfn != -1L);
+ current->mce_error_pfn = m->addr >> PAGE_SHIFT;
+ set_thread_flag(TIF_MCE_NOTIFY);
+}
+
+#undef pr_fmt
+#define pr_fmt(x) "MCE: %s:%d " x "\n", current->comm, current->pid
+#define PADDR(x) ((u64)(x) << PAGE_SHIFT)
+
+/*
+ * No successfull recovery. Make sure at least that there's
+ * a SIGBUS.
+ */
+static void ar_fallback(struct task_struct *me, unsigned long pfn)
+{
+ if (signal_pending(me) && sigismember(&me->pending.signal, SIGBUS))
+ return;
+
+ /*
+ * For some reason hwpoison wasn't able to send a proper
+ * SIGBUS. Send a fallback signal. Unfortunately we don't
+ * know the virtual address here, so can't tell the program
+ * details.
+ */
+ force_sig(SIGBUS, me);
+ pr_err("Killed due to action-required memory corruption");
+}
+
+/*
+ * Handle action-required on the process stack. hwpoison does the
+ * bulk of the work and with some luck might even be able to fix the
+ * problem.
+ *
+ * Logic changes here should be reflected in kernel_ar_recoverable().
+ */
+static void handle_action_required(struct pt_regs *regs)
+{
+ struct task_struct *me = current;
+ unsigned long pfn = me->mce_error_pfn;
+ unsigned long pstack;
+
+ me->mce_error_pfn = -1L;
+
+ /*
+ * User-mode:
+ *
+ * Guarantee of no kernel locks hold. Do full VM level
+ * recovery. This will result either in a signal
+ * or transparent recovery.
+ */
+ if (user_mode(regs)) {
+ pr_err("Uncorrected hardware memory error in user-access at %llx",
+ PADDR(pfn));
+ if (__memory_failure(pfn, MCE_VECTOR, 0) < 0) {
+ pr_err("Memory error not recovered");
+ ar_fallback(me, pfn);
+ } else
+ pr_err("Memory error recovered");
+ return;
+ }
+
+ /*
+ * Kernel-mode:
+ *
+ * Recover from faults with exception tables.
+ *
+ * We can't use VM recovery here, because there's no
+ * guarantee what locks are already hold in the code
+ * interrupted and we don't have a virtual address.
+ *
+ * Simply EFAULT this case.
+ */
+ pr_err("Hardware memory error in kernel context at %llx",
+ PADDR(pfn));
+ if (fixup_exception(regs)) {
+ pr_err("Injecting EFAULT for kernel memory error");
+ return;
+ }
+
+ /*
+ * Corruption in kernel code that is not protected by
+ * a exception table.
+ *
+ * When the tolerance level is high enough treat like
+ * an oops. Note this is not fully safe and might deadlock
+ * when the current code path hold any locks taken by do_exit.
+ *
+ * Do various sanity checks to avoid looping etc.
+ */
+ pstack = (unsigned long)task_thread_info(current);
+ if (tolerant >= 2 &&
+ !(current->flags & PF_EXITING) &&
+ current->pid &&
+ !in_interrupt() &&
+ regs->sp >= pstack && regs->sp <= pstack + THREAD_SIZE) {
+ pr_err("Unsafe killing of current process in kernel context");
+ do_exit(SIGBUS);
+ }
+
+ panic("Memory error machine check in kernel context at %llx",
+ PADDR(pfn));
+}
+
+#undef pr_fmt
+#define pr_fmt(x) x
+
/*
* The actual machine check handler. This only handles real
* exceptions when something got corrupted coming in through int 18.
@@ -1072,12 +1197,6 @@ void do_machine_check(struct pt_regs *regs, long error_code)
continue;
}

- /*
- * Kill on action required.
- */
- if (severity == MCE_AR_SEVERITY)
- kill_it = 1;
-
mce_read_aux(&m, i);

/*
@@ -1122,6 +1241,15 @@ void do_machine_check(struct pt_regs *regs, long error_code)
mce_panic("Fatal machine check on current CPU", &m, msg);

/*
+ * Do recovery in current process if needed. This has to be delayed
+ * until we're back on the process stack.
+ */
+ if (worst == MCE_AR_SEVERITY) {
+ mce_action_required(&m, msg, regs);
+ kill_it = 0;
+ }
+
+ /*
* If the error seems to be unrecoverable, something should be
* done. Try to kill as little as possible. If we can kill just
* one task, do that. If the user has set the tolerance very
@@ -1136,6 +1264,18 @@ void do_machine_check(struct pt_regs *regs, long error_code)

if (worst > 0)
mce_report_event(regs);
+
+ /*
+ * We seem to be making TIF_MCE_NOTIFY serve two purposes:
+ * 1: Get the log of this event moving
+ * 2: Don't let us return to an "Action Required" user process.
+ * But mce_report_event() may end up clearing the flag, so we
+ * set it again here if needed to stop us returning to the
+ * user code that triggered this machine check.
+ */
+ if (worst == MCE_AR_SEVERITY)
+ set_thread_flag(TIF_MCE_NOTIFY);
+
mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
out:
atomic_dec(&mce_entry);
@@ -1157,8 +1297,6 @@ void __attribute__((weak)) memory_failure(unsigned long pfn, int vector)
* per CPU.
* Note we don't disable preemption, so this code might run on the wrong
* CPU. In this case the event is picked up by the scheduled work queue.
- * This is merely a fast path to expedite processing in some common
- * cases.
*/
void mce_notify_process(struct pt_regs *regs)
{
@@ -1166,6 +1304,9 @@ void mce_notify_process(struct pt_regs *regs)
mce_notify_irq();
while (mce_ring_get(&pfn))
memory_failure(pfn, MCE_VECTOR);
+
+ if (regs && current->mce_error_pfn != -1L)
+ handle_action_required(regs);
}

static void mce_process_work(struct work_struct *dummy)
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index caa151f..16ab936 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -124,6 +124,12 @@ extern struct cred init_cred;
# define INIT_PERF_EVENTS(tsk)
#endif

+#ifdef CONFIG_X86_MCE
+#define INIT_MCE_ERROR_PFN .mce_error_pfn = -1L,
+#else
+#define INIT_MCE_ERROR_PFN
+#endif
+
/*
* INIT_TASK is used to set up the first task table, touch at
* your own risk!. Base=0, limit=0x1fffff (=2MB)
@@ -192,6 +198,7 @@ extern struct cred init_cred;
INIT_FTRACE_GRAPH \
INIT_TRACE_RECURSION \
INIT_TASK_RCU_PREEMPT(tsk) \
+ INIT_MCE_ERROR_PFN \
}


diff --git a/include/linux/sched.h b/include/linux/sched.h
index 781abd1..a72f3aa 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1540,6 +1540,9 @@ struct task_struct {
#ifdef CONFIG_HAVE_HW_BREAKPOINT
atomic_t ptrace_bp_refcnt;
#endif
+#ifdef CONFIG_X86_MCE
+ unsigned long mce_error_pfn;
+#endif
};

/* Future-safe accessor for struct task_struct's cpus_allowed. */
--
1.7.3.1

2011-05-24 03:40:44

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC 0/9] mce recovery for Sandy Bridge server


* Luck, Tony <[email protected]> wrote:

> Here's a nine-part patch series to implement "AR=1" recovery
> that will be available on high-end Sandy Bridge server processors.

So, the generic comment i have on this is similar to what i wrote to
Huang Ying already: i'd really like to see progress on the proper
RAS / event approach instead of seeing further iterations of the
very limited and isolated severity levels and mcelog approach ...

Also, more integration with EDAC would be required: there is no reason why the
framework and user/admin appearance of hardware error handling of CPU and
memory chipset level hardware should be different. In fact the physical
distinction is largely moot since memory controllers moved into the CPU ...

> In this case the process detects an uncorrectable memory error
> while doing an instruction of data fetch that is about to be
> consumed. This is in contrast to the recoverable errors on
> Nehalem and Westmere that were out of immediate execution context
> (patrol scrubber and cache line write-back).
>
> The code is based on work done by Andi last year and published in
> the "mce/action-required" branch of his mce git tree:
> git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6.git
> Thus he gets author credit on 6 out of 9 patches (but I'll take
> the blame for all of them).

Well, did Andi tell you that me and Thomas NAK-ed those bits and what reasons
we outlined? See:

http://lkml.org/lkml/2010/2/16/309

We are more than half a year down the road and i do not see much progress on
those important conceptual points in this RFC patchset either. The technical
points we made in our original NAK was left largely unaddressed AFAICS.

The two main technical points made there still hold today:

- Please work on integrating RAS/EDAC with MCE. Both Mauro and Boris has
made progress there, but frankly, i only saw Andi ignoring/stonewalling
them.

- Please work on using a proper event framework for new hardware events,
we really do not want to extend the mcelog mess, and we want to help
efforts like unified system event logging and the new RAS daemon.

Now, some of that might be fun and non-trivial infrastructure work - so i'm not
asking you to implement the world and everything on day 1. But seeing this
non-progress on the Intel MCE side why should i consider lifting the NAK?

> The first eight patches are mostly cleanups and minor new bits
> that are needed by part 9 where the interesting stuff happens.
>
> For the "in context" case, we must not return from the machine
> check handler (in the data fetch case we'd re-execute the fetch
> and take another machine check, in the instruction fetch case
> we actually don't have a precise IP to return to). We use the
> TIF_MCE_NOTIFY task flag bit to ensure that we don't return to
> the user context - but we also need to keep track of the memory
> address where the fault occurred. The h/w only gives us the physical
> address which we must keep track of ... to do so we have added
> "mce_error_pfn" to the task structure - this feels odd, but it
> is an attribute of the task (e.g. this task may be migrated to
> another processor before we get to look at TIF_MCE_NOTIFY and
> head to do_notify_resume() to process it).

There is absolutely *no reason* for the MCE code to hook in to the kernel at
such a low level and muck with task struct!

The error PFN is a useful event field - use it as a TRACE_EVENT() field.
Processing events in non-NMI context is something you can do already as well,
see how perf_event_wakeup() delays a ring-buffer wakeup to post-NMI IRQ
context.

Creating a callback there would be a good place to do the TIF_MCE work and also
to extract any events that got queued by other NMIs. Note that more events
might be queued by further NMIs while we are processing the MCE path - while
with the task->mce_error_pfn hack we are limited to a single pending event only
and subsequent NMIs will overwrite this value!

A happy side effect is that the TIF_MCE_NOTIFY hack could go away as well.

Also note that if it's done via events then injection support will naturally
extend to this capability as well and you can test it by injecting arbitrary
error-PFNs ...

So my broad take on this is that historically the MCE code started out as a
limited set of very low level hacks, both on the CPU support side, on the
logging side and the user ABI side. For years it was only supporting AMD CPUs
in a serious fashion. This was borderline acceptable because frankly back then
MCE did not really matter much: boxes died faster than anyone could do anything
intelligent about it.

But today it matters more. There are more intelligent errors emitted by the
hardware and less fatal hardware behavior. The boundary between 'hard errors',
'soft errors' and 'system events' has become fuzzier: so we really want to have
a generic notion for system events that matter to performance and reliability -
instead of these isolated islands of weirdly coded functionality that the Intel
MCE code is today. We want to be able to log all of these event classes, in one
unified framework, and define policy on them.

So we *really* want to promote this code to a higher level of abstraction.
Everyone would benefit from doing that: Intel hardware error handling features
would be enabled much more richly and i suspect they would also be *used* in a
much more meaningful way - driving the hw cycle as well.

Thanks,

Ingo

2011-05-24 08:14:42

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC 0/9] mce recovery for Sandy Bridge server

On Tue, May 24, 2011 at 05:40:23AM +0200, Ingo Molnar wrote:
> So we *really* want to promote this code to a higher level of abstraction.
> Everyone would benefit from doing that: Intel hardware error handling features
> would be enabled much more richly and i suspect they would also be *used* in a
> much more meaningful way - driving the hw cycle as well.

Absolutely agreed. The RAS architecture should look like this, IMHO:

I. Event collection: #MC handler and pollers, no queueing or buffering crap.

II. Pluggable and extensible filters which are
* per vendor
* configurable from userspace
* easily extensible
* decide whether action should be taken in the kernel or error is non-critical
and should go to RAS daemon

III. Error handling callback(s)
* also extensible
* also per vendor
* also configurable from userspace

Advantages:
* reuse perf code - no need for ad-hoc new buffers and lockless thingies when we
have it all already

* easy code and even hw testing with perf inject or ras inject
** this gives us also the different injection methods per vendor in an unified
way instead of interfaces in /sys or debugfs or mcelog or ...

* keep code design sane instead of letting it needlessly fiddle with
other parts of the kernel

* ...

Now I should better go and put my patches where my mouth is :).

--
Regards/Gruss,
Boris.

2011-05-24 16:58:06

by Luck, Tony

[permalink] [raw]
Subject: RE: [RFC 0/9] mce recovery for Sandy Bridge server

Other points noted - I'll go look at the previous discussion threads
you gave me links to.

I do want to comment on this point:
> Creating a callback there would be a good place to do the TIF_MCE work and also
> to extract any events that got queued by other NMIs. Note that more events
> might be queued by further NMIs while we are processing the MCE path - while
> with the task->mce_error_pfn hack we are limited to a single pending event only
> and subsequent NMIs will overwrite this value!

I wasn't very happy with task->mce_error_pfn either - but being overwritten
is not one of its flaws. The task that stumbled on the error must not be
run until the error is dealt with - any other NMIs for other errors must
be happening to other tasks (who have their own task->mce_error_pfn).

> A happy side effect is that the TIF_MCE_NOTIFY hack could go away as well.

We need some way to stop the task that found the error dead in its
tracks - if it tripped over a data error, then running it will just trip
over the same error again. If it had a memory error during an instruction
fetch we have no place to return to.

So can we talk about this part for a while before returning to the
"how to report this" discussion?

So here's the situation - we are in the NMI handler when we find from
looking at the machine check bank registers that we have a recoverable
error. We know the physical address, and we know the task (which might
have been in user or kernel context). I can package that information
into a perf/event ... but then how can I mark the current task as
not-fit-for-execution?

-Tony

2011-05-24 17:33:31

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC 0/9] mce recovery for Sandy Bridge server

On Tue, May 24, 2011 at 09:57:46AM -0700, Luck, Tony wrote:
> So can we talk about this part for a while before returning to the
> "how to report this" discussion?
>
> So here's the situation - we are in the NMI handler when we find from
> looking at the machine check bank registers that we have a recoverable
> error. We know the physical address, and we know the task (which might
> have been in user or kernel context). I can package that information
> into a perf/event ... but then how can I mark the current task as
> not-fit-for-execution?

Maybe something like

set_current_state(TASK_UNINTERRUPTIBLE);

finish work in NMI context

do remaining work in process context like sending appropriate signals
etc; finally:

set_task_state(tsk, TASK_RUNNING)

?

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

2011-05-24 17:56:28

by Luck, Tony

[permalink] [raw]
Subject: Re: [RFC 0/9] mce recovery for Sandy Bridge server

Dragging PeterZ to this thread, since we are now talking about scheduler.

On Tue, May 24, 2011 at 10:33 AM, Borislav Petkov <[email protected]> wrote:
> On Tue, May 24, 2011 at 09:57:46AM -0700, Luck, Tony wrote:
>> So can we talk about this part for a while before returning to the
>> "how to report this" discussion?
>>
>> So here's the situation - we are in the NMI handler when we find from
>> looking at the machine check bank registers that we have a recoverable
>> error. We know the physical address, and we know the task (which might
>> have been in user or kernel context). I can package that information
>> into a perf/event ... but then how can I mark the current task as
>> not-fit-for-execution?
>
> Maybe something like
>
> set_current_state(TASK_UNINTERRUPTIBLE);
>
> finish work in NMI context
>
> do remaining work in process context like sending appropriate signals
> etc; finally:
>
> set_task_state(tsk, TASK_RUNNING)

That looks pretty easy - are their any weird side effects that I should
be worried about? My perf/event can't really include the "task" pointer
(that sounds way too internal) - but I can provide the process id, so
the "RAS daemon" that sees this event can look up the task to do that
final set_task_state(tsk, TASK_RUNNING).

Does this work in the threaded case? In the case where the task was in
kernel context (but in a CONFIG_PREEMT=y kernel at some point
where preemption is allowed)?

-Tony

2011-05-24 21:04:33

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC 0/9] mce recovery for Sandy Bridge server

On Tue, May 24, 2011 at 10:56:26AM -0700, Tony Luck wrote:
> > Maybe something like
> >
> > set_current_state(TASK_UNINTERRUPTIBLE);
> >
> > finish work in NMI context
> >
> > do remaining work in process context like sending appropriate signals
> > etc; finally:
> >
> > set_task_state(tsk, TASK_RUNNING)
>
> That looks pretty easy - are their any weird side effects that I should
> be worried about? My perf/event can't really include the "task" pointer
> (that sounds way too internal) - but I can provide the process id, so
> the "RAS daemon" that sees this event can look up the task to do that
> final set_task_state(tsk, TASK_RUNNING).

Actually, I was thinking more in the direction of doing this in a kernel
thread or workqueue without going back to the RAS daemon. Then you would
only need to save the task_struct ptr.

> Does this work in the threaded case? In the case where the task was in
> kernel context (but in a CONFIG_PREEMT=y kernel at some point
> where preemption is allowed)?

Well, IIUC and depending on the error, if it is severe enough, you would
want to run the remaining work right after the NMI handler finishes
without going to userspace.

Hmm..

--
Regards/Gruss,
Boris.

2011-05-24 21:21:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 0/9] mce recovery for Sandy Bridge server

On Tue, 2011-05-24 at 10:56 -0700, Tony Luck wrote:
> Dragging PeterZ to this thread, since we are now talking about scheduler.
>
> On Tue, May 24, 2011 at 10:33 AM, Borislav Petkov <[email protected]> wrote:
> > On Tue, May 24, 2011 at 09:57:46AM -0700, Luck, Tony wrote:
> >> So can we talk about this part for a while before returning to the
> >> "how to report this" discussion?
> >>
> >> So here's the situation - we are in the NMI handler when we find from
> >> looking at the machine check bank registers that we have a recoverable
> >> error. We know the physical address, and we know the task (which might
> >> have been in user or kernel context). I can package that information
> >> into a perf/event ... but then how can I mark the current task as
> >> not-fit-for-execution?
> >
> > Maybe something like
> >
> > set_current_state(TASK_UNINTERRUPTIBLE);
> >
> > finish work in NMI context
> >
> > do remaining work in process context like sending appropriate signals
> > etc; finally:
> >
> > set_task_state(tsk, TASK_RUNNING)
>
> That looks pretty easy - are their any weird side effects that I should
> be worried about? My perf/event can't really include the "task" pointer
> (that sounds way too internal) - but I can provide the process id, so
> the "RAS daemon" that sees this event can look up the task to do that
> final set_task_state(tsk, TASK_RUNNING).
>
> Does this work in the threaded case? In the case where the task was in
> kernel context (but in a CONFIG_PREEMT=y kernel at some point
> where preemption is allowed)?


Right, so you can't do things like that from NMI context, but what perf
can do is raise a self-IPI and continue from IRQ context (question for
the HW folks, can there be cycles between the NMI iret and IRQ assert
from whatever context was before the NMI hit?)

>From IRQ context we can wake threads, set TIF_flags etc. you can
basically do what SIGSTOP does and put the task in TASK_STOPPED state,
wake your handler thread and set TIF_NEED_RESCHED. Then the handler
thread will be scheduled depending on your handler's sched policy.


2011-05-24 21:31:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC 0/9] mce recovery for Sandy Bridge server

On Tue, May 24, 2011 at 2:24 PM, Peter Zijlstra <[email protected]> wrote:
>
> Right, so you can't do things like that from NMI context, but what perf
> can do is raise a self-IPI and continue from IRQ context (question for
> the HW folks, can there be cycles between the NMI iret and IRQ assert
> from whatever context was before the NMI hit?)

Of course there can be - the code where the NMI hit may have
interrupts disabled.

Linus

2011-05-24 21:34:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 0/9] mce recovery for Sandy Bridge server

On Tue, 2011-05-24 at 14:30 -0700, Linus Torvalds wrote:
> On Tue, May 24, 2011 at 2:24 PM, Peter Zijlstra <[email protected]> wrote:
> >
> > Right, so you can't do things like that from NMI context, but what perf
> > can do is raise a self-IPI and continue from IRQ context (question for
> > the HW folks, can there be cycles between the NMI iret and IRQ assert
> > from whatever context was before the NMI hit?)
>
> Of course there can be - the code where the NMI hit may have
> interrupts disabled.

D'0h yes of course!..

Then I can't immediately see a way to stop a thread dead in its tracks.
Userspace yes, but not kernel-space.

2011-05-24 21:42:06

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC 0/9] mce recovery for Sandy Bridge server


* Peter Zijlstra <[email protected]> wrote:

> On Tue, 2011-05-24 at 14:30 -0700, Linus Torvalds wrote:
> > On Tue, May 24, 2011 at 2:24 PM, Peter Zijlstra <[email protected]> wrote:
> > >
> > > Right, so you can't do things like that from NMI context, but what perf
> > > can do is raise a self-IPI and continue from IRQ context (question for
> > > the HW folks, can there be cycles between the NMI iret and IRQ assert
> > > from whatever context was before the NMI hit?)
> >
> > Of course there can be - the code where the NMI hit may have
> > interrupts disabled.
>
> D'0h yes of course!..
>
> Then I can't immediately see a way to stop a thread dead in its tracks.
> Userspace yes, but not kernel-space.

It's a best-effort thing: run proper kernel code (hardirq context) as soon as
possible - but obviously not sooner than that! :-)

Thanks,

Ingo

2011-05-24 21:48:33

by Tony Luck

[permalink] [raw]
Subject: Re: [RFC 0/9] mce recovery for Sandy Bridge server

On Tue, May 24, 2011 at 2:30 PM, Linus Torvalds
<[email protected]> wrote:
> On Tue, May 24, 2011 at 2:24 PM, Peter Zijlstra <[email protected]> wrote:
>>
>> Right, so you can't do things like that from NMI context, but what perf
>> can do is raise a self-IPI and continue from IRQ context (question for
>> the HW folks, can there be cycles between the NMI iret and IRQ assert
>> from whatever context was before the NMI hit?)
>
> Of course there can be - the code where the NMI hit may have
> interrupts disabled.

But the case when I'd want to do the "stop this task" thing is when I
think that I can recover - for memory errors detected while in kernel
code I expect this will only ever be a few special cases:
1) copy to/from user
2) copy page (for copy-on-write fault)
3) ...
and in these cases we don't have interrupts disabled. In fact I have
difficulty imagining a scenario where the kernel trips over a memory
error in interrupt disabled code that would ever be recoverable.

So my NMI handler can look at the saved pt_regs to see whether
it blasted its way into some interrupt disabled code and call that
fatal - if it came in while interrupts were enabled, then it could use
Peter's self-IPI thingy.

-Tony

2011-05-25 06:03:48

by Hidetoshi Seto

[permalink] [raw]
Subject: Re: [RFC 0/9] mce recovery for Sandy Bridge server

(2011/05/24 6:54), Luck, Tony wrote:
> Andi's recovery code can also handle a few cases where the
> error is detected while running kernel code (when copying
> data to/from a user process) - but the TIF_MCE_NOTIFY method
> doesn't actually ever get to this code (since the entry_64.S code
> only checks TIF_MCE_NOTIFY on return to userspace). I'd
> appreciate any ideas on how to handle this. Perhaps we could
> do good things when CONFIG_PREEMPT=y (it seems probable that
> any error in a non-preemtible section of kernel code is going
> to be fatal).

How about separating stuffs in:
step1) Add support for AR in user space :
- send sigbus to affected processes, poison affected memory
- panic if error is in kernel
step2) Add support for AR in kernel
- some new notify/handle mechanism etc.

It seems too big jump for me.


Thanks,
H.Seto

2011-05-25 10:02:08

by Joerg Roedel

[permalink] [raw]
Subject: Re: [RFC 0/9] mce recovery for Sandy Bridge server

On Tue, May 24, 2011 at 02:48:30PM -0700, Tony Luck wrote:

> But the case when I'd want to do the "stop this task" thing is when I
> think that I can recover - for memory errors detected while in kernel
> code I expect this will only ever be a few special cases:
> 1) copy to/from user
> 2) copy page (for copy-on-write fault)
> 3) ...
> and in these cases we don't have interrupts disabled. In fact I have
> difficulty imagining a scenario where the kernel trips over a memory
> error in interrupt disabled code that would ever be recoverable.

Perf itself has a copy_from_user_nmi which is used to collect stack
traces. But those non-preemptible user accesses can be moved out of
your special cases.

Joerg

2011-05-25 13:44:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC 0/9] mce recovery for Sandy Bridge server


* Luck, Tony <[email protected]> wrote:

> Other points noted - I'll go look at the previous discussion threads
> you gave me links to.
>
> I do want to comment on this point:
> > Creating a callback there would be a good place to do the TIF_MCE work and also
> > to extract any events that got queued by other NMIs. Note that more events
> > might be queued by further NMIs while we are processing the MCE path - while
> > with the task->mce_error_pfn hack we are limited to a single pending event only
> > and subsequent NMIs will overwrite this value!
>
> I wasn't very happy with task->mce_error_pfn either - but being
> overwritten is not one of its flaws. The task that stumbled on the
> error must not be run until the error is dealt with - any other
> NMIs for other errors must be happening to other tasks (who have
> their own task->mce_error_pfn).

If that task is running right now you might not have much of a chance
but to be prepared for a repeat NMI the moment NMIs are enabled again
after the IRET.

So repeat errors are not just

But yeah, this is not the worst ugliness of it indeed.

> > A happy side effect is that the TIF_MCE_NOTIFY hack could go away
> > as well.
>
> We need some way to stop the task that found the error dead in its
> tracks - if it tripped over a data error, then running it will just
> trip over the same error again. If it had a memory error during an
> instruction fetch we have no place to return to.

Well, the primary thing TIF_MCE_NOTIFY does is a roundabout way to
iterate through repeat calls to memory_failure(), with all pfns that
got buffered so far.

We already have a generic facility to do such things at
return-to-userspace: _TIF_USER_RETURN_NOTIFY.

Btw., the SIGKILL logic is probably overcomplicated: when it's clear
that user-space can not recover why not do a do_exit() and be done
with it? As long as it's called from a syscall level codepath and no
locks/resources are held do_exit() can be called.

> So can we talk about this part for a while before returning to the
> "how to report this" discussion?
>
> So here's the situation - we are in the NMI handler when we find
> from looking at the machine check bank registers that we have a
> recoverable error. We know the physical address, and we know the
> task (which might have been in user or kernel context). I can
> package that information into a perf/event ... but then how can I
> mark the current task as not-fit-for-execution?

Well, you'd queue up a user return notifier, process the ring-buffer
at that end (as a ring-buffer consumer), call policy action like
memory_failure() which does its MM specific checks to see whether the
task is recoverable or not.

The goal is that we wouldnt lose any existing hwpoison functionality
over what hwpoison does today - we'd *gain* functionality:

- through the use of enumerated events we would give access to
*other* users of the same event source as well, not just hwpoison.

- we'd also eventually remove the mcelog ring-buffer mechanism
itself.

For this i'd suggest to take a look at Frederic's recent patches on
lkml which decouple the events code some more:

Subject: [PATCH v2] hw_breakpoint: Let the user choose not to build it (and perf too)

Those could be compounded with more decoupling of perf events: for
example right now 'struct perf_event' is pretty big, because it
carries hw PMU details as well. But those details are not needed for
non-PMU events so splitting those from struct perf_event would
streamline this code some more.

Likewise the kernel/events/core.c ring-buffer bits could be split out
into kernel/events/ring-buffer.c - i think Frederic might be working
on this one already.

More advanced steps in the MCE area would be to replace the
'severity' logic (which is really a poor-man's filter) with event
filters and panic the box (or do other immediate action) if the
filter matches.

This again would not lose any functionality that the severity bits
give us today, we'd gain functionality:

- the conditions in filter expressions are pretty flexible so we
could do more than the current stack of severity bitmask ops. For
example a system could be configured to ignore the first N
non-fatal messages but panic the box if more than 10 errors were
generated. If there's a "message_seq_nr" field available in the
TRACE_EVENT() then this would be a simple "message_seq_nr >= 10"
filter condition. With the current severity code this would have
to be coded in, the ABI extended, etc. etc.

Although initially it would probably want to match what the severity
filters do today and install all the default policy.

Thanks,

Ingo

2011-05-25 16:44:14

by Luck, Tony

[permalink] [raw]
Subject: RE: [RFC 0/9] mce recovery for Sandy Bridge server

> How about separating stuffs in:
> step1) Add support for AR in user space :
> - send sigbus to affected processes, poison affected memory
> - panic if error is in kernel
> step2) Add support for AR in kernel
> - some new notify/handle mechanism etc.
>
> It seems too big jump for me.

Agreed - we can go step by step with different recovery cases. Deciding
which to implement is a classic benefit vs. effort trade off.

For each case determine some ball-park numbers for the percentage of
memory that will be in the state you wish to recover (hence user-mode
scores very highly because most people buy machines to run applications,
though there are some exceptions like file servers).

Determine the effort and invasiveness of a solution to recover - here
it is clear that a way to handle arbitrary kernel memory corruption
is never going to fly - but there is hope for some simple cases like
copy to/from user (that already are specially tagged so that user level
page faults can be processed).

-Tony

2011-05-25 21:43:58

by Luck, Tony

[permalink] [raw]
Subject: Re: [RFC 0/9] mce recovery for Sandy Bridge server

2011/5/25 Ingo Molnar <[email protected]>:
> Btw., the SIGKILL logic is probably overcomplicated: when it's clear
> that user-space can not recover why not do a do_exit() and be done
> with it? As long as it's called from a syscall level codepath and no
> locks/resources are held do_exit() can be called.

There is no SIGKILL - we use SIGBUS because it generally isn't clear
to the kernel whether the error is recoverable (the kernel can tell whether
it is *transparently* recoverable - e.g. by replacing a corrupt memory
page with a new copy read from disk in the case that the page is
mapped from a file and still marked as clean) - but if the kernel can't
recover, we want to give the application a shot at doing so. So we send
a SIGBUS with a payload specifying the virtual address and amount
of data that has been lost.

One database vendor has already used this mechanism in a demo
of application level recovery - a second is looking at doing so, and a
third was internally divided about whether the engineering cost of
doing this was justified given the rate of 2+bit memory errors.

[We do need a tweak here - it isn't helpful to have the application
drop a core file in the SIG_DFL case - so we really ought to stop
it from doing so]

> ?- the conditions in filter expressions are pretty flexible so we
> ? could do more than the current stack of severity bitmask ops. For
> ? example a system could be configured to ignore the first N
> ? non-fatal messages but panic the box if more than 10 errors were
> ? generated. If there's a "message_seq_nr" field available in the
> ? TRACE_EVENT() then this would be a simple "message_seq_nr >= 10"
> ? filter condition. With the current severity code this would have
> ? to be coded in, the ABI extended, etc. etc.

Generally you'd want to avoid rules based on absolute counts like this,
if you simply panic when you get to an event count of 10, then any system
that runs for long enough will eventually accrue this many errors and die.
Much better to use "events per time-window" (or a leaky bucket algorithm
that slowly "forgets" about old errors). You might also want to keep
separate counts per component (e.g. DIMM stick) because 10 errors
from one DIMM stick may well indicate a problem with that DIMM, but
10 errors from different DIMMs is more likely an indication that your
power supply is glitchy.

I'll have to think about whether some parts of what is being done by
the existing severity code could be moved out to filters - I'm not
certain that they can - the code uses that table to parse whats in the
machine check banks as described in volume 3A, chapter 15 of the SDM to
determine just what is going on. The severity codes refer to each bank
(and each logical cpu nominally has its own set of banks - some banks
are actually shared between hyperthreads on the same core, or cores
on the same socket). The meanings are:

MCE_NO_SEVERITY = no error logged in this bank

MCE_KEEP_SEVERITY = something here, but is not useful in our
current context, leave it alone. The "S" bit in the MCi_STATUS register
is used to mark whether an entry should be processed by CMCI/poll of the
banks, or by the NMI machine check event hanlder (this resolves races
when a machine check is delivered while handling a CMCI)

MCE_SOME_SEVERITY = a real error, low severity (e.g. h/w has already
corrected it)

MCE_AO_SEVERITY = an uncorrected error has been found, but it need not
be handled
right away (e.g. patrol scrubber found a 2-bit error in memory that is
not currently being
accessed by any processor).

MCE_UC_SEVERITY - on pre-nehalem cpus uncorrected errors are never
recoverable, so
the AO and AR values are not used

MCE_AR_SEVERITY - an uncorrected error in current execution context - something
must be done, if OS can't figure out what, then this error is fatal.

MCE_PANIC_SEVERITY - instant death, no saving throw (log to NVRAM if you
have it)


So I think that we still need this triage - to tell us which sort of
perf/event to
generate (corrected vs. uncorrected, memory vs. something else, ...),
and whether
we need to take some action in the kernel immediately.

Probably all the event filtering can do is count and analyse the
stream of corrected
and recovered errors to look for patterns for some pre-emptive action - but the
bulk of the complex logic for this should be in the user level "RASdaemon"
that is consuming the perf/events.

-Tony

2011-05-25 21:47:59

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC 0/9] mce recovery for Sandy Bridge server


* Tony Luck <[email protected]> wrote:

> 2011/5/25 Ingo Molnar <[email protected]>:
>
> > Btw., the SIGKILL logic is probably overcomplicated: when it's
> > clear that user-space can not recover why not do a do_exit() and
> > be done with it? As long as it's called from a syscall level
> > codepath and no locks/resources are held do_exit() can be called.
>
> There is no SIGKILL [...]

there is one here:

mm/memory-failure.c::kill_procs_ao()

force_sig(SIGKILL, tk->tsk);

Which can be reached from memory_failure() in certain (justified)
circumstances.

Thanks,

Ingo

2011-05-25 23:53:13

by Luck, Tony

[permalink] [raw]
Subject: Re: [RFC 0/9] mce recovery for Sandy Bridge server

2011/5/25 Ingo Molnar <[email protected]>:
> Well, the primary thing TIF_MCE_NOTIFY does is a roundabout way to
> iterate through repeat calls to memory_failure(), with all pfns that
> got buffered so far.
>
> We already have a generic facility to do such things at
> return-to-userspace: _TIF_USER_RETURN_NOTIFY.

This looked really promising as a way to drop one use of TIF_MCE_NOTIFY,
but it doesn't currently quite do what is needed for my new case.

What I need is a way to grab the current task just before it returns to user
space - what this code appears to do is to catch the current
*processor* just before
it sees a flagged process trying to return to user space.

These aren't quite the same ... if I use "user_return_notifier_register()" in
my machine check handler, what might happen is that entry_64.S
paranoid_userspace may see _TIF_NEED_RESCHED, and call schedule.
Now my "i don't want this to run" process could be picked up by a different
cpu that doesn't have the notifier registered.

The big clue was
head = &get_cpu_var(return_notifier_list);
in fire_user_return_notifiers()


But I wonder if I'm misreading the code - I'm not quite certain
what the kvm code is trying to do when using this, but it looks
to me that it might also suffer from the resched and migrate to
another cpu possibility.

-Tony

2011-05-26 20:16:13

by Luck, Tony

[permalink] [raw]
Subject: Re: [RFC 0/9] mce recovery for Sandy Bridge server

2011/5/25 Tony Luck <[email protected]>:
> But I wonder if I'm misreading the code - I'm not quite certain
> what the kvm code is trying to do when using this, but it looks
> to me that it might also suffer from the resched and migrate to
> another cpu possibility.

I didn't notice propagate_user_return_notify() ... which runs during
context switch to move TIF_USER_RETURN_NOTIFY from the
old process to the next one. This deals with the resched problem.

So this really is an "execute this function when the next task
tries to return to user mode" without any regard to *which* task
is being piggy-backed. It might be the one for which we first set
TIF_USER_RETURN_NOTIFY, or it might be some totally different
task many context switches later (if we happen to run a bunch of
kernel daemons that don't return to user mode).

This could work for the current usage of TIF_MCE_NOTIFY (which
looks like it is just trying to get some task to push the error logs
along their path to /dev/mcelog, and process any "action optional"
faults that have recorded page numbers of memory that needs to
be looked at, but for which there is no rush.

But it doesn't work for the "action required" case where I need to
stop the task that hit the error from running, and figure out the
virtual address that it is using to map the physical address that
the h/w reported as having an error. To get this behavior I'd need
the list of functions to call have its head in the task structure (so
the list is attached to a specific task). But I'd need another TIF
bit (unless I can re-purpose TIF_MCE_NOTIFY). It would get rid
of the task->mce_error_pfn (since I'd pass parameters in the
structure that is the container of the one that links onto the list).

Would such a change be worth prototyping?

-Tony