This is the in-kernel recovery path for memory errors. Still
no change to the eventual reporting mechanism (mcelog vs. perf).
Lots of changes based on feedback from the previous set of
discussions. Big stuff:
0) Scaled back on ambitions - just worry about recovery from
memory errors while executing in user mode. Adding recovery
for some kernel cases is deferred to some future date.
1) Took Seto-san's "mce_gather_info()" patch in place of the
one to just move mce_get_rip() around. Added an expanded
comment to explain that we need to collect all this information
to make decisions about severity for errors we find when scanning
the machine check banks. The rest of Seto-san's patches to
make the severity table easier to read look interesting.
2) Ingo pointed me the the user-return-notifier code as an alternative
to TIF_MCE_NOTIFY. Part 7 uses return notifiers to cover the
existing use of this TIF bit (Avi: Thanks for the review)
3) Take the newly freed TIF_MCE_NOTIFY bit and use it to implement
an extension in part 8 to make a "per-task" user notifier.
3) Re-implement the recovery path using the task_return_notifier ...
this gets rid of the "mce_error_pfn" field that was used in the
previous implementation that was so obnoxious.
N.B. both the existing parts and my extensions to the user return
notifiers are not NMI safe because they use ordinary Linux lists.
If this seems to be a worthwhile direction, this deficiency can
be fixed using Ying's NMI safe single linked lists. Ditto the
hacky "allocator" of task_notify structures in mce_action_required().
Just a proof of concept ... should be replaced with a more
generic NMI safe allocator.
-Tony
0001-MCE-fixes-for-mce-severity-table
0002-MCE-save-most-severe-error-information
0003-MCE-introduce-mce_gather_info
0004-MCE-Move-ADDR-MISC-reading-code-into-common-function
0005-MCE-Mask-out-address-mask-bits-below-address-granual
0006-HWPOISON-Handle-hwpoison-in-current-process
0007-MCE-replace-mce.c-use-of-TIF_MCE_NOTIFY-with-user_re
0008-NOTIFIER-Take-over-TIF_MCE_NOTIFY-and-implement-task
0009-MCE-run-through-processors-with-more-severe-problems
0010-MCE-Add-Action-Required-support
arch/x86/Kconfig | 1 +
arch/x86/include/asm/thread_info.h | 8 +-
arch/x86/kernel/cpu/mcheck/mce-severity.c | 37 +++-
arch/x86/kernel/cpu/mcheck/mce.c | 339 ++++++++++++++++++++++-------
arch/x86/kernel/signal.c | 7 +-
include/linux/sched.h | 3 +
include/linux/user-return-notifier.h | 14 ++
kernel/fork.c | 1 +
kernel/user-return-notifier.c | 36 +++
mm/memory-failure.c | 28 ++-
10 files changed, 369 insertions(+), 105 deletions(-)
Andi Kleen (3):
MCE: Move ADDR/MISC reading code into common function
MCE: Mask out address mask bits below address granuality
HWPOISON: Handle hwpoison in current process
Hidetoshi Seto (1):
MCE: introduce mce_gather_info()
Tony Luck (6):
MCE: fixes for mce severity table
MCE: save most severe error information
MCE: replace mce.c use of TIF_MCE_NOTIFY with user_return_notifier
NOTIFIER: Take over TIF_MCE_NOTIFY and implement task return notifier
MCE: run through processors with more severe problems first
MCE: Add Action-Required support
From: Tony Luck <[email protected]>
The "Spurious not enabled" entry is redundant. The "Not enabled"
entry earlier in the table will cover this case.
The "Action required; unknown MCACOD" entry shouldn't specify
MCACOD in the .mask field. Current code will only match for
mcacod==0 rather than all AR=1 entries.
Signed-off-by: Hidetoshi Seto <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce-severity.c | 4 +---
1 files changed, 1 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/cpu/mcheck/mce-severity.c b/arch/x86/kernel/cpu/mcheck/mce-severity.c
index 1e8d66c..352d16a 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-severity.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-severity.c
@@ -69,8 +69,6 @@ static struct severity {
MCGMASK(MCG_STATUS_RIPV, 0, PANIC, "In kernel and no restart IP",
KERNEL),
BITCLR(MCI_STATUS_UC, KEEP, "Corrected error", NOSER),
- MASK(MCI_STATUS_OVER|MCI_STATUS_UC|MCI_STATUS_EN, MCI_STATUS_UC, SOME,
- "Spurious not enabled", SER),
/* ignore OVER for UCNA */
MASK(MCI_UC_SAR, MCI_STATUS_UC, KEEP,
@@ -82,7 +80,7 @@ static struct severity {
/* AR add known MCACODs here */
MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_STATUS_OVER|MCI_UC_SAR, PANIC,
"Action required with lost events", SER),
- MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCACOD, MCI_UC_SAR, PANIC,
+ MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_UC_SAR, PANIC,
"Action required; unknown MCACOD", SER),
/* known AO MCACODs: */
--
1.7.3.1
From: Tony Luck <[email protected]>
monarch clears all of the per cpu "mces_seen", so we must keep a copy
to use after mce_end()
Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce.c | 5 ++++-
1 files changed, 4 insertions(+), 1 deletions(-)
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 3385ea2..ed1542a 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1046,6 +1046,9 @@ void do_machine_check(struct pt_regs *regs, long error_code)
}
}
+ /* Save our worst error locally, monarch will clear mces_seen */
+ m = *final;
+
if (!no_way_out)
mce_clear_state(toclear);
@@ -1064,7 +1067,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
* support MCE broadcasting or it has been disabled.
*/
if (no_way_out && tolerant < 3)
- mce_panic("Fatal machine check on current CPU", final, msg);
+ mce_panic("Fatal machine check on current CPU", &m, msg);
/*
* If the error seems to be unrecoverable, something should be
--
1.7.3.1
From: Hidetoshi Seto <[email protected]>
This patch introduces new function mce_gather_info() to be called
at beginning of error handling, to gather minimum error information
from proper error registers (and saved registers).
As the result the mce_get_rip() is integrated and unnecessary
zero-ing is removed. This also fix an issue pointed by Andi Kleen
that RIP is required to make some decision about error severity
(for SRAR errors) but it was retrieved later in the handler.
Signed-off-by: Hidetoshi Seto <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce.c | 50 ++++++++++++++++++++-----------------
1 files changed, 27 insertions(+), 23 deletions(-)
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index ed1542a..cbd4b0f 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -381,6 +381,31 @@ static void mce_wrmsrl(u32 msr, u64 v)
}
/*
+ * Collect all global (w.r.t. this processor) status about this machine
+ * check into our "mce" struct so that we can use it later to assess
+ * the severity of the problem as we read per-bank specific details.
+ */
+static inline void mce_gather_info(struct mce *m, struct pt_regs *regs)
+{
+ mce_setup(m);
+
+ m->mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
+ if (regs) {
+ /*
+ * Get the address of the instruction at the time of
+ * the machine check error.
+ */
+ if (m->mcgstatus & (MCG_STATUS_RIPV|MCG_STATUS_EIPV)) {
+ m->ip = regs->ip;
+ m->cs = regs->cs;
+ }
+ /* Use accurate RIP reporting if available. */
+ if (rip_msr)
+ m->ip = mce_rdmsrl(rip_msr);
+ }
+}
+
+/*
* Simple lockless ring to communicate PFNs from the exception handler with the
* process context work function. This is vastly simplified because there's
* only a single reader and a single writer.
@@ -451,24 +476,6 @@ static void mce_schedule_work(void)
}
}
-/*
- * Get the address of the instruction at the time of the machine check
- * error.
- */
-static inline void mce_get_rip(struct mce *m, struct pt_regs *regs)
-{
-
- if (regs && (m->mcgstatus & (MCG_STATUS_RIPV|MCG_STATUS_EIPV))) {
- m->ip = regs->ip;
- m->cs = regs->cs;
- } else {
- m->ip = 0;
- m->cs = 0;
- }
- if (rip_msr)
- m->ip = mce_rdmsrl(rip_msr);
-}
-
#ifdef CONFIG_X86_LOCAL_APIC
/*
* Called after interrupts have been reenabled again
@@ -549,9 +556,8 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
percpu_inc(mce_poll_count);
- mce_setup(&m);
+ mce_gather_info(&m, NULL);
- m.mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
for (i = 0; i < banks; i++) {
if (!mce_banks[i].ctl || !test_bit(i, *b))
continue;
@@ -951,9 +957,8 @@ void do_machine_check(struct pt_regs *regs, long error_code)
if (!banks)
goto out;
- mce_setup(&m);
+ mce_gather_info(&m, regs);
- m.mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
final = &__get_cpu_var(mces_seen);
*final = m;
@@ -1037,7 +1042,6 @@ void do_machine_check(struct pt_regs *regs, long error_code)
if (severity == MCE_AO_SEVERITY && mce_usable_address(&m))
mce_ring_add(m.addr >> PAGE_SHIFT);
- mce_get_rip(&m, regs);
mce_log(&m);
if (severity > worst) {
--
1.7.3.1
From: Andi Kleen <[email protected]>
Used for next patch.
Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce.c | 21 +++++++++++++--------
1 files changed, 13 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index cbd4b0f..0349e87 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -532,6 +532,17 @@ static void mce_report_event(struct pt_regs *regs)
#endif
}
+/*
+ * Read ADDR and MISC registers.
+ */
+static void mce_read_aux(struct mce *m, int i)
+{
+ if (m->status & MCI_STATUS_MISCV)
+ m->misc = mce_rdmsrl(MSR_IA32_MCx_MISC(i));
+ if (m->status & MCI_STATUS_ADDRV)
+ m->addr = mce_rdmsrl(MSR_IA32_MCx_ADDR(i));
+}
+
DEFINE_PER_CPU(unsigned, mce_poll_count);
/*
@@ -582,10 +593,7 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
(m.status & (mce_ser ? MCI_STATUS_S : MCI_STATUS_UC)))
continue;
- if (m.status & MCI_STATUS_MISCV)
- m.misc = mce_rdmsrl(MSR_IA32_MCx_MISC(i));
- if (m.status & MCI_STATUS_ADDRV)
- m.addr = mce_rdmsrl(MSR_IA32_MCx_ADDR(i));
+ mce_read_aux(&m, i);
if (!(flags & MCP_TIMESTAMP))
m.tsc = 0;
@@ -1027,10 +1035,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
if (severity == MCE_AR_SEVERITY)
kill_it = 1;
- if (m.status & MCI_STATUS_MISCV)
- m.misc = mce_rdmsrl(MSR_IA32_MCx_MISC(i));
- if (m.status & MCI_STATUS_ADDRV)
- m.addr = mce_rdmsrl(MSR_IA32_MCx_ADDR(i));
+ mce_read_aux(&m, i);
/*
* Action optional error. Queue address for later processing.
--
1.7.3.1
From: Andi Kleen <[email protected]>
SER enabled systems report the address granuality for each
reported address in a machine check. But the bits below
the granuality are undefined. Mask them out before
logging the machine check.
Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce.c | 12 +++++++++++-
1 files changed, 11 insertions(+), 1 deletions(-)
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 0349e87..ffc8d11 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -539,8 +539,18 @@ static void mce_read_aux(struct mce *m, int i)
{
if (m->status & MCI_STATUS_MISCV)
m->misc = mce_rdmsrl(MSR_IA32_MCx_MISC(i));
- if (m->status & MCI_STATUS_ADDRV)
+ if (m->status & MCI_STATUS_ADDRV) {
m->addr = mce_rdmsrl(MSR_IA32_MCx_ADDR(i));
+
+ /*
+ * Mask the reported address by the reported granuality.
+ */
+ if (mce_ser && (m->status & MCI_STATUS_MISCV)) {
+ u8 shift = m->misc & 0x1f;
+ m->addr >>= shift;
+ m->addr <<= shift;
+ }
+ }
}
DEFINE_PER_CPU(unsigned, mce_poll_count);
--
1.7.3.1
From: Andi Kleen <[email protected]>
When hardware poison handles the current process use
a forced signal with _AR severity.
Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
---
mm/memory-failure.c | 28 ++++++++++++++++------------
1 files changed, 16 insertions(+), 12 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 2b9a5ee..a203113 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -184,8 +184,7 @@ int hwpoison_filter(struct page *p)
EXPORT_SYMBOL_GPL(hwpoison_filter);
/*
- * Send all the processes who have the page mapped an ``action optional''
- * signal.
+ * Send all the processes who have the page mapped a SIGBUS.
*/
static int kill_proc_ao(struct task_struct *t, unsigned long addr, int trapno,
unsigned long pfn, struct page *page)
@@ -194,23 +193,28 @@ static int kill_proc_ao(struct task_struct *t, unsigned long addr, int trapno,
int ret;
printk(KERN_ERR
- "MCE %#lx: Killing %s:%d early due to hardware memory corruption\n",
- pfn, t->comm, t->pid);
+ "MCE %#lx: Killing %s:%d due to hardware memory corruption\n",
+ pfn, t->comm, t->pid);
si.si_signo = SIGBUS;
si.si_errno = 0;
- si.si_code = BUS_MCEERR_AO;
si.si_addr = (void *)addr;
#ifdef __ARCH_SI_TRAPNO
si.si_trapno = trapno;
#endif
si.si_addr_lsb = compound_trans_order(compound_head(page)) + PAGE_SHIFT;
- /*
- * Don't use force here, it's convenient if the signal
- * can be temporarily blocked.
- * This could cause a loop when the user sets SIGBUS
- * to SIG_IGN, but hopefully no one will do that?
- */
- ret = send_sig_info(SIGBUS, &si, t); /* synchronous? */
+ if (t == current) {
+ si.si_code = BUS_MCEERR_AR;
+ ret = force_sig_info(SIGBUS, &si, t);
+ } else {
+ /*
+ * Don't use force here, it's convenient if the signal
+ * can be temporarily blocked.
+ * This could cause a loop when the user sets SIGBUS
+ * to SIG_IGN, but hopefully noone will do that?
+ */
+ si.si_code = BUS_MCEERR_AO;
+ ret = send_sig_info(SIGBUS, &si, t);
+ }
if (ret < 0)
printk(KERN_INFO "MCE: Error sending signal to %s:%d: %d\n",
t->comm, t->pid, ret);
--
1.7.3.1
From: Tony Luck <[email protected]>
Ingo wrote:
> We already have a generic facility to do such things at
> return-to-userspace: _TIF_USER_RETURN_NOTIFY.
This just a proof of concept patch ... before this can become
real the user-return-notifier code would have to be made NMI
safe (currently it uses hlist_add_head/hlist_del, which would
need to be changed to Ying's NMI-safe single threaded lists).
Reviewed-by: Avi Kivity <[email protected]>
NOT-Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/kernel/cpu/mcheck/mce.c | 47 +++++++++++++++++++++++++++----------
arch/x86/kernel/signal.c | 6 -----
3 files changed, 35 insertions(+), 19 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cc6c53a..054e127 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -845,6 +845,7 @@ config X86_REROUTE_FOR_BROKEN_BOOT_IRQS
config X86_MCE
bool "Machine Check / overheating reporting"
+ select USER_RETURN_NOTIFIER
---help---
Machine Check support allows the processor to notify the
kernel if it detects a problem (e.g. overheating, data corruption).
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index ffc8d11..28d223e 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -38,6 +38,7 @@
#include <linux/mm.h>
#include <linux/debugfs.h>
#include <linux/edac_mce.h>
+#include <linux/user-return-notifier.h>
#include <asm/processor.h>
#include <asm/hw_irq.h>
@@ -69,6 +70,15 @@ atomic_t mce_entry;
DEFINE_PER_CPU(unsigned, mce_exception_count);
+struct mce_notify {
+ struct user_return_notifier urn;
+ bool registered;
+};
+
+static void mce_do_notify(struct user_return_notifier *urn);
+
+static DEFINE_PER_CPU(struct mce_notify, mce_notify);
+
/*
* Tolerant levels:
* 0: always panic on uncorrected errors, log corrected errors
@@ -947,6 +957,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
int i;
int worst = 0;
int severity;
+ struct mce_notify *np;
/*
* Establish sequential order between the CPUs entering the machine
* check handler.
@@ -1099,7 +1110,12 @@ void do_machine_check(struct pt_regs *regs, long error_code)
force_sig(SIGBUS, current);
/* notify userspace ASAP */
- set_thread_flag(TIF_MCE_NOTIFY);
+ np = &__get_cpu_var(mce_notify);
+ if (np->registered == 0) {
+ np->urn.on_user_return = mce_do_notify;
+ user_return_notifier_register(&np->urn);
+ np->registered = 1;
+ }
if (worst > 0)
mce_report_event(regs);
@@ -1116,28 +1132,35 @@ void __attribute__((weak)) memory_failure(unsigned long pfn, int vector)
printk(KERN_ERR "Action optional memory failure at %lx ignored\n", pfn);
}
+static void mce_process_ring(void)
+{
+ unsigned long pfn;
+
+ mce_notify_irq();
+ while (mce_ring_get(&pfn))
+ memory_failure(pfn, MCE_VECTOR);
+}
+
/*
* Called after mce notification in process context. This code
* is allowed to sleep. Call the high level VM handler to process
* any corrupted pages.
* Assume that the work queue code only calls this one at a time
* per CPU.
- * Note we don't disable preemption, so this code might run on the wrong
- * CPU. In this case the event is picked up by the scheduled work queue.
- * This is merely a fast path to expedite processing in some common
- * cases.
*/
-void mce_notify_process(void)
+static void mce_do_notify(struct user_return_notifier *urn)
{
- unsigned long pfn;
- mce_notify_irq();
- while (mce_ring_get(&pfn))
- memory_failure(pfn, MCE_VECTOR);
+ struct mce_notify *np = container_of(urn, struct mce_notify, urn);
+
+ user_return_notifier_unregister(urn);
+ np->registered = 0;
+
+ mce_process_ring();
}
static void mce_process_work(struct work_struct *dummy)
{
- mce_notify_process();
+ mce_process_ring();
}
#ifdef CONFIG_X86_MCE_INTEL
@@ -1218,8 +1241,6 @@ int mce_notify_irq(void)
/* Not more than two messages every minute */
static DEFINE_RATELIMIT_STATE(ratelimit, 60*HZ, 2);
- clear_thread_flag(TIF_MCE_NOTIFY);
-
if (test_and_clear_bit(0, &mce_need_notify)) {
wake_up_interruptible(&mce_wait);
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 4fd173c..44efc22 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -838,12 +838,6 @@ static void do_signal(struct pt_regs *regs)
void
do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
{
-#ifdef CONFIG_X86_MCE
- /* notify userspace of pending MCEs */
- if (thread_info_flags & _TIF_MCE_NOTIFY)
- mce_notify_process();
-#endif /* CONFIG_X86_64 && CONFIG_X86_MCE */
-
/* deal with pending signal delivery */
if (thread_info_flags & _TIF_SIGPENDING)
do_signal(regs);
--
1.7.3.1
From: Tony Luck <[email protected]>
Existing user return notifier mechanism is designed to catch a specific
cpu just as it returns to run any task in user mode. We also need a
machanism to catch a specific task.
N.B. This version is not fit for use - it uses normal Linux lists (which
are not NMI safe). Once Ying's NMI-safe single threaded lists are merged,
it should be updated to use them.
NOT-Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/include/asm/thread_info.h | 8 +++---
arch/x86/kernel/signal.c | 3 ++
include/linux/sched.h | 3 ++
include/linux/user-return-notifier.h | 14 +++++++++++++
kernel/fork.c | 1 +
kernel/user-return-notifier.c | 36 ++++++++++++++++++++++++++++++++++
6 files changed, 61 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 1f2e61e..5574f96 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -82,7 +82,7 @@ struct thread_info {
#define TIF_SYSCALL_EMU 6 /* syscall emulation active */
#define TIF_SYSCALL_AUDIT 7 /* syscall auditing active */
#define TIF_SECCOMP 8 /* secure computing */
-#define TIF_MCE_NOTIFY 10 /* notify userspace of an MCE */
+#define TIF_TASK_RETURN_NOTIFY 10 /* notify kernel of return to task */
#define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */
#define TIF_NOTSC 16 /* TSC is not accessible in userland */
#define TIF_IA32 17 /* 32bit process */
@@ -105,7 +105,7 @@ struct thread_info {
#define _TIF_SYSCALL_EMU (1 << TIF_SYSCALL_EMU)
#define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT)
#define _TIF_SECCOMP (1 << TIF_SECCOMP)
-#define _TIF_MCE_NOTIFY (1 << TIF_MCE_NOTIFY)
+#define _TIF_TASK_RETURN_NOTIFY (1 << TIF_TASK_RETURN_NOTIFY)
#define _TIF_USER_RETURN_NOTIFY (1 << TIF_USER_RETURN_NOTIFY)
#define _TIF_NOTSC (1 << TIF_NOTSC)
#define _TIF_IA32 (1 << TIF_IA32)
@@ -139,8 +139,8 @@ struct thread_info {
((0x0000FFFF & ~_TIF_SECCOMP) | _TIF_SYSCALL_TRACEPOINT)
/* Only used for 64 bit */
-#define _TIF_DO_NOTIFY_MASK \
- (_TIF_SIGPENDING | _TIF_MCE_NOTIFY | _TIF_NOTIFY_RESUME | \
+#define _TIF_DO_NOTIFY_MASK \
+ (_TIF_SIGPENDING | _TIF_TASK_RETURN_NOTIFY | _TIF_NOTIFY_RESUME | \
_TIF_USER_RETURN_NOTIFY)
/* flags to check in __switch_to() */
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 44efc22..f0f45b1 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -838,6 +838,9 @@ static void do_signal(struct pt_regs *regs)
void
do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
{
+ if (thread_info_flags & _TIF_TASK_RETURN_NOTIFY)
+ fire_task_return_notifiers();
+
/* deal with pending signal delivery */
if (thread_info_flags & _TIF_SIGPENDING)
do_signal(regs);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 781abd1..ec7a581 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1540,6 +1540,9 @@ struct task_struct {
#ifdef CONFIG_HAVE_HW_BREAKPOINT
atomic_t ptrace_bp_refcnt;
#endif
+#ifdef CONFIG_USER_RETURN_NOTIFIER
+ struct hlist_head return_notifier_list;
+#endif
};
/* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/include/linux/user-return-notifier.h b/include/linux/user-return-notifier.h
index 9c4a445..6114b25 100644
--- a/include/linux/user-return-notifier.h
+++ b/include/linux/user-return-notifier.h
@@ -15,6 +15,9 @@ struct user_return_notifier {
void user_return_notifier_register(struct user_return_notifier *urn);
void user_return_notifier_unregister(struct user_return_notifier *urn);
+void task_return_notifier_register(struct user_return_notifier *urn);
+void task_return_notifier_unregister(struct user_return_notifier *urn);
+
static inline void propagate_user_return_notify(struct task_struct *prev,
struct task_struct *next)
{
@@ -26,11 +29,18 @@ static inline void propagate_user_return_notify(struct task_struct *prev,
void fire_user_return_notifiers(void);
+void fire_task_return_notifiers(void);
+
static inline void clear_user_return_notifier(struct task_struct *p)
{
clear_tsk_thread_flag(p, TIF_USER_RETURN_NOTIFY);
}
+static inline void clear_task_return_notifier(struct task_struct *p)
+{
+ clear_tsk_thread_flag(p, TIF_TASK_RETURN_NOTIFY);
+}
+
#else
struct user_return_notifier {};
@@ -42,8 +52,12 @@ static inline void propagate_user_return_notify(struct task_struct *prev,
static inline void fire_user_return_notifiers(void) {}
+static inline void fire_task_return_notifiers(void) {}
+
static inline void clear_user_return_notifier(struct task_struct *p) {}
+static inline void clear_task_return_notifier(struct task_struct *p) {}
+
#endif
#endif
diff --git a/kernel/fork.c b/kernel/fork.c
index e7548de..1af37c2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -282,6 +282,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
setup_thread_stack(tsk, orig);
clear_user_return_notifier(tsk);
+ clear_task_return_notifier(tsk);
clear_tsk_need_resched(tsk);
stackend = end_of_stack(tsk);
*stackend = STACK_END_MAGIC; /* for overflow detection */
diff --git a/kernel/user-return-notifier.c b/kernel/user-return-notifier.c
index 92cb706..7c9389e 100644
--- a/kernel/user-return-notifier.c
+++ b/kernel/user-return-notifier.c
@@ -42,3 +42,39 @@ void fire_user_return_notifiers(void)
urn->on_user_return(urn);
put_cpu_var(return_notifier_list);
}
+
+/*
+ * Request a notification when the current task returns to userspace. Must be
+ * called in atomic context. The notifier will also be called in atomic
+ * context.
+ */
+void task_return_notifier_register(struct user_return_notifier *urn)
+{
+ set_tsk_thread_flag(current, TIF_TASK_RETURN_NOTIFY);
+ hlist_add_head(&urn->link, &(current->return_notifier_list));
+}
+EXPORT_SYMBOL_GPL(task_return_notifier_register);
+
+/*
+ * Removes a registered user return notifier. Must be called from atomic
+ * context, and from the same cpu registration occurred in.
+ */
+void task_return_notifier_unregister(struct user_return_notifier *urn)
+{
+ hlist_del(&urn->link);
+ if (hlist_empty(&(current->return_notifier_list)))
+ clear_tsk_thread_flag(current, TIF_TASK_RETURN_NOTIFY);
+}
+EXPORT_SYMBOL_GPL(task_return_notifier_unregister);
+
+/* Calls registered user return notifiers */
+void fire_task_return_notifiers(void)
+{
+ struct user_return_notifier *urn;
+ struct hlist_node *tmp1, *tmp2;
+ struct hlist_head *head;
+
+ head = &(current->return_notifier_list);
+ hlist_for_each_entry_safe(urn, tmp1, tmp2, head, link)
+ urn->on_user_return(urn);
+}
--
1.7.3.1
From: Tony Luck <[email protected]>
Instead of letting cpus run through the MC bank scanning code
in the order that they turned up in the handler, we arrange to
deal with those that have more severe problems (mcgstatus.ripv=0)
first. This will make life simpler in the case that banks are
shared between processors, since the cpu with the problem will
see it and clear it, leaving the other cpu(s) that share the
bank with nothing to do.
Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce.c | 88 ++++++++++++++++++++++++++-----------
1 files changed, 62 insertions(+), 26 deletions(-)
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 28d223e..9c72245 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -662,7 +662,7 @@ static int mce_no_way_out(struct mce *m, char **msg)
* Variable to establish order between CPUs while scanning.
* Each CPU spins initially until executing is equal its number.
*/
-static atomic_t mce_executing;
+static atomic_t mce_executing = ATOMIC_INIT(-1);
/*
* Defines order of CPUs on entry. First CPU becomes Monarch.
@@ -778,13 +778,40 @@ static void mce_reign(void)
static atomic_t global_nwo;
/*
+ * Keep separate bitmaps for cpus that have the option return from
+ * machine check handler (MCG_STATUS.RIPV == 1) and those for that
+ * cannot.
+ */
+static cpumask_t can_return;
+static cpumask_t cant_return;
+
+static int monarch;
+
+/*
+ * next cpu choosing first from cant_return, and then from can_return
+ */
+int mce_nextcpu(int this)
+{
+ int next;
+
+ if (this == -1 || cpumask_test_cpu(this, &cant_return)) {
+ next = cpumask_next(this, &cant_return);
+ if (next >= nr_cpu_ids)
+ next = cpumask_next(-1, &can_return);
+ return next;
+ }
+
+ return cpumask_next(this, &can_return);
+}
+
+/*
* Start of Monarch synchronization. This waits until all CPUs have
* entered the exception handler and then determines if any of them
* saw a fatal event that requires panic. Then it executes them
- * in the entry order.
+ * one at a time.
* TBD double check parallel CPU hotunplug
*/
-static int mce_start(int *no_way_out)
+static int mce_start(int *no_way_out, int noreturn)
{
int order;
int cpus = num_online_cpus();
@@ -800,6 +827,11 @@ static int mce_start(int *no_way_out)
smp_wmb();
order = atomic_inc_return(&mce_callin);
+ if (noreturn)
+ cpumask_set_cpu(smp_processor_id(), &cant_return);
+ else
+ cpumask_set_cpu(smp_processor_id(), &can_return);
+
/*
* Wait for everyone.
*/
@@ -818,23 +850,26 @@ static int mce_start(int *no_way_out)
if (order == 1) {
/*
- * Monarch: Starts executing now, the others wait.
+ * Choose a cpu to be monarch. It will run first
*/
- atomic_set(&mce_executing, 1);
- } else {
- /*
- * Subject: Now start the scanning loop one by one in
- * the original callin order.
- * This way when there are any shared banks it will be
- * only seen by one CPU before cleared, avoiding duplicates.
- */
- while (atomic_read(&mce_executing) < order) {
- if (mce_timed_out(&timeout)) {
- atomic_set(&global_nwo, 0);
- return -1;
- }
- ndelay(SPINUNIT);
+ monarch = mce_nextcpu(-1);
+ atomic_set(&mce_executing, monarch);
+ }
+ /*
+ * Subject: Now start the scanning loop one by one in
+ * choosing first the "cant_return" cpus and the the others.
+ * This way when there are any shared banks it will be
+ * only seen by one CPU before cleared, avoiding duplicates.
+ * Letting the "cant_return" folks go first means that for
+ * "action required" errors, the cpu that hit the problem
+ * will find its error in the shared bank first.
+ */
+ while (atomic_read(&mce_executing) != smp_processor_id()) {
+ if (mce_timed_out(&timeout)) {
+ atomic_set(&global_nwo, 0);
+ return -1;
}
+ ndelay(SPINUNIT);
}
/*
@@ -862,17 +897,16 @@ static int mce_end(int order)
/*
* Allow others to run.
*/
- atomic_inc(&mce_executing);
+ atomic_set(&mce_executing, mce_nextcpu(smp_processor_id()));
- if (order == 1) {
+ if (smp_processor_id() == monarch) {
/* CHECKME: Can this race with a parallel hotplug? */
- int cpus = num_online_cpus();
/*
* Monarch: Wait for everyone to go through their scanning
* loops.
*/
- while (atomic_read(&mce_executing) <= cpus) {
+ while (atomic_read(&mce_executing) != nr_cpu_ids) {
if (mce_timed_out(&timeout))
goto reset;
ndelay(SPINUNIT);
@@ -885,7 +919,7 @@ static int mce_end(int order)
/*
* Subject: Wait for Monarch to finish.
*/
- while (atomic_read(&mce_executing) != 0) {
+ while (atomic_read(&mce_executing) != -1) {
if (mce_timed_out(&timeout))
goto reset;
ndelay(SPINUNIT);
@@ -903,12 +937,14 @@ static int mce_end(int order)
reset:
atomic_set(&global_nwo, 0);
atomic_set(&mce_callin, 0);
+ cpumask_clear(&can_return);
+ cpumask_clear(&cant_return);
barrier();
/*
* Let others run again.
*/
- atomic_set(&mce_executing, 0);
+ atomic_set(&mce_executing, -1);
return ret;
}
@@ -1006,7 +1042,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
* This way we don't report duplicated events on shared banks
* because the first one to see it will clear it.
*/
- order = mce_start(&no_way_out);
+ order = mce_start(&no_way_out, kill_it);
for (i = 0; i < banks; i++) {
__clear_bit(i, toclear);
if (!mce_banks[i].ctl)
@@ -2218,7 +2254,7 @@ static void mce_reset(void)
{
cpu_missing = 0;
atomic_set(&mce_fake_paniced, 0);
- atomic_set(&mce_executing, 0);
+ atomic_set(&mce_executing, -1);
atomic_set(&mce_callin, 0);
atomic_set(&global_nwo, 0);
}
--
1.7.3.1
From: Tony Luck <[email protected]>
Implement core MCA recovery. This is used for errors
that happen in the current execution context.
The kernel has to first pass the error information
to a function running on the current process stack.
This is done using task_return_notifier_register().
Just handle errors in user mode for now. Later we
may be able to handle some kernel cases (e.g. when
kernel is in copy_*_user())
Based on some original code by Andi Kleen.
Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce-severity.c | 35 +++++++-
arch/x86/kernel/cpu/mcheck/mce.c | 118 +++++++++++++++++++++++++++--
2 files changed, 142 insertions(+), 11 deletions(-)
diff --git a/arch/x86/kernel/cpu/mcheck/mce-severity.c b/arch/x86/kernel/cpu/mcheck/mce-severity.c
index 352d16a..fe8a28c 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-severity.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-severity.c
@@ -13,6 +13,7 @@
#include <linux/seq_file.h>
#include <linux/init.h>
#include <linux/debugfs.h>
+#include <linux/module.h>
#include <asm/mce.h>
#include "mce-internal.h"
@@ -54,6 +55,9 @@ static struct severity {
{ .mcgmask = x, .mcgres = res, SEV(s), .msg = m, ## r }
#define MASK(x, y, s, m, r...) \
{ .mask = x, .result = y, SEV(s), .msg = m, ## r }
+#define ARMASK(x, y, s, m, r...) \
+ { .mcgmask = MCG_STATUS_RIPV, .mcgres = 0, \
+ .mask = x, .result = y, SEV(s), .msg = m, ## r }
#define MCI_UC_S (MCI_STATUS_UC|MCI_STATUS_S)
#define MCI_UC_SAR (MCI_STATUS_UC|MCI_STATUS_S|MCI_STATUS_AR)
#define MCACOD 0xffff
@@ -67,7 +71,7 @@ static struct severity {
MCGMASK(MCG_STATUS_RIPV|MCG_STATUS_EIPV, 0, PANIC,
"Neither restart nor error IP"),
MCGMASK(MCG_STATUS_RIPV, 0, PANIC, "In kernel and no restart IP",
- KERNEL),
+ KERNEL, NOSER),
BITCLR(MCI_STATUS_UC, KEEP, "Corrected error", NOSER),
/* ignore OVER for UCNA */
@@ -77,10 +81,16 @@ static struct severity {
"Illegal combination (UCNA with AR=1)", SER),
MASK(MCI_STATUS_S, 0, KEEP, "Non signalled machine check", SER),
- /* AR add known MCACODs here */
MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_STATUS_OVER|MCI_UC_SAR, PANIC,
"Action required with lost events", SER),
- MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_UC_SAR, PANIC,
+
+ /* known AR MCACODs: */
+ ARMASK(MCI_UC_SAR|MCI_STATUS_OVER|0xffff, MCI_UC_SAR|0x134, AR,
+ "Action required: data load error", SER),
+ ARMASK(MCI_UC_SAR|MCI_STATUS_OVER|0xffff, MCI_UC_SAR|0x150, AR,
+ "Action required: instruction fetch error", SER),
+
+ ARMASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_UC_SAR, PANIC,
"Action required; unknown MCACOD", SER),
/* known AO MCACODs: */
@@ -89,6 +99,7 @@ static struct severity {
MASK(MCI_UC_SAR|MCI_STATUS_OVER|MCACOD, MCI_UC_S|0x17a, AO,
"Action optional: last level cache writeback error", SER),
+
MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_UC_S, SOME,
"Action optional unknown MCACOD", SER),
MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_UC_S|MCI_STATUS_OVER, SOME,
@@ -110,6 +121,17 @@ static int error_context(struct mce *m)
return IN_KERNEL;
}
+static int kernel_ar_recoverable(struct mce *m, int tolerant)
+{
+ if (tolerant >= 2)
+ return MCE_AR_SEVERITY;
+ if (!(m->mcgstatus & MCG_STATUS_EIPV) || !m->ip)
+ return MCE_PANIC_SEVERITY;
+ if (search_exception_tables(m->ip))
+ return MCE_AR_SEVERITY;
+ return MCE_PANIC_SEVERITY;
+}
+
int mce_severity(struct mce *a, int tolerant, char **msg)
{
enum context ctx = error_context(a);
@@ -129,9 +151,12 @@ int mce_severity(struct mce *a, int tolerant, char **msg)
if (msg)
*msg = s->msg;
s->covered = 1;
- if (s->sev >= MCE_UC_SEVERITY && ctx == IN_KERNEL) {
- if (panic_on_oops || tolerant < 1)
+ if (ctx == IN_KERNEL) {
+ if (s->sev >= MCE_UC_SEVERITY &&
+ (panic_on_oops || tolerant < 1))
return MCE_PANIC_SEVERITY;
+ if (s->sev == MCE_AR_SEVERITY)
+ return kernel_ar_recoverable(a, tolerant);
}
return s->sev;
}
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 9c72245..a7a8c53 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -80,6 +80,20 @@ static void mce_do_notify(struct user_return_notifier *urn);
static DEFINE_PER_CPU(struct mce_notify, mce_notify);
/*
+ * Task return notifiers are used for "action required"
+ * recovery of tasks - i.e. we prevent return to the task
+ * that encountered the machine check, but we ensure that
+ * we process the error in task context.
+ */
+struct task_notify {
+ struct user_return_notifier urn;
+ unsigned long pfn;
+ atomic_t inuse;
+};
+static struct task_notify task_notifier[NR_CPUS];
+static void mce_do_task(struct user_return_notifier *urn);
+
+/*
* Tolerant levels:
* 0: always panic on uncorrected errors, log corrected errors
* 1: panic or SIGBUS on uncorrected errors, log corrected errors
@@ -975,6 +989,84 @@ static void mce_clear_state(unsigned long *toclear)
}
}
+/* Stub when hwpoison is not compiled in */
+int __attribute__((weak)) __memory_failure(unsigned long pfn, int vector,
+ int precount)
+{
+ return -1;
+}
+
+/*
+ * Uncorrected error for current process.
+ */
+static void mce_action_required(struct mce *m, char *msg, struct pt_regs *regs)
+{
+ int i;
+
+ if (!mce_usable_address(m))
+ mce_panic("No address for Action-Required Machine Check",
+ m, msg);
+ if (!(m->mcgstatus & MCG_STATUS_EIPV))
+ mce_panic("No EIPV for Action-Required Machine Check",
+ m, msg);
+
+ for (i = 0; i < NR_CPUS; i++)
+ if (!atomic_cmpxchg(&task_notifier[i].inuse, 0, 1))
+ break;
+ if (i == NR_CPUS)
+ mce_panic("Too many concurrent errors", m, msg);
+
+ task_notifier[i].urn.on_user_return = mce_do_task;
+ task_notifier[i].pfn = m->addr >> PAGE_SHIFT;
+ task_return_notifier_register(&task_notifier[i].urn);
+}
+
+#undef pr_fmt
+#define pr_fmt(x) "MCE: %s:%d " x "\n", current->comm, current->pid
+#define PADDR(x) ((u64)(x) << PAGE_SHIFT)
+
+/*
+ * No successfull recovery. Make sure at least that there's
+ * a SIGBUS.
+ */
+static void ar_fallback(struct task_struct *me, unsigned long pfn)
+{
+ if (signal_pending(me) && sigismember(&me->pending.signal, SIGBUS))
+ return;
+
+ /*
+ * For some reason hwpoison wasn't able to send a proper
+ * SIGBUS. Send a fallback signal. Unfortunately we don't
+ * know the virtual address here, so can't tell the program
+ * details.
+ */
+ force_sig(SIGBUS, me);
+ pr_err("Killed due to action-required memory corruption");
+}
+
+/*
+ * Handle action-required on the process stack. hwpoison does the
+ * bulk of the work and with some luck might even be able to fix the
+ * problem.
+ *
+ * Logic changes here should be reflected in kernel_ar_recoverable().
+ */
+static void handle_action_required(unsigned long pfn)
+{
+ struct task_struct *me = current;
+
+ pr_err("Uncorrected hardware memory error in user-access at %llx",
+ PADDR(pfn));
+ if (__memory_failure(pfn, MCE_VECTOR, 0) < 0) {
+ pr_err("Memory error not recovered");
+ ar_fallback(me, pfn);
+ } else
+ pr_err("Memory error recovered");
+}
+
+#undef pr_fmt
+#define pr_fmt(x) x
+
/*
* The actual machine check handler. This only handles real
* exceptions when something got corrupted coming in through int 18.
@@ -1086,12 +1178,6 @@ void do_machine_check(struct pt_regs *regs, long error_code)
continue;
}
- /*
- * Kill on action required.
- */
- if (severity == MCE_AR_SEVERITY)
- kill_it = 1;
-
mce_read_aux(&m, i);
/*
@@ -1136,6 +1222,15 @@ void do_machine_check(struct pt_regs *regs, long error_code)
mce_panic("Fatal machine check on current CPU", &m, msg);
/*
+ * Do recovery in current process if needed. This has to be delayed
+ * until we're back on the process stack.
+ */
+ if (worst == MCE_AR_SEVERITY) {
+ mce_action_required(&m, msg, regs);
+ kill_it = 0;
+ }
+
+ /*
* If the error seems to be unrecoverable, something should be
* done. Try to kill as little as possible. If we can kill just
* one task, do that. If the user has set the tolerance very
@@ -1194,6 +1289,17 @@ static void mce_do_notify(struct user_return_notifier *urn)
mce_process_ring();
}
+static void mce_do_task(struct user_return_notifier *urn)
+{
+ struct task_notify *np = container_of(urn, struct task_notify, urn);
+ unsigned long pfn = np->pfn;
+
+ task_return_notifier_unregister(urn);
+ atomic_set(&np->inuse, 0);
+
+ handle_action_required(pfn);
+}
+
static void mce_process_work(struct work_struct *dummy)
{
mce_process_ring();
--
1.7.3.1
(2011/06/10 6:38), Luck, Tony wrote:
> From: Tony Luck <[email protected]>
>
> Implement core MCA recovery. This is used for errors
> that happen in the current execution context.
>
> The kernel has to first pass the error information
> to a function running on the current process stack.
> This is done using task_return_notifier_register().
>
> Just handle errors in user mode for now. Later we
> may be able to handle some kernel cases (e.g. when
> kernel is in copy_*_user())
>
> Based on some original code by Andi Kleen.
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> arch/x86/kernel/cpu/mcheck/mce-severity.c | 35 +++++++-
> arch/x86/kernel/cpu/mcheck/mce.c | 118 +++++++++++++++++++++++++++--
> 2 files changed, 142 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/mcheck/mce-severity.c b/arch/x86/kernel/cpu/mcheck/mce-severity.c
> index 352d16a..fe8a28c 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce-severity.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce-severity.c
> @@ -13,6 +13,7 @@
> #include <linux/seq_file.h>
> #include <linux/init.h>
> #include <linux/debugfs.h>
> +#include <linux/module.h>
> #include <asm/mce.h>
>
> #include "mce-internal.h"
> @@ -54,6 +55,9 @@ static struct severity {
> { .mcgmask = x, .mcgres = res, SEV(s), .msg = m, ## r }
> #define MASK(x, y, s, m, r...) \
> { .mask = x, .result = y, SEV(s), .msg = m, ## r }
> +#define ARMASK(x, y, s, m, r...) \
> + { .mcgmask = MCG_STATUS_RIPV, .mcgres = 0, \
> + .mask = x, .result = y, SEV(s), .msg = m, ## r }
> #define MCI_UC_S (MCI_STATUS_UC|MCI_STATUS_S)
> #define MCI_UC_SAR (MCI_STATUS_UC|MCI_STATUS_S|MCI_STATUS_AR)
> #define MCACOD 0xffff
> @@ -67,7 +71,7 @@ static struct severity {
> MCGMASK(MCG_STATUS_RIPV|MCG_STATUS_EIPV, 0, PANIC,
> "Neither restart nor error IP"),
> MCGMASK(MCG_STATUS_RIPV, 0, PANIC, "In kernel and no restart IP",
> - KERNEL),
> + KERNEL, NOSER),
> BITCLR(MCI_STATUS_UC, KEEP, "Corrected error", NOSER),
>
> /* ignore OVER for UCNA */
> @@ -77,10 +81,16 @@ static struct severity {
> "Illegal combination (UCNA with AR=1)", SER),
> MASK(MCI_STATUS_S, 0, KEEP, "Non signalled machine check", SER),
>
> - /* AR add known MCACODs here */
> MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_STATUS_OVER|MCI_UC_SAR, PANIC,
> "Action required with lost events", SER),
> - MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_UC_SAR, PANIC,
> +
> + /* known AR MCACODs: */
> + ARMASK(MCI_UC_SAR|MCI_STATUS_OVER|0xffff, MCI_UC_SAR|0x134, AR,
> + "Action required: data load error", SER),
> + ARMASK(MCI_UC_SAR|MCI_STATUS_OVER|0xffff, MCI_UC_SAR|0x150, AR,
> + "Action required: instruction fetch error", SER),
> +
> + ARMASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_UC_SAR, PANIC,
> "Action required; unknown MCACOD", SER),
>
> /* known AO MCACODs: */
> @@ -89,6 +99,7 @@ static struct severity {
> MASK(MCI_UC_SAR|MCI_STATUS_OVER|MCACOD, MCI_UC_S|0x17a, AO,
> "Action optional: last level cache writeback error", SER),
>
> +
> MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_UC_S, SOME,
> "Action optional unknown MCACOD", SER),
> MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_UC_S|MCI_STATUS_OVER, SOME,
> @@ -110,6 +121,17 @@ static int error_context(struct mce *m)
> return IN_KERNEL;
> }
>
> +static int kernel_ar_recoverable(struct mce *m, int tolerant)
> +{
> + if (tolerant >= 2)
> + return MCE_AR_SEVERITY;
> + if (!(m->mcgstatus & MCG_STATUS_EIPV) || !m->ip)
> + return MCE_PANIC_SEVERITY;
> + if (search_exception_tables(m->ip))
> + return MCE_AR_SEVERITY;
> + return MCE_PANIC_SEVERITY;
> +}
> +
You said "Just handle errors in user mode for now." but ...?
> int mce_severity(struct mce *a, int tolerant, char **msg)
> {
> enum context ctx = error_context(a);
> @@ -129,9 +151,12 @@ int mce_severity(struct mce *a, int tolerant, char **msg)
> if (msg)
> *msg = s->msg;
> s->covered = 1;
> - if (s->sev >= MCE_UC_SEVERITY && ctx == IN_KERNEL) {
> - if (panic_on_oops || tolerant < 1)
> + if (ctx == IN_KERNEL) {
> + if (s->sev >= MCE_UC_SEVERITY &&
> + (panic_on_oops || tolerant < 1))
> return MCE_PANIC_SEVERITY;
> + if (s->sev == MCE_AR_SEVERITY)
> + return kernel_ar_recoverable(a, tolerant);
> }
> return s->sev;
> }
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index 9c72245..a7a8c53 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -80,6 +80,20 @@ static void mce_do_notify(struct user_return_notifier *urn);
> static DEFINE_PER_CPU(struct mce_notify, mce_notify);
>
> /*
> + * Task return notifiers are used for "action required"
> + * recovery of tasks - i.e. we prevent return to the task
> + * that encountered the machine check, but we ensure that
> + * we process the error in task context.
> + */
> +struct task_notify {
> + struct user_return_notifier urn;
> + unsigned long pfn;
> + atomic_t inuse;
> +};
> +static struct task_notify task_notifier[NR_CPUS];
> +static void mce_do_task(struct user_return_notifier *urn);
> +
> +/*
> * Tolerant levels:
> * 0: always panic on uncorrected errors, log corrected errors
> * 1: panic or SIGBUS on uncorrected errors, log corrected errors
> @@ -975,6 +989,84 @@ static void mce_clear_state(unsigned long *toclear)
> }
> }
>
> +/* Stub when hwpoison is not compiled in */
> +int __attribute__((weak)) __memory_failure(unsigned long pfn, int vector,
> + int precount)
> +{
> + return -1;
> +}
> +
> +/*
> + * Uncorrected error for current process.
> + */
> +static void mce_action_required(struct mce *m, char *msg, struct pt_regs *regs)
> +{
> + int i;
> +
> + if (!mce_usable_address(m))
> + mce_panic("No address for Action-Required Machine Check",
> + m, msg);
> + if (!(m->mcgstatus & MCG_STATUS_EIPV))
> + mce_panic("No EIPV for Action-Required Machine Check",
> + m, msg);
When can this happen?
Why not create new severity {PANIC, "Action Required: but No EIPV", ...}
in severity table?
> +
> + for (i = 0; i < NR_CPUS; i++)
> + if (!atomic_cmpxchg(&task_notifier[i].inuse, 0, 1))
> + break;
> + if (i == NR_CPUS)
> + mce_panic("Too many concurrent errors", m, msg);
> +
> + task_notifier[i].urn.on_user_return = mce_do_task;
> + task_notifier[i].pfn = m->addr >> PAGE_SHIFT;
> + task_return_notifier_register(&task_notifier[i].urn);
> +}
> +
> +#undef pr_fmt
> +#define pr_fmt(x) "MCE: %s:%d " x "\n", current->comm, current->pid
> +#define PADDR(x) ((u64)(x) << PAGE_SHIFT)
> +
> +/*
> + * No successfull recovery. Make sure at least that there's
> + * a SIGBUS.
> + */
> +static void ar_fallback(struct task_struct *me, unsigned long pfn)
> +{
> + if (signal_pending(me) && sigismember(&me->pending.signal, SIGBUS))
> + return;
Is it safe for _AR if SIGBUS is pending but blocked?
I think force_sig() is reasonable for such situation.
> +
> + /*
> + * For some reason hwpoison wasn't able to send a proper
> + * SIGBUS. Send a fallback signal. Unfortunately we don't
> + * know the virtual address here, so can't tell the program
> + * details.
> + */
> + force_sig(SIGBUS, me);
> + pr_err("Killed due to action-required memory corruption");
> +}
> +
> +/*
> + * Handle action-required on the process stack. hwpoison does the
> + * bulk of the work and with some luck might even be able to fix the
> + * problem.
> + *
> + * Logic changes here should be reflected in kernel_ar_recoverable().
> + */
> +static void handle_action_required(unsigned long pfn)
> +{
> + struct task_struct *me = current;
> +
> + pr_err("Uncorrected hardware memory error in user-access at %llx",
> + PADDR(pfn));
> + if (__memory_failure(pfn, MCE_VECTOR, 0) < 0) {
> + pr_err("Memory error not recovered");
> + ar_fallback(me, pfn);
> + } else
> + pr_err("Memory error recovered");
> +}
> +
> +#undef pr_fmt
> +#define pr_fmt(x) x
> +
> /*
> * The actual machine check handler. This only handles real
> * exceptions when something got corrupted coming in through int 18.
> @@ -1086,12 +1178,6 @@ void do_machine_check(struct pt_regs *regs, long error_code)
> continue;
> }
>
> - /*
> - * Kill on action required.
> - */
> - if (severity == MCE_AR_SEVERITY)
> - kill_it = 1;
> -
> mce_read_aux(&m, i);
>
> /*
> @@ -1136,6 +1222,15 @@ void do_machine_check(struct pt_regs *regs, long error_code)
> mce_panic("Fatal machine check on current CPU", &m, msg);
>
> /*
> + * Do recovery in current process if needed. This has to be delayed
> + * until we're back on the process stack.
> + */
> + if (worst == MCE_AR_SEVERITY) {
> + mce_action_required(&m, msg, regs);
Comprehensible name would be appreciated, e.g.:
mce_request_dpc_for_action_required(pfn);
And if we cannot request context for recovery, it is better to suppress
trailing attempts, e.g. before mce_end():
if (!no_way_out && severity == MCE_AR_SEVERITY) {
err = mce_request_dpc_for_action_required(pfn);
if (err) {
atomic_inc(&global_nwo);
severity = MCE_PANIC_SEVERITY; /* escalated */
}
}
Thanks,
H.Seto
> + kill_it = 0;
> + }
> +
> + /*
> * If the error seems to be unrecoverable, something should be
> * done. Try to kill as little as possible. If we can kill just
> * one task, do that. If the user has set the tolerance very
> @@ -1194,6 +1289,17 @@ static void mce_do_notify(struct user_return_notifier *urn)
> mce_process_ring();
> }
>
> +static void mce_do_task(struct user_return_notifier *urn)
> +{
> + struct task_notify *np = container_of(urn, struct task_notify, urn);
> + unsigned long pfn = np->pfn;
> +
> + task_return_notifier_unregister(urn);
> + atomic_set(&np->inuse, 0);
> +
> + handle_action_required(pfn);
> +}
> +
> static void mce_process_work(struct work_struct *dummy)
> {
> mce_process_ring();
(2011/06/10 6:30), Luck, Tony wrote:
> From: Tony Luck <[email protected]>
>
> monarch clears all of the per cpu "mces_seen", so we must keep a copy
> to use after mce_end()
Could you clarify why we have to use mces_seen after mce_end(), please?
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> arch/x86/kernel/cpu/mcheck/mce.c | 5 ++++-
> 1 files changed, 4 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index 3385ea2..ed1542a 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -1046,6 +1046,9 @@ void do_machine_check(struct pt_regs *regs, long error_code)
> }
> }
>
> + /* Save our worst error locally, monarch will clear mces_seen */
> + m = *final;
> +
> if (!no_way_out)
> mce_clear_state(toclear);
>
> @@ -1064,7 +1067,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
> * support MCE broadcasting or it has been disabled.
> */
> if (no_way_out && tolerant < 3)
> - mce_panic("Fatal machine check on current CPU", final, msg);
> + mce_panic("Fatal machine check on current CPU", &m, msg);
>
> /*
> * If the error seems to be unrecoverable, something should be
At least the mce_panic here is called only when there is no monarch
due to no broadcast or failed rendezvous. Otherwise if there is a monarch,
the monarch should check all "mces_seen" before clearing them and call
a single mce_panic() for the system from mce_end() by itself, rather than
having multiple mce_panic() from some of subjects after mce_end().
Thanks,
H.Seto
(2011/06/10 6:33), Luck, Tony wrote:
> From: Andi Kleen <[email protected]>
>
> SER enabled systems report the address granuality for each
> reported address in a machine check. But the bits below
> the granuality are undefined. Mask them out before
> logging the machine check.
>
> Signed-off-by: Andi Kleen <[email protected]>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> arch/x86/kernel/cpu/mcheck/mce.c | 12 +++++++++++-
> 1 files changed, 11 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index 0349e87..ffc8d11 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -539,8 +539,18 @@ static void mce_read_aux(struct mce *m, int i)
> {
> if (m->status & MCI_STATUS_MISCV)
> m->misc = mce_rdmsrl(MSR_IA32_MCx_MISC(i));
> - if (m->status & MCI_STATUS_ADDRV)
> + if (m->status & MCI_STATUS_ADDRV) {
> m->addr = mce_rdmsrl(MSR_IA32_MCx_ADDR(i));
> +
> + /*
> + * Mask the reported address by the reported granuality.
> + */
> + if (mce_ser && (m->status & MCI_STATUS_MISCV)) {
> + u8 shift = m->misc & 0x1f;
> + m->addr >>= shift;
> + m->addr <<= shift;
> + }
> + }
> }
>
> DEFINE_PER_CPU(unsigned, mce_poll_count);
Why do you have to mask it out in kernel, why not in user/logger?
One possible story is:
"... the brand-new Xeon XXXX has new MCx_***_VALID bit in ****
register, if it is set the lower bits of MCx_ADDR indicates
****, otherwise the bits are undefined ..."
So I think that kernel should convey the raw value from hardware to
userland. Even if it contains some noise on it, user can determine
whether it is useful or not. And more, since this is an error record,
there will be no second chance to retrieve the data afterward.
Thanks,
H.Seto
(2011/06/10 6:34), Luck, Tony wrote:
> From: Andi Kleen <[email protected]>
>
> When hardware poison handles the current process use
> a forced signal with _AR severity.
>
> Signed-off-by: Andi Kleen <[email protected]>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> mm/memory-failure.c | 28 ++++++++++++++++------------
> 1 files changed, 16 insertions(+), 12 deletions(-)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 2b9a5ee..a203113 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -184,8 +184,7 @@ int hwpoison_filter(struct page *p)
> EXPORT_SYMBOL_GPL(hwpoison_filter);
>
> /*
> - * Send all the processes who have the page mapped an ``action optional''
> - * signal.
> + * Send all the processes who have the page mapped a SIGBUS.
> */
> static int kill_proc_ao(struct task_struct *t, unsigned long addr, int trapno,
> unsigned long pfn, struct page *page)
It doesn't make sense that the function named "*_ao" sends _AR.
> @@ -194,23 +193,28 @@ static int kill_proc_ao(struct task_struct *t, unsigned long addr, int trapno,
> int ret;
>
> printk(KERN_ERR
> - "MCE %#lx: Killing %s:%d early due to hardware memory corruption\n",
> - pfn, t->comm, t->pid);
> + "MCE %#lx: Killing %s:%d due to hardware memory corruption\n",
> + pfn, t->comm, t->pid);
> si.si_signo = SIGBUS;
> si.si_errno = 0;
> - si.si_code = BUS_MCEERR_AO;
> si.si_addr = (void *)addr;
> #ifdef __ARCH_SI_TRAPNO
> si.si_trapno = trapno;
> #endif
> si.si_addr_lsb = compound_trans_order(compound_head(page)) + PAGE_SHIFT;
> - /*
> - * Don't use force here, it's convenient if the signal
> - * can be temporarily blocked.
> - * This could cause a loop when the user sets SIGBUS
> - * to SIG_IGN, but hopefully no one will do that?
> - */
> - ret = send_sig_info(SIGBUS, &si, t); /* synchronous? */
> + if (t == current) {
> + si.si_code = BUS_MCEERR_AR;
> + ret = force_sig_info(SIGBUS, &si, t);
> + } else {
> + /*
> + * Don't use force here, it's convenient if the signal
> + * can be temporarily blocked.
> + * This could cause a loop when the user sets SIGBUS
> + * to SIG_IGN, but hopefully noone will do that?
> + */
> + si.si_code = BUS_MCEERR_AO;
> + ret = send_sig_info(SIGBUS, &si, t);
> + }
> if (ret < 0)
> printk(KERN_INFO "MCE: Error sending signal to %s:%d: %d\n",
> t->comm, t->pid, ret);
I suppose that usually SRAO is handled in worker thread scheduled after
MCE, so current is unlikely one of affected threads in that case...
And I also suppose that you'd like to use this function to be called
from affected thread before leaving kernel in the case of SRAR...
My concern is that "t == current" is neither strong nor clear statement
to switch the type of signal. Someone might want to use this function
to inject _AO to current.
It is better to have new kill_proc_ar() (separated, or one shared _common
plus a couple of _ar/_ao), I think. I believe that there is no caller
who have no idea whether it should request sending _AR or _AO.
Thanks,
H.Seto
(2011/06/10 6:35), Luck, Tony wrote:
> From: Tony Luck <[email protected]>
>
> Ingo wrote:
>> We already have a generic facility to do such things at
>> return-to-userspace: _TIF_USER_RETURN_NOTIFY.
>
> This just a proof of concept patch ... before this can become
> real the user-return-notifier code would have to be made NMI
> safe (currently it uses hlist_add_head/hlist_del, which would
> need to be changed to Ying's NMI-safe single threaded lists).
>
> Reviewed-by: Avi Kivity <[email protected]>
> NOT-Signed-off-by: Tony Luck <[email protected]>
> ---
> arch/x86/Kconfig | 1 +
> arch/x86/kernel/cpu/mcheck/mce.c | 47 +++++++++++++++++++++++++++----------
> arch/x86/kernel/signal.c | 6 -----
> 3 files changed, 35 insertions(+), 19 deletions(-)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index cc6c53a..054e127 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -845,6 +845,7 @@ config X86_REROUTE_FOR_BROKEN_BOOT_IRQS
>
> config X86_MCE
> bool "Machine Check / overheating reporting"
> + select USER_RETURN_NOTIFIER
> ---help---
> Machine Check support allows the processor to notify the
> kernel if it detects a problem (e.g. overheating, data corruption).
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index ffc8d11..28d223e 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -38,6 +38,7 @@
> #include <linux/mm.h>
> #include <linux/debugfs.h>
> #include <linux/edac_mce.h>
> +#include <linux/user-return-notifier.h>
>
> #include <asm/processor.h>
> #include <asm/hw_irq.h>
> @@ -69,6 +70,15 @@ atomic_t mce_entry;
>
> DEFINE_PER_CPU(unsigned, mce_exception_count);
>
> +struct mce_notify {
> + struct user_return_notifier urn;
> + bool registered;
> +};
> +
> +static void mce_do_notify(struct user_return_notifier *urn);
> +
> +static DEFINE_PER_CPU(struct mce_notify, mce_notify);
> +
> /*
> * Tolerant levels:
> * 0: always panic on uncorrected errors, log corrected errors
> @@ -947,6 +957,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
> int i;
> int worst = 0;
> int severity;
> + struct mce_notify *np;
> /*
> * Establish sequential order between the CPUs entering the machine
> * check handler.
> @@ -1099,7 +1110,12 @@ void do_machine_check(struct pt_regs *regs, long error_code)
> force_sig(SIGBUS, current);
>
> /* notify userspace ASAP */
> - set_thread_flag(TIF_MCE_NOTIFY);
> + np = &__get_cpu_var(mce_notify);
> + if (np->registered == 0) {
> + np->urn.on_user_return = mce_do_notify;
> + user_return_notifier_register(&np->urn);
> + np->registered = 1;
> + }
>
> if (worst > 0)
> mce_report_event(regs);
> @@ -1116,28 +1132,35 @@ void __attribute__((weak)) memory_failure(unsigned long pfn, int vector)
> printk(KERN_ERR "Action optional memory failure at %lx ignored\n", pfn);
> }
>
> +static void mce_process_ring(void)
> +{
> + unsigned long pfn;
> +
> + mce_notify_irq();
> + while (mce_ring_get(&pfn))
> + memory_failure(pfn, MCE_VECTOR);
> +}
> +
> /*
> * Called after mce notification in process context. This code
> * is allowed to sleep. Call the high level VM handler to process
> * any corrupted pages.
> * Assume that the work queue code only calls this one at a time
> * per CPU.
> - * Note we don't disable preemption, so this code might run on the wrong
> - * CPU. In this case the event is picked up by the scheduled work queue.
> - * This is merely a fast path to expedite processing in some common
> - * cases.
> */
> -void mce_notify_process(void)
> +static void mce_do_notify(struct user_return_notifier *urn)
> {
> - unsigned long pfn;
> - mce_notify_irq();
> - while (mce_ring_get(&pfn))
> - memory_failure(pfn, MCE_VECTOR);
> + struct mce_notify *np = container_of(urn, struct mce_notify, urn);
> +
> + user_return_notifier_unregister(urn);
> + np->registered = 0;
> +
> + mce_process_ring();
> }
Now I'm reconsidering the MCE event notification mechanism.
One of something nervous is whether it is really required to process
"_AO" memory poisoning (i.e. mce_process_ring()) here in a process
context that unfortunately interrupted by MCE (or preempted after that).
I'm uncertain how long walking though the task_list for unmap will takes,
and not sure it is acceptable if the unlucky thread is a kind of latency
sensitive...
If we can move mce_process_ring() to worker thread completely, what
we have to do will be:
1) from NMI context, request non-NMI context by irq_work()
2) from (irq) context, wake up loggers and schedule work if required
3) from worker thread, process "_AO" memory poisoning etc.
So now question is why user_return_notifier is needed here.
Is it just an alternative of irq_work() for !LOCAL_APIC ?
Thanks,
H.Seto
>
> static void mce_process_work(struct work_struct *dummy)
> {
> - mce_notify_process();
> + mce_process_ring();
> }
>
> #ifdef CONFIG_X86_MCE_INTEL
> @@ -1218,8 +1241,6 @@ int mce_notify_irq(void)
> /* Not more than two messages every minute */
> static DEFINE_RATELIMIT_STATE(ratelimit, 60*HZ, 2);
>
> - clear_thread_flag(TIF_MCE_NOTIFY);
> -
> if (test_and_clear_bit(0, &mce_need_notify)) {
> wake_up_interruptible(&mce_wait);
>
> diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
> index 4fd173c..44efc22 100644
> --- a/arch/x86/kernel/signal.c
> +++ b/arch/x86/kernel/signal.c
> @@ -838,12 +838,6 @@ static void do_signal(struct pt_regs *regs)
> void
> do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
> {
> -#ifdef CONFIG_X86_MCE
> - /* notify userspace of pending MCEs */
> - if (thread_info_flags & _TIF_MCE_NOTIFY)
> - mce_notify_process();
> -#endif /* CONFIG_X86_64 && CONFIG_X86_MCE */
> -
> /* deal with pending signal delivery */
> if (thread_info_flags & _TIF_SIGPENDING)
> do_signal(regs);
(2011/06/10 6:37), Luck, Tony wrote:
> From: Tony Luck <[email protected]>
>
> Instead of letting cpus run through the MC bank scanning code
> in the order that they turned up in the handler, we arrange to
> deal with those that have more severe problems (mcgstatus.ripv=0)
> first. This will make life simpler in the case that banks are
> shared between processors, since the cpu with the problem will
> see it and clear it, leaving the other cpu(s) that share the
> bank with nothing to do.
Well, I agree about the point that reordering is required to handle
shared banks.
I think it is better to put this change in early of the series.
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
(snip)
> +/*
> + * next cpu choosing first from cant_return, and then from can_return
> + */
> +int mce_nextcpu(int this)
> +{
> + int next;
> +
> + if (this == -1 || cpumask_test_cpu(this, &cant_return)) {
> + next = cpumask_next(this, &cant_return);
> + if (next >= nr_cpu_ids)
> + next = cpumask_next(-1, &can_return);
> + return next;
> + }
> +
> + return cpumask_next(this, &can_return);
> +}
I don't like to have multiple cpumasks here, notably one is just an
inversion of another...
How about using severity-leveling?
Pick cpus with PANIC level first, then AR, AO ...
Or how about checking rip in each mces_seen?
Thanks,
H.Seto
On Thu, Jun 09, 2011 at 05:32:41PM -0400, Luck, Tony wrote:
> From: Andi Kleen <[email protected]>
>
> Used for next patch.
Yeah, let's have a more general comment here like "save us some code
duplication"
> Signed-off-by: Andi Kleen <[email protected]>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> arch/x86/kernel/cpu/mcheck/mce.c | 21 +++++++++++++--------
> 1 files changed, 13 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index cbd4b0f..0349e87 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -532,6 +532,17 @@ static void mce_report_event(struct pt_regs *regs)
> #endif
> }
>
> +/*
> + * Read ADDR and MISC registers.
> + */
No need for that comment, IMO. Function is small enough and the macros
speak for themselves :).
> +static void mce_read_aux(struct mce *m, int i)
inline?
> +{
> + if (m->status & MCI_STATUS_MISCV)
> + m->misc = mce_rdmsrl(MSR_IA32_MCx_MISC(i));
> + if (m->status & MCI_STATUS_ADDRV)
> + m->addr = mce_rdmsrl(MSR_IA32_MCx_ADDR(i));
> +}
> +
> DEFINE_PER_CPU(unsigned, mce_poll_count);
>
> /*
> @@ -582,10 +593,7 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
> (m.status & (mce_ser ? MCI_STATUS_S : MCI_STATUS_UC)))
> continue;
>
> - if (m.status & MCI_STATUS_MISCV)
> - m.misc = mce_rdmsrl(MSR_IA32_MCx_MISC(i));
> - if (m.status & MCI_STATUS_ADDRV)
> - m.addr = mce_rdmsrl(MSR_IA32_MCx_ADDR(i));
> + mce_read_aux(&m, i);
>
> if (!(flags & MCP_TIMESTAMP))
> m.tsc = 0;
> @@ -1027,10 +1035,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
> if (severity == MCE_AR_SEVERITY)
> kill_it = 1;
>
> - if (m.status & MCI_STATUS_MISCV)
> - m.misc = mce_rdmsrl(MSR_IA32_MCx_MISC(i));
> - if (m.status & MCI_STATUS_ADDRV)
> - m.addr = mce_rdmsrl(MSR_IA32_MCx_ADDR(i));
> + mce_read_aux(&m, i);
>
> /*
> * Action optional error. Queue address for later processing.
> --
> 1.7.3.1
>
>
--
Regards/Gruss,
Boris.
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632
On Thu, Jun 09, 2011 at 05:33:42PM -0400, Luck, Tony wrote:
> From: Andi Kleen <[email protected]>
>
> SER enabled systems report the address granuality for each
> reported address in a machine check. But the bits below
> the granuality are undefined. Mask them out before
> logging the machine check.
>
> Signed-off-by: Andi Kleen <[email protected]>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> arch/x86/kernel/cpu/mcheck/mce.c | 12 +++++++++++-
> 1 files changed, 11 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index 0349e87..ffc8d11 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -539,8 +539,18 @@ static void mce_read_aux(struct mce *m, int i)
> {
> if (m->status & MCI_STATUS_MISCV)
> m->misc = mce_rdmsrl(MSR_IA32_MCx_MISC(i));
> - if (m->status & MCI_STATUS_ADDRV)
> + if (m->status & MCI_STATUS_ADDRV) {
> m->addr = mce_rdmsrl(MSR_IA32_MCx_ADDR(i));
> +
> + /*
> + * Mask the reported address by the reported granuality.
> + */
> + if (mce_ser && (m->status & MCI_STATUS_MISCV)) {
> + u8 shift = m->misc & 0x1f;
> + m->addr >>= shift;
> + m->addr <<= shift;
> + }
> + }
> }
>
> DEFINE_PER_CPU(unsigned, mce_poll_count);
s/granuality/granularity/g;
I would very much like to know how you guys got 'granularity' wrong
_three_ times in one small patch?!!
:-)
--
Regards/Gruss,
Boris.
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632
On Fri, Jun 10, 2011 at 04:07:13AM -0400, Hidetoshi Seto wrote:
> (2011/06/10 6:33), Luck, Tony wrote:
> > From: Andi Kleen <[email protected]>
> >
> > SER enabled systems report the address granuality for each
> > reported address in a machine check. But the bits below
> > the granuality are undefined. Mask them out before
> > logging the machine check.
> >
> > Signed-off-by: Andi Kleen <[email protected]>
> > Signed-off-by: Tony Luck <[email protected]>
> > ---
> > arch/x86/kernel/cpu/mcheck/mce.c | 12 +++++++++++-
> > 1 files changed, 11 insertions(+), 1 deletions(-)
> >
> > diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> > index 0349e87..ffc8d11 100644
> > --- a/arch/x86/kernel/cpu/mcheck/mce.c
> > +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> > @@ -539,8 +539,18 @@ static void mce_read_aux(struct mce *m, int i)
> > {
> > if (m->status & MCI_STATUS_MISCV)
> > m->misc = mce_rdmsrl(MSR_IA32_MCx_MISC(i));
> > - if (m->status & MCI_STATUS_ADDRV)
> > + if (m->status & MCI_STATUS_ADDRV) {
> > m->addr = mce_rdmsrl(MSR_IA32_MCx_ADDR(i));
> > +
> > + /*
> > + * Mask the reported address by the reported granuality.
> > + */
> > + if (mce_ser && (m->status & MCI_STATUS_MISCV)) {
> > + u8 shift = m->misc & 0x1f;
> > + m->addr >>= shift;
> > + m->addr <<= shift;
> > + }
> > + }
> > }
> >
> > DEFINE_PER_CPU(unsigned, mce_poll_count);
>
> Why do you have to mask it out in kernel, why not in user/logger?
>
> One possible story is:
> "... the brand-new Xeon XXXX has new MCx_***_VALID bit in ****
> register, if it is set the lower bits of MCx_ADDR indicates
> ****, otherwise the bits are undefined ..."
>
> So I think that kernel should convey the raw value from hardware to
> userland. Even if it contains some noise on it, user can determine
> whether it is useful or not. And more, since this is an error record,
> there will be no second chance to retrieve the data afterward.
I think they need the correct address for the poisoning later:
if (severity == MCE_AO_SEVERITY && mce_usable_address(&m))
mce_ring_add(m.addr >> PAGE_SHIFT);
Tony?
--
Regards/Gruss,
Boris.
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632
2011/6/10 Hidetoshi Seto <[email protected]>:
> Could you clarify why we have to use mces_seen after mce_end(), please?
In part 10 I add:
+ if (worst == MCE_AR_SEVERITY) {
+ mce_action_required(&m, msg, regs);
+ kill_it = 0;
+ }
Which happens after mce_end() ... and wants to access the struct mce that
we filled out for this processor.
But it is possible that we don't need it - your comments to later parts
of this patch series point out that better severity table entries could
cover the checks for ADDRV and MISCV being set, and make the
error immediately fatal, rather than leaving it until this late stage to
do so.
I think we only actually need to save the address from the mce structure.
-Tony
On Fri, Jun 10, 2011 at 2:33 AM, Borislav Petkov <[email protected]> wrote:
>> Used for next patch.
>
> Yeah, let's have a more general comment here like "save us some code
> duplication"
Sure.
>> +/*
>> + * Read ADDR and MISC registers.
>> + */
>
> No need for that comment, IMO. Function is small enough and the macros
> speak for themselves :).
Ok.
>> +static void mce_read_aux(struct mce *m, int i)
>
> inline?
Compiler will possibly do this anyway - I hate to second guess it. I'm not a
big fan of inline for code that isn't in the critical performance path.
-Tony
2011/6/10 Hidetoshi Seto <[email protected]>:
> Why do you have to mask it out in kernel, why not in user/logger?
>
> One possible story is:
> ?"... the brand-new Xeon XXXX has new MCx_***_VALID bit in ****
> ?register, if it is set the lower bits of MCx_ADDR indicates
> ?****, otherwise the bits are undefined ..."
>
> So I think that kernel should convey the raw value from hardware to
> userland. ?Even if it contains some noise on it, user can determine
> whether it is useful or not. ?And more, since this is an error record,
> there will be no second chance to retrieve the data afterward.
This is a good point - we should log the original untouched bits someplace.
But we also want to avoid having to perform this same calculation over
and over (and risk having some place miss doing it right). So perhaps
we need to leave m->addr with the exact bits that were pulled from
MCi_ADDR - but add an <address,size> or <address,end> tuple that
has the post-calculation numbers for use in the kernel recovery code.
-Tony
On Fri, Jun 10, 2011 at 2:42 AM, Borislav Petkov <[email protected]> wrote:
> s/granuality/granularity/g;
>
> I would very much like to know how you guys got 'granularity' wrong
> _three_ times in one small patch?!!
Speshial skiil, tallent, and reguler speelling praktice sesssions :-)
[At least we spelled it wrong the same way each time]
Will fix.
-Tony
2011/6/10 Hidetoshi Seto <[email protected]>:
> It doesn't make sense that the function named "*_ao" sends _AR.
Oops. Good point.
> I suppose that usually SRAO is handled in worker thread scheduled after
> MCE, so current is unlikely one of affected threads in that case...
> And I also suppose that you'd like to use this function to be called
> from affected thread before leaving kernel in the case of SRAR...
>
> My concern is that "t == current" is neither strong nor clear statement
> to switch the type of signal. ?Someone might want to use this function
> to inject _AO to current.
>
> It is better to have new kill_proc_ar() (separated, or one shared _common
> plus a couple of _ar/_ao), I think. ?I believe that there is no caller
> who have no idea whether it should request sending _AR or _AO.
Agreed - each call chain should know whether it is AR or AO. I think
there is plenty of room to make this code cleaner and clearer.
In the AR case we will always be in the current task context - and we
know that we cannot let the process ignore the signal.
In the AO case we will usually be in some other task context (except
by random chance) - and we shouldn't mind if the signal is currently
blocked, or ignored.
-Tony
2011/6/10 Hidetoshi Seto <[email protected]>:
> Now I'm reconsidering the MCE event notification mechanism.
> One of something nervous is whether it is really required to process
> "_AO" memory poisoning (i.e. mce_process_ring()) here in a process
> context that unfortunately interrupted by MCE (or preempted after that).
> I'm uncertain how long walking though the task_list for unmap will takes,
> and not sure it is acceptable if the unlucky thread is a kind of latency
> sensitive...
>
> If we can move mce_process_ring() to worker thread completely, what
> we have to do will be:
> ?1) from NMI context, request non-NMI context by irq_work()
> ?2) from (irq) context, wake up loggers and schedule work if required
> ?3) from worker thread, process "_AO" memory poisoning etc.
>
> So now question is why user_return_notifier is needed here.
> Is it just an alternative of irq_work() for !LOCAL_APIC ?
I switched this notification over from old TIF_MCE_NOTIFY to
using user_return_notifier to preserve existing semantics, and
free up that TIF bit for the task notifier for the action required case.
You are right that we should have some better method - the
shot-gun approach of hitting every task on every cpu in the hope
that one of them will run our code soon sounds like overkill.
Using some NMI-saf method to wake a single worker thread
sounds a good idea for a subsequent clean up.
-Tony
2011/6/10 Hidetoshi Seto <[email protected]>:
> Well, I agree about the point that reordering is required to handle
> shared banks.
Thanks (and thanks for taking the time to look at all of this series
to provide such good feedback)
> I think it is better to put this change in early of the series.
Several of the early parts are position independent w.r.t. each
other - I'm not sure if there is some strong benefit to having
them in any particular order.
> I don't like to have multiple cpumasks here, notably one is just an
> inversion of another...
I wasn't completely thrilled with the way this turned out either.
> How about using severity-leveling?
> Pick cpus with PANIC level first, then AR, AO ...
That sounds like I need even more masks (or lists).
> Or how about checking rip in each mces_seen?
This is equivalent to what I did - but I think the code
will be cleaner. I'll give it a try.
-Tony
2011/6/10 Hidetoshi Seto <[email protected]>:
>> +static int kernel_ar_recoverable(struct mce *m, int tolerant)
>> +{
>> + ? ? if (tolerant >= 2)
>> + ? ? ? ? ? ? return MCE_AR_SEVERITY;
>> + ? ? if (!(m->mcgstatus & MCG_STATUS_EIPV) || !m->ip)
>> + ? ? ? ? ? ? return MCE_PANIC_SEVERITY;
>> + ? ? if (search_exception_tables(m->ip))
>> + ? ? ? ? ? ? return MCE_AR_SEVERITY;
>> + ? ? return MCE_PANIC_SEVERITY;
>> +}
>> +
>
> You said "Just handle errors in user mode for now." but ...?
Oops. I dropped the kernel handling bits from mce.c, but forget to
clean them out of mce-severity.c
Will fix.
>> + ? ? if (!mce_usable_address(m))
>> + ? ? ? ? ? ? mce_panic("No address for Action-Required Machine Check",
>> + ? ? ? ? ? ? ? ? ? ? ? m, msg);
>> + ? ? if (!(m->mcgstatus & MCG_STATUS_EIPV))
>> + ? ? ? ? ? ? mce_panic("No EIPV for Action-Required Machine Check",
>> + ? ? ? ? ? ? ? ? ? ? ? m, msg);
>
> When can this happen?
Probably never (or only if there was a chip bug. AR errors should
provide addresses
(otherwise there is no way to perform any useful "action").
> Why not create new severity {PANIC, "Action Required: but No EIPV", ...}
> in severity table?
Very good idea.
>> +static void ar_fallback(struct task_struct *me, unsigned long pfn)
>> +{
>> + ? ? if (signal_pending(me) && sigismember(&me->pending.signal, SIGBUS))
>> + ? ? ? ? ? ? return;
>
> Is it safe for _AR if SIGBUS is pending but blocked?
> I think force_sig() is reasonable for such situation.
Agreed - it is not safe if the signal is blocked. force_sig() sounds better.
>> + ? ? if (worst == MCE_AR_SEVERITY) {
>> + ? ? ? ? ? ? mce_action_required(&m, msg, regs);
>
> Comprehensible name would be appreciated, e.g.:
> ? ? ? ? ? ? ? ?mce_request_dpc_for_action_required(pfn);
I'm always in the mood for better names ... but I'm not sure what
the "dpc" in this name is trying to tell me.
> And if we cannot request context for recovery, it is better to suppress
> trailing attempts, e.g. before mce_end():
>
> ? ? ? ?if (!no_way_out && severity == MCE_AR_SEVERITY) {
> ? ? ? ? ? ? ? ?err = mce_request_dpc_for_action_required(pfn);
> ? ? ? ? ? ? ? ?if (err) {
> ? ? ? ? ? ? ? ? ? ? ? ?atomic_inc(&global_nwo);
> ? ? ? ? ? ? ? ? ? ? ? ?severity = MCE_PANIC_SEVERITY; ?/* escalated */
> ? ? ? ? ? ? ? ?}
> ? ? ? ?}
Agreed - if we can't perform the recovery action, we should escalate
the severity.
-Tony
Tony Luck <[email protected]> writes:
> 2011/6/10 Hidetoshi Seto <[email protected]>:
>> Why do you have to mask it out in kernel, why not in user/logger?
>>
>> One possible story is:
>> Â "... the brand-new Xeon XXXX has new MCx_***_VALID bit in ****
>> Â register, if it is set the lower bits of MCx_ADDR indicates
>> Â ****, otherwise the bits are undefined ..."
>>
>> So I think that kernel should convey the raw value from hardware to
>> userland. Â Even if it contains some noise on it, user can determine
>> whether it is useful or not. Â And more, since this is an error record,
>> there will be no second chance to retrieve the data afterward.
>
> This is a good point - we should log the original untouched bits someplace.
These bits are undefined. It doesn't make sense to log undefined bits.
They can contain random values. Showing random values just confuses
people.
How would the user even know it's undefined? They normally don't read
the specs. Really programs have to handle that for the user. And there
the only sane way is to just mask them off.
-Andi
--
[email protected] -- Speaking for myself only
Just a couple of notes:
On Fri, Jun 10, 2011 at 04:42:33PM -0400, Tony Luck wrote:
> 2011/6/10 Hidetoshi Seto <[email protected]>:
> > Now I'm reconsidering the MCE event notification mechanism.
> > One of something nervous is whether it is really required to process
> > "_AO" memory poisoning (i.e. mce_process_ring()) here in a process
> > context that unfortunately interrupted by MCE (or preempted after that).
> > I'm uncertain how long walking though the task_list for unmap will takes,
> > and not sure it is acceptable if the unlucky thread is a kind of latency
> > sensitive...
> >
> > If we can move mce_process_ring() to worker thread completely, what
> > we have to do will be:
> > ?1) from NMI context, request non-NMI context by irq_work()
> > ?2) from (irq) context, wake up loggers and schedule work if required
> > ?3) from worker thread, process "_AO" memory poisoning etc.
> >
> > So now question is why user_return_notifier is needed here.
> > Is it just an alternative of irq_work() for !LOCAL_APIC ?
I think that an x86 CPU with hw poisoning features support makes
LOCAL_APIC's presence almost axiomatic ...
> I switched this notification over from old TIF_MCE_NOTIFY to
> using user_return_notifier to preserve existing semantics, and
> free up that TIF bit for the task notifier for the action required case.
>
> You are right that we should have some better method - the
> shot-gun approach of hitting every task on every cpu in the hope
> that one of them will run our code soon sounds like overkill.
> Using some NMI-saf method to wake a single worker thread
> sounds a good idea for a subsequent clean up.
and we should go ahead and do it correctly from the get-go, methinks:
for the AR case, we definitely need to go to userspace _but_ the
worker thread that does the recovery for us has to be the _next_ thing
scheduled seamlessly following the process that was running when the MCE
happened.
If schedule_work() cannot guarantee us that then we need to do something
else like yield_to() to the worker thread, thus not penalizing the
unlucky thread. Maybe PeterZ, Ingo could comment on whether that is a ok
thing to do. Botomline is, we absolutely want to execute the AR handling
thread _before_ we return to normal process scheduling and thus going to
the risk of executing the thread that caused the AR MCE.
Btw, Tony, you could do away without atomic NMI lists if we change the
user return notifier a bit:
diff --git a/include/linux/user-return-notifier.h b/include/linux/user-return-notifier.h
index 9c4a445..eb3c977 100644
--- a/include/linux/user-return-notifier.h
+++ b/include/linux/user-return-notifier.h
@@ -26,6 +26,11 @@ static inline void propagate_user_return_notify(struct task_struct *prev,
void fire_user_return_notifiers(void);
+static inline void void set_user_return_notifier(struct task_struct *p)
+{
+ set_tsk_thread_flag(p, TIF_USER_RETURN_NOTIFY);
+}
+
static inline void clear_user_return_notifier(struct task_struct *p)
{
clear_tsk_thread_flag(p, TIF_USER_RETURN_NOTIFY);
diff --git a/kernel/user-return-notifier.c b/kernel/user-return-notifier.c
index 92cb706..c54c5d2 100644
--- a/kernel/user-return-notifier.c
+++ b/kernel/user-return-notifier.c
@@ -1,4 +1,3 @@
-
#include <linux/user-return-notifier.h>
#include <linux/percpu.h>
#include <linux/sched.h>
@@ -13,12 +12,23 @@ static DEFINE_PER_CPU(struct hlist_head, return_notifier_list);
*/
void user_return_notifier_register(struct user_return_notifier *urn)
{
- set_tsk_thread_flag(current, TIF_USER_RETURN_NOTIFY);
hlist_add_head(&urn->link, &__get_cpu_var(return_notifier_list));
}
EXPORT_SYMBOL_GPL(user_return_notifier_register);
/*
+ * Request a constant notification when the current cpu returns to userspace.
+ * Must be called in atomic context. The notifier will also be called in atomic
+ * context.
+ */
+void user_return_notifier_register_and_enable(struct user_return_notifier *urn)
+{
+ set_tsk_thread_flag(current, TIF_USER_RETURN_NOTIFY);
+ hlist_add_head(&urn->link, &__get_cpu_var(return_notifier_list));
+}
+EXPORT_SYMBOL_GPL(user_return_notifier_register_and_enable);
+
+/*
* Removes a registered user return notifier. Must be called from atomic
* context, and from the same cpu registration occurred in.
*/
--
and then in the #MC handler you do user_return_notifier_set(),
having registered it during MCE core init. memory_failure() does
clear_user_return_notifier(current) after it finishes. The _enable
version is what others like KVM would still use. Or something to that
effect.
Hmm...
--
Regards/Gruss,
Boris.
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551
On 06/10/2011 12:35 AM, Luck, Tony wrote:
> From: Tony Luck<[email protected]>
>
> Ingo wrote:
> > We already have a generic facility to do such things at
> > return-to-userspace: _TIF_USER_RETURN_NOTIFY.
>
> This just a proof of concept patch ... before this can become
> real the user-return-notifier code would have to be made NMI
> safe (currently it uses hlist_add_head/hlist_del, which would
> need to be changed to Ying's NMI-safe single threaded lists).
You could use irq_work_queue() to push this into an irq context, which
is user-return-notifier safe.
--
error compiling committee.c: too many arguments to function
On 06/10/2011 11:42 PM, Tony Luck wrote:
> You are right that we should have some better method - the
> shot-gun approach of hitting every task on every cpu in the hope
> that one of them will run our code soon sounds like overkill.
> Using some NMI-saf method to wake a single worker thread
> sounds a good idea for a subsequent clean up.
>
That's irq_work_queue().
--
error compiling committee.c: too many arguments to function
On Sun, Jun 12, 2011 at 04:29:41AM -0400, Avi Kivity wrote:
> On 06/10/2011 12:35 AM, Luck, Tony wrote:
> > From: Tony Luck<[email protected]>
> >
> > Ingo wrote:
> > > We already have a generic facility to do such things at
> > > return-to-userspace: _TIF_USER_RETURN_NOTIFY.
> >
> > This just a proof of concept patch ... before this can become
> > real the user-return-notifier code would have to be made NMI
> > safe (currently it uses hlist_add_head/hlist_del, which would
> > need to be changed to Ying's NMI-safe single threaded lists).
>
> You could use irq_work_queue() to push this into an irq context, which
> is user-return-notifier safe.
Maybe I'm missing something but it looks like irq_work_queue() queues
work which is run in irq_work_run() with IRQs disabled. However, user
return notifiers are run after IRQs get enabled in entry_64.S. And we
want to run memory_failure() with IRQs enabled.
More importantly, we want to be able to do the following:
* run #MC handler which queues work
* when returning to userspace, preempt and schedule that previously
queued work _before_ the process that caused the MCE gets to execute.
Imagine this scenario:
Your userspace process causes a data cache read error due to either
alpha particles or maybe because the DRAM device containing the process
page is faulty and generates ECC errors which the ECC code cannot
correct, i.e. an uncorrectable error we definitely want to handle; IOW
Action Required MCE.
Now, if you get lucky and this page is mapped only by the process that
caused the MCE, you could unmap it, mark it PageReserved and cause the
process to refault. But in order to do that, you want to execute the
memory_failure() handler _before_ you schedule the process again.
In the instruction cache read error case, you don't have processor
context to return to (or you're being too conservative and don't want to
risk it) so you kill the process, which is pretty easy to do.
Does that make a bit more sense? Tony?
--
Regards/Gruss,
Boris.
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551
On 06/12/2011 01:24 PM, Borislav Petkov wrote:
> On Sun, Jun 12, 2011 at 04:29:41AM -0400, Avi Kivity wrote:
> > On 06/10/2011 12:35 AM, Luck, Tony wrote:
> > > From: Tony Luck<[email protected]>
> > >
> > > Ingo wrote:
> > > > We already have a generic facility to do such things at
> > > > return-to-userspace: _TIF_USER_RETURN_NOTIFY.
> > >
> > > This just a proof of concept patch ... before this can become
> > > real the user-return-notifier code would have to be made NMI
> > > safe (currently it uses hlist_add_head/hlist_del, which would
> > > need to be changed to Ying's NMI-safe single threaded lists).
> >
> > You could use irq_work_queue() to push this into an irq context, which
> > is user-return-notifier safe.
>
> Maybe I'm missing something but it looks like irq_work_queue() queues
> work which is run in irq_work_run() with IRQs disabled. However, user
> return notifiers are run after IRQs get enabled in entry_64.S. And we
> want to run memory_failure() with IRQs enabled.
>
> More importantly, we want to be able to do the following:
>
> * run #MC handler which queues work
>
> * when returning to userspace, preempt and schedule that previously
> queued work _before_ the process that caused the MCE gets to execute.
Yes.
> Imagine this scenario:
>
> Your userspace process causes a data cache read error due to either
> alpha particles or maybe because the DRAM device containing the process
> page is faulty and generates ECC errors which the ECC code cannot
> correct, i.e. an uncorrectable error we definitely want to handle; IOW
> Action Required MCE.
>
> Now, if you get lucky and this page is mapped only by the process that
> caused the MCE, you could unmap it, mark it PageReserved and cause the
> process to refault. But in order to do that, you want to execute the
> memory_failure() handler _before_ you schedule the process again.
>
> In the instruction cache read error case, you don't have processor
> context to return to (or you're being too conservative and don't want to
> risk it) so you kill the process, which is pretty easy to do.
>
> Does that make a bit more sense? Tony?
>
You're missing the flow. The MCE handler calls irq_work_queue(), which
schedules a user return notifier, which does any needed processing in
task context.
--
error compiling committee.c: too many arguments to function
On Sun, Jun 12, 2011 at 06:30:07AM -0400, Avi Kivity wrote:
> You're missing the flow. The MCE handler calls irq_work_queue(), which
> schedules a user return notifier, which does any needed processing in
> task context.
Ah ok, this will obviate the need for atomic lists in the user return
notifier. This could work, yep.
Thanks for clarifying.
--
Regards/Gruss,
Boris.
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551
On Thu, Jun 09, 2011 at 05:36:42PM -0400, Luck, Tony wrote:
> From: Tony Luck <[email protected]>
>
> Existing user return notifier mechanism is designed to catch a specific
> cpu just as it returns to run any task in user mode. We also need a
> machanism to catch a specific task.
Why do we need that? I mean, in the remaining patches we end up either
running memory_failure() or sending signals to a task. Can't we do it
all in the user return notifier and not have a different notifier for
each policy?
Thanks.
--
Regards/Gruss,
Boris.
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551
On Sun, Jun 12, 2011 at 3:38 PM, Borislav Petkov <[email protected]> wrote:
> On Thu, Jun 09, 2011 at 05:36:42PM -0400, Luck, Tony wrote:
>> From: Tony Luck <[email protected]>
>>
>> Existing user return notifier mechanism is designed to catch a specific
>> cpu just as it returns to run any task in user mode. ?We also need a
>> mechanism to catch a specific task.
>
> Why do we need that? I mean, in the remaining patches we end up either
> running memory_failure() or sending signals to a task. Can't we do it
> all in the user return notifier and not have a different notifier for
> each policy?
Unless I'm mis-reading the user-return-notifier code, it is possible that
we'll context switch before we get to the notifier. At that point the
user-return-notifier TIF bit is passed on from our task to the newly
run-able task. But our task is still viable, so another cpu could grab
it and start running it ... then we have a race ... will the new task
that inherited the notifier unmap the page fast enough, or will there
be a loud BANG as the original task runs right into the machine
check again.
-Tony
On 06/13/2011 08:31 AM, Tony Luck wrote:
> On Sun, Jun 12, 2011 at 3:38 PM, Borislav Petkov<[email protected]> wrote:
> > On Thu, Jun 09, 2011 at 05:36:42PM -0400, Luck, Tony wrote:
> >> From: Tony Luck<[email protected]>
> >>
> >> Existing user return notifier mechanism is designed to catch a specific
> >> cpu just as it returns to run any task in user mode. We also need a
> >> mechanism to catch a specific task.
> >
> > Why do we need that? I mean, in the remaining patches we end up either
> > running memory_failure() or sending signals to a task. Can't we do it
> > all in the user return notifier and not have a different notifier for
> > each policy?
>
> Unless I'm mis-reading the user-return-notifier code, it is possible that
> we'll context switch before we get to the notifier. At that point the
> user-return-notifier TIF bit is passed on from our task to the newly
> run-able task. But our task is still viable, so another cpu could grab
> it and start running it ... then we have a race ... will the new task
> that inherited the notifier unmap the page fast enough, or will there
> be a loud BANG as the original task runs right into the machine
> check again.
Right. user-return-notifiers are really a per-cpu notifier, unrelated
to any specific task. The use of per-task flags was an optimization.
If running into the MCE again is really bad, then you need something
more, since other threads (or other processes) could run into the same
page as well. If not, do we care? Let it hit the MCE again, as long as
we'll catch it eventually.
--
error compiling committee.c: too many arguments to function
On Mon, Jun 13, 2011 at 03:59:38AM -0400, Avi Kivity wrote:
> On 06/13/2011 08:31 AM, Tony Luck wrote:
> > On Sun, Jun 12, 2011 at 3:38 PM, Borislav Petkov<[email protected]> wrote:
> > > On Thu, Jun 09, 2011 at 05:36:42PM -0400, Luck, Tony wrote:
> > >> From: Tony Luck<[email protected]>
> > >>
> > >> Existing user return notifier mechanism is designed to catch a specific
> > >> cpu just as it returns to run any task in user mode. We also need a
> > >> mechanism to catch a specific task.
> > >
> > > Why do we need that? I mean, in the remaining patches we end up either
> > > running memory_failure() or sending signals to a task. Can't we do it
> > > all in the user return notifier and not have a different notifier for
> > > each policy?
> >
> > Unless I'm mis-reading the user-return-notifier code, it is possible that
> > we'll context switch before we get to the notifier. At that point the
> > user-return-notifier TIF bit is passed on from our task to the newly
> > run-able task. But our task is still viable, so another cpu could grab
> > it and start running it ... then we have a race ... will the new task
> > that inherited the notifier unmap the page fast enough, or will there
> > be a loud BANG as the original task runs right into the machine
> > check again.
>
> Right. user-return-notifiers are really a per-cpu notifier, unrelated
> to any specific task. The use of per-task flags was an optimization.
>
> If running into the MCE again is really bad, then you need something
> more, since other threads (or other processes) could run into the same
> page as well.
Well, the #MC handler runs on all CPUs on Intel so what we could do is
set the current task to TASK_STOPPED or _UNINTERRUPTIBLE or something
that doesn't make it viable for scheduling anymore.
Then we can take our time running the notifier since the "problematic"
task won't get scheduled until we're done. Then, when we finish
analyzing the MCE, we either kill it so it has to handle SIGKILL the
next time it gets scheduled or we unmap its page with error in it so
that it #PFs on the next run.
But no, I don't think we can catch all possible situations where a page
is mapped by multiple tasks ...
> If not, do we care? Let it hit the MCE again, as long as
> we'll catch it eventually.
... and in that case we are going to have to let it hit again. Or is
there a way to get to the tasklist of all the tasks mapping a page in
atomic context, stop them from scheduling and run the notifier work in
process context?
Hmmm..
--
Regards/Gruss,
Boris.
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551
On 06/13/2011 12:55 PM, Borislav Petkov wrote:
> >
> > If running into the MCE again is really bad, then you need something
> > more, since other threads (or other processes) could run into the same
> > page as well.
>
> Well, the #MC handler runs on all CPUs on Intel so what we could do is
> set the current task to TASK_STOPPED or _UNINTERRUPTIBLE or something
> that doesn't make it viable for scheduling anymore.
>
> Then we can take our time running the notifier since the "problematic"
> task won't get scheduled until we're done. Then, when we finish
> analyzing the MCE, we either kill it so it has to handle SIGKILL the
> next time it gets scheduled or we unmap its page with error in it so
> that it #PFs on the next run.
If all cpus catch it, do we even know which task it is?
On the other hand, it makes user return notifiers attractive, since they
are per-cpu, and combined with MCE broadcast that turns them into a
global event.
> But no, I don't think we can catch all possible situations where a page
> is mapped by multiple tasks ...
>
> > If not, do we care? Let it hit the MCE again, as long as
> > we'll catch it eventually.
>
> ... and in that case we are going to have to let it hit again. Or is
> there a way to get to the tasklist of all the tasks mapping a page in
> atomic context, stop them from scheduling and run the notifier work in
> process context?
>
> Hmmm..
Surely not in atomic context, but you can use rmap to find all mappers
of a given page.
So: MCE uses irq_work_queue() -> wake up a realtime task -> process the
mce, unmap the page, go back to sleep.
--
error compiling committee.c: too many arguments to function
On Mon, Jun 13, 2011 at 07:40:25AM -0400, Avi Kivity wrote:
> On 06/13/2011 12:55 PM, Borislav Petkov wrote:
> > >
> > > If running into the MCE again is really bad, then you need something
> > > more, since other threads (or other processes) could run into the same
> > > page as well.
> >
> > Well, the #MC handler runs on all CPUs on Intel so what we could do is
> > set the current task to TASK_STOPPED or _UNINTERRUPTIBLE or something
> > that doesn't make it viable for scheduling anymore.
> >
> > Then we can take our time running the notifier since the "problematic"
> > task won't get scheduled until we're done. Then, when we finish
> > analyzing the MCE, we either kill it so it has to handle SIGKILL the
> > next time it gets scheduled or we unmap its page with error in it so
> > that it #PFs on the next run.
>
> If all cpus catch it, do we even know which task it is?
Well, in the ActionRequired case, the error is obviously reported
through a #MC exception meaning that the core definitely generates the
MCE before we've made a context switch (CR3 change etc) so in that case
'current' will point to the task at fault.
The problem is finding which 'current' it is, from all the tasks running
on all cores when the #MC is raised. Tony, can you tell from the hw
which core actually caused the MCE? Is it the monarch, so to speak?
> On the other hand, it makes user return notifiers attractive, since
> they are per-cpu, and combined with MCE broadcast that turns them into
> a global event.
Yes, I think that having a global event makes error handling much safer.
It also gives you the proper amount of conservative optimism that you've
actually handled the error properly before returning to userspace.
I'm thinking that in cases were we have a page shared by multiple
processes, we still would want to run a 'main' user return notifier on
one core which does the rmap lookup _but_, _also_ very importantly, the
other cores still hold off from executing userspace until that main
notifier hasn't finished finding out how big the fallout is,i.e. how
many other processes would run into the same page.
> > But no, I don't think we can catch all possible situations where a page
> > is mapped by multiple tasks ...
> >
> > > If not, do we care? Let it hit the MCE again, as long as
> > > we'll catch it eventually.
> >
> > ... and in that case we are going to have to let it hit again. Or is
> > there a way to get to the tasklist of all the tasks mapping a page in
> > atomic context, stop them from scheduling and run the notifier work in
> > process context?
> >
> > Hmmm..
>
> Surely not in atomic context, but you can use rmap to find all mappers
> of a given page.
>
> So: MCE uses irq_work_queue() -> wake up a realtime task -> process the
> mce, unmap the page, go back to sleep.
Yes, this is basically it. However, the other cores cannot schedule a
task which maps the compromized page until we haven't finished finding
and 'fixing' all the mappers.
So we either hold off the cores from executing userspace - in that
case no need to mark a task as unsuitable to run - or use the task
return notifiers in patch 10/10.
HOWEVER, AFAICT, if the page is mapped multiple times,
killing/recovering the current task doesn't help from another core
touching it and causing a follow-up MCE. So holding off all the cores
from scheduling userspace in some manner might be the superior solution.
Especially if you don't execute the #MC handler on all CPUs as is the
case on AMD.
--
Regards/Gruss,
Boris.
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551
On 06/13/2011 03:40 PM, Borislav Petkov wrote:
<snippage>
> >
> > So: MCE uses irq_work_queue() -> wake up a realtime task -> process the
> > mce, unmap the page, go back to sleep.
>
> Yes, this is basically it. However, the other cores cannot schedule a
> task which maps the compromized page until we haven't finished finding
> and 'fixing' all the mappers.
>
> So we either hold off the cores from executing userspace - in that
> case no need to mark a task as unsuitable to run - or use the task
> return notifiers in patch 10/10.
>
> HOWEVER, AFAICT, if the page is mapped multiple times,
> killing/recovering the current task doesn't help from another core
> touching it and causing a follow-up MCE. So holding off all the cores
> from scheduling userspace in some manner might be the superior solution.
> Especially if you don't execute the #MC handler on all CPUs as is the
> case on AMD.
>
That's basically impossible, since the other cores may be in fact
executing userspace, with the next instruction accessing the bad page.
In fact the access may have been started simultaneously with the one
that triggered the #MC.
The best you can do is IPI everyone as soon as you've caught the #MC,
but you have to be prepared for multiple #MC for the same page. Once
you have that, global synchronization is not so important anymore.
--
error compiling committee.c: too many arguments to function
On Mon, Jun 13, 2011 at 08:47:05AM -0400, Avi Kivity wrote:
> > HOWEVER, AFAICT, if the page is mapped multiple times,
> > killing/recovering the current task doesn't help from another core
> > touching it and causing a follow-up MCE. So holding off all the cores
> > from scheduling userspace in some manner might be the superior solution.
> > Especially if you don't execute the #MC handler on all CPUs as is the
> > case on AMD.
> >
>
> That's basically impossible, since the other cores may be in fact
> executing userspace, with the next instruction accessing the bad page.
> In fact the access may have been started simultaneously with the one
> that triggered the #MC.
True.
> The best you can do is IPI everyone as soon as you've caught the #MC,
> but you have to be prepared for multiple #MC for the same page. Once
> you have that, global synchronization is not so important anymore.
Yeah, in the multiple #MC case the memory_failure() thing should
probably be made reentrant-safe (if it is not yet). And in that case,
we'll be starting a worker thread on each CPU that caused an MCE from
accessing that page. The thread that manages to clear all the mappings
of our page simply does so while the others should be able to 'see' that
there's no work to be done anymore (PFN is not mapped in the pagetables
anymore) and exit without doing anything. Yeah, sounds doable with the
irq_work_queue -> user_return_notifier flow.
Thanks.
--
Regards/Gruss,
Boris.
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551
On 06/13/2011 06:12 PM, Borislav Petkov wrote:
> > The best you can do is IPI everyone as soon as you've caught the #MC,
> > but you have to be prepared for multiple #MC for the same page. Once
> > you have that, global synchronization is not so important anymore.
>
> Yeah, in the multiple #MC case the memory_failure() thing should
> probably be made reentrant-safe (if it is not yet). And in that case,
> we'll be starting a worker thread on each CPU that caused an MCE from
> accessing that page. The thread that manages to clear all the mappings
> of our page simply does so while the others should be able to 'see' that
> there's no work to be done anymore (PFN is not mapped in the pagetables
> anymore) and exit without doing anything. Yeah, sounds doable with the
> irq_work_queue -> user_return_notifier flow.
I don't think a user_return_notifier is needed here. You don't just
want to do things before a userspace return, you also want to do them
soon. A user return notifier might take a very long time to run, if a
context switch occurs to a thread that spends a lot of time in the
kernel (perhaps a realtime thread).
So I think the best choice here is MCE -> irq_work -> realtime kernel
thread (or work queue)
--
error compiling committee.c: too many arguments to function
On Mon, Jun 13, 2011 at 5:40 AM, Borislav Petkov <[email protected]> wrote:
> Well, in the ActionRequired case, the error is obviously reported
> through a #MC exception meaning that the core definitely generates the
> MCE before we've made a context switch (CR3 change etc) so in that case
> 'current' will point to the task at fault.
>
> The problem is finding which 'current' it is, from all the tasks running
> on all cores when the #MC is raised. Tony, can you tell from the hw
> which core actually caused the MCE? Is it the monarch, so to speak?
We can tell which cpu hit the problem by looking at MCG_STATUS register.
All the innocent bystanders who were dragged into this machine check will
have the RIPV bit set and EIPV bit clear (SDM Vol 3A, Table 15-20 in section
15.3.9.2). With my patch to re-order processing this will usually be the
monarch (though it might not be if more that one cpu has RIPV==0).
> I'm thinking that in cases were we have a page shared by multiple
> processes, we still would want to run a 'main' user return notifier on
> one core which does the rmap lookup _but_, _also_ very importantly, the
> other cores still hold off from executing userspace until that main
> notifier hasn't finished finding out how big the fallout is,i.e. how
> many other processes would run into the same page.
I don't think that we have any hope of fixing the "multiple processes
about to hit the same page" problem. We can't track down all the users
from the MC handler (because the data structures may be in the process
of being changed by some threads that we in the kernel at the time of the
machine check). Our only hope to solve this would be to let all the kernel
threads return from the MC handler - with a self-IRQ to grab them back
into our clutches when they get out of any critical sections.
But that sounds like something to defer to "phase N+1" after we have
solved all the easier cases.
-Tony
On Mon, Jun 13, 2011 at 9:31 AM, Avi Kivity <[email protected]> wrote:
> I don't think a user_return_notifier is needed here. ?You don't just want to
> do things before a userspace return, you also want to do them soon. ?A user
> return notifier might take a very long time to run, if a context switch
> occurs to a thread that spends a lot of time in the kernel (perhaps a
> realtime thread).
>
> So I think the best choice here is MCE -> irq_work -> realtime kernel thread
> (or work queue)
In the AO (action optional case (e.g. patrol scrubber) - there isn't much rush.
We'd like to process things "soon" (before someone hits the corrupt location)
but we don't need to take extraordinary efforts to make "soon" happen.
In the AR (action required - instruction or data fetch from a corrupted
memory location) our main priority is making sure that we don't continue
the task that hit the error - because we don't want to hit it again (as Boris
said, on Intel cpus this is very disruptive to the system as every cpu is
sent the machine check signal - and the code has to read a large number
of slow "msr" registers to figure out what happened. If we can guarantee
that we won't run this task - then the time pressure is greatly reduced.
So if we can do:
MCE -> irq_work -> make-task-not-runnable -> thread-or-work-queue
in a reliable way, then that would meet the needs. PeterZ didn't like the
idea of setting TASK_STOPPED or _UNINTERRUPTIBLE in NMI
context in the MC handler - but I think he was okay with it inside the
irq_work handler.
-Tony
>> Or how about checking rip in each mces_seen?
>
> This is equivalent to what I did - but I think the code
> will be cleaner. I'll give it a try.
Here's a patch on top of my previous series that just looks at
mces_seen to choose the order. Obviously I'd fold this into the
other patch for a final version - but this one lets you see what
the "mce_nextcpu()" function would look like (and how removing
the bitmaps cleans up the other parts of the code). It does look
better to me.
Seto-san: Does this fit with what you were thinking?
Compile tested only.
-Tony
---
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index a7a8c53..6b4176b 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -791,31 +791,47 @@ static void mce_reign(void)
static atomic_t global_nwo;
-/*
- * Keep separate bitmaps for cpus that have the option return from
- * machine check handler (MCG_STATUS.RIPV == 1) and those for that
- * cannot.
- */
-static cpumask_t can_return;
-static cpumask_t cant_return;
-
static int monarch;
/*
- * next cpu choosing first from cant_return, and then from can_return
+ * Find next cpu that will run through the core of do_machine_check()
+ * checking all the banks of machine check registers. We first take
+ * cpus with serious problems (as indicated by MCG_STATUS_RIPV being
+ * clear in the mcgstatus register). A second pass through mces_seen
+ * is made to process the remaining cpus.
+ * We do this because some machine check banks are shared between cpus,
+ * and it is better to find the error on the cpu that has the problem
+ * and clear the bank so that the innocent bystanders do not have to
+ * worry about errors that do not affect them.
*/
-int mce_nextcpu(int this)
+int mce_nextcpu(int cur)
{
- int next;
+ struct mce *m;
+ int cpu = cur;
+ u64 mask = MCG_STATUS_MCIP;
- if (this == -1 || cpumask_test_cpu(this, &cant_return)) {
- next = cpumask_next(this, &cant_return);
- if (next >= nr_cpu_ids)
- next = cpumask_next(-1, &can_return);
- return next;
+ if (cpu != -1) {
+ m = &per_cpu(mces_seen, cpu);
+ if (m->mcgstatus & MCG_STATUS_RIPV)
+ mask |= MCG_STATUS_RIPV;
}
- return cpumask_next(this, &can_return);
+ while (1) {
+ cpu = cpumask_next(cpu, cpu_possible_mask);
+ if (cpu >= nr_cpu_ids) {
+ if (mask & MCG_STATUS_RIPV)
+ return cpu;
+ mask |= MCG_STATUS_RIPV;
+ cpu = -1;
+ continue;
+ }
+
+ m = &per_cpu(mces_seen, cpu);
+ if ((m->mcgstatus & (MCG_STATUS_MCIP|MCG_STATUS_RIPV)) == mask)
+ break;
+ }
+
+ return cpu;
}
/*
@@ -825,7 +841,7 @@ int mce_nextcpu(int this)
* one at a time.
* TBD double check parallel CPU hotunplug
*/
-static int mce_start(int *no_way_out, int noreturn)
+static int mce_start(int *no_way_out)
{
int order;
int cpus = num_online_cpus();
@@ -841,11 +857,6 @@ static int mce_start(int *no_way_out, int noreturn)
smp_wmb();
order = atomic_inc_return(&mce_callin);
- if (noreturn)
- cpumask_set_cpu(smp_processor_id(), &cant_return);
- else
- cpumask_set_cpu(smp_processor_id(), &can_return);
-
/*
* Wait for everyone.
*/
@@ -951,8 +962,6 @@ static int mce_end(int order)
reset:
atomic_set(&global_nwo, 0);
atomic_set(&mce_callin, 0);
- cpumask_clear(&can_return);
- cpumask_clear(&cant_return);
barrier();
/*
@@ -1134,7 +1143,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
* This way we don't report duplicated events on shared banks
* because the first one to see it will clear it.
*/
- order = mce_start(&no_way_out, kill_it);
+ order = mce_start(&no_way_out);
for (i = 0; i < banks; i++) {
__clear_bit(i, toclear);
if (!mce_banks[i].ctl)
(2011/06/14 7:03), Tony Luck wrote:
>>> Or how about checking rip in each mces_seen?
>>
>> This is equivalent to what I did - but I think the code
>> will be cleaner. I'll give it a try.
>
> Here's a patch on top of my previous series that just looks at
> mces_seen to choose the order. Obviously I'd fold this into the
> other patch for a final version - but this one lets you see what
> the "mce_nextcpu()" function would look like (and how removing
> the bitmaps cleans up the other parts of the code). It does look
> better to me.
>
> Seto-san: Does this fit with what you were thinking?
>
> Compile tested only.
>
> -Tony
>
> ---
>
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index a7a8c53..6b4176b 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -791,31 +791,47 @@ static void mce_reign(void)
>
> static atomic_t global_nwo;
>
> -/*
> - * Keep separate bitmaps for cpus that have the option return from
> - * machine check handler (MCG_STATUS.RIPV == 1) and those for that
> - * cannot.
> - */
> -static cpumask_t can_return;
> -static cpumask_t cant_return;
> -
> static int monarch;
>
> /*
> - * next cpu choosing first from cant_return, and then from can_return
> + * Find next cpu that will run through the core of do_machine_check()
> + * checking all the banks of machine check registers. We first take
> + * cpus with serious problems (as indicated by MCG_STATUS_RIPV being
> + * clear in the mcgstatus register). A second pass through mces_seen
> + * is made to process the remaining cpus.
> + * We do this because some machine check banks are shared between cpus,
> + * and it is better to find the error on the cpu that has the problem
> + * and clear the bank so that the innocent bystanders do not have to
> + * worry about errors that do not affect them.
BTW in case of "no_way_out" events, we don't clear banks because they
could be carried over to the next boot (expecting logged as bootlog).
So we may need to have some trick for some known cases; e.g. ignore
observed AR by bystanders, anyway.
> */
> -int mce_nextcpu(int this)
> +int mce_nextcpu(int cur)
> {
> - int next;
> + struct mce *m;
> + int cpu = cur;
> + u64 mask = MCG_STATUS_MCIP;
Why do you check the MCG_STATUS_MCIP too here?
What happens if there is a problematic cpu that could not read
MCG register properly so indicates "PANIC with !MCIP"?
>
> - if (this == -1 || cpumask_test_cpu(this, &cant_return)) {
> - next = cpumask_next(this, &cant_return);
> - if (next >= nr_cpu_ids)
> - next = cpumask_next(-1, &can_return);
> - return next;
> + if (cpu != -1) {
> + m = &per_cpu(mces_seen, cpu);
> + if (m->mcgstatus & MCG_STATUS_RIPV)
> + mask |= MCG_STATUS_RIPV;
> }
>
> - return cpumask_next(this, &can_return);
> + while (1) {
> + cpu = cpumask_next(cpu, cpu_possible_mask);
possible? online?
Ah, I guess you assumed that all cpus checked in should have
mces_seen with MCIP while offline cpus have cleaned mces_seen.
Though we know there might be races with cpu hotplug, now we
already use num_online_cpus() in this rendezvous mechanism,
it is OK to use cpu_online_mask here at the moment, I think.
Or we should invent new, better rendezvous mechanism...
> + if (cpu >= nr_cpu_ids) {
> + if (mask & MCG_STATUS_RIPV)
> + return cpu;
> + mask |= MCG_STATUS_RIPV;
> + cpu = -1;
> + continue;
> + }
> +
> + m = &per_cpu(mces_seen, cpu);
> + if ((m->mcgstatus & (MCG_STATUS_MCIP|MCG_STATUS_RIPV)) == mask)
> + break;
> + }
> +
> + return cpu;
> }
>
> /*
> @@ -825,7 +841,7 @@ int mce_nextcpu(int this)
> * one at a time.
> * TBD double check parallel CPU hotunplug
> */
> -static int mce_start(int *no_way_out, int noreturn)
> +static int mce_start(int *no_way_out)
> {
> int order;
> int cpus = num_online_cpus();
> @@ -841,11 +857,6 @@ static int mce_start(int *no_way_out, int noreturn)
> smp_wmb();
> order = atomic_inc_return(&mce_callin);
>
> - if (noreturn)
> - cpumask_set_cpu(smp_processor_id(), &cant_return);
> - else
> - cpumask_set_cpu(smp_processor_id(), &can_return);
> -
> /*
> * Wait for everyone.
> */
> @@ -951,8 +962,6 @@ static int mce_end(int order)
> reset:
> atomic_set(&global_nwo, 0);
> atomic_set(&mce_callin, 0);
> - cpumask_clear(&can_return);
> - cpumask_clear(&cant_return);
> barrier();
>
> /*
> @@ -1134,7 +1143,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
> * This way we don't report duplicated events on shared banks
> * because the first one to see it will clear it.
> */
> - order = mce_start(&no_way_out, kill_it);
> + order = mce_start(&no_way_out);
> for (i = 0; i < banks; i++) {
> __clear_bit(i, toclear);
> if (!mce_banks[i].ctl)
>
>
The rest looks good.
Thanks,
H.Seto
(2011/06/14 2:13), Tony Luck wrote:
> On Mon, Jun 13, 2011 at 9:31 AM, Avi Kivity <[email protected]> wrote:
>> I don't think a user_return_notifier is needed here. You don't just want to
>> do things before a userspace return, you also want to do them soon. A user
>> return notifier might take a very long time to run, if a context switch
>> occurs to a thread that spends a lot of time in the kernel (perhaps a
>> realtime thread).
>>
>> So I think the best choice here is MCE -> irq_work -> realtime kernel thread
>> (or work queue)
>
> In the AO (action optional case (e.g. patrol scrubber) - there isn't much rush.
> We'd like to process things "soon" (before someone hits the corrupt location)
> but we don't need to take extraordinary efforts to make "soon" happen.
>
> In the AR (action required - instruction or data fetch from a corrupted
> memory location) our main priority is making sure that we don't continue
> the task that hit the error - because we don't want to hit it again (as Boris
> said, on Intel cpus this is very disruptive to the system as every cpu is
> sent the machine check signal - and the code has to read a large number
> of slow "msr" registers to figure out what happened. If we can guarantee
> that we won't run this task - then the time pressure is greatly reduced.
>
> So if we can do:
>
> MCE -> irq_work -> make-task-not-runnable -> thread-or-work-queue
>
> in a reliable way, then that would meet the needs. PeterZ didn't like the
> idea of setting TASK_STOPPED or _UNINTERRUPTIBLE in NMI
> context in the MC handler - but I think he was okay with it inside the
> irq_work handler.
>
> -Tony
I've made small patches to clear things. Could you take a look?
These are based on my cleanup patch set:
https://lkml.org/lkml/2011/6/7/677
Thanks,
H.Seto
arch/x86/kernel/cpu/mcheck/mce.c | 52 ++++++++++++++++++++-----------------
1 files changed, 28 insertions(+), 24 deletions(-)
Hidetoshi Seto (2):
x86, mce: introduce mce_memory_failure_process
x86, mce: rework use of TIF_MCE_NOTIFY
And relocate related functions to near by mce_ring* routines.
Signed-off-by: Hidetoshi Seto <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce.c | 26 +++++++++++++++++---------
1 files changed, 17 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 08363b0..26d7994 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -395,6 +395,8 @@ static inline void mce_gather_info(struct mce *m, struct pt_regs *regs)
}
/*
+ * mce_ring, mce_memory_failure: Support for Memory errors
+ *
* Simple lockless ring to communicate PFNs from the exception handler with the
* process context work function. This is vastly simplified because there's
* only a single reader and a single writer.
@@ -449,6 +451,20 @@ static int mce_ring_add(unsigned long pfn)
return 0;
}
+/* dummy to break dependency. actual code is in mm/memory-failure.c */
+void __attribute__((weak)) memory_failure(unsigned long pfn, int vector)
+{
+ pr_err("Action optional memory failure at %lx ignored\n", pfn);
+}
+
+static inline void mce_memory_failure_process(void)
+{
+ unsigned long pfn;
+
+ while (mce_ring_get(&pfn))
+ memory_failure(pfn, MCE_VECTOR);
+}
+
int mce_available(struct cpuinfo_x86 *c)
{
if (mce_disabled)
@@ -1049,12 +1065,6 @@ out:
}
EXPORT_SYMBOL_GPL(do_machine_check);
-/* dummy to break dependency. actual code is in mm/memory-failure.c */
-void __attribute__((weak)) memory_failure(unsigned long pfn, int vector)
-{
- printk(KERN_ERR "Action optional memory failure at %lx ignored\n", pfn);
-}
-
/*
* Called after mce notification in process context. This code
* is allowed to sleep. Call the high level VM handler to process
@@ -1068,10 +1078,8 @@ void __attribute__((weak)) memory_failure(unsigned long pfn, int vector)
*/
void mce_notify_process(void)
{
- unsigned long pfn;
mce_notify_irq();
- while (mce_ring_get(&pfn))
- memory_failure(pfn, MCE_VECTOR);
+ mce_memory_failure_process();
}
static void mce_process_work(struct work_struct *dummy)
--
1.7.1
The basic flow of MCE handler is summarized as follows:
1) from NMI context:
check hardware error registers, determine error severity,
and then panic or request non-NMI context by irq_work() to
continue the system.
2) from (irq) context:
call non-NMI safe functions,
wake up loggers and schedule work if required
3) from worker thread:
process some time-consuming works like memory poisoning.
TIF_MCE_NOTIFY flag is relatively legacy and have used to do tasks of
2) and 3) on the thread context that interrupted by MCE. However now
use of irq_work() and work-queue is enough for these tasks, so this
patch removes duplicated tasks in mce_notify_process().
As the result there is no task to be done in the interrupted context,
but soon if SRAR is supported there would be some thread-specific thing
for action required. So keep the flag for such possible future use,
until better mechanism is introduced.
Signed-off-by: Hidetoshi Seto <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce.c | 28 ++++++++++++----------------
1 files changed, 12 insertions(+), 16 deletions(-)
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 26d7994..b1c2f1f 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1053,8 +1053,9 @@ void do_machine_check(struct pt_regs *regs, long error_code)
if (kill_it && tolerant < 3)
force_sig(SIGBUS, current);
- /* notify userspace ASAP */
- set_thread_flag(TIF_MCE_NOTIFY);
+ /* Trap this thread before returning to user, for action required */
+ if (worst == MCE_AR_SEVERITY)
+ set_thread_flag(TIF_MCE_NOTIFY);
if (worst > 0)
mce_report_event(regs);
@@ -1066,25 +1067,22 @@ out:
EXPORT_SYMBOL_GPL(do_machine_check);
/*
- * Called after mce notification in process context. This code
- * is allowed to sleep. Call the high level VM handler to process
- * any corrupted pages.
- * Assume that the work queue code only calls this one at a time
- * per CPU.
- * Note we don't disable preemption, so this code might run on the wrong
- * CPU. In this case the event is picked up by the scheduled work queue.
- * This is merely a fast path to expedite processing in some common
- * cases.
+ * Called in process context that interrupted by MCE and marked with
+ * TIF_MCE_NOTFY, just before returning to errorneous userland.
+ * This code is allowed to sleep.
+ * Attempt possible recovery such as calling the high level VM handler to
+ * process any corrupted pages, and kill/signal current process if required.
*/
void mce_notify_process(void)
{
- mce_notify_irq();
- mce_memory_failure_process();
+ clear_thread_flag(TIF_MCE_NOTIFY);
+
+ /* TBD: do recovery for action required event */
}
static void mce_process_work(struct work_struct *dummy)
{
- mce_notify_process();
+ mce_memory_failure_process();
}
#ifdef CONFIG_X86_MCE_INTEL
@@ -1165,8 +1163,6 @@ int mce_notify_irq(void)
/* Not more than two messages every minute */
static DEFINE_RATELIMIT_STATE(ratelimit, 60*HZ, 2);
- clear_thread_flag(TIF_MCE_NOTIFY);
-
if (test_and_clear_bit(0, &mce_need_notify)) {
/* wake processes polling /dev/mcelog */
wake_up_interruptible(&mce_chrdev_wait);
--
1.7.1
2011/6/13 Hidetoshi Seto <[email protected]>:
> BTW in case of "no_way_out" events, we don't clear banks because they
> could be carried over to the next boot (expecting logged as bootlog).
> So we may need to have some trick for some known cases; e.g. ignore
> observed AR by bystanders, anyway.
Yes. The overall plan is that we should leave the machine check banks
alone for fatal errors (so that the BIOS, or next OS after the reset can
do something with them). Non-fatal errors can be handled, logged and
cleared. But this leaves us in a pickle if we initially think we can handle
an error, and later decide that we can't. Leaving errors for too long in
the machine check banks has its own problems too - overwrite rules
mean that two errors in the same bank which are each non-fatal, may
become a fatal error for the OS.
>> + ? ? u64 ? ? mask = MCG_STATUS_MCIP;
>
> Why do you check the MCG_STATUS_MCIP too here?
> What happens if there is a problematic cpu that could not read
> MCG register properly so indicates "PANIC with !MCIP"?
You figured out the answer later - but perhaps I should have given better
clues in the comments. I think that the !MCIP panic is a "can't
happen" case.
>> + ? ? ? ? ? ? cpu = cpumask_next(cpu, cpu_possible_mask);
>
> possible? online?
The old code has "for_each_possible_cpu" when scanning through
mces_seen - and I didn't want to change this functionality at this
point.
> Ah, I guess you assumed that all cpus checked in should have
> mces_seen with MCIP while offline cpus have cleaned mces_seen.
>
> Though we know there might be races with cpu hotplug, now we
> already use num_online_cpus() in this rendezvous mechanism,
> it is OK to use cpu_online_mask here at the moment, I think.
>
> Or we should invent new, better rendezvous mechanism...
Eventually we need something better. Currently we may do some
very strange things if someone has taken cpus offline (since they
will still arrive to rendezvous and we'll get more than num_online_cpus()
showing up. Ditto if someone hot-added another cpu board but hasn't
yet brought the cpus online. Or if we booted with less than all cpus
by kernel command line argument. Etc. Unfortunately I don't have
good ideas on how to do this better - ideally we'd have some very small
time interval in which to expect that cpus would arrive at the handler.
But the SDM gives no guidance on this.
-Tony
On Mon, Jun 13, 2011 at 7:50 PM, Hidetoshi Seto
<[email protected]> wrote:
> I've made small patches to clear things. Could you take a look?
>
> These are based on my cleanup patch set:
> https://lkml.org/lkml/2011/6/7/677
Boris pointed me to your v2 patchset - I got about halfway through it
today. It all looks
really good. I got stuck trying to test the new irq_work() pieces
[because my injection
system is not working well today]. I will try again tomorrow.
-Tony
On 06/13/2011 08:13 PM, Tony Luck wrote:
> On Mon, Jun 13, 2011 at 9:31 AM, Avi Kivity<[email protected]> wrote:
> > I don't think a user_return_notifier is needed here. You don't just want to
> > do things before a userspace return, you also want to do them soon. A user
> > return notifier might take a very long time to run, if a context switch
> > occurs to a thread that spends a lot of time in the kernel (perhaps a
> > realtime thread).
> >
> > So I think the best choice here is MCE -> irq_work -> realtime kernel thread
> > (or work queue)
>
> In the AO (action optional case (e.g. patrol scrubber) - there isn't much rush.
> We'd like to process things "soon" (before someone hits the corrupt location)
> but we don't need to take extraordinary efforts to make "soon" happen.
>
> In the AR (action required - instruction or data fetch from a corrupted
> memory location) our main priority is making sure that we don't continue
> the task that hit the error - because we don't want to hit it again (as Boris
> said, on Intel cpus this is very disruptive to the system as every cpu is
> sent the machine check signal - and the code has to read a large number
> of slow "msr" registers to figure out what happened. If we can guarantee
> that we won't run this task - then the time pressure is greatly reduced.
Aren't these events extraordinarily rare? I think we can afford a
little inefficiency there.
Even with mce -> irq_work -> rt thread, we're unlikely to return to the
task as the rt thread will displace the task. It may be migrated to an
idle cpu, but even then we may be able to drop the page before it gets
back to userspace.
> So if we can do:
>
> MCE -> irq_work -> make-task-not-runnable -> thread-or-work-queue
>
> in a reliable way, then that would meet the needs. PeterZ didn't like the
> idea of setting TASK_STOPPED or _UNINTERRUPTIBLE in NMI
> context in the MC handler - but I think he was okay with it inside the
> irq_work handler.
How about signalling it with a kernel-internal signal?
I don't think that doing anything to the task is correct, though, as the
problem is with the page, not the task itself. In fact if the task is
executing a vgather instruction, or if another thread munmap()s the
page, it may not hit the same page again when re-executed.
--
error compiling committee.c: too many arguments to function
On Tue, Jun 14, 2011 at 07:40:50AM -0400, Avi Kivity wrote:
> > > So I think the best choice here is MCE -> irq_work -> realtime kernel thread
> > > (or work queue)
> >
> > In the AO (action optional case (e.g. patrol scrubber) - there isn't much rush.
> > We'd like to process things "soon" (before someone hits the corrupt location)
> > but we don't need to take extraordinary efforts to make "soon" happen.
> >
> > In the AR (action required - instruction or data fetch from a corrupted
> > memory location) our main priority is making sure that we don't continue
> > the task that hit the error - because we don't want to hit it again (as Boris
> > said, on Intel cpus this is very disruptive to the system as every cpu is
> > sent the machine check signal - and the code has to read a large number
> > of slow "msr" registers to figure out what happened. If we can guarantee
> > that we won't run this task - then the time pressure is greatly reduced.
>
> Aren't these events extraordinarily rare? I think we can afford a
> little inefficiency there.
>
> Even with mce -> irq_work -> rt thread, we're unlikely to return to
> the task as the rt thread will displace the task. It may be migrated
> to an idle cpu, but even then we may be able to drop the page before
> it gets back to userspace.
This doesn't give you the guarantee that the realtime task manages to
unmap the page from all pagetables before another process running on
another core accesses it.
I think your previous suggestion of making the memory failure handling
code reentrant would cover all holes.
Even marking all processes mapping a faulty page STOPPED or
UNINTERRUPTIBLE won't work in all cases since you have to go out and
find which those processes are. And this is what the rt thread will do.
Thanks.
--
Regards/Gruss,
Boris.
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551
On 06/14/2011 04:33 PM, Borislav Petkov wrote:
> >
> > Even with mce -> irq_work -> rt thread, we're unlikely to return to
> > the task as the rt thread will displace the task. It may be migrated
> > to an idle cpu, but even then we may be able to drop the page before
> > it gets back to userspace.
>
> This doesn't give you the guarantee that the realtime task manages to
> unmap the page from all pagetables before another process running on
> another core accesses it.
Right, it's not about a guarantee, it's about maintaining decent
performance.
> I think your previous suggestion of making the memory failure handling
> code reentrant would cover all holes.
I think it's required, yes.
Since we can't have nested #MC (due to the IST mechanism resetting %rsp
and cloberring the previous invocation's stack), we have to clear MCIP
outside the #MC handler. And that means irq_work_queue()
(note that this changes the behaviour from memory corruption to shutdown
state; both suck, but one more than the other).
> Even marking all processes mapping a faulty page STOPPED or
> UNINTERRUPTIBLE won't work in all cases since you have to go out and
> find which those processes are. And this is what the rt thread will do.
Yes.
--
error compiling committee.c: too many arguments to function
> Aren't these events extraordinarily rare? I think we can afford a
> little inefficiency there.
Yes. Very rare. But also very disruptive (On Intel all cpus are signaled
and will be stuck processing the machine check for hundreds of microseconds).
So we'd like to try hard not to take the same fault more than once.
There's also the issue of post-error analysis. Some people like to dig
around in the MCA logs to figure out if the memory is really going bad,
or is just being hit occasionally by stray alpha-particles or neutrons.
Getting two errors close together might cause someone to replace a DIMM
that isn't really bad. In Linux user space tools we could take account
of this repetition - but the OEM tools are imbedded in their BIOS or
maintenance processors.
>I don't think that doing anything to the task is correct, though, as the
>problem is with the page, not the task itself. In fact if the task is
>executing a vgather instruction, or if another thread munmap()s the
>page, it may not hit the same page again when re-executed.
True the memory is the source of the problem - but the task is
intimately affected. Time for a car analogy :-) ...
You are driving along the road when you notice a giant hole. You
hit the brakes and stop on the very edge. The problem is with the
road, not with your car. But I don't think you want to start driving
again (at least not in the forward direction!)
-Tony
>Since we can't have nested #MC (due to the IST mechanism resetting %rsp
>and clobbering the previous invocation's stack), we have to clear MCIP
>outside the #MC handler. And that means irq_work_queue()
>
>(note that this changes the behaviour from memory corruption to shutdown
>state; both suck, but one more than the other).
Hmm. We currently clear MCIP inside the handler (at the very end, but
still inside). Deferring clearing would make sense - but we'd have to
be sure to do so a.s.a.p on all processors. It also would change how
we look at the possibility of a process being able to run and re-execute
an instruction that causes a machine check. If we defer clearing MCIP
this won't just be annoying, it will be fatal (if any cpu hasn't yet
got to the work queue to clear MCIP).
-Tony
On Mon, Jun 13, 2011 at 7:53 PM, Hidetoshi Seto
<[email protected]> wrote:
> + * Called in process context that interrupted by MCE and marked with
> + * TIF_MCE_NOTFY, just before returning to errorneous userland.
> + * This code is allowed to sleep.
> + * Attempt possible recovery such as calling the high level VM handler to
> + * process any corrupted pages, and kill/signal current process if required.
> ?*/
> ?void mce_notify_process(void)
> ?{
> - ? ? ? mce_notify_irq();
> - ? ? ? mce_memory_failure_process();
> + ? ? ? clear_thread_flag(TIF_MCE_NOTIFY);
> +
> + ? ? ? /* TBD: do recovery for action required event */
> ?}
I liked where this series was going - but I'm not sure how we will
be able to write code to fill in the TBD here. You've got us to
a good state ... the process that hit the action-required error
can't get to user space to re-execute because of TIF_MCE_NOTIFY.
So that part is great. But ... we don't have the information we
need (failing address) to take some action. That was put into
the ring ... and it might still be there, but it could have been
grabbed and handled by the worker thread (???). So the error
might have been handled (or might be in the process of being
handled - we could be racing with the worker) - but we don't know.
I think that for action-required we need to pass the PFN from
the MC handler to this mce_notify_process() function. Andi
put it into the task structure - and although I didn't like that
much (and Ingo hated it even more) - it was a quite simple way
to pass the information. The bad "pfn" *is* task relevant data.
It's the reason that the task can't run, and the only hope to get
the process back onto its feet again.
My detour into task-return-notifiers was a massively more
complex way to achieve this same goal (the pfn there was
dropped into the container structure for the "urn" pointer that
was passed to the handler.)
Maybe I'm missing something obvious - but I think that
to fix the action-required error - we need to know some more
about the error than which task is affected.
-Tony
On Tue, Jun 14, 2011 at 11:02 AM, Tony Luck <[email protected]> wrote:
> need (failing address) to take some action. That was put into
> the ring
Correction - we only put the pfn into the ring for AO (action optional)
errors [in the current state of the patch series].
But the need for a path to get the pfn from the MC handler to the
mce_notify_process() function remains the same.
-Tony
(2011/06/15 3:28), Tony Luck wrote:
> On Tue, Jun 14, 2011 at 11:02 AM, Tony Luck <[email protected]> wrote:
>> need (failing address) to take some action. That was put into
>> the ring
>
> Correction - we only put the pfn into the ring for AO (action optional)
> errors [in the current state of the patch series].
>
> But the need for a path to get the pfn from the MC handler to the
> mce_notify_process() function remains the same.
Are there any reason why we cannot have a small special buffer for AR?
For example:
# writer side:
for (i = 0; i < ar_buf_size; i++) {
if (!cmpxchg(&ar_buf[i].pid, 0, current->pid)) {
ar_buf[i].pfn = pfn;
return 0;
}
}
return -1; /* buffer full */
# reader side:
for (i = 0; i < ar_buf_size; i++) {
if (ar_buf[i].pid == current->pid) {
mce_handle_srar(ar_buf[i].pfn);
ar_buf[i].pid = 0;
}
}
I think it is not intrusive rather than having X-bytes for each thread,
which is usually 99.99% unused for its lifetime.
Or ... is it possible to push siginfo w/ addr and pop here?
Thanks,
H.Seto
On Tue, Jun 14, 2011 at 6:29 PM, Hidetoshi Seto
<[email protected]> wrote:
> Or ... is it possible to push siginfo w/ addr and pop here?
I chatted to Peter Anvin about this over lunch ... his suggestion was that since
we know (for now) that the recovery case is always from user mode. We can
let all the non-involved cpus return from do_machine_check() .. but catch the
cpu with the problem and do a sideways stack jump from the machine check
stack to the normal trap stack. At this point we'll be executing in a context
that is effectively the same as a page fault - so we have plenty of safe options
on functions we can call, locks we can take etc.
So perhaps we can change "void do_machine_check()" to "unsigned long
do_machine_check()" and have the bystander cpus "return 0;" and the
cpu that hit the error "return m.addr;" ... and then do the necessary magic
in entry_64.S to leap from stack to stack in one mighty leap (and then
onto a "handle_action_required(regs, addr)" function.
-Tony
(2011/06/15 11:10), Tony Luck wrote:
> On Tue, Jun 14, 2011 at 6:29 PM, Hidetoshi Seto
> <[email protected]> wrote:
>> Or ... is it possible to push siginfo w/ addr and pop here?
>
> I chatted to Peter Anvin about this over lunch ... his suggestion was that since
> we know (for now) that the recovery case is always from user mode. We can
> let all the non-involved cpus return from do_machine_check() .. but catch the
> cpu with the problem and do a sideways stack jump from the machine check
> stack to the normal trap stack. At this point we'll be executing in a context
> that is effectively the same as a page fault - so we have plenty of safe options
> on functions we can call, locks we can take etc.
>
> So perhaps we can change "void do_machine_check()" to "unsigned long
> do_machine_check()" and have the bystander cpus "return 0;" and the
> cpu that hit the error "return m.addr;" ... and then do the necessary magic
> in entry_64.S to leap from stack to stack in one mighty leap (and then
> onto a "handle_action_required(regs, addr)" function.
>
> -Tony
Sounds good.
I guess we need something more for high-level recovery in kernel mode,
but it is better to set aside such difficulty for now and take things
one at a time.
Thanks,
H.Seto
On 06/14/2011 08:13 PM, Luck, Tony wrote:
> >Since we can't have nested #MC (due to the IST mechanism resetting %rsp
> >and clobbering the previous invocation's stack), we have to clear MCIP
> >outside the #MC handler. And that means irq_work_queue()
> >
> >(note that this changes the behaviour from memory corruption to shutdown
> >state; both suck, but one more than the other).
>
> Hmm. We currently clear MCIP inside the handler (at the very end, but
> still inside). Deferring clearing would make sense - but we'd have to
> be sure to do so a.s.a.p on all processors. It also would change how
> we look at the possibility of a process being able to run and re-execute
> an instruction that causes a machine check. If we defer clearing MCIP
> this won't just be annoying, it will be fatal (if any cpu hasn't yet
> got to the work queue to clear MCIP).
Clearly, the work queue (or thread) has to be real-time at the highest
priority.
--
error compiling committee.c: too many arguments to function
On 06/14/2011 07:59 PM, Luck, Tony wrote:
> >I don't think that doing anything to the task is correct, though, as the
> >problem is with the page, not the task itself. In fact if the task is
> >executing a vgather instruction, or if another thread munmap()s the
> >page, it may not hit the same page again when re-executed.
>
> True the memory is the source of the problem - but the task is
> intimately affected. Time for a car analogy :-) ...
>
> You are driving along the road when you notice a giant hole. You
> hit the brakes and stop on the very edge. The problem is with the
> road, not with your car. But I don't think you want to start driving
> again (at least not in the forward direction!)
>
No argument, the current task has to be preempted. But a complete
solution needs to prevent other tasks as well.
--
error compiling committee.c: too many arguments to function