When an HA clustering software or administrator detects unresponsiveness
of a host, they issue an NMI to the host to completely stop current
works and take a crash dump. If the kernel has already panicked
or is capturing a crash dump at that time, further NMI can cause
a crash dump failure.
Also, crash_kexec() called from oops context and panic() can
cause race conditions.
To solve these issues, this patch set does following things:
- Don't call panic() on NMI if the kernel has already panicked
- Extend exclusion control currently done by panic_lock to crash_kexec
- Introduce "apic_extnmi=none" boot option which masks external NMI
NMI at the boot time
Additionally, "apic_extnmi=all" is provieded. This option unmasks
external NMI for all CPUs. This would help cause kernel panic even if
CPU 0 can't handle an external NMI due to hang-up in NMI context
or being handled by other NMI handlers.
This patch set can be applied to current -tip tree.
V6:
- Update comments and patch descriptions all over the patch series
- Add documentation for kernel.panic_on_io_nmi sysctl (PATCH 6/6)
- Separate PATCH 5/6 from PATCH 2/6 because the portion is actually
needed for "apic_extnmi=all" case introduced by PATCH 4/6
- ...and various fixes (please see the change logs in each patch
description for details)
V5: https://lkml.org/lkml/2015/11/20/228
- Use WRITE_ONCE() for crash_ipi_done to keep the instruction order
(PATCH 2/4)
- Address concurrent unknown/external NMI case, too (PATCH 2/4)
- Fix build errors (PATCH 3/4)
- Rename "noextnmi" boot option to "apic_extnmi" and expand its
feature (PATCH 4/4)
V4: https://lkml.org/lkml/2015/9/25/193
- Improve comments and descriptions (PATCH 1/4 to 3/4)
- Use new __crash_kexec(), no exclusion check version of crash_kexec(),
instead of checking if panic_cpu is the current cpu or not
(PATCH 3/4)
V3: https://lkml.org/lkml/2015/8/6/39
- Introduce nmi_panic() macro to reduce code duplication
- In the case of panic on NMI, don't return from NMI handlers
if another cpu already panicked
V2: https://lkml.org/lkml/2015/7/27/31
- Use atomic_cmpxchg() instead of current spin_trylock() to exclude
concurrent accesses to panic() and crash_kexec()
- Don't introduce no-lock version of panic() and crash_kexec()
V1: https://lkml.org/lkml/2015/7/22/81
---
Hidehiro Kawai (6):
panic/x86: Fix re-entrance problem due to panic on NMI
panic/x86: Allow CPUs to save registers even if they are looping in NMI context
kexec: Fix race between panic() and crash_kexec() called directly
x86/apic: Introduce apic_extnmi boot option
x86/nmi: Fix to save registers for crash dump on external NMI broadcast
Documentation: Add documentation for kernel.panic_on_io_nmi sysctl
Documentation/kernel-parameters.txt | 9 +++++++++
Documentation/sysctl/kernel.txt | 15 +++++++++++++++
arch/x86/include/asm/apic.h | 5 +++++
arch/x86/include/asm/reboot.h | 1 +
arch/x86/kernel/apic/apic.c | 35 +++++++++++++++++++++++++++++++++--
arch/x86/kernel/nmi.c | 27 ++++++++++++++++++++++-----
arch/x86/kernel/reboot.c | 28 ++++++++++++++++++++++++++++
include/linux/kernel.h | 29 +++++++++++++++++++++++++++++
include/linux/kexec.h | 2 ++
kernel/kexec_core.c | 30 +++++++++++++++++++++++++++++-
kernel/panic.c | 29 ++++++++++++++++++++++++-----
kernel/watchdog.c | 2 +-
12 files changed, 198 insertions(+), 14 deletions(-)
--
Hidehiro Kawai
Hitachi, Ltd. Research & Development Group
If panic on NMI happens just after panic() on the same CPU, panic()
is recursively called. As the result, it stalls after failing to
acquire panic_lock.
To avoid this problem, don't call panic() in NMI context if
we've already entered panic().
V6:
- Add a comment about panic_cpu
- Replace the magic number -1 for panic_cpu with a macro
V4:
- Improve comments in io_check_error() and panic()
V3:
- Introduce nmi_panic() macro to reduce code duplication
- In the case of panic on NMI, don't return from NMI handlers
if another CPU already panicked
V2:
- Use atomic_cmpxchg() instead of current spin_trylock() to
exclude concurrent accesses to the panic routines
- Don't introduce no-lock version of panic()
Signed-off-by: Hidehiro Kawai <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Michal Hocko <[email protected]>
---
arch/x86/kernel/nmi.c | 16 ++++++++++++----
include/linux/kernel.h | 21 +++++++++++++++++++++
kernel/panic.c | 15 ++++++++++++---
kernel/watchdog.c | 2 +-
4 files changed, 46 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 697f90d..5131714 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -231,7 +231,7 @@ pci_serr_error(unsigned char reason, struct pt_regs *regs)
#endif
if (panic_on_unrecovered_nmi)
- panic("NMI: Not continuing");
+ nmi_panic("NMI: Not continuing");
pr_emerg("Dazed and confused, but trying to continue\n");
@@ -255,8 +255,16 @@ io_check_error(unsigned char reason, struct pt_regs *regs)
reason, smp_processor_id());
show_regs(regs);
- if (panic_on_io_nmi)
- panic("NMI IOCK error: Not continuing");
+ if (panic_on_io_nmi) {
+ nmi_panic("NMI IOCK error: Not continuing");
+
+ /*
+ * If we return from nmi_panic(), it means we have received
+ * NMI while processing panic(). So, simply return without
+ * a delay and re-enabling NMI.
+ */
+ return;
+ }
/* Re-enable the IOCK line, wait for a few seconds */
reason = (reason & NMI_REASON_CLEAR_MASK) | NMI_REASON_CLEAR_IOCHK;
@@ -297,7 +305,7 @@ unknown_nmi_error(unsigned char reason, struct pt_regs *regs)
pr_emerg("Do you have a strange power saving mode enabled?\n");
if (unknown_nmi_panic || panic_on_unrecovered_nmi)
- panic("NMI: Not continuing");
+ nmi_panic("NMI: Not continuing");
pr_emerg("Dazed and confused, but trying to continue\n");
}
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 350dfb0..db66867 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -446,6 +446,27 @@ extern int sysctl_panic_on_stackoverflow;
extern bool crash_kexec_post_notifiers;
/*
+ * panic_cpu holds a panicking CPU number and is used for exclusive
+ * execution of panic and crash_kexec routines. If the value is
+ * PANIC_CPU_INVALID, it means that none of CPU has entered panic or
+ * crash_kexec.
+ */
+extern atomic_t panic_cpu;
+#define PANIC_CPU_INVALID -1
+
+/*
+ * A variant of panic() called from NMI context.
+ * If we've already panicked on this CPU, return from here.
+ */
+#define nmi_panic(fmt, ...) \
+ do { \
+ int this_cpu = raw_smp_processor_id(); \
+ if (atomic_cmpxchg(&panic_cpu, PANIC_CPU_INVALID, this_cpu) \
+ != this_cpu) \
+ panic(fmt, ##__VA_ARGS__); \
+ } while (0)
+
+/*
* Only to be used by arch init code. If the user over-wrote the default
* CONFIG_PANIC_TIMEOUT, honor it.
*/
diff --git a/kernel/panic.c b/kernel/panic.c
index 4b150bc..3261e2d 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -61,6 +61,8 @@ void __weak panic_smp_self_stop(void)
cpu_relax();
}
+atomic_t panic_cpu = ATOMIC_INIT(PANIC_CPU_INVALID);
+
/**
* panic - halt the system
* @fmt: The text string to print
@@ -71,17 +73,17 @@ void __weak panic_smp_self_stop(void)
*/
void panic(const char *fmt, ...)
{
- static DEFINE_SPINLOCK(panic_lock);
static char buf[1024];
va_list args;
long i, i_next = 0;
int state = 0;
+ int old_cpu, this_cpu;
/*
* Disable local interrupts. This will prevent panic_smp_self_stop
* from deadlocking the first cpu that invokes the panic, since
* there is nothing to prevent an interrupt handler (that runs
- * after the panic_lock is acquired) from invoking panic again.
+ * after setting panic_cpu) from invoking panic again.
*/
local_irq_disable();
@@ -94,8 +96,15 @@ void panic(const char *fmt, ...)
* multiple parallel invocations of panic, all other CPUs either
* stop themself or will wait until they are stopped by the 1st CPU
* with smp_send_stop().
+ *
+ * `old_cpu == PANIC_CPU_INVALID' means this is the 1st CPU which
+ * comes here, so go ahead.
+ * `old_cpu == this_cpu' means we came from nmi_panic() which sets
+ * panic_cpu to this CPU. In this case, this is also the 1st CPU.
*/
- if (!spin_trylock(&panic_lock))
+ this_cpu = raw_smp_processor_id();
+ old_cpu = atomic_cmpxchg(&panic_cpu, PANIC_CPU_INVALID, this_cpu);
+ if (old_cpu != PANIC_CPU_INVALID && old_cpu != this_cpu)
panic_smp_self_stop();
console_verbose();
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 18f34cf..b9be18f 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -351,7 +351,7 @@ static void watchdog_overflow_callback(struct perf_event *event,
trigger_allbutself_cpu_backtrace();
if (hardlockup_panic)
- panic("Hard LOCKUP");
+ nmi_panic("Hard LOCKUP");
__this_cpu_write(hard_watchdog_warn, true);
return;
Currently, kdump_nmi_shootdown_cpus(), a subroutine of crash_kexec(),
sends NMI IPI to non-panic CPUs to stop them and save their register
information and doing some cleanups for crash dumping. However, if
a non-panic CPU is infinitely looping in NMI context, we fail to
save its register information into the crash dump.
For example, this can happen when unknown NMIs are broadcast to all
CPUs as follows (before applying this patch series):
CPU 0 CPU 1
=========================== ==========================
receive an unknown NMI
unknown_nmi_error()
panic() receive an unknown NMI
spin_trylock(&panic_lock) unknown_nmi_error()
crash_kexec() panic()
spin_trylock(&panic_lock)
panic_smp_self_stop()
infinite loop
kdump_nmi_shootdown_cpus()
issue NMI IPI -----------> blocked until IRET
infinite loop...
Here, since CPU 1 is in NMI context, additional NMI from CPU 0
is blocked until CPU 1 executes IRET. However, CPU 1 never
executes IRET, so the NMI is not handled and the callback function
to save registers is never called.
Actually, this can happen on some servers which broadcast NMIs to
all CPUs when the dump button is pushed.
To save registers in this case, we need to:
a) Return from NMI handler instead of looping infinitely
or
b) Call the callback function directly from the infinite loop
Inherently, a) is risky because NMI is also used to prevent corrupted
data from being propagated to devices. So, we chose b).
This patch does following things:
1. Move the timing of `infinite loop in NMI context' (actually
done by panic_smp_self_stop()) outside of panic() to enable us to
refer pt_regs. Please note that panic_smp_self_stop() is still
used for normal context
2. Call a callback of kdump_nmi_shootdown_cpus() directly to save
registers and do some cleanups after setting waiting_for_crash_ipi
which is used for counting down the number of CPUs which handled
the callback
V6:
- Revise the commit description and many comments
- Move changes involved with nmi_reason_lock to a later patch because
it turned out there is no problem at this point
- Rename crash_ipi_done to crash_ipi_issued to clarify its meaning
V5:
- Use WRITE_ONCE() when setting crash_ipi_done to 1 so that the
compiler doesn't change the instruction order
- Support the case of b in the above description
- Add poll_crash_ipi_and_callback()
V4:
- Rewrite the patch description
V3:
- Newly introduced
Signed-off-by: Hidehiro Kawai <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: Michal Hocko <[email protected]>
---
arch/x86/kernel/nmi.c | 6 +++---
arch/x86/kernel/reboot.c | 20 ++++++++++++++++++++
include/linux/kernel.h | 14 +++++++++++---
kernel/panic.c | 10 ++++++++++
kernel/watchdog.c | 2 +-
5 files changed, 45 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 5131714..5e00de7 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -231,7 +231,7 @@ pci_serr_error(unsigned char reason, struct pt_regs *regs)
#endif
if (panic_on_unrecovered_nmi)
- nmi_panic("NMI: Not continuing");
+ nmi_panic(regs, "NMI: Not continuing");
pr_emerg("Dazed and confused, but trying to continue\n");
@@ -256,7 +256,7 @@ io_check_error(unsigned char reason, struct pt_regs *regs)
show_regs(regs);
if (panic_on_io_nmi) {
- nmi_panic("NMI IOCK error: Not continuing");
+ nmi_panic(regs, "NMI IOCK error: Not continuing");
/*
* If we return from nmi_panic(), it means we have received
@@ -305,7 +305,7 @@ unknown_nmi_error(unsigned char reason, struct pt_regs *regs)
pr_emerg("Do you have a strange power saving mode enabled?\n");
if (unknown_nmi_panic || panic_on_unrecovered_nmi)
- nmi_panic("NMI: Not continuing");
+ nmi_panic(regs, "NMI: Not continuing");
pr_emerg("Dazed and confused, but trying to continue\n");
}
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index 02693dd..1da1302 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -718,6 +718,7 @@ static int crashing_cpu;
static nmi_shootdown_cb shootdown_callback;
static atomic_t waiting_for_crash_ipi;
+static int crash_ipi_issued;
static int crash_nmi_callback(unsigned int val, struct pt_regs *regs)
{
@@ -780,6 +781,9 @@ void nmi_shootdown_cpus(nmi_shootdown_cb callback)
smp_send_nmi_allbutself();
+ /* Kick CPUs looping in NMI context. */
+ WRITE_ONCE(crash_ipi_issued, 1);
+
msecs = 1000; /* Wait at most a second for the other cpus to stop */
while ((atomic_read(&waiting_for_crash_ipi) > 0) && msecs) {
mdelay(1);
@@ -788,6 +792,22 @@ void nmi_shootdown_cpus(nmi_shootdown_cb callback)
/* Leave the nmi callback set */
}
+
+/* Override the weak function in kernel/panic.c */
+void nmi_panic_self_stop(struct pt_regs *regs)
+{
+ while (1) {
+ /*
+ * Wait for the crash dumping IPI to be issued, and then
+ * call its callback directly.
+ */
+ if (READ_ONCE(crash_ipi_issued))
+ crash_nmi_callback(0, regs); /* Don't return */
+
+ cpu_relax();
+ }
+}
+
#else /* !CONFIG_SMP */
void nmi_shootdown_cpus(nmi_shootdown_cb callback)
{
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index db66867..f28eebb 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -255,6 +255,7 @@ extern long (*panic_blink)(int state);
__printf(1, 2)
void panic(const char *fmt, ...)
__noreturn __cold;
+void nmi_panic_self_stop(struct pt_regs *);
extern void oops_enter(void);
extern void oops_exit(void);
void print_oops_end_marker(void);
@@ -457,13 +458,20 @@ extern atomic_t panic_cpu;
/*
* A variant of panic() called from NMI context.
* If we've already panicked on this CPU, return from here.
+ * If another CPU already panicked, loop in nmi_panic_self_stop() which
+ * can provide architecture dependent code such as saving register states
+ * for crash dump.
*/
-#define nmi_panic(fmt, ...) \
+#define nmi_panic(regs, fmt, ...) \
do { \
+ int old_cpu; \
int this_cpu = raw_smp_processor_id(); \
- if (atomic_cmpxchg(&panic_cpu, PANIC_CPU_INVALID, this_cpu) \
- != this_cpu) \
+ old_cpu = atomic_cmpxchg(&panic_cpu, PANIC_CPU_INVALID, \
+ this_cpu); \
+ if (old_cpu == PANIC_CPU_INVALID) \
panic(fmt, ##__VA_ARGS__); \
+ else if (old_cpu != this_cpu) \
+ nmi_panic_self_stop(regs); \
} while (0)
/*
diff --git a/kernel/panic.c b/kernel/panic.c
index 3261e2d..3d6c3f1 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -61,6 +61,16 @@ void __weak panic_smp_self_stop(void)
cpu_relax();
}
+/*
+ * Stop ourselves in NMI context if another cpu has already panicked.
+ * Architecture code may override this to prepare for crash dumping
+ * (e.g. save register information).
+ */
+void __weak nmi_panic_self_stop(struct pt_regs *regs)
+{
+ panic_smp_self_stop();
+}
+
atomic_t panic_cpu = ATOMIC_INIT(PANIC_CPU_INVALID);
/**
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index b9be18f..84b5035 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -351,7 +351,7 @@ static void watchdog_overflow_callback(struct perf_event *event,
trigger_allbutself_cpu_backtrace();
if (hardlockup_panic)
- nmi_panic("Hard LOCKUP");
+ nmi_panic(regs, "Hard LOCKUP");
__this_cpu_write(hard_watchdog_warn, true);
return;
Currently, panic() and crash_kexec() can be called at the same time.
For example (x86 case):
CPU 0:
oops_end()
crash_kexec()
mutex_trylock() // acquired
nmi_shootdown_cpus() // stop other CPUs
CPU 1:
panic()
crash_kexec()
mutex_trylock() // failed to acquire
smp_send_stop() // stop other CPUs
infinite loop
If CPU 1 calls smp_send_stop() before nmi_shootdown_cpus(), kdump
fails.
In another case:
CPU 0:
oops_end()
crash_kexec()
mutex_trylock() // acquired
<NMI>
io_check_error()
panic()
crash_kexec()
mutex_trylock() // failed to acquire
infinite loop
Clearly, this is an undesirable result.
To fix this problem, this patch changes crash_kexec() to exclude
others by using atomic_t panic_cpu.
V6:
- Use PANIC_CPU_INVALID instead of hard-coded -1
- Add comments about __crash_kexec()
V5:
- Add missing dummy __crash_kexec() for !CONFIG_KEXEC_CORE case
- Replace atomic_xchg() with atomic_set() in crash_kexec() because
it is used as a release operation and there is no need of memory
barrier effect. This change also removes an unused value warning
V4:
- Use new __crash_kexec(), no exclusion check version of crash_kexec(),
instead of checking if panic_cpu is the current CPU or not
V2:
- Use atomic_cmpxchg() instead of spin_trylock() on panic_lock
to exclude concurrent accesses
- Don't introduce no-lock version of crash_kexec()
Signed-off-by: Hidehiro Kawai <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
---
include/linux/kexec.h | 2 ++
kernel/kexec_core.c | 30 +++++++++++++++++++++++++++++-
kernel/panic.c | 4 ++--
3 files changed, 33 insertions(+), 3 deletions(-)
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index d140b1e..7b68d27 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -237,6 +237,7 @@ extern int kexec_purgatory_get_set_symbol(struct kimage *image,
unsigned int size, bool get_value);
extern void *kexec_purgatory_get_symbol_addr(struct kimage *image,
const char *name);
+extern void __crash_kexec(struct pt_regs *);
extern void crash_kexec(struct pt_regs *);
int kexec_should_crash(struct task_struct *);
void crash_save_cpu(struct pt_regs *regs, int cpu);
@@ -332,6 +333,7 @@ int __weak arch_kexec_apply_relocations(const Elf_Ehdr *ehdr, Elf_Shdr *sechdrs,
#else /* !CONFIG_KEXEC_CORE */
struct pt_regs;
struct task_struct;
+static inline void __crash_kexec(struct pt_regs *regs) { }
static inline void crash_kexec(struct pt_regs *regs) { }
static inline int kexec_should_crash(struct task_struct *p) { return 0; }
#define kexec_in_progress false
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 11b64a6..c823f30 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -853,7 +853,12 @@ struct kimage *kexec_image;
struct kimage *kexec_crash_image;
int kexec_load_disabled;
-void crash_kexec(struct pt_regs *regs)
+/*
+ * No panic_cpu check version of crash_kexec(). This function is called
+ * only when panic_cpu holds the current CPU number; this is the only CPU
+ * which processes crash_kexec routines.
+ */
+void __crash_kexec(struct pt_regs *regs)
{
/* Take the kexec_mutex here to prevent sys_kexec_load
* running on one cpu from replacing the crash kernel
@@ -876,6 +881,29 @@ void crash_kexec(struct pt_regs *regs)
}
}
+void crash_kexec(struct pt_regs *regs)
+{
+ int old_cpu, this_cpu;
+
+ /*
+ * Only one CPU is allowed to execute the crash_kexec() code as with
+ * panic(). Otherwise parallel calls of panic() and crash_kexec()
+ * may stop each other. To exclude them, we use panic_cpu here too.
+ */
+ this_cpu = raw_smp_processor_id();
+ old_cpu = atomic_cmpxchg(&panic_cpu, PANIC_CPU_INVALID, this_cpu);
+ if (old_cpu == PANIC_CPU_INVALID) {
+ /* This is the 1st CPU which comes here, so go ahead. */
+ __crash_kexec(regs);
+
+ /*
+ * Reset panic_cpu to allow another panic()/crash_kexec()
+ * call.
+ */
+ atomic_set(&panic_cpu, PANIC_CPU_INVALID);
+ }
+}
+
size_t crash_get_memory_size(void)
{
size_t size = 0;
diff --git a/kernel/panic.c b/kernel/panic.c
index 3d6c3f1..81a0a3d 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -138,7 +138,7 @@ void panic(const char *fmt, ...)
* the "crash_kexec_post_notifiers" option to the kernel.
*/
if (!crash_kexec_post_notifiers)
- crash_kexec(NULL);
+ __crash_kexec(NULL); /* bypass panic_cpu check */
/*
* Note smp_send_stop is the usual smp shutdown function, which
@@ -163,7 +163,7 @@ void panic(const char *fmt, ...)
* more unstable, it can increase risks of the kdump failure too.
*/
if (crash_kexec_post_notifiers)
- crash_kexec(NULL);
+ __crash_kexec(NULL); /* bypass panic_cpu check */
bust_spinlocks(0);
This patch introduces new boot option, apic_extnmi:
apic_extnmi={ bsp | all | none}
The default value is "bsp" and this is the current behavior; only
BSP receives external NMI. "all" allows external NMIs to be
broadcast to all CPUs. This would raise the success rate of panic
on NMI when BSP hangs up in NMI context or the external NMI is
swallowed by other NMI handlers on BSP. If you specified "none",
any CPUs don't receive external NMIs. This is useful for dump
capture kernel so that it wouldn't be shot down while saving a
crash dump.
V6:
- Fix comments
- Change to use strncmp to parse the value of "apic_extnmi="
- Make apic_set_extnmi() return -EINVAL if the given value is invalid
V5:
- Rename the option from "noextnmi" to "apic_extnmi"
- Add apic_extnmi=all feature
- Fix the wrong documentation about "noextnmi" (apic_extnmi=none)
Signed-off-by: Hidehiro Kawai <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Jonathan Corbet <[email protected]>
---
Documentation/kernel-parameters.txt | 9 +++++++++
arch/x86/include/asm/apic.h | 5 +++++
arch/x86/kernel/apic/apic.c | 35 +++++++++++++++++++++++++++++++++--
3 files changed, 47 insertions(+), 2 deletions(-)
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 742f69d..74acea5 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -472,6 +472,15 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
Change the amount of debugging information output
when initialising the APIC and IO-APIC components.
+ apic_extnmi= [APIC,X86] External NMI delivery setting
+ Format: { bsp (default) | all | none }
+ bsp: External NMI is delivered only to CPU 0
+ all: External NMIs are broadcast to all CPUs as a
+ backup of CPU 0
+ none: External NMI is masked for all CPUs. This is
+ useful so that a dump capture kernel won't be
+ shot down by NMI
+
autoconf= [IPV6]
See Documentation/networking/ipv6.txt.
diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 7f62ad4..c80f6b6 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -23,6 +23,11 @@
#define APIC_VERBOSE 1
#define APIC_DEBUG 2
+/* Macros for apic_extnmi which controls external NMI masking */
+#define APIC_EXTNMI_BSP 0 /* Default */
+#define APIC_EXTNMI_ALL 1
+#define APIC_EXTNMI_NONE 2
+
/*
* Define the default level of output to be very little
* This can be turned up by using apic=verbose for more
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 8d7df74..162d2bd 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -82,6 +82,12 @@ physid_mask_t phys_cpu_present_map;
static unsigned int disabled_cpu_apicid __read_mostly = BAD_APICID;
/*
+ * This variable controls which CPUs receive external NMIs. By default,
+ * external NMIs are delivered only to the BSP.
+ */
+static int apic_extnmi = APIC_EXTNMI_BSP;
+
+/*
* Map cpu index to physical APIC ID
*/
DEFINE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_cpu_to_apicid, BAD_APICID);
@@ -1161,6 +1167,8 @@ void __init init_bsp_APIC(void)
value = APIC_DM_NMI;
if (!lapic_is_integrated()) /* 82489DX */
value |= APIC_LVT_LEVEL_TRIGGER;
+ if (apic_extnmi == APIC_EXTNMI_NONE)
+ value |= APIC_LVT_MASKED;
apic_write(APIC_LVT1, value);
}
@@ -1378,9 +1386,11 @@ void setup_local_APIC(void)
apic_write(APIC_LVT0, value);
/*
- * only the BP should see the LINT1 NMI signal, obviously.
+ * Only the BP sees the LINT1 NMI signal by default. This can be
+ * modified by apic_extnmi= boot option.
*/
- if (!cpu)
+ if ((!cpu && apic_extnmi != APIC_EXTNMI_NONE) ||
+ apic_extnmi == APIC_EXTNMI_ALL)
value = APIC_DM_NMI;
else
value = APIC_DM_NMI | APIC_LVT_MASKED;
@@ -2557,3 +2567,24 @@ static int __init apic_set_disabled_cpu_apicid(char *arg)
return 0;
}
early_param("disable_cpu_apicid", apic_set_disabled_cpu_apicid);
+
+static int __init apic_set_extnmi(char *arg)
+{
+ if (!arg)
+ return -EINVAL;
+
+ if (!strncmp("all", arg, 3))
+ apic_extnmi = APIC_EXTNMI_ALL;
+ else if (!strncmp("none", arg, 4))
+ apic_extnmi = APIC_EXTNMI_NONE;
+ else if (!strncmp("bsp", arg, 3))
+ apic_extnmi = APIC_EXTNMI_BSP;
+ else {
+ pr_warn("Unknown external NMI delivery mode `%s' ignored\n",
+ arg);
+ return -EINVAL;
+ }
+
+ return 0;
+}
+early_param("apic_extnmi", apic_set_extnmi);
Now, multiple CPUs can receive external NMI simultaneously by
specifying "apic_extnmi=all" as an boot option. When we take a
crash dump by using external NMI with this option, we fail to save
register values into the crash dump. This happens as follows:
CPU 0 CPU 1
================================ =============================
receive an external NMI
default_do_nmi() receive an external NMI
spin_lock(&nmi_reason_lock) default_do_nmi()
io_check_error() spin_lock(&nmi_reason_lock)
panic() busy loop
...
kdump_nmi_shootdown_cpus()
issue NMI IPI -----------> blocked until IRET
busy loop...
Here, since CPU 1 is in NMI context, additional NMI from CPU 0
is blocked until CPU 1 executes IRET. However, CPU 1 never
executes IRET, so the NMI is not handled and the callback function
to save registers is never called.
To solve this issue, we check if the IPI for crash dumping was
issued while waiting for nmi_reason_lock to be released, and if so,
call its callback function directly. If the IPI is not issued (e.g.
kdump is disabled), the actual behavior doesn't change.
V6:
- Separated from the former patch `panic/x86: Allow cpus to save
registers even if they are looping in NMI context'
- Fix comments
- Remove unneeded UP version of poll_crash_ipi_and_calback
- Rename poll_crash_ipi_and_callback to run_crash_ipi_callback
Signed-off-by: Hidehiro Kawai <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Michal Hocko <[email protected]>
---
arch/x86/include/asm/reboot.h | 1 +
arch/x86/kernel/nmi.c | 11 ++++++++++-
arch/x86/kernel/reboot.c | 18 +++++++++++++-----
3 files changed, 24 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/reboot.h b/arch/x86/include/asm/reboot.h
index a82c4f1..2cb1cc2 100644
--- a/arch/x86/include/asm/reboot.h
+++ b/arch/x86/include/asm/reboot.h
@@ -25,5 +25,6 @@ void __noreturn machine_real_restart(unsigned int type);
typedef void (*nmi_shootdown_cb)(int, struct pt_regs*);
void nmi_shootdown_cpus(nmi_shootdown_cb callback);
+void run_crash_ipi_callback(struct pt_regs *regs);
#endif /* _ASM_X86_REBOOT_H */
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 5e00de7..cbfa0b5 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -29,6 +29,7 @@
#include <asm/mach_traps.h>
#include <asm/nmi.h>
#include <asm/x86_init.h>
+#include <asm/reboot.h>
#define CREATE_TRACE_POINTS
#include <trace/events/nmi.h>
@@ -357,7 +358,15 @@ static void default_do_nmi(struct pt_regs *regs)
}
/* Non-CPU-specific NMI: NMI sources can be processed on any CPU */
- raw_spin_lock(&nmi_reason_lock);
+ /*
+ * Another CPU may be processing panic routines while holding
+ * nmi_reason_lock. Check if the CPU issued the IPI for crash
+ * dumping, and if so, call its callback directly. If there is
+ * no CPU preparing crash dump, we simply loop here without doing
+ * special things.
+ */
+ while (!raw_spin_trylock(&nmi_reason_lock))
+ run_crash_ipi_callback(regs);
reason = x86_platform.get_nmi_reason();
if (reason & NMI_REASON_MASK) {
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index 1da1302..60a216b 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -793,17 +793,25 @@ void nmi_shootdown_cpus(nmi_shootdown_cb callback)
/* Leave the nmi callback set */
}
+/*
+ * Wait for the crash dumping IPI to be issued, and then call its callback
+ * directly. This function is used when we have already been in NMI handler.
+ */
+void run_crash_ipi_callback(struct pt_regs *regs)
+{
+ if (crash_ipi_issued)
+ crash_nmi_callback(0, regs); /* Don't return */
+}
+
/* Override the weak function in kernel/panic.c */
void nmi_panic_self_stop(struct pt_regs *regs)
{
while (1) {
/*
- * Wait for the crash dumping IPI to be issued, and then
- * call its callback directly.
+ * If there is no CPU preparing crash dump, we simply loop
+ * here without doing special things.
*/
- if (READ_ONCE(crash_ipi_issued))
- crash_nmi_callback(0, regs); /* Don't return */
-
+ run_crash_ipi_callback(regs);
cpu_relax();
}
}
kernel.panic_on_io_nmi sysctl was introduced by commit 5211a242d0cb
("x86: Add sysctl to allow panic on IOCK NMI error"), but its
documentation is missing. So, add it.
V6:
- Newly added
Signed-off-by: Hidehiro Kawai <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Andrew Morton <[email protected]>
---
Documentation/sysctl/kernel.txt | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index af70d15..235f804 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -551,6 +551,21 @@ the recommended setting is 60.
==============================================================
+panic_on_io_nmi:
+
+Controls the kernel's behavior when a CPU receives an NMI caused by
+an IO error.
+
+0: try to continue operation (default)
+
+1: panic immediately. The IO error triggered NMI indicates a serious
+ system condition, which could result in IO data corruption. Rather
+ than continuing, panicking might be a better choice. Some servers
+ issue this sort of NMI when the dump button is pushed, and you
+ can use this option to take a crash dump
+
+==============================================================
+
panic_on_oops:
Controls the kernel's behaviour when an oops or BUG is encountered.
Hi Hidehiro,
[auto build test ERROR on v4.4-rc4]
[also build test ERROR on next-20151209]
[cannot apply to tip/x86/core]
url: https://github.com/0day-ci/linux/commits/Hidehiro-Kawai/Fix-race-issues-among-panic-NMI-and-crash_kexec/20151210-095254
config: x86_64-randconfig-s4-12101030 (attached as .config)
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64
All errors (new ones prefixed by >>):
arch/x86/built-in.o: In function `do_nmi':
>> (.text+0x7339): undefined reference to `run_crash_ipi_callback'
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
Hi Steven,
> From: Steven Rostedt [mailto:[email protected]]
> On Tue, Nov 24, 2015 at 11:48:53AM +0100, Borislav Petkov wrote:
> > > + */
> > > + while (!raw_spin_trylock(&nmi_reason_lock))
> > > + poll_crash_ipi_and_callback(regs);
> >
> > Waaait a minute: so if we're getting NMIs broadcasted on every core but
> > we're *not* crash dumping, we will run into here too. This can't be
> > right. :-\
>
> This only does something if crash_ipi_done is set, which means you are killing
> the box. But perhaps a comment that states that here would be useful, or maybe
> just put in the check here. There's no need to make it depend on SMP, as
> raw_spin_trylock() will turn to just ({1}) for UP, and that code wont even be
> hit.
It seems that poll_crash_ipi_and_callback (now renamed to run_crash_ipi_callback)
is referred for UP if CONFIG_DEBUG_SPINLOCK=y, and it causes a build error
as below. run_crash_ipi_callback refers crash_ipi_issued and crash_nmi_callback,
which are defined only if CONFIG_SMP=y. So we need to defined it separately
for SMP and UP.
I'll resend this patch later.
> Hi Hidehiro,
>
> [auto build test ERROR on v4.4-rc4]
> [also build test ERROR on next-20151209]
> [cannot apply to tip/x86/core]
>
> url:
> https://github.com/0day-ci/linux/commits/Hidehiro-Kawai/Fix-race-issues-among-panic-NMI-and-crash_kexec/20151210-095
> 254
> config: x86_64-randconfig-s4-12101030 (attached as .config)
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=x86_64
>
> All errors (new ones prefixed by >>):
>
> arch/x86/built-in.o: In function `do_nmi':
> >> (.text+0x7339): undefined reference to `run_crash_ipi_callback'
>
> ---
> 0-DAY kernel test infrastructure Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all Intel Corporation
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
Now, multiple CPUs can receive external NMI simultaneously by
specifying "apic_extnmi=all" as an boot option. When we take a
crash dump by using external NMI with this option, we fail to save
register values into the crash dump. This happens as follows:
CPU 0 CPU 1
================================ =============================
receive an external NMI
default_do_nmi() receive an external NMI
spin_lock(&nmi_reason_lock) default_do_nmi()
io_check_error() spin_lock(&nmi_reason_lock)
panic() busy loop
...
kdump_nmi_shootdown_cpus()
issue NMI IPI -----------> blocked until IRET
busy loop...
Here, since CPU 1 is in NMI context, additional NMI from CPU 0
is blocked until CPU 1 executes IRET. However, CPU 1 never
executes IRET, so the NMI is not handled and the callback function
to save registers is never called.
To solve this issue, we check if the IPI for crash dumping was
issued while waiting for nmi_reason_lock to be released, and if so,
call its callback function directly. If the IPI is not issued (e.g.
kdump is disabled), the actual behavior doesn't change.
V6.1:
- Reintroduce the UP version of run_crash_ipi_callback to fix build
error for CONFIG_SMP=n and CONFIG_DEBUG_SPINLOCK=y case
V6:
- Separated from the former patch `panic/x86: Allow cpus to save
registers even if they are looping in NMI context'
- Fix comments
- Remove unneeded UP version of poll_crash_ipi_and_calback
- Rename poll_crash_ipi_and_callback to run_crash_ipi_callback
Signed-off-by: Hidehiro Kawai <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Michal Hocko <[email protected]>
---
arch/x86/include/asm/reboot.h | 1 +
arch/x86/kernel/nmi.c | 11 ++++++++++-
arch/x86/kernel/reboot.c | 22 +++++++++++++++++-----
3 files changed, 28 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/reboot.h b/arch/x86/include/asm/reboot.h
index a82c4f1..2cb1cc2 100644
--- a/arch/x86/include/asm/reboot.h
+++ b/arch/x86/include/asm/reboot.h
@@ -25,5 +25,6 @@ void __noreturn machine_real_restart(unsigned int type);
typedef void (*nmi_shootdown_cb)(int, struct pt_regs*);
void nmi_shootdown_cpus(nmi_shootdown_cb callback);
+void run_crash_ipi_callback(struct pt_regs *regs);
#endif /* _ASM_X86_REBOOT_H */
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 5e00de7..cbfa0b5 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -29,6 +29,7 @@
#include <asm/mach_traps.h>
#include <asm/nmi.h>
#include <asm/x86_init.h>
+#include <asm/reboot.h>
#define CREATE_TRACE_POINTS
#include <trace/events/nmi.h>
@@ -357,7 +358,15 @@ static void default_do_nmi(struct pt_regs *regs)
}
/* Non-CPU-specific NMI: NMI sources can be processed on any CPU */
- raw_spin_lock(&nmi_reason_lock);
+ /*
+ * Another CPU may be processing panic routines while holding
+ * nmi_reason_lock. Check if the CPU issued the IPI for crash
+ * dumping, and if so, call its callback directly. If there is
+ * no CPU preparing crash dump, we simply loop here without doing
+ * special things.
+ */
+ while (!raw_spin_trylock(&nmi_reason_lock))
+ run_crash_ipi_callback(regs);
reason = x86_platform.get_nmi_reason();
if (reason & NMI_REASON_MASK) {
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index 1da1302..8a184e3 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -793,17 +793,25 @@ void nmi_shootdown_cpus(nmi_shootdown_cb callback)
/* Leave the nmi callback set */
}
+/*
+ * Wait for the crash dumping IPI to be issued, and then call its callback
+ * directly. This function is used when we have already been in NMI handler.
+ */
+void run_crash_ipi_callback(struct pt_regs *regs)
+{
+ if (crash_ipi_issued)
+ crash_nmi_callback(0, regs); /* Don't return */
+}
+
/* Override the weak function in kernel/panic.c */
void nmi_panic_self_stop(struct pt_regs *regs)
{
while (1) {
/*
- * Wait for the crash dumping IPI to be issued, and then
- * call its callback directly.
+ * If there is no CPU preparing crash dump, we simply loop
+ * here without doing special things.
*/
- if (READ_ONCE(crash_ipi_issued))
- crash_nmi_callback(0, regs); /* Don't return */
-
+ run_crash_ipi_callback(regs);
cpu_relax();
}
}
@@ -813,4 +821,8 @@ void nmi_shootdown_cpus(nmi_shootdown_cb callback)
{
/* No other CPUs to shoot down */
}
+
+void run_crash_ipi_callback(struct pt_regs *regs)
+{
+}
#endif
On Thu, Dec 10, 2015 at 10:46:26AM +0900, Hidehiro Kawai wrote:
> If panic on NMI happens just after panic() on the same CPU, panic()
> is recursively called. As the result, it stalls after failing to
> acquire panic_lock.
>
> To avoid this problem, don't call panic() in NMI context if
> we've already entered panic().
>
> V6:
> - Add a comment about panic_cpu
> - Replace the magic number -1 for panic_cpu with a macro
>
> V4:
> - Improve comments in io_check_error() and panic()
>
> V3:
> - Introduce nmi_panic() macro to reduce code duplication
> - In the case of panic on NMI, don't return from NMI handlers
> if another CPU already panicked
>
> V2:
> - Use atomic_cmpxchg() instead of current spin_trylock() to
> exclude concurrent accesses to the panic routines
> - Don't introduce no-lock version of panic()
>
> Signed-off-by: Hidehiro Kawai <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Michal Hocko <[email protected]>
> ---
> arch/x86/kernel/nmi.c | 16 ++++++++++++----
> include/linux/kernel.h | 21 +++++++++++++++++++++
> kernel/panic.c | 15 ++++++++++++---
> kernel/watchdog.c | 2 +-
> 4 files changed, 46 insertions(+), 8 deletions(-)
Looks better.
Did some comments cleanup and nmi_panic() macro reformatting so that it
is more readable and ended up applying this:
---
From: Hidehiro Kawai <[email protected]>
Date: Thu, 10 Dec 2015 10:46:26 +0900
Subject: [PATCH] panic, x86: Fix re-entrance problem due to panic on NMI
If panic on NMI happens just after panic() on the same CPU, panic() is
recursively called. Kernel stalls, as a result, after failing to acquire
panic_lock.
To avoid this problem, don't call panic() in NMI context if we've
already entered panic().
For that, introduce nmi_panic() macro to reduce code duplication. In
the case of panic on NMI, don't return from NMI handlers if another CPU
already panicked.
Signed-off-by: Hidehiro Kawai <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Aaron Tomlin <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Don Zickus <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Gobinda Charan Maji <[email protected]>
Cc: HATAYAMA Daisuke <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Javi Merino <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: lkml <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Nicolas Iooss <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Prarit Bhargava <[email protected]>
Cc: Rasmus Villemoes <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Seth Jennings <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ulrich Obergfell <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: Vivek Goyal <[email protected]>
Link: http://lkml.kernel.org/r/20151210014626.25437.13302.stgit@softrs
[ Cleanup comments, fixup formatting. ]
Signed-off-by: Borislav Petkov <[email protected]>
---
arch/x86/kernel/nmi.c | 16 ++++++++++++----
include/linux/kernel.h | 20 ++++++++++++++++++++
kernel/panic.c | 16 +++++++++++++---
kernel/watchdog.c | 2 +-
4 files changed, 46 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 697f90db0e37..292a24bd0553 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -231,7 +231,7 @@ pci_serr_error(unsigned char reason, struct pt_regs *regs)
#endif
if (panic_on_unrecovered_nmi)
- panic("NMI: Not continuing");
+ nmi_panic("NMI: Not continuing");
pr_emerg("Dazed and confused, but trying to continue\n");
@@ -255,8 +255,16 @@ io_check_error(unsigned char reason, struct pt_regs *regs)
reason, smp_processor_id());
show_regs(regs);
- if (panic_on_io_nmi)
- panic("NMI IOCK error: Not continuing");
+ if (panic_on_io_nmi) {
+ nmi_panic("NMI IOCK error: Not continuing");
+
+ /*
+ * If end up here, it means we have received an NMI while
+ * processing panic(). Simply return without delaying and
+ * re-enabling NMI.
+ */
+ return;
+ }
/* Re-enable the IOCK line, wait for a few seconds */
reason = (reason & NMI_REASON_CLEAR_MASK) | NMI_REASON_CLEAR_IOCHK;
@@ -297,7 +305,7 @@ unknown_nmi_error(unsigned char reason, struct pt_regs *regs)
pr_emerg("Do you have a strange power saving mode enabled?\n");
if (unknown_nmi_panic || panic_on_unrecovered_nmi)
- panic("NMI: Not continuing");
+ nmi_panic("NMI: Not continuing");
pr_emerg("Dazed and confused, but trying to continue\n");
}
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 350dfb08aee3..750cc5c7c999 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -446,6 +446,26 @@ extern int sysctl_panic_on_stackoverflow;
extern bool crash_kexec_post_notifiers;
/*
+ * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It
+ * holds a CPU number which is executing panic() currently. A value of
+ * PANIC_CPU_INVALID means no CPU has entered panic() or crash_kexec().
+ */
+extern atomic_t panic_cpu;
+#define PANIC_CPU_INVALID -1
+
+/*
+ * A variant of panic() called from NMI context. We return if we've already
+ * panicked on this CPU.
+ */
+#define nmi_panic(fmt, ...) \
+do { \
+ int cpu = raw_smp_processor_id(); \
+ \
+ if (atomic_cmpxchg(&panic_cpu, PANIC_CPU_INVALID, cpu) != cpu) \
+ panic(fmt, ##__VA_ARGS__); \
+} while (0)
+
+/*
* Only to be used by arch init code. If the user over-wrote the default
* CONFIG_PANIC_TIMEOUT, honor it.
*/
diff --git a/kernel/panic.c b/kernel/panic.c
index 4b150bc0c6c1..3344524cf6ff 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -61,6 +61,8 @@ void __weak panic_smp_self_stop(void)
cpu_relax();
}
+atomic_t panic_cpu = ATOMIC_INIT(PANIC_CPU_INVALID);
+
/**
* panic - halt the system
* @fmt: The text string to print
@@ -71,17 +73,17 @@ void __weak panic_smp_self_stop(void)
*/
void panic(const char *fmt, ...)
{
- static DEFINE_SPINLOCK(panic_lock);
static char buf[1024];
va_list args;
long i, i_next = 0;
int state = 0;
+ int old_cpu, this_cpu;
/*
* Disable local interrupts. This will prevent panic_smp_self_stop
* from deadlocking the first cpu that invokes the panic, since
* there is nothing to prevent an interrupt handler (that runs
- * after the panic_lock is acquired) from invoking panic again.
+ * after setting panic_cpu) from invoking panic() again.
*/
local_irq_disable();
@@ -94,8 +96,16 @@ void panic(const char *fmt, ...)
* multiple parallel invocations of panic, all other CPUs either
* stop themself or will wait until they are stopped by the 1st CPU
* with smp_send_stop().
+ *
+ * `old_cpu == PANIC_CPU_INVALID' means this is the 1st CPU which
+ * comes here, so go ahead.
+ * `old_cpu == this_cpu' means we came from nmi_panic() which sets
+ * panic_cpu to this CPU. In this case, this is also the 1st CPU.
*/
- if (!spin_trylock(&panic_lock))
+ this_cpu = raw_smp_processor_id();
+ old_cpu = atomic_cmpxchg(&panic_cpu, PANIC_CPU_INVALID, this_cpu);
+
+ if (old_cpu != PANIC_CPU_INVALID && old_cpu != this_cpu)
panic_smp_self_stop();
console_verbose();
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 18f34cf75f74..b9be18fae154 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -351,7 +351,7 @@ static void watchdog_overflow_callback(struct perf_event *event,
trigger_allbutself_cpu_backtrace();
if (hardlockup_panic)
- panic("Hard LOCKUP");
+ nmi_panic("Hard LOCKUP");
__this_cpu_write(hard_watchdog_warn, true);
return;
--
2.3.5
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
> From: Borislav Petkov [mailto:[email protected]]
[...]
> Looks better.
>
> Did some comments cleanup and nmi_panic() macro reformatting so that it
> is more readable and ended up applying this:
Thanks a lot!
Hidehiro Kawai
Hitachi, Ltd. Research & Development Group
> ---
> From: Hidehiro Kawai <[email protected]>
> Date: Thu, 10 Dec 2015 10:46:26 +0900
> Subject: [PATCH] panic, x86: Fix re-entrance problem due to panic on NMI
>
> If panic on NMI happens just after panic() on the same CPU, panic() is
> recursively called. Kernel stalls, as a result, after failing to acquire
> panic_lock.
>
> To avoid this problem, don't call panic() in NMI context if we've
> already entered panic().
>
> For that, introduce nmi_panic() macro to reduce code duplication. In
> the case of panic on NMI, don't return from NMI handlers if another CPU
> already panicked.
>
> Signed-off-by: Hidehiro Kawai <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
> Cc: Aaron Tomlin <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Baoquan He <[email protected]>
> Cc: Chris Metcalf <[email protected]>
> Cc: David Hildenbrand <[email protected]>
> Cc: Don Zickus <[email protected]>
> Cc: "Eric W. Biederman" <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Cc: Gobinda Charan Maji <[email protected]>
> Cc: HATAYAMA Daisuke <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Javi Merino <[email protected]>
> Cc: Jonathan Corbet <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: lkml <[email protected]>
> Cc: Masami Hiramatsu <[email protected]>
> Cc: Michal Nazarewicz <[email protected]>
> Cc: Nicolas Iooss <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Prarit Bhargava <[email protected]>
> Cc: Rasmus Villemoes <[email protected]>
> Cc: Rusty Russell <[email protected]>
> Cc: Seth Jennings <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ulrich Obergfell <[email protected]>
> Cc: Vitaly Kuznetsov <[email protected]>
> Cc: Vivek Goyal <[email protected]>
> Link: http://lkml.kernel.org/r/20151210014626.25437.13302.stgit@softrs
> [ Cleanup comments, fixup formatting. ]
> Signed-off-by: Borislav Petkov <[email protected]>
> ---
> arch/x86/kernel/nmi.c | 16 ++++++++++++----
> include/linux/kernel.h | 20 ++++++++++++++++++++
> kernel/panic.c | 16 +++++++++++++---
> kernel/watchdog.c | 2 +-
> 4 files changed, 46 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
> index 697f90db0e37..292a24bd0553 100644
> --- a/arch/x86/kernel/nmi.c
> +++ b/arch/x86/kernel/nmi.c
> @@ -231,7 +231,7 @@ pci_serr_error(unsigned char reason, struct pt_regs *regs)
> #endif
>
> if (panic_on_unrecovered_nmi)
> - panic("NMI: Not continuing");
> + nmi_panic("NMI: Not continuing");
>
> pr_emerg("Dazed and confused, but trying to continue\n");
>
> @@ -255,8 +255,16 @@ io_check_error(unsigned char reason, struct pt_regs *regs)
> reason, smp_processor_id());
> show_regs(regs);
>
> - if (panic_on_io_nmi)
> - panic("NMI IOCK error: Not continuing");
> + if (panic_on_io_nmi) {
> + nmi_panic("NMI IOCK error: Not continuing");
> +
> + /*
> + * If end up here, it means we have received an NMI while
> + * processing panic(). Simply return without delaying and
> + * re-enabling NMI.
> + */
> + return;
> + }
>
> /* Re-enable the IOCK line, wait for a few seconds */
> reason = (reason & NMI_REASON_CLEAR_MASK) | NMI_REASON_CLEAR_IOCHK;
> @@ -297,7 +305,7 @@ unknown_nmi_error(unsigned char reason, struct pt_regs *regs)
>
> pr_emerg("Do you have a strange power saving mode enabled?\n");
> if (unknown_nmi_panic || panic_on_unrecovered_nmi)
> - panic("NMI: Not continuing");
> + nmi_panic("NMI: Not continuing");
>
> pr_emerg("Dazed and confused, but trying to continue\n");
> }
> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> index 350dfb08aee3..750cc5c7c999 100644
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -446,6 +446,26 @@ extern int sysctl_panic_on_stackoverflow;
> extern bool crash_kexec_post_notifiers;
>
> /*
> + * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It
> + * holds a CPU number which is executing panic() currently. A value of
> + * PANIC_CPU_INVALID means no CPU has entered panic() or crash_kexec().
> + */
> +extern atomic_t panic_cpu;
> +#define PANIC_CPU_INVALID -1
> +
> +/*
> + * A variant of panic() called from NMI context. We return if we've already
> + * panicked on this CPU.
> + */
> +#define nmi_panic(fmt, ...) \
> +do { \
> + int cpu = raw_smp_processor_id(); \
> + \
> + if (atomic_cmpxchg(&panic_cpu, PANIC_CPU_INVALID, cpu) != cpu) \
> + panic(fmt, ##__VA_ARGS__); \
> +} while (0)
> +
> +/*
> * Only to be used by arch init code. If the user over-wrote the default
> * CONFIG_PANIC_TIMEOUT, honor it.
> */
> diff --git a/kernel/panic.c b/kernel/panic.c
> index 4b150bc0c6c1..3344524cf6ff 100644
> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -61,6 +61,8 @@ void __weak panic_smp_self_stop(void)
> cpu_relax();
> }
>
> +atomic_t panic_cpu = ATOMIC_INIT(PANIC_CPU_INVALID);
> +
> /**
> * panic - halt the system
> * @fmt: The text string to print
> @@ -71,17 +73,17 @@ void __weak panic_smp_self_stop(void)
> */
> void panic(const char *fmt, ...)
> {
> - static DEFINE_SPINLOCK(panic_lock);
> static char buf[1024];
> va_list args;
> long i, i_next = 0;
> int state = 0;
> + int old_cpu, this_cpu;
>
> /*
> * Disable local interrupts. This will prevent panic_smp_self_stop
> * from deadlocking the first cpu that invokes the panic, since
> * there is nothing to prevent an interrupt handler (that runs
> - * after the panic_lock is acquired) from invoking panic again.
> + * after setting panic_cpu) from invoking panic() again.
> */
> local_irq_disable();
>
> @@ -94,8 +96,16 @@ void panic(const char *fmt, ...)
> * multiple parallel invocations of panic, all other CPUs either
> * stop themself or will wait until they are stopped by the 1st CPU
> * with smp_send_stop().
> + *
> + * `old_cpu == PANIC_CPU_INVALID' means this is the 1st CPU which
> + * comes here, so go ahead.
> + * `old_cpu == this_cpu' means we came from nmi_panic() which sets
> + * panic_cpu to this CPU. In this case, this is also the 1st CPU.
> */
> - if (!spin_trylock(&panic_lock))
> + this_cpu = raw_smp_processor_id();
> + old_cpu = atomic_cmpxchg(&panic_cpu, PANIC_CPU_INVALID, this_cpu);
> +
> + if (old_cpu != PANIC_CPU_INVALID && old_cpu != this_cpu)
> panic_smp_self_stop();
>
> console_verbose();
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 18f34cf75f74..b9be18fae154 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -351,7 +351,7 @@ static void watchdog_overflow_callback(struct perf_event *event,
> trigger_allbutself_cpu_backtrace();
>
> if (hardlockup_panic)
> - panic("Hard LOCKUP");
> + nmi_panic("Hard LOCKUP");
>
> __this_cpu_write(hard_watchdog_warn, true);
> return;
> --
> 2.3.5
>
> --
> Regards/Gruss,
> Boris.
>
> ECO tip #101: Trim your mails when you reply.
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On Thu, Dec 10, 2015 at 03:52:46PM +0900, Hidehiro Kawai wrote:
> Now, multiple CPUs can receive external NMI simultaneously by
> specifying "apic_extnmi=all" as an boot option. When we take a
> crash dump by using external NMI with this option, we fail to save
> register values into the crash dump. This happens as follows:
>
> CPU 0 CPU 1
> ================================ =============================
> receive an external NMI
> default_do_nmi() receive an external NMI
> spin_lock(&nmi_reason_lock) default_do_nmi()
> io_check_error() spin_lock(&nmi_reason_lock)
> panic() busy loop
> ...
> kdump_nmi_shootdown_cpus()
> issue NMI IPI -----------> blocked until IRET
> busy loop...
>
> Here, since CPU 1 is in NMI context, additional NMI from CPU 0
> is blocked until CPU 1 executes IRET. However, CPU 1 never
> executes IRET, so the NMI is not handled and the callback function
> to save registers is never called.
>
> To solve this issue, we check if the IPI for crash dumping was
> issued while waiting for nmi_reason_lock to be released, and if so,
> call its callback function directly. If the IPI is not issued (e.g.
> kdump is disabled), the actual behavior doesn't change.
>
> V6.1:
> - Reintroduce the UP version of run_crash_ipi_callback to fix build
> error for CONFIG_SMP=n and CONFIG_DEBUG_SPINLOCK=y case
>
> V6:
> - Separated from the former patch `panic/x86: Allow cpus to save
> registers even if they are looping in NMI context'
> - Fix comments
> - Remove unneeded UP version of poll_crash_ipi_and_calback
> - Rename poll_crash_ipi_and_callback to run_crash_ipi_callback
>
> Signed-off-by: Hidehiro Kawai <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Michal Hocko <[email protected]>
> ---
> arch/x86/include/asm/reboot.h | 1 +
> arch/x86/kernel/nmi.c | 11 ++++++++++-
> arch/x86/kernel/reboot.c | 22 +++++++++++++++++-----
> 3 files changed, 28 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/include/asm/reboot.h b/arch/x86/include/asm/reboot.h
> index a82c4f1..2cb1cc2 100644
> --- a/arch/x86/include/asm/reboot.h
> +++ b/arch/x86/include/asm/reboot.h
> @@ -25,5 +25,6 @@ void __noreturn machine_real_restart(unsigned int type);
>
> typedef void (*nmi_shootdown_cb)(int, struct pt_regs*);
> void nmi_shootdown_cpus(nmi_shootdown_cb callback);
> +void run_crash_ipi_callback(struct pt_regs *regs);
>
> #endif /* _ASM_X86_REBOOT_H */
> diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
> index 5e00de7..cbfa0b5 100644
> --- a/arch/x86/kernel/nmi.c
> +++ b/arch/x86/kernel/nmi.c
> @@ -29,6 +29,7 @@
> #include <asm/mach_traps.h>
> #include <asm/nmi.h>
> #include <asm/x86_init.h>
> +#include <asm/reboot.h>
>
> #define CREATE_TRACE_POINTS
> #include <trace/events/nmi.h>
> @@ -357,7 +358,15 @@ static void default_do_nmi(struct pt_regs *regs)
> }
>
> /* Non-CPU-specific NMI: NMI sources can be processed on any CPU */
> - raw_spin_lock(&nmi_reason_lock);
> + /*
> + * Another CPU may be processing panic routines while holding
> + * nmi_reason_lock. Check if the CPU issued the IPI for crash
> + * dumping, and if so, call its callback directly. If there is
> + * no CPU preparing crash dump, we simply loop here without doing
> + * special things.
> + */
> + while (!raw_spin_trylock(&nmi_reason_lock))
> + run_crash_ipi_callback(regs);
Added cpu_relax() here too:
+ while (!raw_spin_trylock(&nmi_reason_lock)) {
+ run_crash_ipi_callback(regs);
+ cpu_relax();
+ }
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
On Thu, Dec 10, 2015 at 10:46:24AM +0900, Hidehiro Kawai wrote:
> Hidehiro Kawai (6):
> panic/x86: Fix re-entrance problem due to panic on NMI
> panic/x86: Allow CPUs to save registers even if they are looping in NMI context
> kexec: Fix race between panic() and crash_kexec() called directly
> x86/apic: Introduce apic_extnmi boot option
> x86/nmi: Fix to save registers for crash dump on external NMI broadcast
> Documentation: Add documentation for kernel.panic_on_io_nmi sysctl
>
>
> Documentation/kernel-parameters.txt | 9 +++++++++
> Documentation/sysctl/kernel.txt | 15 +++++++++++++++
> arch/x86/include/asm/apic.h | 5 +++++
> arch/x86/include/asm/reboot.h | 1 +
> arch/x86/kernel/apic/apic.c | 35 +++++++++++++++++++++++++++++++++--
> arch/x86/kernel/nmi.c | 27 ++++++++++++++++++++++-----
> arch/x86/kernel/reboot.c | 28 ++++++++++++++++++++++++++++
> include/linux/kernel.h | 29 +++++++++++++++++++++++++++++
> include/linux/kexec.h | 2 ++
> kernel/kexec_core.c | 30 +++++++++++++++++++++++++++++-
> kernel/panic.c | 29 ++++++++++++++++++++++++-----
> kernel/watchdog.c | 2 +-
> 12 files changed, 198 insertions(+), 14 deletions(-)
All applied,
thanks.
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
Commit-ID: 1717f2096b543cede7a380c858c765c41936bc35
Gitweb: http://git.kernel.org/tip/1717f2096b543cede7a380c858c765c41936bc35
Author: Hidehiro Kawai <[email protected]>
AuthorDate: Mon, 14 Dec 2015 11:19:09 +0100
Committer: Thomas Gleixner <[email protected]>
CommitDate: Sat, 19 Dec 2015 11:07:00 +0100
panic, x86: Fix re-entrance problem due to panic on NMI
If panic on NMI happens just after panic() on the same CPU, panic() is
recursively called. Kernel stalls, as a result, after failing to acquire
panic_lock.
To avoid this problem, don't call panic() in NMI context if we've
already entered panic().
For that, introduce nmi_panic() macro to reduce code duplication. In
the case of panic on NMI, don't return from NMI handlers if another CPU
already panicked.
Signed-off-by: Hidehiro Kawai <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Aaron Tomlin <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Don Zickus <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Gobinda Charan Maji <[email protected]>
Cc: HATAYAMA Daisuke <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Javi Merino <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: lkml <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Nicolas Iooss <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Prarit Bhargava <[email protected]>
Cc: Rasmus Villemoes <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Seth Jennings <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ulrich Obergfell <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: Vivek Goyal <[email protected]>
Link: http://lkml.kernel.org/r/20151210014626.25437.13302.stgit@softrs
[ Cleanup comments, fixup formatting. ]
Signed-off-by: Borislav Petkov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
---
arch/x86/kernel/nmi.c | 16 ++++++++++++----
include/linux/kernel.h | 20 ++++++++++++++++++++
kernel/panic.c | 16 +++++++++++++---
kernel/watchdog.c | 2 +-
4 files changed, 46 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 697f90d..fca8793 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -231,7 +231,7 @@ pci_serr_error(unsigned char reason, struct pt_regs *regs)
#endif
if (panic_on_unrecovered_nmi)
- panic("NMI: Not continuing");
+ nmi_panic("NMI: Not continuing");
pr_emerg("Dazed and confused, but trying to continue\n");
@@ -255,8 +255,16 @@ io_check_error(unsigned char reason, struct pt_regs *regs)
reason, smp_processor_id());
show_regs(regs);
- if (panic_on_io_nmi)
- panic("NMI IOCK error: Not continuing");
+ if (panic_on_io_nmi) {
+ nmi_panic("NMI IOCK error: Not continuing");
+
+ /*
+ * If we end up here, it means we have received an NMI while
+ * processing panic(). Simply return without delaying and
+ * re-enabling NMIs.
+ */
+ return;
+ }
/* Re-enable the IOCK line, wait for a few seconds */
reason = (reason & NMI_REASON_CLEAR_MASK) | NMI_REASON_CLEAR_IOCHK;
@@ -297,7 +305,7 @@ unknown_nmi_error(unsigned char reason, struct pt_regs *regs)
pr_emerg("Do you have a strange power saving mode enabled?\n");
if (unknown_nmi_panic || panic_on_unrecovered_nmi)
- panic("NMI: Not continuing");
+ nmi_panic("NMI: Not continuing");
pr_emerg("Dazed and confused, but trying to continue\n");
}
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 350dfb0..750cc5c 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -446,6 +446,26 @@ extern int sysctl_panic_on_stackoverflow;
extern bool crash_kexec_post_notifiers;
/*
+ * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It
+ * holds a CPU number which is executing panic() currently. A value of
+ * PANIC_CPU_INVALID means no CPU has entered panic() or crash_kexec().
+ */
+extern atomic_t panic_cpu;
+#define PANIC_CPU_INVALID -1
+
+/*
+ * A variant of panic() called from NMI context. We return if we've already
+ * panicked on this CPU.
+ */
+#define nmi_panic(fmt, ...) \
+do { \
+ int cpu = raw_smp_processor_id(); \
+ \
+ if (atomic_cmpxchg(&panic_cpu, PANIC_CPU_INVALID, cpu) != cpu) \
+ panic(fmt, ##__VA_ARGS__); \
+} while (0)
+
+/*
* Only to be used by arch init code. If the user over-wrote the default
* CONFIG_PANIC_TIMEOUT, honor it.
*/
diff --git a/kernel/panic.c b/kernel/panic.c
index 4b150bc..3344524 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -61,6 +61,8 @@ void __weak panic_smp_self_stop(void)
cpu_relax();
}
+atomic_t panic_cpu = ATOMIC_INIT(PANIC_CPU_INVALID);
+
/**
* panic - halt the system
* @fmt: The text string to print
@@ -71,17 +73,17 @@ void __weak panic_smp_self_stop(void)
*/
void panic(const char *fmt, ...)
{
- static DEFINE_SPINLOCK(panic_lock);
static char buf[1024];
va_list args;
long i, i_next = 0;
int state = 0;
+ int old_cpu, this_cpu;
/*
* Disable local interrupts. This will prevent panic_smp_self_stop
* from deadlocking the first cpu that invokes the panic, since
* there is nothing to prevent an interrupt handler (that runs
- * after the panic_lock is acquired) from invoking panic again.
+ * after setting panic_cpu) from invoking panic() again.
*/
local_irq_disable();
@@ -94,8 +96,16 @@ void panic(const char *fmt, ...)
* multiple parallel invocations of panic, all other CPUs either
* stop themself or will wait until they are stopped by the 1st CPU
* with smp_send_stop().
+ *
+ * `old_cpu == PANIC_CPU_INVALID' means this is the 1st CPU which
+ * comes here, so go ahead.
+ * `old_cpu == this_cpu' means we came from nmi_panic() which sets
+ * panic_cpu to this CPU. In this case, this is also the 1st CPU.
*/
- if (!spin_trylock(&panic_lock))
+ this_cpu = raw_smp_processor_id();
+ old_cpu = atomic_cmpxchg(&panic_cpu, PANIC_CPU_INVALID, this_cpu);
+
+ if (old_cpu != PANIC_CPU_INVALID && old_cpu != this_cpu)
panic_smp_self_stop();
console_verbose();
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 18f34cf..b9be18f 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -351,7 +351,7 @@ static void watchdog_overflow_callback(struct perf_event *event,
trigger_allbutself_cpu_backtrace();
if (hardlockup_panic)
- panic("Hard LOCKUP");
+ nmi_panic("Hard LOCKUP");
__this_cpu_write(hard_watchdog_warn, true);
return;
Commit-ID: 58c5661f2144c089bbc2e5d87c9ec1dc1d2964fe
Gitweb: http://git.kernel.org/tip/58c5661f2144c089bbc2e5d87c9ec1dc1d2964fe
Author: Hidehiro Kawai <[email protected]>
AuthorDate: Mon, 14 Dec 2015 11:19:10 +0100
Committer: Thomas Gleixner <[email protected]>
CommitDate: Sat, 19 Dec 2015 11:07:01 +0100
panic, x86: Allow CPUs to save registers even if looping in NMI context
Currently, kdump_nmi_shootdown_cpus(), a subroutine of crash_kexec(),
sends an NMI IPI to CPUs which haven't called panic() to stop them,
save their register information and do some cleanups for crash dumping.
However, if such a CPU is infinitely looping in NMI context, we fail to
save its register information into the crash dump.
For example, this can happen when unknown NMIs are broadcast to all
CPUs as follows:
CPU 0 CPU 1
=========================== ==========================
receive an unknown NMI
unknown_nmi_error()
panic() receive an unknown NMI
spin_trylock(&panic_lock) unknown_nmi_error()
crash_kexec() panic()
spin_trylock(&panic_lock)
panic_smp_self_stop()
infinite loop
kdump_nmi_shootdown_cpus()
issue NMI IPI -----------> blocked until IRET
infinite loop...
Here, since CPU 1 is in NMI context, the second NMI from CPU 0 is
blocked until CPU 1 executes IRET. However, CPU 1 never executes IRET,
so the NMI is not handled and the callback function to save registers is
never called.
In practice, this can happen on some servers which broadcast NMIs to all
CPUs when the NMI button is pushed.
To save registers in this case, we need to:
a) Return from NMI handler instead of looping infinitely
or
b) Call the callback function directly from the infinite loop
Inherently, a) is risky because NMI is also used to prevent corrupted
data from being propagated to devices. So, we chose b).
This patch does the following:
1. Move the infinite looping of CPUs which haven't called panic() in NMI
context (actually done by panic_smp_self_stop()) outside of panic() to
enable us to refer pt_regs. Please note that panic_smp_self_stop() is
still used for normal context.
2. Call a callback of kdump_nmi_shootdown_cpus() directly to save
registers and do some cleanups after setting waiting_for_crash_ipi which
is used for counting down the number of CPUs which handled the callback
Signed-off-by: Hidehiro Kawai <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Aaron Tomlin <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Dave Young <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Don Zickus <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Gobinda Charan Maji <[email protected]>
Cc: HATAYAMA Daisuke <[email protected]>
Cc: Hidehiro Kawai <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Javi Merino <[email protected]>
Cc: Jiang Liu <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: lkml <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Nicolas Iooss <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Prarit Bhargava <[email protected]>
Cc: Rasmus Villemoes <[email protected]>
Cc: Seth Jennings <[email protected]>
Cc: Stefan Lippers-Hollmann <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ulrich Obergfell <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: Yasuaki Ishimatsu <[email protected]>
Link: http://lkml.kernel.org/r/20151210014628.25437.75256.stgit@softrs
[ Cleanup comments, fixup formatting. ]
Signed-off-by: Borislav Petkov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
---
arch/x86/kernel/nmi.c | 6 +++---
arch/x86/kernel/reboot.c | 20 ++++++++++++++++++++
include/linux/kernel.h | 16 ++++++++++++----
kernel/panic.c | 9 +++++++++
kernel/watchdog.c | 2 +-
5 files changed, 45 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index fca8793..424aec4 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -231,7 +231,7 @@ pci_serr_error(unsigned char reason, struct pt_regs *regs)
#endif
if (panic_on_unrecovered_nmi)
- nmi_panic("NMI: Not continuing");
+ nmi_panic(regs, "NMI: Not continuing");
pr_emerg("Dazed and confused, but trying to continue\n");
@@ -256,7 +256,7 @@ io_check_error(unsigned char reason, struct pt_regs *regs)
show_regs(regs);
if (panic_on_io_nmi) {
- nmi_panic("NMI IOCK error: Not continuing");
+ nmi_panic(regs, "NMI IOCK error: Not continuing");
/*
* If we end up here, it means we have received an NMI while
@@ -305,7 +305,7 @@ unknown_nmi_error(unsigned char reason, struct pt_regs *regs)
pr_emerg("Do you have a strange power saving mode enabled?\n");
if (unknown_nmi_panic || panic_on_unrecovered_nmi)
- nmi_panic("NMI: Not continuing");
+ nmi_panic(regs, "NMI: Not continuing");
pr_emerg("Dazed and confused, but trying to continue\n");
}
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index 02693dd..1da1302 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -718,6 +718,7 @@ static int crashing_cpu;
static nmi_shootdown_cb shootdown_callback;
static atomic_t waiting_for_crash_ipi;
+static int crash_ipi_issued;
static int crash_nmi_callback(unsigned int val, struct pt_regs *regs)
{
@@ -780,6 +781,9 @@ void nmi_shootdown_cpus(nmi_shootdown_cb callback)
smp_send_nmi_allbutself();
+ /* Kick CPUs looping in NMI context. */
+ WRITE_ONCE(crash_ipi_issued, 1);
+
msecs = 1000; /* Wait at most a second for the other cpus to stop */
while ((atomic_read(&waiting_for_crash_ipi) > 0) && msecs) {
mdelay(1);
@@ -788,6 +792,22 @@ void nmi_shootdown_cpus(nmi_shootdown_cb callback)
/* Leave the nmi callback set */
}
+
+/* Override the weak function in kernel/panic.c */
+void nmi_panic_self_stop(struct pt_regs *regs)
+{
+ while (1) {
+ /*
+ * Wait for the crash dumping IPI to be issued, and then
+ * call its callback directly.
+ */
+ if (READ_ONCE(crash_ipi_issued))
+ crash_nmi_callback(0, regs); /* Don't return */
+
+ cpu_relax();
+ }
+}
+
#else /* !CONFIG_SMP */
void nmi_shootdown_cpus(nmi_shootdown_cb callback)
{
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 750cc5c..7311c32 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -255,6 +255,7 @@ extern long (*panic_blink)(int state);
__printf(1, 2)
void panic(const char *fmt, ...)
__noreturn __cold;
+void nmi_panic_self_stop(struct pt_regs *);
extern void oops_enter(void);
extern void oops_exit(void);
void print_oops_end_marker(void);
@@ -455,14 +456,21 @@ extern atomic_t panic_cpu;
/*
* A variant of panic() called from NMI context. We return if we've already
- * panicked on this CPU.
+ * panicked on this CPU. If another CPU already panicked, loop in
+ * nmi_panic_self_stop() which can provide architecture dependent code such
+ * as saving register state for crash dump.
*/
-#define nmi_panic(fmt, ...) \
+#define nmi_panic(regs, fmt, ...) \
do { \
- int cpu = raw_smp_processor_id(); \
+ int old_cpu, cpu; \
\
- if (atomic_cmpxchg(&panic_cpu, PANIC_CPU_INVALID, cpu) != cpu) \
+ cpu = raw_smp_processor_id(); \
+ old_cpu = atomic_cmpxchg(&panic_cpu, PANIC_CPU_INVALID, cpu); \
+ \
+ if (old_cpu == PANIC_CPU_INVALID) \
panic(fmt, ##__VA_ARGS__); \
+ else if (old_cpu != cpu) \
+ nmi_panic_self_stop(regs); \
} while (0)
/*
diff --git a/kernel/panic.c b/kernel/panic.c
index 3344524..06f31b49 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -61,6 +61,15 @@ void __weak panic_smp_self_stop(void)
cpu_relax();
}
+/*
+ * Stop ourselves in NMI context if another CPU has already panicked. Arch code
+ * may override this to prepare for crash dumping, e.g. save regs info.
+ */
+void __weak nmi_panic_self_stop(struct pt_regs *regs)
+{
+ panic_smp_self_stop();
+}
+
atomic_t panic_cpu = ATOMIC_INIT(PANIC_CPU_INVALID);
/**
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index b9be18f..84b5035 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -351,7 +351,7 @@ static void watchdog_overflow_callback(struct perf_event *event,
trigger_allbutself_cpu_backtrace();
if (hardlockup_panic)
- nmi_panic("Hard LOCKUP");
+ nmi_panic(regs, "Hard LOCKUP");
__this_cpu_write(hard_watchdog_warn, true);
return;
Commit-ID: 7bbee5ca3896f69f09c68be549cb8997abe6bca6
Gitweb: http://git.kernel.org/tip/7bbee5ca3896f69f09c68be549cb8997abe6bca6
Author: Hidehiro Kawai <[email protected]>
AuthorDate: Mon, 14 Dec 2015 11:19:11 +0100
Committer: Thomas Gleixner <[email protected]>
CommitDate: Sat, 19 Dec 2015 11:07:01 +0100
kexec: Fix race between panic() and crash_kexec()
Currently, panic() and crash_kexec() can be called at the same time.
For example (x86 case):
CPU 0:
oops_end()
crash_kexec()
mutex_trylock() // acquired
nmi_shootdown_cpus() // stop other CPUs
CPU 1:
panic()
crash_kexec()
mutex_trylock() // failed to acquire
smp_send_stop() // stop other CPUs
infinite loop
If CPU 1 calls smp_send_stop() before nmi_shootdown_cpus(), kdump
fails.
In another case:
CPU 0:
oops_end()
crash_kexec()
mutex_trylock() // acquired
<NMI>
io_check_error()
panic()
crash_kexec()
mutex_trylock() // failed to acquire
infinite loop
Clearly, this is an undesirable result.
To fix this problem, this patch changes crash_kexec() to exclude others
by using the panic_cpu atomic.
Signed-off-by: Hidehiro Kawai <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Dave Young <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: HATAYAMA Daisuke <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Martin Schwidefsky <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Minfei Huang <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Seth Jennings <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: x86-ml <[email protected]>
Link: http://lkml.kernel.org/r/20151210014630.25437.94161.stgit@softrs
Signed-off-by: Borislav Petkov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
---
include/linux/kexec.h | 2 ++
kernel/kexec_core.c | 30 +++++++++++++++++++++++++++++-
kernel/panic.c | 8 ++++++--
3 files changed, 37 insertions(+), 3 deletions(-)
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index d140b1e..7b68d27 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -237,6 +237,7 @@ extern int kexec_purgatory_get_set_symbol(struct kimage *image,
unsigned int size, bool get_value);
extern void *kexec_purgatory_get_symbol_addr(struct kimage *image,
const char *name);
+extern void __crash_kexec(struct pt_regs *);
extern void crash_kexec(struct pt_regs *);
int kexec_should_crash(struct task_struct *);
void crash_save_cpu(struct pt_regs *regs, int cpu);
@@ -332,6 +333,7 @@ int __weak arch_kexec_apply_relocations(const Elf_Ehdr *ehdr, Elf_Shdr *sechdrs,
#else /* !CONFIG_KEXEC_CORE */
struct pt_regs;
struct task_struct;
+static inline void __crash_kexec(struct pt_regs *regs) { }
static inline void crash_kexec(struct pt_regs *regs) { }
static inline int kexec_should_crash(struct task_struct *p) { return 0; }
#define kexec_in_progress false
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 11b64a6..c823f30 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -853,7 +853,12 @@ struct kimage *kexec_image;
struct kimage *kexec_crash_image;
int kexec_load_disabled;
-void crash_kexec(struct pt_regs *regs)
+/*
+ * No panic_cpu check version of crash_kexec(). This function is called
+ * only when panic_cpu holds the current CPU number; this is the only CPU
+ * which processes crash_kexec routines.
+ */
+void __crash_kexec(struct pt_regs *regs)
{
/* Take the kexec_mutex here to prevent sys_kexec_load
* running on one cpu from replacing the crash kernel
@@ -876,6 +881,29 @@ void crash_kexec(struct pt_regs *regs)
}
}
+void crash_kexec(struct pt_regs *regs)
+{
+ int old_cpu, this_cpu;
+
+ /*
+ * Only one CPU is allowed to execute the crash_kexec() code as with
+ * panic(). Otherwise parallel calls of panic() and crash_kexec()
+ * may stop each other. To exclude them, we use panic_cpu here too.
+ */
+ this_cpu = raw_smp_processor_id();
+ old_cpu = atomic_cmpxchg(&panic_cpu, PANIC_CPU_INVALID, this_cpu);
+ if (old_cpu == PANIC_CPU_INVALID) {
+ /* This is the 1st CPU which comes here, so go ahead. */
+ __crash_kexec(regs);
+
+ /*
+ * Reset panic_cpu to allow another panic()/crash_kexec()
+ * call.
+ */
+ atomic_set(&panic_cpu, PANIC_CPU_INVALID);
+ }
+}
+
size_t crash_get_memory_size(void)
{
size_t size = 0;
diff --git a/kernel/panic.c b/kernel/panic.c
index 06f31b49..b333380 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -136,9 +136,11 @@ void panic(const char *fmt, ...)
* everything else.
* If we want to run this after calling panic_notifiers, pass
* the "crash_kexec_post_notifiers" option to the kernel.
+ *
+ * Bypass the panic_cpu check and call __crash_kexec directly.
*/
if (!crash_kexec_post_notifiers)
- crash_kexec(NULL);
+ __crash_kexec(NULL);
/*
* Note smp_send_stop is the usual smp shutdown function, which
@@ -161,9 +163,11 @@ void panic(const char *fmt, ...)
* panic_notifiers and dumping kmsg before kdump.
* Note: since some panic_notifiers can make crashed kernel
* more unstable, it can increase risks of the kdump failure too.
+ *
+ * Bypass the panic_cpu check and call __crash_kexec directly.
*/
if (crash_kexec_post_notifiers)
- crash_kexec(NULL);
+ __crash_kexec(NULL);
bust_spinlocks(0);
Commit-ID: b7c4948e9881fb38b048269f376fb4bf194ce24a
Gitweb: http://git.kernel.org/tip/b7c4948e9881fb38b048269f376fb4bf194ce24a
Author: Hidehiro Kawai <[email protected]>
AuthorDate: Mon, 14 Dec 2015 11:19:12 +0100
Committer: Thomas Gleixner <[email protected]>
CommitDate: Sat, 19 Dec 2015 11:07:01 +0100
x86/apic: Introduce apic_extnmi command line parameter
This patch introduces a command line parameter apic_extnmi:
apic_extnmi=( bsp|all|none )
The default value is "bsp" and this is the current behavior: only the
Boot-Strapping Processor receives an external NMI.
"all" allows external NMIs to be broadcast to all CPUs. This would
raise the success rate of panic on NMI when BSP hangs in NMI context
or the external NMI is swallowed by other NMI handlers on the BSP.
If you specify "none", no CPUs receive external NMIs. This is useful for
the dump capture kernel so that it cannot be shot down by accidentally
pressing the external NMI button (on platforms which have it) while
saving a crash dump.
Signed-off-by: Hidehiro Kawai <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Bandan Das <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiang Liu <[email protected]>
Cc: Joerg Roedel <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: "Maciej W. Rozycki" <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ricardo Ribalda Delgado <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: x86-ml <[email protected]>
Link: http://lkml.kernel.org/r/20151210014632.25437.43778.stgit@softrs
Signed-off-by: Borislav Petkov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
---
Documentation/kernel-parameters.txt | 9 +++++++++
arch/x86/include/asm/apic.h | 5 +++++
arch/x86/kernel/apic/apic.c | 34 ++++++++++++++++++++++++++++++++--
3 files changed, 46 insertions(+), 2 deletions(-)
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 742f69d..74acea5 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -472,6 +472,15 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
Change the amount of debugging information output
when initialising the APIC and IO-APIC components.
+ apic_extnmi= [APIC,X86] External NMI delivery setting
+ Format: { bsp (default) | all | none }
+ bsp: External NMI is delivered only to CPU 0
+ all: External NMIs are broadcast to all CPUs as a
+ backup of CPU 0
+ none: External NMI is masked for all CPUs. This is
+ useful so that a dump capture kernel won't be
+ shot down by NMI
+
autoconf= [IPV6]
See Documentation/networking/ipv6.txt.
diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 7f62ad4..c80f6b6 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -23,6 +23,11 @@
#define APIC_VERBOSE 1
#define APIC_DEBUG 2
+/* Macros for apic_extnmi which controls external NMI masking */
+#define APIC_EXTNMI_BSP 0 /* Default */
+#define APIC_EXTNMI_ALL 1
+#define APIC_EXTNMI_NONE 2
+
/*
* Define the default level of output to be very little
* This can be turned up by using apic=verbose for more
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 8d7df74..8a5cdda 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -82,6 +82,12 @@ physid_mask_t phys_cpu_present_map;
static unsigned int disabled_cpu_apicid __read_mostly = BAD_APICID;
/*
+ * This variable controls which CPUs receive external NMIs. By default,
+ * external NMIs are delivered only to the BSP.
+ */
+static int apic_extnmi = APIC_EXTNMI_BSP;
+
+/*
* Map cpu index to physical APIC ID
*/
DEFINE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_cpu_to_apicid, BAD_APICID);
@@ -1161,6 +1167,8 @@ void __init init_bsp_APIC(void)
value = APIC_DM_NMI;
if (!lapic_is_integrated()) /* 82489DX */
value |= APIC_LVT_LEVEL_TRIGGER;
+ if (apic_extnmi == APIC_EXTNMI_NONE)
+ value |= APIC_LVT_MASKED;
apic_write(APIC_LVT1, value);
}
@@ -1378,9 +1386,11 @@ void setup_local_APIC(void)
apic_write(APIC_LVT0, value);
/*
- * only the BP should see the LINT1 NMI signal, obviously.
+ * Only the BSP sees the LINT1 NMI signal by default. This can be
+ * modified by apic_extnmi= boot option.
*/
- if (!cpu)
+ if ((!cpu && apic_extnmi != APIC_EXTNMI_NONE) ||
+ apic_extnmi == APIC_EXTNMI_ALL)
value = APIC_DM_NMI;
else
value = APIC_DM_NMI | APIC_LVT_MASKED;
@@ -2557,3 +2567,23 @@ static int __init apic_set_disabled_cpu_apicid(char *arg)
return 0;
}
early_param("disable_cpu_apicid", apic_set_disabled_cpu_apicid);
+
+static int __init apic_set_extnmi(char *arg)
+{
+ if (!arg)
+ return -EINVAL;
+
+ if (!strncmp("all", arg, 3))
+ apic_extnmi = APIC_EXTNMI_ALL;
+ else if (!strncmp("none", arg, 4))
+ apic_extnmi = APIC_EXTNMI_NONE;
+ else if (!strncmp("bsp", arg, 3))
+ apic_extnmi = APIC_EXTNMI_BSP;
+ else {
+ pr_warn("Unknown external NMI delivery mode `%s' ignored\n", arg);
+ return -EINVAL;
+ }
+
+ return 0;
+}
+early_param("apic_extnmi", apic_set_extnmi);
Commit-ID: b279d67df88a49c6ca32b3eebd195660254be394
Gitweb: http://git.kernel.org/tip/b279d67df88a49c6ca32b3eebd195660254be394
Author: Hidehiro Kawai <[email protected]>
AuthorDate: Mon, 14 Dec 2015 11:19:13 +0100
Committer: Thomas Gleixner <[email protected]>
CommitDate: Sat, 19 Dec 2015 11:07:01 +0100
x86/nmi: Save regs in crash dump on external NMI
Now, multiple CPUs can receive an external NMI simultaneously by
specifying the "apic_extnmi=all" command line parameter. When we take
a crash dump by using external NMI with this option, we fail to save
registers into the crash dump. This happens as follows:
CPU 0 CPU 1
================================ =============================
receive an external NMI
default_do_nmi() receive an external NMI
spin_lock(&nmi_reason_lock) default_do_nmi()
io_check_error() spin_lock(&nmi_reason_lock)
panic() busy loop
...
kdump_nmi_shootdown_cpus()
issue NMI IPI -----------> blocked until IRET
busy loop...
Here, since CPU 1 is in NMI context, an additional NMI from CPU 0
remains unhandled until CPU 1 IRETs. However, CPU 1 will never execute
IRET so the NMI is not handled and the callback function to save
registers is never called.
To solve this issue, we check if the IPI for crash dumping was issued
while waiting for nmi_reason_lock to be released, and if so, call its
callback function directly. If the IPI is not issued (e.g. kdump is
disabled), the actual behavior doesn't change.
Signed-off-by: Hidehiro Kawai <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Dave Young <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiang Liu <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Masami Hiramatsu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stefan Lippers-Hollmann <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: x86-ml <[email protected]>
Link: http://lkml.kernel.org/r/20151210065245.4587.39316.stgit@softrs
Signed-off-by: Borislav Petkov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
---
arch/x86/include/asm/reboot.h | 1 +
arch/x86/kernel/nmi.c | 16 ++++++++++++++--
arch/x86/kernel/reboot.c | 24 +++++++++++++++++-------
3 files changed, 32 insertions(+), 9 deletions(-)
diff --git a/arch/x86/include/asm/reboot.h b/arch/x86/include/asm/reboot.h
index a82c4f1..2cb1cc2 100644
--- a/arch/x86/include/asm/reboot.h
+++ b/arch/x86/include/asm/reboot.h
@@ -25,5 +25,6 @@ void __noreturn machine_real_restart(unsigned int type);
typedef void (*nmi_shootdown_cb)(int, struct pt_regs*);
void nmi_shootdown_cpus(nmi_shootdown_cb callback);
+void run_crash_ipi_callback(struct pt_regs *regs);
#endif /* _ASM_X86_REBOOT_H */
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 424aec4..8a2cdd7 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -29,6 +29,7 @@
#include <asm/mach_traps.h>
#include <asm/nmi.h>
#include <asm/x86_init.h>
+#include <asm/reboot.h>
#define CREATE_TRACE_POINTS
#include <trace/events/nmi.h>
@@ -356,8 +357,19 @@ static void default_do_nmi(struct pt_regs *regs)
return;
}
- /* Non-CPU-specific NMI: NMI sources can be processed on any CPU */
- raw_spin_lock(&nmi_reason_lock);
+ /*
+ * Non-CPU-specific NMI: NMI sources can be processed on any CPU.
+ *
+ * Another CPU may be processing panic routines while holding
+ * nmi_reason_lock. Check if the CPU issued the IPI for crash dumping,
+ * and if so, call its callback directly. If there is no CPU preparing
+ * crash dump, we simply loop here.
+ */
+ while (!raw_spin_trylock(&nmi_reason_lock)) {
+ run_crash_ipi_callback(regs);
+ cpu_relax();
+ }
+
reason = x86_platform.get_nmi_reason();
if (reason & NMI_REASON_MASK) {
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index 1da1302..d64889a 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -793,17 +793,23 @@ void nmi_shootdown_cpus(nmi_shootdown_cb callback)
/* Leave the nmi callback set */
}
+/*
+ * Check if the crash dumping IPI got issued and if so, call its callback
+ * directly. This function is used when we have already been in NMI handler.
+ * It doesn't return.
+ */
+void run_crash_ipi_callback(struct pt_regs *regs)
+{
+ if (crash_ipi_issued)
+ crash_nmi_callback(0, regs);
+}
+
/* Override the weak function in kernel/panic.c */
void nmi_panic_self_stop(struct pt_regs *regs)
{
while (1) {
- /*
- * Wait for the crash dumping IPI to be issued, and then
- * call its callback directly.
- */
- if (READ_ONCE(crash_ipi_issued))
- crash_nmi_callback(0, regs); /* Don't return */
-
+ /* If no CPU is preparing crash dump, we simply loop here. */
+ run_crash_ipi_callback(regs);
cpu_relax();
}
}
@@ -813,4 +819,8 @@ void nmi_shootdown_cpus(nmi_shootdown_cb callback)
{
/* No other CPUs to shoot down */
}
+
+void run_crash_ipi_callback(struct pt_regs *regs)
+{
+}
#endif
Commit-ID: 9f318e3fcb1d4c48c26e8ca2ff2a459b82f36a23
Gitweb: http://git.kernel.org/tip/9f318e3fcb1d4c48c26e8ca2ff2a459b82f36a23
Author: Hidehiro Kawai <[email protected]>
AuthorDate: Mon, 14 Dec 2015 11:19:14 +0100
Committer: Thomas Gleixner <[email protected]>
CommitDate: Sat, 19 Dec 2015 11:07:01 +0100
Documentation: Document kernel.panic_on_io_nmi sysctl
kernel.panic_on_io_nmi sysctl was introduced by commit
5211a242d0cb ("x86: Add sysctl to allow panic on IOCK NMI error")
but its documentation is missing. So, add it.
Signed-off-by: Hidehiro Kawai <[email protected]>
Requested-by: Borislav Petkov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Don Zickus <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Heinrich Schuchardt <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Kosina <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Manfred Spraul <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Nicolas Iooss <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Seth Jennings <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ulrich Obergfell <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: x86-ml <[email protected]>
Link: http://lkml.kernel.org/r/20151210014637.25437.71903.stgit@softrs
Signed-off-by: Borislav Petkov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
---
Documentation/sysctl/kernel.txt | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index af70d15..73c6b1e 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -551,6 +551,21 @@ the recommended setting is 60.
==============================================================
+panic_on_io_nmi:
+
+Controls the kernel's behavior when a CPU receives an NMI caused by
+an IO error.
+
+0: try to continue operation (default)
+
+1: panic immediately. The IO error triggered an NMI. This indicates a
+ serious system condition which could result in IO data corruption.
+ Rather than continuing, panicking might be a better choice. Some
+ servers issue this sort of NMI when the dump button is pushed,
+ and you can use this option to take a crash dump.
+
+==============================================================
+
panic_on_oops:
Controls the kernel's behaviour when an oops or BUG is encountered.