When an HA clustering software or administrator detects unresponsiveness
of a host, they issue an NMI to the host to completely stop current
works and take a crash dump. If the kernel has already panicked
or is capturing a crash dump at that time, further NMI can cause
a crash dump failure.
Also, crash_kexec() called from oops context and panic() can
cause race conditions.
To solve these issues, this patch set does following things:
- Don't call panic() on NMI if the kernel has already panicked
- Extend exclusion control currently done by panic_lock to crash_kexec
- Introduce "apic_extnmi=none" boot option which masks external NMI
NMI at the boot time
Additionally, "apic_extnmi=all" is provieded. This option unmasks
external NMI for all CPUs. This would help cause kernel panic even if
CPU 0 can't handle an external NMI due to hang-up in NMI context
or being handled by other NMI handlers.
This patch set can be applied to current -tip tree.
V5:
- Use WRITE_ONCE() for crash_ipi_done to keep the instruction order
(PATCH 2/4)
- Address concurrent unknown/external NMI case, too (PATCH 2/4)
- Fix build errors (PATCH 3/4)
- Rename "noextnmi" boot option to "apic_extnmi" and expand its
feature (PATCH 4/4)
V4: https://lkml.org/lkml/2015/9/25/193
- Improve comments and descriptions (PATCH 1/4 to 3/4)
- Use new __crash_kexec(), no exclusion check version of crash_kexec(),
instead of checking if panic_cpu is the current cpu or not
(PATCH 3/4)
V3: https://lkml.org/lkml/2015/8/6/39
- Introduce nmi_panic() macro to reduce code duplication
- In the case of panic on NMI, don't return from NMI handlers
if another cpu already panicked
V2: https://lkml.org/lkml/2015/7/27/31
- Use atomic_cmpxchg() instead of current spin_trylock() to exclude
concurrent accesses to panic() and crash_kexec()
- Don't introduce no-lock version of panic() and crash_kexec()
V1: https://lkml.org/lkml/2015/7/22/81
---
Hidehiro Kawai (4):
panic/x86: Fix re-entrance problem due to panic on NMI
panic/x86: Allow cpus to save registers even if they are looping in NMI context
kexec: Fix race between panic() and crash_kexec() called directly
x86/apic: Introduce apic_extnmi boot option
Documentation/kernel-parameters.txt | 9 +++++++++
arch/x86/include/asm/apic.h | 5 +++++
arch/x86/include/asm/reboot.h | 1 +
arch/x86/kernel/apic/apic.c | 31 ++++++++++++++++++++++++++++++-
arch/x86/kernel/nmi.c | 27 ++++++++++++++++++++++-----
arch/x86/kernel/reboot.c | 28 ++++++++++++++++++++++++++++
include/linux/kernel.h | 21 +++++++++++++++++++++
include/linux/kexec.h | 2 ++
kernel/kexec_core.c | 26 +++++++++++++++++++++++++-
kernel/panic.c | 29 ++++++++++++++++++++++++-----
kernel/watchdog.c | 2 +-
11 files changed, 168 insertions(+), 13 deletions(-)
--
Hidehiro Kawai
Hitachi, Ltd. Research & Development Group
If panic on NMI happens just after panic() on the same CPU, panic()
is recursively called. As the result, it stalls after failing to
acquire panic_lock.
To avoid this problem, don't call panic() in NMI context if
we've already entered panic().
V4:
- Improve comments in io_check_error() and panic()
V3:
- Introduce nmi_panic() macro to reduce code duplication
- In the case of panic on NMI, don't return from NMI handlers
if another cpu already panicked
V2:
- Use atomic_cmpxchg() instead of current spin_trylock() to
exclude concurrent accesses to the panic routines
- Don't introduce no-lock version of panic()
Signed-off-by: Hidehiro Kawai <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Michal Hocko <[email protected]>
---
arch/x86/kernel/nmi.c | 16 ++++++++++++----
include/linux/kernel.h | 13 +++++++++++++
kernel/panic.c | 15 ++++++++++++---
kernel/watchdog.c | 2 +-
4 files changed, 38 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 697f90d..5131714 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -231,7 +231,7 @@ pci_serr_error(unsigned char reason, struct pt_regs *regs)
#endif
if (panic_on_unrecovered_nmi)
- panic("NMI: Not continuing");
+ nmi_panic("NMI: Not continuing");
pr_emerg("Dazed and confused, but trying to continue\n");
@@ -255,8 +255,16 @@ io_check_error(unsigned char reason, struct pt_regs *regs)
reason, smp_processor_id());
show_regs(regs);
- if (panic_on_io_nmi)
- panic("NMI IOCK error: Not continuing");
+ if (panic_on_io_nmi) {
+ nmi_panic("NMI IOCK error: Not continuing");
+
+ /*
+ * If we return from nmi_panic(), it means we have received
+ * NMI while processing panic(). So, simply return without
+ * a delay and re-enabling NMI.
+ */
+ return;
+ }
/* Re-enable the IOCK line, wait for a few seconds */
reason = (reason & NMI_REASON_CLEAR_MASK) | NMI_REASON_CLEAR_IOCHK;
@@ -297,7 +305,7 @@ unknown_nmi_error(unsigned char reason, struct pt_regs *regs)
pr_emerg("Do you have a strange power saving mode enabled?\n");
if (unknown_nmi_panic || panic_on_unrecovered_nmi)
- panic("NMI: Not continuing");
+ nmi_panic("NMI: Not continuing");
pr_emerg("Dazed and confused, but trying to continue\n");
}
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 350dfb0..480a4fd 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -445,6 +445,19 @@ extern int sysctl_panic_on_stackoverflow;
extern bool crash_kexec_post_notifiers;
+extern atomic_t panic_cpu;
+
+/*
+ * A variant of panic() called from NMI context.
+ * If we've already panicked on this cpu, return from here.
+ */
+#define nmi_panic(fmt, ...) \
+ do { \
+ int this_cpu = raw_smp_processor_id(); \
+ if (atomic_cmpxchg(&panic_cpu, -1, this_cpu) != this_cpu) \
+ panic(fmt, ##__VA_ARGS__); \
+ } while (0)
+
/*
* Only to be used by arch init code. If the user over-wrote the default
* CONFIG_PANIC_TIMEOUT, honor it.
diff --git a/kernel/panic.c b/kernel/panic.c
index 4579dbb..24ee2ea 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -61,6 +61,8 @@ void __weak panic_smp_self_stop(void)
cpu_relax();
}
+atomic_t panic_cpu = ATOMIC_INIT(-1);
+
/**
* panic - halt the system
* @fmt: The text string to print
@@ -71,17 +73,17 @@ void __weak panic_smp_self_stop(void)
*/
void panic(const char *fmt, ...)
{
- static DEFINE_SPINLOCK(panic_lock);
static char buf[1024];
va_list args;
long i, i_next = 0;
int state = 0;
+ int old_cpu, this_cpu;
/*
* Disable local interrupts. This will prevent panic_smp_self_stop
* from deadlocking the first cpu that invokes the panic, since
* there is nothing to prevent an interrupt handler (that runs
- * after the panic_lock is acquired) from invoking panic again.
+ * after setting panic_cpu) from invoking panic again.
*/
local_irq_disable();
@@ -94,8 +96,15 @@ void panic(const char *fmt, ...)
* multiple parallel invocations of panic, all other CPUs either
* stop themself or will wait until they are stopped by the 1st CPU
* with smp_send_stop().
+ *
+ * `old_cpu == -1' means this is the 1st CPU which comes here, so
+ * go ahead.
+ * `old_cpu == this_cpu' means we came from nmi_panic() which sets
+ * panic_cpu to this cpu. In this case, this is also the 1st CPU.
*/
- if (!spin_trylock(&panic_lock))
+ this_cpu = raw_smp_processor_id();
+ old_cpu = atomic_cmpxchg(&panic_cpu, -1, this_cpu);
+ if (old_cpu != -1 && old_cpu != this_cpu)
panic_smp_self_stop();
console_verbose();
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 18f34cf..b9be18f 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -351,7 +351,7 @@ static void watchdog_overflow_callback(struct perf_event *event,
trigger_allbutself_cpu_backtrace();
if (hardlockup_panic)
- panic("Hard LOCKUP");
+ nmi_panic("Hard LOCKUP");
__this_cpu_write(hard_watchdog_warn, true);
return;
nmi_shootdown_cpus(), a subroutine of crash_kexec(), sends NMI IPI
to non-panic cpus to stop them while saving their register
information and doing some cleanups for crash dumping. So if a
non-panic cpus is infinitely looping in NMI context, we fail to
save its register information and lose the information from the
crash dump.
`Infinite loop in NMI context' can happen:
a. when a cpu panics on NMI while another cpu is processing panic
b. when a cpu received an external or unknown NMI while another
cpu is processing panic on NMI
In the case of a, it loops in panic_smp_self_stop(). In the case
of b, it loops in raw_spin_lock() of nmi_reason_lock. This can
happen on some servers which broadcasts NMIs to all CPUs when a dump
button is pushed.
To save registers in these case too, this patch does following things:
1. Move the timing of `infinite loop in NMI context' (actually
done by panic_smp_self_stop()) outside of panic() to enable us to
refer pt_regs
2. call a callback of nmi_shootdown_cpus() directly to save
registers and do some cleanups after setting waiting_for_crash_ipi
which is used for counting down the number of cpus which handled
the callback
V5:
- Use WRITE_ONCE() when setting crash_ipi_done to 1 so that the
compiler doesn't change the instruction order
- Support the case of b in the above description
- Add poll_crash_ipi_and_callback()
V4:
- Rewrite the patch description
V3:
- Newly introduced
Signed-off-by: Hidehiro Kawai <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: Michal Hocko <[email protected]>
---
arch/x86/include/asm/reboot.h | 1 +
arch/x86/kernel/nmi.c | 17 +++++++++++++----
arch/x86/kernel/reboot.c | 28 ++++++++++++++++++++++++++++
include/linux/kernel.h | 12 ++++++++++--
kernel/panic.c | 10 ++++++++++
kernel/watchdog.c | 2 +-
6 files changed, 63 insertions(+), 7 deletions(-)
diff --git a/arch/x86/include/asm/reboot.h b/arch/x86/include/asm/reboot.h
index a82c4f1..964e82f 100644
--- a/arch/x86/include/asm/reboot.h
+++ b/arch/x86/include/asm/reboot.h
@@ -25,5 +25,6 @@ void __noreturn machine_real_restart(unsigned int type);
typedef void (*nmi_shootdown_cb)(int, struct pt_regs*);
void nmi_shootdown_cpus(nmi_shootdown_cb callback);
+void poll_crash_ipi_and_callback(struct pt_regs *regs);
#endif /* _ASM_X86_REBOOT_H */
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 5131714..74a1434 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -29,6 +29,7 @@
#include <asm/mach_traps.h>
#include <asm/nmi.h>
#include <asm/x86_init.h>
+#include <asm/reboot.h>
#define CREATE_TRACE_POINTS
#include <trace/events/nmi.h>
@@ -231,7 +232,7 @@ pci_serr_error(unsigned char reason, struct pt_regs *regs)
#endif
if (panic_on_unrecovered_nmi)
- nmi_panic("NMI: Not continuing");
+ nmi_panic(regs, "NMI: Not continuing");
pr_emerg("Dazed and confused, but trying to continue\n");
@@ -256,7 +257,7 @@ io_check_error(unsigned char reason, struct pt_regs *regs)
show_regs(regs);
if (panic_on_io_nmi) {
- nmi_panic("NMI IOCK error: Not continuing");
+ nmi_panic(regs, "NMI IOCK error: Not continuing");
/*
* If we return from nmi_panic(), it means we have received
@@ -305,7 +306,7 @@ unknown_nmi_error(unsigned char reason, struct pt_regs *regs)
pr_emerg("Do you have a strange power saving mode enabled?\n");
if (unknown_nmi_panic || panic_on_unrecovered_nmi)
- nmi_panic("NMI: Not continuing");
+ nmi_panic(regs, "NMI: Not continuing");
pr_emerg("Dazed and confused, but trying to continue\n");
}
@@ -357,7 +358,15 @@ static void default_do_nmi(struct pt_regs *regs)
}
/* Non-CPU-specific NMI: NMI sources can be processed on any CPU */
- raw_spin_lock(&nmi_reason_lock);
+
+ /*
+ * Another CPU may be processing panic routines with holding
+ * nmi_reason_lock. Check IPI issuance from the panicking CPU
+ * and call the callback directly.
+ */
+ while (!raw_spin_trylock(&nmi_reason_lock))
+ poll_crash_ipi_and_callback(regs);
+
reason = x86_platform.get_nmi_reason();
if (reason & NMI_REASON_MASK) {
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index 02693dd..44c5f5b 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -718,6 +718,7 @@ static int crashing_cpu;
static nmi_shootdown_cb shootdown_callback;
static atomic_t waiting_for_crash_ipi;
+static int crash_ipi_done;
static int crash_nmi_callback(unsigned int val, struct pt_regs *regs)
{
@@ -780,6 +781,9 @@ void nmi_shootdown_cpus(nmi_shootdown_cb callback)
smp_send_nmi_allbutself();
+ /* Kick cpus looping in nmi context. */
+ WRITE_ONCE(crash_ipi_done, 1);
+
msecs = 1000; /* Wait at most a second for the other cpus to stop */
while ((atomic_read(&waiting_for_crash_ipi) > 0) && msecs) {
mdelay(1);
@@ -788,9 +792,33 @@ void nmi_shootdown_cpus(nmi_shootdown_cb callback)
/* Leave the nmi callback set */
}
+
+/*
+ * Wait for the timing of IPI for crash dumping, and then call its callback
+ * directly. This function is used when we have already been in NMI handler.
+ */
+void poll_crash_ipi_and_callback(struct pt_regs *regs)
+{
+ if (crash_ipi_done)
+ crash_nmi_callback(0, regs); /* Shouldn't return */
+}
+
+/* Override the weak function in kernel/panic.c */
+void nmi_panic_self_stop(struct pt_regs *regs)
+{
+ while (1) {
+ poll_crash_ipi_and_callback(regs);
+ cpu_relax();
+ }
+}
+
#else /* !CONFIG_SMP */
void nmi_shootdown_cpus(nmi_shootdown_cb callback)
{
/* No other CPUs to shoot down */
}
+
+void poll_crash_ipi_and_callback(struct pt_regs *regs)
+{
+}
#endif
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 480a4fd..728a31b 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -255,6 +255,7 @@ extern long (*panic_blink)(int state);
__printf(1, 2)
void panic(const char *fmt, ...)
__noreturn __cold;
+void nmi_panic_self_stop(struct pt_regs *);
extern void oops_enter(void);
extern void oops_exit(void);
void print_oops_end_marker(void);
@@ -450,12 +451,19 @@ extern atomic_t panic_cpu;
/*
* A variant of panic() called from NMI context.
* If we've already panicked on this cpu, return from here.
+ * If another cpu already panicked, loop in nmi_panic_self_stop() which
+ * can provide architecture dependent code such as saving register states
+ * for crash dump.
*/
-#define nmi_panic(fmt, ...) \
+#define nmi_panic(regs, fmt, ...) \
do { \
+ int old_cpu; \
int this_cpu = raw_smp_processor_id(); \
- if (atomic_cmpxchg(&panic_cpu, -1, this_cpu) != this_cpu) \
+ old_cpu = atomic_cmpxchg(&panic_cpu, -1, this_cpu); \
+ if (old_cpu == -1) \
panic(fmt, ##__VA_ARGS__); \
+ else if (old_cpu != this_cpu) \
+ nmi_panic_self_stop(regs); \
} while (0)
/*
diff --git a/kernel/panic.c b/kernel/panic.c
index 24ee2ea..4fce2be 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -61,6 +61,16 @@ void __weak panic_smp_self_stop(void)
cpu_relax();
}
+/*
+ * Stop ourself in NMI context if another cpu has already panicked.
+ * Architecture code may override this to prepare for crash dumping
+ * (e.g. save register information).
+ */
+void __weak nmi_panic_self_stop(struct pt_regs *regs)
+{
+ panic_smp_self_stop();
+}
+
atomic_t panic_cpu = ATOMIC_INIT(-1);
/**
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index b9be18f..84b5035 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -351,7 +351,7 @@ static void watchdog_overflow_callback(struct perf_event *event,
trigger_allbutself_cpu_backtrace();
if (hardlockup_panic)
- nmi_panic("Hard LOCKUP");
+ nmi_panic(regs, "Hard LOCKUP");
__this_cpu_write(hard_watchdog_warn, true);
return;
Currently, panic() and crash_kexec() can be called at the same time.
For example (x86 case):
CPU 0:
oops_end()
crash_kexec()
mutex_trylock() // acquired
nmi_shootdown_cpus() // stop other cpus
CPU 1:
panic()
crash_kexec()
mutex_trylock() // failed to acquire
smp_send_stop() // stop other cpus
infinite loop
If CPU 1 calls smp_send_stop() before nmi_shootdown_cpus(), kdump
fails.
In another case:
CPU 0:
oops_end()
crash_kexec()
mutex_trylock() // acquired
<NMI>
io_check_error()
panic()
crash_kexec()
mutex_trylock() // failed to acquire
infinite loop
Clearly, this is an undesirable result.
To fix this problem, this patch changes crash_kexec() to exclude
others by using atomic_t panic_cpu.
V5:
- Add missing dummy __crash_kexec() for !CONFIG_KEXEC_CORE case
- Replace atomic_xchg() with atomic_set() in crash_kexec() because
it is used as a release operation and there is no need of memory
barrier effect. This change also removes an unused value warning
V4:
- Use new __crash_kexec(), no exclusion check version of crash_kexec(),
instead of checking if panic_cpu is the current cpu or not
V2:
- Use atomic_cmpxchg() instead of spin_trylock() on panic_lock
to exclude concurrent accesses
- Don't introduce no-lock version of crash_kexec()
Signed-off-by: Hidehiro Kawai <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
---
include/linux/kexec.h | 2 ++
kernel/kexec_core.c | 26 +++++++++++++++++++++++++-
kernel/panic.c | 4 ++--
3 files changed, 29 insertions(+), 3 deletions(-)
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index d140b1e..7b68d27 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -237,6 +237,7 @@ extern int kexec_purgatory_get_set_symbol(struct kimage *image,
unsigned int size, bool get_value);
extern void *kexec_purgatory_get_symbol_addr(struct kimage *image,
const char *name);
+extern void __crash_kexec(struct pt_regs *);
extern void crash_kexec(struct pt_regs *);
int kexec_should_crash(struct task_struct *);
void crash_save_cpu(struct pt_regs *regs, int cpu);
@@ -332,6 +333,7 @@ int __weak arch_kexec_apply_relocations(const Elf_Ehdr *ehdr, Elf_Shdr *sechdrs,
#else /* !CONFIG_KEXEC_CORE */
struct pt_regs;
struct task_struct;
+static inline void __crash_kexec(struct pt_regs *regs) { }
static inline void crash_kexec(struct pt_regs *regs) { }
static inline int kexec_should_crash(struct task_struct *p) { return 0; }
#define kexec_in_progress false
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 11b64a6..9d097f5 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -853,7 +853,8 @@ struct kimage *kexec_image;
struct kimage *kexec_crash_image;
int kexec_load_disabled;
-void crash_kexec(struct pt_regs *regs)
+/* No panic_cpu check version of crash_kexec */
+void __crash_kexec(struct pt_regs *regs)
{
/* Take the kexec_mutex here to prevent sys_kexec_load
* running on one cpu from replacing the crash kernel
@@ -876,6 +877,29 @@ void crash_kexec(struct pt_regs *regs)
}
}
+void crash_kexec(struct pt_regs *regs)
+{
+ int old_cpu, this_cpu;
+
+ /*
+ * Only one CPU is allowed to execute the crash_kexec() code as with
+ * panic(). Otherwise parallel calls of panic() and crash_kexec()
+ * may stop each other. To exclude them, we use panic_cpu here too.
+ */
+ this_cpu = raw_smp_processor_id();
+ old_cpu = atomic_cmpxchg(&panic_cpu, -1, this_cpu);
+ if (old_cpu == -1) {
+ /* This is the 1st CPU which comes here, so go ahead. */
+ __crash_kexec(regs);
+
+ /*
+ * Reset panic_cpu to allow another panic()/crash_kexec()
+ * call.
+ */
+ atomic_set(&panic_cpu, -1);
+ }
+}
+
size_t crash_get_memory_size(void)
{
size_t size = 0;
diff --git a/kernel/panic.c b/kernel/panic.c
index 4fce2be..5d0b807 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -138,7 +138,7 @@ void panic(const char *fmt, ...)
* the "crash_kexec_post_notifiers" option to the kernel.
*/
if (!crash_kexec_post_notifiers)
- crash_kexec(NULL);
+ __crash_kexec(NULL);
/*
* Note smp_send_stop is the usual smp shutdown function, which
@@ -163,7 +163,7 @@ void panic(const char *fmt, ...)
* more unstable, it can increase risks of the kdump failure too.
*/
if (crash_kexec_post_notifiers)
- crash_kexec(NULL);
+ __crash_kexec(NULL);
bust_spinlocks(0);
This patch introduces new boot option, apic_extnmi:
apic_extnmi={ bsp | all | none}
The default value is "bsp" and this is the current behavior; only
BSP receives external NMI. "all" allows external NMIs to be
broadcast to all CPUs. This would raise the success rate of panic
on NMI when BSP hangs up in NMI context or the external NMI is
swallowed by other NMI handlers on BSP. If you specified "none",
any CPUs don't receive external NMIs. This is useful for dump
capture kernel so that it wouldn't be shot down while saving a
crash dump.
V5:
- Rename the option from "noextnmi" to "apic_extnmi"
- Add apic_extnmi=all feature
- Fix the wrong documentation about "noextnmi" (apic_extnmi=none)
Signed-off-by: Hidehiro Kawai <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Jonathan Corbet <[email protected]>
---
Documentation/kernel-parameters.txt | 9 +++++++++
arch/x86/include/asm/apic.h | 5 +++++
arch/x86/kernel/apic/apic.c | 31 ++++++++++++++++++++++++++++++-
3 files changed, 44 insertions(+), 1 deletion(-)
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index f8aae63..ceed3bc 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -472,6 +472,15 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
Change the amount of debugging information output
when initialising the APIC and IO-APIC components.
+ apic_extnmi= [APIC,X86] External NMI delivery setting
+ Format: { bsp (default) | all | none }
+ bsp: External NMI is delivered to only CPU 0
+ all: External NMIs are broadcast to all CPUs as a
+ backup of CPU 0
+ none: External NMI is masked for all CPUs. This is
+ useful so that a dump capture kernel won't be
+ shot down by NMI
+
autoconf= [IPV6]
See Documentation/networking/ipv6.txt.
diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 7f62ad4..c80f6b6 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -23,6 +23,11 @@
#define APIC_VERBOSE 1
#define APIC_DEBUG 2
+/* Macros for apic_extnmi which controls external NMI masking */
+#define APIC_EXTNMI_BSP 0 /* Default */
+#define APIC_EXTNMI_ALL 1
+#define APIC_EXTNMI_NONE 2
+
/*
* Define the default level of output to be very little
* This can be turned up by using apic=verbose for more
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 2f69e3b..a2a8074 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -82,6 +82,12 @@ physid_mask_t phys_cpu_present_map;
static unsigned int disabled_cpu_apicid __read_mostly = BAD_APICID;
/*
+ * This variable controls which CPUs receive external NMIs. By default,
+ * external NMIs are delivered to only BSP.
+ */
+static int apic_extnmi = APIC_EXTNMI_BSP;
+
+/*
* Map cpu index to physical APIC ID
*/
DEFINE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_cpu_to_apicid, BAD_APICID);
@@ -1161,6 +1167,8 @@ void __init init_bsp_APIC(void)
value = APIC_DM_NMI;
if (!lapic_is_integrated()) /* 82489DX */
value |= APIC_LVT_LEVEL_TRIGGER;
+ if (apic_extnmi == APIC_EXTNMI_NONE)
+ value |= APIC_LVT_MASKED;
apic_write(APIC_LVT1, value);
}
@@ -1380,7 +1388,8 @@ void setup_local_APIC(void)
/*
* only the BP should see the LINT1 NMI signal, obviously.
*/
- if (!cpu)
+ if ((!cpu && apic_extnmi != APIC_EXTNMI_NONE) ||
+ apic_extnmi == APIC_EXTNMI_ALL)
value = APIC_DM_NMI;
else
value = APIC_DM_NMI | APIC_LVT_MASKED;
@@ -2548,3 +2557,23 @@ static int __init apic_set_disabled_cpu_apicid(char *arg)
return 0;
}
early_param("disable_cpu_apicid", apic_set_disabled_cpu_apicid);
+
+static int __init apic_set_extnmi(char *arg)
+{
+ if (!arg)
+ return -EINVAL;
+
+ if (strcmp("all", arg) == 0)
+ apic_extnmi = APIC_EXTNMI_ALL;
+ else if (strcmp("none", arg) == 0)
+ apic_extnmi = APIC_EXTNMI_NONE;
+ else if (strcmp("bsp", arg) == 0)
+ apic_extnmi = APIC_EXTNMI_BSP;
+ else {
+ pr_warn("Unknown external NMI delivery mode `%s' is ignored\n",
+ arg);
+ }
+
+ return 0;
+}
+early_param("apic_extnmi", apic_set_extnmi);
On Fri, Nov 20, 2015 at 06:36:44PM +0900, Hidehiro Kawai wrote:
> If panic on NMI happens just after panic() on the same CPU, panic()
> is recursively called. As the result, it stalls after failing to
> acquire panic_lock.
>
> To avoid this problem, don't call panic() in NMI context if
> we've already entered panic().
>
> V4:
> - Improve comments in io_check_error() and panic()
>
> V3:
> - Introduce nmi_panic() macro to reduce code duplication
> - In the case of panic on NMI, don't return from NMI handlers
> if another cpu already panicked
>
> V2:
> - Use atomic_cmpxchg() instead of current spin_trylock() to
> exclude concurrent accesses to the panic routines
> - Don't introduce no-lock version of panic()
>
> Signed-off-by: Hidehiro Kawai <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Michal Hocko <[email protected]>
> ---
> arch/x86/kernel/nmi.c | 16 ++++++++++++----
> include/linux/kernel.h | 13 +++++++++++++
> kernel/panic.c | 15 ++++++++++++---
> kernel/watchdog.c | 2 +-
> 4 files changed, 38 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
> index 697f90d..5131714 100644
> --- a/arch/x86/kernel/nmi.c
> +++ b/arch/x86/kernel/nmi.c
> @@ -231,7 +231,7 @@ pci_serr_error(unsigned char reason, struct pt_regs *regs)
> #endif
>
> if (panic_on_unrecovered_nmi)
> - panic("NMI: Not continuing");
> + nmi_panic("NMI: Not continuing");
>
> pr_emerg("Dazed and confused, but trying to continue\n");
>
> @@ -255,8 +255,16 @@ io_check_error(unsigned char reason, struct pt_regs *regs)
> reason, smp_processor_id());
> show_regs(regs);
>
> - if (panic_on_io_nmi)
> - panic("NMI IOCK error: Not continuing");
> + if (panic_on_io_nmi) {
> + nmi_panic("NMI IOCK error: Not continuing");
Btw, that panic_on_io_nmi seems undocumented in
Documentation/sysctl/kernel.txt. Care to document it, please, as a
separate patch?
> +
> + /*
> + * If we return from nmi_panic(), it means we have received
> + * NMI while processing panic(). So, simply return without
> + * a delay and re-enabling NMI.
> + */
> + return;
> + }
>
> /* Re-enable the IOCK line, wait for a few seconds */
> reason = (reason & NMI_REASON_CLEAR_MASK) | NMI_REASON_CLEAR_IOCHK;
> @@ -297,7 +305,7 @@ unknown_nmi_error(unsigned char reason, struct pt_regs *regs)
>
> pr_emerg("Do you have a strange power saving mode enabled?\n");
> if (unknown_nmi_panic || panic_on_unrecovered_nmi)
> - panic("NMI: Not continuing");
> + nmi_panic("NMI: Not continuing");
>
> pr_emerg("Dazed and confused, but trying to continue\n");
> }
> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> index 350dfb0..480a4fd 100644
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -445,6 +445,19 @@ extern int sysctl_panic_on_stackoverflow;
>
> extern bool crash_kexec_post_notifiers;
>
> +extern atomic_t panic_cpu;
This needs a comment explaining what this variable is and what it
denotes.
> +
> +/*
> + * A variant of panic() called from NMI context.
> + * If we've already panicked on this cpu, return from here.
> + */
> +#define nmi_panic(fmt, ...) \
> + do { \
> + int this_cpu = raw_smp_processor_id(); \
> + if (atomic_cmpxchg(&panic_cpu, -1, this_cpu) != this_cpu) \
> + panic(fmt, ##__VA_ARGS__); \
> + } while (0)
> +
> /*
> * Only to be used by arch init code. If the user over-wrote the default
> * CONFIG_PANIC_TIMEOUT, honor it.
> diff --git a/kernel/panic.c b/kernel/panic.c
> index 4579dbb..24ee2ea 100644
> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -61,6 +61,8 @@ void __weak panic_smp_self_stop(void)
> cpu_relax();
> }
>
> +atomic_t panic_cpu = ATOMIC_INIT(-1);
> +
> /**
> * panic - halt the system
> * @fmt: The text string to print
> @@ -71,17 +73,17 @@ void __weak panic_smp_self_stop(void)
> */
> void panic(const char *fmt, ...)
> {
> - static DEFINE_SPINLOCK(panic_lock);
> static char buf[1024];
> va_list args;
> long i, i_next = 0;
> int state = 0;
> + int old_cpu, this_cpu;
>
> /*
> * Disable local interrupts. This will prevent panic_smp_self_stop
> * from deadlocking the first cpu that invokes the panic, since
> * there is nothing to prevent an interrupt handler (that runs
> - * after the panic_lock is acquired) from invoking panic again.
> + * after setting panic_cpu) from invoking panic again.
> */
> local_irq_disable();
>
> @@ -94,8 +96,15 @@ void panic(const char *fmt, ...)
> * multiple parallel invocations of panic, all other CPUs either
> * stop themself or will wait until they are stopped by the 1st CPU
> * with smp_send_stop().
> + *
> + * `old_cpu == -1' means this is the 1st CPU which comes here, so
> + * go ahead.
> + * `old_cpu == this_cpu' means we came from nmi_panic() which sets
> + * panic_cpu to this cpu. In this case, this is also the 1st CPU.
I'd prefer that -1 to be
#define INVALID_CPU_NUM -1
or
#define UNDEFINED_CPU_NUM -1
or so instead of a naked number.
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
Hi,
> On Fri, Nov 20, 2015 at 06:36:44PM +0900, Hidehiro Kawai wrote:
> > If panic on NMI happens just after panic() on the same CPU, panic()
> > is recursively called. As the result, it stalls after failing to
> > acquire panic_lock.
> >
> > To avoid this problem, don't call panic() in NMI context if
> > we've already entered panic().
> >
> > V4:
> > - Improve comments in io_check_error() and panic()
> >
> > V3:
> > - Introduce nmi_panic() macro to reduce code duplication
> > - In the case of panic on NMI, don't return from NMI handlers
> > if another cpu already panicked
> >
> > V2:
> > - Use atomic_cmpxchg() instead of current spin_trylock() to
> > exclude concurrent accesses to the panic routines
> > - Don't introduce no-lock version of panic()
> >
> > Signed-off-by: Hidehiro Kawai <[email protected]>
> > Cc: Andrew Morton <[email protected]>
> > Cc: Thomas Gleixner <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: "H. Peter Anvin" <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Michal Hocko <[email protected]>
> > ---
> > arch/x86/kernel/nmi.c | 16 ++++++++++++----
> > include/linux/kernel.h | 13 +++++++++++++
> > kernel/panic.c | 15 ++++++++++++---
> > kernel/watchdog.c | 2 +-
> > 4 files changed, 38 insertions(+), 8 deletions(-)
> >
> > diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
> > index 697f90d..5131714 100644
> > --- a/arch/x86/kernel/nmi.c
> > +++ b/arch/x86/kernel/nmi.c
> > @@ -231,7 +231,7 @@ pci_serr_error(unsigned char reason, struct pt_regs *regs)
> > #endif
> >
> > if (panic_on_unrecovered_nmi)
> > - panic("NMI: Not continuing");
> > + nmi_panic("NMI: Not continuing");
> >
> > pr_emerg("Dazed and confused, but trying to continue\n");
> >
> > @@ -255,8 +255,16 @@ io_check_error(unsigned char reason, struct pt_regs *regs)
> > reason, smp_processor_id());
> > show_regs(regs);
> >
> > - if (panic_on_io_nmi)
> > - panic("NMI IOCK error: Not continuing");
> > + if (panic_on_io_nmi) {
> > + nmi_panic("NMI IOCK error: Not continuing");
>
> Btw, that panic_on_io_nmi seems undocumented in
> Documentation/sysctl/kernel.txt. Care to document it, please, as a
> separate patch?
Sure. I'll post a documentation patch for it in a separate patch.
Because panic_on_io_nmi has been available since relatively old days,
I didn't check it.
> > +
> > + /*
> > + * If we return from nmi_panic(), it means we have received
> > + * NMI while processing panic(). So, simply return without
> > + * a delay and re-enabling NMI.
> > + */
> > + return;
> > + }
> >
> > /* Re-enable the IOCK line, wait for a few seconds */
> > reason = (reason & NMI_REASON_CLEAR_MASK) | NMI_REASON_CLEAR_IOCHK;
> > @@ -297,7 +305,7 @@ unknown_nmi_error(unsigned char reason, struct pt_regs *regs)
> >
> > pr_emerg("Do you have a strange power saving mode enabled?\n");
> > if (unknown_nmi_panic || panic_on_unrecovered_nmi)
> > - panic("NMI: Not continuing");
> > + nmi_panic("NMI: Not continuing");
> >
> > pr_emerg("Dazed and confused, but trying to continue\n");
> > }
> > diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> > index 350dfb0..480a4fd 100644
> > --- a/include/linux/kernel.h
> > +++ b/include/linux/kernel.h
> > @@ -445,6 +445,19 @@ extern int sysctl_panic_on_stackoverflow;
> >
> > extern bool crash_kexec_post_notifiers;
> >
> > +extern atomic_t panic_cpu;
>
> This needs a comment explaining what this variable is and what it
> denotes.
OK, I'll do that in the next version.
>
> > +
> > +/*
> > + * A variant of panic() called from NMI context.
> > + * If we've already panicked on this cpu, return from here.
> > + */
> > +#define nmi_panic(fmt, ...) \
> > + do { \
> > + int this_cpu = raw_smp_processor_id(); \
> > + if (atomic_cmpxchg(&panic_cpu, -1, this_cpu) != this_cpu) \
> > + panic(fmt, ##__VA_ARGS__); \
> > + } while (0)
> > +
> > /*
> > * Only to be used by arch init code. If the user over-wrote the default
> > * CONFIG_PANIC_TIMEOUT, honor it.
> > diff --git a/kernel/panic.c b/kernel/panic.c
> > index 4579dbb..24ee2ea 100644
> > --- a/kernel/panic.c
> > +++ b/kernel/panic.c
> > @@ -61,6 +61,8 @@ void __weak panic_smp_self_stop(void)
> > cpu_relax();
> > }
> >
> > +atomic_t panic_cpu = ATOMIC_INIT(-1);
> > +
> > /**
> > * panic - halt the system
> > * @fmt: The text string to print
> > @@ -71,17 +73,17 @@ void __weak panic_smp_self_stop(void)
> > */
> > void panic(const char *fmt, ...)
> > {
> > - static DEFINE_SPINLOCK(panic_lock);
> > static char buf[1024];
> > va_list args;
> > long i, i_next = 0;
> > int state = 0;
> > + int old_cpu, this_cpu;
> >
> > /*
> > * Disable local interrupts. This will prevent panic_smp_self_stop
> > * from deadlocking the first cpu that invokes the panic, since
> > * there is nothing to prevent an interrupt handler (that runs
> > - * after the panic_lock is acquired) from invoking panic again.
> > + * after setting panic_cpu) from invoking panic again.
> > */
> > local_irq_disable();
> >
> > @@ -94,8 +96,15 @@ void panic(const char *fmt, ...)
> > * multiple parallel invocations of panic, all other CPUs either
> > * stop themself or will wait until they are stopped by the 1st CPU
> > * with smp_send_stop().
> > + *
> > + * `old_cpu == -1' means this is the 1st CPU which comes here, so
> > + * go ahead.
> > + * `old_cpu == this_cpu' means we came from nmi_panic() which sets
> > + * panic_cpu to this cpu. In this case, this is also the 1st CPU.
>
> I'd prefer that -1 to be
>
> #define INVALID_CPU_NUM -1
>
> or
>
> #define UNDEFINED_CPU_NUM -1
>
> or so instead of a naked number.
OK, I'll use a macro.
Regards,
--
Hidehiro Kawai
Hitachi, Ltd. Research & Development Group
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On Fri, Nov 20, 2015 at 06:36:46PM +0900, Hidehiro Kawai wrote:
> nmi_shootdown_cpus(), a subroutine of crash_kexec(), sends NMI IPI
> to non-panic cpus to stop them while saving their register
...to stop them and save their register...
> information and doing some cleanups for crash dumping. So if a
> non-panic cpus is infinitely looping in NMI context, we fail to
That should be CPU. Please use "CPU" instead of "cpu" in all your text
in your next submission.
> save its register information and lose the information from the
> crash dump.
>
> `Infinite loop in NMI context' can happen:
>
> a. when a cpu panics on NMI while another cpu is processing panic
> b. when a cpu received an external or unknown NMI while another
> cpu is processing panic on NMI
>
> In the case of a, it loops in panic_smp_self_stop(). In the case
> of b, it loops in raw_spin_lock() of nmi_reason_lock.
Please describe those two cases more verbosely - it takes slow people
like me a while to figure out what exactly can happen.
> This can
> happen on some servers which broadcasts NMIs to all CPUs when a dump
> button is pushed.
>
> To save registers in these case too, this patch does following things:
>
> 1. Move the timing of `infinite loop in NMI context' (actually
> done by panic_smp_self_stop()) outside of panic() to enable us to
> refer pt_regs
I can't parse that sentence. And I really tried :-\
panic_smp_self_stop() is still in panic().
> 2. call a callback of nmi_shootdown_cpus() directly to save
> registers and do some cleanups after setting waiting_for_crash_ipi
> which is used for counting down the number of cpus which handled
> the callback
>
> V5:
> - Use WRITE_ONCE() when setting crash_ipi_done to 1 so that the
> compiler doesn't change the instruction order
> - Support the case of b in the above description
> - Add poll_crash_ipi_and_callback()
>
> V4:
> - Rewrite the patch description
>
> V3:
> - Newly introduced
>
> Signed-off-by: Hidehiro Kawai <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Eric Biederman <[email protected]>
> Cc: Vivek Goyal <[email protected]>
> Cc: Michal Hocko <[email protected]>
> ---
> arch/x86/include/asm/reboot.h | 1 +
> arch/x86/kernel/nmi.c | 17 +++++++++++++----
> arch/x86/kernel/reboot.c | 28 ++++++++++++++++++++++++++++
> include/linux/kernel.h | 12 ++++++++++--
> kernel/panic.c | 10 ++++++++++
> kernel/watchdog.c | 2 +-
> 6 files changed, 63 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/include/asm/reboot.h b/arch/x86/include/asm/reboot.h
> index a82c4f1..964e82f 100644
> --- a/arch/x86/include/asm/reboot.h
> +++ b/arch/x86/include/asm/reboot.h
> @@ -25,5 +25,6 @@ void __noreturn machine_real_restart(unsigned int type);
>
> typedef void (*nmi_shootdown_cb)(int, struct pt_regs*);
> void nmi_shootdown_cpus(nmi_shootdown_cb callback);
> +void poll_crash_ipi_and_callback(struct pt_regs *regs);
>
> #endif /* _ASM_X86_REBOOT_H */
> diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
> index 5131714..74a1434 100644
> --- a/arch/x86/kernel/nmi.c
> +++ b/arch/x86/kernel/nmi.c
> @@ -29,6 +29,7 @@
> #include <asm/mach_traps.h>
> #include <asm/nmi.h>
> #include <asm/x86_init.h>
> +#include <asm/reboot.h>
>
> #define CREATE_TRACE_POINTS
> #include <trace/events/nmi.h>
> @@ -231,7 +232,7 @@ pci_serr_error(unsigned char reason, struct pt_regs *regs)
> #endif
>
> if (panic_on_unrecovered_nmi)
> - nmi_panic("NMI: Not continuing");
> + nmi_panic(regs, "NMI: Not continuing");
>
> pr_emerg("Dazed and confused, but trying to continue\n");
>
> @@ -256,7 +257,7 @@ io_check_error(unsigned char reason, struct pt_regs *regs)
> show_regs(regs);
>
> if (panic_on_io_nmi) {
> - nmi_panic("NMI IOCK error: Not continuing");
> + nmi_panic(regs, "NMI IOCK error: Not continuing");
>
> /*
> * If we return from nmi_panic(), it means we have received
> @@ -305,7 +306,7 @@ unknown_nmi_error(unsigned char reason, struct pt_regs *regs)
>
> pr_emerg("Do you have a strange power saving mode enabled?\n");
> if (unknown_nmi_panic || panic_on_unrecovered_nmi)
> - nmi_panic("NMI: Not continuing");
> + nmi_panic(regs, "NMI: Not continuing");
>
> pr_emerg("Dazed and confused, but trying to continue\n");
> }
> @@ -357,7 +358,15 @@ static void default_do_nmi(struct pt_regs *regs)
> }
>
> /* Non-CPU-specific NMI: NMI sources can be processed on any CPU */
> - raw_spin_lock(&nmi_reason_lock);
> +
> + /*
> + * Another CPU may be processing panic routines with holding
while
> + * nmi_reason_lock. Check IPI issuance from the panicking CPU
> + * and call the callback directly.
This is one strange sentence. Does it mean something like:
"Check if the panicking CPU issued the IPI and, if so, call the crash
callback directly."
?
> + */
> + while (!raw_spin_trylock(&nmi_reason_lock))
> + poll_crash_ipi_and_callback(regs);
Waaait a minute: so if we're getting NMIs broadcasted on every core but
we're *not* crash dumping, we will run into here too. This can't be
right. :-\
> +
> reason = x86_platform.get_nmi_reason();
>
> if (reason & NMI_REASON_MASK) {
> diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
> index 02693dd..44c5f5b 100644
> --- a/arch/x86/kernel/reboot.c
> +++ b/arch/x86/kernel/reboot.c
> @@ -718,6 +718,7 @@ static int crashing_cpu;
> static nmi_shootdown_cb shootdown_callback;
>
> static atomic_t waiting_for_crash_ipi;
> +static int crash_ipi_done;
>
> static int crash_nmi_callback(unsigned int val, struct pt_regs *regs)
> {
> @@ -780,6 +781,9 @@ void nmi_shootdown_cpus(nmi_shootdown_cb callback)
>
> smp_send_nmi_allbutself();
>
> + /* Kick cpus looping in nmi context. */
> + WRITE_ONCE(crash_ipi_done, 1);
> +
> msecs = 1000; /* Wait at most a second for the other cpus to stop */
> while ((atomic_read(&waiting_for_crash_ipi) > 0) && msecs) {
> mdelay(1);
> @@ -788,9 +792,33 @@ void nmi_shootdown_cpus(nmi_shootdown_cb callback)
>
> /* Leave the nmi callback set */
> }
> +
> +/*
> + * Wait for the timing of IPI for crash dumping, and then call its callback
"Wait for the crash dumping IPI to complete... "
> + * directly. This function is used when we have already been in NMI handler.
> + */
> +void poll_crash_ipi_and_callback(struct pt_regs *regs)
Why "poll"? We won't return from crash_nmi_callback() if we're not the
crashing CPU.
> +{
> + if (crash_ipi_done)
> + crash_nmi_callback(0, regs); /* Shouldn't return */
> +}
> +
> +/* Override the weak function in kernel/panic.c */
> +void nmi_panic_self_stop(struct pt_regs *regs)
> +{
> + while (1) {
> + poll_crash_ipi_and_callback(regs);
> + cpu_relax();
> + }
> +}
> +
> #else /* !CONFIG_SMP */
> void nmi_shootdown_cpus(nmi_shootdown_cb callback)
> {
> /* No other CPUs to shoot down */
> }
> +
> +void poll_crash_ipi_and_callback(struct pt_regs *regs)
> +{
> +}
> #endif
> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> index 480a4fd..728a31b 100644
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -255,6 +255,7 @@ extern long (*panic_blink)(int state);
> __printf(1, 2)
> void panic(const char *fmt, ...)
> __noreturn __cold;
> +void nmi_panic_self_stop(struct pt_regs *);
> extern void oops_enter(void);
> extern void oops_exit(void);
> void print_oops_end_marker(void);
> @@ -450,12 +451,19 @@ extern atomic_t panic_cpu;
> /*
> * A variant of panic() called from NMI context.
> * If we've already panicked on this cpu, return from here.
> + * If another cpu already panicked, loop in nmi_panic_self_stop() which
> + * can provide architecture dependent code such as saving register states
> + * for crash dump.
> */
> -#define nmi_panic(fmt, ...) \
> +#define nmi_panic(regs, fmt, ...) \
> do { \
> + int old_cpu; \
> int this_cpu = raw_smp_processor_id(); \
> - if (atomic_cmpxchg(&panic_cpu, -1, this_cpu) != this_cpu) \
> + old_cpu = atomic_cmpxchg(&panic_cpu, -1, this_cpu); \
> + if (old_cpu == -1) \
> panic(fmt, ##__VA_ARGS__); \
> + else if (old_cpu != this_cpu) \
> + nmi_panic_self_stop(regs); \
Same here - this is assuming that broadcasting NMIs to all cores
automatically means there's a crash dump in progress:
nmi_panic_self_stop() -> poll_crash_ipi_and_callback()
and this cannot be right.
> } while (0)
>
> /*
> diff --git a/kernel/panic.c b/kernel/panic.c
> index 24ee2ea..4fce2be 100644
> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -61,6 +61,16 @@ void __weak panic_smp_self_stop(void)
> cpu_relax();
> }
>
> +/*
> + * Stop ourself in NMI context if another cpu has already panicked.
"ourselves"
> + * Architecture code may override this to prepare for crash dumping
> + * (e.g. save register information).
> + */
> +void __weak nmi_panic_self_stop(struct pt_regs *regs)
> +{
> + panic_smp_self_stop();
> +}
> +
> atomic_t panic_cpu = ATOMIC_INIT(-1);
>
> /**
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index b9be18f..84b5035 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -351,7 +351,7 @@ static void watchdog_overflow_callback(struct perf_event *event,
> trigger_allbutself_cpu_backtrace();
>
> if (hardlockup_panic)
> - nmi_panic("Hard LOCKUP");
> + nmi_panic(regs, "Hard LOCKUP");
>
> __this_cpu_write(hard_watchdog_warn, true);
> return;
>
>
>
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
On Fri 20-11-15 18:36:44, Hidehiro Kawai wrote:
> If panic on NMI happens just after panic() on the same CPU, panic()
> is recursively called. As the result, it stalls after failing to
> acquire panic_lock.
>
> To avoid this problem, don't call panic() in NMI context if
> we've already entered panic().
>
> V4:
> - Improve comments in io_check_error() and panic()
>
> V3:
> - Introduce nmi_panic() macro to reduce code duplication
> - In the case of panic on NMI, don't return from NMI handlers
> if another cpu already panicked
>
> V2:
> - Use atomic_cmpxchg() instead of current spin_trylock() to
> exclude concurrent accesses to the panic routines
> - Don't introduce no-lock version of panic()
>
> Signed-off-by: Hidehiro Kawai <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Michal Hocko <[email protected]>
I've finally seen testing results for these patches and managed to look
at them again.
Acked-by: Michal Hocko <[email protected]>
> ---
> arch/x86/kernel/nmi.c | 16 ++++++++++++----
> include/linux/kernel.h | 13 +++++++++++++
> kernel/panic.c | 15 ++++++++++++---
> kernel/watchdog.c | 2 +-
> 4 files changed, 38 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
> index 697f90d..5131714 100644
> --- a/arch/x86/kernel/nmi.c
> +++ b/arch/x86/kernel/nmi.c
> @@ -231,7 +231,7 @@ pci_serr_error(unsigned char reason, struct pt_regs *regs)
> #endif
>
> if (panic_on_unrecovered_nmi)
> - panic("NMI: Not continuing");
> + nmi_panic("NMI: Not continuing");
>
> pr_emerg("Dazed and confused, but trying to continue\n");
>
> @@ -255,8 +255,16 @@ io_check_error(unsigned char reason, struct pt_regs *regs)
> reason, smp_processor_id());
> show_regs(regs);
>
> - if (panic_on_io_nmi)
> - panic("NMI IOCK error: Not continuing");
> + if (panic_on_io_nmi) {
> + nmi_panic("NMI IOCK error: Not continuing");
> +
> + /*
> + * If we return from nmi_panic(), it means we have received
> + * NMI while processing panic(). So, simply return without
> + * a delay and re-enabling NMI.
> + */
> + return;
> + }
>
> /* Re-enable the IOCK line, wait for a few seconds */
> reason = (reason & NMI_REASON_CLEAR_MASK) | NMI_REASON_CLEAR_IOCHK;
> @@ -297,7 +305,7 @@ unknown_nmi_error(unsigned char reason, struct pt_regs *regs)
>
> pr_emerg("Do you have a strange power saving mode enabled?\n");
> if (unknown_nmi_panic || panic_on_unrecovered_nmi)
> - panic("NMI: Not continuing");
> + nmi_panic("NMI: Not continuing");
>
> pr_emerg("Dazed and confused, but trying to continue\n");
> }
> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> index 350dfb0..480a4fd 100644
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -445,6 +445,19 @@ extern int sysctl_panic_on_stackoverflow;
>
> extern bool crash_kexec_post_notifiers;
>
> +extern atomic_t panic_cpu;
> +
> +/*
> + * A variant of panic() called from NMI context.
> + * If we've already panicked on this cpu, return from here.
> + */
> +#define nmi_panic(fmt, ...) \
> + do { \
> + int this_cpu = raw_smp_processor_id(); \
> + if (atomic_cmpxchg(&panic_cpu, -1, this_cpu) != this_cpu) \
> + panic(fmt, ##__VA_ARGS__); \
> + } while (0)
> +
> /*
> * Only to be used by arch init code. If the user over-wrote the default
> * CONFIG_PANIC_TIMEOUT, honor it.
> diff --git a/kernel/panic.c b/kernel/panic.c
> index 4579dbb..24ee2ea 100644
> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -61,6 +61,8 @@ void __weak panic_smp_self_stop(void)
> cpu_relax();
> }
>
> +atomic_t panic_cpu = ATOMIC_INIT(-1);
> +
> /**
> * panic - halt the system
> * @fmt: The text string to print
> @@ -71,17 +73,17 @@ void __weak panic_smp_self_stop(void)
> */
> void panic(const char *fmt, ...)
> {
> - static DEFINE_SPINLOCK(panic_lock);
> static char buf[1024];
> va_list args;
> long i, i_next = 0;
> int state = 0;
> + int old_cpu, this_cpu;
>
> /*
> * Disable local interrupts. This will prevent panic_smp_self_stop
> * from deadlocking the first cpu that invokes the panic, since
> * there is nothing to prevent an interrupt handler (that runs
> - * after the panic_lock is acquired) from invoking panic again.
> + * after setting panic_cpu) from invoking panic again.
> */
> local_irq_disable();
>
> @@ -94,8 +96,15 @@ void panic(const char *fmt, ...)
> * multiple parallel invocations of panic, all other CPUs either
> * stop themself or will wait until they are stopped by the 1st CPU
> * with smp_send_stop().
> + *
> + * `old_cpu == -1' means this is the 1st CPU which comes here, so
> + * go ahead.
> + * `old_cpu == this_cpu' means we came from nmi_panic() which sets
> + * panic_cpu to this cpu. In this case, this is also the 1st CPU.
> */
> - if (!spin_trylock(&panic_lock))
> + this_cpu = raw_smp_processor_id();
> + old_cpu = atomic_cmpxchg(&panic_cpu, -1, this_cpu);
> + if (old_cpu != -1 && old_cpu != this_cpu)
> panic_smp_self_stop();
>
> console_verbose();
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 18f34cf..b9be18f 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -351,7 +351,7 @@ static void watchdog_overflow_callback(struct perf_event *event,
> trigger_allbutself_cpu_backtrace();
>
> if (hardlockup_panic)
> - panic("Hard LOCKUP");
> + nmi_panic("Hard LOCKUP");
>
> __this_cpu_write(hard_watchdog_warn, true);
> return;
>
--
Michal Hocko
SUSE Labs
On Fri 20-11-15 18:36:46, Hidehiro Kawai wrote:
> nmi_shootdown_cpus(), a subroutine of crash_kexec(), sends NMI IPI
> to non-panic cpus to stop them while saving their register
> information and doing some cleanups for crash dumping. So if a
> non-panic cpus is infinitely looping in NMI context, we fail to
> save its register information and lose the information from the
> crash dump.
>
> `Infinite loop in NMI context' can happen:
>
> a. when a cpu panics on NMI while another cpu is processing panic
> b. when a cpu received an external or unknown NMI while another
> cpu is processing panic on NMI
>
> In the case of a, it loops in panic_smp_self_stop(). In the case
> of b, it loops in raw_spin_lock() of nmi_reason_lock. This can
> happen on some servers which broadcasts NMIs to all CPUs when a dump
> button is pushed.
>
> To save registers in these case too, this patch does following things:
>
> 1. Move the timing of `infinite loop in NMI context' (actually
> done by panic_smp_self_stop()) outside of panic() to enable us to
> refer pt_regs
> 2. call a callback of nmi_shootdown_cpus() directly to save
> registers and do some cleanups after setting waiting_for_crash_ipi
> which is used for counting down the number of cpus which handled
> the callback
>
> V5:
> - Use WRITE_ONCE() when setting crash_ipi_done to 1 so that the
> compiler doesn't change the instruction order
> - Support the case of b in the above description
> - Add poll_crash_ipi_and_callback()
>
> V4:
> - Rewrite the patch description
>
> V3:
> - Newly introduced
>
> Signed-off-by: Hidehiro Kawai <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Eric Biederman <[email protected]>
> Cc: Vivek Goyal <[email protected]>
> Cc: Michal Hocko <[email protected]>
Yes this seems correct
Acked-by: Michal Hocko <[email protected]>
> ---
> arch/x86/include/asm/reboot.h | 1 +
> arch/x86/kernel/nmi.c | 17 +++++++++++++----
> arch/x86/kernel/reboot.c | 28 ++++++++++++++++++++++++++++
> include/linux/kernel.h | 12 ++++++++++--
> kernel/panic.c | 10 ++++++++++
> kernel/watchdog.c | 2 +-
> 6 files changed, 63 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/include/asm/reboot.h b/arch/x86/include/asm/reboot.h
> index a82c4f1..964e82f 100644
> --- a/arch/x86/include/asm/reboot.h
> +++ b/arch/x86/include/asm/reboot.h
> @@ -25,5 +25,6 @@ void __noreturn machine_real_restart(unsigned int type);
>
> typedef void (*nmi_shootdown_cb)(int, struct pt_regs*);
> void nmi_shootdown_cpus(nmi_shootdown_cb callback);
> +void poll_crash_ipi_and_callback(struct pt_regs *regs);
>
> #endif /* _ASM_X86_REBOOT_H */
> diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
> index 5131714..74a1434 100644
> --- a/arch/x86/kernel/nmi.c
> +++ b/arch/x86/kernel/nmi.c
> @@ -29,6 +29,7 @@
> #include <asm/mach_traps.h>
> #include <asm/nmi.h>
> #include <asm/x86_init.h>
> +#include <asm/reboot.h>
>
> #define CREATE_TRACE_POINTS
> #include <trace/events/nmi.h>
> @@ -231,7 +232,7 @@ pci_serr_error(unsigned char reason, struct pt_regs *regs)
> #endif
>
> if (panic_on_unrecovered_nmi)
> - nmi_panic("NMI: Not continuing");
> + nmi_panic(regs, "NMI: Not continuing");
>
> pr_emerg("Dazed and confused, but trying to continue\n");
>
> @@ -256,7 +257,7 @@ io_check_error(unsigned char reason, struct pt_regs *regs)
> show_regs(regs);
>
> if (panic_on_io_nmi) {
> - nmi_panic("NMI IOCK error: Not continuing");
> + nmi_panic(regs, "NMI IOCK error: Not continuing");
>
> /*
> * If we return from nmi_panic(), it means we have received
> @@ -305,7 +306,7 @@ unknown_nmi_error(unsigned char reason, struct pt_regs *regs)
>
> pr_emerg("Do you have a strange power saving mode enabled?\n");
> if (unknown_nmi_panic || panic_on_unrecovered_nmi)
> - nmi_panic("NMI: Not continuing");
> + nmi_panic(regs, "NMI: Not continuing");
>
> pr_emerg("Dazed and confused, but trying to continue\n");
> }
> @@ -357,7 +358,15 @@ static void default_do_nmi(struct pt_regs *regs)
> }
>
> /* Non-CPU-specific NMI: NMI sources can be processed on any CPU */
> - raw_spin_lock(&nmi_reason_lock);
> +
> + /*
> + * Another CPU may be processing panic routines with holding
> + * nmi_reason_lock. Check IPI issuance from the panicking CPU
> + * and call the callback directly.
> + */
> + while (!raw_spin_trylock(&nmi_reason_lock))
> + poll_crash_ipi_and_callback(regs);
> +
> reason = x86_platform.get_nmi_reason();
>
> if (reason & NMI_REASON_MASK) {
> diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
> index 02693dd..44c5f5b 100644
> --- a/arch/x86/kernel/reboot.c
> +++ b/arch/x86/kernel/reboot.c
> @@ -718,6 +718,7 @@ static int crashing_cpu;
> static nmi_shootdown_cb shootdown_callback;
>
> static atomic_t waiting_for_crash_ipi;
> +static int crash_ipi_done;
>
> static int crash_nmi_callback(unsigned int val, struct pt_regs *regs)
> {
> @@ -780,6 +781,9 @@ void nmi_shootdown_cpus(nmi_shootdown_cb callback)
>
> smp_send_nmi_allbutself();
>
> + /* Kick cpus looping in nmi context. */
> + WRITE_ONCE(crash_ipi_done, 1);
> +
> msecs = 1000; /* Wait at most a second for the other cpus to stop */
> while ((atomic_read(&waiting_for_crash_ipi) > 0) && msecs) {
> mdelay(1);
> @@ -788,9 +792,33 @@ void nmi_shootdown_cpus(nmi_shootdown_cb callback)
>
> /* Leave the nmi callback set */
> }
> +
> +/*
> + * Wait for the timing of IPI for crash dumping, and then call its callback
> + * directly. This function is used when we have already been in NMI handler.
> + */
> +void poll_crash_ipi_and_callback(struct pt_regs *regs)
> +{
> + if (crash_ipi_done)
> + crash_nmi_callback(0, regs); /* Shouldn't return */
> +}
> +
> +/* Override the weak function in kernel/panic.c */
> +void nmi_panic_self_stop(struct pt_regs *regs)
> +{
> + while (1) {
> + poll_crash_ipi_and_callback(regs);
> + cpu_relax();
> + }
> +}
> +
> #else /* !CONFIG_SMP */
> void nmi_shootdown_cpus(nmi_shootdown_cb callback)
> {
> /* No other CPUs to shoot down */
> }
> +
> +void poll_crash_ipi_and_callback(struct pt_regs *regs)
> +{
> +}
> #endif
> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> index 480a4fd..728a31b 100644
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -255,6 +255,7 @@ extern long (*panic_blink)(int state);
> __printf(1, 2)
> void panic(const char *fmt, ...)
> __noreturn __cold;
> +void nmi_panic_self_stop(struct pt_regs *);
> extern void oops_enter(void);
> extern void oops_exit(void);
> void print_oops_end_marker(void);
> @@ -450,12 +451,19 @@ extern atomic_t panic_cpu;
> /*
> * A variant of panic() called from NMI context.
> * If we've already panicked on this cpu, return from here.
> + * If another cpu already panicked, loop in nmi_panic_self_stop() which
> + * can provide architecture dependent code such as saving register states
> + * for crash dump.
> */
> -#define nmi_panic(fmt, ...) \
> +#define nmi_panic(regs, fmt, ...) \
> do { \
> + int old_cpu; \
> int this_cpu = raw_smp_processor_id(); \
> - if (atomic_cmpxchg(&panic_cpu, -1, this_cpu) != this_cpu) \
> + old_cpu = atomic_cmpxchg(&panic_cpu, -1, this_cpu); \
> + if (old_cpu == -1) \
> panic(fmt, ##__VA_ARGS__); \
> + else if (old_cpu != this_cpu) \
> + nmi_panic_self_stop(regs); \
> } while (0)
>
> /*
> diff --git a/kernel/panic.c b/kernel/panic.c
> index 24ee2ea..4fce2be 100644
> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -61,6 +61,16 @@ void __weak panic_smp_self_stop(void)
> cpu_relax();
> }
>
> +/*
> + * Stop ourself in NMI context if another cpu has already panicked.
> + * Architecture code may override this to prepare for crash dumping
> + * (e.g. save register information).
> + */
> +void __weak nmi_panic_self_stop(struct pt_regs *regs)
> +{
> + panic_smp_self_stop();
> +}
> +
> atomic_t panic_cpu = ATOMIC_INIT(-1);
>
> /**
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index b9be18f..84b5035 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -351,7 +351,7 @@ static void watchdog_overflow_callback(struct perf_event *event,
> trigger_allbutself_cpu_backtrace();
>
> if (hardlockup_panic)
> - nmi_panic("Hard LOCKUP");
> + nmi_panic(regs, "Hard LOCKUP");
>
> __this_cpu_write(hard_watchdog_warn, true);
> return;
>
--
Michal Hocko
SUSE Labs
On Fri 20-11-15 18:36:48, Hidehiro Kawai wrote:
> Currently, panic() and crash_kexec() can be called at the same time.
> For example (x86 case):
>
> CPU 0:
> oops_end()
> crash_kexec()
> mutex_trylock() // acquired
> nmi_shootdown_cpus() // stop other cpus
>
> CPU 1:
> panic()
> crash_kexec()
> mutex_trylock() // failed to acquire
> smp_send_stop() // stop other cpus
> infinite loop
>
> If CPU 1 calls smp_send_stop() before nmi_shootdown_cpus(), kdump
> fails.
>
> In another case:
>
> CPU 0:
> oops_end()
> crash_kexec()
> mutex_trylock() // acquired
> <NMI>
> io_check_error()
> panic()
> crash_kexec()
> mutex_trylock() // failed to acquire
> infinite loop
>
> Clearly, this is an undesirable result.
>
> To fix this problem, this patch changes crash_kexec() to exclude
> others by using atomic_t panic_cpu.
>
> V5:
> - Add missing dummy __crash_kexec() for !CONFIG_KEXEC_CORE case
> - Replace atomic_xchg() with atomic_set() in crash_kexec() because
> it is used as a release operation and there is no need of memory
> barrier effect. This change also removes an unused value warning
>
> V4:
> - Use new __crash_kexec(), no exclusion check version of crash_kexec(),
> instead of checking if panic_cpu is the current cpu or not
>
> V2:
> - Use atomic_cmpxchg() instead of spin_trylock() on panic_lock
> to exclude concurrent accesses
> - Don't introduce no-lock version of crash_kexec()
>
> Signed-off-by: Hidehiro Kawai <[email protected]>
> Cc: Eric Biederman <[email protected]>
> Cc: Vivek Goyal <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Michal Hocko <[email protected]>
Looks good to me as well
Acked-by: Michal Hocko <[email protected]>
[...]
> +void crash_kexec(struct pt_regs *regs)
> +{
> + int old_cpu, this_cpu;
> +
> + /*
> + * Only one CPU is allowed to execute the crash_kexec() code as with
> + * panic(). Otherwise parallel calls of panic() and crash_kexec()
> + * may stop each other. To exclude them, we use panic_cpu here too.
> + */
> + this_cpu = raw_smp_processor_id();
> + old_cpu = atomic_cmpxchg(&panic_cpu, -1, this_cpu);
> + if (old_cpu == -1) {
> + /* This is the 1st CPU which comes here, so go ahead. */
> + __crash_kexec(regs);
> +
> + /*
> + * Reset panic_cpu to allow another panic()/crash_kexec()
> + * call.
> + */
> + atomic_set(&panic_cpu, -1);
This was slighly more obvious in the previous version where the reset
happened after the trylock on the mutex failed, maybe the comment could
be more specific
+ /*
+ * Reset panic_cpu to allow another panic()/crash_kexec()
+ * call if __crash_kexec couldn't handle the situation.
+ */
--
Michal Hocko
SUSE Labs
On Fri, Nov 20, 2015 at 06:36:44PM +0900, Hidehiro Kawai wrote:
> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> index 350dfb0..480a4fd 100644
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -445,6 +445,19 @@ extern int sysctl_panic_on_stackoverflow;
>
> extern bool crash_kexec_post_notifiers;
>
> +extern atomic_t panic_cpu;
> +
> +/*
> + * A variant of panic() called from NMI context.
> + * If we've already panicked on this cpu, return from here.
> + */
> +#define nmi_panic(fmt, ...) \
> + do { \
> + int this_cpu = raw_smp_processor_id(); \
> + if (atomic_cmpxchg(&panic_cpu, -1, this_cpu) != this_cpu) \
> + panic(fmt, ##__VA_ARGS__); \
Hmm,
What happens if:
CPU 0: CPU 1:
------ ------
nmi_panic();
nmi_panic();
<external nmi>
nmi_panic();
?
cmpxchg(&panic_cpu, -1, 0) != 0
returns -1 for cpu 0, thus 0 != 0, and sets panic_cpu to 0
cmpxchg(&panic_cpu, -1, 1) != 1
returns 0, and then it too panics, but does not set panic_cpu to 1
Now you have your external NMI triggering on CPU 1
cmpxchg(&panic_cpu, -1, 1) != 1
returns 0 again, and you call panic again within the panic of CPU 1.
Is this OK?
Perhaps you want a per cpu bitmask, and do a test_and_set() on the CPU. That
would prevent any CPU from rerunning a panic() twice on any CPU.
-- Steve
> + } while (0)
> +
> /*
> * Only to be used by arch init code. If the user over-wrote the default
> * CONFIG_PANIC_TIMEOUT, honor it.
> diff --git a/kernel/panic.c b/kernel/panic.c
> index 4579dbb..24ee2ea 100644
On Tue, 24 Nov 2015 10:05:10 -0500
Steven Rostedt <[email protected]> wrote:
> cmpxchg(&panic_cpu, -1, 0) != 0
>
> returns -1 for cpu 0, thus 0 != 0, and sets panic_cpu to 0
That was suppose to be "thus -1 != 0".
-- Steve
On Tue, Nov 24, 2015 at 11:48:53AM +0100, Borislav Petkov wrote:
>
> > + */
> > + while (!raw_spin_trylock(&nmi_reason_lock))
> > + poll_crash_ipi_and_callback(regs);
>
> Waaait a minute: so if we're getting NMIs broadcasted on every core but
> we're *not* crash dumping, we will run into here too. This can't be
> right. :-\
This only does something if crash_ipi_done is set, which means you are killing
the box. But perhaps a comment that states that here would be useful, or maybe
just put in the check here. There's no need to make it depend on SMP, as
raw_spin_trylock() will turn to just ({1}) for UP, and that code wont even be
hit.
-- Steve
On Tue, Nov 24, 2015 at 02:37:00PM -0500, Steven Rostedt wrote:
> On Tue, Nov 24, 2015 at 11:48:53AM +0100, Borislav Petkov wrote:
> >
> > > + */
> > > + while (!raw_spin_trylock(&nmi_reason_lock))
> > > + poll_crash_ipi_and_callback(regs);
> >
> > Waaait a minute: so if we're getting NMIs broadcasted on every core but
> > we're *not* crash dumping, we will run into here too. This can't be
> > right. :-\
>
> This only does something if crash_ipi_done is set, which means you are killing
> the box.
Yeah, Michal and I discussed that on IRC today. And yeah, it is really
tricky stuff. So I appreciate it a lot you looking at it too. Thanks!
> But perhaps a comment that states that here would be useful, or maybe
> just put in the check here. There's no need to make it depend on SMP, as
> raw_spin_trylock() will turn to just ({1}) for UP, and that code wont even be
> hit.
Right, this code needs much more thorough documentation to counter the
trickiness.
Thanks.
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
On Tue 24-11-15 10:05:10, Steven Rostedt wrote:
> On Fri, Nov 20, 2015 at 06:36:44PM +0900, Hidehiro Kawai wrote:
> > diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> > index 350dfb0..480a4fd 100644
> > --- a/include/linux/kernel.h
> > +++ b/include/linux/kernel.h
> > @@ -445,6 +445,19 @@ extern int sysctl_panic_on_stackoverflow;
> >
> > extern bool crash_kexec_post_notifiers;
> >
> > +extern atomic_t panic_cpu;
> > +
> > +/*
> > + * A variant of panic() called from NMI context.
> > + * If we've already panicked on this cpu, return from here.
> > + */
> > +#define nmi_panic(fmt, ...) \
> > + do { \
> > + int this_cpu = raw_smp_processor_id(); \
> > + if (atomic_cmpxchg(&panic_cpu, -1, this_cpu) != this_cpu) \
> > + panic(fmt, ##__VA_ARGS__); \
>
> Hmm,
>
> What happens if:
>
> CPU 0: CPU 1:
> ------ ------
> nmi_panic();
>
> nmi_panic();
> <external nmi>
> nmi_panic();
I thought that nmi_panic is called only from the nmi context. If so how
can we get a nested NMI like that?
--
Michal Hocko
SUSE Labs
On Fri, Nov 20, 2015 at 06:36:48PM +0900, Hidehiro Kawai wrote:
> Currently, panic() and crash_kexec() can be called at the same time.
> For example (x86 case):
>
> CPU 0:
> oops_end()
> crash_kexec()
> mutex_trylock() // acquired
> nmi_shootdown_cpus() // stop other cpus
>
> CPU 1:
> panic()
> crash_kexec()
> mutex_trylock() // failed to acquire
> smp_send_stop() // stop other cpus
> infinite loop
>
> If CPU 1 calls smp_send_stop() before nmi_shootdown_cpus(), kdump
> fails.
So the smp_send_stop() stops CPU 0 from calling nmi_shootdown_cpus(), right?
>
> In another case:
>
> CPU 0:
> oops_end()
> crash_kexec()
> mutex_trylock() // acquired
> <NMI>
> io_check_error()
> panic()
> crash_kexec()
> mutex_trylock() // failed to acquire
> infinite loop
>
> Clearly, this is an undesirable result.
I'm trying to see how this patch fixes this case.
>
> To fix this problem, this patch changes crash_kexec() to exclude
> others by using atomic_t panic_cpu.
>
> V5:
> - Add missing dummy __crash_kexec() for !CONFIG_KEXEC_CORE case
> - Replace atomic_xchg() with atomic_set() in crash_kexec() because
> it is used as a release operation and there is no need of memory
> barrier effect. This change also removes an unused value warning
>
> V4:
> - Use new __crash_kexec(), no exclusion check version of crash_kexec(),
> instead of checking if panic_cpu is the current cpu or not
>
> V2:
> - Use atomic_cmpxchg() instead of spin_trylock() on panic_lock
> to exclude concurrent accesses
> - Don't introduce no-lock version of crash_kexec()
>
> Signed-off-by: Hidehiro Kawai <[email protected]>
> Cc: Eric Biederman <[email protected]>
> Cc: Vivek Goyal <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Michal Hocko <[email protected]>
> ---
> include/linux/kexec.h | 2 ++
> kernel/kexec_core.c | 26 +++++++++++++++++++++++++-
> kernel/panic.c | 4 ++--
> 3 files changed, 29 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index d140b1e..7b68d27 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -237,6 +237,7 @@ extern int kexec_purgatory_get_set_symbol(struct kimage *image,
> unsigned int size, bool get_value);
> extern void *kexec_purgatory_get_symbol_addr(struct kimage *image,
> const char *name);
> +extern void __crash_kexec(struct pt_regs *);
> extern void crash_kexec(struct pt_regs *);
> int kexec_should_crash(struct task_struct *);
> void crash_save_cpu(struct pt_regs *regs, int cpu);
> @@ -332,6 +333,7 @@ int __weak arch_kexec_apply_relocations(const Elf_Ehdr *ehdr, Elf_Shdr *sechdrs,
> #else /* !CONFIG_KEXEC_CORE */
> struct pt_regs;
> struct task_struct;
> +static inline void __crash_kexec(struct pt_regs *regs) { }
> static inline void crash_kexec(struct pt_regs *regs) { }
> static inline int kexec_should_crash(struct task_struct *p) { return 0; }
> #define kexec_in_progress false
> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index 11b64a6..9d097f5 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -853,7 +853,8 @@ struct kimage *kexec_image;
> struct kimage *kexec_crash_image;
> int kexec_load_disabled;
>
> -void crash_kexec(struct pt_regs *regs)
> +/* No panic_cpu check version of crash_kexec */
> +void __crash_kexec(struct pt_regs *regs)
> {
> /* Take the kexec_mutex here to prevent sys_kexec_load
> * running on one cpu from replacing the crash kernel
> @@ -876,6 +877,29 @@ void crash_kexec(struct pt_regs *regs)
> }
> }
>
> +void crash_kexec(struct pt_regs *regs)
> +{
> + int old_cpu, this_cpu;
> +
> + /*
> + * Only one CPU is allowed to execute the crash_kexec() code as with
> + * panic(). Otherwise parallel calls of panic() and crash_kexec()
> + * may stop each other. To exclude them, we use panic_cpu here too.
> + */
> + this_cpu = raw_smp_processor_id();
> + old_cpu = atomic_cmpxchg(&panic_cpu, -1, this_cpu);
> + if (old_cpu == -1) {
> + /* This is the 1st CPU which comes here, so go ahead. */
> + __crash_kexec(regs);
> +
> + /*
> + * Reset panic_cpu to allow another panic()/crash_kexec()
> + * call.
> + */
> + atomic_set(&panic_cpu, -1);
> + }
> +}
> +
> size_t crash_get_memory_size(void)
> {
> size_t size = 0;
> diff --git a/kernel/panic.c b/kernel/panic.c
> index 4fce2be..5d0b807 100644
> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -138,7 +138,7 @@ void panic(const char *fmt, ...)
> * the "crash_kexec_post_notifiers" option to the kernel.
> */
> if (!crash_kexec_post_notifiers)
> - crash_kexec(NULL);
> + __crash_kexec(NULL);
Why call the __crash_kexec() version and not just crash_kexec() here.
This needs to be documented.
>
> /*
> * Note smp_send_stop is the usual smp shutdown function, which
> @@ -163,7 +163,7 @@ void panic(const char *fmt, ...)
> * more unstable, it can increase risks of the kdump failure too.
> */
> if (crash_kexec_post_notifiers)
> - crash_kexec(NULL);
> + __crash_kexec(NULL);
ditto.
-- Steve
>
> bust_spinlocks(0);
>
>
On Tue, 24 Nov 2015 21:27:13 +0100
Michal Hocko <[email protected]> wrote:
> > What happens if:
> >
> > CPU 0: CPU 1:
> > ------ ------
> > nmi_panic();
> >
> > nmi_panic();
> > <external nmi>
> > nmi_panic();
>
> I thought that nmi_panic is called only from the nmi context. If so how
> can we get a nested NMI like that?
Nevermind. I was thinking the external NMI could nest, but I'm guessing
it cant. Anyway, the patches later on modify this code which checks for
something other than != this_cpu, which makes this issue mute even if
it could nest.
-- Steve
> On Fri, Nov 20, 2015 at 06:36:46PM +0900, Hidehiro Kawai wrote:
> > nmi_shootdown_cpus(), a subroutine of crash_kexec(), sends NMI IPI
> > to non-panic cpus to stop them while saving their register
>
> ...to stop them and save their register...
Thanks for the correction.
> > information and doing some cleanups for crash dumping. So if a
> > non-panic cpus is infinitely looping in NMI context, we fail to
>
> That should be CPU. Please use "CPU" instead of "cpu" in all your text
> in your next submission.
OK, I'll fix that.
> > save its register information and lose the information from the
> > crash dump.
> >
> > `Infinite loop in NMI context' can happen:
> >
> > a. when a cpu panics on NMI while another cpu is processing panic
> > b. when a cpu received an external or unknown NMI while another
> > cpu is processing panic on NMI
> >
> > In the case of a, it loops in panic_smp_self_stop(). In the case
> > of b, it loops in raw_spin_lock() of nmi_reason_lock.
>
> Please describe those two cases more verbosely - it takes slow people
> like me a while to figure out what exactly can happen.
a. when a cpu panics on NMI while another cpu is processing panic
Ex.
CPU 0 CPU 1
================= =================
panic()
panic_cpu <-- 0
check panic_cpu
crash_kexec()
receive an unknown NMI
unknown_nmi_error()
nmi_panic()
panic()
check panic_cpu
panic_smp_self_stop()
infinite loop in NMI context
b. when a cpu received an external or unknown NMI while another
cpu is processing panic on NMI
Ex.
CPU 0 CPU 1
====================== ==================
receive an unknown NMI
unknown_nmi_error()
nmi_panic() receive an unknown NMI
panic_cpu <-- 0 unknown_nmi_error()
panic() nmi_panic()
check panic_cpu panic
crash_kexec() check panic_cpu
panic_smp_self_stop()
infinite loop in NMI context
> > This can
> > happen on some servers which broadcasts NMIs to all CPUs when a dump
> > button is pushed.
> >
> > To save registers in these case too, this patch does following things:
> >
> > 1. Move the timing of `infinite loop in NMI context' (actually
> > done by panic_smp_self_stop()) outside of panic() to enable us to
> > refer pt_regs
>
> I can't parse that sentence. And I really tried :-\
> panic_smp_self_stop() is still in panic().
panic_smp_self_stop() is still used when a CPU in normal context
should go into infinite loop. Only when a CPU is in NMI context,
nmi_panic_self_stop() is called and the CPU loops infinitely
without entering panic().
I'll try to revise this sentense.
> > 2. call a callback of nmi_shootdown_cpus() directly to save
> > registers and do some cleanups after setting waiting_for_crash_ipi
> > which is used for counting down the number of cpus which handled
> > the callback
> >
> > V5:
> > - Use WRITE_ONCE() when setting crash_ipi_done to 1 so that the
> > compiler doesn't change the instruction order
> > - Support the case of b in the above description
> > - Add poll_crash_ipi_and_callback()
> >
> > V4:
> > - Rewrite the patch description
> >
> > V3:
> > - Newly introduced
> >
> > Signed-off-by: Hidehiro Kawai <[email protected]>
> > Cc: Andrew Morton <[email protected]>
> > Cc: Thomas Gleixner <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: "H. Peter Anvin" <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Eric Biederman <[email protected]>
> > Cc: Vivek Goyal <[email protected]>
> > Cc: Michal Hocko <[email protected]>
> > ---
> > arch/x86/include/asm/reboot.h | 1 +
> > arch/x86/kernel/nmi.c | 17 +++++++++++++----
> > arch/x86/kernel/reboot.c | 28 ++++++++++++++++++++++++++++
> > include/linux/kernel.h | 12 ++++++++++--
> > kernel/panic.c | 10 ++++++++++
> > kernel/watchdog.c | 2 +-
> > 6 files changed, 63 insertions(+), 7 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/reboot.h b/arch/x86/include/asm/reboot.h
> > index a82c4f1..964e82f 100644
> > --- a/arch/x86/include/asm/reboot.h
> > +++ b/arch/x86/include/asm/reboot.h
> > @@ -25,5 +25,6 @@ void __noreturn machine_real_restart(unsigned int type);
> >
> > typedef void (*nmi_shootdown_cb)(int, struct pt_regs*);
> > void nmi_shootdown_cpus(nmi_shootdown_cb callback);
> > +void poll_crash_ipi_and_callback(struct pt_regs *regs);
> >
> > #endif /* _ASM_X86_REBOOT_H */
> > diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
> > index 5131714..74a1434 100644
> > --- a/arch/x86/kernel/nmi.c
> > +++ b/arch/x86/kernel/nmi.c
> > @@ -29,6 +29,7 @@
> > #include <asm/mach_traps.h>
> > #include <asm/nmi.h>
> > #include <asm/x86_init.h>
> > +#include <asm/reboot.h>
> >
> > #define CREATE_TRACE_POINTS
> > #include <trace/events/nmi.h>
> > @@ -231,7 +232,7 @@ pci_serr_error(unsigned char reason, struct pt_regs *regs)
> > #endif
> >
> > if (panic_on_unrecovered_nmi)
> > - nmi_panic("NMI: Not continuing");
> > + nmi_panic(regs, "NMI: Not continuing");
> >
> > pr_emerg("Dazed and confused, but trying to continue\n");
> >
> > @@ -256,7 +257,7 @@ io_check_error(unsigned char reason, struct pt_regs *regs)
> > show_regs(regs);
> >
> > if (panic_on_io_nmi) {
> > - nmi_panic("NMI IOCK error: Not continuing");
> > + nmi_panic(regs, "NMI IOCK error: Not continuing");
> >
> > /*
> > * If we return from nmi_panic(), it means we have received
> > @@ -305,7 +306,7 @@ unknown_nmi_error(unsigned char reason, struct pt_regs *regs)
> >
> > pr_emerg("Do you have a strange power saving mode enabled?\n");
> > if (unknown_nmi_panic || panic_on_unrecovered_nmi)
> > - nmi_panic("NMI: Not continuing");
> > + nmi_panic(regs, "NMI: Not continuing");
> >
> > pr_emerg("Dazed and confused, but trying to continue\n");
> > }
> > @@ -357,7 +358,15 @@ static void default_do_nmi(struct pt_regs *regs)
> > }
> >
> > /* Non-CPU-specific NMI: NMI sources can be processed on any CPU */
> > - raw_spin_lock(&nmi_reason_lock);
> > +
> > + /*
> > + * Another CPU may be processing panic routines with holding
>
> while
I'll fix it.
> > + * nmi_reason_lock. Check IPI issuance from the panicking CPU
> > + * and call the callback directly.
>
> This is one strange sentence. Does it mean something like:
>
> "Check if the panicking CPU issued the IPI and, if so, call the crash
> callback directly."
>
> ?
Yes. Thanks for the suggestion.
> > + */
> > + while (!raw_spin_trylock(&nmi_reason_lock))
> > + poll_crash_ipi_and_callback(regs);
>
> Waaait a minute: so if we're getting NMIs broadcasted on every core but
> we're *not* crash dumping, we will run into here too. This can't be
> right. :-\
As Steven commented, poll_crash_ipi_and_callback() does nothing
if we're not crash dumping. But yes, this is confusing. I'll add
more detailed comment, or change the logic a bit if I come up with
better one.
> > +
> > reason = x86_platform.get_nmi_reason();
> >
> > if (reason & NMI_REASON_MASK) {
> > diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
> > index 02693dd..44c5f5b 100644
> > --- a/arch/x86/kernel/reboot.c
> > +++ b/arch/x86/kernel/reboot.c
> > @@ -718,6 +718,7 @@ static int crashing_cpu;
> > static nmi_shootdown_cb shootdown_callback;
> >
> > static atomic_t waiting_for_crash_ipi;
> > +static int crash_ipi_done;
> >
> > static int crash_nmi_callback(unsigned int val, struct pt_regs *regs)
> > {
> > @@ -780,6 +781,9 @@ void nmi_shootdown_cpus(nmi_shootdown_cb callback)
> >
> > smp_send_nmi_allbutself();
> >
> > + /* Kick cpus looping in nmi context. */
> > + WRITE_ONCE(crash_ipi_done, 1);
> > +
> > msecs = 1000; /* Wait at most a second for the other cpus to stop */
> > while ((atomic_read(&waiting_for_crash_ipi) > 0) && msecs) {
> > mdelay(1);
> > @@ -788,9 +792,33 @@ void nmi_shootdown_cpus(nmi_shootdown_cb callback)
> >
> > /* Leave the nmi callback set */
> > }
> > +
> > +/*
> > + * Wait for the timing of IPI for crash dumping, and then call its callback
>
> "Wait for the crash dumping IPI to complete... "
So, I think "Wait for the crash dumping IPI to be issued..." is better.
"complete" would be ambiguous in this context.
> > + * directly. This function is used when we have already been in NMI handler.
> > + */
> > +void poll_crash_ipi_and_callback(struct pt_regs *regs)
>
> Why "poll"? We won't return from crash_nmi_callback() if we're not the
> crashing CPU.
This function polls that crash IPI has been issued by checking
crash_ipi_done, then invokes the callback. This is different
from so-called "poll" functions. Do you have some good name?
> > +{
> > + if (crash_ipi_done)
> > + crash_nmi_callback(0, regs); /* Shouldn't return */
> > +}
> > +
> > +/* Override the weak function in kernel/panic.c */
> > +void nmi_panic_self_stop(struct pt_regs *regs)
> > +{
> > + while (1) {
> > + poll_crash_ipi_and_callback(regs);
> > + cpu_relax();
> > + }
> > +}
> > +
> > #else /* !CONFIG_SMP */
> > void nmi_shootdown_cpus(nmi_shootdown_cb callback)
> > {
> > /* No other CPUs to shoot down */
> > }
> > +
> > +void poll_crash_ipi_and_callback(struct pt_regs *regs)
> > +{
> > +}
> > #endif
> > diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> > index 480a4fd..728a31b 100644
> > --- a/include/linux/kernel.h
> > +++ b/include/linux/kernel.h
> > @@ -255,6 +255,7 @@ extern long (*panic_blink)(int state);
> > __printf(1, 2)
> > void panic(const char *fmt, ...)
> > __noreturn __cold;
> > +void nmi_panic_self_stop(struct pt_regs *);
> > extern void oops_enter(void);
> > extern void oops_exit(void);
> > void print_oops_end_marker(void);
> > @@ -450,12 +451,19 @@ extern atomic_t panic_cpu;
> > /*
> > * A variant of panic() called from NMI context.
> > * If we've already panicked on this cpu, return from here.
> > + * If another cpu already panicked, loop in nmi_panic_self_stop() which
> > + * can provide architecture dependent code such as saving register states
> > + * for crash dump.
> > */
> > -#define nmi_panic(fmt, ...) \
> > +#define nmi_panic(regs, fmt, ...) \
> > do { \
> > + int old_cpu; \
> > int this_cpu = raw_smp_processor_id(); \
> > - if (atomic_cmpxchg(&panic_cpu, -1, this_cpu) != this_cpu) \
> > + old_cpu = atomic_cmpxchg(&panic_cpu, -1, this_cpu); \
> > + if (old_cpu == -1) \
> > panic(fmt, ##__VA_ARGS__); \
> > + else if (old_cpu != this_cpu) \
> > + nmi_panic_self_stop(regs); \
>
> Same here - this is assuming that broadcasting NMIs to all cores
> automatically means there's a crash dump in progress:
>
> nmi_panic_self_stop() -> poll_crash_ipi_and_callback()
>
> and this cannot be right.
>
> > } while (0)
> >
> > /*
> > diff --git a/kernel/panic.c b/kernel/panic.c
> > index 24ee2ea..4fce2be 100644
> > --- a/kernel/panic.c
> > +++ b/kernel/panic.c
> > @@ -61,6 +61,16 @@ void __weak panic_smp_self_stop(void)
> > cpu_relax();
> > }
> >
> > +/*
> > + * Stop ourself in NMI context if another cpu has already panicked.
>
> "ourselves"
Thanks. I'll fix it.
Regards,
--
Hidehiro Kawai
Hitachi, Ltd. Research & Development Group
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
> On Tue, Nov 24, 2015 at 11:48:53AM +0100, Borislav Petkov wrote:
> >
> > > + */
> > > + while (!raw_spin_trylock(&nmi_reason_lock))
> > > + poll_crash_ipi_and_callback(regs);
> >
> > Waaait a minute: so if we're getting NMIs broadcasted on every core but
> > we're *not* crash dumping, we will run into here too. This can't be
> > right. :-\
>
> This only does something if crash_ipi_done is set, which means you are killing
> the box. But perhaps a comment that states that here would be useful, or maybe
> just put in the check here.
OK, I'll add more comments around this.
> There's no need to make it depend on SMP, as
> raw_spin_trylock() will turn to just ({1}) for UP, and that code wont even be
> hit.
I'll integrate these SMP and UP versions with a comment about
that.
Regards,
--
Hidehiro Kawai
Hitachi, Ltd. Research & Development Group
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
> On Fri, Nov 20, 2015 at 06:36:48PM +0900, Hidehiro Kawai wrote:
> > Currently, panic() and crash_kexec() can be called at the same time.
> > For example (x86 case):
> >
> > CPU 0:
> > oops_end()
> > crash_kexec()
> > mutex_trylock() // acquired
> > nmi_shootdown_cpus() // stop other cpus
> >
> > CPU 1:
> > panic()
> > crash_kexec()
> > mutex_trylock() // failed to acquire
> > smp_send_stop() // stop other cpus
> > infinite loop
> >
> > If CPU 1 calls smp_send_stop() before nmi_shootdown_cpus(), kdump
> > fails.
>
> So the smp_send_stop() stops CPU 0 from calling nmi_shootdown_cpus(), right?
Yes, but the important thing is that CPU 1 stops CPU 0 which is
only CPU processing crash_ kexec routines.
> >
> > In another case:
> >
> > CPU 0:
> > oops_end()
> > crash_kexec()
> > mutex_trylock() // acquired
> > <NMI>
> > io_check_error()
> > panic()
> > crash_kexec()
> > mutex_trylock() // failed to acquire
> > infinite loop
> >
> > Clearly, this is an undesirable result.
>
> I'm trying to see how this patch fixes this case.
>
> >
> > To fix this problem, this patch changes crash_kexec() to exclude
> > others by using atomic_t panic_cpu.
> >
> > V5:
> > - Add missing dummy __crash_kexec() for !CONFIG_KEXEC_CORE case
> > - Replace atomic_xchg() with atomic_set() in crash_kexec() because
> > it is used as a release operation and there is no need of memory
> > barrier effect. This change also removes an unused value warning
> >
> > V4:
> > - Use new __crash_kexec(), no exclusion check version of crash_kexec(),
> > instead of checking if panic_cpu is the current cpu or not
> >
> > V2:
> > - Use atomic_cmpxchg() instead of spin_trylock() on panic_lock
> > to exclude concurrent accesses
> > - Don't introduce no-lock version of crash_kexec()
> >
> > Signed-off-by: Hidehiro Kawai <[email protected]>
> > Cc: Eric Biederman <[email protected]>
> > Cc: Vivek Goyal <[email protected]>
> > Cc: Andrew Morton <[email protected]>
> > Cc: Michal Hocko <[email protected]>
> > ---
> > include/linux/kexec.h | 2 ++
> > kernel/kexec_core.c | 26 +++++++++++++++++++++++++-
> > kernel/panic.c | 4 ++--
> > 3 files changed, 29 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> > index d140b1e..7b68d27 100644
> > --- a/include/linux/kexec.h
> > +++ b/include/linux/kexec.h
> > @@ -237,6 +237,7 @@ extern int kexec_purgatory_get_set_symbol(struct kimage *image,
> > unsigned int size, bool get_value);
> > extern void *kexec_purgatory_get_symbol_addr(struct kimage *image,
> > const char *name);
> > +extern void __crash_kexec(struct pt_regs *);
> > extern void crash_kexec(struct pt_regs *);
> > int kexec_should_crash(struct task_struct *);
> > void crash_save_cpu(struct pt_regs *regs, int cpu);
> > @@ -332,6 +333,7 @@ int __weak arch_kexec_apply_relocations(const Elf_Ehdr *ehdr, Elf_Shdr *sechdrs,
> > #else /* !CONFIG_KEXEC_CORE */
> > struct pt_regs;
> > struct task_struct;
> > +static inline void __crash_kexec(struct pt_regs *regs) { }
> > static inline void crash_kexec(struct pt_regs *regs) { }
> > static inline int kexec_should_crash(struct task_struct *p) { return 0; }
> > #define kexec_in_progress false
> > diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> > index 11b64a6..9d097f5 100644
> > --- a/kernel/kexec_core.c
> > +++ b/kernel/kexec_core.c
> > @@ -853,7 +853,8 @@ struct kimage *kexec_image;
> > struct kimage *kexec_crash_image;
> > int kexec_load_disabled;
> >
> > -void crash_kexec(struct pt_regs *regs)
> > +/* No panic_cpu check version of crash_kexec */
> > +void __crash_kexec(struct pt_regs *regs)
> > {
> > /* Take the kexec_mutex here to prevent sys_kexec_load
> > * running on one cpu from replacing the crash kernel
> > @@ -876,6 +877,29 @@ void crash_kexec(struct pt_regs *regs)
> > }
> > }
> >
> > +void crash_kexec(struct pt_regs *regs)
> > +{
> > + int old_cpu, this_cpu;
> > +
> > + /*
> > + * Only one CPU is allowed to execute the crash_kexec() code as with
> > + * panic(). Otherwise parallel calls of panic() and crash_kexec()
> > + * may stop each other. To exclude them, we use panic_cpu here too.
> > + */
> > + this_cpu = raw_smp_processor_id();
> > + old_cpu = atomic_cmpxchg(&panic_cpu, -1, this_cpu);
> > + if (old_cpu == -1) {
> > + /* This is the 1st CPU which comes here, so go ahead. */
> > + __crash_kexec(regs);
> > +
> > + /*
> > + * Reset panic_cpu to allow another panic()/crash_kexec()
> > + * call.
> > + */
> > + atomic_set(&panic_cpu, -1);
> > + }
> > +}
> > +
> > size_t crash_get_memory_size(void)
> > {
> > size_t size = 0;
> > diff --git a/kernel/panic.c b/kernel/panic.c
> > index 4fce2be..5d0b807 100644
> > --- a/kernel/panic.c
> > +++ b/kernel/panic.c
> > @@ -138,7 +138,7 @@ void panic(const char *fmt, ...)
> > * the "crash_kexec_post_notifiers" option to the kernel.
> > */
> > if (!crash_kexec_post_notifiers)
> > - crash_kexec(NULL);
> > + __crash_kexec(NULL);
>
> Why call the __crash_kexec() version and not just crash_kexec() here.
> This needs to be documented.
In this patch, an exclusive execution control with panic_cpu
is added to crash_kexec(). When crash_kexec() is called from
panic(), we don't need to check panic_cpu because we have already
held the exclusive control. So, __crash_kexec() is used here
to bypass it.
Of course, we can call crash_kexec() here, and crash_kexec()
checks if panic_cpu is equal to the current CPU number, and
if so, continues to process crash_kexec() routines.
This was done in older version of this patch series, but
Peter received a wrong impression about checking if panic_cpu
is equal to the current CPU number; it implies that it permits
recursive call of crash_kexec() (actually recursive call of
crash_kexec() can't happen).
Anyway, I'll add some comments.
Regards,
--
Hidehiro Kawai
Hitachi, Ltd. Research & Development Group
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On Wed, Nov 25, 2015 at 05:51:59AM +0000, 河合英宏 / KAWAI,HIDEHIRO wrote:
> > > `Infinite loop in NMI context' can happen:
> > >
> > > a. when a cpu panics on NMI while another cpu is processing panic
> > > b. when a cpu received an external or unknown NMI while another
> > > cpu is processing panic on NMI
> > >
> > > In the case of a, it loops in panic_smp_self_stop(). In the case
> > > of b, it loops in raw_spin_lock() of nmi_reason_lock.
> >
> > Please describe those two cases more verbosely - it takes slow people
> > like me a while to figure out what exactly can happen.
>
> a. when a cpu panics on NMI while another cpu is processing panic
> Ex.
> CPU 0 CPU 1
> ================= =================
> panic()
> panic_cpu <-- 0
> check panic_cpu
> crash_kexec()
> receive an unknown NMI
> unknown_nmi_error()
> nmi_panic()
> panic()
> check panic_cpu
> panic_smp_self_stop()
> infinite loop in NMI context
>
> b. when a cpu received an external or unknown NMI while another
> cpu is processing panic on NMI
> Ex.
> CPU 0 CPU 1
> ====================== ==================
> receive an unknown NMI
> unknown_nmi_error()
> nmi_panic() receive an unknown NMI
> panic_cpu <-- 0 unknown_nmi_error()
> panic() nmi_panic()
> check panic_cpu panic
> crash_kexec() check panic_cpu
> panic_smp_self_stop()
> infinite loop in NMI context
Ok, that's what I saw too, thanks for confirming.
But please write those examples with the *old* code in the commit
message, i.e. without panic_cpu and nmi_panic() which you're adding.
Basically, you want to structure your commit message this way:
This is the problem the current code has: ...
But we need to do this: ...
We fix it by doing that: ...
This will be of great help now when reading the commit message and of
invaluable help later, when we all have forgotten about the issue and
are scratching heads over why stuff was added.
> > > To save registers in these case too, this patch does following things:
> > >
> > > 1. Move the timing of `infinite loop in NMI context' (actually
> > > done by panic_smp_self_stop()) outside of panic() to enable us to
> > > refer pt_regs
> >
> > I can't parse that sentence. And I really tried :-\
> > panic_smp_self_stop() is still in panic().
>
> panic_smp_self_stop() is still used when a CPU in normal context
> should go into infinite loop. Only when a CPU is in NMI context,
> nmi_panic_self_stop() is called and the CPU loops infinitely
> without entering panic().
>
> I'll try to revise this sentense.
FWIW, it sounds better already! :)
> > > + */
> > > + while (!raw_spin_trylock(&nmi_reason_lock))
> > > + poll_crash_ipi_and_callback(regs);
> >
> > Waaait a minute: so if we're getting NMIs broadcasted on every core but
> > we're *not* crash dumping, we will run into here too. This can't be
> > right. :-\
>
> As Steven commented, poll_crash_ipi_and_callback() does nothing
> if we're not crash dumping. But yes, this is confusing. I'll add
> more detailed comment, or change the logic a bit if I come up with
> better one.
Thanks, much appreciated!
> > > +/*
> > > + * Wait for the timing of IPI for crash dumping, and then call its callback
> >
> > "Wait for the crash dumping IPI to complete... "
>
> So, I think "Wait for the crash dumping IPI to be issued..." is better.
> "complete" would be ambiguous in this context.
Ok.
>
> > > + * directly. This function is used when we have already been in NMI handler.
> > > + */
> > > +void poll_crash_ipi_and_callback(struct pt_regs *regs)
> >
> > Why "poll"? We won't return from crash_nmi_callback() if we're not the
> > crashing CPU.
>
> This function polls that crash IPI has been issued by checking
> crash_ipi_done, then invokes the callback. This is different
> from so-called "poll" functions. Do you have some good name?
Maybe something as simple as "run_crash_callback"?
Or since we're calling it from other places, maybe add the "crash"
prefix:
while (!raw_spin_trylock(&nmi_reason_lock))
crash_run_callback(regs);
Looks even better to me in code context. :)
Thanks!
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
> On Wed, Nov 25, 2015 at 05:51:59AM +0000, 河合英宏 / KAWAI,HIDEHIRO wrote:
> > > > `Infinite loop in NMI context' can happen:
> > > >
> > > > a. when a cpu panics on NMI while another cpu is processing panic
> > > > b. when a cpu received an external or unknown NMI while another
> > > > cpu is processing panic on NMI
> > > >
> > > > In the case of a, it loops in panic_smp_self_stop(). In the case
> > > > of b, it loops in raw_spin_lock() of nmi_reason_lock.
> > >
> > > Please describe those two cases more verbosely - it takes slow people
> > > like me a while to figure out what exactly can happen.
> >
> > a. when a cpu panics on NMI while another cpu is processing panic
> > Ex.
> > CPU 0 CPU 1
> > ================= =================
> > panic()
> > panic_cpu <-- 0
> > check panic_cpu
> > crash_kexec()
> > receive an unknown NMI
> > unknown_nmi_error()
> > nmi_panic()
> > panic()
> > check panic_cpu
> > panic_smp_self_stop()
> > infinite loop in NMI context
> >
> > b. when a cpu received an external or unknown NMI while another
> > cpu is processing panic on NMI
> > Ex.
> > CPU 0 CPU 1
> > ====================== ==================
> > receive an unknown NMI
> > unknown_nmi_error()
> > nmi_panic() receive an unknown NMI
> > panic_cpu <-- 0 unknown_nmi_error()
> > panic() nmi_panic()
> > check panic_cpu panic
> > crash_kexec() check panic_cpu
> > panic_smp_self_stop()
> > infinite loop in NMI context
>
> Ok, that's what I saw too, thanks for confirming.
>
> But please write those examples with the *old* code in the commit
> message, i.e. without panic_cpu and nmi_panic() which you're adding.
Does *old* code mean the code without this patch *series* ?
panic_cpu and nmi_panic() are introduced by PATCH 1/4, not this patch.
> Basically, you want to structure your commit message this way:
>
> This is the problem the current code has: ...
>
> But we need to do this: ...
>
> We fix it by doing that: ...
Good suggestion! I'll revise a bit with following your comment.
> > > > + * directly. This function is used when we have already been in NMI handler.
> > > > + */
> > > > +void poll_crash_ipi_and_callback(struct pt_regs *regs)
> > >
> > > Why "poll"? We won't return from crash_nmi_callback() if we're not the
> > > crashing CPU.
> >
> > This function polls that crash IPI has been issued by checking
> > crash_ipi_done, then invokes the callback. This is different
> > from so-called "poll" functions. Do you have some good name?
>
> Maybe something as simple as "run_crash_callback"?
I prefer this, but we might want to add some more prefix or suffix.
For example, "conditionally_run_crash_nmi_callback".
> Or since we're calling it from other places, maybe add the "crash"
> prefix:
>
> while (!raw_spin_trylock(&nmi_reason_lock))
> crash_run_callback(regs);
>
> Looks even better to me in code context. :)
Thanks for your deep review!
--
Hidehiro Kawai
Hitachi, Ltd. Research & Development Group
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On Fri, Nov 20, 2015 at 06:36:48PM +0900, Hidehiro Kawai wrote:
> Currently, panic() and crash_kexec() can be called at the same time.
> For example (x86 case):
>
> CPU 0:
> oops_end()
> crash_kexec()
> mutex_trylock() // acquired
> nmi_shootdown_cpus() // stop other cpus
>
> CPU 1:
> panic()
> crash_kexec()
> mutex_trylock() // failed to acquire
> smp_send_stop() // stop other cpus
> infinite loop
>
> If CPU 1 calls smp_send_stop() before nmi_shootdown_cpus(), kdump
> fails.
>
> In another case:
>
> CPU 0:
> oops_end()
> crash_kexec()
> mutex_trylock() // acquired
> <NMI>
> io_check_error()
> panic()
> crash_kexec()
> mutex_trylock() // failed to acquire
> infinite loop
>
> Clearly, this is an undesirable result.
>
> To fix this problem, this patch changes crash_kexec() to exclude
> others by using atomic_t panic_cpu.
>
> V5:
> - Add missing dummy __crash_kexec() for !CONFIG_KEXEC_CORE case
> - Replace atomic_xchg() with atomic_set() in crash_kexec() because
> it is used as a release operation and there is no need of memory
> barrier effect. This change also removes an unused value warning
>
> V4:
> - Use new __crash_kexec(), no exclusion check version of crash_kexec(),
> instead of checking if panic_cpu is the current cpu or not
>
> V2:
> - Use atomic_cmpxchg() instead of spin_trylock() on panic_lock
> to exclude concurrent accesses
> - Don't introduce no-lock version of crash_kexec()
>
> Signed-off-by: Hidehiro Kawai <[email protected]>
> Cc: Eric Biederman <[email protected]>
> Cc: Vivek Goyal <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Michal Hocko <[email protected]>
> ---
> include/linux/kexec.h | 2 ++
> kernel/kexec_core.c | 26 +++++++++++++++++++++++++-
> kernel/panic.c | 4 ++--
> 3 files changed, 29 insertions(+), 3 deletions(-)
...
> +void crash_kexec(struct pt_regs *regs)
> +{
> + int old_cpu, this_cpu;
> +
> + /*
> + * Only one CPU is allowed to execute the crash_kexec() code as with
> + * panic(). Otherwise parallel calls of panic() and crash_kexec()
> + * may stop each other. To exclude them, we use panic_cpu here too.
> + */
> + this_cpu = raw_smp_processor_id();
> + old_cpu = atomic_cmpxchg(&panic_cpu, -1, this_cpu);
> + if (old_cpu == -1) {
> + /* This is the 1st CPU which comes here, so go ahead. */
> + __crash_kexec(regs);
> +
> + /*
> + * Reset panic_cpu to allow another panic()/crash_kexec()
> + * call.
So can we make __crash_kexec() return error values?
* failed to grab kexec_mutex -> reset panic_cpu
* no kexec_crash_image -> no need to reset it, all future crash_kexec()
calls won't work so no need to run into that path anymore. However, this could
be problematic if we want the other CPUs to panic. Do we care?
* machine_kexec successful -> doesn't matter
Thanks.
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
On Wed, Nov 25, 2015 at 09:46:37AM +0000, 河合英宏 / KAWAI,HIDEHIRO wrote:
> Does *old* code mean the code without this patch *series* ?
Yes.
> I prefer this, but we might want to add some more prefix or suffix.
> For example, "conditionally_run_crash_nmi_callback".
That's unnecessary IMO. If you need to express that, you could write
that in a comment above the function definition. Anyone who looks at
the code then will know that it is conditional, like so many other
kernel functions. :)
> Thanks for your deep review!
Thanks for the patience!
:-)
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
On Fri, Nov 20, 2015 at 06:36:50PM +0900, Hidehiro Kawai wrote:
> This patch introduces new boot option, apic_extnmi:
>
> apic_extnmi={ bsp | all | none}
>
> The default value is "bsp" and this is the current behavior; only
> BSP receives external NMI. "all" allows external NMIs to be
> broadcast to all CPUs. This would raise the success rate of panic
> on NMI when BSP hangs up in NMI context or the external NMI is
> swallowed by other NMI handlers on BSP. If you specified "none",
> any CPUs don't receive external NMIs. This is useful for dump
> capture kernel so that it wouldn't be shot down while saving a
> crash dump.
>
> V5:
> - Rename the option from "noextnmi" to "apic_extnmi"
> - Add apic_extnmi=all feature
> - Fix the wrong documentation about "noextnmi" (apic_extnmi=none)
>
> Signed-off-by: Hidehiro Kawai <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: Jonathan Corbet <[email protected]>
> ---
> Documentation/kernel-parameters.txt | 9 +++++++++
> arch/x86/include/asm/apic.h | 5 +++++
> arch/x86/kernel/apic/apic.c | 31 ++++++++++++++++++++++++++++++-
> 3 files changed, 44 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> index f8aae63..ceed3bc 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -472,6 +472,15 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
> Change the amount of debugging information output
> when initialising the APIC and IO-APIC components.
>
> + apic_extnmi= [APIC,X86] External NMI delivery setting
> + Format: { bsp (default) | all | none }
> + bsp: External NMI is delivered to only CPU 0
only to
> + all: External NMIs are broadcast to all CPUs as a
> + backup of CPU 0
> + none: External NMI is masked for all CPUs. This is
> + useful so that a dump capture kernel won't be
> + shot down by NMI
> +
> autoconf= [IPV6]
> See Documentation/networking/ipv6.txt.
>
> diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
> index 7f62ad4..c80f6b6 100644
> --- a/arch/x86/include/asm/apic.h
> +++ b/arch/x86/include/asm/apic.h
> @@ -23,6 +23,11 @@
> #define APIC_VERBOSE 1
> #define APIC_DEBUG 2
>
> +/* Macros for apic_extnmi which controls external NMI masking */
> +#define APIC_EXTNMI_BSP 0 /* Default */
> +#define APIC_EXTNMI_ALL 1
> +#define APIC_EXTNMI_NONE 2
> +
> /*
> * Define the default level of output to be very little
> * This can be turned up by using apic=verbose for more
> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
> index 2f69e3b..a2a8074 100644
> --- a/arch/x86/kernel/apic/apic.c
> +++ b/arch/x86/kernel/apic/apic.c
> @@ -82,6 +82,12 @@ physid_mask_t phys_cpu_present_map;
> static unsigned int disabled_cpu_apicid __read_mostly = BAD_APICID;
>
> /*
> + * This variable controls which CPUs receive external NMIs. By default,
> + * external NMIs are delivered to only BSP.
only to the BSP.
> + */
> +static int apic_extnmi = APIC_EXTNMI_BSP;
> +
> +/*
> * Map cpu index to physical APIC ID
> */
> DEFINE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_cpu_to_apicid, BAD_APICID);
> @@ -1161,6 +1167,8 @@ void __init init_bsp_APIC(void)
> value = APIC_DM_NMI;
> if (!lapic_is_integrated()) /* 82489DX */
> value |= APIC_LVT_LEVEL_TRIGGER;
> + if (apic_extnmi == APIC_EXTNMI_NONE)
> + value |= APIC_LVT_MASKED;
> apic_write(APIC_LVT1, value);
> }
>
> @@ -1380,7 +1388,8 @@ void setup_local_APIC(void)
> /*
> * only the BP should see the LINT1 NMI signal, obviously.
> */
That comment needs adjusting.
> - if (!cpu)
> + if ((!cpu && apic_extnmi != APIC_EXTNMI_NONE) ||
> + apic_extnmi == APIC_EXTNMI_ALL)
> value = APIC_DM_NMI;
> else
> value = APIC_DM_NMI | APIC_LVT_MASKED;
> @@ -2548,3 +2557,23 @@ static int __init apic_set_disabled_cpu_apicid(char *arg)
> return 0;
> }
> early_param("disable_cpu_apicid", apic_set_disabled_cpu_apicid);
> +
> +static int __init apic_set_extnmi(char *arg)
> +{
> + if (!arg)
> + return -EINVAL;
> +
> + if (strcmp("all", arg) == 0)
if (!strncmp("all", arg, 3))
ditto for the rest
> + apic_extnmi = APIC_EXTNMI_ALL;
> + else if (strcmp("none", arg) == 0)
> + apic_extnmi = APIC_EXTNMI_NONE;
> + else if (strcmp("bsp", arg) == 0)
> + apic_extnmi = APIC_EXTNMI_BSP;
> + else {
> + pr_warn("Unknown external NMI delivery mode `%s' is ignored\n",
s/is //
Also, if there's no other delivery mode which makes sense, you can do:
pr_warn("Unknown external NMI delivery mode `%s', defaulting to 'bsp'\n", arg);
apic_extnmi = APIC_EXTNMI_BSP;
Btw, you can let the pr_warn line be longer than 80 cols.
And if you don't default, you need
return -EINVAL;
here.
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
> On Wed, Nov 25, 2015 at 09:46:37AM +0000, 河合英宏 / KAWAI,HIDEHIRO wrote:
...
> > I prefer this, but we might want to add some more prefix or suffix.
> > For example, "conditionally_run_crash_nmi_callback".
>
> That's unnecessary IMO. If you need to express that, you could write
> that in a comment above the function definition. Anyone who looks at
> the code then will know that it is conditional, like so many other
> kernel functions. :)
OK, I'll use a simple one with a comment.
Regards,
--
Hidehiro Kawai
Hitachi, Ltd. Research & Development Group
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
> On Fri, Nov 20, 2015 at 06:36:50PM +0900, Hidehiro Kawai wrote:
> > This patch introduces new boot option, apic_extnmi:
> >
> > apic_extnmi={ bsp | all | none}
> >
> > The default value is "bsp" and this is the current behavior; only
> > BSP receives external NMI. "all" allows external NMIs to be
> > broadcast to all CPUs. This would raise the success rate of panic
> > on NMI when BSP hangs up in NMI context or the external NMI is
> > swallowed by other NMI handlers on BSP. If you specified "none",
> > any CPUs don't receive external NMIs. This is useful for dump
> > capture kernel so that it wouldn't be shot down while saving a
> > crash dump.
> >
> > V5:
> > - Rename the option from "noextnmi" to "apic_extnmi"
> > - Add apic_extnmi=all feature
> > - Fix the wrong documentation about "noextnmi" (apic_extnmi=none)
> >
> > Signed-off-by: Hidehiro Kawai <[email protected]>
> > Cc: Thomas Gleixner <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: "H. Peter Anvin" <[email protected]>
> > Cc: Jonathan Corbet <[email protected]>
> > ---
> > Documentation/kernel-parameters.txt | 9 +++++++++
> > arch/x86/include/asm/apic.h | 5 +++++
> > arch/x86/kernel/apic/apic.c | 31 ++++++++++++++++++++++++++++++-
> > 3 files changed, 44 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> > index f8aae63..ceed3bc 100644
> > --- a/Documentation/kernel-parameters.txt
> > +++ b/Documentation/kernel-parameters.txt
> > @@ -472,6 +472,15 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
> > Change the amount of debugging information output
> > when initialising the APIC and IO-APIC components.
> >
> > + apic_extnmi= [APIC,X86] External NMI delivery setting
> > + Format: { bsp (default) | all | none }
> > + bsp: External NMI is delivered to only CPU 0
>
> only to
Thanks for tha correction.
>
> > + all: External NMIs are broadcast to all CPUs as a
> > + backup of CPU 0
> > + none: External NMI is masked for all CPUs. This is
> > + useful so that a dump capture kernel won't be
> > + shot down by NMI
> > +
> > autoconf= [IPV6]
> > See Documentation/networking/ipv6.txt.
> >
> > diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
> > index 7f62ad4..c80f6b6 100644
> > --- a/arch/x86/include/asm/apic.h
> > +++ b/arch/x86/include/asm/apic.h
> > @@ -23,6 +23,11 @@
> > #define APIC_VERBOSE 1
> > #define APIC_DEBUG 2
> >
> > +/* Macros for apic_extnmi which controls external NMI masking */
> > +#define APIC_EXTNMI_BSP 0 /* Default */
> > +#define APIC_EXTNMI_ALL 1
> > +#define APIC_EXTNMI_NONE 2
> > +
> > /*
> > * Define the default level of output to be very little
> > * This can be turned up by using apic=verbose for more
> > diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
> > index 2f69e3b..a2a8074 100644
> > --- a/arch/x86/kernel/apic/apic.c
> > +++ b/arch/x86/kernel/apic/apic.c
> > @@ -82,6 +82,12 @@ physid_mask_t phys_cpu_present_map;
> > static unsigned int disabled_cpu_apicid __read_mostly = BAD_APICID;
> >
> > /*
> > + * This variable controls which CPUs receive external NMIs. By default,
> > + * external NMIs are delivered to only BSP.
>
> only to the BSP.
...and again.
>
> > + */
> > +static int apic_extnmi = APIC_EXTNMI_BSP;
> > +
> > +/*
> > * Map cpu index to physical APIC ID
> > */
> > DEFINE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_cpu_to_apicid, BAD_APICID);
> > @@ -1161,6 +1167,8 @@ void __init init_bsp_APIC(void)
> > value = APIC_DM_NMI;
> > if (!lapic_is_integrated()) /* 82489DX */
> > value |= APIC_LVT_LEVEL_TRIGGER;
> > + if (apic_extnmi == APIC_EXTNMI_NONE)
> > + value |= APIC_LVT_MASKED;
> > apic_write(APIC_LVT1, value);
> > }
> >
> > @@ -1380,7 +1388,8 @@ void setup_local_APIC(void)
> > /*
> > * only the BP should see the LINT1 NMI signal, obviously.
> > */
>
> That comment needs adjusting.
OK, I'll do that.
>
> > - if (!cpu)
> > + if ((!cpu && apic_extnmi != APIC_EXTNMI_NONE) ||
> > + apic_extnmi == APIC_EXTNMI_ALL)
> > value = APIC_DM_NMI;
> > else
> > value = APIC_DM_NMI | APIC_LVT_MASKED;
> > @@ -2548,3 +2557,23 @@ static int __init apic_set_disabled_cpu_apicid(char *arg)
> > return 0;
> > }
> > early_param("disable_cpu_apicid", apic_set_disabled_cpu_apicid);
> > +
> > +static int __init apic_set_extnmi(char *arg)
> > +{
> > + if (!arg)
> > + return -EINVAL;
> > +
> > + if (strcmp("all", arg) == 0)
>
> if (!strncmp("all", arg, 3))
>
> ditto for the rest
I'll fix them.
> > + apic_extnmi = APIC_EXTNMI_ALL;
> > + else if (strcmp("none", arg) == 0)
> > + apic_extnmi = APIC_EXTNMI_NONE;
> > + else if (strcmp("bsp", arg) == 0)
> > + apic_extnmi = APIC_EXTNMI_BSP;
> > + else {
> > + pr_warn("Unknown external NMI delivery mode `%s' is ignored\n",
>
> s/is //
I'll fix it, thanks.
> Also, if there's no other delivery mode which makes sense, you can do:
>
> pr_warn("Unknown external NMI delivery mode `%s', defaulting to 'bsp'\n", arg);
> apic_extnmi = APIC_EXTNMI_BSP;
I intended to keep the previous or initial value of apic_extnmi
because boot option can be specified multiple times. If a user do
that, making the last valid value effective would be natural manner.
This is unclear part of boot option as Ingo pointed, but...
> Btw, you can let the pr_warn line be longer than 80 cols.
I like 80 cols because I'm working with multiple 80-col terminals. :-)
> And if you don't default, you need
>
> return -EINVAL;
>
> here.
You are right. It seems that I deleted it while modifying
around the code.
Regards,
--
Hidehiro Kawai
Hitachi, Ltd. Research & Development Group
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
Hello Borislav,
Sorry, I haven't replied to this mail yet.
> On Fri, Nov 20, 2015 at 06:36:48PM +0900, Hidehiro Kawai wrote:
...
> > +void crash_kexec(struct pt_regs *regs)
> > +{
> > + int old_cpu, this_cpu;
> > +
> > + /*
> > + * Only one CPU is allowed to execute the crash_kexec() code as with
> > + * panic(). Otherwise parallel calls of panic() and crash_kexec()
> > + * may stop each other. To exclude them, we use panic_cpu here too.
> > + */
> > + this_cpu = raw_smp_processor_id();
> > + old_cpu = atomic_cmpxchg(&panic_cpu, -1, this_cpu);
> > + if (old_cpu == -1) {
> > + /* This is the 1st CPU which comes here, so go ahead. */
> > + __crash_kexec(regs);
> > +
> > + /*
> > + * Reset panic_cpu to allow another panic()/crash_kexec()
> > + * call.
>
> So can we make __crash_kexec() return error values?
>
> * failed to grab kexec_mutex -> reset panic_cpu
>
> * no kexec_crash_image -> no need to reset it, all future crash_kexec()
> calls won't work so no need to run into that path anymore. However, this could
> be problematic if we want the other CPUs to panic. Do we care?
>
> * machine_kexec successful -> doesn't matter
We can do so, but I think resetting panic_cpu always would be
simpler and safer.
Although checking kexec_crash_image each time is pointless, it
doesn't cause any actual problem.
Regards,
--
Hidehiro Kawai
Hitachi, Ltd. Research & Development Group
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On Wed, Dec 02, 2015 at 11:57:38AM +0000, 河合英宏 / KAWAI,HIDEHIRO wrote:
> We can do so, but I think resetting panic_cpu always would be
> simpler and safer.
Well, I think executing code needlessly *especially* at panic time is
not all that rosy either.
Besides something like this:
static bool kexec_failed;
...
if (crash_kexec_post_notifiers && !kexec_failed)
kexec_failed = __crash_kexec(NULL);
is as simple as it gets.
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
> On Wed, Dec 02, 2015 at 11:57:38AM +0000, 河合英宏 / KAWAI,HIDEHIRO wrote:
> > We can do so, but I think resetting panic_cpu always would be
> > simpler and safer.
I'll state in detail.
When we call crash_kexec() without entering panic() and return from
it, panic() should be called eventually. But the code paths are
a bit complicated and there are many implementations for each
architecture. So one day, this assumption may be broken; the CPU
doesn't call panic(). Or the CPU may fail to call panic() because
we are already in insane state. It would be nervous, but allowing
another CPU to process panic routines by resetting panic_cpu
is safer approach.
> Well, I think executing code needlessly *especially* at panic time is
> not all that rosy either.
>
> Besides something like this:
>
> static bool kexec_failed;
>
> ...
>
> if (crash_kexec_post_notifiers && !kexec_failed)
> kexec_failed = __crash_kexec(NULL);
>
> is as simple as it gets.
Since this code is executed only once due to panic_cpu,
I think introducing this logic is not much valuable.
Also, current implementation is already quite simple:
panic()
{
...
__crash_kexec(NULL) {
if (mutex_trylock(&kexec_mutex)) {
if (kexec_crash_image) {
/* don't return */
}
}
mutex_unlock(&kexec_mutex)
}
How do you think?
Regards,
--
Hidehiro Kawai
Hitachi, Ltd. Research & Development Group
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
> @@ -357,7 +358,15 @@ static void default_do_nmi(struct pt_regs *regs)
> }
>
> /* Non-CPU-specific NMI: NMI sources can be processed on any CPU */
> - raw_spin_lock(&nmi_reason_lock);
> +
> + /*
> + * Another CPU may be processing panic routines with holding
> + * nmi_reason_lock. Check IPI issuance from the panicking CPU
> + * and call the callback directly.
> + */
> + while (!raw_spin_trylock(&nmi_reason_lock))
> + poll_crash_ipi_and_callback(regs);
> +
> reason = x86_platform.get_nmi_reason();
I noticed this logic is unneeded until applying PATCH 4/4.
Currently, unknown NMI can be broadcast to all CPUs, but in that case,
panic()/nmi_panic() are called after releasing nmi_reason_lock.
So CPUs can't loop infinitely here.
PATCH 4/4 allows us to broadcast external NMIs to all CPUs, and it
causes infinite loop in raw_spin_lock(&nmi_reason_lock). So the above
changes are needed.
I'll move these chagnes to a later patch in the next version.
Thanks,
--
Hidehiro Kawai
Hitachi, Ltd. Research & Development Group
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On Thu, Dec 03, 2015 at 02:01:38AM +0000, 河合英宏 / KAWAI,HIDEHIRO wrote:
> > On Wed, Dec 02, 2015 at 11:57:38AM +0000, 河合英宏 / KAWAI,HIDEHIRO wrote:
> > > We can do so, but I think resetting panic_cpu always would be
> > > simpler and safer.
>
> I'll state in detail.
>
> When we call crash_kexec() without entering panic() and return from
> it, panic() should be called eventually.
Huh, the call chain is
panic->crash_kexec
Or do you mean, when crash_kexec() is not called by panic() but by some
of its other callers?
> But the code paths are a bit complicated and there are many
> implementations for each architecture. So one day, this assumption may
> be broken; the CPU doesn't call panic(). Or the CPU may fail to call
> panic() because we are already in insane state. It would be nervous,
> but allowing another CPU to process panic routines by resetting
> panic_cpu is safer approach.
My suggestion was to do this only on the panic path - not necessarily on
the others.
> Since this code is executed only once due to panic_cpu,
> I think introducing this logic is not much valuable.
> Also, current implementation is already quite simple:
>
> panic()
> {
> ...
> __crash_kexec(NULL) {
> if (mutex_trylock(&kexec_mutex)) {
> if (kexec_crash_image) {
> /* don't return */
> }
I don't mean the kexec_crash_image case - I mean the opposite one:
!kexec_crash_image. And I think I know now what you're trying to tell
me: the first CPU which hits panic, will finish panic eventually and so
it will take down the machine.
Every other CPU which happens to enter panic in between the first CPU
and the machine being taken down, doesn't matter because, well, who
cares, we're panicking already.
Am I close?
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
> On Thu, Dec 03, 2015 at 02:01:38AM +0000, 河合英宏 / KAWAI,HIDEHIRO wrote:
> > > On Wed, Dec 02, 2015 at 11:57:38AM +0000, 河合英宏 / KAWAI,HIDEHIRO wrote:
> > > > We can do so, but I think resetting panic_cpu always would be
> > > > simpler and safer.
> >
> > I'll state in detail.
> >
> > When we call crash_kexec() without entering panic() and return from
> > it, panic() should be called eventually.
>
> Huh, the call chain is
>
> panic->crash_kexec
>
> Or do you mean, when crash_kexec() is not called by panic() but by some
> of its other callers?
I was arguing about the case of oops_end --> crash_kexec
--> return from crash_kexec because of !kexec_crash_image -->
panic.
In the case of panic --> __crash_kexec, __crash_kexec is called
only once, so we don't need to check the return value of __crash_kexec
as you suggested. So I thought you stated about crash_kexec --> panic
case.
> > But the code paths are a bit complicated and there are many
> > implementations for each architecture. So one day, this assumption may
> > be broken; the CPU doesn't call panic(). Or the CPU may fail to call
> > panic() because we are already in insane state. It would be nervous,
> > but allowing another CPU to process panic routines by resetting
> > panic_cpu is safer approach.
>
> My suggestion was to do this only on the panic path - not necessarily on
> the others.
>
> > Since this code is executed only once due to panic_cpu,
> > I think introducing this logic is not much valuable.
> > Also, current implementation is already quite simple:
> >
> > panic()
> > {
> > ...
> > __crash_kexec(NULL) {
> > if (mutex_trylock(&kexec_mutex)) {
> > if (kexec_crash_image) {
> > /* don't return */
> > }
>
> I don't mean the kexec_crash_image case - I mean the opposite one:
> !kexec_crash_image.
I also mentioned !kexec_crash_image case...
> And I think I know now what you're trying to tell
> me: the first CPU which hits panic, will finish panic eventually and so
> it will take down the machine.
No. The first CPU calls panic, and then it calls __crash_kexec.
Because of !kexec_crash_image, it returns from __crash_kexec and
continues to the panic procedure. At the same time, another CPU
tries to call panic(), but it doesn't run the panic procedure;
panic_cpu prevents the second CPU from running it.
This means __crash_kexec is called only once even if we don't
check the return value of __crash_kexec.
(Please note that crash_kexec can be called multiple times in the
case of oops_end() --> crash_kexec().)
I'm sorry I couldn't tell my thought well.
Regards,
--
Hidehiro Kawai
Hitachi, Ltd. Research & Development Group
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On Thu, Dec 03, 2015 at 11:29:21AM +0000, 河合英宏 / KAWAI,HIDEHIRO wrote:
> I was arguing about the case of oops_end --> crash_kexec
> --> return from crash_kexec because of !kexec_crash_image -->
> panic.
Aha.
> In the case of panic --> __crash_kexec, __crash_kexec is called
> only once, so we don't need to check the return value of __crash_kexec
> as you suggested. So I thought you stated about crash_kexec --> panic
> case.
No, I meant the other way around.
> I also mentioned !kexec_crash_image case...
I must've missed it.
> No. The first CPU calls panic, and then it calls __crash_kexec.
> Because of !kexec_crash_image, it returns from __crash_kexec and
> continues to the panic procedure. At the same time, another CPU
> tries to call panic(), but it doesn't run the panic procedure;
> panic_cpu prevents the second CPU from running it.
>
> This means __crash_kexec is called only once even if we don't
> check the return value of __crash_kexec.
I think we're on the same page, even if we express it differently - the
other CPUs entering panic() will loop in panic_smp_self_stop() so they
won't reach __crash_kexec().
> (Please note that crash_kexec can be called multiple times in the
> case of oops_end() --> crash_kexec().)
Right, and that was the case that was bugging me - calling into
crash_kexec() on multiple CPUs but it is a trylock and a pointer test -
I guess that's diminishingly small overhead to care.
Thanks.
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.