LinuxLists.cc - [V2 PATCH 0/3] x86: Fix panic vs. NMI issues

2015-07-27 05:11:15

by Hidehiro Kawai

[permalink] [raw]

Subject: [V2 PATCH 0/3] x86: Fix panic vs. NMI issues

When an HA cluster software or administrator detects non-response
of a host, they issue an NMI to the host to completely stop current
works and take a crash dump. If the kernel has already panicked
or is capturing a crash dump at that time, further NMI can cause
a crash dump failure.

To solve this issue, this patch set does two things:

- Don't panic on NMI if the kernel has already panicked
- Introduce "noextnmi" boot option which masks external NMI at the
boot time (supported only for x86)

V2:
- Use atomic_cmpxchg() instead of current spin_trylock() to exclude
concurrent accesses to panic() and crash_kexec()
- Don't introduce no-lock version of panic() and crash_kexec()

---

Hidehiro Kawai (3):
x86/panic: Fix re-entrance problem due to panic on NMI
kexec: Fix race between panic() and crash_kexec() called directly
x86/apic: Introduce noextnmi boot option

Documentation/kernel-parameters.txt | 4 ++++
arch/x86/kernel/apic/apic.c | 17 ++++++++++++++++-
arch/x86/kernel/nmi.c | 15 +++++++++++----
include/linux/kernel.h | 1 +
kernel/kexec.c | 20 ++++++++++++++++++++
kernel/panic.c | 13 ++++++++++---
6 files changed, 62 insertions(+), 8 deletions(-)

--
Hidehiro Kawai
Hitachi, Ltd. Research & Development Group

2015-07-27 05:11:29

by Hidehiro Kawai

[permalink] [raw]

Subject: [V2 PATCH 3/3] x86/apic: Introduce noextnmi boot option

This patch introduces new boot option "noextnmi" which disables
external NMI. This option is useful for the dump capture kernel
so that an HA application or administrator wouldn't mistakenly
shoot down the kernel by NMI.

Currently, only x86 supports this option.

Signed-off-by: Hidehiro Kawai <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Jonathan Corbet <[email protected]>
---
Documentation/kernel-parameters.txt | 4 ++++
arch/x86/kernel/apic/apic.c | 17 ++++++++++++++++-
2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 1d6f045..2cbd40b 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2364,6 +2364,10 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
noexec=on: enable non-executable mappings (default)
noexec=off: disable non-executable mappings

+ noextnmi [X86]
+ Mask external NMI. This option is useful for a
+ dump capture kernel to be shot down by NMI.
+
nosmap [X86]
Disable SMAP (Supervisor Mode Access Prevention)
even if it is supported by processor.
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index dcb5285..a140410 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -82,6 +82,12 @@
static unsigned int disabled_cpu_apicid __read_mostly = BAD_APICID;

/*
+ * Don't enable external NMI via LINT1 on BSP. This is useful for
+ * the dump capture kernel.
+ */
+static bool apic_noextnmi;
+
+/*
* Map cpu index to physical APIC ID
*/
DEFINE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_cpu_to_apicid, BAD_APICID);
@@ -1150,6 +1156,8 @@ void __init init_bsp_APIC(void)
value = APIC_DM_NMI;
if (!lapic_is_integrated()) /* 82489DX */
value |= APIC_LVT_LEVEL_TRIGGER;
+ if (apic_noextnmi)
+ value |= APIC_LVT_MASKED;
apic_write(APIC_LVT1, value);
}

@@ -1369,7 +1377,7 @@ void setup_local_APIC(void)
/*
* only the BP should see the LINT1 NMI signal, obviously.
*/
- if (!cpu)
+ if (!cpu && !apic_noextnmi)
value = APIC_DM_NMI;
else
value = APIC_DM_NMI | APIC_LVT_MASKED;
@@ -2537,3 +2545,10 @@ static int __init apic_set_disabled_cpu_apicid(char *arg)
return 0;
}
early_param("disable_cpu_apicid", apic_set_disabled_cpu_apicid);
+
+static int __init apic_set_noextnmi(char *arg)
+{
+ apic_noextnmi = true;
+ return 0;
+}
+early_param("noextnmi", apic_set_noextnmi);

2015-07-27 05:11:28

by Hidehiro Kawai

[permalink] [raw]

Subject: [V2 PATCH 1/3] x86/panic: Fix re-entrance problem due to panic on NMI

If panic on NMI happens just after panic() on the same CPU, panic()
is recursively called. As the result, it stalls after failing to
acquire panic_lock.

To avoid this problem, don't call panic() in NMI context if
we've already entered panic().

V2:
- Use atomic_cmpxchg() instead of current spin_trylock() to
exclude concurrent accesses to the panic routines
- Don't introduce no-lock version of panic()

Signed-off-by: Hidehiro Kawai <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---
arch/x86/kernel/nmi.c | 15 +++++++++++----
include/linux/kernel.h | 1 +
kernel/panic.c | 13 ++++++++++---
3 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index d05bd2e..5b32d81 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -230,7 +230,8 @@ void unregister_nmi_handler(unsigned int type, const char *name)
}
#endif

- if (panic_on_unrecovered_nmi)
+ if (panic_on_unrecovered_nmi &&
+ atomic_cmpxchg(&panicking_cpu, -1, raw_smp_processor_id()) == -1)
panic("NMI: Not continuing");

pr_emerg("Dazed and confused, but trying to continue\n");
@@ -255,8 +256,13 @@ void unregister_nmi_handler(unsigned int type, const char *name)
reason, smp_processor_id());
show_regs(regs);

- if (panic_on_io_nmi)
- panic("NMI IOCK error: Not continuing");
+ if (panic_on_io_nmi) {
+ if (atomic_cmpxchg(&panicking_cpu, -1, raw_smp_processor_id())
+ == -1)
+ panic("NMI IOCK error: Not continuing");
+ else
+ return; /* We don't want to wait and re-enable NMI */
+ }

/* Re-enable the IOCK line, wait for a few seconds */
reason = (reason & NMI_REASON_CLEAR_MASK) | NMI_REASON_CLEAR_IOCHK;
@@ -296,7 +302,8 @@ void unregister_nmi_handler(unsigned int type, const char *name)
reason, smp_processor_id());

pr_emerg("Do you have a strange power saving mode enabled?\n");
- if (unknown_nmi_panic || panic_on_unrecovered_nmi)
+ if ((unknown_nmi_panic || panic_on_unrecovered_nmi) &&
+ atomic_cmpxchg(&panicking_cpu, -1, raw_smp_processor_id()) == -1)
panic("NMI: Not continuing");

pr_emerg("Dazed and confused, but trying to continue\n");
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 5582410..8ca199b 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -442,6 +442,7 @@ extern __scanf(2, 0)
extern int sysctl_panic_on_stackoverflow;

extern bool crash_kexec_post_notifiers;
+extern atomic_t panicking_cpu;

/*
* Only to be used by arch init code. If the user over-wrote the default
diff --git a/kernel/panic.c b/kernel/panic.c
index 04e91ff..7e6b568 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -60,6 +60,8 @@ void __weak panic_smp_self_stop(void)
cpu_relax();
}

+atomic_t panicking_cpu = ATOMIC_INIT(-1);
+
/**
* panic - halt the system
* @fmt: The text string to print
@@ -70,17 +72,17 @@ void __weak panic_smp_self_stop(void)
*/
void panic(const char *fmt, ...)
{
- static DEFINE_SPINLOCK(panic_lock);
static char buf[1024];
va_list args;
long i, i_next = 0;
int state = 0;
+ int old_cpu, this_cpu;

/*
* Disable local interrupts. This will prevent panic_smp_self_stop
* from deadlocking the first cpu that invokes the panic, since
* there is nothing to prevent an interrupt handler (that runs
- * after the panic_lock is acquired) from invoking panic again.
+ * after setting panicking_cpu) from invoking panic again.
*/
local_irq_disable();

@@ -93,8 +95,13 @@ void panic(const char *fmt, ...)
* multiple parallel invocations of panic, all other CPUs either
* stop themself or will wait until they are stopped by the 1st CPU
* with smp_send_stop().
+ *
+ * `old_cpu == -1' means we are the first comer.
+ * `old_cpu == this_cpu' means we came here due to panic on NMI.
*/
- if (!spin_trylock(&panic_lock))
+ this_cpu = raw_smp_processor_id();
+ old_cpu = atomic_cmpxchg(&panicking_cpu, -1, this_cpu);
+ if (old_cpu != -1 && old_cpu != this_cpu)
panic_smp_self_stop();

console_verbose();

2015-07-27 05:11:53

by Hidehiro Kawai

[permalink] [raw]

Subject: [V2 PATCH 2/3] kexec: Fix race between panic() and crash_kexec() called directly

Currently, panic() and crash_kexec() can be called at the same time.
For example (x86 case):

CPU 0:
oops_end()
crash_kexec()
mutex_trylock() // acquired
nmi_shootdown_cpus() // stop other cpus

CPU 1:
panic()
crash_kexec()
mutex_trylock() // failed to acquire
smp_send_stop() // stop other cpus
infinite loop

If CPU 1 calls smp_send_stop() before nmi_shootdown_cpus(), kdump
fails.

In another case:

CPU 0:
oops_end()
crash_kexec()
mutex_trylock() // acquired
<NMI>
io_check_error()
panic()
crash_kexec()
mutex_trylock() // failed to acquire
infinite loop

Clearly, this is an undesirable result.

To fix this problem, this patch changes crash_kexec() to exclude
others by using atomic_t panicking_cpu.

V2:
- Use atomic_cmpxchg() instead of spin_trylock() on panic_lock
to exclude concurrent accesses
- Don't introduce no-lock version of crash_kexec()

Signed-off-by: Hidehiro Kawai <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: Andrew Morton <[email protected]>
---
kernel/kexec.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)

diff --git a/kernel/kexec.c b/kernel/kexec.c
index a785c10..ca40a19 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -1472,6 +1472,18 @@ void __weak crash_unmap_reserved_pages(void)

void crash_kexec(struct pt_regs *regs)
{
+ int old_cpu, this_cpu;
+
+ /*
+ * `old_cpu == -1' means we are the first comer and crash_kexec()
+ * was called without entering panic().
+ * `old_cpu == this_cpu' means crash_kexec() was called from panic().
+ */
+ this_cpu = raw_smp_processor_id();
+ old_cpu = atomic_cmpxchg(&panicking_cpu, -1, this_cpu);
+ if (old_cpu != -1 && old_cpu != this_cpu)
+ return;
+
/* Take the kexec_mutex here to prevent sys_kexec_load
* running on one cpu from replacing the crash kernel
* we are using after a panic on a different cpu.
@@ -1491,6 +1503,14 @@ void crash_kexec(struct pt_regs *regs)
}
mutex_unlock(&kexec_mutex);
}
+
+ /*
+ * If we came here from panic(), we have to keep panicking_cpu
+ * to prevent other cpus from entering panic(). Otherwise,
+ * resetting it so that other cpus can enter panic()/crash_kexec().
+ */
+ if (old_cpu == this_cpu)
+ atomic_set(&panicking_cpu, -1);
}

size_t crash_get_memory_size(void)

2015-07-27 14:34:12

by Michal Hocko

[permalink] [raw]

Subject: Re: [V2 PATCH 1/3] x86/panic: Fix re-entrance problem due to panic on NMI

On Mon 27-07-15 10:58:50, Hidehiro Kawai wrote:
[...]
> diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
> index d05bd2e..5b32d81 100644
> --- a/arch/x86/kernel/nmi.c
> +++ b/arch/x86/kernel/nmi.c
> @@ -230,7 +230,8 @@ void unregister_nmi_handler(unsigned int type, const char *name)
> }
> #endif
>
> - if (panic_on_unrecovered_nmi)
> + if (panic_on_unrecovered_nmi &&
> + atomic_cmpxchg(&panicking_cpu, -1, raw_smp_processor_id()) == -1)
> panic("NMI: Not continuing");

Spreading the check to all NMI callers is quite ugly. Wouldn't it be
better to introduce nmi_panic() which wouldn't be __noreturn unlike the
regular panic. The check could be also relaxed a bit and nmi_panic would
return only if the ongoing panic is the current cpu when we really have
to return and allow the preempted panic to finish.

Something like
---
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 5582410727cb..409091c48e6c 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -253,6 +253,7 @@ static inline void might_fault(void) { }
extern struct atomic_notifier_head panic_notifier_list;
extern long (*panic_blink)(int state);
__printf(1, 2)
+void nmi_panic(const char *fmt, ...) __cold;
void panic(const char *fmt, ...)
__noreturn __cold;
extern void oops_enter(void);
diff --git a/kernel/panic.c b/kernel/panic.c
index 04e91ff7560b..4c1ff7e19cdc 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -60,6 +60,8 @@ void __weak panic_smp_self_stop(void)
cpu_relax();
}

+static atomic_t panic_cpu = ATOMIC_INIT(-1);
+
/**
* panic - halt the system
* @fmt: The text string to print
@@ -70,11 +72,11 @@ void __weak panic_smp_self_stop(void)
*/
void panic(const char *fmt, ...)
{
- static DEFINE_SPINLOCK(panic_lock);
static char buf[1024];
va_list args;
long i, i_next = 0;
int state = 0;
+ int this_cpu, old_cpu;

/*
* Disable local interrupts. This will prevent panic_smp_self_stop
@@ -94,7 +96,9 @@ void panic(const char *fmt, ...)
* stop themself or will wait until they are stopped by the 1st CPU
* with smp_send_stop().
*/
- if (!spin_trylock(&panic_lock))
+ this_cpu = raw_smp_processor_id();
+ old_cpu = atomic_cmpxchg(&panic_cpu, -1, this_cpu);
+ if (old_cpu != -1 && old_cpu != this_cpu)
panic_smp_self_stop();

console_verbose();
@@ -201,9 +205,20 @@ void panic(const char *fmt, ...)
mdelay(PANIC_TIMER_STEP);
}
}
-
EXPORT_SYMBOL(panic);

+void nmi_panic(const char *fmt, ...)
+{
+ /*
+ * We have to back off if the NMI has preempted an ongoing panic and
+ * allow it to finish
+ */
+ if (atomic_read(&panic_cpu) == raw_smp_processor_id())
+ return;
+
+ panic();
+}
+EXPORT_SYMBOL(nmi_panic);

struct tnt {
u8 bit;
--
Michal Hocko
SUSE Labs

2015-07-27 14:55:12

by Michal Hocko

[permalink] [raw]

Subject: Re: [V2 PATCH 2/3] kexec: Fix race between panic() and crash_kexec() called directly

On Mon 27-07-15 10:58:50, Hidehiro Kawai wrote:
[...]
> @@ -1472,6 +1472,18 @@ void __weak crash_unmap_reserved_pages(void)
>
> void crash_kexec(struct pt_regs *regs)
> {
> + int old_cpu, this_cpu;
> +
> + /*
> + * `old_cpu == -1' means we are the first comer and crash_kexec()
> + * was called without entering panic().
> + * `old_cpu == this_cpu' means crash_kexec() was called from panic().
> + */
> + this_cpu = raw_smp_processor_id();
> + old_cpu = atomic_cmpxchg(&panicking_cpu, -1, this_cpu);
> + if (old_cpu != -1 && old_cpu != this_cpu)
> + return;
> +
> /* Take the kexec_mutex here to prevent sys_kexec_load
> * running on one cpu from replacing the crash kernel
> * we are using after a panic on a different cpu.
> @@ -1491,6 +1503,14 @@ void crash_kexec(struct pt_regs *regs)
> }
> mutex_unlock(&kexec_mutex);
> }
> +
> + /*
> + * If we came here from panic(), we have to keep panicking_cpu
> + * to prevent other cpus from entering panic(). Otherwise,
> + * resetting it so that other cpus can enter panic()/crash_kexec().
> + */
> + if (old_cpu == this_cpu)
> + atomic_set(&panicking_cpu, -1);

This do the opposite what the comment says, wouldn't it? You should
check old_cpu == -1. Also atomic_set doesn't imply memory barriers which
might be a problem.
--
Michal Hocko
SUSE Labs

2015-07-28 02:02:18

by Hidehiro Kawai

[permalink] [raw]

Subject: Re: [V2 PATCH 1/3] x86/panic: Fix re-entrance problem due to panic on NMI

Hi,

Thanks for the review.

(2015/07/27 23:34), Michal Hocko wrote:
> On Mon 27-07-15 10:58:50, Hidehiro Kawai wrote:
> [...]
>> diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
>> index d05bd2e..5b32d81 100644
>> --- a/arch/x86/kernel/nmi.c
>> +++ b/arch/x86/kernel/nmi.c
>> @@ -230,7 +230,8 @@ void unregister_nmi_handler(unsigned int type, const char *name)
>> }
>> #endif
>>
>> - if (panic_on_unrecovered_nmi)
>> + if (panic_on_unrecovered_nmi &&
>> + atomic_cmpxchg(&panicking_cpu, -1, raw_smp_processor_id()) == -1)
>> panic("NMI: Not continuing");
>
> Spreading the check to all NMI callers is quite ugly. Wouldn't it be
> better to introduce nmi_panic() which wouldn't be __noreturn unlike the
> regular panic.

Sure. I'll fix it.

> The check could be also relaxed a bit and nmi_panic would
> return only if the ongoing panic is the current cpu when we really have
> to return and allow the preempted panic to finish.

It's reasonable. I'll do that in the next version.

> Something like
[...]
> +void nmi_panic(const char *fmt, ...)

Since we can't directly pass variable arguments to a subroutine,
we have to use a macro or do like this:

void nmi_panic(const char *msg)
{
...
panic("%s", msg);
}

If there is no objection, I'm going to use a macro.

> +{
> + /*
> + * We have to back off if the NMI has preempted an ongoing panic and
> + * allow it to finish
> + */
> + if (atomic_read(&panic_cpu) == raw_smp_processor_id())
> + return;
> +
> + panic();
> +}
> +EXPORT_SYMBOL(nmi_panic);
>
> struct tnt {
> u8 bit;
>

--
Hidehiro Kawai
Hitachi, Ltd. Research & Development Group

2015-07-28 02:15:16

by Hidehiro Kawai

[permalink] [raw]

Subject: Re: [V2 PATCH 2/3] kexec: Fix race between panic() and crash_kexec() called directly

Hi,

(2015/07/27 23:55), Michal Hocko wrote:
> On Mon 27-07-15 10:58:50, Hidehiro Kawai wrote:
> [...]
>> @@ -1472,6 +1472,18 @@ void __weak crash_unmap_reserved_pages(void)
>>
>> void crash_kexec(struct pt_regs *regs)
>> {
>> + int old_cpu, this_cpu;
>> +
>> + /*
>> + * `old_cpu == -1' means we are the first comer and crash_kexec()
>> + * was called without entering panic().
>> + * `old_cpu == this_cpu' means crash_kexec() was called from panic().
>> + */
>> + this_cpu = raw_smp_processor_id();
>> + old_cpu = atomic_cmpxchg(&panicking_cpu, -1, this_cpu);
>> + if (old_cpu != -1 && old_cpu != this_cpu)
>> + return;
>> +
>> /* Take the kexec_mutex here to prevent sys_kexec_load
>> * running on one cpu from replacing the crash kernel
>> * we are using after a panic on a different cpu.
>> @@ -1491,6 +1503,14 @@ void crash_kexec(struct pt_regs *regs)
>> }
>> mutex_unlock(&kexec_mutex);
>> }
>> +
>> + /*
>> + * If we came here from panic(), we have to keep panicking_cpu
>> + * to prevent other cpus from entering panic(). Otherwise,
>> + * resetting it so that other cpus can enter panic()/crash_kexec().
>> + */
>> + if (old_cpu == this_cpu)
>> + atomic_set(&panicking_cpu, -1);
>
> This do the opposite what the comment says, wouldn't it? You should
> check old_cpu == -1.

Sorry, you are right. I performed same tests as for the
previous patch set, but I missed the test case for this
new logic.

> Also atomic_set doesn't imply memory barriers which
> might be a problem.

OK, I'll use atomic_xchg().

Regards,
--
Hidehiro Kawai
Hitachi, Ltd. Research & Development Group

2015-07-28 08:01:12

by Michal Hocko

[permalink] [raw]

Subject: Re: [V2 PATCH 1/3] x86/panic: Fix re-entrance problem due to panic on NMI

On Tue 28-07-15 11:02:11, Hidehiro Kawai wrote:
[...]
> > Something like
> [...]
> > +void nmi_panic(const char *fmt, ...)
>
> Since we can't directly pass variable arguments to a subroutine,

Sure, I was just too lazy to finish this as it was just an illustration
of the idea.

> we have to use a macro or do like this:
>
> void nmi_panic(const char *msg)
> {
> ...
> panic("%s", msg);
> }
>
> If there is no objection, I'm going to use a macro.

Your other patch needs panic_cpu externally visible so the macro should
be OK.
--
Michal Hocko
SUSE Labs

2015-07-29 05:48:54

by Hidehiro Kawai

[permalink] [raw]

Subject: RE: Re: [V2 PATCH 1/3] x86/panic: Fix re-entrance problem due to panic on NMI

Hi,

> From: [email protected] [mailto:[email protected]] On Behalf Of Hidehiro Kawai
> (2015/07/27 23:34), Michal Hocko wrote:
> > On Mon 27-07-15 10:58:50, Hidehiro Kawai wrote:
[...]
> > The check could be also relaxed a bit and nmi_panic would
> > return only if the ongoing panic is the current cpu when we really have
> > to return and allow the preempted panic to finish.
>
> It's reasonable. I'll do that in the next version.

I noticed atomic_read() is insufficient. Please consider the following
scenario.

CPU 1: call panic() in the normal context
CPU 0: call nmi_panic(), check the value of panic_cpu, then call panic()
CPU 1: set 1 to panic_cpu
CPU 0: fail to set 0 to panic_cpu, then do an infinite loop
CPU 1: call crash_kexec(), then call kdump_nmi_shootdown_cpus()

At this point, since CPU 0 loops in NMI context, it never executes
the NMI handler registered by kdump_nmi_shootdown_cpus(). This means
that no register states are saved and no cleanups for VMX/SVM are
performed. So, we should still use atomic_cmpxchg() in nmi_panic() to
prevent other cpus from running panic routines.

> > +void nmi_panic(const char *fmt, ...)
> > +{
> > + /*
> > + * We have to back off if the NMI has preempted an ongoing panic and
> > + * allow it to finish
> > + */
> > + if (atomic_read(&panic_cpu) == raw_smp_processor_id())
> > + return;
> > +
> > + panic();
> > +}
> > +EXPORT_SYMBOL(nmi_panic);

2015-07-29 08:23:38

by Michal Hocko

[permalink] [raw]

Subject: Re: Re: [V2 PATCH 1/3] x86/panic: Fix re-entrance problem due to panic on NMI

On Wed 29-07-15 05:48:47, 河合英宏 / KAWAI，HIDEHIRO wrote:
> Hi,
>
> > From: [email protected] [mailto:[email protected]] On Behalf Of Hidehiro Kawai
> > (2015/07/27 23:34), Michal Hocko wrote:
> > > On Mon 27-07-15 10:58:50, Hidehiro Kawai wrote:
> [...]
> > > The check could be also relaxed a bit and nmi_panic would
> > > return only if the ongoing panic is the current cpu when we really have
> > > to return and allow the preempted panic to finish.
> >
> > It's reasonable. I'll do that in the next version.
>
> I noticed atomic_read() is insufficient. Please consider the following
> scenario.
>
> CPU 1: call panic() in the normal context
> CPU 0: call nmi_panic(), check the value of panic_cpu, then call panic()
> CPU 1: set 1 to panic_cpu
> CPU 0: fail to set 0 to panic_cpu, then do an infinite loop
> CPU 1: call crash_kexec(), then call kdump_nmi_shootdown_cpus()
>
> At this point, since CPU 0 loops in NMI context, it never executes
> the NMI handler registered by kdump_nmi_shootdown_cpus(). This means
> that no register states are saved and no cleanups for VMX/SVM are
> performed.

Yes this is true but it is no different from the current state, isn't
it? So if you want to handle that then it deserves a separate patch.
It is certainly not harmful wrt. panic behavior.

> So, we should still use atomic_cmpxchg() in nmi_panic() to
> prevent other cpus from running panic routines.

Not sure what you mean by that.

--
Michal Hocko
SUSE Labs

2015-07-29 09:09:29

by Hidehiro Kawai

[permalink] [raw]

Subject: RE: Re: [V2 PATCH 1/3] x86/panic: Fix re-entrance problem due to panic on NMI

> From: Michal Hocko [mailto:[email protected]]
> On Wed 29-07-15 05:48:47, 河合英宏 / KAWAI，HIDEHIRO wrote:
> > Hi,
> >
> > > From: [email protected] [mailto:[email protected]] On Behalf Of Hidehiro Kawai
> > > (2015/07/27 23:34), Michal Hocko wrote:
> > > > On Mon 27-07-15 10:58:50, Hidehiro Kawai wrote:
> > [...]
> > > > The check could be also relaxed a bit and nmi_panic would
> > > > return only if the ongoing panic is the current cpu when we really have
> > > > to return and allow the preempted panic to finish.
> > >
> > > It's reasonable. I'll do that in the next version.
> >
> > I noticed atomic_read() is insufficient. Please consider the following
> > scenario.
> >
> > CPU 1: call panic() in the normal context
> > CPU 0: call nmi_panic(), check the value of panic_cpu, then call panic()
> > CPU 1: set 1 to panic_cpu
> > CPU 0: fail to set 0 to panic_cpu, then do an infinite loop
> > CPU 1: call crash_kexec(), then call kdump_nmi_shootdown_cpus()
> >
> > At this point, since CPU 0 loops in NMI context, it never executes
> > the NMI handler registered by kdump_nmi_shootdown_cpus(). This means
> > that no register states are saved and no cleanups for VMX/SVM are
> > performed.
>
> Yes this is true but it is no different from the current state, isn't
> it? So if you want to handle that then it deserves a separate patch.
> It is certainly not harmful wrt. panic behavior.
>
> > So, we should still use atomic_cmpxchg() in nmi_panic() to
> > prevent other cpus from running panic routines.
>
> Not sure what you mean by that.

I mean that we should use the same logic as my V2 patch like this:

#define nmi_panic(fmt, ...) \
do { \
if (atomic_cmpxchg(&panic_cpu, -1, raw_smp_processor_id()) \
== -1) \
panic(fmt, ##__VA_ARGS__); \
} while (0)

By using atomic_cmpxchg here, we can ensure that only this cpu
runs panic routines. It is important to prevent a NMI-context cpu
from calling panic_smp_self_stop().

void panic(const char *fmt, ...)
{
...
* `old_cpu == -1' means we are the first comer.
* `old_cpu == this_cpu' means we came here due to panic on NMI.
*/
this_cpu = raw_smp_processor_id();
old_cpu = atomic_cmpxchg(&panic_cpu, -1, this_cpu);
if (old_cpu != -1 && old_cpu != this_cpu)
panic_smp_self_stop();

Please assume that CPU 0 calls nmi_panic() in NMI context
and CPU 1 calls panic() in normal context at tha same time.

If CPU 1 set panic_cpu before CPU 0 does, CPU 1 runs panic routines
and CPU 0 return from the nmi handler. Eventually CPU 0 is stopped
by nmi_shootdown_cpus().

If CPU 0 set panic_cpu before CPU 1 does, CPU 0 runs panic routines.
CPU 1 calls panic_smp_self_stop(), and wait for NMI by
nmi_shootdown_cpus().

Anyway, I tested my approach and it worked fine.

Regards,
Kawai

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?

2015-07-29 09:22:09

by Michal Hocko

[permalink] [raw]

Subject: Re: Re: [V2 PATCH 1/3] x86/panic: Fix re-entrance problem due to panic on NMI

On Wed 29-07-15 09:09:18, 河合英宏 / KAWAI，HIDEHIRO wrote:
> > From: Michal Hocko [mailto:[email protected]]
> > On Wed 29-07-15 05:48:47, 河合英宏 / KAWAI，HIDEHIRO wrote:
> > > Hi,
> > >
> > > > From: [email protected] [mailto:[email protected]] On Behalf Of Hidehiro Kawai
> > > > (2015/07/27 23:34), Michal Hocko wrote:
> > > > > On Mon 27-07-15 10:58:50, Hidehiro Kawai wrote:
> > > [...]
> > > > > The check could be also relaxed a bit and nmi_panic would
> > > > > return only if the ongoing panic is the current cpu when we really have
> > > > > to return and allow the preempted panic to finish.
> > > >
> > > > It's reasonable. I'll do that in the next version.
> > >
> > > I noticed atomic_read() is insufficient. Please consider the following
> > > scenario.
> > >
> > > CPU 1: call panic() in the normal context
> > > CPU 0: call nmi_panic(), check the value of panic_cpu, then call panic()
> > > CPU 1: set 1 to panic_cpu
> > > CPU 0: fail to set 0 to panic_cpu, then do an infinite loop
> > > CPU 1: call crash_kexec(), then call kdump_nmi_shootdown_cpus()
> > >
> > > At this point, since CPU 0 loops in NMI context, it never executes
> > > the NMI handler registered by kdump_nmi_shootdown_cpus(). This means
> > > that no register states are saved and no cleanups for VMX/SVM are
> > > performed.
> >
> > Yes this is true but it is no different from the current state, isn't
> > it? So if you want to handle that then it deserves a separate patch.
> > It is certainly not harmful wrt. panic behavior.
> >
> > > So, we should still use atomic_cmpxchg() in nmi_panic() to
> > > prevent other cpus from running panic routines.
> >
> > Not sure what you mean by that.
>
> I mean that we should use the same logic as my V2 patch like this:
>
> #define nmi_panic(fmt, ...) \
> do { \
> if (atomic_cmpxchg(&panic_cpu, -1, raw_smp_processor_id()) \
> == -1) \
> panic(fmt, ##__VA_ARGS__); \
> } while (0)

This would allow to return from NMI too eagerly. When I was testing my
previous approach (on 3.0 based kernel) I had basically the same thing
(one NMI to process panic) and others to return. This led to a strange
behavior when the NMI button triggered NMI on all (hundreds) CPUs. The
crash kernel booted eventually but the log contained lockups when a
CPU waited for an IPI to the CPU which was handling the NMI panic.

Anyway, I do not thing this is really necessary to solve the panic
reentrancy issue. If the missing saved state is a real problem then it
should be handled separately - maybe it can be achieved without an IPI
and directly from the panic context if we are in NMI.
--
Michal Hocko
SUSE Labs

2015-07-30 01:45:45

by Hidehiro Kawai

[permalink] [raw]

Subject: RE: Re: [V2 PATCH 1/3] x86/panic: Fix re-entrance problem due to panic on NMI

Hi,

> From: Michal Hocko [mailto:[email protected]]
>
> On Wed 29-07-15 09:09:18, 河合英宏 / KAWAI，HIDEHIRO wrote:
> > > From: Michal Hocko [mailto:[email protected]]
> > > On Wed 29-07-15 05:48:47, 河合英宏 / KAWAI，HIDEHIRO wrote:
> > > > Hi,
> > > >
> > > > > From: [email protected] [mailto:[email protected]] On Behalf Of Hidehiro Kawai
> > > > > (2015/07/27 23:34), Michal Hocko wrote:
> > > > > > On Mon 27-07-15 10:58:50, Hidehiro Kawai wrote:
> > > > [...]
> > > > > > The check could be also relaxed a bit and nmi_panic would
> > > > > > return only if the ongoing panic is the current cpu when we really have
> > > > > > to return and allow the preempted panic to finish.
> > > > >
> > > > > It's reasonable. I'll do that in the next version.
> > > >
> > > > I noticed atomic_read() is insufficient. Please consider the following
> > > > scenario.
> > > >
> > > > CPU 1: call panic() in the normal context
> > > > CPU 0: call nmi_panic(), check the value of panic_cpu, then call panic()
> > > > CPU 1: set 1 to panic_cpu
> > > > CPU 0: fail to set 0 to panic_cpu, then do an infinite loop
> > > > CPU 1: call crash_kexec(), then call kdump_nmi_shootdown_cpus()
> > > >
> > > > At this point, since CPU 0 loops in NMI context, it never executes
> > > > the NMI handler registered by kdump_nmi_shootdown_cpus(). This means
> > > > that no register states are saved and no cleanups for VMX/SVM are
> > > > performed.
> > >
> > > Yes this is true but it is no different from the current state, isn't
> > > it? So if you want to handle that then it deserves a separate patch.
> > > It is certainly not harmful wrt. panic behavior.
> > >
> > > > So, we should still use atomic_cmpxchg() in nmi_panic() to
> > > > prevent other cpus from running panic routines.
> > >
> > > Not sure what you mean by that.
> >
> > I mean that we should use the same logic as my V2 patch like this:
> >
> > #define nmi_panic(fmt, ...) \
> > do { \
> > if (atomic_cmpxchg(&panic_cpu, -1, raw_smp_processor_id()) \
> > == -1) \
> > panic(fmt, ##__VA_ARGS__); \
> > } while (0)
>
> This would allow to return from NMI too eagerly.

Yes, but what's the problem?
The root cause of your case hasn't been clarified yet.
I can't fix for an unclear issue because I don't know what's the right
solution.

> When I was testing my
> previous approach (on 3.0 based kernel) I had basically the same thing
> (one NMI to process panic) and others to return. This led to a strange
> behavior when the NMI button triggered NMI on all (hundreds) CPUs.

It's strange. Usually, NMI caused by NMI button is routed to only CPU 0
as an external NMI. External NMI for CPUs other than CPU 0 are masked
at boot time. Does it really happen? Does the problem still happen on
the latest kernel? What kind of NMI is deliverd to each CPU?

Traditionally, we should have assumed that NMI for crash dumping is
delivered to only one cpu. Otherwise, we should often fail to take
a proper crash dump. It seems that your case is another problem to be
solved separately.

> The
> crash kernel booted eventually but the log contained lockups when a
> CPU waited for an IPI to the CPU which was handling the NMI panic.

Could you explain more precisely?

> Anyway, I do not thing this is really necessary to solve the panic
> reentrancy issue.
> If the missing saved state is a real problem then it
> should be handled separately - maybe it can be achieved without an IPI
> and directly from the panic context if we are in NMI.

What I would like to do via this patchse is to solve race issues
among NMI, panic() and crash_kexec(). So, I don't think we should fix
that separately, although I would need to reword some descriptions
and titles.

Anyway, I'm going to sent out my revised version once in order to
tidy up. I also would like to hear kexec/kdump guys' opinions.

Regards,
Kawai

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?

2015-07-30 07:33:34

by Hidehiro Kawai

[permalink] [raw]

Subject: RE: Re: [V2 PATCH 1/3] x86/panic: Fix re-entrance problem due to panic on NMI

Hi Michal,

> From: 河合英宏 / KAWAI，HIDEHIRO [mailto:[email protected]]
> > When I was testing my
> > previous approach (on 3.0 based kernel) I had basically the same thing
> > (one NMI to process panic) and others to return. This led to a strange
> > behavior when the NMI button triggered NMI on all (hundreds) CPUs.
>
> It's strange. Usually, NMI caused by NMI button is routed to only CPU 0
> as an external NMI. External NMI for CPUs other than CPU 0 are masked
> at boot time. Does it really happen? Does the problem still happen on
> the latest kernel? What kind of NMI is deliverd to each CPU?

Are you using SGI UV? On that platform, NMIs may be delivered to
all cpus because LVT1 of all cpus are not masked as follows:

void uv_nmi_init(void)
{
unsigned int value;

/*
* Unmask NMI on all cpus
*/
value = apic_read(APIC_LVT1) | APIC_DM_NMI;
value &= ~APIC_LVT_MASKED;
apic_write(APIC_LVT1, value);
}

>
> Traditionally, we should have assumed that NMI for crash dumping is
> delivered to only one cpu. Otherwise, we should often fail to take
> a proper crash dump. It seems that your case is another problem to be
> solved separately.
>
> > The
> > crash kernel booted eventually but the log contained lockups when a
> > CPU waited for an IPI to the CPU which was handling the NMI panic.
>
> Could you explain more precisely?
>
> > Anyway, I do not thing this is really necessary to solve the panic
> > reentrancy issue.
> > If the missing saved state is a real problem then it
> > should be handled separately - maybe it can be achieved without an IPI
> > and directly from the panic context if we are in NMI.
>
> What I would like to do via this patchse is to solve race issues
> among NMI, panic() and crash_kexec(). So, I don't think we should fix
> that separately, although I would need to reword some descriptions
> and titles.
>
> Anyway, I'm going to sent out my revised version once in order to
> tidy up. I also would like to hear kexec/kdump guys' opinions.
>
> Regards,
> Kawai

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?

2015-07-30 07:48:22

by Michal Hocko

[permalink] [raw]

Subject: Re: Re: [V2 PATCH 1/3] x86/panic: Fix re-entrance problem due to panic on NMI

On Thu 30-07-15 01:45:35, 河合英宏 / KAWAI，HIDEHIRO wrote:
> Hi,
>
> > From: Michal Hocko [mailto:[email protected]]
> >
> > On Wed 29-07-15 09:09:18, 河合英宏 / KAWAI，HIDEHIRO wrote:
[...]
> > > #define nmi_panic(fmt, ...) \
> > > do { \
> > > if (atomic_cmpxchg(&panic_cpu, -1, raw_smp_processor_id()) \
> > > == -1) \
> > > panic(fmt, ##__VA_ARGS__); \
> > > } while (0)
> >
> > This would allow to return from NMI too eagerly.
>
> Yes, but what's the problem?

I believe that panic should be noreturn as much as possible and return
only when we do not have any other options. Moreover I would ask an
opposite question, what is the problem to loop in NMI on other CPUs than
the one which is performing crash_kexec? We will not save registers, so
what?

> The root cause of your case hasn't been clarified yet.
> I can't fix for an unclear issue because I don't know what's the right
> solution.
>
> > When I was testing my
> > previous approach (on 3.0 based kernel) I had basically the same thing
> > (one NMI to process panic) and others to return. This led to a strange
> > behavior when the NMI button triggered NMI on all (hundreds) CPUs.
>
> It's strange. Usually, NMI caused by NMI button is routed to only CPU 0
> as an external NMI. External NMI for CPUs other than CPU 0 are masked
> at boot time. Does it really happen?

Could you point me to the code which does that, please? Maybe we are
missing that in our 3.0 kernel. I was quite surprised to see this
behavior as well.

> Does the problem still happen on the latest kernel?

I do not have machine accessible so I have to rely on the customer to
test and the current vanilla might be an issue.

> What kind of NMI is deliverd to each CPU?

See the log below.

> Traditionally, we should have assumed that NMI for crash dumping is
> delivered to only one cpu. Otherwise, we should often fail to take
> a proper crash dump.

You might still get a panic on hardlockup which will happen on all CPUs
from the NMI context so we have to be able to handle panic in NMI on
many CPUs.

> It seems that your case is another problem to be solved separately.

I do not think so, quite contrary. If you want to solve the reentrancy
then other CPUs might be spinning in NMI if there is a guarantee that at
least one CPU can progress to finish crash_kexec().

> > The
> > crash kernel booted eventually but the log contained lockups when a
> > CPU waited for an IPI to the CPU which was handling the NMI panic.
>
> Could you explain more precisely?

[ 167.843761] Uhhuh. NMI received for unknown reason 3d on CPU 130.
[ 167.843763] Do you have a strange power saving mode enabled?
[... Mangled output ....]
[ 167.856415] Dazed and confused, but trying to continue
[ 167.856428] Dazed and confused, but trying to continue
[ 167.856442] Dazed and confused, but trying to continue
[...]
[ 193.108440] BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:0:4]
[...]
[ 193.108586] Call Trace:
[ 193.108595] [<ffffffff8109baeb>] smp_call_function_single+0x15b/0x170
[ 193.108600] [<ffffffff8109bb4e>] smp_call_function_any+0x4e/0x110
[ 193.108607] [<ffffffffa04a332c>] get_cur_val+0xbc/0x130 [acpi_cpufreq]
[ 193.108630] [<ffffffffa04a3417>] get_cur_freq_on_cpu+0x77/0xf0 [acpi_cpufreq]
[ 193.108638] [<ffffffff8137bc37>] cpufreq_update_policy+0x97/0x140
[ 193.108646] [<ffffffffa00ca04b>] acpi_processor_notify+0x4b/0x145 [processor]
[ 193.108654] [<ffffffff812d2eca>] acpi_ev_notify_dispatch+0x61/0x77
[ 193.108659] [<ffffffff812c1785>] acpi_os_execute_deferred+0x21/0x2c
[ 193.108667] [<ffffffff8107d03c>] process_one_work+0x16c/0x350
[ 193.108673] [<ffffffff8107fd6a>] worker_thread+0x17a/0x410
[ 193.108679] [<ffffffff81084136>] kthread+0x96/0xa0
[ 193.108688] [<ffffffff8146df64>] kernel_thread_helper+0x4/0x10
[...]
[ 221.068390] BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:0:4]
[...]
[ 227.991235] INFO: rcu_sched_state detected stalls on CPUs/tasks: { 130} (detected by 56, t=15002 jiffies)
[ 227.991247] sending NMI to all CPUs:
[ 227.991251] NMI backtrace for cpu 0
[ 229.074091] INFO: rcu_bh_state detected stalls on CPUs/tasks: { 130} (detected by 105, t=15013 jiffies)
[ 0.000000] Initializing cgroup subsys cpuset
[ 0.000000] Initializing cgroup subsys cpu
[ 0.000000] Linux version 3.0.101-0.47.55.9.8853.0.TEST-default (geeko@buildhost) (gcc version 4.3.4 [gcc-4_3-branch revision 152973] (SUSE Linux) ) #1 SMP Thu May 28 08:25:11 UTC 2015 (dc083ee)
[ 0.000000] Command line: root=/dev/system/lvroot resume=/dev/system/lvswap intel_idle.max_cstate=0 processor.max_cstate=0 elevator=deadline nmi_watchdog=1 console=tty0 console=ttyS1,115200 elevator=deadline sysrq=yes reset_devices irqpoll maxcpus=1 disable_cpu_apicid=0 noefi acpi_rsdp=0xba7a4014 crashkernel=1024M-:512M memmap=exactmap memmap=576K@64K memmap=523684K@393216K elfcorehdr=916900K memmap=32768K#3018748K memmap=3736K#3051516K memmap=262144K$3145728K

I can provide the full log but it is quite mangled. I guess the
CPU130 was the only one allowed to proceed with the panic while others
returned from the unknown NMI handling. It took a lot of time until
CPU130 managed to boot the crash kernel with soft lockups and RCU stalls
reports. CPU0 is most probably locked up waiting for CPU130 to
acknowledge the IPI which will not happen apparently.

Maybe this is not possible in the current kernels for some reason but it
tells me that returning from panic is quite fragile so I would like to
prevent from it as much as possible.

> > Anyway, I do not thing this is really necessary to solve the panic
> > reentrancy issue.
> > If the missing saved state is a real problem then it
> > should be handled separately - maybe it can be achieved without an IPI
> > and directly from the panic context if we are in NMI.
>
> What I would like to do via this patchse is to solve race issues
> among NMI, panic() and crash_kexec().

Yes I fully support you in this ;) I just believe that spinning in NMI
vs. saving registers is a separate issue.

> So, I don't think we should fix that separately, although I would need
> to reword some descriptions and titles.

I can have them tested.

--
Michal Hocko
SUSE Labs

2015-07-30 07:55:19

by Michal Hocko

[permalink] [raw]

Subject: Re: Re: [V2 PATCH 1/3] x86/panic: Fix re-entrance problem due to panic on NMI

On Thu 30-07-15 07:33:15, 河合英宏 / KAWAI，HIDEHIRO wrote:
[...]
> Are you using SGI UV? On that platform, NMIs may be delivered to
> all cpus because LVT1 of all cpus are not masked as follows:

This is Compute Blade 520XB1 from Hitachi with 240 cpus.

--
Michal Hocko
SUSE Labs

2015-07-30 08:07:10

by Hidehiro Kawai

[permalink] [raw]

Subject: RE: Re: [V2 PATCH 1/3] x86/panic: Fix re-entrance problem due to panic on NMI

Hi,

> -----Original Message-----
> From: Michal Hocko [mailto:[email protected]]
>
> On Thu 30-07-15 07:33:15, 河合英宏 / KAWAI，HIDEHIRO wrote:
> [...]
> > Are you using SGI UV? On that platform, NMIs may be delivered to
> > all cpus because LVT1 of all cpus are not masked as follows:
>
> This is Compute Blade 520XB1 from Hitachi with 240 cpus.

Thanks for the information!

I asked my colleague in other department about NMI button behavior
of our server just before receive your mail, and certainly what you
said happens; all cpus say "Uhhuh. NMI received for unknown reason 3d
on CPU N" when NMI button is pusshed. So, this was also our problem...

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?

2015-07-30 11:56:01

by Hidehiro Kawai

[permalink] [raw]

Subject: RE: Re: [V2 PATCH 1/3] x86/panic: Fix re-entrance problem due to panic on NMI

> From: Michal Hocko [mailto:[email protected]]
>
> On Thu 30-07-15 01:45:35, 河合英宏 / KAWAI，HIDEHIRO wrote:
> > Hi,
> >
> > > From: Michal Hocko [mailto:[email protected]]
> > >
> > > On Wed 29-07-15 09:09:18, 河合英宏 / KAWAI，HIDEHIRO wrote:
> [...]
> > > > #define nmi_panic(fmt, ...) \
> > > > do { \
> > > > if (atomic_cmpxchg(&panic_cpu, -1, raw_smp_processor_id()) \
> > > > == -1) \
> > > > panic(fmt, ##__VA_ARGS__); \
> > > > } while (0)
> > >
> > > This would allow to return from NMI too eagerly.
> >
> > Yes, but what's the problem?
>
> I believe that panic should be noreturn as much as possible and return
> only when we do not have any other options.

In my case, external NMI is delivered to only CPU 0. This means
other CPUs continues to run until receiving NMI IPI. I think returning
from NMI handler soon doesn't differ from this so much.

> Moreover I would ask an
> opposite question, what is the problem to loop in NMI on other CPUs than
> the one which is performing crash_kexec? We will not save registers, so
> what?

Actually, we have to do at least save registers, clean up VMX/SVM states
and announce that the cpu has stopped. Just returning from NMI handler
is simpler.

> > The root cause of your case hasn't been clarified yet.
> > I can't fix for an unclear issue because I don't know what's the right
> > solution.
> >
> > > When I was testing my
> > > previous approach (on 3.0 based kernel) I had basically the same thing
> > > (one NMI to process panic) and others to return. This led to a strange
> > > behavior when the NMI button triggered NMI on all (hundreds) CPUs.
> >
> > It's strange. Usually, NMI caused by NMI button is routed to only CPU 0
> > as an external NMI. External NMI for CPUs other than CPU 0 are masked
> > at boot time. Does it really happen?
>
> Could you point me to the code which does that, please? Maybe we are
> missing that in our 3.0 kernel. I was quite surprised to see this
> behavior as well.

Please see the snippet below.

void setup_local_APIC(void)
{
...
/*
* only the BP should see the LINT1 NMI signal, obviously.
*/
if (!cpu)
value = APIC_DM_NMI;
else
value = APIC_DM_NMI | APIC_LVT_MASKED;
if (!lapic_is_integrated()) /* 82489DX */
value |= APIC_LVT_LEVEL_TRIGGER;
apic_write(APIC_LVT1, value);

LINT1 pins of cpus other than CPU 0 are masked here.
However, at least on some of Hitachi servers, NMI caused by NMI
button doesn't seem to be delivered through LINT1. So, my `external NMI'
word may not be correct.

> > Does the problem still happen on the latest kernel?
>
> I do not have machine accessible so I have to rely on the customer to
> test and the current vanilla might be an issue.

Sure.

> > What kind of NMI is deliverd to each CPU?
>
> See the log below.
>
> > Traditionally, we should have assumed that NMI for crash dumping is
> > delivered to only one cpu. Otherwise, we should often fail to take
> > a proper crash dump.
>
> You might still get a panic on hardlockup which will happen on all CPUs
> from the NMI context so we have to be able to handle panic in NMI on
> many CPUs.

Do you say about the case of a kerne panic while other cpus locks up
in NMI context? In that case, there is no way to do things needed by
kdump procedure including saving registeres...

> > It seems that your case is another problem to be solved separately.
>
> I do not think so, quite contrary. If you want to solve the reentrancy
> then other CPUs might be spinning in NMI if there is a guarantee that at
> least one CPU can progress to finish crash_kexec().
>
> > > The
> > > crash kernel booted eventually but the log contained lockups when a
> > > CPU waited for an IPI to the CPU which was handling the NMI panic.
> >
> > Could you explain more precisely?
>
> [ 167.843761] Uhhuh. NMI received for unknown reason 3d on CPU 130.
> [ 167.843763] Do you have a strange power saving mode enabled?
> [... Mangled output ....]
> [ 167.856415] Dazed and confused, but trying to continue
> [ 167.856428] Dazed and confused, but trying to continue
> [ 167.856442] Dazed and confused, but trying to continue
> [...]
> [ 193.108440] BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:0:4]
> [...]
> [ 193.108586] Call Trace:
> [ 193.108595] [<ffffffff8109baeb>] smp_call_function_single+0x15b/0x170
> [ 193.108600] [<ffffffff8109bb4e>] smp_call_function_any+0x4e/0x110
> [ 193.108607] [<ffffffffa04a332c>] get_cur_val+0xbc/0x130 [acpi_cpufreq]
> [ 193.108630] [<ffffffffa04a3417>] get_cur_freq_on_cpu+0x77/0xf0 [acpi_cpufreq]
> [ 193.108638] [<ffffffff8137bc37>] cpufreq_update_policy+0x97/0x140
> [ 193.108646] [<ffffffffa00ca04b>] acpi_processor_notify+0x4b/0x145 [processor]
> [ 193.108654] [<ffffffff812d2eca>] acpi_ev_notify_dispatch+0x61/0x77
> [ 193.108659] [<ffffffff812c1785>] acpi_os_execute_deferred+0x21/0x2c
> [ 193.108667] [<ffffffff8107d03c>] process_one_work+0x16c/0x350
> [ 193.108673] [<ffffffff8107fd6a>] worker_thread+0x17a/0x410
> [ 193.108679] [<ffffffff81084136>] kthread+0x96/0xa0
> [ 193.108688] [<ffffffff8146df64>] kernel_thread_helper+0x4/0x10
> [...]
> [ 221.068390] BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:0:4]
> [...]
> [ 227.991235] INFO: rcu_sched_state detected stalls on CPUs/tasks: { 130} (detected by 56, t=15002 jiffies)
> [ 227.991247] sending NMI to all CPUs:
> [ 227.991251] NMI backtrace for cpu 0
> [ 229.074091] INFO: rcu_bh_state detected stalls on CPUs/tasks: { 130} (detected by 105, t=15013 jiffies)
> [ 0.000000] Initializing cgroup subsys cpuset
> [ 0.000000] Initializing cgroup subsys cpu
> [ 0.000000] Linux version 3.0.101-0.47.55.9.8853.0.TEST-default (geeko@buildhost) (gcc version 4.3.4 [gcc-4_3-branch
> revision 152973] (SUSE Linux) ) #1 SMP Thu May 28 08:25:11 UTC 2015 (dc083ee)
> [ 0.000000] Command line: root=/dev/system/lvroot resume=/dev/system/lvswap intel_idle.max_cstate=0
> processor.max_cstate=0 elevator=deadline nmi_watchdog=1 console=tty0 console=ttyS1,115200 elevator=deadline sysrq=yes
> reset_devices irqpoll maxcpus=1 disable_cpu_apicid=0 noefi acpi_rsdp=0xba7a4014 crashkernel=1024M-:512M memmap=exactmap
> memmap=576K@64K memmap=523684K@393216K elfcorehdr=916900K memmap=32768K#3018748K memmap=3736K#3051516K
> memmap=262144K$3145728K
>
> I can provide the full log but it is quite mangled. I guess the
> CPU130 was the only one allowed to proceed with the panic while others
> returned from the unknown NMI handling. It took a lot of time until
> CPU130 managed to boot the crash kernel with soft lockups and RCU stalls
> reports. CPU0 is most probably locked up waiting for CPU130 to
> acknowledge the IPI which will not happen apparently.

There is a timeout of 1000ms in nmi_shootdown_cpus(), so I don't know
why CPU 130 waits so long. I'll try to consider for a while.

> Maybe this is not possible in the current kernels for some reason but it
> tells me that returning from panic is quite fragile so I would like to
> prevent from it as much as possible.
>
> > > Anyway, I do not thing this is really necessary to solve the panic
> > > reentrancy issue.
> > > If the missing saved state is a real problem then it
> > > should be handled separately - maybe it can be achieved without an IPI
> > > and directly from the panic context if we are in NMI.
> >
> > What I would like to do via this patchse is to solve race issues
> > among NMI, panic() and crash_kexec().
>
> Yes I fully support you in this ;) I just believe that spinning in NMI
> vs. saving registers is a separate issue.

Ok, but I'm going to address it in this series because the issue is
caused by simultaneous panic and nmis.

>
> > So, I don't think we should fix that separately, although I would need
> > to reword some descriptions and titles.
>
> I can have them tested.

Thanks a lot!

Regards,
Kawai

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?

2015-07-30 12:27:53

by Michal Hocko

[permalink] [raw]

Subject: Re: Re: [V2 PATCH 1/3] x86/panic: Fix re-entrance problem due to panic on NMI

On Thu 30-07-15 11:55:52, 河合英宏 / KAWAI，HIDEHIRO wrote:
> > From: Michal Hocko [mailto:[email protected]]
[...]
> > Could you point me to the code which does that, please? Maybe we are
> > missing that in our 3.0 kernel. I was quite surprised to see this
> > behavior as well.
>
> Please see the snippet below.
>
> void setup_local_APIC(void)
> {
> ...
> /*
> * only the BP should see the LINT1 NMI signal, obviously.
> */
> if (!cpu)
> value = APIC_DM_NMI;
> else
> value = APIC_DM_NMI | APIC_LVT_MASKED;
> if (!lapic_is_integrated()) /* 82489DX */
> value |= APIC_LVT_LEVEL_TRIGGER;
> apic_write(APIC_LVT1, value);
>
>
> LINT1 pins of cpus other than CPU 0 are masked here.
> However, at least on some of Hitachi servers, NMI caused by NMI
> button doesn't seem to be delivered through LINT1. So, my `external NMI'
> word may not be correct.

I am not familiar with details here but I can tell you that this
particular code snippet is the same in our 3.0 based kernel so it seems
that the HW is indeed doing something differently.

> > You might still get a panic on hardlockup which will happen on all CPUs
> > from the NMI context so we have to be able to handle panic in NMI on
> > many CPUs.
>
> Do you say about the case of a kerne panic while other cpus locks up
> in NMI context? In that case, there is no way to do things needed by
> kdump procedure including saving registeres...

I am saying that watchdog_overflow_callback might trigger on more CPUs
and panic from NMI context as well. So this is not reduced to the NMI
button sends NMI to more CPUs.

Why cannot the panic() context save all the registers if we are going to
loop in NMI context? This would be imho preferable to returning from
panic IMO.

[...]
> > I can provide the full log but it is quite mangled. I guess the
> > CPU130 was the only one allowed to proceed with the panic while others
> > returned from the unknown NMI handling. It took a lot of time until
> > CPU130 managed to boot the crash kernel with soft lockups and RCU stalls
> > reports. CPU0 is most probably locked up waiting for CPU130 to
> > acknowledge the IPI which will not happen apparently.
>
> There is a timeout of 1000ms in nmi_shootdown_cpus(), so I don't know
> why CPU 130 waits so long. I'll try to consider for a while.

Yes, I do not understand the timing here either and the fact that the
log is a complete mess in the important parts doesn't help a wee bit.

[...]

--
Michal Hocko
SUSE Labs

2015-07-31 11:23:09

by Hidehiro Kawai

[permalink] [raw]

Subject: RE: Re: [V2 PATCH 1/3] x86/panic: Fix re-entrance problem due to panic on NMI

> From: Michal Hocko [mailto:[email protected]]
>
> On Thu 30-07-15 11:55:52, 河合英宏 / KAWAI，HIDEHIRO wrote:
> > > From: Michal Hocko [mailto:[email protected]]
> [...]
> > > Could you point me to the code which does that, please? Maybe we are
> > > missing that in our 3.0 kernel. I was quite surprised to see this
> > > behavior as well.
> >
> > Please see the snippet below.
> >
> > void setup_local_APIC(void)
> > {
> > ...
> > /*
> > * only the BP should see the LINT1 NMI signal, obviously.
> > */
> > if (!cpu)
> > value = APIC_DM_NMI;
> > else
> > value = APIC_DM_NMI | APIC_LVT_MASKED;
> > if (!lapic_is_integrated()) /* 82489DX */
> > value |= APIC_LVT_LEVEL_TRIGGER;
> > apic_write(APIC_LVT1, value);
> >
> >
> > LINT1 pins of cpus other than CPU 0 are masked here.
> > However, at least on some of Hitachi servers, NMI caused by NMI
> > button doesn't seem to be delivered through LINT1. So, my `external NMI'
> > word may not be correct.
>
> I am not familiar with details here but I can tell you that this
> particular code snippet is the same in our 3.0 based kernel so it seems
> that the HW is indeed doing something differently.

Yes, and it turned out my PATCH 3/3 doesn't work at all on some
hardware...

> > > You might still get a panic on hardlockup which will happen on all CPUs
> > > from the NMI context so we have to be able to handle panic in NMI on
> > > many CPUs.
> >
> > Do you say about the case of a kerne panic while other cpus locks up
> > in NMI context? In that case, there is no way to do things needed by
> > kdump procedure including saving registeres...
>
> I am saying that watchdog_overflow_callback might trigger on more CPUs
> and panic from NMI context as well. So this is not reduced to the NMI
> button sends NMI to more CPUs.

I understand. So, I have to also modify watchdog_overflow_callback
to call nmi_panic().

> Why cannot the panic() context save all the registers if we are going to
> loop in NMI context? This would be imho preferable to returning from
> panic IMO.

I'm not saying we cannot save registers and do some cleanups in NMI
context. I fell that it would introduce unneeded complexity.
Since watchdog_overflow_callback is defined as generic code,
we need to implement the preparation for kdump for other architectures.
I haven't checked which architectures support both nmi watchdog and
kdump, though.

Anyway, I came up with a simple solution for x86. Waiting for the
timing of nmi_shootdown_cpus() in nmi_panic(), then invoke the
callback registered by nmi_shootdown_cpus().

> > > I can provide the full log but it is quite mangled. I guess the
> > > CPU130 was the only one allowed to proceed with the panic while others
> > > returned from the unknown NMI handling. It took a lot of time until
> > > CPU130 managed to boot the crash kernel with soft lockups and RCU stalls
> > > reports. CPU0 is most probably locked up waiting for CPU130 to
> > > acknowledge the IPI which will not happen apparently.
> >
> > There is a timeout of 1000ms in nmi_shootdown_cpus(), so I don't know
> > why CPU 130 waits so long. I'll try to consider for a while.
>
> Yes, I do not understand the timing here either and the fact that the
> log is a complete mess in the important parts doesn't help a wee bit.

I'm interested in where "kernel panic -not syncing: " is.
It may give us a clue.

Regards,
Kawai

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?

2015-08-04 08:56:59

by Michal Hocko

[permalink] [raw]

Subject: Re: Re: [V2 PATCH 1/3] x86/panic: Fix re-entrance problem due to panic on NMI

On Fri 31-07-15 11:23:00, 河合英宏 / KAWAI，HIDEHIRO wrote:
> > From: Michal Hocko [mailto:[email protected]]
[...]
> > I am saying that watchdog_overflow_callback might trigger on more CPUs
> > and panic from NMI context as well. So this is not reduced to the NMI
> > button sends NMI to more CPUs.
>
> I understand. So, I have to also modify watchdog_overflow_callback
> to call nmi_panic().

yes.

[...]
> > > There is a timeout of 1000ms in nmi_shootdown_cpus(), so I don't know
> > > why CPU 130 waits so long. I'll try to consider for a while.
> >
> > Yes, I do not understand the timing here either and the fact that the
> > log is a complete mess in the important parts doesn't help a wee bit.
>
> I'm interested in where "kernel panic -not syncing: " is.
> It may give us a clue.

This one is lost in the mangled text:
[ 167.843771] U<0>[ 167.843771] hhuh. NMI received for unkn<0><0>[ 167.843765] Uh[ 16NM843774I own rea reived for unknow<0 r 16n 2d 765] Uhhuh. CPU recei11. <0known reason 7. on770] Ker<[ - not rn NMI:nic - not contt sing

<0 >[ : Not con.inu437azed and confused, b] Dtryingaed annue

fu 167.8ut trying>[ to 7.<0377 167.843775] U<0>[ 167.843776] ]hhu.ived for u3nknown rMason 3 re oived for [nk167.843781] 1.
<. N0>[ 167.843781] Uh. NMI recen 3d on CPU 0.i< >[ nowon 3d on] Chhuh.MI
eceived[ or7.843nknoUhhuh.wn rMason e3d ceCPivUd 120.
<0nk>no 167.wn843ason 3na s p120.
o<0er savi d6 e843ab88] Do yeu have a
<trange0>[ er saving mode e nabl1d?7<4][ 167 84hu94]MIuh. NceIived for unknown reas vdfor 1no3was0>[ 2d 67.84380on CI rUe 12e.
ive7d8u3800wn rveaseo f2d on CPo3.r< u>k[o 1 rea6s.o2d8 oo you hn aPve <0st>a e power 1s7.843816] Do yoauv ng moade enbslra?ng[ e 167.8438p41o]er shhuhavi.ngIroenived fbled?nknow
< reaso0> 2d on [PU1626.41]0> Uh67.h. NM387I] receihed for .nknown reason 2Nn MC U ceived for .
[son 2d on CPU 6.
< 160>7.8467.84873] Uhhuh. 3MI received 908 o knstra
[ n167.843908] Do ygo pave westrangesa pvnv mode enableng mode ed?
n<b0ed?
--
Michal Hocko
SUSE Labs

2015-08-04 11:53:35

by Hidehiro Kawai

[permalink] [raw]

Subject: RE: Re: [V2 PATCH 1/3] x86/panic: Fix re-entrance problem due to panic on NMI

Hi,

> From: Michal Hocko [mailto:[email protected]]
> On Fri 31-07-15 11:23:00, 河合英宏 / KAWAI，HIDEHIRO wrote:
> > > From: Michal Hocko [mailto:[email protected]]
> > > > There is a timeout of 1000ms in nmi_shootdown_cpus(), so I don't know
> > > > why CPU 130 waits so long. I'll try to consider for a while.
> > >
> > > Yes, I do not understand the timing here either and the fact that the
> > > log is a complete mess in the important parts doesn't help a wee bit.
> >
> > I'm interested in where "kernel panic -not syncing: " is.
> > It may give us a clue.
>
> This one is lost in the mangled text:
> [ 167.843771] U<0>[ 167.843771] hhuh. NMI received for unkn<0><0>[ 167.843765] Uh[ 16NM843774I own rea reived for
> unknow<0 r 16n 2d 765] Uhhuh. CPU recei11. <0known reason 7. on770] Ker<[ - not rn NMI:nic - not contt sing
>
> <0 >[ : Not con.inu437azed and confused, b] Dtryingaed annue
>
> fu 167.8ut trying>[ to 7.<0377 167.843775] U<0>[ 167.843776] ]hhu.ived for u3nknown rMason 3 re oived for [nk167.843781]

Thanks for the information.

I anticipated that some lock contention on issuing messages (e.g.
locks used by network/netconsole driver) delayed the panic procedure,
but it seems not to be related because the panic message finished to
be issued early.

If I come up with something, I will post a mail. I think there
may be potential issues.

> 1.
> <. N0>[ 167.843781] Uh. NMI recen 3d on CPU 0.i< >[ nowon 3d on] Chhuh.MI
> eceived[ or7.843nknoUhhuh.wn rMason e3d ceCPivUd 120.
> <0nk>no 167.wn843ason 3na s p120.
> o<0er savi d6 e843ab88] Do yeu have a
> <trange0>[ er saving mode e nabl1d?7<4][ 167 84hu94]MIuh. NceIived for unknown reas vdfor 1no3was0>[ 2d 67.84380on CI
> rUe 12e.
> ive7d8u3800wn rveaseo f2d on CPo3.r< u>k[o 1 rea6s.o2d8 oo you hn aPve <0st>a e power 1s7.843816] Do yoauv ng moade
> enbslra?ng[ e 167.8438p41o]er shhuhavi.ngIroenived fbled?nknow
> < reaso0> 2d on [PU1626.41]0> Uh67.h. NM387I] receihed for .nknown reason 2Nn MC U ceived for .
> [son 2d on CPU 6.
> < 160>7.8467.84873] Uhhuh. 3MI received 908 o knstra
> [ n167.843908] Do ygo pave westrangesa pvnv mode enableng mode ed?
> n<b0ed?

Regards,
Kawai

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?