2008-02-06 19:28:27

by Neil Horman

[permalink] [raw]
Subject: [PATCH], issue EOI to APIC prior to calling crash_kexec in die_nmi path

Hey all-
A hang on kdump was reported to me awhile back, only when systems died
via nmi watchdog panic. The hang wouldn't always be in the same place, but it
would usually be somewhere down in purgatory. In looking at the code, it
occured to me that since, during an nmi interrupt, we won't be able to handle
additional interrupts, that we won't be able to halt the other processors on a
system like we try to do in machine_crash_shutdown. As such, it appears that
leaving the other cpus running exposes us to the risk that another processor
will encounter an error and halt the system while we are trying to boot the
kdump kernel, and that can result in a hang. I wrote the attached patch to end
the nmi interrupt prior to calling crash_kexec from within die_nmi, and testing
here has proven successfull.

Regards
Neil

Signed-off-by: Neil Horman <[email protected]>


traps_32.c | 2 ++
traps_64.c | 5 ++++-
2 files changed, 6 insertions(+), 1 deletion(-)


diff --git a/arch/x86/kernel/traps_32.c b/arch/x86/kernel/traps_32.c
index 3cf7297..4443e8e 100644
--- a/arch/x86/kernel/traps_32.c
+++ b/arch/x86/kernel/traps_32.c
@@ -751,6 +751,8 @@ void __kprobes die_nmi(struct pt_regs *regs, const char *msg)
* and might aswell get out now while we still can.
*/
if (!user_mode_vm(regs)) {
+ nmi_exit();
+ local_irq_enable();
current->thread.trap_no = 2;
crash_kexec(regs);
}
diff --git a/arch/x86/kernel/traps_64.c b/arch/x86/kernel/traps_64.c
index efc66df..0a50e70 100644
--- a/arch/x86/kernel/traps_64.c
+++ b/arch/x86/kernel/traps_64.c
@@ -608,8 +608,11 @@ void __kprobes die_nmi(char *str, struct pt_regs *regs, int do_panic)
*/
printk(str, smp_processor_id());
show_registers(regs);
- if (kexec_should_crash(current))
+ if (kexec_should_crash(current)) {
+ nmi_exit();
+ local_irq_enable();
crash_kexec(regs);
+ }
if (do_panic || panic_on_oops)
panic("Non maskable interrupt");
oops_end(flags, NULL, SIGBUS);
--
/***************************************************
*Neil Horman
*[email protected]
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/


2008-02-06 19:41:58

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH], issue EOI to APIC prior to calling crash_kexec in die_nmi path

On Wed, Feb 06, 2008 at 02:25:55PM -0500, Neil Horman wrote:
> Hey all-
> A hang on kdump was reported to me awhile back, only when systems died
> via nmi watchdog panic. The hang wouldn't always be in the same place, but it
> would usually be somewhere down in purgatory. In looking at the code, it
> occured to me that since, during an nmi interrupt, we won't be able to handle
> additional interrupts, that we won't be able to halt the other processors on a
> system like we try to do in machine_crash_shutdown. As such, it appears that
> leaving the other cpus running exposes us to the risk that another processor
> will encounter an error and halt the system while we are trying to boot the
> kdump kernel, and that can result in a hang. I wrote the attached patch to end
> the nmi interrupt prior to calling crash_kexec from within die_nmi, and testing
> here has proven successfull.
>

Hi Neil,

Why wouldn't I be able to stop other cpus if I am inside an NMI handler? I
just need to send an NMI IPI to other cpus and they should be able to
receive and handle it?

Thanks
Vivek

2008-02-06 20:13:45

by Neil Horman

[permalink] [raw]
Subject: Re: [PATCH], issue EOI to APIC prior to calling crash_kexec in die_nmi path

On Wed, Feb 06, 2008 at 02:40:40PM -0500, Vivek Goyal wrote:
> On Wed, Feb 06, 2008 at 02:25:55PM -0500, Neil Horman wrote:
> > Hey all-
> > A hang on kdump was reported to me awhile back, only when systems died
> > via nmi watchdog panic. The hang wouldn't always be in the same place, but it
> > would usually be somewhere down in purgatory. In looking at the code, it
> > occured to me that since, during an nmi interrupt, we won't be able to handle
> > additional interrupts, that we won't be able to halt the other processors on a
> > system like we try to do in machine_crash_shutdown. As such, it appears that
> > leaving the other cpus running exposes us to the risk that another processor
> > will encounter an error and halt the system while we are trying to boot the
> > kdump kernel, and that can result in a hang. I wrote the attached patch to end
> > the nmi interrupt prior to calling crash_kexec from within die_nmi, and testing
> > here has proven successfull.
> >
>
> Hi Neil,
>
> Why wouldn't I be able to stop other cpus if I am inside an NMI handler? I
> just need to send an NMI IPI to other cpus and they should be able to
> receive and handle it?
>
> Thanks
> Vivek
>
Can an APIC accept an NMI while already handling an NMI? I didn't think they
would interrupt one another, but rather, pend until such time as the previous
NMI was cleared
Neil

> _______________________________________________
> kexec mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/kexec

--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*[email protected]
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/

2008-02-06 20:27:16

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH], issue EOI to APIC prior to calling crash_kexec in die_nmi path

Neil Horman wrote:
> Can an APIC accept an NMI while already handling an NMI? I didn't think they
> would interrupt one another, but rather, pend until such time as the previous
> NMI was cleared

The CPU certainly won't (there is a hidden flag that's cleared on IRET
which disables NMI; it's possible to re-enable NMI by executing a dummy
IRET inside the NMI handler.)

-hpa

2008-02-06 20:37:16

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH], issue EOI to APIC prior to calling crash_kexec in die_nmi path

On Wed, Feb 06, 2008 at 03:12:23PM -0500, Neil Horman wrote:
> On Wed, Feb 06, 2008 at 02:40:40PM -0500, Vivek Goyal wrote:
> > On Wed, Feb 06, 2008 at 02:25:55PM -0500, Neil Horman wrote:
> > > Hey all-
> > > A hang on kdump was reported to me awhile back, only when systems died
> > > via nmi watchdog panic. The hang wouldn't always be in the same place, but it
> > > would usually be somewhere down in purgatory. In looking at the code, it
> > > occured to me that since, during an nmi interrupt, we won't be able to handle
> > > additional interrupts, that we won't be able to halt the other processors on a
> > > system like we try to do in machine_crash_shutdown. As such, it appears that
> > > leaving the other cpus running exposes us to the risk that another processor
> > > will encounter an error and halt the system while we are trying to boot the
> > > kdump kernel, and that can result in a hang. I wrote the attached patch to end
> > > the nmi interrupt prior to calling crash_kexec from within die_nmi, and testing
> > > here has proven successfull.
> > >
> >
> > Hi Neil,
> >
> > Why wouldn't I be able to stop other cpus if I am inside an NMI handler? I
> > just need to send an NMI IPI to other cpus and they should be able to
> > receive and handle it?
> >
> > Thanks
> > Vivek
> >
> Can an APIC accept an NMI while already handling an NMI? I didn't think they
> would interrupt one another, but rather, pend until such time as the previous
> NMI was cleared
> Neil

Hi Neil,

CPU will not accept another NMI if it is already handling the NMI but in
this case only crashing CPU is in NMI. Other cpus are not and they will
accept the NMI (When crashing cpu sends NMI IPI for stopping these).

So I am not able to understand how clearing the NMI on crashing cpu is
helpful?

Thanks
Vivek

2008-02-06 21:05:35

by Neil Horman

[permalink] [raw]
Subject: Re: [PATCH], issue EOI to APIC prior to calling crash_kexec in die_nmi path

On Wed, Feb 06, 2008 at 12:21:30PM -0800, H. Peter Anvin wrote:
> Neil Horman wrote:
> >Can an APIC accept an NMI while already handling an NMI? I didn't think
> >they
> >would interrupt one another, but rather, pend until such time as the
> >previous
> >NMI was cleared
>
> The CPU certainly won't (there is a hidden flag that's cleared on IRET
> which disables NMI; it's possible to re-enable NMI by executing a dummy
> IRET inside the NMI handler.)
>
> -hpa

So that would be another solution I would expect. Would you rather I
investigate that, or is the proposed solution more reasonable?

Neil


--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*[email protected]
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/

2008-02-06 22:00:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH], issue EOI to APIC prior to calling crash_kexec in die_nmi path


* Neil Horman <[email protected]> wrote:

> if (!user_mode_vm(regs)) {
> + nmi_exit();
> + local_irq_enable();
> current->thread.trap_no = 2;
> crash_kexec(regs);

looks good to me, but please move the local_irq_enable() to within
crash_kexec() instead - probably inside the "got the kexec lock"
section. That makes crash_kexec() use generally safer too i guess: right
it seems that die() too can call crash_kexec() with irqs disabled - and
can thus hang in smp_send_stop() [or wherever it hung before].

Ingo

2008-02-06 22:49:22

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH], issue EOI to APIC prior to calling crash_kexec in die_nmi path

On Wed, Feb 06, 2008 at 11:00:01PM +0100, Ingo Molnar wrote:
>
> * Neil Horman <[email protected]> wrote:
>
> > if (!user_mode_vm(regs)) {
> > + nmi_exit();
> > + local_irq_enable();
> > current->thread.trap_no = 2;
> > crash_kexec(regs);
>
> looks good to me, but please move the local_irq_enable() to within
> crash_kexec() instead - probably inside the "got the kexec lock"
> section. That makes crash_kexec() use generally safer too i guess: right
> it seems that die() too can call crash_kexec() with irqs disabled - and
> can thus hang in smp_send_stop() [or wherever it hung before].
>

In general, I think we should not be servicing interrupts once the system
has crashed and crash_kexec() has been invoked.

In fact, right now machine_crash_shutdown() explicity disables interrupt
before sending NMIs to other cpus to stop these cpus and which makes sense to
me.

I am wondering if interrupts are disabled on crashing cpu or if crashing
cpu is inside die_nmi(), how would it stop/prevent delivery of NMI IPI to
other cpus.

Am I missing something obivious?

Thanks
Vivek

2008-02-06 22:54:57

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH], issue EOI to APIC prior to calling crash_kexec in die_nmi path


* Vivek Goyal <[email protected]> wrote:

> On Wed, Feb 06, 2008 at 11:00:01PM +0100, Ingo Molnar wrote:
> >
> > * Neil Horman <[email protected]> wrote:
> >
> > > if (!user_mode_vm(regs)) {
> > > + nmi_exit();
> > > + local_irq_enable();
> > > current->thread.trap_no = 2;
> > > crash_kexec(regs);
> >
> > looks good to me, but please move the local_irq_enable() to within
> > crash_kexec() instead - probably inside the "got the kexec lock"
> > section. That makes crash_kexec() use generally safer too i guess: right
> > it seems that die() too can call crash_kexec() with irqs disabled - and
> > can thus hang in smp_send_stop() [or wherever it hung before].
> >
>
> In general, I think we should not be servicing interrupts once the
> system has crashed and crash_kexec() has been invoked.
>
> In fact, right now machine_crash_shutdown() explicity disables
> interrupt before sending NMIs to other cpus to stop these cpus and
> which makes sense to me.
>
> I am wondering if interrupts are disabled on crashing cpu or if
> crashing cpu is inside die_nmi(), how would it stop/prevent delivery
> of NMI IPI to other cpus.
>
> Am I missing something obivious?

i wondered about that too. kexec should be as atomic as it can be -
enabling interrupts only opens up a window for another crash (more
memory corruption, etc. etc) to happen.

Ingo

2008-02-06 23:01:34

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH], issue EOI to APIC prior to calling crash_kexec in die_nmi path

Vivek Goyal wrote:
>
> I am wondering if interrupts are disabled on crashing cpu or if crashing
> cpu is inside die_nmi(), how would it stop/prevent delivery of NMI IPI to
> other cpus.
>

I don't see how it would.

-hpa

2008-02-06 23:37:51

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH], issue EOI to APIC prior to calling crash_kexec in die_nmi path


* H. Peter Anvin <[email protected]> wrote:

>> I am wondering if interrupts are disabled on crashing cpu or if
>> crashing cpu is inside die_nmi(), how would it stop/prevent delivery
>> of NMI IPI to other cpus.
>
> I don't see how it would.

cross-CPU IPIs are a bit fragile on some PC platforms. So if the kexec
code relies on getting IPIs to all other CPUs, it might not be able to
do it reliably. There might be limitations on how many APIC irqs there
can be queued at a time, and if those slots are used up and the CPU is
not servicing irqs then stuff gets retried. This might even affect NMIs
sent via APIC messages - not sure about that.

Ingo

2008-02-06 23:52:22

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH], issue EOI to APIC prior to calling crash_kexec in die_nmi path

On Thu, Feb 07, 2008 at 12:36:57AM +0100, Ingo Molnar wrote:
>
> * H. Peter Anvin <[email protected]> wrote:
>
> >> I am wondering if interrupts are disabled on crashing cpu or if
> >> crashing cpu is inside die_nmi(), how would it stop/prevent delivery
> >> of NMI IPI to other cpus.
> >
> > I don't see how it would.
>
> cross-CPU IPIs are a bit fragile on some PC platforms. So if the kexec
> code relies on getting IPIs to all other CPUs, it might not be able to
> do it reliably. There might be limitations on how many APIC irqs there
> can be queued at a time, and if those slots are used up and the CPU is
> not servicing irqs then stuff gets retried. This might even affect NMIs
> sent via APIC messages - not sure about that.

- Kexec code does not wait infinitely for destination cpu to respond to
NMI. If destination cpu does not reposond in certain amount of time,
execution continues. So even if NMI was not delivered to destination
cpu kexec code should have continued. (Dangerous though, as we don't
know what other cpu will be doing in the mean time.)

- Even if there is a limitation on how many interrupts can be queued up
(including NMI), I am not sure how this patch will help that situation.
This patch is not doing anything on destination cpu (assuming destination
cpu is also not executing die_nmi() at the same time)

In fact, even if other cpus are servicing die_nmi() they will end up
spinning on kexec_lock inside crash_kexec().

I think there is more to this problem then just EOI stuff.

Thanks
Vivek

2008-02-07 00:48:09

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH], issue EOI to APIC prior to calling crash_kexec in die_nmi path

Ingo Molnar <[email protected]> writes:

> * H. Peter Anvin <[email protected]> wrote:
>
>>> I am wondering if interrupts are disabled on crashing cpu or if
>>> crashing cpu is inside die_nmi(), how would it stop/prevent delivery
>>> of NMI IPI to other cpus.
>>
>> I don't see how it would.
>
> cross-CPU IPIs are a bit fragile on some PC platforms. So if the kexec
> code relies on getting IPIs to all other CPUs, it might not be able to
> do it reliably. There might be limitations on how many APIC irqs there
> can be queued at a time, and if those slots are used up and the CPU is
> not servicing irqs then stuff gets retried. This might even affect NMIs
> sent via APIC messages - not sure about that.



The design was as follows:
- Doing anything in the crashing kernel is unreliable.
- We do not have the information to do anything useful in the recovery/target
kernel.
- Having the other cpus stopped is very nice as it reduces the amount of
weirdness happening. We do not share the same text or data addresses
so stopping the other cpus is not mandatory. On some other architectures
there are cpu tables that must live at a fixed address but this is not
the case on x86.
- Having the location the other cpus were running at is potentially very
interesting debugging information.

Therefore the intent of the code is to send an NMI to each other cpu. With
a timeout of a second or so. So that if the NMI do not get sent we continue
on.

There is certainly still room for improving the robustness by not shutting
down the ioapics and using less general infrastructure code on that path.
That said I would be a little surprised if that is what is biting us.

Looking at the patch the local_irq_enable() is totally bogus. As soon
was we hit machine_crash_shutdown the first thing we do is disable irqs.

I'm wondering if someone was using the switch cpus on crash patch that was
floating around. That would require the ipis to work.

I don't know if nmi_exit makes sense. There are enough layers of abstraction
in that piece of code I can't quickly spot the part that is banging the hardware.

The location of nmi_exit in the patch is clearly wrong. crash_kexec is a noop
if we don't have a crash kernel loaded (and if we are not the first cpu into it),
so if we don't execute the crash code something weird may happen. Further the
code is just more maintainable if that kind of code lives in machine_crash_shutdown.



Eric

2008-02-07 00:51:13

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH], issue EOI to APIC prior to calling crash_kexec in die_nmi path


* Eric W. Biederman <[email protected]> wrote:

> Looking at the patch the local_irq_enable() is totally bogus. As soon
> was we hit machine_crash_shutdown the first thing we do is disable
> irqs.

yeah.

> I'm wondering if someone was using the switch cpus on crash patch that
> was floating around. That would require the ipis to work.
>
> I don't know if nmi_exit makes sense. There are enough layers of
> abstraction in that piece of code I can't quickly spot the part that
> is banging the hardware.
>
> The location of nmi_exit in the patch is clearly wrong. crash_kexec
> is a noop if we don't have a crash kernel loaded (and if we are not
> the first cpu into it), so if we don't execute the crash code
> something weird may happen. Further the code is just more
> maintainable if that kind of code lives in machine_crash_shutdown.

nmi_exit() has no hw effects - it's just our own bookeeping.

the hw knows that we finished the NMI when we do an iret. Perhaps that's
the bug or side-effect that made the difference: via enabling irqs we
get an irq entry, and that does an iret and clears the NMI nested state
- allowing the kexec context to proceed? I suspect kexec() will do an
iret eventually (at minimum in the booted up kernel's context) - all
NMIs are blocked up to that point and maybe the APIC doesnt really like
being frobbed in that state? In any case, the local_irq_enable() is just
wrong - it's the worst thing a crashing kernel can do. Perhaps doing an
intentional iret with a prepared stack-let that just restores to
still-irqs-off state and jumps to the next instruction could 'exit' the
NMI context without really having to exit it in the kernel code flow?

Ingo

2008-02-07 01:35:18

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH], issue EOI to APIC prior to calling crash_kexec in die_nmi path

Ingo Molnar <[email protected]> writes:

> * Eric W. Biederman <[email protected]> wrote:
>
>> Looking at the patch the local_irq_enable() is totally bogus. As soon
>> was we hit machine_crash_shutdown the first thing we do is disable
>> irqs.
>
> yeah.
>
>> I'm wondering if someone was using the switch cpus on crash patch that
>> was floating around. That would require the ipis to work.
>>
>> I don't know if nmi_exit makes sense. There are enough layers of
>> abstraction in that piece of code I can't quickly spot the part that
>> is banging the hardware.
>>
>> The location of nmi_exit in the patch is clearly wrong. crash_kexec
>> is a noop if we don't have a crash kernel loaded (and if we are not
>> the first cpu into it), so if we don't execute the crash code
>> something weird may happen. Further the code is just more
>> maintainable if that kind of code lives in machine_crash_shutdown.
>
> nmi_exit() has no hw effects - it's just our own bookeeping.

Alright so that should have no hardware effect then.

> the hw knows that we finished the NMI when we do an iret. Perhaps that's
> the bug or side-effect that made the difference: via enabling irqs we
> get an irq entry, and that does an iret and clears the NMI nested state
> - allowing the kexec context to proceed? I suspect kexec() will do an
> iret eventually (at minimum in the booted up kernel's context) - all
> NMIs are blocked up to that point and maybe the APIC doesnt really like
> being frobbed in that state? In any case, the local_irq_enable() is just
> wrong - it's the worst thing a crashing kernel can do. Perhaps doing an
> intentional iret with a prepared stack-let that just restores to
> still-irqs-off state and jumps to the next instruction could 'exit' the
> NMI context without really having to exit it in the kernel code flow?

Well I think this is slightly on the wrong track. The original patch description
said we get as far as purgatory. Purgatory is the bit of C code form
/sbin/kexec that runs just before our second kernel. It sets up
arguments and verify the target kernel has a valid sha256sum. If
purgatory detects data corruption it spins, to prevent a corrupt
recovery kernel from doing something nasty.

It appears that the primary crash_kexec path is working fine.

The original description speculated that we had non-stopped cpus that
were telling the hardware to shut off.

I don't see what the hang is. However the goal apparently is to make
the kexec on panic path more robust so that we can take crash dumps in
more strange cases.

We can get NMI from the nmi watchdogs so it is possible this happens
on legitimate hardware so there is a chance this is deterministic and
that we can get enough information to debug and fix the original
error.

If part of the problem is getting to crash_kexec my inclination is to
move the call to crash_kexec up as early as possible in die_nmi. As
we may simply be hanging in printk or something stupid like that.

It is weird that only the 32bit die_nmi path calls bust_spinlocks.

I'm not really happy with the secondary cpus taking whole notify_die
path as that is more general purpose infrastructure that might go
bad. However it doesn't appear broken, and it should not be critical
to the crash dump process.

Eric

2008-02-07 12:19:19

by Neil Horman

[permalink] [raw]
Subject: Re: [PATCH], issue EOI to APIC prior to calling crash_kexec in die_nmi path

On Wed, Feb 06, 2008 at 05:31:11PM -0700, Eric W. Biederman wrote:
> Ingo Molnar <[email protected]> writes:
>
> > * H. Peter Anvin <[email protected]> wrote:
> >
> >>> I am wondering if interrupts are disabled on crashing cpu or if
> >>> crashing cpu is inside die_nmi(), how would it stop/prevent delivery
> >>> of NMI IPI to other cpus.
> >>
> >> I don't see how it would.
> >
> > cross-CPU IPIs are a bit fragile on some PC platforms. So if the kexec
> > code relies on getting IPIs to all other CPUs, it might not be able to
> > do it reliably. There might be limitations on how many APIC irqs there
> > can be queued at a time, and if those slots are used up and the CPU is
> > not servicing irqs then stuff gets retried. This might even affect NMIs
> > sent via APIC messages - not sure about that.
>
>
>
> The design was as follows:
> - Doing anything in the crashing kernel is unreliable.
> - We do not have the information to do anything useful in the recovery/target
> kernel.
> - Having the other cpus stopped is very nice as it reduces the amount of
> weirdness happening. We do not share the same text or data addresses
> so stopping the other cpus is not mandatory. On some other architectures
> there are cpu tables that must live at a fixed address but this is not
> the case on x86.
> - Having the location the other cpus were running at is potentially very
> interesting debugging information.
>
> Therefore the intent of the code is to send an NMI to each other cpu. With
> a timeout of a second or so. So that if the NMI do not get sent we continue
> on.
>
> There is certainly still room for improving the robustness by not shutting
> down the ioapics and using less general infrastructure code on that path.
> That said I would be a little surprised if that is what is biting us.
>
> Looking at the patch the local_irq_enable() is totally bogus. As soon
> was we hit machine_crash_shutdown the first thing we do is disable irqs.
>

Ingo noted a few posts down the nmi_exit doesn't actually write to the APIC EOI
register, so yeah, I agree, its bogus (and I apologize, I should have checked
that more carefully). Nevertheless, this patch consistently allowed a hangning
machine to boot through an Nmi lockup. So I'm forced to wonder whats going on
then that this patch helps with. perhaps its a just a very fragile timing
issue, I'll need to look more closely.

> I'm wondering if someone was using the switch cpus on crash patch that was
> floating around. That would require the ipis to work.
>
Definately not the case, I did a clean build from a cvs tree to test this and
can verify that the switch cpu patch was not in place.

> I don't know if nmi_exit makes sense. There are enough layers of abstraction
> in that piece of code I can't quickly spot the part that is banging the hardware.
>
As ingo mentioned this does seem to be bogus.

> The location of nmi_exit in the patch is clearly wrong. crash_kexec is a noop
> if we don't have a crash kernel loaded (and if we are not the first cpu into it),
> so if we don't execute the crash code something weird may happen. Further the
> code is just more maintainable if that kind of code lives in machine_crash_shutdown.
>
>
>
> Eric

--
/****************************************************
* Neil Horman <[email protected]>
* Software Engineer, Red Hat
****************************************************/

2008-02-07 12:25:04

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH], issue EOI to APIC prior to calling crash_kexec in die_nmi path


* Neil Horman <[email protected]> wrote:

> Ingo noted a few posts down the nmi_exit doesn't actually write to the
> APIC EOI register, so yeah, I agree, its bogus (and I apologize, I
> should have checked that more carefully). Nevertheless, this patch
> consistently allowed a hangning machine to boot through an Nmi lockup.
> So I'm forced to wonder whats going on then that this patch helps
> with. perhaps its a just a very fragile timing issue, I'll need to
> look more closely.

try a dummy iret, something like:

asm volatile ("pushf; push $1f; iret; 1: \n");

to get the CPU out of its 'nested NMI' state. (totally untested)

the idea is to push down an iret frame to the kernel stack that will
just jump to the next instruction and gets it out of the NMI nesting.
Note: interrupts will/must still be disabled, despite the iret. (the
ordering of the pushes might be wrong, we might need more than that for
a valid iret, etc. etc.)

Ingo

2008-02-07 20:40:33

by Neil Horman

[permalink] [raw]
Subject: Re: [PATCH], issue EOI to APIC prior to calling crash_kexec in die_nmi path

On Thu, Feb 07, 2008 at 01:24:04PM +0100, Ingo Molnar wrote:
>
> * Neil Horman <[email protected]> wrote:
>
> > Ingo noted a few posts down the nmi_exit doesn't actually write to the
> > APIC EOI register, so yeah, I agree, its bogus (and I apologize, I
> > should have checked that more carefully). Nevertheless, this patch
> > consistently allowed a hangning machine to boot through an Nmi lockup.
> > So I'm forced to wonder whats going on then that this patch helps
> > with. perhaps its a just a very fragile timing issue, I'll need to
> > look more closely.
>
> try a dummy iret, something like:
>
> asm volatile ("pushf; push $1f; iret; 1: \n");
>
> to get the CPU out of its 'nested NMI' state. (totally untested)
>
> the idea is to push down an iret frame to the kernel stack that will
> just jump to the next instruction and gets it out of the NMI nesting.
> Note: interrupts will/must still be disabled, despite the iret. (the
> ordering of the pushes might be wrong, we might need more than that for
> a valid iret, etc. etc.)
>
> Ingo

Will do. I'll report results as soon as I have them,. Thanks!
Neil

--
/****************************************************
* Neil Horman <[email protected]>
* Software Engineer, Red Hat
****************************************************/

2008-02-08 16:16:17

by Neil Horman

[permalink] [raw]
Subject: Re: [PATCH], issue EOI to APIC prior to calling crash_kexec in die_nmi path

On Thu, Feb 07, 2008 at 01:24:04PM +0100, Ingo Molnar wrote:
>
> * Neil Horman <[email protected]> wrote:
>
> > Ingo noted a few posts down the nmi_exit doesn't actually write to the
> > APIC EOI register, so yeah, I agree, its bogus (and I apologize, I
> > should have checked that more carefully). Nevertheless, this patch
> > consistently allowed a hangning machine to boot through an Nmi lockup.
> > So I'm forced to wonder whats going on then that this patch helps
> > with. perhaps its a just a very fragile timing issue, I'll need to
> > look more closely.
>
> try a dummy iret, something like:
>
> asm volatile ("pushf; push $1f; iret; 1: \n");
>
> to get the CPU out of its 'nested NMI' state. (totally untested)
>
> the idea is to push down an iret frame to the kernel stack that will
> just jump to the next instruction and gets it out of the NMI nesting.
> Note: interrupts will/must still be disabled, despite the iret. (the
> ordering of the pushes might be wrong, we might need more than that for
> a valid iret, etc. etc.)
>
> Ingo

Just tried this experiment and it met with success. Executing a dummy iret
instruction got us to boot the kdump kernel successfully.

Thoughts on how we should handle this from here?


Regards
Neil

--
/****************************************************
* Neil Horman <[email protected]>
* Software Engineer, Red Hat
****************************************************/

2008-02-08 16:48:19

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH], issue EOI to APIC prior to calling crash_kexec in die_nmi path

On Fri, Feb 08, 2008 at 11:14:22AM -0500, Neil Horman wrote:
> On Thu, Feb 07, 2008 at 01:24:04PM +0100, Ingo Molnar wrote:
> >
> > * Neil Horman <[email protected]> wrote:
> >
> > > Ingo noted a few posts down the nmi_exit doesn't actually write to the
> > > APIC EOI register, so yeah, I agree, its bogus (and I apologize, I
> > > should have checked that more carefully). Nevertheless, this patch
> > > consistently allowed a hangning machine to boot through an Nmi lockup.
> > > So I'm forced to wonder whats going on then that this patch helps
> > > with. perhaps its a just a very fragile timing issue, I'll need to
> > > look more closely.
> >
> > try a dummy iret, something like:
> >
> > asm volatile ("pushf; push $1f; iret; 1: \n");
> >
> > to get the CPU out of its 'nested NMI' state. (totally untested)
> >
> > the idea is to push down an iret frame to the kernel stack that will
> > just jump to the next instruction and gets it out of the NMI nesting.
> > Note: interrupts will/must still be disabled, despite the iret. (the
> > ordering of the pushes might be wrong, we might need more than that for
> > a valid iret, etc. etc.)
> >
> > Ingo
>
> Just tried this experiment and it met with success. Executing a dummy iret
> instruction got us to boot the kdump kernel successfully.
>

Interesting. So that means there is some operation we can't perform when
we are in NMI handler (Or nested NMIs, I don't know if this is nested NMI
case ).

Even if we initiated crash dump in NMI handler, next kernel should unlock
that state as soon as we enable interrupts in next kernel (iret will be
called).

So the only issue here will be if need to put the explicit logic to unlock
the NMI earlier (Either in crashing kernel after clearing IDT or in
purgatory code). Anything earlier then that, will be dangerous though, handling
another NMI while we are already crashed and doing final preparations to jump
to the new kernel.

Neil, is it possible to do some serial console debugging to find out
where exactly we are hanging? Beats me, what's that operation which can
not be executed while being in NMI handler and makes system to hang. I am
also curious to know if it is nested NMI case.

Thanks
Vivek

2008-02-08 17:23:36

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH], issue EOI to APIC prior to calling crash_kexec in die_nmi path



Ingo Molnar <[email protected]> writes:
>
> try a dummy iret, something like:
>
> asm volatile ("pushf; push $1f; iret; 1: \n");
>
> to get the CPU out of its 'nested NMI' state. (totally untested)

Just if you do this while running on the NMI stack (and I think
you do if you insert it at the same place as Neil's original patch)
and another NMI happens for some reason you will be dead
because your stack will corrupt.

If you really want nested NMIs you would need to fix up the NMI
stacks first by pointing them below your own stack.

-Andi

2008-02-08 17:29:26

by Neil Horman

[permalink] [raw]
Subject: Re: [PATCH], issue EOI to APIC prior to calling crash_kexec in die_nmi path

On Fri, Feb 08, 2008 at 11:45:44AM -0500, Vivek Goyal wrote:
> On Fri, Feb 08, 2008 at 11:14:22AM -0500, Neil Horman wrote:
> > On Thu, Feb 07, 2008 at 01:24:04PM +0100, Ingo Molnar wrote:
> > >
> > > * Neil Horman <[email protected]> wrote:
> > >
> > > > Ingo noted a few posts down the nmi_exit doesn't actually write to the
> > > > APIC EOI register, so yeah, I agree, its bogus (and I apologize, I
> > > > should have checked that more carefully). Nevertheless, this patch
> > > > consistently allowed a hangning machine to boot through an Nmi lockup.
> > > > So I'm forced to wonder whats going on then that this patch helps
> > > > with. perhaps its a just a very fragile timing issue, I'll need to
> > > > look more closely.
> > >
> > > try a dummy iret, something like:
> > >
> > > asm volatile ("pushf; push $1f; iret; 1: \n");
> > >
> > > to get the CPU out of its 'nested NMI' state. (totally untested)
> > >
> > > the idea is to push down an iret frame to the kernel stack that will
> > > just jump to the next instruction and gets it out of the NMI nesting.
> > > Note: interrupts will/must still be disabled, despite the iret. (the
> > > ordering of the pushes might be wrong, we might need more than that for
> > > a valid iret, etc. etc.)
> > >
> > > Ingo
> >
> > Just tried this experiment and it met with success. Executing a dummy iret
> > instruction got us to boot the kdump kernel successfully.
> >
>
> Interesting. So that means there is some operation we can't perform when
> we are in NMI handler (Or nested NMIs, I don't know if this is nested NMI
> case ).
>
> Even if we initiated crash dump in NMI handler, next kernel should unlock
> that state as soon as we enable interrupts in next kernel (iret will be
> called).
>
> So the only issue here will be if need to put the explicit logic to unlock
> the NMI earlier (Either in crashing kernel after clearing IDT or in
> purgatory code). Anything earlier then that, will be dangerous though, handling
> another NMI while we are already crashed and doing final preparations to jump
> to the new kernel.
>
> Neil, is it possible to do some serial console debugging to find out
> where exactly we are hanging? Beats me, what's that operation which can
> not be executed while being in NMI handler and makes system to hang. I am
> also curious to know if it is nested NMI case.
>
I can try, but my last attempts to do so fuond me hung in various places in
purgatory or very early in head.S. I'll try again though, to see if I can get
some consistency.

Neil


> Thanks
> Vivek
>
> _______________________________________________
> kexec mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/kexec

--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*[email protected]
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/

2008-02-12 21:11:20

by Neil Horman

[permalink] [raw]
Subject: Re: [PATCH], issue EOI to APIC prior to calling crash_kexec in die_nmi path

>
> Neil, is it possible to do some serial console debugging to find out
> where exactly we are hanging? Beats me, what's that operation which can
> not be executed while being in NMI handler and makes system to hang. I am
> also curious to know if it is nested NMI case.
>
> Thanks
> Vivek
>


Hey-
Some intermediate results:

I've instrumented head.S in the kernel with the following code:
#define SEROUT(z) \
mov $0x3F8,%dx;\
movb z,%al;\
outb %dx


And peppered different ascii characters throughout the startup code from
startup_32 to right before the jump to start_kernel. When I panic the system
via an:
echo c > /proc/sysrq_trigger
I see an appropriate sequence of characters on the serial console

When I panic the box by forcing an NMI watchdog timeout however, I see nothing.
The machine will either hang, or reset into the bios. I think this is
reasonably conclusive in its indication that we're not getting into the second
kernel when this problem occurs. Next I'll instrument the purgatory code in a
simmilar way.

Regards
Neil

> _______________________________________________
> kexec mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/kexec

--
/****************************************************
* Neil Horman <[email protected]>
* Software Engineer, Red Hat
****************************************************/

2008-02-15 14:07:52

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH], issue EOI to APIC prior to calling crash_kexec in die_nmi path

Neil Horman <[email protected]> writes:

>>
>> Neil, is it possible to do some serial console debugging to find out
>> where exactly we are hanging? Beats me, what's that operation which can
>> not be executed while being in NMI handler and makes system to hang. I am
>> also curious to know if it is nested NMI case.
>>
>> Thanks
>> Vivek
>>
>
>
> Hey-
> Some intermediate results:
>
> I've instrumented head.S in the kernel with the following code:
> #define SEROUT(z) \
> mov $0x3F8,%dx;\
> movb z,%al;\
> outb %dx
>
>
> And peppered different ascii characters throughout the startup code from
> startup_32 to right before the jump to start_kernel. When I panic the system
> via an:
> echo c > /proc/sysrq_trigger
> I see an appropriate sequence of characters on the serial console
>
> When I panic the box by forcing an NMI watchdog timeout however, I see nothing.
> The machine will either hang, or reset into the bios. I think this is
> reasonably conclusive in its indication that we're not getting into the second
> kernel when this problem occurs. Next I'll instrument the purgatory code in a
> simmilar way.

My apologies for the late reply. You should be able to use the --console-serial
option to kexec to get output from purgatory, which will catch everything
except a little bit of the assembly stub. purgatory even has a stripped down
version of printf you can call.

Eric

2008-02-20 15:00:30

by Neil Horman

[permalink] [raw]
Subject: Re: [PATCH], issue EOI to APIC prior to calling crash_kexec in die_nmi path

On Tue, Feb 12, 2008 at 04:08:16PM -0500, Neil Horman wrote:
> >
> > Neil, is it possible to do some serial console debugging to find out
> > where exactly we are hanging? Beats me, what's that operation which can
> > not be executed while being in NMI handler and makes system to hang. I am
> > also curious to know if it is nested NMI case.
> >
> > Thanks
> > Vivek
> >
>
>
> Hey-
> Some intermediate results:
>
> I've instrumented head.S in the kernel with the following code:
> #define SEROUT(z) \
> mov $0x3F8,%dx;\
> movb z,%al;\
> outb %dx
>
>
> And peppered different ascii characters throughout the startup code from
> startup_32 to right before the jump to start_kernel. When I panic the system
> via an:
> echo c > /proc/sysrq_trigger
> I see an appropriate sequence of characters on the serial console
>
> When I panic the box by forcing an NMI watchdog timeout however, I see nothing.
> The machine will either hang, or reset into the bios. I think this is
> reasonably conclusive in its indication that we're not getting into the second
> kernel when this problem occurs. Next I'll instrument the purgatory code in a
> simmilar way.
>
> Regards
> Neil
>
Hey, further update. I instrumented purgatory, and turned on the console
options that Eric mentioned, which resulted in no output to the console, so I've
concluded that we don't get to purgatory at all. Further instrumenting shows
that the last C level printk I can issue is right before we call
relocate_kernel. So it appears that we hang/reset somewhere in there. I'll
instrument down that path shortly.

Regards
Neil