2012-02-12 01:04:18

by Yinghai Lu

[permalink] [raw]
Subject: Re: [tip:x86/debug] x86/kdump: No need to disable ioapic/ lapic in crash path

On Sat, Feb 11, 2012 at 3:09 PM, tip-bot for Don Zickus
<[email protected]> wrote:
> Commit-ID: ?d9bc9be89629445758670220787683e37c93f6c1
> Gitweb: ? ? http://git.kernel.org/tip/d9bc9be89629445758670220787683e37c93f6c1
> Author: ? ? Don Zickus <[email protected]>
> AuthorDate: Thu, 9 Feb 2012 16:53:41 -0500
> Committer: ?Ingo Molnar <[email protected]>
> CommitDate: Sat, 11 Feb 2012 15:38:53 +0100
>
> x86/kdump: No need to disable ioapic/lapic in crash path
>
> A customer of ours noticed when their machine crashed, kdump did
> not work but hung instead. ?Using their firmware dumping
> solution they grabbed a vmcore and decoded the stacks on the
> cpus. ?What they noticed seemed to be a rare deadlock with the
> ioapic_lock.
>
> ?CPU4:
> ?machine_crash_shutdown
> ?-> machine_ops.crash_shutdown
> ? ?-> native_machine_crash_shutdown
> ? ? ? -> kdump_nmi_shootdown_cpus ------> Send NMI to other CPUs
> ? ? ? -> disable_IO_APIC
> ? ? ? ? ?-> clear_IO_APIC
> ? ? ? ? ? ? -> clear_IO_APIC_pin
> ? ? ? ? ? ? ? ?-> ioapic_read_entry
> ? ? ? ? ? ? ? ? ? -> spin_lock_irqsave(&ioapic_lock, flags)
> ? ? ? ? ? ? ? ? ? ---Infinite loop here---
>
> ?CPU0:
> ?do_IRQ
> ?-> handle_irq
> ? ?-> handle_edge_irq
> ? ? ? ?-> ack_apic_edge
> ? ? ? ? ? -> move_native_irq
> ? ? ? ? ? ? ? -> mask_IO_APIC_irq
> ? ? ? ? ? ? ? ? ?-> mask_IO_APIC_irq_desc
> ? ? ? ? ? ? ? ? ? ? -> spin_lock_irqsave(&ioapic_lock, flags)
> ? ? ? ? ? ? ? ? ? ? ---Receive NMI here after getting spinlock---
> ? ? ? ? ? ? ? ? ? ? ? ?-> nmi
> ? ? ? ? ? ? ? ? ? ? ? ? ? -> do_nmi
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?-> crash_nmi_callback
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?---Infinite loop here---
>
> The problem is that although kdump tries to shutdown minimal
> hardware, it still needs to disable the IO APIC. ?This requires
> spinlocks which may be held by another cpu. ?This other cpu is
> being held infinitely in an NMI context by kdump in order to
> serialize the crashing path. ?Instant deadlock.
>
> Eric brought up a point that because the boot code was
> restructured we may not need to disable the io apic any more in
> the crash path. ?The original concern that led to the
> development of disable_IO_APIC, was that the jiffies calibration
> on boot up relied on the PIT timer for reference. ?Access to the
> PIT required 8259 interrupts to be working. ?This wouldn't work
> if the ioapic needed to be configured. ?So on panic path, the
> ioapic was reconfigured to use virtual wire mode to allow the 8259 to passthrough.
>
> Those concerns don't hold true now, thanks to the jiffies
> calibration code not needing the PIT. ?As a result, we can
> remove this call and simplify the locking needed in the panic
> path.
>
> The same work allowed us to remove the need to disable the local
> apic on shutdown too. ?This should allow us to jump to the
> second a little faster.
>
> I tested kdump on an Ivy Bridge platform, a Pentium4 and an old
> athlon that did not have an ioapic. ?All three were successful.
>
> I also tested using lkdtm that would use jprobes to panic the
> system when entering do_IRQ. ?The idea was to see how the system
> reacted with an interrupt pending in the second kernel. ?My
> core2 quad successfully kdump'd 3 times in a row with no issues.
>
> v2: removed the disable lapic code too

with this commit, kdump is not working anymore on my setups with
Nehalem, Westmere, sandbridge.
these setup all have VT-d enabled.


After reverting this commit, kdump is working again.

So assume you need to drop this patch.

Thanks

Yinghai Lu


2012-02-12 03:10:31

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [tip:x86/debug] x86/kdump: No need to disable ioapic/ lapic in crash path

Yinghai Lu <[email protected]> writes:

> On Sat, Feb 11, 2012 at 3:09 PM, tip-bot for Don Zickus
> <[email protected]> wrote:
>> Commit-ID:  d9bc9be89629445758670220787683e37c93f6c1
>> Gitweb:     http://git.kernel.org/tip/d9bc9be89629445758670220787683e37c93f6c1
>> Author:     Don Zickus <[email protected]>
>> AuthorDate: Thu, 9 Feb 2012 16:53:41 -0500
>> Committer:  Ingo Molnar <[email protected]>
>> CommitDate: Sat, 11 Feb 2012 15:38:53 +0100
>>
>> x86/kdump: No need to disable ioapic/lapic in crash path
>>
>> A customer of ours noticed when their machine crashed, kdump did
>> not work but hung instead.  Using their firmware dumping
>> solution they grabbed a vmcore and decoded the stacks on the
>> cpus.  What they noticed seemed to be a rare deadlock with the
>> ioapic_lock.
>>
>>  CPU4:
>>  machine_crash_shutdown
>>  -> machine_ops.crash_shutdown
>>    -> native_machine_crash_shutdown
>>       -> kdump_nmi_shootdown_cpus ------> Send NMI to other CPUs
>>       -> disable_IO_APIC
>>          -> clear_IO_APIC
>>             -> clear_IO_APIC_pin
>>                -> ioapic_read_entry
>>                   -> spin_lock_irqsave(&ioapic_lock, flags)
>>                   ---Infinite loop here---
>>
>>  CPU0:
>>  do_IRQ
>>  -> handle_irq
>>    -> handle_edge_irq
>>        -> ack_apic_edge
>>           -> move_native_irq
>>               -> mask_IO_APIC_irq
>>                  -> mask_IO_APIC_irq_desc
>>                     -> spin_lock_irqsave(&ioapic_lock, flags)
>>                     ---Receive NMI here after getting spinlock---
>>                        -> nmi
>>                           -> do_nmi
>>                              -> crash_nmi_callback
>>                              ---Infinite loop here---
>>
>> The problem is that although kdump tries to shutdown minimal
>> hardware, it still needs to disable the IO APIC.  This requires
>> spinlocks which may be held by another cpu.  This other cpu is
>> being held infinitely in an NMI context by kdump in order to
>> serialize the crashing path.  Instant deadlock.
>>
>> Eric brought up a point that because the boot code was
>> restructured we may not need to disable the io apic any more in
>> the crash path.  The original concern that led to the
>> development of disable_IO_APIC, was that the jiffies calibration
>> on boot up relied on the PIT timer for reference.  Access to the
>> PIT required 8259 interrupts to be working.  This wouldn't work
>> if the ioapic needed to be configured.  So on panic path, the
>> ioapic was reconfigured to use virtual wire mode to allow the 8259 to passthrough.
>>
>> Those concerns don't hold true now, thanks to the jiffies
>> calibration code not needing the PIT.  As a result, we can
>> remove this call and simplify the locking needed in the panic
>> path.
>>
>> The same work allowed us to remove the need to disable the local
>> apic on shutdown too.  This should allow us to jump to the
>> second a little faster.
>>
>> I tested kdump on an Ivy Bridge platform, a Pentium4 and an old
>> athlon that did not have an ioapic.  All three were successful.
>>
>> I also tested using lkdtm that would use jprobes to panic the
>> system when entering do_IRQ.  The idea was to see how the system
>> reacted with an interrupt pending in the second kernel.  My
>> core2 quad successfully kdump'd 3 times in a row with no issues.
>>
>> v2: removed the disable lapic code too
>
> with this commit, kdump is not working anymore on my setups with
> Nehalem, Westmere, sandbridge.
> these setup all have VT-d enabled.
>
>
> After reverting this commit, kdump is working again.
>
> So assume you need to drop this patch.

It sounds like there is a bug in ioapic initialization in the context of
VT-d. Where do you fail?

It would be much better to debug this than to blindly revert this patch,
as this change has the potential to significantly increase the
reliability of the kdump path.

Eric

2012-02-12 04:17:45

by Yinghai Lu

[permalink] [raw]
Subject: Re: [tip:x86/debug] x86/kdump: No need to disable ioapic/ lapic in crash path

On Sat, Feb 11, 2012 at 7:13 PM, Eric W. Biederman
<[email protected]> wrote:
> Yinghai Lu <[email protected]> writes:
>> After reverting this commit, kdump is working again.
>>
>> So assume you need to drop this patch.
>
> It sounds like there is a bug in ioapic initialization in the context of
> VT-d. ?Where do you fail?
>
before get debug print out from second kernel, the system get reset.

Yinghai

2012-02-12 11:13:11

by Ingo Molnar

[permalink] [raw]
Subject: Re: [tip:x86/debug] x86/kdump: No need to disable ioapic/ lapic in crash path


* Yinghai Lu <[email protected]> wrote:

> On Sat, Feb 11, 2012 at 3:09 PM, tip-bot for Don Zickus
> <[email protected]> wrote:
> > Commit-ID: ?d9bc9be89629445758670220787683e37c93f6c1
> > Gitweb: ? ? http://git.kernel.org/tip/d9bc9be89629445758670220787683e37c93f6c1
> > Author: ? ? Don Zickus <[email protected]>
> > AuthorDate: Thu, 9 Feb 2012 16:53:41 -0500
> > Committer: ?Ingo Molnar <[email protected]>
> > CommitDate: Sat, 11 Feb 2012 15:38:53 +0100
> >
> > x86/kdump: No need to disable ioapic/lapic in crash path
> >
> > A customer of ours noticed when their machine crashed, kdump did
> > not work but hung instead. ?Using their firmware dumping
> > solution they grabbed a vmcore and decoded the stacks on the
> > cpus. ?What they noticed seemed to be a rare deadlock with the
> > ioapic_lock.
> >
> > ?CPU4:
> > ?machine_crash_shutdown
> > ?-> machine_ops.crash_shutdown
> > ? ?-> native_machine_crash_shutdown
> > ? ? ? -> kdump_nmi_shootdown_cpus ------> Send NMI to other CPUs
> > ? ? ? -> disable_IO_APIC
> > ? ? ? ? ?-> clear_IO_APIC
> > ? ? ? ? ? ? -> clear_IO_APIC_pin
> > ? ? ? ? ? ? ? ?-> ioapic_read_entry
> > ? ? ? ? ? ? ? ? ? -> spin_lock_irqsave(&ioapic_lock, flags)
> > ? ? ? ? ? ? ? ? ? ---Infinite loop here---
> >
> > ?CPU0:
> > ?do_IRQ
> > ?-> handle_irq
> > ? ?-> handle_edge_irq
> > ? ? ? ?-> ack_apic_edge
> > ? ? ? ? ? -> move_native_irq
> > ? ? ? ? ? ? ? -> mask_IO_APIC_irq
> > ? ? ? ? ? ? ? ? ?-> mask_IO_APIC_irq_desc
> > ? ? ? ? ? ? ? ? ? ? -> spin_lock_irqsave(&ioapic_lock, flags)
> > ? ? ? ? ? ? ? ? ? ? ---Receive NMI here after getting spinlock---
> > ? ? ? ? ? ? ? ? ? ? ? ?-> nmi
> > ? ? ? ? ? ? ? ? ? ? ? ? ? -> do_nmi
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?-> crash_nmi_callback
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?---Infinite loop here---
> >
> > The problem is that although kdump tries to shutdown minimal
> > hardware, it still needs to disable the IO APIC. ?This requires
> > spinlocks which may be held by another cpu. ?This other cpu is
> > being held infinitely in an NMI context by kdump in order to
> > serialize the crashing path. ?Instant deadlock.
> >
> > Eric brought up a point that because the boot code was
> > restructured we may not need to disable the io apic any more in
> > the crash path. ?The original concern that led to the
> > development of disable_IO_APIC, was that the jiffies calibration
> > on boot up relied on the PIT timer for reference. ?Access to the
> > PIT required 8259 interrupts to be working. ?This wouldn't work
> > if the ioapic needed to be configured. ?So on panic path, the
> > ioapic was reconfigured to use virtual wire mode to allow the 8259 to passthrough.
> >
> > Those concerns don't hold true now, thanks to the jiffies
> > calibration code not needing the PIT. ?As a result, we can
> > remove this call and simplify the locking needed in the panic
> > path.
> >
> > The same work allowed us to remove the need to disable the local
> > apic on shutdown too. ?This should allow us to jump to the
> > second a little faster.
> >
> > I tested kdump on an Ivy Bridge platform, a Pentium4 and an old
> > athlon that did not have an ioapic. ?All three were successful.
> >
> > I also tested using lkdtm that would use jprobes to panic the
> > system when entering do_IRQ. ?The idea was to see how the system
> > reacted with an interrupt pending in the second kernel. ?My
> > core2 quad successfully kdump'd 3 times in a row with no issues.
> >
> > v2: removed the disable lapic code too
>
> with this commit, kdump is not working anymore on my setups with
> Nehalem, Westmere, sandbridge.
> these setup all have VT-d enabled.
>
> After reverting this commit, kdump is working again.
>
> So assume you need to drop this patch.

Dropped the patch, thanks for reporting this.

Don, Eric?

Ingo

2012-02-13 12:49:31

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [tip:x86/debug] x86/kdump: No need to disable ioapic/ lapic in crash path

Yinghai Lu <[email protected]> writes:

> On Sat, Feb 11, 2012 at 7:13 PM, Eric W. Biederman
> <[email protected]> wrote:
>> Yinghai Lu <[email protected]> writes:
>>> After reverting this commit, kdump is working again.
>>>
>>> So assume you need to drop this patch.
>>
>> It sounds like there is a bug in ioapic initialization in the context of
>> VT-d.  Where do you fail?
>>
> before get debug print out from second kernel, the system get reset.

Ouch.

Don can you work with with Yinghai to figure out what is different
between your test configuration and his?

YH did you have early printk enabled?

YH Did I understand you properly that things work if you don't enable
VT-d? By VT-d are you referring to interrupt remapping?

For anyone watching. The premise of this patch was that we can boot
the kernel without resorting to legacy tricks that require us to
put the interrupt controllers in PIT mode.

Apparently there is some weird corner case that YH can reproduce that
Don did not have in his test set that YH does that causes things to
fail.

I really don't see any likely candidates when looking at the code. The
most I can see is some code that we don't run when interrupt remapping
is enabled in disable_IO_APIC.

So I suspect we have a bug in our apic initialization somewhere, but
apic initialization should happen after printk are enabled. Or at least
after early printks so the reset YH is reporting doesn't make much sense.

Eric

2012-02-13 15:28:37

by Don Zickus

[permalink] [raw]
Subject: Re: [tip:x86/debug] x86/kdump: No need to disable ioapic/ lapic in crash path

On Sat, Feb 11, 2012 at 05:04:15PM -0800, Yinghai Lu wrote:
> >
> > v2: removed the disable lapic code too
>
> with this commit, kdump is not working anymore on my setups with
> Nehalem, Westmere, sandbridge.
> these setup all have VT-d enabled.

Hi Yinghai,

Thanks for the report. Can you attach the .config you were using and what
vendor you were using for Nehalem, Westmere and Sandybridge. I'll try to
reproduce it in my lab.

Cheers,
Don

2012-02-13 16:51:03

by Yinghai Lu

[permalink] [raw]
Subject: Re: [tip:x86/debug] x86/kdump: No need to disable ioapic/ lapic in crash path

On Mon, Feb 13, 2012 at 4:52 AM, Eric W. Biederman
<[email protected]> wrote:
> Yinghai Lu <[email protected]> writes:
>
>> On Sat, Feb 11, 2012 at 7:13 PM, Eric W. Biederman
>> <[email protected]> wrote:
>>> Yinghai Lu <[email protected]> writes:
>>>> After reverting this commit, kdump is working again.
>>>>
>>>> So assume you need to drop this patch.
>>>
>>> It sounds like there is a bug in ioapic initialization in the context of
>>> VT-d. ?Where do you fail?
>>>
>> before get debug print out from second kernel, the system get reset.
>
> Ouch.
>
> Don can you work with with Yinghai to figure out what is different
> between your test configuration and his?
>
> YH did you have early printk enabled?

yes.

I always have "console=uart8250,io,0x3f8,115200n8" appended.

also i have local patches that print info from
arch/x86/boot/compressed/misc.c::decompress_kernel()
and
arch/x86/kernel/head64.c::x86_64_start_kernel()

>
> YH Did I understand you properly that things work if you don't enable
> VT-d?

that is default setting for BIOS and OS.

will check if disable that will help.

> By VT-d are you referring to interrupt remapping?

yes, include interrupt rempapping and dma remapping.

>
> For anyone watching. ?The premise of this patch was that we can boot
> the kernel without resorting to legacy tricks that require us to
> put the interrupt controllers in PIT mode.
>
> Apparently there is some weird corner case that YH can reproduce that
> Don did not have in his test set that YH does that causes things to
> fail.
>
> I really don't see any likely candidates when looking at the code. ?The
> most I can see is some code that we don't run when interrupt remapping
> is enabled in disable_IO_APIC.
>
> So I suspect we have a bug in our apic initialization somewhere, but
> apic initialization should happen after printk are enabled. ?Or at least
> after early printks so the reset YH is reporting doesn't make much sense.

will try Don's first version patch that only removing disable_IO_APIC.

Yinghai

2012-02-13 16:52:24

by Yinghai Lu

[permalink] [raw]
Subject: Re: [tip:x86/debug] x86/kdump: No need to disable ioapic/ lapic in crash path

On Mon, Feb 13, 2012 at 7:28 AM, Don Zickus <[email protected]> wrote:
> On Sat, Feb 11, 2012 at 05:04:15PM -0800, Yinghai Lu wrote:
>> >
>> > v2: removed the disable lapic code too
>>
>> with this commit, kdump is not working anymore on my setups with
>> Nehalem, Westmere, sandbridge.
>> these setup all have VT-d enabled.
>
> Hi Yinghai,
>
> Thanks for the report. ?Can you attach the .config you were using and what
> vendor you were using for Nehalem, Westmere and Sandybridge. ?I'll try to
> reproduce it in my lab.

Sure, please check attached one

Yinghai


Attachments:
config.2012_02_11 (89.67 kB)

2012-02-13 18:16:04

by Yinghai Lu

[permalink] [raw]
Subject: Re: [tip:x86/debug] x86/kdump: No need to disable ioapic/ lapic in crash path

On Mon, Feb 13, 2012 at 8:51 AM, Yinghai Lu <[email protected]> wrote:
>> So I suspect we have a bug in our apic initialization somewhere, but
>> apic initialization should happen after printk are enabled. ?Or at least
>> after early printks so the reset YH is reporting doesn't make much sense.
>
> will try Don's first version patch that only removing disable_IO_APIC.

first version patch (only removing disable_IO_APIC) is working.

Thanks

Yinghai

2012-02-13 22:13:20

by Don Zickus

[permalink] [raw]
Subject: Re: [tip:x86/debug] x86/kdump: No need to disable ioapic/ lapic in crash path

On Mon, Feb 13, 2012 at 08:52:19AM -0800, Yinghai Lu wrote:
> On Mon, Feb 13, 2012 at 7:28 AM, Don Zickus <[email protected]> wrote:
> > On Sat, Feb 11, 2012 at 05:04:15PM -0800, Yinghai Lu wrote:
> >> >
> >> > v2: removed the disable lapic code too
> >>
> >> with this commit, kdump is not working anymore on my setups with
> >> Nehalem, Westmere, sandbridge.
> >> these setup all have VT-d enabled.
> >
> > Hi Yinghai,
> >
> > Thanks for the report. ?Can you attach the .config you were using and what
> > vendor you were using for Nehalem, Westmere and Sandybridge. ?I'll try to
> > reproduce it in my lab.
>
> Sure, please check attached one

Looks like I can reproduce the hang with my original patch and things work
when we stick the disable_local_APIC() back in there. I'll keep poking.

Cheers,
Don

2012-02-13 22:51:49

by Don Zickus

[permalink] [raw]
Subject: Re: [tip:x86/debug] x86/kdump: No need to disable ioapic/ lapic in crash path

On Mon, Feb 13, 2012 at 05:12:58PM -0500, Don Zickus wrote:
> On Mon, Feb 13, 2012 at 08:52:19AM -0800, Yinghai Lu wrote:
> > On Mon, Feb 13, 2012 at 7:28 AM, Don Zickus <[email protected]> wrote:
> > > On Sat, Feb 11, 2012 at 05:04:15PM -0800, Yinghai Lu wrote:
> > >> >
> > >> > v2: removed the disable lapic code too
> > >>
> > >> with this commit, kdump is not working anymore on my setups with
> > >> Nehalem, Westmere, sandbridge.
> > >> these setup all have VT-d enabled.
> > >
> > > Hi Yinghai,
> > >
> > > Thanks for the report. ?Can you attach the .config you were using and what
> > > vendor you were using for Nehalem, Westmere and Sandybridge. ?I'll try to
> > > reproduce it in my lab.
> >
> > Sure, please check attached one
>
> Looks like I can reproduce the hang with my original patch and things work
> when we stick the disable_local_APIC() back in there. I'll keep poking.

Hmm, my config worked fine in both cases, hence why I didn't catch it (its
based on a bloated Fedora config). I'll look at the differences to see if
something pops out.

Cheers,
Don

2012-02-16 02:53:58

by Don Zickus

[permalink] [raw]
Subject: Re: [tip:x86/debug] x86/kdump: No need to disable ioapic/ lapic in crash path

On Mon, Feb 13, 2012 at 08:52:19AM -0800, Yinghai Lu wrote:
> On Mon, Feb 13, 2012 at 7:28 AM, Don Zickus <[email protected]> wrote:
> > On Sat, Feb 11, 2012 at 05:04:15PM -0800, Yinghai Lu wrote:
> >> >
> >> > v2: removed the disable lapic code too
> >>
> >> with this commit, kdump is not working anymore on my setups with
> >> Nehalem, Westmere, sandbridge.
> >> these setup all have VT-d enabled.
> >
> > Hi Yinghai,
> >
> > Thanks for the report. ?Can you attach the .config you were using and what
> > vendor you were using for Nehalem, Westmere and Sandybridge. ?I'll try to
> > reproduce it in my lab.
>
> Sure, please check attached one

For some reason using gzip instead of xz makes things work for me with
your config.

CONFIG_KERNEL_GZIP=y
CONFIG_KERNEL_XZ=n

Do you see the same thing?

Cheers,
Don

2012-02-16 18:43:09

by Yinghai Lu

[permalink] [raw]
Subject: Re: [tip:x86/debug] x86/kdump: No need to disable ioapic/ lapic in crash path

On Wed, Feb 15, 2012 at 6:53 PM, Don Zickus <[email protected]> wrote:
>
> For some reason using gzip instead of xz makes things work for me with
> your config.
>
> CONFIG_KERNEL_GZIP=y
> CONFIG_KERNEL_XZ=n
>
> Do you see the same thing?

no.

with GZIP still get instant reset.

Yinghai

2012-02-16 19:10:55

by Don Zickus

[permalink] [raw]
Subject: Re: [tip:x86/debug] x86/kdump: No need to disable ioapic/ lapic in crash path

On Mon, Feb 13, 2012 at 10:16:00AM -0800, Yinghai Lu wrote:
> On Mon, Feb 13, 2012 at 8:51 AM, Yinghai Lu <[email protected]> wrote:
> >> So I suspect we have a bug in our apic initialization somewhere, but
> >> apic initialization should happen after printk are enabled. ?Or at least
> >> after early printks so the reset YH is reporting doesn't make much sense.
> >
> > will try Don's first version patch that only removing disable_IO_APIC.
>
> first version patch (only removing disable_IO_APIC) is working.

So I think I figured it out. I went through and commented out code in
disable_local_APIC until I narrowed it down to the piece of code that
needs to be disabled for it to work.

Surprise, surprise... its LVTPC or perf! :-) Actually it is the
nmi_watchdog which uses perf. My theory is NMIs are not disabled and one
is generated by the local apic during decompression (just bad timing) and
*splat*.

Yinghai, you can probably prove this by

echo 0 > /proc/sys/kernel/nmi_watchdog

then do your kdump crash test.

At least that test worked for me.

So either we explicitly shutdown perf or just mask off LVTPC in a modified
disable_local_APIC?

Eric, thoughts, preferences?

Cheers,
Don

2012-02-16 21:42:03

by Don Zickus

[permalink] [raw]
Subject: Re: [tip:x86/debug] x86/kdump: No need to disable ioapic/ lapic in crash path

On Thu, Feb 16, 2012 at 10:43:06AM -0800, Yinghai Lu wrote:
> On Wed, Feb 15, 2012 at 6:53 PM, Don Zickus <[email protected]> wrote:
> >
> > For some reason using gzip instead of xz makes things work for me with
> > your config.
> >
> > CONFIG_KERNEL_GZIP=y
> > CONFIG_KERNEL_XZ=n
> >
> > Do you see the same thing?
>
> no.
>
> with GZIP still get instant reset.

What about my other idea, 'echo 0 > /proc/sys/kernel/nmi_watchdog'?

Cheers,
Don

2012-02-16 21:53:32

by Yinghai Lu

[permalink] [raw]
Subject: Re: [tip:x86/debug] x86/kdump: No need to disable ioapic/ lapic in crash path

On Thu, Feb 16, 2012 at 9:27 AM, Don Zickus <[email protected]> wrote:

> So I think I figured it out. ?I went through and commented out code in
> disable_local_APIC until I narrowed it down to the piece of code that
> needs to be disabled for it to work.
>
> Surprise, surprise... its LVTPC or perf! :-) ?Actually it is the
> nmi_watchdog which uses perf. ?My theory is NMIs are not disabled and one
> is generated by the local apic during decompression (just bad timing) and
> *splat*.
>
> Yinghai, you can probably prove this by
>
> echo 0 > /proc/sys/kernel/nmi_watchdog
>
> then do your kdump crash test.

yes. that will make kdump crash working.

Yinghai

2012-02-16 22:10:56

by Don Zickus

[permalink] [raw]
Subject: Re: [tip:x86/debug] x86/kdump: No need to disable ioapic/ lapic in crash path

On Thu, Feb 16, 2012 at 01:53:29PM -0800, Yinghai Lu wrote:
> On Thu, Feb 16, 2012 at 9:27 AM, Don Zickus <[email protected]> wrote:
>
> > So I think I figured it out. ?I went through and commented out code in
> > disable_local_APIC until I narrowed it down to the piece of code that
> > needs to be disabled for it to work.
> >
> > Surprise, surprise... its LVTPC or perf! :-) ?Actually it is the
> > nmi_watchdog which uses perf. ?My theory is NMIs are not disabled and one
> > is generated by the local apic during decompression (just bad timing) and
> > *splat*.
> >
> > Yinghai, you can probably prove this by
> >
> > echo 0 > /proc/sys/kernel/nmi_watchdog
> >
> > then do your kdump crash test.
>
> yes. that will make kdump crash working.

Cool. Thanks.

Eric,

Just let me know how you want to handle disabling NMIs in the kexec in
panic shutdown case.

Cheers,
Don

2012-02-17 03:35:32

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [tip:x86/debug] x86/kdump: No need to disable ioapic/ lapic in crash path

Don Zickus <[email protected]> writes:

> On Thu, Feb 16, 2012 at 01:53:29PM -0800, Yinghai Lu wrote:
>> On Thu, Feb 16, 2012 at 9:27 AM, Don Zickus <[email protected]> wrote:
>>
>> > So I think I figured it out.  I went through and commented out code in
>> > disable_local_APIC until I narrowed it down to the piece of code that
>> > needs to be disabled for it to work.
>> >
>> > Surprise, surprise... its LVTPC or perf! :-)  Actually it is the
>> > nmi_watchdog which uses perf.  My theory is NMIs are not disabled and one
>> > is generated by the local apic during decompression (just bad timing) and
>> > *splat*.
>> >
>> > Yinghai, you can probably prove this by
>> >
>> > echo 0 > /proc/sys/kernel/nmi_watchdog
>> >
>> > then do your kdump crash test.
>>
>> yes. that will make kdump crash working.
>
> Cool. Thanks.
>
> Eric,
>
> Just let me know how you want to handle disabling NMIs in the kexec in
> panic shutdown case.

Interesting. Apparently we have been avoiding this problem by accident.

Thanks for hunting this down.

The options I can see are:
- Ensure we can handle and ignore exceptions like this.
- Always shutoff the lapic and ioapic entries that can generate this.

The good news is that both solutions should be lock free.

The current kernel boot code relies on the assumption that all
interrupts can be disabled. In this case with nmi's that is clearly not
the case.

The most robust solution and what we want to do long term is to
install an idt that will simply ignore all interrupts until the
idt is replaced. Since really all we need to deal with is the NMI
vector, which is vector #2, we can have a very small interrupt
descriptor table.

Unfortunately we go through some cpu mode switches in /sbin/kexec,
allowing us to enter the kernels 32bit entry point before we
run the decompresser, so at first glance both /sbin/kexec and the
kernel need to be fixed in a coordinated fashion.

There are two was I can see of removing the need for an exactly
coordinated release.
- Document that an old /sbin/kexec userspace requires you not to
use the nmi watchdog with modern kernels.
- For a short while simply retain code that stomps the nmi watchdog.
(But still leaves us open to other kinds of nmi's).

Grr. Looking a little more closely, all throughout the linux kernel's
boot there is the assumption that any interrupt during boot is a failure
of some kind, and except for an errant nmi watchdog that is a true
assumption.

Don I guess I really have to recommend disabling the nmi watchdog in the
kexec on panic path if we can do so at all reasonably.

I like the idea of ignoring nmis during boot but that seems to be a
slightly larger project and with little practical improvement in kexec
on panic quality. Other than getting what should be one or two
i/o writes out of the kexec on panic path.

Eric

2012-02-17 12:38:11

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [tip:x86/debug] x86/kdump: No need to disable ioapic/ lapic in crash path

[email protected] (Eric W. Biederman) writes:

> Don Zickus <[email protected]> writes:
>
>> On Thu, Feb 16, 2012 at 01:53:29PM -0800, Yinghai Lu wrote:
>>> On Thu, Feb 16, 2012 at 9:27 AM, Don Zickus <[email protected]> wrote:
>>>
>>> > So I think I figured it out.  I went through and commented out code in
>>> > disable_local_APIC until I narrowed it down to the piece of code that
>>> > needs to be disabled for it to work.
>>> >
>>> > Surprise, surprise... its LVTPC or perf! :-)  Actually it is the
>>> > nmi_watchdog which uses perf.  My theory is NMIs are not disabled and one
>>> > is generated by the local apic during decompression (just bad timing) and
>>> > *splat*.
>>> >
>>> > Yinghai, you can probably prove this by
>>> >
>>> > echo 0 > /proc/sys/kernel/nmi_watchdog
>>> >
>>> > then do your kdump crash test.
>>>
>>> yes. that will make kdump crash working.
>>
>> Cool. Thanks.
>>
>> Eric,
>>
>> Just let me know how you want to handle disabling NMIs in the kexec in
>> panic shutdown case.
>
> Interesting. Apparently we have been avoiding this problem by accident.
>
> Thanks for hunting this down.
>
> The options I can see are:
> - Ensure we can handle and ignore exceptions like this.
> - Always shutoff the lapic and ioapic entries that can generate this.
>
> The good news is that both solutions should be lock free.
>
> The current kernel boot code relies on the assumption that all
> interrupts can be disabled. In this case with nmi's that is clearly not
> the case.
>
> The most robust solution and what we want to do long term is to
> install an idt that will simply ignore all interrupts until the
> idt is replaced. Since really all we need to deal with is the NMI
> vector, which is vector #2, we can have a very small interrupt
> descriptor table.
>
> Unfortunately we go through some cpu mode switches in /sbin/kexec,
> allowing us to enter the kernels 32bit entry point before we
> run the decompresser, so at first glance both /sbin/kexec and the
> kernel need to be fixed in a coordinated fashion.
>
> There are two was I can see of removing the need for an exactly
> coordinated release.
> - Document that an old /sbin/kexec userspace requires you not to
> use the nmi watchdog with modern kernels.
> - For a short while simply retain code that stomps the nmi watchdog.
> (But still leaves us open to other kinds of nmi's).
>
> Grr. Looking a little more closely, all throughout the linux kernel's
> boot there is the assumption that any interrupt during boot is a failure
> of some kind, and except for an errant nmi watchdog that is a true
> assumption.
>
> Don I guess I really have to recommend disabling the nmi watchdog in the
> kexec on panic path if we can do so at all reasonably.
>
> I like the idea of ignoring nmis during boot but that seems to be a
> slightly larger project and with little practical improvement in kexec
> on panic quality. Other than getting what should be one or two
> i/o writes out of the kexec on panic path.

Hmm.

Thinking about it a little more. The kernel's boot path is inconsistent
with the rest of the kernel's nmi handling. For anything exception
except an nmi stopping and giving up is fine. For an nmi it is very
rare for an NMI to signal a truly nasty failure (usually it just means
someone saw a bitflip somewhere), and we can almost always continue
without problem.

I think in practice we really should make our boot path consistent with
the rest of the kernel and ignore/log/report nmis but not fail on them.
Triple faulting (trigger a cpu reset) as we do today just seems like a
recipe for deep and confusing mystery, and not being helpful to the
user.

My preferred fix would be to fix the boot path and /sbin/kexec to ignore
and report nmis as we boot, as that is really what we want long term and
it gives us the most robust solution.

The fix with a guarantee of no more scope creep is to just disable the
nmi watchdog on the kexec on panic path.

Don if you have time please figure out is needed to ignore nmi's and
possible record and/or report them while we boot, otherwise please cook
up a patch that just disables the nmi watchdog wherever we are sending
it from (the local apic or the ioapic).

Eric

2012-02-17 15:49:19

by HATAYAMA Daisuke

[permalink] [raw]
Subject: Re: [tip:x86/debug] x86/kdump: No need to disable ioapic/ lapic in crash path

2012/2/17 Eric W. Biederman <[email protected]>:
> [email protected] (Eric W. Biederman) writes:
>
>> Don Zickus <[email protected]> writes:
>>
>>> On Thu, Feb 16, 2012 at 01:53:29PM -0800, Yinghai Lu wrote:
>>>> On Thu, Feb 16, 2012 at 9:27 AM, Don Zickus <[email protected]> wrote:
>>>>
>>>> > So I think I figured it out. ?I went through and commented out code in
>>>> > disable_local_APIC until I narrowed it down to the piece of code that
>>>> > needs to be disabled for it to work.
>>>> >
>>>> > Surprise, surprise... its LVTPC or perf! :-) ?Actually it is the
>>>> > nmi_watchdog which uses perf. ?My theory is NMIs are not disabled and one
>>>> > is generated by the local apic during decompression (just bad timing) and
>>>> > *splat*.
>>>> >
>>>> > Yinghai, you can probably prove this by
>>>> >
>>>> > echo 0 > /proc/sys/kernel/nmi_watchdog
>>>> >
>>>> > then do your kdump crash test.
>>>>
>>>> yes. ?that will make kdump crash working.
>>>
>>> Cool. ?Thanks.
>>>
>>> Eric,
>>>
>>> Just let me know how you want to handle disabling NMIs in the kexec in
>>> panic shutdown case.
>>
>> Interesting. ?Apparently we have been avoiding this problem by accident.
>>
>> Thanks for hunting this down.
>>
>> The options I can see are:
>> - Ensure we can handle and ignore exceptions like this.
>> - Always shutoff the lapic and ioapic entries that can generate this.
>>
>> The good news is that both solutions should be lock free.
>>
>> The current kernel boot code relies on the assumption that all
>> interrupts can be disabled. ?In this case with nmi's that is clearly not
>> the case.
>>
>> The most robust solution and what we want to do long term is to
>> install an idt that will simply ignore all interrupts until the
>> idt is replaced. ?Since really all we need to deal with is the NMI
>> vector, which is vector #2, we can have a very small interrupt
>> descriptor table.
>>
>> Unfortunately we go through some cpu mode switches in /sbin/kexec,
>> allowing us to enter the kernels 32bit entry point before we
>> run the decompresser, so at first glance both /sbin/kexec and the
>> kernel need to be fixed in a coordinated fashion.
>>
>> There are two was I can see of removing the need for an exactly
>> coordinated release.
>> - Document that an old /sbin/kexec userspace requires you not to
>> ? use the nmi watchdog with modern kernels.
>> - For a short while simply retain code that stomps the nmi watchdog.
>> ? (But still leaves us open to other kinds of nmi's).
>>
>> Grr. ?Looking a little more closely, all throughout the linux kernel's
>> boot there is the assumption that any interrupt during boot is a failure
>> of some kind, and except for an errant nmi watchdog that is a true
>> assumption.
>>
>> Don I guess I really have to recommend disabling the nmi watchdog in the
>> kexec on panic path if we can do so at all reasonably.
>>
>> I like the idea of ignoring nmis during boot but that seems to be a
>> slightly larger project and with little practical improvement in kexec
>> on panic quality. ?Other than getting what should be one or two
>> i/o writes out of the kexec on panic path.
>
> Hmm.
>
> Thinking about it a little more. ?The kernel's boot path is inconsistent
> with the rest of the kernel's nmi handling. ?For anything exception
> except an nmi stopping and giving up is fine. ?For an nmi it is very
> rare for an NMI to signal a truly nasty failure (usually it just means
> someone saw a bitflip somewhere), and we can almost always continue
> without problem.
>
> I think in practice we really should make our boot path consistent with
> the rest of the kernel and ignore/log/report nmis but not fail on them.
> Triple faulting (trigger a cpu reset) as we do today just seems like a
> recipe for deep and confusing mystery, and not being helpful to the
> user.
>
> My preferred fix would be to fix the boot path and /sbin/kexec to ignore
> and report nmis as we boot, as that is really what we want long term and
> it gives us the most robust solution.
>
> The fix with a guarantee of no more scope creep is to just disable the
> nmi watchdog on the kexec on panic path.
>
> Don if you have time please figure out is needed to ignore nmi's and
> possible record and/or report them while we boot, otherwise please cook
> up a patch that just disables the nmi watchdog wherever we are sending
> it from (the local apic or the ioapic).
>
> Eric

A few days ago I investigted the case where system is reseted due to
triple fault caused by the NMI after idt is disabled in
machine_kexec. I didn't see the reset when trigering the kdump with
NMI since the NMI is masked until next iret instruction executed as
described in 6.7.2. Handling Multiple NMIs of Intel Manual Vol.3A.
The NMI mask remains untill the first iret execution on the 2nd
kernel: just the return path of the first kernel_thread invocation for
init process. The exact path is:

switch_to
-> ret_from_fork
-> int_ret_from_sys_call
-> retint_restore_args
-> irq_return

At that phase idt is already set up and kdump works.

>From the discussion I interpret kdump doesn't assume this behaviour,
right?

BTW, does anyone know the detail of the NMI mask? I couldn't figure
out about it from the Intel spec more than ``certain hardware
conditions''... I expect those who look at here are x86 NMI experts.

Thanks.
HATAYAMA, Daisuke

2012-02-17 19:54:56

by Don Zickus

[permalink] [raw]
Subject: Re: [tip:x86/debug] x86/kdump: No need to disable ioapic/ lapic in crash path

On Fri, Feb 17, 2012 at 04:41:01AM -0800, Eric W. Biederman wrote:
>
> The fix with a guarantee of no more scope creep is to just disable the
> nmi watchdog on the kexec on panic path.
>
> Don if you have time please figure out is needed to ignore nmi's and
> possible record and/or report them while we boot, otherwise please cook
> up a patch that just disables the nmi watchdog wherever we are sending
> it from (the local apic or the ioapic).

Can I keep things even simpler? The original problem was the deadlock
with the ioapic lock. We fixed that by removing the call to
disable_IO_APIC(). Can we just leave the disable_local_APIC calls in
there for now? Is there any real harm?

All this rewrite is going to take time which will delay fixing a current
problem with kexec on panic, the ioapic deadlock.

Thoughts?

Cheers,
Don

2012-02-17 20:19:07

by Don Zickus

[permalink] [raw]
Subject: Re: [tip:x86/debug] x86/kdump: No need to disable ioapic/ lapic in crash path

On Sat, Feb 18, 2012 at 12:49:16AM +0900, HATAYAMA Daisuke wrote:
> A few days ago I investigted the case where system is reseted due to
> triple fault caused by the NMI after idt is disabled in
> machine_kexec. I didn't see the reset when trigering the kdump with
> NMI since the NMI is masked until next iret instruction executed as
> described in 6.7.2. Handling Multiple NMIs of Intel Manual Vol.3A.
> The NMI mask remains untill the first iret execution on the 2nd
> kernel: just the return path of the first kernel_thread invocation for
> init process. The exact path is:

hmm. So even though the local apic was disabled you still got an NMI?
That could have been from an external NMI. I forget how that is wired up,
if it goes through the IOAPIC to the Local APIC or directly to the NMI pin
on the cpu.

>
> switch_to
> -> ret_from_fork
> -> int_ret_from_sys_call
> -> retint_restore_args
> -> irq_return
>
> At that phase idt is already set up and kdump works.
>
> From the discussion I interpret kdump doesn't assume this behaviour,
> right?

probably not.

>
> BTW, does anyone know the detail of the NMI mask? I couldn't figure
> out about it from the Intel spec more than ``certain hardware
> conditions''... I expect those who look at here are x86 NMI experts.

I don't understand the question.

Cheers,
Don

2012-02-18 03:19:07

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [tip:x86/debug] x86/kdump: No need to disable ioapic/ lapic in crash path

Don Zickus <[email protected]> writes:

> On Fri, Feb 17, 2012 at 04:41:01AM -0800, Eric W. Biederman wrote:
>>
>> The fix with a guarantee of no more scope creep is to just disable the
>> nmi watchdog on the kexec on panic path.
>>
>> Don if you have time please figure out is needed to ignore nmi's and
>> possible record and/or report them while we boot, otherwise please cook
>> up a patch that just disables the nmi watchdog wherever we are sending
>> it from (the local apic or the ioapic).
>
> Can I keep things even simpler? The original problem was the deadlock
> with the ioapic lock. We fixed that by removing the call to
> disable_IO_APIC(). Can we just leave the disable_local_APIC calls in
> there for now? Is there any real harm?

> All this rewrite is going to take time which will delay fixing a current
> problem with kexec on panic, the ioapic deadlock.

Hmm.

My apologies I just realized that we can not disable the nmi watchdog
safely in all cases. To avoid the deadlock we fundamentally can not
write to the io_apic, because the locks are the io_apic write path.
The nmi watchdog can be sourced from either the local apics or the
io_apics. To disable the nmi_watchdog we need at least potentially
to write io_apic.

So it appears to me that the only reasonable and robust thing we can
do is to ignore nmis in the kexec on panic path.

So it looks to me that the only path forward at this point is to fix
the other bug where an unexpected nmi will kill the kexec on panic boot.

I just took a look at the code in /sbin/kexec and that code does not in
fact change the idt except when we switch to 16bit mode, which we
definitely do not do in the kexec on panic case. So it appears that we
don't need to coordinate an /sbin/kexec release with a kernel release to
ignore nmis.

In fact it looks like we only need to fix the interrupt descriptors
loaded in machine_kexec_64.c and head64.c to ignore nmis.

At which point we will have fixed two bugs and have a much more reliable
kexec on panic implementation.

Eric

2012-02-20 05:18:05

by Hatayama, Daisuke

[permalink] [raw]
Subject: Re: [tip:x86/debug] x86/kdump: No need to disable ioapic/ lapic in crash path

From: Don Zickus <[email protected]>
Subject: Re: [tip:x86/debug] x86/kdump: No need to disable ioapic/ lapic in crash path
Date: Fri, 17 Feb 2012 15:18:42 -0500

> On Sat, Feb 18, 2012 at 12:49:16AM +0900, HATAYAMA Daisuke wrote:
>> A few days ago I investigted the case where system is reseted due to
>> triple fault caused by the NMI after idt is disabled in
>> machine_kexec. I didn't see the reset when trigering the kdump with
>> NMI since the NMI is masked until next iret instruction executed as
>> described in 6.7.2. Handling Multiple NMIs of Intel Manual Vol.3A.
>> The NMI mask remains untill the first iret execution on the 2nd
>> kernel: just the return path of the first kernel_thread invocation for
>> init process. The exact path is:
>
> hmm. So even though the local apic was disabled you still got an NMI?
> That could have been from an external NMI. I forget how that is wired up,
> if it goes through the IOAPIC to the Local APIC or directly to the NMI pin
> on the cpu.
>

Please don't confused. I used RHEL kernels based on 2.6.18 and
2.6.32. I didn't use the patch disabling local apic.

>>
>> switch_to
>> -> ret_from_fork
>> -> int_ret_from_sys_call
>> -> retint_restore_args
>> -> irq_return
>>
>> At that phase idt is already set up and kdump works.
>>
>> From the discussion I interpret kdump doesn't assume this behaviour,
>> right?
>
> probably not.
>

Thanks.

>>
>> BTW, does anyone know the detail of the NMI mask? I couldn't figure
>> out about it from the Intel spec more than ``certain hardware
>> conditions''... I expect those who look at here are x86 NMI experts.
>
> I don't understand the question.
>
> Cheers,
> Don
>

Fig 10-4 explaining Local APIC Structure says INIT/NMI/SMI are
directly sent to CPU Core, but the later part of this route is not
explained formally anyware. Only the explanation is the sentence in
6.7 Nonmaskable Interrupt (NMI):

The processor also invokes certain hardware conditions to insure
that no other interrupts, including NMI interrupts, are received
until the NMI handler has completed executing.

I'm just wondering if this is explained more formally anyware.

Thanks.
HATAYAMA, Daisuke

2012-02-20 15:14:52

by Don Zickus

[permalink] [raw]
Subject: Re: [tip:x86/debug] x86/kdump: No need to disable ioapic/ lapic in crash path

On Fri, Feb 17, 2012 at 07:21:52PM -0800, Eric W. Biederman wrote:
> Don Zickus <[email protected]> writes:
>
> > On Fri, Feb 17, 2012 at 04:41:01AM -0800, Eric W. Biederman wrote:
> >>
> >> The fix with a guarantee of no more scope creep is to just disable the
> >> nmi watchdog on the kexec on panic path.
> >>
> >> Don if you have time please figure out is needed to ignore nmi's and
> >> possible record and/or report them while we boot, otherwise please cook
> >> up a patch that just disables the nmi watchdog wherever we are sending
> >> it from (the local apic or the ioapic).
> >
> > Can I keep things even simpler? The original problem was the deadlock
> > with the ioapic lock. We fixed that by removing the call to
> > disable_IO_APIC(). Can we just leave the disable_local_APIC calls in
> > there for now? Is there any real harm?
>
> > All this rewrite is going to take time which will delay fixing a current
> > problem with kexec on panic, the ioapic deadlock.
>
> Hmm.
>
> My apologies I just realized that we can not disable the nmi watchdog
> safely in all cases. To avoid the deadlock we fundamentally can not
> write to the io_apic, because the locks are the io_apic write path.
> The nmi watchdog can be sourced from either the local apics or the
> io_apics. To disable the nmi_watchdog we need at least potentially
> to write io_apic.

I am curious where you see the nmi watchdog being sourced from the ioapic?
I thought I removed that code 3 or 4 releases ago.

>
> So it appears to me that the only reasonable and robust thing we can
> do is to ignore nmis in the kexec on panic path.
>
> So it looks to me that the only path forward at this point is to fix
> the other bug where an unexpected nmi will kill the kexec on panic boot.
>
> I just took a look at the code in /sbin/kexec and that code does not in
> fact change the idt except when we switch to 16bit mode, which we
> definitely do not do in the kexec on panic case. So it appears that we
> don't need to coordinate an /sbin/kexec release with a kernel release to
> ignore nmis.
>
> In fact it looks like we only need to fix the interrupt descriptors
> loaded in machine_kexec_64.c and head64.c to ignore nmis.
>
> At which point we will have fixed two bugs and have a much more reliable
> kexec on panic implementation.

Ok. I'll talk with Vivek about how the can be implemented.

Cheers.
Don

2012-02-20 15:25:18

by Don Zickus

[permalink] [raw]
Subject: Re: [tip:x86/debug] x86/kdump: No need to disable ioapic/ lapic in crash path

On Mon, Feb 20, 2012 at 02:17:33PM +0900, HATAYAMA Daisuke wrote:
> From: Don Zickus <[email protected]>
> Subject: Re: [tip:x86/debug] x86/kdump: No need to disable ioapic/ lapic in crash path
> Date: Fri, 17 Feb 2012 15:18:42 -0500
>
> > On Sat, Feb 18, 2012 at 12:49:16AM +0900, HATAYAMA Daisuke wrote:
> >> A few days ago I investigted the case where system is reseted due to
> >> triple fault caused by the NMI after idt is disabled in
> >> machine_kexec. I didn't see the reset when trigering the kdump with
> >> NMI since the NMI is masked until next iret instruction executed as
> >> described in 6.7.2. Handling Multiple NMIs of Intel Manual Vol.3A.
> >> The NMI mask remains untill the first iret execution on the 2nd
> >> kernel: just the return path of the first kernel_thread invocation for
> >> init process. The exact path is:
> >
> > hmm. So even though the local apic was disabled you still got an NMI?
> > That could have been from an external NMI. I forget how that is wired up,
> > if it goes through the IOAPIC to the Local APIC or directly to the NMI pin
> > on the cpu.
> >
>
> Please don't confused. I used RHEL kernels based on 2.6.18 and
> 2.6.32. I didn't use the patch disabling local apic.

Sure. Those kernels should be using the 'disable_local_APIC' code. My
patch just removed that call, IOW it stops disabling local apic or a
simpler way is to say it keeps the local apic enabled.

My question stills stands then, you might have experienced an external
NMI, but I am not entirely sure.


>
> >>
> >> switch_to
> >> -> ret_from_fork
> >> -> int_ret_from_sys_call
> >> -> retint_restore_args
> >> -> irq_return
> >>
> >> At that phase idt is already set up and kdump works.
> >>
> >> From the discussion I interpret kdump doesn't assume this behaviour,
> >> right?
> >
> > probably not.
> >
>
> Thanks.
>
> >>
> >> BTW, does anyone know the detail of the NMI mask? I couldn't figure
> >> out about it from the Intel spec more than ``certain hardware
> >> conditions''... I expect those who look at here are x86 NMI experts.
> >
> > I don't understand the question.
> >
> > Cheers,
> > Don
> >
>
> Fig 10-4 explaining Local APIC Structure says INIT/NMI/SMI are
> directly sent to CPU Core, but the later part of this route is not
> explained formally anyware. Only the explanation is the sentence in
> 6.7 Nonmaskable Interrupt (NMI):
>
> The processor also invokes certain hardware conditions to insure
> that no other interrupts, including NMI interrupts, are received
> until the NMI handler has completed executing.
>
> I'm just wondering if this is explained more formally anyware.

It might be I just don't know where. I just view the NMI as an exception.
Each cpu exception has a priority. NMI has a higher priority than
interrupts but a lower priority that say INIT. Therefore when the cpu
gets an exception it classifies it based on priority. Higher priorities
will interrupt the current exception, such as NMI, while lower priorities
will wait until the current exception is finished.

To me those would be the hardware conditions, but that is my
interpretation.

Cheers,
Don

>
> Thanks.
> HATAYAMA, Daisuke
>

2012-02-21 07:58:12

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [tip:x86/debug] x86/kdump: No need to disable ioapic/ lapic in crash path

Don Zickus <[email protected]> writes:

2> On Fri, Feb 17, 2012 at 07:21:52PM -0800, Eric W. Biederman wrote:
>> Don Zickus <[email protected]> writes:
>>
>> > On Fri, Feb 17, 2012 at 04:41:01AM -0800, Eric W. Biederman wrote:
>> >>
>> >> The fix with a guarantee of no more scope creep is to just disable the
>> >> nmi watchdog on the kexec on panic path.
>> >>
>> >> Don if you have time please figure out is needed to ignore nmi's and
>> >> possible record and/or report them while we boot, otherwise please cook
>> >> up a patch that just disables the nmi watchdog wherever we are sending
>> >> it from (the local apic or the ioapic).
>> >
>> > Can I keep things even simpler? The original problem was the deadlock
>> > with the ioapic lock. We fixed that by removing the call to
>> > disable_IO_APIC(). Can we just leave the disable_local_APIC calls in
>> > there for now? Is there any real harm?
>>
>> > All this rewrite is going to take time which will delay fixing a current
>> > problem with kexec on panic, the ioapic deadlock.
>>
>> Hmm.
>>
>> My apologies I just realized that we can not disable the nmi watchdog
>> safely in all cases. To avoid the deadlock we fundamentally can not
>> write to the io_apic, because the locks are the io_apic write path.
>> The nmi watchdog can be sourced from either the local apics or the
>> io_apics. To disable the nmi_watchdog we need at least potentially
>> to write io_apic.
>
> I am curious where you see the nmi watchdog being sourced from the ioapic?
> I thought I removed that code 3 or 4 releases ago.

In my memory, and in references to the code in comments in various apic
related code. I couldn't figure out what the current code was doing and
assumed the implementation was equivalent. It does look like you
removed the code that used the io_apic. I still haven't figured
out just how the new implementation works yet.

So maybe in the short term we can safely just stomp the timer that
triggers the nmi watchdog in the local apic. Over the long term that
feels like it is just asking for trouble.

I wonder if the reason that we have an hpet stomp in that code is
for a similar reason. Did we ever source nmi's from the hpet timer?

>> So it appears to me that the only reasonable and robust thing we can
>> do is to ignore nmis in the kexec on panic path.
>>
>> So it looks to me that the only path forward at this point is to fix
>> the other bug where an unexpected nmi will kill the kexec on panic boot.
>>
>> I just took a look at the code in /sbin/kexec and that code does not in
>> fact change the idt except when we switch to 16bit mode, which we
>> definitely do not do in the kexec on panic case. So it appears that we
>> don't need to coordinate an /sbin/kexec release with a kernel release to
>> ignore nmis.
>>
>> In fact it looks like we only need to fix the interrupt descriptors
>> loaded in machine_kexec_64.c and head64.c to ignore nmis.
>>
>> At which point we will have fixed two bugs and have a much more reliable
>> kexec on panic implementation.
>
> Ok. I'll talk with Vivek about how the can be implemented.

Thanks. It really doesn't look very hard. Just a tiny idt with
an nmi entry that says iret.

Eric

2012-02-21 14:00:14

by Don Zickus

[permalink] [raw]
Subject: Re: [tip:x86/debug] x86/kdump: No need to disable ioapic/ lapic in crash path

On Tue, Feb 21, 2012 at 12:01:07AM -0800, Eric W. Biederman wrote:
> Don Zickus <[email protected]> writes:
>
> 2> On Fri, Feb 17, 2012 at 07:21:52PM -0800, Eric W. Biederman wrote:
> >> Don Zickus <[email protected]> writes:
> >>
> >> > On Fri, Feb 17, 2012 at 04:41:01AM -0800, Eric W. Biederman wrote:
> >> >>
> >> >> The fix with a guarantee of no more scope creep is to just disable the
> >> >> nmi watchdog on the kexec on panic path.
> >> >>
> >> >> Don if you have time please figure out is needed to ignore nmi's and
> >> >> possible record and/or report them while we boot, otherwise please cook
> >> >> up a patch that just disables the nmi watchdog wherever we are sending
> >> >> it from (the local apic or the ioapic).
> >> >
> >> > Can I keep things even simpler? The original problem was the deadlock
> >> > with the ioapic lock. We fixed that by removing the call to
> >> > disable_IO_APIC(). Can we just leave the disable_local_APIC calls in
> >> > there for now? Is there any real harm?
> >>
> >> > All this rewrite is going to take time which will delay fixing a current
> >> > problem with kexec on panic, the ioapic deadlock.
> >>
> >> Hmm.
> >>
> >> My apologies I just realized that we can not disable the nmi watchdog
> >> safely in all cases. To avoid the deadlock we fundamentally can not
> >> write to the io_apic, because the locks are the io_apic write path.
> >> The nmi watchdog can be sourced from either the local apics or the
> >> io_apics. To disable the nmi_watchdog we need at least potentially
> >> to write io_apic.
> >
> > I am curious where you see the nmi watchdog being sourced from the ioapic?
> > I thought I removed that code 3 or 4 releases ago.
>
> In my memory, and in references to the code in comments in various apic
> related code. I couldn't figure out what the current code was doing and
> assumed the implementation was equivalent. It does look like you
> removed the code that used the io_apic. I still haven't figured
> out just how the new implementation works yet.

It is just a client of perf now. Perf controls the local apic accesses.
A lot simpler now.

>
> So maybe in the short term we can safely just stomp the timer that
> triggers the nmi watchdog in the local apic. Over the long term that
> feels like it is just asking for trouble.

I can understand that.

>
> I wonder if the reason that we have an hpet stomp in that code is
> for a similar reason. Did we ever source nmi's from the hpet timer?
>
> >> So it appears to me that the only reasonable and robust thing we can
> >> do is to ignore nmis in the kexec on panic path.
> >>
> >> So it looks to me that the only path forward at this point is to fix
> >> the other bug where an unexpected nmi will kill the kexec on panic boot.
> >>
> >> I just took a look at the code in /sbin/kexec and that code does not in
> >> fact change the idt except when we switch to 16bit mode, which we
> >> definitely do not do in the kexec on panic case. So it appears that we
> >> don't need to coordinate an /sbin/kexec release with a kernel release to
> >> ignore nmis.
> >>
> >> In fact it looks like we only need to fix the interrupt descriptors
> >> loaded in machine_kexec_64.c and head64.c to ignore nmis.
> >>
> >> At which point we will have fixed two bugs and have a much more reliable
> >> kexec on panic implementation.
> >
> > Ok. I'll talk with Vivek about how the can be implemented.
>
> Thanks. It really doesn't look very hard. Just a tiny idt with
> an nmi entry that says iret.

It probably is, except I never hacked on idt code before and my assembly
isn't that good. I have been trying to find examples to copy from to give
it a try. So far I was using early_idt_handlers with early_printk to see
if I could capture some printk messages while jumping from the first
kernel to the second kernel (when the other early_idt_handlers would kick
in for the second kernel).

Tips? Better examples?

Cheers,
Don

2012-02-29 23:16:18

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [tip:x86/debug] x86/kdump: No need to disable ioapic/ lapic in crash path

Don Zickus <[email protected]> writes:

> On Tue, Feb 21, 2012 at 12:01:07AM -0800, Eric W. Biederman wrote:
>> Don Zickus <[email protected]> writes:
>>
>> 2> On Fri, Feb 17, 2012 at 07:21:52PM -0800, Eric W. Biederman wrote:
>> >> Don Zickus <[email protected]> writes:
>> >>
>> >> > On Fri, Feb 17, 2012 at 04:41:01AM -0800, Eric W. Biederman wrote:
>> >> >>
>> >> >> The fix with a guarantee of no more scope creep is to just disable the
>> >> >> nmi watchdog on the kexec on panic path.
>> >> >>
>> >> >> Don if you have time please figure out is needed to ignore nmi's and
>> >> >> possible record and/or report them while we boot, otherwise please cook
>> >> >> up a patch that just disables the nmi watchdog wherever we are sending
>> >> >> it from (the local apic or the ioapic).
>> >> >
>> >> > Can I keep things even simpler? The original problem was the deadlock
>> >> > with the ioapic lock. We fixed that by removing the call to
>> >> > disable_IO_APIC(). Can we just leave the disable_local_APIC calls in
>> >> > there for now? Is there any real harm?
>> >>
>> >> > All this rewrite is going to take time which will delay fixing a current
>> >> > problem with kexec on panic, the ioapic deadlock.
>> >>
>> >> Hmm.
>> >>
>> >> My apologies I just realized that we can not disable the nmi watchdog
>> >> safely in all cases. To avoid the deadlock we fundamentally can not
>> >> write to the io_apic, because the locks are the io_apic write path.
>> >> The nmi watchdog can be sourced from either the local apics or the
>> >> io_apics. To disable the nmi_watchdog we need at least potentially
>> >> to write io_apic.
>> >
>> > I am curious where you see the nmi watchdog being sourced from the ioapic?
>> > I thought I removed that code 3 or 4 releases ago.
>>
>> In my memory, and in references to the code in comments in various apic
>> related code. I couldn't figure out what the current code was doing and
>> assumed the implementation was equivalent. It does look like you
>> removed the code that used the io_apic. I still haven't figured
>> out just how the new implementation works yet.
>
> It is just a client of perf now. Perf controls the local apic accesses.
> A lot simpler now.
>
>>
>> So maybe in the short term we can safely just stomp the timer that
>> triggers the nmi watchdog in the local apic. Over the long term that
>> feels like it is just asking for trouble.
>
> I can understand that.
>
>>
>> I wonder if the reason that we have an hpet stomp in that code is
>> for a similar reason. Did we ever source nmi's from the hpet timer?
>>
>> >> So it appears to me that the only reasonable and robust thing we can
>> >> do is to ignore nmis in the kexec on panic path.
>> >>
>> >> So it looks to me that the only path forward at this point is to fix
>> >> the other bug where an unexpected nmi will kill the kexec on panic boot.
>> >>
>> >> I just took a look at the code in /sbin/kexec and that code does not in
>> >> fact change the idt except when we switch to 16bit mode, which we
>> >> definitely do not do in the kexec on panic case. So it appears that we
>> >> don't need to coordinate an /sbin/kexec release with a kernel release to
>> >> ignore nmis.
>> >>
>> >> In fact it looks like we only need to fix the interrupt descriptors
>> >> loaded in machine_kexec_64.c and head64.c to ignore nmis.
>> >>
>> >> At which point we will have fixed two bugs and have a much more reliable
>> >> kexec on panic implementation.
>> >
>> > Ok. I'll talk with Vivek about how the can be implemented.
>>
>> Thanks. It really doesn't look very hard. Just a tiny idt with
>> an nmi entry that says iret.
>
> It probably is, except I never hacked on idt code before and my assembly
> isn't that good. I have been trying to find examples to copy from to give
> it a try. So far I was using early_idt_handlers with early_printk to see
> if I could capture some printk messages while jumping from the first
> kernel to the second kernel (when the other early_idt_handlers would kick
> in for the second kernel).
>
> Tips? Better examples?

That is a particularly good example. When I took a quick look earlier
that is the first place we reload the idt in the kernel boot so that is
one of two places that needs to be modified.

The other place in the reboot code page (which will be a little tricky
because of it's relocatability) needs to have code that roughly says:

kexec_nmi_handler:
; Does this vector get any other kind of parameters and saved
; registers that we need to restore?
; Different types of exceptions handlers have different rules
; that I don't remember off the top of my head. call gates, trap
; gates, etc.
iret

kexec_idt:
.long 0, 0 ; IDT entry 0
.long 0, 0 ; IDT entry 1
.long something, kexec_nmi_handler ; IDT entry 2 NMI's

You can write to the kexec_idt from C code and for the nmi handler you
just need to make certain all is good.

The code has to be tested and the proper Intel CPU manual should
probably be consulted to make certain you dot your eyes and cross your
t's but we really don't need much code to change on this path.

Eric

p.s. Apologies for the delay I overlooked this message earlier.