2008-10-07 15:45:45

by Stephane Eranian

[permalink] [raw]
Subject: NMI watchdog setup_lapic_nmi_watchdog() problem

Hello,

I was doing some more testing with perfmon when I ran into
a problem with the NMI watchdog code in 2.6.27-rc8.

Since 2.6.20, it is possible to enable/disable the NMI watchdog
on-the-fly via /proc/sys/kernel/nmi_watchdog. This is a nice option
which avoids having to reboot the kernel.

Enabling/disabling the NMI watchdog uses two internal functions
enable_lapic_nmi_watchdog() and disable_lapic_nmi_watchdog().

Enable_lapic_nmi_watchdog() uses a IPI handler to setup the
APIC on each CPU. However, it turns out that this handler, namely,
setup_apic_nmi_watchdog() relies on some explicit ordering constraint
due to suspend/resume constraints as explained in the comment
below:

void setup_apic_nmi_watchdog(void *unused)
{
if (__get_cpu_var(wd_enabled))
return;

/* cheap hack to support suspend/resume */
/* if cpu0 is not active neither should the other cpus */
if (smp_processor_id() != 0 && atomic_read(&nmi_active) <= 0)
return;

switch (nmi_watchdog) {
[snip]
}

Supposing watchdog was disabled via /proc, nmi_active = 0. Then if you
re-enable, and if CPU0 is not the first to execute the IPI handler, then none
of the other CPUS will re-enable their NMI watchdog timer. On a quad-core
system, I have seen, for instance, 2 out of 4 with NMI watchdogs re-enabled.

I am not an expert at suspend/resume. I am assuming there was a race condition
there and that's why this code was added early on. The problem is that it now
conflicts with the /proc option.

It is not clear to me how this works during boot. Obviously the order
is respected
and all CPUs have their NMI watchdog enabled.

Until I understand the suspend/resume issue, it is hard to provide a
fix for this.

Any comments?


2008-10-07 20:22:34

by Andrew Morton

[permalink] [raw]
Subject: Re: NMI watchdog setup_lapic_nmi_watchdog() problem

On Tue, 7 Oct 2008 17:45:32 +0200
"stephane eranian" <[email protected]> wrote:

> Hello,
>
> I was doing some more testing with perfmon when I ran into
> a problem with the NMI watchdog code in 2.6.27-rc8.
>
> Since 2.6.20, it is possible to enable/disable the NMI watchdog
> on-the-fly via /proc/sys/kernel/nmi_watchdog. This is a nice option
> which avoids having to reboot the kernel.
>
> Enabling/disabling the NMI watchdog uses two internal functions
> enable_lapic_nmi_watchdog() and disable_lapic_nmi_watchdog().
>
> Enable_lapic_nmi_watchdog() uses a IPI handler to setup the
> APIC on each CPU. However, it turns out that this handler, namely,
> setup_apic_nmi_watchdog() relies on some explicit ordering constraint
> due to suspend/resume constraints as explained in the comment
> below:
>
> void setup_apic_nmi_watchdog(void *unused)
> {
> if (__get_cpu_var(wd_enabled))
> return;
>
> /* cheap hack to support suspend/resume */
> /* if cpu0 is not active neither should the other cpus */
> if (smp_processor_id() != 0 && atomic_read(&nmi_active) <= 0)
> return;
>
> switch (nmi_watchdog) {
> [snip]
> }
>
> Supposing watchdog was disabled via /proc, nmi_active = 0. Then if you
> re-enable, and if CPU0 is not the first to execute the IPI handler, then none
> of the other CPUS will re-enable their NMI watchdog timer. On a quad-core
> system, I have seen, for instance, 2 out of 4 with NMI watchdogs re-enabled.
>
> I am not an expert at suspend/resume. I am assuming there was a race condition
> there and that's why this code was added early on. The problem is that it now
> conflicts with the /proc option.
>
> It is not clear to me how this works during boot. Obviously the order
> is respected
> and all CPUs have their NMI watchdog enabled.
>
> Until I understand the suspend/resume issue, it is hard to provide a
> fix for this.
>
> Any comments?

The "cheap hack" was added in September 2006.

2.6.20 was released in Feb 2007.

So presumably this problem has always been there, since
/proc/sys/kernel/nmi_watchdog was first added, only nobody has hit it
before.

Have you only recently started to use /proc/sys/kernel/nmi_watchdog, or
did it work OK on any earlier kernel?

2008-10-07 21:11:04

by Stephane Eranian

[permalink] [raw]
Subject: Re: NMI watchdog setup_lapic_nmi_watchdog() problem

Andrew,

On Tue, Oct 7, 2008 at 10:21 PM, Andrew Morton
<[email protected]> wrote:
> On Tue, 7 Oct 2008 17:45:32 +0200
> "stephane eranian" <[email protected]> wrote:
>
>> Hello,
>>
>> I was doing some more testing with perfmon when I ran into
>> a problem with the NMI watchdog code in 2.6.27-rc8.
>>
>> Since 2.6.20, it is possible to enable/disable the NMI watchdog
>> on-the-fly via /proc/sys/kernel/nmi_watchdog. This is a nice option
>> which avoids having to reboot the kernel.
>>
>> Enabling/disabling the NMI watchdog uses two internal functions
>> enable_lapic_nmi_watchdog() and disable_lapic_nmi_watchdog().
>>
>> Enable_lapic_nmi_watchdog() uses a IPI handler to setup the
>> APIC on each CPU. However, it turns out that this handler, namely,
>> setup_apic_nmi_watchdog() relies on some explicit ordering constraint
>> due to suspend/resume constraints as explained in the comment
>> below:
>>
>> void setup_apic_nmi_watchdog(void *unused)
>> {
>> if (__get_cpu_var(wd_enabled))
>> return;
>>
>> /* cheap hack to support suspend/resume */
>> /* if cpu0 is not active neither should the other cpus */
>> if (smp_processor_id() != 0 && atomic_read(&nmi_active) <= 0)
>> return;
>>
>> switch (nmi_watchdog) {
>> [snip]
>> }
>>
>> Supposing watchdog was disabled via /proc, nmi_active = 0. Then if you
>> re-enable, and if CPU0 is not the first to execute the IPI handler, then none
>> of the other CPUS will re-enable their NMI watchdog timer. On a quad-core
>> system, I have seen, for instance, 2 out of 4 with NMI watchdogs re-enabled.
>>
>> I am not an expert at suspend/resume. I am assuming there was a race condition
>> there and that's why this code was added early on. The problem is that it now
>> conflicts with the /proc option.
>>
>> It is not clear to me how this works during boot. Obviously the order
>> is respected
>> and all CPUs have their NMI watchdog enabled.
>>
>> Until I understand the suspend/resume issue, it is hard to provide a
>> fix for this.
>>
>> Any comments?
>
> The "cheap hack" was added in September 2006.
>
> 2.6.20 was released in Feb 2007.
>
> So presumably this problem has always been there, since
> /proc/sys/kernel/nmi_watchdog was first added, only nobody has hit it
> before.
>
Yes, I believe that is true. The bug was there as soon as the /proc interface
was introduced.

The bug is not visible unless you instrument that setup_lapic_nmi_watchdog()
routine. I did that because I was tracking the value of nmi_active
within perfmon.
On a quad-core, nmi_active is equal to 4, so when I saw 2, I started
investigating.

> Have you only recently started to use /proc/sys/kernel/nmi_watchdog, or
> did it work OK on any earlier kernel?
>
I started playing with that because wanted to see whether NMI was releasing
the PMU MSR back to the free pool which it did. However, I had not
paid attention
to whether or not NMI was re-activated on all CPUs.

If you remove the 'cheap hack', disabling/enabling works. That's why
I'd like to better
understand was is going on with suspend/resume.

2008-10-07 23:39:33

by Andi Kleen

[permalink] [raw]
Subject: Re: NMI watchdog setup_lapic_nmi_watchdog() problem

> If you remove the 'cheap hack', disabling/enabling works. That's why
> I'd like to better
> understand was is going on with suspend/resume.

The hack was done because there wasn't an available CPU global enable flag
for the nmi watchdog. So resume would possible reenable it unnecessarily
If there was one it could just check that.

-Andi

--
[email protected]

2008-10-07 23:57:59

by Stephane Eranian

[permalink] [raw]
Subject: Re: NMI watchdog setup_lapic_nmi_watchdog() problem

Andi,

On Wed, Oct 8, 2008 at 1:45 AM, Andi Kleen <[email protected]> wrote:
>> If you remove the 'cheap hack', disabling/enabling works. That's why
>> I'd like to better
>> understand was is going on with suspend/resume.
>
> The hack was done because there wasn't an available CPU global enable flag
> for the nmi watchdog. So resume would possible reenable it unnecessarily
> If there was one it could just check that.
>
I am not sure what you mean by 'CPU global enable flag'.

Are you talking about a per-cpu variable or a truly global variable?

Isn't that what nmi_active is today?

As for per-cpu, there is wd_enabled.