2005-10-04 15:03:55

by Eric W. Biederman

[permalink] [raw]
Subject: i386 nmi_watchdog: Merge check_nmi_watchdog fixes from x86_64


The per cpu nmi watchdog timer is based on an event counter.
idle cpus don't generate events so the NMI watchdog doesn't fire
and the test to see if the watchdog is working fails.

- Add nmi_cpu_busy so idle cpus don't mess up the test.
- kmalloc prev_nmi_count to keep kernel stack usage bounded.
- Improve the error message on failure so there is enough
information to debug problems.

Signed-off-by: Eric W. Biederman <[email protected]>


---

arch/i386/kernel/nmi.c | 39 +++++++++++++++++++++++++++++++++++++--
1 files changed, 37 insertions(+), 2 deletions(-)

42b4c9dd7c387ed2eb7a29c9f02c4c8ed75a736f
diff --git a/arch/i386/kernel/nmi.c b/arch/i386/kernel/nmi.c
--- a/arch/i386/kernel/nmi.c
+++ b/arch/i386/kernel/nmi.c
@@ -100,16 +100,44 @@ int nmi_active;
(P4_CCCR_OVF_PMI0|P4_CCCR_THRESHOLD(15)|P4_CCCR_COMPLEMENT| \
P4_CCCR_COMPARE|P4_CCCR_REQUIRED|P4_CCCR_ESCR_SELECT(4)|P4_CCCR_ENABLE)

+#ifdef CONFIG_SMP
+/* The performance counters used by NMI_LOCAL_APIC don't trigger when
+ * the CPU is idle. To make sure the NMI watchdog really ticks on all
+ * CPUs during the test make them busy.
+ */
+static __init void nmi_cpu_busy(void *data)
+{
+ volatile int *endflag = data;
+ local_irq_enable();
+ /* Intentionally don't use cpu_relax here. This is
+ to make sure that the performance counter really ticks,
+ even if there is a simulator or similar that catches the
+ pause instruction. On a real HT machine this is fine because
+ all other CPUs are busy with "useless" delay loops and don't
+ care if they get somewhat less cycles. */
+ while (*endflag == 0)
+ barrier();
+}
+#endif
+
static int __init check_nmi_watchdog(void)
{
- unsigned int prev_nmi_count[NR_CPUS];
+ volatile int endflag = 0;
+ unsigned int *prev_nmi_count;
int cpu;

if (nmi_watchdog == NMI_NONE)
return 0;

+ prev_nmi_count = kmalloc(NR_CPUS * sizeof(int), GFP_KERNEL);
+ if (!prev_nmi_count)
+ return -1;
+
printk(KERN_INFO "Testing NMI watchdog ... ");

+ if (nmi_watchdog == NMI_LOCAL_APIC)
+ smp_call_function(nmi_cpu_busy, (void *)&endflag, 0, 0);
+
for (cpu = 0; cpu < NR_CPUS; cpu++)
prev_nmi_count[cpu] = per_cpu(irq_stat, cpu).__nmi_count;
local_irq_enable();
@@ -123,12 +151,18 @@ static int __init check_nmi_watchdog(voi
continue;
#endif
if (nmi_count(cpu) - prev_nmi_count[cpu] <= 5) {
- printk("CPU#%d: NMI appears to be stuck!\n", cpu);
+ endflag = 1;
+ printk("CPU#%d: NMI appears to be stuck (%d->%d)!\n",
+ cpu,
+ prev_nmi_count[cpu],
+ nmi_count(cpu));
nmi_active = 0;
lapic_nmi_owner &= ~LAPIC_NMI_WATCHDOG;
+ kfree(prev_nmi_count);
return -1;
}
}
+ endflag = 1;
printk("OK.\n");

/* now that we know it works we can reduce NMI frequency to
@@ -136,6 +170,7 @@ static int __init check_nmi_watchdog(voi
if (nmi_watchdog == NMI_LOCAL_APIC)
nmi_hz = 1;

+ kfree(prev_nmi_count);
return 0;
}
/* This needs to happen later in boot so counters are working */


2005-10-05 19:50:12

by Bill Davidsen

[permalink] [raw]
Subject: Re: i386 nmi_watchdog: Merge check_nmi_watchdog fixes from x86_64

Eric W. Biederman wrote:
> The per cpu nmi watchdog timer is based on an event counter.
> idle cpus don't generate events so the NMI watchdog doesn't fire
> and the test to see if the watchdog is working fails.

Question: given all the concern about reducing clocks and all, do we
actually need nmi on more than one CPU? Are there cases where a single
CPU hangs in idle on an SMP system?

--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2005-10-05 20:02:17

by Eric W. Biederman

[permalink] [raw]
Subject: Re: i386 nmi_watchdog: Merge check_nmi_watchdog fixes from x86_64

Bill Davidsen <[email protected]> writes:

> Eric W. Biederman wrote:
>> The per cpu nmi watchdog timer is based on an event counter. idle cpus don't
>> generate events so the NMI watchdog doesn't fire
>> and the test to see if the watchdog is working fails.
>
> Question: given all the concern about reducing clocks and all, do we actually
> need nmi on more than one CPU? Are there cases where a single CPU hangs in idle
> on an SMP system?

I really don't know, but the normal interrupt rate is once per second or slower
if the cpu is idle. I was just working in the vicinity and discovered when
I enabled the nmi watchdog it failed to come on because it didn't handled it's
initialization test properly, and x86_64 had already fixed it.

Eric

2005-10-12 01:26:39

by Andrew Morton

[permalink] [raw]
Subject: Re: i386 nmi_watchdog: Merge check_nmi_watchdog fixes from x86_64

[email protected] (Eric W. Biederman) wrote:
>
>
> The per cpu nmi watchdog timer is based on an event counter.
> idle cpus don't generate events so the NMI watchdog doesn't fire
> and the test to see if the watchdog is working fails.
>
> - Add nmi_cpu_busy so idle cpus don't mess up the test.
> - kmalloc prev_nmi_count to keep kernel stack usage bounded.
> - Improve the error message on failure so there is enough
> information to debug problems.
>
> ...
>
> static int __init check_nmi_watchdog(void)
> {
> - unsigned int prev_nmi_count[NR_CPUS];
> + volatile int endflag = 0;

I don't think this needs to be declared volatile?

2005-10-14 04:13:36

by Eric W. Biederman

[permalink] [raw]
Subject: Re: i386 nmi_watchdog: Merge check_nmi_watchdog fixes from x86_64

Andrew Morton <[email protected]> writes:

> [email protected] (Eric W. Biederman) wrote:
>>
>>
>> The per cpu nmi watchdog timer is based on an event counter.
>> idle cpus don't generate events so the NMI watchdog doesn't fire
>> and the test to see if the watchdog is working fails.
>>
>> - Add nmi_cpu_busy so idle cpus don't mess up the test.
>> - kmalloc prev_nmi_count to keep kernel stack usage bounded.
>> - Improve the error message on failure so there is enough
>> information to debug problems.
>>
>> ...
>>
>> static int __init check_nmi_watchdog(void)
>> {
>> - unsigned int prev_nmi_count[NR_CPUS];
>> + volatile int endflag = 0;
>
> I don't think this needs to be declared volatile?


Sorry for not replying sooner I just got back from a trip.

I haven't though it through extremely closely but I believe
the stores into that variable in check_nmi_watchdog could
legitimately be optimized away by the compiler if it doesn't
have a hint. As the variable is auto and is never used
after the store without volatile it seems a reasonable
assumption that no one else will see the value.

If the variable was static the volatile would clearly be unnecessary
as we have taken the address earlier so at some point the compiler
would be obligated to but with the variable being auto the rules are a
little murky.

Eric

2005-10-14 18:53:12

by Andy Isaacson

[permalink] [raw]
Subject: Re: i386 nmi_watchdog: Merge check_nmi_watchdog fixes from x86_64

On Thu, Oct 13, 2005 at 10:13:12PM -0600, Eric W. Biederman wrote:
> Andrew Morton <[email protected]> writes:
> > [email protected] (Eric W. Biederman) wrote:
> >> static int __init check_nmi_watchdog(void)
> >> {
> >> + volatile int endflag = 0;
> >
> > I don't think this needs to be declared volatile?
>
> I haven't though it through extremely closely but I believe
> the stores into that variable in check_nmi_watchdog could
> legitimately be optimized away by the compiler if it doesn't
> have a hint. As the variable is auto and is never used
> after the store without volatile it seems a reasonable
> assumption that no one else will see the value.
>
> If the variable was static the volatile would clearly be unnecessary
> as we have taken the address earlier so at some point the compiler
> would be obligated to but with the variable being auto the rules are a
> little murky.

I don't think it's murky at all; since you took the address and passed
it to another function, the compiler has to assume that you saved the
pointer away and will be referring to it later. (Unless the compiler
can prove that you *won't* be referring to it later, such as if there
are no more function calls between the store and the variable going out
of scope, or if the compiler does whole-program optimization and can see
the entire data flow of that pointer and prove that it's not
dereferenced.)

Proving this using the standard is a bit of a challenge...

6.2.4:
4 An object whose identifier is declared with no linkage and without the
storage-class specifier static has automatic storage duration.
5 For such an object that does not have a variable length array type,
its lifetime extends from entry into the block with which it is
associated until execution of that block ends in any way. (Entering an
enclosed block or calling a function suspends, but does not end,
execution of the current block.)

6.5.3.2:
3 The unary & operator returns the address of its operand. If the
operand has type ``type'', the result has type ``pointer to type''.
4 The unary * operator denotes indirection. If the operand ... points to
an object, the result is an lvalue designating the object. If the
operand has type ``pointer to type'', the result has type ``type''.

So the object 'endflag' has lifetime extending until the end of the
function; you are permitted to take its address and pass that address to
a function (which stores it somewhere with static or allocated
duration); a later function call is then permitted to dereference the
stored pointer so long as 'endflag' has not yet passed out of scope.

Now, I think the following could hypothetically be a problem:

static int *bp;

int foo() {
volatile int flag = 0;
int blah = 0, count = 0;

bp = &blah;
pthread_create(baz, flag);
blah++;
while(flag == 0)
count++;
return count;
}

int baz(void *p) {
volatile int *flag = p;
int count = 1;
while(*bp == 0)
count++;
*flag = count;
}

An optimizing compiler could look at foo() and decide that since blah is
not volatile, and there are no function calls after it is incremented,
the increment can be discarded. But alas, baz() does check the value on
another thread.

I believe (but have not verified) that GCC simply inhibits
dead-store-elimination when the address of the variable has been taken,
so this theoretical possibility is not a real danger under gcc. And in
any case, it doesn't apply to your check_nmi_watchdog, because you've
got function calls after the assignment.

-andy

2005-10-15 12:50:27

by Eric W. Biederman

[permalink] [raw]
Subject: Re: i386 nmi_watchdog: Merge check_nmi_watchdog fixes from x86_64

Andy Isaacson <[email protected]> writes:

> On Thu, Oct 13, 2005 at 10:13:12PM -0600, Eric W. Biederman wrote:
>> Andrew Morton <[email protected]> writes:
>> > [email protected] (Eric W. Biederman) wrote:
>> >> static int __init check_nmi_watchdog(void)
>> >> {
>> >> + volatile int endflag = 0;
>> >
>> > I don't think this needs to be declared volatile?
>>
>> If the variable was static the volatile would clearly be unnecessary
>> as we have taken the address earlier so at some point the compiler
>> would be obligated to but with the variable being auto the rules are a
>> little murky.
>
> I don't think it's murky at all; since you took the address and passed
> it to another function, the compiler has to assume that you saved the
> pointer away and will be referring to it later. (Unless the compiler
> can prove that you *won't* be referring to it later, such as if there
> are no more function calls between the store and the variable going out
> of scope, or if the compiler does whole-program optimization and can see
> the entire data flow of that pointer and prove that it's not
> dereferenced.)
>
> Now, I think the following could hypothetically be a problem:
>
> An optimizing compiler could look at foo() and decide that since blah is
> not volatile, and there are no function calls after it is incremented,
> the increment can be discarded. But alas, baz() does check the value on
> another thread.
>
> I believe (but have not verified) that GCC simply inhibits
> dead-store-elimination when the address of the variable has been taken,
> so this theoretical possibility is not a real danger under gcc. And in
> any case, it doesn't apply to your check_nmi_watchdog, because you've
> got function calls after the assignment.

It comes very close to applying to check_nmi_watchdog as both
of the functions called after endflag is set: printk and kfree
both have nothing to do with the setting of endflag. If they
were simpler functions an optimizing compiler could look at
them realizing that. Since those two functions have nothing
to do with endflag someone else looking at the code remove or
reorder them with a trivial patch and the code could stop working.

GCC keeps getting more powerful optimizations so I am not comfortable
with depending on it simply eliminating dead-store-elimination when the
address of a variable has been taken.

If there is a better idiom to synchronize the cpus which does
not mark the variable volatile I could see switching to that.

There is a theoretical race there as some cpu might not see endflag
get set and the next function on the stack could set it that same
stack slot to 0. I can see value in fixing that, if there is a simple
and clear solution. Currently I am having a failure of imagination.

Eric

2005-10-18 07:05:15

by Andy Isaacson

[permalink] [raw]
Subject: Re: i386 nmi_watchdog: Merge check_nmi_watchdog fixes from x86_64

On Sat, Oct 15, 2005 at 06:49:56AM -0600, Eric W. Biederman wrote:
> Andy Isaacson <[email protected]> writes:
> > I believe (but have not verified) that GCC simply inhibits
> > dead-store-elimination when the address of the variable has been taken,
> > so this theoretical possibility is not a real danger under gcc. And in
> > any case, it doesn't apply to your check_nmi_watchdog, because you've
> > got function calls after the assignment.
>
> It comes very close to applying to check_nmi_watchdog as both
[snip]

You prodded me to looking in a bit more depth at the code around this,
and I'm actually a bit concerned that volatile may not be enough of a
guarantee that other CPUs will see the correct value. I grant that this
is a mostly theoretical concern, but let's take a look at the code:

+static __init void nmi_cpu_busy(void *data)
+{
+ volatile int *endflag = data;
+ local_irq_enable();
+ while (*endflag == 0)
+ barrier();
+}
static int __init check_nmi_watchdog(void)
{
+ volatile int endflag = 0;
...
+ if (nmi_watchdog == NMI_LOCAL_APIC)
+ smp_call_function(nmi_cpu_busy, (void *)&endflag, 0, 0);
...
+ endflag = 1;
printk("OK.\n");
if (nmi_watchdog == NMI_LOCAL_APIC)
nmi_hz = 1;
+ kfree(prev_nmi_count);
return 0;
}

So CPU#0 does a smp_call_function, does some work, then sets endflag,
does a printk, sets a static variable, calls kfree, then leaves the
stack frame.

Meanwhile, CPU#1 - CPU#N are polling waiting for endflag to make a
transition from 0->1.

Nothing that CPU#0 does after setting endflag is a guarantee that its
store to endflag will be seen by other CPUs. In particular, if the
caller immediately zeros that stack location (not an unlikely
happenstance), then it's possible that the two stores to endflag might
be coalesced by a write buffer in CPU#0's bus interface.

Making endflag volatile does not affect the write buffer; it simply
forces the compiler to generate the "store to memory" instruction at the
appropriate place in the code flow. To affect the write buffer, the
virtual address would have to be flagged appropriately (on x86, using an
MTRR; on MIPS, using a flag on the TLB entry; similar options on other
platforms) or an appropriate store instruction such as "non-temporal
store" would have to be used. Or, a memory barrier.

In particular, a NUMA x86 system is rather likely to be unfair to remote
nodes in this case (where the other local CPUs are polling too).

> If there is a better idiom to synchronize the cpus which does
> not mark the variable volatile I could see switching to that.
>
> There is a theoretical race there as some cpu might not see endflag
> get set and the next function on the stack could set it that same
> stack slot to 0. I can see value in fixing that, if there is a simple
> and clear solution. Currently I am having a failure of imagination.

Exactly (he says; after writing the same thing above I re-read the rest
of your post).

I would imagine that some kind of write memory barrier is appropriate,
but I'm not up to date on what's in fashion in the kernel code these
days.

-andy

2005-10-18 12:44:47

by Eric W. Biederman

[permalink] [raw]
Subject: Re: i386 nmi_watchdog: Merge check_nmi_watchdog fixes from x86_64

Andy Isaacson <[email protected]> writes:

> On Sat, Oct 15, 2005 at 06:49:56AM -0600, Eric W. Biederman wrote:
>> Andy Isaacson <[email protected]> writes:
>> > I believe (but have not verified) that GCC simply inhibits
>> > dead-store-elimination when the address of the variable has been taken,
>> > so this theoretical possibility is not a real danger under gcc. And in
>> > any case, it doesn't apply to your check_nmi_watchdog, because you've
>> > got function calls after the assignment.
>>
>> It comes very close to applying to check_nmi_watchdog as both
> [snip]
>
> You prodded me to looking in a bit more depth at the code around this,
> and I'm actually a bit concerned that volatile may not be enough of a
> guarantee that other CPUs will see the correct value. I grant that this
> is a mostly theoretical concern, but let's take a look at the code:
>
> +static __init void nmi_cpu_busy(void *data)
> +{
> + volatile int *endflag = data;
> + local_irq_enable();
> + while (*endflag == 0)
> + barrier();
> +}
> static int __init check_nmi_watchdog(void)
> {
> + volatile int endflag = 0;
> ...
> + if (nmi_watchdog == NMI_LOCAL_APIC)
> + smp_call_function(nmi_cpu_busy, (void *)&endflag, 0, 0);
> ...
> + endflag = 1;
> printk("OK.\n");
> if (nmi_watchdog == NMI_LOCAL_APIC)
> nmi_hz = 1;
> + kfree(prev_nmi_count);
> return 0;
> }
>
> So CPU#0 does a smp_call_function, does some work, then sets endflag,
> does a printk, sets a static variable, calls kfree, then leaves the
> stack frame.
>
> Meanwhile, CPU#1 - CPU#N are polling waiting for endflag to make a
> transition from 0->1.
>
> Nothing that CPU#0 does after setting endflag is a guarantee that its
> store to endflag will be seen by other CPUs. In particular, if the
> caller immediately zeros that stack location (not an unlikely
> happenstance), then it's possible that the two stores to endflag might
> be coalesced by a write buffer in CPU#0's bus interface.

Yes but if they are at all separated in time things are better. And
the constant reads of endflag are going to ping-pong the cache
line between all of the cpus which will tend to get the store pushed
out as soon as possible. No other magic is necessary on x86.

> In particular, a NUMA x86 system is rather likely to be unfair to remote
> nodes in this case (where the other local CPUs are polling too).

In the NUMA systems I am familiar with the system will slow to a crawl
instead of performing and being unfair.

>> If there is a better idiom to synchronize the cpus which does
>> not mark the variable volatile I could see switching to that.
>>
>> There is a theoretical race there as some cpu might not see endflag
>> get set and the next function on the stack could set it that same
>> stack slot to 0. I can see value in fixing that, if there is a simple
>> and clear solution. Currently I am having a failure of imagination.
>
> Exactly (he says; after writing the same thing above I re-read the rest
> of your post).
>
> I would imagine that some kind of write memory barrier is appropriate,
> but I'm not up to date on what's in fashion in the kernel code these
> days.

Probably a counter, to ensure the code exits. The code prints
a message if it fails, and possibly the boot breaks. So I don't expect
bugs in this code to persist for a long time without someone screaming.

Andrew has forwarded me the first bug report already.
http://bugzilla.kernel.org/show_bug.cgi?id=5462

It looks like we have a bug of the watchdog timers instead of one of
the expected failure modes. After I get some more information I will
be able to see what the practical consequences are.

Eric

2005-10-19 07:04:21

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: i386 nmi_watchdog: Merge check_nmi_watchdog fixes from x86_64

On Tue, 18 Oct 2005, Eric W. Biederman wrote:

> Andy Isaacson <[email protected]> writes:
>
> > +static __init void nmi_cpu_busy(void *data)
> > +{
> > + volatile int *endflag = data;
> > + local_irq_enable();
> > + while (*endflag == 0)
> > + barrier();
> > +}
> > static int __init check_nmi_watchdog(void)
> > {
> > + volatile int endflag = 0;
> > ...
> > + if (nmi_watchdog == NMI_LOCAL_APIC)
> > + smp_call_function(nmi_cpu_busy, (void *)&endflag, 0, 0);
> > ...
> > + endflag = 1;
> > printk("OK.\n");
> > if (nmi_watchdog == NMI_LOCAL_APIC)
> > nmi_hz = 1;
> > + kfree(prev_nmi_count);
> > return 0;
> > }
>
> Probably a counter, to ensure the code exits. The code prints

Why not just use the 'wait' variable to smp_call_function to at least
ensure that the lifetime of endflag isn't exceeded? Something like;

static __init void nmi_cpu_busy(void *data)
{
volatile int *endflag = data;
local_irq_enable();
while (*endflag == 0) {
cpu_relax();
rmb();
}
}

static int __init check_nmi_watchdog(void)
{
volatile int endflag = 0;
...
if (nmi_watchdog == NMI_LOCAL_APIC)
smp_call_function(nmi_cpu_busy, (void *)&endflag, 0, 1);
...
endflag = 1;
wmb();

Thanks,
Zwane