2004-09-28 16:33:33

by Joerg Sommrey

[permalink] [raw]
Subject: nmi watchdog failure on dual Athlon box

Hello,

just tried Ingo's "lockupcli" nmi watchdog test - it fails to unlock the
box.

boot-parm:
...nmi_watchdog=2...

dmesg:
...
testing NMI watchdog ... OK.
...

/proc/interrupts:
...
NMI: 115 103
...

So far everything looks fine. But after running Ingo's "lockupcli" the
box is locked (surprise!) but there is no nmi watchdog killing anything.
The system gets rebooted from the w83627hf WDT after 60 s.

System:
Tyan Tiger MPX (S2466)
2 x Athlon MP 2000+
kernel 2.6.8.1

nmi_watchdog=1 has never worked for me (except 2.6.3-mm4).

I'm not really surprised at this test as I had a couple of lockups in
the past that were never resolved by the nmi watchdog.

Any ideas?

-jo

--
-rw-r--r-- 1 jo users 63 2004-09-28 17:44 /home/jo/.signature


2004-09-28 17:08:42

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: nmi watchdog failure on dual Athlon box

On Tue, 28 Sep 2004, Joerg Sommrey wrote:

> just tried Ingo's "lockupcli" nmi watchdog test - it fails to unlock the
> box.
>
> boot-parm:
> ...nmi_watchdog=2...

The local APIC NMI watchdog has limited capabilities. It may fail to
trigger for certain lockups because there is no available event that would
happen periodically regardless of the CPU state. I can only suspect what
"lockupcli" does (where is it available from, anyway?), but if it runs
"cli; hlt", then the watchdog *will* fail.

> nmi_watchdog=1 has never worked for me (except 2.6.3-mm4).

Too bad. The I/O APIC watchdog triggers regardless of the CPU state and
works as long as the chipset is operational.

Maciej

2004-09-28 18:31:14

by Joerg Sommrey

[permalink] [raw]
Subject: Re: nmi watchdog failure on dual Athlon box

On Tue, Sep 28, 2004 at 06:08:37PM +0100, Maciej W. Rozycki wrote:
> On Tue, 28 Sep 2004, Joerg Sommrey wrote:
>
> > just tried Ingo's "lockupcli" nmi watchdog test - it fails to unlock the
> > box.
> >
> > boot-parm:
> > ...nmi_watchdog=2...
>
> The local APIC NMI watchdog has limited capabilities. It may fail to
> trigger for certain lockups because there is no available event that would
> happen periodically regardless of the CPU state. I can only suspect what
> "lockupcli" does (where is it available from, anyway?), but if it runs
> "cli; hlt", then the watchdog *will* fail.

Here's the quote from Ingo's mail:
In <[email protected]> Ingo Molnar <[email protected]> writes:
|once the NMI watchdog is up and running it should catch all hard lockups
|and print backtraces to the serial console - even if you are within X
|while the lockup happens. You can test hard lockups by running the
|attached 'lockupcli' userspace code as root - it turns off interrupts
|and goes into an infinite loop => instant lockup. The NMI watchdog
|should notice this condition after a couple of seconds and should abort
|the task, printing a kernel trace as well. Your box should be back in
|working order after that point.

[...]

|--- lockupcli.c
|
|main ()
|{
| iopl(3);
| for (;;) asm("cli");
|}

Does this mean there is a good reason for further investigations on why
the IO-APIC NMI watchdog doesn't work? Until now I thought it would
be ok as long as the local APIC NMI watchdog is set up.

-jo

--
-rw-r--r-- 1 jo users 63 2004-09-28 18:42 /home/jo/.signature

2004-09-28 20:20:29

by Chris Wedgwood

[permalink] [raw]
Subject: Re: nmi watchdog failure on dual Athlon box

On Tue, Sep 28, 2004 at 06:33:24PM +0200, Joerg Sommrey wrote:

> nmi_watchdog=1 has never worked for me (except 2.6.3-mm4).

tyan 2466? if so then i've seen this too, i think it's a mainboard
problem


--cw

2004-09-28 21:08:32

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: nmi watchdog failure on dual Athlon box

On Tue, 28 Sep 2004, Joerg Sommrey wrote:

> |--- lockupcli.c
> |
> |main ()
> |{
> | iopl(3);
> | for (;;) asm("cli");
> |}
>
> Does this mean there is a good reason for further investigations on why
> the IO-APIC NMI watchdog doesn't work? Until now I thought it would
> be ok as long as the local APIC NMI watchdog is set up.

Since this program does busy looping, the local APIC NMI watchdog should
trigger indeed. It's "cli; hlt" that causes a problem with this watchdog.
Something wrong is happening in your system, indeed.

Maciej

2004-09-29 20:28:11

by Joerg Sommrey

[permalink] [raw]
Subject: Re: nmi watchdog failure on dual Athlon box

On Tue, Sep 28, 2004 at 10:08:21PM +0100, Maciej W. Rozycki wrote:
> On Tue, 28 Sep 2004, Joerg Sommrey wrote:
>
> > |--- lockupcli.c
> > |
> > |main ()
> > |{
> > | iopl(3);
> > | for (;;) asm("cli");
> > |}
> >
> > Does this mean there is a good reason for further investigations on why
> > the IO-APIC NMI watchdog doesn't work? Until now I thought it would
> > be ok as long as the local APIC NMI watchdog is set up.
>
> Since this program does busy looping, the local APIC NMI watchdog should
> trigger indeed. It's "cli; hlt" that causes a problem with this watchdog.
> Something wrong is happening in your system, indeed.

As I stated earlier, there *seemed* to be a working IO-APIC NMI watchdog
with 2.6.3-mm4. I never checked it's functionallity. Now I rebuilt that
kernel and gave it a try. Though it claims to have a running IO-APIC NMI
watchdog, the lockupcli test failed. Zwane was right when he suspected the
nmi_watchdog=1 test working erratically in that case. Sad but true: no NMI
watchdog on tyan S2466. I wonder if it's just impossible on such a board
or if it needs some "special treatment"

-jo

--
-rw-r--r-- 1 jo users 63 2004-09-29 22:10 /home/jo/.signature