2004-10-17 23:07:29

by Randy.Dunlap

[permalink] [raw]
Subject: NMI watchdog detected lockup


I'm seeing this often during a kernel build on AIC79xx.
I did one kernel build on SATA without seeing this.
This is on a dual-Opteron IBM Workstation A with
2 GB RAM, SATA, & SCSI.

Does this show anything? (not to me)
Maybe the buggy code is on CPU1 and its registers aren't
captured.


NMI Watchdog detected LOCKUP on CPU0, registers:
CPU 0
Modules linked in: aic79xx usbserial aic7xxx ohci1394 ieee1394
Pid: 0, comm: swapper Not tainted 2.6.9-rc4-bk3
RIP: 0010:[<ffffffff8010f5f0>] <ffffffff8010f5f0>{default_idle+32}
RSP: 0018:ffffffff805e3fb8 EFLAGS: 00000246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000018
RDX: ffffffff8010f5d0 RSI: ffffffff80472b80 RDI: 0000010001e11b20
RBP: 0000000000000000 R08: 00000000ffffffff R09: 0000000000000001
R10: 0000000000000000 R11: ffffffff80562b60 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS: 0000002a9670fd40(0000) GS:ffffffff805de880(0000)
knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000002a9568a2c0 CR3: 0000000000101000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo ffffffff805e2000, task
ffffffff80472b80)
Stack: ffffffff8010f9fd 0000000000000000 ffffffff805e56e5 0000000000000000
ffffffff8055fc60 0000000000000800 ffffffff805e51e0
0000000000000404
0000000000000000
Call Trace:<ffffffff8010f9fd>{cpu_idle+29}
<ffffffff805e56e5>{start_kernel+421}
<ffffffff805e51e0>{_sinittext+480}

Code: c3 fb f3 c3 66 66 66 90 66 66 66 90 66 66 66 90 48 83 ec 38
console shuts up ...
NMI Watchdog detected LOCKUP on CPU1, registers:


--
~Randy


2004-10-18 00:00:16

by Marc Bevand

[permalink] [raw]
Subject: Re: NMI watchdog detected lockup

On 2004-10-17, Randy.Dunlap <[email protected]> wrote:
|
| I'm seeing this often during a kernel build on AIC79xx.
| I did one kernel build on SATA without seeing this.
| This is on a dual-Opteron IBM Workstation A with
| 2 GB RAM, SATA, & SCSI.
| [...]
| NMI Watchdog detected LOCKUP on CPU0, registers:
| [...]

You are not the first one to observe frequent watchdog timeout
lockup on dual Opteron systems during intense I/O operations,
see this thread:

http://thread.gmane.org/gmane.linux.ide/1933

Note: this does *not* seem to be SATA-related.

--
Marc Bevand http://www.epita.fr/~bevand_m
Computer Science School EPITA - System, Network and Security Dept.

2004-10-18 17:22:19

by Randy.Dunlap

[permalink] [raw]
Subject: Re: NMI watchdog detected lockup

Marc Bevand wrote:
> On 2004-10-17, Randy.Dunlap <[email protected]> wrote:
> |
> | I'm seeing this often during a kernel build on AIC79xx.
> | I did one kernel build on SATA without seeing this.
> | This is on a dual-Opteron IBM Workstation A with
> | 2 GB RAM, SATA, & SCSI.
> | [...]
> | NMI Watchdog detected LOCKUP on CPU0, registers:
> | [...]
>
> You are not the first one to observe frequent watchdog timeout
> lockup on dual Opteron systems during intense I/O operations,
> see this thread:
>
> http://thread.gmane.org/gmane.linux.ide/1933
>
> Note: this does *not* seem to be SATA-related.

Hi,

Zwane suspected NMI spikes and advised me to disable nmi_watchdog
(nmi_watchdog=0). After doing that, a kernel build completes
successfully, although with many messages like these:

Uhhuh. NMI received for unknown reason 21.
Dazed and confused, but trying to continue
Do you have a strange power saving mode enabled?
Uhhuh. NMI received for unknown reason 31.
Dazed and confused, but trying to continue
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Do you have a strange power saving mode enabled?
Uhhuh. NMI received for unknown reason 31.
Dazed and confused, but trying to continue
Do you have a strange power saving mode enabled?
Uhhuh. NMI received for unknown reason 31.
Uhhuh. NMI received for unknown reason 31.
Dazed and confused, but trying to continue
Do you have a strange power saving mode enabled?
Uhhuh. NMI received for unknown reason 21.
Dazed and confused, but trying to continue
Do you have a strange power saving mode enabled?
Uhhuh. NMI received for unknown reason 21.
Dazed and confused, but trying to continue
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Do you have a strange power saving mode enabled?
Uhhuh. NMI received for unknown reason 21.
Dazed and confused, but trying to continue
Do you have a strange power saving mode enabled?


I've also seen reason == 20.

This is on 2.6.9-rc4.

Andi, any ideas?

I've had several hundred of these messages, with only 1 dazed &
confused that did not continue OK.

Adding show_registers(regs); in the NMI handler points to
default_idle():

Dazed and confused, but trying to continue
Do you have a strange power saving mode enabled?
CPU 0
Modules linked in: aic79xx usbserial aic7xxx ohci1394 ieee1394
Pid: 0, comm: swapper Not tainted 2.6.9-rc4
RIP: 0010:[<ffffffff8010f5f0>] <ffffffff8010f5f0>{default_idle+32}
RSP: 0018:ffffffff805e3fb8 EFLAGS: 00000246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000018
RDX: ffffffff8010f5d0 RSI: ffffffff80472b00 RDI: 0000010001e11b20
RBP: 0000000000000000 R08: 00000000ffffffff R09: 0000000000000001
R10: 0000000000000080 R11: ffffffff80562ae0 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS: 0000002a95b2e4c0(0000) GS:ffffffff805de800(0000)
knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000002a955a6000 CR3: 0000000000101000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo ffffffff805e2000, task
ffffffff80472b00)
Stack: ffffffff8010f9fd 0000000000000000 ffffffff805e56e5 0000000000000000
ffffffff8055fbe0 0000000000000800 ffffffff805e51e0
0000000000000404
0000000000000000
Call Trace:<ffffffff8010f9fd>{cpu_idle+29}
<ffffffff805e56e5>{start_kernel+421}
<ffffffff805e51e0>{_sinittext+480}

Code: c3 fb f3 c3 66 66 66 90 66 66 66 90 66 66 66 90 48 83 ec 38

--
~Randy

2004-10-18 18:00:41

by Andi Kleen

[permalink] [raw]
Subject: Re: NMI watchdog detected lockup

On Mon, 18 Oct 2004 10:13:11 -0700
"Randy.Dunlap" <[email protected]> wrote:

> Marc Bevand wrote:
> > On 2004-10-17, Randy.Dunlap <[email protected]> wrote:
> > |
> > | I'm seeing this often during a kernel build on AIC79xx.
> > | I did one kernel build on SATA without seeing this.
> > | This is on a dual-Opteron IBM Workstation A with
> > | 2 GB RAM, SATA, & SCSI.
> > | [...]
> > | NMI Watchdog detected LOCKUP on CPU0, registers:
> > | [...]
> >
> > You are not the first one to observe frequent watchdog timeout
> > lockup on dual Opteron systems during intense I/O operations,
> > see this thread:
> >
> > http://thread.gmane.org/gmane.linux.ide/1933
> >
> > Note: this does *not* seem to be SATA-related.
>
> Hi,
>
> Zwane suspected NMI spikes and advised me to disable nmi_watchdog
> (nmi_watchdog=0). After doing that, a kernel build completes
> successfully, although with many messages like these:
>
> Uhhuh. NMI received for unknown reason 21.

Something on your system creates bogus NMI interrupts. What chipset
are you using exactly?

Sometimes chipsets can be programmed to raise NMIs when an PCI bus
error occurs.

21 is the normal state (PIT timer running, but no errors logged)

If you have an AMD 8131 it could be in theory erratum 54, but then
normally one of the error bits in reason should be set.

-Andi

2004-10-18 18:10:19

by Randy.Dunlap

[permalink] [raw]
Subject: Re: NMI watchdog detected lockup

Andi Kleen wrote:
> On Mon, 18 Oct 2004 10:13:11 -0700
> "Randy.Dunlap" <[email protected]> wrote:
>
>
>>Marc Bevand wrote:
>>
>>>On 2004-10-17, Randy.Dunlap <[email protected]> wrote:
>>>|
>>>| I'm seeing this often during a kernel build on AIC79xx.
>>>| I did one kernel build on SATA without seeing this.
>>>| This is on a dual-Opteron IBM Workstation A with
>>>| 2 GB RAM, SATA, & SCSI.
>>>| [...]
>>>| NMI Watchdog detected LOCKUP on CPU0, registers:
>>>| [...]
>>>
>>>You are not the first one to observe frequent watchdog timeout
>>>lockup on dual Opteron systems during intense I/O operations,
>>>see this thread:
>>>
>>> http://thread.gmane.org/gmane.linux.ide/1933
>>>
>>>Note: this does *not* seem to be SATA-related.
>>
>>Hi,
>>
>>Zwane suspected NMI spikes and advised me to disable nmi_watchdog
>>(nmi_watchdog=0). After doing that, a kernel build completes
>>successfully, although with many messages like these:
>>
>>Uhhuh. NMI received for unknown reason 21.
>
>
> Something on your system creates bogus NMI interrupts. What chipset
> are you using exactly?
>
> Sometimes chipsets can be programmed to raise NMIs when an PCI bus
> error occurs.
>
> 21 is the normal state (PIT timer running, but no errors logged)
>
> If you have an AMD 8131 it could be in theory erratum 54, but then
> normally one of the error bits in reason should be set.

Yes, it's an AMD-8111 / 8131 / 8151 / K8-northbridge machine.

--
~Randy

2004-10-18 18:21:22

by Andi Kleen

[permalink] [raw]
Subject: Re: NMI watchdog detected lockup

On Mon, 18 Oct 2004 10:58:08 -0700
"Randy.Dunlap" <[email protected]> wrote:

> > Something on your system creates bogus NMI interrupts. What chipset
> > are you using exactly?
> >
> > Sometimes chipsets can be programmed to raise NMIs when an PCI bus
> > error occurs.
> >
> > 21 is the normal state (PIT timer running, but no errors logged)
> >
> > If you have an AMD 8131 it could be in theory erratum 54, but then
> > normally one of the error bits in reason should be set.
>
> Yes, it's an AMD-8111 / 8131 / 8151 / K8-northbridge machine.

It's probably one of your IO cards. I would remove them one by one
or possibly switch them to different slots (PCI vs PCI-X)

-Andi

2004-10-18 18:42:52

by Phil Oester

[permalink] [raw]
Subject: Re: NMI watchdog detected lockup

On Mon, Oct 18, 2004 at 08:16:54PM +0200, Andi Kleen wrote:
> On Mon, 18 Oct 2004 10:58:08 -0700
> "Randy.Dunlap" <[email protected]> wrote:
>
> > > Something on your system creates bogus NMI interrupts. What chipset
> > > are you using exactly?
> > >
> > > Sometimes chipsets can be programmed to raise NMIs when an PCI bus
> > > error occurs.
> > >
> > > 21 is the normal state (PIT timer running, but no errors logged)
> > >
> > > If you have an AMD 8131 it could be in theory erratum 54, but then
> > > normally one of the error bits in reason should be set.
> >
> > Yes, it's an AMD-8111 / 8131 / 8151 / K8-northbridge machine.
>
> It's probably one of your IO cards. I would remove them one by one
> or possibly switch them to different slots (PCI vs PCI-X)

Not sure if it's related, but I've noticed this with numerous 440gx
boxes on 2.6.8.1. I get reasons 2d and 3d. If I reboot with
nmi_watchdog=1 on these boxes, the errors go away. This was not
a problem on 2.6.3 interestingly enough...

Phil

2004-10-21 05:14:55

by Randy.Dunlap

[permalink] [raw]
Subject: Re: NMI watchdog detected lockup

Andi Kleen wrote:
> On Mon, 18 Oct 2004 10:58:08 -0700
> "Randy.Dunlap" <[email protected]> wrote:
>
>
>>>Something on your system creates bogus NMI interrupts. What chipset
>>>are you using exactly?
>>>
>>>Sometimes chipsets can be programmed to raise NMIs when an PCI bus
>>>error occurs.
>>>
>>>21 is the normal state (PIT timer running, but no errors logged)
>>>
>>>If you have an AMD 8131 it could be in theory erratum 54, but then
>>>normally one of the error bits in reason should be set.
>>
>>Yes, it's an AMD-8111 / 8131 / 8151 / K8-northbridge machine.
>
>
> It's probably one of your IO cards. I would remove them one by one
> or possibly switch them to different slots (PCI vs PCI-X)
>
> -Andi

Thanks, Andi.

I removed the Adapter SCSI PCI card and switched to using
the onboard Adaptec controller, and now I'm seeing no problems,
so the AIC7xyz (or 79yz) card or driver doesn't seem to like
PCI-X or something here.

--
~Randy