2006-09-26 12:36:49

by Greg Schafer

[permalink] [raw]
Subject: 2.6.18 Nasty Lockup

Hi

This is a _hard_ lockup. No oops, no magic sysrq, no nuthin, just a
completely dead machine with only option the reset button. Usually happens
within a couple of minutes of desktop use but is 100% reproducible. Problem
is still there in a fresh checkout of current Linus git tree (post 2.6.18).

Dual Athlon-MP 2200's on a Tyan S2466 Tiger MPX. Config attached.

I used git-bisect and arrived at the apparent culprit below. Anything else I
should do to gather more info?

Help!

Thanks
Greg




5d0cf410e94b1f1ff852c3f210d22cc6c5a27ffa is first bad commit
commit 5d0cf410e94b1f1ff852c3f210d22cc6c5a27ffa
Author: john stultz <[email protected]>
Date: Mon Jun 26 00:25:12 2006 -0700

[PATCH] Time: i386 Clocksource Drivers

Implement the time sources for i386 (acpi_pm, cyclone, hpet, pit, and tsc).
With this patch, the conversion of the i386 arch to the generic timekeeping
code should be complete.

The patch should be fairly straight forward, only adding the new clocksources.

[[email protected]: acpi_pm cleanup]
Signed-off-by: John Stultz <[email protected]>
Signed-off-by: Adrian Bunk <[email protected]>
Signed-off-by: Paul Mundt <[email protected]>
Signed-off-by: John Stultz <[email protected]>
Signed-off-by: OGAWA Hirofumi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

:040000 040000 b22eef862844b7a26f4fb041fd3db2516c9aa03f 4e99b3095d242ad24ec62f770d73d0fa10ca5a79 M Documentation
:040000 040000 8c768988dc95f6ce40a69482ca54180ae8e3261c 1fbfffc0b5c4a07d08feb0e82b16f54f7bff71a6 M arch
:040000 040000 4946bbccf18a59c737c57bbc7eb387f755b69333 8885f8c635d62269ce8013c83c16a4801b0645da M drivers
:040000 040000 5d0de64ddb80d4b881de0124e5480d1967e83d0c 205e33731637b186f230cdfda3506893ff53384f M kernel


Attachments:
(No filename) (1.85 kB)
config (23.55 kB)
Download all attachments

2006-09-26 13:56:41

by Michael Obster

[permalink] [raw]
Subject: Re: 2.6.18 Nasty Lockup

Hi Greg,

what do you mean with desktop use? X11-System? Then please add also your
used grafic card and the X11 driver. I see this behaviour with lots of
binary only driver like from NVIDIA or ATI (perhaps they have problems
with the new 2.6.18 kernel).

Just to except that this is the cause for your lock ups.

Kind regards,
Michael Obster

Greg Schafer schrieb:
> This is a _hard_ lockup. No oops, no magic sysrq, no nuthin, just a
> completely dead machine with only option the reset button. Usually happens
> within a couple of minutes of desktop use but is 100% reproducible. Problem
> is still there in a fresh checkout of current Linus git tree (post 2.6.18).
>
> Dual Athlon-MP 2200's on a Tyan S2466 Tiger MPX. Config attached.

2006-09-26 15:51:59

by Ben Duncan

[permalink] [raw]
Subject: Re: 2.6.18 Nasty Lockup

have gotten recently that as well. One time happen to have top running when
that happened again.

Top showed PDFLUSH consuming 100%CPU and 100% memeory

Do not know if this well help anyone, but thought I would share.

My lspci for the record:

00:00.0 Host bridge: nVidia Corporation nForce2 AGP (different version?) (rev c1)
00:00.1 RAM memory: nVidia Corporation nForce2 Memory Controller 1 (rev c1)
00:00.2 RAM memory: nVidia Corporation nForce2 Memory Controller 4 (rev c1)
00:00.3 RAM memory: nVidia Corporation nForce2 Memory Controller 3 (rev c1)
00:00.4 RAM memory: nVidia Corporation nForce2 Memory Controller 2 (rev c1)
00:00.5 RAM memory: nVidia Corporation nForce2 Memory Controller 5 (rev c1)
00:01.0 ISA bridge: nVidia Corporation nForce2 ISA Bridge (rev a3)
00:01.1 SMBus: nVidia Corporation nForce2 SMBus (MCP) (rev a2)
00:02.0 USB Controller: nVidia Corporation nForce2 USB Controller (rev a3)
00:02.1 USB Controller: nVidia Corporation nForce2 USB Controller (rev a3)
00:02.2 USB Controller: nVidia Corporation nForce2 USB Controller (rev a3)
00:04.0 Ethernet controller: nVidia Corporation nForce2 Ethernet Controller (rev a1)
00:06.0 Multimedia audio controller: nVidia Corporation nForce2 AC97 Audio Controler (MCP)
(rev a1)
00:08.0 PCI bridge: nVidia Corporation nForce2 External PCI Bridge (rev a3)
00:09.0 IDE interface: nVidia Corporation nForce2 IDE (rev a2)
00:1e.0 PCI bridge: nVidia Corporation nForce2 AGP (rev c1)
01:09.0 Mass storage controller: Silicon Image, Inc. SiI 3112 [SATALink/SATARaid] Serial ATA
Controller (rev 02)
02:00.0 VGA compatible controller: nVidia Corporation NV18 [GeForce4 MX 440 AGP 8x] (rev a2)


Greg Schafer wrote:
> Hi
>
> This is a _hard_ lockup. No oops, no magic sysrq, no nuthin, just a
> completely dead machine with only option the reset button. Usually happens
> within a couple of minutes of desktop use but is 100% reproducible. Problem
> is still there in a fresh checkout of current Linus git tree (post 2.6.18).
>
<SNIP>
--
Ben Duncan - Business Network Solutions, Inc. 336 Elton Road Jackson MS, 39212
"Never attribute to malice, that which can be adequately explained by stupidity"
- Hanlon's Razor

2006-09-26 18:19:09

by john stultz

[permalink] [raw]
Subject: Re: 2.6.18 Nasty Lockup

On Tue, 2006-09-26 at 22:36 +1000, Greg Schafer wrote:
> This is a _hard_ lockup. No oops, no magic sysrq, no nuthin, just a
> completely dead machine with only option the reset button. Usually happens
> within a couple of minutes of desktop use but is 100% reproducible. Problem
> is still there in a fresh checkout of current Linus git tree (post 2.6.18).
>
> Dual Athlon-MP 2200's on a Tyan S2466 Tiger MPX. Config attached.

Thanks for narrowing this down. Could you send me full dmesg output?

thanks
-john


2006-09-26 20:15:55

by john stultz

[permalink] [raw]
Subject: Re: 2.6.18 Nasty Lockup

On Tue, 2006-09-26 at 22:36 +1000, Greg Schafer wrote:
> Hi
>
> This is a _hard_ lockup. No oops, no magic sysrq, no nuthin, just a
> completely dead machine with only option the reset button. Usually happens
> within a couple of minutes of desktop use but is 100% reproducible. Problem
> is still there in a fresh checkout of current Linus git tree (post 2.6.18).
>
> Dual Athlon-MP 2200's on a Tyan S2466 Tiger MPX. Config attached.
>
> I used git-bisect and arrived at the apparent culprit below. Anything else I
> should do to gather more info?

Quick test: Does enabling CONFIG_ACPI change the behavior?

thanks
-john


2006-09-26 21:15:59

by S.Çağlar Onur

[permalink] [raw]
Subject: Re: 2.6.18 Nasty Lockup

26 Eyl 2006 Sal 15:36 tarihinde, Greg Schafer şunları yazmıştı:
> This is a _hard_ lockup. No oops, no magic sysrq, no nuthin, just a
> completely dead machine with only option the reset button. Usually happens
> within a couple of minutes of desktop use but is 100% reproducible. Problem
> is still there in a fresh checkout of current Linus git tree (post 2.6.18).

Same symptoms here and its reproducible after starting the irqbalance (0.12 or
0.13), if i disable irqbalance then everything is going fine.

--
S.Çağlar Onur <[email protected]>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!


Attachments:
(No filename) (679.00 B)
(No filename) (189.00 B)
Download all attachments

2006-09-26 22:02:54

by Greg Schafer

[permalink] [raw]
Subject: Re: 2.6.18 Nasty Lockup

On Tue, Sep 26, 2006 at 01:15:51PM -0700, john stultz wrote:
> On Tue, 2006-09-26 at 22:36 +1000, Greg Schafer wrote:
> > This is a _hard_ lockup. No oops, no magic sysrq, no nuthin, just a
> > completely dead machine with only option the reset button. Usually happens
> > within a couple of minutes of desktop use but is 100% reproducible. Problem
> > is still there in a fresh checkout of current Linus git tree (post 2.6.18).
> >
> > Dual Athlon-MP 2200's on a Tyan S2466 Tiger MPX. Config attached.
> >
> > I used git-bisect and arrived at the apparent culprit below. Anything else I
> > should do to gather more info?
>
> Quick test: Does enabling CONFIG_ACPI change the behavior?

Yes. It doesn't lockup now, at least it hasn't yet. Should I always
configure with CONFIG_ACPI? I've usually avoided it.


On Tue, Sep 26, 2006 at 11:18:02AM -0700, john stultz wrote:
> Thanks for narrowing this down. Could you send me full dmesg output?

Sure, attached (non CONFIG_ACPI case).


On Tue, Sep 26, 2006 at 01:20:54PM -0400, James Puthukattukaran wrote:
> Did you try asserting an NMI via "nmi_watchdog" kernel boot argument?

I gave it a try on your advice (added nmi_watchdog=1 to boot args). No dice
I'm afraid (no oops data). But I did notice in the dmesg output something
possibly strange:

Testing NMI watchdog ... CPU#0: NMI appears to be stuck (0->0)!


On Tue, Sep 26, 2006 at 03:56:46PM +0200, Michael Obster wrote:
> what do you mean with desktop use? X11-System? Then please add also your
> used grafic card and the X11 driver. I see this behaviour with lots of
> binary only driver like from NVIDIA or ATI (perhaps they have problems
> with the new 2.6.18 kernel).

No binary-only drivers here. Matrox card. In fact, the lockup sometimes
happens before X is even loaded (I suspect ntpdate in my bootscripts).


Regards
Greg


Attachments:
(No filename) (1.80 kB)
dmesg.out.gz (3.50 kB)
Download all attachments

2006-09-26 22:51:26

by john stultz

[permalink] [raw]
Subject: Re: 2.6.18 Nasty Lockup

On Wed, 2006-09-27 at 00:15 +0300, S.Çağlar Onur wrote:
> 26 Eyl 2006 Sal 15:36 tarihinde, Greg Schafer şunları yazmıştı:
> > This is a _hard_ lockup. No oops, no magic sysrq, no nuthin, just a
> > completely dead machine with only option the reset button. Usually happens
> > within a couple of minutes of desktop use but is 100% reproducible. Problem
> > is still there in a fresh checkout of current Linus git tree (post 2.6.18).
>
> Same symptoms here and its reproducible after starting the irqbalance (0.12 or
> 0.13), if i disable irqbalance then everything is going fine.

Hmm.. Not sure about the connection to irqbalance. You're using the TSC
clocksource, so I'm curious if your cpu TSC's are out of sync. Can you
boot w/ "clocksource=acpi_pm" to see if that resolves it?

thanks
-john


2006-09-26 22:59:44

by john stultz

[permalink] [raw]
Subject: Re: 2.6.18 Nasty Lockup

On Wed, 2006-09-27 at 08:02 +1000, Greg Schafer wrote:
> On Tue, Sep 26, 2006 at 01:15:51PM -0700, john stultz wrote:
> > On Tue, 2006-09-26 at 22:36 +1000, Greg Schafer wrote:
> > > This is a _hard_ lockup. No oops, no magic sysrq, no nuthin, just a
> > > completely dead machine with only option the reset button. Usually happens
> > > within a couple of minutes of desktop use but is 100% reproducible. Problem
> > > is still there in a fresh checkout of current Linus git tree (post 2.6.18).
> > >
> > > Dual Athlon-MP 2200's on a Tyan S2466 Tiger MPX. Config attached.
> > >
> > > I used git-bisect and arrived at the apparent culprit below. Anything else I
> > > should do to gather more info?
> >
> > Quick test: Does enabling CONFIG_ACPI change the behavior?
>
> Yes. It doesn't lockup now, at least it hasn't yet. Should I always
> configure with CONFIG_ACPI? I've usually avoided it.

Yea. Dual proc AMD systems tend to not have synced TSCs, so we fall back
to whatever is available. If ACPI is not enabled, that usually means
only the PIT is left (otherwise the ACPI PM or HPET can be used).


>
> On Tue, Sep 26, 2006 at 11:18:02AM -0700, john stultz wrote:
> > Thanks for narrowing this down. Could you send me full dmesg output?
>
> Sure, attached (non CONFIG_ACPI case).

Thanks, that confirms the above theory. It seems there is a SMP race w/
the PIT clocksource. Andi was having a similar issue, and I had some
difficulty reproducing it (my dev box is UP) but I'll grab an SMP system
and try a few tests.

Thanks so much again for narrowing this down and providing quick
feedback!
-john


2006-09-27 09:46:12

by S.Çağlar Onur

[permalink] [raw]
Subject: Re: 2.6.18 Nasty Lockup

27 Eyl 2006 Çar 01:50 tarihinde, john stultz şunları yazmıştı:
> On Wed, 2006-09-27 at 00:15 +0300, S.Çağlar Onur wrote:
> > 26 Eyl 2006 Sal 15:36 tarihinde, Greg Schafer şunları yazmıştı:
> > > This is a _hard_ lockup. No oops, no magic sysrq, no nuthin, just a
> > > completely dead machine with only option the reset button. Usually
> > > happens within a couple of minutes of desktop use but is 100%
> > > reproducible. Problem is still there in a fresh checkout of current
> > > Linus git tree (post 2.6.18).
> >
> > Same symptoms here and its reproducible after starting the irqbalance
> > (0.12 or 0.13), if i disable irqbalance then everything is going fine.
>
> Hmm.. Not sure about the connection to irqbalance. You're using the TSC
> clocksource, so I'm curious if your cpu TSC's are out of sync. Can you
> boot w/ "clocksource=acpi_pm" to see if that resolves it?

Yep, it solves the problem and system boot normally with irqbalance enabled.

--
S.Çağlar Onur <[email protected]>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!


Attachments:
(No filename) (1.10 kB)
(No filename) (189.00 B)
Download all attachments

2006-09-27 19:15:13

by john stultz

[permalink] [raw]
Subject: Re: 2.6.18 Nasty Lockup

On Wed, 2006-09-27 at 12:45 +0300, S.Çağlar Onur wrote:
> 27 Eyl 2006 Çar 01:50 tarihinde, john stultz şunları yazmıştı:
> > On Wed, 2006-09-27 at 00:15 +0300, S.Çağlar Onur wrote:
> > > 26 Eyl 2006 Sal 15:36 tarihinde, Greg Schafer şunları yazmıştı:
> > > > This is a _hard_ lockup. No oops, no magic sysrq, no nuthin, just a
> > > > completely dead machine with only option the reset button. Usually
> > > > happens within a couple of minutes of desktop use but is 100%
> > > > reproducible. Problem is still there in a fresh checkout of current
> > > > Linus git tree (post 2.6.18).
> > >
> > > Same symptoms here and its reproducible after starting the irqbalance
> > > (0.12 or 0.13), if i disable irqbalance then everything is going fine.
> >
> > Hmm.. Not sure about the connection to irqbalance. You're using the TSC
> > clocksource, so I'm curious if your cpu TSC's are out of sync. Can you
> > boot w/ "clocksource=acpi_pm" to see if that resolves it?
>
> Yep, it solves the problem and system boot normally with irqbalance enabled.

Ok. Good to hear you have a workaround. Now to sort out why your TSCs
are becoming un-synced. From the dmesg you sent me privately, I noticed
that while you have 4 cpus, the following message only shows up once:

ACPI: Processor [CPU1] (supports 8 throttling states)

Does disabling cpufreq change anything?

thanks
-john


2006-09-27 20:55:34

by Andi Kleen

[permalink] [raw]
Subject: Re: 2.6.18 Nasty Lockup

On Wed, Sep 27, 2006 at 12:14:59PM -0700, john stultz wrote:
> On Wed, 2006-09-27 at 12:45 +0300, S.??a??lar Onur wrote:
> > 27 Eyl 2006 ??ar 01:50 tarihinde, john stultz ??unlar?? yazm????t??:
> > > On Wed, 2006-09-27 at 00:15 +0300, S.??a??lar Onur wrote:
> > > > 26 Eyl 2006 Sal 15:36 tarihinde, Greg Schafer ??unlar?? yazm????t??:
> > > > > This is a _hard_ lockup. No oops, no magic sysrq, no nuthin, just a
> > > > > completely dead machine with only option the reset button. Usually
> > > > > happens within a couple of minutes of desktop use but is 100%
> > > > > reproducible. Problem is still there in a fresh checkout of current
> > > > > Linus git tree (post 2.6.18).
> > > >
> > > > Same symptoms here and its reproducible after starting the irqbalance
> > > > (0.12 or 0.13), if i disable irqbalance then everything is going fine.
> > >
> > > Hmm.. Not sure about the connection to irqbalance. You're using the TSC
> > > clocksource, so I'm curious if your cpu TSC's are out of sync. Can you
> > > boot w/ "clocksource=acpi_pm" to see if that resolves it?
> >
> > Yep, it solves the problem and system boot normally with irqbalance enabled.
>
> Ok. Good to hear you have a workaround. Now to sort out why your TSCs
> are becoming un-synced. From the dmesg you sent me privately, I noticed

On Intel it seems to happen when people overclock their systems.

> that while you have 4 cpus, the following message only shows up once:
>
> ACPI: Processor [CPU1] (supports 8 throttling states)
>
> Does disabling cpufreq change anything?

Throttling has nothing to do with cpufreq
(at least not until you use the broken P4 throttling cpufreq
driver, which nobody should). It is normally only used when
the CPU overheats.

-Andi

2006-09-27 21:06:09

by S.Çağlar Onur

[permalink] [raw]
Subject: Re: 2.6.18 Nasty Lockup

27 Eyl 2006 Çar 23:55 tarihinde, Andi Kleen şunları yazmıştı:
> > Ok. Good to hear you have a workaround. Now to sort out why your TSCs
> > are becoming un-synced. From the dmesg you sent me privately, I noticed
>
> On Intel it seems to happen when people overclock their systems.

This sytem is not overlocked, its a pure 2 x Intel Xeon 3GHz with HT.

> > that while you have 4 cpus, the following message only shows up once:
> >
> > ACPI: Processor [CPU1] (supports 8 throttling states)
> >
> > Does disabling cpufreq change anything?
>
> Throttling has nothing to do with cpufreq
> (at least not until you use the broken P4 throttling cpufreq
> driver, which nobody should). It is normally only used when
> the CPU overheats.

None of them is used on this system also

buildfarm ~ # lsmod
Module Size Used by
i2c_i801 7372 0
i2c_core 19968 1 i2c_i801
serio_raw 7012 0
e752x_edac 11364 0
edac_mc 21424 1 e752x_edac
i6300esb 7096 0
tg3 98116 0
sd_mod 18432 4
uhci_hcd 21900 0
ehci_hcd 29896 0
usbcore 115652 5 uhci_hcd,ehci_hcd
ata_piix 13864 2
libata 93172 1 ata_piix
scsi_mod 127304 2 sd_mod,libata

Also if needed, you can find .config at
http://cekirdek.pardus.org.tr/~caglar/config.2.6.18

Cheers
--
S.Çağlar Onur <[email protected]>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!


Attachments:
(No filename) (1.57 kB)
(No filename) (189.00 B)
Download all attachments

2006-09-28 08:40:04

by S.Çağlar Onur

[permalink] [raw]
Subject: Re: 2.6.18 Nasty Lockup

27 Eyl 2006 Çar 22:14 tarihinde, john stultz şunları yazmıştı:
> Ok. Good to hear you have a workaround. Now to sort out why your TSCs
> are becoming un-synced. From the dmesg you sent me privately, I noticed
> that while you have 4 cpus, the following message only shows up once:
>
> ACPI: Processor [CPU1] (supports 8 throttling states)
>
> Does disabling cpufreq change anything?

By the way i tried but nothing changes :(

--
S.Çağlar Onur <[email protected]>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!


Attachments:
(No filename) (596.00 B)
(No filename) (189.00 B)
Download all attachments

2006-09-29 08:49:56

by S.Çağlar Onur

[permalink] [raw]
Subject: Re: 2.6.18 Nasty Lockup

28 Eyl 2006 Per 14:39 tarihinde, S.Çağlar Onur şunları yazmıştı:
> 27 Eyl 2006 Çar 22:14 tarihinde, john stultz şunları yazmıştı:
> > Ok. Good to hear you have a workaround. Now to sort out why your TSCs
> > are becoming un-synced. From the dmesg you sent me privately, I noticed
> > that while you have 4 cpus, the following message only shows up once:
> >
> > ACPI: Processor [CPU1] (supports 8 throttling states)
> >
> > Does disabling cpufreq change anything?
>
> By the way i tried but nothing changes :(

Is there any other advice available? Is there anything else you want me to
try?

Cheers
--
S.Çağlar Onur <[email protected]>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!


Attachments:
(No filename) (776.00 B)
(No filename) (189.00 B)
Download all attachments

2006-10-06 22:57:37

by john stultz

[permalink] [raw]
Subject: Re: 2.6.18 Nasty Lockup

On Fri, 2006-09-29 at 11:49 +0300, S.Çağlar Onur wrote:
> 28 Eyl 2006 Per 14:39 tarihinde, S.Çağlar Onur şunları yazmıştı:
> > 27 Eyl 2006 Çar 22:14 tarihinde, john stultz şunları yazmıştı:
> > > Ok. Good to hear you have a workaround. Now to sort out why your TSCs
> > > are becoming un-synced. From the dmesg you sent me privately, I noticed
> > > that while you have 4 cpus, the following message only shows up once:
> > >
> > > ACPI: Processor [CPU1] (supports 8 throttling states)
> > >
> > > Does disabling cpufreq change anything?
> >
> > By the way i tried but nothing changes :(
>
> Is there any other advice available? Is there anything else you want me to
> try?

Hey S.Çağlar,

So I just wrote up this test case that will show how skewed the TSCs
are. I'd be interested if you could run it a few times quickly after a
fresh boot, and then again a day or so later.

See the header comment for instructions.

And just a fair warning: this runs w/ SCHED_FIFO, and thus has the
potential to hang your system (while writing it I made a few flubs and
it hung my system). I believe I've got all of the issues fixed (tested
on a few systems), but wanted to give you a fair warning before I
suggest you run this. :)

thanks
-john


Attachments:
tsc-drift.c (3.35 kB)

2006-10-07 15:48:16

by S.Çağlar Onur

[permalink] [raw]
Subject: Re: 2.6.18 Nasty Lockup

07 Eki 2006 Cts 01:57 tarihinde, john stultz şunları yazmıştı:
> Hey S.Çağlar,
>
> So I just wrote up this test case that will show how skewed the TSCs
> are. I'd be interested if you could run it a few times quickly after a
> fresh boot, and then again a day or so later.
>
> See the header comment for instructions.
>
> And just a fair warning: this runs w/ SCHED_FIFO, and thus has the
> potential to hang your system (while writing it I made a few flubs and
> it hung my system). I believe I've got all of the issues fixed (tested
> on a few systems), but wanted to give you a fair warning before I
> suggest you run this. :)

Ok ill try this on Monday, thanks!

Cheers
--
S.Çağlar Onur <[email protected]>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!


Attachments:
(No filename) (846.00 B)
(No filename) (189.00 B)
Download all attachments