2007-09-28 20:25:33

by Frans Pop

[permalink] [raw]
Subject: [2.6.23-rc8-mm2] System hangs (loops?) during boot

My Toshiba Satellite A40 (i386, P4 ) hangs during boot after:
Marking TSC unstable due to: possible TSC halt in C2.
Time: acpi_pm clocksource has been installed.

It may not actually hang, but rather end up in a loop as after some time the
fan goes wild.

System boots fine with 2.6.23-rc8. Have not tried earlier mm releases.

Any suggestions where to look before I do a bisect?

Cheers,
FJP


2007-09-29 00:32:55

by Frans Pop

[permalink] [raw]
Subject: Re: [2.6.23-rc8-mm2] System hangs (loops?) during boot

On Friday 28 September 2007, Frans Pop wrote:
> My Toshiba Satellite A40 (i386, P4 Mobile) hangs during boot after:
> Marking TSC unstable due to: possible TSC halt in C2.
> Time: acpi_pm clocksource has been installed.

A few new boot attempts show the problem is more likely at:
Probing IDE interface ide0...

> It may not actually hang, but rather end up in a loop as after some time
> the fan goes wild.

Unfortunately I have no serial port and this seems too early for netconsole,
so cannot catch a boot log.

> Any suggestions where to look before I do a bisect?

2007-09-29 08:37:21

by Andrew Morton

[permalink] [raw]
Subject: Re: [2.6.23-rc8-mm2] System hangs (loops?) during boot

On Sat, 29 Sep 2007 02:32:44 +0200 Frans Pop <[email protected]> wrote:

> On Friday 28 September 2007, Frans Pop wrote:
> > My Toshiba Satellite A40 (i386, P4 Mobile) hangs during boot after:
> > Marking TSC unstable due to: possible TSC halt in C2.
> > Time: acpi_pm clocksource has been installed.
>
> A few new boot attempts show the problem is more likely at:
> Probing IDE interface ide0...
>
> > It may not actually hang, but rather end up in a loop as after some time
> > the fan goes wild.
>
> Unfortunately I have no serial port and this seems too early for netconsole,
> so cannot catch a boot log.
>
> > Any suggestions where to look before I do a bisect?

Not really, sorry. I usually start at mm.patch when I don't have a clue.
It mm.patch fails then pop off all the x86 patches (down to
git-ipwireless_cs.patch).

If mm.patch doesn't fail then push on all the memory management patches (up
to mm-test-and-set-zone-reclaim-lock-before-starting-cleanup.patch).

Two iterations should get you into the culprit zone.

But please do bisect it.

2007-09-29 20:02:53

by Andrew Morton

[permalink] [raw]
Subject: Re: [2.6.23-rc8-mm2] System hangs (loops?) during boot

On Sat, 29 Sep 2007 21:40:22 +0200 Frans Pop <[email protected]> wrote:

> On Saturday 29 September 2007, you wrote:
> > On Sat, 29 Sep 2007 02:32:44 +0200 Frans Pop <[email protected]> wrote:
> > > On Friday 28 September 2007, Frans Pop wrote:
> > > > My Toshiba Satellite A40 (i386, P4 Mobile) hangs during boot after:
> > > > Marking TSC unstable due to: possible TSC halt in C2.
> > > > Time: acpi_pm clocksource has been installed.
> > >
> > > A few new boot attempts show the problem is more likely at:
> > > Probing IDE interface ide0...
>
> Looks like it is both: hpet killing IDE?
>
> > Two iterations should get you into the culprit zone.
>
> Thanks for the pointers. Luckily I ended up in a quite narrow zone between
> two of the points you indicated (10 iterations).
>
> And the winner is:
>
> 3fe6c0016fd863b233097a8219a0d8577c2fd503 is first bad commit
> commit 3fe6c0016fd863b233097a8219a0d8577c2fd503
> Author: Udo A. Steinberg <...>
> hpet-force-enable-on-ich34
>
> Guess the comments about thin ice and testing were justified :-)
>
> lspci and a 2.6.23-rc6 dmesg for this system can be found in:
> http://lkml.org/lkml/2007/9/19/300

Great, thanks for doing that.

I guess I'll drop the patch for now in that case.

2007-09-29 22:15:47

by Frans Pop

[permalink] [raw]
Subject: Re: [2.6.23-rc8-mm2] kernel BUG at mm/slab.c:591! | invalid opcode: 0000 [#1] SMP

On Friday 28 September 2007, you wrote:
> My Toshiba Satellite A40 (i386, P4 Mobile) hangs during boot after:

With 'hpet-force-enable-on-ich34' reverted the system boots OK again.

We're not yet done though. It now fails to resume from suspend and there's
also the BUG (see subject) during power off. Possibly these are related.

Attached a diff of dmesg between 2.6.23-6 and 2.6.23-8. Nothing spectacular
AFAICT, except that I activated netconsole.

The details of that BUG are at the end of the diff.

I have some idea where to start looking for this one. If I'm not mistaken,
it should be somewhere between these two changes:
# good: [01762341418efa818f28dc69426ca7cc582cdc8c] git-wireless
# good: [ecdd2a3cb73af8b4659a4cb150bdf2a3ac908791] i386-pit-remove-the-useless-ifdefs

Cheers,
FJP


Attachments:
(No filename) (794.00 B)
2.6.23-rc6_rc8-mm2.dmesg.diff (21.49 kB)
Download all attachments

2007-09-30 10:46:43

by Andrew Morton

[permalink] [raw]
Subject: Re: [2.6.23-rc8-mm2] kernel BUG at mm/slab.c:591! | invalid opcode: 0000 [#1] SMP

On Sun, 30 Sep 2007 00:15:35 +0200 Frans Pop <[email protected]> wrote:

> On Friday 28 September 2007, you wrote:
> > My Toshiba Satellite A40 (i386, P4 Mobile) hangs during boot after:
>
> With 'hpet-force-enable-on-ich34' reverted the system boots OK again.
>
> We're not yet done though. It now fails to resume from suspend

suspend-to-RAM? Can you describe this failure a bit more?

> and there's
> also the BUG (see subject) during power off. Possibly these are related.
>
> Attached a diff of dmesg between 2.6.23-6 and 2.6.23-8. Nothing spectacular
> AFAICT, except that I activated netconsole.
>
> The details of that BUG are at the end of the diff.
>
> I have some idea where to start looking for this one. If I'm not mistaken,
> it should be somewhere between these two changes:
> # good: [01762341418efa818f28dc69426ca7cc582cdc8c] git-wireless
> # good: [ecdd2a3cb73af8b4659a4cb150bdf2a3ac908791] i386-pit-remove-the-useless-ifdefs
>

> +------------[ cut here ]------------
> +kernel BUG at mm/slab.c:591!
> +invalid opcode: 0000 [#1] SMP
> +last sysfs file: /devices/pci0000:00/0000:00:1e.0/0000:01:0b.0/resource
> +Modules linked in: ipv6 fuse dm_snapshot eeprom lm90 i2c_i801 i2c_core speedstep_ich speedstep_lib toshiba_acpi joydev tsdev pcmcia firmware_class snd_intel8x0 snd_intel8x0m snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss yenta_socket snd_pcm rsrc_nonstatic pcmcia_core snd_timer iTCO_wdt video output snd battery psmouse ac watchdog_core watchdog_dev button parport_pc parport shpchp pci_hotplug soundcore snd_page_alloc intel_agp agpgart evdev serio_raw pcspkr rtc ext3 jbd mbcache dm_mirror dm_mod ide_cd cdrom ide_disk piix generic ide_core ata_generic libata scsi_mod ehci_hcd uhci_hcd usbcore thermal processor fan
> +
> +Pid: 3763, comm: rmmod Not tainted (2.6.23-rc8-mm2 #1)
> +EIP: 0060:[<c016df71>] EFLAGS: 00010046 CPU: 0
> +EIP is at kfree+0x5e/0x97
> +EAX: 00000000 EBX: dfe704b8 ECX: c2340d84 EDX: c13fcd40
> +ESI: 00000286 EDI: dfe6ac65 EBP: c7204000 ESP: c7205f14
> + DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> +Process rmmod (pid: 3763, ti=c7204000 task=c201eab0 task.ti=c7204000)
> +last branch before last exception/interrupt
> + from c016df66 (kfree+0x53/0x97)
> + to c016df6b (kfree+0x58/0x97)
> +Stack: dfe704b8 dfe6ac65 dfe704c8 c01d975c dfe7048c c01d9772 00000880 c01da45a
> + 00000000 00000880 dfe70488 00000000 dfe705c0 00000000 dfe69395 dfe6a5f2
> + dfe6aba0 c01465a0 65737566 00000000 00000000 c21e75c0 c0163e91 c18df200
> +Call Trace:
> + [<c01d975c>] kobject_cleanup+0x31/0x47
> + [<c01da45a>] kref_put+0x76/0x84
> + [<dfe69395>] fuse_sysfs_cleanup+0xa/0x14 [fuse]
> + [<dfe6a5f2>] fuse_exit+0x19/0x24 [fuse]
> + [<c01465a0>] sys_delete_module+0x1c0/0x228
> + [<c0103ece>] sysenter_past_esp+0x6b/0xa1
> + [<ffffe410>] 0xffffe410
> + =======================
> +Code: aa 43 c0 8b 02 25 00 40 02 00 3d 00 40 02 00 75 03 8b 52 0c 8b 02 25 00 40 02 00 3d 00 40 02 00 75 03 8b 52 0c 8b 02 84 c0 78 04 <0f> 0b eb fe 8b 4a 18 64 a1 08 00 40 c0 8b 1c 81 8b 03 3b 43 04
> +EIP: [<c016df71>] kfree+0x5e/0x97 SS:ESP 0068:c7205f14
>

I think we have now fixed that - Miklos, do you recall?

2007-09-30 14:24:48

by Udo A. Steinberg

[permalink] [raw]
Subject: Re: [2.6.23-rc8-mm2] System hangs (loops?) during boot

On Sat, 29 Sep 2007 13:02:34 -0700 Andrew Morton (AM) wrote:

AM> On Sat, 29 Sep 2007 21:40:22 +0200 Frans Pop <[email protected]> wrote:
AM>
AM> > On Saturday 29 September 2007, you wrote:
AM> > > On Sat, 29 Sep 2007 02:32:44 +0200 Frans Pop <[email protected]>
AM> > > wrote:
AM> > > > On Friday 28 September 2007, Frans Pop wrote:
AM> > > > > My Toshiba Satellite A40 (i386, P4 Mobile) hangs during boot
AM> > > > > after: Marking TSC unstable due to: possible TSC halt in C2.
AM> > > > > Time: acpi_pm clocksource has been installed.
AM> > > >
AM> > > > A few new boot attempts show the problem is more likely at:
AM> > > > Probing IDE interface ide0...
AM> >
AM> > Looks like it is both: hpet killing IDE?
AM> >
AM> > > Two iterations should get you into the culprit zone.
AM> >
AM> > Thanks for the pointers. Luckily I ended up in a quite narrow zone
AM> > between two of the points you indicated (10 iterations).
AM> >
AM> > And the winner is:
AM> >
AM> > 3fe6c0016fd863b233097a8219a0d8577c2fd503 is first bad commit
AM> > commit 3fe6c0016fd863b233097a8219a0d8577c2fd503
AM> > Author: Udo A. Steinberg <...>
AM> > hpet-force-enable-on-ich34
AM> >
AM> > Guess the comments about thin ice and testing were justified :-)
AM> >
AM> > lspci and a 2.6.23-rc6 dmesg for this system can be found in:
AM> > http://lkml.org/lkml/2007/9/19/300
AM>
AM> Great, thanks for doing that.
AM>
AM> I guess I'll drop the patch for now in that case.

I somehow doubt that the HPET patch itself is the culprit. In fact, the
reason it shows up on git-bisect is probably because without it HPET
functionality is not enabled on the platform. So the problem could really
be anywhere in the HPET-driven timer infrastructure.

Frans, could you try out the -hrt patchset from Thomas Gleixner and see
if that works? Also, what ICH does your platform have? ICH3 or ICH4?

Cheers,

- Udo


Attachments:
signature.asc (189.00 B)

2007-09-30 21:50:42

by Frans Pop

[permalink] [raw]
Subject: Re: [2.6.23-rc8-mm2] System hangs (loops?) during boot

On Sunday 30 September 2007, you wrote:
> On Sat, 29 Sep 2007 13:02:34 -0700 Andrew Morton (AM) wrote:
> AM> On Sat, 29 Sep 2007 21:40:22 +0200 Frans Pop wrote:
> AM> > 3fe6c0016fd863b233097a8219a0d8577c2fd503 is first bad commit
> AM> > commit 3fe6c0016fd863b233097a8219a0d8577c2fd503
> AM> > Author: Udo A. Steinberg <...>
> AM> > hpet-force-enable-on-ich34
> AM> >
> AM> > Guess the comments about thin ice and testing were justified :-)
> AM> >
> AM> Great, thanks for doing that.
> AM>
> AM> I guess I'll drop the patch for now in that case.
>
> I somehow doubt that the HPET patch itself is the culprit. In fact, the
> reason it shows up on git-bisect is probably because without it HPET
> functionality is not enabled on the platform. So the problem could really
> be anywhere in the HPET-driven timer infrastructure.
>
> Frans, could you try out the -hrt patchset from Thomas Gleixner and see
> if that works?

I'm not sure what you mean. I fetched the branch I think you referred to
[1], but when I did a merge of that on top of v2.6.23-rc8-mm2 I
got "Already up-to-date", so AFAICT that branch is fully merged into mm and
I'm already running with you latest code...
Please correct me if I'm doing anything wrong.

[1] git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt

> Also, what ICH does your platform have? ICH3 or ICH4?

It is ICH4.
See the link below (which I already included) for more details:
> AM> > lspci and a 2.6.23-rc6 dmesg for this system can be found in:
> AM> > http://lkml.org/lkml/2007/9/19/300

Cheers,
FJP

2007-09-30 22:00:51

by Udo A. Steinberg

[permalink] [raw]
Subject: Re: [2.6.23-rc8-mm2] System hangs (loops?) during boot

On Sun, 30 Sep 2007 23:50:29 +0200 Frans Pop (FP) wrote:

FP> I'm not sure what you mean. I fetched the branch I think you referred to
FP> [1], but when I did a merge of that on top of v2.6.23-rc8-mm2 I
FP> got "Already up-to-date", so AFAICT that branch is fully merged into mm
FP> and I'm already running with you latest code...
FP> Please correct me if I'm doing anything wrong.

I was suggesting to download 2.6.23-rc8 and applying the -hrt patchset at
http://www.kernel.org/pub/linux/kernel/people/tglx/hrtimers/2.6.23-rc8/
on top of it.

That excludes all the extra stuff in -mm and should give us a good hint
whether HPET is really at fault.

Cheers,

- Udo


Attachments:
signature.asc (189.00 B)

2007-09-30 22:30:52

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [2.6.23-rc8-mm2] kernel BUG at mm/slab.c:591! | invalid opcode: 0000 [#1] SMP

> > +Call Trace:
> > + [<c01d975c>] kobject_cleanup+0x31/0x47
> > + [<c01da45a>] kref_put+0x76/0x84
> > + [<dfe69395>] fuse_sysfs_cleanup+0xa/0x14 [fuse]
> > + [<dfe6a5f2>] fuse_exit+0x19/0x24 [fuse]
> > + [<c01465a0>] sys_delete_module+0x1c0/0x228
> > + [<c0103ece>] sysenter_past_esp+0x6b/0xa1
> > + [<ffffe410>] 0xffffe410
> > + =======================
> > +Code: aa 43 c0 8b 02 25 00 40 02 00 3d 00 40 02 00 75 03 8b 52 0c 8b 02 25 00 40 02 00 3d 00 40 02 00 75 03 8b 52 0c 8b 02 84 c0 78 04 <0f> 0b eb fe 8b 4a 18 64 a1 08 00 40 c0 8b 1c 81 8b 03 3b 43 04
> > +EIP: [<c016df71>] kfree+0x5e/0x97 SS:ESP 0068:c7205f14
> >
>
> I think we have now fixed that - Miklos, do you recall?

Yes, Greg has a fix. I didn't see it go into -mm yet.

Miklos

2007-09-30 22:59:38

by Frans Pop

[permalink] [raw]
Subject: Re: [2.6.23-rc8-mm2] Fails to resume from s2mem (was:kernel BUG at mm/slab.c:591! ...)

On Sunday 30 September 2007, you wrote:
> On Sun, 30 Sep 2007 00:15:35 +0200 Frans Pop <[email protected]> wrote:
> > On Friday 28 September 2007, you wrote:
> > > My Toshiba Satellite A40 (i386, P4 Mobile) hangs during boot after:
> >
> > With 'hpet-force-enable-on-ich34' reverted the system boots OK again.
> >
> > We're not yet done though. It now fails to resume from suspend
>
> suspend-to-RAM? Can you describe this failure a bit more?

The suspend is done from KDE by closing the lid, which runs a trivial
script. The system seems to suspend normally (correct leds at the end).

When I open the lid again, the system seems to restart (fan starts, leds
change), but the LCD only shows:
<cursor> L
System cannot be reached over the net. After some time the fans speed up.

Exactly the same happens if I 'echo mem /sys/power/state' from console
without X running.

BTW, that partial display is normal for me: I mostly see "Linu" until the
switch to X.Org, but never a full sentence that makes sense.
*Is* that normal?

2007-10-01 00:07:44

by Frans Pop

[permalink] [raw]
Subject: Re: [2.6.23-rc8-mm2] System hangs (loops?) during boot

On Monday 01 October 2007, you wrote:
> On Sun, 30 Sep 2007 23:50:29 +0200 Frans Pop (FP) wrote:
> > I'm not sure what you mean. I fetched the branch I think you referred
> > to [1], but when I did a merge of that on top of v2.6.23-rc8-mm2 I
> > got "Already up-to-date", so AFAICT that branch is fully merged into mm
> > and I'm already running with you latest code...
>
> I was suggesting to download 2.6.23-rc8 and applying the -hrt patchset at
> http://www.kernel.org/pub/linux/kernel/people/tglx/hrtimers/2.6.23-rc8/
> on top of it.

Ah, OK. I'm afraid that was not at all clear from your previous message :-/

During 'make oldconfig' I got a config question about "CPU idle support",
which does not seem to be in rc8-mm2; is that correct? I answered N.

> That excludes all the extra stuff in -mm and should give us a good hint
> whether HPET is really at fault.

The system does boot with rc8 + hrt1.

Andrew: any suggestions on how to trace the "real" culprit for the hang?


Udo: I did see one issue during boot with this rc8 + hrt1 kernel.
System is Debian unstable.
Setting the system clock..
select() to /dev/rtc to wait for clock tick timed out

Also, if the patchset really is not the same as in mm, then that at least
partially invalidates this test.

Relevant lines from dmesg diff:
-Linux version 2.6.23-rc6 [...]
+Linux version 2.6.23-rc8-hrt1 [...]
[...]
+hpet clockevent registered
+hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0
+hpet0: 3 64-bit timers, 14318180 Hz
ACPI: RTC can wake from S4
Time: tsc clocksource has been installed.
[...]
-Marking TSC unstable due to: possible TSC halt in C2.
-Time: acpi_pm clocksource has been installed.
[...]
-Clocksource tsc unstable (delta = -473964036 ns)
[...]
Real Time Clock Driver v1.12ac
[...]
+Marking TSC unstable due to: cpufreq changes.
+Time: hpet clocksource has been installed.
+Clocksource tsc unstable (delta = -125526392 ns)

# cat /sys/devices/system/clocksource/clocksource0/available_clocksource
hpet acpi_pm pit jiffies tsc
# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
hpet

I noted that acpi_pm is no longer mentioned in dmesg, but is still present.

Cheers,
FJP

2007-10-01 00:44:32

by Udo A. Steinberg

[permalink] [raw]
Subject: Re: [2.6.23-rc8-mm2] System hangs (loops?) during boot

On Mon, 1 Oct 2007 02:07:33 +0200 Frans Pop (FP) wrote:

FP> On Monday 01 October 2007, you wrote:
FP> > I was suggesting to download 2.6.23-rc8 and applying the -hrt patchset
FP> > at
FP> > http://www.kernel.org/pub/linux/kernel/people/tglx/hrtimers/2.6.23-rc8/
FP> > on top of it.
FP>
FP> Ah, OK. I'm afraid that was not at all clear from your previous
FP> message :-/

Yeah, sorry about that.

FP> During 'make oldconfig' I got a config question about "CPU idle
FP> support", which does not seem to be in rc8-mm2; is that correct? I
FP> answered N.

Shouldn't matter either way. Answering 'Y' gives you a more sophisticated
C-state governor that improves battery life.

FP> The system does boot with rc8 + hrt1.

Good. That seems to confirm my suspicion that the real problem is caused by
something in -mm which is not in -hrt. However, I have no idea what exactly
could be going wrong.

FP> Andrew: any suggestions on how to trace the "real" culprit for the hang?
FP>
FP>
FP> Udo: I did see one issue during boot with this rc8 + hrt1 kernel.
FP> System is Debian unstable.
FP> Setting the system clock..
FP> select() to /dev/rtc to wait for clock tick timed out

Thomas and Andrew are the best people to ask about what exactly has been
merged from -hrt into -mm. Maybe they can chime in here.

Cheers,

- Udo


Attachments:
signature.asc (189.00 B)

2007-10-01 01:13:07

by Frans Pop

[permalink] [raw]
Subject: Re: [2.6.23-rc8-mm2] System hangs (loops?) during boot

On Monday 01 October 2007, Udo A. Steinberg wrote:
> On Mon, 1 Oct 2007 02:07:33 +0200 Frans Pop (FP) wrote:
> FP> On Monday 01 October 2007, you wrote:
> FP> > I was suggesting to download 2.6.23-rc8 and applying the -hrt
> patchset FP> > at
> FP> >
> http://www.kernel.org/pub/linux/kernel/people/tglx/hrtimers/2.6.23-rc8/
> FP> > on top of it.
>
> FP> The system does boot with rc8 + hrt1.
>
> FP> Udo: I did see one issue during boot with this rc8 + hrt1 kernel.
> FP> System is Debian unstable.
> FP> Setting the system clock..
> FP> select() to /dev/rtc to wait for clock tick timed out
>
> Thomas and Andrew are the best people to ask about what exactly has been
> merged from -hrt into -mm. Maybe they can chime in here.

I agree with you for the original issue.

But the /dev/rtc issue described here has *nothing* to do with mm: I saw
that with pure Linus' 2.6.23-rc8 + Thomas' hrt1 on top: the combination you
asked me to test.
So this is an issue in "your" hrt changes. Thomas may have some suggestions,
but maybe you can have a look as well?

I'm of course willing to do a bisect if needed to narrow this down, but I'd
like to know who I'll be talking or where I should follow up for issues in
the hrt patch set.

2007-10-01 02:24:46

by Andrew Morton

[permalink] [raw]
Subject: Re: [2.6.23-rc8-mm2] System hangs (loops?) during boot

On Mon, 1 Oct 2007 02:07:33 +0200 Frans Pop <[email protected]> wrote:

> > That excludes all the extra stuff in -mm and should give us a good hint
> > whether HPET is really at fault.
>
> The system does boot with rc8 + hrt1.
>
> Andrew: any suggestions on how to trace the "real" culprit for the hang?

Not really. Ordinarily one could move hpet-force-enable-on-ich34.patch to
start-of-series, then verify that mainline+hpet-force-enable-on-ich34.patch
works correctly, then just bisect all the other patches.

But tht doesn't work because hpet-force-enable-on-ich34.patch has a
dependency on other patches in the hrt-related patch series, and it could
be that the bug which you've exposed is already in mainline anyway.

If you had a minimal, standalone hpet-force-enable-on-ich34.patch against
mainline then you could test that against mainline. If that also failed
then you could git-bisect mainline, applying
hpet-force-enable-on-ich34.patch each time (I've done that before). But
this assumes that you're searching for a regression in minaline. It may
never have worked.