2006-05-11 01:24:48

by Bernhard Rosenkraenzer

[permalink] [raw]
Subject: Total machine lockup w/ current kernels while installing from CD

Hi,
I've built a CD that installs a customized system; basically what it does is
boot from CD (iso9660) and run

mke2fs -m0 -j -O dir_index,filetype,has_journal,sparse_super /dev/hda1
mount -text3 /dev/hda1 /mnt/dest
rpm -r /mnt/dest -ivh /RPMS/*

This worked perfectly up until some recent kernel updates - with current
kernels (both Linus and -mm branch), the system locks up totally at a random
point during rpm installation (everything goes down, including the NumLock
LED etc).

I'm currently building some old kernels to see when this problem was
introduced and sort out what patch did it.
With 2.6.16-rc6, the problem occurs, but unlike later revisions it gives an
error message before freezing:

BUG: soft lockup detected on CPU#0!

Pid: 421, comm: kjournald
EIP: 0060:[<b01a2f52>] CPU: 0
EIP is at journal_commit_transaction+0x92e/0xfcc
EFLAGS: 00000297 Not tainted (2.6.16-rc6 #1)
EAX: 00000001 EBX: c2d34788 ECX: 00000001 EDX: c785e000
ESI: b3ff8d04 EDI: 000000f0 EBP: b683b840 DS: 007b ES: 007b
CR0: 8005003b CR2: 0841f7fc CR3: 17217000 CR4: 000006d0
[<b02bd52e>] schedule+0x2ee/0x5b6
[<b01a6a88>] kjournald+0x201/0x213
[<b0111089>] smp_apic_timer_interrupt+0x32/0x49
[<b01a6937>] kjournald+0xb0/0x213
[<b01a5ffa>] commit_timeout+0x0/0x9
[<b012a789>] autoremove_wake_function+0x0/0x4b
[<b01a6887>] kjournald+0x0/0x213
[<b0101005>] kernel_thread_helper+0x5/0xb

I'll keep trying to isolate which patch introduced this; at the moment the
last known working kernel is 2.6.15 (and 2.6.15-mm4), the first known broken
one is 2.6.16-rc6; the ones in between are still waiting for the compile box.


2006-05-14 17:00:32

by Bernhard Rosenkraenzer

[permalink] [raw]
Subject: Re: Total machine lockup w/ current kernels while installing from CD

On Thursday, 11. May 2006 03:22, Bernhard Rosenkraenzer wrote:
> Hi,
> I've built a CD that installs a customized system; basically what it does
> is boot from CD (iso9660) and run
>
> mke2fs -m0 -j -O dir_index,filetype,has_journal,sparse_super /dev/hda1
> mount -text3 /dev/hda1 /mnt/dest
> rpm -r /mnt/dest -ivh /RPMS/*
>
> This worked perfectly up until some recent kernel updates - with current
> kernels (both Linus and -mm branch), the system locks up totally at a
> random point during rpm installation (everything goes down, including the
> NumLock LED etc).
>
> I'm currently building some old kernels to see when this problem was
> introduced and sort out what patch did it.
> With 2.6.16-rc6, the problem occurs, but unlike later revisions it gives an
> error message before freezing:
>
> BUG: soft lockup detected on CPU#0!
>
> Pid: 421, comm: kjournald
> EIP: 0060:[<b01a2f52>] CPU: 0
> EIP is at journal_commit_transaction+0x92e/0xfcc
> EFLAGS: 00000297 Not tainted (2.6.16-rc6 #1)
> EAX: 00000001 EBX: c2d34788 ECX: 00000001 EDX: c785e000
> ESI: b3ff8d04 EDI: 000000f0 EBP: b683b840 DS: 007b ES: 007b
> CR0: 8005003b CR2: 0841f7fc CR3: 17217000 CR4: 000006d0
> [<b02bd52e>] schedule+0x2ee/0x5b6
> [<b01a6a88>] kjournald+0x201/0x213
> [<b0111089>] smp_apic_timer_interrupt+0x32/0x49
> [<b01a6937>] kjournald+0xb0/0x213
> [<b01a5ffa>] commit_timeout+0x0/0x9
> [<b012a789>] autoremove_wake_function+0x0/0x4b
> [<b01a6887>] kjournald+0x0/0x213
> [<b0101005>] kernel_thread_helper+0x5/0xb

I'm finally through compiling all kernels between the last good and first bad
one (slow hardware sucks ;) ) -- the problem was introduced in the patch from
2.6.16-rc5 to 2.6.16-rc6, and is apparently not present in any -mm releases
before 2.6.16-rc6.

I'll try to isolate which change between 2.6.16-rc5 and -rc6 causes it... But
it would be nice if someone who knows a bit more about the code involved
could look at it.
l

2006-05-15 20:32:12

by Bernhard Rosenkraenzer

[permalink] [raw]
Subject: [FIXED] Re: Total machine lockup w/ current kernels while installing from CD

On Thursday, 11. May 2006 03:22, Bernhard Rosenkraenzer wrote:
> Hi,
> I've built a CD that installs a customized system
> [... crash at a random point ...]
> BUG: soft lockup detected on CPU#0!
>
> Pid: 421, comm: kjournald
> EIP: 0060:[<b01a2f52>] CPU: 0
> EIP is at journal_commit_transaction+0x92e/0xfcc
> EFLAGS: 00000297 Not tainted (2.6.16-rc6 #1)
> EAX: 00000001 EBX: c2d34788 ECX: 00000001 EDX: c785e000
> ESI: b3ff8d04 EDI: 000000f0 EBP: b683b840 DS: 007b ES: 007b
> CR0: 8005003b CR2: 0841f7fc CR3: 17217000 CR4: 000006d0
> [<b02bd52e>] schedule+0x2ee/0x5b6
> [<b01a6a88>] kjournald+0x201/0x213
> [<b0111089>] smp_apic_timer_interrupt+0x32/0x49
> [<b01a6937>] kjournald+0xb0/0x213
> [<b01a5ffa>] commit_timeout+0x0/0x9
> [<b012a789>] autoremove_wake_function+0x0/0x4b
> [<b01a6887>] kjournald+0x0/0x213
> [<b0101005>] kernel_thread_helper+0x5/0xb

After backing out lots of changes, I've figured out the problem is caused by
this bit of 2.6.16-rc6:

diff -urN linux-2.6.16-rc5/kernel/sched.c linux-2.6.16-rc6/kernel/sched.c
--- linux-2.6.16-rc5/kernel/sched.c 2006-05-11 20:04:18.000000000 +0200
+++ linux-2.6.16-rc6/kernel/sched.c 2006-05-11 20:00:00.000000000 +0200
@@ -4028,6 +4021,8 @@
*/
if (unlikely(preempt_count()))
return;
+ if (unlikely(system_state != SYSTEM_RUNNING))
+ return;
do {
add_preempt_count(PREEMPT_ACTIVE);
schedule();


The problem is that (to save a couple of bits of space), my simple installer
was running inside an initrd -- and system_state isn't set to SYSTEM_RUNNING
before linuxrc is executed --> scheduler breakage causes the oops.

Since initrds are deprecated anyway, I've used this opportunity to update the
installer to use initramfs instead, and as expected, the problem went away --
nevertheless [unless initrds are being removed in 2.6.18 anyway] we should
probably fix this or at least document it somewhe.

2006-05-15 20:43:07

by Andrew Morton

[permalink] [raw]
Subject: Re: [FIXED] Re: Total machine lockup w/ current kernels while installing from CD

Bernhard Rosenkraenzer <[email protected]> wrote:
>
> On Thursday, 11. May 2006 03:22, Bernhard Rosenkraenzer wrote:
> > Hi,
> > I've built a CD that installs a customized system
> > [... crash at a random point ...]
> > BUG: soft lockup detected on CPU#0!
> >
> > Pid: 421, comm: kjournald
> > EIP: 0060:[<b01a2f52>] CPU: 0
> > EIP is at journal_commit_transaction+0x92e/0xfcc
> > EFLAGS: 00000297 Not tainted (2.6.16-rc6 #1)
> > EAX: 00000001 EBX: c2d34788 ECX: 00000001 EDX: c785e000
> > ESI: b3ff8d04 EDI: 000000f0 EBP: b683b840 DS: 007b ES: 007b
> > CR0: 8005003b CR2: 0841f7fc CR3: 17217000 CR4: 000006d0
> > [<b02bd52e>] schedule+0x2ee/0x5b6
> > [<b01a6a88>] kjournald+0x201/0x213
> > [<b0111089>] smp_apic_timer_interrupt+0x32/0x49
> > [<b01a6937>] kjournald+0xb0/0x213
> > [<b01a5ffa>] commit_timeout+0x0/0x9
> > [<b012a789>] autoremove_wake_function+0x0/0x4b
> > [<b01a6887>] kjournald+0x0/0x213
> > [<b0101005>] kernel_thread_helper+0x5/0xb
>
> After backing out lots of changes, I've figured out the problem is caused by
> this bit of 2.6.16-rc6:
>
> diff -urN linux-2.6.16-rc5/kernel/sched.c linux-2.6.16-rc6/kernel/sched.c
> --- linux-2.6.16-rc5/kernel/sched.c 2006-05-11 20:04:18.000000000 +0200
> +++ linux-2.6.16-rc6/kernel/sched.c 2006-05-11 20:00:00.000000000 +0200
> @@ -4028,6 +4021,8 @@
> */
> if (unlikely(preempt_count()))
> return;
> + if (unlikely(system_state != SYSTEM_RUNNING))
> + return;
> do {
> add_preempt_count(PREEMPT_ACTIVE);
> schedule();

(That's cond_resched())

>
> The problem is that (to save a couple of bits of space), my simple installer
> was running inside an initrd -- and system_state isn't set to SYSTEM_RUNNING
> before linuxrc is executed --> scheduler breakage causes the oops.

ah-hah.

It's odd that we'll run initrds in a !SYSTEM_RUNNING state.

It's not an oops - it's sort-of a warning. Did the system actually
continue to run and boot up OK?

If so, I'd assume that the ext3 filesystem was mounted on a very slow
device - perhaps an IDE disk in PIO mode?

Perhaps we should poke the softlockup detector if someone called
cond_resched() when in a reschedulable state.

2006-05-15 20:53:12

by Bernhard Rosenkraenzer

[permalink] [raw]
Subject: Re: [FIXED] Re: Total machine lockup w/ current kernels while installing from CD

On Monday, 15. May 2006 22:45, Andrew Morton wrote:
> It's odd that we'll run initrds in a !SYSTEM_RUNNING state.

True, especially because we run initramfs in SYSTEM_RUNNING state.

> It's not an oops - it's sort-of a warning. Did the system actually
> continue to run and boot up OK?

No, it was a lockup and the system just hung at the point forever, so the
lockup detection was right.

> If so, I'd assume that the ext3 filesystem was mounted on a very slow
> device - perhaps an IDE disk in PIO mode?

That too - happened with a pretty stupid 5-liner installation script that just
formats the disk and installs a set of customized rpms from CD.

The hw we installed this on is Asus Pundit-R boxes, which means, basically,
weird IDE setup (no secondary IDE --> harddisk on hda, CD drive on hdb) on an
ATI IGP chipset, not exactly the fastest out there.

2006-05-15 21:09:57

by Andrew Morton

[permalink] [raw]
Subject: Re: [FIXED] Re: Total machine lockup w/ current kernels while installing from CD

Bernhard Rosenkraenzer <[email protected]> wrote:
>
> On Monday, 15. May 2006 22:45, Andrew Morton wrote:
> > It's odd that we'll run initrds in a !SYSTEM_RUNNING state.
>
> True, especially because we run initramfs in SYSTEM_RUNNING state.
>
> > It's not an oops - it's sort-of a warning. Did the system actually
> > continue to run and boot up OK?
>
> No, it was a lockup and the system just hung at the point forever, so the
> lockup detection was right.

This is odd. Your machine hangs if cond_resched() is a no-op. This should
not happen.

What does `grep PREEMPT .config' say?