2006-02-21 17:51:18

by Stas Sergeev

[permalink] [raw]
Subject: [patch] Re: 2.6.16-rc4-mm1 (bugs and lockups)

Hi.

The history is that -mm kernels do not work for me
for a few months already. The things started from
crashing somewhere after starting init, and for the
last month - no boot at all, just
"Uncompressing... OK, booting kernel", and silence.
Early console didn't work too.
With the latest releases this degraded into an infinite
stream of the "Unknown interrupt or fault" messages.
So today my patience ran out and I started to think how
can I collect at least some info for the bug-report.
Attached is the patch that allows to gather some valueable
debug info on the problem by making an early console more
useable. I can't properly test the patch, as the kernel
still doesn't boot, so I'll explain it in details in a
hope someone else can justify the intrusive changes.


arch_hooks.h: added prototypes for setup_early_printk()
and early_printk().

head.S: added "hlt" to the dummy fault handler. This is
necessary because otherwise the fault retriggers infinitely,
causing the infinite stream of an "Unknown interrupt or fault"
messages, which scrolls away the usefull info. I don't know
if this is a safe change.

setup.c: killed wrong setup_early_printk() prototype.
Moved setup_early_printk() a bit earlier, as it was not
"early enough" to cover the bug I was fighting with.

early_printk.c: made it to start printing from the bottom
of the screen, otherwise the messages interfere with the
ones of the boot-loader, so you can't read them.

main.c: moved smp_prepare_boot_cpu() call earlier. This
was necessary because otherwise printk() can't print
It checks cpu_online(), which returns false. This change
is consistent with the UP case, where's the boot CPU is
"online" from the very beginning, AFAICS. But again, I am
not entirely sure whether this is safe.


OK, so with that patch I was hoping to collect some debug
info. It turned out though, that the main.c change also
fixes the problem itself. The lockup was happening in an
__alloc_bootmem_core(): the "if (!size)" check was succeeding,
and the BUG() was triggering. After my main.c change this no
longer happens, but I don't know where the problem was.

I still can't boot the kernel because of this
http://www.uwsg.iu.edu/hypermail/linux/kernel/0602.2/1244.html
but at least I know that with the attached patch, the boot
process goes much further.

Just in case the patch is going to be applied:
Signed-off-by: Stas Sergeev <[email protected]>


Attachments:
bugearly1-16-rc4-mm1.diff (3.01 kB)

2006-02-22 01:33:45

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] Re: 2.6.16-rc4-mm1 (bugs and lockups)

Stas Sergeev <[email protected]> wrote:
>
> Hi.
>
> The history is that -mm kernels do not work for me
> for a few months already. The things started from
> crashing somewhere after starting init, and for the
> last month - no boot at all, just
> "Uncompressing... OK, booting kernel", and silence.
> Early console didn't work too.
> With the latest releases this degraded into an infinite
> stream of the "Unknown interrupt or fault" messages.
> So today my patience ran out and I started to think how
> can I collect at least some info for the bug-report.
> Attached is the patch that allows to gather some valueable
> debug info on the problem by making an early console more
> useable. I can't properly test the patch, as the kernel
> still doesn't boot, so I'll explain it in details in a
> hope someone else can justify the intrusive changes.
>

It's unusual that the failure has been only in -mm, and for so long. That
would indicate that we have a problem in a for-mm-only patch.

And yet your patch applies OK to 2.6.16-rc4, so it's not obvious how this
got in there.

Did you never perform a bisection as per
http://www.zip.com.au/~akpm/linux/patches/stuff/bisecting-mm-trees.txt,
find out which patch in -mm was the offender?

> arch_hooks.h: added prototypes for setup_early_printk()
> and early_printk().
>
> head.S: added "hlt" to the dummy fault handler. This is
> necessary because otherwise the fault retriggers infinitely,
> causing the infinite stream of an "Unknown interrupt or fault"
> messages, which scrolls away the usefull info. I don't know
> if this is a safe change.
>
> setup.c: killed wrong setup_early_printk() prototype.
> Moved setup_early_printk() a bit earlier, as it was not
> "early enough" to cover the bug I was fighting with.
>
> early_printk.c: made it to start printing from the bottom
> of the screen, otherwise the messages interfere with the
> ones of the boot-loader, so you can't read them.
>
> main.c: moved smp_prepare_boot_cpu() call earlier. This
> was necessary because otherwise printk() can't print
> It checks cpu_online(), which returns false. This change
> is consistent with the UP case, where's the boot CPU is
> "online" from the very beginning, AFAICS. But again, I am
> not entirely sure whether this is safe.
>

They all sound like good stuff - I'll take a look, thanks.

> OK, so with that patch I was hoping to collect some debug
> info. It turned out though, that the main.c change also
> fixes the problem itself. The lockup was happening in an
> __alloc_bootmem_core(): the "if (!size)" check was succeeding,
> and the BUG() was triggering. After my main.c change this no
> longer happens, but I don't know where the problem was.
>
> I still can't boot the kernel because of this
> http://www.uwsg.iu.edu/hypermail/linux/kernel/0602.2/1244.html
> but at least I know that with the attached patch, the boot
> process goes much further.
>

Sorry, this was hot-fixed. See
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.16-rc4/2.6.16-rc4-mm1/hot-fixes/.
You'll want revert-register-sysfs-device-for-lp-devices.patch. May as
well apply the others, too..

2006-02-22 01:45:48

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] Re: 2.6.16-rc4-mm1 (bugs and lockups)

Stas Sergeev <[email protected]> wrote:
>
> main.c: moved smp_prepare_boot_cpu() call earlier. This
> was necessary because otherwise printk() can't print
> It checks cpu_online(), which returns false. This change
> is consistent with the UP case, where's the boot CPU is
> "online" from the very beginning, AFAICS. But again, I am
> not entirely sure whether this is safe.
>

Yeah, this is scary. Early boot is fragile and complex and architectures
might not expect to run smp_prepare_boot_cpu() before setup_arch().

umm, actually it's wrong. i386's smp_prepare_boot_cpu() diddles with
per-cpu memory, and that's not initialised at that stage. See the call to
setup_per_cpu_areas() a few lines later.

So I'll drop that hunk. How important is it in practice?

If it's purely to make printk print something then perhaps we can do
something expedient like:

#ifdef CONFIG_SMP
cpu_set(smp_processor_id(), cpu_online_map); /* comment */
#endif

right there in start_kernel()?

(That assumes that smp_processor_id() works at that stage. Surely that's
true).

2006-02-22 15:51:20

by Stas Sergeev

[permalink] [raw]
Subject: Re: [patch] Re: 2.6.16-rc4-mm1 (bugs and lockups)

Hello.

Andrew Morton wrote:
> umm, actually it's wrong. i386's smp_prepare_boot_cpu() diddles with
> per-cpu memory, and that's not initialised at that stage. See the call to
> setup_per_cpu_areas() a few lines later.
> So I'll drop that hunk. How important is it in practice?
It was important because it used to fix both the printk and
(completely accidentally!) the boot problem itself.

> #ifdef CONFIG_SMP
> cpu_set(smp_processor_id(), cpu_online_map); /* comment */
> #endif
I don't even think #ifdef is needed. Having that for the UP
case may be useless, yet looks consistent to me.

> right there in start_kernel()?
This is enough for printk but not for the boot lockup.
The attached patch is however enough. And it should be
correct, as it is consistent with an UP case.

> (That assumes that smp_processor_id() works at that stage. Surely that's
> true).
Looking into the arch-specific code, I can see that some
arches evaluate the boot-cpu number by some other means,
not by the smp_processor_id(). Still I am pretty sure the
patch won't hurt them.

With this patch and with the hotfixes, I've got the -mm
kernel working, thanks.

----

Register the boot-cpu in the cpu maps earlier to allow the
early printk to work, and to fix an obscure deadlock at boot.

Signed-off-by: Stas Sergeev <[email protected]>


Attachments:
smpb.diff (981.00 B)