2005-05-16 21:12:08

by Martin Bligh

[permalink] [raw]
Subject: 2.6.12-rc4-mm2 boot failure

PPC64 NUMA box. Maybe this is the same NUMA slab problem you were
hitting before ...

Oops: Exception in kernel mode, sig: 5 [#1]^M
SMP NR_CPUS=32 NUMA PSERIES LPAR ^M
Modules linked in:^M
NIP: C000000000099624 XER: 00000000 LR: C00000000009A014 CTR: C00000000028C0D4^M
REGS: c00000000057ba10 TRAP: 0700 Not tainted (2.6.12-rc4-mm2-autokern1)^M
MSR: 8000000000029032 EE: 1 PR: 0 FP: 0 ME: 1 IR/DR: 11 CR: 24004022^M
DAR: 8000000000009032 DSISR: c0000000006c82bf^M
TASK: c0000000005e2100[0] 'swapper' THREAD: c000000000578000 CPU: 0^M
GPR00: 0000000000000001 C00000000057BC90 C0000000006C0568 C00000077FFD2590 ^M
GPR04: 0000000000000000 FFFFFFFFFFFFFFFF C0000000006C83D0 C0000000005E3A24 ^M
GPR08: C0000000005E3A18 0000000000000000 C0000000006C83C8 C0000000006C82E8 ^M
GPR12: 000000000000000A C0000000005CD000 0000000000000000 0000000000000000 ^M
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 ^M
GPR20: 0000000000230000 0000000003A10000 0000000000000060 0000000003F143C8 ^M
GPR24: C0000000005CD000 C0000000006BE208 C000000000577D68 0000000000008000 ^M
GPR28: 0000000000000000 00000000000080D0 C0000000005E2100 0000000000000001 ^M
NIP [c000000000099624] .interleave_nodes+0x38/0xd0^M
LR [c00000000009a014] .alloc_pages_current+0x100/0x134^M
Call Trace:^M
[c00000000057bc90] [000000000000001d] 0x1d (unreliable)^M
[c00000000057bd20] [c00000000009a014] .alloc_pages_current+0x100/0x134^M
[c00000000057bdc0] [c00000000007abd4] .get_zeroed_page+0x28/0x90^M
[c00000000057be40] [c0000000004e2e68] .pidmap_init+0x24/0xa0^M
[c00000000057bed0] [c0000000004c7734] .start_kernel+0x21c/0x30c^M
[c00000000057bf90] [c00000000000c010] .__setup_cpu_power3+0x0/0x4^M
Instruction dump:^M
fba1ffe8 fbc1fff0 f8010010 f821ff71 60000000 ebcd0160 a93e0788 793f0020 ^M
7fe9fe70 7d20fa78 7c004850 54000ffe <0b000000> 3ba30010 38bf0001 38800001 ^M
<0>Kernel panic - not syncing: Attempted to kill the idle task!^M


2005-05-16 21:28:27

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.12-rc4-mm2 boot failure

"Martin J. Bligh" <[email protected]> wrote:
>
> PPC64 NUMA box. Maybe this is the same NUMA slab problem you were
> hitting before ...

Probably. Christoph, this patch has crossed the grief threshold - I'll
drop it.

2005-05-16 21:38:11

by Martin Bligh

[permalink] [raw]
Subject: Re: 2.6.12-rc4-mm2 boot failure



--On Monday, May 16, 2005 14:25:04 -0700 Andrew Morton <[email protected]> wrote:

> "Martin J. Bligh" <[email protected]> wrote:
>>
>> PPC64 NUMA box. Maybe this is the same NUMA slab problem you were
>> hitting before ...
>
> Probably. Christoph, this patch has crossed the grief threshold - I'll
> drop it.

OK, fair enough. Christoph, I am interested in seeing your patch work
... is something that's needed. If you want, I can help you offline
with some testing on a variety of platforms.

M.

2005-05-16 21:49:58

by Christoph Lameter

[permalink] [raw]
Subject: Re: 2.6.12-rc4-mm2 boot failure

On Mon, 16 May 2005, Martin J. Bligh wrote:

> --On Monday, May 16, 2005 14:25:04 -0700 Andrew Morton <[email protected]> wrote:
>
> > "Martin J. Bligh" <[email protected]> wrote:
> >>
> >> PPC64 NUMA box. Maybe this is the same NUMA slab problem you were
> >> hitting before ...
> >
> > Probably. Christoph, this patch has crossed the grief threshold - I'll
> > drop it.
>
> OK, fair enough. Christoph, I am interested in seeing your patch work
> ... is something that's needed. If you want, I can help you offline
> with some testing on a variety of platforms.

Some description of the failure would be helpful. A boot log? .config?

Does the box have CONFIG_NUMA off and CONFIG_DISCONTIG on?

2005-05-16 22:25:01

by Martin Bligh

[permalink] [raw]
Subject: Re: 2.6.12-rc4-mm2 boot failure

--On Monday, May 16, 2005 14:40:57 -0700 Christoph Lameter <[email protected]> wrote:

> On Mon, 16 May 2005, Martin J. Bligh wrote:
>
>> --On Monday, May 16, 2005 14:25:04 -0700 Andrew Morton <[email protected]> wrote:
>>
>> > "Martin J. Bligh" <[email protected]> wrote:
>> >>
>> >> PPC64 NUMA box. Maybe this is the same NUMA slab problem you were
>> >> hitting before ...
>> >
>> > Probably. Christoph, this patch has crossed the grief threshold - I'll
>> > drop it.
>>
>> OK, fair enough. Christoph, I am interested in seeing your patch work
>> ... is something that's needed. If you want, I can help you offline
>> with some testing on a variety of platforms.
>
> Some description of the failure would be helpful. A boot log? .config?
>
> Does the box have CONFIG_NUMA off and CONFIG_DISCONTIG on?

attatched boot log. Config file is here:

http://ftp.kernel.org/pub/linux/kernel/people/mbligh/config/abat/p570

M.


Attachments:
(No filename) (932.00 B)
boot_failure.log (14.22 kB)
Download all attachments

2005-05-17 23:23:00

by Martin Bligh

[permalink] [raw]
Subject: Re: 2.6.12-rc4-mm2 boot failure



--On Monday, May 16, 2005 14:36:21 -0700 "Martin J. Bligh" <[email protected]> wrote:

>
>
> --On Monday, May 16, 2005 14:25:04 -0700 Andrew Morton <[email protected]> wrote:
>
>> "Martin J. Bligh" <[email protected]> wrote:
>>>
>>> PPC64 NUMA box. Maybe this is the same NUMA slab problem you were
>>> hitting before ...
>>
>> Probably. Christoph, this patch has crossed the grief threshold - I'll
>> drop it.
>
> OK, fair enough. Christoph, I am interested in seeing your patch work
> ... is something that's needed. If you want, I can help you offline
> with some testing on a variety of platforms.

OK, I backed out the slab patches from -mm2, and confirmed the problem
went away.

M.

2005-05-18 01:07:35

by Christoph Lameter

[permalink] [raw]
Subject: Re: 2.6.12-rc4-mm2 boot failure

On Tue, 17 May 2005, Martin J. Bligh wrote:

> > OK, fair enough. Christoph, I am interested in seeing your patch work
> > ... is something that's needed. If you want, I can help you offline
> > with some testing on a variety of platforms.
>
> OK, I backed out the slab patches from -mm2, and confirmed the problem
> went away.

Is there any way I can access the system to figure out what is wrong? The
failure is in the page allocator and it seems that a node id is wrong.

2005-05-18 05:06:49

by Martin Bligh

[permalink] [raw]
Subject: Re: 2.6.12-rc4-mm2 boot failure



--Christoph Lameter <[email protected]> wrote (on Tuesday, May 17, 2005 18:07:16 -0700):

> On Tue, 17 May 2005, Martin J. Bligh wrote:
>
>> > OK, fair enough. Christoph, I am interested in seeing your patch work
>> > ... is something that's needed. If you want, I can help you offline
>> > with some testing on a variety of platforms.
>>
>> OK, I backed out the slab patches from -mm2, and confirmed the problem
>> went away.
>
> Is there any way I can access the system to figure out what is wrong? The
> failure is in the page allocator and it seems that a node id is wrong.

Not really - IBM doesn't tend to like letting outside parties into their
network ;-) I think OSDL might have some power boxes now ... maybe it
fails on there?

M.