2006-09-07 18:20:46

by Keith Mannthey

[permalink] [raw]
Subject: [Bug] [2.6.18-rc5-mm1] system no boot early death x86_64

Hello,
I was booting rc4-mm3. With rc5-mm1 I am hanging early... Mel I don't
know if this is related to your code but I will soon know. (I don't get
your debug info in early console.)
I was working on patches for the reserve based memory hot add path in
srat.c (the initial error is fixed by Mels patches but there is more to
do) and was just moving to rc5-mm1 to sync up and then more trouble.
This is with reserve based hot-add not enabled at the command line.


Linux version 2.6.18-rc5-mm1-smp (root@elm3a153) (gcc version 4.1.0
(SUSE Linux)) #2 SMP Wed Sep 6 21:04:22 EDT 2006
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 0000000000098400 (usable)
BIOS-e820: 0000000000098400 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 000000007ff85e00 (usable)
BIOS-e820: 000000007ff85e00 - 000000007ff98880 (ACPI data)
BIOS-e820: 000000007ff98880 - 0000000080000000 (reserved)
BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
BIOS-e820: 0000000100000000 - 0000000470000000 (usable)
BIOS-e820: 0000001070000000 - 0000001160000000 (usable)
end_pfn_map = 18219008
kernel direct mapping tables up to 1160000000 @ 8000-4f000
DMI 2.3 present.
SRAT: PXM 0 -> APIC 0 -> Node 0
SRAT: PXM 0 -> APIC 1 -> Node 0
SRAT: PXM 0 -> APIC 2 -> Node 0
SRAT: PXM 0 -> APIC 3 -> Node 0
SRAT: PXM 0 -> APIC 38 -> Node 0
SRAT: PXM 0 -> APIC 39 -> Node 0
SRAT: PXM 0 -> APIC 36 -> Node 0
SRAT: PXM 0 -> APIC 37 -> Node 0
SRAT: PXM 1 -> APIC 64 -> Node 1
SRAT: PXM 1 -> APIC 65 -> Node 1
SRAT: PXM 1 -> APIC 66 -> Node 1
SRAT: PXM 1 -> APIC 67 -> Node 1
SRAT: PXM 1 -> APIC 102 -> Node 1
SRAT: PXM 1 -> APIC 103 -> Node 1
SRAT: PXM 1 -> APIC 100 -> Node 1
SRAT: PXM 1 -> APIC 101 -> Node 1
SRAT: Node 0 PXM 0 0-80000000
SRAT: Node 0 PXM 0 0-470000000
SRAT: Node 1 PXM 1 1070000000-1160000000
Bootmem setup node 0 0000000000000000-0000000470000000




2006-09-08 00:29:03

by Keith Mannthey

[permalink] [raw]
Subject: Re: [Bug] [2.6.18-rc5-mm1] system no boot early death x86_64

On Thu, 2006-09-07 at 11:20 -0700, keith mannthey wrote:
> Hello,
> I was booting rc4-mm3. With rc5-mm1 I am hanging early... Mel I don't
> know if this is related to your code but I will soon know. (I don't get
> your debug info in early console.)
> I was working on patches for the reserve based memory hot add path in
> srat.c (the initial error is fixed by Mels patches but there is more to
> do) and was just moving to rc5-mm1 to sync up and then more trouble.
> This is with reserve based hot-add not enabled at the command line.


Well this isn't fully adding up but here is what I found.

If I drop
x86_64-mm-drop-640k-reservation.patch
x86_64-mm-remove-e820-fallback.patch
and
x86_64-mm-remove-e820-fallback-fix.patch

I build and boot. All files in the series upto x86_64-mm-drop-640k-
reservation.patch work just fine. Dropping this patch makes things
better. The e820 patches were removed to make the rest of the series
apply.

It is not clear what changes would cause me to die setting up the
bootmem allocator on my first node...

I know x86_64-mm-drop-640k-reservation.patch has been around for a
while.

any ideas?

Thanks,
Keith

(from a working boot)

disabling early console
Linux version 2.6.18-rc5-mm1-smp (root@elm3a153) (gcc version 4.1.0
(SUSE Linux)) #13 SMP Thu Sep 7 19:15:00 EDT 2006
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 0000000000098400 (usable)
BIOS-e820: 0000000000098400 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 000000007ff85e00 (usable)
BIOS-e820: 000000007ff85e00 - 000000007ff98880 (ACPI data)
BIOS-e820: 000000007ff98880 - 0000000080000000 (reserved)
BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
BIOS-e820: 0000000100000000 - 0000000470000000 (usable)
BIOS-e820: 0000001070000000 - 0000001160000000 (usable)
Entering add_active_range(0, 0, 152) 0 entries of 3200 used
Entering add_active_range(0, 256, 524165) 1 entries of 3200 used
Entering add_active_range(0, 1048576, 4653056) 2 entries of 3200 used
Entering add_active_range(0, 17235968, 18219008) 3 entries of 3200 used
end_pfn_map = 18219008
DMI 2.3 present.
ACPI: RSDP (v000 IBM ) @
0x00000000000fdcf0
ACPI: RSDT (v001 IBM EXA01ZEU 0x00001000 IBM 0x45444f43) @
0x000000007ff98800
ACPI: FADT (v001 IBM EXA01ZEU 0x00001000 IBM 0x45444f43) @
0x000000007ff98780
ACPI: MADT (v001 IBM EXA01ZEU 0x00001000 IBM 0x45444f43) @
0x000000007ff98600
ACPI: SRAT (v001 IBM EXA01ZEU 0x00001000 IBM 0x45444f43) @
0x000000007ff983c0
ACPI: HPET (v001 IBM EXA01ZEU 0x00001000 IBM 0x45444f43) @
0x000000007ff98380
ACPI: SSDT (v001 IBM VIGSSDT0 0x00001000 INTL 0x20030122) @
0x000000007ff90780
ACPI: SSDT (v001 IBM VIGSSDT1 0x00001000 INTL 0x20030122) @
0x000000007ff88bc0
ACPI: DSDT (v001 IBM EXA01ZEU 0x00001000 INTL 0x20030122) @
0x0000000000000000
SRAT: PXM 0 -> APIC 0 -> Node 0
SRAT: PXM 0 -> APIC 1 -> Node 0
SRAT: PXM 0 -> APIC 2 -> Node 0
SRAT: PXM 0 -> APIC 3 -> Node 0
SRAT: PXM 0 -> APIC 38 -> Node 0
SRAT: PXM 0 -> APIC 39 -> Node 0
SRAT: PXM 0 -> APIC 36 -> Node 0
SRAT: PXM 0 -> APIC 37 -> Node 0
SRAT: PXM 1 -> APIC 64 -> Node 1
SRAT: PXM 1 -> APIC 65 -> Node 1
SRAT: PXM 1 -> APIC 66 -> Node 1
SRAT: PXM 1 -> APIC 67 -> Node 1
SRAT: PXM 1 -> APIC 102 -> Node 1
SRAT: PXM 1 -> APIC 103 -> Node 1
SRAT: PXM 1 -> APIC 100 -> Node 1
SRAT: PXM 1 -> APIC 101 -> Node 1
SRAT: Node 0 PXM 0 0-80000000
Entering add_active_range(0, 0, 152) 0 entries of 3200 used
Entering add_active_range(0, 256, 524165) 1 entries of 3200 used
SRAT: Node 0 PXM 0 0-470000000
Entering add_active_range(0, 0, 152) 2 entries of 3200 used
Entering add_active_range(0, 256, 524165) 2 entries of 3200 used
Entering add_active_range(0, 1048576, 4653056) 2 entries of 3200 used
SRAT: Node 1 PXM 1 1070000000-1160000000
Entering add_active_range(1, 17235968, 18219008) 3 entries of 3200 used
NUMA: Using 36 for the hash shift.
Bootmem setup node 0 0000000000000000-0000000470000000
Bootmem setup node 1 0000001070000000-0000001160000000
Zone PFN ranges:
DMA 0 -> 4096
DMA32 4096 -> 1048576
Normal 1048576 -> 18219008
early_node_map[4] active PFN ranges
0: 0 -> 152
0: 256 -> 524165
0: 1048576 -> 4653056
1: 17235968 -> 18219008
On node 0 totalpages: 4128541



2006-09-08 10:40:08

by Mel Gorman

[permalink] [raw]
Subject: Re: [Bug] [2.6.18-rc5-mm1] system no boot early death x86_64

On Thu, 7 Sep 2006, keith mannthey wrote:

> On Thu, 2006-09-07 at 11:20 -0700, keith mannthey wrote:
>> Hello,
>> I was booting rc4-mm3. With rc5-mm1 I am hanging early... Mel I don't
>> know if this is related to your code but I will soon know. (I don't get
>> your debug info in early console.)
>> I was working on patches for the reserve based memory hot add path in
>> srat.c (the initial error is fixed by Mels patches but there is more to
>> do)

That is some good news at least.

> and was just moving to rc5-mm1 to sync up and then more trouble.
>> This is with reserve based hot-add not enabled at the command line.
>
>
> Well this isn't fully adding up but here is what I found.
>
> If I drop
> x86_64-mm-drop-640k-reservation.patch
> x86_64-mm-remove-e820-fallback.patch
> and
> x86_64-mm-remove-e820-fallback-fix.patch
>
> I build and boot. All files in the series upto x86_64-mm-drop-640k-
> reservation.patch work just fine. Dropping this patch makes things
> better. The e820 patches were removed to make the rest of the series
> apply.
>

I am having trouble reproducing this. However, I recently got access to a
machine similar to yours. I can say that sometimes the stability of
2.6.18-rc4-mm3 and 2.6.18-rc5-mm1 was totally useless (but the symptons
different to yours) and the box would easily crash for reasons I could not
pin down. As stability problems had been reported on the machine earlier
by other users, I was inclined to blame the hardware. Now I'm not sure.

> It is not clear what changes would cause me to die setting up the
> bootmem allocator on my first node...
>

Unless your machine really has something special in the low 640K that is
required and bad things happen if it's written to at a bad time.

> I know x86_64-mm-drop-640k-reservation.patch has been around for a
> while.
>
> any ideas?
>

None so far, I'll keep hitting the machine I have to see if I can find
something more useful but I'm not very optimistic I'll pin it down.

> Thanks,
> Keith
>
> (from a working boot)
>
> disabling early console
> Linux version 2.6.18-rc5-mm1-smp (root@elm3a153) (gcc version 4.1.0
> (SUSE Linux)) #13 SMP Thu Sep 7 19:15:00 EDT 2006
> BIOS-provided physical RAM map:
> BIOS-e820: 0000000000000000 - 0000000000098400 (usable)
> BIOS-e820: 0000000000098400 - 00000000000a0000 (reserved)
> BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
> BIOS-e820: 0000000000100000 - 000000007ff85e00 (usable)
> BIOS-e820: 000000007ff85e00 - 000000007ff98880 (ACPI data)
> BIOS-e820: 000000007ff98880 - 0000000080000000 (reserved)
> BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
> BIOS-e820: 0000000100000000 - 0000000470000000 (usable)
> BIOS-e820: 0000001070000000 - 0000001160000000 (usable)
> Entering add_active_range(0, 0, 152) 0 entries of 3200 used
> Entering add_active_range(0, 256, 524165) 1 entries of 3200 used
> Entering add_active_range(0, 1048576, 4653056) 2 entries of 3200 used
> Entering add_active_range(0, 17235968, 18219008) 3 entries of 3200 used
> end_pfn_map = 18219008
> DMI 2.3 present.
> ACPI: RSDP (v000 IBM ) @
> 0x00000000000fdcf0
> ACPI: RSDT (v001 IBM EXA01ZEU 0x00001000 IBM 0x45444f43) @
> 0x000000007ff98800
> ACPI: FADT (v001 IBM EXA01ZEU 0x00001000 IBM 0x45444f43) @
> 0x000000007ff98780
> ACPI: MADT (v001 IBM EXA01ZEU 0x00001000 IBM 0x45444f43) @
> 0x000000007ff98600
> ACPI: SRAT (v001 IBM EXA01ZEU 0x00001000 IBM 0x45444f43) @
> 0x000000007ff983c0
> ACPI: HPET (v001 IBM EXA01ZEU 0x00001000 IBM 0x45444f43) @
> 0x000000007ff98380
> ACPI: SSDT (v001 IBM VIGSSDT0 0x00001000 INTL 0x20030122) @
> 0x000000007ff90780
> ACPI: SSDT (v001 IBM VIGSSDT1 0x00001000 INTL 0x20030122) @
> 0x000000007ff88bc0
> ACPI: DSDT (v001 IBM EXA01ZEU 0x00001000 INTL 0x20030122) @
> 0x0000000000000000
> SRAT: PXM 0 -> APIC 0 -> Node 0
> SRAT: PXM 0 -> APIC 1 -> Node 0
> SRAT: PXM 0 -> APIC 2 -> Node 0
> SRAT: PXM 0 -> APIC 3 -> Node 0
> SRAT: PXM 0 -> APIC 38 -> Node 0
> SRAT: PXM 0 -> APIC 39 -> Node 0
> SRAT: PXM 0 -> APIC 36 -> Node 0
> SRAT: PXM 0 -> APIC 37 -> Node 0
> SRAT: PXM 1 -> APIC 64 -> Node 1
> SRAT: PXM 1 -> APIC 65 -> Node 1
> SRAT: PXM 1 -> APIC 66 -> Node 1
> SRAT: PXM 1 -> APIC 67 -> Node 1
> SRAT: PXM 1 -> APIC 102 -> Node 1
> SRAT: PXM 1 -> APIC 103 -> Node 1
> SRAT: PXM 1 -> APIC 100 -> Node 1
> SRAT: PXM 1 -> APIC 101 -> Node 1
> SRAT: Node 0 PXM 0 0-80000000
> Entering add_active_range(0, 0, 152) 0 entries of 3200 used
> Entering add_active_range(0, 256, 524165) 1 entries of 3200 used
> SRAT: Node 0 PXM 0 0-470000000
> Entering add_active_range(0, 0, 152) 2 entries of 3200 used
> Entering add_active_range(0, 256, 524165) 2 entries of 3200 used
> Entering add_active_range(0, 1048576, 4653056) 2 entries of 3200 used
> SRAT: Node 1 PXM 1 1070000000-1160000000
> Entering add_active_range(1, 17235968, 18219008) 3 entries of 3200 used
> NUMA: Using 36 for the hash shift.
> Bootmem setup node 0 0000000000000000-0000000470000000
> Bootmem setup node 1 0000001070000000-0000001160000000
> Zone PFN ranges:
> DMA 0 -> 4096
> DMA32 4096 -> 1048576
> Normal 1048576 -> 18219008
> early_node_map[4] active PFN ranges
> 0: 0 -> 152
> 0: 256 -> 524165
> 0: 1048576 -> 4653056
> 1: 17235968 -> 18219008
> On node 0 totalpages: 4128541
>
>
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2006-09-09 14:46:09

by Andi Kleen

[permalink] [raw]
Subject: Re: [Bug] [2.6.18-rc5-mm1] system no boot early death x86_64

On Friday 08 September 2006 12:40, Mel Gorman wrote:
> On Thu, 7 Sep 2006, keith mannthey wrote:
> > On Thu, 2006-09-07 at 11:20 -0700, keith mannthey wrote:
> >> Hello,
> >> I was booting rc4-mm3. With rc5-mm1 I am hanging early... Mel I don't
> >> know if this is related to your code but I will soon know. (I don't get
> >> your debug info in early console.)
> >> I was working on patches for the reserve based memory hot add path in
> >> srat.c (the initial error is fixed by Mels patches but there is more to
> >> do)
>
> That is some good news at least.
>
> > and was just moving to rc5-mm1 to sync up and then more trouble.
> >
> >> This is with reserve based hot-add not enabled at the command line.
> >
> > Well this isn't fully adding up but here is what I found.
> >
> > If I drop
> > x86_64-mm-drop-640k-reservation.patch
> > x86_64-mm-remove-e820-fallback.patch
> > and
> > x86_64-mm-remove-e820-fallback-fix.patch
> >
> > I build and boot. All files in the series upto x86_64-mm-drop-640k-
> > reservation.patch work just fine. Dropping this patch makes things
> > better. The e820 patches were removed to make the rest of the series
> > apply.
>
> I am having trouble reproducing this. However, I recently got access to a
> machine similar to yours. I can say that sometimes the stability of
> 2.6.18-rc4-mm3 and 2.6.18-rc5-mm1 was totally useless (but the symptons
> different to yours) and the box would easily crash for reasons I could not
> pin down. As stability problems had been reported on the machine earlier
> by other users, I was inclined to blame the hardware. Now I'm not sure.
>
> > It is not clear what changes would cause me to die setting up the
> > bootmem allocator on my first node...
>
> Unless your machine really has something special in the low 640K that is
> required and bad things happen if it's written to at a bad time.

That would be a BIOS bug then. If anything is there it has to be reserved.

But maybe it just breaks something that only worked by accident before.

-Andi