2008-12-24 07:56:06

by Chandru

[permalink] [raw]
Subject: 2.6.28-rc9 panics with crashkernel=256M while booting

On a ppc machine booting linux-2.6.28-rc9 with crashkernel=256M@32M boot
parameter causes the kernel to panic while booting. ?Following are the console
messages...

=================================
<snip>
...
Calling quiesce ...
returning from prom_init
Reserving 256MB of memory at 32MB for crashkernel (System RAM: 4096MB)
Phyp-dump disabled at boot time
Using pSeries machine description
Using 1TB segments
Found initrd at 0xc000000000e00000:0xc0000000015ab50c
console [udbg0] enabled
Partition configured for 4 cpus.
CPU maps initialized for 2 threads per core
Starting Linux PPC64 #3 SMP Mon Dec 22 02:16:31 CST 2008
-----------------------------------------------------
ppc64_pft_size ? ? ? ? ? ? ? ?= 0x1a
physicalMemorySize ? ? ? ? ? ?= 0x100000000
htab_hash_mask ? ? ? ? ? ? ? ?= 0x7ffff
-----------------------------------------------------
Initializing cgroup subsys cpuset
Initializing cgroup subsys cpu
Linux version 2.6.28-rc9-1-ppc64 (root@rulerlp10) (gcc version 4.3.2
[gcc-4_3-branch revision 141291] (SUSE Linux) ) #3 SMP Mon Dec 22 02:16:31 CST
2008
[boot]0012 Setup Arch
------------[ cut here ]------------
kernel BUG at mm/bootmem.c:279!
Oops: Exception in kernel mode, sig: 5 [#1]
SMP NR_CPUS=1024 NUMA pSeries
Modules linked in:
NIP: c000000000747538 LR: c000000000731d1c CTR: c000000000730cf4
REGS: c000000000993a00 TRAP: 0700 ? Not tainted ?(2.6.28-rc9-1-ppc64)
MSR: 8000000000021032 <ME,IR,DR> ?CR: 24002082 ?XER: 00000001
TASK = c0000000008dfaa0[0] 'swapper' THREAD: c000000000990000 CPU: 0
GPR00: 0000000000000001 c000000000993c80 c00000000098a768 c0000000007664d0
GPR04: 00000000000001f7 0000000000001001 0000000000000001 0000000000000000
GPR08: 0000000000000000 0000000000000000 c0000000008dd6c0 0000000000000000
GPR12: 0000000024002088 c000000000a52c00 c000000000763990 c00000000067f735
GPR16: 00000000034638c8 0000000000000000 c000000000993d98 c000000000993d90
GPR20: c000000000993da0 c000000000c32bc8 c000000001f78800 c000000000000000
GPR24: 0000000000000004 0000000010087800 0000000000000000 0000000000000001
GPR28: 00000000000001f7 0000000000001001 c000000000910cb0 c0000000007664d0
NIP [c000000000747538] .mark_bootmem_node+0xa8/0x120
LR [c000000000731d1c] .do_init_bootmem+0xc1c/0xd58
Call Trace:
[c000000000993c80] [c000000000993d20] init_thread_union+0x3d20/0x4000
(unreliable)
[c000000000993d20] [c000000000731d1c] .do_init_bootmem+0xc1c/0xd58
[c000000000993e40] [c0000000007267f0] .setup_arch+0x1a4/0x21c
[c000000000993ee0] [c000000000720868] .start_kernel+0xdc/0x56c
[c000000000993f90] [c0000000000083b8] .start_here_common+0x1c/0x64
Instruction dump:
7ca501d2 4bda924d 60000000 e93f0000 7c09e010 7c000110 7c0000d0 0b000000
e81f0008 7c1d0010 7c000110 7c0000d0 <0b000000> 2fbb0000 7ca9e850 7c89e050
---[ end trace 31fd0ba7d8756001 ]---
Kernel panic - not syncing: Attempted to kill the idle task!
------------[ cut here ]------------
Badness at kernel/smp.c:333
NIP: c0000000000b9384 LR: c0000000000b96ac CTR: 0000000000136f8c
REGS: c000000000992eb0 TRAP: 0700 ? Tainted: G ? ? ?D ? ? (2.6.28-rc9-1-ppc64)
MSR: 8000000000021032 <ME,IR,DR> ?CR: 28002084 ?XER: 00000001
TASK = c0000000008dfaa0[0] 'swapper' THREAD: c000000000990000 CPU: 0
GPR00: 0000000000000001 c000000000993130 c00000000098a768 0000000000000001
GPR04: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR12: 0000000028002082 c000000000a52c00 c000000000763990 c00000000067f735
GPR16: 00000000034638c8 0000000000000000 c000000000993d98 c000000000993d90
GPR20: c000000000993da0 c000000000c32bc8 c000000001f78800 0000000000000000
GPR24: c000000000926320 0000000000000000 0000000000000000 0000000000000000
GPR28: 0000000000000000 0000000000000000 c00000000090f4f8 c000000000af8380
NIP [c0000000000b9384] .smp_call_function_mask+0x6c/0x2f4
LR [c0000000000b96ac] .smp_call_function+0xa0/0xcc
Call Trace:
[c000000000993130] [c0000000009931d0] init_thread_union+0x31d0/0x4000
(unreliable)
[c000000000993420] [c0000000000b96ac] .smp_call_function+0xa0/0xcc
[c000000000993530] [c00000000002ce0c] .smp_send_stop+0x24/0x3c
[c0000000009935b0] [c0000000004f0658] .panic+0x98/0x198
[c000000000993640] [c00000000008fe18] .do_exit+0xa0/0x900
[c000000000993730] [c000000000027984] .die+0x274/0x278
[c0000000009937d0] [c000000000027c78] ._exception+0x88/0x20c
[c000000000993990] [c000000000004f8c] program_check_common+0x10c/0x180
--- Exception: 700 at .mark_bootmem_node+0xa8/0x120
? ? LR = .do_init_bootmem+0xc1c/0xd58
[c000000000993c80] [c000000000993d20] init_thread_union+0x3d20/0x4000
(unreliable)
[c000000000993d20] [c000000000731d1c] .do_init_bootmem+0xc1c/0xd58
[c000000000993e40] [c0000000007267f0] .setup_arch+0x1a4/0x21c
[c000000000993ee0] [c000000000720868] .start_kernel+0xdc/0x56c
[c000000000993f90] [c0000000000083b8] .start_here_common+0x1c/0x64
Instruction dump:
eae103a8 eb4103b6 f8810328 f8a10330 f8c10338 f8e10340 f9010348 f9210350
f9410358 880d01da 7c000074 7800d182 <0b000000> 3b8100d8 a3ad000a e89e8028
Rebooting in 180 seconds.. ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
=================================



When booted with crashkernel=224M@32M or any memory size less than this, the
system boots properly. The following was the observation..
The system comes up with two nodes (0-256M and 256M-4GB). ?The crashkernel
memory reservation spans across these two nodes. ?The
mark_reserved_regions_for_nid() in arch/powerpc/mm/numa.c resizes the reserved
part of the memory within it as...
...
...
? ? ? ? ? ? if (end_pfn > node_ar.end_pfn)
? ? ? ? ? ? ? ? reserve_size = (node_ar.end_pfn << PAGE_SHIFT)
? ? ? ? ? ? ? ? ? ? - (start_pfn << PAGE_SHIFT);


but the reserve_bootmem_node() in mm/bootmem.c raises the pfn value of end

? ? end = PFN_UP(physaddr + size);

This causes end to get a value past the last page in the 0-256M node. ?Again
when reserve_bootmem_node() returns, ?mark_reserved_regions_for_nid() loops
around to set the rest of the crashkernel memory in the next node as reserved. ?
It references NODE_DATA(node_ar.nid) and this causes another 'Oops: kernel
access of bad area' problem. The following changes made the system to boot
with any amount of crashkernel memory size.


Fix code for reserved memory spanning across nodes

Signed-off-by: Chandru S <[email protected]>
---

--- linux-2.6.28-rc9//arch/powerpc/mm/numa.c.orig 2008-12-22
04:23:24.000000000 -0600
+++ linux-2.6.28-rc9/arch/powerpc/mm/numa.c 2008-12-22 04:24:25.000000000 -0600
@@ -995,10 +995,11 @@ void __init do_init_bootmem(void)
start_pfn, end_pfn);

free_bootmem_with_active_regions(nid, end_pfn);
+ }
+
+ for_each_online_node(nid) {
/*
- * Be very careful about moving this around. Future
- * calls to careful_allocation() depend on this getting
- * done correctly.
+ * Be very careful about moving this around.
*/
mark_reserved_regions_for_nid(nid);
sparse_memory_present_with_active_regions(nid);
--- linux-2.6.28-rc9/mm/bootmem.c.orig 2008-12-19 10:49:24.000000000 -0600
+++ linux-2.6.28-rc9/mm/bootmem.c 2008-12-19 10:49:33.000000000 -0600
@@ -375,10 +375,14 @@ int __init reserve_bootmem_node(pg_data_
unsigned long size, int flags)
{
unsigned long start, end;
+ bootmem_data_t *bdata = pgdat->bdata;

start = PFN_DOWN(physaddr);
end = PFN_UP(physaddr + size);

+ if (end > bdata->node_low_pfn)
+ end = bdata->node_low_pfn;
+
return mark_bootmem_node(pgdat->bdata, start, end, 1, flags);
}


2008-12-25 07:36:32

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.28-rc9 panics with crashkernel=256M while booting

(cc's added)

On Wed, 24 Dec 2008 13:25:49 +0530 Chandru <[email protected]> wrote:

> On a ppc machine booting linux-2.6.28-rc9 with crashkernel=256M@32M boot
> parameter causes the kernel to panic while booting. __Following are the console
> messages...

- Please put [patch] in the Subject: line of patches

- Please choose a suitable title, as per
Documentation/SubmittingPatches, section 15.

- Please cc suitable mailing lists and maintainers on bug reports and
on patches.



From: Chandru <[email protected]>

When booted with crashkernel=224M@32M or any memory size less than this,
the system boots properly. The following was the observation.. The
system comes up with two nodes (0-256M and 256M-4GB). _The crashkernel
memory reservation spans across these two nodes. _The
mark_reserved_regions_for_nid() in arch/powerpc/mm/numa.c resizes the
reserved part of the memory within it as:

_ _ _ _ _ _ if (end_pfn > node_ar.end_pfn)
_ _ _ _ _ _ _ _ reserve_size = (node_ar.end_pfn << PAGE_SHIFT)
_ _ _ _ _ _ _ _ _ _ - (start_pfn << PAGE_SHIFT);


but the reserve_bootmem_node() in mm/bootmem.c raises the pfn value of end

_ _ end = PFN_UP(physaddr + size);

This causes end to get a value past the last page in the 0-256M node.
_Again when reserve_bootmem_node() returns,
_mark_reserved_regions_for_nid() loops around to set the rest of the
crashkernel memory in the next node as reserved. _ It references
NODE_DATA(node_ar.nid) and this causes another 'Oops: kernel access of bad
area' problem. The following changes made the system to boot with any
amount of crashkernel memory size.

Signed-off-by: Chandru S <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

arch/powerpc/mm/numa.c | 7 ++++---
mm/bootmem.c | 4 ++++
2 files changed, 8 insertions(+), 3 deletions(-)

diff -puN arch/powerpc/mm/numa.c~2628-rc9-panics-with-crashkernel=256m-while-booting arch/powerpc/mm/numa.c
--- a/arch/powerpc/mm/numa.c~2628-rc9-panics-with-crashkernel=256m-while-booting
+++ a/arch/powerpc/mm/numa.c
@@ -995,10 +995,11 @@ void __init do_init_bootmem(void)
start_pfn, end_pfn);

free_bootmem_with_active_regions(nid, end_pfn);
+ }
+
+ for_each_online_node(nid) {
/*
- * Be very careful about moving this around. Future
- * calls to careful_allocation() depend on this getting
- * done correctly.
+ * Be very careful about moving this around.
*/
mark_reserved_regions_for_nid(nid);
sparse_memory_present_with_active_regions(nid);
diff -puN mm/bootmem.c~2628-rc9-panics-with-crashkernel=256m-while-booting mm/bootmem.c
--- a/mm/bootmem.c~2628-rc9-panics-with-crashkernel=256m-while-booting
+++ a/mm/bootmem.c
@@ -375,10 +375,14 @@ int __init reserve_bootmem_node(pg_data_
unsigned long size, int flags)
{
unsigned long start, end;
+ bootmem_data_t *bdata = pgdat->bdata;

start = PFN_DOWN(physaddr);
end = PFN_UP(physaddr + size);

+ if (end > bdata->node_low_pfn)
+ end = bdata->node_low_pfn;
+
return mark_bootmem_node(pgdat->bdata, start, end, 1, flags);
}

_

2008-12-25 08:07:55

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.28-rc9 panics with crashkernel=256M while booting

On Wed, 24 Dec 2008 23:35:36 -0800 Andrew Morton <[email protected]> wrote:

> - Please put [patch] in the Subject: line of patches
>
> - Please choose a suitable title, as per
> Documentation/SubmittingPatches, section 15.
>
> - Please cc suitable mailing lists and maintainers on bug reports and
> on patches.

Also the patch was wordwrapped and the changelog was filled with weird
UTF8 characters.

I think I have it all cleaned up now.


From: Chandru <[email protected]>

When booted with crashkernel=224M@32M or any memory size less than this,
the system boots properly. The following was the observation.. The
system comes up with two nodes (0-256M and 256M-4GB). The crashkernel
memory reservation spans across these two nodes. The
mark_reserved_regions_for_nid() in arch/powerpc/mm/numa.c resizes the
reserved part of the memory within it as:

if (end_pfn > node_ar.end_pfn)
reserve_size = (node_ar.end_pfn << PAGE_SHIFT)
- (start_pfn << PAGE_SHIFT);


but the reserve_bootmem_node() in mm/bootmem.c raises the pfn value of end

end = PFN_UP(physaddr + size);

This causes end to get a value past the last page in the 0-256M node.
Again when reserve_bootmem_node() returns, mark_reserved_regions_for_nid()
loops around to set the rest of the crashkernel memory in the next node as
reserved. It references NODE_DATA(node_ar.nid) and this causes another
'Oops: kernel access of bad area' problem. The following changes made the
system to boot with any amount of crashkernel memory size.

Signed-off-by: Chandru S <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

arch/powerpc/mm/numa.c | 7 ++++---
mm/bootmem.c | 4 ++++
2 files changed, 8 insertions(+), 3 deletions(-)

diff -puN arch/powerpc/mm/numa.c~powerpc-fix-code-for-reserved-memory-spanning-across-nodes arch/powerpc/mm/numa.c
--- a/arch/powerpc/mm/numa.c~powerpc-fix-code-for-reserved-memory-spanning-across-nodes
+++ a/arch/powerpc/mm/numa.c
@@ -995,10 +995,11 @@ void __init do_init_bootmem(void)
start_pfn, end_pfn);

free_bootmem_with_active_regions(nid, end_pfn);
+ }
+
+ for_each_online_node(nid) {
/*
- * Be very careful about moving this around. Future
- * calls to careful_allocation() depend on this getting
- * done correctly.
+ * Be very careful about moving this around.
*/
mark_reserved_regions_for_nid(nid);
sparse_memory_present_with_active_regions(nid);
diff -puN mm/bootmem.c~powerpc-fix-code-for-reserved-memory-spanning-across-nodes mm/bootmem.c
--- a/mm/bootmem.c~powerpc-fix-code-for-reserved-memory-spanning-across-nodes
+++ a/mm/bootmem.c
@@ -375,10 +375,14 @@ int __init reserve_bootmem_node(pg_data_
unsigned long size, int flags)
{
unsigned long start, end;
+ bootmem_data_t *bdata = pgdat->bdata;

start = PFN_DOWN(physaddr);
end = PFN_UP(physaddr + size);

+ if (end > bdata->node_low_pfn)
+ end = bdata->node_low_pfn;
+
return mark_bootmem_node(pgdat->bdata, start, end, 1, flags);
}

_

2008-12-26 00:59:44

by Paul Mackerras

[permalink] [raw]
Subject: Re: 2.6.28-rc9 panics with crashkernel=256M while booting

Andrew Morton writes:

> On Wed, 24 Dec 2008 13:25:49 +0530 Chandru <[email protected]> wrote:
>
> > On a ppc machine booting linux-2.6.28-rc9 with crashkernel=256M@32M boot
> > parameter causes the kernel to panic while booting. __Following are the console
> > messages...
>
> - Please put [patch] in the Subject: line of patches
>
> - Please choose a suitable title, as per
> Documentation/SubmittingPatches, section 15.
>
> - Please cc suitable mailing lists and maintainers on bug reports and
> on patches.

Dave Hansen was working on this code recently, and this looks a bit
similar to some changes he was making. Dave, what's your opinion on
this patch?

Paul.

> From: Chandru <[email protected]>
>
> When booted with crashkernel=224M@32M or any memory size less than this,
> the system boots properly. The following was the observation.. The
> system comes up with two nodes (0-256M and 256M-4GB). _The crashkernel
> memory reservation spans across these two nodes. _The
> mark_reserved_regions_for_nid() in arch/powerpc/mm/numa.c resizes the
> reserved part of the memory within it as:
>
> _ _ _ _ _ _ if (end_pfn > node_ar.end_pfn)
> _ _ _ _ _ _ _ _ reserve_size = (node_ar.end_pfn << PAGE_SHIFT)
> _ _ _ _ _ _ _ _ _ _ - (start_pfn << PAGE_SHIFT);
>
>
> but the reserve_bootmem_node() in mm/bootmem.c raises the pfn value of end
>
> _ _ end = PFN_UP(physaddr + size);
>
> This causes end to get a value past the last page in the 0-256M node.
> _Again when reserve_bootmem_node() returns,
> _mark_reserved_regions_for_nid() loops around to set the rest of the
> crashkernel memory in the next node as reserved. _ It references
> NODE_DATA(node_ar.nid) and this causes another 'Oops: kernel access of bad
> area' problem. The following changes made the system to boot with any
> amount of crashkernel memory size.
>
> Signed-off-by: Chandru S <[email protected]>
> Cc: Benjamin Herrenschmidt <[email protected]>
> Cc: Paul Mackerras <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> ---
>
> arch/powerpc/mm/numa.c | 7 ++++---
> mm/bootmem.c | 4 ++++
> 2 files changed, 8 insertions(+), 3 deletions(-)
>
> diff -puN arch/powerpc/mm/numa.c~2628-rc9-panics-with-crashkernel=256m-while-booting arch/powerpc/mm/numa.c
> --- a/arch/powerpc/mm/numa.c~2628-rc9-panics-with-crashkernel=256m-while-booting
> +++ a/arch/powerpc/mm/numa.c
> @@ -995,10 +995,11 @@ void __init do_init_bootmem(void)
> start_pfn, end_pfn);
>
> free_bootmem_with_active_regions(nid, end_pfn);
> + }
> +
> + for_each_online_node(nid) {
> /*
> - * Be very careful about moving this around. Future
> - * calls to careful_allocation() depend on this getting
> - * done correctly.
> + * Be very careful about moving this around.
> */
> mark_reserved_regions_for_nid(nid);
> sparse_memory_present_with_active_regions(nid);
> diff -puN mm/bootmem.c~2628-rc9-panics-with-crashkernel=256m-while-booting mm/bootmem.c
> --- a/mm/bootmem.c~2628-rc9-panics-with-crashkernel=256m-while-booting
> +++ a/mm/bootmem.c
> @@ -375,10 +375,14 @@ int __init reserve_bootmem_node(pg_data_
> unsigned long size, int flags)
> {
> unsigned long start, end;
> + bootmem_data_t *bdata = pgdat->bdata;
>
> start = PFN_DOWN(physaddr);
> end = PFN_UP(physaddr + size);
>
> + if (end > bdata->node_low_pfn)
> + end = bdata->node_low_pfn;
> +
> return mark_bootmem_node(pgdat->bdata, start, end, 1, flags);
> }
>
> _

2008-12-26 06:36:13

by Chandru

[permalink] [raw]
Subject: Re: 2.6.28-rc9 panics with crashkernel=256M while booting

On Thursday 25 December 2008 13:37:35 Andrew Morton wrote:
> On Wed, 24 Dec 2008 23:35:36 -0800 Andrew Morton <[email protected]>
wrote:
> > - Please put [patch] in the Subject: line of patches
> >
> > - Please choose a suitable title, as per
> > Documentation/SubmittingPatches, section 15.
> >
> > - Please cc suitable mailing lists and maintainers on bug reports and
> > on patches.
>
> Also the patch was wordwrapped and the changelog was filled with weird
> UTF8 characters.
>
> I think I have it all cleaned up now.
>

Hello Andrew,

Thank you very much for cleaning the patch. It was a result of using two email
clients at my end. Thank you again!.

Chandru

2008-12-29 21:36:22

by Dave Hansen

[permalink] [raw]
Subject: Re: 2.6.28-rc9 panics with crashkernel=256M while booting

On Fri, 2008-12-26 at 11:59 +1100, Paul Mackerras wrote:
>
> > diff -puN arch/powerpc/mm/numa.c~2628-rc9-panics-with-crashkernel=256m-while-booting arch/powerpc/mm/numa.c
> > --- a/arch/powerpc/mm/numa.c~2628-rc9-panics-with-crashkernel=256m-while-booting
> > +++ a/arch/powerpc/mm/numa.c
> > @@ -995,10 +995,11 @@ void __init do_init_bootmem(void)
> > start_pfn, end_pfn);
> >
> > free_bootmem_with_active_regions(nid, end_pfn);
> > + }
> > +
> > + for_each_online_node(nid) {
> > /*
> > - * Be very careful about moving this around. Future
> > - * calls to careful_allocation() depend on this getting
> > - * done correctly.
> > + * Be very careful about moving this around.
> > */
> > mark_reserved_regions_for_nid(nid);
> > sparse_memory_present_with_active_regions(nid);

I think this reintroduces one of the bugs that I squashed. You *have*
to call mark_reserved_regions_for_nid() right after you do
free_bootmem_with_active_regions(). Otherwise, someone else can
bootmem_alloc() a reserved region from that node.

Perhaps I need to make that comment a bit more forceful. :)

/*
* Don't break this loop out. Period. Never. Ever.
* No, seriously. Don't do it. I mean it. Really!
*/

-- Dave

2009-01-05 13:50:11

by Chandru

[permalink] [raw]
Subject: Re: 2.6.28-rc9 panics with crashkernel=256M while booting

On Tuesday 30 December 2008 03:06:07 Dave Hansen wrote:
> On Fri, 2008-12-26 at 11:59 +1100, Paul Mackerras wrote:
> > > + }
> > > +
> > > + for_each_online_node(nid) {
> > > /*
> > > - * Be very careful about moving this around. Future
> > > - * calls to careful_allocation() depend on this getting
> > > - * done correctly.
> > > + * Be very careful about moving this around.
> > > */
> > > mark_reserved_regions_for_nid(nid);
> > > sparse_memory_present_with_active_regions(nid);
>
> I think this reintroduces one of the bugs that I squashed. You *have*
> to call mark_reserved_regions_for_nid() right after you do
> free_bootmem_with_active_regions(). Otherwise, someone else can
> bootmem_alloc() a reserved region from that node.

Thanks for the review comments Dave. With the commit:a4c74ddd5ea3db53fc73d29c222b22656a7d05be, I see this has been taken care in mark_reserved_regions_for_nid(). In that case we may only need the change made in reserve_bootmem_node().

Hello Andrew,
Could you please consider the following patch instead of the original patch in this thread.
Thanks,


When booted with crashkernel=224M@32M or any memory size less than this, the system boots properly. The system comes up with two nodes (0-256M and 256M-4GB). The crashkernel memory reservation spans across these two nodes. The mark_reserved_regions_for_nid() in arch/powerpc/numa.c resizes the reserved part of the memory within it as...

if (end_pfn > node_ar.end_pfn)
reserve_size = (node_ar.end_pfn << PAGE_SHIFT)
- (start_pfn << PAGE_SHIFT);

but the reserve_bootmem_node() in mm/bootmem.c raises the pfn value of end

end = PFN_UP(physaddr + size);

This causes end to get a value past the last page in the 0-256M node. The following change restricts the value of end if it exceeds the last pfn in a given node.

Signed-off-by: Chandru S <[email protected]>
Cc: Dave Hansen <[email protected]>
---

mm/bootmem.c | 4 ++++
1 file changed, 4 insertions(+)


--- linux-2.6.28/mm/bootmem.c.orig 2009-01-05 20:42:12.000000000 +0530
+++ linux-2.6.28/mm/bootmem.c 2009-01-05 20:43:53.000000000 +0530
@@ -375,10 +375,14 @@ int __init reserve_bootmem_node(pg_data_
unsigned long size, int flags)
{
unsigned long start, end;
+ bootmem_data_t *bdata = pgdat->bdata;

start = PFN_DOWN(physaddr);
end = PFN_UP(physaddr + size);

+ if (end > bdata->node_low_pfn)
+ end = bdata->node_low_pfn;
+
return mark_bootmem_node(pgdat->bdata, start, end, 1, flags);
}

2009-01-05 16:30:49

by Dave Hansen

[permalink] [raw]
Subject: Re: 2.6.28-rc9 panics with crashkernel=256M while booting

On Mon, 2009-01-05 at 19:19 +0530, Chandru wrote:
> On Tuesday 30 December 2008 03:06:07 Dave Hansen wrote:
> When booted with crashkernel=224M@32M or any memory size less than
> this, the system boots properly. The system comes up with two nodes
> (0-256M and 256M-4GB). The crashkernel memory reservation spans across
> these two nodes. The mark_reserved_regions_for_nid() in
> arch/powerpc/numa.c resizes the reserved part of the memory within it
> as...
>
> if (end_pfn > node_ar.end_pfn)
> reserve_size = (node_ar.end_pfn << PAGE_SHIFT)
> - (start_pfn << PAGE_SHIFT);
>
> but the reserve_bootmem_node() in mm/bootmem.c raises the pfn value of
> end
>
> end = PFN_UP(physaddr + size);
>
> This causes end to get a value past the last page in the 0-256M node.
> The following change restricts the value of end if it exceeds the last
> pfn in a given node.

OK, I had to think about this for a good, long time. That's bad. :)

There are two things that we're dealing with here: "active regions" and
the NODE_DATA's. The if() you've pasted above resizes the reservation
so that it fits into the current active region. However, as you noted,
we haven't resized it so that it fits into the NODE_DATA() that we're
looking at. We call into the bootmem code, and BUG_ON().

The thing I don't like about this is that it might hide bugs in other
callers. This really is a ppc-specific thing and, although what you
wrote will fix the bug on ppc, it will probably cause someone in the
future to call reserve_bootmem_node() with too large a reservation and
get a silent failure (not reserving the requested size) back.

We really do need to go take a hard look at the whole interaction
between lmb's, node active regions, and the NUMA code some day. It has
kinda grown to be a bit ungainly.

How about we just consult the NODE_DATA() in
mark_reserved_regions_for_nid() instead of reserve_bootmem_node()?

> --- linux-2.6.28/mm/bootmem.c.orig 2009-01-05 20:42:12.000000000 +0530
> +++ linux-2.6.28/mm/bootmem.c 2009-01-05 20:43:53.000000000 +0530
> @@ -375,10 +375,14 @@ int __init reserve_bootmem_node(pg_data_
> unsigned long size, int flags)
> {
> unsigned long start, end;
> + bootmem_data_t *bdata = pgdat->bdata;
>
> start = PFN_DOWN(physaddr);
> end = PFN_UP(physaddr + size);
>
> + if (end > bdata->node_low_pfn)
> + end = bdata->node_low_pfn;
> +
> return mark_bootmem_node(pgdat->bdata, start, end, 1, flags);
> }
-- Dave

2009-01-07 12:58:41

by Chandru

[permalink] [raw]
Subject: Re: 2.6.28-rc9 panics with crashkernel=256M while booting

On Monday 05 January 2009 22:00:33 Dave Hansen wrote:
> OK, I had to think about this for a good, long time. That's bad. :)
>
> There are two things that we're dealing with here: "active regions" and
> the NODE_DATA's. The if() you've pasted above resizes the reservation
> so that it fits into the current active region. However, as you noted,
> we haven't resized it so that it fits into the NODE_DATA() that we're
> looking at. We call into the bootmem code, and BUG_ON().
>
> The thing I don't like about this is that it might hide bugs in other
> callers. This really is a ppc-specific thing and, although what you
> wrote will fix the bug on ppc, it will probably cause someone in the
> future to call reserve_bootmem_node() with too large a reservation and
> get a silent failure (not reserving the requested size) back.
>
> We really do need to go take a hard look at the whole interaction
> between lmb's, node active regions, and the NUMA code some day. It has
> kinda grown to be a bit ungainly.
>
> How about we just consult the NODE_DATA() in
> mark_reserved_regions_for_nid() instead of reserve_bootmem_node()?

I don't know how you wanted NODE_DATA() to be consulted here. i.e before
calling reserve_bootmem_node() should we have a condition

if (PFN_UP(physbase+reserve_size) > node_end_pfn)
then
resize reserve_size again so that PFN_UP() will equate to node_end_pfn ??
end

Also I was wondering if in reserve_bootmem_node()
end = PFN_DOWN() ; will do..

With the recent changes from you that went into 2.6.28 stable
(commit:a4c74ddd5ea3db53fc73d29c222b22656a7d05be), it worked on the system
with PFN_DOWN().

Thanks,
Chandru

2009-01-07 17:25:46

by Dave Hansen

[permalink] [raw]
Subject: Re: 2.6.28-rc9 panics with crashkernel=256M while booting

On Wed, 2009-01-07 at 18:28 +0530, Chandru wrote:
> I don't know how you wanted NODE_DATA() to be consulted here. i.e before
> calling reserve_bootmem_node() should we have a condition
>
> if (PFN_UP(physbase+reserve_size) > node_end_pfn)
> then
> resize reserve_size again so that PFN_UP() will equate to node_end_pfn ??
> end

I'm just suggesting making your fix in the ppc code instead of in
mm/bootmem.c.

-- Dave

2009-01-08 10:29:33

by Chandru

[permalink] [raw]
Subject: Re: 2.6.28-rc9 panics with crashkernel=256M while booting

On Wednesday 07 January 2009 22:55:25 Dave Hansen wrote:
>
> I'm just suggesting making your fix in the ppc code instead of in
> mm/bootmem.c.
>

Here are the changes that helped to boot the kernel. Please review it.
Thanks,

Signed-off-by: Chandru S <[email protected]>
Cc: Dave Hansen <[email protected]>
---

arch/powerpc/mm/numa.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

--- linux-2.6.28/arch/powerpc/mm/numa.c.orig 2009-01-08 03:20:41.000000000 -0600
+++ linux-2.6.28/arch/powerpc/mm/numa.c 2009-01-08 03:50:41.000000000 -0600
@@ -16,6 +16,7 @@
#include <linux/module.h>
#include <linux/nodemask.h>
#include <linux/cpu.h>
+#include <linux/pfn.h>
#include <linux/notifier.h>
#include <linux/lmb.h>
#include <linux/of.h>
@@ -898,9 +899,17 @@ static void mark_reserved_regions_for_ni
* if reserved region extends past active region
* then trim size to active region
*/
- if (end_pfn > node_ar.end_pfn)
+ if (end_pfn > node_ar.end_pfn) {
reserve_size = (node_ar.end_pfn << PAGE_SHIFT)
- (start_pfn << PAGE_SHIFT);
+ /*
+ * resize it further if the reservation could
+ * cross the last page in this node
+ */
+ if (PFN_UP(physbase+reserve_size) >
+ node_end_pfn)
+ reserve_size -= PAGE_SIZE;
+ }
/*
* Only worry about *this* node, others may not
* yet have valid NODE_DATA().

2009-01-08 20:03:35

by Dave Hansen

[permalink] [raw]
Subject: Re: 2.6.28-rc9 panics with crashkernel=256M while booting

On Thu, 2009-01-08 at 15:59 +0530, Chandru wrote:
> @@ -898,9 +899,17 @@ static void mark_reserved_regions_for_ni
> * if reserved region extends past active region
> * then trim size to active region
> */
> - if (end_pfn > node_ar.end_pfn)
> + if (end_pfn > node_ar.end_pfn) {
> reserve_size = (node_ar.end_pfn << PAGE_SHIFT)
> - (start_pfn << PAGE_SHIFT);
> + /*
> + * resize it further if the reservation could
> + * cross the last page in this node
> + */
> + if (PFN_UP(physbase+reserve_size) >
> + node_end_pfn)
> + reserve_size -= PAGE_SIZE;
> + }
> /*

Now I'm even more confused. Could you please send a fully changelogged
patch that describes the problem, and how this fixes it? This just
seems like an off-by-one error, which isn't what I thought we had before
at all.

I'm also horribly confused why PFN_UP is needed here. Is 'physbase' not
page aligned? reserve_size looks like it *has* to be. 'end_pfn' is
always (as far as I have ever seen in the kernel) the pfn of the page
after the area we are interested in and we treat it as such in that
function. In the case of an unaligned physbase, that wouldn't be true.

Think of the case where we have a 1-byte reservation. start_pfn will
equal end_pfn and we won't go into that while loop at *all* and won't
reserve anything.

Does 'end_pfn' need fixing?

-- Dave

2009-01-09 11:08:55

by Chandru

[permalink] [raw]
Subject: Re: 2.6.28-rc9 panics with crashkernel=256M while booting

On Friday 09 January 2009 01:33:12 Dave Hansen wrote:
> Now I'm even more confused. Could you please send a fully changelogged
> patch that describes the problem, and how this fixes it? This just
> seems like an off-by-one error, which isn't what I thought we had before
> at all.
>
> I'm also horribly confused why PFN_UP is needed here. Is 'physbase' not
> page aligned? reserve_size looks like it *has* to be. 'end_pfn' is
> always (as far as I have ever seen in the kernel) the pfn of the page
> after the area we are interested in and we treat it as such in that
> function. In the case of an unaligned physbase, that wouldn't be true.
>
> Think of the case where we have a 1-byte reservation. start_pfn will
> equal end_pfn and we won't go into that while loop at *all* and won't
> reserve anything.
>
> Does 'end_pfn' need fixing?
>

Attached is the console log with debug command line parameters enabled and
with couple of more debug statements added to the code. Please take a look at it.

thanks,
Chandru


Attachments:
(No filename) (1.01 kB)
console_log.txt (5.15 kB)
Download all attachments

2009-01-15 08:13:19

by Chandru

[permalink] [raw]
Subject: Re: 2.6.28-rc9 panics with crashkernel=256M while booting

On Friday 09 January 2009 16:37:24 Chandru wrote:
> On Friday 09 January 2009 01:33:12 Dave Hansen wrote:
> > Now I'm even more confused. Could you please send a fully changelogged
> > patch that describes the problem, and how this fixes it? This just
> > seems like an off-by-one error, which isn't what I thought we had before
> > at all.
> >
> > I'm also horribly confused why PFN_UP is needed here. Is 'physbase' not
> > page aligned? reserve_size looks like it *has* to be. 'end_pfn' is
> > always (as far as I have ever seen in the kernel) the pfn of the page
> > after the area we are interested in and we treat it as such in that
> > function. In the case of an unaligned physbase, that wouldn't be true.
> >
> > Think of the case where we have a 1-byte reservation. start_pfn will
> > equal end_pfn and we won't go into that while loop at *all* and won't
> > reserve anything.
> >
> > Does 'end_pfn' need fixing?
> >
>
> Attached is the console log with debug command line parameters enabled and
> with couple of more debug statements added to the code. Please take a look at it.
>
> thanks,
> Chandru
>

Hello Dave, From the debug console output, if there is anything you can add here,
pls let me know.

thanks

2009-01-16 12:36:48

by Chandru

[permalink] [raw]
Subject: Re: 2.6.28-rc9 panics with crashkernel=256M while booting

On Thursday 15 January 2009 13:35:27 Chandru wrote:
> Hello Dave, From the debug console output, if there is anything you can add
> here, pls let me know.

As we can see from the console output here, physbase isn't page aligned when
the panic occurs. So we could as well send (start_pfn << PAGE_SHIFT) to
reserve_bootmem_node() instead of physbase. your thoughts ?.

Also end_pfn in mark_reserved_region_for_nid() is defined as

unsigned long end_pfn = ((physbase + size) >> PAGE_SHIFT);

Does this refer to the pfn after the area that we are interested in ?. We have
atleast two fixes here,
1. Limit start and end to bdata->node_min_pfn and bdata->node_low_pfn in
reserve_bootmem_node() and add comments out in there that the caller of the
funtion should be aware of how much are they reserving.
2. send (start_pfn << PAGE_SHIFT) to reserve_bootmem_node() instead of
physbase.

Chandru

2009-01-16 17:53:40

by Dave Hansen

[permalink] [raw]
Subject: Re: 2.6.28-rc9 panics with crashkernel=256M while booting

On Fri, 2009-01-16 at 17:46 +0530, Chandru wrote:
> On Thursday 15 January 2009 13:35:27 Chandru wrote:
> > Hello Dave, From the debug console output, if there is anything you can add
> > here, pls let me know.
>
> As we can see from the console output here, physbase isn't page aligned when
> the panic occurs. So we could as well send (start_pfn << PAGE_SHIFT) to
> reserve_bootmem_node() instead of physbase. your thoughts ?.
>
> Also end_pfn in mark_reserved_region_for_nid() is defined as
>
> unsigned long end_pfn = ((physbase + size) >> PAGE_SHIFT);
>
> Does this refer to the pfn after the area that we are interested in ?. We have
> atleast two fixes here,
> 1. Limit start and end to bdata->node_min_pfn and bdata->node_low_pfn in
> reserve_bootmem_node() and add comments out in there that the caller of the
> funtion should be aware of how much are they reserving.
> 2. send (start_pfn << PAGE_SHIFT) to reserve_bootmem_node() instead of
> physbase.

Just looking at it, that calculation is OK. But, there was one in your
dmesg that looked a page too long, like page 0x1001 instead of 0x1000.
I'd find out how that happened.

-- Dave

2009-01-19 08:12:20

by Chandru

[permalink] [raw]
Subject: Re: 2.6.28-rc9 panics with crashkernel=256M while booting

On Friday 16 January 2009 23:22:57 Dave Hansen wrote:
> Just looking at it, that calculation is OK. But, there was one in your
> dmesg that looked a page too long, like page 0x1001 instead of 0x1000.
> I'd find out how that happened.

That is a result of PFN_UP() in reserve_bootmem_node() for which we hit the
BUG_ON() eventually. Prior to calling reserve_bootmem_node() we have...

node_ar.end_pfn = node->node_end_pfn = PFN_DOWN(physbase+reserve_size).

Hence a PFN_UP() will raise the value of 'end'. The kernel has
CONFIG_PPC_64K_PAGES enabled in the config.

Chandru

2009-01-19 11:30:22

by Chandru

[permalink] [raw]
Subject: Re: 2.6.28-rc9 panics with crashkernel=256M while booting

In case either physbase or reserve_size are not page aligned and in addition
if the following condition is also true

node_ar.end_pfn = node->node_end_pfn = PFN_DOWN(physbase+reserve_size).

we may hit the BUG_ON(end > bdata->node_low_pfn) in mark_bootmem_node() in
mm/bootmem.c Hence pass the pfn that the physbase is part of and align
reserve_size before calling reserve_bootmem_node().

Signed-off-by: Chandru S <[email protected]>
Cc: Dave Hansen <[email protected]>
---
arch/powerpc/mm/numa.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

--- linux-2.6.29-rc2/arch/powerpc/mm/numa.c.orig 2009-01-19 16:14:49.000000000 +0530
+++ linux-2.6.29-rc2/arch/powerpc/mm/numa.c 2009-01-19 16:36:38.000000000 +0530
@@ -901,7 +901,8 @@ static void mark_reserved_regions_for_ni
get_node_active_region(start_pfn, &node_ar);
while (start_pfn < end_pfn &&
node_ar.start_pfn < node_ar.end_pfn) {
- unsigned long reserve_size = size;
+ unsigned long reserve_size = (size >> PAGE_SHIFT) <<
+ PAGE_SHIFT;
/*
* if reserved region extends past active region
* then trim size to active region
@@ -917,7 +918,8 @@ static void mark_reserved_regions_for_ni
dbg("reserve_bootmem %lx %lx nid=%d\n",
physbase, reserve_size, node_ar.nid);
reserve_bootmem_node(NODE_DATA(node_ar.nid),
- physbase, reserve_size,
+ (start_pfn << PAGE_SHIFT),
+ reserve_size,
BOOTMEM_DEFAULT);
}
/*

2009-01-20 08:13:59

by Chandru

[permalink] [raw]
Subject: Re: 2.6.28-rc9 panics with crashkernel=256M while booting

Chandru wrote:
> In case either physbase or reserve_size are not page aligned and in addition
> if the following condition is also true
>
> node_ar.end_pfn = node->node_end_pfn = PFN_DOWN(physbase+reserve_size).
>
> we may hit the BUG_ON(end > bdata->node_low_pfn) in mark_bootmem_node() in
> mm/bootmem.c Hence pass the pfn that the physbase is part of and align
> reserve_size before calling reserve_bootmem_node().
>
> Signed-off-by: Chandru S <[email protected]>
> Cc: Dave Hansen <[email protected]>
> ---
> arch/powerpc/mm/numa.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> --- linux-2.6.29-rc2/arch/powerpc/mm/numa.c.orig 2009-01-19 16:14:49.000000000 +0530
> +++ linux-2.6.29-rc2/arch/powerpc/mm/numa.c 2009-01-19 16:36:38.000000000 +0530
> @@ -901,7 +901,8 @@ static void mark_reserved_regions_for_ni
> get_node_active_region(start_pfn, &node_ar);
> while (start_pfn < end_pfn &&
> node_ar.start_pfn < node_ar.end_pfn) {
> - unsigned long reserve_size = size;
> + unsigned long reserve_size = (size >> PAGE_SHIFT) <<
> + PAGE_SHIFT;
> /*
> * if reserved region extends past active region
> * then trim size to active region
> @@ -917,7 +918,8 @@ static void mark_reserved_regions_for_ni
> dbg("reserve_bootmem %lx %lx nid=%d\n",
> physbase, reserve_size, node_ar.nid);
> reserve_bootmem_node(NODE_DATA(node_ar.nid),
> - physbase, reserve_size,
> + (start_pfn << PAGE_SHIFT),
> + reserve_size,
> BOOTMEM_DEFAULT);
> }
> /*
>
does this patch look good ?, do you concur with it ?

thanks,
Chandru

2009-01-22 00:29:55

by Dave Hansen

[permalink] [raw]
Subject: Re: 2.6.28-rc9 panics with crashkernel=256M while booting

On Mon, 2009-01-19 at 17:00 +0530, Chandru wrote:
> --- linux-2.6.29-rc2/arch/powerpc/mm/numa.c.orig 2009-01-19 16:14:49.000000000 +0530
> +++ linux-2.6.29-rc2/arch/powerpc/mm/numa.c 2009-01-19 16:36:38.000000000 +0530
> @@ -901,7 +901,8 @@ static void mark_reserved_regions_for_ni
> get_node_active_region(start_pfn, &node_ar);
> while (start_pfn < end_pfn &&
> node_ar.start_pfn < node_ar.end_pfn) {
> - unsigned long reserve_size = size;
> + unsigned long reserve_size = (size >> PAGE_SHIFT) <<
> + PAGE_SHIFT;
> /*
> * if reserved region extends past active region
> * then trim size to active region
> @@ -917,7 +918,8 @@ static void mark_reserved_regions_for_ni
> dbg("reserve_bootmem %lx %lx nid=%d\n",
> physbase, reserve_size, node_ar.nid);
> reserve_bootmem_node(NODE_DATA(node_ar.nid),
> - physbase, reserve_size,
> + (start_pfn << PAGE_SHIFT),
> + reserve_size,
> BOOTMEM_DEFAULT);
> }
> /*

Chandru, I don't mean to keep ragging on your patches, but I really
don't think this is right, yet.

Let's take, for instance, a 1-byte reservation. With this code, you've
suddenly turned that into a 0-byte reservation, and that *can't* be
right. The same thing happens if you have a reservation that spans two
pages. If you unconditionally round it down, then you might miss the
part that spans a portion of the second page.

It needs to be rounded down like you are suggesting here, but only in
the case where we've gone over the *CURRENT* node's boundary. This is
kinda what that "if (end_pfn > node_ar.end_pfn)" check is doing. But,
it evidently screws it up if the overlap isn't by an entire page or
something.

Please also, for pete's sake, use masks (a la PAGE_MASK) or macros if
you're going to page-align something. Don't shift down and up like
that.

-- Dave

2009-01-22 08:20:41

by Chandru

[permalink] [raw]
Subject: Re: 2.6.28-rc9 panics with crashkernel=256M while booting

On Thursday 22 January 2009 05:59:39 Dave Hansen wrote:
> Let's take, for instance, a 1-byte reservation. With this code, you've
> suddenly turned that into a 0-byte reservation, and that *can't* be
> right. The same thing happens if you have a reservation that spans two
> pages. If you unconditionally round it down, then you might miss the
> part that spans a portion of the second page.
>
> It needs to be rounded down like you are suggesting here, but only in
> the case where we've gone over the *CURRENT* node's boundary. This is
> kinda what that "if (end_pfn > node_ar.end_pfn)" check is doing. But,
> it evidently screws it up if the overlap isn't by an entire page or
> something.

I assumed the condition 'while (start_pfn < end_pfn && .. )' asks for atleast
a PAGE_SIZE difference between them and hence went ahead with that patch.
My guess was a 1-byte , 2-byte or a (PAGE_SIZE -1)-byte reservations may not even
go into that loop. However we just need a fix for this problem. So if there is a
better fix that you have please post it to lkml.

Thanks,
Chandru