Subject: [PATCH] alloc_bootmem_core: fix misaligned allocation of 1G page

If memory hole remapping is enabled on an x86-NUMA system, allocation
of 1G pages on node 1 will most probably trigger an BUG_ON in
alloc_bootmem_huge_page because alloc_bootmem_core fails to properly
align the huge page on a 1G boundary.

I've observed this Oops with kernel 2.6.27-rc2-00166-gaeee90d
with a 2 socket system and activated memory hole remapping.
(Of course disabling memory hole remapping works around the problem
but this wastes a significant amount of memory.)

Here some dmesg snippet with that kernel (using "bootmem_debug" plus some
additional printk's):

...
Bootmem setup node 0 0000000000000000-0000000130000000
...
Bootmem setup node 1 0000000130000000-0000000230000000
...
Kernel command line: root=/dev/sda4 console=ttyS0,115200
hugepagesz=2M hugepages=0 hugepagesz=1G hugepages=3 bootmem_debug
debug earlyprintk=ttyS0,115200
...

bootmem::alloc_bootmem_core nid=1 size=40000000 [262144 pages]
align=40000000 goal=0 limit=0
min: 1245184, max: 2293760, step: 262144, start: 1310720
sidx: 65536, midx: 1048576
sidx: 65536
sidx: 262144, eidx: 524288
start_off: 1073741824, end_off: 2147483648, merge: 0, min_pfn: 1245184
bootmem::__reserve nid=1 start=170000 end=1b0000 flags=1
addr:ffff880170000000, paddr:0000000170000000, size: 1073741824
PANIC: early exception 06 rip 10:ffffffff807ce3b0 error 0 cr2 0
Pid: 0, comm: swapper Not tainted 2.6.27-rc2-00166-gaeee90d-dirty #6

Call Trace:
[<ffffffff807cccbe>] ___alloc_bootmem_nopanic+0x60/0x98
[<ffffffff807bc195>] early_idt_handler+0x55/0x69
[<ffffffff807ce3b0>] alloc_bootmem_huge_page+0xa6/0xd9
[<ffffffff807ce39f>] alloc_bootmem_huge_page+0x95/0xd9
[<ffffffff807ce3fe>] hugetlb_hstate_alloc_pages+0x1b/0x3a
[<ffffffff807ce489>] hugetlb_nrpages_setup+0x6c/0x7a
[<ffffffff807bc69e>] unknown_bootoption+0xdc/0x1e2
[<ffffffff802446d6>] parse_args+0x137/0x1f5
[<ffffffff807bc5c2>] unknown_bootoption+0x0/0x1e2
[<ffffffff807bcb6e>] start_kernel+0x195/0x2b7
[<ffffffff807bc369>] x86_64_start_kernel+0xe3/0xe7

RIP 0x10

The problem in alloc_bootmem_core is that it just guarantees
proper alignment for the offset (sidx) from bdata->node_min_pfn.

A simple (ugly) fix is to add bdata->node_min_pfn to sidx and
friends. Patch is attached.

The current code in alloc_bootmem_core is based on changes introduced
with commit 5f2809e69c7128f86316048221cf45146f69a4a0 (bootmem: clean
up alloc_bootmem_core). But I didn't check whether this commit
introduced the problem.

Signed-off-by: Andreas Herrmann <[email protected]>
---
mm/bootmem.c | 21 +++++++++++++--------
1 files changed, 13 insertions(+), 8 deletions(-)

With attached patch the 1G huge page gets properly aligned on node 1:

Linux version 2.6.27-rc2-00389-g10fec20-dirty ...
...
Bootmem setup node 0 0000000000000000-0000000130000000
...
Bootmem setup node 1 0000000130000000-0000000230000000
...

Kernel command line: root=/dev/sda4 console=ttyS0,115200
hugepagesz=2M hugepages=0 huge pagesz=1G hugepages=3 bootmem_debug
debug earlyprintk=ttyS0,115200
bootmem::alloc_bootmem_core nid=0 size=40000000 [262144 pages] align=40000000
goal=0 limit=0
bootmem::__reserve nid=0 start=40000 end=80000 flags=1
bootmem::alloc_bootmem_core nid=0 size=40000000 [262144 pages] align=40000000
goal=0 limit=0
bootmem::__reserve nid=0 start=80000 end=c0000 flags=1
bootmem::alloc_bootmem_core nid=0 size=40000000 [262144 pages] align=40000000
goal=0 limit=0
bootmem::alloc_bootmem_core nid=0 size=40000000 [262144 pages] align=40000000
goal=0 limit=0
bootmem::alloc_bootmem_core nid=1 size=40000000 [262144 pages] align=40000000
goal=0 limit=0
bootmem::__reserve nid=1 start=140000 end=180000 flags=1
Initializing CPU#0
...

Patch is against v2.6.27-rc2-389-g10fec20.
Please apply for 2.6.27 ... if nobody comes up with a better solution.


Regards,

Andreas

diff --git a/mm/bootmem.c b/mm/bootmem.c
index 4af15d0..9d54244 100644
--- a/mm/bootmem.c
+++ b/mm/bootmem.c
@@ -441,8 +441,8 @@ static void * __init alloc_bootmem_core(struct bootmem_data *bdata,
else
start = ALIGN(min, step);

- sidx = start - bdata->node_min_pfn;;
- midx = max - bdata->node_min_pfn;
+ sidx = start;
+ midx = max;

if (bdata->hint_idx > sidx) {
/*
@@ -458,7 +458,10 @@ static void * __init alloc_bootmem_core(struct bootmem_data *bdata,
void *region;
unsigned long eidx, i, start_off, end_off;
find_block:
- sidx = find_next_zero_bit(bdata->node_bootmem_map, midx, sidx);
+ sidx = find_next_zero_bit(bdata->node_bootmem_map,
+ midx - bdata->node_min_pfn,
+ sidx - bdata->node_min_pfn);
+ sidx += bdata->node_min_pfn;
sidx = ALIGN(sidx, step);
eidx = sidx + PFN_UP(size);

@@ -466,7 +469,8 @@ find_block:
break;

for (i = sidx; i < eidx; i++)
- if (test_bit(i, bdata->node_bootmem_map)) {
+ if (test_bit(i - bdata->node_min_pfn,
+ bdata->node_bootmem_map)) {
sidx = ALIGN(i, step);
if (sidx == i)
sidx += step;
@@ -474,16 +478,17 @@ find_block:
}

if (bdata->last_end_off &&
- PFN_DOWN(bdata->last_end_off) + 1 == sidx)
+ (PFN_DOWN(bdata->last_end_off) + 1) ==
+ (sidx - bdata->node_min_pfn))
start_off = ALIGN(bdata->last_end_off, align);
else
- start_off = PFN_PHYS(sidx);
+ start_off = PFN_PHYS(sidx - bdata->node_min_pfn);

- merge = PFN_DOWN(start_off) < sidx;
+ merge = PFN_DOWN(start_off) < (sidx - bdata->node_min_pfn);
end_off = start_off + size;

bdata->last_end_off = end_off;
- bdata->hint_idx = PFN_UP(end_off);
+ bdata->hint_idx = PFN_UP(end_off + bdata->node_min_pfn);

/*
* Reserve the area now:
--
1.5.6.4



2008-08-12 16:59:32

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH] alloc_bootmem_core: fix misaligned allocation of 1G page

Hi Andreas,

Andreas Herrmann <[email protected]> writes:

> If memory hole remapping is enabled on an x86-NUMA system, allocation
> of 1G pages on node 1 will most probably trigger an BUG_ON in
> alloc_bootmem_huge_page because alloc_bootmem_core fails to properly
> align the huge page on a 1G boundary.

Yeah, the problem of rewriting working code is that you're likely to
re-introduce old bugs :(

> I've observed this Oops with kernel 2.6.27-rc2-00166-gaeee90d
> with a 2 socket system and activated memory hole remapping.
> (Of course disabling memory hole remapping works around the problem
> but this wastes a significant amount of memory.)
>
> Here some dmesg snippet with that kernel (using "bootmem_debug" plus some
> additional printk's):
>
> ...
> Bootmem setup node 0 0000000000000000-0000000130000000
> ...
> Bootmem setup node 1 0000000130000000-0000000230000000
> ...
> Kernel command line: root=/dev/sda4 console=ttyS0,115200
> hugepagesz=2M hugepages=0 hugepagesz=1G hugepages=3 bootmem_debug
> debug earlyprintk=ttyS0,115200
> ...
>
> bootmem::alloc_bootmem_core nid=1 size=40000000 [262144 pages]
> align=40000000 goal=0 limit=0
> min: 1245184, max: 2293760, step: 262144, start: 1310720
> sidx: 65536, midx: 1048576
> sidx: 65536
> sidx: 262144, eidx: 524288
> start_off: 1073741824, end_off: 2147483648, merge: 0, min_pfn: 1245184
> bootmem::__reserve nid=1 start=170000 end=1b0000 flags=1
> addr:ffff880170000000, paddr:0000000170000000, size: 1073741824
> PANIC: early exception 06 rip 10:ffffffff807ce3b0 error 0 cr2 0
> Pid: 0, comm: swapper Not tainted 2.6.27-rc2-00166-gaeee90d-dirty #6
>
> Call Trace:
> [<ffffffff807cccbe>] ___alloc_bootmem_nopanic+0x60/0x98
> [<ffffffff807bc195>] early_idt_handler+0x55/0x69
> [<ffffffff807ce3b0>] alloc_bootmem_huge_page+0xa6/0xd9
> [<ffffffff807ce39f>] alloc_bootmem_huge_page+0x95/0xd9
> [<ffffffff807ce3fe>] hugetlb_hstate_alloc_pages+0x1b/0x3a
> [<ffffffff807ce489>] hugetlb_nrpages_setup+0x6c/0x7a
> [<ffffffff807bc69e>] unknown_bootoption+0xdc/0x1e2
> [<ffffffff802446d6>] parse_args+0x137/0x1f5
> [<ffffffff807bc5c2>] unknown_bootoption+0x0/0x1e2
> [<ffffffff807bcb6e>] start_kernel+0x195/0x2b7
> [<ffffffff807bc369>] x86_64_start_kernel+0xe3/0xe7
>
> RIP 0x10
>
> The problem in alloc_bootmem_core is that it just guarantees
> proper alignment for the offset (sidx) from bdata->node_min_pfn.
>
> A simple (ugly) fix is to add bdata->node_min_pfn to sidx and
> friends. Patch is attached.

[...]

> The current code in alloc_bootmem_core is based on changes introduced
> with commit 5f2809e69c7128f86316048221cf45146f69a4a0 (bootmem: clean
> up alloc_bootmem_core). But I didn't check whether this commit
> introduced the problem.

It did, there were workarounds for the same problem in the earlier code,
I missed it.

The misalignment stems from the fact that the alignment requirement is
wider than the offset-pfn and the starting pfn of the node is not
aligned itself, correct?

I think, the cleaner fix would be to work with an aligned base pfn to
begin with, like the following, untested. What do you think?

Hannes

diff --git a/mm/bootmem.c b/mm/bootmem.c
index 4af15d0..bee4dfe 100644
--- a/mm/bootmem.c
+++ b/mm/bootmem.c
@@ -410,7 +410,7 @@ static void * __init alloc_bootmem_core(struct bootmem_data *bdata,
unsigned long goal, unsigned long limit)
{
unsigned long fallback = 0;
- unsigned long min, max, start, sidx, midx, step;
+ unsigned long min_pfn, min, max, start, sidx, midx, step;

BUG_ON(!size);
BUG_ON(align & (align - 1));
@@ -423,7 +423,10 @@ static void * __init alloc_bootmem_core(struct bootmem_data *bdata,
bdata - bootmem_node_data, size, PAGE_ALIGN(size) >> PAGE_SHIFT,
align, goal, limit);

- min = bdata->node_min_pfn;
+ step = max(align >> PAGE_SHIFT, 1UL);
+ min_pfn = ALIGN(bdata->node_min_pfn, step);
+
+ min = min_pfn;
max = bdata->node_low_pfn;

goal >>= PAGE_SHIFT;
@@ -434,15 +437,13 @@ static void * __init alloc_bootmem_core(struct bootmem_data *bdata,
if (max <= min)
return NULL;

- step = max(align >> PAGE_SHIFT, 1UL);
-
if (goal && min < goal && goal < max)
start = ALIGN(goal, step);
else
start = ALIGN(min, step);

- sidx = start - bdata->node_min_pfn;;
- midx = max - bdata->node_min_pfn;
+ sidx = start - min_pfn;
+ midx = max - min_pfn;

if (bdata->hint_idx > sidx) {
/*
@@ -492,8 +493,7 @@ find_block:
PFN_UP(end_off), BOOTMEM_EXCLUSIVE))
BUG();

- region = phys_to_virt(PFN_PHYS(bdata->node_min_pfn) +
- start_off);
+ region = phys_to_virt(PFN_PHYS(min_pfn) + start_off);
memset(region, 0, size);
return region;
}

Subject: Re: [PATCH] alloc_bootmem_core: fix misaligned allocation of 1G page

On Tue, Aug 12, 2008 at 06:58:55PM +0200, Johannes Weiner wrote:
> Andreas Herrmann <[email protected]> writes:
> > The current code in alloc_bootmem_core is based on changes introduced
> > with commit 5f2809e69c7128f86316048221cf45146f69a4a0 (bootmem: clean
> > up alloc_bootmem_core). But I didn't check whether this commit
> > introduced the problem.
>
> It did, there were workarounds for the same problem in the earlier code,
> I missed it.
>
> The misalignment stems from the fact that the alignment requirement is
> wider than the offset-pfn and the starting pfn of the node is not
> aligned itself, correct?

Yes.

> I think, the cleaner fix would be to work with an aligned base pfn to
> begin with, like the following, untested. What do you think?

This won't (completely) work.

Every time you compute the new alignment for sidx the starting point
(node_min_pfn) must be factored in.

Otherwise the function can't allocate the first possible page. For
example, assuming that

node_min_pfn = 130000
align = 0x40000000 (1GByte)

allocating a 1G page on this node will result in

sidx=0x40000
min_pfn=0x140000

Both are properly aligned. But the resulting super-page will be at
address 0x180000000 whereas the first possible 1G page would be at
address 0x140000000.

> diff --git a/mm/bootmem.c b/mm/bootmem.c
> index 4af15d0..bee4dfe 100644
> --- a/mm/bootmem.c
> +++ b/mm/bootmem.c

...

> @@ -492,8 +493,7 @@ find_block:
> PFN_UP(end_off), BOOTMEM_EXCLUSIVE))
> BUG();
>
> - region = phys_to_virt(PFN_PHYS(bdata->node_min_pfn) +
> - start_off);
> + region = phys_to_virt(PFN_PHYS(min_pfn) + start_off);
> memset(region, 0, size);
> return region;

Oops ...
the returned region doesn't match the reserved one as it still gets
reserved with

if (__reserve(bdata, PFN_DOWN(start_off) + merge,
PFN_UP(end_off), BOOTMEM_EXCLUSIVE))

where __reserve() will use bdata->node_min_pfn and not the properly
aligned min_pfn value. Either you have to pass the new min_pfn
value to __reserve() or you have to adapt start_off with another
offset = min_pfn - bdata->node_min_pfn ...



I thought about other solutions like introducing a "base_offset" --
the value needed to align node_min_pfn. But this value must be used
in many places to correctly compute/align sidx etc. and it doesn't
make the code better readable.

Hence I still prefer the patch posted yesterday. I just want to clean
it up somewhat. See attached patch.


Regards,

Andreas
--

alloc_bootmem_core: minor cleanup, use min instead of bdata->node_min_pfn

Signed-off-by: Andreas Herrmann <[email protected]>
---
mm/bootmem.c | 20 ++++++++------------
1 files changed, 8 insertions(+), 12 deletions(-)

diff --git a/mm/bootmem.c b/mm/bootmem.c
index 9d54244..11ece4b 100644
--- a/mm/bootmem.c
+++ b/mm/bootmem.c
@@ -459,9 +459,8 @@ static void * __init alloc_bootmem_core(struct bootmem_data *bdata,
unsigned long eidx, i, start_off, end_off;
find_block:
sidx = find_next_zero_bit(bdata->node_bootmem_map,
- midx - bdata->node_min_pfn,
- sidx - bdata->node_min_pfn);
- sidx += bdata->node_min_pfn;
+ midx - min, sidx - min);
+ sidx += min;
sidx = ALIGN(sidx, step);
eidx = sidx + PFN_UP(size);

@@ -469,8 +468,7 @@ find_block:
break;

for (i = sidx; i < eidx; i++)
- if (test_bit(i - bdata->node_min_pfn,
- bdata->node_bootmem_map)) {
+ if (test_bit(i - min, bdata->node_bootmem_map)) {
sidx = ALIGN(i, step);
if (sidx == i)
sidx += step;
@@ -478,17 +476,16 @@ find_block:
}

if (bdata->last_end_off &&
- (PFN_DOWN(bdata->last_end_off) + 1) ==
- (sidx - bdata->node_min_pfn))
+ (PFN_DOWN(bdata->last_end_off) + 1) == (sidx - min))
start_off = ALIGN(bdata->last_end_off, align);
else
- start_off = PFN_PHYS(sidx - bdata->node_min_pfn);
+ start_off = PFN_PHYS(sidx - min);

- merge = PFN_DOWN(start_off) < (sidx - bdata->node_min_pfn);
+ merge = PFN_DOWN(start_off) < (sidx - min);
end_off = start_off + size;

bdata->last_end_off = end_off;
- bdata->hint_idx = PFN_UP(end_off + bdata->node_min_pfn);
+ bdata->hint_idx = PFN_UP(end_off + min);

/*
* Reserve the area now:
@@ -497,8 +494,7 @@ find_block:
PFN_UP(end_off), BOOTMEM_EXCLUSIVE))
BUG();

- region = phys_to_virt(PFN_PHYS(bdata->node_min_pfn) +
- start_off);
+ region = phys_to_virt(PFN_PHYS(min) + start_off);
memset(region, 0, size);
return region;
}
--
1.5.6.4



2008-08-13 18:18:39

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH] alloc_bootmem_core: fix misaligned allocation of 1G page

Hi,

Andreas Herrmann <[email protected]> writes:

> On Tue, Aug 12, 2008 at 06:58:55PM +0200, Johannes Weiner wrote:
>> Andreas Herrmann <[email protected]> writes:
>> > The current code in alloc_bootmem_core is based on changes introduced
>> > with commit 5f2809e69c7128f86316048221cf45146f69a4a0 (bootmem: clean
>> > up alloc_bootmem_core). But I didn't check whether this commit
>> > introduced the problem.
>>
>> It did, there were workarounds for the same problem in the earlier code,
>> I missed it.
>>
>> The misalignment stems from the fact that the alignment requirement is
>> wider than the offset-pfn and the starting pfn of the node is not
>> aligned itself, correct?
>
> Yes.

Okay, and the real and effective problem is that the code right now
aligns the relative offsets to the globally requested alignment, which
is of course total crap to do. Because, whether or not the result will
be correct (only if the node start is aligned), the global alignment is
just not applicable to relative offsets, even if it works by accident.

>> I think, the cleaner fix would be to work with an aligned base pfn to
>> begin with, like the following, untested. What do you think?
>
> This won't (completely) work.

You are right, this was not the problem at all.

> Every time you compute the new alignment for sidx the starting point
> (node_min_pfn) must be factored in.

Yep.

> Otherwise the function can't allocate the first possible page. For
> example, assuming that
>
> node_min_pfn = 130000
> align = 0x40000000 (1GByte)
>
> allocating a 1G page on this node will result in
>
> sidx=0x40000
> min_pfn=0x140000
>
> Both are properly aligned. But the resulting super-page will be at
> address 0x180000000 whereas the first possible 1G page would be at
> address 0x140000000.

I can not follow this example. Is there a typo somewhere? In this
example, the resulting address is aligned but it shouldn't even be so.

The sidx will be aligned and when you add 0x130000 to it, the result is
not (in this case).

>> diff --git a/mm/bootmem.c b/mm/bootmem.c
>> index 4af15d0..bee4dfe 100644
>> --- a/mm/bootmem.c
>> +++ b/mm/bootmem.c
>
> ...
>
>> @@ -492,8 +493,7 @@ find_block:
>> PFN_UP(end_off), BOOTMEM_EXCLUSIVE))
>> BUG();
>>
>> - region = phys_to_virt(PFN_PHYS(bdata->node_min_pfn) +
>> - start_off);
>> + region = phys_to_virt(PFN_PHYS(min_pfn) + start_off);
>> memset(region, 0, size);
>> return region;
>
> Oops ...
> the returned region doesn't match the reserved one as it still gets
> reserved with
>
> if (__reserve(bdata, PFN_DOWN(start_off) + merge,
> PFN_UP(end_off), BOOTMEM_EXCLUSIVE))
>
> where __reserve() will use bdata->node_min_pfn and not the properly
> aligned min_pfn value. Either you have to pass the new min_pfn
> value to __reserve() or you have to adapt start_off with another
> offset = min_pfn - bdata->node_min_pfn ...

All in favor for the latter. Keep the relative value at a local
alignment so that it yields a correct global alignment when combined
with node_min_pfn, no matter how node_min_pfn is aligned itself.

> I thought about other solutions like introducing a "base_offset" --
> the value needed to align node_min_pfn. But this value must be used
> in many places to correctly compute/align sidx etc. and it doesn't
> make the code better readable.
>
> Hence I still prefer the patch posted yesterday. I just want to clean
> it up somewhat. See attached patch.

Attached is a version that should be easier to read. Please
double-check.

It keeps the indexes/offsets aligned relative to the node so that
combined with the node start always yields a value that is aligned as
requested.

Hannes

commit ec5b91737d0be242a6a9b255fa1749962f978188
Author: Johannes Weiner <[email protected]>
Date: Wed Aug 13 19:59:43 2008 +0200

bootmem: fix aligning of node-relative indexes and offsets

diff --git a/mm/bootmem.c b/mm/bootmem.c
index 4af15d0..8620832 100644
--- a/mm/bootmem.c
+++ b/mm/bootmem.c
@@ -405,6 +405,30 @@ int __init reserve_bootmem(unsigned long addr, unsigned long size,
}
#endif /* !CONFIG_HAVE_ARCH_BOOTMEM_NODE */

+static unsigned long align_idx(struct bootmem_data *bdata, unsigned long idx,
+ unsigned long step)
+{
+ unsigned long base = bdata->node_min_pfn;
+
+ /*
+ * Align the index with respect to the node start so that the
+ * resulting absolute PFN (node-start + index) is properly
+ * aligned.
+ */
+
+ return ALIGN(base + idx, step) - base;
+}
+
+static unsigned long align_off(struct bootmem_data *bdata, unsigned long off,
+ unsigned long align)
+{
+ unsigned long base = PFN_PHYS(bdata->node_min_pfn);
+
+ /* Same as align_idx for byte offsets */
+
+ return ALIGN(base + off, align) - base;
+}
+
static void * __init alloc_bootmem_core(struct bootmem_data *bdata,
unsigned long size, unsigned long align,
unsigned long goal, unsigned long limit)
@@ -441,16 +465,23 @@ static void * __init alloc_bootmem_core(struct bootmem_data *bdata,
else
start = ALIGN(min, step);

- sidx = start - bdata->node_min_pfn;;
+ sidx = start - bdata->node_min_pfn;
midx = max - bdata->node_min_pfn;

+ /*
+ * Because the indexes are relative to the node, all alignment
+ * below has to be done with respect to the node's start in
+ * order to have the resulting PFNs and addresses properly
+ * aligned.
+ */
+
if (bdata->hint_idx > sidx) {
/*
* Handle the valid case of sidx being zero and still
* catch the fallback below.
*/
fallback = sidx + 1;
- sidx = ALIGN(bdata->hint_idx, step);
+ sidx = align_idx(bdata, bdata->hint_idx, step);
}

while (1) {
@@ -459,7 +490,7 @@ static void * __init alloc_bootmem_core(struct bootmem_data *bdata,
unsigned long eidx, i, start_off, end_off;
find_block:
sidx = find_next_zero_bit(bdata->node_bootmem_map, midx, sidx);
- sidx = ALIGN(sidx, step);
+ sidx = align_idx(bdata, sidx, step);
eidx = sidx + PFN_UP(size);

if (sidx >= midx || eidx > midx)
@@ -467,7 +498,7 @@ find_block:

for (i = sidx; i < eidx; i++)
if (test_bit(i, bdata->node_bootmem_map)) {
- sidx = ALIGN(i, step);
+ sidx = align_idx(bdata, i, step);
if (sidx == i)
sidx += step;
goto find_block;
@@ -475,7 +506,7 @@ find_block:

if (bdata->last_end_off &&
PFN_DOWN(bdata->last_end_off) + 1 == sidx)
- start_off = ALIGN(bdata->last_end_off, align);
+ start_off = align_off(bdata, bdata->last_end_off, align);
else
start_off = PFN_PHYS(sidx);

@@ -499,7 +530,7 @@ find_block:
}

if (fallback) {
- sidx = ALIGN(fallback - 1, step);
+ sidx = align_idx(bdata, fallback - 1, step);
fallback = 0;
goto find_block;
}

Subject: Re: [PATCH] alloc_bootmem_core: fix misaligned allocation of 1G page

On Wed, Aug 13, 2008 at 08:18:04PM +0200, Johannes Weiner wrote:
> Andreas Herrmann <[email protected]> writes:

> > Otherwise the function can't allocate the first possible page. For
> > example, assuming that
> >
> > node_min_pfn = 130000
> > align = 0x40000000 (1GByte)
> >
> > allocating a 1G page on this node will result in
> >
> > sidx=0x40000
> > min_pfn=0x140000
> >
> > Both are properly aligned. But the resulting super-page will be at
> > address 0x180000000 whereas the first possible 1G page would be at
> > address 0x140000000.
>
> I can not follow this example. Is there a typo somewhere? In this
> example, the resulting address is aligned but it shouldn't even be so.

No typo. I guess the confusion is just my ambiguous wording.
It's what your previous patch generated. At the end it "aligned" the first
1G page on node 1 at 0x180000000 (== the return value of alloc_bootmem_core)
but it reserved the region from 0x170000000:

Aug 13 16:05:49 brandon bootmem::alloc_bootmem_core nid=1 size=40000000
[262144 pages] align=40000000 goal=0 limit=0
Aug 13 16:05:49 brandon bootmem::__reserve nid=1 start=170000 end=1b0000 flags=1
Aug 13 16:05:49 brandon addr:ffff880180000000, paddr:0000000180000000, size: 1073741824

... and after some time ...

BUG: unable to handle kernel NULL pointer dereference at 000000000000000a
IP: [<ffffffff80261a87>] get_page_from_freelist+0x403/0x5ed
PGD 22f108067 PUD 22dc3d067 PMD 0
Oops: 0000 [1] SMP
...


> Attached is a version that should be easier to read. Please
> double-check.

Nice.
That's the preferred solution and should replace my initial fix.

Reviewed-and-tested-by: Andreas Herrmann <[email protected]>


Thanks,

Andreas


> commit ec5b91737d0be242a6a9b255fa1749962f978188
> Author: Johannes Weiner <[email protected]>
> Date: Wed Aug 13 19:59:43 2008 +0200
>
> bootmem: fix aligning of node-relative indexes and offsets
>
> diff --git a/mm/bootmem.c b/mm/bootmem.c
> index 4af15d0..8620832 100644
> --- a/mm/bootmem.c
> +++ b/mm/bootmem.c
> @@ -405,6 +405,30 @@ int __init reserve_bootmem(unsigned long addr, unsigned long size,
> }
> #endif /* !CONFIG_HAVE_ARCH_BOOTMEM_NODE */
>
> +static unsigned long align_idx(struct bootmem_data *bdata, unsigned long idx,
> + unsigned long step)
> +{
> + unsigned long base = bdata->node_min_pfn;
> +
> + /*
> + * Align the index with respect to the node start so that the
> + * resulting absolute PFN (node-start + index) is properly
> + * aligned.
> + */
> +
> + return ALIGN(base + idx, step) - base;
> +}
> +
> +static unsigned long align_off(struct bootmem_data *bdata, unsigned long off,
> + unsigned long align)
> +{
> + unsigned long base = PFN_PHYS(bdata->node_min_pfn);
> +
> + /* Same as align_idx for byte offsets */
> +
> + return ALIGN(base + off, align) - base;
> +}
> +
> static void * __init alloc_bootmem_core(struct bootmem_data *bdata,
> unsigned long size, unsigned long align,
> unsigned long goal, unsigned long limit)
> @@ -441,16 +465,23 @@ static void * __init alloc_bootmem_core(struct bootmem_data *bdata,
> else
> start = ALIGN(min, step);
>
> - sidx = start - bdata->node_min_pfn;;
> + sidx = start - bdata->node_min_pfn;
> midx = max - bdata->node_min_pfn;
>
> + /*
> + * Because the indexes are relative to the node, all alignment
> + * below has to be done with respect to the node's start in
> + * order to have the resulting PFNs and addresses properly
> + * aligned.
> + */
> +
> if (bdata->hint_idx > sidx) {
> /*
> * Handle the valid case of sidx being zero and still
> * catch the fallback below.
> */
> fallback = sidx + 1;
> - sidx = ALIGN(bdata->hint_idx, step);
> + sidx = align_idx(bdata, bdata->hint_idx, step);
> }
>
> while (1) {
> @@ -459,7 +490,7 @@ static void * __init alloc_bootmem_core(struct bootmem_data *bdata,
> unsigned long eidx, i, start_off, end_off;
> find_block:
> sidx = find_next_zero_bit(bdata->node_bootmem_map, midx, sidx);
> - sidx = ALIGN(sidx, step);
> + sidx = align_idx(bdata, sidx, step);
> eidx = sidx + PFN_UP(size);
>
> if (sidx >= midx || eidx > midx)
> @@ -467,7 +498,7 @@ find_block:
>
> for (i = sidx; i < eidx; i++)
> if (test_bit(i, bdata->node_bootmem_map)) {
> - sidx = ALIGN(i, step);
> + sidx = align_idx(bdata, i, step);
> if (sidx == i)
> sidx += step;
> goto find_block;
> @@ -475,7 +506,7 @@ find_block:
>
> if (bdata->last_end_off &&
> PFN_DOWN(bdata->last_end_off) + 1 == sidx)
> - start_off = ALIGN(bdata->last_end_off, align);
> + start_off = align_off(bdata, bdata->last_end_off, align);
> else
> start_off = PFN_PHYS(sidx);
>
> @@ -499,7 +530,7 @@ find_block:
> }
>
> if (fallback) {
> - sidx = ALIGN(fallback - 1, step);
> + sidx = align_idx(bdata, fallback - 1, step);
> fallback = 0;
> goto find_block;
> }
>

2008-08-14 00:19:18

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH -v2] bootmem: fix aligning of node-relative indexes and offsets

Absolute alignment requirements may never be applied to node-relative
offsets. Andreas Herrmann spotted this flaw when a bootmem allocation
on an unaligned node was itself not aligned because the combination of
an unaligned node with an aligned offset into that node is not
garuanteed to be aligned itself.

This patch introduces two helper functions that align a node-relative
index or offset with respect to the node's starting address so that
the absolute PFN or virtual address that results from combining the
two satisfies the requested alignment.

Then all the broken ALIGN()s in alloc_bootmem_core() are replaced by
these helpers.

Signed-off-by: Johannes Weiner <[email protected]>
Reported-by: Andreas Herrmann <[email protected]>
Debugged-by: Andreas Herrmann <[email protected]>
Reviewed-by: Andreas Herrmann <[email protected]>
Tested-by: Andreas Herrmann <[email protected]>
---

Andrew could you pull this in? Replaces
alloc_bootmem_core-fix-misaligned-allocation-of-1g-page.patch and is
also needed in upstream.

diff --git a/mm/bootmem.c b/mm/bootmem.c
index 4af15d0..6262d43 100644
--- a/mm/bootmem.c
+++ b/mm/bootmem.c
@@ -405,6 +405,29 @@ int __init reserve_bootmem(unsigned long addr, unsigned long size,
}
#endif /* !CONFIG_HAVE_ARCH_BOOTMEM_NODE */

+static unsigned long align_idx(struct bootmem_data *bdata, unsigned long idx,
+ unsigned long step)
+{
+ unsigned long base = bdata->node_min_pfn;
+
+ /*
+ * Align the index with respect to the node start so that the
+ * combination of both satisfies the requested alignment.
+ */
+
+ return ALIGN(base + idx, step) - base;
+}
+
+static unsigned long align_off(struct bootmem_data *bdata, unsigned long off,
+ unsigned long align)
+{
+ unsigned long base = PFN_PHYS(bdata->node_min_pfn);
+
+ /* Same as align_idx for byte offsets */
+
+ return ALIGN(base + off, align) - base;
+}
+
static void * __init alloc_bootmem_core(struct bootmem_data *bdata,
unsigned long size, unsigned long align,
unsigned long goal, unsigned long limit)
@@ -441,7 +464,7 @@ static void * __init alloc_bootmem_core(struct bootmem_data *bdata,
else
start = ALIGN(min, step);

- sidx = start - bdata->node_min_pfn;;
+ sidx = start - bdata->node_min_pfn;
midx = max - bdata->node_min_pfn;

if (bdata->hint_idx > sidx) {
@@ -450,7 +473,7 @@ static void * __init alloc_bootmem_core(struct bootmem_data *bdata,
* catch the fallback below.
*/
fallback = sidx + 1;
- sidx = ALIGN(bdata->hint_idx, step);
+ sidx = align_idx(bdata, bdata->hint_idx, step);
}

while (1) {
@@ -459,7 +482,7 @@ static void * __init alloc_bootmem_core(struct bootmem_data *bdata,
unsigned long eidx, i, start_off, end_off;
find_block:
sidx = find_next_zero_bit(bdata->node_bootmem_map, midx, sidx);
- sidx = ALIGN(sidx, step);
+ sidx = align_idx(bdata, sidx, step);
eidx = sidx + PFN_UP(size);

if (sidx >= midx || eidx > midx)
@@ -467,7 +490,7 @@ find_block:

for (i = sidx; i < eidx; i++)
if (test_bit(i, bdata->node_bootmem_map)) {
- sidx = ALIGN(i, step);
+ sidx = align_idx(bdata, i, step);
if (sidx == i)
sidx += step;
goto find_block;
@@ -475,7 +498,7 @@ find_block:

if (bdata->last_end_off &&
PFN_DOWN(bdata->last_end_off) + 1 == sidx)
- start_off = ALIGN(bdata->last_end_off, align);
+ start_off = align_off(bdata, bdata->last_end_off, align);
else
start_off = PFN_PHYS(sidx);

@@ -499,7 +522,7 @@ find_block:
}

if (fallback) {
- sidx = ALIGN(fallback - 1, step);
+ sidx = align_idx(bdata, fallback - 1, step);
fallback = 0;
goto find_block;
}

2008-08-18 21:18:18

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] alloc_bootmem_core: fix misaligned allocation of 1G page

On Wed, 13 Aug 2008 21:31:58 +0200
Andreas Herrmann <[email protected]> wrote:

> > Attached is a version that should be easier to read. Please
> > double-check.
>
> Nice.
> That's the preferred solution and should replace my initial fix.
>
> Reviewed-and-tested-by: Andreas Herrmann <[email protected]>

Johannes, I'm unsure whether that patch was the final version. Could
you please resend it (or its successor)?

With a more complete changelog, please. This:

bootmem: fix aligning of node-relative indexes and offsets

doesn't explain what needed fixing, nor how it was fixed, nor the
consequences of leaving it unfixed, etc.

Thanks.

2008-08-18 21:22:18

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] alloc_bootmem_core: fix misaligned allocation of 1G page

On Mon, 18 Aug 2008 14:17:18 -0700
Andrew Morton <[email protected]> wrote:

> On Wed, 13 Aug 2008 21:31:58 +0200
> Andreas Herrmann <[email protected]> wrote:
>
> > > Attached is a version that should be easier to read. Please
> > > double-check.
> >
> > Nice.
> > That's the preferred solution and should replace my initial fix.
> >
> > Reviewed-and-tested-by: Andreas Herrmann <[email protected]>
>
> Johannes, I'm unsure whether that patch was the final version. Could
> you please resend it (or its successor)?
>
> With a more complete changelog, please. This:
>
> bootmem: fix aligning of node-relative indexes and offsets
>
> doesn't explain what needed fixing, nor how it was fixed, nor the
> consequences of leaving it unfixed, etc.
>

ah, OK, I found it. please ignore.