2008-08-15 22:01:43

by Adam Litke

[permalink] [raw]
Subject: [BUG] __GFP_THISNODE is not always honored

While running the libhugetlbfs test suite on a NUMA machine with 2.6.27-rc3, I
discovered some strange behavior with __GFP_THISNODE. The hugetlb function
alloc_fresh_huge_page_node() calls alloc_pages_node() with __GFP_THISNODE but
occasionally a page that is not on the requested node is returned. Since the
hugetlb code assumes that the page will be on the requested node, badness follows
when the page is added to the wrong node's free_list.

There is clearly something wrong with the buddy allocator since __GFP_THISNODE
cannot be trusted. Until that is fixed, the hugetlb code should not assume
that the newly allocated page is on the node asked for. This patch prevents
the hugetlb pool counters from being corrupted and allows the code to cope with
unbalanced numa allocations.

So far my debugging has led me to get_page_from_freelist() inside the
for_each_zone_zonelist() loop. When buffered_rmqueue() returns a page I
compare the value of page_to_nid(page), zone->node and the node that the
hugetlb code requested with __GFP_THISNODE. These all match -- except when the
problem triggers. In that case, zone->node matches the node we asked for but
page_to_nid() does not.

Workaround patch:
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 67a7119..7a30a61 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -568,7 +568,7 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
__free_pages(page, huge_page_order(h));
return NULL;
}
- prep_new_huge_page(h, page, nid);
+ prep_new_huge_page(h, page, page_to_nid(page));
}

return page;

--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center


2008-08-18 10:59:37

by Mel Gorman

[permalink] [raw]
Subject: Re: [BUG] __GFP_THISNODE is not always honored

On (15/08/08 17:01), Adam Litke didst pronounce:
> While running the libhugetlbfs test suite on a NUMA machine with 2.6.27-rc3, I
> discovered some strange behavior with __GFP_THISNODE. The hugetlb function
> alloc_fresh_huge_page_node() calls alloc_pages_node() with __GFP_THISNODE but
> occasionally a page that is not on the requested node is returned.

That's bad in itself and has wider reaching consequences than hugetlb
getting its counters wrong. I believe SLUB depends on __GFP_THISNODE
being obeyed for example. Can you boot the machine in question with
mminit_loglevel=4 and loglevel=8 set on the command line and send me the
dmesg please? It should output the zonelists and I might be able to
figure out what's going wrong. Thanks

> Since the
> hugetlb code assumes that the page will be on the requested node, badness follows
> when the page is added to the wrong node's free_list.
>
> There is clearly something wrong with the buddy allocator since __GFP_THISNODE
> cannot be trusted. Until that is fixed, the hugetlb code should not assume
> that the newly allocated page is on the node asked for. This patch prevents
> the hugetlb pool counters from being corrupted and allows the code to cope with
> unbalanced numa allocations.
>
> So far my debugging has led me to get_page_from_freelist() inside the
> for_each_zone_zonelist() loop. When buffered_rmqueue() returns a page I
> compare the value of page_to_nid(page), zone->node and the node that the
> hugetlb code requested with __GFP_THISNODE. These all match -- except when the
> problem triggers. In that case, zone->node matches the node we asked for but
> page_to_nid() does not.
>

Feels like the wrong zonelist is being used. The dmesg with
mminit_loglevel may tell.

> Workaround patch:
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 67a7119..7a30a61 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -568,7 +568,7 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
> __free_pages(page, huge_page_order(h));
> return NULL;
> }
> - prep_new_huge_page(h, page, nid);
> + prep_new_huge_page(h, page, page_to_nid(page));
> }

This will mask the bug for hugetlb but I wonder if this should be a
VM_BUG_ON(page_to_nid(page) != nid) ?

>
> return page;
>
> --
> Adam Litke - (agl at us.ibm.com)
> IBM Linux Technology Center
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-08-18 18:17:14

by Adam Litke

[permalink] [raw]
Subject: Re: [BUG] __GFP_THISNODE is not always honored

On Mon, 2008-08-18 at 11:59 +0100, Mel Gorman wrote:
> On (15/08/08 17:01), Adam Litke didst pronounce:
> > While running the libhugetlbfs test suite on a NUMA machine with 2.6.27-rc3, I
> > discovered some strange behavior with __GFP_THISNODE. The hugetlb function
> > alloc_fresh_huge_page_node() calls alloc_pages_node() with __GFP_THISNODE but
> > occasionally a page that is not on the requested node is returned.
>
> That's bad in itself and has wider reaching consequences than hugetlb
> getting its counters wrong. I believe SLUB depends on __GFP_THISNODE
> being obeyed for example. Can you boot the machine in question with
> mminit_loglevel=4 and loglevel=8 set on the command line and send me the
> dmesg please? It should output the zonelists and I might be able to
> figure out what's going wrong. Thanks

dmesg output is attached.

> > Since the
> > hugetlb code assumes that the page will be on the requested node, badness follows
> > when the page is added to the wrong node's free_list.
> >
> > There is clearly something wrong with the buddy allocator since __GFP_THISNODE
> > cannot be trusted. Until that is fixed, the hugetlb code should not assume
> > that the newly allocated page is on the node asked for. This patch prevents
> > the hugetlb pool counters from being corrupted and allows the code to cope with
> > unbalanced numa allocations.
> >
> > So far my debugging has led me to get_page_from_freelist() inside the
> > for_each_zone_zonelist() loop. When buffered_rmqueue() returns a page I
> > compare the value of page_to_nid(page), zone->node and the node that the
> > hugetlb code requested with __GFP_THISNODE. These all match -- except when the
> > problem triggers. In that case, zone->node matches the node we asked for but
> > page_to_nid() does not.
> >
>
> Feels like the wrong zonelist is being used. The dmesg with
> mminit_loglevel may tell.
>
> > Workaround patch:
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 67a7119..7a30a61 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -568,7 +568,7 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
> > __free_pages(page, huge_page_order(h));
> > return NULL;
> > }
> > - prep_new_huge_page(h, page, nid);
> > + prep_new_huge_page(h, page, page_to_nid(page));
> > }
>
> This will mask the bug for hugetlb but I wonder if this should be a
> VM_BUG_ON(page_to_nid(page) != nid) ?

Yeah, the patch was provided for illustrative purposes only.

--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center


Attachments:
dmesg.out (38.13 kB)

2008-08-18 19:16:20

by Christoph Lameter

[permalink] [raw]
Subject: Re: [BUG] __GFP_THISNODE is not always honored


> That's bad in itself and has wider reaching consequences than hugetlb
> getting its counters wrong. I believe SLUB depends on __GFP_THISNODE
> being obeyed for example. Can you boot the machine in question with
> mminit_loglevel=4 and loglevel=8 set on the command line and send me the
> dmesg please? It should output the zonelists and I might be able to
> figure out what's going wrong. Thanks

Its SLAB depends on it and will corrupt data if the wrong node is returned.
SLAB has BUG_ONs that should trigger if anything like that occurs.


> This will mask the bug for hugetlb but I wonder if this should be a
> VM_BUG_ON(page_to_nid(page) != nid) ?

Right.

2008-08-18 19:22:46

by Christoph Lameter

[permalink] [raw]
Subject: Re: [BUG] __GFP_THISNODE is not always honored

Adam Litke wrote:
>
> So far my debugging has led me to get_page_from_freelist() inside the
> for_each_zone_zonelist() loop. When buffered_rmqueue() returns a page I
> compare the value of page_to_nid(page), zone->node and the node that the
> hugetlb code requested with __GFP_THISNODE. These all match -- except when the
> problem triggers. In that case, zone->node matches the node we asked for but
> page_to_nid() does not.

Uhhh.. A page that was just taken off the freelist? So we may have freed or
coalesced a page to the wrong zone? Looks like there is something more
fundamental that broke here.

2008-08-18 19:53:55

by Mel Gorman

[permalink] [raw]
Subject: Re: [BUG] __GFP_THISNODE is not always honored

On (18/08/08 14:21), Christoph Lameter didst pronounce:
> Adam Litke wrote:
> >
> > So far my debugging has led me to get_page_from_freelist() inside the
> > for_each_zone_zonelist() loop. When buffered_rmqueue() returns a page I
> > compare the value of page_to_nid(page), zone->node and the node that the
> > hugetlb code requested with __GFP_THISNODE. These all match -- except when the
> > problem triggers. In that case, zone->node matches the node we asked for but
> > page_to_nid() does not.
>
> Uhhh.. A page that was just taken off the freelist? So we may have freed or
> coalesced a page to the wrong zone? Looks like there is something more
> fundamental that broke here.
>

It's still a bit hard to tell but I don't believe we are coalescing wrong
at the moment. buffered_rmqueue() is pretty high in the call chain for the
page allocator. The problem could have been explained if the zonelist walking
for __GFP_THISNODE was screwed but the dmesg output seems to show that's
ok at least. It could also be something really wacky like the page
linkages don't match the zone->node linkages.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-08-18 19:57:59

by Mel Gorman

[permalink] [raw]
Subject: Re: [BUG] __GFP_THISNODE is not always honored

On (18/08/08 13:16), Adam Litke didst pronounce:
> <MUCH SNIPPAGE>
> mminit::memmap_init Initialising map node 0 zone 0 pfns 32768 -> 278528
> mminit::memmap_init Initialising map node 1 zone 0 pfns 0 -> 524288

This might be the problem here. This machine has overlapping nodes which
is a fairly rare situation. I think it's possible the page linkages for
node 0 are getting overwritten with their node 1 equivalents. If this is
happening, it would lead to some oddness.

2008-08-20 17:09:09

by Adam Litke

[permalink] [raw]
Subject: [BUG] Make setup_zone_migrate_reserve() aware of overlapping nodes

I have gotten to the root cause of the hugetlb badness I reported back
on August 15th. My system has the following memory topology (note the
overlapping node):

Node 0 Memory: 0x8000000-0x44000000
Node 1 Memory: 0x0-0x8000000 0x44000000-0x80000000

setup_zone_migrate_reserve() scans the address range 0x0-0x8000000
looking for a pageblock to move onto the MIGRATE_RESERVE list. Finding
no candidates, it happily continues the scan into 0x8000000-0x44000000.
When a pageblock is found, the pages are moved to the MIGRATE_RESERVE
list on the wrong zone. Oops.

(Andrew: once the proper fix is agreed upon, this should also be a
candidate for -stable.)

setup_zone_migrate_reserve() should skip pageblocks in overlapping
nodes.

Signed-off-by: Adam Litke <[email protected]>

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index af982f7..f297a9b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2512,6 +2512,10 @@ static void setup_zone_migrate_reserve(struct zone *zone)
pageblock_order;

for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
+ /* Watch out for overlapping nodes */
+ if (!early_pfn_in_nid(pfn, zone->node))
+ continue;
+
if (!pfn_valid(pfn))
continue;
page = pfn_to_page(pfn);

--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

2008-08-20 18:12:13

by Dave Hansen

[permalink] [raw]
Subject: Re: [BUG] Make setup_zone_migrate_reserve() aware of overlapping nodes

On Wed, 2008-08-20 at 12:08 -0500, Adam Litke wrote:
> I have gotten to the root cause of the hugetlb badness I reported back
> on August 15th. My system has the following memory topology (note the
> overlapping node):
>
> Node 0 Memory: 0x8000000-0x44000000
> Node 1 Memory: 0x0-0x8000000 0x44000000-0x80000000
>
> setup_zone_migrate_reserve() scans the address range 0x0-0x8000000
> looking for a pageblock to move onto the MIGRATE_RESERVE list. Finding
> no candidates, it happily continues the scan into 0x8000000-0x44000000.
> When a pageblock is found, the pages are moved to the MIGRATE_RESERVE
> list on the wrong zone. Oops.

This eventually gets down into move_freepages() via:

->setup_zone_migrate_reserve()
->move_freepages_block()
->move_freepages()
right?

It looks like there have been bugs in this area before in
move_freepages(). Should there be a more stringent check in *there*?
Maybe a warning?
> 
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2512,6 +2512,10 @@ static void setup_zone_migrate_reserve(struct
> zone *zone)
> pageblock_order;
>
> for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
> + /* Watch out for overlapping nodes */
> + if (!early_pfn_in_nid(pfn, zone->node))
> + continue;

zone->node doesn't exist on !CONFIG_NUMA. :(

You probably want:

if (!early_pfn_in_nid(pfn, zone_to_nid(zone)))
continue;

Are you sure you need the "early_" variant here? We're not using
early_pfn_valid() right below it. I guess you could also use:

if (!page_to_nid(page) != zone_to_nid(zone))
continue;

-- Dave

2008-08-20 19:56:10

by Adam Litke

[permalink] [raw]
Subject: [BUG] [PATCH v2] Make setup_zone_migrate_reserve() aware of overlapping nodes

Changes since V1
- Fix build for !NUMA
- Add VM_BUG_ON() to catch this problem at the source

I have gotten to the root cause of the hugetlb badness I reported back on
August 15th. My system has the following memory topology (note the
overlapping node):

Node 0 Memory: 0x8000000-0x44000000
Node 1 Memory: 0x0-0x8000000 0x44000000-0x80000000

setup_zone_migrate_reserve() scans the address range 0x0-0x8000000 looking
for a pageblock to move onto the MIGRATE_RESERVE list. Finding no
candidates, it happily continues the scan into 0x8000000-0x44000000. When
a pageblock is found, the pages are moved to the MIGRATE_RESERVE list on
the wrong zone. Oops.

(Andrew: once the proper fix is agreed upon, this should also be a
candidate for -stable.)

setup_zone_migrate_reserve() should skip pageblocks in overlapping nodes.

Signed-off-by: Adam Litke <[email protected]>

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index af982f7..feb7916 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -694,6 +694,9 @@ static int move_freepages(struct zone *zone,
#endif

for (page = start_page; page <= end_page;) {
+ /* Make sure we are not inadvertently changing nodes */
+ VM_BUG_ON(page_to_nid(page) != zone_to_nid(zone));
+
if (!pfn_valid_within(page_to_pfn(page))) {
page++;
continue;
@@ -2516,6 +2519,10 @@ static void setup_zone_migrate_reserve(struct zone *zone)
continue;
page = pfn_to_page(pfn);

+ /* Watch out for overlapping nodes */
+ if (page_to_nid(page) != zone_to_nid(zone))
+ continue;
+
/* Blocks with reserved pages will never free, skip them. */
if (PageReserved(page))
continue;

--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

2008-08-21 11:33:53

by Mel Gorman

[permalink] [raw]
Subject: Re: [BUG] [PATCH v2] Make setup_zone_migrate_reserve() aware of overlapping nodes

On (20/08/08 14:55), Adam Litke didst pronounce:
> Changes since V1
> - Fix build for !NUMA
> - Add VM_BUG_ON() to catch this problem at the source
>
> I have gotten to the root cause of the hugetlb badness I reported back on
> August 15th. My system has the following memory topology (note the
> overlapping node):
>
> Node 0 Memory: 0x8000000-0x44000000
> Node 1 Memory: 0x0-0x8000000 0x44000000-0x80000000
>
> setup_zone_migrate_reserve() scans the address range 0x0-0x8000000 looking
> for a pageblock to move onto the MIGRATE_RESERVE list. Finding no
> candidates, it happily continues the scan into 0x8000000-0x44000000. When
> a pageblock is found, the pages are moved to the MIGRATE_RESERVE list on
> the wrong zone. Oops.
>
> (Andrew: once the proper fix is agreed upon, this should also be a
> candidate for -stable.)
>
> setup_zone_migrate_reserve() should skip pageblocks in overlapping nodes.
>
> Signed-off-by: Adam Litke <[email protected]>
>

zone_to_nid(zone) is called every time in the loop even though it will never
change. This is less than optimal but setup_zone_migrate_reserve() is only
called during init and when min_free_kbytes is adjusted so it's not worth
worrying about. Otherwise it looks good.

Acked-by: Mel Gorman <[email protected]>

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index af982f7..feb7916 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -694,6 +694,9 @@ static int move_freepages(struct zone *zone,
> #endif
>
> for (page = start_page; page <= end_page;) {
> + /* Make sure we are not inadvertently changing nodes */
> + VM_BUG_ON(page_to_nid(page) != zone_to_nid(zone));
> +
> if (!pfn_valid_within(page_to_pfn(page))) {
> page++;
> continue;
> @@ -2516,6 +2519,10 @@ static void setup_zone_migrate_reserve(struct zone *zone)
> continue;
> page = pfn_to_page(pfn);
>
> + /* Watch out for overlapping nodes */
> + if (page_to_nid(page) != zone_to_nid(zone))
> + continue;
> +
> /* Blocks with reserved pages will never free, skip them. */
> if (PageReserved(page))
> continue;
>
> --
> Adam Litke - (agl at us.ibm.com)
> IBM Linux Technology Center
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-08-26 09:29:37

by Andy Whitcroft

[permalink] [raw]
Subject: Re: [BUG] [PATCH v2] Make setup_zone_migrate_reserve() aware of overlapping nodes

On Thu, Aug 21, 2008 at 12:33:39PM +0100, Mel Gorman wrote:
> On (20/08/08 14:55), Adam Litke didst pronounce:
> > Changes since V1
> > - Fix build for !NUMA
> > - Add VM_BUG_ON() to catch this problem at the source
> >
> > I have gotten to the root cause of the hugetlb badness I reported back on
> > August 15th. My system has the following memory topology (note the
> > overlapping node):
> >
> > Node 0 Memory: 0x8000000-0x44000000
> > Node 1 Memory: 0x0-0x8000000 0x44000000-0x80000000
> >
> > setup_zone_migrate_reserve() scans the address range 0x0-0x8000000 looking
> > for a pageblock to move onto the MIGRATE_RESERVE list. Finding no
> > candidates, it happily continues the scan into 0x8000000-0x44000000. When
> > a pageblock is found, the pages are moved to the MIGRATE_RESERVE list on
> > the wrong zone. Oops.
> >
> > (Andrew: once the proper fix is agreed upon, this should also be a
> > candidate for -stable.)
> >
> > setup_zone_migrate_reserve() should skip pageblocks in overlapping nodes.
> >
> > Signed-off-by: Adam Litke <[email protected]>
> >
>
> zone_to_nid(zone) is called every time in the loop even though it will never
> change. This is less than optimal but setup_zone_migrate_reserve() is only
> called during init and when min_free_kbytes is adjusted so it's not worth
> worrying about. Otherwise it looks good.
>
> Acked-by: Mel Gorman <[email protected]>
>
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index af982f7..feb7916 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -694,6 +694,9 @@ static int move_freepages(struct zone *zone,
> > #endif
> >
> > for (page = start_page; page <= end_page;) {
> > + /* Make sure we are not inadvertently changing nodes */
> > + VM_BUG_ON(page_to_nid(page) != zone_to_nid(zone));
> > +
> > if (!pfn_valid_within(page_to_pfn(page))) {
> > page++;
> > continue;
> > @@ -2516,6 +2519,10 @@ static void setup_zone_migrate_reserve(struct zone *zone)
> > continue;
> > page = pfn_to_page(pfn);
> >
> > + /* Watch out for overlapping nodes */
> > + if (page_to_nid(page) != zone_to_nid(zone))
> > + continue;
> > +
> > /* Blocks with reserved pages will never free, skip them. */
> > if (PageReserved(page))
> > continue;

This patch looks sane. I do note that we have a config option to tell
us whether we have any possibility of overlapping nodes, and we have an
early version of a check for this early_pfn_in_nid() in mm.h. You might
consider having a non-early variant of this which could be optimised
away for those arches which do not have CONFIG_NODES_SPAN_OTHER_NODES.

In 'unearlifying' this to pfn_in_nid() I think we have a small naming
issue with these function as they are only valid for use with pfns within
an existing node. They should probabally both be *pfn_in_nid_within()
or something in line with pfn_valid_within().

-apw