LinuxLists.cc - [PATCH] mm: use a radix-tree to make do_move

2008-10-09 12:31:14

Subject: [PATCH] mm: use a radix-tree to make do_move_pages() complexity linear

Add a radix-tree in do_move_pages() to associate each page with
the struct page_to_node that describes its migration.
new_page_node() can now easily find out the page_to_node of the
given page instead of traversing the whole page_to_node array.
So the overall complexity is linear instead of quadratic.

We still need the page_to_node array since it is allocated by the
caller (sys_move_page()) and used by do_pages_stat() when no target
nodes are given by the application. And we need room to store all
these page_to_node entries for do_move_pages() as well anyway.

If a page is given twice by the application, the old code would
return -EBUSY (failure from the second isolate_lru_page()). Now,
radix_tree_insert() will return -EEXIST, and we convert it back
to -EBUSY to keep the user-space ABI.

The radix-tree is emptied at the end of do_move_pages() since
new_page_node() doesn't know when an entry is used for the last
time (unmap_and_move() could try another pass later).
Marking pp->page as ZERO_PAGE(0) was actually never used. We now
set it to NULL when pp is not in the radix-tree. It is faster
than doing a loop of radix_tree_lookup_gang()+delete().

Signed-off-by: Brice Goglin <[email protected]>
Signed-off-by: Nathalie Furmento <[email protected]>

--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -31,6 +31,7 @@
#include <linux/security.h>
#include <linux/memcontrol.h>
#include <linux/syscalls.h>
+#include <linux/radix-tree.h>

#include "internal.h"

@@ -840,12 +841,10 @@ struct page_to_node {
static struct page *new_page_node(struct page *p, unsigned long private,
int **result)
{
- struct page_to_node *pm = (struct page_to_node *)private;
+ struct radix_tree_root *root = (struct radix_tree_root *) private;
+ struct page_to_node *pm = (struct page_to_node *) radix_tree_lookup(root, (unsigned long) p);

- while (pm->node != MAX_NUMNODES && pm->page != p)
- pm++;
-
- if (pm->node == MAX_NUMNODES)
+ if (!pm)
return NULL;

*result = &pm->status;
@@ -865,6 +864,7 @@ static int do_move_pages(struct mm_struct *mm, struct page_to_node *pm,
int err;
struct page_to_node *pp;
LIST_HEAD(pagelist);
+ RADIX_TREE(pmroot, GFP_KERNEL);

down_read(&mm->mmap_sem);

@@ -876,11 +876,8 @@ static int do_move_pages(struct mm_struct *mm, struct page_to_node *pm,
struct vm_area_struct *vma;
struct page *page;

- /*
- * A valid page pointer that will not match any of the
- * pages that will be moved.
- */
- pp->page = ZERO_PAGE(0);
+ /* set to NULL as long as pp is not in the radix-tree */
+ pp->page = NULL;

err = -EFAULT;
vma = find_vma(mm, pp->addr);
@@ -900,9 +897,7 @@ static int do_move_pages(struct mm_struct *mm, struct page_to_node *pm,
if (PageReserved(page)) /* Check for zero page */
goto put_and_set;

- pp->page = page;
err = page_to_nid(page);
-
if (err == pp->node)
/*
* Node already in the right place
@@ -914,6 +909,23 @@ static int do_move_pages(struct mm_struct *mm, struct page_to_node *pm,
!migrate_all)
goto put_and_set;

+ /*
+ * Insert pp in the radix-tree so that new_page_node() can find it
+ * while only knowing the page.
+ * There cannot be any duplicate since isolate_lru_page() would fail
+ * below anyway.
+ */
+ err = radix_tree_insert(&pmroot, (unsigned long) page, pp);
+ if (err < 0) {
+ if (err == -EEXIST)
+ /* the page cannot be migrated twice */
+ err = -EBUSY;
+ goto put_and_set;
+ }
+
+ /* set pp->page for real now that pp is in the radix-tree */
+ pp->page = page;
+
err = isolate_lru_page(page, &pagelist);
put_and_set:
/*
@@ -927,12 +939,17 @@ set_status:
}

if (!list_empty(&pagelist))
- err = migrate_pages(&pagelist, new_page_node,
- (unsigned long)pm);
+ err = migrate_pages(&pagelist, new_page_node, (unsigned long) &pmroot);
else
err = -ENOENT;

up_read(&mm->mmap_sem);
+
+ /* empty the radix-tree now that new_page_node() will not be called anymore */
+ for(pp = pm; pp->node != MAX_NUMNODES; pp++)
+ if (pp->page)
+ radix_tree_delete(&pmroot, (unsigned long) pp->page);
+
return err;
}

2008-10-10 19:50:57

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH] mm: use a radix-tree to make do_move_pages() complexity linear

On Thu, 09 Oct 2008 14:32:26 +0200
Brice Goglin <[email protected]> wrote:

> Add a radix-tree in do_move_pages() to associate each page with
> the struct page_to_node that describes its migration.
> new_page_node() can now easily find out the page_to_node of the
> given page instead of traversing the whole page_to_node array.
> So the overall complexity is linear instead of quadratic.
>
> We still need the page_to_node array since it is allocated by the
> caller (sys_move_page()) and used by do_pages_stat() when no target
> nodes are given by the application. And we need room to store all
> these page_to_node entries for do_move_pages() as well anyway.
>
> If a page is given twice by the application, the old code would
> return -EBUSY (failure from the second isolate_lru_page()). Now,
> radix_tree_insert() will return -EEXIST, and we convert it back
> to -EBUSY to keep the user-space ABI.
>
> The radix-tree is emptied at the end of do_move_pages() since
> new_page_node() doesn't know when an entry is used for the last
> time (unmap_and_move() could try another pass later).
> Marking pp->page as ZERO_PAGE(0) was actually never used. We now
> set it to NULL when pp is not in the radix-tree. It is faster
> than doing a loop of radix_tree_lookup_gang()+delete().

Any O(n*n) code always catches up with us in the end. But I do think
that to merge this code we'd need some description of the problem which
we fixed.

Please send a description of the situation under which the current code
performs unacceptably. Some before-and-after quantitative measurements
would be good.

Because it could be (as far as I know) that the problem is purely
theoretical, in which case we might not want the patch at all.

Thanks.

2008-10-10 20:12:20

by Brice Goglin

[permalink] [raw]

Subject: Re: [PATCH] mm: use a radix-tree to make do_move_pages() complexity linear

Andrew Morton wrote:
> On Thu, 09 Oct 2008 14:32:26 +0200
> Brice Goglin <[email protected]> wrote:
>
>
>> Add a radix-tree in do_move_pages() to associate each page with
>> the struct page_to_node that describes its migration.
>> new_page_node() can now easily find out the page_to_node of the
>> given page instead of traversing the whole page_to_node array.
>> So the overall complexity is linear instead of quadratic.
>>
>> We still need the page_to_node array since it is allocated by the
>> caller (sys_move_page()) and used by do_pages_stat() when no target
>> nodes are given by the application. And we need room to store all
>> these page_to_node entries for do_move_pages() as well anyway.
>>
>> If a page is given twice by the application, the old code would
>> return -EBUSY (failure from the second isolate_lru_page()). Now,
>> radix_tree_insert() will return -EEXIST, and we convert it back
>> to -EBUSY to keep the user-space ABI.
>>
>> The radix-tree is emptied at the end of do_move_pages() since
>> new_page_node() doesn't know when an entry is used for the last
>> time (unmap_and_move() could try another pass later).
>> Marking pp->page as ZERO_PAGE(0) was actually never used. We now
>> set it to NULL when pp is not in the radix-tree. It is faster
>> than doing a loop of radix_tree_lookup_gang()+delete().
>>
>
> Any O(n*n) code always catches up with us in the end. But I do think
> that to merge this code we'd need some description of the problem which
> we fixed.
>
> Please send a description of the situation under which the current code
> performs unacceptably. Some before-and-after quantitative measurements
> would be good.
>
> Because it could be (as far as I know) that the problem is purely
> theoretical, in which case we might not want the patch at all.
>

Just try sys_move_pages() on a 10-100MB buffer, you'll get something
like 50MB/s on a recent Opteron machine. This throughput decreases
significantly with the number of pages. With this patch, we get about
350MB/s and the throughput is stable when the migrated buffer gets
larger. I don't have detailled numbers at hand, I'll send them by monday.

Brice

2008-10-10 20:34:10

by Christoph Lameter

[permalink] [raw]

Subject: Re: [PATCH] mm: use a radix-tree to make do_move_pages() complexity linear

Brice Goglin wrote:

> Just try sys_move_pages() on a 10-100MB buffer, you'll get something
> like 50MB/s on a recent Opteron machine. This throughput decreases
> significantly with the number of pages. With this patch, we get about
> 350MB/s and the throughput is stable when the migrated buffer gets
> larger. I don't have detailled numbers at hand, I'll send them by monday.

Migration throughput is optimal for sys_move_pages() and the cpuset migration.
Some comparison would be useful.

With 100MB you have ~250k pages which will require a vmalloc of 4MB for the
struct pm struct array to control the migration of each individual page.

Would it be possible to restructure this in such a way that we work in chunks
of 100 or so pages each so that we can avoid the vmalloc?

We also could do a kmalloc for each individual struct pm_struct with the radix
tree which would also avoid the vmalloc but still keep the need to allocate
4MB for temporary struct pm_structs.

Or get rid of the pm_struct altogether by storing the address of the node
vector somewhere and retrieve the node as needed from the array. This would
allow storing the struct page * pointers in the radix tree.

2008-10-10 21:00:48

by Brice Goglin

[permalink] [raw]

Subject: Re: [PATCH] mm: use a radix-tree to make do_move_pages() complexity linear

Brice Goglin wrote:
> Andrew Morton wrote:
>
>> On Thu, 09 Oct 2008 14:32:26 +0200
>> Brice Goglin <[email protected]> wrote:
>>
>>
>>
>>> Add a radix-tree in do_move_pages() to associate each page with
>>> the struct page_to_node that describes its migration.
>>> new_page_node() can now easily find out the page_to_node of the
>>> given page instead of traversing the whole page_to_node array.
>>> So the overall complexity is linear instead of quadratic.
>>>
>>> We still need the page_to_node array since it is allocated by the
>>> caller (sys_move_page()) and used by do_pages_stat() when no target
>>> nodes are given by the application. And we need room to store all
>>> these page_to_node entries for do_move_pages() as well anyway.
>>>
>>> If a page is given twice by the application, the old code would
>>> return -EBUSY (failure from the second isolate_lru_page()). Now,
>>> radix_tree_insert() will return -EEXIST, and we convert it back
>>> to -EBUSY to keep the user-space ABI.
>>>
>>> The radix-tree is emptied at the end of do_move_pages() since
>>> new_page_node() doesn't know when an entry is used for the last
>>> time (unmap_and_move() could try another pass later).
>>> Marking pp->page as ZERO_PAGE(0) was actually never used. We now
>>> set it to NULL when pp is not in the radix-tree. It is faster
>>> than doing a loop of radix_tree_lookup_gang()+delete().
>>>
>>>
>> Any O(n*n) code always catches up with us in the end. But I do think
>> that to merge this code we'd need some description of the problem which
>> we fixed.
>>
>> Please send a description of the situation under which the current code
>> performs unacceptably. Some before-and-after quantitative measurements
>> would be good.
>>
>> Because it could be (as far as I know) that the problem is purely
>> theoretical, in which case we might not want the patch at all.
>>
>>
>
> Just try sys_move_pages() on a 10-100MB buffer, you'll get something
> like 50MB/s on a recent Opteron machine. This throughput decreases
> significantly with the number of pages. With this patch, we get about
> 350MB/s and the throughput is stable when the migrated buffer gets
> larger. I don't have detailled numbers at hand, I'll send them by monday.
>

Here's some quickyl-gathered numbers for the duration of move_pages().
It's between nodes #2 and #3 of a quad-quad-core opteron 2347HE with
2.6.27-rc5 + perfmon2:

buffer (kB) move_pages (us) move_pages with patch (us)
4000 12351 12580
40000 223975 123024

As you can see, with the patch applied, the migration time for 10x more
pages is 10x more. Without the patch, it's 18x.

I'll see if I can implement what Christoph's ideas.

Brice

2008-10-11 08:55:24

by Brice Goglin

[permalink] [raw]

Subject: Re: [PATCH] mm: use a radix-tree to make do_move_pages() complexity linear

Christoph Lameter wrote:
> Would it be possible to restructure this in such a way that we work in chunks
> of 100 or so pages each so that we can avoid the vmalloc?
>
> We also could do a kmalloc for each individual struct pm_struct with the radix
> tree which would also avoid the vmalloc but still keep the need to allocate
> 4MB for temporary struct pm_structs.
>
> Or get rid of the pm_struct altogether by storing the address of the node
> vector somewhere and retrieve the node as needed from the array. This would
> allow storing the struct page * pointers in the radix tree.
>

I have been thinking about all this, and here's some ideas:
* do_pages_stat() may easily be rewritten without the huge pm array. It
just need to user-space pointers to the page and status arrays. It
traverses the first array , and for each page does: doing get_user() the
address, retrieve the page node, and put_user() the result in the status
array. No need to allocate any single page_to_node structure.
* If we split the migration in small chunks (such as one page of pm's),
the quadratic complexity doesn't matter that much. There will be at most
several dozens of pm in the chunk array, so the linear search in
new_page_node() won't be that slow, it may even be faster than the
overall cost of adding a radix-tree to improve this search. So we can
keep the internal code unchanged and just add the chunking around it.
* One thing that bothers me is move_pages() returning -ENOENT when no
page are given to migrate_pages(). I don't see why having 100/100 pages
not migrated would return a different error than having only 99/100
pages not migrated. We have the status array to place -ENOENT for all
these pages. If the user doesn't know where his pages are allocated, he
shouldn't get a different return value depending on how many pages were
already on the right node. And actually, this convention makes
user-space application harder to write since you need to treat -ENOENT
as a success unless you already knew for sure where your pages were
allocated. And the big thing is that this convention makes the chunking
painfully/uselessly more complex. Breaking user-ABI is bad, but fixing
crazy ABI...

Here are some other numbers from the above (dirty) implementation,
migrating from node #2 to #3 on a quad-quad-core opteron 2347HE with
vanilla 2.6.27:

length move_pages (us) move_pages with patch (us)
4kB 126 98
40kB 198 168
400kB 963 937
4MB 12503 11930
40MB 246867 11848

It seems to be even slightly better than the previous patch (but the
kernel are a bit different). And I quickly checked that it scales well
up to 4GB buffers.

Brice

2008-10-11 08:58:32

by Nick Piggin

[permalink] [raw]

Subject: Re: [PATCH] mm: use a radix-tree to make do_move_pages() complexity linear

On Saturday 11 October 2008 19:54, Brice Goglin wrote:
> Christoph Lameter wrote:
> > Would it be possible to restructure this in such a way that we work in
> > chunks of 100 or so pages each so that we can avoid the vmalloc?
> >
> > We also could do a kmalloc for each individual struct pm_struct with the
> > radix tree which would also avoid the vmalloc but still keep the need to
> > allocate 4MB for temporary struct pm_structs.
> >
> > Or get rid of the pm_struct altogether by storing the address of the node
> > vector somewhere and retrieve the node as needed from the array. This
> > would allow storing the struct page * pointers in the radix tree.
>
> I have been thinking about all this, and here's some ideas:
> * do_pages_stat() may easily be rewritten without the huge pm array. It
> just need to user-space pointers to the page and status arrays. It
> traverses the first array , and for each page does: doing get_user() the
> address, retrieve the page node, and put_user() the result in the status
> array. No need to allocate any single page_to_node structure.
> * If we split the migration in small chunks (such as one page of pm's),
> the quadratic complexity doesn't matter that much. There will be at most
> several dozens of pm in the chunk array, so the linear search in
> new_page_node() won't be that slow, it may even be faster than the
> overall cost of adding a radix-tree to improve this search. So we can
> keep the internal code unchanged and just add the chunking around it.
> * One thing that bothers me is move_pages() returning -ENOENT when no
> page are given to migrate_pages(). I don't see why having 100/100 pages
> not migrated would return a different error than having only 99/100
> pages not migrated. We have the status array to place -ENOENT for all
> these pages. If the user doesn't know where his pages are allocated, he
> shouldn't get a different return value depending on how many pages were
> already on the right node. And actually, this convention makes
> user-space application harder to write since you need to treat -ENOENT
> as a success unless you already knew for sure where your pages were
> allocated. And the big thing is that this convention makes the chunking
> painfully/uselessly more complex. Breaking user-ABI is bad, but fixing
> crazy ABI...
>
> Here are some other numbers from the above (dirty) implementation,
> migrating from node #2 to #3 on a quad-quad-core opteron 2347HE with
> vanilla 2.6.27:
>
> length move_pages (us) move_pages with patch (us)
> 4kB 126 98
> 40kB 198 168
> 400kB 963 937
> 4MB 12503 11930
> 40MB 246867 11848
>
> It seems to be even slightly better than the previous patch (but the
> kernel are a bit different). And I quickly checked that it scales well
> up to 4GB buffers.

If you are worried about vmalloc overhead, I'd suggest testing with -mm.
I've rewritten the vmap code so it is now slightly scalable and sane to
use.

2008-10-11 09:20:31

by Brice Goglin

[permalink] [raw]

Subject: Re: [PATCH] mm: use a radix-tree to make do_move_pages() complexity linear

Nick Piggin wrote:
> If you are worried about vmalloc overhead, I'd suggest testing with -mm.
> I've rewritten the vmap code so it is now slightly scalable and sane to
> use.
>

I am actually only worried about move_pages() performance for large
buffers :) The vmalloc overhead is probably negligible against the
quadratic duration of move_pages() for dozens of MB.

Brice

2008-10-13 14:10:31

by Christoph Lameter

[permalink] [raw]

Subject: Re: [PATCH] mm: use a radix-tree to make do_move_pages() complexity linear

Brice Goglin wrote:
> * One thing that bothers me is move_pages() returning -ENOENT when no
> page are given to migrate_pages(). I don't see why having 100/100 pages
> not migrated would return a different error than having only 99/100
> pages not migrated. We have the status array to place -ENOENT for all
> these pages. If the user doesn't know where his pages are allocated, he
> shouldn't get a different return value depending on how many pages were
> already on the right node. And actually, this convention makes
> user-space application harder to write since you need to treat -ENOENT
> as a success unless you already knew for sure where your pages were
> allocated. And the big thing is that this convention makes the chunking
> painfully/uselessly more complex. Breaking user-ABI is bad, but fixing
> crazy ABI...
>
I do not think that move_pages() is used that frequently. Changing the
API slightly as you suggest would not be that big of a deal.