DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
  s=s1024; d=yahoo.com.au;
  h=Received:X-YMail-OSG:X-Yahoo-Newman-Property:From:To:Subject:Date:User-Agent:Cc:References:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding:Content-Disposition:Message-Id;
  b=xP8rIsU0ARN5BLlG/p4/Olr/BWzpYLkT2TMckjOcKbO+617YL5Ok5MkfdMUf8tGebBe08XlFqMJ3kXPIfmyCFSW9+lL6w699/LRwSKTBLpOzvfUoF2iR9kE6TB2PkieqWsL5Q4vx0Vy3Y1A5Agtl69TnbvRLIwYBdykXUnCH4g4=  ;
From: Nick Piggin <nickpiggin@yahoo.com.au>
To: Brice Goglin <Brice.Goglin@inria.fr>
Subject: Re: [PATCH] mm: use a radix-tree to make do_move_pages() complexity linear
Date: Sat, 11 Oct 2008 19:58:12 +1100
User-Agent: KMail/1.9.5
Cc: Christoph Lameter <cl@linux-foundation.org>,
       Andrew Morton <akpm@linux-foundation.org>, linux-kernel@vger.kernel.org,
       linux-mm@kvack.org, nathalie.furmento@labri.fr
References: <48EDF9DA.7000508@inria.fr> <48EFBBE9.5000703@linux-foundation.org> <48F069B8.6050709@inria.fr>
In-Reply-To: <48F069B8.6050709@inria.fr>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200810111958.12848.nickpiggin@yahoo.com.au>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3187
Lines: 62

On Saturday 11 October 2008 19:54, Brice Goglin wrote:
> Christoph Lameter wrote:
> > Would it be possible to restructure this in such a way that we work in
> > chunks of 100 or so pages each so that we can avoid the vmalloc?
> >
> > We also could do a kmalloc for each individual struct pm_struct with the
> > radix tree which would also avoid the vmalloc but still keep the need to
> > allocate 4MB for temporary struct pm_structs.
> >
> > Or get rid of the pm_struct altogether by storing the address of the node
> > vector somewhere and retrieve the node as needed from the array. This
> > would allow storing the struct page * pointers in the radix tree.
>
> I have been thinking about all this, and here's some ideas:
> * do_pages_stat() may easily be rewritten without the huge pm array. It
> just need to user-space pointers to the page and status arrays. It
> traverses the first array , and for each page does: doing get_user() the
> address, retrieve the page node, and put_user() the result in the status
> array. No need to allocate any single page_to_node structure.
> * If we split the migration in small chunks (such as one page of pm's),
> the quadratic complexity doesn't matter that much. There will be at most
> several dozens of pm in the chunk array, so the linear search in
> new_page_node() won't be that slow, it may even be faster than the
> overall cost of adding a radix-tree to improve this search. So we can
> keep the internal code unchanged and just add the chunking around it.
> * One thing that bothers me is move_pages() returning -ENOENT when no
> page are given to migrate_pages(). I don't see why having 100/100 pages
> not migrated would return a different error than having only 99/100
> pages not migrated. We have the status array to place -ENOENT for all
> these pages. If the user doesn't know where his pages are allocated, he
> shouldn't get a different return value depending on how many pages were
> already on the right node. And actually, this convention makes
> user-space application harder to write since you need to treat -ENOENT
> as a success unless you already knew for sure where your pages were
> allocated. And the big thing is that this convention makes the chunking
> painfully/uselessly more complex. Breaking user-ABI is bad, but fixing
> crazy ABI...
>
> Here are some other numbers from the above (dirty) implementation,
> migrating from node #2 to #3 on a quad-quad-core opteron 2347HE with
> vanilla 2.6.27:
>
> length		move_pages (us)	move_pages with patch (us)
> 4kB		126		98
> 40kB		198		168
> 400kB		963		937
> 4MB		12503		11930
> 40MB		246867		11848
>
> It seems to be even slightly better than the previous patch (but the
> kernel are a bit different). And I quickly checked that it scales well
> up to 4GB buffers.

If you are worried about vmalloc overhead, I'd suggest testing with -mm.
I've rewritten the vmap code so it is now slightly scalable and sane to
use.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/