Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756078AbYJKI6c (ORCPT ); Sat, 11 Oct 2008 04:58:32 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1750933AbYJKI6Y (ORCPT ); Sat, 11 Oct 2008 04:58:24 -0400 Received: from smtp119.mail.mud.yahoo.com ([209.191.84.76]:33773 "HELO smtp119.mail.mud.yahoo.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with SMTP id S1750756AbYJKI6Y (ORCPT ); Sat, 11 Oct 2008 04:58:24 -0400 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com.au; h=Received:X-YMail-OSG:X-Yahoo-Newman-Property:From:To:Subject:Date:User-Agent:Cc:References:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding:Content-Disposition:Message-Id; b=xP8rIsU0ARN5BLlG/p4/Olr/BWzpYLkT2TMckjOcKbO+617YL5Ok5MkfdMUf8tGebBe08XlFqMJ3kXPIfmyCFSW9+lL6w699/LRwSKTBLpOzvfUoF2iR9kE6TB2PkieqWsL5Q4vx0Vy3Y1A5Agtl69TnbvRLIwYBdykXUnCH4g4= ; X-YMail-OSG: I2RRe3YVM1l2P_6yF9o5zipTMScNCD.es1XS46QdTYEOvd7.irzZ4Mu3EmWkSgUUuwY4jjiyEgmOM5TscmvtvaxKVOjeAyJk0SHx1gC9b_zL.1ycZRtHQHqGq7nhAi3ydfBNmdHW8ZdFcoZfBLREuId15Y3eWTFTcZPyfDcw X-Yahoo-Newman-Property: ymail-3 From: Nick Piggin To: Brice Goglin Subject: Re: [PATCH] mm: use a radix-tree to make do_move_pages() complexity linear Date: Sat, 11 Oct 2008 19:58:12 +1100 User-Agent: KMail/1.9.5 Cc: Christoph Lameter , Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org, nathalie.furmento@labri.fr References: <48EDF9DA.7000508@inria.fr> <48EFBBE9.5000703@linux-foundation.org> <48F069B8.6050709@inria.fr> In-Reply-To: <48F069B8.6050709@inria.fr> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200810111958.12848.nickpiggin@yahoo.com.au> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3187 Lines: 62 On Saturday 11 October 2008 19:54, Brice Goglin wrote: > Christoph Lameter wrote: > > Would it be possible to restructure this in such a way that we work in > > chunks of 100 or so pages each so that we can avoid the vmalloc? > > > > We also could do a kmalloc for each individual struct pm_struct with the > > radix tree which would also avoid the vmalloc but still keep the need to > > allocate 4MB for temporary struct pm_structs. > > > > Or get rid of the pm_struct altogether by storing the address of the node > > vector somewhere and retrieve the node as needed from the array. This > > would allow storing the struct page * pointers in the radix tree. > > I have been thinking about all this, and here's some ideas: > * do_pages_stat() may easily be rewritten without the huge pm array. It > just need to user-space pointers to the page and status arrays. It > traverses the first array , and for each page does: doing get_user() the > address, retrieve the page node, and put_user() the result in the status > array. No need to allocate any single page_to_node structure. > * If we split the migration in small chunks (such as one page of pm's), > the quadratic complexity doesn't matter that much. There will be at most > several dozens of pm in the chunk array, so the linear search in > new_page_node() won't be that slow, it may even be faster than the > overall cost of adding a radix-tree to improve this search. So we can > keep the internal code unchanged and just add the chunking around it. > * One thing that bothers me is move_pages() returning -ENOENT when no > page are given to migrate_pages(). I don't see why having 100/100 pages > not migrated would return a different error than having only 99/100 > pages not migrated. We have the status array to place -ENOENT for all > these pages. If the user doesn't know where his pages are allocated, he > shouldn't get a different return value depending on how many pages were > already on the right node. And actually, this convention makes > user-space application harder to write since you need to treat -ENOENT > as a success unless you already knew for sure where your pages were > allocated. And the big thing is that this convention makes the chunking > painfully/uselessly more complex. Breaking user-ABI is bad, but fixing > crazy ABI... > > Here are some other numbers from the above (dirty) implementation, > migrating from node #2 to #3 on a quad-quad-core opteron 2347HE with > vanilla 2.6.27: > > length move_pages (us) move_pages with patch (us) > 4kB 126 98 > 40kB 198 168 > 400kB 963 937 > 4MB 12503 11930 > 40MB 246867 11848 > > It seems to be even slightly better than the previous patch (but the > kernel are a bit different). And I quickly checked that it scales well > up to 4GB buffers. If you are worried about vmalloc overhead, I'd suggest testing with -mm. I've rewritten the vmap code so it is now slightly scalable and sane to use. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/