Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754583AbZI3OlM (ORCPT ); Wed, 30 Sep 2009 10:41:12 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1754334AbZI3OlM (ORCPT ); Wed, 30 Sep 2009 10:41:12 -0400 Received: from gir.skynet.ie ([193.1.99.77]:37134 "EHLO gir.skynet.ie" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754016AbZI3OlL (ORCPT ); Wed, 30 Sep 2009 10:41:11 -0400 Date: Wed, 30 Sep 2009 15:41:17 +0100 From: Mel Gorman To: Pekka Enberg Cc: Nick Piggin , Christoph Lameter , heiko.carstens@de.ibm.com, sachinp@in.ibm.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Tejun Heo , Benjamin Herrenschmidt Subject: Re: [PATCH 2/4] slqb: Record what node is local to a kmem_cache_cpu Message-ID: <20090930144117.GA17906@csn.ul.ie> References: <1253624054-10882-1-git-send-email-mel@csn.ul.ie> <1253624054-10882-3-git-send-email-mel@csn.ul.ie> <84144f020909220638l79329905sf9a35286130e88d0@mail.gmail.com> <20090922135453.GF25965@csn.ul.ie> <84144f020909221154x820b287r2996480225692fad@mail.gmail.com> <20090922185608.GH25965@csn.ul.ie> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20090922185608.GH25965@csn.ul.ie> User-Agent: Mutt/1.5.17+20080114 (2008-01-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5290 Lines: 127 On Tue, Sep 22, 2009 at 07:56:08PM +0100, Mel Gorman wrote: > On Tue, Sep 22, 2009 at 09:54:33PM +0300, Pekka Enberg wrote: > > Hi Mel, > > > > On Tue, Sep 22, 2009 at 4:54 PM, Mel Gorman wrote: > > >> I don't understand how the memory leak happens from the above > > >> description (or reading the code). page_to_nid() returns some crazy > > >> value at free time? > > > > > > Nope, it isn't a leak as such, the allocator knows where the memory is. > > > The problem is that is always frees remote but on allocation, it sees > > > the per-cpu list is empty and calls the page allocator again. The remote > > > lists just grow. > > > > > >> The remote list isn't drained properly? > > > > > > That is another way of looking at it. When the remote lists get to a > > > watermark, they should drain. However, it's worth pointing out if it's > > > repaired in this fashion, the performance of SLQB will suffer as it'll > > > never reuse the local list of pages and instead always get cold pages > > > from the allocator. > > > > I worry about setting c->local_nid to the node of the allocated struct > > kmem_cache_cpu. It seems like an arbitrary policy decision that's not > > necessarily the best option and I'm not totally convinced it's correct > > when cpusets are configured. SLUB seems to do the sane thing here by > > using page allocator fallback (which respects cpusets AFAICT) and > > recycling one slab slab at a time. > > > > Can I persuade you into sending me a patch that fixes remote list > > draining to get things working on PPC? I'd much rather wait for Nick's > > input on the allocation policy and performance. > > > > It'll be at least next week before I can revisit this again. I'm afraid > I'm going offline from tomorrow until Tuesday. > Ok, so I spent today looking at this again. The problem is not with faulty drain logic as such. As frees always place an object on a remote list and the allocation side is often (but not always) allocating a new page, a significant number of objects in the free list are the only object in a page. SLQB drains based on the number of objects on the free list, not the number of pages. With many of the pages having only one object, the freelists are pinning a lot more memory than expected. For example, a watermark to drain of 512 could be pinning 2MB of pages. The drain logic could be extended to track not only the number of objects on the free list but also the number of pages but I really don't think that is desirable behaviour. I'm somewhat running out of sensible ideas for dealing with this but here is another go anyway that might be more palatable than tracking what a "local" node is within the slab. This boots on 2.6.32-rc1 with the latest slqb-core git tree with Kconfig modified to allow SLQB to be set on ppc64. ==== CUT HERE ==== SLQB: Allocate from the remote lists when the local node is memoryless and has no free objects When SLQB is freeing an object, it checks if the object belongs to a page within the local node. If it is not, the object is freed to a remote list. When the remote list has too many objects, the list is drained. On allocation, the remote list is only used if a specific node is specified and that node is not the local node. On memoryless nodes, there is a problem in that the specified node will often not be the local node. The impact is that many objects on the free list are the only object in the page. This bloats SLQB's memory requirements and causes OOM to trigger. This patch alters the allocation path. If the allocation from local lists fails and the local node is memoryless, an attempt will be made to allocate from the remote lists before going to the page allocator. Signed-off-by: Mel Gorman --- mm/slqb.c | 30 ++++++++++++++++++++++-------- 1 file changed, 22 insertions(+), 8 deletions(-) diff --git a/mm/slqb.c b/mm/slqb.c index 4d72be2..b73e7d0 100644 --- a/mm/slqb.c +++ b/mm/slqb.c @@ -1513,16 +1513,30 @@ try_remote: l = &c->list; object = __cache_list_get_object(s, l); if (unlikely(!object)) { - object = cache_list_get_page(s, l); - if (unlikely(!object)) { - object = __slab_alloc_page(s, gfpflags, node); -#ifdef CONFIG_NUMA + int thisnode = numa_node_id(); + + /* + * If the local node is memoryless, try remote alloc before + * trying the page allocator. Otherwise, what happens is + * objects are always freed to remote lists but the allocation + * side always allocates a new page with only one object + * used in each page + */ + if (unlikely(!node_state(thisnode, N_HIGH_MEMORY))) + object = __remote_slab_alloc(s, gfpflags, thisnode); + + if (!object) { + object = cache_list_get_page(s, l); if (unlikely(!object)) { - node = numa_node_id(); - goto try_remote; - } + object = __slab_alloc_page(s, gfpflags, node); +#ifdef CONFIG_NUMA + if (unlikely(!object)) { + node = numa_node_id(); + goto try_remote; + } #endif - return object; + return object; + } } } if (likely(object)) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/