Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1760539AbYFGBIl (ORCPT ); Fri, 6 Jun 2008 21:08:41 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1757189AbYFGBG2 (ORCPT ); Fri, 6 Jun 2008 21:06:28 -0400 Received: from smtp1.linux-foundation.org ([140.211.169.13]:58409 "EHLO smtp1.linux-foundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758393AbYFGBG0 (ORCPT ); Fri, 6 Jun 2008 21:06:26 -0400 Date: Fri, 6 Jun 2008 18:05:06 -0700 From: Andrew Morton To: Rik van Riel Cc: linux-kernel@vger.kernel.org, lee.schermerhorn@hp.com, kosaki.motohiro@jp.fujitsu.com, linux-mm@kvack.org, eric.whitney@hp.com Subject: Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure Message-Id: <20080606180506.081f686a.akpm@linux-foundation.org> In-Reply-To: <20080606202859.291472052@redhat.com> References: <20080606202838.390050172@redhat.com> <20080606202859.291472052@redhat.com> X-Mailer: Sylpheed version 2.2.4 (GTK+ 2.8.20; i486-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 10796 Lines: 305 On Fri, 06 Jun 2008 16:28:51 -0400 Rik van Riel wrote: > > From: Lee Schermerhorn > > Infrastructure to manage pages excluded from reclaim--i.e., hidden > from vmscan. Based on a patch by Larry Woodman of Red Hat. Reworked > to maintain "nonreclaimable" pages on a separate per-zone LRU list, > to "hide" them from vmscan. > > Kosaki Motohiro added the support for the memory controller noreclaim > lru list. > > Pages on the noreclaim list have both PG_noreclaim and PG_lru set. > Thus, PG_noreclaim is analogous to and mutually exclusive with > PG_active--it specifies which LRU list the page is on. > > The noreclaim infrastructure is enabled by a new mm Kconfig option > [CONFIG_]NORECLAIM_LRU. Having a config option for this really sucks, and needs extra-special justification, rather than none. Plus.. akpm:/usr/src/25> find . -name '*.[ch]' | xargs grep CONFIG_NORECLAIM_LRU ./drivers/base/node.c:#ifdef CONFIG_NORECLAIM_LRU ./drivers/base/node.c:#ifdef CONFIG_NORECLAIM_LRU ./fs/proc/proc_misc.c:#ifdef CONFIG_NORECLAIM_LRU ./fs/proc/proc_misc.c:#ifdef CONFIG_NORECLAIM_LRU ./include/linux/mmzone.h:#ifdef CONFIG_NORECLAIM_LRU ./include/linux/mmzone.h:#ifdef CONFIG_NORECLAIM_LRU ./include/linux/mmzone.h:#ifdef CONFIG_NORECLAIM_LRU ./include/linux/page-flags.h:#ifdef CONFIG_NORECLAIM_LRU ./include/linux/page-flags.h:#ifdef CONFIG_NORECLAIM_LRU ./include/linux/pagemap.h:#ifdef CONFIG_NORECLAIM_LRU ./include/linux/swap.h:#ifdef CONFIG_NORECLAIM_LRU ./include/linux/vmstat.h:#ifdef CONFIG_NORECLAIM_LRU ./kernel/sysctl.c:#ifdef CONFIG_NORECLAIM_LRU ./mm/internal.h:#ifdef CONFIG_NORECLAIM_LRU ./mm/page_alloc.c:#ifdef CONFIG_NORECLAIM_LRU ./mm/page_alloc.c:#ifdef CONFIG_NORECLAIM_LRU ./mm/page_alloc.c:#ifdef CONFIG_NORECLAIM_LRU ./mm/page_alloc.c:#ifdef CONFIG_NORECLAIM_LRU ./mm/page_alloc.c:#ifdef CONFIG_NORECLAIM_LRU ./mm/page_alloc.c:#ifdef CONFIG_NORECLAIM_LRU ./mm/page_alloc.c:#ifdef CONFIG_NORECLAIM_LRU ./mm/vmscan.c:#ifdef CONFIG_NORECLAIM_LRU ./mm/vmscan.c:#ifdef CONFIG_NORECLAIM_LRU ./mm/vmscan.c:#ifdef CONFIG_NORECLAIM_LRU ./mm/vmstat.c:#ifdef CONFIG_NORECLAIM_LRU ./mm/vmstat.c:#ifdef CONFIG_NORECLAIM_LRU > A new function 'page_reclaimable(page, vma)' in vmscan.c tests whether > or not a page is reclaimable. Subsequent patches will add the various > !reclaimable tests. We'll want to keep these tests light-weight for > use in shrink_active_list() and, possibly, the fault path. > > To avoid races between tasks putting pages [back] onto an LRU list and > tasks that might be moving the page from nonreclaimable to reclaimable > state, one should test reclaimability under page lock and place > nonreclaimable pages directly on the noreclaim list before dropping the > lock. Otherwise, we risk "stranding" reclaimable pages on the noreclaim > list. It's OK to use the pagevec caches for reclaimable pages. The new > function 'putback_lru_page()'--inverse to 'isolate_lru_page()'--handles > this transition, including potential page truncation while the page is > unlocked. > The changelog doesn't even mention, let alone explain and justify the fact that this feature is not available on 32-bit systems. This is a large drawback - it means that a (hopefully useful) feature is unavailable to the large majority of Linux systems and that it reduces the testing coverage and that it adversely impacts MM maintainability. > Index: linux-2.6.26-rc2-mm1/mm/Kconfig > =================================================================== > --- linux-2.6.26-rc2-mm1.orig/mm/Kconfig 2008-05-29 16:21:04.000000000 -0400 > +++ linux-2.6.26-rc2-mm1/mm/Kconfig 2008-06-06 16:05:15.000000000 -0400 > @@ -205,3 +205,13 @@ config NR_QUICK > config VIRT_TO_BUS > def_bool y > depends on !ARCH_NO_VIRT_TO_BUS > + > +config NORECLAIM_LRU > + bool "Add LRU list to track non-reclaimable pages (EXPERIMENTAL, 64BIT only)" > + depends on EXPERIMENTAL && 64BIT > + help > + Supports tracking of non-reclaimable pages off the [in]active lists > + to avoid excessive reclaim overhead on large memory systems. Pages > + may be non-reclaimable because: they are locked into memory, they > + are anonymous pages for which no swap space exists, or they are anon > + pages that are expensive to unmap [long anon_vma "related vma" list.] Aunt Tillie might be struggling with some of that. > Index: linux-2.6.26-rc2-mm1/include/linux/page-flags.h > =================================================================== > --- linux-2.6.26-rc2-mm1.orig/include/linux/page-flags.h 2008-05-29 16:21:04.000000000 -0400 > +++ linux-2.6.26-rc2-mm1/include/linux/page-flags.h 2008-06-06 16:05:15.000000000 -0400 > @@ -94,6 +94,9 @@ enum pageflags { > PG_reclaim, /* To be reclaimed asap */ > PG_buddy, /* Page is free, on buddy lists */ > PG_swapbacked, /* Page is backed by RAM/swap */ > +#ifdef CONFIG_NORECLAIM_LRU > + PG_noreclaim, /* Page is "non-reclaimable" */ > +#endif I fear that we're messing up the terminology here. Go into your 2.6.25 tree and do `grep -i reclaimable */*.c'. The term already means a few different things, but in the vmscan context, "reclaimable" means that the page is unreferenced, clean and can be stolen. "reclaimable" also means a lot of other things, and we just made that worse. Can we think of a new term which uniquely describes this new concept and use that, rather than flogging the old horse? > > ... > > +/** > + * add_page_to_noreclaim_list > + * @page: the page to be added to the noreclaim list > + * > + * Add page directly to its zone's noreclaim list. To avoid races with > + * tasks that might be making the page reclaimble while it's not on the > + * lru, we want to add the page while it's locked or otherwise "invisible" > + * to other tasks. This is difficult to do when using the pagevec cache, > + * so bypass that. > + */ How does a task "make a page reclaimable"? munlock()? fsync()? exit()? Choice of terminology matters... > +void add_page_to_noreclaim_list(struct page *page) > +{ > + struct zone *zone = page_zone(page); > + > + spin_lock_irq(&zone->lru_lock); > + SetPageNoreclaim(page); > + SetPageLRU(page); > + add_page_to_lru_list(zone, page, LRU_NORECLAIM); > + spin_unlock_irq(&zone->lru_lock); > +} > + > /* > * Drain pages out of the cpu's pagevecs. > * Either "cpu" is the current CPU, and preemption has already been > @@ -339,6 +370,7 @@ void release_pages(struct page **pages, > > if (PageLRU(page)) { > struct zone *pagezone = page_zone(page); > + > if (pagezone != zone) { > if (zone) > spin_unlock_irqrestore(&zone->lru_lock, > @@ -415,6 +447,7 @@ void ____pagevec_lru_add(struct pagevec > { > int i; > struct zone *zone = NULL; > + VM_BUG_ON(is_noreclaim_lru(lru)); > > for (i = 0; i < pagevec_count(pvec); i++) { > struct page *page = pvec->pages[i]; > @@ -426,6 +459,7 @@ void ____pagevec_lru_add(struct pagevec > zone = pagezone; > spin_lock_irq(&zone->lru_lock); > } > + VM_BUG_ON(PageActive(page) || PageNoreclaim(page)); If this ever triggers, you'll wish that it had been coded with two separate assertions. > VM_BUG_ON(PageLRU(page)); > SetPageLRU(page); > if (is_active_lru(lru)) > > ... > > +/** > + * putback_lru_page > + * @page to be put back to appropriate lru list > + * > + * Add previously isolated @page to appropriate LRU list. > + * Page may still be non-reclaimable for other reasons. > + * > + * lru_lock must not be held, interrupts must be enabled. > + * Must be called with page locked. > + * > + * return 1 if page still locked [not truncated], else 0 > + */ The kerneldoc function description is missing. > +int putback_lru_page(struct page *page) > +{ > + int lru; > + int ret = 1; > + > + VM_BUG_ON(!PageLocked(page)); > + VM_BUG_ON(PageLRU(page)); > + > + lru = !!TestClearPageActive(page); > + ClearPageNoreclaim(page); /* for page_reclaimable() */ > + > + if (unlikely(!page->mapping)) { > + /* > + * page truncated. drop lock as put_page() will > + * free the page. > + */ > + VM_BUG_ON(page_count(page) != 1); > + unlock_page(page); > + ret = 0; > + } else if (page_reclaimable(page, NULL)) { > + /* > + * For reclaimable pages, we can use the cache. > + * In event of a race, worst case is we end up with a > + * non-reclaimable page on [in]active list. > + * We know how to handle that. > + */ > + lru += page_file_cache(page); > + lru_cache_add_lru(page, lru); > + mem_cgroup_move_lists(page, lru); > + } else { > + /* > + * Put non-reclaimable pages directly on zone's noreclaim > + * list. > + */ > + add_page_to_noreclaim_list(page); > + mem_cgroup_move_lists(page, LRU_NORECLAIM); > + } > + > + put_page(page); /* drop ref from isolate */ > + return ret; /* ret => "page still locked" */ > +} So THAT'S what the magical "return 2" is doing in page_file_cache()! OK, after all the patches are applied, the "2" becomes LRU_FILE and the enumeration of `enum lru_list' reflects that. > +/* > + * Cull page that shrink_*_list() has detected to be non-reclaimable > + * under page lock to close races with other tasks that might be making > + * the page reclaimable. Avoid stranding a reclaimable page on the > + * noreclaim list. > + */ > +static inline void cull_nonreclaimable_page(struct page *page) > +{ > + lock_page(page); > + if (putback_lru_page(page)) > + unlock_page(page); > +} Again, the terminology is quite overloaded and confusing. What does "non-reclaimable" mean in this context? _Any_ page which was dirty or which had an elevated refcount? Surely not referenced pages, which the scanner also can treat as non-reclaimable. Did you check whether all these inlined functions really should have been inlined? Even ones like this are probably too large. > /* > * shrink_page_list() returns the number of reclaimed pages > */ > > ... > > @@ -647,6 +721,14 @@ int __isolate_lru_page(struct page *page > if (mode != ISOLATE_BOTH && (!page_file_cache(page) != !file)) > return ret; > > + /* > + * Non-reclaimable pages shouldn't make it onto either the active > + * nor the inactive list. However, when doing lumpy reclaim of > + * higher order pages we can still run into them. I guess that something along the lines of "when this function is being called for lumpy reclaim we can still .." would be clearer. > + */ > + if (PageNoreclaim(page)) > + return ret; > + > ret = -EBUSY; > if (likely(get_page_unless_zero(page))) { > /* -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/