2008-06-06 20:36:15

by Rik van Riel

[permalink] [raw]
Subject: [PATCH -mm 13/25] Noreclaim LRU Infrastructure


From: Lee Schermerhorn <[email protected]>

Infrastructure to manage pages excluded from reclaim--i.e., hidden
from vmscan. Based on a patch by Larry Woodman of Red Hat. Reworked
to maintain "nonreclaimable" pages on a separate per-zone LRU list,
to "hide" them from vmscan.

Kosaki Motohiro added the support for the memory controller noreclaim
lru list.

Pages on the noreclaim list have both PG_noreclaim and PG_lru set.
Thus, PG_noreclaim is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.

The noreclaim infrastructure is enabled by a new mm Kconfig option
[CONFIG_]NORECLAIM_LRU.

A new function 'page_reclaimable(page, vma)' in vmscan.c tests whether
or not a page is reclaimable. Subsequent patches will add the various
!reclaimable tests. We'll want to keep these tests light-weight for
use in shrink_active_list() and, possibly, the fault path.

To avoid races between tasks putting pages [back] onto an LRU list and
tasks that might be moving the page from nonreclaimable to reclaimable
state, one should test reclaimability under page lock and place
nonreclaimable pages directly on the noreclaim list before dropping the
lock. Otherwise, we risk "stranding" reclaimable pages on the noreclaim
list. It's OK to use the pagevec caches for reclaimable pages. The new
function 'putback_lru_page()'--inverse to 'isolate_lru_page()'--handles
this transition, including potential page truncation while the page is
unlocked.

Signed-off-by: Lee Schermerhorn <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>

---

include/linux/memcontrol.h | 2
include/linux/mm_inline.h | 13 ++-
include/linux/mmzone.h | 24 ++++++
include/linux/page-flags.h | 13 +++
include/linux/pagevec.h | 1
include/linux/swap.h | 12 +++
mm/Kconfig | 10 ++
mm/internal.h | 26 +++++++
mm/memcontrol.c | 73 ++++++++++++--------
mm/mempolicy.c | 2
mm/migrate.c | 68 ++++++++++++------
mm/page_alloc.c | 9 ++
mm/swap.c | 52 +++++++++++---
mm/vmscan.c | 164 +++++++++++++++++++++++++++++++++++++++------
14 files changed, 382 insertions(+), 87 deletions(-)

Index: linux-2.6.26-rc2-mm1/mm/Kconfig
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/Kconfig 2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/Kconfig 2008-06-06 16:05:15.000000000 -0400
@@ -205,3 +205,13 @@ config NR_QUICK
config VIRT_TO_BUS
def_bool y
depends on !ARCH_NO_VIRT_TO_BUS
+
+config NORECLAIM_LRU
+ bool "Add LRU list to track non-reclaimable pages (EXPERIMENTAL, 64BIT only)"
+ depends on EXPERIMENTAL && 64BIT
+ help
+ Supports tracking of non-reclaimable pages off the [in]active lists
+ to avoid excessive reclaim overhead on large memory systems. Pages
+ may be non-reclaimable because: they are locked into memory, they
+ are anonymous pages for which no swap space exists, or they are anon
+ pages that are expensive to unmap [long anon_vma "related vma" list.]
Index: linux-2.6.26-rc2-mm1/include/linux/page-flags.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/page-flags.h 2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/page-flags.h 2008-06-06 16:05:15.000000000 -0400
@@ -94,6 +94,9 @@ enum pageflags {
PG_reclaim, /* To be reclaimed asap */
PG_buddy, /* Page is free, on buddy lists */
PG_swapbacked, /* Page is backed by RAM/swap */
+#ifdef CONFIG_NORECLAIM_LRU
+ PG_noreclaim, /* Page is "non-reclaimable" */
+#endif
#ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
PG_uncached, /* Page has been mapped as uncached */
#endif
@@ -167,6 +170,7 @@ PAGEFLAG(Referenced, referenced) TESTCLE
PAGEFLAG(Dirty, dirty) TESTSCFLAG(Dirty, dirty) __CLEARPAGEFLAG(Dirty, dirty)
PAGEFLAG(LRU, lru) __CLEARPAGEFLAG(LRU, lru)
PAGEFLAG(Active, active) __CLEARPAGEFLAG(Active, active)
+ TESTCLEARFLAG(Active, active)
__PAGEFLAG(Slab, slab)
PAGEFLAG(Checked, owner_priv_1) /* Used by some filesystems */
PAGEFLAG(Pinned, owner_priv_1) TESTSCFLAG(Pinned, owner_priv_1) /* Xen */
@@ -203,6 +207,15 @@ PAGEFLAG(SwapCache, swapcache)
PAGEFLAG_FALSE(SwapCache)
#endif

+#ifdef CONFIG_NORECLAIM_LRU
+PAGEFLAG(Noreclaim, noreclaim) __CLEARPAGEFLAG(Noreclaim, noreclaim)
+ TESTCLEARFLAG(Noreclaim, noreclaim)
+#else
+PAGEFLAG_FALSE(Noreclaim) TESTCLEARFLAG_FALSE(Noreclaim)
+ SETPAGEFLAG_NOOP(Noreclaim) CLEARPAGEFLAG_NOOP(Noreclaim)
+ __CLEARPAGEFLAG_NOOP(Noreclaim)
+#endif
+
#ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
PAGEFLAG(Uncached, uncached)
#else
Index: linux-2.6.26-rc2-mm1/include/linux/mmzone.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/mmzone.h 2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/mmzone.h 2008-06-06 16:05:15.000000000 -0400
@@ -85,6 +85,11 @@ enum zone_stat_item {
NR_ACTIVE_ANON, /* " " " " " */
NR_INACTIVE_FILE, /* " " " " " */
NR_ACTIVE_FILE, /* " " " " " */
+#ifdef CONFIG_NORECLAIM_LRU
+ NR_NORECLAIM, /* " " " " " */
+#else
+ NR_NORECLAIM = NR_ACTIVE_FILE, /* avoid compiler errors in dead code */
+#endif
NR_ANON_PAGES, /* Mapped anonymous pages */
NR_FILE_MAPPED, /* pagecache pages mapped into pagetables.
only modified from process context */
@@ -124,10 +129,18 @@ enum lru_list {
LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,
LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
- NR_LRU_LISTS };
+#ifdef CONFIG_NORECLAIM_LRU
+ LRU_NORECLAIM,
+#else
+ LRU_NORECLAIM = LRU_ACTIVE_FILE, /* avoid compiler errors in dead code */
+#endif
+ NR_LRU_LISTS
+};

#define for_each_lru(l) for (l = 0; l < NR_LRU_LISTS; l++)

+#define for_each_reclaimable_lru(l) for (l = 0; l <= LRU_ACTIVE_FILE; l++)
+
static inline int is_file_lru(enum lru_list l)
{
return (l == LRU_INACTIVE_FILE || l == LRU_ACTIVE_FILE);
@@ -138,6 +151,15 @@ static inline int is_active_lru(enum lru
return (l == LRU_ACTIVE_ANON || l == LRU_ACTIVE_FILE);
}

+static inline int is_noreclaim_lru(enum lru_list l)
+{
+#ifdef CONFIG_NORECLAIM_LRU
+ return l == LRU_NORECLAIM;
+#else
+ return 0;
+#endif
+}
+
enum lru_list page_lru(struct page *page);

struct per_cpu_pages {
Index: linux-2.6.26-rc2-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/page_alloc.c 2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/page_alloc.c 2008-06-06 16:05:15.000000000 -0400
@@ -256,6 +256,9 @@ static void bad_page(struct page *page)
1 << PG_private |
1 << PG_locked |
1 << PG_active |
+#ifdef CONFIG_NORECLAIM_LRU
+ 1 << PG_noreclaim |
+#endif
1 << PG_dirty |
1 << PG_reclaim |
1 << PG_slab |
@@ -491,6 +494,9 @@ static inline int free_pages_check(struc
1 << PG_swapcache |
1 << PG_writeback |
1 << PG_reserved |
+#ifdef CONFIG_NORECLAIM_LRU
+ 1 << PG_noreclaim |
+#endif
1 << PG_buddy ))))
bad_page(page);
if (PageDirty(page))
@@ -642,6 +648,9 @@ static int prep_new_page(struct page *pa
1 << PG_private |
1 << PG_locked |
1 << PG_active |
+#ifdef CONFIG_NORECLAIM_LRU
+ 1 << PG_noreclaim |
+#endif
1 << PG_dirty |
1 << PG_slab |
1 << PG_swapcache |
Index: linux-2.6.26-rc2-mm1/include/linux/mm_inline.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/mm_inline.h 2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/mm_inline.h 2008-06-06 16:05:15.000000000 -0400
@@ -89,11 +89,16 @@ del_page_from_lru(struct zone *zone, str
enum lru_list l = LRU_INACTIVE_ANON;

list_del(&page->lru);
- if (PageActive(page)) {
- __ClearPageActive(page);
- l += LRU_ACTIVE;
+ if (PageNoreclaim(page)) {
+ __ClearPageNoreclaim(page);
+ l = LRU_NORECLAIM;
+ } else {
+ if (PageActive(page)) {
+ __ClearPageActive(page);
+ l += LRU_ACTIVE;
+ }
+ l += page_file_cache(page);
}
- l += page_file_cache(page);
__dec_zone_state(zone, NR_INACTIVE_ANON + l);
}

Index: linux-2.6.26-rc2-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/swap.h 2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/swap.h 2008-06-06 16:05:15.000000000 -0400
@@ -180,6 +180,8 @@ extern int lru_add_drain_all(void);
extern void rotate_reclaimable_page(struct page *page);
extern void swap_setup(void);

+extern void add_page_to_noreclaim_list(struct page *page);
+
/**
* lru_cache_add: add a page to the page lists
* @page: the page to add
@@ -228,6 +230,16 @@ static inline int zone_reclaim(struct zo
}
#endif

+#ifdef CONFIG_NORECLAIM_LRU
+extern int page_reclaimable(struct page *page, struct vm_area_struct *vma);
+#else
+static inline int page_reclaimable(struct page *page,
+ struct vm_area_struct *vma)
+{
+ return 1;
+}
+#endif
+
extern int kswapd_run(int nid);

#ifdef CONFIG_MMU
Index: linux-2.6.26-rc2-mm1/include/linux/pagevec.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/pagevec.h 2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/pagevec.h 2008-06-06 16:05:15.000000000 -0400
@@ -101,7 +101,6 @@ static inline void __pagevec_lru_add_act
____pagevec_lru_add(pvec, LRU_ACTIVE_FILE);
}

-
static inline void pagevec_lru_add_file(struct pagevec *pvec)
{
if (pagevec_count(pvec))
Index: linux-2.6.26-rc2-mm1/mm/swap.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/swap.c 2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/swap.c 2008-06-06 16:05:15.000000000 -0400
@@ -106,9 +106,13 @@ enum lru_list page_lru(struct page *page
{
enum lru_list lru = LRU_BASE;

- if (PageActive(page))
- lru += LRU_ACTIVE;
- lru += page_file_cache(page);
+ if (PageNoreclaim(page))
+ lru = LRU_NORECLAIM;
+ else {
+ if (PageActive(page))
+ lru += LRU_ACTIVE;
+ lru += page_file_cache(page);
+ }

return lru;
}
@@ -133,7 +137,8 @@ static void pagevec_move_tail(struct pag
zone = pagezone;
spin_lock(&zone->lru_lock);
}
- if (PageLRU(page) && !PageActive(page)) {
+ if (PageLRU(page) && !PageActive(page) &&
+ !PageNoreclaim(page)) {
int lru = page_file_cache(page);
list_move_tail(&page->lru, &zone->list[lru]);
pgmoved++;
@@ -154,7 +159,7 @@ static void pagevec_move_tail(struct pag
void rotate_reclaimable_page(struct page *page)
{
if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) &&
- PageLRU(page)) {
+ !PageNoreclaim(page) && PageLRU(page)) {
struct pagevec *pvec;
unsigned long flags;

@@ -175,7 +180,7 @@ void activate_page(struct page *page)
struct zone *zone = page_zone(page);

spin_lock_irq(&zone->lru_lock);
- if (PageLRU(page) && !PageActive(page)) {
+ if (PageLRU(page) && !PageActive(page) && !PageNoreclaim(page)) {
int file = page_file_cache(page);
int lru = LRU_BASE + file;
del_page_from_lru_list(zone, page, lru);
@@ -184,7 +189,7 @@ void activate_page(struct page *page)
lru += LRU_ACTIVE;
add_page_to_lru_list(zone, page, lru);
__count_vm_event(PGACTIVATE);
- mem_cgroup_move_lists(page, true);
+ mem_cgroup_move_lists(page, lru);

if (file) {
zone->recent_scanned_file++;
@@ -207,7 +212,8 @@ void activate_page(struct page *page)
*/
void mark_page_accessed(struct page *page)
{
- if (!PageActive(page) && PageReferenced(page) && PageLRU(page)) {
+ if (!PageActive(page) && !PageNoreclaim(page) &&
+ PageReferenced(page) && PageLRU(page)) {
activate_page(page);
ClearPageReferenced(page);
} else if (!PageReferenced(page)) {
@@ -235,13 +241,38 @@ void __lru_cache_add(struct page *page,
void lru_cache_add_lru(struct page *page, enum lru_list lru)
{
if (PageActive(page)) {
+ VM_BUG_ON(PageNoreclaim(page));
ClearPageActive(page);
+ } else if (PageNoreclaim(page)) {
+ VM_BUG_ON(PageActive(page));
+ ClearPageNoreclaim(page);
}

- VM_BUG_ON(PageLRU(page) || PageActive(page));
+ VM_BUG_ON(PageLRU(page) || PageActive(page) || PageNoreclaim(page));
__lru_cache_add(page, lru);
}

+/**
+ * add_page_to_noreclaim_list
+ * @page: the page to be added to the noreclaim list
+ *
+ * Add page directly to its zone's noreclaim list. To avoid races with
+ * tasks that might be making the page reclaimble while it's not on the
+ * lru, we want to add the page while it's locked or otherwise "invisible"
+ * to other tasks. This is difficult to do when using the pagevec cache,
+ * so bypass that.
+ */
+void add_page_to_noreclaim_list(struct page *page)
+{
+ struct zone *zone = page_zone(page);
+
+ spin_lock_irq(&zone->lru_lock);
+ SetPageNoreclaim(page);
+ SetPageLRU(page);
+ add_page_to_lru_list(zone, page, LRU_NORECLAIM);
+ spin_unlock_irq(&zone->lru_lock);
+}
+
/*
* Drain pages out of the cpu's pagevecs.
* Either "cpu" is the current CPU, and preemption has already been
@@ -339,6 +370,7 @@ void release_pages(struct page **pages,

if (PageLRU(page)) {
struct zone *pagezone = page_zone(page);
+
if (pagezone != zone) {
if (zone)
spin_unlock_irqrestore(&zone->lru_lock,
@@ -415,6 +447,7 @@ void ____pagevec_lru_add(struct pagevec
{
int i;
struct zone *zone = NULL;
+ VM_BUG_ON(is_noreclaim_lru(lru));

for (i = 0; i < pagevec_count(pvec); i++) {
struct page *page = pvec->pages[i];
@@ -426,6 +459,7 @@ void ____pagevec_lru_add(struct pagevec
zone = pagezone;
spin_lock_irq(&zone->lru_lock);
}
+ VM_BUG_ON(PageActive(page) || PageNoreclaim(page));
VM_BUG_ON(PageLRU(page));
SetPageLRU(page);
if (is_active_lru(lru))
Index: linux-2.6.26-rc2-mm1/mm/migrate.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/migrate.c 2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/migrate.c 2008-06-06 16:05:15.000000000 -0400
@@ -53,14 +53,9 @@ int migrate_prep(void)
return 0;
}

-static inline void move_to_lru(struct page *page)
-{
- lru_cache_add_lru(page, page_lru(page));
- put_page(page);
-}
-
/*
- * Add isolated pages on the list back to the LRU.
+ * Add isolated pages on the list back to the LRU under page lock
+ * to avoid leaking reclaimable pages back onto noreclaim list.
*
* returns the number of pages put back.
*/
@@ -72,7 +67,9 @@ int putback_lru_pages(struct list_head *

list_for_each_entry_safe(page, page2, l, lru) {
list_del(&page->lru);
- move_to_lru(page);
+ lock_page(page);
+ if (putback_lru_page(page))
+ unlock_page(page);
count++;
}
return count;
@@ -340,8 +337,11 @@ static void migrate_page_copy(struct pag
SetPageReferenced(newpage);
if (PageUptodate(page))
SetPageUptodate(newpage);
- if (PageActive(page))
+ if (TestClearPageActive(page)) {
+ VM_BUG_ON(PageNoreclaim(page));
SetPageActive(newpage);
+ } else
+ noreclaim_migrate_page(newpage, page);
if (PageChecked(page))
SetPageChecked(newpage);
if (PageMappedToDisk(page))
@@ -362,7 +362,6 @@ static void migrate_page_copy(struct pag
#ifdef CONFIG_SWAP
ClearPageSwapCache(page);
#endif
- ClearPageActive(page);
ClearPagePrivate(page);
set_page_private(page, 0);
page->mapping = NULL;
@@ -541,10 +540,15 @@ static int fallback_migrate_page(struct
*
* The new page will have replaced the old page if this function
* is successful.
+ *
+ * Return value:
+ * < 0 - error code
+ * == 0 - success
*/
static int move_to_new_page(struct page *newpage, struct page *page)
{
struct address_space *mapping;
+ int unlock = 1;
int rc;

/*
@@ -579,10 +583,16 @@ static int move_to_new_page(struct page

if (!rc) {
remove_migration_ptes(page, newpage);
+ /*
+ * Put back on LRU while holding page locked to
+ * handle potential race with, e.g., munlock()
+ */
+ unlock = putback_lru_page(newpage);
} else
newpage->mapping = NULL;

- unlock_page(newpage);
+ if (unlock)
+ unlock_page(newpage);

return rc;
}
@@ -599,18 +609,19 @@ static int unmap_and_move(new_page_t get
struct page *newpage = get_new_page(page, private, &result);
int rcu_locked = 0;
int charge = 0;
+ int unlock = 1;

if (!newpage)
return -ENOMEM;

if (page_count(page) == 1)
/* page was freed from under us. So we are done. */
- goto move_newpage;
+ goto end_migration;

charge = mem_cgroup_prepare_migration(page, newpage);
if (charge == -ENOMEM) {
rc = -ENOMEM;
- goto move_newpage;
+ goto end_migration;
}
/* prepare cgroup just returns 0 or -ENOMEM */
BUG_ON(charge);
@@ -618,7 +629,7 @@ static int unmap_and_move(new_page_t get
rc = -EAGAIN;
if (TestSetPageLocked(page)) {
if (!force)
- goto move_newpage;
+ goto end_migration;
lock_page(page);
}

@@ -680,8 +691,6 @@ rcu_unlock:

unlock:

- unlock_page(page);
-
if (rc != -EAGAIN) {
/*
* A page that has been migrated has all references
@@ -690,17 +699,30 @@ unlock:
* restored.
*/
list_del(&page->lru);
- move_to_lru(page);
+ if (!page->mapping) {
+ VM_BUG_ON(page_count(page) != 1);
+ unlock_page(page);
+ put_page(page); /* just free the old page */
+ goto end_migration;
+ } else
+ unlock = putback_lru_page(page);
}

-move_newpage:
+ if (unlock)
+ unlock_page(page);
+
+end_migration:
if (!charge)
mem_cgroup_end_migration(newpage);
- /*
- * Move the new page to the LRU. If migration was not successful
- * then this will free the page.
- */
- move_to_lru(newpage);
+
+ if (!newpage->mapping) {
+ /*
+ * Migration failed or was never attempted.
+ * Free the newpage.
+ */
+ VM_BUG_ON(page_count(newpage) != 1);
+ put_page(newpage);
+ }
if (result) {
if (rc)
*result = rc;
Index: linux-2.6.26-rc2-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/vmscan.c 2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/vmscan.c 2008-06-06 16:05:50.000000000 -0400
@@ -437,6 +437,73 @@ cannot_free:
return 0;
}

+/**
+ * putback_lru_page
+ * @page to be put back to appropriate lru list
+ *
+ * Add previously isolated @page to appropriate LRU list.
+ * Page may still be non-reclaimable for other reasons.
+ *
+ * lru_lock must not be held, interrupts must be enabled.
+ * Must be called with page locked.
+ *
+ * return 1 if page still locked [not truncated], else 0
+ */
+int putback_lru_page(struct page *page)
+{
+ int lru;
+ int ret = 1;
+
+ VM_BUG_ON(!PageLocked(page));
+ VM_BUG_ON(PageLRU(page));
+
+ lru = !!TestClearPageActive(page);
+ ClearPageNoreclaim(page); /* for page_reclaimable() */
+
+ if (unlikely(!page->mapping)) {
+ /*
+ * page truncated. drop lock as put_page() will
+ * free the page.
+ */
+ VM_BUG_ON(page_count(page) != 1);
+ unlock_page(page);
+ ret = 0;
+ } else if (page_reclaimable(page, NULL)) {
+ /*
+ * For reclaimable pages, we can use the cache.
+ * In event of a race, worst case is we end up with a
+ * non-reclaimable page on [in]active list.
+ * We know how to handle that.
+ */
+ lru += page_file_cache(page);
+ lru_cache_add_lru(page, lru);
+ mem_cgroup_move_lists(page, lru);
+ } else {
+ /*
+ * Put non-reclaimable pages directly on zone's noreclaim
+ * list.
+ */
+ add_page_to_noreclaim_list(page);
+ mem_cgroup_move_lists(page, LRU_NORECLAIM);
+ }
+
+ put_page(page); /* drop ref from isolate */
+ return ret; /* ret => "page still locked" */
+}
+
+/*
+ * Cull page that shrink_*_list() has detected to be non-reclaimable
+ * under page lock to close races with other tasks that might be making
+ * the page reclaimable. Avoid stranding a reclaimable page on the
+ * noreclaim list.
+ */
+static inline void cull_nonreclaimable_page(struct page *page)
+{
+ lock_page(page);
+ if (putback_lru_page(page))
+ unlock_page(page);
+}
+
/*
* shrink_page_list() returns the number of reclaimed pages
*/
@@ -470,6 +537,12 @@ static unsigned long shrink_page_list(st

sc->nr_scanned++;

+ if (unlikely(!page_reclaimable(page, NULL))) {
+ if (putback_lru_page(page))
+ unlock_page(page);
+ continue;
+ }
+
if (!sc->may_swap && page_mapped(page))
goto keep_locked;

@@ -566,7 +639,7 @@ static unsigned long shrink_page_list(st
* possible for a page to have PageDirty set, but it is actually
* clean (all its buffers are clean). This happens if the
* buffers were written out directly, with submit_bh(). ext3
- * will do this, as well as the blockdev mapping.
+ * will do this, as well as the blockdev mapping.
* try_to_release_page() will discover that cleanness and will
* drop the buffers and mark the page clean - it can be freed.
*
@@ -598,6 +671,7 @@ activate_locked:
/* Not a candidate for swapping, so reclaim swap space. */
if (PageSwapCache(page) && vm_swap_full())
remove_exclusive_swap_page_ref(page);
+ VM_BUG_ON(PageActive(page));
SetPageActive(page);
pgactivate++;
keep_locked:
@@ -647,6 +721,14 @@ int __isolate_lru_page(struct page *page
if (mode != ISOLATE_BOTH && (!page_file_cache(page) != !file))
return ret;

+ /*
+ * Non-reclaimable pages shouldn't make it onto either the active
+ * nor the inactive list. However, when doing lumpy reclaim of
+ * higher order pages we can still run into them.
+ */
+ if (PageNoreclaim(page))
+ return ret;
+
ret = -EBUSY;
if (likely(get_page_unless_zero(page))) {
/*
@@ -758,7 +840,7 @@ static unsigned long isolate_lru_pages(u
/* else it is being freed elsewhere */
list_move(&cursor_page->lru, src);
default:
- break;
+ break; /* ! on LRU or wrong list */
}
}
}
@@ -818,8 +900,9 @@ static unsigned long clear_active_flags(
* Returns -EBUSY if the page was not on an LRU list.
*
* The returned page will have PageLRU() cleared. If it was found on
- * the active list, it will have PageActive set. That flag may need
- * to be cleared by the caller before letting the page go.
+ * the active list, it will have PageActive set. If it was found on
+ * the noreclaim list, it will have the PageNoreclaim bit set. That flag
+ * may need to be cleared by the caller before letting the page go.
*
* The vmstat statistic corresponding to the list on which the page was
* found will be decremented.
@@ -844,7 +927,13 @@ int isolate_lru_page(struct page *page)
ret = 0;
ClearPageLRU(page);

+ /* Calculate the LRU list for normal pages ... */
lru += page_file_cache(page) + !!PageActive(page);
+
+ /* ... except NoReclaim, which has its own list. */
+ if (PageNoreclaim(page))
+ lru = LRU_NORECLAIM;
+
del_page_from_lru_list(zone, page, lru);
}
spin_unlock_irq(&zone->lru_lock);
@@ -959,19 +1048,27 @@ static unsigned long shrink_inactive_lis
int lru = LRU_BASE;
page = lru_to_page(&page_list);
VM_BUG_ON(PageLRU(page));
- SetPageLRU(page);
list_del(&page->lru);
- if (page_file_cache(page))
- lru += LRU_FILE;
- if (scan_global_lru(sc)) {
+ if (unlikely(!page_reclaimable(page, NULL))) {
+ spin_unlock_irq(&zone->lru_lock);
+ cull_nonreclaimable_page(page);
+ spin_lock_irq(&zone->lru_lock);
+ continue;
+ } else {
if (page_file_cache(page))
- zone->recent_rotated_file++;
- else
- zone->recent_rotated_anon++;
+ lru += LRU_FILE;
+ if (scan_global_lru(sc)) {
+ if (page_file_cache(page))
+ zone->recent_rotated_file++;
+ else
+ zone->recent_rotated_anon++;
+ }
+ if (PageActive(page))
+ lru += LRU_ACTIVE;
}
- if (PageActive(page))
- lru += LRU_ACTIVE;
+ SetPageLRU(page);
add_page_to_lru_list(zone, page, lru);
+ mem_cgroup_move_lists(page, lru);
if (!pagevec_add(&pvec, page)) {
spin_unlock_irq(&zone->lru_lock);
__pagevec_release(&pvec);
@@ -1065,6 +1162,12 @@ static void shrink_active_list(unsigned
cond_resched();
page = lru_to_page(&l_hold);
list_del(&page->lru);
+
+ if (unlikely(!page_reclaimable(page, NULL))) {
+ cull_nonreclaimable_page(page);
+ continue;
+ }
+
if (page_referenced(page, 0, sc->mem_cgroup)) {
if (file) {
/* Referenced file pages stay active. */
@@ -1107,7 +1210,7 @@ static void shrink_active_list(unsigned
ClearPageActive(page);

list_move(&page->lru, &zone->list[lru]);
- mem_cgroup_move_lists(page, false);
+ mem_cgroup_move_lists(page, lru);
pgmoved++;
if (!pagevec_add(&pvec, page)) {
__mod_zone_page_state(zone, NR_INACTIVE_ANON + lru,
@@ -1139,7 +1242,7 @@ static void shrink_active_list(unsigned
VM_BUG_ON(!PageActive(page));

list_move(&page->lru, &zone->list[lru]);
- mem_cgroup_move_lists(page, true);
+ mem_cgroup_move_lists(page, lru);
pgmoved++;
if (!pagevec_add(&pvec, page)) {
__mod_zone_page_state(zone, NR_INACTIVE_ANON + lru,
@@ -1277,7 +1380,7 @@ static unsigned long shrink_zone(int pri

get_scan_ratio(zone, sc, percent);

- for_each_lru(l) {
+ for_each_reclaimable_lru(l) {
if (scan_global_lru(sc)) {
int file = is_file_lru(l);
int scan;
@@ -1308,7 +1411,7 @@ static unsigned long shrink_zone(int pri

while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
nr[LRU_INACTIVE_FILE]) {
- for_each_lru(l) {
+ for_each_reclaimable_lru(l) {
if (nr[l]) {
nr_to_scan = min(nr[l],
(unsigned long)sc->swap_cluster_max);
@@ -1859,8 +1962,8 @@ static unsigned long shrink_all_zones(un
if (zone_is_all_unreclaimable(zone) && prio != DEF_PRIORITY)
continue;

- for_each_lru(l) {
- /* For pass = 0 we don't shrink the active list */
+ for_each_reclaimable_lru(l) {
+ /* For pass = 0, we don't shrink the active list */
if (pass == 0 &&
(l == LRU_ACTIVE_ANON || l == LRU_ACTIVE_FILE))
continue;
@@ -2197,3 +2300,26 @@ int zone_reclaim(struct zone *zone, gfp_
return ret;
}
#endif
+
+#ifdef CONFIG_NORECLAIM_LRU
+/*
+ * page_reclaimable - test whether a page is reclaimable
+ * @page: the page to test
+ * @vma: the VMA in which the page is or will be mapped, may be NULL
+ *
+ * Test whether page is reclaimable--i.e., should be placed on active/inactive
+ * lists vs noreclaim list.
+ *
+ * Reasons page might not be reclaimable:
+ * TODO - later patches
+ */
+int page_reclaimable(struct page *page, struct vm_area_struct *vma)
+{
+
+ VM_BUG_ON(PageNoreclaim(page));
+
+ /* TODO: test page [!]reclaimable conditions */
+
+ return 1;
+}
+#endif
Index: linux-2.6.26-rc2-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/mempolicy.c 2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/mempolicy.c 2008-06-06 16:05:15.000000000 -0400
@@ -2199,7 +2199,7 @@ static void gather_stats(struct page *pa
if (PageSwapCache(page))
md->swapcache++;

- if (PageActive(page))
+ if (PageActive(page) || PageNoreclaim(page))
md->active++;

if (PageWriteback(page))
Index: linux-2.6.26-rc2-mm1/mm/internal.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/internal.h 2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/internal.h 2008-06-06 16:05:15.000000000 -0400
@@ -34,8 +34,15 @@ static inline void __put_page(struct pag
atomic_dec(&page->_count);
}

+/*
+ * in mm/vmscan.c:
+ */
extern int isolate_lru_page(struct page *page);
+extern int putback_lru_page(struct page *page);

+/*
+ * in mm/page_alloc.c
+ */
extern void __free_pages_bootmem(struct page *page, unsigned int order);

/*
@@ -49,6 +56,25 @@ static inline unsigned long page_order(s
return page_private(page);
}

+#ifdef CONFIG_NORECLAIM_LRU
+/*
+ * noreclaim_migrate_page() called only from migrate_page_copy() to
+ * migrate noreclaim flag to new page.
+ * Note that the old page has been isolated from the LRU lists at this
+ * point so we don't need to worry about LRU statistics.
+ */
+static inline void noreclaim_migrate_page(struct page *new, struct page *old)
+{
+ if (TestClearPageNoreclaim(old))
+ SetPageNoreclaim(new);
+}
+#else
+static inline void noreclaim_migrate_page(struct page *new, struct page *old)
+{
+}
+#endif
+
+
/*
* FLATMEM and DISCONTIGMEM configurations use alloc_bootmem_node,
* so all functions starting at paging_init should be marked __init
Index: linux-2.6.26-rc2-mm1/mm/memcontrol.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/memcontrol.c 2008-05-23 14:21:34.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/memcontrol.c 2008-06-06 16:05:15.000000000 -0400
@@ -161,9 +161,10 @@ struct page_cgroup {
int ref_cnt; /* cached, mapped, migrating */
int flags;
};
-#define PAGE_CGROUP_FLAG_CACHE (0x1) /* charged as cache */
-#define PAGE_CGROUP_FLAG_ACTIVE (0x2) /* page is active in this cgroup */
-#define PAGE_CGROUP_FLAG_FILE (0x4) /* page is file system backed */
+#define PAGE_CGROUP_FLAG_CACHE (0x1) /* charged as cache */
+#define PAGE_CGROUP_FLAG_ACTIVE (0x2) /* page is active in this cgroup */
+#define PAGE_CGROUP_FLAG_FILE (0x4) /* page is file system backed */
+#define PAGE_CGROUP_FLAG_NORECLAIM (0x8) /* page is noreclaimable page */

static int page_cgroup_nid(struct page_cgroup *pc)
{
@@ -283,10 +284,14 @@ static void __mem_cgroup_remove_list(str
{
int lru = LRU_BASE;

- if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
- lru += LRU_ACTIVE;
- if (pc->flags & PAGE_CGROUP_FLAG_FILE)
- lru += LRU_FILE;
+ if (pc->flags & PAGE_CGROUP_FLAG_NORECLAIM)
+ lru = LRU_NORECLAIM;
+ else {
+ if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
+ lru += LRU_ACTIVE;
+ if (pc->flags & PAGE_CGROUP_FLAG_FILE)
+ lru += LRU_FILE;
+ }

MEM_CGROUP_ZSTAT(mz, lru) -= 1;

@@ -299,10 +304,14 @@ static void __mem_cgroup_add_list(struct
{
int lru = LRU_BASE;

- if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
- lru += LRU_ACTIVE;
- if (pc->flags & PAGE_CGROUP_FLAG_FILE)
- lru += LRU_FILE;
+ if (pc->flags & PAGE_CGROUP_FLAG_NORECLAIM)
+ lru = LRU_NORECLAIM;
+ else {
+ if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
+ lru += LRU_ACTIVE;
+ if (pc->flags & PAGE_CGROUP_FLAG_FILE)
+ lru += LRU_FILE;
+ }

MEM_CGROUP_ZSTAT(mz, lru) += 1;
list_add(&pc->lru, &mz->lists[lru]);
@@ -310,21 +319,31 @@ static void __mem_cgroup_add_list(struct
mem_cgroup_charge_statistics(pc->mem_cgroup, pc->flags, true);
}

-static void __mem_cgroup_move_lists(struct page_cgroup *pc, bool active)
+static void __mem_cgroup_move_lists(struct page_cgroup *pc, enum lru_list lru)
{
struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
- int from = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
- int file = pc->flags & PAGE_CGROUP_FLAG_FILE;
- int lru = LRU_FILE * !!file + !!from;
+ int active = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
+ int file = pc->flags & PAGE_CGROUP_FLAG_FILE;
+ int noreclaim = pc->flags & PAGE_CGROUP_FLAG_NORECLAIM;
+ enum lru_list from = noreclaim ? LRU_NORECLAIM :
+ (LRU_FILE * !!file + !!active);

- MEM_CGROUP_ZSTAT(mz, lru) -= 1;
+ if (lru == from)
+ return;

- if (active)
- pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
- else
+ MEM_CGROUP_ZSTAT(mz, from) -= 1;
+
+ if (is_noreclaim_lru(lru)) {
pc->flags &= ~PAGE_CGROUP_FLAG_ACTIVE;
+ pc->flags |= PAGE_CGROUP_FLAG_NORECLAIM;
+ } else {
+ if (is_active_lru(lru))
+ pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
+ else
+ pc->flags &= ~PAGE_CGROUP_FLAG_ACTIVE;
+ pc->flags &= ~PAGE_CGROUP_FLAG_NORECLAIM;
+ }

- lru = LRU_FILE * !!file + !!active;
MEM_CGROUP_ZSTAT(mz, lru) += 1;
list_move(&pc->lru, &mz->lists[lru]);
}
@@ -342,7 +361,7 @@ int task_in_mem_cgroup(struct task_struc
/*
* This routine assumes that the appropriate zone's lru lock is already held
*/
-void mem_cgroup_move_lists(struct page *page, bool active)
+void mem_cgroup_move_lists(struct page *page, enum lru_list lru)
{
struct page_cgroup *pc;
struct mem_cgroup_per_zone *mz;
@@ -362,7 +381,7 @@ void mem_cgroup_move_lists(struct page *
if (pc) {
mz = page_cgroup_zoneinfo(pc);
spin_lock_irqsave(&mz->lru_lock, flags);
- __mem_cgroup_move_lists(pc, active);
+ __mem_cgroup_move_lists(pc, lru);
spin_unlock_irqrestore(&mz->lru_lock, flags);
}
unlock_page_cgroup(page);
@@ -460,12 +479,10 @@ unsigned long mem_cgroup_isolate_pages(u
/*
* TODO: play better with lumpy reclaim, grabbing anything.
*/
- if (PageActive(page) && !active) {
- __mem_cgroup_move_lists(pc, true);
- continue;
- }
- if (!PageActive(page) && active) {
- __mem_cgroup_move_lists(pc, false);
+ if (PageNoreclaim(page) ||
+ (PageActive(page) && !active) ||
+ (!PageActive(page) && active)) {
+ __mem_cgroup_move_lists(pc, page_lru(page));
continue;
}

Index: linux-2.6.26-rc2-mm1/include/linux/memcontrol.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/memcontrol.h 2008-05-23 14:21:34.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/memcontrol.h 2008-06-06 16:05:15.000000000 -0400
@@ -35,7 +35,7 @@ extern int mem_cgroup_charge(struct page
extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
extern void mem_cgroup_uncharge_page(struct page *page);
-extern void mem_cgroup_move_lists(struct page *page, bool active);
+extern void mem_cgroup_move_lists(struct page *page, enum lru_list lru);
extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
struct list_head *dst,
unsigned long *scanned, int order,

--
All Rights Reversed


2008-06-07 01:08:41

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure

On Fri, 06 Jun 2008 16:28:51 -0400
Rik van Riel <[email protected]> wrote:

>
> From: Lee Schermerhorn <[email protected]>
>
> Infrastructure to manage pages excluded from reclaim--i.e., hidden
> from vmscan. Based on a patch by Larry Woodman of Red Hat. Reworked
> to maintain "nonreclaimable" pages on a separate per-zone LRU list,
> to "hide" them from vmscan.
>
> Kosaki Motohiro added the support for the memory controller noreclaim
> lru list.
>
> Pages on the noreclaim list have both PG_noreclaim and PG_lru set.
> Thus, PG_noreclaim is analogous to and mutually exclusive with
> PG_active--it specifies which LRU list the page is on.
>
> The noreclaim infrastructure is enabled by a new mm Kconfig option
> [CONFIG_]NORECLAIM_LRU.

Having a config option for this really sucks, and needs extra-special
justification, rather than none.

Plus..

akpm:/usr/src/25> find . -name '*.[ch]' | xargs grep CONFIG_NORECLAIM_LRU
./drivers/base/node.c:#ifdef CONFIG_NORECLAIM_LRU
./drivers/base/node.c:#ifdef CONFIG_NORECLAIM_LRU
./fs/proc/proc_misc.c:#ifdef CONFIG_NORECLAIM_LRU
./fs/proc/proc_misc.c:#ifdef CONFIG_NORECLAIM_LRU
./include/linux/mmzone.h:#ifdef CONFIG_NORECLAIM_LRU
./include/linux/mmzone.h:#ifdef CONFIG_NORECLAIM_LRU
./include/linux/mmzone.h:#ifdef CONFIG_NORECLAIM_LRU
./include/linux/page-flags.h:#ifdef CONFIG_NORECLAIM_LRU
./include/linux/page-flags.h:#ifdef CONFIG_NORECLAIM_LRU
./include/linux/pagemap.h:#ifdef CONFIG_NORECLAIM_LRU
./include/linux/swap.h:#ifdef CONFIG_NORECLAIM_LRU
./include/linux/vmstat.h:#ifdef CONFIG_NORECLAIM_LRU
./kernel/sysctl.c:#ifdef CONFIG_NORECLAIM_LRU
./mm/internal.h:#ifdef CONFIG_NORECLAIM_LRU
./mm/page_alloc.c:#ifdef CONFIG_NORECLAIM_LRU
./mm/page_alloc.c:#ifdef CONFIG_NORECLAIM_LRU
./mm/page_alloc.c:#ifdef CONFIG_NORECLAIM_LRU
./mm/page_alloc.c:#ifdef CONFIG_NORECLAIM_LRU
./mm/page_alloc.c:#ifdef CONFIG_NORECLAIM_LRU
./mm/page_alloc.c:#ifdef CONFIG_NORECLAIM_LRU
./mm/page_alloc.c:#ifdef CONFIG_NORECLAIM_LRU
./mm/vmscan.c:#ifdef CONFIG_NORECLAIM_LRU
./mm/vmscan.c:#ifdef CONFIG_NORECLAIM_LRU
./mm/vmscan.c:#ifdef CONFIG_NORECLAIM_LRU
./mm/vmstat.c:#ifdef CONFIG_NORECLAIM_LRU
./mm/vmstat.c:#ifdef CONFIG_NORECLAIM_LRU


> A new function 'page_reclaimable(page, vma)' in vmscan.c tests whether
> or not a page is reclaimable. Subsequent patches will add the various
> !reclaimable tests. We'll want to keep these tests light-weight for
> use in shrink_active_list() and, possibly, the fault path.
>
> To avoid races between tasks putting pages [back] onto an LRU list and
> tasks that might be moving the page from nonreclaimable to reclaimable
> state, one should test reclaimability under page lock and place
> nonreclaimable pages directly on the noreclaim list before dropping the
> lock. Otherwise, we risk "stranding" reclaimable pages on the noreclaim
> list. It's OK to use the pagevec caches for reclaimable pages. The new
> function 'putback_lru_page()'--inverse to 'isolate_lru_page()'--handles
> this transition, including potential page truncation while the page is
> unlocked.
>

The changelog doesn't even mention, let alone explain and justify the
fact that this feature is not available on 32-bit systems. This is a
large drawback - it means that a (hopefully useful) feature is
unavailable to the large majority of Linux systems and that it reduces
the testing coverage and that it adversely impacts MM maintainability.

> Index: linux-2.6.26-rc2-mm1/mm/Kconfig
> ===================================================================
> --- linux-2.6.26-rc2-mm1.orig/mm/Kconfig 2008-05-29 16:21:04.000000000 -0400
> +++ linux-2.6.26-rc2-mm1/mm/Kconfig 2008-06-06 16:05:15.000000000 -0400
> @@ -205,3 +205,13 @@ config NR_QUICK
> config VIRT_TO_BUS
> def_bool y
> depends on !ARCH_NO_VIRT_TO_BUS
> +
> +config NORECLAIM_LRU
> + bool "Add LRU list to track non-reclaimable pages (EXPERIMENTAL, 64BIT only)"
> + depends on EXPERIMENTAL && 64BIT
> + help
> + Supports tracking of non-reclaimable pages off the [in]active lists
> + to avoid excessive reclaim overhead on large memory systems. Pages
> + may be non-reclaimable because: they are locked into memory, they
> + are anonymous pages for which no swap space exists, or they are anon
> + pages that are expensive to unmap [long anon_vma "related vma" list.]

Aunt Tillie might be struggling with some of that.

> Index: linux-2.6.26-rc2-mm1/include/linux/page-flags.h
> ===================================================================
> --- linux-2.6.26-rc2-mm1.orig/include/linux/page-flags.h 2008-05-29 16:21:04.000000000 -0400
> +++ linux-2.6.26-rc2-mm1/include/linux/page-flags.h 2008-06-06 16:05:15.000000000 -0400
> @@ -94,6 +94,9 @@ enum pageflags {
> PG_reclaim, /* To be reclaimed asap */
> PG_buddy, /* Page is free, on buddy lists */
> PG_swapbacked, /* Page is backed by RAM/swap */
> +#ifdef CONFIG_NORECLAIM_LRU
> + PG_noreclaim, /* Page is "non-reclaimable" */
> +#endif

I fear that we're messing up the terminology here.

Go into your 2.6.25 tree and do `grep -i reclaimable */*.c'. The term
already means a few different things, but in the vmscan context,
"reclaimable" means that the page is unreferenced, clean and can be
stolen. "reclaimable" also means a lot of other things, and we just
made that worse.

Can we think of a new term which uniquely describes this new concept
and use that, rather than flogging the old horse?

>
> ...
>
> +/**
> + * add_page_to_noreclaim_list
> + * @page: the page to be added to the noreclaim list
> + *
> + * Add page directly to its zone's noreclaim list. To avoid races with
> + * tasks that might be making the page reclaimble while it's not on the
> + * lru, we want to add the page while it's locked or otherwise "invisible"
> + * to other tasks. This is difficult to do when using the pagevec cache,
> + * so bypass that.
> + */

How does a task "make a page reclaimable"? munlock()? fsync()?
exit()?

Choice of terminology matters...

> +void add_page_to_noreclaim_list(struct page *page)
> +{
> + struct zone *zone = page_zone(page);
> +
> + spin_lock_irq(&zone->lru_lock);
> + SetPageNoreclaim(page);
> + SetPageLRU(page);
> + add_page_to_lru_list(zone, page, LRU_NORECLAIM);
> + spin_unlock_irq(&zone->lru_lock);
> +}
> +
> /*
> * Drain pages out of the cpu's pagevecs.
> * Either "cpu" is the current CPU, and preemption has already been
> @@ -339,6 +370,7 @@ void release_pages(struct page **pages,
>
> if (PageLRU(page)) {
> struct zone *pagezone = page_zone(page);
> +
> if (pagezone != zone) {
> if (zone)
> spin_unlock_irqrestore(&zone->lru_lock,
> @@ -415,6 +447,7 @@ void ____pagevec_lru_add(struct pagevec
> {
> int i;
> struct zone *zone = NULL;
> + VM_BUG_ON(is_noreclaim_lru(lru));
>
> for (i = 0; i < pagevec_count(pvec); i++) {
> struct page *page = pvec->pages[i];
> @@ -426,6 +459,7 @@ void ____pagevec_lru_add(struct pagevec
> zone = pagezone;
> spin_lock_irq(&zone->lru_lock);
> }
> + VM_BUG_ON(PageActive(page) || PageNoreclaim(page));

If this ever triggers, you'll wish that it had been coded with two
separate assertions.

> VM_BUG_ON(PageLRU(page));
> SetPageLRU(page);
> if (is_active_lru(lru))
>
> ...
>
> +/**
> + * putback_lru_page
> + * @page to be put back to appropriate lru list
> + *
> + * Add previously isolated @page to appropriate LRU list.
> + * Page may still be non-reclaimable for other reasons.
> + *
> + * lru_lock must not be held, interrupts must be enabled.
> + * Must be called with page locked.
> + *
> + * return 1 if page still locked [not truncated], else 0
> + */

The kerneldoc function description is missing.

> +int putback_lru_page(struct page *page)
> +{
> + int lru;
> + int ret = 1;
> +
> + VM_BUG_ON(!PageLocked(page));
> + VM_BUG_ON(PageLRU(page));
> +
> + lru = !!TestClearPageActive(page);
> + ClearPageNoreclaim(page); /* for page_reclaimable() */
> +
> + if (unlikely(!page->mapping)) {
> + /*
> + * page truncated. drop lock as put_page() will
> + * free the page.
> + */
> + VM_BUG_ON(page_count(page) != 1);
> + unlock_page(page);
> + ret = 0;
> + } else if (page_reclaimable(page, NULL)) {
> + /*
> + * For reclaimable pages, we can use the cache.
> + * In event of a race, worst case is we end up with a
> + * non-reclaimable page on [in]active list.
> + * We know how to handle that.
> + */
> + lru += page_file_cache(page);
> + lru_cache_add_lru(page, lru);
> + mem_cgroup_move_lists(page, lru);
> + } else {
> + /*
> + * Put non-reclaimable pages directly on zone's noreclaim
> + * list.
> + */
> + add_page_to_noreclaim_list(page);
> + mem_cgroup_move_lists(page, LRU_NORECLAIM);
> + }
> +
> + put_page(page); /* drop ref from isolate */
> + return ret; /* ret => "page still locked" */
> +}

<stares for a while>

<penny drops>

So THAT'S what the magical "return 2" is doing in page_file_cache()!

<looks>

OK, after all the patches are applied, the "2" becomes LRU_FILE and the
enumeration of `enum lru_list' reflects that.

> +/*
> + * Cull page that shrink_*_list() has detected to be non-reclaimable
> + * under page lock to close races with other tasks that might be making
> + * the page reclaimable. Avoid stranding a reclaimable page on the
> + * noreclaim list.
> + */
> +static inline void cull_nonreclaimable_page(struct page *page)
> +{
> + lock_page(page);
> + if (putback_lru_page(page))
> + unlock_page(page);
> +}

Again, the terminology is quite overloaded and confusing. What does
"non-reclaimable" mean in this context? _Any_ page which was dirty or
which had an elevated refcount? Surely not referenced pages, which the
scanner also can treat as non-reclaimable.

Did you check whether all these inlined functions really should have
been inlined? Even ones like this are probably too large.

> /*
> * shrink_page_list() returns the number of reclaimed pages
> */
>
> ...
>
> @@ -647,6 +721,14 @@ int __isolate_lru_page(struct page *page
> if (mode != ISOLATE_BOTH && (!page_file_cache(page) != !file))
> return ret;
>
> + /*
> + * Non-reclaimable pages shouldn't make it onto either the active
> + * nor the inactive list. However, when doing lumpy reclaim of
> + * higher order pages we can still run into them.

I guess that something along the lines of "when this function is being
called for lumpy reclaim we can still .." would be clearer.

> + */
> + if (PageNoreclaim(page))
> + return ret;
> +
> ret = -EBUSY;
> if (likely(get_page_unless_zero(page))) {
> /*

2008-06-08 20:34:44

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure

On Fri, 6 Jun 2008 18:05:06 -0700
Andrew Morton <[email protected]> wrote:
> On Fri, 06 Jun 2008 16:28:51 -0400
> Rik van Riel <[email protected]> wrote:
>
> >
> > From: Lee Schermerhorn <[email protected]>

> > The noreclaim infrastructure is enabled by a new mm Kconfig option
> > [CONFIG_]NORECLAIM_LRU.
>
> Having a config option for this really sucks, and needs extra-special
> justification, rather than none.

I believe the justification is that it uses a page flag.

PG_noreclaim would be the 20th page flag used, meaning there are
4 more free if 8 bits are used for zone and node info, which would
give 6 bits for NODE_SHIFT or 64 NUMA nodes - probably overkill
for 32 bit x86.

If you want I'll get rid of CONFIG_NORECLAIM_LRU and make everything
just compile in always.

Please let me know what your preference is.

> > --- linux-2.6.26-rc2-mm1.orig/include/linux/page-flags.h 2008-05-29 16:21:04.000000000 -0400
> > +++ linux-2.6.26-rc2-mm1/include/linux/page-flags.h 2008-06-06 16:05:15.000000000 -0400
> > @@ -94,6 +94,9 @@ enum pageflags {
> > PG_reclaim, /* To be reclaimed asap */
> > PG_buddy, /* Page is free, on buddy lists */
> > PG_swapbacked, /* Page is backed by RAM/swap */
> > +#ifdef CONFIG_NORECLAIM_LRU
> > + PG_noreclaim, /* Page is "non-reclaimable" */
> > +#endif
>
> I fear that we're messing up the terminology here.
>
> Go into your 2.6.25 tree and do `grep -i reclaimable */*.c'. The term
> already means a few different things, but in the vmscan context,
> "reclaimable" means that the page is unreferenced, clean and can be
> stolen. "reclaimable" also means a lot of other things, and we just
> made that worse.
>
> Can we think of a new term which uniquely describes this new concept
> and use that, rather than flogging the old horse?

Want to reuse the BSD term "pinned" instead?

> > +/**
> > + * add_page_to_noreclaim_list
> > + * @page: the page to be added to the noreclaim list
> > + *
> > + * Add page directly to its zone's noreclaim list. To avoid races with
> > + * tasks that might be making the page reclaimble while it's not on the
> > + * lru, we want to add the page while it's locked or otherwise "invisible"
> > + * to other tasks. This is difficult to do when using the pagevec cache,
> > + * so bypass that.
> > + */
>
> How does a task "make a page reclaimable"? munlock()? fsync()?
> exit()?
>
> Choice of terminology matters...

Lee? Kosaki-san?


--
All rights reversed.

2008-06-08 20:58:32

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure

On Sun, 8 Jun 2008 16:34:13 -0400 Rik van Riel <[email protected]> wrote:

> On Fri, 6 Jun 2008 18:05:06 -0700
> Andrew Morton <[email protected]> wrote:
> > On Fri, 06 Jun 2008 16:28:51 -0400
> > Rik van Riel <[email protected]> wrote:
> >
> > >
> > > From: Lee Schermerhorn <[email protected]>
>
> > > The noreclaim infrastructure is enabled by a new mm Kconfig option
> > > [CONFIG_]NORECLAIM_LRU.
> >
> > Having a config option for this really sucks, and needs extra-special
> > justification, rather than none.
>
> I believe the justification is that it uses a page flag.
>
> PG_noreclaim would be the 20th page flag used, meaning there are
> 4 more free if 8 bits are used for zone and node info, which would
> give 6 bits for NODE_SHIFT or 64 NUMA nodes - probably overkill
> for 32 bit x86.
>
> If you want I'll get rid of CONFIG_NORECLAIM_LRU and make everything
> just compile in always.

Seems unlikely to be useful? The only way in which this would be an
advantage if if we hae some other feature which also needs a page flag
but which will never be concurrently enabled with this one.

> Please let me know what your preference is.

Don't use another page flag?

> > > --- linux-2.6.26-rc2-mm1.orig/include/linux/page-flags.h 2008-05-29 16:21:04.000000000 -0400
> > > +++ linux-2.6.26-rc2-mm1/include/linux/page-flags.h 2008-06-06 16:05:15.000000000 -0400
> > > @@ -94,6 +94,9 @@ enum pageflags {
> > > PG_reclaim, /* To be reclaimed asap */
> > > PG_buddy, /* Page is free, on buddy lists */
> > > PG_swapbacked, /* Page is backed by RAM/swap */
> > > +#ifdef CONFIG_NORECLAIM_LRU
> > > + PG_noreclaim, /* Page is "non-reclaimable" */
> > > +#endif
> >
> > I fear that we're messing up the terminology here.
> >
> > Go into your 2.6.25 tree and do `grep -i reclaimable */*.c'. The term
> > already means a few different things, but in the vmscan context,
> > "reclaimable" means that the page is unreferenced, clean and can be
> > stolen. "reclaimable" also means a lot of other things, and we just
> > made that worse.
> >
> > Can we think of a new term which uniquely describes this new concept
> > and use that, rather than flogging the old horse?
>
> Want to reuse the BSD term "pinned" instead?

mm, "pinned" in Linuxland means "someone took a ref on it to prevent it
from being reclaimed".

As a starting point: what, in your english-language-paragraph-length
words, does this flag mean?

> > > +/**
> > > + * add_page_to_noreclaim_list
> > > + * @page: the page to be added to the noreclaim list
> > > + *
> > > + * Add page directly to its zone's noreclaim list. To avoid races with
> > > + * tasks that might be making the page reclaimble while it's not on the
> > > + * lru, we want to add the page while it's locked or otherwise "invisible"
> > > + * to other tasks. This is difficult to do when using the pagevec cache,
> > > + * so bypass that.
> > > + */
> >
> > How does a task "make a page reclaimable"? munlock()? fsync()?
> > exit()?
> >
> > Choice of terminology matters...
>
> Lee? Kosaki-san?

2008-06-08 21:07:25

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure

>> > +#ifdef CONFIG_NORECLAIM_LRU
>> > + PG_noreclaim, /* Page is "non-reclaimable" */
>> > +#endif
>>
>> I fear that we're messing up the terminology here.
>>
>> Go into your 2.6.25 tree and do `grep -i reclaimable */*.c'. The term
>> already means a few different things, but in the vmscan context,
>> "reclaimable" means that the page is unreferenced, clean and can be
>> stolen. "reclaimable" also means a lot of other things, and we just
>> made that worse.
>>
>> Can we think of a new term which uniquely describes this new concept
>> and use that, rather than flogging the old horse?
>
> Want to reuse the BSD term "pinned" instead?

I like this term :)
but I afraid to somebody confuse Xen/KVM term's pinned page.
IOW, I guess somebody imazine from "pinned page" to below flag.

#define PG_pinned PG_owner_priv_1 /* Xen pinned pagetable */

I have no idea....


>> > +/**
>> > + * add_page_to_noreclaim_list
>> > + * @page: the page to be added to the noreclaim list
>> > + *
>> > + * Add page directly to its zone's noreclaim list. To avoid races with
>> > + * tasks that might be making the page reclaimble while it's not on the
>> > + * lru, we want to add the page while it's locked or otherwise "invisible"
>> > + * to other tasks. This is difficult to do when using the pagevec cache,
>> > + * so bypass that.
>> > + */
>>
>> How does a task "make a page reclaimable"? munlock()? fsync()?
>> exit()?
>>
>> Choice of terminology matters...
>
> Lee? Kosaki-san?

IFAIK, moving noreclaim list to reclaim list happend at below situation.

mlock'ed page
- all mlocked process exit.
- all mlocked process call munlock().
- page related vma vanished
(e.g. mumap, mmap, remap_file_page)

SHM_LOCKed page
- sysctl(SHM_UNLOCK) called.

2008-06-08 21:33:37

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure

On Sun, 8 Jun 2008 13:57:04 -0700
Andrew Morton <[email protected]> wrote:

> > > > From: Lee Schermerhorn <[email protected]>
> >
> > > > The noreclaim infrastructure is enabled by a new mm Kconfig option
> > > > [CONFIG_]NORECLAIM_LRU.
> > >
> > > Having a config option for this really sucks, and needs extra-special
> > > justification, rather than none.
> >
> > I believe the justification is that it uses a page flag.
> >
> > PG_noreclaim would be the 20th page flag used, meaning there are
> > 4 more free if 8 bits are used for zone and node info, which would
> > give 6 bits for NODE_SHIFT or 64 NUMA nodes - probably overkill
> > for 32 bit x86.
> >
> > If you want I'll get rid of CONFIG_NORECLAIM_LRU and make everything
> > just compile in always.
>
> Seems unlikely to be useful? The only way in which this would be an
> advantage if if we hae some other feature which also needs a page flag
> but which will never be concurrently enabled with this one.
>
> > Please let me know what your preference is.
>
> Don't use another page flag?

I don't see how that would work. We need a way to identify
the status of the page.

> > > > +#ifdef CONFIG_NORECLAIM_LRU
> > > > + PG_noreclaim, /* Page is "non-reclaimable" */
> > > > +#endif
> > >
> > > I fear that we're messing up the terminology here.
> > >
> > > Go into your 2.6.25 tree and do `grep -i reclaimable */*.c'. The term
> > > already means a few different things, but in the vmscan context,
> > > "reclaimable" means that the page is unreferenced, clean and can be
> > > stolen. "reclaimable" also means a lot of other things, and we just
> > > made that worse.
> > >
> > > Can we think of a new term which uniquely describes this new concept
> > > and use that, rather than flogging the old horse?
> >
> > Want to reuse the BSD term "pinned" instead?
>
> mm, "pinned" in Linuxland means "someone took a ref on it to prevent it
> from being reclaimed".
>
> As a starting point: what, in your english-language-paragraph-length
> words, does this flag mean?

"Cannot be reclaimed because someone has it locked in memory
through mlock, or the page belongs to something that cannot
be evicted like ramfs."

--
All rights reversed.

2008-06-08 21:43:42

by Ray Lee

[permalink] [raw]
Subject: Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure

On Sun, Jun 8, 2008 at 2:32 PM, Rik van Riel <[email protected]> wrote:
> On Sun, 8 Jun 2008 13:57:04 -0700
> Andrew Morton <[email protected]> wrote:
>
>> > > > From: Lee Schermerhorn <[email protected]>
>> >
>> > > > The noreclaim infrastructure is enabled by a new mm Kconfig option
>> > > > [CONFIG_]NORECLAIM_LRU.
>> > >
>> > > Having a config option for this really sucks, and needs extra-special
>> > > justification, rather than none.
>> >
>> > I believe the justification is that it uses a page flag.
>> >
>> > PG_noreclaim would be the 20th page flag used, meaning there are
>> > 4 more free if 8 bits are used for zone and node info, which would
>> > give 6 bits for NODE_SHIFT or 64 NUMA nodes - probably overkill
>> > for 32 bit x86.
>> >
>> > If you want I'll get rid of CONFIG_NORECLAIM_LRU and make everything
>> > just compile in always.
>>
>> Seems unlikely to be useful? The only way in which this would be an
>> advantage if if we hae some other feature which also needs a page flag
>> but which will never be concurrently enabled with this one.
>>
>> > Please let me know what your preference is.
>>
>> Don't use another page flag?
>
> I don't see how that would work. We need a way to identify
> the status of the page.
>
>> > > > +#ifdef CONFIG_NORECLAIM_LRU
>> > > > + PG_noreclaim, /* Page is "non-reclaimable" */
>> > > > +#endif
>> > >
>> > > I fear that we're messing up the terminology here.
>> > >
>> > > Go into your 2.6.25 tree and do `grep -i reclaimable */*.c'. The term
>> > > already means a few different things, but in the vmscan context,
>> > > "reclaimable" means that the page is unreferenced, clean and can be
>> > > stolen. "reclaimable" also means a lot of other things, and we just
>> > > made that worse.
>> > >
>> > > Can we think of a new term which uniquely describes this new concept
>> > > and use that, rather than flogging the old horse?
>> >
>> > Want to reuse the BSD term "pinned" instead?
>>
>> mm, "pinned" in Linuxland means "someone took a ref on it to prevent it
>> from being reclaimed".
>>
>> As a starting point: what, in your english-language-paragraph-length
>> words, does this flag mean?
>
> "Cannot be reclaimed because someone has it locked in memory
> through mlock, or the page belongs to something that cannot
> be evicted like ramfs."

"Unevictable"

2008-06-08 22:04:28

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure

On Sun, 8 Jun 2008 13:57:04 -0700
Andrew Morton <[email protected]> wrote:

> > If you want I'll get rid of CONFIG_NORECLAIM_LRU and make everything
> > just compile in always.
>
> Seems unlikely to be useful? The only way in which this would be an
> advantage if if we hae some other feature which also needs a page flag
> but which will never be concurrently enabled with this one.
>
> > Please let me know what your preference is.
>
> Don't use another page flag?

To explain in more detail why we need the page flag:

When we move a page from the active or inactive list onto the
noreclaim list, we need to know what list it was on, in order
to adjust the zone counts for that list (NR_ACTIVE_ANON, etc).

For the same reason, we need to be able to identify whether
a page is already on the noreclaim list, so we can adjust
the statistics for the noreclaim pages, too. We cannot afford
to accidentally move a page onto the noreclaim list twice, or
try to remove it from the noreclaim list twice.

We need to know how many pages of each type there are in
each zone, and we need a way to specify that a page has
just become noreclaim. If a page is sitting a pagevec
somewhere, and it has just become unreclaimable, we want
that page to end up on the noreclaim list once that
pagevec is flushed.

As far as I can see, this requires a page flag.

--
All rights reversed.

2008-06-08 23:23:16

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure

On Sun, 8 Jun 2008 17:32:44 -0400 Rik van Riel <[email protected]> wrote:

> On Sun, 8 Jun 2008 13:57:04 -0700
> Andrew Morton <[email protected]> wrote:
>
> > > > > From: Lee Schermerhorn <[email protected]>
> > >
> > > > > The noreclaim infrastructure is enabled by a new mm Kconfig option
> > > > > [CONFIG_]NORECLAIM_LRU.
> > > >
> > > > Having a config option for this really sucks, and needs extra-special
> > > > justification, rather than none.
> > >
> > > I believe the justification is that it uses a page flag.
> > >
> > > PG_noreclaim would be the 20th page flag used, meaning there are
> > > 4 more free if 8 bits are used for zone and node info, which would
> > > give 6 bits for NODE_SHIFT or 64 NUMA nodes - probably overkill
> > > for 32 bit x86.

This feature isn't available on 32-bit cpus is it?

> > > If you want I'll get rid of CONFIG_NORECLAIM_LRU and make everything
> > > just compile in always.
> >
> > Seems unlikely to be useful? The only way in which this would be an
> > advantage if if we hae some other feature which also needs a page flag
> > but which will never be concurrently enabled with this one.

^^this?

> > > Please let me know what your preference is.
> >
> > Don't use another page flag?
>
> I don't see how that would work. We need a way to identify
> the status of the page.

We'll run out one day. Then we will have little choice but to increase
the size of the pageframe.

This is a direct downside of adding more lru lists.

The this-is-64-bit-only problem really sucks, IMO. We still don't know
the reason for that decision. Presumably it was because we've already
run out of page flags? If so, the time for the larger pageframe is
upon us.

> > > > > +#ifdef CONFIG_NORECLAIM_LRU
> > > > > + PG_noreclaim, /* Page is "non-reclaimable" */
> > > > > +#endif
> > > >
> > > > I fear that we're messing up the terminology here.
> > > >
> > > > Go into your 2.6.25 tree and do `grep -i reclaimable */*.c'. The term
> > > > already means a few different things, but in the vmscan context,
> > > > "reclaimable" means that the page is unreferenced, clean and can be
> > > > stolen. "reclaimable" also means a lot of other things, and we just
> > > > made that worse.
> > > >
> > > > Can we think of a new term which uniquely describes this new concept
> > > > and use that, rather than flogging the old horse?
> > >
> > > Want to reuse the BSD term "pinned" instead?
> >
> > mm, "pinned" in Linuxland means "someone took a ref on it to prevent it
> > from being reclaimed".
> >
> > As a starting point: what, in your english-language-paragraph-length
> > words, does this flag mean?
>
> "Cannot be reclaimed because someone has it locked in memory
> through mlock, or the page belongs to something that cannot
> be evicted like ramfs."

Ray's "unevictable" sounds good. It's not a term we've used elsewhere.

It's all a bit arbitrary, but it's just a label which maps onto a
concept and if we all honour that mapping carefully in our code and
writings, VM maintenance becomes that bit easier.

2008-06-08 23:35:27

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure

On Sun, 8 Jun 2008 16:22:08 -0700
Andrew Morton <[email protected]> wrote:

> The this-is-64-bit-only problem really sucks, IMO. We still don't know
> the reason for that decision. Presumably it was because we've already
> run out of page flags? If so, the time for the larger pageframe is
> upon us.

32 bit machines are unlikely to have so much memory that they run
into big scalability issues with mlocked memory.

The obvious exception to that are large PAE systems, which run
into other bottlenecks already and will probably hit the wall in
some other way before suffering greatly from the "kswapd is
scanning unevictable pages" problem.

I'll leave it up to you to decide whether you want this feature
64 bit only, or whether you want to use up the page flag on 32
bit systems too.

Please let me know which direction I should take, so I can fix
up the patch set accordingly.

> > > As a starting point: what, in your english-language-paragraph-length
> > > words, does this flag mean?
> >
> > "Cannot be reclaimed because someone has it locked in memory
> > through mlock, or the page belongs to something that cannot
> > be evicted like ramfs."
>
> Ray's "unevictable" sounds good. It's not a term we've used elsewhere.
>
> It's all a bit arbitrary, but it's just a label which maps onto a
> concept and if we all honour that mapping carefully in our code and
> writings, VM maintenance becomes that bit easier.

OK, I'll rename everything to unevictable and will add documentation
to clear up the meaning.

--
All rights reversed.

2008-06-08 23:55:43

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure

On Sun, 8 Jun 2008 19:34:20 -0400 Rik van Riel <[email protected]> wrote:

> On Sun, 8 Jun 2008 16:22:08 -0700
> Andrew Morton <[email protected]> wrote:
>
> > The this-is-64-bit-only problem really sucks, IMO. We still don't know
> > the reason for that decision. Presumably it was because we've already
> > run out of page flags? If so, the time for the larger pageframe is
> > upon us.
>
> 32 bit machines are unlikely to have so much memory that they run
> into big scalability issues with mlocked memory.
>
> The obvious exception to that are large PAE systems, which run
> into other bottlenecks already and will probably hit the wall in
> some other way before suffering greatly from the "kswapd is
> scanning unevictable pages" problem.
>
> I'll leave it up to you to decide whether you want this feature
> 64 bit only, or whether you want to use up the page flag on 32
> bit systems too.
>
> Please let me know which direction I should take, so I can fix
> up the patch set accordingly.

I'm getting rather wobbly about all of this.

This is, afair, by far the most intrusive and high-risk change we've
looked at doing since 2.5.x, for small values of x.

I mean, it's taken many years of work to get reclaim into its current
state (and the reduction in reported problems will in part be due to
the quadrupling-odd of memory over that time). And we're now proposing
radical changes which again will take years to sort out, all on behalf
of a small number of workloads upon a minority of 64-bit machines which
themselves are a minority of the Linux base.

And it will take longer to get those problems sorted out if 32-bt
machines aren't even compiing the new code in.

Are all of thse changes really justified?

ho hum. Can you remind us what problems this patchset actually
addresses? Preferably in order of seriousness? (The [0/n] description
told us about the implementation but forgot to tell us anything about
what it was fixing). Because I guess we should have a think about
alternative approaches.

2008-06-09 00:56:58

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure

On Sun, 8 Jun 2008 16:54:34 -0700
Andrew Morton <[email protected]> wrote:
> On Sun, 8 Jun 2008 19:34:20 -0400 Rik van Riel <[email protected]> wrote:

> > Please let me know which direction I should take, so I can fix
> > up the patch set accordingly.
>
> I'm getting rather wobbly about all of this.
>
> This is, afair, by far the most intrusive and high-risk change we've
> looked at doing since 2.5.x, for small values of x.

Nowhere near as intrusive or risky as eg. the timer changes that went
in a few releases ago.

> I mean, it's taken many years of work to get reclaim into its current
> state (and the reduction in reported problems will in part be due to
> the quadrupling-odd of memory over that time).

Actually, memory is now getting so large that the current code no
longer works right. On machines 16GB and up, we have discovered
really pathetic behaviour by the VM currently upstream.

Things like the VM scanning over the (locked) shared memory segment
over and over and over again, to get at the 1GB of freeable pagecache
memory in the system. Or the system scanning over all anonymous
memory over and over again, despite the fact that there is no more
swap space left.

With heavy anonymous memory workloads, Linux can stall for minutes
once memory runs low and something needs to be swapped out, because
pretty much all memory is anonymous and everything has the referenced
bit set. We have seen systems with 128GB of RAM hang overnight, once
every CPU got wedged in the pageout scanning code. Typically the VM
decides on a first page to swap out in 2-3 minutes though, and then
it will start several gigabytes of swap IO at once...

Definately not acceptable behaviour.

> And we're now proposing radical changes which again will take years to sort
> out, all on behalf of a small number of workloads upon a minority of 64-bit
> machines which themselves are a minority of the Linux base.

Hardware gets larger. 4 years ago few people cared about systems
with more than 4GB of memory, but nowadays people have that in their
desktops.

> And it will take longer to get those problems sorted out if 32-bt
> machines aren't even compiing the new code in.

32 bit systems will still get the file/anon LRU split. The only
thing that is 64 bit only in the current patch set is keeping the
unevictable pages off of the LRU lists.

This means that balancing between file and anon eviction will be
the same on 32 and 64 bit systems and things should get sorted out
on both systems at the same time.

> Are all of thse changes really justified?

People with large Linux servers are experiencing system stalls
of several minutes, or at worst complete livelocks, with the
current VM.

I believe that those issues need to be fixed.

After discussing this for a long time with Larry Woodman,
Lee Schermerhorn and others, I am convinced that they can
not be fixed by putting a bandaid on the current code.

After all, the fundamental problem often is that the file backed
and mem/swap backed pages are on the same LRU.

Think of a case that is becoming more and more common: a database
server with 128GB of RAM, 2GB of (hardly ever used) swap, 80GB of
locked shared memory segment, 30GB of other anonymous memory and
5GB of page cache.

Do you think it is reasonable for the VM to have to scan over
110GB of essentially unevictable memory, just to get at the 5GB
of page cache?

> Because I guess we should have a think about alternative approaches.

We have. We failed to come up with anything that avoids the
problem without actually fixing the fundamental issues.

If you have an idea, please let us know.

Otherwise, please give us a chance to shake things out in -mm.

I will prepare kernel RPMs for Fedora so users in the community can
easily test these patches too, and help find scenarios where these
patches do not perform as well as what the current kernel has.

I have time to track down and fix any issues that people find.

--
All rights reversed.

2008-06-09 02:58:28

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure

On Sun, 8 Jun 2008 16:54:34 -0700
Andrew Morton <[email protected]> wrote:

> ho hum. Can you remind us what problems this patchset actually
> addresses? Preferably in order of seriousness?

Here are some other problems that my patch series can easily fix,
because file cache and anon/swap backed pages live on separate
LRUs:

http://feedblog.org/2007/09/29/using-o_direct-on-linux-and-innodb-to-fix-swap-insanity/

http://blogs.smugmug.com/don/2008/05/01/mysql-and-the-linux-swap-problem/

I do not know for sure whether the patch set does fix it yet for
everyone, or whether it needs some more tuning first, but it is
fairly easily fixable by tweaking the relative pressure on both
sets of LRU lists.

No tricks of skipping over one type of pages while scanning, or
treating the referenced bits differently when the moon is in some
particular phase required - one set of lists for each type of
pages, and variable pressure between the two.

--
All rights reversed.

2008-06-09 05:44:52

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure

On Sun, 8 Jun 2008 22:58:00 -0400 Rik van Riel <[email protected]> wrote:

> On Sun, 8 Jun 2008 16:54:34 -0700
> Andrew Morton <[email protected]> wrote:
>
> > ho hum. Can you remind us what problems this patchset actually
> > addresses? Preferably in order of seriousness?
>
> Here are some other problems that my patch series can easily fix,
> because file cache and anon/swap backed pages live on separate
> LRUs:
>
> http://feedblog.org/2007/09/29/using-o_direct-on-linux-and-innodb-to-fix-swap-insanity/
>
> http://blogs.smugmug.com/don/2008/05/01/mysql-and-the-linux-swap-problem/

Sorry, but sending us off to look at random bug reports (from people
who didn't report a bug) is not how we discuss or changelog kernel
patches.

It is for good reasons that we like to see an accurate and detailed
analysis of the problems which are being addressed, and a description
of the means by which they were solved.


> I do not know for sure whether the patch set does fix it yet for
> everyone, or whether it needs some more tuning first, but it is
> fairly easily fixable by tweaking the relative pressure on both
> sets of LRU lists.

I expect it will help, yes. On 64-bit systems. It's unclear whether
mlock or SHM_LOCK is part of the issue here - if it is then 32-bit
systems will still be exposed to these things.

I also expect that it will introduce new problems, ones which can take a
very long time to diagnose and fix. Inevitable, but hopefully acceptable,
if the benefit is there.

> No tricks of skipping over one type of pages while scanning, or
> treating the referenced bits differently when the moon is in some
> particular phase required - one set of lists for each type of
> pages, and variable pressure between the two.

For the unevictable pages we have previously considered just taking
them off the LRU and leaving them off - reattach them at
SHM_UNLOCK-time and at munlock()-time (potentially subject to
reexamination of any other vmas which map each page).

I believe that Andrea had code which leaves the anon pages off the LRU
as well, but I forget the details.

2008-06-09 06:11:53

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure

On Sun, 8 Jun 2008 20:56:29 -0400 Rik van Riel <[email protected]> wrote:

> On Sun, 8 Jun 2008 16:54:34 -0700
> Andrew Morton <[email protected]> wrote:
> > On Sun, 8 Jun 2008 19:34:20 -0400 Rik van Riel <[email protected]> wrote:
>
> > > Please let me know which direction I should take, so I can fix
> > > up the patch set accordingly.
> >
> > I'm getting rather wobbly about all of this.
> >
> > This is, afair, by far the most intrusive and high-risk change we've
> > looked at doing since 2.5.x, for small values of x.
>
> Nowhere near as intrusive or risky as eg. the timer changes that went
> in a few releases ago.

Well. Intrusiveness doesn't matter much. But no, you're dead wrong -
this stuff is far more risky than timer changes. Because things like
the timer changes are trivial to detect errors in - it either works or
it doesn't.

Whereas reclaim problems can take *years* to identify and are often
very hard for the programmers to understand, reproduce and diagnose.

> > I mean, it's taken many years of work to get reclaim into its current
> > state (and the reduction in reported problems will in part be due to
> > the quadrupling-odd of memory over that time).
>
> Actually, memory is now getting so large that the current code no
> longer works right. On machines 16GB and up, we have discovered
> really pathetic behaviour by the VM currently upstream.
>
> Things like the VM scanning over the (locked) shared memory segment
> over and over and over again, to get at the 1GB of freeable pagecache
> memory in the system.

Earlier discussion about removing these pages from ALL LRUs reached a
quite detailed stage, but nobody seemed to finish any code.

> Or the system scanning over all anonymous
> memory over and over again, despite the fact that there is no more
> swap space left.

We shouldn't rewrite core VM to cater for incorrectly configured
systems.

> With heavy anonymous memory workloads, Linux can stall for minutes
> once memory runs low and something needs to be swapped out, because
> pretty much all memory is anonymous and everything has the referenced
> bit set. We have seen systems with 128GB of RAM hang overnight, once
> every CPU got wedged in the pageout scanning code. Typically the VM
> decides on a first page to swap out in 2-3 minutes though, and then
> it will start several gigabytes of swap IO at once...
>
> Definately not acceptable behaviour.

I see handwavy non-bug-reports loosely associated with a vast pile of
code and vague expressions of hope that one will fix the other.

Where's the meat in this, Rik? This is engineering.

Do you or do you not have a test case which demonstrates this problem?
It doesn't sound terribly hard. Where are the before-and-after test
results?

> > And we're now proposing radical changes which again will take years to sort
> > out, all on behalf of a small number of workloads upon a minority of 64-bit
> > machines which themselves are a minority of the Linux base.
>
> Hardware gets larger. 4 years ago few people cared about systems
> with more than 4GB of memory, but nowadays people have that in their
> desktops.
>
> > And it will take longer to get those problems sorted out if 32-bt
> > machines aren't even compiing the new code in.
>
> 32 bit systems will still get the file/anon LRU split. The only
> thing that is 64 bit only in the current patch set is keeping the
> unevictable pages off of the LRU lists.
>
> This means that balancing between file and anon eviction will be
> the same on 32 and 64 bit systems and things should get sorted out
> on both systems at the same time.
>
> > Are all of thse changes really justified?
>
> People with large Linux servers are experiencing system stalls
> of several minutes, or at worst complete livelocks, with the
> current VM.
>
> I believe that those issues need to be fixed.

I'd love to see hard evidence that they have been. And that doesn't
mean getting palmed off on wikis and random blog pages.

Also, it is incumbent upon us to consider the other design proposals,
such as removing anon pages from the LRUs, removing mlocked pages from
the LRUs.

> After discussing this for a long time with Larry Woodman,
> Lee Schermerhorn and others, I am convinced that they can
> not be fixed by putting a bandaid on the current code.
>
> After all, the fundamental problem often is that the file backed
> and mem/swap backed pages are on the same LRU.

That actually isn't a fundamental problem.

It _becomes_ a problem because we try to treat the two types of pages
differently.

Stupid question: did anyone try setting swappiness=100? What happened?

> Think of a case that is becoming more and more common: a database
> server with 128GB of RAM, 2GB of (hardly ever used) swap, 80GB of
> locked shared memory segment, 30GB of other anonymous memory and
> 5GB of page cache.
>
> Do you think it is reasonable for the VM to have to scan over
> 110GB of essentially unevictable memory, just to get at the 5GB
> of page cache?

Well for starters that system was grossly misconfigured. It is
incumbent upon you, in your design document (that thing we call a
changelog) to justify why the VM design needs to be altered to cater
for such misconfigured systems. It just drives me up the wall having
to engage in a 20-email discussion to be able to squeeze these little
revelations out. Only to have them lost again later.

Secondly, I expect that removal of mlocked pages from the LRU (as was
discussed a year or two ago and perhaps implemented by Andrea) along
with swappiness=100 might be get us towards a fix. Don't know.

> > Because I guess we should have a think about alternative approaches.
>
> We have. We failed to come up with anything that avoids the
> problem without actually fixing the fundamental issues.

Unless I missed it, none of your patch descriptions even attempt to
describe these fundamental issues. It's all buried in 20-deep email
threads.

> If you have an idea, please let us know.

I see no fundamental reason why we need to put mlocked or SHM_LOCKED
pages onto a VM LRU at all.

One cause of problms is that we attempt to prioritise anon pages over
file-backed pagecache. And we prioritise mmapped pages, which your patches
don't address, do they? Stopping doing that would, I expect, prevent a
range of these problems. It would introduce others, probably.

> Otherwise, please give us a chance to shake things out in -mm.

-mm isn't a very useful testing place any more, I'm afraid. The
patches would be better off in linux-next, but then they would screw up
all the other pending MM patches, and it's probably a bit early for
getting them into linux-next.

Once I get sections of -mm feeding into linux-next, things will be better.

> I will prepare kernel RPMs for Fedora so users in the community can
> easily test these patches too, and help find scenarios where these
> patches do not perform as well as what the current kernel has.
>
> I have time to track down and fix any issues that people find.

That helps.

2008-06-09 13:44:34

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure

On Sun, 8 Jun 2008 23:10:53 -0700
Andrew Morton <[email protected]> wrote:

> Also, it is incumbent upon us to consider the other design proposals,
> such as removing anon pages from the LRUs, removing mlocked pages from
> the LRUs.

That is certainly an option. We'll still need to keep track of
what kind of page the page is, though, otherwise we won't know
whether or not we can put it back onto the LRU lists at munlock
time.

> > After discussing this for a long time with Larry Woodman,
> > Lee Schermerhorn and others, I am convinced that they can
> > not be fixed by putting a bandaid on the current code.
> >
> > After all, the fundamental problem often is that the file backed
> > and mem/swap backed pages are on the same LRU.
>
> That actually isn't a fundamental problem.
>
> It _becomes_ a problem because we try to treat the two types of pages
> differently.
>
> Stupid question: did anyone try setting swappiness=100? What happened?

The database shared memory segment got swapped out and the
system crawled to a halt.

Swap IO usually is less efficient than page cache IO, because
page cache IO happens in larger chunks and does not involve
a swap-out first and a swap-in later - the data is just read,
which at least halves the disk IO compared to swap.

Readahead tilts the IO cost even more in favor of evicting
page cache pages, vs. swapping something out.

> > Think of a case that is becoming more and more common: a database
> > server with 128GB of RAM, 2GB of (hardly ever used) swap, 80GB of
> > locked shared memory segment, 30GB of other anonymous memory and
> > 5GB of page cache.
> >
> > Do you think it is reasonable for the VM to have to scan over
> > 110GB of essentially unevictable memory, just to get at the 5GB
> > of page cache?
>
> Well for starters that system was grossly misconfigured.

Swapping out the database shared memory segment is not an option,
because it is mlocked. Even if it was an option, swapping it out
would be a bad idea because swap IO is simply less efficient than
page cache IO (see above).

> Secondly, I expect that removal of mlocked pages from the LRU (as was
> discussed a year or two ago and perhaps implemented by Andrea) along
> with swappiness=100 might be get us towards a fix. Don't know.

Removing mlocked pages from the LRU can be done, but I suspect
we'll still want to keep track of how many of these pages there
are, right?

> > > Because I guess we should have a think about alternative approaches.
> >
> > We have. We failed to come up with anything that avoids the
> > problem without actually fixing the fundamental issues.
>
> Unless I missed it, none of your patch descriptions even attempt to
> describe these fundamental issues. It's all buried in 20-deep email
> threads.

I'll add more problem descriptions to the next patch submission.
I'm halfway the patch series making all the cleanups and changes
you suggested.

> One cause of problms is that we attempt to prioritise anon pages over
> file-backed pagecache. And we prioritise mmapped pages, which your patches
> don't address, do they? Stopping doing that would, I expect, prevent a
> range of these problems. It would introduce others, probably.

Try running a database with swappiness=100 and then doing a
backup of the system simultaneously. The database will end
up being swapped out, which slows down the database, causes
extra IO and ends up slowing down the backup, too.

The backup does not benefit from having its data cached,
since it only reads everything once.

> > Otherwise, please give us a chance to shake things out in -mm.
>
> -mm isn't a very useful testing place any more, I'm afraid.

That's a problem. I can run tests on the VM patches, but you know
as well as I do that the code needs to be shaken out by lots of
users before we can be truly confident in it...

> > I will prepare kernel RPMs for Fedora so users in the community can
> > easily test these patches too, and help find scenarios where these
> > patches do not perform as well as what the current kernel has.
> >
> > I have time to track down and fix any issues that people find.
>
> That helps.

I sure hope so.

I'll send you a cleaned-up patch series soon. Hopefully tonight
or tomorrow.

--
All rights reversed.

2008-06-10 19:17:33

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure

On Sun, 8 Jun 2008, Andrew Morton wrote:

> And it will take longer to get those problems sorted out if 32-bt
> machines aren't even compiing the new code in.

The problem is going to be less if we dependedn on
CONFIG_PAGEFLAGS_EXTENDED instead of 64 bit. This means that only certain
32bit NUMA/sparsemem configs cannot do this due to lack of page flags.

I did the pageflags rework in part because of Rik's project.

> ho hum. Can you remind us what problems this patchset actually
> addresses? Preferably in order of seriousness? (The [0/n] description
> told us about the implementation but forgot to tell us anything about
> what it was fixing). Because I guess we should have a think about
> alternative approaches.

It solves the livelock while reclaiming issues that we see more and more.

There are loads that have lots of unreclaimable pages. These are
frequently and uselessly scanned under memory pressure.

The larger the memory the more problems.

2008-06-10 19:37:31

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure

On Tue, 10 Jun 2008 12:17:23 -0700 (PDT)
Christoph Lameter <[email protected]> wrote:

> On Sun, 8 Jun 2008, Andrew Morton wrote:
>
> > And it will take longer to get those problems sorted out if 32-bt
> > machines aren't even compiing the new code in.
>
> The problem is going to be less if we dependedn on
> CONFIG_PAGEFLAGS_EXTENDED instead of 64 bit. This means that only certain
> 32bit NUMA/sparsemem configs cannot do this due to lack of page flags.
>
> I did the pageflags rework in part because of Rik's project.

I think your pageflags work freed up a number of bits on 32
bit systems, unless someone compiles a 32 bit system with
support for 4 memory zones (2 bits ZONE_SHIFT) and 64 NUMA
nodes (6 bits NODE_SHIFT), in which case we should still
have 24 bits for flags.

Of course, having 64 NUMA nodes and a ZONE_SHIFT of 2 on
a 32 bit system is probably total insanity already. I
suspect very few people compile 32 bit with NUMA at all,
except if it is an architecture that uses DISCONTIGMEM
instead of zones, in which case ZONE_SHIFT is 0, which
will free up space too :)

--
All Rights Reversed

2008-06-10 20:10:18

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure

On Fri, 6 Jun 2008 18:05:06 -0700
Andrew Morton <[email protected]> wrote:

> > +config NORECLAIM_LRU
> > + bool "Add LRU list to track non-reclaimable pages (EXPERIMENTAL, 64BIT only)"
> > + depends on EXPERIMENTAL && 64BIT
> > + help
> > + Supports tracking of non-reclaimable pages off the [in]active lists
> > + to avoid excessive reclaim overhead on large memory systems. Pages
> > + may be non-reclaimable because: they are locked into memory, they
> > + are anonymous pages for which no swap space exists, or they are anon
> > + pages that are expensive to unmap [long anon_vma "related vma" list.]
>
> Aunt Tillie might be struggling with some of that.

I have now Aunt Tillified the description:

+++ linux-2.6.26-rc5-mm2/mm/Kconfig 2008-06-10 14:56:19.000000000 -0400
@@ -205,3 +205,13 @@ config NR_QUICK
config VIRT_TO_BUS
def_bool y
depends on !ARCH_NO_VIRT_TO_BUS
+
+config UNEVICTABLE_LRU
+ bool "Add LRU list to track non-evictable pages"
+ default y
+ help
+ Keeps unevictable pages off of the active and inactive pageout
+ lists, so kswapd will not waste CPU time or have its balancing
+ algorithms thrown off by scanning these pages. Selecting this
+ will use one page flag and increase the code size a little,
+ say Y unless you know what you are doing.

> Can we think of a new term which uniquely describes this new concept
> and use that, rather than flogging the old horse?

I have also switched to "unevictable".

> > +/**
> > + * add_page_to_noreclaim_list
> > + * @page: the page to be added to the noreclaim list
> > + *
> > + * Add page directly to its zone's noreclaim list. To avoid races with
> > + * tasks that might be making the page reclaimble while it's not on the
> > + * lru, we want to add the page while it's locked or otherwise "invisible"
> > + * to other tasks. This is difficult to do when using the pagevec cache,
> > + * so bypass that.
> > + */
>
> How does a task "make a page reclaimable"? munlock()? fsync()?
> exit()?
>
> Choice of terminology matters...

I have added a linuxdoc function description here and
amended the comment to specify the ways in which a task
can make a page evictable.

> > + VM_BUG_ON(PageActive(page) || PageNoreclaim(page));
>
> If this ever triggers, you'll wish that it had been coded with two
> separate assertions.

Good catch. I separated these.

> > +/**
> > + * putback_lru_page
> > + * @page to be put back to appropriate lru list

> The kerneldoc function description is missing.

Added this one, as well as a few others that were missing.

> > + } else if (page_reclaimable(page, NULL)) {
> > + /*
> > + * For reclaimable pages, we can use the cache.
> > + * In event of a race, worst case is we end up with a
> > + * non-reclaimable page on [in]active list.
> > + * We know how to handle that.
> > + */
> > + lru += page_file_cache(page);
> > + lru_cache_add_lru(page, lru);
> > + mem_cgroup_move_lists(page, lru);

> <stares for a while>
>
> <penny drops>
>
> So THAT'S what the magical "return 2" is doing in page_file_cache()!
>
> <looks>
>
> OK, after all the patches are applied, the "2" becomes LRU_FILE and the
> enumeration of `enum lru_list' reflects that.

In most places I have turned this into a call to page_lru(page).

> > +static inline void cull_nonreclaimable_page(struct page *page)

> Did you check whether all these inlined functions really should have
> been inlined? Even ones like this are probably too large.

Turned this into just a "static void" and renamed it
to cull_unevictable_page.

> > + /*
> > + * Non-reclaimable pages shouldn't make it onto either the active
> > + * nor the inactive list. However, when doing lumpy reclaim of
> > + * higher order pages we can still run into them.
>
> I guess that something along the lines of "when this function is being
> called for lumpy reclaim we can still .." would be clearer.

+ /*
+ * When this function is being called for lumpy reclaim, we
+ * initially look into all LRU pages, active, inactive and
+ * unreclaimable; only give shrink_page_list evictable pages.
+ */
+ if (PageUnevictable(page))
+ return ret;

... on to the next patch!

--
All Rights Reversed

2008-06-10 21:34:49

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure

On Tue, 10 Jun 2008 15:37:02 -0400
Rik van Riel <[email protected]> wrote:

> On Tue, 10 Jun 2008 12:17:23 -0700 (PDT)
> Christoph Lameter <[email protected]> wrote:
>
> > On Sun, 8 Jun 2008, Andrew Morton wrote:
> >
> > > And it will take longer to get those problems sorted out if 32-bt
> > > machines aren't even compiing the new code in.
> >
> > The problem is going to be less if we dependedn on
> > CONFIG_PAGEFLAGS_EXTENDED instead of 64 bit. This means that only certain
> > 32bit NUMA/sparsemem configs cannot do this due to lack of page flags.
> >
> > I did the pageflags rework in part because of Rik's project.
>
> I think your pageflags work freed up a number of bits on 32
> bit systems, unless someone compiles a 32 bit system with
> support for 4 memory zones (2 bits ZONE_SHIFT) and 64 NUMA
> nodes (6 bits NODE_SHIFT), in which case we should still
> have 24 bits for flags.
>
> Of course, having 64 NUMA nodes and a ZONE_SHIFT of 2 on
> a 32 bit system is probably total insanity already. I
> suspect very few people compile 32 bit with NUMA at all,
> except if it is an architecture that uses DISCONTIGMEM
> instead of zones, in which case ZONE_SHIFT is 0, which
> will free up space too :)

Maybe it's time to bite the bullet and kill i386 NUMA support. afaik
it's just NUMAQ and a 2-node NUMAish machine which IBM made (as400?)

arch/sh uses NUMA for 32-bit, I believe. But I don't know what its
maximum node count is. The default for sh NODES_SHIFT is 3.

2008-06-10 21:48:47

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure


>
> Maybe it's time to bite the bullet and kill i386 NUMA support. afaik
> it's just NUMAQ and a 2-node NUMAish machine which IBM made (as400?)

Actually much more (most 64bit NUMA systems can run 32bit too), it just
doesn't work well because the code is not very good, undertested, many
bugs, weird design and in general 32bit NUMA has a lot of limitations
that don't make it a good idea.

But you don't need to kill it only for this (although imho there are
lots of other good reasons) Just use a different way to look up the
node. Encoding it into the flags is just an optimization.
But a separate hash or similar would also work. It seemed like a good
idea back then.

In fact there's already a hash for this (the pa->node hash) that
can do it. It' just some more instructions and one cache line
more accessed, but since i386 NUMA is a fringe application
that doesn't seem like a big issue.

-Andi

2008-06-10 22:05:49

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure

On Tue, 2008-06-10 at 14:33 -0700, Andrew Morton wrote:
> Maybe it's time to bite the bullet and kill i386 NUMA support. afaik
> it's just NUMAQ and a 2-node NUMAish machine which IBM made (as400?)

Yeah, IBM sold a couple of these "interesting" 32-bit NUMA machines:

https://www.redbooks.ibm.com/Redbooks.nsf/RedbookAbstracts/tips0267.html?Open

I think those maxed out at 8 nodes, ever. But, no distro ever turned
NUMA on for i386, so no one actually depends on it working. We do have
a bunch of systems that we use for testing and so forth. It'd be a
shame to make these suck *too* much. The NUMA-Q is probably also so
intertwined with CONFIG_NUMA that we'd likely never get it running
again.

I'd rather just bloat page->flags on these platforms or move the
sparsemem/zone/node bits elsewhere than kill NUMA support.

-- Dave

2008-06-11 05:11:55

by Paul Mundt

[permalink] [raw]
Subject: Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure

On Tue, Jun 10, 2008 at 02:33:34PM -0700, Andrew Morton wrote:
> On Tue, 10 Jun 2008 15:37:02 -0400
> Rik van Riel <[email protected]> wrote:
>
> > On Tue, 10 Jun 2008 12:17:23 -0700 (PDT)
> > Christoph Lameter <[email protected]> wrote:
> >
> > > On Sun, 8 Jun 2008, Andrew Morton wrote:
> > >
> > > > And it will take longer to get those problems sorted out if 32-bt
> > > > machines aren't even compiing the new code in.
> > >
> > > The problem is going to be less if we dependedn on
> > > CONFIG_PAGEFLAGS_EXTENDED instead of 64 bit. This means that only certain
> > > 32bit NUMA/sparsemem configs cannot do this due to lack of page flags.
> > >
> > > I did the pageflags rework in part because of Rik's project.
> >
> > I think your pageflags work freed up a number of bits on 32
> > bit systems, unless someone compiles a 32 bit system with
> > support for 4 memory zones (2 bits ZONE_SHIFT) and 64 NUMA
> > nodes (6 bits NODE_SHIFT), in which case we should still
> > have 24 bits for flags.
> >
> > Of course, having 64 NUMA nodes and a ZONE_SHIFT of 2 on
> > a 32 bit system is probably total insanity already. I
> > suspect very few people compile 32 bit with NUMA at all,
> > except if it is an architecture that uses DISCONTIGMEM
> > instead of zones, in which case ZONE_SHIFT is 0, which
> > will free up space too :)
>
> Maybe it's time to bite the bullet and kill i386 NUMA support. afaik
> it's just NUMAQ and a 2-node NUMAish machine which IBM made (as400?)
>
> arch/sh uses NUMA for 32-bit, I believe. But I don't know what its
> maximum node count is. The default for sh NODES_SHIFT is 3.

In terms of memory nodes, systems vary from 2 up to 16 or so. It gets
gradually more complex in the SMP cases where we are 3-4 levels deep in
various types of memories that we expose as nodes (ie, 4-8 CPUs with a
dozen different memories or so at various interconnect levels).

As far as testing goes, it's part of the regular build and regression
testing for a number of boards, which we verify on a daily basis
(although admittedly -mm gets far less testing, even though that's where
most of the churn in this area tends to be).

2008-06-11 06:18:00

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure

On Wed, 11 Jun 2008 14:09:15 +0900 Paul Mundt <[email protected]> wrote:

> On Tue, Jun 10, 2008 at 02:33:34PM -0700, Andrew Morton wrote:
> > On Tue, 10 Jun 2008 15:37:02 -0400
> > Rik van Riel <[email protected]> wrote:
> >
> > > On Tue, 10 Jun 2008 12:17:23 -0700 (PDT)
> > > Christoph Lameter <[email protected]> wrote:
> > >
> > > > On Sun, 8 Jun 2008, Andrew Morton wrote:
> > > >
> > > > > And it will take longer to get those problems sorted out if 32-bt
> > > > > machines aren't even compiing the new code in.
> > > >
> > > > The problem is going to be less if we dependedn on
> > > > CONFIG_PAGEFLAGS_EXTENDED instead of 64 bit. This means that only certain
> > > > 32bit NUMA/sparsemem configs cannot do this due to lack of page flags.
> > > >
> > > > I did the pageflags rework in part because of Rik's project.
> > >
> > > I think your pageflags work freed up a number of bits on 32
> > > bit systems, unless someone compiles a 32 bit system with
> > > support for 4 memory zones (2 bits ZONE_SHIFT) and 64 NUMA
> > > nodes (6 bits NODE_SHIFT), in which case we should still
> > > have 24 bits for flags.
> > >
> > > Of course, having 64 NUMA nodes and a ZONE_SHIFT of 2 on
> > > a 32 bit system is probably total insanity already. I
> > > suspect very few people compile 32 bit with NUMA at all,
> > > except if it is an architecture that uses DISCONTIGMEM
> > > instead of zones, in which case ZONE_SHIFT is 0, which
> > > will free up space too :)
> >
> > Maybe it's time to bite the bullet and kill i386 NUMA support. afaik
> > it's just NUMAQ and a 2-node NUMAish machine which IBM made (as400?)
> >
> > arch/sh uses NUMA for 32-bit, I believe. But I don't know what its
> > maximum node count is. The default for sh NODES_SHIFT is 3.
>
> In terms of memory nodes, systems vary from 2 up to 16 or so. It gets
> gradually more complex in the SMP cases where we are 3-4 levels deep in
> various types of memories that we expose as nodes (ie, 4-8 CPUs with a
> dozen different memories or so at various interconnect levels).

Thanks.

Andi has suggested that we can remove the node-ID encoding from
page.flags on x86 because that info is available elsewhere, although a
bit more slowly.

<looks at page_zone(), wonders whether we care about performance anyway>

There wouldn't be much point in doing that unless we did it for all
32-bit architectures. How much trouble would it cause sh?

> As far as testing goes, it's part of the regular build and regression
> testing for a number of boards, which we verify on a daily basis
> (although admittedly -mm gets far less testing, even though that's where
> most of the churn in this area tends to be).

Oh well, that's what -rc is for :(

It would be good if someone over there could start testing linux-next.
Once I get my act together that will include most-of-mm anyway.

2008-06-11 06:31:58

by Paul Mundt

[permalink] [raw]
Subject: Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure

On Tue, Jun 10, 2008 at 11:16:42PM -0700, Andrew Morton wrote:
> On Wed, 11 Jun 2008 14:09:15 +0900 Paul Mundt <[email protected]> wrote:
> > On Tue, Jun 10, 2008 at 02:33:34PM -0700, Andrew Morton wrote:
> > > Maybe it's time to bite the bullet and kill i386 NUMA support. afaik
> > > it's just NUMAQ and a 2-node NUMAish machine which IBM made (as400?)
> > >
> > > arch/sh uses NUMA for 32-bit, I believe. But I don't know what its
> > > maximum node count is. The default for sh NODES_SHIFT is 3.
> >
> > In terms of memory nodes, systems vary from 2 up to 16 or so. It gets
> > gradually more complex in the SMP cases where we are 3-4 levels deep in
> > various types of memories that we expose as nodes (ie, 4-8 CPUs with a
> > dozen different memories or so at various interconnect levels).
>
> Thanks.
>
> Andi has suggested that we can remove the node-ID encoding from
> page.flags on x86 because that info is available elsewhere, although a
> bit more slowly.
>
> <looks at page_zone(), wonders whether we care about performance anyway>
>
> There wouldn't be much point in doing that unless we did it for all
> 32-bit architectures. How much trouble would it cause sh?
>
At first glance I don't think that should be too bad. We only do NUMA
through sparsemem anyways, and we have pretty much no overlap in any of
the ranges, so simply setting NODE_NOT_IN_PAGE_FLAGS should be ok there.
Given the relatively small number of pages we have, the added cost of
page_to_nid() referencing section_to_node_table should still be
tolerable. I'll give it a go and see what the numbers look like.

> > As far as testing goes, it's part of the regular build and regression
> > testing for a number of boards, which we verify on a daily basis
> > (although admittedly -mm gets far less testing, even though that's where
> > most of the churn in this area tends to be).
>
> Oh well, that's what -rc is for :(
>
> It would be good if someone over there could start testing linux-next.
> Once I get my act together that will include most-of-mm anyway.
>
Agreed. This is something we're attempting to add in to our automated
testing at present.

2008-06-11 12:06:55

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure


> Andi has suggested that we can remove the node-ID encoding from
> page.flags on x86 because that info is available elsewhere, although a
> bit more slowly.
>
> <looks at page_zone(), wonders whether we care about performance anyway>

It would be just pfn_to_nid(page_pfn(page)) for 32bit && CONFIG_NUMA.
-sh should have that too.

Only trouble is that it needs some reordering because right now page_pfn
is not defined early enough.

> There wouldn't be much point in doing that unless we did it for all
> 32-bit architectures. How much trouble would it cause sh?

Probably very little from a quick look at the source.

-Andi

2008-06-11 14:10:04

by Andi Kleen

[permalink] [raw]
Subject: Removing node flags from page->flags was Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure II


After some comptemplation I don't think we need to do anything for this.
Just add more page flags. The ifdef jungle in mm.h should handle it already.

#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
#define NODES_WIDTH NODES_SHIFT
#else
#ifdef CONFIG_SPARSEMEM_VMEMMAP
#error "Vmemmap: No space for nodes field in page flags"
#endif
#define NODES_WIDTH 0
#endif


[btw the vmemmap case could be handled easily too by going through
the zone, but it's not used on 32bit]

and then

#if !(NODES_WIDTH > 0 || NODES_SHIFT == 0)
#define NODE_NOT_IN_PAGE_FLAGS
#endif


and then

#ifdef NODE_NOT_IN_PAGE_FLAGS
extern int page_to_nid(struct page *page);
#else
static inline int page_to_nid(struct page *page)
{
return (page->flags >> NODES_PGSHIFT) & NODES_MASK;
}
#endif

and the sparse.c page_to_nid does a hash lookup.

So if NR_PAGEFLAGS is big enough it should work.

-Andi

2008-06-11 19:04:34

by Andy Whitcroft

[permalink] [raw]
Subject: Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure

On Tue, Jun 10, 2008 at 02:33:34PM -0700, Andrew Morton wrote:
> On Tue, 10 Jun 2008 15:37:02 -0400
> Rik van Riel <[email protected]> wrote:
>
> > On Tue, 10 Jun 2008 12:17:23 -0700 (PDT)
> > Christoph Lameter <[email protected]> wrote:
> >
> > > On Sun, 8 Jun 2008, Andrew Morton wrote:
> > >
> > > > And it will take longer to get those problems sorted out if 32-bt
> > > > machines aren't even compiing the new code in.
> > >
> > > The problem is going to be less if we dependedn on
> > > CONFIG_PAGEFLAGS_EXTENDED instead of 64 bit. This means that only certain
> > > 32bit NUMA/sparsemem configs cannot do this due to lack of page flags.
> > >
> > > I did the pageflags rework in part because of Rik's project.
> >
> > I think your pageflags work freed up a number of bits on 32
> > bit systems, unless someone compiles a 32 bit system with
> > support for 4 memory zones (2 bits ZONE_SHIFT) and 64 NUMA
> > nodes (6 bits NODE_SHIFT), in which case we should still
> > have 24 bits for flags.
> >
> > Of course, having 64 NUMA nodes and a ZONE_SHIFT of 2 on
> > a 32 bit system is probably total insanity already. I
> > suspect very few people compile 32 bit with NUMA at all,
> > except if it is an architecture that uses DISCONTIGMEM
> > instead of zones, in which case ZONE_SHIFT is 0, which
> > will free up space too :)
>
> Maybe it's time to bite the bullet and kill i386 NUMA support. afaik
> it's just NUMAQ and a 2-node NUMAish machine which IBM made (as400?)
>
> arch/sh uses NUMA for 32-bit, I believe. But I don't know what its
> maximum node count is. The default for sh NODES_SHIFT is 3.

I think we can say that although NUMAQ can have up to 64 NUMA nodes, in
fact I don't think we have any more with more than 4 nodes left. From
the other discussion it sounds like we have a maximum if 8 nodes on
other sub-arches. So it would not be unreasonable to reduce the shift
to 3. Which might allow us to reduce the size of the reserve.

The problem will come with SPARSEMEM as that stores the section number
in the reserved field. Which can mean we need the whole reserve, and
there is currently no simple way to remove that.

I have been wondering whether we could make more use of the dynamic
nature of the page bits. As bits only need to exist when used, whether
we could consider letting the page flags grow to 64 bits if necessary.
However, at a quick count we are still only using about 19 bits, and if
memory serves we have 23/24 after the reserve on 32 bit.

-apw

2008-06-11 20:53:17

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure


> The problem will come with SPARSEMEM as that stores the section number
> in the reserved field. Which can mean we need the whole reserve, and
> there is currently no simple way to remove that.

Why do you need that many sections on i386?

-Andi

2008-06-11 23:25:17

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure

On Wed, 11 Jun 2008, Andy Whitcroft wrote:

> I think we can say that although NUMAQ can have up to 64 NUMA nodes, in
> fact I don't think we have any more with more than 4 nodes left. From
> the other discussion it sounds like we have a maximum if 8 nodes on
> other sub-arches. So it would not be unreasonable to reduce the shift
> to 3. Which might allow us to reduce the size of the reserve.
>
> The problem will come with SPARSEMEM as that stores the section number
> in the reserved field. Which can mean we need the whole reserve, and
> there is currently no simple way to remove that.

But in that case we can use the section number to look up the node number.
That is done automatically if we have too many page flags.