2013-05-16 20:34:29

by Dave Hansen

[permalink] [raw]
Subject: [RFCv2][PATCH 0/5] mm: Batch page reclamation under shink_page_list

These are an update of Tim Chen's earlier work:

http://lkml.kernel.org/r/1347293960.9977.70.camel@schen9-DESK

I broke the patches up a bit more, and tried to incorporate some
changes based on some feedback from Mel and Andrew.

Changes for v2:
* use page_mapping() accessor instead of direct access
to page->mapping (could cause crashes when running in
to swap cache pages.
* group the batch function's introduction patch with
its first use
* rename a few functions as suggested by Mel
* Ran some single-threaded tests to look for regressions
caused by the batching. If there is overhead, it is only
in the worst-case scenarios, and then only in hundreths of
a percent of CPU time.

If you're curious how effective the batching is, I have a quick
and dirty patch to keep some stats:

https://www.sr71.net/~dave/intel/rmb-stats-only.patch

--

To do page reclamation in shrink_page_list function, there are
two locks taken on a page by page basis. One is the tree lock
protecting the radix tree of the page mapping and the other is
the mapping->i_mmap_mutex protecting the mapped pages. This set
deals only with mapping->tree_lock.

Tim managed to get 14% throughput improvement when with a workload
putting heavy pressure of page cache by reading many large mmaped
files simultaneously on a 8 socket Westmere server.

I've been testing these by running large parallel kernel compiles
on systems that are under memory pressure. During development,
I caught quite a few races on smaller setups, and it's being
quite stable that large (160 logical CPU / 1TB) system.


2013-05-16 20:34:34

by Dave Hansen

[permalink] [raw]
Subject: [RFCv2][PATCH 2/5] make 'struct page' and swp_entry_t variants of swapcache_free().


From: Dave Hansen <[email protected]>

swapcache_free() takes two arguments:

void swapcache_free(swp_entry_t entry, struct page *page)

Most of its callers (5/7) are from error handling paths haven't even
instantiated a page, so they pass page=NULL. Both of the callers
that call in with a 'struct page' create and pass in a temporary
swp_entry_t.

Now that we are deferring clearing page_private() until after
swapcache_free() has been called, we can just create a variant
that takes a 'struct page' and does the temporary variable in
the helper.

That leaves all the other callers doing

swapcache_free(entry, NULL)

so create another helper for them that makes it clear that they
need only pass in a swp_entry_t.

One downside here is that delete_from_swap_cache() now does
an extra swap_address_space() call. But, those are pretty
cheap (just some array index arithmetic).

Signed-off-by: Dave Hansen <[email protected]>
---

linux.git-davehans/drivers/staging/zcache/zcache-main.c | 2 +-
linux.git-davehans/include/linux/swap.h | 3 ++-
linux.git-davehans/mm/shmem.c | 2 +-
linux.git-davehans/mm/swap_state.c | 15 +++++----------
linux.git-davehans/mm/swapfile.c | 15 ++++++++++++++-
linux.git-davehans/mm/vmscan.c | 5 +----
6 files changed, 24 insertions(+), 18 deletions(-)

diff -puN drivers/staging/zcache/zcache-main.c~make-page-and-swp_entry_t-variants drivers/staging/zcache/zcache-main.c
--- linux.git/drivers/staging/zcache/zcache-main.c~make-page-and-swp_entry_t-variants 2013-05-16 13:27:24.954149254 -0700
+++ linux.git-davehans/drivers/staging/zcache/zcache-main.c 2013-05-16 13:27:24.966149785 -0700
@@ -961,7 +961,7 @@ static int zcache_get_swap_cache_page(in
* add_to_swap_cache() doesn't return -EEXIST, so we can safely
* clear SWAP_HAS_CACHE flag.
*/
- swapcache_free(entry, NULL);
+ swapcache_free_entry(entry);
/* FIXME: is it possible to get here without err==-ENOMEM?
* If not, we can dispense with the do loop, use goto retry */
} while (err != -ENOMEM);
diff -puN include/linux/swap.h~make-page-and-swp_entry_t-variants include/linux/swap.h
--- linux.git/include/linux/swap.h~make-page-and-swp_entry_t-variants 2013-05-16 13:27:24.956149343 -0700
+++ linux.git-davehans/include/linux/swap.h 2013-05-16 13:27:24.967149829 -0700
@@ -382,7 +382,8 @@ extern void swap_shmem_alloc(swp_entry_t
extern int swap_duplicate(swp_entry_t);
extern int swapcache_prepare(swp_entry_t);
extern void swap_free(swp_entry_t);
-extern void swapcache_free(swp_entry_t, struct page *page);
+extern void swapcache_free_entry(swp_entry_t entry);
+extern void swapcache_free_page_entry(struct page *page);
extern int free_swap_and_cache(swp_entry_t);
extern int swap_type_of(dev_t, sector_t, struct block_device **);
extern unsigned int count_swap_pages(int, int);
diff -puN mm/shmem.c~make-page-and-swp_entry_t-variants mm/shmem.c
--- linux.git/mm/shmem.c~make-page-and-swp_entry_t-variants 2013-05-16 13:27:24.957149387 -0700
+++ linux.git-davehans/mm/shmem.c 2013-05-16 13:27:24.968149873 -0700
@@ -872,7 +872,7 @@ static int shmem_writepage(struct page *
}

mutex_unlock(&shmem_swaplist_mutex);
- swapcache_free(swap, NULL);
+ swapcache_free_entry(swap);
redirty:
set_page_dirty(page);
if (wbc->for_reclaim)
diff -puN mm/swapfile.c~make-page-and-swp_entry_t-variants mm/swapfile.c
--- linux.git/mm/swapfile.c~make-page-and-swp_entry_t-variants 2013-05-16 13:27:24.959149475 -0700
+++ linux.git-davehans/mm/swapfile.c 2013-05-16 13:27:24.969149917 -0700
@@ -637,7 +637,7 @@ void swap_free(swp_entry_t entry)
/*
* Called after dropping swapcache to decrease refcnt to swap entries.
*/
-void swapcache_free(swp_entry_t entry, struct page *page)
+static void __swapcache_free(swp_entry_t entry, struct page *page)
{
struct swap_info_struct *p;
unsigned char count;
@@ -651,6 +651,19 @@ void swapcache_free(swp_entry_t entry, s
}
}

+void swapcache_free_entry(swp_entry_t entry)
+{
+ __swapcache_free(entry, NULL);
+}
+
+void swapcache_free_page_entry(struct page *page)
+{
+ swp_entry_t entry = { .val = page_private(page) };
+ __swapcache_free(entry, page);
+ set_page_private(page, 0);
+ ClearPageSwapCache(page);
+}
+
/*
* How many references to page are currently swapped out?
* This does not give an exact answer when swap count is continued,
diff -puN mm/swap_state.c~make-page-and-swp_entry_t-variants mm/swap_state.c
--- linux.git/mm/swap_state.c~make-page-and-swp_entry_t-variants 2013-05-16 13:27:24.961149563 -0700
+++ linux.git-davehans/mm/swap_state.c 2013-05-16 13:27:24.970149961 -0700
@@ -172,7 +172,7 @@ int add_to_swap(struct page *page, struc

if (unlikely(PageTransHuge(page)))
if (unlikely(split_huge_page_to_list(page, list))) {
- swapcache_free(entry, NULL);
+ swapcache_free_entry(entry);
return 0;
}

@@ -198,7 +198,7 @@ int add_to_swap(struct page *page, struc
* add_to_swap_cache() doesn't return -EEXIST, so we can safely
* clear SWAP_HAS_CACHE flag.
*/
- swapcache_free(entry, NULL);
+ swapcache_free_entry(entry);
return 0;
}
}
@@ -211,19 +211,14 @@ int add_to_swap(struct page *page, struc
*/
void delete_from_swap_cache(struct page *page)
{
- swp_entry_t entry;
struct address_space *address_space;

- entry.val = page_private(page);
-
- address_space = swap_address_space(entry);
+ address_space = page_mapping(page);
spin_lock_irq(&address_space->tree_lock);
__delete_from_swap_cache(page);
spin_unlock_irq(&address_space->tree_lock);

- swapcache_free(entry, page);
- set_page_private(page, 0);
- ClearPageSwapCache(page);
+ swapcache_free_page_entry(page);
page_cache_release(page);
}

@@ -365,7 +360,7 @@ struct page *read_swap_cache_async(swp_e
* add_to_swap_cache() doesn't return -EEXIST, so we can safely
* clear SWAP_HAS_CACHE flag.
*/
- swapcache_free(entry, NULL);
+ swapcache_free_entry(entry);
} while (err != -ENOMEM);

if (new_page)
diff -puN mm/vmscan.c~make-page-and-swp_entry_t-variants mm/vmscan.c
--- linux.git/mm/vmscan.c~make-page-and-swp_entry_t-variants 2013-05-16 13:27:24.963149651 -0700
+++ linux.git-davehans/mm/vmscan.c 2013-05-16 13:27:24.971150005 -0700
@@ -490,12 +490,9 @@ static int __remove_mapping(struct addre
}

if (PageSwapCache(page)) {
- swp_entry_t swap = { .val = page_private(page) };
__delete_from_swap_cache(page);
spin_unlock_irq(&mapping->tree_lock);
- swapcache_free(swap, page);
- set_page_private(page, 0);
- ClearPageSwapCache(page);
+ swapcache_free_page_entry(page);
} else {
void (*freepage)(struct page *);

_

2013-05-16 20:34:37

by Dave Hansen

[permalink] [raw]
Subject: [RFCv2][PATCH 4/5] break out mapping "freepage" code


From: Dave Hansen <[email protected]>

__remove_mapping() only deals with pages with mappings, meaning
page cache and swap cache.

At this point, the page has been removed from the mapping's radix
tree, and we need to ensure that any fs-specific (or swap-
specific) resources are freed up.

We will be using this function from a second location in a
following patch.

Signed-off-by: Dave Hansen <[email protected]>
Acked-by: Mel Gorman <[email protected]>
---

linux.git-davehans/mm/vmscan.c | 28 +++++++++++++++++++---------
1 file changed, 19 insertions(+), 9 deletions(-)

diff -puN mm/vmscan.c~free_mapping_page mm/vmscan.c
--- linux.git/mm/vmscan.c~free_mapping_page 2013-05-16 13:27:25.520174273 -0700
+++ linux.git-davehans/mm/vmscan.c 2013-05-16 13:27:25.525174493 -0700
@@ -497,6 +497,24 @@ static int __remove_mapping(struct addre
return 1;
}

+/*
+ * Release any resources the mapping had tied up in
+ * the page.
+ */
+static void mapping_release_page(struct address_space *mapping,
+ struct page *page)
+{
+ if (PageSwapCache(page)) {
+ swapcache_free_page_entry(page);
+ } else {
+ void (*freepage)(struct page *);
+ freepage = mapping->a_ops->freepage;
+ mem_cgroup_uncharge_cache_page(page);
+ if (freepage != NULL)
+ freepage(page);
+ }
+}
+
static int lock_remove_mapping(struct address_space *mapping, struct page *page)
{
int ret;
@@ -510,15 +528,7 @@ static int lock_remove_mapping(struct ad
if (!ret)
return 0;

- if (PageSwapCache(page)) {
- swapcache_free_page_entry(page);
- } else {
- void (*freepage)(struct page *);
- freepage = mapping->a_ops->freepage;
- mem_cgroup_uncharge_cache_page(page);
- if (freepage != NULL)
- freepage(page);
- }
+ mapping_release_page(mapping, page);
return ret;
}

_

2013-05-16 20:35:00

by Dave Hansen

[permalink] [raw]
Subject: [RFCv2][PATCH 3/5] break up __remove_mapping()


From: Dave Hansen <[email protected]>

Our goal here is to eventually reduce the number of repetitive
acquire/release operations on mapping->tree_lock.

Logically, this patch has two steps:
1. rename __remove_mapping() to lock_remove_mapping() since
"__" usually means "this us the unlocked version.
2. Recreate __remove_mapping() to _be_ the lock_remove_mapping()
but without the locks.

I think this actually makes the code flow around the locking
_much_ more straighforward since the locking just becomes:

spin_lock_irq(&mapping->tree_lock);
ret = __remove_mapping(mapping, page);
spin_unlock_irq(&mapping->tree_lock);

One non-obvious part of this patch: the

freepage = mapping->a_ops->freepage;

used to happen under the mapping->tree_lock, but this patch
moves it to outside of the lock. All of the other
a_ops->freepage users do it outside the lock, and we only
assign it when we create inodes, so that makes it safe.

Signed-off-by: Dave Hansen <[email protected]>
Acked-by: Mel Gorman <[email protected]>

---

linux.git-davehans/mm/vmscan.c | 43 ++++++++++++++++++++++++-----------------
1 file changed, 26 insertions(+), 17 deletions(-)

diff -puN mm/vmscan.c~make-remove-mapping-without-locks mm/vmscan.c
--- linux.git/mm/vmscan.c~make-remove-mapping-without-locks 2013-05-16 13:27:25.268163133 -0700
+++ linux.git-davehans/mm/vmscan.c 2013-05-16 13:27:25.273163355 -0700
@@ -450,12 +450,12 @@ static pageout_t pageout(struct page *pa
* Same as remove_mapping, but if the page is removed from the mapping, it
* gets returned with a refcount of 0.
*/
-static int __remove_mapping(struct address_space *mapping, struct page *page)
+static int __remove_mapping(struct address_space *mapping,
+ struct page *page)
{
BUG_ON(!PageLocked(page));
BUG_ON(mapping != page_mapping(page));

- spin_lock_irq(&mapping->tree_lock);
/*
* The non racy check for a busy page.
*
@@ -482,35 +482,44 @@ static int __remove_mapping(struct addre
* and thus under tree_lock, then this ordering is not required.
*/
if (!page_freeze_refs(page, 2))
- goto cannot_free;
+ return 0;
/* note: atomic_cmpxchg in page_freeze_refs provides the smp_rmb */
if (unlikely(PageDirty(page))) {
page_unfreeze_refs(page, 2);
- goto cannot_free;
+ return 0;
}

if (PageSwapCache(page)) {
__delete_from_swap_cache(page);
- spin_unlock_irq(&mapping->tree_lock);
+ } else {
+ __delete_from_page_cache(page);
+ }
+ return 1;
+}
+
+static int lock_remove_mapping(struct address_space *mapping, struct page *page)
+{
+ int ret;
+ BUG_ON(!PageLocked(page));
+
+ spin_lock_irq(&mapping->tree_lock);
+ ret = __remove_mapping(mapping, page);
+ spin_unlock_irq(&mapping->tree_lock);
+
+ /* unable to free */
+ if (!ret)
+ return 0;
+
+ if (PageSwapCache(page)) {
swapcache_free_page_entry(page);
} else {
void (*freepage)(struct page *);
-
freepage = mapping->a_ops->freepage;
-
- __delete_from_page_cache(page);
- spin_unlock_irq(&mapping->tree_lock);
mem_cgroup_uncharge_cache_page(page);
-
if (freepage != NULL)
freepage(page);
}
-
- return 1;
-
-cannot_free:
- spin_unlock_irq(&mapping->tree_lock);
- return 0;
+ return ret;
}

/*
@@ -521,7 +530,7 @@ cannot_free:
*/
int remove_mapping(struct address_space *mapping, struct page *page)
{
- if (__remove_mapping(mapping, page)) {
+ if (lock_remove_mapping(mapping, page)) {
/*
* Unfreezing the refcount with 1 rather than 2 effectively
* drops the pagecache ref for us without requiring another
_

2013-05-16 20:35:07

by Dave Hansen

[permalink] [raw]
Subject: [RFCv2][PATCH 5/5] batch shrink_page_list() locking operations


From: Dave Hansen <[email protected]>
changes for v2:
* remove batch_has_same_mapping() helper. A local varible makes
the check cheaper and cleaner
* Move batch draining later to where we already know
page_mapping(). This probably fixes a truncation race anyway
* rename batch_for_mapping_removal -> batch_for_mapping_rm. It
caused a line over 80 chars and needed shortening anyway.
* Note: we only set 'batch_mapping' when there are pages in the
batch_for_mapping_rm list

--

We batch like this so that several pages can be freed with a
single mapping->tree_lock acquisition/release pair. This reduces
the number of atomic operations and ensures that we do not bounce
cachelines around.

Tim Chen's earlier version of these patches just unconditionally
created large batches of pages, even if they did not share a
page_mapping(). This is a bit suboptimal for a few reasons:
1. if we can not consolidate lock acquisitions, it makes little
sense to batch
2. The page locks are held for long periods of time, so we only
want to do this when we are sure that we will gain a
substantial throughput improvement because we pay a latency
cost by holding the locks.

This patch makes sure to only batch when all the pages on
'batch_for_mapping_rm' continue to share a page_mapping().
This only happens in practice in cases where pages in the same
file are close to each other on the LRU. That seems like a
reasonable assumption.

In a 128MB virtual machine doing kernel compiles, the average
batch size when calling __remove_mapping_batch() is around 5,
so this does seem to do some good in practice.

On a 160-cpu system doing kernel compiles, I still saw an
average batch length of about 2.8. One promising feature:
as the memory pressure went up, the average batches seem to
have gotten larger.

It has shown some substantial performance benefits on
microbenchmarks.

Signed-off-by: Dave Hansen <[email protected]>
---

linux.git-davehans/mm/vmscan.c | 95 +++++++++++++++++++++++++++++++++++++----
1 file changed, 86 insertions(+), 9 deletions(-)

diff -puN mm/vmscan.c~create-remove_mapping_batch mm/vmscan.c
--- linux.git/mm/vmscan.c~create-remove_mapping_batch 2013-05-16 13:27:25.775185544 -0700
+++ linux.git-davehans/mm/vmscan.c 2013-05-16 13:27:25.779185721 -0700
@@ -552,6 +552,61 @@ int remove_mapping(struct address_space
return 0;
}

+/*
+ * pages come in here (via remove_list) locked and leave unlocked
+ * (on either ret_pages or free_pages)
+ *
+ * We do this batching so that we free batches of pages with a
+ * single mapping->tree_lock acquisition/release. This optimization
+ * only makes sense when the pages on remove_list all share a
+ * page_mapping(). If this is violated you will BUG_ON().
+ */
+static int __remove_mapping_batch(struct list_head *remove_list,
+ struct list_head *ret_pages,
+ struct list_head *free_pages)
+{
+ int nr_reclaimed = 0;
+ struct address_space *mapping;
+ struct page *page;
+ LIST_HEAD(need_free_mapping);
+
+ if (list_empty(remove_list))
+ return 0;
+
+ mapping = page_mapping(lru_to_page(remove_list));
+ spin_lock_irq(&mapping->tree_lock);
+ while (!list_empty(remove_list)) {
+ page = lru_to_page(remove_list);
+ BUG_ON(!PageLocked(page));
+ BUG_ON(page_mapping(page) != mapping);
+ list_del(&page->lru);
+
+ if (!__remove_mapping(mapping, page)) {
+ unlock_page(page);
+ list_add(&page->lru, ret_pages);
+ continue;
+ }
+ list_add(&page->lru, &need_free_mapping);
+ }
+ spin_unlock_irq(&mapping->tree_lock);
+
+ while (!list_empty(&need_free_mapping)) {
+ page = lru_to_page(&need_free_mapping);
+ list_move(&page->list, free_pages);
+ mapping_release_page(mapping, page);
+ /*
+ * At this point, we have no other references and there is
+ * no way to pick any more up (removed from LRU, removed
+ * from pagecache). Can use non-atomic bitops now (and
+ * we obviously don't have to worry about waking up a process
+ * waiting on the page lock, because there are no references.
+ */
+ __clear_page_locked(page);
+ nr_reclaimed++;
+ }
+ return nr_reclaimed;
+}
+
/**
* putback_lru_page - put previously isolated page onto appropriate LRU list
* @page: page to be put back to appropriate lru list
@@ -700,6 +755,8 @@ static unsigned long shrink_page_list(st
{
LIST_HEAD(ret_pages);
LIST_HEAD(free_pages);
+ LIST_HEAD(batch_for_mapping_rm);
+ struct address_space *batch_mapping = NULL;
int pgactivate = 0;
unsigned long nr_dirty = 0;
unsigned long nr_congested = 0;
@@ -718,6 +775,7 @@ static unsigned long shrink_page_list(st
cond_resched();

page = lru_to_page(page_list);
+
list_del(&page->lru);

if (!trylock_page(page))
@@ -776,6 +834,10 @@ static unsigned long shrink_page_list(st
nr_writeback++;
goto keep_locked;
}
+ /*
+ * batch_for_mapping_rm could be drained here
+ * if its lock_page()s hurt latency elsewhere.
+ */
wait_on_page_writeback(page);
}

@@ -805,6 +867,18 @@ static unsigned long shrink_page_list(st
}

mapping = page_mapping(page);
+ /*
+ * batching only makes sense when we can save lock
+ * acquisitions, so drain the previously-batched
+ * pages when we move over to a different mapping
+ */
+ if (batch_mapping && (batch_mapping != mapping)) {
+ nr_reclaimed +=
+ __remove_mapping_batch(&batch_for_mapping_rm,
+ &ret_pages,
+ &free_pages);
+ batch_mapping = NULL;
+ }

/*
* The page is mapped into the page tables of one or more
@@ -922,17 +996,18 @@ static unsigned long shrink_page_list(st
}
}

- if (!mapping || !__remove_mapping(mapping, page))
+ if (!mapping)
goto keep_locked;
-
/*
- * At this point, we have no other references and there is
- * no way to pick any more up (removed from LRU, removed
- * from pagecache). Can use non-atomic bitops now (and
- * we obviously don't have to worry about waking up a process
- * waiting on the page lock, because there are no references.
+ * This list contains pages all in the same mapping, but
+ * in effectively random order and we hold lock_page()
+ * on *all* of them. This can potentially cause lock
+ * ordering issues, but the reclaim code only trylocks
+ * them which saves us.
*/
- __clear_page_locked(page);
+ list_add(&page->lru, &batch_for_mapping_rm);
+ batch_mapping = mapping;
+ continue;
free_it:
nr_reclaimed++;

@@ -963,7 +1038,9 @@ keep:
list_add(&page->lru, &ret_pages);
VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
}
-
+ nr_reclaimed += __remove_mapping_batch(&batch_for_mapping_rm,
+ &ret_pages,
+ &free_pages);
/*
* Tag a zone as congested if all the dirty pages encountered were
* backed by a congested BDI. In this case, reclaimers should just
_

2013-05-16 20:35:35

by Dave Hansen

[permalink] [raw]
Subject: [RFCv2][PATCH 1/5] defer clearing of page_private() for swap cache pages


From: Dave Hansen <[email protected]>

This patch defers the destruction of swapcache-specific data in
'struct page'. This simplifies the code because we do not have
to keep extra copies of the data during the removal of a page
from the swap cache.

There are only two callers of swapcache_free() which actually
pass in a non-NULL 'struct page'. Both of them (__remove_mapping
and delete_from_swap_cache()) create a temporary on-stack
'swp_entry_t' and set entry.val to page_private().

They need to do this since __delete_from_swap_cache() does
set_page_private(page, 0) and destroys the information.

However, I'd like to batch a few of these operations on several
pages together in a new version of __remove_mapping(), and I
would prefer not to have to allocate temporary storage for each
page. The code is pretty ugly, and it's a bit silly to create
these on-stack 'swp_entry_t's when it is so easy to just keep the
information around in 'struct page'.

There should not be any danger in doing this since we are
absolutely on the path of freeing these page. There is no
turning back, and no other rerferences can be obtained after it
comes out of the radix tree.

Note: This patch is separate from the next one since it
introduces the behavior change. I've seen issues with this patch
by itself in various forms and I think having it separate like
this aids bisection.

Signed-off-by: Dave Hansen <[email protected]>
Acked-by: Mel Gorman <[email protected]>
---

linux.git-davehans/mm/swap_state.c | 4 ++--
linux.git-davehans/mm/vmscan.c | 2 ++
2 files changed, 4 insertions(+), 2 deletions(-)

diff -puN mm/swap_state.c~__delete_from_swap_cache-dont-clear-page-private mm/swap_state.c
--- linux.git/mm/swap_state.c~__delete_from_swap_cache-dont-clear-page-private 2013-05-16 13:27:24.686137407 -0700
+++ linux.git-davehans/mm/swap_state.c 2013-05-16 13:27:24.691137629 -0700
@@ -146,8 +146,6 @@ void __delete_from_swap_cache(struct pag
entry.val = page_private(page);
address_space = swap_address_space(entry);
radix_tree_delete(&address_space->page_tree, page_private(page));
- set_page_private(page, 0);
- ClearPageSwapCache(page);
address_space->nrpages--;
__dec_zone_page_state(page, NR_FILE_PAGES);
INC_CACHE_INFO(del_total);
@@ -224,6 +222,8 @@ void delete_from_swap_cache(struct page
spin_unlock_irq(&address_space->tree_lock);

swapcache_free(entry, page);
+ set_page_private(page, 0);
+ ClearPageSwapCache(page);
page_cache_release(page);
}

diff -puN mm/vmscan.c~__delete_from_swap_cache-dont-clear-page-private mm/vmscan.c
--- linux.git/mm/vmscan.c~__delete_from_swap_cache-dont-clear-page-private 2013-05-16 13:27:24.687137451 -0700
+++ linux.git-davehans/mm/vmscan.c 2013-05-16 13:27:24.692137673 -0700
@@ -494,6 +494,8 @@ static int __remove_mapping(struct addre
__delete_from_swap_cache(page);
spin_unlock_irq(&mapping->tree_lock);
swapcache_free(swap, page);
+ set_page_private(page, 0);
+ ClearPageSwapCache(page);
} else {
void (*freepage)(struct page *);

_

2013-05-17 13:27:21

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFCv2][PATCH 2/5] make 'struct page' and swp_entry_t variants of swapcache_free().

On Thu, May 16, 2013 at 01:34:29PM -0700, Dave Hansen wrote:
>
> From: Dave Hansen <[email protected]>
>
> swapcache_free() takes two arguments:
>
> void swapcache_free(swp_entry_t entry, struct page *page)
>
> Most of its callers (5/7) are from error handling paths haven't even
> instantiated a page, so they pass page=NULL. Both of the callers
> that call in with a 'struct page' create and pass in a temporary
> swp_entry_t.
>
> Now that we are deferring clearing page_private() until after
> swapcache_free() has been called, we can just create a variant
> that takes a 'struct page' and does the temporary variable in
> the helper.
>
> That leaves all the other callers doing
>
> swapcache_free(entry, NULL)
>
> so create another helper for them that makes it clear that they
> need only pass in a swp_entry_t.
>
> One downside here is that delete_from_swap_cache() now does
> an extra swap_address_space() call. But, those are pretty
> cheap (just some array index arithmetic).
>
> Signed-off-by: Dave Hansen <[email protected]>

Acked-by: Mel Gorman <[email protected]>

--
Mel Gorman
SUSE Labs

2013-05-17 13:35:32

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFCv2][PATCH 5/5] batch shrink_page_list() locking operations

On Thu, May 16, 2013 at 01:34:34PM -0700, Dave Hansen wrote:
>
> From: Dave Hansen <[email protected]>
> changes for v2:
> * remove batch_has_same_mapping() helper. A local varible makes
> the check cheaper and cleaner
> * Move batch draining later to where we already know
> page_mapping(). This probably fixes a truncation race anyway
> * rename batch_for_mapping_removal -> batch_for_mapping_rm. It
> caused a line over 80 chars and needed shortening anyway.
> * Note: we only set 'batch_mapping' when there are pages in the
> batch_for_mapping_rm list
>
> --
>
> We batch like this so that several pages can be freed with a
> single mapping->tree_lock acquisition/release pair. This reduces
> the number of atomic operations and ensures that we do not bounce
> cachelines around.
>
> Tim Chen's earlier version of these patches just unconditionally
> created large batches of pages, even if they did not share a
> page_mapping(). This is a bit suboptimal for a few reasons:
> 1. if we can not consolidate lock acquisitions, it makes little
> sense to batch
> 2. The page locks are held for long periods of time, so we only
> want to do this when we are sure that we will gain a
> substantial throughput improvement because we pay a latency
> cost by holding the locks.
>
> This patch makes sure to only batch when all the pages on
> 'batch_for_mapping_rm' continue to share a page_mapping().
> This only happens in practice in cases where pages in the same
> file are close to each other on the LRU. That seems like a
> reasonable assumption.
>
> In a 128MB virtual machine doing kernel compiles, the average
> batch size when calling __remove_mapping_batch() is around 5,
> so this does seem to do some good in practice.
>
> On a 160-cpu system doing kernel compiles, I still saw an
> average batch length of about 2.8. One promising feature:
> as the memory pressure went up, the average batches seem to
> have gotten larger.
>
> It has shown some substantial performance benefits on
> microbenchmarks.
>
> Signed-off-by: Dave Hansen <[email protected]>
>
> <SNIP>
>
> @@ -718,6 +775,7 @@ static unsigned long shrink_page_list(st
> cond_resched();
>
> page = lru_to_page(page_list);
> +
> list_del(&page->lru);
>
> if (!trylock_page(page))

Can drop this hunk :/

> @@ -776,6 +834,10 @@ static unsigned long shrink_page_list(st
> nr_writeback++;
> goto keep_locked;
> }
> + /*
> + * batch_for_mapping_rm could be drained here
> + * if its lock_page()s hurt latency elsewhere.
> + */
> wait_on_page_writeback(page);
> }
>
> @@ -805,6 +867,18 @@ static unsigned long shrink_page_list(st
> }
>
> mapping = page_mapping(page);
> + /*
> + * batching only makes sense when we can save lock
> + * acquisitions, so drain the previously-batched
> + * pages when we move over to a different mapping
> + */
> + if (batch_mapping && (batch_mapping != mapping)) {
> + nr_reclaimed +=
> + __remove_mapping_batch(&batch_for_mapping_rm,
> + &ret_pages,
> + &free_pages);
> + batch_mapping = NULL;
> + }
>
> /*
> * The page is mapped into the page tables of one or more

As a heads-up, Andrew picked up a reclaim-related series from me. It
adds a new wait_on_page_writeback() with a revised patch making it a
congestion_wait() inside shrink_page_list. Watch when these two series
are integrated because you almost certainly want to do a follow-up patch
that drains before that congestion_wait too.

Otherwise

Acked-by: Mel Gorman <[email protected]>

--
Mel Gorman
SUSE Labs

2013-05-20 21:56:22

by Seth Jennings

[permalink] [raw]
Subject: Re: [RFCv2][PATCH 0/5] mm: Batch page reclamation under shink_page_list

On Thu, May 16, 2013 at 01:34:27PM -0700, Dave Hansen wrote:
> These are an update of Tim Chen's earlier work:
>
> http://lkml.kernel.org/r/1347293960.9977.70.camel@schen9-DESK
>
> I broke the patches up a bit more, and tried to incorporate some
> changes based on some feedback from Mel and Andrew.
>
> Changes for v2:
> * use page_mapping() accessor instead of direct access
> to page->mapping (could cause crashes when running in
> to swap cache pages.
> * group the batch function's introduction patch with
> its first use
> * rename a few functions as suggested by Mel
> * Ran some single-threaded tests to look for regressions
> caused by the batching. If there is overhead, it is only
> in the worst-case scenarios, and then only in hundreths of
> a percent of CPU time.
>
> If you're curious how effective the batching is, I have a quick
> and dirty patch to keep some stats:
>
> https://www.sr71.net/~dave/intel/rmb-stats-only.patch
>

Didn't do any performance comparison but did a kernel build with 2 make threads
per core in a memory constrained situation w/ zswap add got an average batch
size of 6.6 pages with the batch being empty on ~10% of calls.

rmb call: 423464
rmb pages: 2790332
rmb empty: 41408

The WARN_ONCE only gave me one stack for the first empty batch and, for what
it's worth, it was from kswapd.

Tested-by: Seth Jennings <[email protected]>