LinuxLists.cc - [RFC PATCH 0/4] mm: reclaim zbud pages on migration and compaction

2013-08-06 06:42:58

Subject: [RFC PATCH 0/4] mm: reclaim zbud pages on migration and compaction

Hi,

Currently zbud pages are not movable and they cannot be allocated from CMA
region. These patches try to address the problem by:
1. Adding a new form of reclaim of zbud pages.
2. Reclaiming zbud pages during migration and compaction.
3. Allocating zbud pages with __GFP_RECLAIMABLE flag.

This reclaim process is different than zbud_reclaim_page(). It acts more
like swapoff() by trying to unuse pages stored in zbud page and bring
them back to memory. The standard zbud_reclaim_page() on the other hand
tries to write them back.

One of patches introduces a new flag: PageZbud. This flag is used in
isolate_migratepages_range() to grab zbud pages and pass them later
for reclaim. Probably this could be replaced with something
smarter than a flag used only in one case.
Any ideas for a better solution are welcome.

This patch set is based on Linux 3.11-rc4.

TODOs:
1. Replace PageZbud flag with other solution.

Best regards,
Krzysztof Kozlowski

Krzysztof Kozlowski (4):
zbud: use page ref counter for zbud pages
mm: split code for unusing swap entries from try_to_unuse
mm: add zbud flag to page flags
mm: reclaim zbud pages on migration and compaction

include/linux/page-flags.h | 12 ++
include/linux/swapfile.h | 2 +
include/linux/zbud.h | 11 +-
mm/compaction.c | 20 ++-
mm/internal.h | 1 +
mm/page_alloc.c | 9 ++
mm/swapfile.c | 354 +++++++++++++++++++++++---------------------
mm/zbud.c | 301 +++++++++++++++++++++++++------------
mm/zswap.c | 57 ++++++-
9 files changed, 499 insertions(+), 268 deletions(-)

--
1.7.9.5

2013-08-06 06:43:09

by Krzysztof Kozlowski

[permalink] [raw]

Subject: [RFC PATCH 1/4] zbud: use page ref counter for zbud pages

Use page reference counter for zbud pages. The ref counter replaces
zbud_header.under_reclaim flag and ensures that zbud page won't be freed
when zbud_free() is called during reclaim. It allows implementation of
additional reclaim paths.

The page count is incremented when:
- a handle is created and passed to zswap (in zbud_alloc()),
- user-supplied eviction callback is called (in zbud_reclaim_page()).

Signed-off-by: Krzysztof Kozlowski <[email protected]>
Signed-off-by: Tomasz Stanislawski <[email protected]>
---
mm/zbud.c | 150 +++++++++++++++++++++++++++++++++++--------------------------
1 file changed, 86 insertions(+), 64 deletions(-)

diff --git a/mm/zbud.c b/mm/zbud.c
index ad1e781..a8e986f 100644
--- a/mm/zbud.c
+++ b/mm/zbud.c
@@ -109,7 +109,6 @@ struct zbud_header {
struct list_head lru;
unsigned int first_chunks;
unsigned int last_chunks;
- bool under_reclaim;
};

/*****************
@@ -138,16 +137,9 @@ static struct zbud_header *init_zbud_page(struct page *page)
zhdr->last_chunks = 0;
INIT_LIST_HEAD(&zhdr->buddy);
INIT_LIST_HEAD(&zhdr->lru);
- zhdr->under_reclaim = 0;
return zhdr;
}

-/* Resets the struct page fields and frees the page */
-static void free_zbud_page(struct zbud_header *zhdr)
-{
- __free_page(virt_to_page(zhdr));
-}
-
/*
* Encodes the handle of a particular buddy within a zbud page
* Pool lock should be held as this function accesses first|last_chunks
@@ -188,6 +180,65 @@ static int num_free_chunks(struct zbud_header *zhdr)
return NCHUNKS - zhdr->first_chunks - zhdr->last_chunks - 1;
}

+/*
+ * Called after zbud_free() or zbud_alloc().
+ * Checks whether given zbud page has to be:
+ * - removed from buddied/unbuddied/LRU lists completetely (zbud_free).
+ * - moved from buddied to unbuddied list
+ * and to beginning of LRU (zbud_alloc, zbud_free),
+ * - added to buddied list and LRU (zbud_alloc),
+ *
+ * The page must be already removed from buddied/unbuddied lists.
+ * Must be called under pool->lock.
+ */
+static void rebalance_lists(struct zbud_pool *pool, struct zbud_header *zhdr)
+{
+ if (zhdr->first_chunks == 0 && zhdr->last_chunks == 0) {
+ /* zbud_free() */
+ list_del(&zhdr->lru);
+ return;
+ } else if (zhdr->first_chunks == 0 || zhdr->last_chunks == 0) {
+ /* zbud_free() or zbud_alloc() */
+ int freechunks = num_free_chunks(zhdr);
+ list_add(&zhdr->buddy, &pool->unbuddied[freechunks]);
+ } else {
+ /* zbud_alloc() */
+ list_add(&zhdr->buddy, &pool->buddied);
+ }
+ /* Add/move zbud page to beginning of LRU */
+ if (!list_empty(&zhdr->lru))
+ list_del(&zhdr->lru);
+ list_add(&zhdr->lru, &pool->lru);
+}
+
+/*
+ * Increases ref count for zbud page.
+ */
+static void get_zbud_page(struct zbud_header *zhdr)
+{
+ get_page(virt_to_page(zhdr));
+}
+
+/*
+ * Decreases ref count for zbud page and frees the page if it reaches 0
+ * (no external references, e.g. handles).
+ *
+ * Must be called under pool->lock.
+ *
+ * Returns 1 if page was freed and 0 otherwise.
+ */
+static int put_zbud_page(struct zbud_pool *pool, struct zbud_header *zhdr)
+{
+ struct page *page = virt_to_page(zhdr);
+ if (put_page_testzero(page)) {
+ free_hot_cold_page(page, 0);
+ pool->pages_nr--;
+ return 1;
+ }
+ return 0;
+}
+
+
/*****************
* API Functions
*****************/
@@ -250,7 +301,7 @@ void zbud_destroy_pool(struct zbud_pool *pool)
int zbud_alloc(struct zbud_pool *pool, int size, gfp_t gfp,
unsigned long *handle)
{
- int chunks, i, freechunks;
+ int chunks, i;
struct zbud_header *zhdr = NULL;
enum buddy bud;
struct page *page;
@@ -273,6 +324,7 @@ int zbud_alloc(struct zbud_pool *pool, int size, gfp_t gfp,
bud = FIRST;
else
bud = LAST;
+ get_zbud_page(zhdr);
goto found;
}
}
@@ -284,6 +336,10 @@ int zbud_alloc(struct zbud_pool *pool, int size, gfp_t gfp,
return -ENOMEM;
spin_lock(&pool->lock);
pool->pages_nr++;
+ /*
+ * We will be using zhdr instead of page, so
+ * don't increase the page count.
+ */
zhdr = init_zbud_page(page);
bud = FIRST;

@@ -293,19 +349,7 @@ found:
else
zhdr->last_chunks = chunks;

- if (zhdr->first_chunks == 0 || zhdr->last_chunks == 0) {
- /* Add to unbuddied list */
- freechunks = num_free_chunks(zhdr);
- list_add(&zhdr->buddy, &pool->unbuddied[freechunks]);
- } else {
- /* Add to buddied list */
- list_add(&zhdr->buddy, &pool->buddied);
- }
-
- /* Add/move zbud page to beginning of LRU */
- if (!list_empty(&zhdr->lru))
- list_del(&zhdr->lru);
- list_add(&zhdr->lru, &pool->lru);
+ rebalance_lists(pool, zhdr);

*handle = encode_handle(zhdr, bud);
spin_unlock(&pool->lock);
@@ -326,10 +370,10 @@ found:
void zbud_free(struct zbud_pool *pool, unsigned long handle)
{
struct zbud_header *zhdr;
- int freechunks;

spin_lock(&pool->lock);
zhdr = handle_to_zbud_header(handle);
+ BUG_ON(zhdr->last_chunks == 0 && zhdr->first_chunks == 0);

/* If first buddy, handle will be page aligned */
if ((handle - ZHDR_SIZE_ALIGNED) & ~PAGE_MASK)
@@ -337,26 +381,9 @@ void zbud_free(struct zbud_pool *pool, unsigned long handle)
else
zhdr->first_chunks = 0;

- if (zhdr->under_reclaim) {
- /* zbud page is under reclaim, reclaim will free */
- spin_unlock(&pool->lock);
- return;
- }
-
- /* Remove from existing buddy list */
list_del(&zhdr->buddy);
-
- if (zhdr->first_chunks == 0 && zhdr->last_chunks == 0) {
- /* zbud page is empty, free */
- list_del(&zhdr->lru);
- free_zbud_page(zhdr);
- pool->pages_nr--;
- } else {
- /* Add to unbuddied list */
- freechunks = num_free_chunks(zhdr);
- list_add(&zhdr->buddy, &pool->unbuddied[freechunks]);
- }
-
+ rebalance_lists(pool, zhdr);
+ put_zbud_page(pool, zhdr);
spin_unlock(&pool->lock);
}

@@ -400,7 +427,7 @@ void zbud_free(struct zbud_pool *pool, unsigned long handle)
*/
int zbud_reclaim_page(struct zbud_pool *pool, unsigned int retries)
{
- int i, ret, freechunks;
+ int i, ret;
struct zbud_header *zhdr;
unsigned long first_handle = 0, last_handle = 0;

@@ -411,11 +438,24 @@ int zbud_reclaim_page(struct zbud_pool *pool, unsigned int retries)
return -EINVAL;
}
for (i = 0; i < retries; i++) {
+ if (list_empty(&pool->lru)) {
+ /*
+ * LRU was emptied during evict calls in previous
+ * iteration but put_zbud_page() returned 0 meaning
+ * that someone still holds the page. This may
+ * happen when some other mm mechanism increased
+ * the page count.
+ * In such case we succedded with reclaim.
+ */
+ return 0;
+ }
zhdr = list_tail_entry(&pool->lru, struct zbud_header, lru);
+ BUG_ON(zhdr->first_chunks == 0 && zhdr->last_chunks == 0);
+ /* Move this last element to beginning of LRU */
list_del(&zhdr->lru);
- list_del(&zhdr->buddy);
+ list_add(&zhdr->lru, &pool->lru);
/* Protect zbud page against free */
- zhdr->under_reclaim = true;
+ get_zbud_page(zhdr);
/*
* We need encode the handles before unlocking, since we can
* race with free that will set (first|last)_chunks to 0
@@ -441,28 +481,10 @@ int zbud_reclaim_page(struct zbud_pool *pool, unsigned int retries)
}
next:
spin_lock(&pool->lock);
- zhdr->under_reclaim = false;
- if (zhdr->first_chunks == 0 && zhdr->last_chunks == 0) {
- /*
- * Both buddies are now free, free the zbud page and
- * return success.
- */
- free_zbud_page(zhdr);
- pool->pages_nr--;
+ if (put_zbud_page(pool, zhdr)) {
spin_unlock(&pool->lock);
return 0;
- } else if (zhdr->first_chunks == 0 ||
- zhdr->last_chunks == 0) {
- /* add to unbuddied list */
- freechunks = num_free_chunks(zhdr);
- list_add(&zhdr->buddy, &pool->unbuddied[freechunks]);
- } else {
- /* add to buddied list */
- list_add(&zhdr->buddy, &pool->buddied);
}
-
- /* add to beginning of LRU */
- list_add(&zhdr->lru, &pool->lru);
}
spin_unlock(&pool->lock);
return -EAGAIN;
--
1.7.9.5

2013-08-06 06:43:16

by Krzysztof Kozlowski

[permalink] [raw]

Subject: [RFC PATCH 2/4] mm: split code for unusing swap entries from try_to_unuse

Move out the code for unusing swap entries from loop in try_to_unuse()
to separate function: try_to_unuse_swp_entry(). Export this new function
in swapfile.h just like try_to_unuse() is exported.

Signed-off-by: Krzysztof Kozlowski <[email protected]>
---
include/linux/swapfile.h | 2 +
mm/swapfile.c | 354 ++++++++++++++++++++++++----------------------
2 files changed, 187 insertions(+), 169 deletions(-)

diff --git a/include/linux/swapfile.h b/include/linux/swapfile.h
index e282624..68c24a7 100644
--- a/include/linux/swapfile.h
+++ b/include/linux/swapfile.h
@@ -9,5 +9,7 @@ extern spinlock_t swap_lock;
extern struct swap_list_t swap_list;
extern struct swap_info_struct *swap_info[];
extern int try_to_unuse(unsigned int, bool, unsigned long);
+extern int try_to_unuse_swp_entry(struct mm_struct **start_mm,
+ struct swap_info_struct *si, swp_entry_t entry);

#endif /* _LINUX_SWAPFILE_H */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 36af6ee..331d0b8 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1100,6 +1100,189 @@ static unsigned int find_next_to_unuse(struct swap_info_struct *si,
}

/*
+ * Returns:
+ * - negative on error,
+ * - 0 on success (entry unused)
+ */
+int try_to_unuse_swp_entry(struct mm_struct **start_mm,
+ struct swap_info_struct *si, swp_entry_t entry)
+{
+ pgoff_t offset = swp_offset(entry);
+ unsigned char *swap_map;
+ unsigned char swcount;
+ struct page *page;
+ int retval = 0;
+
+ if (signal_pending(current)) {
+ retval = -EINTR;
+ goto out;
+ }
+
+ /*
+ * Get a page for the entry, using the existing swap
+ * cache page if there is one. Otherwise, get a clean
+ * page and read the swap into it.
+ */
+ swap_map = &si->swap_map[offset];
+ page = read_swap_cache_async(entry,
+ GFP_HIGHUSER_MOVABLE, NULL, 0);
+ if (!page) {
+ /*
+ * Either swap_duplicate() failed because entry
+ * has been freed independently, and will not be
+ * reused since sys_swapoff() already disabled
+ * allocation from here, or alloc_page() failed.
+ */
+ if (!*swap_map)
+ retval = 0;
+ else
+ retval = -ENOMEM;
+ goto out;
+ }
+
+ /*
+ * Don't hold on to start_mm if it looks like exiting.
+ */
+ if (atomic_read(&(*start_mm)->mm_users) == 1) {
+ mmput(*start_mm);
+ *start_mm = &init_mm;
+ atomic_inc(&init_mm.mm_users);
+ }
+
+ /*
+ * Wait for and lock page. When do_swap_page races with
+ * try_to_unuse, do_swap_page can handle the fault much
+ * faster than try_to_unuse can locate the entry. This
+ * apparently redundant "wait_on_page_locked" lets try_to_unuse
+ * defer to do_swap_page in such a case - in some tests,
+ * do_swap_page and try_to_unuse repeatedly compete.
+ */
+ wait_on_page_locked(page);
+ wait_on_page_writeback(page);
+ lock_page(page);
+ wait_on_page_writeback(page);
+
+ /*
+ * Remove all references to entry.
+ */
+ swcount = *swap_map;
+ if (swap_count(swcount) == SWAP_MAP_SHMEM) {
+ retval = shmem_unuse(entry, page);
+ VM_BUG_ON(retval > 0);
+ /* page has already been unlocked and released */
+ goto out;
+ }
+ if (swap_count(swcount) && *start_mm != &init_mm)
+ retval = unuse_mm(*start_mm, entry, page);
+
+ if (swap_count(*swap_map)) {
+ int set_start_mm = (*swap_map >= swcount);
+ struct list_head *p = &(*start_mm)->mmlist;
+ struct mm_struct *new_start_mm = *start_mm;
+ struct mm_struct *prev_mm = *start_mm;
+ struct mm_struct *mm;
+
+ atomic_inc(&new_start_mm->mm_users);
+ atomic_inc(&prev_mm->mm_users);
+ spin_lock(&mmlist_lock);
+ while (swap_count(*swap_map) && !retval &&
+ (p = p->next) != &(*start_mm)->mmlist) {
+ mm = list_entry(p, struct mm_struct, mmlist);
+ if (!atomic_inc_not_zero(&mm->mm_users))
+ continue;
+ spin_unlock(&mmlist_lock);
+ mmput(prev_mm);
+ prev_mm = mm;
+
+ cond_resched();
+
+ swcount = *swap_map;
+ if (!swap_count(swcount)) /* any usage ? */
+ ;
+ else if (mm == &init_mm)
+ set_start_mm = 1;
+ else
+ retval = unuse_mm(mm, entry, page);
+
+ if (set_start_mm && *swap_map < swcount) {
+ mmput(new_start_mm);
+ atomic_inc(&mm->mm_users);
+ new_start_mm = mm;
+ set_start_mm = 0;
+ }
+ spin_lock(&mmlist_lock);
+ }
+ spin_unlock(&mmlist_lock);
+ mmput(prev_mm);
+ mmput(*start_mm);
+ *start_mm = new_start_mm;
+ }
+ if (retval) {
+ unlock_page(page);
+ page_cache_release(page);
+ goto out;
+ }
+
+ /*
+ * If a reference remains (rare), we would like to leave
+ * the page in the swap cache; but try_to_unmap could
+ * then re-duplicate the entry once we drop page lock,
+ * so we might loop indefinitely; also, that page could
+ * not be swapped out to other storage meanwhile. So:
+ * delete from cache even if there's another reference,
+ * after ensuring that the data has been saved to disk -
+ * since if the reference remains (rarer), it will be
+ * read from disk into another page. Splitting into two
+ * pages would be incorrect if swap supported "shared
+ * private" pages, but they are handled by tmpfs files.
+ *
+ * Given how unuse_vma() targets one particular offset
+ * in an anon_vma, once the anon_vma has been determined,
+ * this splitting happens to be just what is needed to
+ * handle where KSM pages have been swapped out: re-reading
+ * is unnecessarily slow, but we can fix that later on.
+ */
+ if (swap_count(*swap_map) &&
+ PageDirty(page) && PageSwapCache(page)) {
+ struct writeback_control wbc = {
+ .sync_mode = WB_SYNC_NONE,
+ };
+
+ swap_writepage(page, &wbc);
+ lock_page(page);
+ wait_on_page_writeback(page);
+ }
+
+ /*
+ * It is conceivable that a racing task removed this page from
+ * swap cache just before we acquired the page lock at the top,
+ * or while we dropped it in unuse_mm(). The page might even
+ * be back in swap cache on another swap area: that we must not
+ * delete, since it may not have been written out to swap yet.
+ */
+ if (PageSwapCache(page) &&
+ likely(page_private(page) == entry.val))
+ delete_from_swap_cache(page);
+
+ /*
+ * So we could skip searching mms once swap count went
+ * to 1, we did not mark any present ptes as dirty: must
+ * mark page dirty so shrink_page_list will preserve it.
+ */
+ SetPageDirty(page);
+ unlock_page(page);
+ page_cache_release(page);
+
+ /*
+ * Make sure that we aren't completely killing
+ * interactive performance.
+ */
+ cond_resched();
+out:
+ return retval;
+}
+
+/*
* We completely avoid races by reading each swap page in advance,
* and then search for the process using it. All the necessary
* page table adjustments can then be made atomically.
@@ -1112,10 +1295,6 @@ int try_to_unuse(unsigned int type, bool frontswap,
{
struct swap_info_struct *si = swap_info[type];
struct mm_struct *start_mm;
- unsigned char *swap_map;
- unsigned char swcount;
- struct page *page;
- swp_entry_t entry;
unsigned int i = 0;
int retval = 0;

@@ -1142,172 +1321,9 @@ int try_to_unuse(unsigned int type, bool frontswap,
* there are races when an instance of an entry might be missed.
*/
while ((i = find_next_to_unuse(si, i, frontswap)) != 0) {
- if (signal_pending(current)) {
- retval = -EINTR;
- break;
- }
-
- /*
- * Get a page for the entry, using the existing swap
- * cache page if there is one. Otherwise, get a clean
- * page and read the swap into it.
- */
- swap_map = &si->swap_map[i];
- entry = swp_entry(type, i);
- page = read_swap_cache_async(entry,
- GFP_HIGHUSER_MOVABLE, NULL, 0);
- if (!page) {
- /*
- * Either swap_duplicate() failed because entry
- * has been freed independently, and will not be
- * reused since sys_swapoff() already disabled
- * allocation from here, or alloc_page() failed.
- */
- if (!*swap_map)
- continue;
- retval = -ENOMEM;
- break;
- }
-
- /*
- * Don't hold on to start_mm if it looks like exiting.
- */
- if (atomic_read(&start_mm->mm_users) == 1) {
- mmput(start_mm);
- start_mm = &init_mm;
- atomic_inc(&init_mm.mm_users);
- }
-
- /*
- * Wait for and lock page. When do_swap_page races with
- * try_to_unuse, do_swap_page can handle the fault much
- * faster than try_to_unuse can locate the entry. This
- * apparently redundant "wait_on_page_locked" lets try_to_unuse
- * defer to do_swap_page in such a case - in some tests,
- * do_swap_page and try_to_unuse repeatedly compete.
- */
- wait_on_page_locked(page);
- wait_on_page_writeback(page);
- lock_page(page);
- wait_on_page_writeback(page);
-
- /*
- * Remove all references to entry.
- */
- swcount = *swap_map;
- if (swap_count(swcount) == SWAP_MAP_SHMEM) {
- retval = shmem_unuse(entry, page);
- /* page has already been unlocked and released */
- if (retval < 0)
- break;
- continue;
- }
- if (swap_count(swcount) && start_mm != &init_mm)
- retval = unuse_mm(start_mm, entry, page);
-
- if (swap_count(*swap_map)) {
- int set_start_mm = (*swap_map >= swcount);
- struct list_head *p = &start_mm->mmlist;
- struct mm_struct *new_start_mm = start_mm;
- struct mm_struct *prev_mm = start_mm;
- struct mm_struct *mm;
-
- atomic_inc(&new_start_mm->mm_users);
- atomic_inc(&prev_mm->mm_users);
- spin_lock(&mmlist_lock);
- while (swap_count(*swap_map) && !retval &&
- (p = p->next) != &start_mm->mmlist) {
- mm = list_entry(p, struct mm_struct, mmlist);
- if (!atomic_inc_not_zero(&mm->mm_users))
- continue;
- spin_unlock(&mmlist_lock);
- mmput(prev_mm);
- prev_mm = mm;
-
- cond_resched();
-
- swcount = *swap_map;
- if (!swap_count(swcount)) /* any usage ? */
- ;
- else if (mm == &init_mm)
- set_start_mm = 1;
- else
- retval = unuse_mm(mm, entry, page);
-
- if (set_start_mm && *swap_map < swcount) {
- mmput(new_start_mm);
- atomic_inc(&mm->mm_users);
- new_start_mm = mm;
- set_start_mm = 0;
- }
- spin_lock(&mmlist_lock);
- }
- spin_unlock(&mmlist_lock);
- mmput(prev_mm);
- mmput(start_mm);
- start_mm = new_start_mm;
- }
- if (retval) {
- unlock_page(page);
- page_cache_release(page);
+ if (try_to_unuse_swp_entry(&start_mm, si,
+ swp_entry(type, i)) != 0)
break;
- }
-
- /*
- * If a reference remains (rare), we would like to leave
- * the page in the swap cache; but try_to_unmap could
- * then re-duplicate the entry once we drop page lock,
- * so we might loop indefinitely; also, that page could
- * not be swapped out to other storage meanwhile. So:
- * delete from cache even if there's another reference,
- * after ensuring that the data has been saved to disk -
- * since if the reference remains (rarer), it will be
- * read from disk into another page. Splitting into two
- * pages would be incorrect if swap supported "shared
- * private" pages, but they are handled by tmpfs files.
- *
- * Given how unuse_vma() targets one particular offset
- * in an anon_vma, once the anon_vma has been determined,
- * this splitting happens to be just what is needed to
- * handle where KSM pages have been swapped out: re-reading
- * is unnecessarily slow, but we can fix that later on.
- */
- if (swap_count(*swap_map) &&
- PageDirty(page) && PageSwapCache(page)) {
- struct writeback_control wbc = {
- .sync_mode = WB_SYNC_NONE,
- };
-
- swap_writepage(page, &wbc);
- lock_page(page);
- wait_on_page_writeback(page);
- }
-
- /*
- * It is conceivable that a racing task removed this page from
- * swap cache just before we acquired the page lock at the top,
- * or while we dropped it in unuse_mm(). The page might even
- * be back in swap cache on another swap area: that we must not
- * delete, since it may not have been written out to swap yet.
- */
- if (PageSwapCache(page) &&
- likely(page_private(page) == entry.val))
- delete_from_swap_cache(page);
-
- /*
- * So we could skip searching mms once swap count went
- * to 1, we did not mark any present ptes as dirty: must
- * mark page dirty so shrink_page_list will preserve it.
- */
- SetPageDirty(page);
- unlock_page(page);
- page_cache_release(page);
-
- /*
- * Make sure that we aren't completely killing
- * interactive performance.
- */
- cond_resched();
if (frontswap && pages_to_unuse > 0) {
if (!--pages_to_unuse)
break;
--
1.7.9.5

2013-08-06 06:43:13

by Krzysztof Kozlowski

[permalink] [raw]

Subject: [RFC PATCH 3/4] mm: add zbud flag to page flags

Add PageZbud flag to page flags to distinguish pages allocated in zbud.
Currently these pages do not have any flags set.

Signed-off-by: Krzysztof Kozlowski <[email protected]>
---
include/linux/page-flags.h | 12 ++++++++++++
mm/page_alloc.c | 3 +++
mm/zbud.c | 4 ++++
3 files changed, 19 insertions(+)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6d53675..5b8b61a6 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -109,6 +109,12 @@ enum pageflags {
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
PG_compound_lock,
#endif
+#ifdef CONFIG_ZBUD
+ /* Allocated by zbud. Flag is necessary to find zbud pages to unuse
+ * during migration/compaction.
+ */
+ PG_zbud,
+#endif
__NR_PAGEFLAGS,

/* Filesystems */
@@ -275,6 +281,12 @@ PAGEFLAG_FALSE(HWPoison)
#define __PG_HWPOISON 0
#endif

+#ifdef CONFIG_ZBUD
+PAGEFLAG(Zbud, zbud)
+#else
+PAGEFLAG_FALSE(Zbud)
+#endif
+
u64 stable_page_flags(struct page *page);

static inline int PageUptodate(struct page *page)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b100255..1a120fb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6345,6 +6345,9 @@ static const struct trace_print_flags pageflag_names[] = {
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
{1UL << PG_compound_lock, "compound_lock" },
#endif
+#ifdef CONFIG_ZBUD
+ {1UL << PG_zbud, "zbud" },
+#endif
};

static void dump_page_flags(unsigned long flags)
diff --git a/mm/zbud.c b/mm/zbud.c
index a8e986f..a452949 100644
--- a/mm/zbud.c
+++ b/mm/zbud.c
@@ -230,7 +230,10 @@ static void get_zbud_page(struct zbud_header *zhdr)
static int put_zbud_page(struct zbud_pool *pool, struct zbud_header *zhdr)
{
struct page *page = virt_to_page(zhdr);
+ BUG_ON(!PageZbud(page));
+
if (put_page_testzero(page)) {
+ ClearPageZbud(page);
free_hot_cold_page(page, 0);
pool->pages_nr--;
return 1;
@@ -341,6 +344,7 @@ int zbud_alloc(struct zbud_pool *pool, int size, gfp_t gfp,
* don't increase the page count.
*/
zhdr = init_zbud_page(page);
+ SetPageZbud(page);
bud = FIRST;

found:
--
1.7.9.5

2013-08-06 06:43:35

by Krzysztof Kozlowski

[permalink] [raw]

Subject: [RFC PATCH 4/4] mm: reclaim zbud pages on migration and compaction

Reclaim zbud pages during migration and compaction by unusing stored
data. This allows adding__GFP_RECLAIMABLE flag when allocating zbud
pages and effectively CMA pool can be used for zswap.

zbud pages are not movable and are not stored under any LRU (except
zbud's LRU). PageZbud flag is used in isolate_migratepages_range() to
grab zbud pages and pass them later for reclaim.

This reclaim process is different than zbud_reclaim_page(). It acts more
like swapoff() by trying to unuse pages stored in zbud page and bring
them back to memory. The standard zbud_reclaim_page() on the other hand
tries to write them back.

Signed-off-by: Krzysztof Kozlowski <[email protected]>
---
include/linux/zbud.h | 11 +++-
mm/compaction.c | 20 ++++++-
mm/internal.h | 1 +
mm/page_alloc.c | 6 ++
mm/zbud.c | 163 +++++++++++++++++++++++++++++++++++++++-----------
mm/zswap.c | 57 ++++++++++++++++--
6 files changed, 215 insertions(+), 43 deletions(-)

diff --git a/include/linux/zbud.h b/include/linux/zbud.h
index 2571a5c..57ee85d 100644
--- a/include/linux/zbud.h
+++ b/include/linux/zbud.h
@@ -5,8 +5,14 @@

struct zbud_pool;

+/**
+ * Template for functions called during reclaim.
+ */
+typedef int (*evict_page_t)(struct zbud_pool *pool, unsigned long handle);
+
struct zbud_ops {
- int (*evict)(struct zbud_pool *pool, unsigned long handle);
+ evict_page_t evict; /* callback for zbud_reclaim_lru_page() */
+ evict_page_t unuse; /* callback for zbud_reclaim_pages() */
};

struct zbud_pool *zbud_create_pool(gfp_t gfp, struct zbud_ops *ops);
@@ -14,7 +20,8 @@ void zbud_destroy_pool(struct zbud_pool *pool);
int zbud_alloc(struct zbud_pool *pool, int size, gfp_t gfp,
unsigned long *handle);
void zbud_free(struct zbud_pool *pool, unsigned long handle);
-int zbud_reclaim_page(struct zbud_pool *pool, unsigned int retries);
+int zbud_reclaim_lru_page(struct zbud_pool *pool, unsigned int retries);
+void zbud_reclaim_pages(struct list_head *zbud_pages);
void *zbud_map(struct zbud_pool *pool, unsigned long handle);
void zbud_unmap(struct zbud_pool *pool, unsigned long handle);
u64 zbud_get_pool_size(struct zbud_pool *pool);
diff --git a/mm/compaction.c b/mm/compaction.c
index 05ccb4c..9bbf412 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -16,6 +16,7 @@
#include <linux/sysfs.h>
#include <linux/balloon_compaction.h>
#include <linux/page-isolation.h>
+#include <linux/zbud.h>
#include "internal.h"

#ifdef CONFIG_COMPACTION
@@ -534,6 +535,17 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
goto next_pageblock;
}

+ if (PageZbud(page)) {
+ /*
+ * Zbud pages do not exist in LRU so we must
+ * check for Zbud flag before PageLRU() below.
+ */
+ BUG_ON(PageLRU(page));
+ get_page(page);
+ list_add(&page->lru, &cc->zbudpages);
+ continue;
+ }
+
/*
* Check may be lockless but that's ok as we recheck later.
* It's possible to migrate LRU pages and balloon pages
@@ -810,7 +822,10 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn, false);
if (!low_pfn || cc->contended)
return ISOLATE_ABORT;
-
+#ifdef CONFIG_ZBUD
+ if (!list_empty(&cc->zbudpages))
+ zbud_reclaim_pages(&cc->zbudpages);
+#endif
cc->migrate_pfn = low_pfn;

return ISOLATE_SUCCESS;
@@ -1023,11 +1038,13 @@ static unsigned long compact_zone_order(struct zone *zone,
};
INIT_LIST_HEAD(&cc.freepages);
INIT_LIST_HEAD(&cc.migratepages);
+ INIT_LIST_HEAD(&cc.zbudpages);

ret = compact_zone(zone, &cc);

VM_BUG_ON(!list_empty(&cc.freepages));
VM_BUG_ON(!list_empty(&cc.migratepages));
+ VM_BUG_ON(!list_empty(&cc.zbudpages));

*contended = cc.contended;
return ret;
@@ -1105,6 +1122,7 @@ static void __compact_pgdat(pg_data_t *pgdat, struct compact_control *cc)
cc->zone = zone;
INIT_LIST_HEAD(&cc->freepages);
INIT_LIST_HEAD(&cc->migratepages);
+ INIT_LIST_HEAD(&cc->zbudpages);

if (cc->order == -1 || !compaction_deferred(zone, cc->order))
compact_zone(zone, cc);
diff --git a/mm/internal.h b/mm/internal.h
index 4390ac6..eaf5c884 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -119,6 +119,7 @@ struct compact_control {
unsigned long nr_migratepages; /* Number of pages to migrate */
unsigned long free_pfn; /* isolate_freepages search base */
unsigned long migrate_pfn; /* isolate_migratepages search base */
+ struct list_head zbudpages; /* List of pages belonging to zbud */
bool sync; /* Synchronous migration */
bool ignore_skip_hint; /* Scan blocks even if marked skip */
bool finished_update_free; /* True when the zone cached pfns are
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1a120fb..e482876 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -60,6 +60,7 @@
#include <linux/page-debug-flags.h>
#include <linux/hugetlb.h>
#include <linux/sched/rt.h>
+#include <linux/zbud.h>

#include <asm/sections.h>
#include <asm/tlbflush.h>
@@ -6031,6 +6032,10 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
ret = -EINTR;
break;
}
+#ifdef CONFIG_ZBUD
+ if (!list_empty(&cc.zbudpages))
+ zbud_reclaim_pages(&cc.zbudpages);
+#endif
tries = 0;
} else if (++tries == 5) {
ret = ret < 0 ? ret : -EBUSY;
@@ -6085,6 +6090,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
.ignore_skip_hint = true,
};
INIT_LIST_HEAD(&cc.migratepages);
+ INIT_LIST_HEAD(&cc.zbudpages);

/*
* What we do here is we mark all pageblocks in range as
diff --git a/mm/zbud.c b/mm/zbud.c
index a452949..98a04c8 100644
--- a/mm/zbud.c
+++ b/mm/zbud.c
@@ -103,12 +103,14 @@ struct zbud_pool {
* @lru: links the zbud page into the lru list in the pool
* @first_chunks: the size of the first buddy in chunks, 0 if free
* @last_chunks: the size of the last buddy in chunks, 0 if free
+ * @pool: pool to which this zbud page belongs to
*/
struct zbud_header {
struct list_head buddy;
struct list_head lru;
unsigned int first_chunks;
unsigned int last_chunks;
+ struct zbud_pool *pool;
};

/*****************
@@ -137,6 +139,7 @@ static struct zbud_header *init_zbud_page(struct page *page)
zhdr->last_chunks = 0;
INIT_LIST_HEAD(&zhdr->buddy);
INIT_LIST_HEAD(&zhdr->lru);
+ zhdr->pool = NULL;
return zhdr;
}

@@ -241,7 +244,6 @@ static int put_zbud_page(struct zbud_pool *pool, struct zbud_header *zhdr)
return 0;
}

-
/*****************
* API Functions
*****************/
@@ -345,6 +347,7 @@ int zbud_alloc(struct zbud_pool *pool, int size, gfp_t gfp,
*/
zhdr = init_zbud_page(page);
SetPageZbud(page);
+ zhdr->pool = pool;
bud = FIRST;

found:
@@ -394,8 +397,57 @@ void zbud_free(struct zbud_pool *pool, unsigned long handle)
#define list_tail_entry(ptr, type, member) \
list_entry((ptr)->prev, type, member)

+/*
+ * Pool lock must be held when calling this function and at least
+ * one handle must not free.
+ * On return the pool lock will be still held however during the
+ * execution it will be unlocked and locked for the time of calling
+ * the evict callback.
+ *
+ * Returns 1 if page was freed here, 0 otherwise (still in use)
+ */
+static int do_reclaim(struct zbud_pool *pool, struct zbud_header *zhdr,
+ evict_page_t evict_cb)
+{
+ int ret;
+ unsigned long first_handle = 0, last_handle = 0;
+
+ BUG_ON(zhdr->first_chunks == 0 && zhdr->last_chunks == 0);
+ /* Move this last element to beginning of LRU */
+ list_del(&zhdr->lru);
+ list_add(&zhdr->lru, &pool->lru);
+ /* Protect zbud page against free */
+ get_zbud_page(zhdr);
+ /*
+ * We need encode the handles before unlocking, since we can
+ * race with free that will set (first|last)_chunks to 0
+ */
+ first_handle = 0;
+ last_handle = 0;
+ if (zhdr->first_chunks)
+ first_handle = encode_handle(zhdr, FIRST);
+ if (zhdr->last_chunks)
+ last_handle = encode_handle(zhdr, LAST);
+ spin_unlock(&pool->lock);
+
+ /* Issue the eviction callback(s) */
+ if (first_handle) {
+ ret = evict_cb(pool, first_handle);
+ if (ret)
+ goto next;
+ }
+ if (last_handle) {
+ ret = evict_cb(pool, last_handle);
+ if (ret)
+ goto next;
+ }
+next:
+ spin_lock(&pool->lock);
+ return put_zbud_page(pool, zhdr);
+}
+
/**
- * zbud_reclaim_page() - evicts allocations from a pool page and frees it
+ * zbud_reclaim_lru_page() - evicts allocations from a pool page and frees it
* @pool: pool from which a page will attempt to be evicted
* @retires: number of pages on the LRU list for which eviction will
* be attempted before failing
@@ -429,11 +481,10 @@ void zbud_free(struct zbud_pool *pool, unsigned long handle)
* no pages to evict or an eviction handler is not registered, -EAGAIN if
* the retry limit was hit.
*/
-int zbud_reclaim_page(struct zbud_pool *pool, unsigned int retries)
+int zbud_reclaim_lru_page(struct zbud_pool *pool, unsigned int retries)
{
- int i, ret;
+ int i;
struct zbud_header *zhdr;
- unsigned long first_handle = 0, last_handle = 0;

spin_lock(&pool->lock);
if (!pool->ops || !pool->ops->evict || list_empty(&pool->lru) ||
@@ -454,44 +505,84 @@ int zbud_reclaim_page(struct zbud_pool *pool, unsigned int retries)
return 0;
}
zhdr = list_tail_entry(&pool->lru, struct zbud_header, lru);
- BUG_ON(zhdr->first_chunks == 0 && zhdr->last_chunks == 0);
- /* Move this last element to beginning of LRU */
- list_del(&zhdr->lru);
- list_add(&zhdr->lru, &pool->lru);
- /* Protect zbud page against free */
- get_zbud_page(zhdr);
- /*
- * We need encode the handles before unlocking, since we can
- * race with free that will set (first|last)_chunks to 0
- */
- first_handle = 0;
- last_handle = 0;
- if (zhdr->first_chunks)
- first_handle = encode_handle(zhdr, FIRST);
- if (zhdr->last_chunks)
- last_handle = encode_handle(zhdr, LAST);
- spin_unlock(&pool->lock);
-
- /* Issue the eviction callback(s) */
- if (first_handle) {
- ret = pool->ops->evict(pool, first_handle);
- if (ret)
- goto next;
+ if (do_reclaim(pool, zhdr, pool->ops->evict)) {
+ spin_unlock(&pool->lock);
+ return 0;
}
- if (last_handle) {
- ret = pool->ops->evict(pool, last_handle);
- if (ret)
- goto next;
+ }
+ spin_unlock(&pool->lock);
+ return -EAGAIN;
+}
+
+
+/**
+ * zbud_reclaim_pages() - reclaims zbud pages by unusing stored pages
+ * @zbud_pages list of zbud pages to reclaim
+ *
+ * zbud reclaim is different from normal system reclaim in that the reclaim is
+ * done from the bottom, up. This is because only the bottom layer, zbud, has
+ * information on how the allocations are organized within each zbud page. This
+ * has the potential to create interesting locking situations between zbud and
+ * the user, however.
+ *
+ * To avoid these, this is how zbud_reclaim_pages() should be called:
+
+ * The user detects some pages should be reclaimed and calls
+ * zbud_reclaim_pages(). The zbud_reclaim_pages() will remove zbud
+ * pages from the pool LRU list and call the user-defined unuse handler with
+ * the pool and handle as arguments.
+ *
+ * If the handle can not be unused, the unuse handler should return
+ * non-zero. zbud_reclaim_pages() will add the zbud page back to the
+ * appropriate list and try the next zbud page on the list.
+ *
+ * If the handle is successfully unused, the unuse handler should
+ * return 0.
+ * The zbud page will be freed later by unuse code
+ * (e.g. frontswap_invalidate_page()).
+ *
+ * If all buddies in the zbud page are successfully unused, then the
+ * zbud page can be freed.
+ */
+void zbud_reclaim_pages(struct list_head *zbud_pages)
+{
+ struct page *page;
+ struct page *page2;
+
+ list_for_each_entry_safe(page, page2, zbud_pages, lru) {
+ struct zbud_header *zhdr;
+ struct zbud_pool *pool;
+
+ list_del(&page->lru);
+ if (!PageZbud(page)) {
+ /*
+ * Drop page count from isolate_migratepages_range()
+ */
+ put_page(page);
+ continue;
}
-next:
+ zhdr = page_address(page);
+ BUG_ON(!zhdr->pool);
+ pool = zhdr->pool;
+
spin_lock(&pool->lock);
+ /* Drop page count from isolate_migratepages_range() */
if (put_zbud_page(pool, zhdr)) {
+ /*
+ * zbud_free() could free the handles before acquiring
+ * pool lock above. No need to reclaim.
+ */
spin_unlock(&pool->lock);
- return 0;
+ continue;
+ }
+ if (!pool->ops || !pool->ops->unuse || list_empty(&pool->lru)) {
+ spin_unlock(&pool->lock);
+ continue;
}
+ BUG_ON(!PageZbud(page));
+ do_reclaim(pool, zhdr, pool->ops->unuse);
+ spin_unlock(&pool->lock);
}
- spin_unlock(&pool->lock);
- return -EAGAIN;
}

/**
diff --git a/mm/zswap.c b/mm/zswap.c
index deda2b6..846649b 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -35,6 +35,9 @@
#include <linux/crypto.h>
#include <linux/mempool.h>
#include <linux/zbud.h>
+#include <linux/swapfile.h>
+#include <linux/mman.h>
+#include <linux/security.h>

#include <linux/mm_types.h>
#include <linux/page-flags.h>
@@ -61,6 +64,8 @@ static atomic_t zswap_stored_pages = ATOMIC_INIT(0);
static u64 zswap_pool_limit_hit;
/* Pages written back when pool limit was reached */
static u64 zswap_written_back_pages;
+/* Pages unused due to reclaim */
+static u64 zswap_unused_pages;
/* Store failed due to a reclaim failure after pool limit was reached */
static u64 zswap_reject_reclaim_fail;
/* Compressed page was too big for the allocator to (optimally) store */
@@ -596,6 +601,47 @@ fail:
return ret;
}

+/**
+ * Tries to unuse swap entries by uncompressing them.
+ * Function is a stripped swapfile.c::try_to_unuse().
+ *
+ * Returns 0 on success or negative on error.
+ */
+static int zswap_unuse_entry(struct zbud_pool *pool, unsigned long handle)
+{
+ struct zswap_header *zhdr;
+ swp_entry_t swpentry;
+ struct zswap_tree *tree;
+ pgoff_t offset;
+ struct mm_struct *start_mm;
+ struct swap_info_struct *si;
+ int ret;
+
+ /* extract swpentry from data */
+ zhdr = zbud_map(pool, handle);
+ swpentry = zhdr->swpentry; /* here */
+ zbud_unmap(pool, handle);
+ tree = zswap_trees[swp_type(swpentry)];
+ offset = swp_offset(swpentry);
+ BUG_ON(pool != tree->pool);
+
+ /*
+ * We cannot hold swap_lock here but swap_info may
+ * change (e.g. by swapoff). In case of swapoff
+ * check for SWP_WRITEOK.
+ */
+ si = swap_info[swp_type(swpentry)];
+ if (!(si->flags & SWP_WRITEOK))
+ return -ECANCELED;
+
+ start_mm = &init_mm;
+ atomic_inc(&init_mm.mm_users);
+ ret = try_to_unuse_swp_entry(&start_mm, si, swpentry);
+ mmput(start_mm);
+ zswap_unused_pages++;
+ return ret;
+}
+
/*********************************
* frontswap hooks
**********************************/
@@ -620,7 +666,7 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
/* reclaim space if needed */
if (zswap_is_full()) {
zswap_pool_limit_hit++;
- if (zbud_reclaim_page(tree->pool, 8)) {
+ if (zbud_reclaim_lru_page(tree->pool, 8)) {
zswap_reject_reclaim_fail++;
ret = -ENOMEM;
goto reject;
@@ -647,8 +693,8 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,

/* store */
len = dlen + sizeof(struct zswap_header);
- ret = zbud_alloc(tree->pool, len, __GFP_NORETRY | __GFP_NOWARN,
- &handle);
+ ret = zbud_alloc(tree->pool, len, __GFP_NORETRY | __GFP_NOWARN |
+ __GFP_RECLAIMABLE, &handle);
if (ret == -ENOSPC) {
zswap_reject_compress_poor++;
goto freepage;
@@ -819,7 +865,8 @@ static void zswap_frontswap_invalidate_area(unsigned type)
}

static struct zbud_ops zswap_zbud_ops = {
- .evict = zswap_writeback_entry
+ .evict = zswap_writeback_entry,
+ .unuse = zswap_unuse_entry
};

static void zswap_frontswap_init(unsigned type)
@@ -880,6 +927,8 @@ static int __init zswap_debugfs_init(void)
zswap_debugfs_root, &zswap_reject_compress_poor);
debugfs_create_u64("written_back_pages", S_IRUGO,
zswap_debugfs_root, &zswap_written_back_pages);
+ debugfs_create_u64("unused_pages", S_IRUGO,
+ zswap_debugfs_root, &zswap_unused_pages);
debugfs_create_u64("duplicate_entry", S_IRUGO,
zswap_debugfs_root, &zswap_duplicate_entry);
debugfs_create_u64("pool_pages", S_IRUGO,
--
1.7.9.5

2013-08-06 09:00:33

by Bob Liu

[permalink] [raw]

Subject: Re: [RFC PATCH 1/4] zbud: use page ref counter for zbud pages

Hi Krzysztof,

On 08/06/2013 02:42 PM, Krzysztof Kozlowski wrote:
> Use page reference counter for zbud pages. The ref counter replaces
> zbud_header.under_reclaim flag and ensures that zbud page won't be freed
> when zbud_free() is called during reclaim. It allows implementation of
> additional reclaim paths.
>
> The page count is incremented when:
> - a handle is created and passed to zswap (in zbud_alloc()),
> - user-supplied eviction callback is called (in zbud_reclaim_page()).
>
> Signed-off-by: Krzysztof Kozlowski <[email protected]>
> Signed-off-by: Tomasz Stanislawski <[email protected]>

Looks good to me.
Reviewed-by: Bob Liu <[email protected]>

> ---
> mm/zbud.c | 150 +++++++++++++++++++++++++++++++++++--------------------------
> 1 file changed, 86 insertions(+), 64 deletions(-)
>
> diff --git a/mm/zbud.c b/mm/zbud.c
> index ad1e781..a8e986f 100644
> --- a/mm/zbud.c
> +++ b/mm/zbud.c
> @@ -109,7 +109,6 @@ struct zbud_header {
> struct list_head lru;
> unsigned int first_chunks;
> unsigned int last_chunks;
> - bool under_reclaim;
> };
>
> /*****************
> @@ -138,16 +137,9 @@ static struct zbud_header *init_zbud_page(struct page *page)
> zhdr->last_chunks = 0;
> INIT_LIST_HEAD(&zhdr->buddy);
> INIT_LIST_HEAD(&zhdr->lru);
> - zhdr->under_reclaim = 0;
> return zhdr;
> }
>
> -/* Resets the struct page fields and frees the page */
> -static void free_zbud_page(struct zbud_header *zhdr)
> -{
> - __free_page(virt_to_page(zhdr));
> -}
> -
> /*
> * Encodes the handle of a particular buddy within a zbud page
> * Pool lock should be held as this function accesses first|last_chunks
> @@ -188,6 +180,65 @@ static int num_free_chunks(struct zbud_header *zhdr)
> return NCHUNKS - zhdr->first_chunks - zhdr->last_chunks - 1;
> }
>
> +/*
> + * Called after zbud_free() or zbud_alloc().
> + * Checks whether given zbud page has to be:
> + * - removed from buddied/unbuddied/LRU lists completetely (zbud_free).
> + * - moved from buddied to unbuddied list
> + * and to beginning of LRU (zbud_alloc, zbud_free),
> + * - added to buddied list and LRU (zbud_alloc),
> + *
> + * The page must be already removed from buddied/unbuddied lists.
> + * Must be called under pool->lock.
> + */
> +static void rebalance_lists(struct zbud_pool *pool, struct zbud_header *zhdr)
> +{

Nit picker, how about change the name to adjust_lists() or something
like this because we don't do any rebalancing.

--
Regards,
-Bob

2013-08-06 09:16:53

by Bob Liu

[permalink] [raw]

Subject: Re: [RFC PATCH 0/4] mm: reclaim zbud pages on migration and compaction

On 08/06/2013 02:42 PM, Krzysztof Kozlowski wrote:
> Hi,
>
> Currently zbud pages are not movable and they cannot be allocated from CMA
> region. These patches try to address the problem by:
> 1. Adding a new form of reclaim of zbud pages.
> 2. Reclaiming zbud pages during migration and compaction.
> 3. Allocating zbud pages with __GFP_RECLAIMABLE flag.
>
> This reclaim process is different than zbud_reclaim_page(). It acts more
> like swapoff() by trying to unuse pages stored in zbud page and bring
> them back to memory. The standard zbud_reclaim_page() on the other hand
> tries to write them back.

I prefer to migrate zbud pages directly if it's possible than reclaiming
them during compaction.

>
> One of patches introduces a new flag: PageZbud. This flag is used in
> isolate_migratepages_range() to grab zbud pages and pass them later
> for reclaim. Probably this could be replaced with something
> smarter than a flag used only in one case.
> Any ideas for a better solution are welcome.
>
> This patch set is based on Linux 3.11-rc4.
>
> TODOs:
> 1. Replace PageZbud flag with other solution.
>
> Best regards,
> Krzysztof Kozlowski
>
>
> Krzysztof Kozlowski (4):
> zbud: use page ref counter for zbud pages
> mm: split code for unusing swap entries from try_to_unuse
> mm: add zbud flag to page flags
> mm: reclaim zbud pages on migration and compaction
>
> include/linux/page-flags.h | 12 ++
> include/linux/swapfile.h | 2 +
> include/linux/zbud.h | 11 +-
> mm/compaction.c | 20 ++-
> mm/internal.h | 1 +
> mm/page_alloc.c | 9 ++
> mm/swapfile.c | 354 +++++++++++++++++++++++---------------------
> mm/zbud.c | 301 +++++++++++++++++++++++++------------
> mm/zswap.c | 57 ++++++-
> 9 files changed, 499 insertions(+), 268 deletions(-)
>

--
Regards,
-Bob

2013-08-06 09:25:37

by Krzysztof Kozlowski

[permalink] [raw]

Subject: Re: [RFC PATCH 1/4] zbud: use page ref counter for zbud pages

Hi Bob,

Thank you for review.

On wto, 2013-08-06 at 17:00 +0800, Bob Liu wrote:
> Nit picker, how about change the name to adjust_lists() or something
> like this because we don't do any rebalancing.

OK, I'll change it.

Best regards,
Krzysztof

2013-08-06 13:05:20

by Krzysztof Kozlowski

[permalink] [raw]

Subject: Re: [RFC PATCH 0/4] mm: reclaim zbud pages on migration and compaction

On wto, 2013-08-06 at 17:16 +0800, Bob Liu wrote:
> On 08/06/2013 02:42 PM, Krzysztof Kozlowski wrote:
> > This reclaim process is different than zbud_reclaim_page(). It acts more
> > like swapoff() by trying to unuse pages stored in zbud page and bring
> > them back to memory. The standard zbud_reclaim_page() on the other hand
> > tries to write them back.
>
> I prefer to migrate zbud pages directly if it's possible than reclaiming
> them during compaction.

I think it is possible however it would be definitely more complex. In
case of migration the zswap handles should be updated as they are just
virtual addresses. Am I right?

Best regards,
Krzysztof

2013-08-06 16:58:50

by Dave Hansen

[permalink] [raw]

Subject: Re: [RFC PATCH 3/4] mm: add zbud flag to page flags

On 08/05/2013 11:42 PM, Krzysztof Kozlowski wrote:
> +#ifdef CONFIG_ZBUD
> + /* Allocated by zbud. Flag is necessary to find zbud pages to unuse
> + * during migration/compaction.
> + */
> + PG_zbud,
> +#endif

Do you _really_ need an absolutely new, unshared page flag?
The zbud code doesn't really look like it uses any of the space in
'struct page'.

I think you could pretty easily alias PG_zbud=PG_slab, then use the
page->{private,slab_cache} (or some other unused field) in 'struct page'
to store a cookie to differentiate slab and zbud pages.

2013-08-07 07:04:09

by Krzysztof Kozlowski

[permalink] [raw]

Subject: Re: [RFC PATCH 3/4] mm: add zbud flag to page flags

On wto, 2013-08-06 at 09:58 -0700, Dave Hansen wrote:
> On 08/05/2013 11:42 PM, Krzysztof Kozlowski wrote:
> > +#ifdef CONFIG_ZBUD
> > + /* Allocated by zbud. Flag is necessary to find zbud pages to unuse
> > + * during migration/compaction.
> > + */
> > + PG_zbud,
> > +#endif
>
> Do you _really_ need an absolutely new, unshared page flag?
> The zbud code doesn't really look like it uses any of the space in
> 'struct page'.
>
> I think you could pretty easily alias PG_zbud=PG_slab, then use the
> page->{private,slab_cache} (or some other unused field) in 'struct page'
> to store a cookie to differentiate slab and zbud pages.

Thanks for idea, I will try that.

Best regards,
Krzysztof

2013-08-07 07:31:55

by Krzysztof Kozlowski

[permalink] [raw]

Subject: Re: [RFC PATCH 1/4] zbud: use page ref counter for zbud pages

Hi Seth,

On wto, 2013-08-06 at 13:51 -0500, Seth Jennings wrote:
> I like the idea. I few things below. Also agree with Bob the
> s/rebalance/adjust/ for rebalance_lists().
OK.

> s/else if/if/ since the if above returns if true.
Sure.

> > + /* zbud_free() or zbud_alloc() */
> > + int freechunks = num_free_chunks(zhdr);
> > + list_add(&zhdr->buddy, &pool->unbuddied[freechunks]);
> > + } else {
> > + /* zbud_alloc() */
> > + list_add(&zhdr->buddy, &pool->buddied);
> > + }
> > + /* Add/move zbud page to beginning of LRU */
> > + if (!list_empty(&zhdr->lru))
> > + list_del(&zhdr->lru);
>
> We don't want to reinsert to the LRU list if we have called zbud_free()
> on a zbud page that previously had two buddies. This code causes the
> zbud page to move to the front of the LRU list which is not what we want.

Right, I'll fix it.

> > @@ -326,10 +370,10 @@ found:
> > void zbud_free(struct zbud_pool *pool, unsigned long handle)
> > {
> > struct zbud_header *zhdr;
> > - int freechunks;
> >
> > spin_lock(&pool->lock);
> > zhdr = handle_to_zbud_header(handle);
> > + BUG_ON(zhdr->last_chunks == 0 && zhdr->first_chunks == 0);
>
> Not sure we need this. Maybe, at most, VM_BUG_ON()?

Actually it is somehow a leftover after debugging so I don't mind
removing it completely.

> > @@ -411,11 +438,24 @@ int zbud_reclaim_page(struct zbud_pool *pool, unsigned int retries)
> > return -EINVAL;
> > }
> > for (i = 0; i < retries; i++) {
> > + if (list_empty(&pool->lru)) {
> > + /*
> > + * LRU was emptied during evict calls in previous
> > + * iteration but put_zbud_page() returned 0 meaning
> > + * that someone still holds the page. This may
> > + * happen when some other mm mechanism increased
> > + * the page count.
> > + * In such case we succedded with reclaim.
> > + */
> > + return 0;
> > + }
> > zhdr = list_tail_entry(&pool->lru, struct zbud_header, lru);
> > + BUG_ON(zhdr->first_chunks == 0 && zhdr->last_chunks == 0);
>
> Again here.
I agree.

Thanks for comments,
Krzysztof

2013-08-08 07:26:41

by Krzysztof Kozlowski

[permalink] [raw]

Subject: Re: [RFC PATCH 3/4] mm: add zbud flag to page flags

Hi,

On wto, 2013-08-06 at 09:58 -0700, Dave Hansen wrote:
> On 08/05/2013 11:42 PM, Krzysztof Kozlowski wrote:
> > +#ifdef CONFIG_ZBUD
> > + /* Allocated by zbud. Flag is necessary to find zbud pages to unuse
> > + * during migration/compaction.
> > + */
> > + PG_zbud,
> > +#endif
>
> Do you _really_ need an absolutely new, unshared page flag?
> The zbud code doesn't really look like it uses any of the space in
> 'struct page'.
>
> I think you could pretty easily alias PG_zbud=PG_slab, then use the
> page->{private,slab_cache} (or some other unused field) in 'struct page'
> to store a cookie to differentiate slab and zbud pages.

How about using page->_mapcount with negative value (-129)? Just like
PageBuddy()?

Best regards,
Krzysztof