2013-08-09 10:22:36

by Krzysztof Kozlowski

[permalink] [raw]
Subject: [RFC PATCH v2 0/4] mm: reclaim zbud pages on migration and compaction

Hi,

Currently zbud pages are not movable and they cannot be allocated from CMA
region. These patches try to address the problem by:
1. Adding a new form of reclaim of zbud pages.
2. Reclaiming zbud pages during migration and compaction.
3. Allocating zbud pages with __GFP_RECLAIMABLE flag.

This reclaim process is different than zbud_reclaim_page(). It acts more
like swapoff() by trying to unuse pages stored in zbud page and bring
them back to memory. The standard zbud_reclaim_page() on the other hand
tries to write them back.

One of patches introduces PageZbud() function which identifies zbud pages
my page->_mapcount. Dave Hansen proposed aliasing PG_zbud=PG_slab but in
such case patch would be more intrusive.

Any ideas for a better solution are welcome.

TODO-s:
1. Migrate zbud pages directly instead of reclaiming.

Changes since v1:
1. Rebased against v3.11-rc4-103-g6c2580c.
2. Remove rebalance_lists() to fix reinserting zbud page after zbud_free.
This function was added because similar code was present in
zbud_free/zbud_alloc/zbud_reclaim_page but it turns out that there
is no benefit in generalizing this code.
(suggested by Seth Jennings)
3. Remove BUG_ON checks for first/last chunks during free and reclaim.
(suggested by Seth Jennings)
4. Use page->_mapcount==-127 instead of new PG_zbud flag.
(suggested by Dave Hansen)
5. Fix invalid dereference of pointer to compact_control in page_alloc.c.
6. Fix lost return value in try_to_unuse() in swapfile.c (this fixes
hang when swapoff was interrupted e.g. by CTRL+C).


Best regards,
Krzysztof Kozlowski


Krzysztof Kozlowski (4):
zbud: use page ref counter for zbud pages
mm: split code for unusing swap entries from try_to_unuse
mm: use mapcount for identifying zbud pages
mm: reclaim zbud pages on migration and compaction

include/linux/mm.h | 23 +++
include/linux/swapfile.h | 2 +
include/linux/zbud.h | 11 +-
mm/compaction.c | 20 ++-
mm/internal.h | 1 +
mm/page_alloc.c | 6 +
mm/swapfile.c | 356 ++++++++++++++++++++++++----------------------
mm/zbud.c | 247 +++++++++++++++++++++++---------
mm/zswap.c | 57 +++++++-
9 files changed, 476 insertions(+), 247 deletions(-)

--
1.7.9.5


2013-08-09 10:22:39

by Krzysztof Kozlowski

[permalink] [raw]
Subject: [RFC PATCH v2 2/4] mm: split code for unusing swap entries from try_to_unuse

Move out the code for unusing swap entries from loop in try_to_unuse()
to separate function: try_to_unuse_swp_entry(). Export this new function
in swapfile.h just like try_to_unuse() is exported.

This new function will be used for unusing swap entries from subsystems
(e.g. zswap).

Signed-off-by: Krzysztof Kozlowski <[email protected]>
---
include/linux/swapfile.h | 2 +
mm/swapfile.c | 356 ++++++++++++++++++++++++----------------------
2 files changed, 189 insertions(+), 169 deletions(-)

diff --git a/include/linux/swapfile.h b/include/linux/swapfile.h
index e282624..68c24a7 100644
--- a/include/linux/swapfile.h
+++ b/include/linux/swapfile.h
@@ -9,5 +9,7 @@ extern spinlock_t swap_lock;
extern struct swap_list_t swap_list;
extern struct swap_info_struct *swap_info[];
extern int try_to_unuse(unsigned int, bool, unsigned long);
+extern int try_to_unuse_swp_entry(struct mm_struct **start_mm,
+ struct swap_info_struct *si, swp_entry_t entry);

#endif /* _LINUX_SWAPFILE_H */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 36af6ee..4ba21ec 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1100,6 +1100,190 @@ static unsigned int find_next_to_unuse(struct swap_info_struct *si,
}

/*
+ * Returns:
+ * - negative on error,
+ * - 0 on success (entry unused, freed independently or shmem entry
+ * already released)
+ */
+int try_to_unuse_swp_entry(struct mm_struct **start_mm,
+ struct swap_info_struct *si, swp_entry_t entry)
+{
+ pgoff_t offset = swp_offset(entry);
+ unsigned char *swap_map;
+ unsigned char swcount;
+ struct page *page;
+ int retval = 0;
+
+ if (signal_pending(current)) {
+ retval = -EINTR;
+ goto out;
+ }
+
+ /*
+ * Get a page for the entry, using the existing swap
+ * cache page if there is one. Otherwise, get a clean
+ * page and read the swap into it.
+ */
+ swap_map = &si->swap_map[offset];
+ page = read_swap_cache_async(entry,
+ GFP_HIGHUSER_MOVABLE, NULL, 0);
+ if (!page) {
+ /*
+ * Either swap_duplicate() failed because entry
+ * has been freed independently, and will not be
+ * reused since sys_swapoff() already disabled
+ * allocation from here, or alloc_page() failed.
+ */
+ if (!*swap_map)
+ retval = 0;
+ else
+ retval = -ENOMEM;
+ goto out;
+ }
+
+ /*
+ * Don't hold on to start_mm if it looks like exiting.
+ */
+ if (atomic_read(&(*start_mm)->mm_users) == 1) {
+ mmput(*start_mm);
+ *start_mm = &init_mm;
+ atomic_inc(&init_mm.mm_users);
+ }
+
+ /*
+ * Wait for and lock page. When do_swap_page races with
+ * try_to_unuse, do_swap_page can handle the fault much
+ * faster than try_to_unuse can locate the entry. This
+ * apparently redundant "wait_on_page_locked" lets try_to_unuse
+ * defer to do_swap_page in such a case - in some tests,
+ * do_swap_page and try_to_unuse repeatedly compete.
+ */
+ wait_on_page_locked(page);
+ wait_on_page_writeback(page);
+ lock_page(page);
+ wait_on_page_writeback(page);
+
+ /*
+ * Remove all references to entry.
+ */
+ swcount = *swap_map;
+ if (swap_count(swcount) == SWAP_MAP_SHMEM) {
+ retval = shmem_unuse(entry, page);
+ VM_BUG_ON(retval > 0);
+ /* page has already been unlocked and released */
+ goto out;
+ }
+ if (swap_count(swcount) && *start_mm != &init_mm)
+ retval = unuse_mm(*start_mm, entry, page);
+
+ if (swap_count(*swap_map)) {
+ int set_start_mm = (*swap_map >= swcount);
+ struct list_head *p = &(*start_mm)->mmlist;
+ struct mm_struct *new_start_mm = *start_mm;
+ struct mm_struct *prev_mm = *start_mm;
+ struct mm_struct *mm;
+
+ atomic_inc(&new_start_mm->mm_users);
+ atomic_inc(&prev_mm->mm_users);
+ spin_lock(&mmlist_lock);
+ while (swap_count(*swap_map) && !retval &&
+ (p = p->next) != &(*start_mm)->mmlist) {
+ mm = list_entry(p, struct mm_struct, mmlist);
+ if (!atomic_inc_not_zero(&mm->mm_users))
+ continue;
+ spin_unlock(&mmlist_lock);
+ mmput(prev_mm);
+ prev_mm = mm;
+
+ cond_resched();
+
+ swcount = *swap_map;
+ if (!swap_count(swcount)) /* any usage ? */
+ ;
+ else if (mm == &init_mm)
+ set_start_mm = 1;
+ else
+ retval = unuse_mm(mm, entry, page);
+
+ if (set_start_mm && *swap_map < swcount) {
+ mmput(new_start_mm);
+ atomic_inc(&mm->mm_users);
+ new_start_mm = mm;
+ set_start_mm = 0;
+ }
+ spin_lock(&mmlist_lock);
+ }
+ spin_unlock(&mmlist_lock);
+ mmput(prev_mm);
+ mmput(*start_mm);
+ *start_mm = new_start_mm;
+ }
+ if (retval) {
+ unlock_page(page);
+ page_cache_release(page);
+ goto out;
+ }
+
+ /*
+ * If a reference remains (rare), we would like to leave
+ * the page in the swap cache; but try_to_unmap could
+ * then re-duplicate the entry once we drop page lock,
+ * so we might loop indefinitely; also, that page could
+ * not be swapped out to other storage meanwhile. So:
+ * delete from cache even if there's another reference,
+ * after ensuring that the data has been saved to disk -
+ * since if the reference remains (rarer), it will be
+ * read from disk into another page. Splitting into two
+ * pages would be incorrect if swap supported "shared
+ * private" pages, but they are handled by tmpfs files.
+ *
+ * Given how unuse_vma() targets one particular offset
+ * in an anon_vma, once the anon_vma has been determined,
+ * this splitting happens to be just what is needed to
+ * handle where KSM pages have been swapped out: re-reading
+ * is unnecessarily slow, but we can fix that later on.
+ */
+ if (swap_count(*swap_map) &&
+ PageDirty(page) && PageSwapCache(page)) {
+ struct writeback_control wbc = {
+ .sync_mode = WB_SYNC_NONE,
+ };
+
+ swap_writepage(page, &wbc);
+ lock_page(page);
+ wait_on_page_writeback(page);
+ }
+
+ /*
+ * It is conceivable that a racing task removed this page from
+ * swap cache just before we acquired the page lock at the top,
+ * or while we dropped it in unuse_mm(). The page might even
+ * be back in swap cache on another swap area: that we must not
+ * delete, since it may not have been written out to swap yet.
+ */
+ if (PageSwapCache(page) &&
+ likely(page_private(page) == entry.val))
+ delete_from_swap_cache(page);
+
+ /*
+ * So we could skip searching mms once swap count went
+ * to 1, we did not mark any present ptes as dirty: must
+ * mark page dirty so shrink_page_list will preserve it.
+ */
+ SetPageDirty(page);
+ unlock_page(page);
+ page_cache_release(page);
+
+ /*
+ * Make sure that we aren't completely killing
+ * interactive performance.
+ */
+ cond_resched();
+out:
+ return retval;
+}
+
+/*
* We completely avoid races by reading each swap page in advance,
* and then search for the process using it. All the necessary
* page table adjustments can then be made atomically.
@@ -1112,10 +1296,6 @@ int try_to_unuse(unsigned int type, bool frontswap,
{
struct swap_info_struct *si = swap_info[type];
struct mm_struct *start_mm;
- unsigned char *swap_map;
- unsigned char swcount;
- struct page *page;
- swp_entry_t entry;
unsigned int i = 0;
int retval = 0;

@@ -1142,172 +1322,10 @@ int try_to_unuse(unsigned int type, bool frontswap,
* there are races when an instance of an entry might be missed.
*/
while ((i = find_next_to_unuse(si, i, frontswap)) != 0) {
- if (signal_pending(current)) {
- retval = -EINTR;
- break;
- }
-
- /*
- * Get a page for the entry, using the existing swap
- * cache page if there is one. Otherwise, get a clean
- * page and read the swap into it.
- */
- swap_map = &si->swap_map[i];
- entry = swp_entry(type, i);
- page = read_swap_cache_async(entry,
- GFP_HIGHUSER_MOVABLE, NULL, 0);
- if (!page) {
- /*
- * Either swap_duplicate() failed because entry
- * has been freed independently, and will not be
- * reused since sys_swapoff() already disabled
- * allocation from here, or alloc_page() failed.
- */
- if (!*swap_map)
- continue;
- retval = -ENOMEM;
- break;
- }
-
- /*
- * Don't hold on to start_mm if it looks like exiting.
- */
- if (atomic_read(&start_mm->mm_users) == 1) {
- mmput(start_mm);
- start_mm = &init_mm;
- atomic_inc(&init_mm.mm_users);
- }
-
- /*
- * Wait for and lock page. When do_swap_page races with
- * try_to_unuse, do_swap_page can handle the fault much
- * faster than try_to_unuse can locate the entry. This
- * apparently redundant "wait_on_page_locked" lets try_to_unuse
- * defer to do_swap_page in such a case - in some tests,
- * do_swap_page and try_to_unuse repeatedly compete.
- */
- wait_on_page_locked(page);
- wait_on_page_writeback(page);
- lock_page(page);
- wait_on_page_writeback(page);
-
- /*
- * Remove all references to entry.
- */
- swcount = *swap_map;
- if (swap_count(swcount) == SWAP_MAP_SHMEM) {
- retval = shmem_unuse(entry, page);
- /* page has already been unlocked and released */
- if (retval < 0)
- break;
- continue;
- }
- if (swap_count(swcount) && start_mm != &init_mm)
- retval = unuse_mm(start_mm, entry, page);
-
- if (swap_count(*swap_map)) {
- int set_start_mm = (*swap_map >= swcount);
- struct list_head *p = &start_mm->mmlist;
- struct mm_struct *new_start_mm = start_mm;
- struct mm_struct *prev_mm = start_mm;
- struct mm_struct *mm;
-
- atomic_inc(&new_start_mm->mm_users);
- atomic_inc(&prev_mm->mm_users);
- spin_lock(&mmlist_lock);
- while (swap_count(*swap_map) && !retval &&
- (p = p->next) != &start_mm->mmlist) {
- mm = list_entry(p, struct mm_struct, mmlist);
- if (!atomic_inc_not_zero(&mm->mm_users))
- continue;
- spin_unlock(&mmlist_lock);
- mmput(prev_mm);
- prev_mm = mm;
-
- cond_resched();
-
- swcount = *swap_map;
- if (!swap_count(swcount)) /* any usage ? */
- ;
- else if (mm == &init_mm)
- set_start_mm = 1;
- else
- retval = unuse_mm(mm, entry, page);
-
- if (set_start_mm && *swap_map < swcount) {
- mmput(new_start_mm);
- atomic_inc(&mm->mm_users);
- new_start_mm = mm;
- set_start_mm = 0;
- }
- spin_lock(&mmlist_lock);
- }
- spin_unlock(&mmlist_lock);
- mmput(prev_mm);
- mmput(start_mm);
- start_mm = new_start_mm;
- }
- if (retval) {
- unlock_page(page);
- page_cache_release(page);
+ retval = try_to_unuse_swp_entry(&start_mm, si,
+ swp_entry(type, i));
+ if (retval != 0)
break;
- }
-
- /*
- * If a reference remains (rare), we would like to leave
- * the page in the swap cache; but try_to_unmap could
- * then re-duplicate the entry once we drop page lock,
- * so we might loop indefinitely; also, that page could
- * not be swapped out to other storage meanwhile. So:
- * delete from cache even if there's another reference,
- * after ensuring that the data has been saved to disk -
- * since if the reference remains (rarer), it will be
- * read from disk into another page. Splitting into two
- * pages would be incorrect if swap supported "shared
- * private" pages, but they are handled by tmpfs files.
- *
- * Given how unuse_vma() targets one particular offset
- * in an anon_vma, once the anon_vma has been determined,
- * this splitting happens to be just what is needed to
- * handle where KSM pages have been swapped out: re-reading
- * is unnecessarily slow, but we can fix that later on.
- */
- if (swap_count(*swap_map) &&
- PageDirty(page) && PageSwapCache(page)) {
- struct writeback_control wbc = {
- .sync_mode = WB_SYNC_NONE,
- };
-
- swap_writepage(page, &wbc);
- lock_page(page);
- wait_on_page_writeback(page);
- }
-
- /*
- * It is conceivable that a racing task removed this page from
- * swap cache just before we acquired the page lock at the top,
- * or while we dropped it in unuse_mm(). The page might even
- * be back in swap cache on another swap area: that we must not
- * delete, since it may not have been written out to swap yet.
- */
- if (PageSwapCache(page) &&
- likely(page_private(page) == entry.val))
- delete_from_swap_cache(page);
-
- /*
- * So we could skip searching mms once swap count went
- * to 1, we did not mark any present ptes as dirty: must
- * mark page dirty so shrink_page_list will preserve it.
- */
- SetPageDirty(page);
- unlock_page(page);
- page_cache_release(page);
-
- /*
- * Make sure that we aren't completely killing
- * interactive performance.
- */
- cond_resched();
if (frontswap && pages_to_unuse > 0) {
if (!--pages_to_unuse)
break;
--
1.7.9.5

2013-08-09 10:22:57

by Krzysztof Kozlowski

[permalink] [raw]
Subject: [RFC PATCH v2 4/4] mm: reclaim zbud pages on migration and compaction

Reclaim zbud pages during migration and compaction by unusing stored
data. This allows adding__GFP_RECLAIMABLE flag when allocating zbud
pages and effectively CMA pool can be used for zswap.

zbud pages are not movable and are not stored under any LRU (except
zbud's LRU). PageZbud flag is used in isolate_migratepages_range() to
grab zbud pages and pass them later for reclaim.

This reclaim process is different than zbud_reclaim_page(). It acts more
like swapoff() by trying to unuse pages stored in zbud page and bring
them back to memory. The standard zbud_reclaim_page() on the other hand
tries to write them back.

Signed-off-by: Krzysztof Kozlowski <[email protected]>
---
include/linux/zbud.h | 11 +++-
mm/compaction.c | 20 ++++++-
mm/internal.h | 1 +
mm/page_alloc.c | 6 ++
mm/zbud.c | 161 +++++++++++++++++++++++++++++++++++++++-----------
mm/zswap.c | 57 ++++++++++++++++--
6 files changed, 214 insertions(+), 42 deletions(-)

diff --git a/include/linux/zbud.h b/include/linux/zbud.h
index 2571a5c..57ee85d 100644
--- a/include/linux/zbud.h
+++ b/include/linux/zbud.h
@@ -5,8 +5,14 @@

struct zbud_pool;

+/**
+ * Template for functions called during reclaim.
+ */
+typedef int (*evict_page_t)(struct zbud_pool *pool, unsigned long handle);
+
struct zbud_ops {
- int (*evict)(struct zbud_pool *pool, unsigned long handle);
+ evict_page_t evict; /* callback for zbud_reclaim_lru_page() */
+ evict_page_t unuse; /* callback for zbud_reclaim_pages() */
};

struct zbud_pool *zbud_create_pool(gfp_t gfp, struct zbud_ops *ops);
@@ -14,7 +20,8 @@ void zbud_destroy_pool(struct zbud_pool *pool);
int zbud_alloc(struct zbud_pool *pool, int size, gfp_t gfp,
unsigned long *handle);
void zbud_free(struct zbud_pool *pool, unsigned long handle);
-int zbud_reclaim_page(struct zbud_pool *pool, unsigned int retries);
+int zbud_reclaim_lru_page(struct zbud_pool *pool, unsigned int retries);
+void zbud_reclaim_pages(struct list_head *zbud_pages);
void *zbud_map(struct zbud_pool *pool, unsigned long handle);
void zbud_unmap(struct zbud_pool *pool, unsigned long handle);
u64 zbud_get_pool_size(struct zbud_pool *pool);
diff --git a/mm/compaction.c b/mm/compaction.c
index 05ccb4c..9bbf412 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -16,6 +16,7 @@
#include <linux/sysfs.h>
#include <linux/balloon_compaction.h>
#include <linux/page-isolation.h>
+#include <linux/zbud.h>
#include "internal.h"

#ifdef CONFIG_COMPACTION
@@ -534,6 +535,17 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
goto next_pageblock;
}

+ if (PageZbud(page)) {
+ /*
+ * Zbud pages do not exist in LRU so we must
+ * check for Zbud flag before PageLRU() below.
+ */
+ BUG_ON(PageLRU(page));
+ get_page(page);
+ list_add(&page->lru, &cc->zbudpages);
+ continue;
+ }
+
/*
* Check may be lockless but that's ok as we recheck later.
* It's possible to migrate LRU pages and balloon pages
@@ -810,7 +822,10 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn, false);
if (!low_pfn || cc->contended)
return ISOLATE_ABORT;
-
+#ifdef CONFIG_ZBUD
+ if (!list_empty(&cc->zbudpages))
+ zbud_reclaim_pages(&cc->zbudpages);
+#endif
cc->migrate_pfn = low_pfn;

return ISOLATE_SUCCESS;
@@ -1023,11 +1038,13 @@ static unsigned long compact_zone_order(struct zone *zone,
};
INIT_LIST_HEAD(&cc.freepages);
INIT_LIST_HEAD(&cc.migratepages);
+ INIT_LIST_HEAD(&cc.zbudpages);

ret = compact_zone(zone, &cc);

VM_BUG_ON(!list_empty(&cc.freepages));
VM_BUG_ON(!list_empty(&cc.migratepages));
+ VM_BUG_ON(!list_empty(&cc.zbudpages));

*contended = cc.contended;
return ret;
@@ -1105,6 +1122,7 @@ static void __compact_pgdat(pg_data_t *pgdat, struct compact_control *cc)
cc->zone = zone;
INIT_LIST_HEAD(&cc->freepages);
INIT_LIST_HEAD(&cc->migratepages);
+ INIT_LIST_HEAD(&cc->zbudpages);

if (cc->order == -1 || !compaction_deferred(zone, cc->order))
compact_zone(zone, cc);
diff --git a/mm/internal.h b/mm/internal.h
index 4390ac6..eaf5c884 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -119,6 +119,7 @@ struct compact_control {
unsigned long nr_migratepages; /* Number of pages to migrate */
unsigned long free_pfn; /* isolate_freepages search base */
unsigned long migrate_pfn; /* isolate_migratepages search base */
+ struct list_head zbudpages; /* List of pages belonging to zbud */
bool sync; /* Synchronous migration */
bool ignore_skip_hint; /* Scan blocks even if marked skip */
bool finished_update_free; /* True when the zone cached pfns are
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b100255..5290e1c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -60,6 +60,7 @@
#include <linux/page-debug-flags.h>
#include <linux/hugetlb.h>
#include <linux/sched/rt.h>
+#include <linux/zbud.h>

#include <asm/sections.h>
#include <asm/tlbflush.h>
@@ -6031,6 +6032,10 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
ret = -EINTR;
break;
}
+#ifdef CONFIG_ZBUD
+ if (!list_empty(&cc->zbudpages))
+ zbud_reclaim_pages(&cc->zbudpages);
+#endif
tries = 0;
} else if (++tries == 5) {
ret = ret < 0 ? ret : -EBUSY;
@@ -6085,6 +6090,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
.ignore_skip_hint = true,
};
INIT_LIST_HEAD(&cc.migratepages);
+ INIT_LIST_HEAD(&cc.zbudpages);

/*
* What we do here is we mark all pageblocks in range as
diff --git a/mm/zbud.c b/mm/zbud.c
index 24c9ba0..0bc0299 100644
--- a/mm/zbud.c
+++ b/mm/zbud.c
@@ -103,12 +103,14 @@ struct zbud_pool {
* @lru: links the zbud page into the lru list in the pool
* @first_chunks: the size of the first buddy in chunks, 0 if free
* @last_chunks: the size of the last buddy in chunks, 0 if free
+ * @pool: pool to which this zbud page belongs to
*/
struct zbud_header {
struct list_head buddy;
struct list_head lru;
unsigned int first_chunks;
unsigned int last_chunks;
+ struct zbud_pool *pool;
};

/*****************
@@ -137,6 +139,7 @@ static struct zbud_header *init_zbud_page(struct page *page)
zhdr->last_chunks = 0;
INIT_LIST_HEAD(&zhdr->buddy);
INIT_LIST_HEAD(&zhdr->lru);
+ zhdr->pool = NULL;
return zhdr;
}

@@ -210,7 +213,6 @@ static int put_zbud_page(struct zbud_pool *pool, struct zbud_header *zhdr)
return 0;
}

-
/*****************
* API Functions
*****************/
@@ -314,6 +316,7 @@ int zbud_alloc(struct zbud_pool *pool, int size, gfp_t gfp,
*/
zhdr = init_zbud_page(page);
SetPageZbud(page);
+ zhdr->pool = pool;
bud = FIRST;

found:
@@ -383,8 +386,56 @@ void zbud_free(struct zbud_pool *pool, unsigned long handle)
#define list_tail_entry(ptr, type, member) \
list_entry((ptr)->prev, type, member)

+/*
+ * Pool lock must be held when calling this function and at least
+ * one handle must not free.
+ * On return the pool lock will be still held however during the
+ * execution it will be unlocked and locked for the time of calling
+ * the evict callback.
+ *
+ * Returns 1 if page was freed here, 0 otherwise (still in use)
+ */
+static int do_reclaim(struct zbud_pool *pool, struct zbud_header *zhdr,
+ evict_page_t evict_cb)
+{
+ int ret;
+ unsigned long first_handle = 0, last_handle = 0;
+
+ /* Move this last element to beginning of LRU */
+ list_del(&zhdr->lru);
+ list_add(&zhdr->lru, &pool->lru);
+ /* Protect zbud page against free */
+ get_zbud_page(zhdr);
+ /*
+ * We need encode the handles before unlocking, since we can
+ * race with free that will set (first|last)_chunks to 0
+ */
+ first_handle = 0;
+ last_handle = 0;
+ if (zhdr->first_chunks)
+ first_handle = encode_handle(zhdr, FIRST);
+ if (zhdr->last_chunks)
+ last_handle = encode_handle(zhdr, LAST);
+ spin_unlock(&pool->lock);
+
+ /* Issue the eviction callback(s) */
+ if (first_handle) {
+ ret = evict_cb(pool, first_handle);
+ if (ret)
+ goto next;
+ }
+ if (last_handle) {
+ ret = evict_cb(pool, last_handle);
+ if (ret)
+ goto next;
+ }
+next:
+ spin_lock(&pool->lock);
+ return put_zbud_page(pool, zhdr);
+}
+
/**
- * zbud_reclaim_page() - evicts allocations from a pool page and frees it
+ * zbud_reclaim_lru_page() - evicts allocations from a pool page and frees it
* @pool: pool from which a page will attempt to be evicted
* @retires: number of pages on the LRU list for which eviction will
* be attempted before failing
@@ -418,11 +469,10 @@ void zbud_free(struct zbud_pool *pool, unsigned long handle)
* no pages to evict or an eviction handler is not registered, -EAGAIN if
* the retry limit was hit.
*/
-int zbud_reclaim_page(struct zbud_pool *pool, unsigned int retries)
+int zbud_reclaim_lru_page(struct zbud_pool *pool, unsigned int retries)
{
- int i, ret;
+ int i;
struct zbud_header *zhdr;
- unsigned long first_handle = 0, last_handle = 0;

spin_lock(&pool->lock);
if (!pool->ops || !pool->ops->evict || list_empty(&pool->lru) ||
@@ -443,43 +493,84 @@ int zbud_reclaim_page(struct zbud_pool *pool, unsigned int retries)
return 0;
}
zhdr = list_tail_entry(&pool->lru, struct zbud_header, lru);
- /* Move this last element to beginning of LRU */
- list_del(&zhdr->lru);
- list_add(&zhdr->lru, &pool->lru);
- /* Protect zbud page against free */
- get_zbud_page(zhdr);
- /*
- * We need encode the handles before unlocking, since we can
- * race with free that will set (first|last)_chunks to 0
- */
- first_handle = 0;
- last_handle = 0;
- if (zhdr->first_chunks)
- first_handle = encode_handle(zhdr, FIRST);
- if (zhdr->last_chunks)
- last_handle = encode_handle(zhdr, LAST);
- spin_unlock(&pool->lock);
-
- /* Issue the eviction callback(s) */
- if (first_handle) {
- ret = pool->ops->evict(pool, first_handle);
- if (ret)
- goto next;
+ if (do_reclaim(pool, zhdr, pool->ops->evict)) {
+ spin_unlock(&pool->lock);
+ return 0;
}
- if (last_handle) {
- ret = pool->ops->evict(pool, last_handle);
- if (ret)
- goto next;
+ }
+ spin_unlock(&pool->lock);
+ return -EAGAIN;
+}
+
+
+/**
+ * zbud_reclaim_pages() - reclaims zbud pages by unusing stored pages
+ * @zbud_pages list of zbud pages to reclaim
+ *
+ * zbud reclaim is different from normal system reclaim in that the reclaim is
+ * done from the bottom, up. This is because only the bottom layer, zbud, has
+ * information on how the allocations are organized within each zbud page. This
+ * has the potential to create interesting locking situations between zbud and
+ * the user, however.
+ *
+ * To avoid these, this is how zbud_reclaim_pages() should be called:
+
+ * The user detects some pages should be reclaimed and calls
+ * zbud_reclaim_pages(). The zbud_reclaim_pages() will remove zbud
+ * pages from the pool LRU list and call the user-defined unuse handler with
+ * the pool and handle as arguments.
+ *
+ * If the handle can not be unused, the unuse handler should return
+ * non-zero. zbud_reclaim_pages() will add the zbud page back to the
+ * appropriate list and try the next zbud page on the list.
+ *
+ * If the handle is successfully unused, the unuse handler should
+ * return 0.
+ * The zbud page will be freed later by unuse code
+ * (e.g. frontswap_invalidate_page()).
+ *
+ * If all buddies in the zbud page are successfully unused, then the
+ * zbud page can be freed.
+ */
+void zbud_reclaim_pages(struct list_head *zbud_pages)
+{
+ struct page *page;
+ struct page *page2;
+
+ list_for_each_entry_safe(page, page2, zbud_pages, lru) {
+ struct zbud_header *zhdr;
+ struct zbud_pool *pool;
+
+ list_del(&page->lru);
+ if (!PageZbud(page)) {
+ /*
+ * Drop page count from isolate_migratepages_range()
+ */
+ put_page(page);
+ continue;
}
-next:
+ zhdr = page_address(page);
+ BUG_ON(!zhdr->pool);
+ pool = zhdr->pool;
+
spin_lock(&pool->lock);
+ /* Drop page count from isolate_migratepages_range() */
if (put_zbud_page(pool, zhdr)) {
+ /*
+ * zbud_free() could free the handles before acquiring
+ * pool lock above. No need to reclaim.
+ */
spin_unlock(&pool->lock);
- return 0;
+ continue;
+ }
+ if (!pool->ops || !pool->ops->unuse || list_empty(&pool->lru)) {
+ spin_unlock(&pool->lock);
+ continue;
}
+ BUG_ON(!PageZbud(page));
+ do_reclaim(pool, zhdr, pool->ops->unuse);
+ spin_unlock(&pool->lock);
}
- spin_unlock(&pool->lock);
- return -EAGAIN;
}

/**
diff --git a/mm/zswap.c b/mm/zswap.c
index deda2b6..846649b 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -35,6 +35,9 @@
#include <linux/crypto.h>
#include <linux/mempool.h>
#include <linux/zbud.h>
+#include <linux/swapfile.h>
+#include <linux/mman.h>
+#include <linux/security.h>

#include <linux/mm_types.h>
#include <linux/page-flags.h>
@@ -61,6 +64,8 @@ static atomic_t zswap_stored_pages = ATOMIC_INIT(0);
static u64 zswap_pool_limit_hit;
/* Pages written back when pool limit was reached */
static u64 zswap_written_back_pages;
+/* Pages unused due to reclaim */
+static u64 zswap_unused_pages;
/* Store failed due to a reclaim failure after pool limit was reached */
static u64 zswap_reject_reclaim_fail;
/* Compressed page was too big for the allocator to (optimally) store */
@@ -596,6 +601,47 @@ fail:
return ret;
}

+/**
+ * Tries to unuse swap entries by uncompressing them.
+ * Function is a stripped swapfile.c::try_to_unuse().
+ *
+ * Returns 0 on success or negative on error.
+ */
+static int zswap_unuse_entry(struct zbud_pool *pool, unsigned long handle)
+{
+ struct zswap_header *zhdr;
+ swp_entry_t swpentry;
+ struct zswap_tree *tree;
+ pgoff_t offset;
+ struct mm_struct *start_mm;
+ struct swap_info_struct *si;
+ int ret;
+
+ /* extract swpentry from data */
+ zhdr = zbud_map(pool, handle);
+ swpentry = zhdr->swpentry; /* here */
+ zbud_unmap(pool, handle);
+ tree = zswap_trees[swp_type(swpentry)];
+ offset = swp_offset(swpentry);
+ BUG_ON(pool != tree->pool);
+
+ /*
+ * We cannot hold swap_lock here but swap_info may
+ * change (e.g. by swapoff). In case of swapoff
+ * check for SWP_WRITEOK.
+ */
+ si = swap_info[swp_type(swpentry)];
+ if (!(si->flags & SWP_WRITEOK))
+ return -ECANCELED;
+
+ start_mm = &init_mm;
+ atomic_inc(&init_mm.mm_users);
+ ret = try_to_unuse_swp_entry(&start_mm, si, swpentry);
+ mmput(start_mm);
+ zswap_unused_pages++;
+ return ret;
+}
+
/*********************************
* frontswap hooks
**********************************/
@@ -620,7 +666,7 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
/* reclaim space if needed */
if (zswap_is_full()) {
zswap_pool_limit_hit++;
- if (zbud_reclaim_page(tree->pool, 8)) {
+ if (zbud_reclaim_lru_page(tree->pool, 8)) {
zswap_reject_reclaim_fail++;
ret = -ENOMEM;
goto reject;
@@ -647,8 +693,8 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,

/* store */
len = dlen + sizeof(struct zswap_header);
- ret = zbud_alloc(tree->pool, len, __GFP_NORETRY | __GFP_NOWARN,
- &handle);
+ ret = zbud_alloc(tree->pool, len, __GFP_NORETRY | __GFP_NOWARN |
+ __GFP_RECLAIMABLE, &handle);
if (ret == -ENOSPC) {
zswap_reject_compress_poor++;
goto freepage;
@@ -819,7 +865,8 @@ static void zswap_frontswap_invalidate_area(unsigned type)
}

static struct zbud_ops zswap_zbud_ops = {
- .evict = zswap_writeback_entry
+ .evict = zswap_writeback_entry,
+ .unuse = zswap_unuse_entry
};

static void zswap_frontswap_init(unsigned type)
@@ -880,6 +927,8 @@ static int __init zswap_debugfs_init(void)
zswap_debugfs_root, &zswap_reject_compress_poor);
debugfs_create_u64("written_back_pages", S_IRUGO,
zswap_debugfs_root, &zswap_written_back_pages);
+ debugfs_create_u64("unused_pages", S_IRUGO,
+ zswap_debugfs_root, &zswap_unused_pages);
debugfs_create_u64("duplicate_entry", S_IRUGO,
zswap_debugfs_root, &zswap_duplicate_entry);
debugfs_create_u64("pool_pages", S_IRUGO,
--
1.7.9.5

2013-08-09 10:22:54

by Krzysztof Kozlowski

[permalink] [raw]
Subject: [RFC PATCH v2 3/4] mm: use mapcount for identifying zbud pages

Currently zbud pages do not have any flags set so it is not possible to
identify them during migration or compaction.

Implement PageZbud() by comparing page->_mapcount to -127 to distinguish
pages allocated by zbud. Just like PageBuddy() is implemented.

Signed-off-by: Krzysztof Kozlowski <[email protected]>
---
include/linux/mm.h | 23 +++++++++++++++++++++++
mm/zbud.c | 4 ++++
2 files changed, 27 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f022460..b9ae6f2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -440,6 +440,7 @@ static inline void init_page_count(struct page *page)
* efficiently by most CPU architectures.
*/
#define PAGE_BUDDY_MAPCOUNT_VALUE (-128)
+#define PAGE_ZBUD_MAPCOUNT_VALUE (-127)

static inline int PageBuddy(struct page *page)
{
@@ -458,6 +459,28 @@ static inline void __ClearPageBuddy(struct page *page)
atomic_set(&page->_mapcount, -1);
}

+#ifdef CONFIG_ZBUD
+static inline int PageZbud(struct page *page)
+{
+ return atomic_read(&page->_mapcount) == PAGE_ZBUD_MAPCOUNT_VALUE;
+}
+
+static inline void SetPageZbud(struct page *page)
+{
+ VM_BUG_ON(atomic_read(&page->_mapcount) != -1);
+ atomic_set(&page->_mapcount, PAGE_ZBUD_MAPCOUNT_VALUE);
+}
+
+static inline void ClearPageZbud(struct page *page)
+{
+ VM_BUG_ON(!PageZbud(page));
+ atomic_set(&page->_mapcount, -1);
+}
+#else
+PAGEFLAG_FALSE(Zbud)
+#endif
+
+
void put_page(struct page *page);
void put_pages_list(struct list_head *pages);

diff --git a/mm/zbud.c b/mm/zbud.c
index 52f6ba1..24c9ba0 100644
--- a/mm/zbud.c
+++ b/mm/zbud.c
@@ -199,7 +199,10 @@ static void get_zbud_page(struct zbud_header *zhdr)
static int put_zbud_page(struct zbud_pool *pool, struct zbud_header *zhdr)
{
struct page *page = virt_to_page(zhdr);
+ BUG_ON(!PageZbud(page));
+
if (put_page_testzero(page)) {
+ ClearPageZbud(page);
free_hot_cold_page(page, 0);
pool->pages_nr--;
return 1;
@@ -310,6 +313,7 @@ int zbud_alloc(struct zbud_pool *pool, int size, gfp_t gfp,
* don't increase the page count.
*/
zhdr = init_zbud_page(page);
+ SetPageZbud(page);
bud = FIRST;

found:
--
1.7.9.5

2013-08-09 10:23:31

by Krzysztof Kozlowski

[permalink] [raw]
Subject: [RFC PATCH v2 1/4] zbud: use page ref counter for zbud pages

Use page reference counter for zbud pages. The ref counter replaces
zbud_header.under_reclaim flag and ensures that zbud page won't be freed
when zbud_free() is called during reclaim. It allows implementation of
additional reclaim paths.

The page count is incremented when:
- a handle is created and passed to zswap (in zbud_alloc()),
- user-supplied eviction callback is called (in zbud_reclaim_page()).

Signed-off-by: Krzysztof Kozlowski <[email protected]>
Signed-off-by: Tomasz Stanislawski <[email protected]>
Reviewed-by: Bob Liu <[email protected]>
---
mm/zbud.c | 96 ++++++++++++++++++++++++++++++++++---------------------------
1 file changed, 53 insertions(+), 43 deletions(-)

diff --git a/mm/zbud.c b/mm/zbud.c
index ad1e781..52f6ba1 100644
--- a/mm/zbud.c
+++ b/mm/zbud.c
@@ -109,7 +109,6 @@ struct zbud_header {
struct list_head lru;
unsigned int first_chunks;
unsigned int last_chunks;
- bool under_reclaim;
};

/*****************
@@ -138,16 +137,9 @@ static struct zbud_header *init_zbud_page(struct page *page)
zhdr->last_chunks = 0;
INIT_LIST_HEAD(&zhdr->buddy);
INIT_LIST_HEAD(&zhdr->lru);
- zhdr->under_reclaim = 0;
return zhdr;
}

-/* Resets the struct page fields and frees the page */
-static void free_zbud_page(struct zbud_header *zhdr)
-{
- __free_page(virt_to_page(zhdr));
-}
-
/*
* Encodes the handle of a particular buddy within a zbud page
* Pool lock should be held as this function accesses first|last_chunks
@@ -188,6 +180,34 @@ static int num_free_chunks(struct zbud_header *zhdr)
return NCHUNKS - zhdr->first_chunks - zhdr->last_chunks - 1;
}

+/*
+ * Increases ref count for zbud page.
+ */
+static void get_zbud_page(struct zbud_header *zhdr)
+{
+ get_page(virt_to_page(zhdr));
+}
+
+/*
+ * Decreases ref count for zbud page and frees the page if it reaches 0
+ * (no external references, e.g. handles).
+ *
+ * Must be called under pool->lock.
+ *
+ * Returns 1 if page was freed and 0 otherwise.
+ */
+static int put_zbud_page(struct zbud_pool *pool, struct zbud_header *zhdr)
+{
+ struct page *page = virt_to_page(zhdr);
+ if (put_page_testzero(page)) {
+ free_hot_cold_page(page, 0);
+ pool->pages_nr--;
+ return 1;
+ }
+ return 0;
+}
+
+
/*****************
* API Functions
*****************/
@@ -250,7 +270,7 @@ void zbud_destroy_pool(struct zbud_pool *pool)
int zbud_alloc(struct zbud_pool *pool, int size, gfp_t gfp,
unsigned long *handle)
{
- int chunks, i, freechunks;
+ int chunks, i;
struct zbud_header *zhdr = NULL;
enum buddy bud;
struct page *page;
@@ -273,6 +293,7 @@ int zbud_alloc(struct zbud_pool *pool, int size, gfp_t gfp,
bud = FIRST;
else
bud = LAST;
+ get_zbud_page(zhdr);
goto found;
}
}
@@ -284,6 +305,10 @@ int zbud_alloc(struct zbud_pool *pool, int size, gfp_t gfp,
return -ENOMEM;
spin_lock(&pool->lock);
pool->pages_nr++;
+ /*
+ * We will be using zhdr instead of page, so
+ * don't increase the page count.
+ */
zhdr = init_zbud_page(page);
bud = FIRST;

@@ -295,7 +320,7 @@ found:

if (zhdr->first_chunks == 0 || zhdr->last_chunks == 0) {
/* Add to unbuddied list */
- freechunks = num_free_chunks(zhdr);
+ int freechunks = num_free_chunks(zhdr);
list_add(&zhdr->buddy, &pool->unbuddied[freechunks]);
} else {
/* Add to buddied list */
@@ -326,7 +351,6 @@ found:
void zbud_free(struct zbud_pool *pool, unsigned long handle)
{
struct zbud_header *zhdr;
- int freechunks;

spin_lock(&pool->lock);
zhdr = handle_to_zbud_header(handle);
@@ -337,26 +361,18 @@ void zbud_free(struct zbud_pool *pool, unsigned long handle)
else
zhdr->first_chunks = 0;

- if (zhdr->under_reclaim) {
- /* zbud page is under reclaim, reclaim will free */
- spin_unlock(&pool->lock);
- return;
- }
-
/* Remove from existing buddy list */
list_del(&zhdr->buddy);

if (zhdr->first_chunks == 0 && zhdr->last_chunks == 0) {
- /* zbud page is empty, free */
list_del(&zhdr->lru);
- free_zbud_page(zhdr);
- pool->pages_nr--;
} else {
/* Add to unbuddied list */
- freechunks = num_free_chunks(zhdr);
+ int freechunks = num_free_chunks(zhdr);
list_add(&zhdr->buddy, &pool->unbuddied[freechunks]);
}

+ put_zbud_page(pool, zhdr);
spin_unlock(&pool->lock);
}

@@ -400,7 +416,7 @@ void zbud_free(struct zbud_pool *pool, unsigned long handle)
*/
int zbud_reclaim_page(struct zbud_pool *pool, unsigned int retries)
{
- int i, ret, freechunks;
+ int i, ret;
struct zbud_header *zhdr;
unsigned long first_handle = 0, last_handle = 0;

@@ -411,11 +427,23 @@ int zbud_reclaim_page(struct zbud_pool *pool, unsigned int retries)
return -EINVAL;
}
for (i = 0; i < retries; i++) {
+ if (list_empty(&pool->lru)) {
+ /*
+ * LRU was emptied during evict calls in previous
+ * iteration but put_zbud_page() returned 0 meaning
+ * that someone still holds the page. This may
+ * happen when some other mm mechanism increased
+ * the page count.
+ * In such case we succedded with reclaim.
+ */
+ return 0;
+ }
zhdr = list_tail_entry(&pool->lru, struct zbud_header, lru);
+ /* Move this last element to beginning of LRU */
list_del(&zhdr->lru);
- list_del(&zhdr->buddy);
+ list_add(&zhdr->lru, &pool->lru);
/* Protect zbud page against free */
- zhdr->under_reclaim = true;
+ get_zbud_page(zhdr);
/*
* We need encode the handles before unlocking, since we can
* race with free that will set (first|last)_chunks to 0
@@ -441,28 +469,10 @@ int zbud_reclaim_page(struct zbud_pool *pool, unsigned int retries)
}
next:
spin_lock(&pool->lock);
- zhdr->under_reclaim = false;
- if (zhdr->first_chunks == 0 && zhdr->last_chunks == 0) {
- /*
- * Both buddies are now free, free the zbud page and
- * return success.
- */
- free_zbud_page(zhdr);
- pool->pages_nr--;
+ if (put_zbud_page(pool, zhdr)) {
spin_unlock(&pool->lock);
return 0;
- } else if (zhdr->first_chunks == 0 ||
- zhdr->last_chunks == 0) {
- /* add to unbuddied list */
- freechunks = num_free_chunks(zhdr);
- list_add(&zhdr->buddy, &pool->unbuddied[freechunks]);
- } else {
- /* add to buddied list */
- list_add(&zhdr->buddy, &pool->buddied);
}
-
- /* add to beginning of LRU */
- list_add(&zhdr->lru, &pool->lru);
}
spin_unlock(&pool->lock);
return -EAGAIN;
--
1.7.9.5

2013-08-12 02:25:39

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/4] mm: reclaim zbud pages on migration and compaction

Hello,

On Fri, Aug 09, 2013 at 12:22:16PM +0200, Krzysztof Kozlowski wrote:
> Hi,
>
> Currently zbud pages are not movable and they cannot be allocated from CMA
> region. These patches try to address the problem by:

The zcache, zram and GUP pages for memory-hotplug and/or CMA are
same situation.

> 1. Adding a new form of reclaim of zbud pages.
> 2. Reclaiming zbud pages during migration and compaction.
> 3. Allocating zbud pages with __GFP_RECLAIMABLE flag.

So I'd like to solve it with general approach.

Each subsystem or GUP caller who want to pin pages long time should
create own migration handler and register the page into pin-page
control subsystem like this.

driver/foo.c

int foo_migrate(struct page *page, void *private);

static struct pin_page_owner foo_migrate = {
.migrate = foo_migrate;
};

int foo_allocate()
{
struct page *newpage = alloc_pages();
set_pinned_page(newpage, &foo_migrate);
}

And in compaction.c or somewhere where want to move/reclaim the page,
general VM can ask to owner if it founds it's pinned page.

mm/compaction.c

if (PagePinned(page)) {
struct pin_page_info *info = get_page_pin_info(page);
info->migrate(page);

}

Only hurdle for that is that we should introduce a new page flag and
I believe if we all agree this approch, we can find a solution at last.

What do you think?

>From 9a4f652006b7d0c750933d738e1bd6f53754bcf6 Mon Sep 17 00:00:00 2001
From: Minchan Kim <[email protected]>
Date: Sun, 11 Aug 2013 00:31:57 +0900
Subject: [RFC] pin page control subsystem


Signed-off-by: Minchan Kim <[email protected]>
---
mm/Makefile | 2 +-
mm/pin-page.c | 101 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 102 insertions(+), 1 deletion(-)
create mode 100644 mm/pin-page.c

diff --git a/mm/Makefile b/mm/Makefile
index f008033..245c2f7 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -5,7 +5,7 @@
mmu-y := nommu.o
mmu-$(CONFIG_MMU) := fremap.o highmem.o madvise.o memory.o mincore.o \
mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
- vmalloc.o pagewalk.o pgtable-generic.o
+ vmalloc.o pagewalk.o pgtable-generic.o pin-page.o

ifdef CONFIG_CROSS_MEMORY_ATTACH
mmu-$(CONFIG_MMU) += process_vm_access.o
diff --git a/mm/pin-page.c b/mm/pin-page.c
new file mode 100644
index 0000000..74b07f8
--- /dev/null
+++ b/mm/pin-page.c
@@ -0,0 +1,101 @@
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/list.h>
+#include <linux/hashtable.h>
+
+#define PPAGE_HASH_BITS 10
+
+static DEFINE_SPINLOCK(hash_lock);
+/*
+ * Should consider what's data struct we should use.
+ * It would be better use radix tree if we try to pin contigous
+ * pages a lot but if we pin spread pages, it wouldn't be a good idea.
+ */
+static DEFINE_HASHTABLE(pin_page_hash, PPAGE_HASH_BITS);
+
+/*
+ * Each subsystems should provide own page migration handler
+ */
+struct pin_page_owner {
+ int (*migrate)(struct page *page, void *private);
+};
+
+struct pin_page_info {
+ struct pin_page_owner *owner;
+ struct hlist_node hlist;
+
+ unsigned long pfn;
+ void *private;
+};
+
+/* TODO : Introduce new page flags */
+void SetPinnedPage(struct page *page)
+{
+
+}
+
+int PinnedPage(struct page *page)
+{
+ return 0;
+}
+
+/*
+ * GUP caller or subsystems which pin the page should call this function
+ * to register @page in pin-page control subsystem so that VM can ask us
+ * when it want to migrate @page.
+ *
+ * Each pinned page would have some private key to identify itself
+ * like custom-allocator-returned handle.
+ */
+int set_pinned_page(struct pin_page_owner *owner,
+ struct page *page, void *private)
+{
+ struct pin_page_info *pinfo = kmalloc(sizeof(pinfo), GFP_KERNEL);
+
+ INIT_HLIST_NODE(&pinfo->hlist);
+ pinfo->owner = owner;
+
+ pinfo->pfn = page_to_pfn(page);
+ pinfo->private = private;
+
+ spin_lock(&hash_lock);
+ hash_add(pin_page_hash, &pinfo->hlist, pinfo->pfn);
+ spin_unlock(&hash_lock);
+
+ SetPinnedPage(page);
+ return 0;
+};
+
+struct pin_page_info *get_pin_page_info(struct page *page)
+{
+ struct pin_page_info *tmp;
+ unsigned long pfn = page_to_pfn(page);
+
+ spin_lock(&hash_lock);
+ hash_for_each_possible(pin_page_hash, tmp, hlist, pfn) {
+ if (tmp->pfn == pfn) {
+ spin_unlock(&hash_lock);
+ return tmp;
+ }
+ }
+ spin_unlock(&hash_lock);
+ return NULL;
+}
+
+/* Used in compaction.c */
+int migrate_pinned_page(struct page *page)
+{
+ int ret = 1;
+ struct pin_page_info *pinfo = NULL;
+
+ if (PinnedPage(page)) {
+ while ((pinfo = get_pin_page_info(page))) {
+ /* If one of owners failed, bail out */
+ if (pinfo->owner->migrate(page, pinfo->private))
+ break;
+ }
+
+ ret = 0;
+ }
+ return ret;
+}
--
1.7.9.5

--
Kind regards,
Minchan Kim

2013-08-12 03:16:52

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/4] mm: reclaim zbud pages on migration and compaction

Hello Minchan,

On Mon, Aug 12, 2013 at 11:25:35AM +0900, Minchan Kim wrote:
> Hello,
>
> On Fri, Aug 09, 2013 at 12:22:16PM +0200, Krzysztof Kozlowski wrote:
> > Hi,
> >
> > Currently zbud pages are not movable and they cannot be allocated from CMA
> > region. These patches try to address the problem by:
>
> The zcache, zram and GUP pages for memory-hotplug and/or CMA are
> same situation.
>
> > 1. Adding a new form of reclaim of zbud pages.
> > 2. Reclaiming zbud pages during migration and compaction.
> > 3. Allocating zbud pages with __GFP_RECLAIMABLE flag.
>
> So I'd like to solve it with general approach.
>
> Each subsystem or GUP caller who want to pin pages long time should
> create own migration handler and register the page into pin-page
> control subsystem like this.
>
> driver/foo.c
>
> int foo_migrate(struct page *page, void *private);
>
> static struct pin_page_owner foo_migrate = {
> .migrate = foo_migrate;
> };
>
> int foo_allocate()
> {
> struct page *newpage = alloc_pages();
> set_pinned_page(newpage, &foo_migrate);
> }
>
> And in compaction.c or somewhere where want to move/reclaim the page,
> general VM can ask to owner if it founds it's pinned page.
>
> mm/compaction.c
>
> if (PagePinned(page)) {
> struct pin_page_info *info = get_page_pin_info(page);
> info->migrate(page);
>
> }
>
> Only hurdle for that is that we should introduce a new page flag and
> I believe if we all agree this approch, we can find a solution at last.
>
> What do you think?

I don't like this approach. There will be too many collisions in the
hash that's been implemented (read: I don't think you can get away with
a naive implementation for core infrastructure that has to suite all
users), you've got a global spin lock, and it doesn't take into account
NUMA issues. The address space migratepage method doesn't have those
issues (at least where it is usable as in aio's use-case).

If you're going to go down this path, you'll have to decide if *all* users
of pinned pages are going to have to subscribe to supporting the un-pinning
of pages, and that means taking a real hard look at how O_DIRECT pins pages.
Once you start thinking about that, you'll find that addressing the
performance concerns is going to be an essential part of any design work to
be done in this area.

-ben
--
"Thought is the essence of where you are now."

2013-08-12 03:49:53

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/4] mm: reclaim zbud pages on migration and compaction

Hello Benjamin,

On Sun, Aug 11, 2013 at 11:16:47PM -0400, Benjamin LaHaise wrote:
> Hello Minchan,
>
> On Mon, Aug 12, 2013 at 11:25:35AM +0900, Minchan Kim wrote:
> > Hello,
> >
> > On Fri, Aug 09, 2013 at 12:22:16PM +0200, Krzysztof Kozlowski wrote:
> > > Hi,
> > >
> > > Currently zbud pages are not movable and they cannot be allocated from CMA
> > > region. These patches try to address the problem by:
> >
> > The zcache, zram and GUP pages for memory-hotplug and/or CMA are
> > same situation.
> >
> > > 1. Adding a new form of reclaim of zbud pages.
> > > 2. Reclaiming zbud pages during migration and compaction.
> > > 3. Allocating zbud pages with __GFP_RECLAIMABLE flag.
> >
> > So I'd like to solve it with general approach.
> >
> > Each subsystem or GUP caller who want to pin pages long time should
> > create own migration handler and register the page into pin-page
> > control subsystem like this.
> >
> > driver/foo.c
> >
> > int foo_migrate(struct page *page, void *private);
> >
> > static struct pin_page_owner foo_migrate = {
> > .migrate = foo_migrate;
> > };
> >
> > int foo_allocate()
> > {
> > struct page *newpage = alloc_pages();
> > set_pinned_page(newpage, &foo_migrate);
> > }
> >
> > And in compaction.c or somewhere where want to move/reclaim the page,
> > general VM can ask to owner if it founds it's pinned page.
> >
> > mm/compaction.c
> >
> > if (PagePinned(page)) {
> > struct pin_page_info *info = get_page_pin_info(page);
> > info->migrate(page);
> >
> > }
> >
> > Only hurdle for that is that we should introduce a new page flag and
> > I believe if we all agree this approch, we can find a solution at last.
> >
> > What do you think?
>
> I don't like this approach. There will be too many collisions in the
> hash that's been implemented (read: I don't think you can get away with

Yeb. That's why I'd like to change it with radix tree of pfn as
I mentioned as comment(just used hash for fast prototyping without big
considering).

> a naive implementation for core infrastructure that has to suite all
> users), you've got a global spin lock, and it doesn't take into account

I think batching-drain of pinned page would be sufficient for avoiding
global spinlock problem because we have been used it with page-allocator
which is one of most critical hotpath.

> NUMA issues. The address space migratepage method doesn't have those

NUMA issues? Could you elaborate it a bit?

> issues (at least where it is usable as in aio's use-case).
>
> If you're going to go down this path, you'll have to decide if *all* users
> of pinned pages are going to have to subscribe to supporting the un-pinning
> of pages, and that means taking a real hard look at how O_DIRECT pins pages.
> Once you start thinking about that, you'll find that addressing the
> performance concerns is going to be an essential part of any design work to
> be done in this area.

True. The patch I included just shows the cocnept so I didn't consider any
performance critical part but if we all agree this arpproch does make sense
and we can implement little overhead, I will step into next phase to enhance
performance.

Thanks for the input, Ben!

>
> -ben
> --
> "Thought is the essence of where you are now."
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2013-08-12 16:48:18

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/4] mm: reclaim zbud pages on migration and compaction

On 08/11/2013 07:25 PM, Minchan Kim wrote:
> +int set_pinned_page(struct pin_page_owner *owner,
> + struct page *page, void *private)
> +{
> + struct pin_page_info *pinfo = kmalloc(sizeof(pinfo), GFP_KERNEL);
> +
> + INIT_HLIST_NODE(&pinfo->hlist);
> + pinfo->owner = owner;
> +
> + pinfo->pfn = page_to_pfn(page);
> + pinfo->private = private;
> +
> + spin_lock(&hash_lock);
> + hash_add(pin_page_hash, &pinfo->hlist, pinfo->pfn);
> + spin_unlock(&hash_lock);
> +
> + SetPinnedPage(page);
> + return 0;
> +};

I definitely agree that we're getting to the point where we need to look
at this more generically. We've got at least four use-cases that have a
need for deterministically relocating memory:

1. CMA (many sub use cases)
2. Memory hot-remove
3. Memory power management
4. Runtime hugetlb-GB page allocations

Whatever we do, it _should_ be good enough to largely let us replace
PG_slab with this new bit.