Hi,
Currently zbud pages are not movable and they cannot be allocated from CMA
(Contiguous Memory Allocator) region. These patches add migration of zbud pages.
The zbud migration code utilizes mapping so many exceptions to migrate
code were added. This can be replaced for example with pin page
control subsystem:
http://article.gmane.org/gmane.linux.kernel.mm/105308
In such case the zbud migration code (zbud_migrate_page()) can be safely
re-used.
Patch "[PATCH 3/5] mm: use mapcount for identifying zbud pages" introduces
PageZbud() function which identifies zbud pages by page->_mapcount.
Dave Hansen proposed aliasing PG_zbud=PG_slab but in such case patch
would be more intrusive.
Any ideas for a better solution are welcome.
Patch "[PATCH 4/5] mm: use indirect zbud handle and radix tree" changes zbud
handle to support migration. Now the handle is an index in radix tree and
zbud_map() maps it to a proper virtual address. This exposes race conditions,
some of them are discussed already here:
http://article.gmane.org/gmane.linux.kernel.mm/105988
Races are fixed by adding internal map count for each zbud handle.
The map count is increased on each zbud_map() call.
Some races between writeback and invalidate still exist. In such case a message
can be seen in logs:
zbud: error: could not lookup handle 13810 in tree
Patches from discussion above may resolve this.
I have considered using "pgoff_t offset" as handle but it prevented storing
duplicate pages in zswap.
This patch set is based on v3.11.
Changes since v1:
-----------------
1. Rebased against v3.11.
2. Updated documentation of zbud_reclaim_page() to match usage of zbud page
reference counters.
3. Split from patch 2/4 trivial change of scope of freechunks var to separate
patch (3/5) (suggested by Seth Jennings).
Changes and relation to patches "reclaiming zbud pages on migration and
compaction":
-----------------
This is continuation of my previous work: reclaiming zbud pages on migration
and compaction. However current solution is completely different so I am not
attaching previous changelog.
Previous patches can be found here:
* [RFC PATCH v2 0/4] mm: reclaim zbud pages on migration and compaction
http://article.gmane.org/gmane.linux.kernel.mm/105153
* [RFC PATCH 0/4] mm: reclaim zbud pages on migration and compaction
http://article.gmane.org/gmane.linux.kernel.mm/104801
One patch from previous work is re-used along with minor changes:
"[PATCH 1/4] zbud: use page ref counter for zbud pages"
* Add missing spin_unlock in zbud_reclaim_page().
* Decrease pool->pages_nr in zbud_free(), not when putting page. This also
removes the need of holding lock while call to put_zbud_page().
Best regards,
Krzysztof Kozlowski
Krzysztof Kozlowski (5):
zbud: use page ref counter for zbud pages
zbud: make freechunks a block local variable
mm: use mapcount for identifying zbud pages
mm: use indirect zbud handle and radix tree
mm: migrate zbud pages
include/linux/mm.h | 23 ++
include/linux/zbud.h | 3 +-
mm/compaction.c | 7 +
mm/migrate.c | 17 +-
mm/zbud.c | 573 ++++++++++++++++++++++++++++++++++++++++----------
mm/zswap.c | 28 ++-
6 files changed, 537 insertions(+), 114 deletions(-)
--
1.7.9.5
Move freechunks variable in zbud_free() and zbud_alloc() to block-level
scope (from function scope).
Signed-off-by: Krzysztof Kozlowski <[email protected]>
---
mm/zbud.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/mm/zbud.c b/mm/zbud.c
index 3f4be72..1d5b26b 100644
--- a/mm/zbud.c
+++ b/mm/zbud.c
@@ -267,7 +267,7 @@ void zbud_destroy_pool(struct zbud_pool *pool)
int zbud_alloc(struct zbud_pool *pool, int size, gfp_t gfp,
unsigned long *handle)
{
- int chunks, i, freechunks;
+ int chunks, i;
struct zbud_header *zhdr = NULL;
enum buddy bud;
struct page *page;
@@ -317,7 +317,7 @@ found:
if (zhdr->first_chunks == 0 || zhdr->last_chunks == 0) {
/* Add to unbuddied list */
- freechunks = num_free_chunks(zhdr);
+ int freechunks = num_free_chunks(zhdr);
list_add(&zhdr->buddy, &pool->unbuddied[freechunks]);
} else {
/* Add to buddied list */
@@ -349,7 +349,6 @@ found:
void zbud_free(struct zbud_pool *pool, unsigned long handle)
{
struct zbud_header *zhdr;
- int freechunks;
spin_lock(&pool->lock);
zhdr = handle_to_zbud_header(handle);
@@ -368,7 +367,7 @@ void zbud_free(struct zbud_pool *pool, unsigned long handle)
pool->pages_nr--;
} else {
/* Add to unbuddied list */
- freechunks = num_free_chunks(zhdr);
+ int freechunks = num_free_chunks(zhdr);
list_add(&zhdr->buddy, &pool->unbuddied[freechunks]);
}
--
1.7.9.5
Add radix tree to zbud pool and use indirect zbud handle as radix tree
index.
This allows migration of zbud pages while the handle used by zswap
remains untouched. Previously zbud handles were virtual addresses. This
imposed problem when page was migrated.
This change also exposes and fixes race condition between:
- zbud_reclaim_page() (called from zswap_frontswap_store())
and
- zbud_free() (called from zswap_frontswap_invalidate_page()).
This race was present already but additional locking and in-direct use
handle makes it frequent during high memory pressure.
Race typically looks like:
- thread 1: zbud_reclaim_page()
- thread 1: zswap_writeback_entry()
- zbud_map()
- thread 0: zswap_frontswap_invalidate_page()
- zbud_free()
- thread 1: read zswap_entry from memory or call zbud_unmap(), now on
invalid memory address
The zbud_reclaim_page() calls evict handler (zswap_writeback_entry())
without holding pool lock. The zswap_writeback_entry() reads
zswap_header from memory obtained from zbud_map() without holding
tree's lock. If invalidate happens during this time the zbud_free()
will remove handle from the tree.
The new map_count fields in zbud_header try to address this problem by
protecting handles from freeing.
Also the call to zbud_unmap() in zswap_writeback_entry() was moved
further - when the tree's lock could be obtained.
Signed-off-by: Krzysztof Kozlowski <[email protected]>
---
include/linux/zbud.h | 2 +-
mm/zbud.c | 313 +++++++++++++++++++++++++++++++++++++++++---------
mm/zswap.c | 24 +++-
3 files changed, 280 insertions(+), 59 deletions(-)
diff --git a/include/linux/zbud.h b/include/linux/zbud.h
index 2571a5c..12d72df 100644
--- a/include/linux/zbud.h
+++ b/include/linux/zbud.h
@@ -16,7 +16,7 @@ int zbud_alloc(struct zbud_pool *pool, int size, gfp_t gfp,
void zbud_free(struct zbud_pool *pool, unsigned long handle);
int zbud_reclaim_page(struct zbud_pool *pool, unsigned int retries);
void *zbud_map(struct zbud_pool *pool, unsigned long handle);
-void zbud_unmap(struct zbud_pool *pool, unsigned long handle);
+int zbud_unmap(struct zbud_pool *pool, unsigned long handle);
u64 zbud_get_pool_size(struct zbud_pool *pool);
#endif /* _ZBUD_H_ */
diff --git a/mm/zbud.c b/mm/zbud.c
index 9be160c..795d56a 100644
--- a/mm/zbud.c
+++ b/mm/zbud.c
@@ -50,6 +50,7 @@
#include <linux/preempt.h>
#include <linux/slab.h>
#include <linux/spinlock.h>
+#include <linux/radix-tree.h>
#include <linux/zbud.h>
/*****************
@@ -69,6 +70,9 @@
#define NCHUNKS (PAGE_SIZE >> CHUNK_SHIFT)
#define ZHDR_SIZE_ALIGNED CHUNK_SIZE
+/* Empty handle, not yet allocated */
+#define HANDLE_EMPTY 0
+
/**
* struct zbud_pool - stores metadata for each zbud pool
* @lock: protects all pool fields and first|last_chunk fields of any
@@ -83,6 +87,10 @@
* @pages_nr: number of zbud pages in the pool.
* @ops: pointer to a structure of user defined operations specified at
* pool creation time.
+ * @page_tree: mapping handle->zbud_header for zbud_map and migration;
+ * many pools may exist so do not use the mapping->page_tree
+ * @last_handle: last handle calculated; used as starting point when searching
+ * for next handle in page_tree in zbud_alloc().
*
* This structure is allocated at pool creation time and maintains metadata
* pertaining to a particular zbud pool.
@@ -94,6 +102,8 @@ struct zbud_pool {
struct list_head lru;
u64 pages_nr;
struct zbud_ops *ops;
+ struct radix_tree_root page_tree;
+ unsigned long last_handle;
};
/*
@@ -103,12 +113,23 @@ struct zbud_pool {
* @lru: links the zbud page into the lru list in the pool
* @first_chunks: the size of the first buddy in chunks, 0 if free
* @last_chunks: the size of the last buddy in chunks, 0 if free
+ * @first_handle: handle to page stored in first buddy
+ * @last_handle: handle to page stored in last buddy
+ * @first_map_count: mapped count of page stored in first buddy
+ * @last_map_count: mapped count of page stored in last buddy
+ *
+ * When map count reaches zero the corresponding handle is removed
+ * from radix tree and cannot be used any longer.
*/
struct zbud_header {
struct list_head buddy;
struct list_head lru;
+ unsigned long first_handle;
+ unsigned long last_handle;
unsigned int first_chunks;
unsigned int last_chunks;
+ short int first_map_count;
+ short int last_map_count;
};
/*****************
@@ -135,38 +156,34 @@ static struct zbud_header *init_zbud_page(struct page *page)
struct zbud_header *zhdr = page_address(page);
zhdr->first_chunks = 0;
zhdr->last_chunks = 0;
+ zhdr->first_handle = HANDLE_EMPTY;
+ zhdr->last_handle = HANDLE_EMPTY;
+ zhdr->first_map_count = 0;
+ zhdr->last_map_count = 0;
INIT_LIST_HEAD(&zhdr->buddy);
INIT_LIST_HEAD(&zhdr->lru);
return zhdr;
}
/*
- * Encodes the handle of a particular buddy within a zbud page
+ * Encodes the address of a particular buddy within a zbud page
* Pool lock should be held as this function accesses first|last_chunks
*/
-static unsigned long encode_handle(struct zbud_header *zhdr, enum buddy bud)
+static unsigned long calc_addr(struct zbud_header *zhdr, unsigned long handle)
{
- unsigned long handle;
+ unsigned long addr;
/*
- * For now, the encoded handle is actually just the pointer to the data
- * but this might not always be the case. A little information hiding.
* Add CHUNK_SIZE to the handle if it is the first allocation to jump
* over the zbud header in the first chunk.
*/
- handle = (unsigned long)zhdr;
- if (bud == FIRST)
+ addr = (unsigned long)zhdr;
+ if (handle == zhdr->first_handle)
/* skip over zbud header */
- handle += ZHDR_SIZE_ALIGNED;
- else /* bud == LAST */
- handle += PAGE_SIZE - (zhdr->last_chunks << CHUNK_SHIFT);
- return handle;
-}
-
-/* Returns the zbud page where a given handle is stored */
-static struct zbud_header *handle_to_zbud_header(unsigned long handle)
-{
- return (struct zbud_header *)(handle & PAGE_MASK);
+ addr += ZHDR_SIZE_ALIGNED;
+ else /* handle == zhdr->last_handle */
+ addr += PAGE_SIZE - (zhdr->last_chunks << CHUNK_SHIFT);
+ return addr;
}
/* Returns the number of free chunks in a zbud page */
@@ -207,6 +224,109 @@ static int put_zbud_page(struct zbud_header *zhdr)
return 0;
}
+/*
+ * Increases map count for given handle.
+ *
+ * The map count is used to prevent any races between zbud_reclaim()
+ * and zbud_free().
+ *
+ * Must be called under pool->lock.
+ */
+static void get_map_count(struct zbud_header *zhdr, unsigned long handle)
+{
+ VM_BUG_ON(handle == HANDLE_EMPTY);
+ if (zhdr->first_handle == handle)
+ zhdr->first_map_count++;
+ else
+ zhdr->last_map_count++;
+}
+
+/*
+ * Decreases map count for given handle.
+ *
+ * Must be called under pool->lock.
+ *
+ * Returns 1 if no more map counts for handle exist and 0 otherwise.
+ */
+static int put_map_count(struct zbud_header *zhdr, unsigned long handle)
+{
+ VM_BUG_ON(handle == HANDLE_EMPTY);
+ if (zhdr->first_handle == handle) {
+ VM_BUG_ON(!zhdr->first_map_count);
+ return ((--zhdr->first_map_count) == 0);
+ } else {
+ VM_BUG_ON(!zhdr->last_map_count);
+ return ((--zhdr->last_map_count) == 0);
+ }
+}
+
+/*
+ * Searches for zbud header in radix tree.
+ * Returns NULL if handle could not be found.
+ *
+ * Handle could not be found in case of race between:
+ * - zswap_writeback_entry() (called from zswap_frontswap_store())
+ * and
+ * - zbud_free() (called from zswap_frontswap_invalidate())
+ *
+ */
+static struct zbud_header *handle_to_zbud_header(struct zbud_pool *pool,
+ unsigned long handle)
+{
+ struct zbud_header *zhdr;
+
+ rcu_read_lock();
+ zhdr = radix_tree_lookup(&pool->page_tree, handle);
+ rcu_read_unlock();
+ if (unlikely(!zhdr)) {
+ /* race: zswap_writeback_entry() and zswap_invalidate() */
+ pr_err("error: could not lookup handle %lu in tree\n", handle);
+ }
+ return zhdr;
+}
+
+/*
+ * Scans radix tree for next free handle and returns it.
+ * Returns HANDLE_EMPTY (0) if no free handle could be found.
+ *
+ * Must be called under pool->lock to be sure that there
+ * won't be other users of found handle.
+ */
+static unsigned long search_next_handle(struct zbud_pool *pool)
+{
+ unsigned long handle = pool->last_handle;
+ unsigned int retries = 0;
+ do {
+ /* 0 will be returned in case of search failure as we'll hit
+ * the max index.
+ */
+ handle = radix_tree_next_hole(&pool->page_tree,
+ handle + 1, ULONG_MAX);
+ } while ((handle == HANDLE_EMPTY) && (++retries < 2));
+ WARN((retries == 2), "%s: reached max number of retries (%u) when " \
+ "searching for next handle (last handle: %lu)\n",
+ __func__, retries, pool->last_handle);
+ return handle;
+}
+
+/*
+ * Searches for next free handle in page_tree and inserts zbud_header
+ * under it. Stores the handle under given pointer and updates
+ * pool->last_handle.
+ *
+ * Must be called under pool->lock.
+ *
+ * Returns 0 on success or negative otherwise.
+ */
+static int tree_insert_zbud_header(struct zbud_pool *pool,
+ struct zbud_header *zhdr, unsigned long *handle)
+{
+ *handle = search_next_handle(pool);
+ if (unlikely(*handle == HANDLE_EMPTY))
+ return -ENOSPC;
+ pool->last_handle = *handle;
+ return radix_tree_insert(&pool->page_tree, *handle, zhdr);
+}
/*****************
* API Functions
@@ -232,8 +352,10 @@ struct zbud_pool *zbud_create_pool(gfp_t gfp, struct zbud_ops *ops)
INIT_LIST_HEAD(&pool->unbuddied[i]);
INIT_LIST_HEAD(&pool->buddied);
INIT_LIST_HEAD(&pool->lru);
+ INIT_RADIX_TREE(&pool->page_tree, GFP_ATOMIC);
pool->pages_nr = 0;
pool->ops = ops;
+ pool->last_handle = HANDLE_EMPTY;
return pool;
}
@@ -270,10 +392,11 @@ void zbud_destroy_pool(struct zbud_pool *pool)
int zbud_alloc(struct zbud_pool *pool, int size, gfp_t gfp,
unsigned long *handle)
{
- int chunks, i;
+ int chunks, i, err;
struct zbud_header *zhdr = NULL;
enum buddy bud;
struct page *page;
+ unsigned long next_handle;
if (size <= 0 || gfp & __GFP_HIGHMEM)
return -EINVAL;
@@ -288,6 +411,11 @@ int zbud_alloc(struct zbud_pool *pool, int size, gfp_t gfp,
if (!list_empty(&pool->unbuddied[i])) {
zhdr = list_first_entry(&pool->unbuddied[i],
struct zbud_header, buddy);
+ err = tree_insert_zbud_header(pool, zhdr, &next_handle);
+ if (unlikely(err)) {
+ spin_unlock(&pool->lock);
+ return err;
+ }
list_del(&zhdr->buddy);
if (zhdr->first_chunks == 0)
bud = FIRST;
@@ -313,11 +441,22 @@ int zbud_alloc(struct zbud_pool *pool, int size, gfp_t gfp,
SetPageZbud(page);
bud = FIRST;
+ err = tree_insert_zbud_header(pool, zhdr, &next_handle);
+ if (unlikely(err)) {
+ put_zbud_page(zhdr);
+ spin_unlock(&pool->lock);
+ return err;
+ }
+
found:
- if (bud == FIRST)
+ if (bud == FIRST) {
zhdr->first_chunks = chunks;
- else
+ zhdr->first_handle = next_handle;
+ } else {
zhdr->last_chunks = chunks;
+ zhdr->last_handle = next_handle;
+ }
+ get_map_count(zhdr, next_handle);
if (zhdr->first_chunks == 0 || zhdr->last_chunks == 0) {
/* Add to unbuddied list */
@@ -333,12 +472,45 @@ found:
list_del(&zhdr->lru);
list_add(&zhdr->lru, &pool->lru);
- *handle = encode_handle(zhdr, bud);
+ *handle = next_handle;
spin_unlock(&pool->lock);
return 0;
}
+/*
+ * Real code for freeing handle.
+ * Removes handle from radix tree, empties chunks and handle in zbud header,
+ * removes buddy from lists and finally puts page.
+ */
+static void zbud_header_free(struct zbud_pool *pool, struct zbud_header *zhdr,
+ unsigned long handle)
+{
+ struct zbud_header *old = radix_tree_delete(&pool->page_tree, handle);
+ VM_BUG_ON(old != zhdr);
+
+ if (zhdr->first_handle == handle) {
+ zhdr->first_chunks = 0;
+ zhdr->first_handle = HANDLE_EMPTY;
+ } else {
+ zhdr->last_chunks = 0;
+ zhdr->last_handle = HANDLE_EMPTY;
+ }
+
+ /* Remove from existing buddy list */
+ list_del(&zhdr->buddy);
+
+ if (zhdr->first_chunks == 0 && zhdr->last_chunks == 0) {
+ list_del(&zhdr->lru);
+ pool->pages_nr--;
+ } else {
+ /* Add to unbuddied list */
+ int freechunks = num_free_chunks(zhdr);
+ list_add(&zhdr->buddy, &pool->unbuddied[freechunks]);
+ }
+ put_zbud_page(zhdr);
+}
+
/**
* zbud_free() - frees the allocation associated with the given handle
* @pool: pool in which the allocation resided
@@ -355,27 +527,18 @@ void zbud_free(struct zbud_pool *pool, unsigned long handle)
struct zbud_header *zhdr;
spin_lock(&pool->lock);
- zhdr = handle_to_zbud_header(handle);
-
- /* If first buddy, handle will be page aligned */
- if ((handle - ZHDR_SIZE_ALIGNED) & ~PAGE_MASK)
- zhdr->last_chunks = 0;
- else
- zhdr->first_chunks = 0;
-
- /* Remove from existing buddy list */
- list_del(&zhdr->buddy);
+ zhdr = handle_to_zbud_header(pool, handle);
+ VM_BUG_ON(!zhdr);
- if (zhdr->first_chunks == 0 && zhdr->last_chunks == 0) {
- list_del(&zhdr->lru);
- pool->pages_nr--;
+ if (!put_map_count(zhdr, handle)) {
+ /*
+ * Still mapped, so just put page count and
+ * zbud_unmap() will free later.
+ */
+ put_zbud_page(zhdr);
} else {
- /* Add to unbuddied list */
- int freechunks = num_free_chunks(zhdr);
- list_add(&zhdr->buddy, &pool->unbuddied[freechunks]);
+ zbud_header_free(pool, zhdr, handle);
}
-
- put_zbud_page(zhdr);
spin_unlock(&pool->lock);
}
@@ -451,15 +614,11 @@ int zbud_reclaim_page(struct zbud_pool *pool, unsigned int retries)
/* Protect zbud page against free */
get_zbud_page(zhdr);
/*
- * We need encode the handles before unlocking, since we can
+ * Grab handles before unlocking, since we can
* race with free that will set (first|last)_chunks to 0
*/
- first_handle = 0;
- last_handle = 0;
- if (zhdr->first_chunks)
- first_handle = encode_handle(zhdr, FIRST);
- if (zhdr->last_chunks)
- last_handle = encode_handle(zhdr, LAST);
+ first_handle = zhdr->first_handle;
+ last_handle = zhdr->last_handle;
spin_unlock(&pool->lock);
/* Issue the eviction callback(s) */
@@ -485,27 +644,69 @@ next:
/**
* zbud_map() - maps the allocation associated with the given handle
* @pool: pool in which the allocation resides
- * @handle: handle associated with the allocation to be mapped
+ * @handle: handle to be mapped
*
- * While trivial for zbud, the mapping functions for others allocators
- * implementing this allocation API could have more complex information encoded
- * in the handle and could create temporary mappings to make the data
- * accessible to the user.
+ * Increases the page ref count and map count for handle.
*
- * Returns: a pointer to the mapped allocation
+ * Returns: a pointer to the mapped allocation or NULL if page could
+ * not be found in radix tree for given handle.
*/
void *zbud_map(struct zbud_pool *pool, unsigned long handle)
{
- return (void *)(handle);
+ struct zbud_header *zhdr;
+ void *addr;
+
+ /*
+ * Grab lock to prevent races with zbud_free or migration.
+ */
+ spin_lock(&pool->lock);
+ zhdr = handle_to_zbud_header(pool, handle);
+ if (!zhdr) {
+ spin_unlock(&pool->lock);
+ return NULL;
+ }
+ /*
+ * Get page so zbud_free or migration could detect that it is
+ * mapped by someone.
+ */
+ get_zbud_page(zhdr);
+ get_map_count(zhdr, handle);
+ addr = (void *)calc_addr(zhdr, handle);
+ spin_unlock(&pool->lock);
+
+ return addr;
}
/**
- * zbud_unmap() - maps the allocation associated with the given handle
+ * zbud_unmap() - unmaps the allocation associated with the given handle
* @pool: pool in which the allocation resides
- * @handle: handle associated with the allocation to be unmapped
+ * @handle: handle to be unmapped
+ *
+ * Decreases the page ref count and map count for handle.
+ * If map count reaches 0 then handle is freed (it must be freed because
+ * zbud_free() was called already on it) and -EFAULT is returned.
+ *
+ * Returns: 0 on successful unmap, negative on error indicating that handle
+ * was invalidated already by zbud_free() and cannot be used anymore
*/
-void zbud_unmap(struct zbud_pool *pool, unsigned long handle)
+int zbud_unmap(struct zbud_pool *pool, unsigned long handle)
{
+ struct zbud_header *zhdr;
+
+ zhdr = handle_to_zbud_header(pool, handle);
+ if (unlikely(!zhdr))
+ return -ENOENT;
+ spin_lock(&pool->lock);
+ if (put_map_count(zhdr, handle)) {
+ /* racing zbud_free() could not free the handle because
+ * we were still using it so it is our job to free */
+ zbud_header_free(pool, zhdr, handle);
+ spin_unlock(&pool->lock);
+ return -EFAULT;
+ }
+ put_zbud_page(zhdr);
+ spin_unlock(&pool->lock);
+ return 0;
}
/**
diff --git a/mm/zswap.c b/mm/zswap.c
index deda2b6..706046a 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -509,8 +509,15 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
/* extract swpentry from data */
zhdr = zbud_map(pool, handle);
+ if (!zhdr) {
+ /*
+ * Race with zbud_free() (called from invalidate).
+ * Entry was invalidated already.
+ */
+ return 0;
+ }
swpentry = zhdr->swpentry; /* here */
- zbud_unmap(pool, handle);
+ VM_BUG_ON(swp_type(swpentry) >= MAX_SWAPFILES);
tree = zswap_trees[swp_type(swpentry)];
offset = swp_offset(swpentry);
BUG_ON(pool != tree->pool);
@@ -520,10 +527,20 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
entry = zswap_rb_search(&tree->rbroot, offset);
if (!entry) {
/* entry was invalidated */
+ zbud_unmap(pool, handle);
spin_unlock(&tree->lock);
return 0;
}
zswap_entry_get(entry);
+ /*
+ * Unmap zbud after obtaining tree lock and entry ref to prevent
+ * any races with invalidate.
+ */
+ if (zbud_unmap(pool, handle)) {
+ zswap_entry_put(entry);
+ spin_unlock(&tree->lock);
+ return 0;
+ }
spin_unlock(&tree->lock);
BUG_ON(offset != entry->offset);
@@ -544,6 +561,7 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
dlen = PAGE_SIZE;
src = (u8 *)zbud_map(tree->pool, entry->handle) +
sizeof(struct zswap_header);
+ VM_BUG_ON(!src);
dst = kmap_atomic(page);
ret = zswap_comp_op(ZSWAP_COMPOP_DECOMPRESS, src,
entry->length, dst, &dlen);
@@ -661,8 +679,9 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
zhdr->swpentry = swp_entry(type, offset);
buf = (u8 *)(zhdr + 1);
memcpy(buf, dst, dlen);
- zbud_unmap(tree->pool, handle);
+ ret = zbud_unmap(tree->pool, handle);
put_cpu_var(zswap_dstmem);
+ VM_BUG_ON(ret);
/* populate entry */
entry->offset = offset;
@@ -726,6 +745,7 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset,
dlen = PAGE_SIZE;
src = (u8 *)zbud_map(tree->pool, entry->handle) +
sizeof(struct zswap_header);
+ VM_BUG_ON(!src);
dst = kmap_atomic(page);
ret = zswap_comp_op(ZSWAP_COMPOP_DECOMPRESS, src, entry->length,
dst, &dlen);
--
1.7.9.5
Currently zbud pages do not have any flags set so it is not possible to
identify them during migration or compaction.
Implement PageZbud() by comparing page->_mapcount to -127 to distinguish
pages allocated by zbud. Just like PageBuddy() is implemented.
Signed-off-by: Krzysztof Kozlowski <[email protected]>
Reviewed-by: Seth Jennings <[email protected]>
---
include/linux/mm.h | 23 +++++++++++++++++++++++
mm/zbud.c | 4 ++++
2 files changed, 27 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f022460..b9ae6f2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -440,6 +440,7 @@ static inline void init_page_count(struct page *page)
* efficiently by most CPU architectures.
*/
#define PAGE_BUDDY_MAPCOUNT_VALUE (-128)
+#define PAGE_ZBUD_MAPCOUNT_VALUE (-127)
static inline int PageBuddy(struct page *page)
{
@@ -458,6 +459,28 @@ static inline void __ClearPageBuddy(struct page *page)
atomic_set(&page->_mapcount, -1);
}
+#ifdef CONFIG_ZBUD
+static inline int PageZbud(struct page *page)
+{
+ return atomic_read(&page->_mapcount) == PAGE_ZBUD_MAPCOUNT_VALUE;
+}
+
+static inline void SetPageZbud(struct page *page)
+{
+ VM_BUG_ON(atomic_read(&page->_mapcount) != -1);
+ atomic_set(&page->_mapcount, PAGE_ZBUD_MAPCOUNT_VALUE);
+}
+
+static inline void ClearPageZbud(struct page *page)
+{
+ VM_BUG_ON(!PageZbud(page));
+ atomic_set(&page->_mapcount, -1);
+}
+#else
+PAGEFLAG_FALSE(Zbud)
+#endif
+
+
void put_page(struct page *page);
void put_pages_list(struct list_head *pages);
diff --git a/mm/zbud.c b/mm/zbud.c
index 1d5b26b..9be160c 100644
--- a/mm/zbud.c
+++ b/mm/zbud.c
@@ -197,7 +197,10 @@ static void get_zbud_page(struct zbud_header *zhdr)
static int put_zbud_page(struct zbud_header *zhdr)
{
struct page *page = virt_to_page(zhdr);
+ VM_BUG_ON(!PageZbud(page));
+
if (put_page_testzero(page)) {
+ ClearPageZbud(page);
free_hot_cold_page(page, 0);
return 1;
}
@@ -307,6 +310,7 @@ int zbud_alloc(struct zbud_pool *pool, int size, gfp_t gfp,
* don't increase the page count.
*/
zhdr = init_zbud_page(page);
+ SetPageZbud(page);
bud = FIRST;
found:
--
1.7.9.5
Use page reference counter for zbud pages. The ref counter replaces
zbud_header.under_reclaim flag and ensures that zbud page won't be freed
when zbud_free() is called during reclaim. It allows implementation of
additional reclaim paths.
The page count is incremented when:
- a handle is created and passed to zswap (in zbud_alloc()),
- user-supplied eviction callback is called (in zbud_reclaim_page()).
Signed-off-by: Krzysztof Kozlowski <[email protected]>
Signed-off-by: Tomasz Stanislawski <[email protected]>
Reviewed-by: Bob Liu <[email protected]>
---
mm/zbud.c | 117 +++++++++++++++++++++++++++++++++----------------------------
1 file changed, 64 insertions(+), 53 deletions(-)
diff --git a/mm/zbud.c b/mm/zbud.c
index ad1e781..3f4be72 100644
--- a/mm/zbud.c
+++ b/mm/zbud.c
@@ -109,7 +109,6 @@ struct zbud_header {
struct list_head lru;
unsigned int first_chunks;
unsigned int last_chunks;
- bool under_reclaim;
};
/*****************
@@ -138,16 +137,9 @@ static struct zbud_header *init_zbud_page(struct page *page)
zhdr->last_chunks = 0;
INIT_LIST_HEAD(&zhdr->buddy);
INIT_LIST_HEAD(&zhdr->lru);
- zhdr->under_reclaim = 0;
return zhdr;
}
-/* Resets the struct page fields and frees the page */
-static void free_zbud_page(struct zbud_header *zhdr)
-{
- __free_page(virt_to_page(zhdr));
-}
-
/*
* Encodes the handle of a particular buddy within a zbud page
* Pool lock should be held as this function accesses first|last_chunks
@@ -188,6 +180,31 @@ static int num_free_chunks(struct zbud_header *zhdr)
return NCHUNKS - zhdr->first_chunks - zhdr->last_chunks - 1;
}
+/*
+ * Increases ref count for zbud page.
+ */
+static void get_zbud_page(struct zbud_header *zhdr)
+{
+ get_page(virt_to_page(zhdr));
+}
+
+/*
+ * Decreases ref count for zbud page and frees the page if it reaches 0
+ * (no external references, e.g. handles).
+ *
+ * Returns 1 if page was freed and 0 otherwise.
+ */
+static int put_zbud_page(struct zbud_header *zhdr)
+{
+ struct page *page = virt_to_page(zhdr);
+ if (put_page_testzero(page)) {
+ free_hot_cold_page(page, 0);
+ return 1;
+ }
+ return 0;
+}
+
+
/*****************
* API Functions
*****************/
@@ -273,6 +290,7 @@ int zbud_alloc(struct zbud_pool *pool, int size, gfp_t gfp,
bud = FIRST;
else
bud = LAST;
+ get_zbud_page(zhdr);
goto found;
}
}
@@ -284,6 +302,10 @@ int zbud_alloc(struct zbud_pool *pool, int size, gfp_t gfp,
return -ENOMEM;
spin_lock(&pool->lock);
pool->pages_nr++;
+ /*
+ * We will be using zhdr instead of page, so
+ * don't increase the page count.
+ */
zhdr = init_zbud_page(page);
bud = FIRST;
@@ -318,10 +340,11 @@ found:
* @pool: pool in which the allocation resided
* @handle: handle associated with the allocation returned by zbud_alloc()
*
- * In the case that the zbud page in which the allocation resides is under
- * reclaim, as indicated by the PG_reclaim flag being set, this function
- * only sets the first|last_chunks to 0. The page is actually freed
- * once both buddies are evicted (see zbud_reclaim_page() below).
+ * This function sets first|last_chunks to 0, removes zbud header from
+ * appropriate lists (LRU, buddied/unbuddied) and puts the reference count
+ * for it. The page is actually freed once both buddies are evicted
+ * (zbud_free() called on both handles or page reclaim in zbud_reclaim_page()
+ * below).
*/
void zbud_free(struct zbud_pool *pool, unsigned long handle)
{
@@ -337,19 +360,11 @@ void zbud_free(struct zbud_pool *pool, unsigned long handle)
else
zhdr->first_chunks = 0;
- if (zhdr->under_reclaim) {
- /* zbud page is under reclaim, reclaim will free */
- spin_unlock(&pool->lock);
- return;
- }
-
/* Remove from existing buddy list */
list_del(&zhdr->buddy);
if (zhdr->first_chunks == 0 && zhdr->last_chunks == 0) {
- /* zbud page is empty, free */
list_del(&zhdr->lru);
- free_zbud_page(zhdr);
pool->pages_nr--;
} else {
/* Add to unbuddied list */
@@ -357,6 +372,7 @@ void zbud_free(struct zbud_pool *pool, unsigned long handle)
list_add(&zhdr->buddy, &pool->unbuddied[freechunks]);
}
+ put_zbud_page(zhdr);
spin_unlock(&pool->lock);
}
@@ -378,21 +394,23 @@ void zbud_free(struct zbud_pool *pool, unsigned long handle)
* To avoid these, this is how zbud_reclaim_page() should be called:
* The user detects a page should be reclaimed and calls zbud_reclaim_page().
- * zbud_reclaim_page() will remove a zbud page from the pool LRU list and call
- * the user-defined eviction handler with the pool and handle as arguments.
+ * zbud_reclaim_page() will move zbud page to the beginning of the pool
+ * LRU list, increase the page reference count and call the user-defined
+ * eviction handler with the pool and handle as arguments.
*
* If the handle can not be evicted, the eviction handler should return
- * non-zero. zbud_reclaim_page() will add the zbud page back to the
- * appropriate list and try the next zbud page on the LRU up to
+ * non-zero. zbud_reclaim_page() will drop the reference count for page
+ * obtained earlier and try the next zbud page on the LRU up to
* a user defined number of retries.
*
* If the handle is successfully evicted, the eviction handler should
* return 0 _and_ should have called zbud_free() on the handle. zbud_free()
- * contains logic to delay freeing the page if the page is under reclaim,
- * as indicated by the setting of the PG_reclaim flag on the underlying page.
+ * will remove the page from appropriate lists (LRU, buddied/unbuddied) and
+ * drop the reference count associated with given handle.
+ * Then the zbud_reclaim_page() will drop reference count obtained earlier.
*
- * If all buddies in the zbud page are successfully evicted, then the
- * zbud page can be freed.
+ * If all buddies in the zbud page are successfully evicted, then dropping
+ * this last reference count will free the page.
*
* Returns: 0 if page is successfully freed, otherwise -EINVAL if there are
* no pages to evict or an eviction handler is not registered, -EAGAIN if
@@ -400,7 +418,7 @@ void zbud_free(struct zbud_pool *pool, unsigned long handle)
*/
int zbud_reclaim_page(struct zbud_pool *pool, unsigned int retries)
{
- int i, ret, freechunks;
+ int i, ret;
struct zbud_header *zhdr;
unsigned long first_handle = 0, last_handle = 0;
@@ -411,11 +429,24 @@ int zbud_reclaim_page(struct zbud_pool *pool, unsigned int retries)
return -EINVAL;
}
for (i = 0; i < retries; i++) {
+ if (list_empty(&pool->lru)) {
+ /*
+ * LRU was emptied during evict calls in previous
+ * iteration but put_zbud_page() returned 0 meaning
+ * that someone still holds the page. This may
+ * happen when some other mm mechanism increased
+ * the page count.
+ * In such case we succedded with reclaim.
+ */
+ spin_unlock(&pool->lock);
+ return 0;
+ }
zhdr = list_tail_entry(&pool->lru, struct zbud_header, lru);
+ /* Move this last element to beginning of LRU */
list_del(&zhdr->lru);
- list_del(&zhdr->buddy);
+ list_add(&zhdr->lru, &pool->lru);
/* Protect zbud page against free */
- zhdr->under_reclaim = true;
+ get_zbud_page(zhdr);
/*
* We need encode the handles before unlocking, since we can
* race with free that will set (first|last)_chunks to 0
@@ -440,29 +471,9 @@ int zbud_reclaim_page(struct zbud_pool *pool, unsigned int retries)
goto next;
}
next:
- spin_lock(&pool->lock);
- zhdr->under_reclaim = false;
- if (zhdr->first_chunks == 0 && zhdr->last_chunks == 0) {
- /*
- * Both buddies are now free, free the zbud page and
- * return success.
- */
- free_zbud_page(zhdr);
- pool->pages_nr--;
- spin_unlock(&pool->lock);
+ if (put_zbud_page(zhdr))
return 0;
- } else if (zhdr->first_chunks == 0 ||
- zhdr->last_chunks == 0) {
- /* add to unbuddied list */
- freechunks = num_free_chunks(zhdr);
- list_add(&zhdr->buddy, &pool->unbuddied[freechunks]);
- } else {
- /* add to buddied list */
- list_add(&zhdr->buddy, &pool->buddied);
- }
-
- /* add to beginning of LRU */
- list_add(&zhdr->lru, &pool->lru);
+ spin_lock(&pool->lock);
}
spin_unlock(&pool->lock);
return -EAGAIN;
--
1.7.9.5
Add migration support for zbud. This allows adding __GFP_MOVABLE flag
when allocating zbud pages and effectively CMA pool can be used for
zswap.
zbud pages are not movable and are not stored under any LRU (except
zbud's LRU). PageZbud flag is used in isolate_migratepages_range() to
grab zbud pages and pass them later for migration.
page->private field is used for storing pointer to zbud_pool.
This pointer to zbud_pool is needed during migration for locking the
pool and accessing radix tree.
The zbud migration code utilizes mapping so many exceptions to migrate
code was added. It can be replaced for example with pin page control
subsystem:
http://article.gmane.org/gmane.linux.kernel.mm/105308
In such case the zbud migration code (zbud_migrate_page()) can be safely
re-used.
Signed-off-by: Krzysztof Kozlowski <[email protected]>
---
include/linux/zbud.h | 1 +
mm/compaction.c | 7 +++
mm/migrate.c | 17 +++++-
mm/zbud.c | 164 +++++++++++++++++++++++++++++++++++++++++++++++---
mm/zswap.c | 4 +-
5 files changed, 179 insertions(+), 14 deletions(-)
diff --git a/include/linux/zbud.h b/include/linux/zbud.h
index 12d72df..3bc2e38 100644
--- a/include/linux/zbud.h
+++ b/include/linux/zbud.h
@@ -11,6 +11,7 @@ struct zbud_ops {
struct zbud_pool *zbud_create_pool(gfp_t gfp, struct zbud_ops *ops);
void zbud_destroy_pool(struct zbud_pool *pool);
+int zbud_put_page(struct page *page);
int zbud_alloc(struct zbud_pool *pool, int size, gfp_t gfp,
unsigned long *handle);
void zbud_free(struct zbud_pool *pool, unsigned long handle);
diff --git a/mm/compaction.c b/mm/compaction.c
index 05ccb4c..8acd198 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -534,6 +534,12 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
goto next_pageblock;
}
+ if (PageZbud(page)) {
+ BUG_ON(PageLRU(page));
+ get_page(page);
+ goto isolated;
+ }
+
/*
* Check may be lockless but that's ok as we recheck later.
* It's possible to migrate LRU pages and balloon pages
@@ -601,6 +607,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
/* Successfully isolated */
cc->finished_update_migrate = true;
del_page_from_lru_list(page, lruvec, page_lru(page));
+isolated:
list_add(&page->lru, migratelist);
cc->nr_migratepages++;
nr_isolated++;
diff --git a/mm/migrate.c b/mm/migrate.c
index 6f0c244..5254eb2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -36,6 +36,7 @@
#include <linux/hugetlb_cgroup.h>
#include <linux/gfp.h>
#include <linux/balloon_compaction.h>
+#include <linux/zbud.h>
#include <asm/tlbflush.h>
@@ -105,6 +106,8 @@ void putback_movable_pages(struct list_head *l)
page_is_file_cache(page));
if (unlikely(balloon_page_movable(page)))
balloon_page_putback(page);
+ else if (unlikely(PageZbud(page)))
+ zbud_put_page(page);
else
putback_lru_page(page);
}
@@ -832,6 +835,10 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
goto skip_unmap;
}
+ if (unlikely(PageZbud(page))) {
+ remap_swapcache = 0;
+ goto skip_unmap;
+ }
/* Establish migration ptes or remove ptes */
try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
@@ -902,13 +909,19 @@ out:
list_del(&page->lru);
dec_zone_page_state(page, NR_ISOLATED_ANON +
page_is_file_cache(page));
- putback_lru_page(page);
+ if (unlikely(PageZbud(page)))
+ zbud_put_page(page);
+ else
+ putback_lru_page(page);
}
/*
* Move the new page to the LRU. If migration was not successful
* then this will free the page.
*/
- putback_lru_page(newpage);
+ if (unlikely(PageZbud(newpage)))
+ zbud_put_page(newpage);
+ else
+ putback_lru_page(newpage);
if (result) {
if (rc)
*result = rc;
diff --git a/mm/zbud.c b/mm/zbud.c
index 795d56a..de8132d 100644
--- a/mm/zbud.c
+++ b/mm/zbud.c
@@ -51,6 +51,8 @@
#include <linux/slab.h>
#include <linux/spinlock.h>
#include <linux/radix-tree.h>
+#include <linux/fs.h>
+#include <linux/pagemap.h>
#include <linux/zbud.h>
/*****************
@@ -211,17 +213,9 @@ static void get_zbud_page(struct zbud_header *zhdr)
*
* Returns 1 if page was freed and 0 otherwise.
*/
-static int put_zbud_page(struct zbud_header *zhdr)
+static inline int put_zbud_page(struct zbud_header *zhdr)
{
- struct page *page = virt_to_page(zhdr);
- VM_BUG_ON(!PageZbud(page));
-
- if (put_page_testzero(page)) {
- ClearPageZbud(page);
- free_hot_cold_page(page, 0);
- return 1;
- }
- return 0;
+ return zbud_put_page(virt_to_page(zhdr));
}
/*
@@ -261,6 +255,27 @@ static int put_map_count(struct zbud_header *zhdr, unsigned long handle)
}
/*
+ * Replaces item expected in radix tree by a new item, while holding pool lock.
+ */
+static int tree_replace(struct zbud_pool *pool,
+ unsigned long handle, void *expected, void *replacement)
+{
+ void **pslot;
+ void *item = NULL;
+
+ VM_BUG_ON(!expected);
+ VM_BUG_ON(!replacement);
+ pslot = radix_tree_lookup_slot(&pool->page_tree, handle);
+ if (pslot)
+ item = radix_tree_deref_slot_protected(pslot,
+ &pool->lock);
+ if (item != expected)
+ return -ENOENT;
+ radix_tree_replace_slot(pslot, replacement);
+ return 0;
+}
+
+/*
* Searches for zbud header in radix tree.
* Returns NULL if handle could not be found.
*
@@ -328,6 +343,110 @@ static int tree_insert_zbud_header(struct zbud_pool *pool,
return radix_tree_insert(&pool->page_tree, *handle, zhdr);
}
+/*
+ * Copy page during migration.
+ *
+ * Must be called under pool->lock.
+ */
+static void copy_zbud_page(struct page *newpage, struct page *page)
+{
+ void *to, *from;
+ SetPageZbud(newpage);
+ newpage->mapping = page->mapping;
+ set_page_private(newpage, page_private(page));
+ /* copy data */
+ to = kmap_atomic(newpage);
+ from = kmap_atomic(page);
+ memcpy(to, from, PAGE_SIZE);
+ kunmap_atomic(from);
+ kunmap_atomic(to);
+}
+
+/*
+ * Replaces old zbud header in radix tree with new, updates page
+ * count (puts old, gets new) and puts map_count for old page.
+ *
+ * The map_count for new page is not increased because it was already
+ * copied by copy_zbud_page().
+ *
+ * Must be called under pool->lock.
+ */
+static void replace_page_handles(struct zbud_pool *pool,
+ struct zbud_header *zhdr, struct zbud_header *newzhdr)
+{
+ if (zhdr->first_handle) {
+ int ret = tree_replace(pool, zhdr->first_handle, zhdr,
+ newzhdr);
+ VM_BUG_ON(ret);
+ get_zbud_page(newzhdr);
+ /* get_map_count() skipped, already copied */
+ put_map_count(zhdr, zhdr->first_handle);
+ put_zbud_page(zhdr);
+ }
+ if (zhdr->last_handle) {
+ int ret = tree_replace(pool, zhdr->last_handle, zhdr,
+ newzhdr);
+ VM_BUG_ON(ret);
+ get_zbud_page(newzhdr);
+ /* get_map_count() skipped, already copied */
+ put_map_count(zhdr, zhdr->last_handle);
+ put_zbud_page(zhdr);
+ }
+}
+
+
+/*
+ * Migrates zbud page.
+ * Returns 0 on success or -EAGAIN if page was dirtied by zbud_map during
+ * migration.
+ */
+static int zbud_migrate_page(struct address_space *mapping,
+ struct page *newpage, struct page *page,
+ enum migrate_mode mode)
+{
+ int rc = 0;
+ struct zbud_header *zhdr, *newzhdr;
+ struct zbud_pool *pool;
+ int expected_cnt = 1; /* one page reference from isolate */
+
+ VM_BUG_ON(!PageLocked(page));
+ zhdr = page_address(page);
+ newzhdr = page_address(newpage);
+ pool = (struct zbud_pool *)page_private(page);
+ VM_BUG_ON(!pool);
+
+ spin_lock(&pool->lock);
+ if (zhdr->first_handle)
+ expected_cnt++;
+ if (zhdr->last_handle)
+ expected_cnt++;
+
+ if (page_count(page) != expected_cnt) {
+ /* Still used by someone (paraller compaction) or dirtied
+ * by zbud_map() before we acquired spin_lock. */
+ rc = -EAGAIN;
+ goto out;
+ }
+ copy_zbud_page(newpage, page);
+ replace_page_handles(pool, zhdr, newzhdr);
+ /* Update lists */
+ list_replace(&zhdr->lru, &newzhdr->lru);
+ list_replace(&zhdr->buddy, &newzhdr->buddy);
+ memset(zhdr, 0, sizeof(struct zbud_header));
+
+out:
+ spin_unlock(&pool->lock);
+ return rc;
+}
+
+static const struct address_space_operations zbud_aops = {
+ .migratepage = zbud_migrate_page,
+};
+const struct address_space zbud_space = {
+ .a_ops = &zbud_aops,
+};
+EXPORT_SYMBOL_GPL(zbud_space);
+
/*****************
* API Functions
*****************/
@@ -370,6 +489,29 @@ void zbud_destroy_pool(struct zbud_pool *pool)
kfree(pool);
}
+/*
+ * zbud_put_page() - decreases ref count for zbud page
+ * @page: zbud page to put
+ *
+ * Page is freed if reference count reaches 0.
+ *
+ * Returns 1 if page was freed and 0 otherwise.
+ */
+int zbud_put_page(struct page *page)
+{
+ VM_BUG_ON(!PageZbud(page));
+
+ if (put_page_testzero(page)) {
+ VM_BUG_ON(PageLocked(page));
+ page->mapping = NULL;
+ set_page_private(page, 0);
+ ClearPageZbud(page);
+ free_hot_cold_page(page, 0);
+ return 1;
+ }
+ return 0;
+}
+
/**
* zbud_alloc() - allocates a region of a given size
* @pool: zbud pool from which to allocate
@@ -439,6 +581,8 @@ int zbud_alloc(struct zbud_pool *pool, int size, gfp_t gfp,
*/
zhdr = init_zbud_page(page);
SetPageZbud(page);
+ page->mapping = (struct address_space *)&zbud_space;
+ set_page_private(page, (unsigned long)pool);
bud = FIRST;
err = tree_insert_zbud_header(pool, zhdr, &next_handle);
diff --git a/mm/zswap.c b/mm/zswap.c
index 706046a..ac8b768 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -665,8 +665,8 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
/* store */
len = dlen + sizeof(struct zswap_header);
- ret = zbud_alloc(tree->pool, len, __GFP_NORETRY | __GFP_NOWARN,
- &handle);
+ ret = zbud_alloc(tree->pool, len,
+ __GFP_NORETRY | __GFP_NOWARN | __GFP_MOVABLE, &handle);
if (ret == -ENOSPC) {
zswap_reject_compress_poor++;
goto freepage;
--
1.7.9.5
Hi Krzysztof,
On 09/11/2013 04:58 PM, Krzysztof Kozlowski wrote:
> Hi,
>
> Currently zbud pages are not movable and they cannot be allocated from CMA
> (Contiguous Memory Allocator) region. These patches add migration of zbud pages.
>
I agree that the migration of zbud pages is important so that system
will not enter order-0 page fragmentation and can be helpful for page
compaction/huge pages etc..
But after I looked at the [patch 4/5], I found it will make zbud very
complicated.
I'd prefer to add this migration feature later until current version
zswap/zbud becomes better enough and more stable.
Mel mentioned several problems about zswap/zbud in thread "[PATCH v6
0/5] zram/zsmalloc promotion".
Like "it's clunky as hell and the layering between zswap and zbud is
twisty" and "I think I brought up its stalling behaviour during review
when it was being merged. It would have been preferable if writeback
could be initiated in batches and then waited on at the very least..
It's worse that it uses _swap_writepage directly instead of going
through a writepage ops. It would have been better if zbud pages
existed on the LRU and written back with an address space ops and
properly handled asynchonous writeback."
So I think it would be better if we can address those issues at first
and it would be easier to address these issues before adding more new
features. Welcome any ideas.
--
Regards,
-Bob
On Tue, Sep 17, 2013 at 02:59:24PM +0800, Bob Liu wrote:
> Hi Krzysztof,
>
> On 09/11/2013 04:58 PM, Krzysztof Kozlowski wrote:
> > Hi,
> >
> > Currently zbud pages are not movable and they cannot be allocated from CMA
> > (Contiguous Memory Allocator) region. These patches add migration of zbud pages.
> >
>
> I agree that the migration of zbud pages is important so that system
> will not enter order-0 page fragmentation and can be helpful for page
> compaction/huge pages etc..
>
> But after I looked at the [patch 4/5], I found it will make zbud very
> complicated.
> I'd prefer to add this migration feature later until current version
> zswap/zbud becomes better enough and more stable.
I agree with this. We are also looking to add zsmalloc as an option too. It
would be nice to come up with a solution that worked for both (any) allocator
that zswap used.
>
> Mel mentioned several problems about zswap/zbud in thread "[PATCH v6
> 0/5] zram/zsmalloc promotion".
>
> Like "it's clunky as hell and the layering between zswap and zbud is
> twisty" and "I think I brought up its stalling behaviour during review
> when it was being merged. It would have been preferable if writeback
> could be initiated in batches and then waited on at the very least..
> It's worse that it uses _swap_writepage directly instead of going
> through a writepage ops. It would have been better if zbud pages
> existed on the LRU and written back with an address space ops and
> properly handled asynchonous writeback."
Yes, the laying in zswap vs zbud is wonky and should be addressed before adding
new layers.
>
> So I think it would be better if we can address those issues at first
> and it would be easier to address these issues before adding more new
> features. Welcome any ideas.
Agreed.
Seth
On Tue, Sep 17, 2013 at 02:59:24PM +0800, Bob Liu wrote:
> Mel mentioned several problems about zswap/zbud in thread "[PATCH v6
> 0/5] zram/zsmalloc promotion".
>
> Like "it's clunky as hell and the layering between zswap and zbud is
> twisty" and "I think I brought up its stalling behaviour during review
> when it was being merged. It would have been preferable if writeback
> could be initiated in batches and then waited on at the very least..
> It's worse that it uses _swap_writepage directly instead of going
> through a writepage ops. It would have been better if zbud pages
> existed on the LRU and written back with an address space ops and
> properly handled asynchonous writeback."
>
> So I think it would be better if we can address those issues at first
> and it would be easier to address these issues before adding more new
> features. Welcome any ideas.
I just had an idea this afternoon to potentially kill both these birds with one
stone: Replace the rbtree in zswap with an address_space.
Each swap type would have its own page_tree to organize the compressed objects
by type and offset (radix tree is more suited for this anyway) and a_ops that
could be called by shrink_page_list() (writepage) or the migration code
(migratepage).
Then zbud pages could be put on the normal LRU list, maybe at the beginning of
the inactive LRU so they would live for another cycle through the list, then be
reclaimed in the normal way with the mapping->a_ops->writepage() pointing to a
zswap_writepage() function that would decompress the pages and call
__swap_writepage() on them.
This might actually do away with the explicit pool size too as the compressed
pool pages wouldn't be outside the control of the MM anymore.
I'm just starting to explore this but I think it has promise.
Seth
Hi,
On pon, 2013-09-23 at 17:07 -0500, Seth Jennings wrote:
> On Tue, Sep 17, 2013 at 02:59:24PM +0800, Bob Liu wrote:
> > Mel mentioned several problems about zswap/zbud in thread "[PATCH v6
> > 0/5] zram/zsmalloc promotion".
> >
> > Like "it's clunky as hell and the layering between zswap and zbud is
> > twisty" and "I think I brought up its stalling behaviour during review
> > when it was being merged. It would have been preferable if writeback
> > could be initiated in batches and then waited on at the very least..
> > It's worse that it uses _swap_writepage directly instead of going
> > through a writepage ops. It would have been better if zbud pages
> > existed on the LRU and written back with an address space ops and
> > properly handled asynchonous writeback."
> >
> > So I think it would be better if we can address those issues at first
> > and it would be easier to address these issues before adding more new
> > features. Welcome any ideas.
>
> I just had an idea this afternoon to potentially kill both these birds with one
> stone: Replace the rbtree in zswap with an address_space.
>
> Each swap type would have its own page_tree to organize the compressed objects
> by type and offset (radix tree is more suited for this anyway) and a_ops that
> could be called by shrink_page_list() (writepage) or the migration code
> (migratepage).
>
> Then zbud pages could be put on the normal LRU list, maybe at the beginning of
> the inactive LRU so they would live for another cycle through the list, then be
> reclaimed in the normal way with the mapping->a_ops->writepage() pointing to a
> zswap_writepage() function that would decompress the pages and call
> __swap_writepage() on them.
How exactly the address space can be used here? Do you want to point to
zbud pages in address_space.page_tree? If yes then which index should be
used?
Best regards,
Krzysztof
On Tue, Sep 24, 2013 at 5:20 PM, Krzysztof Kozlowski
<[email protected]> wrote:
> Hi,
>
> On pon, 2013-09-23 at 17:07 -0500, Seth Jennings wrote:
>> On Tue, Sep 17, 2013 at 02:59:24PM +0800, Bob Liu wrote:
>> > Mel mentioned several problems about zswap/zbud in thread "[PATCH v6
>> > 0/5] zram/zsmalloc promotion".
>> >
>> > Like "it's clunky as hell and the layering between zswap and zbud is
>> > twisty" and "I think I brought up its stalling behaviour during review
>> > when it was being merged. It would have been preferable if writeback
>> > could be initiated in batches and then waited on at the very least..
>> > It's worse that it uses _swap_writepage directly instead of going
>> > through a writepage ops. It would have been better if zbud pages
>> > existed on the LRU and written back with an address space ops and
>> > properly handled asynchonous writeback."
>> >
>> > So I think it would be better if we can address those issues at first
>> > and it would be easier to address these issues before adding more new
>> > features. Welcome any ideas.
>>
>> I just had an idea this afternoon to potentially kill both these birds with one
>> stone: Replace the rbtree in zswap with an address_space.
>>
>> Each swap type would have its own page_tree to organize the compressed objects
>> by type and offset (radix tree is more suited for this anyway) and a_ops that
>> could be called by shrink_page_list() (writepage) or the migration code
>> (migratepage).
>>
>> Then zbud pages could be put on the normal LRU list, maybe at the beginning of
>> the inactive LRU so they would live for another cycle through the list, then be
>> reclaimed in the normal way with the mapping->a_ops->writepage() pointing to a
>> zswap_writepage() function that would decompress the pages and call
>> __swap_writepage() on them.
>
> How exactly the address space can be used here? Do you want to point to
> zbud pages in address_space.page_tree? If yes then which index should be
> used?
>
I didn't get the point neither. I think introduce address_space is enough.
1. zbud.c:
static struct address_space_operations zbud_aops = {
.writepage= zswap_write_page,
};
struct address_space zbud_space = {
.a_ops = &zbud_aops,
};
zbud_alloc() {
zbud_page = alloc_page();
zbud_page->mapping = (struct address_space *)&zbud_space;
set_page_private(page, (unsigned long)pool);
lru_add_anon(zbud_page);
}
2. zswap.c
static int zswap_writepage(struct page *page, struct writeback_control *wbc)
{
handle = encode_handle(page_address(page), FIRST));
zswap_writeback_entry(pool, handle);
handle = encode_handle(page_address(page), LAST));
zswap_writeback_entry(pool, handle);
}
Of course it may need lots of work for core MM subsystem can maintain
zbud pages.
But in this way, we can get rid of the clunky reclaiming layer and
integrate zswap closely with core MM subsystem which knows better how
many zbud pages can be used and when should trigger the zbud pages
reclaim.
--
Regards,
--Bob
> I just had an idea this afternoon to potentially kill both these birds with one
> stone: Replace the rbtree in zswap with an address_space.
>
> Each swap type would have its own page_tree to organize the compressed objects
> by type and offset (radix tree is more suited for this anyway) and a_ops that
> could be called by shrink_page_list() (writepage) or the migration code
> (migratepage).
>
> Then zbud pages could be put on the normal LRU list, maybe at the beginning of
> the inactive LRU so they would live for another cycle through the list, then be
> reclaimed in the normal way with the mapping->a_ops->writepage() pointing to a
> zswap_writepage() function that would decompress the pages and call
> __swap_writepage() on them.
>
> This might actually do away with the explicit pool size too as the compressed
> pool pages wouldn't be outside the control of the MM anymore.
>
> I'm just starting to explore this but I think it has promise.
>
> Seth
>
Hi Seth,
There is a problem with the proposed idea.
The radix tree used 'struct address_space' is a part of
a bigger data structure.
The radix tree is used to translate an offset to a page.
That is ok for zswap. But struct page has a field named 'index'.
The MM assumes that this index is an offset in radix tree
where one can find the page. A lot is done by MM to sustain
this consistency.
In case of zbud, there are two swap offset pointing to
the same page. There might be more if zsmalloc is used.
What is worse it is possible that one swap entry could
point to data that cross a page boundary.
Of course, one could try to modify MM to support
multiple mapping of a page in the radix tree.
But I think that MM guys will consider this as a hack
and they will not accept it.
Regards,
Tomasz Stanislawski
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
On Wed, Sep 25, 2013 at 07:09:50PM +0200, Tomasz Stanislawski wrote:
> > I just had an idea this afternoon to potentially kill both these birds with one
> > stone: Replace the rbtree in zswap with an address_space.
> >
> > Each swap type would have its own page_tree to organize the compressed objects
> > by type and offset (radix tree is more suited for this anyway) and a_ops that
> > could be called by shrink_page_list() (writepage) or the migration code
> > (migratepage).
> >
> > Then zbud pages could be put on the normal LRU list, maybe at the beginning of
> > the inactive LRU so they would live for another cycle through the list, then be
> > reclaimed in the normal way with the mapping->a_ops->writepage() pointing to a
> > zswap_writepage() function that would decompress the pages and call
> > __swap_writepage() on them.
> >
> > This might actually do away with the explicit pool size too as the compressed
> > pool pages wouldn't be outside the control of the MM anymore.
> >
> > I'm just starting to explore this but I think it has promise.
> >
> > Seth
> >
>
> Hi Seth,
> There is a problem with the proposed idea.
> The radix tree used 'struct address_space' is a part of
> a bigger data structure.
> The radix tree is used to translate an offset to a page.
> That is ok for zswap. But struct page has a field named 'index'.
> The MM assumes that this index is an offset in radix tree
> where one can find the page. A lot is done by MM to sustain
> this consistency.
Yes, this is how it is for page cache pages. However, the MM is able to
work differently with anonymous pages. In the case of an anonymous
page, the mapping field points to an anon_vma struct, or, if ksm in
enabled and dedup'ing the page, a private ksm tracking structure. If
the anonymous page is fully unmapped and resides only in the swap cache,
the page mapping is NULL. So there is precedent for the fields to mean
other things.
The question is how to mark and identify zbud pages among the other page
types that will be on the LRU. There are many ways. The question is
what is the best and most acceptable way.
>
> In case of zbud, there are two swap offset pointing to
> the same page. There might be more if zsmalloc is used.
> What is worse it is possible that one swap entry could
> point to data that cross a page boundary.
We just won't set page->index since it doesn't have a good meaning in
our case. Swap cache pages also don't use index, although is seems to
me that they could since there is a 1:1 mapping of a swap cache page to
a swap offset and the index field isn't being used for anything else.
But I digress...
>
> Of course, one could try to modify MM to support
> multiple mapping of a page in the radix tree.
> But I think that MM guys will consider this as a hack
> and they will not accept it.
Yes, it will require some changes to the MM to handle zbud pages on the
LRU. I'm thinking that it won't be too intrusive, depending on how we
choose to mark zbud pages.
Seth
>
> Regards,
> Tomasz Stanislawski
>
>
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to [email protected]. For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
> >
>
On 09/25/2013 11:57 PM, Seth Jennings wrote:
> On Wed, Sep 25, 2013 at 07:09:50PM +0200, Tomasz Stanislawski wrote:
>>> I just had an idea this afternoon to potentially kill both these birds with one
>>> stone: Replace the rbtree in zswap with an address_space.
>>>
>>> Each swap type would have its own page_tree to organize the compressed objects
>>> by type and offset (radix tree is more suited for this anyway) and a_ops that
>>> could be called by shrink_page_list() (writepage) or the migration code
>>> (migratepage).
>>>
>>> Then zbud pages could be put on the normal LRU list, maybe at the beginning of
>>> the inactive LRU so they would live for another cycle through the list, then be
>>> reclaimed in the normal way with the mapping->a_ops->writepage() pointing to a
>>> zswap_writepage() function that would decompress the pages and call
>>> __swap_writepage() on them.
>>>
>>> This might actually do away with the explicit pool size too as the compressed
>>> pool pages wouldn't be outside the control of the MM anymore.
>>>
>>> I'm just starting to explore this but I think it has promise.
>>>
>>> Seth
>>>
>>
>> Hi Seth,
>> There is a problem with the proposed idea.
>> The radix tree used 'struct address_space' is a part of
>> a bigger data structure.
>> The radix tree is used to translate an offset to a page.
>> That is ok for zswap. But struct page has a field named 'index'.
>> The MM assumes that this index is an offset in radix tree
>> where one can find the page. A lot is done by MM to sustain
>> this consistency.
>
> Yes, this is how it is for page cache pages. However, the MM is able to
> work differently with anonymous pages. In the case of an anonymous
> page, the mapping field points to an anon_vma struct, or, if ksm in
> enabled and dedup'ing the page, a private ksm tracking structure. If
> the anonymous page is fully unmapped and resides only in the swap cache,
> the page mapping is NULL. So there is precedent for the fields to mean
> other things.
Hi Seth,
You are right that page->mapping is NULL for pages in swap_cache but
page_mapping() is not NULL in such a case. The mapping is taken from
struct address_space swapper_spaces[]. It is still an address space,
and it should preserve constraints for struct address_space.
The same happen for page->index and page_index().
>
> The question is how to mark and identify zbud pages among the other page
> types that will be on the LRU. There are many ways. The question is
> what is the best and most acceptable way.
>
If you consider hacking I have some idea how address_space could utilized for ZBUD.
One solution whould be using tags in a radix tree. Every entry in a radix tree
can have a few bits assigned to it. Currently 3 bits are supported:
>From include/linux/fs.h
#define PAGECACHE_TAG_DIRTY 0
#define PAGECACHE_TAG_WRITEBACK 1
#define PAGECACHE_TAG_TOWRITE 2
You could add a new bit or utilize one of existing ones.
The other idea is use a trick from a RB trees and scatter-gather lists.
I mean using the last bits of pointers to keep some metadata.
Values of 'struct page *' variables are aligned to a pointer alignment which is
4 for 32-bit CPUs and 8 for 64-bit ones (not sure). This means that one could
could use the last bit of page pointer in a radix tree to track if a swap entry
refers to a lower or a higher part of a ZBUD page.
I think it is a serious hacking/obfuscation but it may work with the minimal
amount of changes to MM. Adding only (x&~3) while extracting page pointer is
probably enough.
What do you think about this idea?
>>
>> In case of zbud, there are two swap offset pointing to
>> the same page. There might be more if zsmalloc is used.
>> What is worse it is possible that one swap entry could
>> point to data that cross a page boundary.
>
> We just won't set page->index since it doesn't have a good meaning in
> our case. Swap cache pages also don't use index, although is seems to
> me that they could since there is a 1:1 mapping of a swap cache page to
> a swap offset and the index field isn't being used for anything else.
> But I digress...
OK.
>
>>
>> Of course, one could try to modify MM to support
>> multiple mapping of a page in the radix tree.
>> But I think that MM guys will consider this as a hack
>> and they will not accept it.
>
> Yes, it will require some changes to the MM to handle zbud pages on the
> LRU. I'm thinking that it won't be too intrusive, depending on how we
> choose to mark zbud pages.
>
Anyway, I think that zswap should use two index engines.
I mean index in Data Base meaning.
One index is used to translate swap_entry to compressed page.
And another one to be used by reclaim and migration by MM,
probably address_space is a best choice.
Zbud would responsible for keeping consistency
between mentioned indexes.
Regards,
Tomasz Stanislawski
> Seth
>
>>
>> Regards,
>> Tomasz Stanislawski
>>
>>
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> the body to [email protected]. For more info on Linux MM,
>>> see: http://www.linux-mm.org/ .
>>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>>>
>>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
On Fri, Sep 27, 2013 at 12:16:37PM +0200, Tomasz Stanislawski wrote:
> On 09/25/2013 11:57 PM, Seth Jennings wrote:
> > On Wed, Sep 25, 2013 at 07:09:50PM +0200, Tomasz Stanislawski wrote:
> >>> I just had an idea this afternoon to potentially kill both these birds with one
> >>> stone: Replace the rbtree in zswap with an address_space.
> >>>
> >>> Each swap type would have its own page_tree to organize the compressed objects
> >>> by type and offset (radix tree is more suited for this anyway) and a_ops that
> >>> could be called by shrink_page_list() (writepage) or the migration code
> >>> (migratepage).
> >>>
> >>> Then zbud pages could be put on the normal LRU list, maybe at the beginning of
> >>> the inactive LRU so they would live for another cycle through the list, then be
> >>> reclaimed in the normal way with the mapping->a_ops->writepage() pointing to a
> >>> zswap_writepage() function that would decompress the pages and call
> >>> __swap_writepage() on them.
> >>>
> >>> This might actually do away with the explicit pool size too as the compressed
> >>> pool pages wouldn't be outside the control of the MM anymore.
> >>>
> >>> I'm just starting to explore this but I think it has promise.
> >>>
> >>> Seth
> >>>
> >>
> >> Hi Seth,
> >> There is a problem with the proposed idea.
> >> The radix tree used 'struct address_space' is a part of
> >> a bigger data structure.
> >> The radix tree is used to translate an offset to a page.
> >> That is ok for zswap. But struct page has a field named 'index'.
> >> The MM assumes that this index is an offset in radix tree
> >> where one can find the page. A lot is done by MM to sustain
> >> this consistency.
> >
> > Yes, this is how it is for page cache pages. However, the MM is able to
> > work differently with anonymous pages. In the case of an anonymous
> > page, the mapping field points to an anon_vma struct, or, if ksm in
> > enabled and dedup'ing the page, a private ksm tracking structure. If
> > the anonymous page is fully unmapped and resides only in the swap cache,
> > the page mapping is NULL. So there is precedent for the fields to mean
> > other things.
>
> Hi Seth,
> You are right that page->mapping is NULL for pages in swap_cache but
> page_mapping() is not NULL in such a case. The mapping is taken from
> struct address_space swapper_spaces[]. It is still an address space,
> and it should preserve constraints for struct address_space.
> The same happen for page->index and page_index().
>
> >
> > The question is how to mark and identify zbud pages among the other page
> > types that will be on the LRU. There are many ways. The question is
> > what is the best and most acceptable way.
> >
>
> If you consider hacking I have some idea how address_space could utilized for ZBUD.
> One solution whould be using tags in a radix tree. Every entry in a radix tree
> can have a few bits assigned to it. Currently 3 bits are supported:
>
> From include/linux/fs.h
> #define PAGECACHE_TAG_DIRTY 0
> #define PAGECACHE_TAG_WRITEBACK 1
> #define PAGECACHE_TAG_TOWRITE 2
>
> You could add a new bit or utilize one of existing ones.
>
> The other idea is use a trick from a RB trees and scatter-gather lists.
> I mean using the last bits of pointers to keep some metadata.
> Values of 'struct page *' variables are aligned to a pointer alignment which is
> 4 for 32-bit CPUs and 8 for 64-bit ones (not sure). This means that one could
> could use the last bit of page pointer in a radix tree to track if a swap entry
> refers to a lower or a higher part of a ZBUD page.
> I think it is a serious hacking/obfuscation but it may work with the minimal
> amount of changes to MM. Adding only (x&~3) while extracting page pointer is
> probably enough.
>
> What do you think about this idea?
I think it is a good one.
I have to say that when I first came up with the idea, I was thinking
the address space would be at the zswap layer and the radix slots would
hold zbud handles, not struct page pointers.
However, as I have discovered today, this is problematic when it comes
to reclaim and migration and serializing access.
I wanted to do as much as possible in the zswap layer since anything
done in the zbud layer would need to be duplicated in any other future
allocator that zswap wanted to support.
Unfortunately, zbud abstracts away the struct page and that visibility
is needed to properly do what we are talking about.
So maybe it is inevitable that this will need to be in the zbud code
with the radix tree slots pointing to struct pages after all.
I like the idea of masking the bit into the struct page pointer to
indicate which buddy maps to the offset.
There is a twist here in that, unlike a normal page cache tree, we can
have two offsets pointing at different buddies in the same frame
which means we'll have to do some custom stuff for migration.
The rabbit hole I was going down today has come to an end so I'll take a
fresh look next week.
Thanks for your ideas and discussion! Maybe we can make zswap/zbud an
upstanding MM citizen yet!
Seth
>
> >>
> >> In case of zbud, there are two swap offset pointing to
> >> the same page. There might be more if zsmalloc is used.
> >> What is worse it is possible that one swap entry could
> >> point to data that cross a page boundary.
> >
> > We just won't set page->index since it doesn't have a good meaning in
> > our case. Swap cache pages also don't use index, although is seems to
> > me that they could since there is a 1:1 mapping of a swap cache page to
> > a swap offset and the index field isn't being used for anything else.
> > But I digress...
>
> OK.
>
> >
> >>
> >> Of course, one could try to modify MM to support
> >> multiple mapping of a page in the radix tree.
> >> But I think that MM guys will consider this as a hack
> >> and they will not accept it.
> >
> > Yes, it will require some changes to the MM to handle zbud pages on the
> > LRU. I'm thinking that it won't be too intrusive, depending on how we
> > choose to mark zbud pages.
> >
>
> Anyway, I think that zswap should use two index engines.
> I mean index in Data Base meaning.
> One index is used to translate swap_entry to compressed page.
> And another one to be used by reclaim and migration by MM,
> probably address_space is a best choice.
> Zbud would responsible for keeping consistency
> between mentioned indexes.
>
> Regards,
> Tomasz Stanislawski
>
> > Seth
> >
> >>
> >> Regards,
> >> Tomasz Stanislawski
> >>
> >>
> >>> --
> >>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >>> the body to [email protected]. For more info on Linux MM,
> >>> see: http://www.linux-mm.org/ .
> >>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
> >>>
> >>
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to [email protected]. For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
> >
>
On 09/28/2013 06:00 AM, Seth Jennings wrote:
> On Fri, Sep 27, 2013 at 12:16:37PM +0200, Tomasz Stanislawski wrote:
>> On 09/25/2013 11:57 PM, Seth Jennings wrote:
>>> On Wed, Sep 25, 2013 at 07:09:50PM +0200, Tomasz Stanislawski wrote:
>>>>> I just had an idea this afternoon to potentially kill both these birds with one
>>>>> stone: Replace the rbtree in zswap with an address_space.
>>>>>
>>>>> Each swap type would have its own page_tree to organize the compressed objects
>>>>> by type and offset (radix tree is more suited for this anyway) and a_ops that
>>>>> could be called by shrink_page_list() (writepage) or the migration code
>>>>> (migratepage).
>>>>>
>>>>> Then zbud pages could be put on the normal LRU list, maybe at the beginning of
>>>>> the inactive LRU so they would live for another cycle through the list, then be
>>>>> reclaimed in the normal way with the mapping->a_ops->writepage() pointing to a
>>>>> zswap_writepage() function that would decompress the pages and call
>>>>> __swap_writepage() on them.
>>>>>
>>>>> This might actually do away with the explicit pool size too as the compressed
>>>>> pool pages wouldn't be outside the control of the MM anymore.
>>>>>
>>>>> I'm just starting to explore this but I think it has promise.
>>>>>
>>>>> Seth
>>>>>
>>>>
>>>> Hi Seth,
>>>> There is a problem with the proposed idea.
>>>> The radix tree used 'struct address_space' is a part of
>>>> a bigger data structure.
>>>> The radix tree is used to translate an offset to a page.
>>>> That is ok for zswap. But struct page has a field named 'index'.
>>>> The MM assumes that this index is an offset in radix tree
>>>> where one can find the page. A lot is done by MM to sustain
>>>> this consistency.
>>>
>>> Yes, this is how it is for page cache pages. However, the MM is able to
>>> work differently with anonymous pages. In the case of an anonymous
>>> page, the mapping field points to an anon_vma struct, or, if ksm in
>>> enabled and dedup'ing the page, a private ksm tracking structure. If
>>> the anonymous page is fully unmapped and resides only in the swap cache,
>>> the page mapping is NULL. So there is precedent for the fields to mean
>>> other things.
>>
>> Hi Seth,
>> You are right that page->mapping is NULL for pages in swap_cache but
>> page_mapping() is not NULL in such a case. The mapping is taken from
>> struct address_space swapper_spaces[]. It is still an address space,
>> and it should preserve constraints for struct address_space.
>> The same happen for page->index and page_index().
>>
>>>
>>> The question is how to mark and identify zbud pages among the other page
>>> types that will be on the LRU. There are many ways. The question is
>>> what is the best and most acceptable way.
>>>
>>
>> If you consider hacking I have some idea how address_space could utilized for ZBUD.
>> One solution whould be using tags in a radix tree. Every entry in a radix tree
>> can have a few bits assigned to it. Currently 3 bits are supported:
>>
>> From include/linux/fs.h
>> #define PAGECACHE_TAG_DIRTY 0
>> #define PAGECACHE_TAG_WRITEBACK 1
>> #define PAGECACHE_TAG_TOWRITE 2
>>
>> You could add a new bit or utilize one of existing ones.
>>
>> The other idea is use a trick from a RB trees and scatter-gather lists.
>> I mean using the last bits of pointers to keep some metadata.
>> Values of 'struct page *' variables are aligned to a pointer alignment which is
>> 4 for 32-bit CPUs and 8 for 64-bit ones (not sure). This means that one could
>> could use the last bit of page pointer in a radix tree to track if a swap entry
>> refers to a lower or a higher part of a ZBUD page.
>> I think it is a serious hacking/obfuscation but it may work with the minimal
>> amount of changes to MM. Adding only (x&~3) while extracting page pointer is
>> probably enough.
>>
>> What do you think about this idea?
>
> I think it is a good one.
>
> I have to say that when I first came up with the idea, I was thinking
> the address space would be at the zswap layer and the radix slots would
> hold zbud handles, not struct page pointers.
>
> However, as I have discovered today, this is problematic when it comes
> to reclaim and migration and serializing access.
>
> I wanted to do as much as possible in the zswap layer since anything
> done in the zbud layer would need to be duplicated in any other future
> allocator that zswap wanted to support.
>
> Unfortunately, zbud abstracts away the struct page and that visibility
> is needed to properly do what we are talking about.
>
> So maybe it is inevitable that this will need to be in the zbud code
> with the radix tree slots pointing to struct pages after all.
>
But in this way, zswap_frontswap_load() can't find zswap_entry. We still
need the rbtree in current zswap.
> I like the idea of masking the bit into the struct page pointer to
> indicate which buddy maps to the offset.
>
I have no idea why we need this.
My idea is connect zbud page with a address space and add zbud page to
LRU list only without any radix tree.
zswap_entry can be still in rbtree or maybe changed to radix tree.
There is a sample code in my previous email.
--
Regards,
-Bob
On pią, 2013-09-27 at 17:00 -0500, Seth Jennings wrote:
> I have to say that when I first came up with the idea, I was thinking
> the address space would be at the zswap layer and the radix slots would
> hold zbud handles, not struct page pointers.
>
> However, as I have discovered today, this is problematic when it comes
> to reclaim and migration and serializing access.
>
> I wanted to do as much as possible in the zswap layer since anything
> done in the zbud layer would need to be duplicated in any other future
> allocator that zswap wanted to support.
>
> Unfortunately, zbud abstracts away the struct page and that visibility
> is needed to properly do what we are talking about.
>
> So maybe it is inevitable that this will need to be in the zbud code
> with the radix tree slots pointing to struct pages after all.
To me it looks very similar to the solution proposed in my patches. The
difference is that you wish to use offset as radix tree index.
I thought about this earlier but it imposed two problems:
1. A generalized handle (instead of offset) may be more suitable when
zbud will be used in other drivers (e.g. zram).
2. It requires redesigning of zswap architecture around
zswap_frontswap_store() in case of duplicated insertion. Currently when
storing a page the zswap:
- allocates zbud page,
- stores new data in it,
- checks whether it is a duplicated page (same offset present in
rbtree),
- if yes (duplicated) then zswap frees previous entry.
The problem here lies in allocating zbud page under the same offset.
This step would replace old data (because we are using the same offset
in radix tree).
In my opinion using zbud handle is in this case more flexible.
Best regards,
Krzysztof
> I like the idea of masking the bit into the struct page pointer to
> indicate which buddy maps to the offset.
>
> There is a twist here in that, unlike a normal page cache tree, we can
> have two offsets pointing at different buddies in the same frame
> which means we'll have to do some custom stuff for migration.
>
> The rabbit hole I was going down today has come to an end so I'll take a
> fresh look next week.
>
> Thanks for your ideas and discussion! Maybe we can make zswap/zbud an
> upstanding MM citizen yet!
>
> Seth
>
> >
> > >>
> > >> In case of zbud, there are two swap offset pointing to
> > >> the same page. There might be more if zsmalloc is used.
> > >> What is worse it is possible that one swap entry could
> > >> point to data that cross a page boundary.
> > >
> > > We just won't set page->index since it doesn't have a good meaning in
> > > our case. Swap cache pages also don't use index, although is seems to
> > > me that they could since there is a 1:1 mapping of a swap cache page to
> > > a swap offset and the index field isn't being used for anything else.
> > > But I digress...
> >
> > OK.
> >
> > >
> > >>
> > >> Of course, one could try to modify MM to support
> > >> multiple mapping of a page in the radix tree.
> > >> But I think that MM guys will consider this as a hack
> > >> and they will not accept it.
> > >
> > > Yes, it will require some changes to the MM to handle zbud pages on the
> > > LRU. I'm thinking that it won't be too intrusive, depending on how we
> > > choose to mark zbud pages.
> > >
> >
> > Anyway, I think that zswap should use two index engines.
> > I mean index in Data Base meaning.
> > One index is used to translate swap_entry to compressed page.
> > And another one to be used by reclaim and migration by MM,
> > probably address_space is a best choice.
> > Zbud would responsible for keeping consistency
> > between mentioned indexes.
> >
> > Regards,
> > Tomasz Stanislawski
> >
> > > Seth
> > >
> > >>
> > >> Regards,
> > >> Tomasz Stanislawski
> > >>
> > >>
> > >>> --
> > >>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > >>> the body to [email protected]. For more info on Linux MM,
> > >>> see: http://www.linux-mm.org/ .
> > >>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
> > >>>
> > >>
> > >
> > > --
> > > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > > the body to [email protected]. For more info on Linux MM,
> > > see: http://www.linux-mm.org/ .
> > > Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
> > >
> >
On Mon, Sep 30, 2013 at 10:28:46AM +0200, Krzysztof Kozlowski wrote:
> On pią, 2013-09-27 at 17:00 -0500, Seth Jennings wrote:
> > I have to say that when I first came up with the idea, I was thinking
> > the address space would be at the zswap layer and the radix slots would
> > hold zbud handles, not struct page pointers.
> >
> > However, as I have discovered today, this is problematic when it comes
> > to reclaim and migration and serializing access.
> >
> > I wanted to do as much as possible in the zswap layer since anything
> > done in the zbud layer would need to be duplicated in any other future
> > allocator that zswap wanted to support.
> >
> > Unfortunately, zbud abstracts away the struct page and that visibility
> > is needed to properly do what we are talking about.
> >
> > So maybe it is inevitable that this will need to be in the zbud code
> > with the radix tree slots pointing to struct pages after all.
>
> To me it looks very similar to the solution proposed in my patches.
Yes, it is very similar. I'm beginning to like aspects of this patch
more as I explore this issue more.
At first, I balked at the idea of yet another abstraction layer, but it
is very hard to avoid unless you want to completely collapse zswap and
zbud into one another and dissolve the layering. Then you could do a
direct swap_offset -> address mapping.
> The
> difference is that you wish to use offset as radix tree index.
> I thought about this earlier but it imposed two problems:
>
> 1. A generalized handle (instead of offset) may be more suitable when
> zbud will be used in other drivers (e.g. zram).
>
> 2. It requires redesigning of zswap architecture around
> zswap_frontswap_store() in case of duplicated insertion. Currently when
> storing a page the zswap:
> - allocates zbud page,
> - stores new data in it,
> - checks whether it is a duplicated page (same offset present in
> rbtree),
> - if yes (duplicated) then zswap frees previous entry.
> The problem here lies in allocating zbud page under the same offset.
> This step would replace old data (because we are using the same offset
> in radix tree).
Yes, but the offset is always going to be the key at the top layer
because that is was the swap subsystem uses. So we'd have to have a
swap_offset -> handle -> address translation (2 abstraction layers) the
first of which would need to deal with the duplicate store issue.
Seth
>
> In my opinion using zbud handle is in this case more flexible.
>
>
> Best regards,
> Krzysztof
>
> > I like the idea of masking the bit into the struct page pointer to
> > indicate which buddy maps to the offset.
> >
> > There is a twist here in that, unlike a normal page cache tree, we can
> > have two offsets pointing at different buddies in the same frame
> > which means we'll have to do some custom stuff for migration.
> >
> > The rabbit hole I was going down today has come to an end so I'll take a
> > fresh look next week.
> >
> > Thanks for your ideas and discussion! Maybe we can make zswap/zbud an
> > upstanding MM citizen yet!
> >
> > Seth
> >
> > >
> > > >>
> > > >> In case of zbud, there are two swap offset pointing to
> > > >> the same page. There might be more if zsmalloc is used.
> > > >> What is worse it is possible that one swap entry could
> > > >> point to data that cross a page boundary.
> > > >
> > > > We just won't set page->index since it doesn't have a good meaning in
> > > > our case. Swap cache pages also don't use index, although is seems to
> > > > me that they could since there is a 1:1 mapping of a swap cache page to
> > > > a swap offset and the index field isn't being used for anything else.
> > > > But I digress...
> > >
> > > OK.
> > >
> > > >
> > > >>
> > > >> Of course, one could try to modify MM to support
> > > >> multiple mapping of a page in the radix tree.
> > > >> But I think that MM guys will consider this as a hack
> > > >> and they will not accept it.
> > > >
> > > > Yes, it will require some changes to the MM to handle zbud pages on the
> > > > LRU. I'm thinking that it won't be too intrusive, depending on how we
> > > > choose to mark zbud pages.
> > > >
> > >
> > > Anyway, I think that zswap should use two index engines.
> > > I mean index in Data Base meaning.
> > > One index is used to translate swap_entry to compressed page.
> > > And another one to be used by reclaim and migration by MM,
> > > probably address_space is a best choice.
> > > Zbud would responsible for keeping consistency
> > > between mentioned indexes.
> > >
> > > Regards,
> > > Tomasz Stanislawski
> > >
> > > > Seth
> > > >
> > > >>
> > > >> Regards,
> > > >> Tomasz Stanislawski
> > > >>
> > > >>
> > > >>> --
> > > >>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > > >>> the body to [email protected]. For more info on Linux MM,
> > > >>> see: http://www.linux-mm.org/ .
> > > >>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
> > > >>>
> > > >>
> > > >
> > > > --
> > > > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > > > the body to [email protected]. For more info on Linux MM,
> > > > see: http://www.linux-mm.org/ .
> > > > Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
> > > >
> > >
>
On wto, 2013-10-01 at 16:04 -0500, Seth Jennings wrote:
> Yes, it is very similar. I'm beginning to like aspects of this patch
> more as I explore this issue more.
>
> At first, I balked at the idea of yet another abstraction layer, but it
> is very hard to avoid unless you want to completely collapse zswap and
> zbud into one another and dissolve the layering. Then you could do a
> direct swap_offset -> address mapping.
After discussion with Tomasz Stanislawski we had an idea of merging the
trees (zswap's rb and zbud's radix added in these patches) into one tree
in zbud layer.
This would simplify the design (if migration was added, of course).
The idea looks like:
1. Get rid of the red-black tree in zswap.
2. Add radix tree to zbud (or use radix tree from address space).
- Use offset (from swp_entry) as index to radix tree.
- zbud page (struct page) stored in tree.
4. With both buddies filled one zbud page would be put in radix tree
twice.
5. zbud API would look like:
zbud_alloc(struct zbud_pool *pool, int size, gfp_t gfp, pgoff_t offset)
zbud_free(struct zbud_pool *pool, pgoff_t offset)
zbud_map(struct zbud_pool *pool, pgoff_t offset)
etc.
6. zbud_map/unmap() would be a little more complex than now as it would
took over some code from zswap (finding offset in tree).
7. The radix tree would be used for:
- finding entry by offset (for zswap_frontswap_load() and others),
- migration.
8. In case of migration colliding with zbud_map/unmap() the locking
could be limited (in comparison to my patch). Calling zbud_map() would
mark a page "dirty". During migration if page was "dirtied" then
migration would fail with EAGAIN. Of course migration won't start if
zbud buddy was mapped.
What do you think about this?
Best regards,
Krzysztof