Changelog:
v6:
* Move the move-to-front logic into zs_map_object (patch 4)
(suggested by Minchan Kim).
* Small clean up for free_zspage at free_handles() call site
(patch 6) (suggested by Minchan Kim).
v5:
* Add a new patch that eliminates unused code in zpool and simplify
the logic for storing evict handler in zbud/z3fold (patch 2)
* Remove redudant fields in zs_pool (previously required by zpool)
(patch 3)
* Wrap under_reclaim and deferred handle freeing logic in CONFIG_ZPOOL
(patch 6) (suggested by Minchan Kim)
* Move a small piece of refactoring from patch 6 to patch 4.
v4:
* Wrap the new LRU logic in CONFIG_ZPOOL (patch 3).
(suggested by Minchan Kim)
v3:
* Set pool->ops = NULL when pool->zpool_ops is null (patch 4).
* Stop holding pool's lock when calling lock_zspage() (patch 5).
(suggested by Sergey Senozhatsky)
* Stop holding pool's lock when checking pool->ops and retries.
(patch 5) (suggested by Sergey Senozhatsky)
* Fix formatting issues (.shrink, extra spaces in casting removed).
(patch 5) (suggested by Sergey Senozhatsky)
v2:
* Add missing CONFIG_ZPOOL ifdefs (patch 5)
(detected by kernel test robot).
Unlike other zswap's allocators such as zbud or z3fold, zsmalloc
currently lacks the writeback mechanism. This means that when the zswap
pool is full, it will simply reject further allocations, and the pages
will be written directly to swap.
This series of patches implements writeback for zsmalloc. When the zswap
pool becomes full, zsmalloc will attempt to evict all the compressed
objects in the least-recently used zspages.
There are 6 patches in this series:
Johannes Weiner (2):
zswap: fix writeback lock ordering for zsmalloc
zpool: clean out dead code
Nhat Pham (4):
zsmalloc: Consolidate zs_pool's migrate_lock and size_class's locks
zsmalloc: Add a LRU to zs_pool to keep track of zspages in LRU order
zsmalloc: Add zpool_ops field to zs_pool to store evict handlers
zsmalloc: Implement writeback mechanism for zsmalloc
mm/z3fold.c | 36 +-----
mm/zbud.c | 32 +----
mm/zpool.c | 10 +-
mm/zsmalloc.c | 325 ++++++++++++++++++++++++++++++++++++++++----------
mm/zswap.c | 37 +++---
5 files changed, 295 insertions(+), 145 deletions(-)
--
2.30.2
From: Johannes Weiner <[email protected]>
zswap's customary lock order is tree->lock before pool->lock, because
the tree->lock protects the entries' refcount, and the free callbacks in
the backends acquire their respective pool locks to dispatch the backing
object. zsmalloc's map callback takes the pool lock, so zswap must not
grab the tree->lock while a handle is mapped. This currently only
happens during writeback, which isn't implemented for zsmalloc. In
preparation for it, move the tree->lock section out of the mapped entry
section
Signed-off-by: Johannes Weiner <[email protected]>
Signed-off-by: Nhat Pham <[email protected]>
---
mm/zswap.c | 37 ++++++++++++++++++++-----------------
1 file changed, 20 insertions(+), 17 deletions(-)
diff --git a/mm/zswap.c b/mm/zswap.c
index 2d48fd59cc7a..2d69c1d678fe 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -958,7 +958,7 @@ static int zswap_writeback_entry(struct zpool *pool, unsigned long handle)
};
if (!zpool_can_sleep_mapped(pool)) {
- tmp = kmalloc(PAGE_SIZE, GFP_ATOMIC);
+ tmp = kmalloc(PAGE_SIZE, GFP_KERNEL);
if (!tmp)
return -ENOMEM;
}
@@ -968,6 +968,7 @@ static int zswap_writeback_entry(struct zpool *pool, unsigned long handle)
swpentry = zhdr->swpentry; /* here */
tree = zswap_trees[swp_type(swpentry)];
offset = swp_offset(swpentry);
+ zpool_unmap_handle(pool, handle);
/* find and ref zswap entry */
spin_lock(&tree->lock);
@@ -975,20 +976,12 @@ static int zswap_writeback_entry(struct zpool *pool, unsigned long handle)
if (!entry) {
/* entry was invalidated */
spin_unlock(&tree->lock);
- zpool_unmap_handle(pool, handle);
kfree(tmp);
return 0;
}
spin_unlock(&tree->lock);
BUG_ON(offset != entry->offset);
- src = (u8 *)zhdr + sizeof(struct zswap_header);
- if (!zpool_can_sleep_mapped(pool)) {
- memcpy(tmp, src, entry->length);
- src = tmp;
- zpool_unmap_handle(pool, handle);
- }
-
/* try to allocate swap cache page */
switch (zswap_get_swap_cache_page(swpentry, &page)) {
case ZSWAP_SWAPCACHE_FAIL: /* no memory or invalidate happened */
@@ -1006,6 +999,14 @@ static int zswap_writeback_entry(struct zpool *pool, unsigned long handle)
acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
dlen = PAGE_SIZE;
+ zhdr = zpool_map_handle(pool, handle, ZPOOL_MM_RO);
+ src = (u8 *)zhdr + sizeof(struct zswap_header);
+ if (!zpool_can_sleep_mapped(pool)) {
+ memcpy(tmp, src, entry->length);
+ src = tmp;
+ zpool_unmap_handle(pool, handle);
+ }
+
mutex_lock(acomp_ctx->mutex);
sg_init_one(&input, src, entry->length);
sg_init_table(&output, 1);
@@ -1015,6 +1016,11 @@ static int zswap_writeback_entry(struct zpool *pool, unsigned long handle)
dlen = acomp_ctx->req->dlen;
mutex_unlock(acomp_ctx->mutex);
+ if (!zpool_can_sleep_mapped(pool))
+ kfree(tmp);
+ else
+ zpool_unmap_handle(pool, handle);
+
BUG_ON(ret);
BUG_ON(dlen != PAGE_SIZE);
@@ -1045,7 +1051,11 @@ static int zswap_writeback_entry(struct zpool *pool, unsigned long handle)
zswap_entry_put(tree, entry);
spin_unlock(&tree->lock);
- goto end;
+ return ret;
+
+fail:
+ if (!zpool_can_sleep_mapped(pool))
+ kfree(tmp);
/*
* if we get here due to ZSWAP_SWAPCACHE_EXIST
@@ -1054,17 +1064,10 @@ static int zswap_writeback_entry(struct zpool *pool, unsigned long handle)
* if we free the entry in the following put
* it is also okay to return !0
*/
-fail:
spin_lock(&tree->lock);
zswap_entry_put(tree, entry);
spin_unlock(&tree->lock);
-end:
- if (zpool_can_sleep_mapped(pool))
- zpool_unmap_handle(pool, handle);
- else
- kfree(tmp);
-
return ret;
}
--
2.30.2
From: Johannes Weiner <[email protected]>
There is a lot of provision for flexibility that isn't actually needed
or used. Zswap (the only zpool user) always passes zpool_ops with an
.evict method set. The backends who reclaim only do so for zswap, so
they can also directly call zpool_ops without indirection or checks.
Finally, there is no need to check the retries parameters and bail
with -EINVAL in the reclaim function, when that's called just a few
lines below with a hard-coded 8. There is no need to duplicate the
evictable and sleep_mapped attrs from the driver in zpool_ops.
Signed-off-by: Johannes Weiner <[email protected]>
Signed-off-by: Nhat Pham <[email protected]>
---
mm/z3fold.c | 36 +++++-------------------------------
mm/zbud.c | 32 +++++---------------------------
mm/zpool.c | 10 ++--------
3 files changed, 12 insertions(+), 66 deletions(-)
diff --git a/mm/z3fold.c b/mm/z3fold.c
index cf71da10d04e..a4de0c317ac7 100644
--- a/mm/z3fold.c
+++ b/mm/z3fold.c
@@ -68,9 +68,6 @@
* Structures
*****************/
struct z3fold_pool;
-struct z3fold_ops {
- int (*evict)(struct z3fold_pool *pool, unsigned long handle);
-};
enum buddy {
HEADLESS = 0,
@@ -138,8 +135,6 @@ struct z3fold_header {
* @stale: list of pages marked for freeing
* @pages_nr: number of z3fold pages in the pool.
* @c_handle: cache for z3fold_buddy_slots allocation
- * @ops: pointer to a structure of user defined operations specified at
- * pool creation time.
* @zpool: zpool driver
* @zpool_ops: zpool operations structure with an evict callback
* @compact_wq: workqueue for page layout background optimization
@@ -158,7 +153,6 @@ struct z3fold_pool {
struct list_head stale;
atomic64_t pages_nr;
struct kmem_cache *c_handle;
- const struct z3fold_ops *ops;
struct zpool *zpool;
const struct zpool_ops *zpool_ops;
struct workqueue_struct *compact_wq;
@@ -907,13 +901,11 @@ static inline struct z3fold_header *__z3fold_alloc(struct z3fold_pool *pool,
* z3fold_create_pool() - create a new z3fold pool
* @name: pool name
* @gfp: gfp flags when allocating the z3fold pool structure
- * @ops: user-defined operations for the z3fold pool
*
* Return: pointer to the new z3fold pool or NULL if the metadata allocation
* failed.
*/
-static struct z3fold_pool *z3fold_create_pool(const char *name, gfp_t gfp,
- const struct z3fold_ops *ops)
+static struct z3fold_pool *z3fold_create_pool(const char *name, gfp_t gfp)
{
struct z3fold_pool *pool = NULL;
int i, cpu;
@@ -949,7 +941,6 @@ static struct z3fold_pool *z3fold_create_pool(const char *name, gfp_t gfp,
if (!pool->release_wq)
goto out_wq;
INIT_WORK(&pool->work, free_pages_work);
- pool->ops = ops;
return pool;
out_wq:
@@ -1230,10 +1221,6 @@ static int z3fold_reclaim_page(struct z3fold_pool *pool, unsigned int retries)
slots.pool = (unsigned long)pool | (1 << HANDLES_NOFREE);
spin_lock(&pool->lock);
- if (!pool->ops || !pool->ops->evict || retries == 0) {
- spin_unlock(&pool->lock);
- return -EINVAL;
- }
for (i = 0; i < retries; i++) {
if (list_empty(&pool->lru)) {
spin_unlock(&pool->lock);
@@ -1319,17 +1306,17 @@ static int z3fold_reclaim_page(struct z3fold_pool *pool, unsigned int retries)
}
/* Issue the eviction callback(s) */
if (middle_handle) {
- ret = pool->ops->evict(pool, middle_handle);
+ ret = pool->zpool_ops->evict(pool->zpool, middle_handle);
if (ret)
goto next;
}
if (first_handle) {
- ret = pool->ops->evict(pool, first_handle);
+ ret = pool->zpool_ops->evict(pool->zpool, first_handle);
if (ret)
goto next;
}
if (last_handle) {
- ret = pool->ops->evict(pool, last_handle);
+ ret = pool->zpool_ops->evict(pool->zpool, last_handle);
if (ret)
goto next;
}
@@ -1593,26 +1580,13 @@ static const struct movable_operations z3fold_mops = {
* zpool
****************/
-static int z3fold_zpool_evict(struct z3fold_pool *pool, unsigned long handle)
-{
- if (pool->zpool && pool->zpool_ops && pool->zpool_ops->evict)
- return pool->zpool_ops->evict(pool->zpool, handle);
- else
- return -ENOENT;
-}
-
-static const struct z3fold_ops z3fold_zpool_ops = {
- .evict = z3fold_zpool_evict
-};
-
static void *z3fold_zpool_create(const char *name, gfp_t gfp,
const struct zpool_ops *zpool_ops,
struct zpool *zpool)
{
struct z3fold_pool *pool;
- pool = z3fold_create_pool(name, gfp,
- zpool_ops ? &z3fold_zpool_ops : NULL);
+ pool = z3fold_create_pool(name, gfp);
if (pool) {
pool->zpool = zpool;
pool->zpool_ops = zpool_ops;
diff --git a/mm/zbud.c b/mm/zbud.c
index 6348932430b8..3acd26193920 100644
--- a/mm/zbud.c
+++ b/mm/zbud.c
@@ -74,10 +74,6 @@
struct zbud_pool;
-struct zbud_ops {
- int (*evict)(struct zbud_pool *pool, unsigned long handle);
-};
-
/**
* struct zbud_pool - stores metadata for each zbud pool
* @lock: protects all pool fields and first|last_chunk fields of any
@@ -90,8 +86,6 @@ struct zbud_ops {
* @lru: list tracking the zbud pages in LRU order by most recently
* added buddy.
* @pages_nr: number of zbud pages in the pool.
- * @ops: pointer to a structure of user defined operations specified at
- * pool creation time.
* @zpool: zpool driver
* @zpool_ops: zpool operations structure with an evict callback
*
@@ -110,7 +104,6 @@ struct zbud_pool {
};
struct list_head lru;
u64 pages_nr;
- const struct zbud_ops *ops;
struct zpool *zpool;
const struct zpool_ops *zpool_ops;
};
@@ -212,12 +205,11 @@ static int num_free_chunks(struct zbud_header *zhdr)
/**
* zbud_create_pool() - create a new zbud pool
* @gfp: gfp flags when allocating the zbud pool structure
- * @ops: user-defined operations for the zbud pool
*
* Return: pointer to the new zbud pool or NULL if the metadata allocation
* failed.
*/
-static struct zbud_pool *zbud_create_pool(gfp_t gfp, const struct zbud_ops *ops)
+static struct zbud_pool *zbud_create_pool(gfp_t gfp)
{
struct zbud_pool *pool;
int i;
@@ -231,7 +223,6 @@ static struct zbud_pool *zbud_create_pool(gfp_t gfp, const struct zbud_ops *ops)
INIT_LIST_HEAD(&pool->buddied);
INIT_LIST_HEAD(&pool->lru);
pool->pages_nr = 0;
- pool->ops = ops;
return pool;
}
@@ -419,8 +410,7 @@ static int zbud_reclaim_page(struct zbud_pool *pool, unsigned int retries)
unsigned long first_handle = 0, last_handle = 0;
spin_lock(&pool->lock);
- if (!pool->ops || !pool->ops->evict || list_empty(&pool->lru) ||
- retries == 0) {
+ if (list_empty(&pool->lru)) {
spin_unlock(&pool->lock);
return -EINVAL;
}
@@ -444,12 +434,12 @@ static int zbud_reclaim_page(struct zbud_pool *pool, unsigned int retries)
/* Issue the eviction callback(s) */
if (first_handle) {
- ret = pool->ops->evict(pool, first_handle);
+ ret = pool->zpool_ops->evict(pool->zpool, first_handle);
if (ret)
goto next;
}
if (last_handle) {
- ret = pool->ops->evict(pool, last_handle);
+ ret = pool->zpool_ops->evict(pool->zpool, last_handle);
if (ret)
goto next;
}
@@ -524,25 +514,13 @@ static u64 zbud_get_pool_size(struct zbud_pool *pool)
* zpool
****************/
-static int zbud_zpool_evict(struct zbud_pool *pool, unsigned long handle)
-{
- if (pool->zpool && pool->zpool_ops && pool->zpool_ops->evict)
- return pool->zpool_ops->evict(pool->zpool, handle);
- else
- return -ENOENT;
-}
-
-static const struct zbud_ops zbud_zpool_ops = {
- .evict = zbud_zpool_evict
-};
-
static void *zbud_zpool_create(const char *name, gfp_t gfp,
const struct zpool_ops *zpool_ops,
struct zpool *zpool)
{
struct zbud_pool *pool;
- pool = zbud_create_pool(gfp, zpool_ops ? &zbud_zpool_ops : NULL);
+ pool = zbud_create_pool(gfp);
if (pool) {
pool->zpool = zpool;
pool->zpool_ops = zpool_ops;
diff --git a/mm/zpool.c b/mm/zpool.c
index 68facc193496..fc3a9893e107 100644
--- a/mm/zpool.c
+++ b/mm/zpool.c
@@ -21,9 +21,6 @@
struct zpool {
struct zpool_driver *driver;
void *pool;
- const struct zpool_ops *ops;
- bool evictable;
- bool can_sleep_mapped;
};
static LIST_HEAD(drivers_head);
@@ -177,9 +174,6 @@ struct zpool *zpool_create_pool(const char *type, const char *name, gfp_t gfp,
zpool->driver = driver;
zpool->pool = driver->create(name, gfp, ops, zpool);
- zpool->ops = ops;
- zpool->evictable = driver->shrink && ops && ops->evict;
- zpool->can_sleep_mapped = driver->sleep_mapped;
if (!zpool->pool) {
pr_err("couldn't create %s pool\n", type);
@@ -380,7 +374,7 @@ u64 zpool_get_total_size(struct zpool *zpool)
*/
bool zpool_evictable(struct zpool *zpool)
{
- return zpool->evictable;
+ return zpool->driver->shrink;
}
/**
@@ -391,7 +385,7 @@ bool zpool_evictable(struct zpool *zpool)
*/
bool zpool_can_sleep_mapped(struct zpool *zpool)
{
- return zpool->can_sleep_mapped;
+ return zpool->driver->sleep_mapped;
}
MODULE_LICENSE("GPL");
--
2.30.2
This commit adds the writeback mechanism for zsmalloc, analogous to the
zbud allocator. Zsmalloc will attempt to determine the coldest zspage
(i.e least recently used) in the pool, and attempt to write back all the
stored compressed objects via the pool's evict handler.
Signed-off-by: Nhat Pham <[email protected]>
---
mm/zsmalloc.c | 193 +++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 182 insertions(+), 11 deletions(-)
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 9920f3584511..3fba04e10227 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -271,12 +271,13 @@ struct zspage {
#ifdef CONFIG_ZPOOL
/* links the zspage to the lru list in the pool */
struct list_head lru;
+ bool under_reclaim;
+ /* list of unfreed handles whose objects have been reclaimed */
+ unsigned long *deferred_handles;
#endif
struct zs_pool *pool;
-#ifdef CONFIG_COMPACTION
rwlock_t lock;
-#endif
};
struct mapping_area {
@@ -297,10 +298,11 @@ static bool ZsHugePage(struct zspage *zspage)
return zspage->huge;
}
-#ifdef CONFIG_COMPACTION
static void migrate_lock_init(struct zspage *zspage);
static void migrate_read_lock(struct zspage *zspage);
static void migrate_read_unlock(struct zspage *zspage);
+
+#ifdef CONFIG_COMPACTION
static void migrate_write_lock(struct zspage *zspage);
static void migrate_write_lock_nested(struct zspage *zspage);
static void migrate_write_unlock(struct zspage *zspage);
@@ -308,9 +310,6 @@ static void kick_deferred_free(struct zs_pool *pool);
static void init_deferred_free(struct zs_pool *pool);
static void SetZsPageMovable(struct zs_pool *pool, struct zspage *zspage);
#else
-static void migrate_lock_init(struct zspage *zspage) {}
-static void migrate_read_lock(struct zspage *zspage) {}
-static void migrate_read_unlock(struct zspage *zspage) {}
static void migrate_write_lock(struct zspage *zspage) {}
static void migrate_write_lock_nested(struct zspage *zspage) {}
static void migrate_write_unlock(struct zspage *zspage) {}
@@ -413,6 +412,27 @@ static void zs_zpool_free(void *pool, unsigned long handle)
zs_free(pool, handle);
}
+static int zs_reclaim_page(struct zs_pool *pool, unsigned int retries);
+
+static int zs_zpool_shrink(void *pool, unsigned int pages,
+ unsigned int *reclaimed)
+{
+ unsigned int total = 0;
+ int ret = -EINVAL;
+
+ while (total < pages) {
+ ret = zs_reclaim_page(pool, 8);
+ if (ret < 0)
+ break;
+ total++;
+ }
+
+ if (reclaimed)
+ *reclaimed = total;
+
+ return ret;
+}
+
static void *zs_zpool_map(void *pool, unsigned long handle,
enum zpool_mapmode mm)
{
@@ -451,6 +471,7 @@ static struct zpool_driver zs_zpool_driver = {
.malloc_support_movable = true,
.malloc = zs_zpool_malloc,
.free = zs_zpool_free,
+ .shrink = zs_zpool_shrink,
.map = zs_zpool_map,
.unmap = zs_zpool_unmap,
.total_size = zs_zpool_total_size,
@@ -924,6 +945,25 @@ static int trylock_zspage(struct zspage *zspage)
return 0;
}
+#ifdef CONFIG_ZPOOL
+/*
+ * Free all the deferred handles whose objects are freed in zs_free.
+ */
+static void free_handles(struct zs_pool *pool, struct zspage *zspage)
+{
+ unsigned long handle = (unsigned long)zspage->deferred_handles;
+
+ while (handle) {
+ unsigned long nxt_handle = handle_to_obj(handle);
+
+ cache_free_handle(pool, handle);
+ handle = nxt_handle;
+ }
+}
+#else
+static inline void free_handles(struct zs_pool *pool, struct zspage *zspage) {}
+#endif
+
static void __free_zspage(struct zs_pool *pool, struct size_class *class,
struct zspage *zspage)
{
@@ -938,6 +978,9 @@ static void __free_zspage(struct zs_pool *pool, struct size_class *class,
VM_BUG_ON(get_zspage_inuse(zspage));
VM_BUG_ON(fg != ZS_EMPTY);
+ /* Free all deferred handles from zs_free */
+ free_handles(pool, zspage);
+
next = page = get_first_page(zspage);
do {
VM_BUG_ON_PAGE(!PageLocked(page), page);
@@ -1023,6 +1066,8 @@ static void init_zspage(struct size_class *class, struct zspage *zspage)
#ifdef CONFIG_ZPOOL
INIT_LIST_HEAD(&zspage->lru);
+ zspage->under_reclaim = false;
+ zspage->deferred_handles = NULL;
#endif
set_freeobj(zspage, 0);
@@ -1535,12 +1580,26 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
obj_free(class->size, obj);
class_stat_dec(class, OBJ_USED, 1);
+
+#ifdef CONFIG_ZPOOL
+ if (zspage->under_reclaim) {
+ /*
+ * Reclaim needs the handles during writeback. It'll free
+ * them along with the zspage when it's done with them.
+ *
+ * Record current deferred handle at the memory location
+ * whose address is given by handle.
+ */
+ record_obj(handle, (unsigned long)zspage->deferred_handles);
+ zspage->deferred_handles = (unsigned long *)handle;
+ spin_unlock(&pool->lock);
+ return;
+ }
+#endif
fullness = fix_fullness_group(class, zspage);
- if (fullness != ZS_EMPTY)
- goto out;
+ if (fullness == ZS_EMPTY)
+ free_zspage(pool, class, zspage);
- free_zspage(pool, class, zspage);
-out:
spin_unlock(&pool->lock);
cache_free_handle(pool, handle);
}
@@ -1740,7 +1799,7 @@ static enum fullness_group putback_zspage(struct size_class *class,
return fullness;
}
-#ifdef CONFIG_COMPACTION
+#if defined(CONFIG_ZPOOL) || defined(CONFIG_COMPACTION)
/*
* To prevent zspage destroy during migration, zspage freeing should
* hold locks of all pages in the zspage.
@@ -1782,6 +1841,24 @@ static void lock_zspage(struct zspage *zspage)
}
migrate_read_unlock(zspage);
}
+#endif /* defined(CONFIG_ZPOOL) || defined(CONFIG_COMPACTION) */
+
+#ifdef CONFIG_ZPOOL
+/*
+ * Unlocks all the pages of the zspage.
+ *
+ * pool->lock must be held before this function is called
+ * to prevent the underlying pages from migrating.
+ */
+static void unlock_zspage(struct zspage *zspage)
+{
+ struct page *page = get_first_page(zspage);
+
+ do {
+ unlock_page(page);
+ } while ((page = get_next_page(page)) != NULL);
+}
+#endif /* CONFIG_ZPOOL */
static void migrate_lock_init(struct zspage *zspage)
{
@@ -1798,6 +1875,7 @@ static void migrate_read_unlock(struct zspage *zspage) __releases(&zspage->lock)
read_unlock(&zspage->lock);
}
+#ifdef CONFIG_COMPACTION
static void migrate_write_lock(struct zspage *zspage)
{
write_lock(&zspage->lock);
@@ -2362,6 +2440,99 @@ void zs_destroy_pool(struct zs_pool *pool)
}
EXPORT_SYMBOL_GPL(zs_destroy_pool);
+#ifdef CONFIG_ZPOOL
+static int zs_reclaim_page(struct zs_pool *pool, unsigned int retries)
+{
+ int i, obj_idx, ret = 0;
+ unsigned long handle;
+ struct zspage *zspage;
+ struct page *page;
+ enum fullness_group fullness;
+
+ /* Lock LRU and fullness list */
+ spin_lock(&pool->lock);
+ if (list_empty(&pool->lru)) {
+ spin_unlock(&pool->lock);
+ return -EINVAL;
+ }
+
+ for (i = 0; i < retries; i++) {
+ struct size_class *class;
+
+ zspage = list_last_entry(&pool->lru, struct zspage, lru);
+ list_del(&zspage->lru);
+
+ /* zs_free may free objects, but not the zspage and handles */
+ zspage->under_reclaim = true;
+
+ class = zspage_class(pool, zspage);
+ fullness = get_fullness_group(class, zspage);
+
+ /* Lock out object allocations and object compaction */
+ remove_zspage(class, zspage, fullness);
+
+ spin_unlock(&pool->lock);
+
+ /* Lock backing pages into place */
+ lock_zspage(zspage);
+
+ obj_idx = 0;
+ page = zspage->first_page;
+ while (1) {
+ handle = find_alloced_obj(class, page, &obj_idx);
+ if (!handle) {
+ page = get_next_page(page);
+ if (!page)
+ break;
+ obj_idx = 0;
+ continue;
+ }
+
+ /*
+ * This will write the object and call zs_free.
+ *
+ * zs_free will free the object, but the
+ * under_reclaim flag prevents it from freeing
+ * the zspage altogether. This is necessary so
+ * that we can continue working with the
+ * zspage potentially after the last object
+ * has been freed.
+ */
+ ret = pool->zpool_ops->evict(pool->zpool, handle);
+ if (ret)
+ goto next;
+
+ obj_idx++;
+ }
+
+next:
+ /* For freeing the zspage, or putting it back in the pool and LRU list. */
+ spin_lock(&pool->lock);
+ zspage->under_reclaim = false;
+
+ if (!get_zspage_inuse(zspage)) {
+ /*
+ * Fullness went stale as zs_free() won't touch it
+ * while the page is removed from the pool. Fix it
+ * up for the check in __free_zspage().
+ */
+ zspage->fullness = ZS_EMPTY;
+
+ __free_zspage(pool, class, zspage);
+ spin_unlock(&pool->lock);
+ return 0;
+ }
+
+ putback_zspage(class, zspage);
+ list_add(&zspage->lru, &pool->lru);
+ unlock_zspage(zspage);
+ }
+
+ spin_unlock(&pool->lock);
+ return -EAGAIN;
+}
+#endif /* CONFIG_ZPOOL */
+
static int __init zs_init(void)
{
int ret;
--
2.30.2
On Fri, Nov 18, 2022 at 04:15:36PM -0800, Nhat Pham wrote:
> This commit adds the writeback mechanism for zsmalloc, analogous to the
> zbud allocator. Zsmalloc will attempt to determine the coldest zspage
> (i.e least recently used) in the pool, and attempt to write back all the
> stored compressed objects via the pool's evict handler.
>
> Signed-off-by: Nhat Pham <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Excellent!
On Fri, Nov 18, 2022 at 04:15:36PM -0800, Nhat Pham wrote:
> This commit adds the writeback mechanism for zsmalloc, analogous to the
> zbud allocator. Zsmalloc will attempt to determine the coldest zspage
> (i.e least recently used) in the pool, and attempt to write back all the
> stored compressed objects via the pool's evict handler.
>
> Signed-off-by: Nhat Pham <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Thanks.
Hi Andrew, looks like Minchan is on board with the series - the concerns
with the latter patches have been resolved. Feel free to cherry-pick
this series back to your mm-unstable branch!
On (22/11/18 16:15), Nhat Pham wrote:
> +static int zs_reclaim_page(struct zs_pool *pool, unsigned int retries);
> +
> +static int zs_zpool_shrink(void *pool, unsigned int pages,
> + unsigned int *reclaimed)
> +{
> + unsigned int total = 0;
> + int ret = -EINVAL;
> +
> + while (total < pages) {
> + ret = zs_reclaim_page(pool, 8);
Just curious why 8 retries and how was 8 picked?
> + if (ret < 0)
> + break;
> + total++;
> + }
> +
> + if (reclaimed)
> + *reclaimed = total;
> +
> + return ret;
> +}
On (22/11/18 16:15), Nhat Pham wrote:
> +static int zs_reclaim_page(struct zs_pool *pool, unsigned int retries)
> +{
> + int i, obj_idx, ret = 0;
> + unsigned long handle;
> + struct zspage *zspage;
> + struct page *page;
> + enum fullness_group fullness;
> +
> + /* Lock LRU and fullness list */
> + spin_lock(&pool->lock);
> + if (list_empty(&pool->lru)) {
> + spin_unlock(&pool->lock);
> + return -EINVAL;
> + }
> +
> + for (i = 0; i < retries; i++) {
> + struct size_class *class;
> +
> + zspage = list_last_entry(&pool->lru, struct zspage, lru);
> + list_del(&zspage->lru);
> +
> + /* zs_free may free objects, but not the zspage and handles */
> + zspage->under_reclaim = true;
> +
> + class = zspage_class(pool, zspage);
> + fullness = get_fullness_group(class, zspage);
> +
> + /* Lock out object allocations and object compaction */
> + remove_zspage(class, zspage, fullness);
> +
> + spin_unlock(&pool->lock);
> +
> + /* Lock backing pages into place */
> + lock_zspage(zspage);
> +
> + obj_idx = 0;
> + page = zspage->first_page;
A nit: we usually call get_first_page() in such cases.
> + while (1) {
> + handle = find_alloced_obj(class, page, &obj_idx);
> + if (!handle) {
> + page = get_next_page(page);
> + if (!page)
> + break;
> + obj_idx = 0;
> + continue;
> + }
> +
> + /*
> + * This will write the object and call zs_free.
> + *
> + * zs_free will free the object, but the
> + * under_reclaim flag prevents it from freeing
> + * the zspage altogether. This is necessary so
> + * that we can continue working with the
> + * zspage potentially after the last object
> + * has been freed.
> + */
> + ret = pool->zpool_ops->evict(pool->zpool, handle);
> + if (ret)
> + goto next;
> +
> + obj_idx++;
> + }
On (22/11/18 16:15), Nhat Pham wrote:
> There is a lot of provision for flexibility that isn't actually needed
> or used. Zswap (the only zpool user) always passes zpool_ops with an
> .evict method set. The backends who reclaim only do so for zswap, so
> they can also directly call zpool_ops without indirection or checks.
>
> Finally, there is no need to check the retries parameters and bail
> with -EINVAL in the reclaim function, when that's called just a few
> lines below with a hard-coded 8. There is no need to duplicate the
> evictable and sleep_mapped attrs from the driver in zpool_ops.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> Signed-off-by: Nhat Pham <[email protected]>
Reviewed-by: Sergey Senozhatsky <[email protected]>
On (22/11/18 16:15), Nhat Pham wrote:
> +
> +static int zs_zpool_shrink(void *pool, unsigned int pages,
> + unsigned int *reclaimed)
> +{
> + unsigned int total = 0;
> + int ret = -EINVAL;
> +
> + while (total < pages) {
> + ret = zs_reclaim_page(pool, 8);
> + if (ret < 0)
> + break;
> + total++;
> + }
> +
> + if (reclaimed)
> + *reclaimed = total;
> +
> + return ret;
> +}
A silly question: why do we need a retry loop in zs_reclaim_page()?
On (22/11/18 16:15), Nhat Pham wrote:
> @@ -958,7 +958,7 @@ static int zswap_writeback_entry(struct zpool *pool, unsigned long handle)
> };
>
> if (!zpool_can_sleep_mapped(pool)) {
> - tmp = kmalloc(PAGE_SIZE, GFP_ATOMIC);
> + tmp = kmalloc(PAGE_SIZE, GFP_KERNEL);
> if (!tmp)
> return -ENOMEM;
> }
I guess this chunk is not realted to zsmalloc lock oredering fix.
Should it be a separate patch? And feel free to squash my patch,
that does the similar thing:
https://lore.kernel.org/all/[email protected]/
On Tue, Nov 22, 2022 at 11:15:20AM +0900, Sergey Senozhatsky wrote:
> On (22/11/18 16:15), Nhat Pham wrote:
> > +
> > +static int zs_zpool_shrink(void *pool, unsigned int pages,
> > + unsigned int *reclaimed)
> > +{
> > + unsigned int total = 0;
> > + int ret = -EINVAL;
> > +
> > + while (total < pages) {
> > + ret = zs_reclaim_page(pool, 8);
> > + if (ret < 0)
> > + break;
> > + total++;
> > + }
> > +
> > + if (reclaimed)
> > + *reclaimed = total;
> > +
> > + return ret;
> > +}
>
> A silly question: why do we need a retry loop in zs_reclaim_page()?
Individual objects in a zspage can be busy (swapped in simultaneously
for example), which will prevent the zspage from being freed. Zswap
currently requests reclaim of one backend page at a time (another
project...), so if we don't retry we're not meeting the reclaim goal
and cause rejections for new stores. Rejections are worse than moving
on to the adjacent LRU item, because a rejected page, which should be
at the head of the LRU, bypasses the list and goes straight to swap.
The number 8 is cribbed from zbud and z3fold. It works well in
practice, but is as arbitrary as MAX_RECLAIM_RETRIES used all over MM.
We may want to revisit it at some point, but we should probably do it
for all backends then.
On (22/11/21 22:12), Johannes Weiner wrote:
> On Tue, Nov 22, 2022 at 11:15:20AM +0900, Sergey Senozhatsky wrote:
> > On (22/11/18 16:15), Nhat Pham wrote:
> > > +
> > > +static int zs_zpool_shrink(void *pool, unsigned int pages,
> > > + unsigned int *reclaimed)
> > > +{
> > > + unsigned int total = 0;
> > > + int ret = -EINVAL;
> > > +
> > > + while (total < pages) {
> > > + ret = zs_reclaim_page(pool, 8);
> > > + if (ret < 0)
> > > + break;
> > > + total++;
> > > + }
> > > +
> > > + if (reclaimed)
> > > + *reclaimed = total;
> > > +
> > > + return ret;
> > > +}
> >
> > A silly question: why do we need a retry loop in zs_reclaim_page()?
>
> Individual objects in a zspage can be busy (swapped in simultaneously
> for example), which will prevent the zspage from being freed. Zswap
> currently requests reclaim of one backend page at a time (another
> project...), so if we don't retry we're not meeting the reclaim goal
> and cause rejections for new stores.
What I meant was: if zs_reclaim_page() makes only partial progress
with the current LRU tail zspage and returns -EAGAIN, then we just
don't increment `total` and continue looping in zs_zpool_shrink().
On each iteration zs_reclaim_page() picks the new LRU tail (if any)
and tries to write it back.
> The number 8 is cribbed from zbud and z3fold.
OK.
On Tue, Nov 22, 2022 at 12:42:20PM +0900, Sergey Senozhatsky wrote:
> On (22/11/21 22:12), Johannes Weiner wrote:
> > On Tue, Nov 22, 2022 at 11:15:20AM +0900, Sergey Senozhatsky wrote:
> > > On (22/11/18 16:15), Nhat Pham wrote:
> > > > +
> > > > +static int zs_zpool_shrink(void *pool, unsigned int pages,
> > > > + unsigned int *reclaimed)
> > > > +{
> > > > + unsigned int total = 0;
> > > > + int ret = -EINVAL;
> > > > +
> > > > + while (total < pages) {
> > > > + ret = zs_reclaim_page(pool, 8);
> > > > + if (ret < 0)
> > > > + break;
> > > > + total++;
> > > > + }
> > > > +
> > > > + if (reclaimed)
> > > > + *reclaimed = total;
> > > > +
> > > > + return ret;
> > > > +}
> > >
> > > A silly question: why do we need a retry loop in zs_reclaim_page()?
> >
> > Individual objects in a zspage can be busy (swapped in simultaneously
> > for example), which will prevent the zspage from being freed. Zswap
> > currently requests reclaim of one backend page at a time (another
> > project...), so if we don't retry we're not meeting the reclaim goal
> > and cause rejections for new stores.
>
> What I meant was: if zs_reclaim_page() makes only partial progress
> with the current LRU tail zspage and returns -EAGAIN, then we just
> don't increment `total` and continue looping in zs_zpool_shrink().
Hm, but it breaks on -EAGAIN, it doesn't continue.
This makes sense, IMO. zs_reclaim_page() will try to reclaim one page,
but considers up to 8 LRU tail pages until one succeeds. If it does,
it continues (total++). But if one of these calls fails, we exit the
loop, give up and return failure from zs_zpool_shrink().
On (22/11/22 01:09), Johannes Weiner wrote:
> On Tue, Nov 22, 2022 at 12:42:20PM +0900, Sergey Senozhatsky wrote:
> > On (22/11/21 22:12), Johannes Weiner wrote:
> > > On Tue, Nov 22, 2022 at 11:15:20AM +0900, Sergey Senozhatsky wrote:
> > > > On (22/11/18 16:15), Nhat Pham wrote:
[..]
> > What I meant was: if zs_reclaim_page() makes only partial progress
> > with the current LRU tail zspage and returns -EAGAIN, then we just
> > don't increment `total` and continue looping in zs_zpool_shrink().
>
> Hm, but it breaks on -EAGAIN, it doesn't continue.
Yes. "What if it would continue". Would it make sense to not
break on EAGAIN?
while (total < pages) {
ret = zs_reclaim_page(pool);
if (ret == -EAGAIN)
continue;
if (ret < 0)
break;
total++;
}
Then we don't need retry loop in zs_reclaim_page().
On (22/11/18 16:15), Nhat Pham wrote:
[..]
> +static int zs_reclaim_page(struct zs_pool *pool, unsigned int retries)
> +{
> + int i, obj_idx, ret = 0;
> + unsigned long handle;
> + struct zspage *zspage;
> + struct page *page;
> + enum fullness_group fullness;
> +
> + /* Lock LRU and fullness list */
> + spin_lock(&pool->lock);
> + if (list_empty(&pool->lru)) {
> + spin_unlock(&pool->lock);
> + return -EINVAL;
> + }
> +
> + for (i = 0; i < retries; i++) {
> + struct size_class *class;
> +
> + zspage = list_last_entry(&pool->lru, struct zspage, lru);
> + list_del(&zspage->lru);
> +
> + /* zs_free may free objects, but not the zspage and handles */
> + zspage->under_reclaim = true;
> +
> + class = zspage_class(pool, zspage);
> + fullness = get_fullness_group(class, zspage);
> +
> + /* Lock out object allocations and object compaction */
> + remove_zspage(class, zspage, fullness);
> +
> + spin_unlock(&pool->lock);
> +
> + /* Lock backing pages into place */
> + lock_zspage(zspage);
> +
> + obj_idx = 0;
> + page = zspage->first_page;
> + while (1) {
> + handle = find_alloced_obj(class, page, &obj_idx);
> + if (!handle) {
> + page = get_next_page(page);
> + if (!page)
> + break;
> + obj_idx = 0;
> + continue;
> + }
> +
> + /*
> + * This will write the object and call zs_free.
> + *
> + * zs_free will free the object, but the
> + * under_reclaim flag prevents it from freeing
> + * the zspage altogether. This is necessary so
> + * that we can continue working with the
> + * zspage potentially after the last object
> + * has been freed.
> + */
> + ret = pool->zpool_ops->evict(pool->zpool, handle);
> + if (ret)
> + goto next;
> +
> + obj_idx++;
> + }
> +
> +next:
> + /* For freeing the zspage, or putting it back in the pool and LRU list. */
> + spin_lock(&pool->lock);
> + zspage->under_reclaim = false;
> +
> + if (!get_zspage_inuse(zspage)) {
> + /*
> + * Fullness went stale as zs_free() won't touch it
> + * while the page is removed from the pool. Fix it
> + * up for the check in __free_zspage().
> + */
> + zspage->fullness = ZS_EMPTY;
> +
> + __free_zspage(pool, class, zspage);
> + spin_unlock(&pool->lock);
> + return 0;
> + }
> +
> + putback_zspage(class, zspage);
> + list_add(&zspage->lru, &pool->lru);
> + unlock_zspage(zspage);
We probably better to cond_resched() somewhere here. Or in zs_zpool_shrink()
loop.
> + }
> +
> + spin_unlock(&pool->lock);
> + return -EAGAIN;
> +}
On Tue, Nov 22, 2022 at 03:35:18PM +0900, Sergey Senozhatsky wrote:
> On (22/11/22 01:09), Johannes Weiner wrote:
> > On Tue, Nov 22, 2022 at 12:42:20PM +0900, Sergey Senozhatsky wrote:
> > > On (22/11/21 22:12), Johannes Weiner wrote:
> > > > On Tue, Nov 22, 2022 at 11:15:20AM +0900, Sergey Senozhatsky wrote:
> > > > > On (22/11/18 16:15), Nhat Pham wrote:
>
> [..]
>
> > > What I meant was: if zs_reclaim_page() makes only partial progress
> > > with the current LRU tail zspage and returns -EAGAIN, then we just
> > > don't increment `total` and continue looping in zs_zpool_shrink().
> >
> > Hm, but it breaks on -EAGAIN, it doesn't continue.
>
> Yes. "What if it would continue". Would it make sense to not
> break on EAGAIN?
>
> while (total < pages) {
> ret = zs_reclaim_page(pool);
> if (ret == -EAGAIN)
> continue;
> if (ret < 0)
> break;
> total++;
> }
>
> Then we don't need retry loop in zs_reclaim_page().
But that's an indefinite busy-loop?
I don't see what the problem with limited retrying in
zs_reclaim_page() is. It's robust and has worked for years.
On (22/11/22 02:10), Johannes Weiner wrote:
> > Yes. "What if it would continue". Would it make sense to not
> > break on EAGAIN?
> >
> > while (total < pages) {
> > ret = zs_reclaim_page(pool);
> > if (ret == -EAGAIN)
> > continue;
> > if (ret < 0)
> > break;
> > total++;
> > }
> >
> > Then we don't need retry loop in zs_reclaim_page().
>
> But that's an indefinite busy-loop?
That would mean that all lru pages constantly have locked objects
and we can only make partial progress.
> I don't see what the problem with limited retrying in
> zs_reclaim_page() is. It's robust and has worked for years.
No problem with it, just asking.
Thanks for the comments, Sergey!
> A nit: we usually call get_first_page() in such cases.
I'll use get_first_page() here in v7.
> We probably better to cond_resched() somewhere here. Or in zs_zpool_shrink()
> loop.
I'll put it right after releasing the pool's lock in the retry loop:
/* Lock out object allocations and object compaction */
remove_zspage(class, zspage, fullness);
spin_unlock(&pool->lock);
cond_resched();
/* Lock backing pages into place */
lock_zspage(zspage);
This will also appear in v7. In the meantime, please feel free to discuss all
the patches - I'll try to batch the changes to minimize the churning.
On Tue, Nov 22, 2022 at 03:37:29PM +0900, Sergey Senozhatsky wrote:
> On (22/11/18 16:15), Nhat Pham wrote:
> [..]
> > +static int zs_reclaim_page(struct zs_pool *pool, unsigned int retries)
> > +{
> > + int i, obj_idx, ret = 0;
> > + unsigned long handle;
> > + struct zspage *zspage;
> > + struct page *page;
> > + enum fullness_group fullness;
> > +
> > + /* Lock LRU and fullness list */
> > + spin_lock(&pool->lock);
> > + if (list_empty(&pool->lru)) {
> > + spin_unlock(&pool->lock);
> > + return -EINVAL;
> > + }
> > +
> > + for (i = 0; i < retries; i++) {
> > + struct size_class *class;
> > +
> > + zspage = list_last_entry(&pool->lru, struct zspage, lru);
> > + list_del(&zspage->lru);
> > +
> > + /* zs_free may free objects, but not the zspage and handles */
> > + zspage->under_reclaim = true;
> > +
> > + class = zspage_class(pool, zspage);
> > + fullness = get_fullness_group(class, zspage);
> > +
> > + /* Lock out object allocations and object compaction */
> > + remove_zspage(class, zspage, fullness);
> > +
> > + spin_unlock(&pool->lock);
> > +
> > + /* Lock backing pages into place */
> > + lock_zspage(zspage);
> > +
> > + obj_idx = 0;
> > + page = zspage->first_page;
> > + while (1) {
> > + handle = find_alloced_obj(class, page, &obj_idx);
> > + if (!handle) {
> > + page = get_next_page(page);
> > + if (!page)
> > + break;
> > + obj_idx = 0;
> > + continue;
> > + }
> > +
> > + /*
> > + * This will write the object and call zs_free.
> > + *
> > + * zs_free will free the object, but the
> > + * under_reclaim flag prevents it from freeing
> > + * the zspage altogether. This is necessary so
> > + * that we can continue working with the
> > + * zspage potentially after the last object
> > + * has been freed.
> > + */
> > + ret = pool->zpool_ops->evict(pool->zpool, handle);
> > + if (ret)
> > + goto next;
> > +
> > + obj_idx++;
> > + }
> > +
> > +next:
> > + /* For freeing the zspage, or putting it back in the pool and LRU list. */
> > + spin_lock(&pool->lock);
> > + zspage->under_reclaim = false;
> > +
> > + if (!get_zspage_inuse(zspage)) {
> > + /*
> > + * Fullness went stale as zs_free() won't touch it
> > + * while the page is removed from the pool. Fix it
> > + * up for the check in __free_zspage().
> > + */
> > + zspage->fullness = ZS_EMPTY;
> > +
> > + __free_zspage(pool, class, zspage);
> > + spin_unlock(&pool->lock);
> > + return 0;
> > + }
> > +
> > + putback_zspage(class, zspage);
> > + list_add(&zspage->lru, &pool->lru);
> > + unlock_zspage(zspage);
>
> We probably better to cond_resched() somewhere here. Or in zs_zpool_shrink()
> loop.
Hm, yeah I suppose that could make sense if we try more than one page.
We always hold either the pool lock or the page locks, and we probably
don't want to schedule with the page locks held. So it would need to
actually lockbreak the pool lock. And then somebody can steal the page
and empty the LRU under us, so we need to check that on looping, too.
Something like this?
for (i = 0; i < retries; i++) {
spin_lock(&pool->lock);
if (list_empty(&pool->lru)) {
spin_unlock(&pool->lock);
return -EINVAL;
}
zspage = list_last_entry(&pool->lru, ...);
...
putback_zspage(class, zspage);
list_add(&zspage->lru, &pool->lru);
unlock_zspage(zspage);
spin_unlock(&pool->lock);
cond_resched();
}
return -EAGAIN;
On Wed, Nov 23, 2022 at 08:30:44AM -0800, Nhat Pham wrote:
> I'll put it right after releasing the pool's lock in the retry loop:
>
> /* Lock out object allocations and object compaction */
> remove_zspage(class, zspage, fullness);
>
> spin_unlock(&pool->lock);
> cond_resched();
>
> /* Lock backing pages into place */
> lock_zspage(zspage);
>
> This will also appear in v7. In the meantime, please feel free to discuss all
> the patches - I'll try to batch the changes to minimize the churning.
Oh, our emails collided. This is easier than my version :)
The suggested changes seem relatively minor, so instead of sending a v7
series of patches, I've just sent the two fixes in a separate thread.
Andrew, would you mind applying those fixes on top of patch 4 and patch
6 respectively? Thanks!