2023-11-30 19:41:30

by Nhat Pham

[permalink] [raw]
Subject: [PATCH v8 0/6] workload-specific and memory pressure-driven zswap writeback

Changelog:
v8:
* Fixed a couple of build errors in the case of !CONFIG_MEMCG
* Simplified the online memcg selection scheme for the zswap global
limit reclaim (suggested by Michal Hocko and Johannes Weiner)
(patch 2 and patch 3)
* Added a new kconfig to allows users to enable zswap shrinker by
default. (suggested by Johannes Weiner) (patch 6)
v7:
* Added the mem_cgroup_iter_online() function to the API for the new
behavior (suggested by Andrew Morton) (patch 2)
* Fixed a missing list_lru_del -> list_lru_del_obj (patch 1)
v6:
* Rebase on top of latest mm-unstable.
* Fix/improve the in-code documentation of the new list_lru
manipulation functions (patch 1)
v5:
* Replace reference getting with an rcu_read_lock() section for
zswap lru modifications (suggested by Yosry)
* Add a new prep patch that allows mem_cgroup_iter() to return
online cgroup.
* Add a callback that updates pool->next_shrink when the cgroup is
offlined (suggested by Yosry Ahmed, Johannes Weiner)
v4:
* Rename list_lru_add to list_lru_add_obj and __list_lru_add to
list_lru_add (patch 1) (suggested by Johannes Weiner and
Yosry Ahmed)
* Some cleanups on the memcg aware LRU patch (patch 2)
(suggested by Yosry Ahmed)
* Use event interface for the new per-cgroup writeback counters.
(patch 3) (suggested by Yosry Ahmed)
* Abstract zswap's lruvec states and handling into
zswap_lruvec_state (patch 5) (suggested by Yosry Ahmed)
v3:
* Add a patch to export per-cgroup zswap writeback counters
* Add a patch to update zswap's kselftest
* Separate the new list_lru functions into its own prep patch
* Do not start from the top of the hierarchy when encounter a memcg
that is not online for the global limit zswap writeback (patch 2)
(suggested by Yosry Ahmed)
* Do not remove the swap entry from list_lru in
__read_swapcache_async() (patch 2) (suggested by Yosry Ahmed)
* Removed a redundant zswap pool getting (patch 2)
(reported by Ryan Roberts)
* Use atomic for the nr_zswap_protected (instead of lruvec's lock)
(patch 5) (suggested by Yosry Ahmed)
* Remove the per-cgroup zswap shrinker knob (patch 5)
(suggested by Yosry Ahmed)
v2:
* Fix loongarch compiler errors
* Use pool stats instead of memcg stats when !CONFIG_MEMCG_KEM

There are currently several issues with zswap writeback:

1. There is only a single global LRU for zswap, making it impossible to
perform worload-specific shrinking - an memcg under memory pressure
cannot determine which pages in the pool it owns, and often ends up
writing pages from other memcgs. This issue has been previously
observed in practice and mitigated by simply disabling
memcg-initiated shrinking:

https://lore.kernel.org/all/[email protected]/T/#u

But this solution leaves a lot to be desired, as we still do not
have an avenue for an memcg to free up its own memory locked up in
the zswap pool.

2. We only shrink the zswap pool when the user-defined limit is hit.
This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed
ahead of time, as this depends on the workload (specifically, on
factors such as memory access patterns and compressibility of the
memory pages).

This patch series solves these issues by separating the global zswap
LRU into per-memcg and per-NUMA LRUs, and performs workload-specific
(i.e memcg- and NUMA-aware) zswap writeback under memory pressure. The
new shrinker does not have any parameter that must be tuned by the
user, and can be opted in or out on a per-memcg basis.

As a proof of concept, we ran the following synthetic benchmark:
build the linux kernel in a memory-limited cgroup, and allocate some
cold data in tmpfs to see if the shrinker could write them out and
improved the overall performance. Depending on the amount of cold data
generated, we observe from 14% to 35% reduction in kernel CPU time used
in the kernel builds.

Domenico Cerasuolo (3):
zswap: make shrinking memcg-aware
mm: memcg: add per-memcg zswap writeback stat
selftests: cgroup: update per-memcg zswap writeback selftest

Nhat Pham (3):
list_lru: allows explicit memcg and NUMA node selection
memcontrol: implement mem_cgroup_tryget_online()
zswap: shrinks zswap pool based on memory pressure

Documentation/admin-guide/mm/zswap.rst | 10 +
drivers/android/binder_alloc.c | 7 +-
fs/dcache.c | 8 +-
fs/gfs2/quota.c | 6 +-
fs/inode.c | 4 +-
fs/nfs/nfs42xattr.c | 8 +-
fs/nfsd/filecache.c | 4 +-
fs/xfs/xfs_buf.c | 6 +-
fs/xfs/xfs_dquot.c | 2 +-
fs/xfs/xfs_qm.c | 2 +-
include/linux/list_lru.h | 54 ++-
include/linux/memcontrol.h | 15 +
include/linux/mmzone.h | 2 +
include/linux/vm_event_item.h | 1 +
include/linux/zswap.h | 27 +-
mm/Kconfig | 14 +
mm/list_lru.c | 48 ++-
mm/memcontrol.c | 3 +
mm/mmzone.c | 1 +
mm/swap.h | 3 +-
mm/swap_state.c | 26 +-
mm/vmstat.c | 1 +
mm/workingset.c | 4 +-
mm/zswap.c | 456 +++++++++++++++++---
tools/testing/selftests/cgroup/test_zswap.c | 74 ++--
25 files changed, 661 insertions(+), 125 deletions(-)


base-commit: 5cdba94229e58a39ca389ad99763af29e6b0c5a5
--
2.34.1


2023-11-30 19:41:52

by Nhat Pham

[permalink] [raw]
Subject: [PATCH v8 6/6] zswap: shrinks zswap pool based on memory pressure

Currently, we only shrink the zswap pool when the user-defined limit is
hit. This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed ahead
of time, as this depends on the workload (specifically, on factors such
as memory access patterns and compressibility of the memory pages).

This patch implements a memcg- and NUMA-aware shrinker for zswap, that
is initiated when there is memory pressure. The shrinker does not
have any parameter that must be tuned by the user, and can be opted in
or out on a per-memcg basis.

Furthermore, to make it more robust for many workloads and prevent
overshrinking (i.e evicting warm pages that might be refaulted into
memory), we build in the following heuristics:

* Estimate the number of warm pages residing in zswap, and attempt to
protect this region of the zswap LRU.
* Scale the number of freeable objects by an estimate of the memory
saving factor. The better zswap compresses the data, the fewer pages
we will evict to swap (as we will otherwise incur IO for relatively
small memory saving).
* During reclaim, if the shrinker encounters a page that is also being
brought into memory, the shrinker will cautiously terminate its
shrinking action, as this is a sign that it is touching the warmer
region of the zswap LRU.

As a proof of concept, we ran the following synthetic benchmark:
build the linux kernel in a memory-limited cgroup, and allocate some
cold data in tmpfs to see if the shrinker could write them out and
improved the overall performance. Depending on the amount of cold data
generated, we observe from 14% to 35% reduction in kernel CPU time used
in the kernel builds.

Signed-off-by: Nhat Pham <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
---
Documentation/admin-guide/mm/zswap.rst | 10 ++
include/linux/mmzone.h | 2 +
include/linux/zswap.h | 25 +++-
mm/Kconfig | 14 ++
mm/mmzone.c | 1 +
mm/swap_state.c | 2 +
mm/zswap.c | 185 ++++++++++++++++++++++++-
7 files changed, 233 insertions(+), 6 deletions(-)

diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst
index 45b98390e938..62fc244ec702 100644
--- a/Documentation/admin-guide/mm/zswap.rst
+++ b/Documentation/admin-guide/mm/zswap.rst
@@ -153,6 +153,16 @@ attribute, e. g.::

Setting this parameter to 100 will disable the hysteresis.

+When there is a sizable amount of cold memory residing in the zswap pool, it
+can be advantageous to proactively write these cold pages to swap and reclaim
+the memory for other use cases. By default, the zswap shrinker is disabled.
+User can enable it as follows:
+
+ echo Y > /sys/module/zswap/parameters/shrinker_enabled
+
+This can be enabled at the boot time if ``CONFIG_ZSWAP_SHRINKER_DEFAULT_ON`` is
+selected.
+
A debugfs interface is provided for various statistic about pool size, number
of pages stored, same-value filled pages and various counters for the reasons
pages are rejected.
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7b1816450bfc..b23bc5390240 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -22,6 +22,7 @@
#include <linux/mm_types.h>
#include <linux/page-flags.h>
#include <linux/local_lock.h>
+#include <linux/zswap.h>
#include <asm/page.h>

/* Free memory management - zoned buddy allocator. */
@@ -641,6 +642,7 @@ struct lruvec {
#ifdef CONFIG_MEMCG
struct pglist_data *pgdat;
#endif
+ struct zswap_lruvec_state zswap_lruvec_state;
};

/* Isolate for asynchronous migration */
diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index e571e393669b..08c240e16a01 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -5,20 +5,40 @@
#include <linux/types.h>
#include <linux/mm_types.h>

+struct lruvec;
+
extern u64 zswap_pool_total_size;
extern atomic_t zswap_stored_pages;

#ifdef CONFIG_ZSWAP

+struct zswap_lruvec_state {
+ /*
+ * Number of pages in zswap that should be protected from the shrinker.
+ * This number is an estimate of the following counts:
+ *
+ * a) Recent page faults.
+ * b) Recent insertion to the zswap LRU. This includes new zswap stores,
+ * as well as recent zswap LRU rotations.
+ *
+ * These pages are likely to be warm, and might incur IO if the are written
+ * to swap.
+ */
+ atomic_long_t nr_zswap_protected;
+};
+
bool zswap_store(struct folio *folio);
bool zswap_load(struct folio *folio);
void zswap_invalidate(int type, pgoff_t offset);
void zswap_swapon(int type);
void zswap_swapoff(int type);
void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg);
-
+void zswap_lruvec_state_init(struct lruvec *lruvec);
+void zswap_page_swapin(struct page *page);
#else

+struct zswap_lruvec_state {};
+
static inline bool zswap_store(struct folio *folio)
{
return false;
@@ -33,7 +53,8 @@ static inline void zswap_invalidate(int type, pgoff_t offset) {}
static inline void zswap_swapon(int type) {}
static inline void zswap_swapoff(int type) {}
static inline void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) {}
-
+static inline void zswap_lruvec_state_init(struct lruvec *lruvec) {}
+static inline void zswap_page_swapin(struct page *page) {}
#endif

#endif /* _LINUX_ZSWAP_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 57cd378c73d6..ca87cdb72f11 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -61,6 +61,20 @@ config ZSWAP_EXCLUSIVE_LOADS_DEFAULT_ON
The cost is that if the page was never dirtied and needs to be
swapped out again, it will be re-compressed.

+config ZSWAP_SHRINKER_DEFAULT_ON
+ bool "Shrink the zswap pool on memory pressure"
+ depends on ZSWAP
+ default n
+ help
+ If selected, the zswap shrinker will be enabled, and the pages
+ stored in the zswap pool will become available for reclaim (i.e
+ written back to the backing swap device) on memory pressure.
+
+ This means that zswap writeback could happen even if the pool is
+ not yet full, or the cgroup zswap limit has not been reached,
+ reducing the chance that cold pages will reside in the zswap pool
+ and consume memory indefinitely.
+
choice
prompt "Default compressor"
depends on ZSWAP
diff --git a/mm/mmzone.c b/mm/mmzone.c
index b594d3f268fe..c01896eca736 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -78,6 +78,7 @@ void lruvec_init(struct lruvec *lruvec)

memset(lruvec, 0, sizeof(struct lruvec));
spin_lock_init(&lruvec->lru_lock);
+ zswap_lruvec_state_init(lruvec);

for_each_lru(lru)
INIT_LIST_HEAD(&lruvec->lists[lru]);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 6c84236382f3..c597cec606e4 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -687,6 +687,7 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
&page_allocated, false);
if (unlikely(page_allocated))
swap_readpage(page, false, NULL);
+ zswap_page_swapin(page);
return page;
}

@@ -862,6 +863,7 @@ static struct page *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
&page_allocated, false);
if (unlikely(page_allocated))
swap_readpage(page, false, NULL);
+ zswap_page_swapin(page);
return page;
}

diff --git a/mm/zswap.c b/mm/zswap.c
index 49b79393e472..0f086ffd7b6a 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -148,6 +148,11 @@ module_param_named(exclusive_loads, zswap_exclusive_loads_enabled, bool, 0644);
/* Number of zpools in zswap_pool (empirically determined for scalability) */
#define ZSWAP_NR_ZPOOLS 32

+/* Enable/disable memory pressure-based shrinker. */
+static bool zswap_shrinker_enabled = IS_ENABLED(
+ CONFIG_ZSWAP_SHRINKER_DEFAULT_ON);
+module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644);
+
/*********************************
* data structures
**********************************/
@@ -177,6 +182,8 @@ struct zswap_pool {
char tfm_name[CRYPTO_MAX_ALG_NAME];
struct list_lru list_lru;
struct mem_cgroup *next_shrink;
+ struct shrinker *shrinker;
+ atomic_t nr_stored;
};

/*
@@ -275,17 +282,26 @@ static bool zswap_can_accept(void)
DIV_ROUND_UP(zswap_pool_total_size, PAGE_SIZE);
}

+static u64 get_zswap_pool_size(struct zswap_pool *pool)
+{
+ u64 pool_size = 0;
+ int i;
+
+ for (i = 0; i < ZSWAP_NR_ZPOOLS; i++)
+ pool_size += zpool_get_total_size(pool->zpools[i]);
+
+ return pool_size;
+}
+
static void zswap_update_total_size(void)
{
struct zswap_pool *pool;
u64 total = 0;
- int i;

rcu_read_lock();

list_for_each_entry_rcu(pool, &zswap_pools, list)
- for (i = 0; i < ZSWAP_NR_ZPOOLS; i++)
- total += zpool_get_total_size(pool->zpools[i]);
+ total += get_zswap_pool_size(pool);

rcu_read_unlock();

@@ -344,13 +360,34 @@ static void zswap_entry_cache_free(struct zswap_entry *entry)
kmem_cache_free(zswap_entry_cache, entry);
}

+/*********************************
+* zswap lruvec functions
+**********************************/
+void zswap_lruvec_state_init(struct lruvec *lruvec)
+{
+ atomic_long_set(&lruvec->zswap_lruvec_state.nr_zswap_protected, 0);
+}
+
+void zswap_page_swapin(struct page *page)
+{
+ struct lruvec *lruvec;
+
+ if (page) {
+ lruvec = folio_lruvec(page_folio(page));
+ atomic_long_inc(&lruvec->zswap_lruvec_state.nr_zswap_protected);
+ }
+}
+
/*********************************
* lru functions
**********************************/
static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry)
{
+ atomic_long_t *nr_zswap_protected;
+ unsigned long lru_size, old, new;
int nid = entry_to_nid(entry);
struct mem_cgroup *memcg;
+ struct lruvec *lruvec;

/*
* Note that it is safe to use rcu_read_lock() here, even in the face of
@@ -368,6 +405,19 @@ static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry)
memcg = mem_cgroup_from_entry(entry);
/* will always succeed */
list_lru_add(list_lru, &entry->lru, nid, memcg);
+
+ /* Update the protection area */
+ lru_size = list_lru_count_one(list_lru, nid, memcg);
+ lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
+ nr_zswap_protected = &lruvec->zswap_lruvec_state.nr_zswap_protected;
+ old = atomic_long_inc_return(nr_zswap_protected);
+ /*
+ * Decay to avoid overflow and adapt to changing workloads.
+ * This is based on LRU reclaim cost decaying heuristics.
+ */
+ do {
+ new = old > lru_size / 4 ? old / 2 : old;
+ } while (!atomic_long_try_cmpxchg(nr_zswap_protected, &old, new));
rcu_read_unlock();
}

@@ -389,6 +439,7 @@ static void zswap_lru_putback(struct list_lru *list_lru,
int nid = entry_to_nid(entry);
spinlock_t *lock = &list_lru->node[nid].lock;
struct mem_cgroup *memcg;
+ struct lruvec *lruvec;

rcu_read_lock();
memcg = mem_cgroup_from_entry(entry);
@@ -396,6 +447,10 @@ static void zswap_lru_putback(struct list_lru *list_lru,
/* we cannot use list_lru_add here, because it increments node's lru count */
list_lru_putback(list_lru, &entry->lru, nid, memcg);
spin_unlock(lock);
+
+ lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(entry_to_nid(entry)));
+ /* increment the protection area to account for the LRU rotation. */
+ atomic_long_inc(&lruvec->zswap_lruvec_state.nr_zswap_protected);
rcu_read_unlock();
}

@@ -485,6 +540,7 @@ static void zswap_free_entry(struct zswap_entry *entry)
else {
zswap_lru_del(&entry->pool->list_lru, entry);
zpool_free(zswap_find_zpool(entry), entry->handle);
+ atomic_dec(&entry->pool->nr_stored);
zswap_pool_put(entry->pool);
}
zswap_entry_cache_free(entry);
@@ -526,6 +582,102 @@ static struct zswap_entry *zswap_entry_find_get(struct rb_root *root,
return entry;
}

+/*********************************
+* shrinker functions
+**********************************/
+static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_one *l,
+ spinlock_t *lock, void *arg);
+
+static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
+ struct shrink_control *sc)
+{
+ struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
+ unsigned long shrink_ret, nr_protected, lru_size;
+ struct zswap_pool *pool = shrinker->private_data;
+ bool encountered_page_in_swapcache = false;
+
+ nr_protected =
+ atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected);
+ lru_size = list_lru_shrink_count(&pool->list_lru, sc);
+
+ /*
+ * Abort if the shrinker is disabled or if we are shrinking into the
+ * protected region.
+ *
+ * This short-circuiting is necessary because if we have too many multiple
+ * concurrent reclaimers getting the freeable zswap object counts at the
+ * same time (before any of them made reasonable progress), the total
+ * number of reclaimed objects might be more than the number of unprotected
+ * objects (i.e the reclaimers will reclaim into the protected area of the
+ * zswap LRU).
+ */
+ if (!zswap_shrinker_enabled || nr_protected >= lru_size - sc->nr_to_scan) {
+ sc->nr_scanned = 0;
+ return SHRINK_STOP;
+ }
+
+ shrink_ret = list_lru_shrink_walk(&pool->list_lru, sc, &shrink_memcg_cb,
+ &encountered_page_in_swapcache);
+
+ if (encountered_page_in_swapcache)
+ return SHRINK_STOP;
+
+ return shrink_ret ? shrink_ret : SHRINK_STOP;
+}
+
+static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
+ struct shrink_control *sc)
+{
+ struct zswap_pool *pool = shrinker->private_data;
+ struct mem_cgroup *memcg = sc->memcg;
+ struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(sc->nid));
+ unsigned long nr_backing, nr_stored, nr_freeable, nr_protected;
+
+#ifdef CONFIG_MEMCG_KMEM
+ cgroup_rstat_flush(memcg->css.cgroup);
+ nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT;
+ nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
+#else
+ /* use pool stats instead of memcg stats */
+ nr_backing = get_zswap_pool_size(pool) >> PAGE_SHIFT;
+ nr_stored = atomic_read(&pool->nr_stored);
+#endif
+
+ if (!zswap_shrinker_enabled || !nr_stored)
+ return 0;
+
+ nr_protected =
+ atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected);
+ nr_freeable = list_lru_shrink_count(&pool->list_lru, sc);
+ /*
+ * Subtract the lru size by an estimate of the number of pages
+ * that should be protected.
+ */
+ nr_freeable = nr_freeable > nr_protected ? nr_freeable - nr_protected : 0;
+
+ /*
+ * Scale the number of freeable pages by the memory saving factor.
+ * This ensures that the better zswap compresses memory, the fewer
+ * pages we will evict to swap (as it will otherwise incur IO for
+ * relatively small memory saving).
+ */
+ return mult_frac(nr_freeable, nr_backing, nr_stored);
+}
+
+static void zswap_alloc_shrinker(struct zswap_pool *pool)
+{
+ pool->shrinker =
+ shrinker_alloc(SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE, "mm-zswap");
+ if (!pool->shrinker)
+ return;
+
+ pool->shrinker->private_data = pool;
+ pool->shrinker->scan_objects = zswap_shrinker_scan;
+ pool->shrinker->count_objects = zswap_shrinker_count;
+ pool->shrinker->batch = 0;
+ pool->shrinker->seeks = DEFAULT_SEEKS;
+}
+
/*********************************
* per-cpu code
**********************************/
@@ -721,6 +873,7 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
spinlock_t *lock, void *arg)
{
struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
+ bool *encountered_page_in_swapcache = (bool *)arg;
struct zswap_tree *tree;
pgoff_t swpoffset;
enum lru_status ret = LRU_REMOVED_RETRY;
@@ -756,6 +909,17 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
zswap_reject_reclaim_fail++;
zswap_lru_putback(&entry->pool->list_lru, entry);
ret = LRU_RETRY;
+
+ /*
+ * Encountering a page already in swap cache is a sign that we are shrinking
+ * into the warmer region. We should terminate shrinking (if we're in the dynamic
+ * shrinker context).
+ */
+ if (writeback_result == -EEXIST && encountered_page_in_swapcache) {
+ ret = LRU_SKIP;
+ *encountered_page_in_swapcache = true;
+ }
+
goto put_unlock;
}
zswap_written_back_pages++;
@@ -913,6 +1077,11 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
&pool->node);
if (ret)
goto error;
+
+ zswap_alloc_shrinker(pool);
+ if (!pool->shrinker)
+ goto error;
+
pr_debug("using %s compressor\n", pool->tfm_name);

/* being the current pool takes 1 ref; this func expects the
@@ -920,13 +1089,19 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
*/
kref_init(&pool->kref);
INIT_LIST_HEAD(&pool->list);
- list_lru_init_memcg(&pool->list_lru, NULL);
+ if (list_lru_init_memcg(&pool->list_lru, pool->shrinker))
+ goto lru_fail;
+ shrinker_register(pool->shrinker);
INIT_WORK(&pool->shrink_work, shrink_worker);
+ atomic_set(&pool->nr_stored, 0);

zswap_pool_debug("created", pool);

return pool;

+lru_fail:
+ list_lru_destroy(&pool->list_lru);
+ shrinker_free(pool->shrinker);
error:
if (pool->acomp_ctx)
free_percpu(pool->acomp_ctx);
@@ -984,6 +1159,7 @@ static void zswap_pool_destroy(struct zswap_pool *pool)

zswap_pool_debug("destroying", pool);

+ shrinker_free(pool->shrinker);
cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node);
free_percpu(pool->acomp_ctx);
list_lru_destroy(&pool->list_lru);
@@ -1540,6 +1716,7 @@ bool zswap_store(struct folio *folio)
if (entry->length) {
INIT_LIST_HEAD(&entry->lru);
zswap_lru_add(&entry->pool->list_lru, entry);
+ atomic_inc(&entry->pool->nr_stored);
}
spin_unlock(&tree->lock);

--
2.34.1

2023-11-30 19:42:20

by Nhat Pham

[permalink] [raw]
Subject: [PATCH v8 5/6] selftests: cgroup: update per-memcg zswap writeback selftest

From: Domenico Cerasuolo <[email protected]>

The memcg-zswap self test is updated to adjust to the behavior change
implemented by commit 87730b165089 ("zswap: make shrinking memcg-aware"),
where zswap performs writeback for specific memcg.

Signed-off-by: Domenico Cerasuolo <[email protected]>
Signed-off-by: Nhat Pham <[email protected]>
---
tools/testing/selftests/cgroup/test_zswap.c | 74 ++++++++++++++-------
1 file changed, 50 insertions(+), 24 deletions(-)

diff --git a/tools/testing/selftests/cgroup/test_zswap.c b/tools/testing/selftests/cgroup/test_zswap.c
index c99d2adaca3f..47fdaa146443 100644
--- a/tools/testing/selftests/cgroup/test_zswap.c
+++ b/tools/testing/selftests/cgroup/test_zswap.c
@@ -50,9 +50,9 @@ static int get_zswap_stored_pages(size_t *value)
return read_int("/sys/kernel/debug/zswap/stored_pages", value);
}

-static int get_zswap_written_back_pages(size_t *value)
+static int get_cg_wb_count(const char *cg)
{
- return read_int("/sys/kernel/debug/zswap/written_back_pages", value);
+ return cg_read_key_long(cg, "memory.stat", "zswp_wb");
}

static long get_zswpout(const char *cgroup)
@@ -73,6 +73,24 @@ static int allocate_bytes(const char *cgroup, void *arg)
return 0;
}

+static char *setup_test_group_1M(const char *root, const char *name)
+{
+ char *group_name = cg_name(root, name);
+
+ if (!group_name)
+ return NULL;
+ if (cg_create(group_name))
+ goto fail;
+ if (cg_write(group_name, "memory.max", "1M")) {
+ cg_destroy(group_name);
+ goto fail;
+ }
+ return group_name;
+fail:
+ free(group_name);
+ return NULL;
+}
+
/*
* Sanity test to check that pages are written into zswap.
*/
@@ -117,43 +135,51 @@ static int test_zswap_usage(const char *root)

/*
* When trying to store a memcg page in zswap, if the memcg hits its memory
- * limit in zswap, writeback should not be triggered.
- *
- * This was fixed with commit 0bdf0efa180a("zswap: do not shrink if cgroup may
- * not zswap"). Needs to be revised when a per memcg writeback mechanism is
- * implemented.
+ * limit in zswap, writeback should affect only the zswapped pages of that
+ * memcg.
*/
static int test_no_invasive_cgroup_shrink(const char *root)
{
- size_t written_back_before, written_back_after;
int ret = KSFT_FAIL;
- char *test_group;
+ size_t control_allocation_size = MB(10);
+ char *control_allocation, *wb_group = NULL, *control_group = NULL;

/* Set up */
- test_group = cg_name(root, "no_shrink_test");
- if (!test_group)
- goto out;
- if (cg_create(test_group))
+ wb_group = setup_test_group_1M(root, "per_memcg_wb_test1");
+ if (!wb_group)
+ return KSFT_FAIL;
+ if (cg_write(wb_group, "memory.zswap.max", "10K"))
goto out;
- if (cg_write(test_group, "memory.max", "1M"))
+ control_group = setup_test_group_1M(root, "per_memcg_wb_test2");
+ if (!control_group)
goto out;
- if (cg_write(test_group, "memory.zswap.max", "10K"))
+
+ /* Push some test_group2 memory into zswap */
+ if (cg_enter_current(control_group))
goto out;
- if (get_zswap_written_back_pages(&written_back_before))
+ control_allocation = malloc(control_allocation_size);
+ for (int i = 0; i < control_allocation_size; i += 4095)
+ control_allocation[i] = 'a';
+ if (cg_read_key_long(control_group, "memory.stat", "zswapped") < 1)
goto out;

- /* Allocate 10x memory.max to push memory into zswap */
- if (cg_run(test_group, allocate_bytes, (void *)MB(10)))
+ /* Allocate 10x memory.max to push wb_group memory into zswap and trigger wb */
+ if (cg_run(wb_group, allocate_bytes, (void *)MB(10)))
goto out;

- /* Verify that no writeback happened because of the memcg allocation */
- if (get_zswap_written_back_pages(&written_back_after))
- goto out;
- if (written_back_after == written_back_before)
+ /* Verify that only zswapped memory from gwb_group has been written back */
+ if (get_cg_wb_count(wb_group) > 0 && get_cg_wb_count(control_group) == 0)
ret = KSFT_PASS;
out:
- cg_destroy(test_group);
- free(test_group);
+ cg_enter_current(root);
+ if (control_group) {
+ cg_destroy(control_group);
+ free(control_group);
+ }
+ cg_destroy(wb_group);
+ free(wb_group);
+ if (control_allocation)
+ free(control_allocation);
return ret;
}

--
2.34.1

2023-11-30 19:43:23

by Nhat Pham

[permalink] [raw]
Subject: [PATCH v8 4/6] mm: memcg: add per-memcg zswap writeback stat

From: Domenico Cerasuolo <[email protected]>

Since zswap now writes back pages from memcg-specific LRUs, we now need a
new stat to show writebacks count for each memcg.

Suggested-by: Nhat Pham <[email protected]>
Signed-off-by: Domenico Cerasuolo <[email protected]>
Signed-off-by: Nhat Pham <[email protected]>
---
include/linux/vm_event_item.h | 1 +
mm/memcontrol.c | 1 +
mm/vmstat.c | 1 +
mm/zswap.c | 4 ++++
4 files changed, 7 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index d1b847502f09..f4569ad98edf 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -142,6 +142,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
#ifdef CONFIG_ZSWAP
ZSWPIN,
ZSWPOUT,
+ ZSWP_WB,
#endif
#ifdef CONFIG_X86
DIRECT_MAP_LEVEL2_SPLIT,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 792ca21c5815..21d79249c8b4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -703,6 +703,7 @@ static const unsigned int memcg_vm_event_stat[] = {
#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
ZSWPIN,
ZSWPOUT,
+ ZSWP_WB,
#endif
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
THP_FAULT_ALLOC,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index afa5a38fcc9c..2249f85e4a87 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1401,6 +1401,7 @@ const char * const vmstat_text[] = {
#ifdef CONFIG_ZSWAP
"zswpin",
"zswpout",
+ "zswp_wb",
#endif
#ifdef CONFIG_X86
"direct_map_level2_splits",
diff --git a/mm/zswap.c b/mm/zswap.c
index f323e45cbdc7..49b79393e472 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -760,6 +760,10 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
}
zswap_written_back_pages++;

+ if (entry->objcg)
+ count_objcg_event(entry->objcg, ZSWP_WB);
+
+ count_vm_event(ZSWP_WB);
/*
* Writeback started successfully, the page now belongs to the
* swapcache. Drop the entry from zswap - unless invalidate already
--
2.34.1

2023-11-30 21:19:44

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v8 0/6] workload-specific and memory pressure-driven zswap writeback

On Thu, 30 Nov 2023 11:40:17 -0800 Nhat Pham <[email protected]> wrote:

> This patch series solves these issues by separating the global zswap
> LRU into per-memcg and per-NUMA LRUs, and performs workload-specific
> (i.e memcg- and NUMA-aware) zswap writeback under memory pressure.

Thanks, I've updated mm-unstable to this version.

2023-12-05 18:22:54

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH v8 4/6] mm: memcg: add per-memcg zswap writeback stat

On Thu, Nov 30, 2023 at 11:40 AM Nhat Pham <[email protected]> wrote:
>
> From: Domenico Cerasuolo <[email protected]>
>
> Since zswap now writes back pages from memcg-specific LRUs, we now need a
> new stat to show writebacks count for each memcg.
>
> Suggested-by: Nhat Pham <[email protected]>
> Signed-off-by: Domenico Cerasuolo <[email protected]>
> Signed-off-by: Nhat Pham <[email protected]>
> ---
> include/linux/vm_event_item.h | 1 +
> mm/memcontrol.c | 1 +
> mm/vmstat.c | 1 +
> mm/zswap.c | 4 ++++
> 4 files changed, 7 insertions(+)
>
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index d1b847502f09..f4569ad98edf 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -142,6 +142,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> #ifdef CONFIG_ZSWAP
> ZSWPIN,
> ZSWPOUT,
> + ZSWP_WB,

I think you dismissed Johannes's comment from v7 about ZSWPWB and
"zswpwb" being more consistent with the existing events.

> #endif
> #ifdef CONFIG_X86
> DIRECT_MAP_LEVEL2_SPLIT,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 792ca21c5815..21d79249c8b4 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -703,6 +703,7 @@ static const unsigned int memcg_vm_event_stat[] = {
> #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
> ZSWPIN,
> ZSWPOUT,
> + ZSWP_WB,
> #endif
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> THP_FAULT_ALLOC,
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index afa5a38fcc9c..2249f85e4a87 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1401,6 +1401,7 @@ const char * const vmstat_text[] = {
> #ifdef CONFIG_ZSWAP
> "zswpin",
> "zswpout",
> + "zswp_wb",
> #endif
> #ifdef CONFIG_X86
> "direct_map_level2_splits",
> diff --git a/mm/zswap.c b/mm/zswap.c
> index f323e45cbdc7..49b79393e472 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -760,6 +760,10 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
> }
> zswap_written_back_pages++;
>
> + if (entry->objcg)
> + count_objcg_event(entry->objcg, ZSWP_WB);
> +
> + count_vm_event(ZSWP_WB);
> /*
> * Writeback started successfully, the page now belongs to the
> * swapcache. Drop the entry from zswap - unless invalidate already
> --
> 2.34.1

2023-12-05 18:56:49

by Nhat Pham

[permalink] [raw]
Subject: Re: [PATCH v8 4/6] mm: memcg: add per-memcg zswap writeback stat

On Tue, Dec 5, 2023 at 10:22 AM Yosry Ahmed <[email protected]> wrote:
>
> On Thu, Nov 30, 2023 at 11:40 AM Nhat Pham <[email protected]> wrote:
> >
> > From: Domenico Cerasuolo <[email protected]>
> >
> > Since zswap now writes back pages from memcg-specific LRUs, we now need a
> > new stat to show writebacks count for each memcg.
> >
> > Suggested-by: Nhat Pham <[email protected]>
> > Signed-off-by: Domenico Cerasuolo <[email protected]>
> > Signed-off-by: Nhat Pham <[email protected]>
> > ---
> > include/linux/vm_event_item.h | 1 +
> > mm/memcontrol.c | 1 +
> > mm/vmstat.c | 1 +
> > mm/zswap.c | 4 ++++
> > 4 files changed, 7 insertions(+)
> >
> > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> > index d1b847502f09..f4569ad98edf 100644
> > --- a/include/linux/vm_event_item.h
> > +++ b/include/linux/vm_event_item.h
> > @@ -142,6 +142,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> > #ifdef CONFIG_ZSWAP
> > ZSWPIN,
> > ZSWPOUT,
> > + ZSWP_WB,
>
> I think you dismissed Johannes's comment from v7 about ZSWPWB and
> "zswpwb" being more consistent with the existing events.

I missed that entirely. Oops. Yeah I prefer ZSWPWB too. Let me send a fix.

>
> > #endif
> > #ifdef CONFIG_X86
> > DIRECT_MAP_LEVEL2_SPLIT,
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 792ca21c5815..21d79249c8b4 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -703,6 +703,7 @@ static const unsigned int memcg_vm_event_stat[] = {
> > #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
> > ZSWPIN,
> > ZSWPOUT,
> > + ZSWP_WB,
> > #endif
> > #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > THP_FAULT_ALLOC,
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index afa5a38fcc9c..2249f85e4a87 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -1401,6 +1401,7 @@ const char * const vmstat_text[] = {
> > #ifdef CONFIG_ZSWAP
> > "zswpin",
> > "zswpout",
> > + "zswp_wb",
> > #endif
> > #ifdef CONFIG_X86
> > "direct_map_level2_splits",
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index f323e45cbdc7..49b79393e472 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -760,6 +760,10 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
> > }
> > zswap_written_back_pages++;
> >
> > + if (entry->objcg)
> > + count_objcg_event(entry->objcg, ZSWP_WB);
> > +
> > + count_vm_event(ZSWP_WB);
> > /*
> > * Writeback started successfully, the page now belongs to the
> > * swapcache. Drop the entry from zswap - unless invalidate already
> > --
> > 2.34.1

2023-12-05 19:33:50

by Nhat Pham

[permalink] [raw]
Subject: [PATCH v8 4/6] mm: memcg: add per-memcg zswap writeback stat (fix)

Rename ZSWP_WB to ZSWPWB to better match the existing counters naming
scheme.

Suggested-by: Johannes Weiner <[email protected]>
Signed-off-by: Nhat Pham <[email protected]>
---
include/linux/vm_event_item.h | 2 +-
mm/memcontrol.c | 2 +-
mm/vmstat.c | 2 +-
mm/zswap.c | 4 ++--
4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index f4569ad98edf..747943bc8cc2 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -142,7 +142,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
#ifdef CONFIG_ZSWAP
ZSWPIN,
ZSWPOUT,
- ZSWP_WB,
+ ZSWPWB,
#endif
#ifdef CONFIG_X86
DIRECT_MAP_LEVEL2_SPLIT,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 21d79249c8b4..0286b7d38832 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -703,7 +703,7 @@ static const unsigned int memcg_vm_event_stat[] = {
#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
ZSWPIN,
ZSWPOUT,
- ZSWP_WB,
+ ZSWPWB,
#endif
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
THP_FAULT_ALLOC,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 2249f85e4a87..cfd8d8256f8e 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1401,7 +1401,7 @@ const char * const vmstat_text[] = {
#ifdef CONFIG_ZSWAP
"zswpin",
"zswpout",
- "zswp_wb",
+ "zswpwb",
#endif
#ifdef CONFIG_X86
"direct_map_level2_splits",
diff --git a/mm/zswap.c b/mm/zswap.c
index c65b8ccc6b72..0fb0945c0031 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -761,9 +761,9 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
zswap_written_back_pages++;

if (entry->objcg)
- count_objcg_event(entry->objcg, ZSWP_WB);
+ count_objcg_event(entry->objcg, ZSWPWB);

- count_vm_event(ZSWP_WB);
+ count_vm_event(ZSWPWB);
/*
* Writeback started successfully, the page now belongs to the
* swapcache. Drop the entry from zswap - unless invalidate already
--
2.34.1

2023-12-05 20:06:43

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH v8 4/6] mm: memcg: add per-memcg zswap writeback stat (fix)

On Tue, Dec 5, 2023 at 11:33 AM Nhat Pham <[email protected]> wrote:
>
> Rename ZSWP_WB to ZSWPWB to better match the existing counters naming
> scheme.
>
> Suggested-by: Johannes Weiner <[email protected]>
> Signed-off-by: Nhat Pham <[email protected]>

For the original patch + this fix:

Reviewed-by: Yosry Ahmed <[email protected]>

> ---
> include/linux/vm_event_item.h | 2 +-
> mm/memcontrol.c | 2 +-
> mm/vmstat.c | 2 +-
> mm/zswap.c | 4 ++--
> 4 files changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index f4569ad98edf..747943bc8cc2 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -142,7 +142,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> #ifdef CONFIG_ZSWAP
> ZSWPIN,
> ZSWPOUT,
> - ZSWP_WB,
> + ZSWPWB,
> #endif
> #ifdef CONFIG_X86
> DIRECT_MAP_LEVEL2_SPLIT,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 21d79249c8b4..0286b7d38832 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -703,7 +703,7 @@ static const unsigned int memcg_vm_event_stat[] = {
> #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
> ZSWPIN,
> ZSWPOUT,
> - ZSWP_WB,
> + ZSWPWB,
> #endif
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> THP_FAULT_ALLOC,
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 2249f85e4a87..cfd8d8256f8e 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1401,7 +1401,7 @@ const char * const vmstat_text[] = {
> #ifdef CONFIG_ZSWAP
> "zswpin",
> "zswpout",
> - "zswp_wb",
> + "zswpwb",
> #endif
> #ifdef CONFIG_X86
> "direct_map_level2_splits",
> diff --git a/mm/zswap.c b/mm/zswap.c
> index c65b8ccc6b72..0fb0945c0031 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -761,9 +761,9 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
> zswap_written_back_pages++;
>
> if (entry->objcg)
> - count_objcg_event(entry->objcg, ZSWP_WB);
> + count_objcg_event(entry->objcg, ZSWPWB);
>
> - count_vm_event(ZSWP_WB);
> + count_vm_event(ZSWPWB);
> /*
> * Writeback started successfully, the page now belongs to the
> * swapcache. Drop the entry from zswap - unless invalidate already
> --
> 2.34.1

2023-12-06 04:11:31

by Bagas Sanjaya

[permalink] [raw]
Subject: Re: [PATCH v8 0/6] workload-specific and memory pressure-driven zswap writeback

On Thu, Nov 30, 2023 at 11:40:17AM -0800, Nhat Pham wrote:
> Changelog:
> v8:
> * Fixed a couple of build errors in the case of !CONFIG_MEMCG
> * Simplified the online memcg selection scheme for the zswap global
> limit reclaim (suggested by Michal Hocko and Johannes Weiner)
> (patch 2 and patch 3)
> * Added a new kconfig to allows users to enable zswap shrinker by
> default. (suggested by Johannes Weiner) (patch 6)
> v7:
> * Added the mem_cgroup_iter_online() function to the API for the new
> behavior (suggested by Andrew Morton) (patch 2)
> * Fixed a missing list_lru_del -> list_lru_del_obj (patch 1)
> v6:
> * Rebase on top of latest mm-unstable.
> * Fix/improve the in-code documentation of the new list_lru
> manipulation functions (patch 1)
> v5:
> * Replace reference getting with an rcu_read_lock() section for
> zswap lru modifications (suggested by Yosry)
> * Add a new prep patch that allows mem_cgroup_iter() to return
> online cgroup.
> * Add a callback that updates pool->next_shrink when the cgroup is
> offlined (suggested by Yosry Ahmed, Johannes Weiner)
> v4:
> * Rename list_lru_add to list_lru_add_obj and __list_lru_add to
> list_lru_add (patch 1) (suggested by Johannes Weiner and
> Yosry Ahmed)
> * Some cleanups on the memcg aware LRU patch (patch 2)
> (suggested by Yosry Ahmed)
> * Use event interface for the new per-cgroup writeback counters.
> (patch 3) (suggested by Yosry Ahmed)
> * Abstract zswap's lruvec states and handling into
> zswap_lruvec_state (patch 5) (suggested by Yosry Ahmed)
> v3:
> * Add a patch to export per-cgroup zswap writeback counters
> * Add a patch to update zswap's kselftest
> * Separate the new list_lru functions into its own prep patch
> * Do not start from the top of the hierarchy when encounter a memcg
> that is not online for the global limit zswap writeback (patch 2)
> (suggested by Yosry Ahmed)
> * Do not remove the swap entry from list_lru in
> __read_swapcache_async() (patch 2) (suggested by Yosry Ahmed)
> * Removed a redundant zswap pool getting (patch 2)
> (reported by Ryan Roberts)
> * Use atomic for the nr_zswap_protected (instead of lruvec's lock)
> (patch 5) (suggested by Yosry Ahmed)
> * Remove the per-cgroup zswap shrinker knob (patch 5)
> (suggested by Yosry Ahmed)
> v2:
> * Fix loongarch compiler errors
> * Use pool stats instead of memcg stats when !CONFIG_MEMCG_KEM
>
> There are currently several issues with zswap writeback:
>
> 1. There is only a single global LRU for zswap, making it impossible to
> perform worload-specific shrinking - an memcg under memory pressure
> cannot determine which pages in the pool it owns, and often ends up
> writing pages from other memcgs. This issue has been previously
> observed in practice and mitigated by simply disabling
> memcg-initiated shrinking:
>
> https://lore.kernel.org/all/[email protected]/T/#u
>
> But this solution leaves a lot to be desired, as we still do not
> have an avenue for an memcg to free up its own memory locked up in
> the zswap pool.
>
> 2. We only shrink the zswap pool when the user-defined limit is hit.
> This means that if we set the limit too high, cold data that are
> unlikely to be used again will reside in the pool, wasting precious
> memory. It is hard to predict how much zswap space will be needed
> ahead of time, as this depends on the workload (specifically, on
> factors such as memory access patterns and compressibility of the
> memory pages).
>
> This patch series solves these issues by separating the global zswap
> LRU into per-memcg and per-NUMA LRUs, and performs workload-specific
> (i.e memcg- and NUMA-aware) zswap writeback under memory pressure. The
> new shrinker does not have any parameter that must be tuned by the
> user, and can be opted in or out on a per-memcg basis.
>
> As a proof of concept, we ran the following synthetic benchmark:
> build the linux kernel in a memory-limited cgroup, and allocate some
> cold data in tmpfs to see if the shrinker could write them out and
> improved the overall performance. Depending on the amount of cold data
> generated, we observe from 14% to 35% reduction in kernel CPU time used
> in the kernel builds.
>
> Domenico Cerasuolo (3):
> zswap: make shrinking memcg-aware
> mm: memcg: add per-memcg zswap writeback stat
> selftests: cgroup: update per-memcg zswap writeback selftest
>
> Nhat Pham (3):
> list_lru: allows explicit memcg and NUMA node selection
> memcontrol: implement mem_cgroup_tryget_online()
> zswap: shrinks zswap pool based on memory pressure
>
> Documentation/admin-guide/mm/zswap.rst | 10 +
> drivers/android/binder_alloc.c | 7 +-
> fs/dcache.c | 8 +-
> fs/gfs2/quota.c | 6 +-
> fs/inode.c | 4 +-
> fs/nfs/nfs42xattr.c | 8 +-
> fs/nfsd/filecache.c | 4 +-
> fs/xfs/xfs_buf.c | 6 +-
> fs/xfs/xfs_dquot.c | 2 +-
> fs/xfs/xfs_qm.c | 2 +-
> include/linux/list_lru.h | 54 ++-
> include/linux/memcontrol.h | 15 +
> include/linux/mmzone.h | 2 +
> include/linux/vm_event_item.h | 1 +
> include/linux/zswap.h | 27 +-
> mm/Kconfig | 14 +
> mm/list_lru.c | 48 ++-
> mm/memcontrol.c | 3 +
> mm/mmzone.c | 1 +
> mm/swap.h | 3 +-
> mm/swap_state.c | 26 +-
> mm/vmstat.c | 1 +
> mm/workingset.c | 4 +-
> mm/zswap.c | 456 +++++++++++++++++---
> tools/testing/selftests/cgroup/test_zswap.c | 74 ++--
> 25 files changed, 661 insertions(+), 125 deletions(-)
>

Carrying from v7,

Tested-by: Bagas Sanjaya <[email protected]>

--
An old man doll... just what I always wanted! - Clara


Attachments:
(No filename) (6.35 kB)
signature.asc (235.00 B)
Download all attachments

2023-12-06 05:52:38

by Chengming Zhou

[permalink] [raw]
Subject: Re: [PATCH v8 6/6] zswap: shrinks zswap pool based on memory pressure

On 2023/12/1 03:40, Nhat Pham wrote:
> Currently, we only shrink the zswap pool when the user-defined limit is
> hit. This means that if we set the limit too high, cold data that are
> unlikely to be used again will reside in the pool, wasting precious
> memory. It is hard to predict how much zswap space will be needed ahead
> of time, as this depends on the workload (specifically, on factors such
> as memory access patterns and compressibility of the memory pages).
>
> This patch implements a memcg- and NUMA-aware shrinker for zswap, that
> is initiated when there is memory pressure. The shrinker does not
> have any parameter that must be tuned by the user, and can be opted in
> or out on a per-memcg basis.
>
> Furthermore, to make it more robust for many workloads and prevent
> overshrinking (i.e evicting warm pages that might be refaulted into
> memory), we build in the following heuristics:
>
> * Estimate the number of warm pages residing in zswap, and attempt to
> protect this region of the zswap LRU.
> * Scale the number of freeable objects by an estimate of the memory
> saving factor. The better zswap compresses the data, the fewer pages
> we will evict to swap (as we will otherwise incur IO for relatively
> small memory saving).
> * During reclaim, if the shrinker encounters a page that is also being
> brought into memory, the shrinker will cautiously terminate its
> shrinking action, as this is a sign that it is touching the warmer
> region of the zswap LRU.
>
> As a proof of concept, we ran the following synthetic benchmark:
> build the linux kernel in a memory-limited cgroup, and allocate some
> cold data in tmpfs to see if the shrinker could write them out and
> improved the overall performance. Depending on the amount of cold data
> generated, we observe from 14% to 35% reduction in kernel CPU time used
> in the kernel builds.
>
> Signed-off-by: Nhat Pham <[email protected]>
> Acked-by: Johannes Weiner <[email protected]>
> ---
> Documentation/admin-guide/mm/zswap.rst | 10 ++
> include/linux/mmzone.h | 2 +
> include/linux/zswap.h | 25 +++-
> mm/Kconfig | 14 ++
> mm/mmzone.c | 1 +
> mm/swap_state.c | 2 +
> mm/zswap.c | 185 ++++++++++++++++++++++++-
> 7 files changed, 233 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst
> index 45b98390e938..62fc244ec702 100644
> --- a/Documentation/admin-guide/mm/zswap.rst
> +++ b/Documentation/admin-guide/mm/zswap.rst
> @@ -153,6 +153,16 @@ attribute, e. g.::
>
> Setting this parameter to 100 will disable the hysteresis.
>
> +When there is a sizable amount of cold memory residing in the zswap pool, it
> +can be advantageous to proactively write these cold pages to swap and reclaim
> +the memory for other use cases. By default, the zswap shrinker is disabled.
> +User can enable it as follows:
> +
> + echo Y > /sys/module/zswap/parameters/shrinker_enabled
> +
> +This can be enabled at the boot time if ``CONFIG_ZSWAP_SHRINKER_DEFAULT_ON`` is
> +selected.
> +
> A debugfs interface is provided for various statistic about pool size, number
> of pages stored, same-value filled pages and various counters for the reasons
> pages are rejected.
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 7b1816450bfc..b23bc5390240 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -22,6 +22,7 @@
> #include <linux/mm_types.h>
> #include <linux/page-flags.h>
> #include <linux/local_lock.h>
> +#include <linux/zswap.h>
> #include <asm/page.h>
>
> /* Free memory management - zoned buddy allocator. */
> @@ -641,6 +642,7 @@ struct lruvec {
> #ifdef CONFIG_MEMCG
> struct pglist_data *pgdat;
> #endif
> + struct zswap_lruvec_state zswap_lruvec_state;
> };
>
> /* Isolate for asynchronous migration */
> diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> index e571e393669b..08c240e16a01 100644
> --- a/include/linux/zswap.h
> +++ b/include/linux/zswap.h
> @@ -5,20 +5,40 @@
> #include <linux/types.h>
> #include <linux/mm_types.h>
>
> +struct lruvec;
> +
> extern u64 zswap_pool_total_size;
> extern atomic_t zswap_stored_pages;
>
> #ifdef CONFIG_ZSWAP
>
> +struct zswap_lruvec_state {
> + /*
> + * Number of pages in zswap that should be protected from the shrinker.
> + * This number is an estimate of the following counts:
> + *
> + * a) Recent page faults.
> + * b) Recent insertion to the zswap LRU. This includes new zswap stores,
> + * as well as recent zswap LRU rotations.
> + *
> + * These pages are likely to be warm, and might incur IO if the are written
> + * to swap.
> + */
> + atomic_long_t nr_zswap_protected;
> +};
> +
> bool zswap_store(struct folio *folio);
> bool zswap_load(struct folio *folio);
> void zswap_invalidate(int type, pgoff_t offset);
> void zswap_swapon(int type);
> void zswap_swapoff(int type);
> void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg);
> -
> +void zswap_lruvec_state_init(struct lruvec *lruvec);
> +void zswap_page_swapin(struct page *page);
> #else
>
> +struct zswap_lruvec_state {};
> +
> static inline bool zswap_store(struct folio *folio)
> {
> return false;
> @@ -33,7 +53,8 @@ static inline void zswap_invalidate(int type, pgoff_t offset) {}
> static inline void zswap_swapon(int type) {}
> static inline void zswap_swapoff(int type) {}
> static inline void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) {}
> -
> +static inline void zswap_lruvec_state_init(struct lruvec *lruvec) {}
> +static inline void zswap_page_swapin(struct page *page) {}
> #endif
>
> #endif /* _LINUX_ZSWAP_H */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 57cd378c73d6..ca87cdb72f11 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -61,6 +61,20 @@ config ZSWAP_EXCLUSIVE_LOADS_DEFAULT_ON
> The cost is that if the page was never dirtied and needs to be
> swapped out again, it will be re-compressed.
>
> +config ZSWAP_SHRINKER_DEFAULT_ON
> + bool "Shrink the zswap pool on memory pressure"
> + depends on ZSWAP
> + default n
> + help
> + If selected, the zswap shrinker will be enabled, and the pages
> + stored in the zswap pool will become available for reclaim (i.e
> + written back to the backing swap device) on memory pressure.
> +
> + This means that zswap writeback could happen even if the pool is
> + not yet full, or the cgroup zswap limit has not been reached,
> + reducing the chance that cold pages will reside in the zswap pool
> + and consume memory indefinitely.
> +
> choice
> prompt "Default compressor"
> depends on ZSWAP
> diff --git a/mm/mmzone.c b/mm/mmzone.c
> index b594d3f268fe..c01896eca736 100644
> --- a/mm/mmzone.c
> +++ b/mm/mmzone.c
> @@ -78,6 +78,7 @@ void lruvec_init(struct lruvec *lruvec)
>
> memset(lruvec, 0, sizeof(struct lruvec));
> spin_lock_init(&lruvec->lru_lock);
> + zswap_lruvec_state_init(lruvec);
>
> for_each_lru(lru)
> INIT_LIST_HEAD(&lruvec->lists[lru]);
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 6c84236382f3..c597cec606e4 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -687,6 +687,7 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
> &page_allocated, false);
> if (unlikely(page_allocated))
> swap_readpage(page, false, NULL);
> + zswap_page_swapin(page);
> return page;
> }
>
> @@ -862,6 +863,7 @@ static struct page *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
> &page_allocated, false);
> if (unlikely(page_allocated))
> swap_readpage(page, false, NULL);
> + zswap_page_swapin(page);
> return page;
> }
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 49b79393e472..0f086ffd7b6a 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -148,6 +148,11 @@ module_param_named(exclusive_loads, zswap_exclusive_loads_enabled, bool, 0644);
> /* Number of zpools in zswap_pool (empirically determined for scalability) */
> #define ZSWAP_NR_ZPOOLS 32
>
> +/* Enable/disable memory pressure-based shrinker. */
> +static bool zswap_shrinker_enabled = IS_ENABLED(
> + CONFIG_ZSWAP_SHRINKER_DEFAULT_ON);
> +module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644);
> +
> /*********************************
> * data structures
> **********************************/
> @@ -177,6 +182,8 @@ struct zswap_pool {
> char tfm_name[CRYPTO_MAX_ALG_NAME];
> struct list_lru list_lru;
> struct mem_cgroup *next_shrink;
> + struct shrinker *shrinker;
> + atomic_t nr_stored;
> };
>
> /*
> @@ -275,17 +282,26 @@ static bool zswap_can_accept(void)
> DIV_ROUND_UP(zswap_pool_total_size, PAGE_SIZE);
> }
>
> +static u64 get_zswap_pool_size(struct zswap_pool *pool)
> +{
> + u64 pool_size = 0;
> + int i;
> +
> + for (i = 0; i < ZSWAP_NR_ZPOOLS; i++)
> + pool_size += zpool_get_total_size(pool->zpools[i]);
> +
> + return pool_size;
> +}
> +
> static void zswap_update_total_size(void)
> {
> struct zswap_pool *pool;
> u64 total = 0;
> - int i;
>
> rcu_read_lock();
>
> list_for_each_entry_rcu(pool, &zswap_pools, list)
> - for (i = 0; i < ZSWAP_NR_ZPOOLS; i++)
> - total += zpool_get_total_size(pool->zpools[i]);
> + total += get_zswap_pool_size(pool);
>
> rcu_read_unlock();
>
> @@ -344,13 +360,34 @@ static void zswap_entry_cache_free(struct zswap_entry *entry)
> kmem_cache_free(zswap_entry_cache, entry);
> }
>
> +/*********************************
> +* zswap lruvec functions
> +**********************************/
> +void zswap_lruvec_state_init(struct lruvec *lruvec)
> +{
> + atomic_long_set(&lruvec->zswap_lruvec_state.nr_zswap_protected, 0);
> +}
> +
> +void zswap_page_swapin(struct page *page)
> +{
> + struct lruvec *lruvec;
> +
> + if (page) {
> + lruvec = folio_lruvec(page_folio(page));
> + atomic_long_inc(&lruvec->zswap_lruvec_state.nr_zswap_protected);
> + }
> +}
> +
> /*********************************
> * lru functions
> **********************************/
> static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry)
> {
> + atomic_long_t *nr_zswap_protected;
> + unsigned long lru_size, old, new;
> int nid = entry_to_nid(entry);
> struct mem_cgroup *memcg;
> + struct lruvec *lruvec;
>
> /*
> * Note that it is safe to use rcu_read_lock() here, even in the face of
> @@ -368,6 +405,19 @@ static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry)
> memcg = mem_cgroup_from_entry(entry);
> /* will always succeed */
> list_lru_add(list_lru, &entry->lru, nid, memcg);
> +
> + /* Update the protection area */
> + lru_size = list_lru_count_one(list_lru, nid, memcg);
> + lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
> + nr_zswap_protected = &lruvec->zswap_lruvec_state.nr_zswap_protected;
> + old = atomic_long_inc_return(nr_zswap_protected);
> + /*
> + * Decay to avoid overflow and adapt to changing workloads.
> + * This is based on LRU reclaim cost decaying heuristics.
> + */
> + do {
> + new = old > lru_size / 4 ? old / 2 : old;
> + } while (!atomic_long_try_cmpxchg(nr_zswap_protected, &old, new));
> rcu_read_unlock();
> }
>
> @@ -389,6 +439,7 @@ static void zswap_lru_putback(struct list_lru *list_lru,
> int nid = entry_to_nid(entry);
> spinlock_t *lock = &list_lru->node[nid].lock;
> struct mem_cgroup *memcg;
> + struct lruvec *lruvec;
>
> rcu_read_lock();
> memcg = mem_cgroup_from_entry(entry);
> @@ -396,6 +447,10 @@ static void zswap_lru_putback(struct list_lru *list_lru,
> /* we cannot use list_lru_add here, because it increments node's lru count */
> list_lru_putback(list_lru, &entry->lru, nid, memcg);
> spin_unlock(lock);
> +
> + lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(entry_to_nid(entry)));
> + /* increment the protection area to account for the LRU rotation. */
> + atomic_long_inc(&lruvec->zswap_lruvec_state.nr_zswap_protected);
> rcu_read_unlock();
> }
>
> @@ -485,6 +540,7 @@ static void zswap_free_entry(struct zswap_entry *entry)
> else {
> zswap_lru_del(&entry->pool->list_lru, entry);
> zpool_free(zswap_find_zpool(entry), entry->handle);
> + atomic_dec(&entry->pool->nr_stored);
> zswap_pool_put(entry->pool);
> }
> zswap_entry_cache_free(entry);
> @@ -526,6 +582,102 @@ static struct zswap_entry *zswap_entry_find_get(struct rb_root *root,
> return entry;
> }
>
> +/*********************************
> +* shrinker functions
> +**********************************/
> +static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_one *l,
> + spinlock_t *lock, void *arg);
> +
> +static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
> + struct shrink_control *sc)
> +{
> + struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
> + unsigned long shrink_ret, nr_protected, lru_size;
> + struct zswap_pool *pool = shrinker->private_data;
> + bool encountered_page_in_swapcache = false;
> +
> + nr_protected =
> + atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected);
> + lru_size = list_lru_shrink_count(&pool->list_lru, sc);
> +
> + /*
> + * Abort if the shrinker is disabled or if we are shrinking into the
> + * protected region.
> + *
> + * This short-circuiting is necessary because if we have too many multiple
> + * concurrent reclaimers getting the freeable zswap object counts at the
> + * same time (before any of them made reasonable progress), the total
> + * number of reclaimed objects might be more than the number of unprotected
> + * objects (i.e the reclaimers will reclaim into the protected area of the
> + * zswap LRU).
> + */
> + if (!zswap_shrinker_enabled || nr_protected >= lru_size - sc->nr_to_scan) {
> + sc->nr_scanned = 0;
> + return SHRINK_STOP;
> + }
> +
> + shrink_ret = list_lru_shrink_walk(&pool->list_lru, sc, &shrink_memcg_cb,
> + &encountered_page_in_swapcache);
> +
> + if (encountered_page_in_swapcache)
> + return SHRINK_STOP;
> +
> + return shrink_ret ? shrink_ret : SHRINK_STOP;
> +}
> +
> +static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
> + struct shrink_control *sc)
> +{
> + struct zswap_pool *pool = shrinker->private_data;
> + struct mem_cgroup *memcg = sc->memcg;
> + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(sc->nid));
> + unsigned long nr_backing, nr_stored, nr_freeable, nr_protected;
> +
> +#ifdef CONFIG_MEMCG_KMEM
> + cgroup_rstat_flush(memcg->css.cgroup);
> + nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT;
> + nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
> +#else
> + /* use pool stats instead of memcg stats */
> + nr_backing = get_zswap_pool_size(pool) >> PAGE_SHIFT;
> + nr_stored = atomic_read(&pool->nr_stored);
> +#endif
> +
> + if (!zswap_shrinker_enabled || !nr_stored)
When I tested with this series, with !zswap_shrinker_enabled in the default case,
I found the performance is much worse than that without this patch.

Testcase: memory.max=2G, zswap enabled, kernel build -j32 in a tmpfs directory.

The reason seems the above cgroup_rstat_flush(), caused much rstat lock contention
to the zswap_store() path. And if I put the "zswap_shrinker_enabled" check above
the cgroup_rstat_flush(), the performance become much better.

Maybe we can put the "zswap_shrinker_enabled" check above cgroup_rstat_flush()?

Thanks!

> + return 0;
> +
> + nr_protected =
> + atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected);
> + nr_freeable = list_lru_shrink_count(&pool->list_lru, sc);
> + /*
> + * Subtract the lru size by an estimate of the number of pages
> + * that should be protected.
> + */
> + nr_freeable = nr_freeable > nr_protected ? nr_freeable - nr_protected : 0;
> +
> + /*
> + * Scale the number of freeable pages by the memory saving factor.
> + * This ensures that the better zswap compresses memory, the fewer
> + * pages we will evict to swap (as it will otherwise incur IO for
> + * relatively small memory saving).
> + */
> + return mult_frac(nr_freeable, nr_backing, nr_stored);
> +}
> +
> +static void zswap_alloc_shrinker(struct zswap_pool *pool)
> +{
> + pool->shrinker =
> + shrinker_alloc(SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE, "mm-zswap");
> + if (!pool->shrinker)
> + return;
> +
> + pool->shrinker->private_data = pool;
> + pool->shrinker->scan_objects = zswap_shrinker_scan;
> + pool->shrinker->count_objects = zswap_shrinker_count;
> + pool->shrinker->batch = 0;
> + pool->shrinker->seeks = DEFAULT_SEEKS;
> +}
> +
> /*********************************
> * per-cpu code
> **********************************/
> @@ -721,6 +873,7 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
> spinlock_t *lock, void *arg)
> {
> struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
> + bool *encountered_page_in_swapcache = (bool *)arg;
> struct zswap_tree *tree;
> pgoff_t swpoffset;
> enum lru_status ret = LRU_REMOVED_RETRY;
> @@ -756,6 +909,17 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
> zswap_reject_reclaim_fail++;
> zswap_lru_putback(&entry->pool->list_lru, entry);
> ret = LRU_RETRY;
> +
> + /*
> + * Encountering a page already in swap cache is a sign that we are shrinking
> + * into the warmer region. We should terminate shrinking (if we're in the dynamic
> + * shrinker context).
> + */
> + if (writeback_result == -EEXIST && encountered_page_in_swapcache) {
> + ret = LRU_SKIP;
> + *encountered_page_in_swapcache = true;
> + }
> +
> goto put_unlock;
> }
> zswap_written_back_pages++;
> @@ -913,6 +1077,11 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
> &pool->node);
> if (ret)
> goto error;
> +
> + zswap_alloc_shrinker(pool);
> + if (!pool->shrinker)
> + goto error;
> +
> pr_debug("using %s compressor\n", pool->tfm_name);
>
> /* being the current pool takes 1 ref; this func expects the
> @@ -920,13 +1089,19 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
> */
> kref_init(&pool->kref);
> INIT_LIST_HEAD(&pool->list);
> - list_lru_init_memcg(&pool->list_lru, NULL);
> + if (list_lru_init_memcg(&pool->list_lru, pool->shrinker))
> + goto lru_fail;
> + shrinker_register(pool->shrinker);
> INIT_WORK(&pool->shrink_work, shrink_worker);
> + atomic_set(&pool->nr_stored, 0);
>
> zswap_pool_debug("created", pool);
>
> return pool;
>
> +lru_fail:
> + list_lru_destroy(&pool->list_lru);
> + shrinker_free(pool->shrinker);
> error:
> if (pool->acomp_ctx)
> free_percpu(pool->acomp_ctx);
> @@ -984,6 +1159,7 @@ static void zswap_pool_destroy(struct zswap_pool *pool)
>
> zswap_pool_debug("destroying", pool);
>
> + shrinker_free(pool->shrinker);
> cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node);
> free_percpu(pool->acomp_ctx);
> list_lru_destroy(&pool->list_lru);
> @@ -1540,6 +1716,7 @@ bool zswap_store(struct folio *folio)
> if (entry->length) {
> INIT_LIST_HEAD(&entry->lru);
> zswap_lru_add(&entry->pool->list_lru, entry);
> + atomic_inc(&entry->pool->nr_stored);
> }
> spin_unlock(&tree->lock);
>

2023-12-06 06:02:15

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH v8 6/6] zswap: shrinks zswap pool based on memory pressure

[..]
> > @@ -526,6 +582,102 @@ static struct zswap_entry *zswap_entry_find_get(struct rb_root *root,
> > return entry;
> > }
> >
> > +/*********************************
> > +* shrinker functions
> > +**********************************/
> > +static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_one *l,
> > + spinlock_t *lock, void *arg);
> > +
> > +static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
> > + struct shrink_control *sc)
> > +{
> > + struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
> > + unsigned long shrink_ret, nr_protected, lru_size;
> > + struct zswap_pool *pool = shrinker->private_data;
> > + bool encountered_page_in_swapcache = false;
> > +
> > + nr_protected =
> > + atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected);
> > + lru_size = list_lru_shrink_count(&pool->list_lru, sc);
> > +
> > + /*
> > + * Abort if the shrinker is disabled or if we are shrinking into the
> > + * protected region.
> > + *
> > + * This short-circuiting is necessary because if we have too many multiple
> > + * concurrent reclaimers getting the freeable zswap object counts at the
> > + * same time (before any of them made reasonable progress), the total
> > + * number of reclaimed objects might be more than the number of unprotected
> > + * objects (i.e the reclaimers will reclaim into the protected area of the
> > + * zswap LRU).
> > + */
> > + if (!zswap_shrinker_enabled || nr_protected >= lru_size - sc->nr_to_scan) {
> > + sc->nr_scanned = 0;
> > + return SHRINK_STOP;
> > + }
> > +
> > + shrink_ret = list_lru_shrink_walk(&pool->list_lru, sc, &shrink_memcg_cb,
> > + &encountered_page_in_swapcache);
> > +
> > + if (encountered_page_in_swapcache)
> > + return SHRINK_STOP;
> > +
> > + return shrink_ret ? shrink_ret : SHRINK_STOP;
> > +}
> > +
> > +static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
> > + struct shrink_control *sc)
> > +{
> > + struct zswap_pool *pool = shrinker->private_data;
> > + struct mem_cgroup *memcg = sc->memcg;
> > + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(sc->nid));
> > + unsigned long nr_backing, nr_stored, nr_freeable, nr_protected;
> > +
> > +#ifdef CONFIG_MEMCG_KMEM
> > + cgroup_rstat_flush(memcg->css.cgroup);
> > + nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT;
> > + nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
> > +#else
> > + /* use pool stats instead of memcg stats */
> > + nr_backing = get_zswap_pool_size(pool) >> PAGE_SHIFT;
> > + nr_stored = atomic_read(&pool->nr_stored);
> > +#endif
> > +
> > + if (!zswap_shrinker_enabled || !nr_stored)
> When I tested with this series, with !zswap_shrinker_enabled in the default case,
> I found the performance is much worse than that without this patch.
>
> Testcase: memory.max=2G, zswap enabled, kernel build -j32 in a tmpfs directory.
>
> The reason seems the above cgroup_rstat_flush(), caused much rstat lock contention
> to the zswap_store() path. And if I put the "zswap_shrinker_enabled" check above
> the cgroup_rstat_flush(), the performance become much better.
>
> Maybe we can put the "zswap_shrinker_enabled" check above cgroup_rstat_flush()?

Yes, we should do nothing if !zswap_shrinker_enabled. We should also
use mem_cgroup_flush_stats() here like other places unless accuracy is
crucial, which I doubt given that reclaim uses
mem_cgroup_flush_stats().

mem_cgroup_flush_stats() has some thresholding to make sure we don't
do flushes unnecessarily, and I have a pending series in mm-unstable
that makes that thresholding per-memcg. Keep in mind that adding a
call to mem_cgroup_flush_stats() will cause a conflict in mm-unstable,
because the series there adds a memcg argument to
mem_cgroup_flush_stats(). That should be easily amenable though, I can
post a fixlet for my series to add the memcg argument there on top of
users if needed.

>
> Thanks!
>
> > + return 0;
> > +
> > + nr_protected =
> > + atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected);
> > + nr_freeable = list_lru_shrink_count(&pool->list_lru, sc);
> > + /*
> > + * Subtract the lru size by an estimate of the number of pages
> > + * that should be protected.
> > + */
> > + nr_freeable = nr_freeable > nr_protected ? nr_freeable - nr_protected : 0;
> > +
> > + /*
> > + * Scale the number of freeable pages by the memory saving factor.
> > + * This ensures that the better zswap compresses memory, the fewer
> > + * pages we will evict to swap (as it will otherwise incur IO for
> > + * relatively small memory saving).
> > + */
> > + return mult_frac(nr_freeable, nr_backing, nr_stored);
> > +}
> > +
> > +static void zswap_alloc_shrinker(struct zswap_pool *pool)
> > +{
> > + pool->shrinker =
> > + shrinker_alloc(SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE, "mm-zswap");
> > + if (!pool->shrinker)
> > + return;
> > +
> > + pool->shrinker->private_data = pool;
> > + pool->shrinker->scan_objects = zswap_shrinker_scan;
> > + pool->shrinker->count_objects = zswap_shrinker_count;
> > + pool->shrinker->batch = 0;
> > + pool->shrinker->seeks = DEFAULT_SEEKS;
> > +}
> > +
> > /*********************************
> > * per-cpu code
> > **********************************/
[..]

2023-12-06 06:44:12

by Chengming Zhou

[permalink] [raw]
Subject: Re: [PATCH v8 6/6] zswap: shrinks zswap pool based on memory pressure

On 2023/12/6 13:59, Yosry Ahmed wrote:
> [..]
>>> @@ -526,6 +582,102 @@ static struct zswap_entry *zswap_entry_find_get(struct rb_root *root,
>>> return entry;
>>> }
>>>
>>> +/*********************************
>>> +* shrinker functions
>>> +**********************************/
>>> +static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_one *l,
>>> + spinlock_t *lock, void *arg);
>>> +
>>> +static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
>>> + struct shrink_control *sc)
>>> +{
>>> + struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
>>> + unsigned long shrink_ret, nr_protected, lru_size;
>>> + struct zswap_pool *pool = shrinker->private_data;
>>> + bool encountered_page_in_swapcache = false;
>>> +
>>> + nr_protected =
>>> + atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected);
>>> + lru_size = list_lru_shrink_count(&pool->list_lru, sc);
>>> +
>>> + /*
>>> + * Abort if the shrinker is disabled or if we are shrinking into the
>>> + * protected region.
>>> + *
>>> + * This short-circuiting is necessary because if we have too many multiple
>>> + * concurrent reclaimers getting the freeable zswap object counts at the
>>> + * same time (before any of them made reasonable progress), the total
>>> + * number of reclaimed objects might be more than the number of unprotected
>>> + * objects (i.e the reclaimers will reclaim into the protected area of the
>>> + * zswap LRU).
>>> + */
>>> + if (!zswap_shrinker_enabled || nr_protected >= lru_size - sc->nr_to_scan) {
>>> + sc->nr_scanned = 0;
>>> + return SHRINK_STOP;
>>> + }
>>> +
>>> + shrink_ret = list_lru_shrink_walk(&pool->list_lru, sc, &shrink_memcg_cb,
>>> + &encountered_page_in_swapcache);
>>> +
>>> + if (encountered_page_in_swapcache)
>>> + return SHRINK_STOP;
>>> +
>>> + return shrink_ret ? shrink_ret : SHRINK_STOP;
>>> +}
>>> +
>>> +static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
>>> + struct shrink_control *sc)
>>> +{
>>> + struct zswap_pool *pool = shrinker->private_data;
>>> + struct mem_cgroup *memcg = sc->memcg;
>>> + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(sc->nid));
>>> + unsigned long nr_backing, nr_stored, nr_freeable, nr_protected;
>>> +
>>> +#ifdef CONFIG_MEMCG_KMEM
>>> + cgroup_rstat_flush(memcg->css.cgroup);
>>> + nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT;
>>> + nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
>>> +#else
>>> + /* use pool stats instead of memcg stats */
>>> + nr_backing = get_zswap_pool_size(pool) >> PAGE_SHIFT;
>>> + nr_stored = atomic_read(&pool->nr_stored);
>>> +#endif
>>> +
>>> + if (!zswap_shrinker_enabled || !nr_stored)
>> When I tested with this series, with !zswap_shrinker_enabled in the default case,
>> I found the performance is much worse than that without this patch.
>>
>> Testcase: memory.max=2G, zswap enabled, kernel build -j32 in a tmpfs directory.
>>
>> The reason seems the above cgroup_rstat_flush(), caused much rstat lock contention
>> to the zswap_store() path. And if I put the "zswap_shrinker_enabled" check above
>> the cgroup_rstat_flush(), the performance become much better.
>>
>> Maybe we can put the "zswap_shrinker_enabled" check above cgroup_rstat_flush()?
>
> Yes, we should do nothing if !zswap_shrinker_enabled. We should also
> use mem_cgroup_flush_stats() here like other places unless accuracy is
> crucial, which I doubt given that reclaim uses
> mem_cgroup_flush_stats().
>

Yes. After changing to use mem_cgroup_flush_stats() here, the performance
become much better.

> mem_cgroup_flush_stats() has some thresholding to make sure we don't
> do flushes unnecessarily, and I have a pending series in mm-unstable
> that makes that thresholding per-memcg. Keep in mind that adding a
> call to mem_cgroup_flush_stats() will cause a conflict in mm-unstable,

My test branch is linux-next 20231205, and it's all good after changing
to use mem_cgroup_flush_stats(memcg).

> because the series there adds a memcg argument to
> mem_cgroup_flush_stats(). That should be easily amenable though, I can
> post a fixlet for my series to add the memcg argument there on top of
> users if needed.
>

It's great. Thanks!

2023-12-06 07:37:25

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH v8 6/6] zswap: shrinks zswap pool based on memory pressure

On Tue, Dec 5, 2023 at 10:43 PM Chengming Zhou <[email protected]> wrote:
>
> On 2023/12/6 13:59, Yosry Ahmed wrote:
> > [..]
> >>> @@ -526,6 +582,102 @@ static struct zswap_entry *zswap_entry_find_get(struct rb_root *root,
> >>> return entry;
> >>> }
> >>>
> >>> +/*********************************
> >>> +* shrinker functions
> >>> +**********************************/
> >>> +static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_one *l,
> >>> + spinlock_t *lock, void *arg);
> >>> +
> >>> +static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
> >>> + struct shrink_control *sc)
> >>> +{
> >>> + struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
> >>> + unsigned long shrink_ret, nr_protected, lru_size;
> >>> + struct zswap_pool *pool = shrinker->private_data;
> >>> + bool encountered_page_in_swapcache = false;
> >>> +
> >>> + nr_protected =
> >>> + atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected);
> >>> + lru_size = list_lru_shrink_count(&pool->list_lru, sc);
> >>> +
> >>> + /*
> >>> + * Abort if the shrinker is disabled or if we are shrinking into the
> >>> + * protected region.
> >>> + *
> >>> + * This short-circuiting is necessary because if we have too many multiple
> >>> + * concurrent reclaimers getting the freeable zswap object counts at the
> >>> + * same time (before any of them made reasonable progress), the total
> >>> + * number of reclaimed objects might be more than the number of unprotected
> >>> + * objects (i.e the reclaimers will reclaim into the protected area of the
> >>> + * zswap LRU).
> >>> + */
> >>> + if (!zswap_shrinker_enabled || nr_protected >= lru_size - sc->nr_to_scan) {
> >>> + sc->nr_scanned = 0;
> >>> + return SHRINK_STOP;
> >>> + }
> >>> +
> >>> + shrink_ret = list_lru_shrink_walk(&pool->list_lru, sc, &shrink_memcg_cb,
> >>> + &encountered_page_in_swapcache);
> >>> +
> >>> + if (encountered_page_in_swapcache)
> >>> + return SHRINK_STOP;
> >>> +
> >>> + return shrink_ret ? shrink_ret : SHRINK_STOP;
> >>> +}
> >>> +
> >>> +static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
> >>> + struct shrink_control *sc)
> >>> +{
> >>> + struct zswap_pool *pool = shrinker->private_data;
> >>> + struct mem_cgroup *memcg = sc->memcg;
> >>> + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(sc->nid));
> >>> + unsigned long nr_backing, nr_stored, nr_freeable, nr_protected;
> >>> +
> >>> +#ifdef CONFIG_MEMCG_KMEM
> >>> + cgroup_rstat_flush(memcg->css.cgroup);
> >>> + nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT;
> >>> + nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
> >>> +#else
> >>> + /* use pool stats instead of memcg stats */
> >>> + nr_backing = get_zswap_pool_size(pool) >> PAGE_SHIFT;
> >>> + nr_stored = atomic_read(&pool->nr_stored);
> >>> +#endif
> >>> +
> >>> + if (!zswap_shrinker_enabled || !nr_stored)
> >> When I tested with this series, with !zswap_shrinker_enabled in the default case,
> >> I found the performance is much worse than that without this patch.
> >>
> >> Testcase: memory.max=2G, zswap enabled, kernel build -j32 in a tmpfs directory.
> >>
> >> The reason seems the above cgroup_rstat_flush(), caused much rstat lock contention
> >> to the zswap_store() path. And if I put the "zswap_shrinker_enabled" check above
> >> the cgroup_rstat_flush(), the performance become much better.
> >>
> >> Maybe we can put the "zswap_shrinker_enabled" check above cgroup_rstat_flush()?
> >
> > Yes, we should do nothing if !zswap_shrinker_enabled. We should also
> > use mem_cgroup_flush_stats() here like other places unless accuracy is
> > crucial, which I doubt given that reclaim uses
> > mem_cgroup_flush_stats().
> >
>
> Yes. After changing to use mem_cgroup_flush_stats() here, the performance
> become much better.
>
> > mem_cgroup_flush_stats() has some thresholding to make sure we don't
> > do flushes unnecessarily, and I have a pending series in mm-unstable
> > that makes that thresholding per-memcg. Keep in mind that adding a
> > call to mem_cgroup_flush_stats() will cause a conflict in mm-unstable,
>
> My test branch is linux-next 20231205, and it's all good after changing
> to use mem_cgroup_flush_stats(memcg).

Thanks for reporting back. We should still move the
zswap_shrinker_enabled check ahead, no need to even call
mem_cgroup_flush_stats() if we will do nothing anyway.

>
> > because the series there adds a memcg argument to
> > mem_cgroup_flush_stats(). That should be easily amenable though, I can
> > post a fixlet for my series to add the memcg argument there on top of
> > users if needed.
> >
>
> It's great. Thanks!
>

2023-12-06 07:53:58

by Chengming Zhou

[permalink] [raw]
Subject: Re: [PATCH v8 6/6] zswap: shrinks zswap pool based on memory pressure

On 2023/12/6 15:36, Yosry Ahmed wrote:
> On Tue, Dec 5, 2023 at 10:43 PM Chengming Zhou <[email protected]> wrote:
>>
>> On 2023/12/6 13:59, Yosry Ahmed wrote:
>>> [..]
>>>>> @@ -526,6 +582,102 @@ static struct zswap_entry *zswap_entry_find_get(struct rb_root *root,
>>>>> return entry;
>>>>> }
>>>>>
>>>>> +/*********************************
>>>>> +* shrinker functions
>>>>> +**********************************/
>>>>> +static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_one *l,
>>>>> + spinlock_t *lock, void *arg);
>>>>> +
>>>>> +static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
>>>>> + struct shrink_control *sc)
>>>>> +{
>>>>> + struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
>>>>> + unsigned long shrink_ret, nr_protected, lru_size;
>>>>> + struct zswap_pool *pool = shrinker->private_data;
>>>>> + bool encountered_page_in_swapcache = false;
>>>>> +
>>>>> + nr_protected =
>>>>> + atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected);
>>>>> + lru_size = list_lru_shrink_count(&pool->list_lru, sc);
>>>>> +
>>>>> + /*
>>>>> + * Abort if the shrinker is disabled or if we are shrinking into the
>>>>> + * protected region.
>>>>> + *
>>>>> + * This short-circuiting is necessary because if we have too many multiple
>>>>> + * concurrent reclaimers getting the freeable zswap object counts at the
>>>>> + * same time (before any of them made reasonable progress), the total
>>>>> + * number of reclaimed objects might be more than the number of unprotected
>>>>> + * objects (i.e the reclaimers will reclaim into the protected area of the
>>>>> + * zswap LRU).
>>>>> + */
>>>>> + if (!zswap_shrinker_enabled || nr_protected >= lru_size - sc->nr_to_scan) {
>>>>> + sc->nr_scanned = 0;
>>>>> + return SHRINK_STOP;
>>>>> + }
>>>>> +
>>>>> + shrink_ret = list_lru_shrink_walk(&pool->list_lru, sc, &shrink_memcg_cb,
>>>>> + &encountered_page_in_swapcache);
>>>>> +
>>>>> + if (encountered_page_in_swapcache)
>>>>> + return SHRINK_STOP;
>>>>> +
>>>>> + return shrink_ret ? shrink_ret : SHRINK_STOP;
>>>>> +}
>>>>> +
>>>>> +static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
>>>>> + struct shrink_control *sc)
>>>>> +{
>>>>> + struct zswap_pool *pool = shrinker->private_data;
>>>>> + struct mem_cgroup *memcg = sc->memcg;
>>>>> + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(sc->nid));
>>>>> + unsigned long nr_backing, nr_stored, nr_freeable, nr_protected;
>>>>> +
>>>>> +#ifdef CONFIG_MEMCG_KMEM
>>>>> + cgroup_rstat_flush(memcg->css.cgroup);
>>>>> + nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT;
>>>>> + nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
>>>>> +#else
>>>>> + /* use pool stats instead of memcg stats */
>>>>> + nr_backing = get_zswap_pool_size(pool) >> PAGE_SHIFT;
>>>>> + nr_stored = atomic_read(&pool->nr_stored);
>>>>> +#endif
>>>>> +
>>>>> + if (!zswap_shrinker_enabled || !nr_stored)
>>>> When I tested with this series, with !zswap_shrinker_enabled in the default case,
>>>> I found the performance is much worse than that without this patch.
>>>>
>>>> Testcase: memory.max=2G, zswap enabled, kernel build -j32 in a tmpfs directory.
>>>>
>>>> The reason seems the above cgroup_rstat_flush(), caused much rstat lock contention
>>>> to the zswap_store() path. And if I put the "zswap_shrinker_enabled" check above
>>>> the cgroup_rstat_flush(), the performance become much better.
>>>>
>>>> Maybe we can put the "zswap_shrinker_enabled" check above cgroup_rstat_flush()?
>>>
>>> Yes, we should do nothing if !zswap_shrinker_enabled. We should also
>>> use mem_cgroup_flush_stats() here like other places unless accuracy is
>>> crucial, which I doubt given that reclaim uses
>>> mem_cgroup_flush_stats().
>>>
>>
>> Yes. After changing to use mem_cgroup_flush_stats() here, the performance
>> become much better.
>>
>>> mem_cgroup_flush_stats() has some thresholding to make sure we don't
>>> do flushes unnecessarily, and I have a pending series in mm-unstable
>>> that makes that thresholding per-memcg. Keep in mind that adding a
>>> call to mem_cgroup_flush_stats() will cause a conflict in mm-unstable,
>>
>> My test branch is linux-next 20231205, and it's all good after changing
>> to use mem_cgroup_flush_stats(memcg).
>
> Thanks for reporting back. We should still move the
> zswap_shrinker_enabled check ahead, no need to even call
> mem_cgroup_flush_stats() if we will do nothing anyway.
>

Yes, agree!

2023-12-06 16:57:17

by Nhat Pham

[permalink] [raw]
Subject: Re: [PATCH v8 6/6] zswap: shrinks zswap pool based on memory pressure

On Tue, Dec 5, 2023 at 10:00 PM Yosry Ahmed <[email protected]> wrote:
>
> [..]
> > > @@ -526,6 +582,102 @@ static struct zswap_entry *zswap_entry_find_get(struct rb_root *root,
> > > return entry;
> > > }
> > >
> > > +/*********************************
> > > +* shrinker functions
> > > +**********************************/
> > > +static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_one *l,
> > > + spinlock_t *lock, void *arg);
> > > +
> > > +static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
> > > + struct shrink_control *sc)
> > > +{
> > > + struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
> > > + unsigned long shrink_ret, nr_protected, lru_size;
> > > + struct zswap_pool *pool = shrinker->private_data;
> > > + bool encountered_page_in_swapcache = false;
> > > +
> > > + nr_protected =
> > > + atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected);
> > > + lru_size = list_lru_shrink_count(&pool->list_lru, sc);
> > > +
> > > + /*
> > > + * Abort if the shrinker is disabled or if we are shrinking into the
> > > + * protected region.
> > > + *
> > > + * This short-circuiting is necessary because if we have too many multiple
> > > + * concurrent reclaimers getting the freeable zswap object counts at the
> > > + * same time (before any of them made reasonable progress), the total
> > > + * number of reclaimed objects might be more than the number of unprotected
> > > + * objects (i.e the reclaimers will reclaim into the protected area of the
> > > + * zswap LRU).
> > > + */
> > > + if (!zswap_shrinker_enabled || nr_protected >= lru_size - sc->nr_to_scan) {
> > > + sc->nr_scanned = 0;
> > > + return SHRINK_STOP;
> > > + }
> > > +
> > > + shrink_ret = list_lru_shrink_walk(&pool->list_lru, sc, &shrink_memcg_cb,
> > > + &encountered_page_in_swapcache);
> > > +
> > > + if (encountered_page_in_swapcache)
> > > + return SHRINK_STOP;
> > > +
> > > + return shrink_ret ? shrink_ret : SHRINK_STOP;
> > > +}
> > > +
> > > +static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
> > > + struct shrink_control *sc)
> > > +{
> > > + struct zswap_pool *pool = shrinker->private_data;
> > > + struct mem_cgroup *memcg = sc->memcg;
> > > + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(sc->nid));
> > > + unsigned long nr_backing, nr_stored, nr_freeable, nr_protected;
> > > +
> > > +#ifdef CONFIG_MEMCG_KMEM
> > > + cgroup_rstat_flush(memcg->css.cgroup);
> > > + nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT;
> > > + nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
> > > +#else
> > > + /* use pool stats instead of memcg stats */
> > > + nr_backing = get_zswap_pool_size(pool) >> PAGE_SHIFT;
> > > + nr_stored = atomic_read(&pool->nr_stored);
> > > +#endif
> > > +
> > > + if (!zswap_shrinker_enabled || !nr_stored)
> > When I tested with this series, with !zswap_shrinker_enabled in the default case,
> > I found the performance is much worse than that without this patch.
> >
> > Testcase: memory.max=2G, zswap enabled, kernel build -j32 in a tmpfs directory.
> >
> > The reason seems the above cgroup_rstat_flush(), caused much rstat lock contention
> > to the zswap_store() path. And if I put the "zswap_shrinker_enabled" check above
> > the cgroup_rstat_flush(), the performance become much better.
> >
> > Maybe we can put the "zswap_shrinker_enabled" check above cgroup_rstat_flush()?
>
> Yes, we should do nothing if !zswap_shrinker_enabled. We should also
> use mem_cgroup_flush_stats() here like other places unless accuracy is
> crucial, which I doubt given that reclaim uses
> mem_cgroup_flush_stats().

Ah, good points on both suggestions. We should not do extra work for
non-user. And, this is a best-effort approximation of the memory
saving factor, so as long as it is not *too* far off I think it's
acceptable.

>
> mem_cgroup_flush_stats() has some thresholding to make sure we don't
> do flushes unnecessarily, and I have a pending series in mm-unstable
> that makes that thresholding per-memcg. Keep in mind that adding a
> call to mem_cgroup_flush_stats() will cause a conflict in mm-unstable,
> because the series there adds a memcg argument to
> mem_cgroup_flush_stats(). That should be easily amenable though, I can
> post a fixlet for my series to add the memcg argument there on top of
> users if needed.

Hmm so how should we proceed from here? How about this:

a) I can send a fixlet to move the enablement check above the stats
flushing + use mem_cgroup_flush_stats
b) Then maybe, you can send a fixlet to update this new callsite?

Does that sound reasonable?

>
> >
> > Thanks!
> >
> > > + return 0;
> > > +
> > > + nr_protected =
> > > + atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected);
> > > + nr_freeable = list_lru_shrink_count(&pool->list_lru, sc);
> > > + /*
> > > + * Subtract the lru size by an estimate of the number of pages
> > > + * that should be protected.
> > > + */
> > > + nr_freeable = nr_freeable > nr_protected ? nr_freeable - nr_protected : 0;
> > > +
> > > + /*
> > > + * Scale the number of freeable pages by the memory saving factor.
> > > + * This ensures that the better zswap compresses memory, the fewer
> > > + * pages we will evict to swap (as it will otherwise incur IO for
> > > + * relatively small memory saving).
> > > + */
> > > + return mult_frac(nr_freeable, nr_backing, nr_stored);
> > > +}
> > > +
> > > +static void zswap_alloc_shrinker(struct zswap_pool *pool)
> > > +{
> > > + pool->shrinker =
> > > + shrinker_alloc(SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE, "mm-zswap");
> > > + if (!pool->shrinker)
> > > + return;
> > > +
> > > + pool->shrinker->private_data = pool;
> > > + pool->shrinker->scan_objects = zswap_shrinker_scan;
> > > + pool->shrinker->count_objects = zswap_shrinker_count;
> > > + pool->shrinker->batch = 0;
> > > + pool->shrinker->seeks = DEFAULT_SEEKS;
> > > +}
> > > +
> > > /*********************************
> > > * per-cpu code
> > > **********************************/
> [..]

2023-12-06 19:45:04

by Nhat Pham

[permalink] [raw]
Subject: [PATCH v8 6/6] zswap: shrinks zswap pool based on memory pressure (fix)

Check shrinker enablement early, and use a less costly stat flushing.

Suggested-by: Yosry Ahmed <[email protected]>
Suggested-by: Chengming Zhou <[email protected]>
Signed-off-by: Nhat Pham <[email protected]>
---
mm/zswap.c | 17 ++++++++++++-----
1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 27c749f6c1ba..d8ecd79120f3 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -596,13 +596,17 @@ static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
struct zswap_pool *pool = shrinker->private_data;
bool encountered_page_in_swapcache = false;

+ if (!zswap_shrinker_enabled) {
+ sc->nr_scanned = 0;
+ return SHRINK_STOP;
+ }
+
nr_protected =
atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected);
lru_size = list_lru_shrink_count(&pool->list_lru, sc);

/*
- * Abort if the shrinker is disabled or if we are shrinking into the
- * protected region.
+ * Abort if we are shrinking into the protected region.
*
* This short-circuiting is necessary because if we have too many multiple
* concurrent reclaimers getting the freeable zswap object counts at the
@@ -611,7 +615,7 @@ static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
* objects (i.e the reclaimers will reclaim into the protected area of the
* zswap LRU).
*/
- if (!zswap_shrinker_enabled || nr_protected >= lru_size - sc->nr_to_scan) {
+ if (nr_protected >= lru_size - sc->nr_to_scan) {
sc->nr_scanned = 0;
return SHRINK_STOP;
}
@@ -633,8 +637,11 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(sc->nid));
unsigned long nr_backing, nr_stored, nr_freeable, nr_protected;

+ if (!zswap_shrinker_enabled)
+ return 0;
+
#ifdef CONFIG_MEMCG_KMEM
- cgroup_rstat_flush(memcg->css.cgroup);
+ mem_cgroup_flush_stats();
nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT;
nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
#else
@@ -643,7 +650,7 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
nr_stored = atomic_read(&pool->nr_stored);
#endif

- if (!zswap_shrinker_enabled || !nr_stored)
+ if (!nr_stored)
return 0;

nr_protected =
--
2.34.1

2023-12-06 19:48:11

by Nhat Pham

[permalink] [raw]
Subject: Re: [PATCH v8 6/6] zswap: shrinks zswap pool based on memory pressure

[...]
>
> Hmm so how should we proceed from here? How about this:
>
> a) I can send a fixlet to move the enablement check above the stats
> flushing + use mem_cgroup_flush_stats
> b) Then maybe, you can send a fixlet to update this new callsite?
>
> Does that sound reasonable?

I just sent out the fixlet. Yosry and Chengming, let me know if that
looks good. Thank you both for detecting this issue and proposing the
fix!

2023-12-06 21:14:24

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH v8 6/6] zswap: shrinks zswap pool based on memory pressure

On Wed, Dec 6, 2023 at 11:47 AM Nhat Pham <[email protected]> wrote:
>
> [...]
> >
> > Hmm so how should we proceed from here? How about this:
> >
> > a) I can send a fixlet to move the enablement check above the stats
> > flushing + use mem_cgroup_flush_stats
> > b) Then maybe, you can send a fixlet to update this new callsite?
> >
> > Does that sound reasonable?
>
> I just sent out the fixlet. Yosry and Chengming, let me know if that
> looks good. Thank you both for detecting this issue and proposing the
> fix!

The fixlet looks good, and Andrew already took care of (b) before I
could send a followup fixlet out :)

2023-12-07 02:33:10

by Chengming Zhou

[permalink] [raw]
Subject: Re: [PATCH v8 6/6] zswap: shrinks zswap pool based on memory pressure

On 2023/12/7 03:47, Nhat Pham wrote:
> [...]
>>
>> Hmm so how should we proceed from here? How about this:
>>
>> a) I can send a fixlet to move the enablement check above the stats
>> flushing + use mem_cgroup_flush_stats
>> b) Then maybe, you can send a fixlet to update this new callsite?
>>
>> Does that sound reasonable?
>
> I just sent out the fixlet. Yosry and Chengming, let me know if that
> looks good. Thank you both for detecting this issue and proposing the
> fix!

Yeah, also looks good to me. Thanks!

--
Best regards,
Chengming Zhou

2023-12-08 00:28:55

by Chris Li

[permalink] [raw]
Subject: Re: [PATCH v8 4/6] mm: memcg: add per-memcg zswap writeback stat (fix)

Acked-by: Chris Li <[email protected]> (Google)

Chris

On Tue, Dec 5, 2023 at 11:33 AM Nhat Pham <[email protected]> wrote:
>
> Rename ZSWP_WB to ZSWPWB to better match the existing counters naming
> scheme.
>
> Suggested-by: Johannes Weiner <[email protected]>
> Signed-off-by: Nhat Pham <[email protected]>
> ---
> include/linux/vm_event_item.h | 2 +-
> mm/memcontrol.c | 2 +-
> mm/vmstat.c | 2 +-
> mm/zswap.c | 4 ++--
> 4 files changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index f4569ad98edf..747943bc8cc2 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -142,7 +142,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> #ifdef CONFIG_ZSWAP
> ZSWPIN,
> ZSWPOUT,
> - ZSWP_WB,
> + ZSWPWB,
> #endif
> #ifdef CONFIG_X86
> DIRECT_MAP_LEVEL2_SPLIT,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 21d79249c8b4..0286b7d38832 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -703,7 +703,7 @@ static const unsigned int memcg_vm_event_stat[] = {
> #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
> ZSWPIN,
> ZSWPOUT,
> - ZSWP_WB,
> + ZSWPWB,
> #endif
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> THP_FAULT_ALLOC,
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 2249f85e4a87..cfd8d8256f8e 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1401,7 +1401,7 @@ const char * const vmstat_text[] = {
> #ifdef CONFIG_ZSWAP
> "zswpin",
> "zswpout",
> - "zswp_wb",
> + "zswpwb",
> #endif
> #ifdef CONFIG_X86
> "direct_map_level2_splits",
> diff --git a/mm/zswap.c b/mm/zswap.c
> index c65b8ccc6b72..0fb0945c0031 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -761,9 +761,9 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
> zswap_written_back_pages++;
>
> if (entry->objcg)
> - count_objcg_event(entry->objcg, ZSWP_WB);
> + count_objcg_event(entry->objcg, ZSWPWB);
>
> - count_vm_event(ZSWP_WB);
> + count_vm_event(ZSWPWB);
> /*
> * Writeback started successfully, the page now belongs to the
> * swapcache. Drop the entry from zswap - unless invalidate already
> --
> 2.34.1
>

2023-12-08 00:43:43

by Chris Li

[permalink] [raw]
Subject: Re: [PATCH v8 5/6] selftests: cgroup: update per-memcg zswap writeback selftest

Hi Nhat,

Thanks for the self test.

Acked-by: Chris Li <[email protected]> (Google)

Chris

On Thu, Nov 30, 2023 at 11:40 AM Nhat Pham <[email protected]> wrote:
>
> From: Domenico Cerasuolo <[email protected]>
>
> The memcg-zswap self test is updated to adjust to the behavior change
> implemented by commit 87730b165089 ("zswap: make shrinking memcg-aware"),
> where zswap performs writeback for specific memcg.
>
> Signed-off-by: Domenico Cerasuolo <[email protected]>
> Signed-off-by: Nhat Pham <[email protected]>
> ---
> tools/testing/selftests/cgroup/test_zswap.c | 74 ++++++++++++++-------
> 1 file changed, 50 insertions(+), 24 deletions(-)
>
> diff --git a/tools/testing/selftests/cgroup/test_zswap.c b/tools/testing/selftests/cgroup/test_zswap.c
> index c99d2adaca3f..47fdaa146443 100644
> --- a/tools/testing/selftests/cgroup/test_zswap.c
> +++ b/tools/testing/selftests/cgroup/test_zswap.c
> @@ -50,9 +50,9 @@ static int get_zswap_stored_pages(size_t *value)
> return read_int("/sys/kernel/debug/zswap/stored_pages", value);
> }
>
> -static int get_zswap_written_back_pages(size_t *value)
> +static int get_cg_wb_count(const char *cg)
> {
> - return read_int("/sys/kernel/debug/zswap/written_back_pages", value);
> + return cg_read_key_long(cg, "memory.stat", "zswp_wb");
> }
>
> static long get_zswpout(const char *cgroup)
> @@ -73,6 +73,24 @@ static int allocate_bytes(const char *cgroup, void *arg)
> return 0;
> }
>
> +static char *setup_test_group_1M(const char *root, const char *name)
> +{
> + char *group_name = cg_name(root, name);
> +
> + if (!group_name)
> + return NULL;
> + if (cg_create(group_name))
> + goto fail;
> + if (cg_write(group_name, "memory.max", "1M")) {
> + cg_destroy(group_name);
> + goto fail;
> + }
> + return group_name;
> +fail:
> + free(group_name);
> + return NULL;
> +}
> +
> /*
> * Sanity test to check that pages are written into zswap.
> */
> @@ -117,43 +135,51 @@ static int test_zswap_usage(const char *root)
>
> /*
> * When trying to store a memcg page in zswap, if the memcg hits its memory
> - * limit in zswap, writeback should not be triggered.
> - *
> - * This was fixed with commit 0bdf0efa180a("zswap: do not shrink if cgroup may
> - * not zswap"). Needs to be revised when a per memcg writeback mechanism is
> - * implemented.
> + * limit in zswap, writeback should affect only the zswapped pages of that
> + * memcg.
> */
> static int test_no_invasive_cgroup_shrink(const char *root)
> {
> - size_t written_back_before, written_back_after;
> int ret = KSFT_FAIL;
> - char *test_group;
> + size_t control_allocation_size = MB(10);
> + char *control_allocation, *wb_group = NULL, *control_group = NULL;
>
> /* Set up */
> - test_group = cg_name(root, "no_shrink_test");
> - if (!test_group)
> - goto out;
> - if (cg_create(test_group))
> + wb_group = setup_test_group_1M(root, "per_memcg_wb_test1");
> + if (!wb_group)
> + return KSFT_FAIL;
> + if (cg_write(wb_group, "memory.zswap.max", "10K"))
> goto out;
> - if (cg_write(test_group, "memory.max", "1M"))
> + control_group = setup_test_group_1M(root, "per_memcg_wb_test2");
> + if (!control_group)
> goto out;
> - if (cg_write(test_group, "memory.zswap.max", "10K"))
> +
> + /* Push some test_group2 memory into zswap */
> + if (cg_enter_current(control_group))
> goto out;
> - if (get_zswap_written_back_pages(&written_back_before))
> + control_allocation = malloc(control_allocation_size);
> + for (int i = 0; i < control_allocation_size; i += 4095)
> + control_allocation[i] = 'a';
> + if (cg_read_key_long(control_group, "memory.stat", "zswapped") < 1)
> goto out;
>
> - /* Allocate 10x memory.max to push memory into zswap */
> - if (cg_run(test_group, allocate_bytes, (void *)MB(10)))
> + /* Allocate 10x memory.max to push wb_group memory into zswap and trigger wb */
> + if (cg_run(wb_group, allocate_bytes, (void *)MB(10)))
> goto out;
>
> - /* Verify that no writeback happened because of the memcg allocation */
> - if (get_zswap_written_back_pages(&written_back_after))
> - goto out;
> - if (written_back_after == written_back_before)
> + /* Verify that only zswapped memory from gwb_group has been written back */
> + if (get_cg_wb_count(wb_group) > 0 && get_cg_wb_count(control_group) == 0)
> ret = KSFT_PASS;
> out:
> - cg_destroy(test_group);
> - free(test_group);
> + cg_enter_current(root);
> + if (control_group) {
> + cg_destroy(control_group);
> + free(control_group);
> + }
> + cg_destroy(wb_group);
> + free(wb_group);
> + if (control_allocation)
> + free(control_allocation);
> return ret;
> }
>
> --
> 2.34.1
>