LinuxLists.cc - [PATCH 0/3] mm: zswap: global shrinker fix and proactive shrink

2024-05-28 04:34:51

Subject: [PATCH 0/3] mm: zswap: global shrinker fix and proactive shrink

Hello,

This series addresses two issues and introduces a minor improvement in
the zswap global shrinker:

1. Fix the memcg iteration logic that breaks iteration on offline
memcgs.
2. Fix the error path that aborts on expected error codes.
3. Add proactive shrinking at 95% full, for 90% accept threshold.

These patches need to be applied in this order to avoid potential loops
caused by the first issue. Patch 3 can be applied independently, but the
two issues must be resolved to ensure the shrinker can evict pages.

With this series applied, the shrinker will continue to evict pages
until accept_threshold_percent proactively as documented in patch 3. As
a side effect of changes in the hysteresis logic, zswap will no longer
reject pages under the max pool limit.

Regards,
Takero Funaki <[email protected]>

Takero Funaki (3):
mm: zswap: fix global shrinker memcg iteration
mm: zswap: fix global shrinker error handling logic
mm: zswap: proactive shrinking before pool size limit is hit

Documentation/admin-guide/mm/zswap.rst | 17 ++-
mm/zswap.c | 156 ++++++++++++++++++-------
2 files changed, 119 insertions(+), 54 deletions(-)

--
2.43.0

2024-05-28 04:34:58

by Takero Funaki

[permalink] [raw]

Subject: [PATCH 1/3] mm: zswap: fix global shrinker memcg iteration

This patch fixes an issue where the zswap global shrinker stopped
iterating through the memcg tree.

The problem was that `shrink_worker()` would stop iterating when a memcg
was being offlined and restart from the tree root. Now, it properly
handles the offlining memcg and continues shrinking with the next memcg.

Fixes: a65b0e7607cc ("zswap: make shrinking memcg-aware")
Signed-off-by: Takero Funaki <[email protected]>
---
mm/zswap.c | 76 ++++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 56 insertions(+), 20 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index a50e2986cd2f..0b1052cee36c 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -775,12 +775,27 @@ void zswap_folio_swapin(struct folio *folio)
}
}

+/*
+ * This function should be called when a memcg is being offlined.
+ *
+ * Since the global shrinker shrink_worker() may hold a reference
+ * of the memcg, we must check and release the reference in
+ * zswap_next_shrink.
+ *
+ * shrink_worker() must handle the case where this function releases
+ * the reference of memcg being shrunk.
+ */
void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
{
/* lock out zswap shrinker walking memcg tree */
spin_lock(&zswap_shrink_lock);
- if (zswap_next_shrink == memcg)
- zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
+
+ if (READ_ONCE(zswap_next_shrink) == memcg) {
+ /* put back reference and advance the cursor */
+ memcg = mem_cgroup_iter(NULL, memcg, NULL);
+ WRITE_ONCE(zswap_next_shrink, memcg);
+ }
+
spin_unlock(&zswap_shrink_lock);
}

@@ -1312,25 +1327,38 @@ static int shrink_memcg(struct mem_cgroup *memcg)

static void shrink_worker(struct work_struct *w)
{
- struct mem_cgroup *memcg;
+ struct mem_cgroup *memcg = NULL;
+ struct mem_cgroup *next_memcg;
int ret, failures = 0;
unsigned long thr;

/* Reclaim down to the accept threshold */
thr = zswap_accept_thr_pages();

- /* global reclaim will select cgroup in a round-robin fashion. */
+ /* global reclaim will select cgroup in a round-robin fashion.
+ *
+ * We save iteration cursor memcg into zswap_next_shrink,
+ * which can be modified by the offline memcg cleaner
+ * zswap_memcg_offline_cleanup().
+ */
do {
spin_lock(&zswap_shrink_lock);
- zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
- memcg = zswap_next_shrink;
+ next_memcg = READ_ONCE(zswap_next_shrink);
+
+ if (memcg != next_memcg) {
+ /*
+ * Ours was released by offlining.
+ * Use the saved memcg reference.
+ */
+ memcg = next_memcg;
+ } else {
+iternext:
+ /* advance cursor */
+ memcg = mem_cgroup_iter(NULL, memcg, NULL);
+ WRITE_ONCE(zswap_next_shrink, memcg);
+ }

/*
- * We need to retry if we have gone through a full round trip, or if we
- * got an offline memcg (or else we risk undoing the effect of the
- * zswap memcg offlining cleanup callback). This is not catastrophic
- * per se, but it will keep the now offlined memcg hostage for a while.
- *
* Note that if we got an online memcg, we will keep the extra
* reference in case the original reference obtained by mem_cgroup_iter
* is dropped by the zswap memcg offlining callback, ensuring that the
@@ -1345,16 +1373,18 @@ static void shrink_worker(struct work_struct *w)
}

if (!mem_cgroup_tryget_online(memcg)) {
- /* drop the reference from mem_cgroup_iter() */
- mem_cgroup_iter_break(NULL, memcg);
- zswap_next_shrink = NULL;
- spin_unlock(&zswap_shrink_lock);
-
- if (++failures == MAX_RECLAIM_RETRIES)
- break;
-
- goto resched;
+ /*
+ * It is an offline memcg which we cannot shrink
+ * until its pages are reparented.
+ * Put back the memcg reference before cleanup
+ * function reads it from zswap_next_shrink.
+ */
+ goto iternext;
}
+ /*
+ * We got an extra memcg reference before unlocking.
+ * The cleaner cannot free it using zswap_next_shrink.
+ */
spin_unlock(&zswap_shrink_lock);

ret = shrink_memcg(memcg);
@@ -1368,6 +1398,12 @@ static void shrink_worker(struct work_struct *w)
resched:
cond_resched();
} while (zswap_total_pages() > thr);
+
+ /*
+ * We can still hold the original memcg reference.
+ * The reference is stored in zswap_next_shrink, and then reused
+ * by the next shrink_worker().
+ */
}

/*********************************
--
2.43.0

2024-05-28 04:35:11

by Takero Funaki

[permalink] [raw]

Subject: [PATCH 2/3] mm: zswap: fix global shrinker error handling logic

This patch fixes zswap global shrinker that did not shrink zpool as
expected.

The issue it addresses is that `shrink_worker()` did not distinguish
between unexpected errors and expected error codes that should be
skipped, such as when there is no stored page in a memcg. This led to
the shrinking process being aborted on the expected error codes.

The shrinker should ignore these cases and skip to the next memcg.
However, skipping all memcgs presents another problem. To address this,
this patch tracks progress while walking the memcg tree and checks for
progress once the tree walk is completed.

To handle the empty memcg case, the helper function `shrink_memcg()` is
modified to check if the memcg is empty and then return -ENOENT.

Fixes: a65b0e7607cc ("zswap: make shrinking memcg-aware")
Signed-off-by: Takero Funaki <[email protected]>
---
mm/zswap.c | 31 ++++++++++++++++++++++++++-----
1 file changed, 26 insertions(+), 5 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 0b1052cee36c..08a6f5a6bf62 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1304,7 +1304,7 @@ static struct shrinker *zswap_alloc_shrinker(void)

static int shrink_memcg(struct mem_cgroup *memcg)
{
- int nid, shrunk = 0;
+ int nid, shrunk = 0, stored = 0;

if (!mem_cgroup_zswap_writeback_enabled(memcg))
return -EINVAL;
@@ -1319,9 +1319,16 @@ static int shrink_memcg(struct mem_cgroup *memcg)
for_each_node_state(nid, N_NORMAL_MEMORY) {
unsigned long nr_to_walk = 1;

+ if (!list_lru_count_one(&zswap_list_lru, nid, memcg))
+ continue;
+ ++stored;
shrunk += list_lru_walk_one(&zswap_list_lru, nid, memcg,
&shrink_memcg_cb, NULL, &nr_to_walk);
}
+
+ if (!stored)
+ return -ENOENT;
+
return shrunk ? 0 : -EAGAIN;
}

@@ -1329,12 +1336,18 @@ static void shrink_worker(struct work_struct *w)
{
struct mem_cgroup *memcg = NULL;
struct mem_cgroup *next_memcg;
- int ret, failures = 0;
+ int ret, failures = 0, progress;
unsigned long thr;

/* Reclaim down to the accept threshold */
thr = zswap_accept_thr_pages();

+ /*
+ * We might start from the last memcg.
+ * That is not a failure.
+ */
+ progress = 1;
+
/* global reclaim will select cgroup in a round-robin fashion.
*
* We save iteration cursor memcg into zswap_next_shrink,
@@ -1366,9 +1379,12 @@ static void shrink_worker(struct work_struct *w)
*/
if (!memcg) {
spin_unlock(&zswap_shrink_lock);
- if (++failures == MAX_RECLAIM_RETRIES)
+
+ /* tree walk completed but no progress */
+ if (!progress && ++failures == MAX_RECLAIM_RETRIES)
break;

+ progress = 0;
goto resched;
}

@@ -1391,10 +1407,15 @@ static void shrink_worker(struct work_struct *w)
/* drop the extra reference */
mem_cgroup_put(memcg);

- if (ret == -EINVAL)
- break;
+ /* not a writeback candidate memcg */
+ if (ret == -EINVAL || ret == -ENOENT)
+ continue;
+
if (ret && ++failures == MAX_RECLAIM_RETRIES)
break;
+
+ ++progress;
+ /* reschedule as we performed some IO */
resched:
cond_resched();
} while (zswap_total_pages() > thr);
--
2.43.0

2024-05-28 04:35:19

by Takero Funaki

[permalink] [raw]

Subject: [PATCH 3/3] mm: zswap: proactive shrinking before pool size limit is hit

This patch implements proactive shrinking of zswap pool before the max
pool size limit is reached. This also changes zswap to accept new pages
while the shrinker is running.

To prevent zswap from rejecting new pages and incurring latency when
zswap is full, this patch queues the global shrinker by a pool usage
threshold at the middle of 100% and accept_thr_percent, instead of the
max pool size. The pool size will be controlled between 90% to 95% for
the default accept_thr_percent=90. Since the current global shrinker
continues to shrink until accept_thr_percent, we do not need to maintain
the hysteresis variable tracking the pool limit overage in
zswap_store().

Before this patch, zswap rejected pages while the shrinker is running
without incrementing zswap_pool_limit_hit counter. It could be a reason
why zswap writethrough new pages before writeback old pages. With this
patch, zswap accepts new pages while shrinking, and zswap increments
the counter when and only when zswap rejects pages by the max pool size.

The name of sysfs tunable accept_thr_percent is unchanged as it is still
the stop condition of the shrinker.
The respective documentation is updated to describe the new behavior.

Signed-off-by: Takero Funaki <[email protected]>
---
Documentation/admin-guide/mm/zswap.rst | 17 +++++----
mm/zswap.c | 49 +++++++++++++++-----------
2 files changed, 37 insertions(+), 29 deletions(-)

diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst
index 3598dcd7dbe7..a1d8f167a27a 100644
--- a/Documentation/admin-guide/mm/zswap.rst
+++ b/Documentation/admin-guide/mm/zswap.rst
@@ -111,18 +111,17 @@ checked if it is a same-value filled page before compressing it. If true, the
compressed length of the page is set to zero and the pattern or same-filled
value is stored.

-To prevent zswap from shrinking pool when zswap is full and there's a high
-pressure on swap (this will result in flipping pages in and out zswap pool
-without any real benefit but with a performance drop for the system), a
-special parameter has been introduced to implement a sort of hysteresis to
-refuse taking pages into zswap pool until it has sufficient space if the limit
-has been hit. To set the threshold at which zswap would start accepting pages
-again after it became full, use the sysfs ``accept_threshold_percent``
-attribute, e. g.::
+To prevent zswap from rejecting new pages and incurring latency when zswap is
+full, zswap initiates a worker called global shrinker that proactively evicts
+some pages from the pool to swap devices while the pool is reaching the limit.
+The global shrinker continues to evict pages until there is sufficient space to
+accept new pages. To control how many pages should remain in the pool, use the
+sysfs ``accept_threshold_percent`` attribute as a percentage of the max pool
+size, e. g.::

echo 80 > /sys/module/zswap/parameters/accept_threshold_percent

-Setting this parameter to 100 will disable the hysteresis.
+Setting this parameter to 100 will disable the proactive shrinking.

Some users cannot tolerate the swapping that comes with zswap store failures
and zswap writebacks. Swapping can be disabled entirely (without disabling
diff --git a/mm/zswap.c b/mm/zswap.c
index 08a6f5a6bf62..0186224be8fc 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -71,8 +71,6 @@ static u64 zswap_reject_kmemcache_fail;

/* Shrinker work queue */
static struct workqueue_struct *shrink_wq;
-/* Pool limit was hit, we need to calm down */
-static bool zswap_pool_reached_full;

/*********************************
* tunables
@@ -118,7 +116,10 @@ module_param_cb(zpool, &zswap_zpool_param_ops, &zswap_zpool_type, 0644);
static unsigned int zswap_max_pool_percent = 20;
module_param_named(max_pool_percent, zswap_max_pool_percent, uint, 0644);

-/* The threshold for accepting new pages after the max_pool_percent was hit */
+/*
+ * The percentage of pool size that the global shrinker keeps in memory.
+ * It does not protect old pages from the dynamic shrinker.
+ */
static unsigned int zswap_accept_thr_percent = 90; /* of max pool size */
module_param_named(accept_threshold_percent, zswap_accept_thr_percent,
uint, 0644);
@@ -487,6 +488,14 @@ static unsigned long zswap_accept_thr_pages(void)
return zswap_max_pages() * zswap_accept_thr_percent / 100;
}

+/*
+ * Returns threshold to start proactive global shrinking.
+ */
+static inline unsigned long zswap_shrink_start_pages(void)
+{
+ return zswap_max_pages() * (100 - (100 - zswap_accept_thr_percent)/2) / 100;
+}
+
unsigned long zswap_total_pages(void)
{
struct zswap_pool *pool;
@@ -504,21 +513,6 @@ unsigned long zswap_total_pages(void)
return total;
}

-static bool zswap_check_limits(void)
-{
- unsigned long cur_pages = zswap_total_pages();
- unsigned long max_pages = zswap_max_pages();
-
- if (cur_pages >= max_pages) {
- zswap_pool_limit_hit++;
- zswap_pool_reached_full = true;
- } else if (zswap_pool_reached_full &&
- cur_pages <= zswap_accept_thr_pages()) {
- zswap_pool_reached_full = false;
- }
- return zswap_pool_reached_full;
-}
-
/*********************************
* param callbacks
**********************************/
@@ -1475,6 +1469,8 @@ bool zswap_store(struct folio *folio)
struct obj_cgroup *objcg = NULL;
struct mem_cgroup *memcg = NULL;
unsigned long value;
+ unsigned long cur_pages;
+ bool need_global_shrink = false;

VM_WARN_ON_ONCE(!folio_test_locked(folio));
VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
@@ -1497,8 +1493,18 @@ bool zswap_store(struct folio *folio)
mem_cgroup_put(memcg);
}

- if (zswap_check_limits())
+ cur_pages = zswap_total_pages();
+
+ if (cur_pages >= zswap_max_pages()) {
+ zswap_pool_limit_hit++;
+ need_global_shrink = true;
goto reject;
+ }
+
+ /* schedule shrink for incoming pages */
+ if (cur_pages >= zswap_shrink_start_pages()
+ && !work_pending(&zswap_shrink_work))
+ queue_work(shrink_wq, &zswap_shrink_work);

/* allocate entry */
entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
@@ -1541,6 +1547,9 @@ bool zswap_store(struct folio *folio)

WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
zswap_reject_alloc_fail++;
+
+ /* reduce entry in array */
+ need_global_shrink = true;
goto store_failed;
}

@@ -1590,7 +1599,7 @@ bool zswap_store(struct folio *folio)
zswap_entry_cache_free(entry);
reject:
obj_cgroup_put(objcg);
- if (zswap_pool_reached_full)
+ if (need_global_shrink && !work_pending(&zswap_shrink_work))
queue_work(shrink_wq, &zswap_shrink_work);
check_old:
/*
--
2.43.0

2024-05-28 15:12:29

by Nhat Pham

[permalink] [raw]

Subject: Re: [PATCH 1/3] mm: zswap: fix global shrinker memcg iteration

On Mon, May 27, 2024 at 9:34 PM Takero Funaki <[email protected]> wrote:
>
> This patch fixes an issue where the zswap global shrinker stopped
> iterating through the memcg tree.

Did you observe this problem in practice?

>
> The problem was that `shrink_worker()` would stop iterating when a memcg
> was being offlined and restart from the tree root. Now, it properly
> handles the offlining memcg and continues shrinking with the next memcg.
>
> Fixes: a65b0e7607cc ("zswap: make shrinking memcg-aware")
> Signed-off-by: Takero Funaki <[email protected]>
> ---
> mm/zswap.c | 76 ++++++++++++++++++++++++++++++++++++++++--------------
> 1 file changed, 56 insertions(+), 20 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index a50e2986cd2f..0b1052cee36c 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -775,12 +775,27 @@ void zswap_folio_swapin(struct folio *folio)
> }
> }
>
> +/*
> + * This function should be called when a memcg is being offlined.
> + *
> + * Since the global shrinker shrink_worker() may hold a reference
> + * of the memcg, we must check and release the reference in
> + * zswap_next_shrink.
> + *
> + * shrink_worker() must handle the case where this function releases
> + * the reference of memcg being shrunk.
> + */
> void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
> {
> /* lock out zswap shrinker walking memcg tree */
> spin_lock(&zswap_shrink_lock);
> - if (zswap_next_shrink == memcg)
> - zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
> +
> + if (READ_ONCE(zswap_next_shrink) == memcg) {
> + /* put back reference and advance the cursor */
> + memcg = mem_cgroup_iter(NULL, memcg, NULL);
> + WRITE_ONCE(zswap_next_shrink, memcg);
> + }

Hmm could you expand on how your version differs from what's existing?
What's the point of all these extra steps? The whole thing is done
under a big spin lock anyway.

> +
> spin_unlock(&zswap_shrink_lock);
> }
>
> @@ -1312,25 +1327,38 @@ static int shrink_memcg(struct mem_cgroup *memcg)
>
> static void shrink_worker(struct work_struct *w)
> {
> - struct mem_cgroup *memcg;
> + struct mem_cgroup *memcg = NULL;
> + struct mem_cgroup *next_memcg;
> int ret, failures = 0;
> unsigned long thr;
>
> /* Reclaim down to the accept threshold */
> thr = zswap_accept_thr_pages();
>
> - /* global reclaim will select cgroup in a round-robin fashion. */
> + /* global reclaim will select cgroup in a round-robin fashion.
> + *
> + * We save iteration cursor memcg into zswap_next_shrink,
> + * which can be modified by the offline memcg cleaner
> + * zswap_memcg_offline_cleanup().
> + */

I feel like the only difference between this loop and the old loop, is
that if we fail to get an online reference to memcg, we're trying from
the next one (given by mem_cgroup_iter()) rather than from the top
(NULL).

For instance, consider the first two steps:

1. First, we check if memcg is the same as zswap_next_shrink. if not,
then reset it to zswap_next_shrink.
2. Advance memcg, then store the result at zswap_next_shrink.

Under the big zswap_shrink_lock, this is the same as:

1. Advance zswap_next_shrink.
2. Look up zswap_next_shrink, then store it under the local memcg variable.

which is what we were previously doing.

The next step is shared - check for null memcg (which again, is the
same as zswap_next_shrink now), and attempt to get an online
reference.
The only difference is when we fail to get the online reference -
instead of starting from the top, we advance memcg again.

Can't we just drop the lock, then add a continue statement? That will
reacquire the lock, advance zswap_next_shrink, look up the result and
store it at memcg, which is what you're trying to achieve?

> do {
> spin_lock(&zswap_shrink_lock);
> - zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
> - memcg = zswap_next_shrink;
> + next_memcg = READ_ONCE(zswap_next_shrink);
> +
> + if (memcg != next_memcg) {
> + /*
> + * Ours was released by offlining.
> + * Use the saved memcg reference.
> + */
> + memcg = next_memcg;
> + } else {
> +iternext:
> + /* advance cursor */
> + memcg = mem_cgroup_iter(NULL, memcg, NULL);
> + WRITE_ONCE(zswap_next_shrink, memcg);
> + }
>
> /*
> - * We need to retry if we have gone through a full round trip, or if we
> - * got an offline memcg (or else we risk undoing the effect of the
> - * zswap memcg offlining cleanup callback). This is not catastrophic
> - * per se, but it will keep the now offlined memcg hostage for a while.
> - *

Why are we removing this comment?

> * Note that if we got an online memcg, we will keep the extra
> * reference in case the original reference obtained by mem_cgroup_iter
> * is dropped by the zswap memcg offlining callback, ensuring that the
> @@ -1345,16 +1373,18 @@ static void shrink_worker(struct work_struct *w)
> }
>
> if (!mem_cgroup_tryget_online(memcg)) {
> - /* drop the reference from mem_cgroup_iter() */
> - mem_cgroup_iter_break(NULL, memcg);
> - zswap_next_shrink = NULL;
> - spin_unlock(&zswap_shrink_lock);
> -
> - if (++failures == MAX_RECLAIM_RETRIES)
> - break;
> -
> - goto resched;

I think we should still increment the failure counter, to guard
against long running/infinite loops.

> + /*
> + * It is an offline memcg which we cannot shrink
> + * until its pages are reparented.
> + * Put back the memcg reference before cleanup
> + * function reads it from zswap_next_shrink.
> + */
> + goto iternext;
> }
> + /*
> + * We got an extra memcg reference before unlocking.
> + * The cleaner cannot free it using zswap_next_shrink.
> + */
> spin_unlock(&zswap_shrink_lock);
>
> ret = shrink_memcg(memcg);
> @@ -1368,6 +1398,12 @@ static void shrink_worker(struct work_struct *w)
> resched:
> cond_resched();
> } while (zswap_total_pages() > thr);
> +
> + /*
> + * We can still hold the original memcg reference.
> + * The reference is stored in zswap_next_shrink, and then reused
> + * by the next shrink_worker().
> + */
> }
>
> /*********************************
> --
> 2.43.0
>

2024-05-28 15:12:50

by Nhat Pham

[permalink] [raw]

Subject: Re: [PATCH 2/3] mm: zswap: fix global shrinker error handling logic

On Mon, May 27, 2024 at 9:34 PM Takero Funaki <[email protected]> wrote:
>
> This patch fixes zswap global shrinker that did not shrink zpool as
> expected.
>
> The issue it addresses is that `shrink_worker()` did not distinguish
> between unexpected errors and expected error codes that should be
> skipped, such as when there is no stored page in a memcg. This led to
> the shrinking process being aborted on the expected error codes.
>
> The shrinker should ignore these cases and skip to the next memcg.
> However, skipping all memcgs presents another problem. To address this,
> this patch tracks progress while walking the memcg tree and checks for
> progress once the tree walk is completed.
>
> To handle the empty memcg case, the helper function `shrink_memcg()` is
> modified to check if the memcg is empty and then return -ENOENT.
>
> Fixes: a65b0e7607cc ("zswap: make shrinking memcg-aware")
> Signed-off-by: Takero Funaki <[email protected]>
> ---
> mm/zswap.c | 31 ++++++++++++++++++++++++++-----
> 1 file changed, 26 insertions(+), 5 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 0b1052cee36c..08a6f5a6bf62 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1304,7 +1304,7 @@ static struct shrinker *zswap_alloc_shrinker(void)
>
> static int shrink_memcg(struct mem_cgroup *memcg)
> {
> - int nid, shrunk = 0;
> + int nid, shrunk = 0, stored = 0;
>
> if (!mem_cgroup_zswap_writeback_enabled(memcg))
> return -EINVAL;
> @@ -1319,9 +1319,16 @@ static int shrink_memcg(struct mem_cgroup *memcg)
> for_each_node_state(nid, N_NORMAL_MEMORY) {
> unsigned long nr_to_walk = 1;
>
> + if (!list_lru_count_one(&zswap_list_lru, nid, memcg))
> + continue;
> + ++stored;
> shrunk += list_lru_walk_one(&zswap_list_lru, nid, memcg,
> &shrink_memcg_cb, NULL, &nr_to_walk);
> }
> +
> + if (!stored)
> + return -ENOENT;
> +
> return shrunk ? 0 : -EAGAIN;
> }
>
> @@ -1329,12 +1336,18 @@ static void shrink_worker(struct work_struct *w)
> {
> struct mem_cgroup *memcg = NULL;
> struct mem_cgroup *next_memcg;
> - int ret, failures = 0;
> + int ret, failures = 0, progress;
> unsigned long thr;
>
> /* Reclaim down to the accept threshold */
> thr = zswap_accept_thr_pages();
>
> + /*
> + * We might start from the last memcg.
> + * That is not a failure.
> + */
> + progress = 1;
> +
> /* global reclaim will select cgroup in a round-robin fashion.
> *
> * We save iteration cursor memcg into zswap_next_shrink,
> @@ -1366,9 +1379,12 @@ static void shrink_worker(struct work_struct *w)
> */
> if (!memcg) {
> spin_unlock(&zswap_shrink_lock);
> - if (++failures == MAX_RECLAIM_RETRIES)
> +
> + /* tree walk completed but no progress */
> + if (!progress && ++failures == MAX_RECLAIM_RETRIES)
> break;
>
> + progress = 0;
> goto resched;
> }
>
> @@ -1391,10 +1407,15 @@ static void shrink_worker(struct work_struct *w)
> /* drop the extra reference */
> mem_cgroup_put(memcg);
>
> - if (ret == -EINVAL)
> - break;
> + /* not a writeback candidate memcg */
> + if (ret == -EINVAL || ret == -ENOENT)
> + continue;
> +

Can we get into an infinite loop or a really long running loops here
if all memcgs have their writeback disabled?

> if (ret && ++failures == MAX_RECLAIM_RETRIES)
> break;
> +
> + ++progress;
> + /* reschedule as we performed some IO */
> resched:
> cond_resched();
> } while (zswap_total_pages() > thr);
> --
> 2.43.0
>

2024-05-28 16:02:29

by Nhat Pham

[permalink] [raw]

Subject: Re: [PATCH 3/3] mm: zswap: proactive shrinking before pool size limit is hit

On Mon, May 27, 2024 at 9:34 PM Takero Funaki <[email protected]> wrote:
>
> This patch implements proactive shrinking of zswap pool before the max
> pool size limit is reached. This also changes zswap to accept new pages
> while the shrinker is running.
>
> To prevent zswap from rejecting new pages and incurring latency when
> zswap is full, this patch queues the global shrinker by a pool usage
> threshold at the middle of 100% and accept_thr_percent, instead of the
> max pool size. The pool size will be controlled between 90% to 95% for
> the default accept_thr_percent=90. Since the current global shrinker
> continues to shrink until accept_thr_percent, we do not need to maintain
> the hysteresis variable tracking the pool limit overage in
> zswap_store().
>
> Before this patch, zswap rejected pages while the shrinker is running
> without incrementing zswap_pool_limit_hit counter. It could be a reason
> why zswap writethrough new pages before writeback old pages. With this
> patch, zswap accepts new pages while shrinking, and zswap increments
> the counter when and only when zswap rejects pages by the max pool size.
>
> The name of sysfs tunable accept_thr_percent is unchanged as it is still
> the stop condition of the shrinker.
> The respective documentation is updated to describe the new behavior.

I'm a bit unsure about using this tunable. How would the user
determine the level at which the zswap pool should be kept empty?

I was actually thinking of removing this knob altogether :)

>
> Signed-off-by: Takero Funaki <[email protected]>
> ---
> Documentation/admin-guide/mm/zswap.rst | 17 +++++----
> mm/zswap.c | 49 +++++++++++++++-----------
> 2 files changed, 37 insertions(+), 29 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst
> index 3598dcd7dbe7..a1d8f167a27a 100644
> --- a/Documentation/admin-guide/mm/zswap.rst
> +++ b/Documentation/admin-guide/mm/zswap.rst
> @@ -111,18 +111,17 @@ checked if it is a same-value filled page before compressing it. If true, the
> compressed length of the page is set to zero and the pattern or same-filled
> value is stored.
>
> -To prevent zswap from shrinking pool when zswap is full and there's a high
> -pressure on swap (this will result in flipping pages in and out zswap pool
> -without any real benefit but with a performance drop for the system), a
> -special parameter has been introduced to implement a sort of hysteresis to
> -refuse taking pages into zswap pool until it has sufficient space if the limit
> -has been hit. To set the threshold at which zswap would start accepting pages
> -again after it became full, use the sysfs ``accept_threshold_percent``
> -attribute, e. g.::
> +To prevent zswap from rejecting new pages and incurring latency when zswap is
> +full, zswap initiates a worker called global shrinker that proactively evicts
> +some pages from the pool to swap devices while the pool is reaching the limit.
> +The global shrinker continues to evict pages until there is sufficient space to
> +accept new pages. To control how many pages should remain in the pool, use the
> +sysfs ``accept_threshold_percent`` attribute as a percentage of the max pool
> +size, e. g.::
>
> echo 80 > /sys/module/zswap/parameters/accept_threshold_percent
>
> -Setting this parameter to 100 will disable the hysteresis.
> +Setting this parameter to 100 will disable the proactive shrinking.
>
> Some users cannot tolerate the swapping that comes with zswap store failures
> and zswap writebacks. Swapping can be disabled entirely (without disabling
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 08a6f5a6bf62..0186224be8fc 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -71,8 +71,6 @@ static u64 zswap_reject_kmemcache_fail;
>
> /* Shrinker work queue */
> static struct workqueue_struct *shrink_wq;
> -/* Pool limit was hit, we need to calm down */
> -static bool zswap_pool_reached_full;
>
> /*********************************
> * tunables
> @@ -118,7 +116,10 @@ module_param_cb(zpool, &zswap_zpool_param_ops, &zswap_zpool_type, 0644);
> static unsigned int zswap_max_pool_percent = 20;
> module_param_named(max_pool_percent, zswap_max_pool_percent, uint, 0644);
>
> -/* The threshold for accepting new pages after the max_pool_percent was hit */
> +/*
> + * The percentage of pool size that the global shrinker keeps in memory.
> + * It does not protect old pages from the dynamic shrinker.
> + */
> static unsigned int zswap_accept_thr_percent = 90; /* of max pool size */
> module_param_named(accept_threshold_percent, zswap_accept_thr_percent,
> uint, 0644);
> @@ -487,6 +488,14 @@ static unsigned long zswap_accept_thr_pages(void)
> return zswap_max_pages() * zswap_accept_thr_percent / 100;
> }
>
> +/*
> + * Returns threshold to start proactive global shrinking.
> + */
> +static inline unsigned long zswap_shrink_start_pages(void)
> +{
> + return zswap_max_pages() * (100 - (100 - zswap_accept_thr_percent)/2) / 100;
> +}
> +
> unsigned long zswap_total_pages(void)
> {
> struct zswap_pool *pool;
> @@ -504,21 +513,6 @@ unsigned long zswap_total_pages(void)
> return total;
> }
>
> -static bool zswap_check_limits(void)
> -{
> - unsigned long cur_pages = zswap_total_pages();
> - unsigned long max_pages = zswap_max_pages();
> -
> - if (cur_pages >= max_pages) {
> - zswap_pool_limit_hit++;
> - zswap_pool_reached_full = true;
> - } else if (zswap_pool_reached_full &&
> - cur_pages <= zswap_accept_thr_pages()) {
> - zswap_pool_reached_full = false;
> - }
> - return zswap_pool_reached_full;
> -}
> -
> /*********************************
> * param callbacks
> **********************************/
> @@ -1475,6 +1469,8 @@ bool zswap_store(struct folio *folio)
> struct obj_cgroup *objcg = NULL;
> struct mem_cgroup *memcg = NULL;
> unsigned long value;
> + unsigned long cur_pages;
> + bool need_global_shrink = false;
>
> VM_WARN_ON_ONCE(!folio_test_locked(folio));
> VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> @@ -1497,8 +1493,18 @@ bool zswap_store(struct folio *folio)
> mem_cgroup_put(memcg);
> }
>
> - if (zswap_check_limits())
> + cur_pages = zswap_total_pages();
> +
> + if (cur_pages >= zswap_max_pages()) {
> + zswap_pool_limit_hit++;
> + need_global_shrink = true;
> goto reject;
> + }
> +
> + /* schedule shrink for incoming pages */
> + if (cur_pages >= zswap_shrink_start_pages()
> + && !work_pending(&zswap_shrink_work))
> + queue_work(shrink_wq, &zswap_shrink_work);

I think work_pending() check here is redundant. If you look at the
documentation, queue_work only succeeds if zswap_shrink_work is not
already on the shrink_wq workqueue.

More specifically, if you check the code, queue_work calls
queue_work_on, which has this check:

if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)) &&
!clear_pending_if_disabled(work)) {

This is the same bit-check as work_pending, which is defined as:

#define work_pending(work) \
test_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))

>
> /* allocate entry */
> entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
> @@ -1541,6 +1547,9 @@ bool zswap_store(struct folio *folio)
>
> WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
> zswap_reject_alloc_fail++;
> +
> + /* reduce entry in array */
> + need_global_shrink = true;
> goto store_failed;
> }
>
> @@ -1590,7 +1599,7 @@ bool zswap_store(struct folio *folio)
> zswap_entry_cache_free(entry);
> reject:
> obj_cgroup_put(objcg);
> - if (zswap_pool_reached_full)
> + if (need_global_shrink && !work_pending(&zswap_shrink_work))
> queue_work(shrink_wq, &zswap_shrink_work);
> check_old:
> /*
> --
> 2.43.0
>

2024-05-29 12:42:56

by Takero Funaki

[permalink] [raw]

Subject: Re: [PATCH 1/3] mm: zswap: fix global shrinker memcg iteration

2024年5月29日(水) 0:10 Nhat Pham <nphamcs@gmailcom>:
>
> On Mon, May 27, 2024 at 9:34 PM Takero Funaki <[email protected]> wrote:
> >
> > This patch fixes an issue where the zswap global shrinker stopped
> > iterating through the memcg tree.
>
> Did you observe this problem in practice?

Thank you for your comments.

I think this situation is rare, but I need to address this to ensure
patch 2 will not stop at the offline memcg.
The main issue I am seeing in version 6.9.0 to 6.10.0rc1 is that the
shrinker did not shrink until accept_thr_percent, and the
written_back_pages is smaller than max_limit_hit.
This can be explained by the shrinker stopping too early or looping
over only part of the tree.

> >
> > The problem was that `shrink_worker()` would stop iterating when a memcg
> > was being offlined and restart from the tree root. Now, it properly
> > handles the offlining memcg and continues shrinking with the next memcg.
> >
> > Fixes: a65b0e7607cc ("zswap: make shrinking memcg-aware")
> > Signed-off-by: Takero Funaki <[email protected]>
> > ---
> > mm/zswap.c | 76 ++++++++++++++++++++++++++++++++++++++++--------------
> > 1 file changed, 56 insertions(+), 20 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index a50e2986cd2f..0b1052cee36c 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -775,12 +775,27 @@ void zswap_folio_swapin(struct folio *folio)
> > }
> > }
> >
> > +/*
> > + * This function should be called when a memcg is being offlined.
> > + *
> > + * Since the global shrinker shrink_worker() may hold a reference
> > + * of the memcg, we must check and release the reference in
> > + * zswap_next_shrink.
> > + *
> > + * shrink_worker() must handle the case where this function releases
> > + * the reference of memcg being shrunk.
> > + */
> > void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
> > {
> > /* lock out zswap shrinker walking memcg tree */
> > spin_lock(&zswap_shrink_lock);
> > - if (zswap_next_shrink == memcg)
> > - zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
> > +
> > + if (READ_ONCE(zswap_next_shrink) == memcg) {
> > + /* put back reference and advance the cursor */
> > + memcg = mem_cgroup_iter(NULL, memcg, NULL);
> > + WRITE_ONCE(zswap_next_shrink, memcg);
> > + }
>
> Hmm could you expand on how your version differs from what's existing?
> What's the point of all these extra steps? The whole thing is done
> under a big spin lock anyway.

I agree that the code is not necessary. These ONCE are for clarifying
what is expected and to align with shrink_worker(), where READ_ONCE is
required to reload the shared variable.
It can be a safeguard for me, sometimes forgetting that we must not
read zswap_next_shrink before acquiring the lock.

> > +
> > spin_unlock(&zswap_shrink_lock);
> > }
> >
> > @@ -1312,25 +1327,38 @@ static int shrink_memcg(struct mem_cgroup *memcg)
> >
> > static void shrink_worker(struct work_struct *w)
> > {
> > - struct mem_cgroup *memcg;
> > + struct mem_cgroup *memcg = NULL;
> > + struct mem_cgroup *next_memcg;
> > int ret, failures = 0;
> > unsigned long thr;
> >
> > /* Reclaim down to the accept threshold */
> > thr = zswap_accept_thr_pages();
> >
> > - /* global reclaim will select cgroup in a round-robin fashion. */
> > + /* global reclaim will select cgroup in a round-robin fashion.
> > + *
> > + * We save iteration cursor memcg into zswap_next_shrink,
> > + * which can be modified by the offline memcg cleaner
> > + * zswap_memcg_offline_cleanup().
> > + */
>
> I feel like the only difference between this loop and the old loop, is
> that if we fail to get an online reference to memcg, we're trying from
> the next one (given by mem_cgroup_iter()) rather than from the top
> (NULL).
>
> For instance, consider the first two steps:
>
> 1. First, we check if memcg is the same as zswap_next_shrink. if not,
> then reset it to zswap_next_shrink.
> 2. Advance memcg, then store the result at zswap_next_shrink.
>
> Under the big zswap_shrink_lock, this is the same as:
>
> 1. Advance zswap_next_shrink.
> 2. Look up zswap_next_shrink, then store it under the local memcg variable.
>
> which is what we were previously doing.
>
> The next step is shared - check for null memcg (which again, is the
> same as zswap_next_shrink now), and attempt to get an online
> reference.
> The only difference is when we fail to get the online reference -
> instead of starting from the top, we advance memcg again.
>
> Can't we just drop the lock, then add a continue statement? That will
> reacquire the lock, advance zswap_next_shrink, look up the result and
> store it at memcg, which is what you're trying to achieve?
>

If I understand correctly, in this offlining pattern, we are not
allowed to leave an offline memcg in zswap_next_shrink more than once.
While offline memcg can appear in memcg_iter_next repeatedly, the
cleaner is called only once per an offline memcg. (or is there some
retry logic?)

If the shrink_worker finds an offline memcg in iteration AFTER the
cleaner was called, we must put back the offline memcg reference
inside shrink_worker() BEFORE going to return/sleep.
Otherwise, the offline memcg reference will be kept in
zswap_next_shrink until the next shrink_worker() is requeued. There is
no rescue chance from the cleaner again.

This means the shrink_worker has to check:
1. We get a lock. Check if the memcg is online while locked.
2a. If online, it can be offlined while we have the lock. But the lock
assures us that the cleaner is waiting for the lock just behind us. We
can rely on this.
2b. If offline, we cannot determine if the cleaner has already been
called or is being called. We have to put back the offline memcg
reference, assuming no cleaner is available.

If we get offline memcg AND abort the shrinker by the max failure
limit, that breaks condition 2b. Thus we need to unconditionally
restart from the beginning of the do block.
I will add these expected situations to the comments.

For unlocking, should we rewrite,

goto iternext;

to

spin_unlock();
goto before_spin_lock; /* just after do{ */

I think that will minimize the critical section and satisfy the condition 2b.

For the memcg iteration,
> 2. Advance memcg, then store the result at zswap_next_shrink.
should be done only if `(memcg == zswap_next_shrink)`.

Say iterating A -> B -> C and A is being offlined.
While we have memcg=A and drop the lock, the cleaner may advance
zswap_next_shrink=A to B.
If we do not check `memcg != next_memcg`, we advance
zswap_next_shrink=B to C again, forgetting B.

That is the reason I added `(memcg != next_memcg)` check.
Although It can be negligible as it only ignores one memcg per one
cleaner callback.

This is my understanding. Am I missing any crucial points? Any
comments or advice would be greatly appreciated.

> > do {
> > spin_lock(&zswap_shrink_lock);
> > - zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
> > - memcg = zswap_next_shrink;
> > + next_memcg = READ_ONCE(zswap_next_shrink);
> > +
> > + if (memcg != next_memcg) {
> > + /*
> > + * Ours was released by offlining.
> > + * Use the saved memcg reference.
> > + */
> > + memcg = next_memcg;
> > + } else {
> > +iternext:
> > + /* advance cursor */
> > + memcg = mem_cgroup_iter(NULL, memcg, NULL);
> > + WRITE_ONCE(zswap_next_shrink, memcg);
> > + }
> >
> > /*
> > - * We need to retry if we have gone through a full round trip, or if we
> > - * got an offline memcg (or else we risk undoing the effect of the
> > - * zswap memcg offlining cleanup callback). This is not catastrophic
> > - * per se, but it will keep the now offlined memcg hostage for a while.
> > - *
>
> Why are we removing this comment?

The existing comment about aborting the loop on the offlining memcg,
is not valid on this patch. The offline memcg will just be skipped.
I think the new behavior is commented at the beginning of the loop and
in the !mem_cgroup_tryget_online branch. Please let me know if you
have any suggestions.

> > * Note that if we got an online memcg, we will keep the extra
> > * reference in case the original reference obtained by mem_cgroup_iter
> > * is dropped by the zswap memcg offlining callback, ensuring that the
> > @@ -1345,16 +1373,18 @@ static void shrink_worker(struct work_struct *w)
> > }
> >
> > if (!mem_cgroup_tryget_online(memcg)) {
> > - /* drop the reference from mem_cgroup_iter() */
> > - mem_cgroup_iter_break(NULL, memcg);
> > - zswap_next_shrink = NULL;
> > - spin_unlock(&zswap_shrink_lock);
> > -
> > - if (++failures == MAX_RECLAIM_RETRIES)
> > - break;
> > -
> > - goto resched;
>
> I think we should still increment the failure counter, to guard
> against long running/infinite loops.

Since we skip the offline memcg instead of restarting from the root,
the new iteration will be terminated when it reaches tree root in
finite steps. I am afraid that counting this as a failure will
terminate the shrinker too easily.

I think shrinker_worker() running longer is not an issue because the
amount of work per the while loop is limited by rescheduling. As it
has a dedicated WQ_MEM_RECLAIM workqueue, returning from the function
is not necessary. cond_resched() should allow the other workqueue to
run.
The next patch also adds a progress check per walking.

> > + /*
> > + * It is an offline memcg which we cannot shrink
> > + * until its pages are reparented.
> > + * Put back the memcg reference before cleanup
> > + * function reads it from zswap_next_shrink.
> > + */
> > + goto iternext;
> > }
> > + /*
> > + * We got an extra memcg reference before unlocking.
> > + * The cleaner cannot free it using zswap_next_shrink.
> > + */
> > spin_unlock(&zswap_shrink_lock);
> >
> > ret = shrink_memcg(memcg);
> > @@ -1368,6 +1398,12 @@ static void shrink_worker(struct work_struct *w)
> > resched:
> > cond_resched();
> > } while (zswap_total_pages() > thr);
> > +
> > + /*
> > + * We can still hold the original memcg reference.
> > + * The reference is stored in zswap_next_shrink, and then reused
> > + * by the next shrink_worker().
> > + */
> > }
> >
> > /*********************************
> > --
> > 2.43.0
> >

--

<[email protected]>

2024-05-29 13:08:07

by Takero Funaki

[permalink] [raw]

Subject: Re: [PATCH 2/3] mm: zswap: fix global shrinker error handling logic

2024年5月29日(水) 0:11 Nhat Pham <nphamcs@gmailcom>:
>
> On Mon, May 27, 2024 at 9:34 PM Takero Funaki <[email protected]> wrote:
> >
> > This patch fixes zswap global shrinker that did not shrink zpool as
> > expected.
> >
> > The issue it addresses is that `shrink_worker()` did not distinguish
> > between unexpected errors and expected error codes that should be
> > skipped, such as when there is no stored page in a memcg. This led to
> > the shrinking process being aborted on the expected error codes.
> >
> > The shrinker should ignore these cases and skip to the next memcg.
> > However, skipping all memcgs presents another problem. To address this,
> > this patch tracks progress while walking the memcg tree and checks for
> > progress once the tree walk is completed.
> >
> > To handle the empty memcg case, the helper function `shrink_memcg()` is
> > modified to check if the memcg is empty and then return -ENOENT.
> >
> > Fixes: a65b0e7607cc ("zswap: make shrinking memcg-aware")
> > Signed-off-by: Takero Funaki <[email protected]>
> > ---
> > mm/zswap.c | 31 ++++++++++++++++++++++++++-----
> > 1 file changed, 26 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 0b1052cee36c..08a6f5a6bf62 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -1304,7 +1304,7 @@ static struct shrinker *zswap_alloc_shrinker(void)
> >
> > static int shrink_memcg(struct mem_cgroup *memcg)
> > {
> > - int nid, shrunk = 0;
> > + int nid, shrunk = 0, stored = 0;
> >
> > if (!mem_cgroup_zswap_writeback_enabled(memcg))
> > return -EINVAL;
> > @@ -1319,9 +1319,16 @@ static int shrink_memcg(struct mem_cgroup *memcg)
> > for_each_node_state(nid, N_NORMAL_MEMORY) {
> > unsigned long nr_to_walk = 1;
> >
> > + if (!list_lru_count_one(&zswap_list_lru, nid, memcg))
> > + continue;
> > + ++stored;
> > shrunk += list_lru_walk_one(&zswap_list_lru, nid, memcg,
> > &shrink_memcg_cb, NULL, &nr_to_walk);
> > }
> > +
> > + if (!stored)
> > + return -ENOENT;
> > +
> > return shrunk ? 0 : -EAGAIN;
> > }
> >
> > @@ -1329,12 +1336,18 @@ static void shrink_worker(struct work_struct *w)
> > {
> > struct mem_cgroup *memcg = NULL;
> > struct mem_cgroup *next_memcg;
> > - int ret, failures = 0;
> > + int ret, failures = 0, progress;
> > unsigned long thr;
> >
> > /* Reclaim down to the accept threshold */
> > thr = zswap_accept_thr_pages();
> >
> > + /*
> > + * We might start from the last memcg.
> > + * That is not a failure.
> > + */
> > + progress = 1;
> > +
> > /* global reclaim will select cgroup in a round-robin fashion.
> > *
> > * We save iteration cursor memcg into zswap_next_shrink,
> > @@ -1366,9 +1379,12 @@ static void shrink_worker(struct work_struct *w)
> > */
> > if (!memcg) {
> > spin_unlock(&zswap_shrink_lock);
> > - if (++failures == MAX_RECLAIM_RETRIES)
> > +
> > + /* tree walk completed but no progress */
> > + if (!progress && ++failures == MAX_RECLAIM_RETRIES)
> > break;
> >
> > + progress = 0;
> > goto resched;
> > }

Here, the `progress` counter tracks how many memcgs successfully evict
a page in a tree walking. (not per the while loop) then reset to 0.
progress > 0 ensures there is progress.
If we visit the tree root (NULL) without any progress, it will be a failure.

Before the loop starts, progress counter is initialized to 1 because
the first tree walk might not iterate all the memcgs, e.g. the
previous worker was terminated at the very last memcg.

> >
> > @@ -1391,10 +1407,15 @@ static void shrink_worker(struct work_struct *w)
> > /* drop the extra reference */
> > mem_cgroup_put(memcg);
> >
> > - if (ret == -EINVAL)
> > - break;
> > + /* not a writeback candidate memcg */
> > + if (ret == -EINVAL || ret == -ENOENT)
> > + continue;
> > +
>
> Can we get into an infinite loop or a really long running loops here
> if all memcgs have their writeback disabled?
>
> > if (ret && ++failures == MAX_RECLAIM_RETRIES)
> > break;
> > +
> > + ++progress;
> > + /* reschedule as we performed some IO */
> > resched:
> > cond_resched();
> > } while (zswap_total_pages() > thr);
> > --
> > 2.43.0
> >

--

<[email protected]>

2024-05-29 13:28:18

by Takero Funaki

[permalink] [raw]

Subject: Re: [PATCH 3/3] mm: zswap: proactive shrinking before pool size limit is hit

2024年5月29日(水) 1:01 Nhat Pham <nphamcs@gmailcom>:
>
> On Mon, May 27, 2024 at 9:34 PM Takero Funaki <[email protected]> wrote:
> >
> > This patch implements proactive shrinking of zswap pool before the max
> > pool size limit is reached. This also changes zswap to accept new pages
> > while the shrinker is running.
> >
> > To prevent zswap from rejecting new pages and incurring latency when
> > zswap is full, this patch queues the global shrinker by a pool usage
> > threshold at the middle of 100% and accept_thr_percent, instead of the
> > max pool size. The pool size will be controlled between 90% to 95% for
> > the default accept_thr_percent=90. Since the current global shrinker
> > continues to shrink until accept_thr_percent, we do not need to maintain
> > the hysteresis variable tracking the pool limit overage in
> > zswap_store().
> >
> > Before this patch, zswap rejected pages while the shrinker is running
> > without incrementing zswap_pool_limit_hit counter. It could be a reason
> > why zswap writethrough new pages before writeback old pages. With this
> > patch, zswap accepts new pages while shrinking, and zswap increments
> > the counter when and only when zswap rejects pages by the max pool size.
> >
> > The name of sysfs tunable accept_thr_percent is unchanged as it is still
> > the stop condition of the shrinker.
> > The respective documentation is updated to describe the new behavior.
>
> I'm a bit unsure about using this tunable. How would the user
> determine the level at which the zswap pool should be kept empty?
>
> I was actually thinking of removing this knob altogether :)
>

If we see a large pool_limit_hit, that indicates we should lower the
accept threshold to make more space proactively, to store new pages
from active processes rather than rejecting.
If not, we can set a higher accept threshold for swapin, s.t.
low-activity background processes.
That depends on one's workload, and can be tuned by the admin, I think.

> >
> > Signed-off-by: Takero Funaki <[email protected]>
> > ---
> > Documentation/admin-guide/mm/zswap.rst | 17 +++++----
> > mm/zswap.c | 49 +++++++++++++++-----------
> > 2 files changed, 37 insertions(+), 29 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst
> > index 3598dcd7dbe7..a1d8f167a27a 100644
> > --- a/Documentation/admin-guide/mm/zswap.rst
> > +++ b/Documentation/admin-guide/mm/zswap.rst
> > @@ -111,18 +111,17 @@ checked if it is a same-value filled page before compressing it. If true, the
> > compressed length of the page is set to zero and the pattern or same-filled
> > value is stored.
> >
> > -To prevent zswap from shrinking pool when zswap is full and there's a high
> > -pressure on swap (this will result in flipping pages in and out zswap pool
> > -without any real benefit but with a performance drop for the system), a
> > -special parameter has been introduced to implement a sort of hysteresis to
> > -refuse taking pages into zswap pool until it has sufficient space if the limit
> > -has been hit. To set the threshold at which zswap would start accepting pages
> > -again after it became full, use the sysfs ``accept_threshold_percent``
> > -attribute, e. g.::
> > +To prevent zswap from rejecting new pages and incurring latency when zswap is
> > +full, zswap initiates a worker called global shrinker that proactively evicts
> > +some pages from the pool to swap devices while the pool is reaching the limit.
> > +The global shrinker continues to evict pages until there is sufficient space to
> > +accept new pages. To control how many pages should remain in the pool, use the
> > +sysfs ``accept_threshold_percent`` attribute as a percentage of the max pool
> > +size, e. g.::
> >
> > echo 80 > /sys/module/zswap/parameters/accept_threshold_percent
> >
> > -Setting this parameter to 100 will disable the hysteresis.
> > +Setting this parameter to 100 will disable the proactive shrinking.
> >
> > Some users cannot tolerate the swapping that comes with zswap store failures
> > and zswap writebacks. Swapping can be disabled entirely (without disabling
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 08a6f5a6bf62..0186224be8fc 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -71,8 +71,6 @@ static u64 zswap_reject_kmemcache_fail;
> >
> > /* Shrinker work queue */
> > static struct workqueue_struct *shrink_wq;
> > -/* Pool limit was hit, we need to calm down */
> > -static bool zswap_pool_reached_full;
> >
> > /*********************************
> > * tunables
> > @@ -118,7 +116,10 @@ module_param_cb(zpool, &zswap_zpool_param_ops, &zswap_zpool_type, 0644);
> > static unsigned int zswap_max_pool_percent = 20;
> > module_param_named(max_pool_percent, zswap_max_pool_percent, uint, 0644);
> >
> > -/* The threshold for accepting new pages after the max_pool_percent was hit */
> > +/*
> > + * The percentage of pool size that the global shrinker keeps in memory.
> > + * It does not protect old pages from the dynamic shrinker.
> > + */
> > static unsigned int zswap_accept_thr_percent = 90; /* of max pool size */
> > module_param_named(accept_threshold_percent, zswap_accept_thr_percent,
> > uint, 0644);
> > @@ -487,6 +488,14 @@ static unsigned long zswap_accept_thr_pages(void)
> > return zswap_max_pages() * zswap_accept_thr_percent / 100;
> > }
> >
> > +/*
> > + * Returns threshold to start proactive global shrinking.
> > + */
> > +static inline unsigned long zswap_shrink_start_pages(void)
> > +{
> > + return zswap_max_pages() * (100 - (100 - zswap_accept_thr_percent)/2) / 100;
> > +}
> > +
> > unsigned long zswap_total_pages(void)
> > {
> > struct zswap_pool *pool;
> > @@ -504,21 +513,6 @@ unsigned long zswap_total_pages(void)
> > return total;
> > }
> >
> > -static bool zswap_check_limits(void)
> > -{
> > - unsigned long cur_pages = zswap_total_pages();
> > - unsigned long max_pages = zswap_max_pages();
> > -
> > - if (cur_pages >= max_pages) {
> > - zswap_pool_limit_hit++;
> > - zswap_pool_reached_full = true;
> > - } else if (zswap_pool_reached_full &&
> > - cur_pages <= zswap_accept_thr_pages()) {
> > - zswap_pool_reached_full = false;
> > - }
> > - return zswap_pool_reached_full;
> > -}
> > -
> > /*********************************
> > * param callbacks
> > **********************************/
> > @@ -1475,6 +1469,8 @@ bool zswap_store(struct folio *folio)
> > struct obj_cgroup *objcg = NULL;
> > struct mem_cgroup *memcg = NULL;
> > unsigned long value;
> > + unsigned long cur_pages;
> > + bool need_global_shrink = false;
> >
> > VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > @@ -1497,8 +1493,18 @@ bool zswap_store(struct folio *folio)
> > mem_cgroup_put(memcg);
> > }
> >
> > - if (zswap_check_limits())
> > + cur_pages = zswap_total_pages();
> > +
> > + if (cur_pages >= zswap_max_pages()) {
> > + zswap_pool_limit_hit++;
> > + need_global_shrink = true;
> > goto reject;
> > + }
> > +
> > + /* schedule shrink for incoming pages */
> > + if (cur_pages >= zswap_shrink_start_pages()
> > + && !work_pending(&zswap_shrink_work))
> > + queue_work(shrink_wq, &zswap_shrink_work);
>
> I think work_pending() check here is redundant. If you look at the
> documentation, queue_work only succeeds if zswap_shrink_work is not
> already on the shrink_wq workqueue.
>
> More specifically, if you check the code, queue_work calls
> queue_work_on, which has this check:
>
> if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)) &&
> !clear_pending_if_disabled(work)) {
>
> This is the same bit-check as work_pending, which is defined as:
>
> #define work_pending(work) \
> test_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))
>

Thanks for the review and info. I will remove the tests.

>
> >
> > /* allocate entry */
> > entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
> > @@ -1541,6 +1547,9 @@ bool zswap_store(struct folio *folio)
> >
> > WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
> > zswap_reject_alloc_fail++;
> > +
> > + /* reduce entry in array */
> > + need_global_shrink = true;
> > goto store_failed;
> > }
> >
> > @@ -1590,7 +1599,7 @@ bool zswap_store(struct folio *folio)
> > zswap_entry_cache_free(entry);
> > reject:
> > obj_cgroup_put(objcg);
> > - if (zswap_pool_reached_full)
> > + if (need_global_shrink && !work_pending(&zswap_shrink_work))
> > queue_work(shrink_wq, &zswap_shrink_work);
> > check_old:
> > /*
> > --
> > 2.43.0
> >

--

<[email protected]>