This series addresses two issues and introduces a minor improvement in
zswap global shrinker:
1. Fix the memcg iteration logic that breaks iteration on offline memcgs.
2. Fix the error path that aborts on expected error codes.
3. Add proactive shrinking at 91% full, for 90% accept threshold.
These patches need to be applied in this order to avoid potential loops
caused by the first issue. Patch 3 can be applied independently, but the
two issues must be resolved to ensure the shrinker can evict pages.
Previously, the zswap pool could be filled with old pages that the
shrinker failed to evict, leading to zswap rejecting new pages. With
this series applied, the shrinker will continue to evict pages until the
pool reaches the accept_thr_percent threshold proactively, as
documented, and maintain the pool to keep recent pages.
As a side effect of changes in the hysteresis logic, zswap will no
longer reject pages under the max pool limit.
With this series, reclaims smaller than the proative shrinking amount
finish instantly and trigger background shrinking. Admins can check if
new pages are buffered by zswap by monitoring the pool_limit_hit
counter.
Changes since v0:
mm: zswap: fix global shrinker memcg iteration
- Drop and reacquire spinlock before skipping a memcg.
- Add some comment to clarify the locking mechanism.
mm: zswap: proactive shrinking before pool size limit is hit
- Remove unneeded check before scheduling work.
- Change shrink start threshold to accept_thr_percent + 1%.
Now it starts shrinking at accept_thr_percent + 1%. Previously, the
threshold was at the midpoint of 100% to accept_threshold.
If a workload needs 10% space to buffer the average reclaim amount, with
the previous patch, it required setting the accept_thr_percent to 80%.
For 50%, it became 0%, which is not acceptable and unclear for admins.
We can use the accept percent as the shrink threshold directly but that
sounds shrinker is called too frequently around the accept threshold. I
added 1% as a minimum gap to the shrink threshold.
----
Takero Funaki (3):
mm: zswap: fix global shrinker memcg iteration
mm: zswap: fix global shrinker error handling logic
mm: zswap: proactive shrinking before pool size limit is hit
Documentation/admin-guide/mm/zswap.rst | 17 ++-
mm/zswap.c | 172 ++++++++++++++++++-------
2 files changed, 136 insertions(+), 53 deletions(-)
--
2.43.0
This patch fixes an issue where the zswap global shrinker stopped
iterating through the memcg tree.
The problem was that `shrink_worker()` would stop iterating when a memcg
was being offlined and restart from the tree root. Now, it properly
handles the offlining memcg and continues shrinking with the next memcg.
This patch also modified handing of the lock for offlined memcg cleaner
to adapt the change in the iteration, and avoid negligibly rare skipping
of a memcg from shrink iteration.
Fixes: a65b0e7607cc ("zswap: make shrinking memcg-aware")
Signed-off-by: Takero Funaki <[email protected]>
---
mm/zswap.c | 87 ++++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 68 insertions(+), 19 deletions(-)
diff --git a/mm/zswap.c b/mm/zswap.c
index 80c634acb8d5..d720a42069b6 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -827,12 +827,27 @@ void zswap_folio_swapin(struct folio *folio)
}
}
+/*
+ * This function should be called when a memcg is being offlined.
+ *
+ * Since the global shrinker shrink_worker() may hold a reference
+ * of the memcg, we must check and release the reference in
+ * zswap_next_shrink.
+ *
+ * shrink_worker() must handle the case where this function releases
+ * the reference of memcg being shrunk.
+ */
void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
{
/* lock out zswap shrinker walking memcg tree */
spin_lock(&zswap_shrink_lock);
- if (zswap_next_shrink == memcg)
- zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
+
+ if (READ_ONCE(zswap_next_shrink) == memcg) {
+ /* put back reference and advance the cursor */
+ memcg = mem_cgroup_iter(NULL, memcg, NULL);
+ WRITE_ONCE(zswap_next_shrink, memcg);
+ }
+
spin_unlock(&zswap_shrink_lock);
}
@@ -1401,25 +1416,44 @@ static int shrink_memcg(struct mem_cgroup *memcg)
static void shrink_worker(struct work_struct *w)
{
- struct mem_cgroup *memcg;
+ struct mem_cgroup *memcg = NULL;
+ struct mem_cgroup *next_memcg;
int ret, failures = 0;
unsigned long thr;
/* Reclaim down to the accept threshold */
thr = zswap_accept_thr_pages();
- /* global reclaim will select cgroup in a round-robin fashion. */
+ /* global reclaim will select cgroup in a round-robin fashion.
+ *
+ * We save iteration cursor memcg into zswap_next_shrink,
+ * which can be modified by the offline memcg cleaner
+ * zswap_memcg_offline_cleanup().
+ *
+ * Since the offline cleaner is called only once, we cannot abandone
+ * offline memcg reference in zswap_next_shrink.
+ * We can rely on the cleaner only if we get online memcg under lock.
+ * If we get offline memcg, we cannot determine the cleaner will be
+ * called later. We must put it before returning from this function.
+ */
do {
+iternext:
spin_lock(&zswap_shrink_lock);
- zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
- memcg = zswap_next_shrink;
+ next_memcg = READ_ONCE(zswap_next_shrink);
+
+ if (memcg != next_memcg) {
+ /*
+ * Ours was released by offlining.
+ * Use the saved memcg reference.
+ */
+ memcg = next_memcg;
+ } else {
+ /* advance cursor */
+ memcg = mem_cgroup_iter(NULL, memcg, NULL);
+ WRITE_ONCE(zswap_next_shrink, memcg);
+ }
/*
- * We need to retry if we have gone through a full round trip, or if we
- * got an offline memcg (or else we risk undoing the effect of the
- * zswap memcg offlining cleanup callback). This is not catastrophic
- * per se, but it will keep the now offlined memcg hostage for a while.
- *
* Note that if we got an online memcg, we will keep the extra
* reference in case the original reference obtained by mem_cgroup_iter
* is dropped by the zswap memcg offlining callback, ensuring that the
@@ -1434,16 +1468,25 @@ static void shrink_worker(struct work_struct *w)
}
if (!mem_cgroup_tryget_online(memcg)) {
- /* drop the reference from mem_cgroup_iter() */
- mem_cgroup_iter_break(NULL, memcg);
- zswap_next_shrink = NULL;
+ /*
+ * It is an offline memcg which we cannot shrink
+ * until its pages are reparented.
+ *
+ * Since we cannot determine if the offline cleaner has
+ * been already called or not, the offline memcg must be
+ * put back unconditonally. We cannot abort the loop while
+ * zswap_next_shrink has a reference of this offline memcg.
+ */
spin_unlock(&zswap_shrink_lock);
-
- if (++failures == MAX_RECLAIM_RETRIES)
- break;
-
- goto resched;
+ goto iternext;
}
+ /*
+ * We got an extra memcg reference before unlocking.
+ * The cleaner cannot free it using zswap_next_shrink.
+ *
+ * Our memcg can be offlined after we get online memcg here.
+ * In this case, the cleaner is waiting the lock just behind us.
+ */
spin_unlock(&zswap_shrink_lock);
ret = shrink_memcg(memcg);
@@ -1457,6 +1500,12 @@ static void shrink_worker(struct work_struct *w)
resched:
cond_resched();
} while (zswap_total_pages() > thr);
+
+ /*
+ * We can still hold the original memcg reference.
+ * The reference is stored in zswap_next_shrink, and then reused
+ * by the next shrink_worker().
+ */
}
/*********************************
--
2.43.0
This patch fixes zswap global shrinker that did not shrink zpool as
expected.
The issue it addresses is that `shrink_worker()` did not distinguish
between unexpected errors and expected error codes that should be
skipped, such as when there is no stored page in a memcg. This led to
the shrinking process being aborted on the expected error codes.
The shrinker should ignore these cases and skip to the next memcg.
However, skipping all memcgs presents another problem. To address this,
this patch tracks progress while walking the memcg tree and checks for
progress once the tree walk is completed.
To handle the empty memcg case, the helper function `shrink_memcg()` is
modified to check if the memcg is empty and then return -ENOENT.
Fixes: a65b0e7607cc ("zswap: make shrinking memcg-aware")
Signed-off-by: Takero Funaki <[email protected]>
---
mm/zswap.c | 31 ++++++++++++++++++++++++++-----
1 file changed, 26 insertions(+), 5 deletions(-)
diff --git a/mm/zswap.c b/mm/zswap.c
index d720a42069b6..1a90f434f247 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1393,7 +1393,7 @@ static struct shrinker *zswap_alloc_shrinker(void)
static int shrink_memcg(struct mem_cgroup *memcg)
{
- int nid, shrunk = 0;
+ int nid, shrunk = 0, stored = 0;
if (!mem_cgroup_zswap_writeback_enabled(memcg))
return -EINVAL;
@@ -1408,9 +1408,16 @@ static int shrink_memcg(struct mem_cgroup *memcg)
for_each_node_state(nid, N_NORMAL_MEMORY) {
unsigned long nr_to_walk = 1;
+ if (!list_lru_count_one(&zswap_list_lru, nid, memcg))
+ continue;
+ ++stored;
shrunk += list_lru_walk_one(&zswap_list_lru, nid, memcg,
&shrink_memcg_cb, NULL, &nr_to_walk);
}
+
+ if (!stored)
+ return -ENOENT;
+
return shrunk ? 0 : -EAGAIN;
}
@@ -1418,12 +1425,18 @@ static void shrink_worker(struct work_struct *w)
{
struct mem_cgroup *memcg = NULL;
struct mem_cgroup *next_memcg;
- int ret, failures = 0;
+ int ret, failures = 0, progress;
unsigned long thr;
/* Reclaim down to the accept threshold */
thr = zswap_accept_thr_pages();
+ /*
+ * We might start from the last memcg.
+ * That is not a failure.
+ */
+ progress = 1;
+
/* global reclaim will select cgroup in a round-robin fashion.
*
* We save iteration cursor memcg into zswap_next_shrink,
@@ -1461,9 +1474,12 @@ static void shrink_worker(struct work_struct *w)
*/
if (!memcg) {
spin_unlock(&zswap_shrink_lock);
- if (++failures == MAX_RECLAIM_RETRIES)
+
+ /* tree walk completed but no progress */
+ if (!progress && ++failures == MAX_RECLAIM_RETRIES)
break;
+ progress = 0;
goto resched;
}
@@ -1493,10 +1509,15 @@ static void shrink_worker(struct work_struct *w)
/* drop the extra reference */
mem_cgroup_put(memcg);
- if (ret == -EINVAL)
- break;
+ /* not a writeback candidate memcg */
+ if (ret == -EINVAL || ret == -ENOENT)
+ continue;
+
if (ret && ++failures == MAX_RECLAIM_RETRIES)
break;
+
+ ++progress;
+ /* reschedule as we performed some IO */
resched:
cond_resched();
} while (zswap_total_pages() > thr);
--
2.43.0
This patch implements proactive shrinking of zswap pool before the max
pool size limit is reached. This also changes zswap to accept new pages
while the shrinker is running.
To prevent zswap from rejecting new pages and incurring latency when
zswap is full, this patch queues the global shrinker by a pool usage
threshold between 100% and accept_thr_percent, instead of the max pool
size. The pool size will be controlled between 90% to 91% for the
default accept_thr_percent=90. Since the current global shrinker
continues to shrink until accept_thr_percent, we do not need to maintain
the hysteresis variable tracking the pool limit overage in
zswap_store().
Before this patch, zswap rejected pages while the shrinker is running
without incrementing zswap_pool_limit_hit counter. It could be a reason
why zswap writethrough new pages before writeback old pages. With this
patch, zswap accepts new pages while shrinking, and zswap increments
the counter when and only when zswap rejects pages by the max pool size.
Now, reclaims smaller than the proactive shrinking amount finish
instantly and trigger background shrinking. Admins can check if new
pages are buffered by zswap by monitoring the pool_limit_hit counter.
The name of sysfs tunable accept_thr_percent is unchanged as it is still
the stop condition of the shrinker.
The respective documentation is updated to describe the new behavior.
Signed-off-by: Takero Funaki <[email protected]>
---
Documentation/admin-guide/mm/zswap.rst | 17 ++++----
mm/zswap.c | 54 ++++++++++++++++----------
2 files changed, 42 insertions(+), 29 deletions(-)
diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst
index 3598dcd7dbe7..a1d8f167a27a 100644
--- a/Documentation/admin-guide/mm/zswap.rst
+++ b/Documentation/admin-guide/mm/zswap.rst
@@ -111,18 +111,17 @@ checked if it is a same-value filled page before compressing it. If true, the
compressed length of the page is set to zero and the pattern or same-filled
value is stored.
-To prevent zswap from shrinking pool when zswap is full and there's a high
-pressure on swap (this will result in flipping pages in and out zswap pool
-without any real benefit but with a performance drop for the system), a
-special parameter has been introduced to implement a sort of hysteresis to
-refuse taking pages into zswap pool until it has sufficient space if the limit
-has been hit. To set the threshold at which zswap would start accepting pages
-again after it became full, use the sysfs ``accept_threshold_percent``
-attribute, e. g.::
+To prevent zswap from rejecting new pages and incurring latency when zswap is
+full, zswap initiates a worker called global shrinker that proactively evicts
+some pages from the pool to swap devices while the pool is reaching the limit.
+The global shrinker continues to evict pages until there is sufficient space to
+accept new pages. To control how many pages should remain in the pool, use the
+sysfs ``accept_threshold_percent`` attribute as a percentage of the max pool
+size, e. g.::
echo 80 > /sys/module/zswap/parameters/accept_threshold_percent
-Setting this parameter to 100 will disable the hysteresis.
+Setting this parameter to 100 will disable the proactive shrinking.
Some users cannot tolerate the swapping that comes with zswap store failures
and zswap writebacks. Swapping can be disabled entirely (without disabling
diff --git a/mm/zswap.c b/mm/zswap.c
index 1a90f434f247..e957bfdeaf70 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -71,8 +71,6 @@ static u64 zswap_reject_kmemcache_fail;
/* Shrinker work queue */
static struct workqueue_struct *shrink_wq;
-/* Pool limit was hit, we need to calm down */
-static bool zswap_pool_reached_full;
/*********************************
* tunables
@@ -118,7 +116,10 @@ module_param_cb(zpool, &zswap_zpool_param_ops, &zswap_zpool_type, 0644);
static unsigned int zswap_max_pool_percent = 20;
module_param_named(max_pool_percent, zswap_max_pool_percent, uint, 0644);
-/* The threshold for accepting new pages after the max_pool_percent was hit */
+/*
+ * The percentage of pool size that the global shrinker keeps in memory.
+ * It does not protect old pages from the dynamic shrinker.
+ */
static unsigned int zswap_accept_thr_percent = 90; /* of max pool size */
module_param_named(accept_threshold_percent, zswap_accept_thr_percent,
uint, 0644);
@@ -539,6 +540,20 @@ static unsigned long zswap_accept_thr_pages(void)
return zswap_max_pages() * zswap_accept_thr_percent / 100;
}
+/*
+ * Returns threshold to start proactive global shrinking.
+ */
+static inline unsigned long zswap_shrink_start_pages(void)
+{
+ /*
+ * Shrinker will evict pages to the accept threshold.
+ * We add 1% to not schedule shrinker too frequently
+ * for small swapout.
+ */
+ return zswap_max_pages() *
+ min(100, zswap_accept_thr_percent + 1) / 100;
+}
+
unsigned long zswap_total_pages(void)
{
struct zswap_pool *pool;
@@ -556,21 +571,6 @@ unsigned long zswap_total_pages(void)
return total;
}
-static bool zswap_check_limits(void)
-{
- unsigned long cur_pages = zswap_total_pages();
- unsigned long max_pages = zswap_max_pages();
-
- if (cur_pages >= max_pages) {
- zswap_pool_limit_hit++;
- zswap_pool_reached_full = true;
- } else if (zswap_pool_reached_full &&
- cur_pages <= zswap_accept_thr_pages()) {
- zswap_pool_reached_full = false;
- }
- return zswap_pool_reached_full;
-}
-
/*********************************
* param callbacks
**********************************/
@@ -1577,6 +1577,8 @@ bool zswap_store(struct folio *folio)
struct obj_cgroup *objcg = NULL;
struct mem_cgroup *memcg = NULL;
unsigned long value;
+ unsigned long cur_pages;
+ bool need_global_shrink = false;
VM_WARN_ON_ONCE(!folio_test_locked(folio));
VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
@@ -1599,8 +1601,17 @@ bool zswap_store(struct folio *folio)
mem_cgroup_put(memcg);
}
- if (zswap_check_limits())
+ cur_pages = zswap_total_pages();
+
+ if (cur_pages >= zswap_max_pages()) {
+ zswap_pool_limit_hit++;
+ need_global_shrink = true;
goto reject;
+ }
+
+ /* schedule shrink for incoming pages */
+ if (cur_pages >= zswap_shrink_start_pages())
+ queue_work(shrink_wq, &zswap_shrink_work);
/* allocate entry */
entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
@@ -1643,6 +1654,9 @@ bool zswap_store(struct folio *folio)
WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
zswap_reject_alloc_fail++;
+
+ /* reduce entry in array */
+ need_global_shrink = true;
goto store_failed;
}
@@ -1692,7 +1706,7 @@ bool zswap_store(struct folio *folio)
zswap_entry_cache_free(entry);
reject:
obj_cgroup_put(objcg);
- if (zswap_pool_reached_full)
+ if (need_global_shrink)
queue_work(shrink_wq, &zswap_shrink_work);
check_old:
/*
--
2.43.0
On Sat, Jun 8, 2024 at 8:53 AM Takero Funaki <[email protected]> wrote:
>
> This patch fixes an issue where the zswap global shrinker stopped
> iterating through the memcg tree.
>
> The problem was that `shrink_worker()` would stop iterating when a memcg
> was being offlined and restart from the tree root. Now, it properly
> handles the offlining memcg and continues shrinking with the next memcg.
>
> This patch also modified handing of the lock for offlined memcg cleaner
> to adapt the change in the iteration, and avoid negligibly rare skipping
> of a memcg from shrink iteration.
>
> Fixes: a65b0e7607cc ("zswap: make shrinking memcg-aware")
> Signed-off-by: Takero Funaki <[email protected]>
> ---
> mm/zswap.c | 87 ++++++++++++++++++++++++++++++++++++++++++------------
> 1 file changed, 68 insertions(+), 19 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 80c634acb8d5..d720a42069b6 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -827,12 +827,27 @@ void zswap_folio_swapin(struct folio *folio)
> }
> }
>
> +/*
> + * This function should be called when a memcg is being offlined.
> + *
> + * Since the global shrinker shrink_worker() may hold a reference
> + * of the memcg, we must check and release the reference in
> + * zswap_next_shrink.
> + *
> + * shrink_worker() must handle the case where this function releases
> + * the reference of memcg being shrunk.
> + */
> void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
> {
> /* lock out zswap shrinker walking memcg tree */
> spin_lock(&zswap_shrink_lock);
> - if (zswap_next_shrink == memcg)
> - zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
> +
> + if (READ_ONCE(zswap_next_shrink) == memcg) {
> + /* put back reference and advance the cursor */
> + memcg = mem_cgroup_iter(NULL, memcg, NULL);
> + WRITE_ONCE(zswap_next_shrink, memcg);
> + }
> +
I am really finding it difficult to understand what the diff is trying
to do. We are holding a lock that protects zswap_next_shrink. We
always access it with the lock held. Why do we need all of this?
Adding READ_ONCE() and WRITE_ONCE() where they are not needed is just
confusing imo.
> spin_unlock(&zswap_shrink_lock);
> }
>
> @@ -1401,25 +1416,44 @@ static int shrink_memcg(struct mem_cgroup *memcg)
>
> static void shrink_worker(struct work_struct *w)
> {
> - struct mem_cgroup *memcg;
> + struct mem_cgroup *memcg = NULL;
> + struct mem_cgroup *next_memcg;
> int ret, failures = 0;
> unsigned long thr;
>
> /* Reclaim down to the accept threshold */
> thr = zswap_accept_thr_pages();
>
> - /* global reclaim will select cgroup in a round-robin fashion. */
> + /* global reclaim will select cgroup in a round-robin fashion.
> + *
> + * We save iteration cursor memcg into zswap_next_shrink,
> + * which can be modified by the offline memcg cleaner
> + * zswap_memcg_offline_cleanup().
> + *
> + * Since the offline cleaner is called only once, we cannot abandone
> + * offline memcg reference in zswap_next_shrink.
> + * We can rely on the cleaner only if we get online memcg under lock.
> + * If we get offline memcg, we cannot determine the cleaner will be
> + * called later. We must put it before returning from this function.
> + */
> do {
> +iternext:
> spin_lock(&zswap_shrink_lock);
> - zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
> - memcg = zswap_next_shrink;
> + next_memcg = READ_ONCE(zswap_next_shrink);
> +
> + if (memcg != next_memcg) {
> + /*
> + * Ours was released by offlining.
> + * Use the saved memcg reference.
> + */
> + memcg = next_memcg;
'memcg' will always be NULL on the first iteration, so we will always
start by shrinking 'zswap_next_shrink' for a second time before moving
the iterator.
> + } else {
> + /* advance cursor */
> + memcg = mem_cgroup_iter(NULL, memcg, NULL);
> + WRITE_ONCE(zswap_next_shrink, memcg);
Again, I don't see what this is achieving. The first iteration will
always set 'memcg' to 'zswap_next_shrink', and then we will always
move the iterator forward. The only difference I see is that we shrink
'zswap_next_shrink' twice in a row now (last 'memcg' in prev call, and
first 'memcg' in this call).
> + }
>
> /*
> - * We need to retry if we have gone through a full round trip, or if we
> - * got an offline memcg (or else we risk undoing the effect of the
> - * zswap memcg offlining cleanup callback). This is not catastrophic
> - * per se, but it will keep the now offlined memcg hostage for a while.
> - *
> * Note that if we got an online memcg, we will keep the extra
> * reference in case the original reference obtained by mem_cgroup_iter
> * is dropped by the zswap memcg offlining callback, ensuring that the
> @@ -1434,16 +1468,25 @@ static void shrink_worker(struct work_struct *w)
> }
>
> if (!mem_cgroup_tryget_online(memcg)) {
> - /* drop the reference from mem_cgroup_iter() */
> - mem_cgroup_iter_break(NULL, memcg);
> - zswap_next_shrink = NULL;
> + /*
> + * It is an offline memcg which we cannot shrink
> + * until its pages are reparented.
> + *
> + * Since we cannot determine if the offline cleaner has
> + * been already called or not, the offline memcg must be
> + * put back unconditonally. We cannot abort the loop while
> + * zswap_next_shrink has a reference of this offline memcg.
> + */
You actually deleted the code that actually puts the ref to the
offline memcg above.
Why don't you just replace mem_cgroup_iter_break(NULL, memcg) with
mem_cgroup_iter(NULL, memcg, NULL) here? I don't understand what the
patch is trying to do to be honest. This patch is a lot more confusing
than it should be.
Also, I would like Nhat to weigh in here. Perhaps the decision to
reset the iterator instead of advancing it in this case was made for a
reason that we should honor. Maybe cgroups are usually offlined
together so we will keep running into offline cgroups here if we
continue? I am not sure.
> spin_unlock(&zswap_shrink_lock);
> -
> - if (++failures == MAX_RECLAIM_RETRIES)
> - break;
> -
> - goto resched;
> + goto iternext;
> }
> + /*
> + * We got an extra memcg reference before unlocking.
> + * The cleaner cannot free it using zswap_next_shrink.
> + *
> + * Our memcg can be offlined after we get online memcg here.
> + * In this case, the cleaner is waiting the lock just behind us.
> + */
> spin_unlock(&zswap_shrink_lock);
>
> ret = shrink_memcg(memcg);
> @@ -1457,6 +1500,12 @@ static void shrink_worker(struct work_struct *w)
> resched:
> cond_resched();
> } while (zswap_total_pages() > thr);
> +
> + /*
> + * We can still hold the original memcg reference.
> + * The reference is stored in zswap_next_shrink, and then reused
> + * by the next shrink_worker().
> + */
> }
>
> /*********************************
> --
> 2.43.0
>
On Sat, Jun 8, 2024 at 8:53 AM Takero Funaki <[email protected]> wrote:
>
> This patch fixes zswap global shrinker that did not shrink zpool as
> expected.
>
> The issue it addresses is that `shrink_worker()` did not distinguish
> between unexpected errors and expected error codes that should be
> skipped, such as when there is no stored page in a memcg. This led to
> the shrinking process being aborted on the expected error codes.
>
> The shrinker should ignore these cases and skip to the next memcg.
> However, skipping all memcgs presents another problem. To address this,
> this patch tracks progress while walking the memcg tree and checks for
> progress once the tree walk is completed.
>
> To handle the empty memcg case, the helper function `shrink_memcg()` is
> modified to check if the memcg is empty and then return -ENOENT.
>
> Fixes: a65b0e7607cc ("zswap: make shrinking memcg-aware")
> Signed-off-by: Takero Funaki <[email protected]>
> ---
> mm/zswap.c | 31 ++++++++++++++++++++++++++-----
> 1 file changed, 26 insertions(+), 5 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index d720a42069b6..1a90f434f247 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1393,7 +1393,7 @@ static struct shrinker *zswap_alloc_shrinker(void)
>
> static int shrink_memcg(struct mem_cgroup *memcg)
> {
> - int nid, shrunk = 0;
> + int nid, shrunk = 0, stored = 0;
>
> if (!mem_cgroup_zswap_writeback_enabled(memcg))
> return -EINVAL;
> @@ -1408,9 +1408,16 @@ static int shrink_memcg(struct mem_cgroup *memcg)
> for_each_node_state(nid, N_NORMAL_MEMORY) {
> unsigned long nr_to_walk = 1;
>
> + if (!list_lru_count_one(&zswap_list_lru, nid, memcg))
> + continue;
> + ++stored;
> shrunk += list_lru_walk_one(&zswap_list_lru, nid, memcg,
> &shrink_memcg_cb, NULL, &nr_to_walk);
> }
> +
> + if (!stored)
> + return -ENOENT;
> +
Can't we just check nr_to_walk here and return -ENOENT if it remains as 1?
Something like:
if (nr_to_walk)
return -ENOENT;
if (!shrunk)
return -EAGAIN;
return 0;
> return shrunk ? 0 : -EAGAIN;
> }
>
> @@ -1418,12 +1425,18 @@ static void shrink_worker(struct work_struct *w)
> {
> struct mem_cgroup *memcg = NULL;
> struct mem_cgroup *next_memcg;
> - int ret, failures = 0;
> + int ret, failures = 0, progress;
> unsigned long thr;
>
> /* Reclaim down to the accept threshold */
> thr = zswap_accept_thr_pages();
>
> + /*
> + * We might start from the last memcg.
> + * That is not a failure.
> + */
> + progress = 1;
> +
> /* global reclaim will select cgroup in a round-robin fashion.
> *
> * We save iteration cursor memcg into zswap_next_shrink,
> @@ -1461,9 +1474,12 @@ static void shrink_worker(struct work_struct *w)
> */
> if (!memcg) {
> spin_unlock(&zswap_shrink_lock);
> - if (++failures == MAX_RECLAIM_RETRIES)
> +
> + /* tree walk completed but no progress */
> + if (!progress && ++failures == MAX_RECLAIM_RETRIES)
> break;
It seems like we may keep iterating the entire hierarchy a lot of
times as long as we are making any type of progress. This doesn't seem
right.
>
> + progress = 0;
> goto resched;
> }
>
> @@ -1493,10 +1509,15 @@ static void shrink_worker(struct work_struct *w)
> /* drop the extra reference */
> mem_cgroup_put(memcg);
>
> - if (ret == -EINVAL)
> - break;
> + /* not a writeback candidate memcg */
> + if (ret == -EINVAL || ret == -ENOENT)
> + continue;
> +
We should probably return -ENOENT for memcg with writeback disabled as well.
> if (ret && ++failures == MAX_RECLAIM_RETRIES)
> break;
> +
> + ++progress;
> + /* reschedule as we performed some IO */
> resched:
> cond_resched();
> } while (zswap_total_pages() > thr);
> --
> 2.43.0
>
On 2024/06/11 4:16, Yosry Ahmed wrote:
>
> I am really finding it difficult to understand what the diff is trying
> to do. We are holding a lock that protects zswap_next_shrink. We
> always access it with the lock held. Why do we need all of this?
>
> Adding READ_ONCE() and WRITE_ONCE() where they are not needed is just
> confusing imo.
I initially thought that reading new values from external variables
inside a loop required protection from compiler optimization. I will
remove the access macros in v2.
> 'memcg' will always be NULL on the first iteration, so we will always
> start by shrinking 'zswap_next_shrink' for a second time before moving
> the iterator.
>
>> + } else {
>> + /* advance cursor */
>> + memcg = mem_cgroup_iter(NULL, memcg, NULL);
>> + WRITE_ONCE(zswap_next_shrink, memcg);
> Again, I don't see what this is achieving. The first iteration will
> always set 'memcg' to 'zswap_next_shrink', and then we will always
> move the iterator forward. The only difference I see is that we shrink
> 'zswap_next_shrink' twice in a row now (last 'memcg' in prev call, and
> first 'memcg' in this call).
The reason for checking if `memcg != next_memcg` was to ensure that we
do not skip memcg that might be modified by the cleaner. For example,
say we get memcg A and save it. When the cleaner advances the cursor
from A to B, we then advance from B to C, shrink C. We have to check
that A in the zswap_next_shrink is untouched before advancing the
cursor.
If this approach is overly complicated and ignoring B is acceptable,
the beginning of the loop can be simplified to:
do {
+iternext:
spin_lock(&zswap_shrink_lock);
zswap_next_shrink = mem_cgroup_iter(NULL,
zswap_next_shrink, NULL);
memcg = zswap_next_shrink;
>> @@ -1434,16 +1468,25 @@ static void shrink_worker(struct work_struct *w)
>> }
>>
>> if (!mem_cgroup_tryget_online(memcg)) {
>> - /* drop the reference from mem_cgroup_iter() */
>> - mem_cgroup_iter_break(NULL, memcg);
>> - zswap_next_shrink = NULL;
>> + /*
>> + * It is an offline memcg which we cannot shrink
>> + * until its pages are reparented.
>> + *
>> + * Since we cannot determine if the offline cleaner has
>> + * been already called or not, the offline memcg must be
>> + * put back unconditonally. We cannot abort the loop while
>> + * zswap_next_shrink has a reference of this offline memcg.
>> + */
> You actually deleted the code that actually puts the ref to the
> offline memcg above.
>
> Why don't you just replace mem_cgroup_iter_break(NULL, memcg) with
> mem_cgroup_iter(NULL, memcg, NULL) here? I don't understand what the
> patch is trying to do to be honest. This patch is a lot more confusing
> than it should be.
>> spin_unlock(&zswap_shrink_lock);
>> -
>> - if (++failures == MAX_RECLAIM_RETRIES)
>> - break;
>> -
>> - goto resched;
>> + goto iternext;
>> }
Removing the `break` on max failures from the if-offline branch is
required to not leave the reference of the next memcg.
If we just replace the mem_cgroup_iter_break with `memcg =
zswap_next_shrink = mem_cgroup_iter(NULL, memcg, NULL);` and break the
loop on failure, the next memcg will be left in zswap_next_shrink. If
zswap_next_shrink is also offline, the reference will be held
indefinitely.
When we get offline memcg, we cannot determine if the cleaner has
already been called or will be called later. We have to put back the
offline memcg reference before returning from the worker function.
This potential memcg leak is the reason why I think we cannot break
the loop here.
In this patch, the `goto iternext` ensures the offline memcg is
released in the next iteration (or by cleaner waiting for our unlock).
>
> Also, I would like Nhat to weigh in here. Perhaps the decision to
> reset the iterator instead of advancing it in this case was made for a
> reason that we should honor. Maybe cgroups are usually offlined
> together so we will keep running into offline cgroups here if we
> continue? I am not sure.
From comment I removed,
>> - * We need to retry if we have gone through a full round trip, or if we
>> - * got an offline memcg (or else we risk undoing the effect of the
>> - * zswap memcg offlining cleanup callback). This is not catastrophic
>> - * per se, but it will keep the now offlined memcg hostage for a while.
I think this mentioned the potential memcg leak, which is now resolved
by this patch modifying the offline memcg case.
Thanks a lot for your comments.
On 2024/06/11 5:27, Yosry Ahmed wrote:
>> unsigned long nr_to_walk = 1;
>>
>> + if (!list_lru_count_one(&zswap_list_lru, nid, memcg))
>> + continue;
>> + ++stored;
>> shrunk += list_lru_walk_one(&zswap_list_lru, nid, memcg,
>> &shrink_memcg_cb, NULL, &nr_to_walk);
>> }
>> +
>> + if (!stored)
>> + return -ENOENT;
>> +
>
> Can't we just check nr_to_walk here and return -ENOENT if it remains as 1?
>
> Something like:
>
> if (nr_to_walk)
> return -ENOENT;
> if (!shrunk)
> return -EAGAIN;
> return 0;
>
ah, the counting step can be removed. I will change it in v2.
>> return shrunk ? 0 : -EAGAIN;
>> }
>>
>> @@ -1418,12 +1425,18 @@ static void shrink_worker(struct work_struct *w)
>> {
>> struct mem_cgroup *memcg = NULL;
>> struct mem_cgroup *next_memcg;
>> - int ret, failures = 0;
>> + int ret, failures = 0, progress;
>> unsigned long thr;
>>
>> /* Reclaim down to the accept threshold */
>> thr = zswap_accept_thr_pages();
>>
>> + /*
>> + * We might start from the last memcg.
>> + * That is not a failure.
>> + */
>> + progress = 1;
>> +
>> /* global reclaim will select cgroup in a round-robin fashion.
>> *
>> * We save iteration cursor memcg into zswap_next_shrink,
>> @@ -1461,9 +1474,12 @@ static void shrink_worker(struct work_struct *w)
>> */
>> if (!memcg) {
>> spin_unlock(&zswap_shrink_lock);
>> - if (++failures == MAX_RECLAIM_RETRIES)
>> +
>> + /* tree walk completed but no progress */
>> + if (!progress && ++failures == MAX_RECLAIM_RETRIES)
>> break;
>
> It seems like we may keep iterating the entire hierarchy a lot of
> times as long as we are making any type of progress. This doesn't seem
> right.
>
Since shrink_worker evicts only one page per tree walk when there is
only one memcg using zswap, I believe this is the intended behavior.
Even if we choose to break the loop more aggressively, it would only
be postponing the problem because pool_limit_hit will trigger the
worker again.
I agree the existing approach is inefficient. It might be better to
change the 1 page in a round-robin strategy.
>>
>> + progress = 0;
>> goto resched;
>> }
>>
>> @@ -1493,10 +1509,15 @@ static void shrink_worker(struct work_struct *w)
>> /* drop the extra reference */
>> mem_cgroup_put(memcg);
>>
>> - if (ret == -EINVAL)
>> - break;
>> + /* not a writeback candidate memcg */
>> + if (ret == -EINVAL || ret == -ENOENT)
>> + continue;
>> +
>
> We should probably return -ENOENT for memcg with writeback disabled as well.
>
>> if (ret && ++failures == MAX_RECLAIM_RETRIES)
>> break;
>> +
>> + ++progress;
>> + /* reschedule as we performed some IO */
>> resched:
>> cond_resched();
>> } while (zswap_total_pages() > thr);
>> --
>> 2.43.0
>>
On Tue, Jun 11, 2024 at 8:21 AM Takero Funaki <[email protected]> wrote:
>
>
> Since shrink_worker evicts only one page per tree walk when there is
> only one memcg using zswap, I believe this is the intended behavior.
I don't think this is the intended behavior :) It's a holdover from
the old zswap reclaiming behaviors.
1. In the past, we used to shrink one object per shrink worker call.
This is crazy.
2. We then move the LRU from the allocator level to zswap level, and
shrink one object at a time until the pool can accept new pages (i.e
under the acceptance threshold).
3. When we separate the LRU to per-(memcg, node), we keep the
shrink-one-at-a-time part, but do it round-robin style on each of the
(memcg, node) combination.
It's time to optimize this. 4th time's the charm!
> Even if we choose to break the loop more aggressively, it would only
> be postponing the problem because pool_limit_hit will trigger the
> worker again.
>
> I agree the existing approach is inefficient. It might be better to
> change the 1 page in a round-robin strategy.
We can play with a bigger batch.
1. Most straightforward idea is to just use a bigger constant (32? 64? 128?)
2. We can try to shrink until accept for each memcg, hoping that the
round robin selection maintains fairness in the long run - but this
can be a bad idea in the short run for the memcg selected. At the very
least, this should try to respect the protected area for each lruvec.
This might still come into conflict with the zswap shrinker though
(since the protection is best-effort).
3. Proportional reclaim - a variant of what we're doing in
get_scan_count() for page reclaim?
scan = lruvec_size - lruvec_size * protection / (cgroup_size + 1);
protection is derived from memory.min or memory.low of the cgroup, and
cgroup_size is the memory usage of the cgroup. lruvec_size maybe we
can substitute with the number of (reclaimable/unprotected?) zswap
objects in the (node, memcg) lru?
On Sat, Jun 8, 2024 at 8:53 AM Takero Funaki <[email protected]> wrote:
>
> This series addresses two issues and introduces a minor improvement in
> zswap global shrinker:
By the way, what is your current setup?
This global shrinker loop should only be run when the global pool
limit is hit. That *never* happens to us in production, even with the
zswap shrinker disabled.
The default pool limit is 20% of memory, which is quite a lot,
especially if anonymous memory is well-compressed and/or has a lot of
zero pages (which do not count towards the limit).
>
> 1. Fix the memcg iteration logic that breaks iteration on offline memcgs.
> 2. Fix the error path that aborts on expected error codes.
> 3. Add proactive shrinking at 91% full, for 90% accept threshold.
>
> These patches need to be applied in this order to avoid potential loops
> caused by the first issue. Patch 3 can be applied independently, but the
> two issues must be resolved to ensure the shrinker can evict pages.
>
> Previously, the zswap pool could be filled with old pages that the
> shrinker failed to evict, leading to zswap rejecting new pages. With
> this series applied, the shrinker will continue to evict pages until the
> pool reaches the accept_thr_percent threshold proactively, as
> documented, and maintain the pool to keep recent pages.
>
> As a side effect of changes in the hysteresis logic, zswap will no
> longer reject pages under the max pool limit.
>
> With this series, reclaims smaller than the proative shrinking amount
> finish instantly and trigger background shrinking. Admins can check if
> new pages are buffered by zswap by monitoring the pool_limit_hit
> counter.
>
> Changes since v0:
> mm: zswap: fix global shrinker memcg iteration
> - Drop and reacquire spinlock before skipping a memcg.
> - Add some comment to clarify the locking mechanism.
> mm: zswap: proactive shrinking before pool size limit is hit
> - Remove unneeded check before scheduling work.
> - Change shrink start threshold to accept_thr_percent + 1%.
>
> Now it starts shrinking at accept_thr_percent + 1%. Previously, the
> threshold was at the midpoint of 100% to accept_threshold.
>
> If a workload needs 10% space to buffer the average reclaim amount, with
> the previous patch, it required setting the accept_thr_percent to 80%.
> For 50%, it became 0%, which is not acceptable and unclear for admins.
> We can use the accept percent as the shrink threshold directly but that
> sounds shrinker is called too frequently around the accept threshold. I
> added 1% as a minimum gap to the shrink threshold.
>
> ----
>
> Takero Funaki (3):
> mm: zswap: fix global shrinker memcg iteration
> mm: zswap: fix global shrinker error handling logic
> mm: zswap: proactive shrinking before pool size limit is hit
>
> Documentation/admin-guide/mm/zswap.rst | 17 ++-
> mm/zswap.c | 172 ++++++++++++++++++-------
> 2 files changed, 136 insertions(+), 53 deletions(-)
>
> --
> 2.43.0
>
On Mon, Jun 10, 2024 at 1:28 PM Yosry Ahmed <[email protected]> wrote:
>
> On Sat, Jun 8, 2024 at 8:53 AM Takero Funaki <[email protected]> wrote:
> >
> > This patch fixes zswap global shrinker that did not shrink zpool as
> > expected.
> >
> > The issue it addresses is that `shrink_worker()` did not distinguish
> > between unexpected errors and expected error codes that should be
> > skipped, such as when there is no stored page in a memcg. This led to
> > the shrinking process being aborted on the expected error codes.
> >
> > The shrinker should ignore these cases and skip to the next memcg.
> > However, skipping all memcgs presents another problem. To address this,
> > this patch tracks progress while walking the memcg tree and checks for
> > progress once the tree walk is completed.
> >
> > To handle the empty memcg case, the helper function `shrink_memcg()` is
> > modified to check if the memcg is empty and then return -ENOENT.
> >
> > Fixes: a65b0e7607cc ("zswap: make shrinking memcg-aware")
> > Signed-off-by: Takero Funaki <[email protected]>
> > ---
> > mm/zswap.c | 31 ++++++++++++++++++++++++++-----
> > 1 file changed, 26 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index d720a42069b6..1a90f434f247 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -1393,7 +1393,7 @@ static struct shrinker *zswap_alloc_shrinker(void)
> >
> > static int shrink_memcg(struct mem_cgroup *memcg)
> > {
> > - int nid, shrunk = 0;
> > + int nid, shrunk = 0, stored = 0;
> >
> > if (!mem_cgroup_zswap_writeback_enabled(memcg))
> > return -EINVAL;
> > @@ -1408,9 +1408,16 @@ static int shrink_memcg(struct mem_cgroup *memcg)
> > for_each_node_state(nid, N_NORMAL_MEMORY) {
> > unsigned long nr_to_walk = 1;
> >
> > + if (!list_lru_count_one(&zswap_list_lru, nid, memcg))
> > + continue;
> > + ++stored;
> > shrunk += list_lru_walk_one(&zswap_list_lru, nid, memcg,
> > &shrink_memcg_cb, NULL, &nr_to_walk);
> > }
> > +
> > + if (!stored)
> > + return -ENOENT;
> > +
>
> Can't we just check nr_to_walk here and return -ENOENT if it remains as 1?
>
> Something like:
>
> if (nr_to_walk)
> return -ENOENT;
> if (!shrunk)
> return -EAGAIN;
> return 0;
>
> > return shrunk ? 0 : -EAGAIN;
> > }
> >
> > @@ -1418,12 +1425,18 @@ static void shrink_worker(struct work_struct *w)
> > {
> > struct mem_cgroup *memcg = NULL;
> > struct mem_cgroup *next_memcg;
> > - int ret, failures = 0;
> > + int ret, failures = 0, progress;
> > unsigned long thr;
> >
> > /* Reclaim down to the accept threshold */
> > thr = zswap_accept_thr_pages();
> >
> > + /*
> > + * We might start from the last memcg.
> > + * That is not a failure.
> > + */
> > + progress = 1;
> > +
> > /* global reclaim will select cgroup in a round-robin fashion.
> > *
> > * We save iteration cursor memcg into zswap_next_shrink,
> > @@ -1461,9 +1474,12 @@ static void shrink_worker(struct work_struct *w)
> > */
> > if (!memcg) {
> > spin_unlock(&zswap_shrink_lock);
> > - if (++failures == MAX_RECLAIM_RETRIES)
> > +
> > + /* tree walk completed but no progress */
> > + if (!progress && ++failures == MAX_RECLAIM_RETRIES)
> > break;
>
> It seems like we may keep iterating the entire hierarchy a lot of
> times as long as we are making any type of progress. This doesn't seem
> right.
>
Ah actually, reading your comment closer then yeah the
shrink-until-accept is the intended-ish behavior (from Domenico's
older patches to re-work zswap reclaim). The one-page-per-attempt is
also "intended" behavior, but not any of our's intentions - just
something that remains from the past implementation as we rework this
codebase one change at a time :)
On Sat, Jun 8, 2024 at 8:53 AM Takero Funaki <[email protected]> wrote:
>
> This patch fixes an issue where the zswap global shrinker stopped
> iterating through the memcg tree.
>
> The problem was that `shrink_worker()` would stop iterating when a memcg
> was being offlined and restart from the tree root. Now, it properly
> handles the offlining memcg and continues shrinking with the next memcg.
>
> This patch also modified handing of the lock for offlined memcg cleaner
> to adapt the change in the iteration, and avoid negligibly rare skipping
> of a memcg from shrink iteration.
>
> Fixes: a65b0e7607cc ("zswap: make shrinking memcg-aware")
> Signed-off-by: Takero Funaki <[email protected]>
> ---
> mm/zswap.c | 87 ++++++++++++++++++++++++++++++++++++++++++------------
> 1 file changed, 68 insertions(+), 19 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 80c634acb8d5..d720a42069b6 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -827,12 +827,27 @@ void zswap_folio_swapin(struct folio *folio)
> }
> }
>
> +/*
> + * This function should be called when a memcg is being offlined.
> + *
> + * Since the global shrinker shrink_worker() may hold a reference
> + * of the memcg, we must check and release the reference in
> + * zswap_next_shrink.
> + *
> + * shrink_worker() must handle the case where this function releases
> + * the reference of memcg being shrunk.
> + */
> void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
> {
> /* lock out zswap shrinker walking memcg tree */
> spin_lock(&zswap_shrink_lock);
> - if (zswap_next_shrink == memcg)
> - zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
> +
> + if (READ_ONCE(zswap_next_shrink) == memcg) {
> + /* put back reference and advance the cursor */
> + memcg = mem_cgroup_iter(NULL, memcg, NULL);
> + WRITE_ONCE(zswap_next_shrink, memcg);
> + }
> +
> spin_unlock(&zswap_shrink_lock);
> }
As I have noted in v0, I think this is unnecessary and makes it more confusing.
>
> @@ -1401,25 +1416,44 @@ static int shrink_memcg(struct mem_cgroup *memcg)
>
> static void shrink_worker(struct work_struct *w)
> {
> - struct mem_cgroup *memcg;
> + struct mem_cgroup *memcg = NULL;
> + struct mem_cgroup *next_memcg;
> int ret, failures = 0;
> unsigned long thr;
>
> /* Reclaim down to the accept threshold */
> thr = zswap_accept_thr_pages();
>
> - /* global reclaim will select cgroup in a round-robin fashion. */
> + /* global reclaim will select cgroup in a round-robin fashion.
> + *
> + * We save iteration cursor memcg into zswap_next_shrink,
> + * which can be modified by the offline memcg cleaner
> + * zswap_memcg_offline_cleanup().
> + *
> + * Since the offline cleaner is called only once, we cannot abandone
> + * offline memcg reference in zswap_next_shrink.
> + * We can rely on the cleaner only if we get online memcg under lock.
> + * If we get offline memcg, we cannot determine the cleaner will be
> + * called later. We must put it before returning from this function.
> + */
> do {
> +iternext:
> spin_lock(&zswap_shrink_lock);
> - zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
> - memcg = zswap_next_shrink;
> + next_memcg = READ_ONCE(zswap_next_shrink);
> +
> + if (memcg != next_memcg) {
> + /*
> + * Ours was released by offlining.
> + * Use the saved memcg reference.
> + */
> + memcg = next_memcg;
> + } else {
> + /* advance cursor */
> + memcg = mem_cgroup_iter(NULL, memcg, NULL);
> + WRITE_ONCE(zswap_next_shrink, memcg);
> + }
I suppose I'm fine with not advancing the memcg when it is already
advanced by the memcg offlining callback.
>
> /*
> - * We need to retry if we have gone through a full round trip, or if we
> - * got an offline memcg (or else we risk undoing the effect of the
> - * zswap memcg offlining cleanup callback). This is not catastrophic
> - * per se, but it will keep the now offlined memcg hostage for a while.
> - *
> * Note that if we got an online memcg, we will keep the extra
> * reference in case the original reference obtained by mem_cgroup_iter
> * is dropped by the zswap memcg offlining callback, ensuring that the
> @@ -1434,16 +1468,25 @@ static void shrink_worker(struct work_struct *w)
> }
>
> if (!mem_cgroup_tryget_online(memcg)) {
> - /* drop the reference from mem_cgroup_iter() */
> - mem_cgroup_iter_break(NULL, memcg);
> - zswap_next_shrink = NULL;
> + /*
> + * It is an offline memcg which we cannot shrink
> + * until its pages are reparented.
> + *
> + * Since we cannot determine if the offline cleaner has
> + * been already called or not, the offline memcg must be
> + * put back unconditonally. We cannot abort the loop while
> + * zswap_next_shrink has a reference of this offline memcg.
> + */
> spin_unlock(&zswap_shrink_lock);
> -
> - if (++failures == MAX_RECLAIM_RETRIES)
> - break;
> -
> - goto resched;
> + goto iternext;
Hmmm yeah in the past, I set it to NULL to make sure we're not
replacing zswap_next_shrink with an offlined memcg, after that zswap
offlining callback for that memcg has been completed..
I suppose we can just call mem_cgroup_iter(...) on that offlined
cgroup, but I'm not 100% sure what happens when we call this function
on a cgroup that is currently being offlined, and has gone past the
zswap offline callback stage. So I was just playing it safe and
restart from the top of the tree :)
I think this implementation has that behavior right? We see that the
memcg is offlined, so we drop the lock and go to the beginning of the
loop. We reacquire the lock, and might see that zswap_next_shrink ==
memcg, so we call mem_cgroup_iter(...) on it. Is this safe?
Note that zswap_shrink_lock only orders serializes this memcg
selection loop with memcg offlining after it - there's no guarantee
what's the behavior is for memcg offlining before it (well other than
one reference that we manage to acquire thanks to
mem_cgroup_iter(...), so that memcg has not been freed, but not sure
what we can guarantee regarding its place in the memcg hierarchy
tree?).
Johannes, do you have any idea what we can expect here? Let me also cc Shakeel.
> }
> + /*
> + * We got an extra memcg reference before unlocking.
> + * The cleaner cannot free it using zswap_next_shrink.
> + *
> + * Our memcg can be offlined after we get online memcg here.
> + * In this case, the cleaner is waiting the lock just behind us.
> + */
> spin_unlock(&zswap_shrink_lock);
>
> ret = shrink_memcg(memcg);
> @@ -1457,6 +1500,12 @@ static void shrink_worker(struct work_struct *w)
> resched:
> cond_resched();
> } while (zswap_total_pages() > thr);
> +
> + /*
> + * We can still hold the original memcg reference.
> + * The reference is stored in zswap_next_shrink, and then reused
> + * by the next shrink_worker().
> + */
> }
>
> /*********************************
> --
> 2.43.0
>
On Tue, Jun 11, 2024 at 11:26:12AM GMT, Nhat Pham wrote:
[...]
>
> Hmmm yeah in the past, I set it to NULL to make sure we're not
> replacing zswap_next_shrink with an offlined memcg, after that zswap
> offlining callback for that memcg has been completed..
>
> I suppose we can just call mem_cgroup_iter(...) on that offlined
> cgroup, but I'm not 100% sure what happens when we call this function
> on a cgroup that is currently being offlined, and has gone past the
> zswap offline callback stage. So I was just playing it safe and
> restart from the top of the tree :)
>
> I think this implementation has that behavior right? We see that the
> memcg is offlined, so we drop the lock and go to the beginning of the
> loop. We reacquire the lock, and might see that zswap_next_shrink ==
> memcg, so we call mem_cgroup_iter(...) on it. Is this safe?
>
> Note that zswap_shrink_lock only orders serializes this memcg
> selection loop with memcg offlining after it - there's no guarantee
> what's the behavior is for memcg offlining before it (well other than
> one reference that we manage to acquire thanks to
> mem_cgroup_iter(...), so that memcg has not been freed, but not sure
> what we can guarantee regarding its place in the memcg hierarchy
> tree?).
>
> Johannes, do you have any idea what we can expect here? Let me also cc Shakeel.
>
>
>
mem_cgroup_iter() does a pre-order traversal, so you can see mixture of
online and offlined memcgs during a traversal.
2024年6月12日(水) 3:26 Nhat Pham <[email protected]>:
>
> As I have noted in v0, I think this is unnecessary and makes it more confusing.
>
Does spin_lock() ensure that compiler optimizations do not remove
memory access to an external variable? I think we need to use
READ_ONCE/WRITE_ONCE for shared variable access even under a spinlock.
For example,
https://elixir.bootlin.com/linux/latest/source/mm/mmu_notifier.c#L234
isn't this a common use case of READ_ONCE?
```c
bool shared_flag = false;
spinlock_t flag_lock;
void somefunc(void) {
for (;;) {
spin_lock(&flag_lock);
/* check external updates */
if (READ_ONCE(shared_flag))
break;
/* do something */
spin_unlock(&flag_lock);
}
spin_unlock(&flag_lock);
}
```
Without READ_ONCE, the check can be extracted from the loop by optimization.
In shrink_worker, zswap_next_shrink is the shared_flag , which can be
updated by concurrent cleaner threads, so it must be re-read every
time we reacquire the lock. Am I badly misunderstanding something?
> > do {
> > +iternext:
> > spin_lock(&zswap_shrink_lock);
> > - zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
> > - memcg = zswap_next_shrink;
> > + next_memcg = READ_ONCE(zswap_next_shrink);
> > +
> > + if (memcg != next_memcg) {
> > + /*
> > + * Ours was released by offlining.
> > + * Use the saved memcg reference.
> > + */
> > + memcg = next_memcg;
> > + } else {
> > + /* advance cursor */
> > + memcg = mem_cgroup_iter(NULL, memcg, NULL);
> > + WRITE_ONCE(zswap_next_shrink, memcg);
> > + }
>
> I suppose I'm fine with not advancing the memcg when it is already
> advanced by the memcg offlining callback.
>
For where to restart the shrinking, as Yosry pointed, my version
starts from the last memcg (=retrying failed memcg or evicting once
more)
I now realize that skipping the next memcg of offlined memcg is less
likely to happen. I am reverting it to restart from the next memcg of
zswap_next_shrink.
Which one could be better?
> >
> > /*
> > - * We need to retry if we have gone through a full round trip, or if we
> > - * got an offline memcg (or else we risk undoing the effect of the
> > - * zswap memcg offlining cleanup callback). This is not catastrophic
> > - * per se, but it will keep the now offlined memcg hostage for a while.
> > - *
> > * Note that if we got an online memcg, we will keep the extra
> > * reference in case the original reference obtained by mem_cgroup_iter
> > * is dropped by the zswap memcg offlining callback, ensuring that the
> > @@ -1434,16 +1468,25 @@ static void shrink_worker(struct work_struct *w)
> > }
> >
> > if (!mem_cgroup_tryget_online(memcg)) {
> > - /* drop the reference from mem_cgroup_iter() */
> > - mem_cgroup_iter_break(NULL, memcg);
> > - zswap_next_shrink = NULL;
> > + /*
> > + * It is an offline memcg which we cannot shrink
> > + * until its pages are reparented.
> > + *
> > + * Since we cannot determine if the offline cleaner has
> > + * been already called or not, the offline memcg must be
> > + * put back unconditonally. We cannot abort the loop while
> > + * zswap_next_shrink has a reference of this offline memcg.
> > + */
> > spin_unlock(&zswap_shrink_lock);
> > -
> > - if (++failures == MAX_RECLAIM_RETRIES)
> > - break;
> > -
> > - goto resched;
> > + goto iternext;
>
> Hmmm yeah in the past, I set it to NULL to make sure we're not
> replacing zswap_next_shrink with an offlined memcg, after that zswap
> offlining callback for that memcg has been completed..
>
> I suppose we can just call mem_cgroup_iter(...) on that offlined
> cgroup, but I'm not 100% sure what happens when we call this function
> on a cgroup that is currently being offlined, and has gone past the
> zswap offline callback stage. So I was just playing it safe and
> restart from the top of the tree :)
>
> I think this implementation has that behavior right? We see that the
> memcg is offlined, so we drop the lock and go to the beginning of the
> loop. We reacquire the lock, and might see that zswap_next_shrink ==
> memcg, so we call mem_cgroup_iter(...) on it. Is this safe?
>
> Note that zswap_shrink_lock only orders serializes this memcg
> selection loop with memcg offlining after it - there's no guarantee
> what's the behavior is for memcg offlining before it (well other than
> one reference that we manage to acquire thanks to
> mem_cgroup_iter(...), so that memcg has not been freed, but not sure
> what we can guarantee regarding its place in the memcg hierarchy
> tree?).
The locking mechanism in shrink_worker does not rely on what the next
memcg is.sorting stability of mem_cgroup_iter does not matter
here.
The expectation for the iterator is that it will walk through all live
memcgs. I believe mem_cgroup_iter uses parent-to-leaf ordering of
cgroup and it ensures all live cgroups are walked at least once,
regardless of its onlineness.
https://elixir.bootlin.com/linux/v6.10-rc2/source/mm/memcontrol.c#L1368
Regarding reference leak, I overlooked a scenario where a leak might
occur in the existing cleaner. although it should be rare.
When the cleaner is called on a memcg in zswap_next_shrink, the next
memcg from mem_cgroup_iter() can be an offline already-cleaned memcg,
resulting in a reference leak of the next memcg from the cleaner. We
should implement the same online check in the cleaner, like this:
```c
void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
{
struct mem_cgroup *next;
/* lock out zswap shrinker walking memcg tree */
spin_lock(&zswap_shrink_lock);
if (zswap_next_shrink == memcg) {
next = zswap_next_shrink;
do {
next = mem_cgroup_iter(NULL, next, NULL);
WRITE_ONCE(zswap_next_shrink, next);
spin_unlock(&zswap_shrink_lock);
/* zswap_next_shrink might be updated here */
spin_lock(&zswap_shrink_lock);
next = READ_ONCE(zswap_next_shrink);
if (!next)
break;
} while (!mem_cgroup_online(next));
/*
* We verified the next memcg is online under lock.
* Even if the next memcg is being offlined here, another
* cleaner for the next memcg is waiting for our unlock just
* behind us. We can leave the next memcg reference.
*/
}
spin_unlock(&zswap_shrink_lock);
}
```
As same as in shrink_worker, we must check if the next memcg is online
under the lock before leaving the ref in zswap_next_shrink.
Otherwise, zswap_next_shrink might hold the ref of offlined and cleaned memcg.
Or if you are concerning about temporary storing unchecked or offlined
memcg in zswap_next_shrink, it is safe because:
1. If there is no other cleaner running for zswap_next_shrink, the ref
saved in zswap_next_shrink ensures liveness of the memcg when
reacquired.
2. Another cleaner thread may put back and replace zswap_next_shrink
with its next. We will check onlineness of the new zswap_next_shrink
under reacquired lock.
3. Even if the verified-online memcg is being offlined concurrently,
another cleaner thread must wait for our unlock. We can leave the
online memcg and rely on its respective cleaner.
On Wed, Jun 12, 2024 at 11:16 AM Takero Funaki <[email protected]> wrote:
>
> 2024年6月12日(水) 3:26 Nhat Pham <[email protected]>:
>
> >
> > As I have noted in v0, I think this is unnecessary and makes it more confusing.
> >
>
> Does spin_lock() ensure that compiler optimizations do not remove
> memory access to an external variable? I think we need to use
> READ_ONCE/WRITE_ONCE for shared variable access even under a spinlock.
> For example,
> https://elixir.bootlin.com/linux/latest/source/mm/mmu_notifier.c#L234
In this example, it seems like mmu_interval_set_seq() updates
interval_sub->invalidate_seq locklessly using WRITE_ONCE(). I think
this is why READ_ONCE() is required in that particular case.
>
> isn't this a common use case of READ_ONCE?
> ```c
> bool shared_flag = false;
> spinlock_t flag_lock;
>
> void somefunc(void) {
> for (;;) {
> spin_lock(&flag_lock);
> /* check external updates */
> if (READ_ONCE(shared_flag))
> break;
> /* do something */
> spin_unlock(&flag_lock);
> }
> spin_unlock(&flag_lock);
> }
> ```
> Without READ_ONCE, the check can be extracted from the loop by optimization.
According to Documentation/memory-barriers.txt, lock acquiring
functions are implicit memory barriers. Otherwise, the compiler would
be able to pull any memory access outside of the lock critical section
and locking wouldn't be reliable.
2024年6月13日(木) 3:28 Yosry Ahmed <[email protected]>:
>
> On Wed, Jun 12, 2024 at 11:16 AM Takero Funaki <[email protected]> wrote:
> >
> > 2024年6月12日(水) 3:26 Nhat Pham <[email protected]>:
> >
> > >
> > > As I have noted in v0, I think this is unnecessary and makes it more confusing.
> > >
> >
> > Does spin_lock() ensure that compiler optimizations do not remove
> > memory access to an external variable? I think we need to use
> > READ_ONCE/WRITE_ONCE for shared variable access even under a spinlock.
> > For example,
> > https://elixir.bootlin.com/linux/latest/source/mm/mmu_notifier.c#L234
>
> In this example, it seems like mmu_interval_set_seq() updates
> interval_sub->invalidate_seq locklessly using WRITE_ONCE(). I think
> this is why READ_ONCE() is required in that particular case.
>
> >
> > isn't this a common use case of READ_ONCE?
> > ```c
> > bool shared_flag = false;
> > spinlock_t flag_lock;
> >
> > void somefunc(void) {
> > for (;;) {
> > spin_lock(&flag_lock);
> > /* check external updates */
> > if (READ_ONCE(shared_flag))
> > break;
> > /* do something */
> > spin_unlock(&flag_lock);
> > }
> > spin_unlock(&flag_lock);
> > }
> > ```
> > Without READ_ONCE, the check can be extracted from the loop by optimization.
>
> According to Documentation/memory-barriers.txt, lock acquiring
> functions are implicit memory barriers. Otherwise, the compiler would
> be able to pull any memory access outside of the lock critical section
> and locking wouldn't be reliable.
Ah, I understand now. The implicit barrier is sufficient as long as
all memory access occurs within the lock. It’s a fundamental rule of
locking—facepalm.
I misread a module code, like in the link, where a lockless write
could invade a critical section. My example was in the opposite
direction, just wrong. Thank you so much for clarifying my
misunderstanding.
For now checking the patch, I suppose the locking mechanism itself is
not affected by my misunderstanding of READ_ONCE.
The corrected version of the cleaner should be:
```c
void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
{
/* lock out zswap shrinker walking memcg tree */
spin_lock(&zswap_shrink_lock);
if (zswap_next_shrink == memcg) {
do {
zswap_next_shrink = mem_cgroup_iter(NULL,
zswap_next_shrink, NULL);
spin_unlock(&zswap_shrink_lock);
spin_lock(&zswap_shrink_lock);
if (!zswap_next_shrink)
break;
} while (!mem_cgroup_online(zswap_next_shrink));
}
spin_unlock(&zswap_shrink_lock);
}
```
Should we have a separate patch to fix the leak scenario?
On Wed, Jun 12, 2024 at 7:13 PM Takero Funaki <[email protected]> wrote:
>
> 2024年6月13日(木) 3:28 Yosry Ahmed <[email protected]>:
> >
> > On Wed, Jun 12, 2024 at 11:16 AM Takero Funaki <[email protected]> wrote:
> > >
> > > 2024年6月12日(水) 3:26 Nhat Pham <[email protected]>:
> > >
> > > >
> > > > As I have noted in v0, I think this is unnecessary and makes it more confusing.
> > > >
> > >
> > > Does spin_lock() ensure that compiler optimizations do not remove
> > > memory access to an external variable? I think we need to use
> > > READ_ONCE/WRITE_ONCE for shared variable access even under a spinlock.
> > > For example,
> > > https://elixir.bootlin.com/linux/latest/source/mm/mmu_notifier.c#L234
> >
> > In this example, it seems like mmu_interval_set_seq() updates
> > interval_sub->invalidate_seq locklessly using WRITE_ONCE(). I think
> > this is why READ_ONCE() is required in that particular case.
> >
> > >
> > > isn't this a common use case of READ_ONCE?
> > > ```c
> > > bool shared_flag = false;
> > > spinlock_t flag_lock;
> > >
> > > void somefunc(void) {
> > > for (;;) {
> > > spin_lock(&flag_lock);
> > > /* check external updates */
> > > if (READ_ONCE(shared_flag))
> > > break;
> > > /* do something */
> > > spin_unlock(&flag_lock);
> > > }
> > > spin_unlock(&flag_lock);
> > > }
> > > ```
> > > Without READ_ONCE, the check can be extracted from the loop by optimization.
> >
> > According to Documentation/memory-barriers.txt, lock acquiring
> > functions are implicit memory barriers. Otherwise, the compiler would
> > be able to pull any memory access outside of the lock critical section
> > and locking wouldn't be reliable.
>
> Ah, I understand now. The implicit barrier is sufficient as long as
> all memory access occurs within the lock. It’s a fundamental rule of
> locking—facepalm.
>
> I misread a module code, like in the link, where a lockless write
> could invade a critical section. My example was in the opposite
> direction, just wrong. Thank you so much for clarifying my
> misunderstanding.
>
> For now checking the patch, I suppose the locking mechanism itself is
> not affected by my misunderstanding of READ_ONCE.
>
> The corrected version of the cleaner should be:
> ```c
> void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
> {
> /* lock out zswap shrinker walking memcg tree */
> spin_lock(&zswap_shrink_lock);
> if (zswap_next_shrink == memcg) {
> do {
> zswap_next_shrink = mem_cgroup_iter(NULL,
> zswap_next_shrink, NULL);
> spin_unlock(&zswap_shrink_lock);
> spin_lock(&zswap_shrink_lock);
> if (!zswap_next_shrink)
> break;
> } while (!mem_cgroup_online(zswap_next_shrink));
> }
> spin_unlock(&zswap_shrink_lock);
> }
> ```
Is the idea here to avoid moving the iterator to another offline memcg
that zswap_memcg_offline_cleanup() was already called for, to avoid
holding a ref on that memcg until the next run of zswap shrinking?
If yes, I think it's probably worth doing. But why do we need to
release and reacquire the lock in the loop above?
2024年6月13日(木) 11:18 Yosry Ahmed <[email protected]>:
> > The corrected version of the cleaner should be:
> > ```c
> > void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
> > {
> > /* lock out zswap shrinker walking memcg tree */
> > spin_lock(&zswap_shrink_lock);
> > if (zswap_next_shrink == memcg) {
> > do {
> > zswap_next_shrink = mem_cgroup_iter(NULL,
> > zswap_next_shrink, NULL);
> > spin_unlock(&zswap_shrink_lock);
> > spin_lock(&zswap_shrink_lock);
> > if (!zswap_next_shrink)
> > break;
> > } while (!mem_cgroup_online(zswap_next_shrink));
> > }
> > spin_unlock(&zswap_shrink_lock);
> > }
> > ```
>
> Is the idea here to avoid moving the iterator to another offline memcg
> that zswap_memcg_offline_cleanup() was already called for, to avoid
> holding a ref on that memcg until the next run of zswap shrinking?
>
> If yes, I think it's probably worth doing. But why do we need to
> release and reacquire the lock in the loop above?
Yes, the existing cleaner might leave the offline, already-cleaned memcg.
The reacquiring lock is to not loop inside the critical section.
In shrink_worker of v0 patch, the loop was restarted on offline memcg
without releasing the lock. Nhat pointed out that we should drop the
lock after every mem_cgroup_iter() call. v1 was changed to reacquire
once per iteration like the cleaner code above.
On Wed, Jun 12, 2024 at 7:36 PM Takero Funaki <[email protected]> wrote:
>
> 2024年6月13日(木) 11:18 Yosry Ahmed <[email protected]>:
>
> > > The corrected version of the cleaner should be:
> > > ```c
> > > void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
> > > {
> > > /* lock out zswap shrinker walking memcg tree */
> > > spin_lock(&zswap_shrink_lock);
> > > if (zswap_next_shrink == memcg) {
> > > do {
> > > zswap_next_shrink = mem_cgroup_iter(NULL,
> > > zswap_next_shrink, NULL);
> > > spin_unlock(&zswap_shrink_lock);
> > > spin_lock(&zswap_shrink_lock);
> > > if (!zswap_next_shrink)
> > > break;
> > > } while (!mem_cgroup_online(zswap_next_shrink));
> > > }
> > > spin_unlock(&zswap_shrink_lock);
> > > }
> > > ```
> >
> > Is the idea here to avoid moving the iterator to another offline memcg
> > that zswap_memcg_offline_cleanup() was already called for, to avoid
> > holding a ref on that memcg until the next run of zswap shrinking?
> >
> > If yes, I think it's probably worth doing. But why do we need to
> > release and reacquire the lock in the loop above?
>
> Yes, the existing cleaner might leave the offline, already-cleaned memcg.
>
> The reacquiring lock is to not loop inside the critical section.
> In shrink_worker of v0 patch, the loop was restarted on offline memcg
> without releasing the lock. Nhat pointed out that we should drop the
> lock after every mem_cgroup_iter() call. v1 was changed to reacquire
> once per iteration like the cleaner code above.
I am not sure how often we'll run into a situation where we'll be
holding the lock for too long tbh. It should be unlikely to keep
encountering offline memcgs for a long time.
Nhat, do you think this could cause a problem in practice?
On Wed, Jun 12, 2024 at 7:58 PM Yosry Ahmed <[email protected]> wrote:
>
> On Wed, Jun 12, 2024 at 7:36 PM Takero Funaki <[email protected]> wrote:
> >
> > 2024年6月13日(木) 11:18 Yosry Ahmed <[email protected]>:
> >
> > > > The corrected version of the cleaner should be:
> > > > ```c
> > > > void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
> > > > {
> > > > /* lock out zswap shrinker walking memcg tree */
> > > > spin_lock(&zswap_shrink_lock);
> > > > if (zswap_next_shrink == memcg) {
> > > > do {
> > > > zswap_next_shrink = mem_cgroup_iter(NULL,
> > > > zswap_next_shrink, NULL);
> > > > spin_unlock(&zswap_shrink_lock);
> > > > spin_lock(&zswap_shrink_lock);
> > > > if (!zswap_next_shrink)
> > > > break;
> > > > } while (!mem_cgroup_online(zswap_next_shrink));
> > > > }
> > > > spin_unlock(&zswap_shrink_lock);
> > > > }
> > > > ```
> > >
> > > Is the idea here to avoid moving the iterator to another offline memcg
> > > that zswap_memcg_offline_cleanup() was already called for, to avoid
> > > holding a ref on that memcg until the next run of zswap shrinking?
> > >
> > > If yes, I think it's probably worth doing. But why do we need to
> > > release and reacquire the lock in the loop above?
> >
> > Yes, the existing cleaner might leave the offline, already-cleaned memcg.
> >
> > The reacquiring lock is to not loop inside the critical section.
> > In shrink_worker of v0 patch, the loop was restarted on offline memcg
> > without releasing the lock. Nhat pointed out that we should drop the
> > lock after every mem_cgroup_iter() call. v1 was changed to reacquire
> > once per iteration like the cleaner code above.
>
> I am not sure how often we'll run into a situation where we'll be
> holding the lock for too long tbh. It should be unlikely to keep
> encountering offline memcgs for a long time.
>
> Nhat, do you think this could cause a problem in practice?
I don't remember prescribing anything to be honest :) I think I was
just asking why can't we just drop the lock, then "continue;". This is
mostly for simplicity's sake.
https://lore.kernel.org/linux-mm/CAKEwX=MwrRc43iM2050v5u-TPUK4Yn+a4G7+h6ieKhpQ7WtQ=A@mail.gmail.com/
But I think as Takero pointed out, it would still skip over the memcg
that was (concurrently) updated to zswap_next_shrink by the memcg
offline callback.
On Sat, Jun 8, 2024 at 8:53 AM Takero Funaki <[email protected]> wrote:
>
> This patch implements proactive shrinking of zswap pool before the max
> pool size limit is reached. This also changes zswap to accept new pages
> while the shrinker is running.
>
> To prevent zswap from rejecting new pages and incurring latency when
> zswap is full, this patch queues the global shrinker by a pool usage
> threshold between 100% and accept_thr_percent, instead of the max pool
> size. The pool size will be controlled between 90% to 91% for the
> default accept_thr_percent=90. Since the current global shrinker
> continues to shrink until accept_thr_percent, we do not need to maintain
> the hysteresis variable tracking the pool limit overage in
> zswap_store().
>
> Before this patch, zswap rejected pages while the shrinker is running
> without incrementing zswap_pool_limit_hit counter. It could be a reason
> why zswap writethrough new pages before writeback old pages. With this
> patch, zswap accepts new pages while shrinking, and zswap increments
> the counter when and only when zswap rejects pages by the max pool size.
>
> Now, reclaims smaller than the proactive shrinking amount finish
> instantly and trigger background shrinking. Admins can check if new
> pages are buffered by zswap by monitoring the pool_limit_hit counter.
>
> The name of sysfs tunable accept_thr_percent is unchanged as it is still
> the stop condition of the shrinker.
> The respective documentation is updated to describe the new behavior.
>
> Signed-off-by: Takero Funaki <[email protected]>
Taking a step back, could you benchmark and include relevant
userspace-visible metrics to show:
a) This is a problem happening in realistic-ish workloads.
b) The solution shows improvements over the status quo.
before we justify any extra complexity? This goes for pretty much the
whole series (i.e why are we fixing this), but for this patch
specifically, since we are optimizing things.
On Sat, Jun 8, 2024 at 8:53 AM Takero Funaki <[email protected]> wrote:
>
> This series addresses two issues and introduces a minor improvement in
> zswap global shrinker:
>
> 1. Fix the memcg iteration logic that breaks iteration on offline memcgs.
> 2. Fix the error path that aborts on expected error codes.
> 3. Add proactive shrinking at 91% full, for 90% accept threshold.
>
Taking a step back from the correctness conversation, could you
include in the changelog of the patches and cover letter a realistic
scenario, along with user space-visible metrics that show (ideally all
4, but at least some of the following):
1. A user problem (that affects performance, or usability, etc.) is happening.
2. The root cause is what we are trying to fix (for e.g in patch 1, we
are skipping over memcgs unnecessarily in the global shrinker loop).
3. The fix alleviates the root cause in b)
4. The userspace-visible problem goes away or is less serious.
I have already hinted in a previous response, but global shrinker is
rarely triggered in production. There are lots of factors that would
prevent this from triggering:
1. Max zswap pool size 20% of memory by default, which is a lot.
2. Swapfile size also limits the size of the amount of data storable
in the zswap pool.
3. Other cgroup constraints (memory.max, memory.zswap.max,
memory.swap.max) also limit a cgroup's zswap usage.
I do agree that patch 1 at least is fixing a problem, and probably
patch 2 too but please justify why we are investing in the extra
complexity to fix this problem in the first place.
On Thu, Jun 13, 2024 at 9:08 AM Nhat Pham <[email protected]> wrote:
>
> 5. At the end of the inner loop, the selected memcg is known to not
> have its cleaner started yet (since it is online with zswap_pools_lock
> held). Our selection will undo the cleaner and hold the memcg hostage
> forever.
/s/will undo/will not undo
My apologies - English is hard :)
On Sat, Jun 8, 2024 at 8:53 AM Takero Funaki <[email protected]> wrote:
>
> This patch fixes an issue where the zswap global shrinker stopped
> iterating through the memcg tree.
>
> The problem was that `shrink_worker()` would stop iterating when a memcg
> was being offlined and restart from the tree root. Now, it properly
> handles the offlining memcg and continues shrinking with the next memcg.
>
> This patch also modified handing of the lock for offlined memcg cleaner
> to adapt the change in the iteration, and avoid negligibly rare skipping
> of a memcg from shrink iteration.
>
Honestly, I think this version is even more complicated than v0 :)
Hmmm how about this:
do {
spin_lock(&zswap_pools_lock);
do {
/* no offline caller has been called to memcg yet */
if (memcg == zswap_next_shrink) {
zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
memcg = zswap_next_shrink;
if (!memcg && ++failure > MAX_FAILURE) {
spin_unlock(&zswap_pools_lock);
return;
}
} while (!zswap_next_shrink || !mem_cgroup_tryget_online(zswap_next_shrink))
spin_unlock(&zswap_pools_lock);
/* proceed with reclaim */
} while (...)
This should achieve all the goals right?
1. No restarting from the top, unless we have traversed the entire hierarchy.
2. No skipping over zswap_next_shrink updated by the memcg offline cleaner.
3. No leaving offlined zswap_next_shrink hanging (and blocking memcg
killing indefinitely).
4. No long running loop unnecessarily. If you want to be extra safe,
we can put a max number of retries on the offline memcg case too (and
restart from the top).
5. At the end of the inner loop, you are guaranteed to hold at least
one reference to memcg (and perhaps 2, if zswap_next_shrink is not
updated by the cleaner).
5. At the end of the inner loop, the selected memcg is known to not
have its cleaner started yet (since it is online with zswap_pools_lock
held). Our selection will undo the cleaner and hold the memcg hostage
forever.
Is there anything that I'm missing? We are not dropping the lock for
the entirety of the inner loop, but honestly I'm fine with this trade
off, for the sake of simplicity :)
If you want it to be even more readable, feel free to refactor the
inner loop into a helper function (but it's probably redundant since
it's only used once).
On Thu, Jun 13, 2024 at 08:04:39AM GMT, Nhat Pham wrote:
[...]
> > > >
> > > > Is the idea here to avoid moving the iterator to another offline memcg
> > > > that zswap_memcg_offline_cleanup() was already called for, to avoid
> > > > holding a ref on that memcg until the next run of zswap shrinking?
> > > >
> > > > If yes, I think it's probably worth doing. But why do we need to
> > > > release and reacquire the lock in the loop above?
> > >
> > > Yes, the existing cleaner might leave the offline, already-cleaned memcg.
> > >
> > > The reacquiring lock is to not loop inside the critical section.
> > > In shrink_worker of v0 patch, the loop was restarted on offline memcg
> > > without releasing the lock. Nhat pointed out that we should drop the
> > > lock after every mem_cgroup_iter() call. v1 was changed to reacquire
> > > once per iteration like the cleaner code above.
> >
> > I am not sure how often we'll run into a situation where we'll be
> > holding the lock for too long tbh. It should be unlikely to keep
> > encountering offline memcgs for a long time.
> >
> > Nhat, do you think this could cause a problem in practice?
>
> I don't remember prescribing anything to be honest :) I think I was
> just asking why can't we just drop the lock, then "continue;". This is
> mostly for simplicity's sake.
>
> https://lore.kernel.org/linux-mm/CAKEwX=MwrRc43iM2050v5u-TPUK4Yn+a4G7+h6ieKhpQ7WtQ=A@mail.gmail.com/
>
> But I think as Takero pointed out, it would still skip over the memcg
> that was (concurrently) updated to zswap_next_shrink by the memcg
> offline callback.
What's the issue with keep traversing until an online memcg is found?
Something like the following:
spin_lock(&zswap_shrink_lock);
do {
zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
} while (zswap_next_shrink && !mem_cgroup_online(zswap_next_shrink));
if (!zswap_next_shrink)
zswap_next_shrink = mem_cgroup_iter(NULL, NULL, NULL);
....
Is the concern that there can a lot of offlined memcgs which may cause
need resched warnings?
2024年6月14日(金) 0:22 Nhat Pham <[email protected]>:
>
> Taking a step back from the correctness conversation, could you
> include in the changelog of the patches and cover letter a realistic
> scenario, along with user space-visible metrics that show (ideally all
> 4, but at least some of the following):
>
> 1. A user problem (that affects performance, or usability, etc.) is happening.
>
> 2. The root cause is what we are trying to fix (for e.g in patch 1, we
> are skipping over memcgs unnecessarily in the global shrinker loop).
>
> 3. The fix alleviates the root cause in b)
>
> 4. The userspace-visible problem goes away or is less serious.
>
Thank you for your suggestions.
For quick response before submitting v2,
1.
The visible issue is that pageout/in operations from active processes
are slow when zswap is near its max pool size. This is particularly
significant on small memory systems, where total swap usage exceeds
what zswap can store. This means that old pages occupy most of the
zswap pool space, and recent pages use swap disk directly.
2.
This issue is caused by zswap keeping the pool size near 100%. Since
the shrinker fails to shrink the pool to accept_thr_percent and zswap
rejects incoming pages, rejection occurs more frequently than it
should. The rejected pages are directly written to disk while zswap
protects old pages from eviction, leading to slow pageout/in
performance for recent pages on the swap disk.
3.
If the pool size were shrunk proactively, rejection by pool limit hits
would be less likely. New incoming pages could be accepted as the pool
gains some space in advance, while older pages are written back in the
background. zswap would then be filled with recent pages, as expected
in the LRU logic.
Patch 1 and 2 make the shrinker reduce the pool to accept_thr_percent.
Patch 3 makes zswap_store trigger the shrinker before reaching the max
pool size. With this series, zswap will prepare some space to reduce
the probability of problematic pool_limit_hit situation, thus reducing
slow reclaim and the page priority inversion against LRU.
4.
Once proactive shrinking reduces the pool size, pageouts complete
instantly as long as the space prepared by shrinking can store the
direct reclaim. If an admin sees a large pool_limit_hit, lowering
accept_threshold_percent will improve active process performance.
> I have already hinted in a previous response, but global shrinker is
> rarely triggered in production. There are lots of factors that would
> prevent this from triggering:
>
> 1. Max zswap pool size 20% of memory by default, which is a lot.
>
> 2. Swapfile size also limits the size of the amount of data storable
> in the zswap pool.
>
> 3. Other cgroup constraints (memory.max, memory.zswap.max,
> memory.swap.max) also limit a cgroup's zswap usage.
>
> I do agree that patch 1 at least is fixing a problem, and probably
> patch 2 too but please justify why we are investing in the extra
> complexity to fix this problem in the first place.
Regarding the production workload, I believe we are facing different situations.
My intended workload is low-activity services distributed on small
system like t2.nano, with 0.5G to 1G of RAM. There are a significant
number of pool_limit_hits and the zswap pool usage sometimes stays
near 100% filled by background service processes.
When I evaluated zswap and zramswap, zswap performed well. I suppose
this was because of its LRU. Once old pages occupied zramswap, there
was no way to move pages from zramswap to the swap disk. zswap could
adapt to this situation by writing back old pages in LRU order. We
could benefit from compressed swap storing recent pages and also
utilize overcommit backed by a large swap on disk.
I think if a system has never seen a pool limit hit, there is no need
to manage the costly LRU. In such cases, I will try zramswap.
I am developing these patches by testing with artificially allocating
large memory to fill the zswap pool (emulating background services),
and then invoking another allocation to trigger direct reclaim
(emulating short-live active process).
On real devices, seeing much less pool_limit_hit by lower
accept_thr_percent=50 and higher max_pool_percent=25.
Note that before patch 3, zswap_store() did not count some of the
rejection as pool_limit_hit properly (underestimated). we cannot
compare the counter directly.
I will try to create a more reproducible realistic benchmark, but it
might not be realistic for your workload.
2024年6月14日(金) 1:49 Shakeel Butt <[email protected]>:
>
> On Thu, Jun 13, 2024 at 08:04:39AM GMT, Nhat Pham wrote:
> [...]
> > > > >
> > > > > Is the idea here to avoid moving the iterator to another offline memcg
> > > > > that zswap_memcg_offline_cleanup() was already called for, to avoid
> > > > > holding a ref on that memcg until the next run of zswap shrinking?
> > > > >
> > > > > If yes, I think it's probably worth doing. But why do we need to
> > > > > release and reacquire the lock in the loop above?
> > > >
> > > > Yes, the existing cleaner might leave the offline, already-cleaned memcg.
> > > >
> > > > The reacquiring lock is to not loop inside the critical section.
> > > > In shrink_worker of v0 patch, the loop was restarted on offline memcg
> > > > without releasing the lock. Nhat pointed out that we should drop the
> > > > lock after every mem_cgroup_iter() call. v1 was changed to reacquire
> > > > once per iteration like the cleaner code above.
> > >
> > > I am not sure how often we'll run into a situation where we'll be
> > > holding the lock for too long tbh. It should be unlikely to keep
> > > encountering offline memcgs for a long time.
> > >
> > > Nhat, do you think this could cause a problem in practice?
> >
> > I don't remember prescribing anything to be honest :) I think I was
> > just asking why can't we just drop the lock, then "continue;". This is
> > mostly for simplicity's sake.
> >
> > https://lore.kernel.org/linux-mm/CAKEwX=MwrRc43iM2050v5u-TPUK4Yn+a4G7+h6ieKhpQ7WtQ=A@mail.gmail.com/
I apologize for misinterpreting your comments. Removing release/reacquire.
> >
> > But I think as Takero pointed out, it would still skip over the memcg
> > that was (concurrently) updated to zswap_next_shrink by the memcg
> > offline callback.
>
> What's the issue with keep traversing until an online memcg is found?
> Something like the following:
>
>
> spin_lock(&zswap_shrink_lock);
> do {
> zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
> } while (zswap_next_shrink && !mem_cgroup_online(zswap_next_shrink));
>
> if (!zswap_next_shrink)
> zswap_next_shrink = mem_cgroup_iter(NULL, NULL, NULL);
> ....
>
> Is the concern that there can a lot of offlined memcgs which may cause
> need resched warnings?
To avoid using the goto-based loop, here's the new version, including
Shakeel's suggestion:
```c
do {
spin_lock(&zswap_shrink_lock);
/*
* Advance the cursor to start shrinking from the next memcg
* after zswap_next_shrink. One memcg might be skipped from
* shrinking if the cleaner also advanced the cursor, but it
* will happen at most once per offlining memcg.
*/
do {
zswap_next_shrink = mem_cgroup_iter(NULL,
zswap_next_shrink, NULL);
memcg = zswap_next_shrink;
} while (memcg && !mem_cgroup_tryget_online(memcg));
if (!memcg) {
spin_unlock(&zswap_shrink_lock);
```
We can add or remove `spin_unlock();spin_lock();` just after
mem_cgroup_iter(), if needed.
I believe the behavior is identical to v1 except for the starting
point of iteration.
For Naht's comment, 2. No skipping over zswap_next_shrink updated by
the memcg offline cleaner.
While this was true for v1, I'm moved to accept this skipping as it's
negligibly rare. As Yorsy commented, v1 retried the last memcg from
the last shrink_worker() run.
There are several options for shrink_worker where to start with:
1. Starting from the next memcg after zswap_next_shrink: It might skip
one memcg, but this is quite rare. It is the current behavior before
patch.
2. Starting from zswap_next_shrink: It might shrink one more page from
the memcg in addition to the one by the last shrink_worker() run. This
should also be rare, but probably more frequent than option 1. This is
the v0 patch behavior.
3. Addressing both: Save the last memcg as well. The worker checks if
it has been modified by the cleaner and advances only if it hasn't.
Like this:
```c
do {
if (zswap_last_shrink == zswap_next_shrink) {
zswap_next_shrink = mem_cgroup_iter(NULL,
zswap_next_shrink, NULL);
}
memcg = zswap_next_shrink;
} while (memcg && !mem_cgroup_tryget_online(memcg));
zswap_last_shrink = memcg;
```
Which one would be better? or any other idea?
On Thu, Jun 13, 2024 at 9:09 PM Takero Funaki <[email protected]> wrote:
>
> 2024年6月14日(金) 0:22 Nhat Pham <[email protected]>:
> >
> > Taking a step back from the correctness conversation, could you
> > include in the changelog of the patches and cover letter a realistic
> > scenario, along with user space-visible metrics that show (ideally all
> > 4, but at least some of the following):
> >
> > 1. A user problem (that affects performance, or usability, etc.) is happening.
> >
> > 2. The root cause is what we are trying to fix (for e.g in patch 1, we
> > are skipping over memcgs unnecessarily in the global shrinker loop).
> >
> > 3. The fix alleviates the root cause in b)
> >
> > 4. The userspace-visible problem goes away or is less serious.
> >
>
> Thank you for your suggestions.
> For quick response before submitting v2,
>
> 1.
> The visible issue is that pageout/in operations from active processes
> are slow when zswap is near its max pool size. This is particularly
> significant on small memory systems, where total swap usage exceeds
> what zswap can store. This means that old pages occupy most of the
> zswap pool space, and recent pages use swap disk directly.
Makes sense. You could probably check pswpin stats etc. to verify
this, or maybe some sort of tracing to measure performance impact?
>
> 2.
> This issue is caused by zswap keeping the pool size near 100%. Since
> the shrinker fails to shrink the pool to accept_thr_percent and zswap
> rejects incoming pages, rejection occurs more frequently than it
> should. The rejected pages are directly written to disk while zswap
> protects old pages from eviction, leading to slow pageout/in
> performance for recent pages on the swap disk.
Yeah this makes sense. The hysteresis heuristic is broken.
>
> 3.
> If the pool size were shrunk proactively, rejection by pool limit hits
> would be less likely. New incoming pages could be accepted as the pool
> gains some space in advance, while older pages are written back in the
> background. zswap would then be filled with recent pages, as expected
> in the LRU logic.
>
> Patch 1 and 2 make the shrinker reduce the pool to accept_thr_percent.
> Patch 3 makes zswap_store trigger the shrinker before reaching the max
> pool size. With this series, zswap will prepare some space to reduce
> the probability of problematic pool_limit_hit situation, thus reducing
> slow reclaim and the page priority inversion against LRU.
Makes sense, but do include numbers to back up your claims if you have them!
>
> 4.
> Once proactive shrinking reduces the pool size, pageouts complete
> instantly as long as the space prepared by shrinking can store the
> direct reclaim. If an admin sees a large pool_limit_hit, lowering
> accept_threshold_percent will improve active process performance.
>
This makes sense from a glance. Thank you for writing all of this, and
please include actual numbers from a benchmark to show these in your
v2, if you have them :)
> Regarding the production workload, I believe we are facing different situations.
>
> My intended workload is low-activity services distributed on small
> system like t2.nano, with 0.5G to 1G of RAM. There are a significant
> number of pool_limit_hits and the zswap pool usage sometimes stays
> near 100% filled by background service processes.
>
> When I evaluated zswap and zramswap, zswap performed well. I suppose
> this was because of its LRU. Once old pages occupied zramswap, there
> was no way to move pages from zramswap to the swap disk. zswap could
> adapt to this situation by writing back old pages in LRU order. We
> could benefit from compressed swap storing recent pages and also
> utilize overcommit backed by a large swap on disk.
>
> I think if a system has never seen a pool limit hit, there is no need
> to manage the costly LRU. In such cases, I will try zramswap.
>
> I am developing these patches by testing with artificially allocating
> large memory to fill the zswap pool (emulating background services),
> and then invoking another allocation to trigger direct reclaim
> (emulating short-live active process).
> On real devices, seeing much less pool_limit_hit by lower
> accept_thr_percent=50 and higher max_pool_percent=25.
> Note that before patch 3, zswap_store() did not count some of the
> rejection as pool_limit_hit properly (underestimated). we cannot
> compare the counter directly.
Sounds good! Did you see performance improvement from lowering
pool_limit_hit etc.?
>
> I will try to create a more reproducible realistic benchmark, but it
> might not be realistic for your workload.
Your approach is not interfering with our workload - we don't
encounter the global shrinker in production ever. That does not mean I
will not consider your code though - this is everyone's kernel, not
just Meta's kernel :)
What I was concerned with was if there were any real workloads at all
that benefit from it - as long as there are (which there seem to be
based on your description), and the approach does not hurt other
cases, I'm happy to review it for merging :)
And as I have stated before, I personally believe the global shrinker
needs to change. I just do not have the means to verify that any
ideas/improvement actually work in a realistic scenario (since our
production system does not encounter global shrinker). Thank you for
picking up the work!
> My intended workload is low-activity services distributed on small
> system like t2.nano, with 0.5G to 1G of RAM. There are a significant
> number of pool_limit_hits and the zswap pool usage sometimes stays
> near 100% filled by background service processes.
>
BTW, I'm curious. Have you experimented with increasing the pool size?
That 20% number is plenty for our use cases, but maybe yours need a
different cap?
Also, have you experimented with the dynamic zswap shrinker? :) I'm
actually curious how it works out in the small machine regime, with
whatever workload you are running.
But yeah, I'd imagine either way zswap would be better than zram for
this particular scenario, as long as you have a swapfile, some
proportion of your anonymous memory cold, and/or can tolerate a
certain amount of latency.
On Thu, Jun 13, 2024 at 9:09 PM Takero Funaki <[email protected]> wrote:
>
> 2024年6月14日(金) 0:22 Nhat Pham <[email protected]>:
> >
> > Taking a step back from the correctness conversation, could you
> > include in the changelog of the patches and cover letter a realistic
> > scenario, along with user space-visible metrics that show (ideally all
> > 4, but at least some of the following):
> >
> > 1. A user problem (that affects performance, or usability, etc.) is happening.
> >
> > 2. The root cause is what we are trying to fix (for e.g in patch 1, we
> > are skipping over memcgs unnecessarily in the global shrinker loop).
> >
> > 3. The fix alleviates the root cause in b)
> >
> > 4. The userspace-visible problem goes away or is less serious.
> >
>
> Thank you for your suggestions.
> For quick response before submitting v2,
Thanks for all the info, this should be in the cover letter or commit
messages in some shape or form.
>
> 1.
> The visible issue is that pageout/in operations from active processes
> are slow when zswap is near its max pool size. This is particularly
> significant on small memory systems, where total swap usage exceeds
> what zswap can store. This means that old pages occupy most of the
> zswap pool space, and recent pages use swap disk directly.
This should be a transient state though, right? Once the shrinker
kicks in it should writeback the old pages and make space for the hot
ones. Which takes us to our next point.
>
> 2.
> This issue is caused by zswap keeping the pool size near 100%. Since
> the shrinker fails to shrink the pool to accept_thr_percent and zswap
> rejects incoming pages, rejection occurs more frequently than it
> should. The rejected pages are directly written to disk while zswap
> protects old pages from eviction, leading to slow pageout/in
> performance for recent pages on the swap disk.
Why is the shrinker failing? IIUC the first two patches fixes two
cases where the shrinker stumbles upon offline memcgs, or memcgs with
no zswapped pages. Are these cases common enough in your use case that
every single time the shrinker runs it hits MAX_RECLAIM_RETRIES before
putting the zswap usage below accept_thr_percent?
This would be surprising given that we should be restarting the
shrinker with every swapout attempt until we can accept pages again.
I guess one could construct a malicious case where there are some
sticky offline memcgs, and all the memcgs that actually have zswap
pages come after it in the iteration order.
Could you shed more light about this? What does the setup look like?
How many memcgs there are, how many of them use zswap, and how many
offline memcgs are you observing?
I am not saying we shouldn't fix these problems anyway, I am just
trying to understand how we got into this situation to begin with.
>
> 3.
> If the pool size were shrunk proactively, rejection by pool limit hits
> would be less likely. New incoming pages could be accepted as the pool
> gains some space in advance, while older pages are written back in the
> background. zswap would then be filled with recent pages, as expected
> in the LRU logic.
I suspect if patches 1 and 2 fix your problem, the shrinker invoked
from reclaim should be doing this sort of "proactive shrinking".
I agree that the current hysteresis around accept_thr_percent is not
good enough, but I am surprised you are hitting the pool limit if the
shrinker is being run during reclaim.
>
> Patch 1 and 2 make the shrinker reduce the pool to accept_thr_percent.
> Patch 3 makes zswap_store trigger the shrinker before reaching the max
> pool size. With this series, zswap will prepare some space to reduce
> the probability of problematic pool_limit_hit situation, thus reducing
> slow reclaim and the page priority inversion against LRU.
>
> 4.
> Once proactive shrinking reduces the pool size, pageouts complete
> instantly as long as the space prepared by shrinking can store the
> direct reclaim. If an admin sees a large pool_limit_hit, lowering
> accept_threshold_percent will improve active process performance.
I agree that proactive shrinking is preferable to waiting until we hit
pool limit, then stop taking in pages until the acceptance threshold.
I am just trying to understand whether such a proactive shrinking
mechanism will be needed if the reclaim shrinker for zswap is being
used, how the two would work together.