Since:
"42c06a0e8ebe mm: kill frontswap"
we no longer have a counter to tracks the number of zswap store
failures. This makes it hard to investigate and monitor for zswap
issues.
This patch adds a global and a per-cgroup zswap store failure counter,
as well as a dedicated debugfs counter for compression algorithm failure
(which can happen for e.g when random data are passed to zswap).
Signed-off-by: Nhat Pham <[email protected]>
---
include/linux/vm_event_item.h | 1 +
mm/memcontrol.c | 1 +
mm/vmstat.c | 1 +
mm/zswap.c | 18 ++++++++++++++----
4 files changed, 17 insertions(+), 4 deletions(-)
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 8abfa1240040..7b2b117b193d 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -145,6 +145,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
#ifdef CONFIG_ZSWAP
ZSWPIN,
ZSWPOUT,
+ ZSWPOUT_FAIL,
#endif
#ifdef CONFIG_X86
DIRECT_MAP_LEVEL2_SPLIT,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 61c0c46c2d62..0e247e72a379 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -593,6 +593,7 @@ static const unsigned int memcg_vm_event_stat[] = {
#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
ZSWPIN,
ZSWPOUT,
+ ZSWPOUT_FAIL,
#endif
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
THP_FAULT_ALLOC,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 359460deb377..85cc79449355 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1401,6 +1401,7 @@ const char * const vmstat_text[] = {
#ifdef CONFIG_ZSWAP
"zswpin",
"zswpout",
+ "zswpout_fail",
#endif
#ifdef CONFIG_X86
"direct_map_level2_splits",
diff --git a/mm/zswap.c b/mm/zswap.c
index 37d2b1cb2ecb..38e6620f8b58 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -61,6 +61,8 @@ static u64 zswap_pool_limit_hit;
static u64 zswap_written_back_pages;
/* Store failed due to a reclaim failure after pool limit was reached */
static u64 zswap_reject_reclaim_fail;
+/* Store failed due to compression algorithm failure */
+static u64 zswap_reject_compress_fail;
/* Compressed page was too big for the allocator to (optimally) store */
static u64 zswap_reject_compress_poor;
/* Store failed because underlying allocator could not get memory */
@@ -1213,10 +1215,10 @@ bool zswap_store(struct folio *folio)
/* Large folios aren't supported */
if (folio_test_large(folio))
- return false;
+ goto out_reject;
if (!zswap_enabled || !tree)
- return false;
+ goto out_reject;
/*
* If this is a duplicate, it must be removed before attempting to store
@@ -1309,8 +1311,10 @@ bool zswap_store(struct folio *folio)
ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
dlen = acomp_ctx->req->dlen;
- if (ret)
+ if (ret) {
+ zswap_reject_compress_fail++;
goto put_dstmem;
+ }
/* store */
zpool = zswap_find_zpool(entry);
@@ -1377,8 +1381,12 @@ bool zswap_store(struct folio *folio)
freepage:
zswap_entry_cache_free(entry);
reject:
- if (objcg)
+ if (objcg) {
+ count_objcg_event(objcg, ZSWPOUT_FAIL);
obj_cgroup_put(objcg);
+ }
+out_reject:
+ count_vm_event(ZSWPOUT_FAIL);
return false;
shrink:
@@ -1550,6 +1558,8 @@ static int zswap_debugfs_init(void)
zswap_debugfs_root, &zswap_reject_alloc_fail);
debugfs_create_u64("reject_kmemcache_fail", 0444,
zswap_debugfs_root, &zswap_reject_kmemcache_fail);
+ debugfs_create_u64("reject_compress_fail", 0444,
+ zswap_debugfs_root, &zswap_reject_compress_fail);
debugfs_create_u64("reject_compress_poor", 0444,
zswap_debugfs_root, &zswap_reject_compress_poor);
debugfs_create_u64("written_back_pages", 0444,
--
2.34.1
On Mon, Oct 23, 2023 at 05:07:02PM -0700, Nhat Pham wrote:
> Since:
>
> "42c06a0e8ebe mm: kill frontswap"
>
> we no longer have a counter to tracks the number of zswap store
> failures. This makes it hard to investigate and monitor for zswap
> issues.
>
> This patch adds a global and a per-cgroup zswap store failure counter,
> as well as a dedicated debugfs counter for compression algorithm failure
> (which can happen for e.g when random data are passed to zswap).
>
> Signed-off-by: Nhat Pham <[email protected]>
I agree this is an issue.
> ---
> include/linux/vm_event_item.h | 1 +
> mm/memcontrol.c | 1 +
> mm/vmstat.c | 1 +
> mm/zswap.c | 18 ++++++++++++++----
> 4 files changed, 17 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 8abfa1240040..7b2b117b193d 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -145,6 +145,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> #ifdef CONFIG_ZSWAP
> ZSWPIN,
> ZSWPOUT,
> + ZSWPOUT_FAIL,
Would the writeback stat be sufficient to determine this?
Hear me out. We already have pswpout that shows when we're hitting
disk swap. Right now we can't tell if this is because of a rejection
or because of writeback. With a writeback counter we could.
And I think we want the writeback counter anyway going forward in
order to monitor and understand the dynamic shrinker's performance.
Either way we go, one of the metrics needs to be derived from the
other(s). But I think subtle and not so subtle shrinker issues are
more concerning than outright configuration problems where zswap
doesn't work at all. The latter is easier to catch before or during
early deployment with simple functionality tests.
Plus, rejections should be rare. They are now, and they should become
even more rare or cease to exist going forward. Because every time
they happen at scale, they represent problematic LRU inversions. We
have patched, have pending patches, or discussed changes to reduce
every single one of them:
/* Store failed due to a reclaim failure after pool limit was reached */
static u64 zswap_reject_reclaim_fail;
With the shrinker this becomes less relevant. There was also the
proposal to disable the bypass to swap and just keep the page.
/* Compressed page was too big for the allocator to (optimally) store */
static u64 zswap_reject_compress_poor;
You were working on eradicating this (with zsmalloc at least).
/* Store failed because underlying allocator could not get memory */
static u64 zswap_reject_alloc_fail;
/* Store failed because the entry metadata could not be allocated (rare) */
static u64 zswap_reject_kmemcache_fail;
These shouldn't happen at all due to PF_MEMALLOC.
IOW, the fail counter is expected to stay zero in healthy,
well-configured systems. Rather than an MM event that needs counting,
this strikes me as something that could be a WARN down the line...
I agree with adding the debugfs counter though.
On Tue, Oct 24, 2023 at 9:09 AM Johannes Weiner <[email protected]> wrote:
>
> On Mon, Oct 23, 2023 at 05:07:02PM -0700, Nhat Pham wrote:
> > Since:
> >
> > "42c06a0e8ebe mm: kill frontswap"
> >
> > we no longer have a counter to tracks the number of zswap store
> > failures. This makes it hard to investigate and monitor for zswap
> > issues.
> >
> > This patch adds a global and a per-cgroup zswap store failure counter,
> > as well as a dedicated debugfs counter for compression algorithm failure
> > (which can happen for e.g when random data are passed to zswap).
> >
> > Signed-off-by: Nhat Pham <[email protected]>
>
> I agree this is an issue.
>
> > ---
> > include/linux/vm_event_item.h | 1 +
> > mm/memcontrol.c | 1 +
> > mm/vmstat.c | 1 +
> > mm/zswap.c | 18 ++++++++++++++----
> > 4 files changed, 17 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> > index 8abfa1240040..7b2b117b193d 100644
> > --- a/include/linux/vm_event_item.h
> > +++ b/include/linux/vm_event_item.h
> > @@ -145,6 +145,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> > #ifdef CONFIG_ZSWAP
> > ZSWPIN,
> > ZSWPOUT,
> > + ZSWPOUT_FAIL,
>
> Would the writeback stat be sufficient to determine this?
>
> Hear me out. We already have pswpout that shows when we're hitting
> disk swap. Right now we can't tell if this is because of a rejection
> or because of writeback. With a writeback counter we could.
Oh I see! It's a bit of an extra step, but I supposed (pswpout - writeback)
could give us the number of zswap store failures.
>
> And I think we want the writeback counter anyway going forward in
> order to monitor and understand the dynamic shrinker's performance.
Domenico and I were talking about this, and we both agree the writeback
counter is absolutely necessary - if anything, to make sure that the
shrinker is not a) completely not working or b) going overboard.
So it is coming as part of the shrinker regardless of this.
I just didn't realize that it also solves this issue we're having too!
>
> Either way we go, one of the metrics needs to be derived from the
> other(s). But I think subtle and not so subtle shrinker issues are
> more concerning than outright configuration problems where zswap
> doesn't work at all. The latter is easier to catch before or during
> early deployment with simple functionality tests.
>
> Plus, rejections should be rare. They are now, and they should become
> even more rare or cease to exist going forward. Because every time
> they happen at scale, they represent problematic LRU inversions. We
> have patched, have pending patches, or discussed changes to reduce
> every single one of them:
>
> /* Store failed due to a reclaim failure after pool limit was reached */
> static u64 zswap_reject_reclaim_fail;
>
> With the shrinker this becomes less relevant. There was also the
> proposal to disable the bypass to swap and just keep the page.
The shrinker and that proposal sound like good ideas ;)
>
> /* Compressed page was too big for the allocator to (optimally) store */
> static u64 zswap_reject_compress_poor;
>
> You were working on eradicating this (with zsmalloc at least).
>
> /* Store failed because underlying allocator could not get memory */
> static u64 zswap_reject_alloc_fail;
> /* Store failed because the entry metadata could not be allocated (rare) */
> static u64 zswap_reject_kmemcache_fail;
>
> These shouldn't happen at all due to PF_MEMALLOC.
>
> IOW, the fail counter is expected to stay zero in healthy,
> well-configured systems. Rather than an MM event that needs counting,
> this strikes me as something that could be a WARN down the line...
>
Yup, I agree that it should (mostly) be at 0. It being non-zero (especially
at a higher ratio w.r.t total number of zswap store counts) is an indication
of something wrong - either a bug, misconfiguration, or a very
ill-compressible workload (or again a bug with the compression algorithm).
A WARN might be good too, but if it's just an ill-compressible workload
that might be too many WARNS :)
But we can always just monitor pswpout - writeback (both globally,
and on a cgroup-basis, I assume?).
> I agree with adding the debugfs counter though.
Then I'll send a new patch that focuses on the debugfs counter
(for the compression failure).
Thanks for the feedback, Johannes.