Before storing a page, zswap first checks if the number of stored pages
exceeds the limit specified by memory.zswap.max, for each cgroup in the
hierarchy. If this limit is reached or exceeded, then zswap shrinking is
triggered and short-circuits the store attempt.
However, if memory.zswap.max = 0 for a cgroup, no amount of writeback
will allow future store attempts from processes in this cgroup to
succeed. Furthermore, this create a pathological behavior in a system
where some cgroups have memory.zswap.max = 0 and some do not: the
processes in the former cgroups, under memory pressure, will evict pages
stored by the latter continually, until the need for swap ceases or the
pool becomes empty.
As a result of this, we observe a disproportionate amount of zswap
writeback and a perpetually small zswap pool in our experiments, even
though the pool limit is never hit.
This patch fixes the issue by rejecting zswap store attempt without
shrinking the pool when memory.zswap.max is 0.
Fixes: f4840ccfca25 ("zswap: memcg accounting")
Signed-off-by: Nhat Pham <[email protected]>
---
include/linux/memcontrol.h | 6 +++---
mm/memcontrol.c | 8 ++++----
mm/zswap.c | 9 +++++++--
3 files changed, 14 insertions(+), 9 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 222d7370134c..507bed3a28b0 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1899,13 +1899,13 @@ static inline void count_objcg_event(struct obj_cgroup *objcg,
#endif /* CONFIG_MEMCG_KMEM */
#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
-bool obj_cgroup_may_zswap(struct obj_cgroup *objcg);
+int obj_cgroup_may_zswap(struct obj_cgroup *objcg);
void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size);
void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size);
#else
-static inline bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
+static inline int obj_cgroup_may_zswap(struct obj_cgroup *objcg)
{
- return true;
+ return 0;
}
static inline void obj_cgroup_charge_zswap(struct obj_cgroup *objcg,
size_t size)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4b27e245a055..09aad0e6f2ea 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7783,10 +7783,10 @@ static struct cftype memsw_files[] = {
* spending cycles on compression when there is already no room left
* or zswap is disabled altogether somewhere in the hierarchy.
*/
-bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
+int obj_cgroup_may_zswap(struct obj_cgroup *objcg)
{
struct mem_cgroup *memcg, *original_memcg;
- bool ret = true;
+ int ret = 0;
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
return true;
@@ -7800,7 +7800,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
if (max == PAGE_COUNTER_MAX)
continue;
if (max == 0) {
- ret = false;
+ ret = -ENODEV;
break;
}
@@ -7808,7 +7808,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE;
if (pages < max)
continue;
- ret = false;
+ ret = -ENOMEM;
break;
}
mem_cgroup_put(original_memcg);
diff --git a/mm/zswap.c b/mm/zswap.c
index 59da2a415fbb..7b13dc865438 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1175,8 +1175,13 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
}
objcg = get_obj_cgroup_from_page(page);
- if (objcg && !obj_cgroup_may_zswap(objcg))
- goto shrink;
+ if (objcg) {
+ ret = obj_cgroup_may_zswap(objcg);
+ if (ret == -ENODEV)
+ goto reject;
+ if (ret == -ENOMEM)
+ goto shrink;
+ }
/* reclaim space if needed */
if (zswap_is_full()) {
--
2.34.1
On Tue, May 30, 2023 at 9:22 AM Nhat Pham <[email protected]> wrote:
>
> Before storing a page, zswap first checks if the number of stored pages
> exceeds the limit specified by memory.zswap.max, for each cgroup in the
> hierarchy. If this limit is reached or exceeded, then zswap shrinking is
> triggered and short-circuits the store attempt.
>
> However, if memory.zswap.max = 0 for a cgroup, no amount of writeback
> will allow future store attempts from processes in this cgroup to
> succeed. Furthermore, this create a pathological behavior in a system
> where some cgroups have memory.zswap.max = 0 and some do not: the
> processes in the former cgroups, under memory pressure, will evict pages
> stored by the latter continually, until the need for swap ceases or the
> pool becomes empty.
>
> As a result of this, we observe a disproportionate amount of zswap
> writeback and a perpetually small zswap pool in our experiments, even
> though the pool limit is never hit.
>
> This patch fixes the issue by rejecting zswap store attempt without
> shrinking the pool when memory.zswap.max is 0.
>
> Fixes: f4840ccfca25 ("zswap: memcg accounting")
> Signed-off-by: Nhat Pham <[email protected]>
> ---
> include/linux/memcontrol.h | 6 +++---
> mm/memcontrol.c | 8 ++++----
> mm/zswap.c | 9 +++++++--
> 3 files changed, 14 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 222d7370134c..507bed3a28b0 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1899,13 +1899,13 @@ static inline void count_objcg_event(struct obj_cgroup *objcg,
> #endif /* CONFIG_MEMCG_KMEM */
>
> #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
> -bool obj_cgroup_may_zswap(struct obj_cgroup *objcg);
> +int obj_cgroup_may_zswap(struct obj_cgroup *objcg);
> void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size);
> void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size);
> #else
> -static inline bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> +static inline int obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> {
> - return true;
> + return 0;
> }
> static inline void obj_cgroup_charge_zswap(struct obj_cgroup *objcg,
> size_t size)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 4b27e245a055..09aad0e6f2ea 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -7783,10 +7783,10 @@ static struct cftype memsw_files[] = {
> * spending cycles on compression when there is already no room left
> * or zswap is disabled altogether somewhere in the hierarchy.
> */
> -bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> +int obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> {
> struct mem_cgroup *memcg, *original_memcg;
> - bool ret = true;
> + int ret = 0;
>
> if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> return true;
> @@ -7800,7 +7800,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> if (max == PAGE_COUNTER_MAX)
> continue;
> if (max == 0) {
> - ret = false;
> + ret = -ENODEV;
> break;
> }
>
> @@ -7808,7 +7808,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE;
> if (pages < max)
> continue;
> - ret = false;
> + ret = -ENOMEM;
> break;
> }
> mem_cgroup_put(original_memcg);
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 59da2a415fbb..7b13dc865438 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1175,8 +1175,13 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
> }
>
> objcg = get_obj_cgroup_from_page(page);
> - if (objcg && !obj_cgroup_may_zswap(objcg))
> - goto shrink;
> + if (objcg) {
> + ret = obj_cgroup_may_zswap(objcg);
> + if (ret == -ENODEV)
> + goto reject;
> + if (ret == -ENOMEM)
> + goto shrink;
> + }
I wonder if we should just make this:
if (objcg && !obj_cgroup_may_zswap(objcg))
goto reject;
Even if memory.zswap.max is > 0, if the limit is hit, shrinking the
zswap pool will only help if we happen to writeback a page from the
same memcg that hit its limit. Keep in mind that we will only
writeback one page every time we observe that the limit is hit (even
with Domenico's patch, because zswap_can_accept() should be true).
On a system with a handful of memcgs,
it seems likely that we wrongfully writeback pages from other memcgs
because of this. Achieving nothing for this memcg, while hurting
others. OTOH, without invoking writeback when the limit is hit, the
memcg will just not be able to use zswap until some pages are
faulted back in or invalidated.
I am not sure which is better, just thinking out loud.
Seems like this can be solved by having per-memcg LRUs, or at least
providing an argument to the shrinker of which memcg to reclaim from.
This would only be possible when the LRU is moved to zswap.
>
> /* reclaim space if needed */
> if (zswap_is_full()) {
> --
> 2.34.1
>
>
On Tue, May 30, 2023 at 09:52:36AM -0700, Yosry Ahmed wrote:
> On Tue, May 30, 2023 at 9:22 AM Nhat Pham <[email protected]> wrote:
> >
> > Before storing a page, zswap first checks if the number of stored pages
> > exceeds the limit specified by memory.zswap.max, for each cgroup in the
> > hierarchy. If this limit is reached or exceeded, then zswap shrinking is
> > triggered and short-circuits the store attempt.
> >
> > However, if memory.zswap.max = 0 for a cgroup, no amount of writeback
> > will allow future store attempts from processes in this cgroup to
> > succeed. Furthermore, this create a pathological behavior in a system
> > where some cgroups have memory.zswap.max = 0 and some do not: the
> > processes in the former cgroups, under memory pressure, will evict pages
> > stored by the latter continually, until the need for swap ceases or the
> > pool becomes empty.
> >
> > As a result of this, we observe a disproportionate amount of zswap
> > writeback and a perpetually small zswap pool in our experiments, even
> > though the pool limit is never hit.
> >
> > This patch fixes the issue by rejecting zswap store attempt without
> > shrinking the pool when memory.zswap.max is 0.
> >
> > Fixes: f4840ccfca25 ("zswap: memcg accounting")
> > Signed-off-by: Nhat Pham <[email protected]>
> > ---
> > include/linux/memcontrol.h | 6 +++---
> > mm/memcontrol.c | 8 ++++----
> > mm/zswap.c | 9 +++++++--
> > 3 files changed, 14 insertions(+), 9 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 222d7370134c..507bed3a28b0 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -1899,13 +1899,13 @@ static inline void count_objcg_event(struct obj_cgroup *objcg,
> > #endif /* CONFIG_MEMCG_KMEM */
> >
> > #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
> > -bool obj_cgroup_may_zswap(struct obj_cgroup *objcg);
> > +int obj_cgroup_may_zswap(struct obj_cgroup *objcg);
> > void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size);
> > void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size);
> > #else
> > -static inline bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > +static inline int obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > {
> > - return true;
> > + return 0;
> > }
> > static inline void obj_cgroup_charge_zswap(struct obj_cgroup *objcg,
> > size_t size)
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 4b27e245a055..09aad0e6f2ea 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -7783,10 +7783,10 @@ static struct cftype memsw_files[] = {
> > * spending cycles on compression when there is already no room left
> > * or zswap is disabled altogether somewhere in the hierarchy.
> > */
> > -bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > +int obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > {
> > struct mem_cgroup *memcg, *original_memcg;
> > - bool ret = true;
> > + int ret = 0;
> >
> > if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> > return true;
> > @@ -7800,7 +7800,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > if (max == PAGE_COUNTER_MAX)
> > continue;
> > if (max == 0) {
> > - ret = false;
> > + ret = -ENODEV;
> > break;
> > }
> >
> > @@ -7808,7 +7808,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE;
> > if (pages < max)
> > continue;
> > - ret = false;
> > + ret = -ENOMEM;
> > break;
> > }
> > mem_cgroup_put(original_memcg);
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 59da2a415fbb..7b13dc865438 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -1175,8 +1175,13 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
> > }
> >
> > objcg = get_obj_cgroup_from_page(page);
> > - if (objcg && !obj_cgroup_may_zswap(objcg))
> > - goto shrink;
> > + if (objcg) {
> > + ret = obj_cgroup_may_zswap(objcg);
> > + if (ret == -ENODEV)
> > + goto reject;
> > + if (ret == -ENOMEM)
> > + goto shrink;
> > + }
>
> I wonder if we should just make this:
>
> if (objcg && !obj_cgroup_may_zswap(objcg))
> goto reject;
>
> Even if memory.zswap.max is > 0, if the limit is hit, shrinking the
> zswap pool will only help if we happen to writeback a page from the
> same memcg that hit its limit. Keep in mind that we will only
> writeback one page every time we observe that the limit is hit (even
> with Domenico's patch, because zswap_can_accept() should be true).
>
> On a system with a handful of memcgs,
> it seems likely that we wrongfully writeback pages from other memcgs
> because of this. Achieving nothing for this memcg, while hurting
> others. OTOH, without invoking writeback when the limit is hit, the
> memcg will just not be able to use zswap until some pages are
> faulted back in or invalidated.
>
> I am not sure which is better, just thinking out loud.
You're absolutely right.
Currently the choice is writing back either everybody or nobody,
meaning between writeback and cgroup containment. They're both so poor
that I can't say I strongly prefer one over the other.
However, I have a lame argument in favor of this patch:
The last few fixes from Nhat and Domenico around writeback show that
few people, if anybody, are actually using writeback. So it might not
actually matter that much in practice which way we go with this patch.
Per-memcg LRUs will be necessary for it to work right.
However, what Nhat is proposing is how we want the behavior down the
line. So between two equally poor choices, I figure we might as well
go with the one that doesn't require another code change later on.
Doesn't that fill you with radiant enthusiasm?
> Seems like this can be solved by having per-memcg LRUs, or at least
> providing an argument to the shrinker of which memcg to reclaim from.
> This would only be possible when the LRU is moved to zswap.
+1
On Tue, May 30, 2023 at 9:53 AM Yosry Ahmed <[email protected]> wrote:
>
> On Tue, May 30, 2023 at 9:22 AM Nhat Pham <[email protected]> wrote:
> >
> > Before storing a page, zswap first checks if the number of stored pages
> > exceeds the limit specified by memory.zswap.max, for each cgroup in the
> > hierarchy. If this limit is reached or exceeded, then zswap shrinking is
> > triggered and short-circuits the store attempt.
> >
> > However, if memory.zswap.max = 0 for a cgroup, no amount of writeback
> > will allow future store attempts from processes in this cgroup to
> > succeed. Furthermore, this create a pathological behavior in a system
> > where some cgroups have memory.zswap.max = 0 and some do not: the
> > processes in the former cgroups, under memory pressure, will evict pages
> > stored by the latter continually, until the need for swap ceases or the
> > pool becomes empty.
> >
> > As a result of this, we observe a disproportionate amount of zswap
> > writeback and a perpetually small zswap pool in our experiments, even
> > though the pool limit is never hit.
> >
> > This patch fixes the issue by rejecting zswap store attempt without
> > shrinking the pool when memory.zswap.max is 0.
> >
> > Fixes: f4840ccfca25 ("zswap: memcg accounting")
> > Signed-off-by: Nhat Pham <[email protected]>
> > ---
> > include/linux/memcontrol.h | 6 +++---
> > mm/memcontrol.c | 8 ++++----
> > mm/zswap.c | 9 +++++++--
> > 3 files changed, 14 insertions(+), 9 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 222d7370134c..507bed3a28b0 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -1899,13 +1899,13 @@ static inline void count_objcg_event(struct obj_cgroup *objcg,
> > #endif /* CONFIG_MEMCG_KMEM */
> >
> > #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
> > -bool obj_cgroup_may_zswap(struct obj_cgroup *objcg);
> > +int obj_cgroup_may_zswap(struct obj_cgroup *objcg);
> > void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size);
> > void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size);
> > #else
> > -static inline bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > +static inline int obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > {
> > - return true;
> > + return 0;
> > }
> > static inline void obj_cgroup_charge_zswap(struct obj_cgroup *objcg,
> > size_t size)
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 4b27e245a055..09aad0e6f2ea 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -7783,10 +7783,10 @@ static struct cftype memsw_files[] = {
> > * spending cycles on compression when there is already no room left
> > * or zswap is disabled altogether somewhere in the hierarchy.
> > */
> > -bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > +int obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > {
> > struct mem_cgroup *memcg, *original_memcg;
> > - bool ret = true;
> > + int ret = 0;
> >
> > if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> > return true;
> > @@ -7800,7 +7800,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > if (max == PAGE_COUNTER_MAX)
> > continue;
> > if (max == 0) {
> > - ret = false;
> > + ret = -ENODEV;
> > break;
> > }
> >
> > @@ -7808,7 +7808,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE;
> > if (pages < max)
> > continue;
> > - ret = false;
> > + ret = -ENOMEM;
> > break;
> > }
> > mem_cgroup_put(original_memcg);
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 59da2a415fbb..7b13dc865438 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -1175,8 +1175,13 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
> > }
> >
> > objcg = get_obj_cgroup_from_page(page);
> > - if (objcg && !obj_cgroup_may_zswap(objcg))
> > - goto shrink;
> > + if (objcg) {
> > + ret = obj_cgroup_may_zswap(objcg);
> > + if (ret == -ENODEV)
> > + goto reject;
> > + if (ret == -ENOMEM)
> > + goto shrink;
> > + }
>
> I wonder if we should just make this:
>
> if (objcg && !obj_cgroup_may_zswap(objcg))
> goto reject;
>
> Even if memory.zswap.max is > 0, if the limit is hit, shrinking the
> zswap pool will only help if we happen to writeback a page from the
> same memcg that hit its limit. Keep in mind that we will only
> writeback one page every time we observe that the limit is hit (even
> with Domenico's patch, because zswap_can_accept() should be true).
>
> On a system with a handful of memcgs,
> it seems likely that we wrongfully writeback pages from other memcgs
> because of this. Achieving nothing for this memcg, while hurting
> others. OTOH, without invoking writeback when the limit is hit, the
> memcg will just not be able to use zswap until some pages are
> faulted back in or invalidated.
>
> I am not sure which is better, just thinking out loud.
>
> Seems like this can be solved by having per-memcg LRUs, or at least
> providing an argument to the shrinker of which memcg to reclaim from.
> This would only be possible when the LRU is moved to zswap.
I totally agree! This seems like the logical next step in zswap's evolution.
I actually proposed this fix with this future development in mind - with
a per-memcg LRU, we can trigger memcg-specific shrinking in
place of this indiscriminate writeback. It seems less drastic a change
(compared to removing shrinking here now, then reintroducing it later).
Thanks for the feedback, Yosry!
>
>
> >
> > /* reclaim space if needed */
> > if (zswap_is_full()) {
> > --
> > 2.34.1
> >
> >
On Tue, May 30, 2023 at 11:00 AM Johannes Weiner <[email protected]> wrote:
>
> On Tue, May 30, 2023 at 09:52:36AM -0700, Yosry Ahmed wrote:
> > On Tue, May 30, 2023 at 9:22 AM Nhat Pham <[email protected]> wrote:
> > >
> > > Before storing a page, zswap first checks if the number of stored pages
> > > exceeds the limit specified by memory.zswap.max, for each cgroup in the
> > > hierarchy. If this limit is reached or exceeded, then zswap shrinking is
> > > triggered and short-circuits the store attempt.
> > >
> > > However, if memory.zswap.max = 0 for a cgroup, no amount of writeback
> > > will allow future store attempts from processes in this cgroup to
> > > succeed. Furthermore, this create a pathological behavior in a system
> > > where some cgroups have memory.zswap.max = 0 and some do not: the
> > > processes in the former cgroups, under memory pressure, will evict pages
> > > stored by the latter continually, until the need for swap ceases or the
> > > pool becomes empty.
> > >
> > > As a result of this, we observe a disproportionate amount of zswap
> > > writeback and a perpetually small zswap pool in our experiments, even
> > > though the pool limit is never hit.
> > >
> > > This patch fixes the issue by rejecting zswap store attempt without
> > > shrinking the pool when memory.zswap.max is 0.
> > >
> > > Fixes: f4840ccfca25 ("zswap: memcg accounting")
> > > Signed-off-by: Nhat Pham <[email protected]>
> > > ---
> > > include/linux/memcontrol.h | 6 +++---
> > > mm/memcontrol.c | 8 ++++----
> > > mm/zswap.c | 9 +++++++--
> > > 3 files changed, 14 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > index 222d7370134c..507bed3a28b0 100644
> > > --- a/include/linux/memcontrol.h
> > > +++ b/include/linux/memcontrol.h
> > > @@ -1899,13 +1899,13 @@ static inline void count_objcg_event(struct obj_cgroup *objcg,
> > > #endif /* CONFIG_MEMCG_KMEM */
> > >
> > > #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
> > > -bool obj_cgroup_may_zswap(struct obj_cgroup *objcg);
> > > +int obj_cgroup_may_zswap(struct obj_cgroup *objcg);
> > > void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size);
> > > void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size);
> > > #else
> > > -static inline bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > +static inline int obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > {
> > > - return true;
> > > + return 0;
> > > }
> > > static inline void obj_cgroup_charge_zswap(struct obj_cgroup *objcg,
> > > size_t size)
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 4b27e245a055..09aad0e6f2ea 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -7783,10 +7783,10 @@ static struct cftype memsw_files[] = {
> > > * spending cycles on compression when there is already no room left
> > > * or zswap is disabled altogether somewhere in the hierarchy.
> > > */
> > > -bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > +int obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > {
> > > struct mem_cgroup *memcg, *original_memcg;
> > > - bool ret = true;
> > > + int ret = 0;
> > >
> > > if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> > > return true;
> > > @@ -7800,7 +7800,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > if (max == PAGE_COUNTER_MAX)
> > > continue;
> > > if (max == 0) {
> > > - ret = false;
> > > + ret = -ENODEV;
> > > break;
> > > }
> > >
> > > @@ -7808,7 +7808,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE;
> > > if (pages < max)
> > > continue;
> > > - ret = false;
> > > + ret = -ENOMEM;
> > > break;
> > > }
> > > mem_cgroup_put(original_memcg);
> > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > index 59da2a415fbb..7b13dc865438 100644
> > > --- a/mm/zswap.c
> > > +++ b/mm/zswap.c
> > > @@ -1175,8 +1175,13 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
> > > }
> > >
> > > objcg = get_obj_cgroup_from_page(page);
> > > - if (objcg && !obj_cgroup_may_zswap(objcg))
> > > - goto shrink;
> > > + if (objcg) {
> > > + ret = obj_cgroup_may_zswap(objcg);
> > > + if (ret == -ENODEV)
> > > + goto reject;
> > > + if (ret == -ENOMEM)
> > > + goto shrink;
> > > + }
> >
> > I wonder if we should just make this:
> >
> > if (objcg && !obj_cgroup_may_zswap(objcg))
> > goto reject;
> >
> > Even if memory.zswap.max is > 0, if the limit is hit, shrinking the
> > zswap pool will only help if we happen to writeback a page from the
> > same memcg that hit its limit. Keep in mind that we will only
> > writeback one page every time we observe that the limit is hit (even
> > with Domenico's patch, because zswap_can_accept() should be true).
> >
> > On a system with a handful of memcgs,
> > it seems likely that we wrongfully writeback pages from other memcgs
> > because of this. Achieving nothing for this memcg, while hurting
> > others. OTOH, without invoking writeback when the limit is hit, the
> > memcg will just not be able to use zswap until some pages are
> > faulted back in or invalidated.
> >
> > I am not sure which is better, just thinking out loud.
>
> You're absolutely right.
>
> Currently the choice is writing back either everybody or nobody,
> meaning between writeback and cgroup containment. They're both so poor
> that I can't say I strongly prefer one over the other.
>
> However, I have a lame argument in favor of this patch:
>
> The last few fixes from Nhat and Domenico around writeback show that
> few people, if anybody, are actually using writeback. So it might not
> actually matter that much in practice which way we go with this patch.
> Per-memcg LRUs will be necessary for it to work right.
>
> However, what Nhat is proposing is how we want the behavior down the
> line. So between two equally poor choices, I figure we might as well
> go with the one that doesn't require another code change later on.
>
> Doesn't that fill you with radiant enthusiasm?
If we have per-memcg LRUs, and memory.zswap.max == 0, then we should
be in one of two situations:
(a) memory.zswap.max has always been 0, so the LRU for this memcg is
empty, so we don't really need the special case for memory.zswap.max
== 0.
(b) memory.zswap.max was reduced to 0 at some point, and some pages
are already in zswap. In this case, I don't think shrinking the memcg
is such a bad idea, we would be lazily enforcing the limit.
In that sense I am not sure that this change won't require another
code change. It feels like special casing memory.zswap.max == 0 is
only needed now due to the lack of per-memcg LRUs.
>
> > Seems like this can be solved by having per-memcg LRUs, or at least
> > providing an argument to the shrinker of which memcg to reclaim from.
> > This would only be possible when the LRU is moved to zswap.
>
> +1
On Tue, May 30, 2023 at 11:27 AM Nhat Pham <[email protected]> wrote:
>
> On Tue, May 30, 2023 at 9:53 AM Yosry Ahmed <[email protected]> wrote:
> >
> > On Tue, May 30, 2023 at 9:22 AM Nhat Pham <[email protected]> wrote:
> > >
> > > Before storing a page, zswap first checks if the number of stored pages
> > > exceeds the limit specified by memory.zswap.max, for each cgroup in the
> > > hierarchy. If this limit is reached or exceeded, then zswap shrinking is
> > > triggered and short-circuits the store attempt.
> > >
> > > However, if memory.zswap.max = 0 for a cgroup, no amount of writeback
> > > will allow future store attempts from processes in this cgroup to
> > > succeed. Furthermore, this create a pathological behavior in a system
> > > where some cgroups have memory.zswap.max = 0 and some do not: the
> > > processes in the former cgroups, under memory pressure, will evict pages
> > > stored by the latter continually, until the need for swap ceases or the
> > > pool becomes empty.
> > >
> > > As a result of this, we observe a disproportionate amount of zswap
> > > writeback and a perpetually small zswap pool in our experiments, even
> > > though the pool limit is never hit.
> > >
> > > This patch fixes the issue by rejecting zswap store attempt without
> > > shrinking the pool when memory.zswap.max is 0.
> > >
> > > Fixes: f4840ccfca25 ("zswap: memcg accounting")
> > > Signed-off-by: Nhat Pham <[email protected]>
> > > ---
> > > include/linux/memcontrol.h | 6 +++---
> > > mm/memcontrol.c | 8 ++++----
> > > mm/zswap.c | 9 +++++++--
> > > 3 files changed, 14 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > index 222d7370134c..507bed3a28b0 100644
> > > --- a/include/linux/memcontrol.h
> > > +++ b/include/linux/memcontrol.h
> > > @@ -1899,13 +1899,13 @@ static inline void count_objcg_event(struct obj_cgroup *objcg,
> > > #endif /* CONFIG_MEMCG_KMEM */
> > >
> > > #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
> > > -bool obj_cgroup_may_zswap(struct obj_cgroup *objcg);
> > > +int obj_cgroup_may_zswap(struct obj_cgroup *objcg);
> > > void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size);
> > > void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size);
> > > #else
> > > -static inline bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > +static inline int obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > {
> > > - return true;
> > > + return 0;
> > > }
> > > static inline void obj_cgroup_charge_zswap(struct obj_cgroup *objcg,
> > > size_t size)
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 4b27e245a055..09aad0e6f2ea 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -7783,10 +7783,10 @@ static struct cftype memsw_files[] = {
> > > * spending cycles on compression when there is already no room left
> > > * or zswap is disabled altogether somewhere in the hierarchy.
> > > */
> > > -bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > +int obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > {
> > > struct mem_cgroup *memcg, *original_memcg;
> > > - bool ret = true;
> > > + int ret = 0;
> > >
> > > if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> > > return true;
> > > @@ -7800,7 +7800,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > if (max == PAGE_COUNTER_MAX)
> > > continue;
> > > if (max == 0) {
> > > - ret = false;
> > > + ret = -ENODEV;
> > > break;
> > > }
> > >
> > > @@ -7808,7 +7808,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE;
> > > if (pages < max)
> > > continue;
> > > - ret = false;
> > > + ret = -ENOMEM;
> > > break;
> > > }
> > > mem_cgroup_put(original_memcg);
> > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > index 59da2a415fbb..7b13dc865438 100644
> > > --- a/mm/zswap.c
> > > +++ b/mm/zswap.c
> > > @@ -1175,8 +1175,13 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
> > > }
> > >
> > > objcg = get_obj_cgroup_from_page(page);
> > > - if (objcg && !obj_cgroup_may_zswap(objcg))
> > > - goto shrink;
> > > + if (objcg) {
> > > + ret = obj_cgroup_may_zswap(objcg);
> > > + if (ret == -ENODEV)
> > > + goto reject;
> > > + if (ret == -ENOMEM)
> > > + goto shrink;
> > > + }
> >
> > I wonder if we should just make this:
> >
> > if (objcg && !obj_cgroup_may_zswap(objcg))
> > goto reject;
> >
> > Even if memory.zswap.max is > 0, if the limit is hit, shrinking the
> > zswap pool will only help if we happen to writeback a page from the
> > same memcg that hit its limit. Keep in mind that we will only
> > writeback one page every time we observe that the limit is hit (even
> > with Domenico's patch, because zswap_can_accept() should be true).
> >
> > On a system with a handful of memcgs,
> > it seems likely that we wrongfully writeback pages from other memcgs
> > because of this. Achieving nothing for this memcg, while hurting
> > others. OTOH, without invoking writeback when the limit is hit, the
> > memcg will just not be able to use zswap until some pages are
> > faulted back in or invalidated.
> >
> > I am not sure which is better, just thinking out loud.
> >
> > Seems like this can be solved by having per-memcg LRUs, or at least
> > providing an argument to the shrinker of which memcg to reclaim from.
> > This would only be possible when the LRU is moved to zswap.
>
> I totally agree! This seems like the logical next step in zswap's evolution.
> I actually proposed this fix with this future development in mind - with
> a per-memcg LRU, we can trigger memcg-specific shrinking in
> place of this indiscriminate writeback. It seems less drastic a change
> (compared to removing shrinking here now, then reintroducing it later).
As I stated in my reply to Johannes, I am just not sure that we will
need to special case memory.zswap.max == 0 when we have proper
writeback. WDYT?
>
> Thanks for the feedback, Yosry!
>
> >
> >
> > >
> > > /* reclaim space if needed */
> > > if (zswap_is_full()) {
> > > --
> > > 2.34.1
> > >
> > >
On Tue, May 30, 2023 at 11:41:32AM -0700, Yosry Ahmed wrote:
> On Tue, May 30, 2023 at 11:00 AM Johannes Weiner <[email protected]> wrote:
> >
> > On Tue, May 30, 2023 at 09:52:36AM -0700, Yosry Ahmed wrote:
> > > On Tue, May 30, 2023 at 9:22 AM Nhat Pham <[email protected]> wrote:
> > > >
> > > > Before storing a page, zswap first checks if the number of stored pages
> > > > exceeds the limit specified by memory.zswap.max, for each cgroup in the
> > > > hierarchy. If this limit is reached or exceeded, then zswap shrinking is
> > > > triggered and short-circuits the store attempt.
> > > >
> > > > However, if memory.zswap.max = 0 for a cgroup, no amount of writeback
> > > > will allow future store attempts from processes in this cgroup to
> > > > succeed. Furthermore, this create a pathological behavior in a system
> > > > where some cgroups have memory.zswap.max = 0 and some do not: the
> > > > processes in the former cgroups, under memory pressure, will evict pages
> > > > stored by the latter continually, until the need for swap ceases or the
> > > > pool becomes empty.
> > > >
> > > > As a result of this, we observe a disproportionate amount of zswap
> > > > writeback and a perpetually small zswap pool in our experiments, even
> > > > though the pool limit is never hit.
> > > >
> > > > This patch fixes the issue by rejecting zswap store attempt without
> > > > shrinking the pool when memory.zswap.max is 0.
> > > >
> > > > Fixes: f4840ccfca25 ("zswap: memcg accounting")
> > > > Signed-off-by: Nhat Pham <[email protected]>
> > > > ---
> > > > include/linux/memcontrol.h | 6 +++---
> > > > mm/memcontrol.c | 8 ++++----
> > > > mm/zswap.c | 9 +++++++--
> > > > 3 files changed, 14 insertions(+), 9 deletions(-)
> > > >
> > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > > index 222d7370134c..507bed3a28b0 100644
> > > > --- a/include/linux/memcontrol.h
> > > > +++ b/include/linux/memcontrol.h
> > > > @@ -1899,13 +1899,13 @@ static inline void count_objcg_event(struct obj_cgroup *objcg,
> > > > #endif /* CONFIG_MEMCG_KMEM */
> > > >
> > > > #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
> > > > -bool obj_cgroup_may_zswap(struct obj_cgroup *objcg);
> > > > +int obj_cgroup_may_zswap(struct obj_cgroup *objcg);
> > > > void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size);
> > > > void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size);
> > > > #else
> > > > -static inline bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > > +static inline int obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > > {
> > > > - return true;
> > > > + return 0;
> > > > }
> > > > static inline void obj_cgroup_charge_zswap(struct obj_cgroup *objcg,
> > > > size_t size)
> > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > index 4b27e245a055..09aad0e6f2ea 100644
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -7783,10 +7783,10 @@ static struct cftype memsw_files[] = {
> > > > * spending cycles on compression when there is already no room left
> > > > * or zswap is disabled altogether somewhere in the hierarchy.
> > > > */
> > > > -bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > > +int obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > > {
> > > > struct mem_cgroup *memcg, *original_memcg;
> > > > - bool ret = true;
> > > > + int ret = 0;
> > > >
> > > > if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> > > > return true;
> > > > @@ -7800,7 +7800,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > > if (max == PAGE_COUNTER_MAX)
> > > > continue;
> > > > if (max == 0) {
> > > > - ret = false;
> > > > + ret = -ENODEV;
> > > > break;
> > > > }
> > > >
> > > > @@ -7808,7 +7808,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > > pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE;
> > > > if (pages < max)
> > > > continue;
> > > > - ret = false;
> > > > + ret = -ENOMEM;
> > > > break;
> > > > }
> > > > mem_cgroup_put(original_memcg);
> > > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > > index 59da2a415fbb..7b13dc865438 100644
> > > > --- a/mm/zswap.c
> > > > +++ b/mm/zswap.c
> > > > @@ -1175,8 +1175,13 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
> > > > }
> > > >
> > > > objcg = get_obj_cgroup_from_page(page);
> > > > - if (objcg && !obj_cgroup_may_zswap(objcg))
> > > > - goto shrink;
> > > > + if (objcg) {
> > > > + ret = obj_cgroup_may_zswap(objcg);
> > > > + if (ret == -ENODEV)
> > > > + goto reject;
> > > > + if (ret == -ENOMEM)
> > > > + goto shrink;
> > > > + }
> > >
> > > I wonder if we should just make this:
> > >
> > > if (objcg && !obj_cgroup_may_zswap(objcg))
> > > goto reject;
> > >
> > > Even if memory.zswap.max is > 0, if the limit is hit, shrinking the
> > > zswap pool will only help if we happen to writeback a page from the
> > > same memcg that hit its limit. Keep in mind that we will only
> > > writeback one page every time we observe that the limit is hit (even
> > > with Domenico's patch, because zswap_can_accept() should be true).
> > >
> > > On a system with a handful of memcgs,
> > > it seems likely that we wrongfully writeback pages from other memcgs
> > > because of this. Achieving nothing for this memcg, while hurting
> > > others. OTOH, without invoking writeback when the limit is hit, the
> > > memcg will just not be able to use zswap until some pages are
> > > faulted back in or invalidated.
> > >
> > > I am not sure which is better, just thinking out loud.
> >
> > You're absolutely right.
> >
> > Currently the choice is writing back either everybody or nobody,
> > meaning between writeback and cgroup containment. They're both so poor
> > that I can't say I strongly prefer one over the other.
> >
> > However, I have a lame argument in favor of this patch:
> >
> > The last few fixes from Nhat and Domenico around writeback show that
> > few people, if anybody, are actually using writeback. So it might not
> > actually matter that much in practice which way we go with this patch.
> > Per-memcg LRUs will be necessary for it to work right.
> >
> > However, what Nhat is proposing is how we want the behavior down the
> > line. So between two equally poor choices, I figure we might as well
> > go with the one that doesn't require another code change later on.
> >
> > Doesn't that fill you with radiant enthusiasm?
>
> If we have per-memcg LRUs, and memory.zswap.max == 0, then we should
> be in one of two situations:
>
> (a) memory.zswap.max has always been 0, so the LRU for this memcg is
> empty, so we don't really need the special case for memory.zswap.max
> == 0.
>
> (b) memory.zswap.max was reduced to 0 at some point, and some pages
> are already in zswap. In this case, I don't think shrinking the memcg
> is such a bad idea, we would be lazily enforcing the limit.
>
> In that sense I am not sure that this change won't require another
> code change. It feels like special casing memory.zswap.max == 0 is
> only needed now due to the lack of per-memcg LRUs.
Good point. And I agree down the line we should just always send the
shrinker off optimistically on the cgroup's lru list.
So I take back my lame argument. But that then still leaves us with
the situation that both choices are equal here, right?
If so, my vote would be to go with the patch as-is.
On Tue, May 30, 2023 at 12:13 PM Johannes Weiner <[email protected]> wrote:
>
> On Tue, May 30, 2023 at 11:41:32AM -0700, Yosry Ahmed wrote:
> > On Tue, May 30, 2023 at 11:00 AM Johannes Weiner <[email protected]> wrote:
> > >
> > > On Tue, May 30, 2023 at 09:52:36AM -0700, Yosry Ahmed wrote:
> > > > On Tue, May 30, 2023 at 9:22 AM Nhat Pham <[email protected]> wrote:
> > > > >
> > > > > Before storing a page, zswap first checks if the number of stored pages
> > > > > exceeds the limit specified by memory.zswap.max, for each cgroup in the
> > > > > hierarchy. If this limit is reached or exceeded, then zswap shrinking is
> > > > > triggered and short-circuits the store attempt.
> > > > >
> > > > > However, if memory.zswap.max = 0 for a cgroup, no amount of writeback
> > > > > will allow future store attempts from processes in this cgroup to
> > > > > succeed. Furthermore, this create a pathological behavior in a system
> > > > > where some cgroups have memory.zswap.max = 0 and some do not: the
> > > > > processes in the former cgroups, under memory pressure, will evict pages
> > > > > stored by the latter continually, until the need for swap ceases or the
> > > > > pool becomes empty.
> > > > >
> > > > > As a result of this, we observe a disproportionate amount of zswap
> > > > > writeback and a perpetually small zswap pool in our experiments, even
> > > > > though the pool limit is never hit.
> > > > >
> > > > > This patch fixes the issue by rejecting zswap store attempt without
> > > > > shrinking the pool when memory.zswap.max is 0.
> > > > >
> > > > > Fixes: f4840ccfca25 ("zswap: memcg accounting")
> > > > > Signed-off-by: Nhat Pham <[email protected]>
> > > > > ---
> > > > > include/linux/memcontrol.h | 6 +++---
> > > > > mm/memcontrol.c | 8 ++++----
> > > > > mm/zswap.c | 9 +++++++--
> > > > > 3 files changed, 14 insertions(+), 9 deletions(-)
> > > > >
> > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > > > index 222d7370134c..507bed3a28b0 100644
> > > > > --- a/include/linux/memcontrol.h
> > > > > +++ b/include/linux/memcontrol.h
> > > > > @@ -1899,13 +1899,13 @@ static inline void count_objcg_event(struct obj_cgroup *objcg,
> > > > > #endif /* CONFIG_MEMCG_KMEM */
> > > > >
> > > > > #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
> > > > > -bool obj_cgroup_may_zswap(struct obj_cgroup *objcg);
> > > > > +int obj_cgroup_may_zswap(struct obj_cgroup *objcg);
> > > > > void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size);
> > > > > void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size);
> > > > > #else
> > > > > -static inline bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > > > +static inline int obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > > > {
> > > > > - return true;
> > > > > + return 0;
> > > > > }
> > > > > static inline void obj_cgroup_charge_zswap(struct obj_cgroup *objcg,
> > > > > size_t size)
> > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > > index 4b27e245a055..09aad0e6f2ea 100644
> > > > > --- a/mm/memcontrol.c
> > > > > +++ b/mm/memcontrol.c
> > > > > @@ -7783,10 +7783,10 @@ static struct cftype memsw_files[] = {
> > > > > * spending cycles on compression when there is already no room left
> > > > > * or zswap is disabled altogether somewhere in the hierarchy.
> > > > > */
> > > > > -bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > > > +int obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > > > {
> > > > > struct mem_cgroup *memcg, *original_memcg;
> > > > > - bool ret = true;
> > > > > + int ret = 0;
> > > > >
> > > > > if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> > > > > return true;
> > > > > @@ -7800,7 +7800,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > > > if (max == PAGE_COUNTER_MAX)
> > > > > continue;
> > > > > if (max == 0) {
> > > > > - ret = false;
> > > > > + ret = -ENODEV;
> > > > > break;
> > > > > }
> > > > >
> > > > > @@ -7808,7 +7808,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > > > pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE;
> > > > > if (pages < max)
> > > > > continue;
> > > > > - ret = false;
> > > > > + ret = -ENOMEM;
> > > > > break;
> > > > > }
> > > > > mem_cgroup_put(original_memcg);
> > > > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > > > index 59da2a415fbb..7b13dc865438 100644
> > > > > --- a/mm/zswap.c
> > > > > +++ b/mm/zswap.c
> > > > > @@ -1175,8 +1175,13 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
> > > > > }
> > > > >
> > > > > objcg = get_obj_cgroup_from_page(page);
> > > > > - if (objcg && !obj_cgroup_may_zswap(objcg))
> > > > > - goto shrink;
> > > > > + if (objcg) {
> > > > > + ret = obj_cgroup_may_zswap(objcg);
> > > > > + if (ret == -ENODEV)
> > > > > + goto reject;
> > > > > + if (ret == -ENOMEM)
> > > > > + goto shrink;
> > > > > + }
> > > >
> > > > I wonder if we should just make this:
> > > >
> > > > if (objcg && !obj_cgroup_may_zswap(objcg))
> > > > goto reject;
> > > >
> > > > Even if memory.zswap.max is > 0, if the limit is hit, shrinking the
> > > > zswap pool will only help if we happen to writeback a page from the
> > > > same memcg that hit its limit. Keep in mind that we will only
> > > > writeback one page every time we observe that the limit is hit (even
> > > > with Domenico's patch, because zswap_can_accept() should be true).
> > > >
> > > > On a system with a handful of memcgs,
> > > > it seems likely that we wrongfully writeback pages from other memcgs
> > > > because of this. Achieving nothing for this memcg, while hurting
> > > > others. OTOH, without invoking writeback when the limit is hit, the
> > > > memcg will just not be able to use zswap until some pages are
> > > > faulted back in or invalidated.
> > > >
> > > > I am not sure which is better, just thinking out loud.
> > >
> > > You're absolutely right.
> > >
> > > Currently the choice is writing back either everybody or nobody,
> > > meaning between writeback and cgroup containment. They're both so poor
> > > that I can't say I strongly prefer one over the other.
> > >
> > > However, I have a lame argument in favor of this patch:
> > >
> > > The last few fixes from Nhat and Domenico around writeback show that
> > > few people, if anybody, are actually using writeback. So it might not
> > > actually matter that much in practice which way we go with this patch.
> > > Per-memcg LRUs will be necessary for it to work right.
> > >
> > > However, what Nhat is proposing is how we want the behavior down the
> > > line. So between two equally poor choices, I figure we might as well
> > > go with the one that doesn't require another code change later on.
> > >
> > > Doesn't that fill you with radiant enthusiasm?
> >
> > If we have per-memcg LRUs, and memory.zswap.max == 0, then we should
> > be in one of two situations:
> >
> > (a) memory.zswap.max has always been 0, so the LRU for this memcg is
> > empty, so we don't really need the special case for memory.zswap.max
> > == 0.
> >
> > (b) memory.zswap.max was reduced to 0 at some point, and some pages
> > are already in zswap. In this case, I don't think shrinking the memcg
> > is such a bad idea, we would be lazily enforcing the limit.
> >
> > In that sense I am not sure that this change won't require another
> > code change. It feels like special casing memory.zswap.max == 0 is
> > only needed now due to the lack of per-memcg LRUs.
>
> Good point. And I agree down the line we should just always send the
> shrinker off optimistically on the cgroup's lru list.
>
> So I take back my lame argument. But that then still leaves us with
> the situation that both choices are equal here, right?
>
> If so, my vote would be to go with the patch as-is.
I *think* it's better to punish the memcg that exceeded its limit by
not allowing it to use zswap until its usage goes down, rather than
punish random memcgs on the machine because one memcg hit its limit.
It also seems to me that on a system with a handful of memcgs, it is
statistically more likely for zswap shrinking to writeback a page from
the wrong memcg.
The code would also be simpler if obj_cgroup_may_zswap() just returns
a boolean and we do not shrink at all if it returns false. If it no
longer returns a boolean we should at least rename it.
Did you try just not shrinking at all if the memcg limit is hit in
your experiments?
I don't feel strongly, but my preference would be to just not shrink
at all if obj_cgroup_may_zswap() returns false.
On Tue, May 30, 2023 at 01:19:12PM -0700, Yosry Ahmed wrote:
> On Tue, May 30, 2023 at 12:13 PM Johannes Weiner <[email protected]> wrote:
> >
> > On Tue, May 30, 2023 at 11:41:32AM -0700, Yosry Ahmed wrote:
> > > On Tue, May 30, 2023 at 11:00 AM Johannes Weiner <[email protected]> wrote:
> > > >
> > > > On Tue, May 30, 2023 at 09:52:36AM -0700, Yosry Ahmed wrote:
> > > > > On Tue, May 30, 2023 at 9:22 AM Nhat Pham <[email protected]> wrote:
> > > > > >
> > > > > > Before storing a page, zswap first checks if the number of stored pages
> > > > > > exceeds the limit specified by memory.zswap.max, for each cgroup in the
> > > > > > hierarchy. If this limit is reached or exceeded, then zswap shrinking is
> > > > > > triggered and short-circuits the store attempt.
> > > > > >
> > > > > > However, if memory.zswap.max = 0 for a cgroup, no amount of writeback
> > > > > > will allow future store attempts from processes in this cgroup to
> > > > > > succeed. Furthermore, this create a pathological behavior in a system
> > > > > > where some cgroups have memory.zswap.max = 0 and some do not: the
> > > > > > processes in the former cgroups, under memory pressure, will evict pages
> > > > > > stored by the latter continually, until the need for swap ceases or the
> > > > > > pool becomes empty.
> > > > > >
> > > > > > As a result of this, we observe a disproportionate amount of zswap
> > > > > > writeback and a perpetually small zswap pool in our experiments, even
> > > > > > though the pool limit is never hit.
> > > > > >
> > > > > > This patch fixes the issue by rejecting zswap store attempt without
> > > > > > shrinking the pool when memory.zswap.max is 0.
> > > > > >
> > > > > > Fixes: f4840ccfca25 ("zswap: memcg accounting")
> > > > > > Signed-off-by: Nhat Pham <[email protected]>
> > > > > > ---
> > > > > > include/linux/memcontrol.h | 6 +++---
> > > > > > mm/memcontrol.c | 8 ++++----
> > > > > > mm/zswap.c | 9 +++++++--
> > > > > > 3 files changed, 14 insertions(+), 9 deletions(-)
> > > > > >
> > > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > > > > index 222d7370134c..507bed3a28b0 100644
> > > > > > --- a/include/linux/memcontrol.h
> > > > > > +++ b/include/linux/memcontrol.h
> > > > > > @@ -1899,13 +1899,13 @@ static inline void count_objcg_event(struct obj_cgroup *objcg,
> > > > > > #endif /* CONFIG_MEMCG_KMEM */
> > > > > >
> > > > > > #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
> > > > > > -bool obj_cgroup_may_zswap(struct obj_cgroup *objcg);
> > > > > > +int obj_cgroup_may_zswap(struct obj_cgroup *objcg);
> > > > > > void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size);
> > > > > > void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size);
> > > > > > #else
> > > > > > -static inline bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > > > > +static inline int obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > > > > {
> > > > > > - return true;
> > > > > > + return 0;
> > > > > > }
> > > > > > static inline void obj_cgroup_charge_zswap(struct obj_cgroup *objcg,
> > > > > > size_t size)
> > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > > > index 4b27e245a055..09aad0e6f2ea 100644
> > > > > > --- a/mm/memcontrol.c
> > > > > > +++ b/mm/memcontrol.c
> > > > > > @@ -7783,10 +7783,10 @@ static struct cftype memsw_files[] = {
> > > > > > * spending cycles on compression when there is already no room left
> > > > > > * or zswap is disabled altogether somewhere in the hierarchy.
> > > > > > */
> > > > > > -bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > > > > +int obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > > > > {
> > > > > > struct mem_cgroup *memcg, *original_memcg;
> > > > > > - bool ret = true;
> > > > > > + int ret = 0;
> > > > > >
> > > > > > if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> > > > > > return true;
> > > > > > @@ -7800,7 +7800,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > > > > if (max == PAGE_COUNTER_MAX)
> > > > > > continue;
> > > > > > if (max == 0) {
> > > > > > - ret = false;
> > > > > > + ret = -ENODEV;
> > > > > > break;
> > > > > > }
> > > > > >
> > > > > > @@ -7808,7 +7808,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > > > > pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE;
> > > > > > if (pages < max)
> > > > > > continue;
> > > > > > - ret = false;
> > > > > > + ret = -ENOMEM;
> > > > > > break;
> > > > > > }
> > > > > > mem_cgroup_put(original_memcg);
> > > > > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > > > > index 59da2a415fbb..7b13dc865438 100644
> > > > > > --- a/mm/zswap.c
> > > > > > +++ b/mm/zswap.c
> > > > > > @@ -1175,8 +1175,13 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
> > > > > > }
> > > > > >
> > > > > > objcg = get_obj_cgroup_from_page(page);
> > > > > > - if (objcg && !obj_cgroup_may_zswap(objcg))
> > > > > > - goto shrink;
> > > > > > + if (objcg) {
> > > > > > + ret = obj_cgroup_may_zswap(objcg);
> > > > > > + if (ret == -ENODEV)
> > > > > > + goto reject;
> > > > > > + if (ret == -ENOMEM)
> > > > > > + goto shrink;
> > > > > > + }
> > > > >
> > > > > I wonder if we should just make this:
> > > > >
> > > > > if (objcg && !obj_cgroup_may_zswap(objcg))
> > > > > goto reject;
> > > > >
> > > > > Even if memory.zswap.max is > 0, if the limit is hit, shrinking the
> > > > > zswap pool will only help if we happen to writeback a page from the
> > > > > same memcg that hit its limit. Keep in mind that we will only
> > > > > writeback one page every time we observe that the limit is hit (even
> > > > > with Domenico's patch, because zswap_can_accept() should be true).
> > > > >
> > > > > On a system with a handful of memcgs,
> > > > > it seems likely that we wrongfully writeback pages from other memcgs
> > > > > because of this. Achieving nothing for this memcg, while hurting
> > > > > others. OTOH, without invoking writeback when the limit is hit, the
> > > > > memcg will just not be able to use zswap until some pages are
> > > > > faulted back in or invalidated.
> > > > >
> > > > > I am not sure which is better, just thinking out loud.
> > > >
> > > > You're absolutely right.
> > > >
> > > > Currently the choice is writing back either everybody or nobody,
> > > > meaning between writeback and cgroup containment. They're both so poor
> > > > that I can't say I strongly prefer one over the other.
> > > >
> > > > However, I have a lame argument in favor of this patch:
> > > >
> > > > The last few fixes from Nhat and Domenico around writeback show that
> > > > few people, if anybody, are actually using writeback. So it might not
> > > > actually matter that much in practice which way we go with this patch.
> > > > Per-memcg LRUs will be necessary for it to work right.
> > > >
> > > > However, what Nhat is proposing is how we want the behavior down the
> > > > line. So between two equally poor choices, I figure we might as well
> > > > go with the one that doesn't require another code change later on.
> > > >
> > > > Doesn't that fill you with radiant enthusiasm?
> > >
> > > If we have per-memcg LRUs, and memory.zswap.max == 0, then we should
> > > be in one of two situations:
> > >
> > > (a) memory.zswap.max has always been 0, so the LRU for this memcg is
> > > empty, so we don't really need the special case for memory.zswap.max
> > > == 0.
> > >
> > > (b) memory.zswap.max was reduced to 0 at some point, and some pages
> > > are already in zswap. In this case, I don't think shrinking the memcg
> > > is such a bad idea, we would be lazily enforcing the limit.
> > >
> > > In that sense I am not sure that this change won't require another
> > > code change. It feels like special casing memory.zswap.max == 0 is
> > > only needed now due to the lack of per-memcg LRUs.
> >
> > Good point. And I agree down the line we should just always send the
> > shrinker off optimistically on the cgroup's lru list.
> >
> > So I take back my lame argument. But that then still leaves us with
> > the situation that both choices are equal here, right?
> >
> > If so, my vote would be to go with the patch as-is.
>
> I *think* it's better to punish the memcg that exceeded its limit by
> not allowing it to use zswap until its usage goes down, rather than
> punish random memcgs on the machine because one memcg hit its limit.
> It also seems to me that on a system with a handful of memcgs, it is
> statistically more likely for zswap shrinking to writeback a page from
> the wrong memcg.
Right, but in either case a hybrid zswap + swap setup with cgroup
isolation is broken anyway. Without it being usable, I'm assuming
there are no users - maybe that's optimistic of me ;)
However, if you think it's better to just be conservative about taking
action in general, that's fine by me as well.
> The code would also be simpler if obj_cgroup_may_zswap() just returns
> a boolean and we do not shrink at all if it returns false. If it no
> longer returns a boolean we should at least rename it.
>
> Did you try just not shrinking at all if the memcg limit is hit in
> your experiments?
>
> I don't feel strongly, but my preference would be to just not shrink
> at all if obj_cgroup_may_zswap() returns false.
Sounds reasonable to me. Basically just replace the goto shrink with
goto reject for now. Maybe a comment that says "XXX: Writeback/reclaim
does not work with cgroups yet. Needs a cgroup-aware entry LRU first,
or we'd push out entries system-wide based on local cgroup limits."
Nhat, does that sound good to you?
On Tue, May 30, 2023 at 1:59 PM Johannes Weiner <[email protected]> wrote:
>
> On Tue, May 30, 2023 at 01:19:12PM -0700, Yosry Ahmed wrote:
> > On Tue, May 30, 2023 at 12:13 PM Johannes Weiner <[email protected]> wrote:
> > >
> > > On Tue, May 30, 2023 at 11:41:32AM -0700, Yosry Ahmed wrote:
> > > > On Tue, May 30, 2023 at 11:00 AM Johannes Weiner <[email protected]> wrote:
> > > > >
> > > > > On Tue, May 30, 2023 at 09:52:36AM -0700, Yosry Ahmed wrote:
> > > > > > On Tue, May 30, 2023 at 9:22 AM Nhat Pham <[email protected]> wrote:
> > > > > > >
> > > > > > > Before storing a page, zswap first checks if the number of stored pages
> > > > > > > exceeds the limit specified by memory.zswap.max, for each cgroup in the
> > > > > > > hierarchy. If this limit is reached or exceeded, then zswap shrinking is
> > > > > > > triggered and short-circuits the store attempt.
> > > > > > >
> > > > > > > However, if memory.zswap.max = 0 for a cgroup, no amount of writeback
> > > > > > > will allow future store attempts from processes in this cgroup to
> > > > > > > succeed. Furthermore, this create a pathological behavior in a system
> > > > > > > where some cgroups have memory.zswap.max = 0 and some do not: the
> > > > > > > processes in the former cgroups, under memory pressure, will evict pages
> > > > > > > stored by the latter continually, until the need for swap ceases or the
> > > > > > > pool becomes empty.
> > > > > > >
> > > > > > > As a result of this, we observe a disproportionate amount of zswap
> > > > > > > writeback and a perpetually small zswap pool in our experiments, even
> > > > > > > though the pool limit is never hit.
> > > > > > >
> > > > > > > This patch fixes the issue by rejecting zswap store attempt without
> > > > > > > shrinking the pool when memory.zswap.max is 0.
> > > > > > >
> > > > > > > Fixes: f4840ccfca25 ("zswap: memcg accounting")
> > > > > > > Signed-off-by: Nhat Pham <[email protected]>
> > > > > > > ---
> > > > > > > include/linux/memcontrol.h | 6 +++---
> > > > > > > mm/memcontrol.c | 8 ++++----
> > > > > > > mm/zswap.c | 9 +++++++--
> > > > > > > 3 files changed, 14 insertions(+), 9 deletions(-)
> > > > > > >
> > > > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > > > > > index 222d7370134c..507bed3a28b0 100644
> > > > > > > --- a/include/linux/memcontrol.h
> > > > > > > +++ b/include/linux/memcontrol.h
> > > > > > > @@ -1899,13 +1899,13 @@ static inline void count_objcg_event(struct obj_cgroup *objcg,
> > > > > > > #endif /* CONFIG_MEMCG_KMEM */
> > > > > > >
> > > > > > > #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
> > > > > > > -bool obj_cgroup_may_zswap(struct obj_cgroup *objcg);
> > > > > > > +int obj_cgroup_may_zswap(struct obj_cgroup *objcg);
> > > > > > > void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size);
> > > > > > > void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size);
> > > > > > > #else
> > > > > > > -static inline bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > > > > > +static inline int obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > > > > > {
> > > > > > > - return true;
> > > > > > > + return 0;
> > > > > > > }
> > > > > > > static inline void obj_cgroup_charge_zswap(struct obj_cgroup *objcg,
> > > > > > > size_t size)
> > > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > > > > index 4b27e245a055..09aad0e6f2ea 100644
> > > > > > > --- a/mm/memcontrol.c
> > > > > > > +++ b/mm/memcontrol.c
> > > > > > > @@ -7783,10 +7783,10 @@ static struct cftype memsw_files[] = {
> > > > > > > * spending cycles on compression when there is already no room left
> > > > > > > * or zswap is disabled altogether somewhere in the hierarchy.
> > > > > > > */
> > > > > > > -bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > > > > > +int obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > > > > > {
> > > > > > > struct mem_cgroup *memcg, *original_memcg;
> > > > > > > - bool ret = true;
> > > > > > > + int ret = 0;
> > > > > > >
> > > > > > > if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> > > > > > > return true;
> > > > > > > @@ -7800,7 +7800,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > > > > > if (max == PAGE_COUNTER_MAX)
> > > > > > > continue;
> > > > > > > if (max == 0) {
> > > > > > > - ret = false;
> > > > > > > + ret = -ENODEV;
> > > > > > > break;
> > > > > > > }
> > > > > > >
> > > > > > > @@ -7808,7 +7808,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > > > > > pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE;
> > > > > > > if (pages < max)
> > > > > > > continue;
> > > > > > > - ret = false;
> > > > > > > + ret = -ENOMEM;
> > > > > > > break;
> > > > > > > }
> > > > > > > mem_cgroup_put(original_memcg);
> > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > > > > > index 59da2a415fbb..7b13dc865438 100644
> > > > > > > --- a/mm/zswap.c
> > > > > > > +++ b/mm/zswap.c
> > > > > > > @@ -1175,8 +1175,13 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
> > > > > > > }
> > > > > > >
> > > > > > > objcg = get_obj_cgroup_from_page(page);
> > > > > > > - if (objcg && !obj_cgroup_may_zswap(objcg))
> > > > > > > - goto shrink;
> > > > > > > + if (objcg) {
> > > > > > > + ret = obj_cgroup_may_zswap(objcg);
> > > > > > > + if (ret == -ENODEV)
> > > > > > > + goto reject;
> > > > > > > + if (ret == -ENOMEM)
> > > > > > > + goto shrink;
> > > > > > > + }
> > > > > >
> > > > > > I wonder if we should just make this:
> > > > > >
> > > > > > if (objcg && !obj_cgroup_may_zswap(objcg))
> > > > > > goto reject;
> > > > > >
> > > > > > Even if memory.zswap.max is > 0, if the limit is hit, shrinking the
> > > > > > zswap pool will only help if we happen to writeback a page from the
> > > > > > same memcg that hit its limit. Keep in mind that we will only
> > > > > > writeback one page every time we observe that the limit is hit (even
> > > > > > with Domenico's patch, because zswap_can_accept() should be true).
> > > > > >
> > > > > > On a system with a handful of memcgs,
> > > > > > it seems likely that we wrongfully writeback pages from other memcgs
> > > > > > because of this. Achieving nothing for this memcg, while hurting
> > > > > > others. OTOH, without invoking writeback when the limit is hit, the
> > > > > > memcg will just not be able to use zswap until some pages are
> > > > > > faulted back in or invalidated.
> > > > > >
> > > > > > I am not sure which is better, just thinking out loud.
> > > > >
> > > > > You're absolutely right.
> > > > >
> > > > > Currently the choice is writing back either everybody or nobody,
> > > > > meaning between writeback and cgroup containment. They're both so poor
> > > > > that I can't say I strongly prefer one over the other.
> > > > >
> > > > > However, I have a lame argument in favor of this patch:
> > > > >
> > > > > The last few fixes from Nhat and Domenico around writeback show that
> > > > > few people, if anybody, are actually using writeback. So it might not
> > > > > actually matter that much in practice which way we go with this patch.
> > > > > Per-memcg LRUs will be necessary for it to work right.
> > > > >
> > > > > However, what Nhat is proposing is how we want the behavior down the
> > > > > line. So between two equally poor choices, I figure we might as well
> > > > > go with the one that doesn't require another code change later on.
> > > > >
> > > > > Doesn't that fill you with radiant enthusiasm?
> > > >
> > > > If we have per-memcg LRUs, and memory.zswap.max == 0, then we should
> > > > be in one of two situations:
> > > >
> > > > (a) memory.zswap.max has always been 0, so the LRU for this memcg is
> > > > empty, so we don't really need the special case for memory.zswap.max
> > > > == 0.
> > > >
> > > > (b) memory.zswap.max was reduced to 0 at some point, and some pages
> > > > are already in zswap. In this case, I don't think shrinking the memcg
> > > > is such a bad idea, we would be lazily enforcing the limit.
> > > >
> > > > In that sense I am not sure that this change won't require another
> > > > code change. It feels like special casing memory.zswap.max == 0 is
> > > > only needed now due to the lack of per-memcg LRUs.
> > >
> > > Good point. And I agree down the line we should just always send the
> > > shrinker off optimistically on the cgroup's lru list.
> > >
> > > So I take back my lame argument. But that then still leaves us with
> > > the situation that both choices are equal here, right?
> > >
> > > If so, my vote would be to go with the patch as-is.
> >
> > I *think* it's better to punish the memcg that exceeded its limit by
> > not allowing it to use zswap until its usage goes down, rather than
> > punish random memcgs on the machine because one memcg hit its limit.
> > It also seems to me that on a system with a handful of memcgs, it is
> > statistically more likely for zswap shrinking to writeback a page from
> > the wrong memcg.
>
> Right, but in either case a hybrid zswap + swap setup with cgroup
> isolation is broken anyway. Without it being usable, I'm assuming
> there are no users - maybe that's optimistic of me ;)
>
> However, if you think it's better to just be conservative about taking
> action in general, that's fine by me as well.
Exactly, I just prefer erroring on the conservative side.
>
> > The code would also be simpler if obj_cgroup_may_zswap() just returns
> > a boolean and we do not shrink at all if it returns false. If it no
> > longer returns a boolean we should at least rename it.
> >
> > Did you try just not shrinking at all if the memcg limit is hit in
> > your experiments?
> >
> > I don't feel strongly, but my preference would be to just not shrink
> > at all if obj_cgroup_may_zswap() returns false.
>
> Sounds reasonable to me. Basically just replace the goto shrink with
> goto reject for now. Maybe a comment that says "XXX: Writeback/reclaim
> does not work with cgroups yet. Needs a cgroup-aware entry LRU first,
> or we'd push out entries system-wide based on local cgroup limits."
Yeah, exactly -- if Nhat agrees of course.
>
> Nhat, does that sound good to you?
On Tue, May 30, 2023 at 2:05 PM Yosry Ahmed <[email protected]> wrote:
>
> On Tue, May 30, 2023 at 1:59 PM Johannes Weiner <[email protected]> wrote:
> >
> > On Tue, May 30, 2023 at 01:19:12PM -0700, Yosry Ahmed wrote:
> > > On Tue, May 30, 2023 at 12:13 PM Johannes Weiner <[email protected]> wrote:
> > > >
> > > > On Tue, May 30, 2023 at 11:41:32AM -0700, Yosry Ahmed wrote:
> > > > > On Tue, May 30, 2023 at 11:00 AM Johannes Weiner <[email protected]> wrote:
> > > > > >
> > > > > > On Tue, May 30, 2023 at 09:52:36AM -0700, Yosry Ahmed wrote:
> > > > > > > On Tue, May 30, 2023 at 9:22 AM Nhat Pham <[email protected]> wrote:
> > > > > > > >
> > > > > > > > Before storing a page, zswap first checks if the number of stored pages
> > > > > > > > exceeds the limit specified by memory.zswap.max, for each cgroup in the
> > > > > > > > hierarchy. If this limit is reached or exceeded, then zswap shrinking is
> > > > > > > > triggered and short-circuits the store attempt.
> > > > > > > >
> > > > > > > > However, if memory.zswap.max = 0 for a cgroup, no amount of writeback
> > > > > > > > will allow future store attempts from processes in this cgroup to
> > > > > > > > succeed. Furthermore, this create a pathological behavior in a system
> > > > > > > > where some cgroups have memory.zswap.max = 0 and some do not: the
> > > > > > > > processes in the former cgroups, under memory pressure, will evict pages
> > > > > > > > stored by the latter continually, until the need for swap ceases or the
> > > > > > > > pool becomes empty.
> > > > > > > >
> > > > > > > > As a result of this, we observe a disproportionate amount of zswap
> > > > > > > > writeback and a perpetually small zswap pool in our experiments, even
> > > > > > > > though the pool limit is never hit.
> > > > > > > >
> > > > > > > > This patch fixes the issue by rejecting zswap store attempt without
> > > > > > > > shrinking the pool when memory.zswap.max is 0.
> > > > > > > >
> > > > > > > > Fixes: f4840ccfca25 ("zswap: memcg accounting")
> > > > > > > > Signed-off-by: Nhat Pham <[email protected]>
> > > > > > > > ---
> > > > > > > > include/linux/memcontrol.h | 6 +++---
> > > > > > > > mm/memcontrol.c | 8 ++++----
> > > > > > > > mm/zswap.c | 9 +++++++--
> > > > > > > > 3 files changed, 14 insertions(+), 9 deletions(-)
> > > > > > > >
> > > > > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > > > > > > index 222d7370134c..507bed3a28b0 100644
> > > > > > > > --- a/include/linux/memcontrol.h
> > > > > > > > +++ b/include/linux/memcontrol.h
> > > > > > > > @@ -1899,13 +1899,13 @@ static inline void count_objcg_event(struct obj_cgroup *objcg,
> > > > > > > > #endif /* CONFIG_MEMCG_KMEM */
> > > > > > > >
> > > > > > > > #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
> > > > > > > > -bool obj_cgroup_may_zswap(struct obj_cgroup *objcg);
> > > > > > > > +int obj_cgroup_may_zswap(struct obj_cgroup *objcg);
> > > > > > > > void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size);
> > > > > > > > void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size);
> > > > > > > > #else
> > > > > > > > -static inline bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > > > > > > +static inline int obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > > > > > > {
> > > > > > > > - return true;
> > > > > > > > + return 0;
> > > > > > > > }
> > > > > > > > static inline void obj_cgroup_charge_zswap(struct obj_cgroup *objcg,
> > > > > > > > size_t size)
> > > > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > > > > > index 4b27e245a055..09aad0e6f2ea 100644
> > > > > > > > --- a/mm/memcontrol.c
> > > > > > > > +++ b/mm/memcontrol.c
> > > > > > > > @@ -7783,10 +7783,10 @@ static struct cftype memsw_files[] = {
> > > > > > > > * spending cycles on compression when there is already no room left
> > > > > > > > * or zswap is disabled altogether somewhere in the hierarchy.
> > > > > > > > */
> > > > > > > > -bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > > > > > > +int obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > > > > > > {
> > > > > > > > struct mem_cgroup *memcg, *original_memcg;
> > > > > > > > - bool ret = true;
> > > > > > > > + int ret = 0;
> > > > > > > >
> > > > > > > > if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> > > > > > > > return true;
> > > > > > > > @@ -7800,7 +7800,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > > > > > > if (max == PAGE_COUNTER_MAX)
> > > > > > > > continue;
> > > > > > > > if (max == 0) {
> > > > > > > > - ret = false;
> > > > > > > > + ret = -ENODEV;
> > > > > > > > break;
> > > > > > > > }
> > > > > > > >
> > > > > > > > @@ -7808,7 +7808,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > > > > > > pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE;
> > > > > > > > if (pages < max)
> > > > > > > > continue;
> > > > > > > > - ret = false;
> > > > > > > > + ret = -ENOMEM;
> > > > > > > > break;
> > > > > > > > }
> > > > > > > > mem_cgroup_put(original_memcg);
> > > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > > > > > > index 59da2a415fbb..7b13dc865438 100644
> > > > > > > > --- a/mm/zswap.c
> > > > > > > > +++ b/mm/zswap.c
> > > > > > > > @@ -1175,8 +1175,13 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
> > > > > > > > }
> > > > > > > >
> > > > > > > > objcg = get_obj_cgroup_from_page(page);
> > > > > > > > - if (objcg && !obj_cgroup_may_zswap(objcg))
> > > > > > > > - goto shrink;
> > > > > > > > + if (objcg) {
> > > > > > > > + ret = obj_cgroup_may_zswap(objcg);
> > > > > > > > + if (ret == -ENODEV)
> > > > > > > > + goto reject;
> > > > > > > > + if (ret == -ENOMEM)
> > > > > > > > + goto shrink;
> > > > > > > > + }
> > > > > > >
> > > > > > > I wonder if we should just make this:
> > > > > > >
> > > > > > > if (objcg && !obj_cgroup_may_zswap(objcg))
> > > > > > > goto reject;
> > > > > > >
> > > > > > > Even if memory.zswap.max is > 0, if the limit is hit, shrinking the
> > > > > > > zswap pool will only help if we happen to writeback a page from the
> > > > > > > same memcg that hit its limit. Keep in mind that we will only
> > > > > > > writeback one page every time we observe that the limit is hit (even
> > > > > > > with Domenico's patch, because zswap_can_accept() should be true).
> > > > > > >
> > > > > > > On a system with a handful of memcgs,
> > > > > > > it seems likely that we wrongfully writeback pages from other memcgs
> > > > > > > because of this. Achieving nothing for this memcg, while hurting
> > > > > > > others. OTOH, without invoking writeback when the limit is hit, the
> > > > > > > memcg will just not be able to use zswap until some pages are
> > > > > > > faulted back in or invalidated.
> > > > > > >
> > > > > > > I am not sure which is better, just thinking out loud.
> > > > > >
> > > > > > You're absolutely right.
> > > > > >
> > > > > > Currently the choice is writing back either everybody or nobody,
> > > > > > meaning between writeback and cgroup containment. They're both so poor
> > > > > > that I can't say I strongly prefer one over the other.
> > > > > >
> > > > > > However, I have a lame argument in favor of this patch:
> > > > > >
> > > > > > The last few fixes from Nhat and Domenico around writeback show that
> > > > > > few people, if anybody, are actually using writeback. So it might not
> > > > > > actually matter that much in practice which way we go with this patch.
> > > > > > Per-memcg LRUs will be necessary for it to work right.
> > > > > >
> > > > > > However, what Nhat is proposing is how we want the behavior down the
> > > > > > line. So between two equally poor choices, I figure we might as well
> > > > > > go with the one that doesn't require another code change later on.
> > > > > >
> > > > > > Doesn't that fill you with radiant enthusiasm?
> > > > >
> > > > > If we have per-memcg LRUs, and memory.zswap.max == 0, then we should
> > > > > be in one of two situations:
> > > > >
> > > > > (a) memory.zswap.max has always been 0, so the LRU for this memcg is
> > > > > empty, so we don't really need the special case for memory.zswap.max
> > > > > == 0.
> > > > >
> > > > > (b) memory.zswap.max was reduced to 0 at some point, and some pages
> > > > > are already in zswap. In this case, I don't think shrinking the memcg
> > > > > is such a bad idea, we would be lazily enforcing the limit.
> > > > >
> > > > > In that sense I am not sure that this change won't require another
> > > > > code change. It feels like special casing memory.zswap.max == 0 is
> > > > > only needed now due to the lack of per-memcg LRUs.
> > > >
> > > > Good point. And I agree down the line we should just always send the
> > > > shrinker off optimistically on the cgroup's lru list.
> > > >
> > > > So I take back my lame argument. But that then still leaves us with
> > > > the situation that both choices are equal here, right?
> > > >
> > > > If so, my vote would be to go with the patch as-is.
> > >
> > > I *think* it's better to punish the memcg that exceeded its limit by
> > > not allowing it to use zswap until its usage goes down, rather than
> > > punish random memcgs on the machine because one memcg hit its limit.
> > > It also seems to me that on a system with a handful of memcgs, it is
> > > statistically more likely for zswap shrinking to writeback a page from
> > > the wrong memcg.
> >
> > Right, but in either case a hybrid zswap + swap setup with cgroup
> > isolation is broken anyway. Without it being usable, I'm assuming
> > there are no users - maybe that's optimistic of me ;)
> >
> > However, if you think it's better to just be conservative about taking
> > action in general, that's fine by me as well.
>
> Exactly, I just prefer erroring on the conservative side.
>
> >
> > > The code would also be simpler if obj_cgroup_may_zswap() just returns
> > > a boolean and we do not shrink at all if it returns false. If it no
> > > longer returns a boolean we should at least rename it.
> > >
> > > Did you try just not shrinking at all if the memcg limit is hit in
> > > your experiments?
> > >
> > > I don't feel strongly, but my preference would be to just not shrink
> > > at all if obj_cgroup_may_zswap() returns false.
> >
> > Sounds reasonable to me. Basically just replace the goto shrink with
> > goto reject for now. Maybe a comment that says "XXX: Writeback/reclaim
> > does not work with cgroups yet. Needs a cgroup-aware entry LRU first,
> > or we'd push out entries system-wide based on local cgroup limits."
>
> Yeah, exactly -- if Nhat agrees of course.
Sounds good to me! I don't have a strong opinion on this either. I was
just trying to make minimal behavioral change to fix this (i.e keep the
shrinking behavior where possible, but definitely reject where it does
not make sense to shrink).
But this works too, and is actually a smaller change code-wise. We
can revisit this piece of code when the per-memcg LRU comes in.
I'll send a new version with the proposed change (and documentation)
shortly. Thanks for the review and suggestion, everyone!
>
> >
> > Nhat, does that sound good to you?
Before storing a page, zswap first checks if the number of stored pages
exceeds the limit specified by memory.zswap.max, for each cgroup in the
hierarchy. If this limit is reached or exceeded, then zswap shrinking is
triggered and short-circuits the store attempt.
However, since the zswap's LRU is not memcg-aware, this can create the
following pathological behavior: the cgroup whose zswap limit is
reached will evict pages from other cgroups continually, without
lowering its own zswap usage. This means the shrinking will continue
until the need for swap ceases or the pool becomes empty.
As a result of this, we observe a disproportionate amount of zswap
writeback and a perpetually small zswap pool in our experiments, even
though the pool limit is never hit.
This patch fixes the issue by rejecting zswap store attempt without
shrinking the pool when obj_cgroup_may_zswap() returns false.
Fixes: f4840ccfca25 ("zswap: memcg accounting")
Signed-off-by: Nhat Pham <[email protected]>
---
mm/zswap.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/mm/zswap.c b/mm/zswap.c
index 59da2a415fbb..cff93643a6ab 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1174,9 +1174,14 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
goto reject;
}
+ /*
+ * XXX: zswap reclaim does not work with cgroups yet. Without a
+ * cgroup-aware entry LRU, we will push out entries system-wide based on
+ * local cgroup limits.
+ */
objcg = get_obj_cgroup_from_page(page);
if (objcg && !obj_cgroup_may_zswap(objcg))
- goto shrink;
+ goto reject;
/* reclaim space if needed */
if (zswap_is_full()) {
--
2.34.1
On Tue, May 30, 2023 at 3:24 PM Nhat Pham <[email protected]> wrote:
>
> Before storing a page, zswap first checks if the number of stored pages
> exceeds the limit specified by memory.zswap.max, for each cgroup in the
> hierarchy. If this limit is reached or exceeded, then zswap shrinking is
> triggered and short-circuits the store attempt.
>
> However, since the zswap's LRU is not memcg-aware, this can create the
> following pathological behavior: the cgroup whose zswap limit is
> reached will evict pages from other cgroups continually, without
> lowering its own zswap usage. This means the shrinking will continue
> until the need for swap ceases or the pool becomes empty.
This pathological behavior will only happen if the zswap limit is 0.
Otherwise, we will see a different pathological behavior where we
unnecessarily evict X pages from other cgroups before we drive the
memcg back below its limit.
Perhaps we should clarify this?
>
> As a result of this, we observe a disproportionate amount of zswap
> writeback and a perpetually small zswap pool in our experiments, even
> though the pool limit is never hit.
I am guessing this is also related to the case where the limit is 0.
It would be useful to clarify this.
>
> This patch fixes the issue by rejecting zswap store attempt without
> shrinking the pool when obj_cgroup_may_zswap() returns false.
>
> Fixes: f4840ccfca25 ("zswap: memcg accounting")
> Signed-off-by: Nhat Pham <[email protected]>
> ---
> mm/zswap.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 59da2a415fbb..cff93643a6ab 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1174,9 +1174,14 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
> goto reject;
> }
>
> + /*
> + * XXX: zswap reclaim does not work with cgroups yet. Without a
> + * cgroup-aware entry LRU, we will push out entries system-wide based on
> + * local cgroup limits.
> + */
> objcg = get_obj_cgroup_from_page(page);
> if (objcg && !obj_cgroup_may_zswap(objcg))
> - goto shrink;
> + goto reject;
>
> /* reclaim space if needed */
> if (zswap_is_full()) {
> --
> 2.34.1
>
With commit log nits above:
Reviewed-by: Yosry Ahmed <[email protected]>
On Tue, 30 May 2023 15:24:40 -0700 Nhat Pham <[email protected]> wrote:
> Before storing a page, zswap first checks if the number of stored pages
> exceeds the limit specified by memory.zswap.max, for each cgroup in the
> hierarchy. If this limit is reached or exceeded, then zswap shrinking is
> triggered and short-circuits the store attempt.
>
> However, since the zswap's LRU is not memcg-aware, this can create the
> following pathological behavior: the cgroup whose zswap limit is
> reached will evict pages from other cgroups continually, without
> lowering its own zswap usage. This means the shrinking will continue
> until the need for swap ceases or the pool becomes empty.
>
> As a result of this, we observe a disproportionate amount of zswap
> writeback and a perpetually small zswap pool in our experiments, even
> though the pool limit is never hit.
That sounds unpleasant. Do you think the patch should be backported
into earlier (-stable) kernels?
> This patch fixes the issue by rejecting zswap store attempt without
> shrinking the pool when obj_cgroup_may_zswap() returns false.
Before storing a page, zswap first checks if the number of stored pages
exceeds the limit specified by memory.zswap.max, for each cgroup in the
hierarchy. If this limit is reached or exceeded, then zswap shrinking is
triggered and short-circuits the store attempt.
However, since the zswap's LRU is not memcg-aware, this can create the
following pathological behavior: the cgroup whose zswap limit is 0 will
evict pages from other cgroups continually, without lowering its own
zswap usage. This means the shrinking will continue until the need for
swap ceases or the pool becomes empty.
As a result of this, we observe a disproportionate amount of zswap
writeback and a perpetually small zswap pool in our experiments, even
though the pool limit is never hit.
More generally, a cgroup might unnecessarily evict pages from other
cgroups before we drive the memcg back below its limit.
This patch fixes the issue by rejecting zswap store attempt without
shrinking the pool when obj_cgroup_may_zswap() returns false.
Fixes: f4840ccfca25 ("zswap: memcg accounting")
Reviewed-by: Yosry Ahmed <[email protected]>
Signed-off-by: Nhat Pham <[email protected]>
---
mm/zswap.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/mm/zswap.c b/mm/zswap.c
index 59da2a415fbb..cff93643a6ab 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1174,9 +1174,14 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
goto reject;
}
+ /*
+ * XXX: zswap reclaim does not work with cgroups yet. Without a
+ * cgroup-aware entry LRU, we will push out entries system-wide based on
+ * local cgroup limits.
+ */
objcg = get_obj_cgroup_from_page(page);
if (objcg && !obj_cgroup_may_zswap(objcg))
- goto shrink;
+ goto reject;
/* reclaim space if needed */
if (zswap_is_full()) {
--
2.34.1
On Tue, May 30, 2023 at 3:30 PM Andrew Morton <[email protected]> wrote:
>
> On Tue, 30 May 2023 15:24:40 -0700 Nhat Pham <[email protected]> wrote:
>
> > Before storing a page, zswap first checks if the number of stored pages
> > exceeds the limit specified by memory.zswap.max, for each cgroup in the
> > hierarchy. If this limit is reached or exceeded, then zswap shrinking is
> > triggered and short-circuits the store attempt.
> >
> > However, since the zswap's LRU is not memcg-aware, this can create the
> > following pathological behavior: the cgroup whose zswap limit is
> > reached will evict pages from other cgroups continually, without
> > lowering its own zswap usage. This means the shrinking will continue
> > until the need for swap ceases or the pool becomes empty.
> >
> > As a result of this, we observe a disproportionate amount of zswap
> > writeback and a perpetually small zswap pool in our experiments, even
> > though the pool limit is never hit.
>
> That sounds unpleasant. Do you think the patch should be backported
> into earlier (-stable) kernels?
I think it should be, for any kernel version after f4840ccfca25.
>
> > This patch fixes the issue by rejecting zswap store attempt without
> > shrinking the pool when obj_cgroup_may_zswap() returns false.
>
On Wed, Jun 7, 2023 at 12:09 PM Andrew Morton <[email protected]> wrote:
>
> It's unclear (to me) whether we should proceed with this. Thoughts, please?
This version looks good to me. I added my Reviewed-by on some version,
but I guess it got lost with all the versions and fixlets :)
>
> Here's what I presently have in mm-hotfixes-unstable:
>
>
> From: Nhat Pham <[email protected]>
> Subject: zswap: do not shrink if cgroup may not zswap
> Date: Tue, 30 May 2023 15:24:40 -0700
>
> Before storing a page, zswap first checks if the number of stored pages
> exceeds the limit specified by memory.zswap.max, for each cgroup in the
> hierarchy. If this limit is reached or exceeded, then zswap shrinking is
> triggered and short-circuits the store attempt.
>
> However, since the zswap's LRU is not memcg-aware, this can create the
> following pathological behavior: the cgroup whose zswap limit is 0 will
> evict pages from other cgroups continually, without lowering its own zswap
> usage. This means the shrinking will continue until the need for swap
> ceases or the pool becomes empty.
>
> As a result of this, we observe a disproportionate amount of zswap
> writeback and a perpetually small zswap pool in our experiments, even
> though the pool limit is never hit.
>
> More generally, a cgroup might unnecessarily evict pages from other
> cgroups before we drive the memcg back below its limit.
>
> This patch fixes the issue by rejecting zswap store attempt without
> shrinking the pool when obj_cgroup_may_zswap() returns false.
>
> [[email protected]: fix return of unintialized value]
> [[email protected]: s/ENOSPC/ENOMEM/]
> Link: https://lkml.kernel.org/r/[email protected]
> Link: https://lkml.kernel.org/r/[email protected]
> Fixes: f4840ccfca25 ("zswap: memcg accounting")
> Signed-off-by: Nhat Pham <[email protected]>
Reviewed-by: Yosry Ahmed <[email protected]>
> Cc: Dan Streetman <[email protected]>
> Cc: Domenico Cerasuolo <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Seth Jennings <[email protected]>
> Cc: Vitaly Wool <[email protected]>
> Cc: Yosry Ahmed <[email protected]>
> Cc: <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> ---
>
> mm/zswap.c | 11 +++++++++--
> 1 file changed, 9 insertions(+), 2 deletions(-)
>
> --- a/mm/zswap.c~zswap-do-not-shrink-if-cgroup-may-not-zswap
> +++ a/mm/zswap.c
> @@ -1174,9 +1174,16 @@ static int zswap_frontswap_store(unsigne
> goto reject;
> }
>
> + /*
> + * XXX: zswap reclaim does not work with cgroups yet. Without a
> + * cgroup-aware entry LRU, we will push out entries system-wide based on
> + * local cgroup limits.
> + */
> objcg = get_obj_cgroup_from_page(page);
> - if (objcg && !obj_cgroup_may_zswap(objcg))
> - goto shrink;
> + if (objcg && !obj_cgroup_may_zswap(objcg)) {
> + ret = -ENOMEM;
> + goto reject;
> + }
>
> /* reclaim space if needed */
> if (zswap_is_full()) {
> _
>
It's unclear (to me) whether we should proceed with this. Thoughts, please?
Here's what I presently have in mm-hotfixes-unstable:
From: Nhat Pham <[email protected]>
Subject: zswap: do not shrink if cgroup may not zswap
Date: Tue, 30 May 2023 15:24:40 -0700
Before storing a page, zswap first checks if the number of stored pages
exceeds the limit specified by memory.zswap.max, for each cgroup in the
hierarchy. If this limit is reached or exceeded, then zswap shrinking is
triggered and short-circuits the store attempt.
However, since the zswap's LRU is not memcg-aware, this can create the
following pathological behavior: the cgroup whose zswap limit is 0 will
evict pages from other cgroups continually, without lowering its own zswap
usage. This means the shrinking will continue until the need for swap
ceases or the pool becomes empty.
As a result of this, we observe a disproportionate amount of zswap
writeback and a perpetually small zswap pool in our experiments, even
though the pool limit is never hit.
More generally, a cgroup might unnecessarily evict pages from other
cgroups before we drive the memcg back below its limit.
This patch fixes the issue by rejecting zswap store attempt without
shrinking the pool when obj_cgroup_may_zswap() returns false.
[[email protected]: fix return of unintialized value]
[[email protected]: s/ENOSPC/ENOMEM/]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: f4840ccfca25 ("zswap: memcg accounting")
Signed-off-by: Nhat Pham <[email protected]>
Cc: Dan Streetman <[email protected]>
Cc: Domenico Cerasuolo <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Seth Jennings <[email protected]>
Cc: Vitaly Wool <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
mm/zswap.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
--- a/mm/zswap.c~zswap-do-not-shrink-if-cgroup-may-not-zswap
+++ a/mm/zswap.c
@@ -1174,9 +1174,16 @@ static int zswap_frontswap_store(unsigne
goto reject;
}
+ /*
+ * XXX: zswap reclaim does not work with cgroups yet. Without a
+ * cgroup-aware entry LRU, we will push out entries system-wide based on
+ * local cgroup limits.
+ */
objcg = get_obj_cgroup_from_page(page);
- if (objcg && !obj_cgroup_may_zswap(objcg))
- goto shrink;
+ if (objcg && !obj_cgroup_may_zswap(objcg)) {
+ ret = -ENOMEM;
+ goto reject;
+ }
/* reclaim space if needed */
if (zswap_is_full()) {
_
On Wed, Jun 7, 2023 at 12:09 PM Andrew Morton <[email protected]> wrote:
>
> It's unclear (to me) whether we should proceed with this. Thoughts, please?
>
> Here's what I presently have in mm-hotfixes-unstable:
>
>
> From: Nhat Pham <[email protected]>
> Subject: zswap: do not shrink if cgroup may not zswap
> Date: Tue, 30 May 2023 15:24:40 -0700
>
> Before storing a page, zswap first checks if the number of stored pages
> exceeds the limit specified by memory.zswap.max, for each cgroup in the
> hierarchy. If this limit is reached or exceeded, then zswap shrinking is
> triggered and short-circuits the store attempt.
>
> However, since the zswap's LRU is not memcg-aware, this can create the
> following pathological behavior: the cgroup whose zswap limit is 0 will
> evict pages from other cgroups continually, without lowering its own zswap
> usage. This means the shrinking will continue until the need for swap
> ceases or the pool becomes empty.
>
> As a result of this, we observe a disproportionate amount of zswap
> writeback and a perpetually small zswap pool in our experiments, even
> though the pool limit is never hit.
>
> More generally, a cgroup might unnecessarily evict pages from other
> cgroups before we drive the memcg back below its limit.
>
> This patch fixes the issue by rejecting zswap store attempt without
> shrinking the pool when obj_cgroup_may_zswap() returns false.
>
> [[email protected]: fix return of unintialized value]
> [[email protected]: s/ENOSPC/ENOMEM/]
> Link: https://lkml.kernel.org/r/[email protected]
> Link: https://lkml.kernel.org/r/[email protected]
> Fixes: f4840ccfca25 ("zswap: memcg accounting")
> Signed-off-by: Nhat Pham <[email protected]>
> Cc: Dan Streetman <[email protected]>
> Cc: Domenico Cerasuolo <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Seth Jennings <[email protected]>
> Cc: Vitaly Wool <[email protected]>
> Cc: Yosry Ahmed <[email protected]>
> Cc: <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> ---
>
> mm/zswap.c | 11 +++++++++--
> 1 file changed, 9 insertions(+), 2 deletions(-)
>
> --- a/mm/zswap.c~zswap-do-not-shrink-if-cgroup-may-not-zswap
> +++ a/mm/zswap.c
> @@ -1174,9 +1174,16 @@ static int zswap_frontswap_store(unsigne
> goto reject;
> }
>
> + /*
> + * XXX: zswap reclaim does not work with cgroups yet. Without a
> + * cgroup-aware entry LRU, we will push out entries system-wide based on
> + * local cgroup limits.
> + */
> objcg = get_obj_cgroup_from_page(page);
> - if (objcg && !obj_cgroup_may_zswap(objcg))
> - goto shrink;
> + if (objcg && !obj_cgroup_may_zswap(objcg)) {
> + ret = -ENOMEM;
> + goto reject;
> + }
>
> /* reclaim space if needed */
> if (zswap_is_full()) {
> _
>
Apologies for lack of clarity - yep this is the final version
I had in mind too.
On Wed, Jun 07, 2023 at 12:09:39PM -0700, Andrew Morton wrote:
> It's unclear (to me) whether we should proceed with this. Thoughts, please?
>
> Here's what I presently have in mm-hotfixes-unstable:
>
>
> From: Nhat Pham <[email protected]>
> Subject: zswap: do not shrink if cgroup may not zswap
> Date: Tue, 30 May 2023 15:24:40 -0700
>
> Before storing a page, zswap first checks if the number of stored pages
> exceeds the limit specified by memory.zswap.max, for each cgroup in the
> hierarchy. If this limit is reached or exceeded, then zswap shrinking is
> triggered and short-circuits the store attempt.
>
> However, since the zswap's LRU is not memcg-aware, this can create the
> following pathological behavior: the cgroup whose zswap limit is 0 will
> evict pages from other cgroups continually, without lowering its own zswap
> usage. This means the shrinking will continue until the need for swap
> ceases or the pool becomes empty.
>
> As a result of this, we observe a disproportionate amount of zswap
> writeback and a perpetually small zswap pool in our experiments, even
> though the pool limit is never hit.
>
> More generally, a cgroup might unnecessarily evict pages from other
> cgroups before we drive the memcg back below its limit.
>
> This patch fixes the issue by rejecting zswap store attempt without
> shrinking the pool when obj_cgroup_may_zswap() returns false.
>
> [[email protected]: fix return of unintialized value]
> [[email protected]: s/ENOSPC/ENOMEM/]
> Link: https://lkml.kernel.org/r/[email protected]
> Link: https://lkml.kernel.org/r/[email protected]
> Fixes: f4840ccfca25 ("zswap: memcg accounting")
> Signed-off-by: Nhat Pham <[email protected]>
> Cc: Dan Streetman <[email protected]>
> Cc: Domenico Cerasuolo <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Seth Jennings <[email protected]>
> Cc: Vitaly Wool <[email protected]>
> Cc: Yosry Ahmed <[email protected]>
> Cc: <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
Acked-by: Johannes Weiner <[email protected]>