2024-04-05 05:35:28

by Yosry Ahmed

[permalink] [raw]
Subject: [PATCH v2 0/5] zswap same-filled and limit checking cleanups

Miscellaneous cleanups for limit checking and same-filled handling in
the store path. This series was broken out of the "zswap: store
zero-filled pages more efficiently" series [1]. It contains the cleanups
and drops the main functional changes.

[1]https://lore.kernel.org/lkml/[email protected]/

v1 -> v2:
- Dropped the patch to skip limit checking for same-filled pages.
- Added a patch to calculate limits in pages only when they change, as
suggested by Johannes.

Yosry Ahmed (5):
mm: zswap: always shrink in zswap_store() if zswap_pool_reached_full
mm: zswap: calculate limits only when updated
mm: zswap: refactor limit checking from zswap_store()
mm: zswap: move more same-filled pages checks outside of zswap_store()
mm: zswap: remove same_filled module params

mm/zswap.c | 169 ++++++++++++++++++++++++++++++++---------------------
1 file changed, 101 insertions(+), 68 deletions(-)

--
2.44.0.478.gd926399ef9-goog



2024-04-05 05:35:51

by Yosry Ahmed

[permalink] [raw]
Subject: [PATCH v2 1/5] mm: zswap: always shrink in zswap_store() if zswap_pool_reached_full

The cleanup code in zswap_store() is not pretty, particularly the
'shrink' label at the bottom that ends up jumping between cleanup
labels.

Instead of having a dedicated label to shrink the pool, just use
zswap_pool_reached_full directly to figure out if the pool needs
shrinking. zswap_pool_reached_full should be true if and only if the
pool needs shrinking.

The only caveat is that the value of zswap_pool_reached_full may be
changed by concurrent zswap_store() calls between checking the limit and
testing zswap_pool_reached_full in the cleanup code. This is fine
because:
- If zswap_pool_reached_full was true during limit checking then became
false during the cleanup code, then someone else already took care of
shrinking the pool and there is no need to queue the worker. That
would be a good change.
- If zswap_pool_reached_full was false during limit checking then became
true during the cleanup code, then someone else hit the limit
meanwhile. In this case, both threads will try to queue the worker,
but it never gets queued more than once anyway. Also, calling
queue_work() multiple times when the limit is hit could already happen
today, so this isn't a significant change in any way.

Signed-off-by: Yosry Ahmed <[email protected]>
Reviewed-by: Nhat Pham <[email protected]>
Reviewed-by: Chengming Zhou <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
---
mm/zswap.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index c4979c76d58e3..1cf3ab4b22e64 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1429,12 +1429,12 @@ bool zswap_store(struct folio *folio)
if (cur_pages >= max_pages) {
zswap_pool_limit_hit++;
zswap_pool_reached_full = true;
- goto shrink;
+ goto reject;
}

if (zswap_pool_reached_full) {
if (cur_pages > zswap_accept_thr_pages())
- goto shrink;
+ goto reject;
else
zswap_pool_reached_full = false;
}
@@ -1540,6 +1540,8 @@ bool zswap_store(struct folio *folio)
zswap_entry_cache_free(entry);
reject:
obj_cgroup_put(objcg);
+ if (zswap_pool_reached_full)
+ queue_work(shrink_wq, &zswap_shrink_work);
check_old:
/*
* If the zswap store fails or zswap is disabled, we must invalidate the
@@ -1550,10 +1552,6 @@ bool zswap_store(struct folio *folio)
if (entry)
zswap_entry_free(entry);
return false;
-
-shrink:
- queue_work(shrink_wq, &zswap_shrink_work);
- goto reject;
}

bool zswap_load(struct folio *folio)
--
2.44.0.478.gd926399ef9-goog


2024-04-05 05:35:53

by Yosry Ahmed

[permalink] [raw]
Subject: [PATCH v2 2/5] mm: zswap: calculate limits only when updated

Currently, we calculate the zswap global limit, and potentially the
acceptance threshold in the zswap, in pages in the zswap store path.
This is unnecessary because the values rarely change.

Instead, precalculate the them when the module parameters are updated,
which should be rare. Since we are adding custom handlers for setting
the percentages now, add proper validation that they are <= 100.

Suggested-by: Johannes Weiner <[email protected]>
Signed-off-by: Yosry Ahmed <[email protected]>
---
mm/zswap.c | 86 ++++++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 70 insertions(+), 16 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 1cf3ab4b22e64..832e3f56232f0 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -116,12 +116,29 @@ module_param_cb(zpool, &zswap_zpool_param_ops, &zswap_zpool_type, 0644);

/* The maximum percentage of memory that the compressed pool can occupy */
static unsigned int zswap_max_pool_percent = 20;
-module_param_named(max_pool_percent, zswap_max_pool_percent, uint, 0644);
+static unsigned long zswap_max_pages;
+
+static int zswap_max_pool_param_set(const char *,
+ const struct kernel_param *);
+static const struct kernel_param_ops zswap_max_pool_param_ops = {
+ .set = zswap_max_pool_param_set,
+ .get = param_get_uint,
+};
+module_param_cb(max_pool_percent, &zswap_max_pool_param_ops,
+ &zswap_max_pool_percent, 0644);

/* The threshold for accepting new pages after the max_pool_percent was hit */
static unsigned int zswap_accept_thr_percent = 90; /* of max pool size */
-module_param_named(accept_threshold_percent, zswap_accept_thr_percent,
- uint, 0644);
+unsigned long zswap_accept_thr_pages;
+
+static int zswap_accept_thr_param_set(const char *,
+ const struct kernel_param *);
+static const struct kernel_param_ops zswap_accept_thr_param_ops = {
+ .set = zswap_accept_thr_param_set,
+ .get = param_get_uint,
+};
+module_param_cb(accept_threshold_percent, &zswap_accept_thr_param_ops,
+ &zswap_accept_thr_percent, 0644);

/*
* Enable/disable handling same-value filled pages (enabled by default).
@@ -490,14 +507,16 @@ static struct zswap_pool *zswap_pool_find_get(char *type, char *compressor)
return NULL;
}

-static unsigned long zswap_max_pages(void)
+static void zswap_update_max_pages(void)
{
- return totalram_pages() * zswap_max_pool_percent / 100;
+ WRITE_ONCE(zswap_max_pages,
+ totalram_pages() * zswap_max_pool_percent / 100);
}

-static unsigned long zswap_accept_thr_pages(void)
+static void zswap_update_accept_thr_pages(void)
{
- return zswap_max_pages() * zswap_accept_thr_percent / 100;
+ WRITE_ONCE(zswap_accept_thr_pages,
+ READ_ONCE(zswap_max_pages) * zswap_accept_thr_percent / 100);
}

unsigned long zswap_total_pages(void)
@@ -684,6 +703,43 @@ static int zswap_enabled_param_set(const char *val,
return ret;
}

+static int __zswap_percent_param_set(const char *val,
+ const struct kernel_param *kp)
+{
+ unsigned int n;
+ int ret;
+
+ ret = kstrtouint(val, 10, &n);
+ if (ret || n > 100)
+ return -EINVAL;
+
+ return param_set_uint(val, kp);
+}
+
+static int zswap_max_pool_param_set(const char *val,
+ const struct kernel_param *kp)
+{
+ int err = __zswap_percent_param_set(val, kp);
+
+ if (!err) {
+ zswap_update_max_pages();
+ zswap_update_accept_thr_pages();
+ }
+
+ return err;
+}
+
+static int zswap_accept_thr_param_set(const char *val,
+ const struct kernel_param *kp)
+{
+ int err = __zswap_percent_param_set(val, kp);
+
+ if (!err)
+ zswap_update_accept_thr_pages();
+
+ return err;
+}
+
/*********************************
* lru functions
**********************************/
@@ -1305,10 +1361,6 @@ static void shrink_worker(struct work_struct *w)
{
struct mem_cgroup *memcg;
int ret, failures = 0;
- unsigned long thr;
-
- /* Reclaim down to the accept threshold */
- thr = zswap_accept_thr_pages();

/* global reclaim will select cgroup in a round-robin fashion. */
do {
@@ -1358,7 +1410,8 @@ static void shrink_worker(struct work_struct *w)
break;
resched:
cond_resched();
- } while (zswap_total_pages() > thr);
+ /* Reclaim down to the accept threshold */
+ } while (zswap_total_pages() > READ_ONCE(zswap_accept_thr_pages));
}

static int zswap_is_page_same_filled(void *ptr, unsigned long *value)
@@ -1424,16 +1477,14 @@ bool zswap_store(struct folio *folio)

/* Check global limits */
cur_pages = zswap_total_pages();
- max_pages = zswap_max_pages();
-
- if (cur_pages >= max_pages) {
+ if (cur_pages >= READ_ONCE(zswap_max_pages)) {
zswap_pool_limit_hit++;
zswap_pool_reached_full = true;
goto reject;
}

if (zswap_pool_reached_full) {
- if (cur_pages > zswap_accept_thr_pages())
+ if (cur_pages > READ_ONCE(zswap_accept_thr_pages))
goto reject;
else
zswap_pool_reached_full = false;
@@ -1734,6 +1785,9 @@ static int zswap_setup(void)
zswap_enabled = false;
}

+ zswap_update_max_pages();
+ zswap_update_accept_thr_pages();
+
if (zswap_debugfs_init())
pr_warn("debugfs initialization failed\n");
zswap_init_state = ZSWAP_INIT_SUCCEED;
--
2.44.0.478.gd926399ef9-goog


2024-04-05 05:36:04

by Yosry Ahmed

[permalink] [raw]
Subject: [PATCH v2 3/5] mm: zswap: refactor limit checking from zswap_store()

Refactor limit and acceptance threshold checking outside of
zswap_store(). This code will be moved around in a following patch, so
it would be cleaner to move a function call around.

Signed-off-by: Yosry Ahmed <[email protected]>
---
mm/zswap.c | 29 +++++++++++++++--------------
1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 832e3f56232f0..ab3cd43cdfc9d 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1444,6 +1444,20 @@ static void zswap_fill_page(void *ptr, unsigned long value)
memset_l(page, value, PAGE_SIZE / sizeof(unsigned long));
}

+static bool zswap_check_full(void)
+{
+ unsigned long cur_pages = zswap_total_pages();
+
+ if (cur_pages >= READ_ONCE(zswap_max_pages)) {
+ zswap_pool_limit_hit++;
+ zswap_pool_reached_full = true;
+ } else if (zswap_pool_reached_full &&
+ cur_pages <= READ_ONCE(zswap_accept_thr_pages)) {
+ zswap_pool_reached_full = false;
+ }
+ return zswap_pool_reached_full;
+}
+
bool zswap_store(struct folio *folio)
{
swp_entry_t swp = folio->swap;
@@ -1452,7 +1466,6 @@ bool zswap_store(struct folio *folio)
struct zswap_entry *entry, *old;
struct obj_cgroup *objcg = NULL;
struct mem_cgroup *memcg = NULL;
- unsigned long max_pages, cur_pages;

VM_WARN_ON_ONCE(!folio_test_locked(folio));
VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
@@ -1475,20 +1488,8 @@ bool zswap_store(struct folio *folio)
mem_cgroup_put(memcg);
}

- /* Check global limits */
- cur_pages = zswap_total_pages();
- if (cur_pages >= READ_ONCE(zswap_max_pages)) {
- zswap_pool_limit_hit++;
- zswap_pool_reached_full = true;
+ if (zswap_check_full())
goto reject;
- }
-
- if (zswap_pool_reached_full) {
- if (cur_pages > READ_ONCE(zswap_accept_thr_pages))
- goto reject;
- else
- zswap_pool_reached_full = false;
- }

/* allocate entry */
entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
--
2.44.0.478.gd926399ef9-goog


2024-04-05 05:36:26

by Yosry Ahmed

[permalink] [raw]
Subject: [PATCH v2 4/5] mm: zswap: move more same-filled pages checks outside of zswap_store()

Currently, zswap_store() checks zswap_same_filled_pages_enabled, kmaps
the folio, then calls zswap_is_page_same_filled() to check the folio
contents. Move this logic into zswap_is_page_same_filled() as well (and
rename it to use 'folio' while we are at it).

This makes zswap_store() cleaner, and makes following changes to that
logic contained within the helper. While we are at it, rename the
insert_entry label to store_entry to match xa_store().

No functional change intended.

Signed-off-by: Yosry Ahmed <[email protected]>
Reviewed-by: Nhat Pham <[email protected]>
Reviewed-by: Chengming Zhou <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
---
mm/zswap.c | 39 +++++++++++++++++++--------------------
1 file changed, 19 insertions(+), 20 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index ab3cd43cdfc9d..13869d18c13bd 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1414,26 +1414,32 @@ static void shrink_worker(struct work_struct *w)
} while (zswap_total_pages() > READ_ONCE(zswap_accept_thr_pages));
}

-static int zswap_is_page_same_filled(void *ptr, unsigned long *value)
+static bool zswap_is_folio_same_filled(struct folio *folio, unsigned long *value)
{
unsigned long *page;
unsigned long val;
unsigned int pos, last_pos = PAGE_SIZE / sizeof(*page) - 1;
+ bool ret = false;

- page = (unsigned long *)ptr;
+ if (!zswap_same_filled_pages_enabled)
+ return false;
+
+ page = kmap_local_folio(folio, 0);
val = page[0];

if (val != page[last_pos])
- return 0;
+ goto out;

for (pos = 1; pos < last_pos; pos++) {
if (val != page[pos])
- return 0;
+ goto out;
}

*value = val;
-
- return 1;
+ ret = true;
+out:
+ kunmap_local(page);
+ return ret;
}

static void zswap_fill_page(void *ptr, unsigned long value)
@@ -1466,6 +1472,7 @@ bool zswap_store(struct folio *folio)
struct zswap_entry *entry, *old;
struct obj_cgroup *objcg = NULL;
struct mem_cgroup *memcg = NULL;
+ unsigned long value;

VM_WARN_ON_ONCE(!folio_test_locked(folio));
VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
@@ -1498,19 +1505,11 @@ bool zswap_store(struct folio *folio)
goto reject;
}

- if (zswap_same_filled_pages_enabled) {
- unsigned long value;
- u8 *src;
-
- src = kmap_local_folio(folio, 0);
- if (zswap_is_page_same_filled(src, &value)) {
- kunmap_local(src);
- entry->length = 0;
- entry->value = value;
- atomic_inc(&zswap_same_filled_pages);
- goto insert_entry;
- }
- kunmap_local(src);
+ if (zswap_is_folio_same_filled(folio, &value)) {
+ entry->length = 0;
+ entry->value = value;
+ atomic_inc(&zswap_same_filled_pages);
+ goto store_entry;
}

if (!zswap_non_same_filled_pages_enabled)
@@ -1533,7 +1532,7 @@ bool zswap_store(struct folio *folio)
if (!zswap_compress(folio, entry))
goto put_pool;

-insert_entry:
+store_entry:
entry->swpentry = swp;
entry->objcg = objcg;

--
2.44.0.478.gd926399ef9-goog


2024-04-05 05:38:54

by Yosry Ahmed

[permalink] [raw]
Subject: [PATCH v2 5/5] mm: zswap: remove same_filled module params

These knobs offer more fine-grained control to userspace than needed and
directly expose/influence kernel implementation; remove them.

For disabling same_filled handling, there is no logical reason to refuse
storing same-filled pages more efficiently and opt for compression.
Scanning pages for patterns may be an argument, but the page contents
will be read into the CPU cache anyway during compression. Also,
removing the same_filled handling code does not move the needle
significantly in terms of performance anyway [1].

For disabling non_same_filled handling, it was added when the compressed
pages in zswap were not being properly charged to memcgs, as workloads
could escape the accounting with compression [2]. This is no longer the
case after commit f4840ccfca25 ("zswap: memcg accounting"), and using
zswap without compression does not make much sense.

[1]https://lore.kernel.org/lkml/CAJD7tkaySFP2hBQw4pnZHJJwe3bMdjJ1t9VC2VJd=khn1_TXvA@mail.gmail.com/
[2]https://lore.kernel.org/lkml/[email protected]/

Signed-off-by: Yosry Ahmed <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
---
mm/zswap.c | 19 -------------------
1 file changed, 19 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 13869d18c13bd..b738435215218 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -140,19 +140,6 @@ static const struct kernel_param_ops zswap_accept_thr_param_ops = {
module_param_cb(accept_threshold_percent, &zswap_accept_thr_param_ops,
&zswap_accept_thr_percent, 0644);

-/*
- * Enable/disable handling same-value filled pages (enabled by default).
- * If disabled every page is considered non-same-value filled.
- */
-static bool zswap_same_filled_pages_enabled = true;
-module_param_named(same_filled_pages_enabled, zswap_same_filled_pages_enabled,
- bool, 0644);
-
-/* Enable/disable handling non-same-value filled pages (enabled by default) */
-static bool zswap_non_same_filled_pages_enabled = true;
-module_param_named(non_same_filled_pages_enabled, zswap_non_same_filled_pages_enabled,
- bool, 0644);
-
/* Number of zpools in zswap_pool (empirically determined for scalability) */
#define ZSWAP_NR_ZPOOLS 32

@@ -1421,9 +1408,6 @@ static bool zswap_is_folio_same_filled(struct folio *folio, unsigned long *value
unsigned int pos, last_pos = PAGE_SIZE / sizeof(*page) - 1;
bool ret = false;

- if (!zswap_same_filled_pages_enabled)
- return false;
-
page = kmap_local_folio(folio, 0);
val = page[0];

@@ -1512,9 +1496,6 @@ bool zswap_store(struct folio *folio)
goto store_entry;
}

- if (!zswap_non_same_filled_pages_enabled)
- goto freepage;
-
/* if entry is successfully added, it keeps the reference */
entry->pool = zswap_pool_current_get();
if (!entry->pool)
--
2.44.0.478.gd926399ef9-goog


2024-04-05 15:33:15

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: zswap: calculate limits only when updated

On Fri, Apr 05, 2024 at 05:35:07AM +0000, Yosry Ahmed wrote:
> Currently, we calculate the zswap global limit, and potentially the
> acceptance threshold in the zswap, in pages in the zswap store path.
> This is unnecessary because the values rarely change.
>
> Instead, precalculate the them when the module parameters are updated,
> which should be rare. Since we are adding custom handlers for setting
> the percentages now, add proper validation that they are <= 100.
>
> Suggested-by: Johannes Weiner <[email protected]>
> Signed-off-by: Yosry Ahmed <[email protected]>

Nice! Getting that stuff out of the hotpath!

Two comments below:

> @@ -684,6 +703,43 @@ static int zswap_enabled_param_set(const char *val,
> return ret;
> }
>
> +static int __zswap_percent_param_set(const char *val,
> + const struct kernel_param *kp)
> +{
> + unsigned int n;
> + int ret;
> +
> + ret = kstrtouint(val, 10, &n);
> + if (ret || n > 100)
> + return -EINVAL;
> +
> + return param_set_uint(val, kp);
> +}
> +
> +static int zswap_max_pool_param_set(const char *val,
> + const struct kernel_param *kp)
> +{
> + int err = __zswap_percent_param_set(val, kp);
> +
> + if (!err) {
> + zswap_update_max_pages();
> + zswap_update_accept_thr_pages();
> + }
> +
> + return err;
> +}
> +
> +static int zswap_accept_thr_param_set(const char *val,
> + const struct kernel_param *kp)
> +{
> + int err = __zswap_percent_param_set(val, kp);
> +
> + if (!err)
> + zswap_update_accept_thr_pages();
> +
> + return err;
> +}

I think you can keep this simple and just always update both if
anything changes for whatever reason. It's an extremely rare event
after all. That should cut it from 3 functions to 1.

Note that totalram_pages can also change during memory onlining and
offlining. For that you need a memory notifier that also calls that
refresh function. It's simple enough, though, check out the code
around register_memory_notifier() in drivers/xen/balloon.c.

2024-04-05 18:44:15

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: zswap: calculate limits only when updated

On Fri, Apr 5, 2024 at 8:26 AM Johannes Weiner <[email protected]> wrote:
>
> On Fri, Apr 05, 2024 at 05:35:07AM +0000, Yosry Ahmed wrote:
> > Currently, we calculate the zswap global limit, and potentially the
> > acceptance threshold in the zswap, in pages in the zswap store path.
> > This is unnecessary because the values rarely change.
> >
> > Instead, precalculate the them when the module parameters are updated,
> > which should be rare. Since we are adding custom handlers for setting
> > the percentages now, add proper validation that they are <= 100.
> >
> > Suggested-by: Johannes Weiner <[email protected]>
> > Signed-off-by: Yosry Ahmed <[email protected]>
>
> Nice! Getting that stuff out of the hotpath!
>
> Two comments below:
>
> > @@ -684,6 +703,43 @@ static int zswap_enabled_param_set(const char *val,
> > return ret;
> > }
> >
> > +static int __zswap_percent_param_set(const char *val,
> > + const struct kernel_param *kp)
> > +{
> > + unsigned int n;
> > + int ret;
> > +
> > + ret = kstrtouint(val, 10, &n);
> > + if (ret || n > 100)
> > + return -EINVAL;
> > +
> > + return param_set_uint(val, kp);
> > +}
> > +
> > +static int zswap_max_pool_param_set(const char *val,
> > + const struct kernel_param *kp)
> > +{
> > + int err = __zswap_percent_param_set(val, kp);
> > +
> > + if (!err) {
> > + zswap_update_max_pages();
> > + zswap_update_accept_thr_pages();
> > + }
> > +
> > + return err;
> > +}
> > +
> > +static int zswap_accept_thr_param_set(const char *val,
> > + const struct kernel_param *kp)
> > +{
> > + int err = __zswap_percent_param_set(val, kp);
> > +
> > + if (!err)
> > + zswap_update_accept_thr_pages();
> > +
> > + return err;
> > +}
>
> I think you can keep this simple and just always update both if
> anything changes for whatever reason. It's an extremely rare event
> after all. That should cut it from 3 functions to 1.

Yeah we can also combine both update functions in this case.

>
> Note that totalram_pages can also change during memory onlining and
> offlining. For that you need a memory notifier that also calls that
> refresh function. It's simple enough, though, check out the code
> around register_memory_notifier() in drivers/xen/balloon.c.

Good point, I completely missed this. It seems like totalram_pages can
actually change from contexts other than memory hotplug as well,
specifically through adjust_managed_page_count(), and mostly through
ballooning drivers. Do we trigger the notifiers in this case? I can't
find such logic.

It seems like in this case the actual amount of memory does not
change, but the drivers take it away from the system. It makes some
sense to me that the zswap limits do not change in this case,
especially that userspace just sets those limits as a percentage of
total memory. I wouldn't expect userspace to take ballooning into
account here.

However, it would be a behavioral change from today where we always
rely on totalram_pages(). Also, if userspace happens to change the
limit when a driver is holding a big chunk of memory away from
totalram_pages, then the limit would be constantly underestimated.

I do not have enough familiarity with memory ballooning to know for
sure if this is okay. How much memory can memory ballooning add/remove
from totalram_pages(), and is it usually transient or does it stick
around for a while?

Also CCing David here.

2024-04-05 19:57:54

by Nhat Pham

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] mm: zswap: refactor limit checking from zswap_store()

On Thu, Apr 4, 2024 at 10:35 PM Yosry Ahmed <[email protected]> wrote:
>
> Refactor limit and acceptance threshold checking outside of
> zswap_store(). This code will be moved around in a following patch, so
> it would be cleaner to move a function call around.
>
> Signed-off-by: Yosry Ahmed <[email protected]>

LGTM.
Reviewed-by: Nhat Pham <[email protected]>

2024-04-05 20:00:13

by Nhat Pham

[permalink] [raw]
Subject: Re: [PATCH v2 5/5] mm: zswap: remove same_filled module params

On Thu, Apr 4, 2024 at 10:35 PM Yosry Ahmed <[email protected]> wrote:
>
> These knobs offer more fine-grained control to userspace than needed and
> directly expose/influence kernel implementation; remove them.
>
> For disabling same_filled handling, there is no logical reason to refuse
> storing same-filled pages more efficiently and opt for compression.
> Scanning pages for patterns may be an argument, but the page contents
> will be read into the CPU cache anyway during compression. Also,
> removing the same_filled handling code does not move the needle
> significantly in terms of performance anyway [1].
>
> For disabling non_same_filled handling, it was added when the compressed
> pages in zswap were not being properly charged to memcgs, as workloads
> could escape the accounting with compression [2]. This is no longer the
> case after commit f4840ccfca25 ("zswap: memcg accounting"), and using
> zswap without compression does not make much sense.
>
> [1]https://lore.kernel.org/lkml/CAJD7tkaySFP2hBQw4pnZHJJwe3bMdjJ1t9VC2VJd=khn1_TXvA@mail.gmail.com/
> [2]https://lore.kernel.org/lkml/[email protected]/
>
> Signed-off-by: Yosry Ahmed <[email protected]>
> Acked-by: Johannes Weiner <[email protected]>

I see fewer code to maintain, I like :)
The code LGTM, and we already discuss the justifications in the
original series, so:

Reviewed-by: Nhat Pham <[email protected]>

2024-04-05 20:24:15

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: zswap: calculate limits only when updated

Hi Yosry,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on next-20240405]
[cannot apply to linus/master v6.9-rc2]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-zswap-always-shrink-in-zswap_store-if-zswap_pool_reached_full/20240405-133623
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20240405053510.1948982-3-yosryahmed%40google.com
patch subject: [PATCH v2 2/5] mm: zswap: calculate limits only when updated
config: i386-randconfig-061-20240405 (https://download.01.org/0day-ci/archive/20240406/[email protected]/config)
compiler: gcc-10 (Ubuntu 10.5.0-1ubuntu1) 10.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240406/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

sparse warnings: (new ones prefixed by >>)
>> mm/zswap.c:132:15: sparse: sparse: symbol 'zswap_accept_thr_pages' was not declared. Should it be static?
mm/zswap.c:1188:23: sparse: sparse: context imbalance in 'shrink_memcg_cb' - unexpected unlock

vim +/zswap_accept_thr_pages +132 mm/zswap.c

120
121 static int zswap_max_pool_param_set(const char *,
122 const struct kernel_param *);
123 static const struct kernel_param_ops zswap_max_pool_param_ops = {
124 .set = zswap_max_pool_param_set,
125 .get = param_get_uint,
126 };
127 module_param_cb(max_pool_percent, &zswap_max_pool_param_ops,
128 &zswap_max_pool_percent, 0644);
129
130 /* The threshold for accepting new pages after the max_pool_percent was hit */
131 static unsigned int zswap_accept_thr_percent = 90; /* of max pool size */
> 132 unsigned long zswap_accept_thr_pages;
133

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2024-04-05 20:30:30

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: zswap: calculate limits only when updated

On Fri, Apr 5, 2024 at 1:24 PM kernel test robot <[email protected]> wrote:
>
> Hi Yosry,
>
> kernel test robot noticed the following build warnings:
>
> [auto build test WARNING on akpm-mm/mm-everything]
> [also build test WARNING on next-20240405]
> [cannot apply to linus/master v6.9-rc2]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> url: https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-zswap-always-shrink-in-zswap_store-if-zswap_pool_reached_full/20240405-133623
> base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> patch link: https://lore.kernel.org/r/20240405053510.1948982-3-yosryahmed%40google.com
> patch subject: [PATCH v2 2/5] mm: zswap: calculate limits only when updated
> config: i386-randconfig-061-20240405 (https://download.01.org/0day-ci/archive/20240406/[email protected]/config)
> compiler: gcc-10 (Ubuntu 10.5.0-1ubuntu1) 10.5.0
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240406/[email protected]/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <[email protected]>
> | Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
>
> sparse warnings: (new ones prefixed by >>)
> >> mm/zswap.c:132:15: sparse: sparse: symbol 'zswap_accept_thr_pages' was not declared. Should it be static?

Yes, I already have it static for v3, which will be sent after the
discussion on this patch settles.

2024-04-08 07:58:13

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: zswap: calculate limits only when updated

On 05.04.24 20:43, Yosry Ahmed wrote:
> On Fri, Apr 5, 2024 at 8:26 AM Johannes Weiner <[email protected]> wrote:
>>
>> On Fri, Apr 05, 2024 at 05:35:07AM +0000, Yosry Ahmed wrote:
>>> Currently, we calculate the zswap global limit, and potentially the
>>> acceptance threshold in the zswap, in pages in the zswap store path.
>>> This is unnecessary because the values rarely change.
>>>
>>> Instead, precalculate the them when the module parameters are updated,
>>> which should be rare. Since we are adding custom handlers for setting
>>> the percentages now, add proper validation that they are <= 100.
>>>
>>> Suggested-by: Johannes Weiner <[email protected]>
>>> Signed-off-by: Yosry Ahmed <[email protected]>
>>
>> Nice! Getting that stuff out of the hotpath!
>>
>> Two comments below:
>>
>>> @@ -684,6 +703,43 @@ static int zswap_enabled_param_set(const char *val,
>>> return ret;
>>> }
>>>
>>> +static int __zswap_percent_param_set(const char *val,
>>> + const struct kernel_param *kp)
>>> +{
>>> + unsigned int n;
>>> + int ret;
>>> +
>>> + ret = kstrtouint(val, 10, &n);
>>> + if (ret || n > 100)
>>> + return -EINVAL;
>>> +
>>> + return param_set_uint(val, kp);
>>> +}
>>> +
>>> +static int zswap_max_pool_param_set(const char *val,
>>> + const struct kernel_param *kp)
>>> +{
>>> + int err = __zswap_percent_param_set(val, kp);
>>> +
>>> + if (!err) {
>>> + zswap_update_max_pages();
>>> + zswap_update_accept_thr_pages();
>>> + }
>>> +
>>> + return err;
>>> +}
>>> +
>>> +static int zswap_accept_thr_param_set(const char *val,
>>> + const struct kernel_param *kp)
>>> +{
>>> + int err = __zswap_percent_param_set(val, kp);
>>> +
>>> + if (!err)
>>> + zswap_update_accept_thr_pages();
>>> +
>>> + return err;
>>> +}
>>
>> I think you can keep this simple and just always update both if
>> anything changes for whatever reason. It's an extremely rare event
>> after all. That should cut it from 3 functions to 1.
>
> Yeah we can also combine both update functions in this case.
>
>>
>> Note that totalram_pages can also change during memory onlining and
>> offlining. For that you need a memory notifier that also calls that
>> refresh function. It's simple enough, though, check out the code
>> around register_memory_notifier() in drivers/xen/balloon.c.
>
> Good point, I completely missed this. It seems like totalram_pages can
> actually change from contexts other than memory hotplug as well,
> specifically through adjust_managed_page_count(), and mostly through
> ballooning drivers. Do we trigger the notifiers in this case? I can't
> find such logic.

Things like virtio-balloon never online/offline memory and would never
call it.

Things like virtio-mem sometimes will online/offline memory and would
sometimes call it (but not always). Things like the Hyper-V balloon and
XEN balloon never offline memory, and would only call it when onlining
memory.

>
> It seems like in this case the actual amount of memory does not
> change, but the drivers take it away from the system. It makes some
> sense to me that the zswap limits do not change in this case,
> especially that userspace just sets those limits as a percentage of
> total memory. I wouldn't expect userspace to take ballooning into
> account here.
>

For virtio-mem, it does change ("actual amount of memory"). For
virtio-balloon, it's tricky. When using virtio-balloon for VM resizing,
it would similarly change. When using it for pure memory overcommit, it
depends on whatever the policy in the hypervisor is ... might be that
under memory pressure that memory is simply given back to the VM.

> However, it would be a behavioral change from today where we always
> rely on totalram_pages(). Also, if userspace happens to change the
> limit when a driver is holding a big chunk of memory away from
> totalram_pages, then the limit would be constantly underestimated.
>
> I do not have enough familiarity with memory ballooning to know for
> sure if this is okay. How much memory can memory ballooning add/remove
> from totalram_pages(), and is it usually transient or does it stick
> around for a while?
>
> Also CCing David here.

It can be a lot. Especially with the Hyper-V balloon (but also on
environments where other forms of memory hotunplug are not supported),
memory ballooning can be used to unplug memory. So that memory can be
gone for good and it can end up being quite a lot of memory.

The clean thing to do would be to have a way for other subsystems to get
notified on any totalram_pages() changes, so they can adjust accordingly.

--
Cheers,

David / dhildenb


2024-04-08 08:07:57

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: zswap: calculate limits only when updated

[..]
> >> Note that totalram_pages can also change during memory onlining and
> >> offlining. For that you need a memory notifier that also calls that
> >> refresh function. It's simple enough, though, check out the code
> >> around register_memory_notifier() in drivers/xen/balloon.c.
> >
> > Good point, I completely missed this. It seems like totalram_pages can
> > actually change from contexts other than memory hotplug as well,
> > specifically through adjust_managed_page_count(), and mostly through
> > ballooning drivers. Do we trigger the notifiers in this case? I can't
> > find such logic.
>
> Things like virtio-balloon never online/offline memory and would never
> call it.

I see calls to adjust_managed_page_count() from
drivers/virtio/virtio_balloon.c, but I don't understand enough to know
what they are doing.

>
> Things like virtio-mem sometimes will online/offline memory and would
> sometimes call it (but not always). Things like the Hyper-V balloon and
> XEN balloon never offline memory, and would only call it when onlining
> memory.

Thanks for the details.

>
> >
> > It seems like in this case the actual amount of memory does not
> > change, but the drivers take it away from the system. It makes some
> > sense to me that the zswap limits do not change in this case,
> > especially that userspace just sets those limits as a percentage of
> > total memory. I wouldn't expect userspace to take ballooning into
> > account here.
> >
>
> For virtio-mem, it does change ("actual amount of memory"). For
> virtio-balloon, it's tricky. When using virtio-balloon for VM resizing,
> it would similarly change. When using it for pure memory overcommit, it
> depends on whatever the policy in the hypervisor is ... might be that
> under memory pressure that memory is simply given back to the VM.

That's good to know, it seems like we need to take these into account,
and not just because the users may happen to change zswap limits while
they are onlining/offlining memory.

>
> > However, it would be a behavioral change from today where we always
> > rely on totalram_pages(). Also, if userspace happens to change the
> > limit when a driver is holding a big chunk of memory away from
> > totalram_pages, then the limit would be constantly underestimated.
> >
> > I do not have enough familiarity with memory ballooning to know for
> > sure if this is okay. How much memory can memory ballooning add/remove
> > from totalram_pages(), and is it usually transient or does it stick
> > around for a while?
> >
> > Also CCing David here.
>
> It can be a lot. Especially with the Hyper-V balloon (but also on
> environments where other forms of memory hotunplug are not supported),
> memory ballooning can be used to unplug memory. So that memory can be
> gone for good and it can end up being quite a lot of memory.
>
> The clean thing to do would be to have a way for other subsystems to get
> notified on any totalram_pages() changes, so they can adjust accordingly.

Yeah I agree. I imagined that register_memory_notifier() would be the
way to do that. Apparently it is only effective for memory hotplug.
From your description, it sounds like the ballooning drivers may have
a very similar effect to memory hotplug, but I don't see
memory_notify() being called in these paths.

Do we need a separate notifier chain for totalram_pages() updates?

2024-04-09 19:37:58

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: zswap: calculate limits only when updated

On 08.04.24 10:07, Yosry Ahmed wrote:
> [..]
>>>> Note that totalram_pages can also change during memory onlining and
>>>> offlining. For that you need a memory notifier that also calls that
>>>> refresh function. It's simple enough, though, check out the code
>>>> around register_memory_notifier() in drivers/xen/balloon.c.
>>>
>>> Good point, I completely missed this. It seems like totalram_pages can
>>> actually change from contexts other than memory hotplug as well,
>>> specifically through adjust_managed_page_count(), and mostly through
>>> ballooning drivers. Do we trigger the notifiers in this case? I can't
>>> find such logic.
>>
>> Things like virtio-balloon never online/offline memory and would never
>> call it.
>
> I see calls to adjust_managed_page_count() from
> drivers/virtio/virtio_balloon.c, but I don't understand enough to know
> what they are doing.

Essentially fake removing/adding pages. :)

>
>>
>> Things like virtio-mem sometimes will online/offline memory and would
>> sometimes call it (but not always). Things like the Hyper-V balloon and
>> XEN balloon never offline memory, and would only call it when onlining
>> memory.
>
> Thanks for the details.
>
>>
>>>
>>> It seems like in this case the actual amount of memory does not
>>> change, but the drivers take it away from the system. It makes some
>>> sense to me that the zswap limits do not change in this case,
>>> especially that userspace just sets those limits as a percentage of
>>> total memory. I wouldn't expect userspace to take ballooning into
>>> account here.
>>>
>>
>> For virtio-mem, it does change ("actual amount of memory"). For
>> virtio-balloon, it's tricky. When using virtio-balloon for VM resizing,
>> it would similarly change. When using it for pure memory overcommit, it
>> depends on whatever the policy in the hypervisor is ... might be that
>> under memory pressure that memory is simply given back to the VM.
>
> That's good to know, it seems like we need to take these into account,
> and not just because the users may happen to change zswap limits while
> they are onlining/offlining memory.

Yes. Likely other parts of the system would want to know when available
memory changes (especially, if it's quite significant).

>
>>
>>> However, it would be a behavioral change from today where we always
>>> rely on totalram_pages(). Also, if userspace happens to change the
>>> limit when a driver is holding a big chunk of memory away from
>>> totalram_pages, then the limit would be constantly underestimated.
>>>
>>> I do not have enough familiarity with memory ballooning to know for
>>> sure if this is okay. How much memory can memory ballooning add/remove
>>> from totalram_pages(), and is it usually transient or does it stick
>>> around for a while?
>>>
>>> Also CCing David here.
>>
>> It can be a lot. Especially with the Hyper-V balloon (but also on
>> environments where other forms of memory hotunplug are not supported),
>> memory ballooning can be used to unplug memory. So that memory can be
>> gone for good and it can end up being quite a lot of memory.
>>
>> The clean thing to do would be to have a way for other subsystems to get
>> notified on any totalram_pages() changes, so they can adjust accordingly.
>
> Yeah I agree. I imagined that register_memory_notifier() would be the
> way to do that. Apparently it is only effective for memory hotplug.

Yes. Right now it's always called under the memory hotplug lock. We
could reuse it for this fake hotplug/unplug as well, but we'd have to
clarify how exactly we expect this to interact with the existing
notifications+notifiers.

> From your description, it sounds like the ballooning drivers may have
> a very similar effect to memory hotplug, but I don't see
> memory_notify() being called in these paths.

Right, the existing notifications (MEM_ONLINE etc.) are called when
memory blocks change their state. That's not what the fake
hotplug/unplug does, that operates on much smaller ranges.

>
> Do we need a separate notifier chain for totalram_pages() updates?

Good question. I actually might have the requirement to notify some arch
code (s390x) from virtio-mem when fake adding/removing memory, and
already wondered how to best wire that up.

Maybe we can squeeze that into the existing notifier chain, but needs a
bit of thought.

--
Cheers,

David / dhildenb


2024-04-10 00:53:46

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: zswap: calculate limits only when updated

[..]
> > Do we need a separate notifier chain for totalram_pages() updates?
>
> Good question. I actually might have the requirement to notify some arch
> code (s390x) from virtio-mem when fake adding/removing memory, and
> already wondered how to best wire that up.
>
> Maybe we can squeeze that into the existing notifier chain, but needs a
> bit of thought.

Do you mean by adding new actions (e.g. MEM_FAKE_ONLINE,
MEM_FAKE_OFFLINE), or by reusing the existing actions (MEM_ONLINE,
MEM_OFFLINE, etc).

New actions mean minimal impact to existing notifiers, but it may make
more sense to reuse MEM_ONLINE and MEM_OFFLINE to have generic actions
that mean "memory increased" and "memory decreased".

I suppose we can add new actions and then separately (and probably
incrementally) audit existing notifiers to check if they want to
handle the new actions as well.

Another consideration is that apparently some ballooning drivers also
register notifiers, so we need to make sure there is no possibility of
deadlock/recursion.

2024-04-10 02:31:22

by Chengming Zhou

[permalink] [raw]
Subject: Re: [PATCH v2 5/5] mm: zswap: remove same_filled module params

On 2024/4/5 13:35, Yosry Ahmed wrote:
> These knobs offer more fine-grained control to userspace than needed and
> directly expose/influence kernel implementation; remove them.
>
> For disabling same_filled handling, there is no logical reason to refuse
> storing same-filled pages more efficiently and opt for compression.
> Scanning pages for patterns may be an argument, but the page contents
> will be read into the CPU cache anyway during compression. Also,
> removing the same_filled handling code does not move the needle
> significantly in terms of performance anyway [1].
>
> For disabling non_same_filled handling, it was added when the compressed
> pages in zswap were not being properly charged to memcgs, as workloads
> could escape the accounting with compression [2]. This is no longer the
> case after commit f4840ccfca25 ("zswap: memcg accounting"), and using
> zswap without compression does not make much sense.
>
> [1]https://lore.kernel.org/lkml/CAJD7tkaySFP2hBQw4pnZHJJwe3bMdjJ1t9VC2VJd=khn1_TXvA@mail.gmail.com/
> [2]https://lore.kernel.org/lkml/[email protected]/
>
> Signed-off-by: Yosry Ahmed <[email protected]>
> Acked-by: Johannes Weiner <[email protected]>

Nice cleanup!

Reviewed-by: Chengming Zhou <[email protected]>

Thanks.

> ---
> mm/zswap.c | 19 -------------------
> 1 file changed, 19 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 13869d18c13bd..b738435215218 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -140,19 +140,6 @@ static const struct kernel_param_ops zswap_accept_thr_param_ops = {
> module_param_cb(accept_threshold_percent, &zswap_accept_thr_param_ops,
> &zswap_accept_thr_percent, 0644);
>
> -/*
> - * Enable/disable handling same-value filled pages (enabled by default).
> - * If disabled every page is considered non-same-value filled.
> - */
> -static bool zswap_same_filled_pages_enabled = true;
> -module_param_named(same_filled_pages_enabled, zswap_same_filled_pages_enabled,
> - bool, 0644);
> -
> -/* Enable/disable handling non-same-value filled pages (enabled by default) */
> -static bool zswap_non_same_filled_pages_enabled = true;
> -module_param_named(non_same_filled_pages_enabled, zswap_non_same_filled_pages_enabled,
> - bool, 0644);
> -
> /* Number of zpools in zswap_pool (empirically determined for scalability) */
> #define ZSWAP_NR_ZPOOLS 32
>
> @@ -1421,9 +1408,6 @@ static bool zswap_is_folio_same_filled(struct folio *folio, unsigned long *value
> unsigned int pos, last_pos = PAGE_SIZE / sizeof(*page) - 1;
> bool ret = false;
>
> - if (!zswap_same_filled_pages_enabled)
> - return false;
> -
> page = kmap_local_folio(folio, 0);
> val = page[0];
>
> @@ -1512,9 +1496,6 @@ bool zswap_store(struct folio *folio)
> goto store_entry;
> }
>
> - if (!zswap_non_same_filled_pages_enabled)
> - goto freepage;
> -
> /* if entry is successfully added, it keeps the reference */
> entry->pool = zswap_pool_current_get();
> if (!entry->pool)

2024-04-10 02:36:49

by Chengming Zhou

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] mm: zswap: refactor limit checking from zswap_store()

On 2024/4/5 13:35, Yosry Ahmed wrote:
> Refactor limit and acceptance threshold checking outside of
> zswap_store(). This code will be moved around in a following patch, so
> it would be cleaner to move a function call around.
>
> Signed-off-by: Yosry Ahmed <[email protected]>

LGTM.

Reviewed-by: Chengming Zhou <[email protected]>

Thanks.

> ---
> mm/zswap.c | 29 +++++++++++++++--------------
> 1 file changed, 15 insertions(+), 14 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 832e3f56232f0..ab3cd43cdfc9d 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1444,6 +1444,20 @@ static void zswap_fill_page(void *ptr, unsigned long value)
> memset_l(page, value, PAGE_SIZE / sizeof(unsigned long));
> }
>
> +static bool zswap_check_full(void)
> +{
> + unsigned long cur_pages = zswap_total_pages();
> +
> + if (cur_pages >= READ_ONCE(zswap_max_pages)) {
> + zswap_pool_limit_hit++;
> + zswap_pool_reached_full = true;
> + } else if (zswap_pool_reached_full &&
> + cur_pages <= READ_ONCE(zswap_accept_thr_pages)) {
> + zswap_pool_reached_full = false;
> + }
> + return zswap_pool_reached_full;
> +}
> +
> bool zswap_store(struct folio *folio)
> {
> swp_entry_t swp = folio->swap;
> @@ -1452,7 +1466,6 @@ bool zswap_store(struct folio *folio)
> struct zswap_entry *entry, *old;
> struct obj_cgroup *objcg = NULL;
> struct mem_cgroup *memcg = NULL;
> - unsigned long max_pages, cur_pages;
>
> VM_WARN_ON_ONCE(!folio_test_locked(folio));
> VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> @@ -1475,20 +1488,8 @@ bool zswap_store(struct folio *folio)
> mem_cgroup_put(memcg);
> }
>
> - /* Check global limits */
> - cur_pages = zswap_total_pages();
> - if (cur_pages >= READ_ONCE(zswap_max_pages)) {
> - zswap_pool_limit_hit++;
> - zswap_pool_reached_full = true;
> + if (zswap_check_full())
> goto reject;
> - }
> -
> - if (zswap_pool_reached_full) {
> - if (cur_pages > READ_ONCE(zswap_accept_thr_pages))
> - goto reject;
> - else
> - zswap_pool_reached_full = false;
> - }
>
> /* allocate entry */
> entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));

2024-04-12 19:49:00

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: zswap: calculate limits only when updated

On 10.04.24 02:52, Yosry Ahmed wrote:
> [..]
>>> Do we need a separate notifier chain for totalram_pages() updates?
>>
>> Good question. I actually might have the requirement to notify some arch
>> code (s390x) from virtio-mem when fake adding/removing memory, and
>> already wondered how to best wire that up.
>>
>> Maybe we can squeeze that into the existing notifier chain, but needs a
>> bit of thought.
>

Sorry for the late reply, I had to think about this a bit.

> Do you mean by adding new actions (e.g. MEM_FAKE_ONLINE,
> MEM_FAKE_OFFLINE), or by reusing the existing actions (MEM_ONLINE,
> MEM_OFFLINE, etc).

At least for virtio-mem, I think we could have a MEM_ONLINE/MEM_OFFLINE
that prepare the whole range belonging to the Linux memory block
(/sys/devices/system/memory/memory...) to go online, and then have
something like MEM_SOFT_ONLINE/MEM_SOFT_OFFLINE or
ENABLE_PAGES/DISABLE_PAGES ... notifications when parts become usable
(!PageOffline, handed to the buddy) or unusable (PageOffline, removed
from the buddy).

There are some details to be figured out, but it could work.

And as virtio-mem currently operates in pageblock granularity (e.g., 2
MiB), but frequently handles multiple contiguous pageblocks within a
Linux memory block, it's not that bad.


But the issue I see with ballooning is that we operate here often on
page granularity. While we could optimize some cases, we might get quite
some overhead from all the notifications. Alternatively, we could send a
list of pages, but it won't win a beauty contest.

I think the main issue is that, for my purpose (virtio-mem on s390x), I
need to notify about the exact memory ranges (so I can reinitialize
stuff in s390x code when memory gets effectively re-enabled). For other
cases (total pages changing), we don't need the memory ranges, but only
the "summary" -- or a notification afterwards that the total pages were
just changed quite a bit.

>
> New actions mean minimal impact to existing notifiers, but it may make
> more sense to reuse MEM_ONLINE and MEM_OFFLINE to have generic actions
> that mean "memory increased" and "memory decreased".

Likely, we should keep their semantics unchanged. Things like KASAN want
to allocate metadata memory for the whole range, not on some smallish
pieces. It really means "This Linux memory block goes online/offline,
please prepare for that.". And again, memory ballooning with small pages
is a bit problematic.

>
> I suppose we can add new actions and then separately (and probably
> incrementally) audit existing notifiers to check if they want to
> handle the new actions as well.
>
> Another consideration is that apparently some ballooning drivers also
> register notifiers, so we need to make sure there is no possibility of
> deadlock/recursion.

Right.

--
Cheers,

David / dhildenb


2024-04-13 01:06:27

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: zswap: calculate limits only when updated

On Fri, Apr 12, 2024 at 12:48 PM David Hildenbrand <[email protected]> wrote:
>
> On 10.04.24 02:52, Yosry Ahmed wrote:
> > [..]
> >>> Do we need a separate notifier chain for totalram_pages() updates?
> >>
> >> Good question. I actually might have the requirement to notify some arch
> >> code (s390x) from virtio-mem when fake adding/removing memory, and
> >> already wondered how to best wire that up.
> >>
> >> Maybe we can squeeze that into the existing notifier chain, but needs a
> >> bit of thought.
> >
>
> Sorry for the late reply, I had to think about this a bit.
>
> > Do you mean by adding new actions (e.g. MEM_FAKE_ONLINE,
> > MEM_FAKE_OFFLINE), or by reusing the existing actions (MEM_ONLINE,
> > MEM_OFFLINE, etc).
>
> At least for virtio-mem, I think we could have a MEM_ONLINE/MEM_OFFLINE
> that prepare the whole range belonging to the Linux memory block
> (/sys/devices/system/memory/memory...) to go online, and then have
> something like MEM_SOFT_ONLINE/MEM_SOFT_OFFLINE or
> ENABLE_PAGES/DISABLE_PAGES ... notifications when parts become usable
> (!PageOffline, handed to the buddy) or unusable (PageOffline, removed
> from the buddy).
>
> There are some details to be figured out, but it could work.
>
> And as virtio-mem currently operates in pageblock granularity (e.g., 2
> MiB), but frequently handles multiple contiguous pageblocks within a
> Linux memory block, it's not that bad.
>
>
> But the issue I see with ballooning is that we operate here often on
> page granularity. While we could optimize some cases, we might get quite
> some overhead from all the notifications. Alternatively, we could send a
> list of pages, but it won't win a beauty contest.
>
> I think the main issue is that, for my purpose (virtio-mem on s390x), I
> need to notify about the exact memory ranges (so I can reinitialize
> stuff in s390x code when memory gets effectively re-enabled). For other
> cases (total pages changing), we don't need the memory ranges, but only
> the "summary" -- or a notification afterwards that the total pages were
> just changed quite a bit.


Thanks for shedding some light on this. Although I am not familiar
with ballooning, sending notifications on page granularity updates
sounds terrible. It seems like this is not as straightforward as I had
anticipated.

I was going to take a stab at this, but given that the motivation is a
minor optimization on the zswap side, I will probably just give up :)

For now, I will drop this optimization from the series for now, and I
can revisit it if/when notifications for totalram_pages() are
implemented at some point. Please CC me if you do so for the s390x use
case :)

>
>
> >
> > New actions mean minimal impact to existing notifiers, but it may make
> > more sense to reuse MEM_ONLINE and MEM_OFFLINE to have generic actions
> > that mean "memory increased" and "memory decreased".
>
> Likely, we should keep their semantics unchanged. Things like KASAN want
> to allocate metadata memory for the whole range, not on some smallish
> pieces. It really means "This Linux memory block goes online/offline,
> please prepare for that.". And again, memory ballooning with small pages
> is a bit problematic.
>
> >
> > I suppose we can add new actions and then separately (and probably
> > incrementally) audit existing notifiers to check if they want to
> > handle the new actions as well.
> >
> > Another consideration is that apparently some ballooning drivers also
> > register notifiers, so we need to make sure there is no possibility of
> > deadlock/recursion.
>
> Right.
>
> --
> Cheers,
>
> David / dhildenb
>

2024-04-15 15:11:04

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: zswap: calculate limits only when updated

On 13.04.24 03:05, Yosry Ahmed wrote:
> On Fri, Apr 12, 2024 at 12:48 PM David Hildenbrand <[email protected]> wrote:
>>
>> On 10.04.24 02:52, Yosry Ahmed wrote:
>>> [..]
>>>>> Do we need a separate notifier chain for totalram_pages() updates?
>>>>
>>>> Good question. I actually might have the requirement to notify some arch
>>>> code (s390x) from virtio-mem when fake adding/removing memory, and
>>>> already wondered how to best wire that up.
>>>>
>>>> Maybe we can squeeze that into the existing notifier chain, but needs a
>>>> bit of thought.
>>>
>>
>> Sorry for the late reply, I had to think about this a bit.
>>
>>> Do you mean by adding new actions (e.g. MEM_FAKE_ONLINE,
>>> MEM_FAKE_OFFLINE), or by reusing the existing actions (MEM_ONLINE,
>>> MEM_OFFLINE, etc).
>>
>> At least for virtio-mem, I think we could have a MEM_ONLINE/MEM_OFFLINE
>> that prepare the whole range belonging to the Linux memory block
>> (/sys/devices/system/memory/memory...) to go online, and then have
>> something like MEM_SOFT_ONLINE/MEM_SOFT_OFFLINE or
>> ENABLE_PAGES/DISABLE_PAGES ... notifications when parts become usable
>> (!PageOffline, handed to the buddy) or unusable (PageOffline, removed
>> from the buddy).
>>
>> There are some details to be figured out, but it could work.
>>
>> And as virtio-mem currently operates in pageblock granularity (e.g., 2
>> MiB), but frequently handles multiple contiguous pageblocks within a
>> Linux memory block, it's not that bad.
>>
>>
>> But the issue I see with ballooning is that we operate here often on
>> page granularity. While we could optimize some cases, we might get quite
>> some overhead from all the notifications. Alternatively, we could send a
>> list of pages, but it won't win a beauty contest.
>>
>> I think the main issue is that, for my purpose (virtio-mem on s390x), I
>> need to notify about the exact memory ranges (so I can reinitialize
>> stuff in s390x code when memory gets effectively re-enabled). For other
>> cases (total pages changing), we don't need the memory ranges, but only
>> the "summary" -- or a notification afterwards that the total pages were
>> just changed quite a bit.
>
>
> Thanks for shedding some light on this. Although I am not familiar
> with ballooning, sending notifications on page granularity updates
> sounds terrible. It seems like this is not as straightforward as I had
> anticipated.
>
> I was going to take a stab at this, but given that the motivation is a
> minor optimization on the zswap side, I will probably just give up :)

Oh no, so I have to do the work! ;)

>
> For now, I will drop this optimization from the series for now, and I
> can revisit it if/when notifications for totalram_pages() are
> implemented at some point. Please CC me if you do so for the s390x use
> case :)

I primarily care about virtio-mem resizing VM memory and adjusting
totalram_pages(), memory ballooning for that is rather a hack for that
use case ... so we're in agreement :)

Likely we'd want two notification mechanisms, but no matter how I look
at it, it's all a bit ugly.

I'll look into the virtio-mem case soonish and will let you know once I
have something running.

--
Cheers,

David / dhildenb


2024-04-15 18:32:30

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: zswap: calculate limits only when updated

On Mon, Apr 15, 2024 at 8:10 AM David Hildenbrand <[email protected]> wrote:
>
> On 13.04.24 03:05, Yosry Ahmed wrote:
> > On Fri, Apr 12, 2024 at 12:48 PM David Hildenbrand <[email protected]> wrote:
> >>
> >> On 10.04.24 02:52, Yosry Ahmed wrote:
> >>> [..]
> >>>>> Do we need a separate notifier chain for totalram_pages() updates?
> >>>>
> >>>> Good question. I actually might have the requirement to notify some arch
> >>>> code (s390x) from virtio-mem when fake adding/removing memory, and
> >>>> already wondered how to best wire that up.
> >>>>
> >>>> Maybe we can squeeze that into the existing notifier chain, but needs a
> >>>> bit of thought.
> >>>
> >>
> >> Sorry for the late reply, I had to think about this a bit.
> >>
> >>> Do you mean by adding new actions (e.g. MEM_FAKE_ONLINE,
> >>> MEM_FAKE_OFFLINE), or by reusing the existing actions (MEM_ONLINE,
> >>> MEM_OFFLINE, etc).
> >>
> >> At least for virtio-mem, I think we could have a MEM_ONLINE/MEM_OFFLINE
> >> that prepare the whole range belonging to the Linux memory block
> >> (/sys/devices/system/memory/memory...) to go online, and then have
> >> something like MEM_SOFT_ONLINE/MEM_SOFT_OFFLINE or
> >> ENABLE_PAGES/DISABLE_PAGES ... notifications when parts become usable
> >> (!PageOffline, handed to the buddy) or unusable (PageOffline, removed
> >> from the buddy).
> >>
> >> There are some details to be figured out, but it could work.
> >>
> >> And as virtio-mem currently operates in pageblock granularity (e.g., 2
> >> MiB), but frequently handles multiple contiguous pageblocks within a
> >> Linux memory block, it's not that bad.
> >>
> >>
> >> But the issue I see with ballooning is that we operate here often on
> >> page granularity. While we could optimize some cases, we might get quite
> >> some overhead from all the notifications. Alternatively, we could send a
> >> list of pages, but it won't win a beauty contest.
> >>
> >> I think the main issue is that, for my purpose (virtio-mem on s390x), I
> >> need to notify about the exact memory ranges (so I can reinitialize
> >> stuff in s390x code when memory gets effectively re-enabled). For other
> >> cases (total pages changing), we don't need the memory ranges, but only
> >> the "summary" -- or a notification afterwards that the total pages were
> >> just changed quite a bit.
> >
> >
> > Thanks for shedding some light on this. Although I am not familiar
> > with ballooning, sending notifications on page granularity updates
> > sounds terrible. It seems like this is not as straightforward as I had
> > anticipated.
> >
> > I was going to take a stab at this, but given that the motivation is a
> > minor optimization on the zswap side, I will probably just give up :)
>
> Oh no, so I have to do the work! ;)
>
> >
> > For now, I will drop this optimization from the series for now, and I
> > can revisit it if/when notifications for totalram_pages() are
> > implemented at some point. Please CC me if you do so for the s390x use
> > case :)
>
> I primarily care about virtio-mem resizing VM memory and adjusting
> totalram_pages(), memory ballooning for that is rather a hack for that
> use case ... so we're in agreement :)
>
> Likely we'd want two notification mechanisms, but no matter how I look
> at it, it's all a bit ugly.

I am assuming you mean one with exact memory ranges for your s390x use
case, and one high-level mechanism for totalram_pages() updates -- or
did I miss the point?

I am interested to see how page granularity updates would be handled
in this case. Perhaps they are only relevant for the high-level
mechanism? In that case, I suppose we can batch updates and notify
once when a threshold is crossed or something.

>
> I'll look into the virtio-mem case soonish and will let you know once I
> have something running.

Thanks!

2024-04-15 19:15:39

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: zswap: calculate limits only when updated

On 15.04.24 20:30, Yosry Ahmed wrote:
> On Mon, Apr 15, 2024 at 8:10 AM David Hildenbrand <[email protected]> wrote:
>>
>> On 13.04.24 03:05, Yosry Ahmed wrote:
>>> On Fri, Apr 12, 2024 at 12:48 PM David Hildenbrand <[email protected]> wrote:
>>>>
>>>> On 10.04.24 02:52, Yosry Ahmed wrote:
>>>>> [..]
>>>>>>> Do we need a separate notifier chain for totalram_pages() updates?
>>>>>>
>>>>>> Good question. I actually might have the requirement to notify some arch
>>>>>> code (s390x) from virtio-mem when fake adding/removing memory, and
>>>>>> already wondered how to best wire that up.
>>>>>>
>>>>>> Maybe we can squeeze that into the existing notifier chain, but needs a
>>>>>> bit of thought.
>>>>>
>>>>
>>>> Sorry for the late reply, I had to think about this a bit.
>>>>
>>>>> Do you mean by adding new actions (e.g. MEM_FAKE_ONLINE,
>>>>> MEM_FAKE_OFFLINE), or by reusing the existing actions (MEM_ONLINE,
>>>>> MEM_OFFLINE, etc).
>>>>
>>>> At least for virtio-mem, I think we could have a MEM_ONLINE/MEM_OFFLINE
>>>> that prepare the whole range belonging to the Linux memory block
>>>> (/sys/devices/system/memory/memory...) to go online, and then have
>>>> something like MEM_SOFT_ONLINE/MEM_SOFT_OFFLINE or
>>>> ENABLE_PAGES/DISABLE_PAGES ... notifications when parts become usable
>>>> (!PageOffline, handed to the buddy) or unusable (PageOffline, removed
>>>> from the buddy).
>>>>
>>>> There are some details to be figured out, but it could work.
>>>>
>>>> And as virtio-mem currently operates in pageblock granularity (e.g., 2
>>>> MiB), but frequently handles multiple contiguous pageblocks within a
>>>> Linux memory block, it's not that bad.
>>>>
>>>>
>>>> But the issue I see with ballooning is that we operate here often on
>>>> page granularity. While we could optimize some cases, we might get quite
>>>> some overhead from all the notifications. Alternatively, we could send a
>>>> list of pages, but it won't win a beauty contest.
>>>>
>>>> I think the main issue is that, for my purpose (virtio-mem on s390x), I
>>>> need to notify about the exact memory ranges (so I can reinitialize
>>>> stuff in s390x code when memory gets effectively re-enabled). For other
>>>> cases (total pages changing), we don't need the memory ranges, but only
>>>> the "summary" -- or a notification afterwards that the total pages were
>>>> just changed quite a bit.
>>>
>>>
>>> Thanks for shedding some light on this. Although I am not familiar
>>> with ballooning, sending notifications on page granularity updates
>>> sounds terrible. It seems like this is not as straightforward as I had
>>> anticipated.
>>>
>>> I was going to take a stab at this, but given that the motivation is a
>>> minor optimization on the zswap side, I will probably just give up :)
>>
>> Oh no, so I have to do the work! ;)
>>
>>>
>>> For now, I will drop this optimization from the series for now, and I
>>> can revisit it if/when notifications for totalram_pages() are
>>> implemented at some point. Please CC me if you do so for the s390x use
>>> case :)
>>
>> I primarily care about virtio-mem resizing VM memory and adjusting
>> totalram_pages(), memory ballooning for that is rather a hack for that
>> use case ... so we're in agreement :)
>>
>> Likely we'd want two notification mechanisms, but no matter how I look
>> at it, it's all a bit ugly.
>
> I am assuming you mean one with exact memory ranges for your s390x use
> case, and one high-level mechanism for totalram_pages() updates -- or
> did I miss the point?

No, that's it.

>
> I am interested to see how page granularity updates would be handled
> in this case. Perhaps they are only relevant for the high-level
> mechanism? In that case, I suppose we can batch updates and notify
> once when a threshold is crossed or something.

Yes, we'd batch updates.

--
Cheers,

David / dhildenb