2024-03-25 23:50:28

by Yosry Ahmed

[permalink] [raw]
Subject: [RFC PATCH 0/9] zswap: store zero-filled pages more efficiently

This patch series drops the support for non-zero same-filled pages from
zswap and makes storing zero-filled pages faster and more efficient.
Non-zero same-filled pages should not be nearly as common as zero-filled
pages, so dropping support for them in favor of improving the more
common zero-filled pages makes sense. It also allows for a lot of code
cleanups.

Patch 1 is a small cleanup for zswap_store() cleanup path that also
implies a small behavioral change.

Patches 2-4 are groundword refactoring.

Patch 5 removes the userspace tunable to enable same-filled pages
handling. It arguably makes no sense anyway.

Patch 6 drops the support for non-zero same-filled pages, and patch 7
allows for storing them more efficiently. Both of these patch cause
around 1.4% improvement in the system time on kernbench. The kernel
build test only produces around 1.5% zero-filled pages. Some real
workloads are expected to have higher ratios of zero-filled pages and
hence more improvement. They also save a little bit of memory. Exact
numbers are in the commit logs.

Patch 8 drops the limit checks before handling zero-filled pages and
patch 9 does a followup cleanup of zswap_store() made possible by this
series.

This series is tagged as an RFC because it makes some potentially
controversial decisions :)

The series is based on a slightly outdated mm-unstable.

Yosry Ahmed (9):
mm: zswap: always shrink in zswap_store() if zswap_pool_reached_full
mm: zswap: refactor storing to the tree out of zswap_store()
mm: zswap: refactor limit checking from zswap_store()
mm: zswap: move more same-filled pages checks outside of zswap_store()
mm: zswap: remove zswap_same_filled_pages_enabled
mm: zswap: drop support for non-zero same-filled pages handling
mm: zswap: store zero-filled pages without a zswap_entry
mm: zswap: do not check the global limit for zero-filled pages
mm: zswap: use zswap_entry_free() for partially initialized entries

mm/zswap.c | 297 +++++++++++++++++++++++++----------------------------
1 file changed, 139 insertions(+), 158 deletions(-)

--
2.44.0.396.g6e790dbe36-goog



2024-03-25 23:50:36

by Yosry Ahmed

[permalink] [raw]
Subject: [RFC PATCH 1/9] mm: zswap: always shrink in zswap_store() if zswap_pool_reached_full

The cleanup code in zswap_store() is not pretty, particularly the
'shrink' label at the bottom that ends up jumping between cleanup
labels.

Instead of having a dedicated label to shrink the pool, just use
zswap_pool_reached_full directly to figure out if the pool needs
shrinking. zswap_pool_reached_full should be true if and only if the
pool needs shrinking.

The only caveat is that the value of zswap_pool_reached_full may be
changed by concurrent zswap_store() calls between checking the limit and
testing zswap_pool_reached_full in the cleanup code. This is fine
because:
- If zswap_pool_reached_full was true during limit checking then became
false during the cleanup code, then someone else already took care of
shrinking the pool and there is no need to queue the worker. That
would be a good change.
- If zswap_pool_reached_full was false during limit checking then became
true during the cleanup code, then someone else hit the limit
meanwhile. In this case, both threads will try to queue the worker,
but it never gets queued more than once anyway. Also, calling
queue_work() multiple times when the limit is hit could already happen
today, so this isn't a significant change in any way.

Signed-off-by: Yosry Ahmed <[email protected]>
---
mm/zswap.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index c4979c76d58e3..1cf3ab4b22e64 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1429,12 +1429,12 @@ bool zswap_store(struct folio *folio)
if (cur_pages >= max_pages) {
zswap_pool_limit_hit++;
zswap_pool_reached_full = true;
- goto shrink;
+ goto reject;
}

if (zswap_pool_reached_full) {
if (cur_pages > zswap_accept_thr_pages())
- goto shrink;
+ goto reject;
else
zswap_pool_reached_full = false;
}
@@ -1540,6 +1540,8 @@ bool zswap_store(struct folio *folio)
zswap_entry_cache_free(entry);
reject:
obj_cgroup_put(objcg);
+ if (zswap_pool_reached_full)
+ queue_work(shrink_wq, &zswap_shrink_work);
check_old:
/*
* If the zswap store fails or zswap is disabled, we must invalidate the
@@ -1550,10 +1552,6 @@ bool zswap_store(struct folio *folio)
if (entry)
zswap_entry_free(entry);
return false;
-
-shrink:
- queue_work(shrink_wq, &zswap_shrink_work);
- goto reject;
}

bool zswap_load(struct folio *folio)
--
2.44.0.396.g6e790dbe36-goog


2024-03-25 23:50:49

by Yosry Ahmed

[permalink] [raw]
Subject: [RFC PATCH 2/9] mm: zswap: refactor storing to the tree out of zswap_store()

Refactor the code that attempts storing to the xarray, handling erros,
and freeing stale entries into a helper. This will be reused in a
following patch to free other types of tree elements as well.

Signed-off-by: Yosry Ahmed <[email protected]>
---
mm/zswap.c | 42 ++++++++++++++++++++++++++----------------
1 file changed, 26 insertions(+), 16 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 1cf3ab4b22e64..ff1975afb7e3d 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -827,6 +827,30 @@ static void zswap_entry_free(struct zswap_entry *entry)
atomic_dec(&zswap_stored_pages);
}

+/*********************************
+* zswap tree functions
+**********************************/
+static int zswap_tree_store(struct xarray *tree, pgoff_t offset, void *new)
+{
+ void *old;
+ int err;
+
+ old = xa_store(tree, offset, new, GFP_KERNEL);
+ err = xa_is_err(old);
+ if (err) {
+ WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
+ zswap_reject_alloc_fail++;
+ } else if (old) {
+ /*
+ * We may have had an existing entry that became stale when
+ * the folio was redirtied and now the new version is being
+ * swapped out. Get rid of the old.
+ */
+ zswap_entry_free(old);
+ }
+ return err;
+}
+
/*********************************
* compressed storage functions
**********************************/
@@ -1396,10 +1420,10 @@ bool zswap_store(struct folio *folio)
swp_entry_t swp = folio->swap;
pgoff_t offset = swp_offset(swp);
struct xarray *tree = swap_zswap_tree(swp);
- struct zswap_entry *entry, *old;
struct obj_cgroup *objcg = NULL;
struct mem_cgroup *memcg = NULL;
unsigned long max_pages, cur_pages;
+ struct zswap_entry *entry;

VM_WARN_ON_ONCE(!folio_test_locked(folio));
VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
@@ -1485,22 +1509,8 @@ bool zswap_store(struct folio *folio)
entry->swpentry = swp;
entry->objcg = objcg;

- old = xa_store(tree, offset, entry, GFP_KERNEL);
- if (xa_is_err(old)) {
- int err = xa_err(old);
-
- WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
- zswap_reject_alloc_fail++;
+ if (zswap_tree_store(tree, offset, entry))
goto store_failed;
- }
-
- /*
- * We may have had an existing entry that became stale when
- * the folio was redirtied and now the new version is being
- * swapped out. Get rid of the old.
- */
- if (old)
- zswap_entry_free(old);

if (objcg) {
obj_cgroup_charge_zswap(objcg, entry->length);
--
2.44.0.396.g6e790dbe36-goog


2024-03-25 23:50:54

by Yosry Ahmed

[permalink] [raw]
Subject: [RFC PATCH 3/9] mm: zswap: refactor limit checking from zswap_store()

Refactor limit and acceptance threshold checking outside of
zswap_store(). This code will be moved around in a following patch, so
it would be cleaner to move a function call around.

Signed-off-by: Yosry Ahmed <[email protected]>
---
mm/zswap.c | 32 ++++++++++++++++----------------
1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index ff1975afb7e3d..6b890c8590ef7 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1415,6 +1415,21 @@ static void zswap_fill_page(void *ptr, unsigned long value)
memset_l(page, value, PAGE_SIZE / sizeof(unsigned long));
}

+static bool zswap_check_limit(void)
+{
+ unsigned long cur_pages = zswap_total_pages();
+ unsigned long thr = zswap_accept_thr_pages();
+ unsigned long max_pages = zswap_max_pages();
+
+ if (cur_pages >= max_pages) {
+ zswap_pool_limit_hit++;
+ zswap_pool_reached_full = true;
+ } else if (zswap_pool_reached_full && cur_pages <= thr) {
+ zswap_pool_reached_full = false;
+ }
+ return !zswap_pool_reached_full;
+}
+
bool zswap_store(struct folio *folio)
{
swp_entry_t swp = folio->swap;
@@ -1422,7 +1437,6 @@ bool zswap_store(struct folio *folio)
struct xarray *tree = swap_zswap_tree(swp);
struct obj_cgroup *objcg = NULL;
struct mem_cgroup *memcg = NULL;
- unsigned long max_pages, cur_pages;
struct zswap_entry *entry;

VM_WARN_ON_ONCE(!folio_test_locked(folio));
@@ -1446,22 +1460,8 @@ bool zswap_store(struct folio *folio)
mem_cgroup_put(memcg);
}

- /* Check global limits */
- cur_pages = zswap_total_pages();
- max_pages = zswap_max_pages();
-
- if (cur_pages >= max_pages) {
- zswap_pool_limit_hit++;
- zswap_pool_reached_full = true;
+ if (!zswap_check_limit())
goto reject;
- }
-
- if (zswap_pool_reached_full) {
- if (cur_pages > zswap_accept_thr_pages())
- goto reject;
- else
- zswap_pool_reached_full = false;
- }

/* allocate entry */
entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
--
2.44.0.396.g6e790dbe36-goog


2024-03-25 23:51:07

by Yosry Ahmed

[permalink] [raw]
Subject: [RFC PATCH 4/9] mm: zswap: move more same-filled pages checks outside of zswap_store()

Currently, zswap_store() check zswap_same_filled_pages_enabled, kmaps
the folio, then calls zswap_is_page_same_filled() to check the folio
contents. Move this logic into zswap_is_page_same_filled() as well (and
rename it to use 'folio' while we are at it).

This makes zswap_store() cleaner, and makes following changes to that
logic contained within the helper.

Signed-off-by: Yosry Ahmed <[email protected]>
---
mm/zswap.c | 45 ++++++++++++++++++++++++---------------------
1 file changed, 24 insertions(+), 21 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 6b890c8590ef7..498a6c5839bef 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1385,26 +1385,36 @@ static void shrink_worker(struct work_struct *w)
} while (zswap_total_pages() > thr);
}

-static int zswap_is_page_same_filled(void *ptr, unsigned long *value)
+static bool zswap_is_folio_same_filled(struct folio *folio, unsigned long *value)
{
unsigned long *page;
unsigned long val;
unsigned int pos, last_pos = PAGE_SIZE / sizeof(*page) - 1;
+ bool ret;

- page = (unsigned long *)ptr;
+ if (!zswap_same_filled_pages_enabled)
+ return false;
+
+ page = kmap_local_folio(folio, 0);
val = page[0];

- if (val != page[last_pos])
- return 0;
+ if (val != page[last_pos]) {
+ ret = false;
+ goto out;
+ }

for (pos = 1; pos < last_pos; pos++) {
- if (val != page[pos])
- return 0;
+ if (val != page[pos]) {
+ ret = false;
+ goto out;
+ }
}

*value = val;
-
- return 1;
+ ret = true;
+out:
+ kunmap_local(page);
+ return ret;
}

static void zswap_fill_page(void *ptr, unsigned long value)
@@ -1438,6 +1448,7 @@ bool zswap_store(struct folio *folio)
struct obj_cgroup *objcg = NULL;
struct mem_cgroup *memcg = NULL;
struct zswap_entry *entry;
+ unsigned long value;

VM_WARN_ON_ONCE(!folio_test_locked(folio));
VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
@@ -1470,19 +1481,11 @@ bool zswap_store(struct folio *folio)
goto reject;
}

- if (zswap_same_filled_pages_enabled) {
- unsigned long value;
- u8 *src;
-
- src = kmap_local_folio(folio, 0);
- if (zswap_is_page_same_filled(src, &value)) {
- kunmap_local(src);
- entry->length = 0;
- entry->value = value;
- atomic_inc(&zswap_same_filled_pages);
- goto insert_entry;
- }
- kunmap_local(src);
+ if (zswap_is_folio_same_filled(folio, &value)) {
+ entry->length = 0;
+ entry->value = value;
+ atomic_inc(&zswap_same_filled_pages);
+ goto insert_entry;
}

if (!zswap_non_same_filled_pages_enabled)
--
2.44.0.396.g6e790dbe36-goog


2024-03-25 23:51:18

by Yosry Ahmed

[permalink] [raw]
Subject: [RFC PATCH 5/9] mm: zswap: remove zswap_same_filled_pages_enabled

There is no logical reason to refuse storing same-filled pages more
efficiently and opt for compression. Remove the userspace knob.

Signed-off-by: Yosry Ahmed <[email protected]>
---
mm/zswap.c | 11 -----------
1 file changed, 11 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 498a6c5839bef..0fc27ae950c74 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -123,14 +123,6 @@ static unsigned int zswap_accept_thr_percent = 90; /* of max pool size */
module_param_named(accept_threshold_percent, zswap_accept_thr_percent,
uint, 0644);

-/*
- * Enable/disable handling same-value filled pages (enabled by default).
- * If disabled every page is considered non-same-value filled.
- */
-static bool zswap_same_filled_pages_enabled = true;
-module_param_named(same_filled_pages_enabled, zswap_same_filled_pages_enabled,
- bool, 0644);
-
/* Enable/disable handling non-same-value filled pages (enabled by default) */
static bool zswap_non_same_filled_pages_enabled = true;
module_param_named(non_same_filled_pages_enabled, zswap_non_same_filled_pages_enabled,
@@ -1392,9 +1384,6 @@ static bool zswap_is_folio_same_filled(struct folio *folio, unsigned long *value
unsigned int pos, last_pos = PAGE_SIZE / sizeof(*page) - 1;
bool ret;

- if (!zswap_same_filled_pages_enabled)
- return false;
-
page = kmap_local_folio(folio, 0);
val = page[0];

--
2.44.0.396.g6e790dbe36-goog


2024-03-25 23:51:32

by Yosry Ahmed

[permalink] [raw]
Subject: [RFC PATCH 6/9] mm: zswap: drop support for non-zero same-filled pages handling

The current same-filled pages handling supports pages filled with any
repeated word-sized pattern. However, in practice, most of these should
be zero pages anyway. Other patterns should be nearly as common.

Drop the support for non-zero same-filled pages, but keep the names of
knobs exposed to userspace as "same_filled", which isn't entirely
inaccurate.

This yields some nice code simplification and enables a following patch
that eliminates the need to allocate struct zswap_entry for those pages
completely.

There is also a very small performance improvement observed over 50 runs
of kernel build test (kernbench) comparing the mean build time on a
skylake machine when building the kernel in a cgroup v1 container with a
3G limit:

base patched % diff
real 70.167 69.915 -0.359%
user 2953.068 2956.147 +0.104%
sys 2612.811 2594.718 -0.692%

This probably comes from more optimized operations like memchr_inv() and
clear_highpage(). Note that the percentage of zero-filled pages during
this test was only around 1.5% on average, and was not affected by this
patch. Practical workloads could have a larger proportion of such pages
(e.g. Johannes observed around 10% [1]), so the performance improvement
should be larger.

[1]https://lore.kernel.org/linux-mm/[email protected]/

Signed-off-by: Yosry Ahmed <[email protected]>
---
mm/zswap.c | 76 ++++++++++++++----------------------------------------
1 file changed, 20 insertions(+), 56 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 0fc27ae950c74..413d9242cf500 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -44,8 +44,8 @@
**********************************/
/* The number of compressed pages currently stored in zswap */
atomic_t zswap_stored_pages = ATOMIC_INIT(0);
-/* The number of same-value filled pages currently stored in zswap */
-static atomic_t zswap_same_filled_pages = ATOMIC_INIT(0);
+/* The number of zero-filled pages currently stored in zswap */
+static atomic_t zswap_zero_filled_pages = ATOMIC_INIT(0);

/*
* The statistics below are not protected from concurrent access for
@@ -123,9 +123,9 @@ static unsigned int zswap_accept_thr_percent = 90; /* of max pool size */
module_param_named(accept_threshold_percent, zswap_accept_thr_percent,
uint, 0644);

-/* Enable/disable handling non-same-value filled pages (enabled by default) */
-static bool zswap_non_same_filled_pages_enabled = true;
-module_param_named(non_same_filled_pages_enabled, zswap_non_same_filled_pages_enabled,
+/* Enable/disable handling non-zero-filled pages (enabled by default) */
+static bool zswap_non_zero_filled_pages_enabled = true;
+module_param_named(non_same_filled_pages_enabled, zswap_non_zero_filled_pages_enabled,
bool, 0644);

/* Number of zpools in zswap_pool (empirically determined for scalability) */
@@ -187,11 +187,10 @@ static struct shrinker *zswap_shrinker;
*
* swpentry - associated swap entry, the offset indexes into the red-black tree
* length - the length in bytes of the compressed page data. Needed during
- * decompression. For a same value filled page length is 0, and both
+ * decompression. For a zero-filled page length is 0, and both
* pool and lru are invalid and must be ignored.
* pool - the zswap_pool the entry's data is in
* handle - zpool allocation handle that stores the compressed page data
- * value - value of the same-value filled pages which have same content
* objcg - the obj_cgroup that the compressed memory is charged to
* lru - handle to the pool's lru used to evict pages.
*/
@@ -199,10 +198,7 @@ struct zswap_entry {
swp_entry_t swpentry;
unsigned int length;
struct zswap_pool *pool;
- union {
- unsigned long handle;
- unsigned long value;
- };
+ unsigned long handle;
struct obj_cgroup *objcg;
struct list_head lru;
};
@@ -805,7 +801,7 @@ static struct zpool *zswap_find_zpool(struct zswap_entry *entry)
static void zswap_entry_free(struct zswap_entry *entry)
{
if (!entry->length)
- atomic_dec(&zswap_same_filled_pages);
+ atomic_dec(&zswap_zero_filled_pages);
else {
zswap_lru_del(&zswap_list_lru, entry);
zpool_free(zswap_find_zpool(entry), entry->handle);
@@ -1377,43 +1373,17 @@ static void shrink_worker(struct work_struct *w)
} while (zswap_total_pages() > thr);
}

-static bool zswap_is_folio_same_filled(struct folio *folio, unsigned long *value)
+static bool zswap_is_folio_zero_filled(struct folio *folio)
{
- unsigned long *page;
- unsigned long val;
- unsigned int pos, last_pos = PAGE_SIZE / sizeof(*page) - 1;
+ unsigned long *kaddr;
bool ret;

- page = kmap_local_folio(folio, 0);
- val = page[0];
-
- if (val != page[last_pos]) {
- ret = false;
- goto out;
- }
-
- for (pos = 1; pos < last_pos; pos++) {
- if (val != page[pos]) {
- ret = false;
- goto out;
- }
- }
-
- *value = val;
- ret = true;
-out:
- kunmap_local(page);
+ kaddr = kmap_local_folio(folio, 0);
+ ret = !memchr_inv(kaddr, 0, PAGE_SIZE);
+ kunmap_local(kaddr);
return ret;
}

-static void zswap_fill_page(void *ptr, unsigned long value)
-{
- unsigned long *page;
-
- page = (unsigned long *)ptr;
- memset_l(page, value, PAGE_SIZE / sizeof(unsigned long));
-}
-
static bool zswap_check_limit(void)
{
unsigned long cur_pages = zswap_total_pages();
@@ -1437,7 +1407,6 @@ bool zswap_store(struct folio *folio)
struct obj_cgroup *objcg = NULL;
struct mem_cgroup *memcg = NULL;
struct zswap_entry *entry;
- unsigned long value;

VM_WARN_ON_ONCE(!folio_test_locked(folio));
VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
@@ -1470,14 +1439,13 @@ bool zswap_store(struct folio *folio)
goto reject;
}

- if (zswap_is_folio_same_filled(folio, &value)) {
+ if (zswap_is_folio_zero_filled(folio)) {
entry->length = 0;
- entry->value = value;
- atomic_inc(&zswap_same_filled_pages);
+ atomic_inc(&zswap_zero_filled_pages);
goto insert_entry;
}

- if (!zswap_non_same_filled_pages_enabled)
+ if (!zswap_non_zero_filled_pages_enabled)
goto freepage;

/* if entry is successfully added, it keeps the reference */
@@ -1532,7 +1500,7 @@ bool zswap_store(struct folio *folio)

store_failed:
if (!entry->length)
- atomic_dec(&zswap_same_filled_pages);
+ atomic_dec(&zswap_zero_filled_pages);
else {
zpool_free(zswap_find_zpool(entry), entry->handle);
put_pool:
@@ -1563,7 +1531,6 @@ bool zswap_load(struct folio *folio)
struct page *page = &folio->page;
struct xarray *tree = swap_zswap_tree(swp);
struct zswap_entry *entry;
- u8 *dst;

VM_WARN_ON_ONCE(!folio_test_locked(folio));

@@ -1573,11 +1540,8 @@ bool zswap_load(struct folio *folio)

if (entry->length)
zswap_decompress(entry, page);
- else {
- dst = kmap_local_page(page);
- zswap_fill_page(dst, entry->value);
- kunmap_local(dst);
- }
+ else
+ clear_highpage(page);

count_vm_event(ZSWPIN);
if (entry->objcg)
@@ -1679,7 +1643,7 @@ static int zswap_debugfs_init(void)
debugfs_create_atomic_t("stored_pages", 0444,
zswap_debugfs_root, &zswap_stored_pages);
debugfs_create_atomic_t("same_filled_pages", 0444,
- zswap_debugfs_root, &zswap_same_filled_pages);
+ zswap_debugfs_root, &zswap_zero_filled_pages);

return 0;
}
--
2.44.0.396.g6e790dbe36-goog


2024-03-25 23:51:38

by Yosry Ahmed

[permalink] [raw]
Subject: [RFC PATCH 7/9] mm: zswap: store zero-filled pages without a zswap_entry

After the rbtree to xarray conversion, and dropping zswap_entry.refcount
and zswap_entry.value, the only members of zswap_entry utilized by
zero-filled pages are zswap_entry.length (always 0) and
zswap_entry.objcg. Store the objcg pointer directly in the xarray as a
tagged pointer and avoid allocating a zswap_entry completely for
zero-filled pages.

This simplifies the code as we no longer need to special case
zero-length cases. We are also able to further separate the zero-filled
pages handling logic and completely isolate them within store/load
helpers. Handling tagged xarray pointers is handled in these two
helpers, as well as the newly introduced helper for freeing tree
elements, zswap_tree_free_element().

There is also a small performance improvement observed over 50 runs of
kernel build test (kernbench) comparing the mean build time on a skylake
machine when building the kernel in a cgroup v1 container with a 3G
limit. This is on top of the improvement from dropping support for
non-zero same-filled pages:

base patched % diff
real 69.915 69.757 -0.229%
user 2956.147 2955.244 -0.031%
sys 2594.718 2575.747 -0.731%

This probably comes from avoiding the zswap_entry allocation and
cleanup/freeing for zero-filled pages. Note that the percentage of
zero-filled pages during this test was only around 1.5% on average.
Practical workloads could have a larger proportion of such pages (e.g.
Johannes observed around 10% [1]), so the performance improvement should
be larger.

This change also saves a small amount of memory due to less allocated
zswap_entry's. In the kernel build test above, we save around 2M of
slab usage when we swap out 3G to zswap.

[1]https://lore.kernel.org/linux-mm/[email protected]/

Signed-off-by: Yosry Ahmed <[email protected]>
---
mm/zswap.c | 137 ++++++++++++++++++++++++++++++-----------------------
1 file changed, 78 insertions(+), 59 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 413d9242cf500..efc323bab2f22 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -183,12 +183,11 @@ static struct shrinker *zswap_shrinker;
* struct zswap_entry
*
* This structure contains the metadata for tracking a single compressed
- * page within zswap.
+ * page within zswap, it does not track zero-filled pages.
*
* swpentry - associated swap entry, the offset indexes into the red-black tree
* length - the length in bytes of the compressed page data. Needed during
- * decompression. For a zero-filled page length is 0, and both
- * pool and lru are invalid and must be ignored.
+ * decompression.
* pool - the zswap_pool the entry's data is in
* handle - zpool allocation handle that stores the compressed page data
* objcg - the obj_cgroup that the compressed memory is charged to
@@ -794,30 +793,35 @@ static struct zpool *zswap_find_zpool(struct zswap_entry *entry)
return entry->pool->zpools[hash_ptr(entry, ilog2(ZSWAP_NR_ZPOOLS))];
}

-/*
- * Carries out the common pattern of freeing and entry's zpool allocation,
- * freeing the entry itself, and decrementing the number of stored pages.
- */
static void zswap_entry_free(struct zswap_entry *entry)
{
- if (!entry->length)
- atomic_dec(&zswap_zero_filled_pages);
- else {
- zswap_lru_del(&zswap_list_lru, entry);
- zpool_free(zswap_find_zpool(entry), entry->handle);
- zswap_pool_put(entry->pool);
- }
+ zswap_lru_del(&zswap_list_lru, entry);
+ zpool_free(zswap_find_zpool(entry), entry->handle);
+ zswap_pool_put(entry->pool);
if (entry->objcg) {
obj_cgroup_uncharge_zswap(entry->objcg, entry->length);
obj_cgroup_put(entry->objcg);
}
zswap_entry_cache_free(entry);
- atomic_dec(&zswap_stored_pages);
}

/*********************************
* zswap tree functions
**********************************/
+static void zswap_tree_free_element(void *elem)
+{
+ if (!elem)
+ return;
+
+ if (xa_pointer_tag(elem)) {
+ obj_cgroup_put(xa_untag_pointer(elem));
+ atomic_dec(&zswap_zero_filled_pages);
+ } else {
+ zswap_entry_free((struct zswap_entry *)elem);
+ }
+ atomic_dec(&zswap_stored_pages);
+}
+
static int zswap_tree_store(struct xarray *tree, pgoff_t offset, void *new)
{
void *old;
@@ -834,7 +838,7 @@ static int zswap_tree_store(struct xarray *tree, pgoff_t offset, void *new)
* the folio was redirtied and now the new version is being
* swapped out. Get rid of the old.
*/
- zswap_entry_free(old);
+ zswap_tree_free_element(old);
}
return err;
}
@@ -1089,7 +1093,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
if (entry->objcg)
count_objcg_event(entry->objcg, ZSWPWB);

- zswap_entry_free(entry);
+ zswap_tree_free_element(entry);

/* folio is up to date */
folio_mark_uptodate(folio);
@@ -1373,6 +1377,33 @@ static void shrink_worker(struct work_struct *w)
} while (zswap_total_pages() > thr);
}

+/*********************************
+* zero-filled functions
+**********************************/
+#define ZSWAP_ZERO_FILLED_TAG 1UL
+
+static int zswap_store_zero_filled(struct xarray *tree, pgoff_t offset,
+ struct obj_cgroup *objcg)
+{
+ int err = zswap_tree_store(tree, offset,
+ xa_tag_pointer(objcg, ZSWAP_ZERO_FILLED_TAG));
+
+ if (!err)
+ atomic_inc(&zswap_zero_filled_pages);
+ return err;
+}
+
+static bool zswap_load_zero_filled(void *elem, struct page *page,
+ struct obj_cgroup **objcg)
+{
+ if (!xa_pointer_tag(elem))
+ return false;
+
+ clear_highpage(page);
+ *objcg = xa_untag_pointer(elem);
+ return true;
+}
+
static bool zswap_is_folio_zero_filled(struct folio *folio)
{
unsigned long *kaddr;
@@ -1432,22 +1463,21 @@ bool zswap_store(struct folio *folio)
if (!zswap_check_limit())
goto reject;

- /* allocate entry */
+ if (zswap_is_folio_zero_filled(folio)) {
+ if (zswap_store_zero_filled(tree, offset, objcg))
+ goto reject;
+ goto stored;
+ }
+
+ if (!zswap_non_zero_filled_pages_enabled)
+ goto reject;
+
entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
if (!entry) {
zswap_reject_kmemcache_fail++;
goto reject;
}

- if (zswap_is_folio_zero_filled(folio)) {
- entry->length = 0;
- atomic_inc(&zswap_zero_filled_pages);
- goto insert_entry;
- }
-
- if (!zswap_non_zero_filled_pages_enabled)
- goto freepage;
-
/* if entry is successfully added, it keeps the reference */
entry->pool = zswap_pool_current_get();
if (!entry->pool)
@@ -1465,17 +1495,14 @@ bool zswap_store(struct folio *folio)
if (!zswap_compress(folio, entry))
goto put_pool;

-insert_entry:
entry->swpentry = swp;
entry->objcg = objcg;

if (zswap_tree_store(tree, offset, entry))
goto store_failed;

- if (objcg) {
+ if (objcg)
obj_cgroup_charge_zswap(objcg, entry->length);
- count_objcg_event(objcg, ZSWPOUT);
- }

/*
* We finish initializing the entry while it's already in xarray.
@@ -1487,25 +1514,21 @@ bool zswap_store(struct folio *folio)
* The publishing order matters to prevent writeback from seeing
* an incoherent entry.
*/
- if (entry->length) {
- INIT_LIST_HEAD(&entry->lru);
- zswap_lru_add(&zswap_list_lru, entry);
- }
+ INIT_LIST_HEAD(&entry->lru);
+ zswap_lru_add(&zswap_list_lru, entry);

- /* update stats */
+stored:
+ if (objcg)
+ count_objcg_event(objcg, ZSWPOUT);
atomic_inc(&zswap_stored_pages);
count_vm_event(ZSWPOUT);

return true;

store_failed:
- if (!entry->length)
- atomic_dec(&zswap_zero_filled_pages);
- else {
- zpool_free(zswap_find_zpool(entry), entry->handle);
+ zpool_free(zswap_find_zpool(entry), entry->handle);
put_pool:
- zswap_pool_put(entry->pool);
- }
+ zswap_pool_put(entry->pool);
freepage:
zswap_entry_cache_free(entry);
reject:
@@ -1518,9 +1541,7 @@ bool zswap_store(struct folio *folio)
* possibly stale entry which was previously stored at this offset.
* Otherwise, writeback could overwrite the new data in the swapfile.
*/
- entry = xa_erase(tree, offset);
- if (entry)
- zswap_entry_free(entry);
+ zswap_tree_free_element(xa_erase(tree, offset));
return false;
}

@@ -1531,26 +1552,27 @@ bool zswap_load(struct folio *folio)
struct page *page = &folio->page;
struct xarray *tree = swap_zswap_tree(swp);
struct zswap_entry *entry;
+ struct obj_cgroup *objcg;
+ void *elem;

VM_WARN_ON_ONCE(!folio_test_locked(folio));

- entry = xa_erase(tree, offset);
- if (!entry)
+ elem = xa_erase(tree, offset);
+ if (!elem)
return false;

- if (entry->length)
+ if (!zswap_load_zero_filled(elem, page, &objcg)) {
+ entry = elem;
+ objcg = entry->objcg;
zswap_decompress(entry, page);
- else
- clear_highpage(page);
+ }

count_vm_event(ZSWPIN);
- if (entry->objcg)
- count_objcg_event(entry->objcg, ZSWPIN);
-
- zswap_entry_free(entry);
+ if (objcg)
+ count_objcg_event(objcg, ZSWPIN);

+ zswap_tree_free_element(elem);
folio_mark_dirty(folio);
-
return true;
}

@@ -1558,11 +1580,8 @@ void zswap_invalidate(swp_entry_t swp)
{
pgoff_t offset = swp_offset(swp);
struct xarray *tree = swap_zswap_tree(swp);
- struct zswap_entry *entry;

- entry = xa_erase(tree, offset);
- if (entry)
- zswap_entry_free(entry);
+ zswap_tree_free_element(xa_erase(tree, offset));
}

int zswap_swapon(int type, unsigned long nr_pages)
--
2.44.0.396.g6e790dbe36-goog


2024-03-25 23:51:48

by Yosry Ahmed

[permalink] [raw]
Subject: [RFC PATCH 8/9] mm: zswap: do not check the global limit for zero-filled pages

When storing zero-filled pages, there is no point of checking the global
zswap limit. These pages do not consume any memory that contributes
toward the limit. Move the limit checking after zero-filled pages are
handled.

This avoids having zero-filled pages skip zswap and go to disk swap if
the limit is hit. It also avoids queueing the shrink worker, which may
end up being unnecessary if the zswap usage goes down on its own before
another store is attempted.

Ignoring the memcg limits as well for zero-filled pages is more
controversial. Those limits are more a matter of per-workload policy.
Some workloads disable zswap completely by setting memory.zswap.max = 0,
and those workloads could start observing some zswap activity even after
disabling zswap. Although harmless, this could cause confusion to
userspace. Remain conservative and keep respecting those limits.

Signed-off-by: Yosry Ahmed <[email protected]>
---
mm/zswap.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index efc323bab2f22..9357328d940af 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1460,9 +1460,6 @@ bool zswap_store(struct folio *folio)
mem_cgroup_put(memcg);
}

- if (!zswap_check_limit())
- goto reject;
-
if (zswap_is_folio_zero_filled(folio)) {
if (zswap_store_zero_filled(tree, offset, objcg))
goto reject;
@@ -1472,6 +1469,9 @@ bool zswap_store(struct folio *folio)
if (!zswap_non_zero_filled_pages_enabled)
goto reject;

+ if (!zswap_check_limit())
+ goto reject;
+
entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
if (!entry) {
zswap_reject_kmemcache_fail++;
--
2.44.0.396.g6e790dbe36-goog


2024-03-25 23:52:10

by Yosry Ahmed

[permalink] [raw]
Subject: [RFC PATCH 9/9] mm: zswap: use zswap_entry_free() for partially initialized entries

zswap_entry_free() performs four types of cleanups before freeing a
zswap_entry:
- Deletes the entry from the LRU.
- Frees compressed memory.
- Puts the pool reference.
- Uncharges the compressed memory and puts the objcg.

zswap_entry_free() always expects a fully initialized entry. Allow
zswap_entry_free() to handle partially initialized entries by making it
possible to identify what cleanups are needed as follows:
- Allocate entries with __GFP_ZERO and initialize zswap_entry.lru when
the entry is allocated. Points are NULL and length is zero upon
initialization.
- Use zswap_entry.length to identify if there is compressed memory to
free. This is possible now that zero-filled pages are handled
separately, so a length of zero means we did not successfully compress
the page.
- Only initialize entry->objcg after the memory is charged in
zswap_store().

With this in place, use zswap_entry_free() in the failure path of
zswap_store() to cleanup partially initialized entries. This simplifies
the cleanup code in zswap_store(). While we are at it, rename the
remaining cleanup labels to more meaningful names.

Signed-off-by: Yosry Ahmed <[email protected]>
---
mm/zswap.c | 62 ++++++++++++++++++++++++++----------------------------
1 file changed, 30 insertions(+), 32 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 9357328d940af..c50f9df230ca3 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -774,12 +774,13 @@ void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
**********************************/
static struct kmem_cache *zswap_entry_cache;

-static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp, int nid)
+static struct zswap_entry *zswap_entry_cache_alloc(int nid)
{
struct zswap_entry *entry;
- entry = kmem_cache_alloc_node(zswap_entry_cache, gfp, nid);
- if (!entry)
- return NULL;
+ entry = kmem_cache_alloc_node(zswap_entry_cache,
+ GFP_KERNEL | __GFP_ZERO, nid);
+ if (entry)
+ INIT_LIST_HEAD(&entry->lru);
return entry;
}

@@ -795,9 +796,12 @@ static struct zpool *zswap_find_zpool(struct zswap_entry *entry)

static void zswap_entry_free(struct zswap_entry *entry)
{
- zswap_lru_del(&zswap_list_lru, entry);
- zpool_free(zswap_find_zpool(entry), entry->handle);
- zswap_pool_put(entry->pool);
+ if (!list_empty(&entry->lru))
+ zswap_lru_del(&zswap_list_lru, entry);
+ if (entry->length)
+ zpool_free(zswap_find_zpool(entry), entry->handle);
+ if (entry->pool)
+ zswap_pool_put(entry->pool);
if (entry->objcg) {
obj_cgroup_uncharge_zswap(entry->objcg, entry->length);
obj_cgroup_put(entry->objcg);
@@ -1447,7 +1451,7 @@ bool zswap_store(struct folio *folio)
return false;

if (!zswap_enabled)
- goto check_old;
+ goto erase_old;

/* Check cgroup limits */
objcg = get_obj_cgroup_from_folio(folio);
@@ -1455,54 +1459,52 @@ bool zswap_store(struct folio *folio)
memcg = get_mem_cgroup_from_objcg(objcg);
if (shrink_memcg(memcg)) {
mem_cgroup_put(memcg);
- goto reject;
+ goto put_objcg;
}
mem_cgroup_put(memcg);
}

if (zswap_is_folio_zero_filled(folio)) {
if (zswap_store_zero_filled(tree, offset, objcg))
- goto reject;
+ goto put_objcg;
goto stored;
}

if (!zswap_non_zero_filled_pages_enabled)
- goto reject;
+ goto put_objcg;

if (!zswap_check_limit())
- goto reject;
+ goto put_objcg;

- entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
+ entry = zswap_entry_cache_alloc(folio_nid(folio));
if (!entry) {
zswap_reject_kmemcache_fail++;
- goto reject;
+ goto put_objcg;
}

- /* if entry is successfully added, it keeps the reference */
entry->pool = zswap_pool_current_get();
if (!entry->pool)
- goto freepage;
+ goto free_entry;

if (objcg) {
memcg = get_mem_cgroup_from_objcg(objcg);
if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
mem_cgroup_put(memcg);
- goto put_pool;
+ goto free_entry;
}
mem_cgroup_put(memcg);
}

if (!zswap_compress(folio, entry))
- goto put_pool;
-
- entry->swpentry = swp;
- entry->objcg = objcg;
+ goto free_entry;

if (zswap_tree_store(tree, offset, entry))
- goto store_failed;
+ goto free_entry;

- if (objcg)
+ if (objcg) {
obj_cgroup_charge_zswap(objcg, entry->length);
+ entry->objcg = objcg;
+ }

/*
* We finish initializing the entry while it's already in xarray.
@@ -1514,7 +1516,7 @@ bool zswap_store(struct folio *folio)
* The publishing order matters to prevent writeback from seeing
* an incoherent entry.
*/
- INIT_LIST_HEAD(&entry->lru);
+ entry->swpentry = swp;
zswap_lru_add(&zswap_list_lru, entry);

stored:
@@ -1525,17 +1527,13 @@ bool zswap_store(struct folio *folio)

return true;

-store_failed:
- zpool_free(zswap_find_zpool(entry), entry->handle);
-put_pool:
- zswap_pool_put(entry->pool);
-freepage:
- zswap_entry_cache_free(entry);
-reject:
+free_entry:
+ zswap_entry_free(entry);
+put_objcg:
obj_cgroup_put(objcg);
if (zswap_pool_reached_full)
queue_work(shrink_wq, &zswap_shrink_work);
-check_old:
+erase_old:
/*
* If the zswap store fails or zswap is disabled, we must invalidate the
* possibly stale entry which was previously stored at this offset.
--
2.44.0.396.g6e790dbe36-goog


2024-03-26 21:57:28

by Nhat Pham

[permalink] [raw]
Subject: Re: [RFC PATCH 4/9] mm: zswap: move more same-filled pages checks outside of zswap_store()

On Mon, Mar 25, 2024 at 4:50 PM Yosry Ahmed <[email protected]> wrote:
>
> Currently, zswap_store() check zswap_same_filled_pages_enabled, kmaps
> the folio, then calls zswap_is_page_same_filled() to check the folio
> contents. Move this logic into zswap_is_page_same_filled() as well (and
> rename it to use 'folio' while we are at it).
>
> This makes zswap_store() cleaner, and makes following changes to that
> logic contained within the helper.
>
> Signed-off-by: Yosry Ahmed <[email protected]>

Reviewed-by: Nhat Pham <[email protected]>

2024-03-26 22:03:00

by Nhat Pham

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] mm: zswap: always shrink in zswap_store() if zswap_pool_reached_full

On Mon, Mar 25, 2024 at 4:50 PM Yosry Ahmed <[email protected]> wrote:
>
> The cleanup code in zswap_store() is not pretty, particularly the
> 'shrink' label at the bottom that ends up jumping between cleanup
> labels.
>
> Instead of having a dedicated label to shrink the pool, just use
> zswap_pool_reached_full directly to figure out if the pool needs
> shrinking. zswap_pool_reached_full should be true if and only if the
> pool needs shrinking.
>
> The only caveat is that the value of zswap_pool_reached_full may be
> changed by concurrent zswap_store() calls between checking the limit and
> testing zswap_pool_reached_full in the cleanup code. This is fine
> because:
> - If zswap_pool_reached_full was true during limit checking then became
> false during the cleanup code, then someone else already took care of
> shrinking the pool and there is no need to queue the worker. That
> would be a good change.

Yup.

> - If zswap_pool_reached_full was false during limit checking then became
> true during the cleanup code, then someone else hit the limit
> meanwhile. In this case, both threads will try to queue the worker,
> but it never gets queued more than once anyway. Also, calling
> queue_work() multiple times when the limit is hit could already happen
> today, so this isn't a significant change in any way.

Agree.

>
> Signed-off-by: Yosry Ahmed <[email protected]>

This change by itself seems fine to me.
Reviewed-by: Nhat Pham <[email protected]>

2024-03-26 22:07:31

by Nhat Pham

[permalink] [raw]
Subject: Re: [RFC PATCH 5/9] mm: zswap: remove zswap_same_filled_pages_enabled

On Mon, Mar 25, 2024 at 4:50 PM Yosry Ahmed <[email protected]> wrote:
>
> There is no logical reason to refuse storing same-filled pages more
> efficiently and opt for compression. Remove the userspace knob.
>

Agree. Absolutely no idea why we have this knob in the first place -
if you have cycles to compress, you probably have some extra cycles to
check same-filled page state? And that is the only cost I can think of
- it wins on probably all other aspects: memory usage,
"decompression", no need to write back these pages etc.

Any actual zswap user who has data or counter-use case should chime
in, but FWIW, my vote is:
Reviewed-by: Nhat Pham <[email protected]>

2024-03-27 02:21:34

by Chengming Zhou

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] mm: zswap: always shrink in zswap_store() if zswap_pool_reached_full

On 2024/3/26 07:50, Yosry Ahmed wrote:
> The cleanup code in zswap_store() is not pretty, particularly the
> 'shrink' label at the bottom that ends up jumping between cleanup
> labels.
>
> Instead of having a dedicated label to shrink the pool, just use
> zswap_pool_reached_full directly to figure out if the pool needs
> shrinking. zswap_pool_reached_full should be true if and only if the
> pool needs shrinking.
>
> The only caveat is that the value of zswap_pool_reached_full may be
> changed by concurrent zswap_store() calls between checking the limit and
> testing zswap_pool_reached_full in the cleanup code. This is fine
> because:
> - If zswap_pool_reached_full was true during limit checking then became
> false during the cleanup code, then someone else already took care of
> shrinking the pool and there is no need to queue the worker. That
> would be a good change.
> - If zswap_pool_reached_full was false during limit checking then became
> true during the cleanup code, then someone else hit the limit
> meanwhile. In this case, both threads will try to queue the worker,
> but it never gets queued more than once anyway. Also, calling
> queue_work() multiple times when the limit is hit could already happen
> today, so this isn't a significant change in any way.
>
> Signed-off-by: Yosry Ahmed <[email protected]>

Looks good to me.

Reviewed-by: Chengming Zhou <[email protected]>

Thanks.

> ---
> mm/zswap.c | 10 ++++------
> 1 file changed, 4 insertions(+), 6 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index c4979c76d58e3..1cf3ab4b22e64 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1429,12 +1429,12 @@ bool zswap_store(struct folio *folio)
> if (cur_pages >= max_pages) {
> zswap_pool_limit_hit++;
> zswap_pool_reached_full = true;
> - goto shrink;
> + goto reject;
> }
>
> if (zswap_pool_reached_full) {
> if (cur_pages > zswap_accept_thr_pages())
> - goto shrink;
> + goto reject;
> else
> zswap_pool_reached_full = false;
> }
> @@ -1540,6 +1540,8 @@ bool zswap_store(struct folio *folio)
> zswap_entry_cache_free(entry);
> reject:
> obj_cgroup_put(objcg);
> + if (zswap_pool_reached_full)
> + queue_work(shrink_wq, &zswap_shrink_work);
> check_old:
> /*
> * If the zswap store fails or zswap is disabled, we must invalidate the
> @@ -1550,10 +1552,6 @@ bool zswap_store(struct folio *folio)
> if (entry)
> zswap_entry_free(entry);
> return false;
> -
> -shrink:
> - queue_work(shrink_wq, &zswap_shrink_work);
> - goto reject;
> }
>
> bool zswap_load(struct folio *folio)

2024-03-27 02:25:51

by Chengming Zhou

[permalink] [raw]
Subject: Re: [RFC PATCH 2/9] mm: zswap: refactor storing to the tree out of zswap_store()

On 2024/3/26 07:50, Yosry Ahmed wrote:
> Refactor the code that attempts storing to the xarray, handling erros,
> and freeing stale entries into a helper. This will be reused in a
> following patch to free other types of tree elements as well.
>
> Signed-off-by: Yosry Ahmed <[email protected]>
> ---
> mm/zswap.c | 42 ++++++++++++++++++++++++++----------------
> 1 file changed, 26 insertions(+), 16 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 1cf3ab4b22e64..ff1975afb7e3d 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -827,6 +827,30 @@ static void zswap_entry_free(struct zswap_entry *entry)
> atomic_dec(&zswap_stored_pages);
> }
>
> +/*********************************
> +* zswap tree functions
> +**********************************/
> +static int zswap_tree_store(struct xarray *tree, pgoff_t offset, void *new)
> +{
> + void *old;
> + int err;
> +
> + old = xa_store(tree, offset, new, GFP_KERNEL);
> + err = xa_is_err(old);

Seems to use xa_err() to return errno, xa_is_err() just return a bool.

> + if (err) {
> + WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
> + zswap_reject_alloc_fail++;
> + } else if (old) {
> + /*
> + * We may have had an existing entry that became stale when
> + * the folio was redirtied and now the new version is being
> + * swapped out. Get rid of the old.
> + */
> + zswap_entry_free(old);
> + }
> + return err;
> +}
> +
> /*********************************
> * compressed storage functions
> **********************************/
> @@ -1396,10 +1420,10 @@ bool zswap_store(struct folio *folio)
> swp_entry_t swp = folio->swap;
> pgoff_t offset = swp_offset(swp);
> struct xarray *tree = swap_zswap_tree(swp);
> - struct zswap_entry *entry, *old;
> struct obj_cgroup *objcg = NULL;
> struct mem_cgroup *memcg = NULL;
> unsigned long max_pages, cur_pages;
> + struct zswap_entry *entry;
>
> VM_WARN_ON_ONCE(!folio_test_locked(folio));
> VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> @@ -1485,22 +1509,8 @@ bool zswap_store(struct folio *folio)
> entry->swpentry = swp;
> entry->objcg = objcg;
>
> - old = xa_store(tree, offset, entry, GFP_KERNEL);
> - if (xa_is_err(old)) {
> - int err = xa_err(old);
> -
> - WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
> - zswap_reject_alloc_fail++;
> + if (zswap_tree_store(tree, offset, entry))
> goto store_failed;
> - }
> -
> - /*
> - * We may have had an existing entry that became stale when
> - * the folio was redirtied and now the new version is being
> - * swapped out. Get rid of the old.
> - */
> - if (old)
> - zswap_entry_free(old);
>
> if (objcg) {
> obj_cgroup_charge_zswap(objcg, entry->length);

2024-03-27 02:39:24

by Chengming Zhou

[permalink] [raw]
Subject: Re: [RFC PATCH 4/9] mm: zswap: move more same-filled pages checks outside of zswap_store()

On 2024/3/26 07:50, Yosry Ahmed wrote:
> Currently, zswap_store() check zswap_same_filled_pages_enabled, kmaps
> the folio, then calls zswap_is_page_same_filled() to check the folio
> contents. Move this logic into zswap_is_page_same_filled() as well (and
> rename it to use 'folio' while we are at it).
>
> This makes zswap_store() cleaner, and makes following changes to that
> logic contained within the helper.
>
> Signed-off-by: Yosry Ahmed <[email protected]>

LGTM with one comment below:

Reviewed-by: Chengming Zhou <[email protected]>

> ---
> mm/zswap.c | 45 ++++++++++++++++++++++++---------------------
> 1 file changed, 24 insertions(+), 21 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 6b890c8590ef7..498a6c5839bef 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1385,26 +1385,36 @@ static void shrink_worker(struct work_struct *w)
> } while (zswap_total_pages() > thr);
> }
>
> -static int zswap_is_page_same_filled(void *ptr, unsigned long *value)
> +static bool zswap_is_folio_same_filled(struct folio *folio, unsigned long *value)
> {
> unsigned long *page;
> unsigned long val;
> unsigned int pos, last_pos = PAGE_SIZE / sizeof(*page) - 1;
> + bool ret;
>
> - page = (unsigned long *)ptr;
> + if (!zswap_same_filled_pages_enabled)
> + return false;
> +
> + page = kmap_local_folio(folio, 0);
> val = page[0];
>
> - if (val != page[last_pos])
> - return 0;
> + if (val != page[last_pos]) {
> + ret = false;
> + goto out;
> + }
>
> for (pos = 1; pos < last_pos; pos++) {
> - if (val != page[pos])
> - return 0;
> + if (val != page[pos]) {
> + ret = false;

nit: ret can be initialized to false, so

> + goto out;
> + }
> }
>
> *value = val;
> -
> - return 1;
> + ret = true;

only need to set to true here.

Thanks.

> +out:
> + kunmap_local(page);
> + return ret;
> }
>
> static void zswap_fill_page(void *ptr, unsigned long value)
> @@ -1438,6 +1448,7 @@ bool zswap_store(struct folio *folio)
> struct obj_cgroup *objcg = NULL;
> struct mem_cgroup *memcg = NULL;
> struct zswap_entry *entry;
> + unsigned long value;
>
> VM_WARN_ON_ONCE(!folio_test_locked(folio));
> VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> @@ -1470,19 +1481,11 @@ bool zswap_store(struct folio *folio)
> goto reject;
> }
>
> - if (zswap_same_filled_pages_enabled) {
> - unsigned long value;
> - u8 *src;
> -
> - src = kmap_local_folio(folio, 0);
> - if (zswap_is_page_same_filled(src, &value)) {
> - kunmap_local(src);
> - entry->length = 0;
> - entry->value = value;
> - atomic_inc(&zswap_same_filled_pages);
> - goto insert_entry;
> - }
> - kunmap_local(src);
> + if (zswap_is_folio_same_filled(folio, &value)) {
> + entry->length = 0;
> + entry->value = value;
> + atomic_inc(&zswap_same_filled_pages);
> + goto insert_entry;
> }
>
> if (!zswap_non_same_filled_pages_enabled)

2024-03-27 02:42:22

by Chengming Zhou

[permalink] [raw]
Subject: Re: [RFC PATCH 3/9] mm: zswap: refactor limit checking from zswap_store()

On 2024/3/26 07:50, Yosry Ahmed wrote:
> Refactor limit and acceptance threshold checking outside of
> zswap_store(). This code will be moved around in a following patch, so
> it would be cleaner to move a function call around.
>
> Signed-off-by: Yosry Ahmed <[email protected]>
> ---
> mm/zswap.c | 32 ++++++++++++++++----------------
> 1 file changed, 16 insertions(+), 16 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index ff1975afb7e3d..6b890c8590ef7 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1415,6 +1415,21 @@ static void zswap_fill_page(void *ptr, unsigned long value)
> memset_l(page, value, PAGE_SIZE / sizeof(unsigned long));
> }
>
> +static bool zswap_check_limit(void)
> +{
> + unsigned long cur_pages = zswap_total_pages();
> + unsigned long thr = zswap_accept_thr_pages();
> + unsigned long max_pages = zswap_max_pages();
> +
> + if (cur_pages >= max_pages) {
> + zswap_pool_limit_hit++;
> + zswap_pool_reached_full = true;
> + } else if (zswap_pool_reached_full && cur_pages <= thr) {
> + zswap_pool_reached_full = false;
> + }
> + return !zswap_pool_reached_full;

nit: Then we use "!zswap_check_limit()" below, double negation looks complex,
should we change to zswap_should_reject() or something?

> +}
> +
> bool zswap_store(struct folio *folio)
> {
> swp_entry_t swp = folio->swap;
> @@ -1422,7 +1437,6 @@ bool zswap_store(struct folio *folio)
> struct xarray *tree = swap_zswap_tree(swp);
> struct obj_cgroup *objcg = NULL;
> struct mem_cgroup *memcg = NULL;
> - unsigned long max_pages, cur_pages;
> struct zswap_entry *entry;
>
> VM_WARN_ON_ONCE(!folio_test_locked(folio));
> @@ -1446,22 +1460,8 @@ bool zswap_store(struct folio *folio)
> mem_cgroup_put(memcg);
> }
>
> - /* Check global limits */
> - cur_pages = zswap_total_pages();
> - max_pages = zswap_max_pages();
> -
> - if (cur_pages >= max_pages) {
> - zswap_pool_limit_hit++;
> - zswap_pool_reached_full = true;
> + if (!zswap_check_limit())
> goto reject;
> - }
> -
> - if (zswap_pool_reached_full) {
> - if (cur_pages > zswap_accept_thr_pages())
> - goto reject;
> - else
> - zswap_pool_reached_full = false;
> - }
>
> /* allocate entry */
> entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));

2024-03-27 02:44:33

by Chengming Zhou

[permalink] [raw]
Subject: Re: [RFC PATCH 5/9] mm: zswap: remove zswap_same_filled_pages_enabled

On 2024/3/26 07:50, Yosry Ahmed wrote:
> There is no logical reason to refuse storing same-filled pages more
> efficiently and opt for compression. Remove the userspace knob.
>
> Signed-off-by: Yosry Ahmed <[email protected]>

LGTM, should we also remove zswap_non_same_filled_pages_enabled?
Not sure if it has real usage...

Reviewed-by: Chengming Zhou <[email protected]>

> ---
> mm/zswap.c | 11 -----------
> 1 file changed, 11 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 498a6c5839bef..0fc27ae950c74 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -123,14 +123,6 @@ static unsigned int zswap_accept_thr_percent = 90; /* of max pool size */
> module_param_named(accept_threshold_percent, zswap_accept_thr_percent,
> uint, 0644);
>
> -/*
> - * Enable/disable handling same-value filled pages (enabled by default).
> - * If disabled every page is considered non-same-value filled.
> - */
> -static bool zswap_same_filled_pages_enabled = true;
> -module_param_named(same_filled_pages_enabled, zswap_same_filled_pages_enabled,
> - bool, 0644);
> -
> /* Enable/disable handling non-same-value filled pages (enabled by default) */
> static bool zswap_non_same_filled_pages_enabled = true;
> module_param_named(non_same_filled_pages_enabled, zswap_non_same_filled_pages_enabled,
> @@ -1392,9 +1384,6 @@ static bool zswap_is_folio_same_filled(struct folio *folio, unsigned long *value
> unsigned int pos, last_pos = PAGE_SIZE / sizeof(*page) - 1;
> bool ret;
>
> - if (!zswap_same_filled_pages_enabled)
> - return false;
> -
> page = kmap_local_folio(folio, 0);
> val = page[0];
>

2024-03-27 11:25:36

by Chengming Zhou

[permalink] [raw]
Subject: Re: [RFC PATCH 6/9] mm: zswap: drop support for non-zero same-filled pages handling

On 2024/3/26 07:50, Yosry Ahmed wrote:
> The current same-filled pages handling supports pages filled with any
> repeated word-sized pattern. However, in practice, most of these should
> be zero pages anyway. Other patterns should be nearly as common.
>
> Drop the support for non-zero same-filled pages, but keep the names of
> knobs exposed to userspace as "same_filled", which isn't entirely
> inaccurate.
>
> This yields some nice code simplification and enables a following patch
> that eliminates the need to allocate struct zswap_entry for those pages
> completely.
>
> There is also a very small performance improvement observed over 50 runs
> of kernel build test (kernbench) comparing the mean build time on a
> skylake machine when building the kernel in a cgroup v1 container with a
> 3G limit:
>
> base patched % diff
> real 70.167 69.915 -0.359%
> user 2953.068 2956.147 +0.104%
> sys 2612.811 2594.718 -0.692%
>
> This probably comes from more optimized operations like memchr_inv() and
> clear_highpage(). Note that the percentage of zero-filled pages during
> this test was only around 1.5% on average, and was not affected by this
> patch. Practical workloads could have a larger proportion of such pages
> (e.g. Johannes observed around 10% [1]), so the performance improvement
> should be larger.
>
> [1]https://lore.kernel.org/linux-mm/[email protected]/
>
> Signed-off-by: Yosry Ahmed <[email protected]>

The code looks good!

Reviewed-by: Chengming Zhou <[email protected]>

Thanks.

> ---
> mm/zswap.c | 76 ++++++++++++++----------------------------------------
> 1 file changed, 20 insertions(+), 56 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 0fc27ae950c74..413d9242cf500 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -44,8 +44,8 @@
> **********************************/
> /* The number of compressed pages currently stored in zswap */
> atomic_t zswap_stored_pages = ATOMIC_INIT(0);
> -/* The number of same-value filled pages currently stored in zswap */
> -static atomic_t zswap_same_filled_pages = ATOMIC_INIT(0);
> +/* The number of zero-filled pages currently stored in zswap */
> +static atomic_t zswap_zero_filled_pages = ATOMIC_INIT(0);
>
> /*
> * The statistics below are not protected from concurrent access for
> @@ -123,9 +123,9 @@ static unsigned int zswap_accept_thr_percent = 90; /* of max pool size */
> module_param_named(accept_threshold_percent, zswap_accept_thr_percent,
> uint, 0644);
>
> -/* Enable/disable handling non-same-value filled pages (enabled by default) */
> -static bool zswap_non_same_filled_pages_enabled = true;
> -module_param_named(non_same_filled_pages_enabled, zswap_non_same_filled_pages_enabled,
> +/* Enable/disable handling non-zero-filled pages (enabled by default) */
> +static bool zswap_non_zero_filled_pages_enabled = true;
> +module_param_named(non_same_filled_pages_enabled, zswap_non_zero_filled_pages_enabled,
> bool, 0644);
>
> /* Number of zpools in zswap_pool (empirically determined for scalability) */
> @@ -187,11 +187,10 @@ static struct shrinker *zswap_shrinker;
> *
> * swpentry - associated swap entry, the offset indexes into the red-black tree
> * length - the length in bytes of the compressed page data. Needed during
> - * decompression. For a same value filled page length is 0, and both
> + * decompression. For a zero-filled page length is 0, and both
> * pool and lru are invalid and must be ignored.
> * pool - the zswap_pool the entry's data is in
> * handle - zpool allocation handle that stores the compressed page data
> - * value - value of the same-value filled pages which have same content
> * objcg - the obj_cgroup that the compressed memory is charged to
> * lru - handle to the pool's lru used to evict pages.
> */
> @@ -199,10 +198,7 @@ struct zswap_entry {
> swp_entry_t swpentry;
> unsigned int length;
> struct zswap_pool *pool;
> - union {
> - unsigned long handle;
> - unsigned long value;
> - };
> + unsigned long handle;
> struct obj_cgroup *objcg;
> struct list_head lru;
> };
> @@ -805,7 +801,7 @@ static struct zpool *zswap_find_zpool(struct zswap_entry *entry)
> static void zswap_entry_free(struct zswap_entry *entry)
> {
> if (!entry->length)
> - atomic_dec(&zswap_same_filled_pages);
> + atomic_dec(&zswap_zero_filled_pages);
> else {
> zswap_lru_del(&zswap_list_lru, entry);
> zpool_free(zswap_find_zpool(entry), entry->handle);
> @@ -1377,43 +1373,17 @@ static void shrink_worker(struct work_struct *w)
> } while (zswap_total_pages() > thr);
> }
>
> -static bool zswap_is_folio_same_filled(struct folio *folio, unsigned long *value)
> +static bool zswap_is_folio_zero_filled(struct folio *folio)
> {
> - unsigned long *page;
> - unsigned long val;
> - unsigned int pos, last_pos = PAGE_SIZE / sizeof(*page) - 1;
> + unsigned long *kaddr;
> bool ret;
>
> - page = kmap_local_folio(folio, 0);
> - val = page[0];
> -
> - if (val != page[last_pos]) {
> - ret = false;
> - goto out;
> - }
> -
> - for (pos = 1; pos < last_pos; pos++) {
> - if (val != page[pos]) {
> - ret = false;
> - goto out;
> - }
> - }
> -
> - *value = val;
> - ret = true;
> -out:
> - kunmap_local(page);
> + kaddr = kmap_local_folio(folio, 0);
> + ret = !memchr_inv(kaddr, 0, PAGE_SIZE);
> + kunmap_local(kaddr);
> return ret;
> }
>
> -static void zswap_fill_page(void *ptr, unsigned long value)
> -{
> - unsigned long *page;
> -
> - page = (unsigned long *)ptr;
> - memset_l(page, value, PAGE_SIZE / sizeof(unsigned long));
> -}
> -
> static bool zswap_check_limit(void)
> {
> unsigned long cur_pages = zswap_total_pages();
> @@ -1437,7 +1407,6 @@ bool zswap_store(struct folio *folio)
> struct obj_cgroup *objcg = NULL;
> struct mem_cgroup *memcg = NULL;
> struct zswap_entry *entry;
> - unsigned long value;
>
> VM_WARN_ON_ONCE(!folio_test_locked(folio));
> VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> @@ -1470,14 +1439,13 @@ bool zswap_store(struct folio *folio)
> goto reject;
> }
>
> - if (zswap_is_folio_same_filled(folio, &value)) {
> + if (zswap_is_folio_zero_filled(folio)) {
> entry->length = 0;
> - entry->value = value;
> - atomic_inc(&zswap_same_filled_pages);
> + atomic_inc(&zswap_zero_filled_pages);
> goto insert_entry;
> }
>
> - if (!zswap_non_same_filled_pages_enabled)
> + if (!zswap_non_zero_filled_pages_enabled)
> goto freepage;
>
> /* if entry is successfully added, it keeps the reference */
> @@ -1532,7 +1500,7 @@ bool zswap_store(struct folio *folio)
>
> store_failed:
> if (!entry->length)
> - atomic_dec(&zswap_same_filled_pages);
> + atomic_dec(&zswap_zero_filled_pages);
> else {
> zpool_free(zswap_find_zpool(entry), entry->handle);
> put_pool:
> @@ -1563,7 +1531,6 @@ bool zswap_load(struct folio *folio)
> struct page *page = &folio->page;
> struct xarray *tree = swap_zswap_tree(swp);
> struct zswap_entry *entry;
> - u8 *dst;
>
> VM_WARN_ON_ONCE(!folio_test_locked(folio));
>
> @@ -1573,11 +1540,8 @@ bool zswap_load(struct folio *folio)
>
> if (entry->length)
> zswap_decompress(entry, page);
> - else {
> - dst = kmap_local_page(page);
> - zswap_fill_page(dst, entry->value);
> - kunmap_local(dst);
> - }
> + else
> + clear_highpage(page);
>
> count_vm_event(ZSWPIN);
> if (entry->objcg)
> @@ -1679,7 +1643,7 @@ static int zswap_debugfs_init(void)
> debugfs_create_atomic_t("stored_pages", 0444,
> zswap_debugfs_root, &zswap_stored_pages);
> debugfs_create_atomic_t("same_filled_pages", 0444,
> - zswap_debugfs_root, &zswap_same_filled_pages);
> + zswap_debugfs_root, &zswap_zero_filled_pages);
>
> return 0;
> }

2024-03-27 17:23:33

by Nhat Pham

[permalink] [raw]
Subject: Re: [RFC PATCH 6/9] mm: zswap: drop support for non-zero same-filled pages handling

On Mon, Mar 25, 2024 at 4:50 PM Yosry Ahmed <[email protected]> wrote:
>
> The current same-filled pages handling supports pages filled with any
> repeated word-sized pattern. However, in practice, most of these should
> be zero pages anyway. Other patterns should be nearly as common.

It'd be nice if we can verify this somehow. Maybe hooking bpftrace,
trace_printk, etc. here?

That aside, my intuition is that this is correct too. It's much less
likely to see a non-zero filled page.

>
> Drop the support for non-zero same-filled pages, but keep the names of
> knobs exposed to userspace as "same_filled", which isn't entirely
> inaccurate.
>
> This yields some nice code simplification and enables a following patch
> that eliminates the need to allocate struct zswap_entry for those pages
> completely.
>
> There is also a very small performance improvement observed over 50 runs
> of kernel build test (kernbench) comparing the mean build time on a
> skylake machine when building the kernel in a cgroup v1 container with a
> 3G limit:
>
> base patched % diff
> real 70.167 69.915 -0.359%
> user 2953.068 2956.147 +0.104%
> sys 2612.811 2594.718 -0.692%
>
> This probably comes from more optimized operations like memchr_inv() and
> clear_highpage(). Note that the percentage of zero-filled pages during

TIL clear_highpage() is a thing :)


> this test was only around 1.5% on average, and was not affected by this
> patch. Practical workloads could have a larger proportion of such pages
> (e.g. Johannes observed around 10% [1]), so the performance improvement
> should be larger.
>
> [1]https://lore.kernel.org/linux-mm/[email protected]/
>
> Signed-off-by: Yosry Ahmed <[email protected]>
> ---
> mm/zswap.c | 76 ++++++++++++++----------------------------------------
> 1 file changed, 20 insertions(+), 56 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 0fc27ae950c74..413d9242cf500 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -44,8 +44,8 @@
> **********************************/
> /* The number of compressed pages currently stored in zswap */
> atomic_t zswap_stored_pages = ATOMIC_INIT(0);
> -/* The number of same-value filled pages currently stored in zswap */
> -static atomic_t zswap_same_filled_pages = ATOMIC_INIT(0);
> +/* The number of zero-filled pages currently stored in zswap */
> +static atomic_t zswap_zero_filled_pages = ATOMIC_INIT(0);
>
> /*
> * The statistics below are not protected from concurrent access for
> @@ -123,9 +123,9 @@ static unsigned int zswap_accept_thr_percent = 90; /* of max pool size */
> module_param_named(accept_threshold_percent, zswap_accept_thr_percent,
> uint, 0644);
>
> -/* Enable/disable handling non-same-value filled pages (enabled by default) */
> -static bool zswap_non_same_filled_pages_enabled = true;
> -module_param_named(non_same_filled_pages_enabled, zswap_non_same_filled_pages_enabled,
> +/* Enable/disable handling non-zero-filled pages (enabled by default) */
> +static bool zswap_non_zero_filled_pages_enabled = true;
> +module_param_named(non_same_filled_pages_enabled, zswap_non_zero_filled_pages_enabled,
> bool, 0644);
>
> /* Number of zpools in zswap_pool (empirically determined for scalability) */
> @@ -187,11 +187,10 @@ static struct shrinker *zswap_shrinker;
> *
> * swpentry - associated swap entry, the offset indexes into the red-black tree
> * length - the length in bytes of the compressed page data. Needed during
> - * decompression. For a same value filled page length is 0, and both
> + * decompression. For a zero-filled page length is 0, and both
> * pool and lru are invalid and must be ignored.
> * pool - the zswap_pool the entry's data is in
> * handle - zpool allocation handle that stores the compressed page data
> - * value - value of the same-value filled pages which have same content
> * objcg - the obj_cgroup that the compressed memory is charged to
> * lru - handle to the pool's lru used to evict pages.
> */
> @@ -199,10 +198,7 @@ struct zswap_entry {
> swp_entry_t swpentry;
> unsigned int length;
> struct zswap_pool *pool;
> - union {
> - unsigned long handle;
> - unsigned long value;
> - };
> + unsigned long handle;
> struct obj_cgroup *objcg;
> struct list_head lru;
> };
> @@ -805,7 +801,7 @@ static struct zpool *zswap_find_zpool(struct zswap_entry *entry)
> static void zswap_entry_free(struct zswap_entry *entry)
> {
> if (!entry->length)
> - atomic_dec(&zswap_same_filled_pages);
> + atomic_dec(&zswap_zero_filled_pages);
> else {
> zswap_lru_del(&zswap_list_lru, entry);
> zpool_free(zswap_find_zpool(entry), entry->handle);
> @@ -1377,43 +1373,17 @@ static void shrink_worker(struct work_struct *w)
> } while (zswap_total_pages() > thr);
> }
>
> -static bool zswap_is_folio_same_filled(struct folio *folio, unsigned long *value)
> +static bool zswap_is_folio_zero_filled(struct folio *folio)
> {
> - unsigned long *page;
> - unsigned long val;
> - unsigned int pos, last_pos = PAGE_SIZE / sizeof(*page) - 1;
> + unsigned long *kaddr;
> bool ret;
>
> - page = kmap_local_folio(folio, 0);
> - val = page[0];
> -
> - if (val != page[last_pos]) {
> - ret = false;
> - goto out;
> - }
> -
> - for (pos = 1; pos < last_pos; pos++) {
> - if (val != page[pos]) {
> - ret = false;
> - goto out;
> - }
> - }
> -
> - *value = val;
> - ret = true;
> -out:
> - kunmap_local(page);
> + kaddr = kmap_local_folio(folio, 0);
> + ret = !memchr_inv(kaddr, 0, PAGE_SIZE);
> + kunmap_local(kaddr);
> return ret;
> }
>
> -static void zswap_fill_page(void *ptr, unsigned long value)
> -{
> - unsigned long *page;
> -
> - page = (unsigned long *)ptr;
> - memset_l(page, value, PAGE_SIZE / sizeof(unsigned long));
> -}
> -
> static bool zswap_check_limit(void)
> {
> unsigned long cur_pages = zswap_total_pages();
> @@ -1437,7 +1407,6 @@ bool zswap_store(struct folio *folio)
> struct obj_cgroup *objcg = NULL;
> struct mem_cgroup *memcg = NULL;
> struct zswap_entry *entry;
> - unsigned long value;
>
> VM_WARN_ON_ONCE(!folio_test_locked(folio));
> VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> @@ -1470,14 +1439,13 @@ bool zswap_store(struct folio *folio)
> goto reject;
> }
>
> - if (zswap_is_folio_same_filled(folio, &value)) {
> + if (zswap_is_folio_zero_filled(folio)) {
> entry->length = 0;
> - entry->value = value;
> - atomic_inc(&zswap_same_filled_pages);
> + atomic_inc(&zswap_zero_filled_pages);
> goto insert_entry;
> }
>
> - if (!zswap_non_same_filled_pages_enabled)
> + if (!zswap_non_zero_filled_pages_enabled)
> goto freepage;
>
> /* if entry is successfully added, it keeps the reference */
> @@ -1532,7 +1500,7 @@ bool zswap_store(struct folio *folio)
>
> store_failed:
> if (!entry->length)
> - atomic_dec(&zswap_same_filled_pages);
> + atomic_dec(&zswap_zero_filled_pages);
> else {
> zpool_free(zswap_find_zpool(entry), entry->handle);
> put_pool:
> @@ -1563,7 +1531,6 @@ bool zswap_load(struct folio *folio)
> struct page *page = &folio->page;
> struct xarray *tree = swap_zswap_tree(swp);
> struct zswap_entry *entry;
> - u8 *dst;
>
> VM_WARN_ON_ONCE(!folio_test_locked(folio));
>
> @@ -1573,11 +1540,8 @@ bool zswap_load(struct folio *folio)
>
> if (entry->length)
> zswap_decompress(entry, page);
> - else {
> - dst = kmap_local_page(page);
> - zswap_fill_page(dst, entry->value);
> - kunmap_local(dst);
> - }
> + else
> + clear_highpage(page);
>
> count_vm_event(ZSWPIN);
> if (entry->objcg)
> @@ -1679,7 +1643,7 @@ static int zswap_debugfs_init(void)
> debugfs_create_atomic_t("stored_pages", 0444,
> zswap_debugfs_root, &zswap_stored_pages);
> debugfs_create_atomic_t("same_filled_pages", 0444,
> - zswap_debugfs_root, &zswap_same_filled_pages);
> + zswap_debugfs_root, &zswap_zero_filled_pages);
>
> return 0;
> }
> --
> 2.44.0.396.g6e790dbe36-goog
>

The code itself LGTM, FWIW:

Reviewed-by: Nhat Pham <[email protected]>

2024-03-27 22:30:01

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH 2/9] mm: zswap: refactor storing to the tree out of zswap_store()

On Tue, Mar 26, 2024 at 7:25 PM Chengming Zhou <[email protected]> wrote:
>
> On 2024/3/26 07:50, Yosry Ahmed wrote:
> > Refactor the code that attempts storing to the xarray, handling erros,
> > and freeing stale entries into a helper. This will be reused in a
> > following patch to free other types of tree elements as well.
> >
> > Signed-off-by: Yosry Ahmed <[email protected]>
> > ---
> > mm/zswap.c | 42 ++++++++++++++++++++++++++----------------
> > 1 file changed, 26 insertions(+), 16 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 1cf3ab4b22e64..ff1975afb7e3d 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -827,6 +827,30 @@ static void zswap_entry_free(struct zswap_entry *entry)
> > atomic_dec(&zswap_stored_pages);
> > }
> >
> > +/*********************************
> > +* zswap tree functions
> > +**********************************/
> > +static int zswap_tree_store(struct xarray *tree, pgoff_t offset, void *new)
> > +{
> > + void *old;
> > + int err;
> > +
> > + old = xa_store(tree, offset, new, GFP_KERNEL);
> > + err = xa_is_err(old);
>
> Seems to use xa_err() to return errno, xa_is_err() just return a bool.

Good catch. It happens to work out because returning 1 would have the
same effect as returning the errno. Will fix it in the next version.

Thanks!

2024-03-27 22:31:28

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH 3/9] mm: zswap: refactor limit checking from zswap_store()

On Tue, Mar 26, 2024 at 7:42 PM Chengming Zhou <[email protected]> wrote:
>
> On 2024/3/26 07:50, Yosry Ahmed wrote:
> > Refactor limit and acceptance threshold checking outside of
> > zswap_store(). This code will be moved around in a following patch, so
> > it would be cleaner to move a function call around.
> >
> > Signed-off-by: Yosry Ahmed <[email protected]>
> > ---
> > mm/zswap.c | 32 ++++++++++++++++----------------
> > 1 file changed, 16 insertions(+), 16 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index ff1975afb7e3d..6b890c8590ef7 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -1415,6 +1415,21 @@ static void zswap_fill_page(void *ptr, unsigned long value)
> > memset_l(page, value, PAGE_SIZE / sizeof(unsigned long));
> > }
> >
> > +static bool zswap_check_limit(void)
> > +{
> > + unsigned long cur_pages = zswap_total_pages();
> > + unsigned long thr = zswap_accept_thr_pages();
> > + unsigned long max_pages = zswap_max_pages();
> > +
> > + if (cur_pages >= max_pages) {
> > + zswap_pool_limit_hit++;
> > + zswap_pool_reached_full = true;
> > + } else if (zswap_pool_reached_full && cur_pages <= thr) {
> > + zswap_pool_reached_full = false;
> > + }
> > + return !zswap_pool_reached_full;
>
> nit: Then we use "!zswap_check_limit()" below, double negation looks complex,
> should we change to zswap_should_reject() or something?

Good point. Will rename it in the next version.

Thanks!

2024-03-27 22:33:05

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH 4/9] mm: zswap: move more same-filled pages checks outside of zswap_store()

On Tue, Mar 26, 2024 at 7:39 PM Chengming Zhou <[email protected]> wrote:
>
> On 2024/3/26 07:50, Yosry Ahmed wrote:
> > Currently, zswap_store() check zswap_same_filled_pages_enabled, kmaps
> > the folio, then calls zswap_is_page_same_filled() to check the folio
> > contents. Move this logic into zswap_is_page_same_filled() as well (and
> > rename it to use 'folio' while we are at it).
> >
> > This makes zswap_store() cleaner, and makes following changes to that
> > logic contained within the helper.
> >
> > Signed-off-by: Yosry Ahmed <[email protected]>
>
> LGTM with one comment below:
>
> Reviewed-by: Chengming Zhou <[email protected]>
>
> > ---
> > mm/zswap.c | 45 ++++++++++++++++++++++++---------------------
> > 1 file changed, 24 insertions(+), 21 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 6b890c8590ef7..498a6c5839bef 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -1385,26 +1385,36 @@ static void shrink_worker(struct work_struct *w)
> > } while (zswap_total_pages() > thr);
> > }
> >
> > -static int zswap_is_page_same_filled(void *ptr, unsigned long *value)
> > +static bool zswap_is_folio_same_filled(struct folio *folio, unsigned long *value)
> > {
> > unsigned long *page;
> > unsigned long val;
> > unsigned int pos, last_pos = PAGE_SIZE / sizeof(*page) - 1;
> > + bool ret;
> >
> > - page = (unsigned long *)ptr;
> > + if (!zswap_same_filled_pages_enabled)
> > + return false;
> > +
> > + page = kmap_local_folio(folio, 0);
> > val = page[0];
> >
> > - if (val != page[last_pos])
> > - return 0;
> > + if (val != page[last_pos]) {
> > + ret = false;
> > + goto out;
> > + }
> >
> > for (pos = 1; pos < last_pos; pos++) {
> > - if (val != page[pos])
> > - return 0;
> > + if (val != page[pos]) {
> > + ret = false;
>
> nit: ret can be initialized to false, so
>
> > + goto out;
> > + }
> > }
> >
> > *value = val;
> > -
> > - return 1;
> > + ret = true;
>
> only need to set to true here.

I didn't bother improving the code here because patch 6 will replace
it anyway, but I will do that in the next version anyway, might as
well.

Thanks!

2024-03-27 22:35:02

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH 5/9] mm: zswap: remove zswap_same_filled_pages_enabled

On Tue, Mar 26, 2024 at 7:44 PM Chengming Zhou <[email protected]> wrote:
>
> On 2024/3/26 07:50, Yosry Ahmed wrote:
> > There is no logical reason to refuse storing same-filled pages more
> > efficiently and opt for compression. Remove the userspace knob.
> >
> > Signed-off-by: Yosry Ahmed <[email protected]>
>
> LGTM, should we also remove zswap_non_same_filled_pages_enabled?
> Not sure if it has real usage...

I am not aware of usages, but in theory you can use it if you
exclusively use zswap to optimize swapping zero-filled pages. Not sure
if anyone actually does that though. We can remove it if everyone else
agrees.

>
> Reviewed-by: Chengming Zhou <[email protected]>

Thanks!

2024-03-27 22:39:27

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH 6/9] mm: zswap: drop support for non-zero same-filled pages handling

On Wed, Mar 27, 2024 at 9:41 AM Nhat Pham <[email protected]> wrote:
>
> On Mon, Mar 25, 2024 at 4:50 PM Yosry Ahmed <[email protected]> wrote:
> >
> > The current same-filled pages handling supports pages filled with any
> > repeated word-sized pattern. However, in practice, most of these should
> > be zero pages anyway. Other patterns should be nearly as common.
>
> It'd be nice if we can verify this somehow. Maybe hooking bpftrace,
> trace_printk, etc. here?

I am trying to do that. Unfortunately collecting this data from our
fleet is not easy, so it will take some time to figure out. If the
data happens to be easy-ish to collect from your fleet that would be
awesome :)

>
> That aside, my intuition is that this is correct too. It's much less
> likely to see a non-zero filled page.
>
> >
> > Drop the support for non-zero same-filled pages, but keep the names of
> > knobs exposed to userspace as "same_filled", which isn't entirely
> > inaccurate.
> >
> > This yields some nice code simplification and enables a following patch
> > that eliminates the need to allocate struct zswap_entry for those pages
> > completely.
> >
> > There is also a very small performance improvement observed over 50 runs
> > of kernel build test (kernbench) comparing the mean build time on a
> > skylake machine when building the kernel in a cgroup v1 container with a
> > 3G limit:
> >
> > base patched % diff
> > real 70.167 69.915 -0.359%
> > user 2953.068 2956.147 +0.104%
> > sys 2612.811 2594.718 -0.692%
> >
> > This probably comes from more optimized operations like memchr_inv() and
> > clear_highpage(). Note that the percentage of zero-filled pages during
>
> TIL clear_highpage() is a thing :)
>
>
[..]
>
> The code itself LGTM, FWIW:
>
> Reviewed-by: Nhat Pham <[email protected]>

Thanks!

2024-03-28 08:14:27

by Chengming Zhou

[permalink] [raw]
Subject: Re: [RFC PATCH 7/9] mm: zswap: store zero-filled pages without a zswap_entry

On 2024/3/26 07:50, Yosry Ahmed wrote:
> After the rbtree to xarray conversion, and dropping zswap_entry.refcount
> and zswap_entry.value, the only members of zswap_entry utilized by
> zero-filled pages are zswap_entry.length (always 0) and
> zswap_entry.objcg. Store the objcg pointer directly in the xarray as a
> tagged pointer and avoid allocating a zswap_entry completely for
> zero-filled pages.
>
> This simplifies the code as we no longer need to special case
> zero-length cases. We are also able to further separate the zero-filled
> pages handling logic and completely isolate them within store/load
> helpers. Handling tagged xarray pointers is handled in these two
> helpers, as well as the newly introduced helper for freeing tree
> elements, zswap_tree_free_element().
>
> There is also a small performance improvement observed over 50 runs of
> kernel build test (kernbench) comparing the mean build time on a skylake
> machine when building the kernel in a cgroup v1 container with a 3G
> limit. This is on top of the improvement from dropping support for
> non-zero same-filled pages:
>
> base patched % diff
> real 69.915 69.757 -0.229%
> user 2956.147 2955.244 -0.031%
> sys 2594.718 2575.747 -0.731%
>
> This probably comes from avoiding the zswap_entry allocation and
> cleanup/freeing for zero-filled pages. Note that the percentage of
> zero-filled pages during this test was only around 1.5% on average.
> Practical workloads could have a larger proportion of such pages (e.g.
> Johannes observed around 10% [1]), so the performance improvement should
> be larger.
>
> This change also saves a small amount of memory due to less allocated
> zswap_entry's. In the kernel build test above, we save around 2M of
> slab usage when we swap out 3G to zswap.
>
> [1]https://lore.kernel.org/linux-mm/[email protected]/
>
> Signed-off-by: Yosry Ahmed <[email protected]>

The code looks good, just one comment below.

Reviewed-by: Chengming Zhou <[email protected]>

> ---
> mm/zswap.c | 137 ++++++++++++++++++++++++++++++-----------------------
> 1 file changed, 78 insertions(+), 59 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 413d9242cf500..efc323bab2f22 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -183,12 +183,11 @@ static struct shrinker *zswap_shrinker;
> * struct zswap_entry
> *
[..]
>
> @@ -1531,26 +1552,27 @@ bool zswap_load(struct folio *folio)
> struct page *page = &folio->page;
> struct xarray *tree = swap_zswap_tree(swp);
> struct zswap_entry *entry;
> + struct obj_cgroup *objcg;
> + void *elem;
>
> VM_WARN_ON_ONCE(!folio_test_locked(folio));
>
> - entry = xa_erase(tree, offset);
> - if (!entry)
> + elem = xa_erase(tree, offset);
> + if (!elem)
> return false;
>
> - if (entry->length)
> + if (!zswap_load_zero_filled(elem, page, &objcg)) {
> + entry = elem;

nit: entry seems no use anymore.

> + objcg = entry->objcg;
> zswap_decompress(entry, page);
> - else
> - clear_highpage(page);
> + }
>
> count_vm_event(ZSWPIN);
> - if (entry->objcg)
> - count_objcg_event(entry->objcg, ZSWPIN);
> -
> - zswap_entry_free(entry);
> + if (objcg)
> + count_objcg_event(objcg, ZSWPIN);
>
> + zswap_tree_free_element(elem);
> folio_mark_dirty(folio);
> -
> return true;
> }
[..]

2024-03-28 08:16:20

by Chengming Zhou

[permalink] [raw]
Subject: Re: [RFC PATCH 8/9] mm: zswap: do not check the global limit for zero-filled pages

On 2024/3/26 07:50, Yosry Ahmed wrote:
> When storing zero-filled pages, there is no point of checking the global
> zswap limit. These pages do not consume any memory that contributes
> toward the limit. Move the limit checking after zero-filled pages are
> handled.
>
> This avoids having zero-filled pages skip zswap and go to disk swap if
> the limit is hit. It also avoids queueing the shrink worker, which may
> end up being unnecessary if the zswap usage goes down on its own before
> another store is attempted.
>
> Ignoring the memcg limits as well for zero-filled pages is more
> controversial. Those limits are more a matter of per-workload policy.
> Some workloads disable zswap completely by setting memory.zswap.max = 0,
> and those workloads could start observing some zswap activity even after
> disabling zswap. Although harmless, this could cause confusion to
> userspace. Remain conservative and keep respecting those limits.
>
> Signed-off-by: Yosry Ahmed <[email protected]>

Yeah, it looks reasonable to keep the memcg limits check.

Reviewed-by: Chengming Zhou <[email protected]>

Thanks.

> ---
> mm/zswap.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index efc323bab2f22..9357328d940af 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1460,9 +1460,6 @@ bool zswap_store(struct folio *folio)
> mem_cgroup_put(memcg);
> }
>
> - if (!zswap_check_limit())
> - goto reject;
> -
> if (zswap_is_folio_zero_filled(folio)) {
> if (zswap_store_zero_filled(tree, offset, objcg))
> goto reject;
> @@ -1472,6 +1469,9 @@ bool zswap_store(struct folio *folio)
> if (!zswap_non_zero_filled_pages_enabled)
> goto reject;
>
> + if (!zswap_check_limit())
> + goto reject;
> +
> entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
> if (!entry) {
> zswap_reject_kmemcache_fail++;

2024-03-28 08:31:54

by Chengming Zhou

[permalink] [raw]
Subject: Re: [RFC PATCH 9/9] mm: zswap: use zswap_entry_free() for partially initialized entries

On 2024/3/26 07:50, Yosry Ahmed wrote:
> zswap_entry_free() performs four types of cleanups before freeing a
> zswap_entry:
> - Deletes the entry from the LRU.
> - Frees compressed memory.
> - Puts the pool reference.
> - Uncharges the compressed memory and puts the objcg.
>
> zswap_entry_free() always expects a fully initialized entry. Allow
> zswap_entry_free() to handle partially initialized entries by making it
> possible to identify what cleanups are needed as follows:
> - Allocate entries with __GFP_ZERO and initialize zswap_entry.lru when
> the entry is allocated. Points are NULL and length is zero upon
> initialization.
> - Use zswap_entry.length to identify if there is compressed memory to
> free. This is possible now that zero-filled pages are handled
> separately, so a length of zero means we did not successfully compress
> the page.
> - Only initialize entry->objcg after the memory is charged in
> zswap_store().
>
> With this in place, use zswap_entry_free() in the failure path of
> zswap_store() to cleanup partially initialized entries. This simplifies
> the cleanup code in zswap_store(). While we are at it, rename the
> remaining cleanup labels to more meaningful names.

Not sure if it's worthwhile to clean up the fail path with the normal path
gets a little verbose.

>
> Signed-off-by: Yosry Ahmed <[email protected]>
> ---
> mm/zswap.c | 62 ++++++++++++++++++++++++++----------------------------
> 1 file changed, 30 insertions(+), 32 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 9357328d940af..c50f9df230ca3 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -774,12 +774,13 @@ void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
> **********************************/
> static struct kmem_cache *zswap_entry_cache;
>
> -static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp, int nid)
> +static struct zswap_entry *zswap_entry_cache_alloc(int nid)
> {
> struct zswap_entry *entry;
> - entry = kmem_cache_alloc_node(zswap_entry_cache, gfp, nid);
> - if (!entry)
> - return NULL;
> + entry = kmem_cache_alloc_node(zswap_entry_cache,
> + GFP_KERNEL | __GFP_ZERO, nid);
> + if (entry)
> + INIT_LIST_HEAD(&entry->lru);
> return entry;
> }
>
> @@ -795,9 +796,12 @@ static struct zpool *zswap_find_zpool(struct zswap_entry *entry)
>
> static void zswap_entry_free(struct zswap_entry *entry)
> {
> - zswap_lru_del(&zswap_list_lru, entry);
> - zpool_free(zswap_find_zpool(entry), entry->handle);
> - zswap_pool_put(entry->pool);
> + if (!list_empty(&entry->lru))
> + zswap_lru_del(&zswap_list_lru, entry);
> + if (entry->length)
> + zpool_free(zswap_find_zpool(entry), entry->handle);
> + if (entry->pool)
> + zswap_pool_put(entry->pool);
> if (entry->objcg) {
> obj_cgroup_uncharge_zswap(entry->objcg, entry->length);
> obj_cgroup_put(entry->objcg);
> @@ -1447,7 +1451,7 @@ bool zswap_store(struct folio *folio)
> return false;
>
> if (!zswap_enabled)
> - goto check_old;
> + goto erase_old;
>
> /* Check cgroup limits */
> objcg = get_obj_cgroup_from_folio(folio);
> @@ -1455,54 +1459,52 @@ bool zswap_store(struct folio *folio)
> memcg = get_mem_cgroup_from_objcg(objcg);
> if (shrink_memcg(memcg)) {
> mem_cgroup_put(memcg);
> - goto reject;
> + goto put_objcg;
> }
> mem_cgroup_put(memcg);
> }
>
> if (zswap_is_folio_zero_filled(folio)) {
> if (zswap_store_zero_filled(tree, offset, objcg))
> - goto reject;
> + goto put_objcg;
> goto stored;
> }
>
> if (!zswap_non_zero_filled_pages_enabled)
> - goto reject;
> + goto put_objcg;
>
> if (!zswap_check_limit())
> - goto reject;
> + goto put_objcg;
>
> - entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
> + entry = zswap_entry_cache_alloc(folio_nid(folio));
> if (!entry) {
> zswap_reject_kmemcache_fail++;
> - goto reject;
> + goto put_objcg;
> }
>
> - /* if entry is successfully added, it keeps the reference */
> entry->pool = zswap_pool_current_get();
> if (!entry->pool)
> - goto freepage;
> + goto free_entry;
>
> if (objcg) {
> memcg = get_mem_cgroup_from_objcg(objcg);
> if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
> mem_cgroup_put(memcg);
> - goto put_pool;
> + goto free_entry;
> }
> mem_cgroup_put(memcg);
> }
>
> if (!zswap_compress(folio, entry))
> - goto put_pool;
> -
> - entry->swpentry = swp;
> - entry->objcg = objcg;
> + goto free_entry;
>
> if (zswap_tree_store(tree, offset, entry))
> - goto store_failed;
> + goto free_entry;
>
> - if (objcg)
> + if (objcg) {
> obj_cgroup_charge_zswap(objcg, entry->length);
> + entry->objcg = objcg;
> + }
>
> /*
> * We finish initializing the entry while it's already in xarray.
> @@ -1514,7 +1516,7 @@ bool zswap_store(struct folio *folio)
> * The publishing order matters to prevent writeback from seeing
> * an incoherent entry.
> */
> - INIT_LIST_HEAD(&entry->lru);
> + entry->swpentry = swp;
> zswap_lru_add(&zswap_list_lru, entry);
>
> stored:
> @@ -1525,17 +1527,13 @@ bool zswap_store(struct folio *folio)
>
> return true;
>
> -store_failed:
> - zpool_free(zswap_find_zpool(entry), entry->handle);
> -put_pool:
> - zswap_pool_put(entry->pool);
> -freepage:
> - zswap_entry_cache_free(entry);
> -reject:
> +free_entry:
> + zswap_entry_free(entry);
> +put_objcg:
> obj_cgroup_put(objcg);
> if (zswap_pool_reached_full)
> queue_work(shrink_wq, &zswap_shrink_work);
> -check_old:
> +erase_old:
> /*
> * If the zswap store fails or zswap is disabled, we must invalidate the
> * possibly stale entry which was previously stored at this offset.

2024-03-28 18:46:16

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH 7/9] mm: zswap: store zero-filled pages without a zswap_entry

On Thu, Mar 28, 2024 at 1:12 AM Chengming Zhou <[email protected]> wrote:
>
> On 2024/3/26 07:50, Yosry Ahmed wrote:
> > After the rbtree to xarray conversion, and dropping zswap_entry.refcount
> > and zswap_entry.value, the only members of zswap_entry utilized by
> > zero-filled pages are zswap_entry.length (always 0) and
> > zswap_entry.objcg. Store the objcg pointer directly in the xarray as a
> > tagged pointer and avoid allocating a zswap_entry completely for
> > zero-filled pages.
> >
> > This simplifies the code as we no longer need to special case
> > zero-length cases. We are also able to further separate the zero-filled
> > pages handling logic and completely isolate them within store/load
> > helpers. Handling tagged xarray pointers is handled in these two
> > helpers, as well as the newly introduced helper for freeing tree
> > elements, zswap_tree_free_element().
> >
> > There is also a small performance improvement observed over 50 runs of
> > kernel build test (kernbench) comparing the mean build time on a skylake
> > machine when building the kernel in a cgroup v1 container with a 3G
> > limit. This is on top of the improvement from dropping support for
> > non-zero same-filled pages:
> >
> > base patched % diff
> > real 69.915 69.757 -0.229%
> > user 2956.147 2955.244 -0.031%
> > sys 2594.718 2575.747 -0.731%
> >
> > This probably comes from avoiding the zswap_entry allocation and
> > cleanup/freeing for zero-filled pages. Note that the percentage of
> > zero-filled pages during this test was only around 1.5% on average.
> > Practical workloads could have a larger proportion of such pages (e.g.
> > Johannes observed around 10% [1]), so the performance improvement should
> > be larger.
> >
> > This change also saves a small amount of memory due to less allocated
> > zswap_entry's. In the kernel build test above, we save around 2M of
> > slab usage when we swap out 3G to zswap.
> >
> > [1]https://lore.kernel.org/linux-mm/[email protected]/
> >
> > Signed-off-by: Yosry Ahmed <[email protected]>
>
> The code looks good, just one comment below.
>
> Reviewed-by: Chengming Zhou <[email protected]>

Thanks!

>
> > ---
> > mm/zswap.c | 137 ++++++++++++++++++++++++++++++-----------------------
> > 1 file changed, 78 insertions(+), 59 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 413d9242cf500..efc323bab2f22 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -183,12 +183,11 @@ static struct shrinker *zswap_shrinker;
> > * struct zswap_entry
> > *
> [..]
> >
> > @@ -1531,26 +1552,27 @@ bool zswap_load(struct folio *folio)
> > struct page *page = &folio->page;
> > struct xarray *tree = swap_zswap_tree(swp);
> > struct zswap_entry *entry;
> > + struct obj_cgroup *objcg;
> > + void *elem;
> >
> > VM_WARN_ON_ONCE(!folio_test_locked(folio));
> >
> > - entry = xa_erase(tree, offset);
> > - if (!entry)
> > + elem = xa_erase(tree, offset);
> > + if (!elem)
> > return false;
> >
> > - if (entry->length)
> > + if (!zswap_load_zero_filled(elem, page, &objcg)) {
> > + entry = elem;
>
> nit: entry seems no use anymore.

I left it here on purpose to avoid casting elem in the next two lines,
it is just more aesthetic.

>
> > + objcg = entry->objcg;
> > zswap_decompress(entry, page);
> > - else
> > - clear_highpage(page);
> > + }
> >
> > count_vm_event(ZSWPIN);
> > - if (entry->objcg)
> > - count_objcg_event(entry->objcg, ZSWPIN);
> > -
> > - zswap_entry_free(entry);
> > + if (objcg)
> > + count_objcg_event(objcg, ZSWPIN);
> >
> > + zswap_tree_free_element(elem);
> > folio_mark_dirty(folio);
> > -
> > return true;
> > }
> [..]

2024-03-28 18:49:53

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH 9/9] mm: zswap: use zswap_entry_free() for partially initialized entries

On Thu, Mar 28, 2024 at 1:31 AM Chengming Zhou <[email protected]> wrote:
>
> On 2024/3/26 07:50, Yosry Ahmed wrote:
> > zswap_entry_free() performs four types of cleanups before freeing a
> > zswap_entry:
> > - Deletes the entry from the LRU.
> > - Frees compressed memory.
> > - Puts the pool reference.
> > - Uncharges the compressed memory and puts the objcg.
> >
> > zswap_entry_free() always expects a fully initialized entry. Allow
> > zswap_entry_free() to handle partially initialized entries by making it
> > possible to identify what cleanups are needed as follows:
> > - Allocate entries with __GFP_ZERO and initialize zswap_entry.lru when
> > the entry is allocated. Points are NULL and length is zero upon
> > initialization.
> > - Use zswap_entry.length to identify if there is compressed memory to
> > free. This is possible now that zero-filled pages are handled
> > separately, so a length of zero means we did not successfully compress
> > the page.
> > - Only initialize entry->objcg after the memory is charged in
> > zswap_store().
> >
> > With this in place, use zswap_entry_free() in the failure path of
> > zswap_store() to cleanup partially initialized entries. This simplifies
> > the cleanup code in zswap_store(). While we are at it, rename the
> > remaining cleanup labels to more meaningful names.
>
> Not sure if it's worthwhile to clean up the fail path with the normal path
> gets a little verbose.

I was on the fence about this, so I thought I would just send it and
see what others think.

On one hand it makes the initialization more robust because the
zswap_entry is always in a clean identifiable state, but on the other
hand it adds churn to the normal path as you noticed. Also after
removing handling zero-length entries from the failure path it isn't
that bad without this patch anyway.

So if no one else thinks this is useful, I will drop the patch in the
next version.

Thanks!

>
> >
> > Signed-off-by: Yosry Ahmed <[email protected]>
> > ---
> > mm/zswap.c | 62 ++++++++++++++++++++++++++----------------------------
> > 1 file changed, 30 insertions(+), 32 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 9357328d940af..c50f9df230ca3 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -774,12 +774,13 @@ void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
> > **********************************/
> > static struct kmem_cache *zswap_entry_cache;
> >
> > -static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp, int nid)
> > +static struct zswap_entry *zswap_entry_cache_alloc(int nid)
> > {
> > struct zswap_entry *entry;
> > - entry = kmem_cache_alloc_node(zswap_entry_cache, gfp, nid);
> > - if (!entry)
> > - return NULL;
> > + entry = kmem_cache_alloc_node(zswap_entry_cache,
> > + GFP_KERNEL | __GFP_ZERO, nid);
> > + if (entry)
> > + INIT_LIST_HEAD(&entry->lru);
> > return entry;
> > }
> >
> > @@ -795,9 +796,12 @@ static struct zpool *zswap_find_zpool(struct zswap_entry *entry)
> >
> > static void zswap_entry_free(struct zswap_entry *entry)
> > {
> > - zswap_lru_del(&zswap_list_lru, entry);
> > - zpool_free(zswap_find_zpool(entry), entry->handle);
> > - zswap_pool_put(entry->pool);
> > + if (!list_empty(&entry->lru))
> > + zswap_lru_del(&zswap_list_lru, entry);
> > + if (entry->length)
> > + zpool_free(zswap_find_zpool(entry), entry->handle);
> > + if (entry->pool)
> > + zswap_pool_put(entry->pool);
> > if (entry->objcg) {
> > obj_cgroup_uncharge_zswap(entry->objcg, entry->length);
> > obj_cgroup_put(entry->objcg);
> > @@ -1447,7 +1451,7 @@ bool zswap_store(struct folio *folio)
> > return false;
> >
> > if (!zswap_enabled)
> > - goto check_old;
> > + goto erase_old;
> >
> > /* Check cgroup limits */
> > objcg = get_obj_cgroup_from_folio(folio);
> > @@ -1455,54 +1459,52 @@ bool zswap_store(struct folio *folio)
> > memcg = get_mem_cgroup_from_objcg(objcg);
> > if (shrink_memcg(memcg)) {
> > mem_cgroup_put(memcg);
> > - goto reject;
> > + goto put_objcg;
> > }
> > mem_cgroup_put(memcg);
> > }
> >
> > if (zswap_is_folio_zero_filled(folio)) {
> > if (zswap_store_zero_filled(tree, offset, objcg))
> > - goto reject;
> > + goto put_objcg;
> > goto stored;
> > }
> >
> > if (!zswap_non_zero_filled_pages_enabled)
> > - goto reject;
> > + goto put_objcg;
> >
> > if (!zswap_check_limit())
> > - goto reject;
> > + goto put_objcg;
> >
> > - entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
> > + entry = zswap_entry_cache_alloc(folio_nid(folio));
> > if (!entry) {
> > zswap_reject_kmemcache_fail++;
> > - goto reject;
> > + goto put_objcg;
> > }
> >
> > - /* if entry is successfully added, it keeps the reference */
> > entry->pool = zswap_pool_current_get();
> > if (!entry->pool)
> > - goto freepage;
> > + goto free_entry;
> >
> > if (objcg) {
> > memcg = get_mem_cgroup_from_objcg(objcg);
> > if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
> > mem_cgroup_put(memcg);
> > - goto put_pool;
> > + goto free_entry;
> > }
> > mem_cgroup_put(memcg);
> > }
> >
> > if (!zswap_compress(folio, entry))
> > - goto put_pool;
> > -
> > - entry->swpentry = swp;
> > - entry->objcg = objcg;
> > + goto free_entry;
> >
> > if (zswap_tree_store(tree, offset, entry))
> > - goto store_failed;
> > + goto free_entry;
> >
> > - if (objcg)
> > + if (objcg) {
> > obj_cgroup_charge_zswap(objcg, entry->length);
> > + entry->objcg = objcg;
> > + }
> >
> > /*
> > * We finish initializing the entry while it's already in xarray.
> > @@ -1514,7 +1516,7 @@ bool zswap_store(struct folio *folio)
> > * The publishing order matters to prevent writeback from seeing
> > * an incoherent entry.
> > */
> > - INIT_LIST_HEAD(&entry->lru);
> > + entry->swpentry = swp;
> > zswap_lru_add(&zswap_list_lru, entry);
> >
> > stored:
> > @@ -1525,17 +1527,13 @@ bool zswap_store(struct folio *folio)
> >
> > return true;
> >
> > -store_failed:
> > - zpool_free(zswap_find_zpool(entry), entry->handle);
> > -put_pool:
> > - zswap_pool_put(entry->pool);
> > -freepage:
> > - zswap_entry_cache_free(entry);
> > -reject:
> > +free_entry:
> > + zswap_entry_free(entry);
> > +put_objcg:
> > obj_cgroup_put(objcg);
> > if (zswap_pool_reached_full)
> > queue_work(shrink_wq, &zswap_shrink_work);
> > -check_old:
> > +erase_old:
> > /*
> > * If the zswap store fails or zswap is disabled, we must invalidate the
> > * possibly stale entry which was previously stored at this offset.

2024-03-28 19:09:11

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] mm: zswap: always shrink in zswap_store() if zswap_pool_reached_full

On Mon, Mar 25, 2024 at 11:50:09PM +0000, Yosry Ahmed wrote:
> The cleanup code in zswap_store() is not pretty, particularly the
> 'shrink' label at the bottom that ends up jumping between cleanup
> labels.
>
> Instead of having a dedicated label to shrink the pool, just use
> zswap_pool_reached_full directly to figure out if the pool needs
> shrinking. zswap_pool_reached_full should be true if and only if the
> pool needs shrinking.
>
> The only caveat is that the value of zswap_pool_reached_full may be
> changed by concurrent zswap_store() calls between checking the limit and
> testing zswap_pool_reached_full in the cleanup code. This is fine
> because:
> - If zswap_pool_reached_full was true during limit checking then became
> false during the cleanup code, then someone else already took care of
> shrinking the pool and there is no need to queue the worker. That
> would be a good change.
> - If zswap_pool_reached_full was false during limit checking then became
> true during the cleanup code, then someone else hit the limit
> meanwhile. In this case, both threads will try to queue the worker,
> but it never gets queued more than once anyway. Also, calling
> queue_work() multiple times when the limit is hit could already happen
> today, so this isn't a significant change in any way.
>
> Signed-off-by: Yosry Ahmed <[email protected]>

Acked-by: Johannes Weiner <[email protected]>

2024-03-28 19:11:21

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFC PATCH 5/9] mm: zswap: remove zswap_same_filled_pages_enabled

On Mon, Mar 25, 2024 at 11:50:13PM +0000, Yosry Ahmed wrote:
> There is no logical reason to refuse storing same-filled pages more
> efficiently and opt for compression. Remove the userspace knob.
>
> Signed-off-by: Yosry Ahmed <[email protected]>

Acked-by: Johannes Weiner <[email protected]>

I also think the non_same_filled_pages_enabled option should go
away. Both of these tunables are pretty bizarre.

2024-03-28 19:32:05

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFC PATCH 6/9] mm: zswap: drop support for non-zero same-filled pages handling

On Mon, Mar 25, 2024 at 11:50:14PM +0000, Yosry Ahmed wrote:
> The current same-filled pages handling supports pages filled with any
> repeated word-sized pattern. However, in practice, most of these should
> be zero pages anyway. Other patterns should be nearly as common.
>
> Drop the support for non-zero same-filled pages, but keep the names of
> knobs exposed to userspace as "same_filled", which isn't entirely
> inaccurate.
>
> This yields some nice code simplification and enables a following patch
> that eliminates the need to allocate struct zswap_entry for those pages
> completely.
>
> There is also a very small performance improvement observed over 50 runs
> of kernel build test (kernbench) comparing the mean build time on a
> skylake machine when building the kernel in a cgroup v1 container with a
> 3G limit:
>
> base patched % diff
> real 70.167 69.915 -0.359%
> user 2953.068 2956.147 +0.104%
> sys 2612.811 2594.718 -0.692%
>
> This probably comes from more optimized operations like memchr_inv() and
> clear_highpage(). Note that the percentage of zero-filled pages during
> this test was only around 1.5% on average, and was not affected by this
> patch. Practical workloads could have a larger proportion of such pages
> (e.g. Johannes observed around 10% [1]), so the performance improvement
> should be larger.
>
> [1]https://lore.kernel.org/linux-mm/[email protected]/
>
> Signed-off-by: Yosry Ahmed <[email protected]>

This is an interesting direction to pursue, but I actually thinkg it
doesn't go far enough. Either way, I think it needs more data.

1) How frequent are non-zero-same-filled pages? Difficult to
generalize, but if you could gather some from your fleet, that
would be useful. If you can devise a portable strategy, I'd also be
more than happy to gather this on ours (although I think you have
more widespread zswap use, whereas we have more disk swap.)

2) The fact that we're doing any of this pattern analysis in zswap at
all strikes me as a bit misguided. Being efficient about repetitive
patterns is squarely in the domain of a compression algorithm. Do
we not trust e.g. zstd to handle this properly?

I'm guessing this goes back to inefficient packing from something
like zbud, which would waste half a page on one repeating byte.

But zsmalloc can do 32 byte objects. It's also a batching slab
allocator, where storing a series of small, same-sized objects is
quite fast.

Add to that the additional branches, the additional kmap, the extra
scanning of every single page for patterns - all in the fast path
of zswap, when we already know that the vast majority of incoming
pages will need to be properly compressed anyway.

Maybe it's time to get rid of the special handling entirely?

2024-03-28 19:39:18

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFC PATCH 7/9] mm: zswap: store zero-filled pages without a zswap_entry

On Mon, Mar 25, 2024 at 11:50:15PM +0000, Yosry Ahmed wrote:
> After the rbtree to xarray conversion, and dropping zswap_entry.refcount
> and zswap_entry.value, the only members of zswap_entry utilized by
> zero-filled pages are zswap_entry.length (always 0) and
> zswap_entry.objcg. Store the objcg pointer directly in the xarray as a
> tagged pointer and avoid allocating a zswap_entry completely for
> zero-filled pages.
>
> This simplifies the code as we no longer need to special case
> zero-length cases. We are also able to further separate the zero-filled
> pages handling logic and completely isolate them within store/load
> helpers. Handling tagged xarray pointers is handled in these two
> helpers, as well as the newly introduced helper for freeing tree
> elements, zswap_tree_free_element().
>
> There is also a small performance improvement observed over 50 runs of
> kernel build test (kernbench) comparing the mean build time on a skylake
> machine when building the kernel in a cgroup v1 container with a 3G
> limit. This is on top of the improvement from dropping support for
> non-zero same-filled pages:
>
> base patched % diff
> real 69.915 69.757 -0.229%
> user 2956.147 2955.244 -0.031%
> sys 2594.718 2575.747 -0.731%
>
> This probably comes from avoiding the zswap_entry allocation and
> cleanup/freeing for zero-filled pages. Note that the percentage of
> zero-filled pages during this test was only around 1.5% on average.
> Practical workloads could have a larger proportion of such pages (e.g.
> Johannes observed around 10% [1]), so the performance improvement should
> be larger.
>
> This change also saves a small amount of memory due to less allocated
> zswap_entry's. In the kernel build test above, we save around 2M of
> slab usage when we swap out 3G to zswap.
>
> [1]https://lore.kernel.org/linux-mm/[email protected]/
>
> Signed-off-by: Yosry Ahmed <[email protected]>
> ---
> mm/zswap.c | 137 ++++++++++++++++++++++++++++++-----------------------
> 1 file changed, 78 insertions(+), 59 deletions(-)

Tbh, I think this makes the code worse overall. Instead of increasing
type safety, it actually reduces it, and places that previously dealt
with a struct zswap_entry now deal with a void *.

If we go ahead with this series, I think it makes more sense to either

a) do the explicit subtyping of struct zswap_entry I had proposed, or

b) at least keep the specialness handling of the xarray entry as local
as possible, so that instead of a proliferating API that operates
on void *, you have explicit filtering only where the tree is
accessed, and then work on struct zswap_entry as much as possible.

2024-03-28 20:07:02

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH 5/9] mm: zswap: remove zswap_same_filled_pages_enabled

On Thu, Mar 28, 2024 at 12:11 PM Johannes Weiner <[email protected]> wrote:
>
> On Mon, Mar 25, 2024 at 11:50:13PM +0000, Yosry Ahmed wrote:
> > There is no logical reason to refuse storing same-filled pages more
> > efficiently and opt for compression. Remove the userspace knob.
> >
> > Signed-off-by: Yosry Ahmed <[email protected]>
>
> Acked-by: Johannes Weiner <[email protected]>
>
> I also think the non_same_filled_pages_enabled option should go
> away. Both of these tunables are pretty bizarre.

Happy to remove both in the next version :)

Thanks!

2024-03-28 20:26:02

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH 6/9] mm: zswap: drop support for non-zero same-filled pages handling

On Thu, Mar 28, 2024 at 12:31 PM Johannes Weiner <[email protected]> wrote:
>
> On Mon, Mar 25, 2024 at 11:50:14PM +0000, Yosry Ahmed wrote:
> > The current same-filled pages handling supports pages filled with any
> > repeated word-sized pattern. However, in practice, most of these should
> > be zero pages anyway. Other patterns should be nearly as common.
> >
> > Drop the support for non-zero same-filled pages, but keep the names of
> > knobs exposed to userspace as "same_filled", which isn't entirely
> > inaccurate.
> >
> > This yields some nice code simplification and enables a following patch
> > that eliminates the need to allocate struct zswap_entry for those pages
> > completely.
> >
> > There is also a very small performance improvement observed over 50 runs
> > of kernel build test (kernbench) comparing the mean build time on a
> > skylake machine when building the kernel in a cgroup v1 container with a
> > 3G limit:
> >
> > base patched % diff
> > real 70.167 69.915 -0.359%
> > user 2953.068 2956.147 +0.104%
> > sys 2612.811 2594.718 -0.692%
> >
> > This probably comes from more optimized operations like memchr_inv() and
> > clear_highpage(). Note that the percentage of zero-filled pages during
> > this test was only around 1.5% on average, and was not affected by this
> > patch. Practical workloads could have a larger proportion of such pages
> > (e.g. Johannes observed around 10% [1]), so the performance improvement
> > should be larger.
> >
> > [1]https://lore.kernel.org/linux-mm/[email protected]/
> >
> > Signed-off-by: Yosry Ahmed <[email protected]>
>
> This is an interesting direction to pursue, but I actually thinkg it
> doesn't go far enough. Either way, I think it needs more data.
>
> 1) How frequent are non-zero-same-filled pages? Difficult to
> generalize, but if you could gather some from your fleet, that
> would be useful. If you can devise a portable strategy, I'd also be
> more than happy to gather this on ours (although I think you have
> more widespread zswap use, whereas we have more disk swap.)

I am trying to collect the data, but there are.. hurdles. It would
take some time, so I was hoping the data could be collected elsewhere
if possible.

The idea I had was to hook a BPF program to the entry of
zswap_fill_page() and create a histogram of the "value" argument. We
would get more coverage by hooking it to the return of
zswap_is_page_same_filled() and only updating the histogram if the
return value is true, as it includes pages in zswap that haven't been
swapped in.

However, with zswap_is_page_same_filled() the BPF program will run in
all zswap stores, whereas for zswap_fill_page() it will only run when
needed. Not sure if this makes a practical difference tbh.

>
> 2) The fact that we're doing any of this pattern analysis in zswap at
> all strikes me as a bit misguided. Being efficient about repetitive
> patterns is squarely in the domain of a compression algorithm. Do
> we not trust e.g. zstd to handle this properly?

I thought about this briefly, but I didn't follow through. I could try
to collect some data by swapping out different patterns and observing
how different compression algorithms react. That would be interesting
for sure.

>
> I'm guessing this goes back to inefficient packing from something
> like zbud, which would waste half a page on one repeating byte.
>
> But zsmalloc can do 32 byte objects. It's also a batching slab
> allocator, where storing a series of small, same-sized objects is
> quite fast.
>
> Add to that the additional branches, the additional kmap, the extra
> scanning of every single page for patterns - all in the fast path
> of zswap, when we already know that the vast majority of incoming
> pages will need to be properly compressed anyway.
>
> Maybe it's time to get rid of the special handling entirely?

We would still be wasting some memory (~96 bytes between zswap_entry
and zsmalloc object), and wasting cycling allocating them. This could
be made up for by cycles saved by removing the handling. We will be
saving some branches for sure. I am not worried about kmap as I think
it's a noop in most cases.

I am interested to see how much we could save by removing scanning for
patterns. We may not save much if we abort after reading a few words
in most cases, but I guess we could also be scanning a considerable
amount before aborting. On the other hand, we would be reading the
page contents into cache anyway for compression, so maybe it doesn't
really matter?

I will try to collect some data about this. I will start by trying to
find out how the compression algorithms handle same-filled pages. If
they can compress it efficiently, then I will try to get more data on
the tradeoff from removing the handling.

Thanks for the insights.

2024-03-28 20:30:23

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH 7/9] mm: zswap: store zero-filled pages without a zswap_entry

On Thu, Mar 28, 2024 at 12:38 PM Johannes Weiner <[email protected]> wrote:
>
> On Mon, Mar 25, 2024 at 11:50:15PM +0000, Yosry Ahmed wrote:
> > After the rbtree to xarray conversion, and dropping zswap_entry.refcount
> > and zswap_entry.value, the only members of zswap_entry utilized by
> > zero-filled pages are zswap_entry.length (always 0) and
> > zswap_entry.objcg. Store the objcg pointer directly in the xarray as a
> > tagged pointer and avoid allocating a zswap_entry completely for
> > zero-filled pages.
> >
> > This simplifies the code as we no longer need to special case
> > zero-length cases. We are also able to further separate the zero-filled
> > pages handling logic and completely isolate them within store/load
> > helpers. Handling tagged xarray pointers is handled in these two
> > helpers, as well as the newly introduced helper for freeing tree
> > elements, zswap_tree_free_element().
> >
> > There is also a small performance improvement observed over 50 runs of
> > kernel build test (kernbench) comparing the mean build time on a skylake
> > machine when building the kernel in a cgroup v1 container with a 3G
> > limit. This is on top of the improvement from dropping support for
> > non-zero same-filled pages:
> >
> > base patched % diff
> > real 69.915 69.757 -0.229%
> > user 2956.147 2955.244 -0.031%
> > sys 2594.718 2575.747 -0.731%
> >
> > This probably comes from avoiding the zswap_entry allocation and
> > cleanup/freeing for zero-filled pages. Note that the percentage of
> > zero-filled pages during this test was only around 1.5% on average.
> > Practical workloads could have a larger proportion of such pages (e.g.
> > Johannes observed around 10% [1]), so the performance improvement should
> > be larger.
> >
> > This change also saves a small amount of memory due to less allocated
> > zswap_entry's. In the kernel build test above, we save around 2M of
> > slab usage when we swap out 3G to zswap.
> >
> > [1]https://lore.kernel.org/linux-mm/[email protected]/
> >
> > Signed-off-by: Yosry Ahmed <[email protected]>
> > ---
> > mm/zswap.c | 137 ++++++++++++++++++++++++++++++-----------------------
> > 1 file changed, 78 insertions(+), 59 deletions(-)
>
> Tbh, I think this makes the code worse overall. Instead of increasing
> type safety, it actually reduces it, and places that previously dealt
> with a struct zswap_entry now deal with a void *.
>
> If we go ahead with this series, I think it makes more sense to either
>
> a) do the explicit subtyping of struct zswap_entry I had proposed, or

I suspect we won't get the small performance improvements (and memory
saving) with this approach. Neither are too significant, but it'd be
nice if we could keep them.

>
> b) at least keep the specialness handling of the xarray entry as local
> as possible, so that instead of a proliferating API that operates
> on void *, you have explicit filtering only where the tree is
> accessed, and then work on struct zswap_entry as much as possible.

I was trying to go for option (b) by isolating filtering and casting
to the correct type in a few functions (zswap_tree_free_element(),
zswap_store_zero_filled(), and zswap_load_zero_filled()). If we
open-code filtering it will be repeated in a few places.

What did you have in mind?

2024-03-28 21:07:26

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFC PATCH 6/9] mm: zswap: drop support for non-zero same-filled pages handling

On Thu, Mar 28, 2024 at 01:23:42PM -0700, Yosry Ahmed wrote:
> On Thu, Mar 28, 2024 at 12:31 PM Johannes Weiner <[email protected]> wrote:
> >
> > On Mon, Mar 25, 2024 at 11:50:14PM +0000, Yosry Ahmed wrote:
> > > The current same-filled pages handling supports pages filled with any
> > > repeated word-sized pattern. However, in practice, most of these should
> > > be zero pages anyway. Other patterns should be nearly as common.
> > >
> > > Drop the support for non-zero same-filled pages, but keep the names of
> > > knobs exposed to userspace as "same_filled", which isn't entirely
> > > inaccurate.
> > >
> > > This yields some nice code simplification and enables a following patch
> > > that eliminates the need to allocate struct zswap_entry for those pages
> > > completely.
> > >
> > > There is also a very small performance improvement observed over 50 runs
> > > of kernel build test (kernbench) comparing the mean build time on a
> > > skylake machine when building the kernel in a cgroup v1 container with a
> > > 3G limit:
> > >
> > > base patched % diff
> > > real 70.167 69.915 -0.359%
> > > user 2953.068 2956.147 +0.104%
> > > sys 2612.811 2594.718 -0.692%
> > >
> > > This probably comes from more optimized operations like memchr_inv() and
> > > clear_highpage(). Note that the percentage of zero-filled pages during
> > > this test was only around 1.5% on average, and was not affected by this
> > > patch. Practical workloads could have a larger proportion of such pages
> > > (e.g. Johannes observed around 10% [1]), so the performance improvement
> > > should be larger.
> > >
> > > [1]https://lore.kernel.org/linux-mm/[email protected]/
> > >
> > > Signed-off-by: Yosry Ahmed <[email protected]>
> >
> > This is an interesting direction to pursue, but I actually thinkg it
> > doesn't go far enough. Either way, I think it needs more data.
> >
> > 1) How frequent are non-zero-same-filled pages? Difficult to
> > generalize, but if you could gather some from your fleet, that
> > would be useful. If you can devise a portable strategy, I'd also be
> > more than happy to gather this on ours (although I think you have
> > more widespread zswap use, whereas we have more disk swap.)
>
> I am trying to collect the data, but there are.. hurdles. It would
> take some time, so I was hoping the data could be collected elsewhere
> if possible.
>
> The idea I had was to hook a BPF program to the entry of
> zswap_fill_page() and create a histogram of the "value" argument. We
> would get more coverage by hooking it to the return of
> zswap_is_page_same_filled() and only updating the histogram if the
> return value is true, as it includes pages in zswap that haven't been
> swapped in.
>
> However, with zswap_is_page_same_filled() the BPF program will run in
> all zswap stores, whereas for zswap_fill_page() it will only run when
> needed. Not sure if this makes a practical difference tbh.
>
> >
> > 2) The fact that we're doing any of this pattern analysis in zswap at
> > all strikes me as a bit misguided. Being efficient about repetitive
> > patterns is squarely in the domain of a compression algorithm. Do
> > we not trust e.g. zstd to handle this properly?
>
> I thought about this briefly, but I didn't follow through. I could try
> to collect some data by swapping out different patterns and observing
> how different compression algorithms react. That would be interesting
> for sure.
>
> >
> > I'm guessing this goes back to inefficient packing from something
> > like zbud, which would waste half a page on one repeating byte.
> >
> > But zsmalloc can do 32 byte objects. It's also a batching slab
> > allocator, where storing a series of small, same-sized objects is
> > quite fast.
> >
> > Add to that the additional branches, the additional kmap, the extra
> > scanning of every single page for patterns - all in the fast path
> > of zswap, when we already know that the vast majority of incoming
> > pages will need to be properly compressed anyway.
> >
> > Maybe it's time to get rid of the special handling entirely?
>
> We would still be wasting some memory (~96 bytes between zswap_entry
> and zsmalloc object), and wasting cycling allocating them. This could
> be made up for by cycles saved by removing the handling. We will be
> saving some branches for sure. I am not worried about kmap as I think
> it's a noop in most cases.

Yes, true.

> I am interested to see how much we could save by removing scanning for
> patterns. We may not save much if we abort after reading a few words
> in most cases, but I guess we could also be scanning a considerable
> amount before aborting. On the other hand, we would be reading the
> page contents into cache anyway for compression, so maybe it doesn't
> really matter?
>
> I will try to collect some data about this. I will start by trying to
> find out how the compression algorithms handle same-filled pages. If
> they can compress it efficiently, then I will try to get more data on
> the tradeoff from removing the handling.

I do wonder if this could be overthinking it, too.

Double checking the numbers on our fleet, a 96 additional bytes for
each same-filled entry would result in a

1) p50 waste of 0.008% of total memory, and a

2) p99 waste of 0.06% of total memory.

And this is without us having even thought about trying to make
zsmalloc more efficient for this particular usecase - which might be
the better point of attack, if we think it's actually worth it.

So my take is that unless removing it would be outright horrible from
a %sys POV (which seems pretty unlikely), IMO it would be fine to just
delete it entirely with a "not worth the maintenance cost" argument.

If you turn the argument around, and somebody would submit the code as
it is today, with the numbers being what they are above, I'm not sure
we would even accept it!

E.g.

mm/zswap.c | 114 ++++++-----------------------------------------------------
1 file changed, 10 insertions(+), 104 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 741957f36f38..fc447f8a5c24 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -44,8 +44,6 @@
**********************************/
/* The number of compressed pages currently stored in zswap */
atomic_t zswap_stored_pages = ATOMIC_INIT(0);
-/* The number of same-value filled pages currently stored in zswap */
-static atomic_t zswap_same_filled_pages = ATOMIC_INIT(0);

/*
* The statistics below are not protected from concurrent access for
@@ -123,19 +121,6 @@ static unsigned int zswap_accept_thr_percent = 90; /* of max pool size */
module_param_named(accept_threshold_percent, zswap_accept_thr_percent,
uint, 0644);

-/*
- * Enable/disable handling same-value filled pages (enabled by default).
- * If disabled every page is considered non-same-value filled.
- */
-static bool zswap_same_filled_pages_enabled = true;
-module_param_named(same_filled_pages_enabled, zswap_same_filled_pages_enabled,
- bool, 0644);
-
-/* Enable/disable handling non-same-value filled pages (enabled by default) */
-static bool zswap_non_same_filled_pages_enabled = true;
-module_param_named(non_same_filled_pages_enabled, zswap_non_same_filled_pages_enabled,
- bool, 0644);
-
/* Number of zpools in zswap_pool (empirically determined for scalability) */
#define ZSWAP_NR_ZPOOLS 32

@@ -194,12 +179,9 @@ static struct shrinker *zswap_shrinker;
* page within zswap.
*
* swpentry - associated swap entry, the offset indexes into the red-black tree
- * length - the length in bytes of the compressed page data. Needed during
- * decompression. For a same value filled page length is 0, and both
- * pool and lru are invalid and must be ignored.
+ * length - the length in bytes of the compressed page data
* pool - the zswap_pool the entry's data is in
* handle - zpool allocation handle that stores the compressed page data
- * value - value of the same-value filled pages which have same content
* objcg - the obj_cgroup that the compressed memory is charged to
* lru - handle to the pool's lru used to evict pages.
*/
@@ -207,10 +189,7 @@ struct zswap_entry {
swp_entry_t swpentry;
unsigned int length;
struct zswap_pool *pool;
- union {
- unsigned long handle;
- unsigned long value;
- };
+ unsigned long handle;
struct obj_cgroup *objcg;
struct list_head lru;
};
@@ -812,13 +791,9 @@ static struct zpool *zswap_find_zpool(struct zswap_entry *entry)
*/
static void zswap_entry_free(struct zswap_entry *entry)
{
- if (!entry->length)
- atomic_dec(&zswap_same_filled_pages);
- else {
- zswap_lru_del(&zswap_list_lru, entry);
- zpool_free(zswap_find_zpool(entry), entry->handle);
- zswap_pool_put(entry->pool);
- }
+ zswap_lru_del(&zswap_list_lru, entry);
+ zpool_free(zswap_find_zpool(entry), entry->handle);
+ zswap_pool_put(entry->pool);
if (entry->objcg) {
obj_cgroup_uncharge_zswap(entry->objcg, entry->length);
obj_cgroup_put(entry->objcg);
@@ -1253,11 +1228,6 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
* This ensures that the better zswap compresses memory, the fewer
* pages we will evict to swap (as it will otherwise incur IO for
* relatively small memory saving).
- *
- * The memory saving factor calculated here takes same-filled pages into
- * account, but those are not freeable since they almost occupy no
- * space. Hence, we may scale nr_freeable down a little bit more than we
- * should if we have a lot of same-filled pages.
*/
return mult_frac(nr_freeable, nr_backing, nr_stored);
}
@@ -1361,36 +1331,6 @@ static void shrink_worker(struct work_struct *w)
} while (zswap_total_pages() > thr);
}

-static int zswap_is_page_same_filled(void *ptr, unsigned long *value)
-{
- unsigned long *page;
- unsigned long val;
- unsigned int pos, last_pos = PAGE_SIZE / sizeof(*page) - 1;
-
- page = (unsigned long *)ptr;
- val = page[0];
-
- if (val != page[last_pos])
- return 0;
-
- for (pos = 1; pos < last_pos; pos++) {
- if (val != page[pos])
- return 0;
- }
-
- *value = val;
-
- return 1;
-}
-
-static void zswap_fill_page(void *ptr, unsigned long value)
-{
- unsigned long *page;
-
- page = (unsigned long *)ptr;
- memset_l(page, value, PAGE_SIZE / sizeof(unsigned long));
-}
-
bool zswap_store(struct folio *folio)
{
swp_entry_t swp = folio->swap;
@@ -1446,24 +1386,6 @@ bool zswap_store(struct folio *folio)
goto reject;
}

- if (zswap_same_filled_pages_enabled) {
- unsigned long value;
- u8 *src;
-
- src = kmap_local_folio(folio, 0);
- if (zswap_is_page_same_filled(src, &value)) {
- kunmap_local(src);
- entry->length = 0;
- entry->value = value;
- atomic_inc(&zswap_same_filled_pages);
- goto insert_entry;
- }
- kunmap_local(src);
- }
-
- if (!zswap_non_same_filled_pages_enabled)
- goto freepage;
-
/* if entry is successfully added, it keeps the reference */
entry->pool = zswap_pool_current_get();
if (!entry->pool)
@@ -1481,7 +1403,6 @@ bool zswap_store(struct folio *folio)
if (!zswap_compress(folio, entry))
goto put_pool;

-insert_entry:
entry->swpentry = swp;
entry->objcg = objcg;

@@ -1517,10 +1438,8 @@ bool zswap_store(struct folio *folio)
* The publishing order matters to prevent writeback from seeing
* an incoherent entry.
*/
- if (entry->length) {
- INIT_LIST_HEAD(&entry->lru);
- zswap_lru_add(&zswap_list_lru, entry);
- }
+ INIT_LIST_HEAD(&entry->lru);
+ zswap_lru_add(&zswap_list_lru, entry);

/* update stats */
atomic_inc(&zswap_stored_pages);
@@ -1529,13 +1448,9 @@ bool zswap_store(struct folio *folio)
return true;

store_failed:
- if (!entry->length)
- atomic_dec(&zswap_same_filled_pages);
- else {
- zpool_free(zswap_find_zpool(entry), entry->handle);
+ zpool_free(zswap_find_zpool(entry), entry->handle);
put_pool:
- zswap_pool_put(entry->pool);
- }
+ zswap_pool_put(entry->pool);
freepage:
zswap_entry_cache_free(entry);
reject:
@@ -1564,7 +1479,6 @@ bool zswap_load(struct folio *folio)
bool swapcache = folio_test_swapcache(folio);
struct xarray *tree = swap_zswap_tree(swp);
struct zswap_entry *entry;
- u8 *dst;

VM_WARN_ON_ONCE(!folio_test_locked(folio));

@@ -1588,13 +1502,7 @@ bool zswap_load(struct folio *folio)
if (!entry)
return false;

- if (entry->length)
- zswap_decompress(entry, page);
- else {
- dst = kmap_local_page(page);
- zswap_fill_page(dst, entry->value);
- kunmap_local(dst);
- }
+ zswap_decompress(entry, page);

count_vm_event(ZSWPIN);
if (entry->objcg)
@@ -1696,8 +1604,6 @@ static int zswap_debugfs_init(void)
zswap_debugfs_root, NULL, &total_size_fops);
debugfs_create_atomic_t("stored_pages", 0444,
zswap_debugfs_root, &zswap_stored_pages);
- debugfs_create_atomic_t("same_filled_pages", 0444,
- zswap_debugfs_root, &zswap_same_filled_pages);

return 0;
}

2024-03-28 23:19:36

by Nhat Pham

[permalink] [raw]
Subject: Re: [RFC PATCH 6/9] mm: zswap: drop support for non-zero same-filled pages handling

On Thu, Mar 28, 2024 at 2:07 PM Johannes Weiner <[email protected]> wrote:
>
> On Thu, Mar 28, 2024 at 01:23:42PM -0700, Yosry Ahmed wrote:
> > On Thu, Mar 28, 2024 at 12:31 PM Johannes Weiner <[email protected]> wrote:
> > >
> > > On Mon, Mar 25, 2024 at 11:50:14PM +0000, Yosry Ahmed wrote:
> > > > The current same-filled pages handling supports pages filled with any
> > > > repeated word-sized pattern. However, in practice, most of these should
> > > > be zero pages anyway. Other patterns should be nearly as common.
> > > >
> > > > Drop the support for non-zero same-filled pages, but keep the names of
> > > > knobs exposed to userspace as "same_filled", which isn't entirely
> > > > inaccurate.
> > > >
> > > > This yields some nice code simplification and enables a following patch
> > > > that eliminates the need to allocate struct zswap_entry for those pages
> > > > completely.
> > > >
> > > > There is also a very small performance improvement observed over 50 runs
> > > > of kernel build test (kernbench) comparing the mean build time on a
> > > > skylake machine when building the kernel in a cgroup v1 container with a
> > > > 3G limit:
> > > >
> > > > base patched % diff
> > > > real 70.167 69.915 -0.359%
> > > > user 2953.068 2956.147 +0.104%
> > > > sys 2612.811 2594.718 -0.692%
> > > >
> > > > This probably comes from more optimized operations like memchr_inv() and
> > > > clear_highpage(). Note that the percentage of zero-filled pages during
> > > > this test was only around 1.5% on average, and was not affected by this
> > > > patch. Practical workloads could have a larger proportion of such pages
> > > > (e.g. Johannes observed around 10% [1]), so the performance improvement
> > > > should be larger.
> > > >
> > > > [1]https://lore.kernel.org/linux-mm/20240320210716.GH294822@cmpxchgorg/
> > > >
> > > > Signed-off-by: Yosry Ahmed <[email protected]>
> > >
> > > This is an interesting direction to pursue, but I actually thinkg it
> > > doesn't go far enough. Either way, I think it needs more data.
> > >
> > > 1) How frequent are non-zero-same-filled pages? Difficult to
> > > generalize, but if you could gather some from your fleet, that
> > > would be useful. If you can devise a portable strategy, I'd also be
> > > more than happy to gather this on ours (although I think you have
> > > more widespread zswap use, whereas we have more disk swap.)
> >
> > I am trying to collect the data, but there are.. hurdles. It would
> > take some time, so I was hoping the data could be collected elsewhere
> > if possible.
> >
> > The idea I had was to hook a BPF program to the entry of
> > zswap_fill_page() and create a histogram of the "value" argument. We
> > would get more coverage by hooking it to the return of
> > zswap_is_page_same_filled() and only updating the histogram if the
> > return value is true, as it includes pages in zswap that haven't been
> > swapped in.
> >
> > However, with zswap_is_page_same_filled() the BPF program will run in
> > all zswap stores, whereas for zswap_fill_page() it will only run when
> > needed. Not sure if this makes a practical difference tbh.
> >
> > >
> > > 2) The fact that we're doing any of this pattern analysis in zswap at
> > > all strikes me as a bit misguided. Being efficient about repetitive
> > > patterns is squarely in the domain of a compression algorithm. Do
> > > we not trust e.g. zstd to handle this properly?
> >
> > I thought about this briefly, but I didn't follow through. I could try
> > to collect some data by swapping out different patterns and observing
> > how different compression algorithms react. That would be interesting
> > for sure.
> >
> > >
> > > I'm guessing this goes back to inefficient packing from something
> > > like zbud, which would waste half a page on one repeating byte.
> > >
> > > But zsmalloc can do 32 byte objects. It's also a batching slab
> > > allocator, where storing a series of small, same-sized objects is
> > > quite fast.
> > >
> > > Add to that the additional branches, the additional kmap, the extra
> > > scanning of every single page for patterns - all in the fast path
> > > of zswap, when we already know that the vast majority of incoming
> > > pages will need to be properly compressed anyway.
> > >
> > > Maybe it's time to get rid of the special handling entirely?
> >
> > We would still be wasting some memory (~96 bytes between zswap_entry
> > and zsmalloc object), and wasting cycling allocating them. This could
> > be made up for by cycles saved by removing the handling. We will be
> > saving some branches for sure. I am not worried about kmap as I think
> > it's a noop in most cases.
>
> Yes, true.
>
> > I am interested to see how much we could save by removing scanning for
> > patterns. We may not save much if we abort after reading a few words
> > in most cases, but I guess we could also be scanning a considerable
> > amount before aborting. On the other hand, we would be reading the
> > page contents into cache anyway for compression, so maybe it doesn't
> > really matter?
> >
> > I will try to collect some data about this. I will start by trying to
> > find out how the compression algorithms handle same-filled pages. If
> > they can compress it efficiently, then I will try to get more data on
> > the tradeoff from removing the handling.
>
> I do wonder if this could be overthinking it, too.
>
> Double checking the numbers on our fleet, a 96 additional bytes for
> each same-filled entry would result in a
>
> 1) p50 waste of 0.008% of total memory, and a
>
> 2) p99 waste of 0.06% of total memory.
>
> And this is without us having even thought about trying to make
> zsmalloc more efficient for this particular usecase - which might be
> the better point of attack, if we think it's actually worth it.
>
> So my take is that unless removing it would be outright horrible from
> a %sys POV (which seems pretty unlikely), IMO it would be fine to just
> delete it entirely with a "not worth the maintenance cost" argument.
>
> If you turn the argument around, and somebody would submit the code as
> it is today, with the numbers being what they are above, I'm not sure
> we would even accept it!

The context guy is here :)

Not arguing for one way or another, but I did find the original patch
that introduced same filled page handling:

https://github.com/torvalds/linux/commit/a85f878b443f8d2b91ba76f09da21ac0af22e07f

https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/T/#u

The number looks impressive, and there is some detail about the
experiment setup, but I can't seem to find what the allocator +
compressor used.

Which, as Johannes has pointed out, matters a lot. A good compressor
(which should work on arguably the most trivial data pattern there is)
+ a backend allocator that is capable of handling small objects well
could make this case really efficient, without resorting to special
handling at the zswap level.

2024-03-28 23:34:18

by Nhat Pham

[permalink] [raw]
Subject: Re: [RFC PATCH 6/9] mm: zswap: drop support for non-zero same-filled pages handling

On Thu, Mar 28, 2024 at 1:24 PM Yosry Ahmed <[email protected]> wrote:
>
> On Thu, Mar 28, 2024 at 12:31 PM Johannes Weiner <[email protected]> wrote:
> >
> > On Mon, Mar 25, 2024 at 11:50:14PM +0000, Yosry Ahmed wrote:
> > > The current same-filled pages handling supports pages filled with any
> > > repeated word-sized pattern. However, in practice, most of these should
> > > be zero pages anyway. Other patterns should be nearly as common.
> > >
> > > Drop the support for non-zero same-filled pages, but keep the names of
> > > knobs exposed to userspace as "same_filled", which isn't entirely
> > > inaccurate.
> > >
> > > This yields some nice code simplification and enables a following patch
> > > that eliminates the need to allocate struct zswap_entry for those pages
> > > completely.
> > >
> > > There is also a very small performance improvement observed over 50 runs
> > > of kernel build test (kernbench) comparing the mean build time on a
> > > skylake machine when building the kernel in a cgroup v1 container with a
> > > 3G limit:
> > >
> > > base patched % diff
> > > real 70.167 69.915 -0.359%
> > > user 2953.068 2956.147 +0.104%
> > > sys 2612.811 2594.718 -0.692%
> > >
> > > This probably comes from more optimized operations like memchr_inv() and
> > > clear_highpage(). Note that the percentage of zero-filled pages during
> > > this test was only around 1.5% on average, and was not affected by this
> > > patch. Practical workloads could have a larger proportion of such pages
> > > (e.g. Johannes observed around 10% [1]), so the performance improvement
> > > should be larger.
> > >
> > > [1]https://lore.kernel.org/linux-mm/[email protected]/
> > >
> > > Signed-off-by: Yosry Ahmed <[email protected]>
> >
> > This is an interesting direction to pursue, but I actually thinkg it
> > doesn't go far enough. Either way, I think it needs more data.
> >
> > 1) How frequent are non-zero-same-filled pages? Difficult to
> > generalize, but if you could gather some from your fleet, that
> > would be useful. If you can devise a portable strategy, I'd also be
> > more than happy to gather this on ours (although I think you have
> > more widespread zswap use, whereas we have more disk swap.)
>
> I am trying to collect the data, but there are.. hurdles. It would
> take some time, so I was hoping the data could be collected elsewhere
> if possible.
>
> The idea I had was to hook a BPF program to the entry of
> zswap_fill_page() and create a histogram of the "value" argument. We
> would get more coverage by hooking it to the return of
> zswap_is_page_same_filled() and only updating the histogram if the
> return value is true, as it includes pages in zswap that haven't been
> swapped in.
>
> However, with zswap_is_page_same_filled() the BPF program will run in
> all zswap stores, whereas for zswap_fill_page() it will only run when
> needed. Not sure if this makes a practical difference tbh.
>
> >
> > 2) The fact that we're doing any of this pattern analysis in zswap at
> > all strikes me as a bit misguided. Being efficient about repetitive
> > patterns is squarely in the domain of a compression algorithm. Do
> > we not trust e.g. zstd to handle this properly?
>
> I thought about this briefly, but I didn't follow through. I could try
> to collect some data by swapping out different patterns and observing
> how different compression algorithms react. That would be interesting
> for sure.
>
> >
> > I'm guessing this goes back to inefficient packing from something
> > like zbud, which would waste half a page on one repeating byte.
> >
> > But zsmalloc can do 32 byte objects. It's also a batching slab
> > allocator, where storing a series of small, same-sized objects is
> > quite fast.
> >
> > Add to that the additional branches, the additional kmap, the extra
> > scanning of every single page for patterns - all in the fast path
> > of zswap, when we already know that the vast majority of incoming
> > pages will need to be properly compressed anyway.
> >
> > Maybe it's time to get rid of the special handling entirely?
>
> We would still be wasting some memory (~96 bytes between zswap_entry
> and zsmalloc object), and wasting cycling allocating them. This could
> be made up for by cycles saved by removing the handling. We will be
> saving some branches for sure. I am not worried about kmap as I think
> it's a noop in most cases.

A secondary effect of the current same-filled page handling is that
we're not considering them for reclaim. Which could potentially be
beneficial, because we're not saving much memory (essentially just the
zswap entry and associated cost of storing them) by writing these
pages back - IOW, the cost / benefit ratio for reclaiming these pages
is quite atrocious.

Again, all of this is just handwaving without numbers. It'd be nice if
we can have more concrete data for this conversation :P

2024-03-29 02:09:50

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH 6/9] mm: zswap: drop support for non-zero same-filled pages handling

On Thu, Mar 28, 2024 at 4:19 PM Nhat Pham <[email protected]> wrote:
>
> On Thu, Mar 28, 2024 at 2:07 PM Johannes Weiner <[email protected]> wrote:
> >
> > On Thu, Mar 28, 2024 at 01:23:42PM -0700, Yosry Ahmed wrote:
> > > On Thu, Mar 28, 2024 at 12:31 PM Johannes Weiner <[email protected]> wrote:
> > > >
> > > > On Mon, Mar 25, 2024 at 11:50:14PM +0000, Yosry Ahmed wrote:
> > > > > The current same-filled pages handling supports pages filled with any
> > > > > repeated word-sized pattern. However, in practice, most of these should
> > > > > be zero pages anyway. Other patterns should be nearly as common.
> > > > >
> > > > > Drop the support for non-zero same-filled pages, but keep the names of
> > > > > knobs exposed to userspace as "same_filled", which isn't entirely
> > > > > inaccurate.
> > > > >
> > > > > This yields some nice code simplification and enables a following patch
> > > > > that eliminates the need to allocate struct zswap_entry for those pages
> > > > > completely.
> > > > >
> > > > > There is also a very small performance improvement observed over 50 runs
> > > > > of kernel build test (kernbench) comparing the mean build time on a
> > > > > skylake machine when building the kernel in a cgroup v1 container with a
> > > > > 3G limit:
> > > > >
> > > > > base patched % diff
> > > > > real 70.167 69.915 -0.359%
> > > > > user 2953.068 2956.147 +0.104%
> > > > > sys 2612.811 2594.718 -0.692%
> > > > >
> > > > > This probably comes from more optimized operations like memchr_inv() and
> > > > > clear_highpage(). Note that the percentage of zero-filled pages during
> > > > > this test was only around 1.5% on average, and was not affected by this
> > > > > patch. Practical workloads could have a larger proportion of such pages
> > > > > (e.g. Johannes observed around 10% [1]), so the performance improvement
> > > > > should be larger.
> > > > >
> > > > > [1]https://lore.kernel.org/linux-mm/[email protected]/
> > > > >
> > > > > Signed-off-by: Yosry Ahmed <[email protected]>
> > > >
> > > > This is an interesting direction to pursue, but I actually thinkg it
> > > > doesn't go far enough. Either way, I think it needs more data.
> > > >
> > > > 1) How frequent are non-zero-same-filled pages? Difficult to
> > > > generalize, but if you could gather some from your fleet, that
> > > > would be useful. If you can devise a portable strategy, I'd also be
> > > > more than happy to gather this on ours (although I think you have
> > > > more widespread zswap use, whereas we have more disk swap.)
> > >
> > > I am trying to collect the data, but there are.. hurdles. It would
> > > take some time, so I was hoping the data could be collected elsewhere
> > > if possible.
> > >
> > > The idea I had was to hook a BPF program to the entry of
> > > zswap_fill_page() and create a histogram of the "value" argument. We
> > > would get more coverage by hooking it to the return of
> > > zswap_is_page_same_filled() and only updating the histogram if the
> > > return value is true, as it includes pages in zswap that haven't been
> > > swapped in.
> > >
> > > However, with zswap_is_page_same_filled() the BPF program will run in
> > > all zswap stores, whereas for zswap_fill_page() it will only run when
> > > needed. Not sure if this makes a practical difference tbh.
> > >
> > > >
> > > > 2) The fact that we're doing any of this pattern analysis in zswap at
> > > > all strikes me as a bit misguided. Being efficient about repetitive
> > > > patterns is squarely in the domain of a compression algorithm. Do
> > > > we not trust e.g. zstd to handle this properly?
> > >
> > > I thought about this briefly, but I didn't follow through. I could try
> > > to collect some data by swapping out different patterns and observing
> > > how different compression algorithms react. That would be interesting
> > > for sure.
> > >
> > > >
> > > > I'm guessing this goes back to inefficient packing from something
> > > > like zbud, which would waste half a page on one repeating byte.
> > > >
> > > > But zsmalloc can do 32 byte objects. It's also a batching slab
> > > > allocator, where storing a series of small, same-sized objects is
> > > > quite fast.
> > > >
> > > > Add to that the additional branches, the additional kmap, the extra
> > > > scanning of every single page for patterns - all in the fast path
> > > > of zswap, when we already know that the vast majority of incoming
> > > > pages will need to be properly compressed anyway.
> > > >
> > > > Maybe it's time to get rid of the special handling entirely?
> > >
> > > We would still be wasting some memory (~96 bytes between zswap_entry
> > > and zsmalloc object), and wasting cycling allocating them. This could
> > > be made up for by cycles saved by removing the handling. We will be
> > > saving some branches for sure. I am not worried about kmap as I think
> > > it's a noop in most cases.
> >
> > Yes, true.
> >
> > > I am interested to see how much we could save by removing scanning for
> > > patterns. We may not save much if we abort after reading a few words
> > > in most cases, but I guess we could also be scanning a considerable
> > > amount before aborting. On the other hand, we would be reading the
> > > page contents into cache anyway for compression, so maybe it doesn't
> > > really matter?
> > >
> > > I will try to collect some data about this. I will start by trying to
> > > find out how the compression algorithms handle same-filled pages. If
> > > they can compress it efficiently, then I will try to get more data on
> > > the tradeoff from removing the handling.
> >
> > I do wonder if this could be overthinking it, too.
> >
> > Double checking the numbers on our fleet, a 96 additional bytes for
> > each same-filled entry would result in a
> >
> > 1) p50 waste of 0.008% of total memory, and a
> >
> > 2) p99 waste of 0.06% of total memory.

Right. Assuming the compressors do not surprise us and store
same-filled pages in an absurd way, it's not worth it in terms of
memory savings.

> >
> > And this is without us having even thought about trying to make
> > zsmalloc more efficient for this particular usecase - which might be
> > the better point of attack, if we think it's actually worth it.
> >
> > So my take is that unless removing it would be outright horrible from
> > a %sys POV (which seems pretty unlikely), IMO it would be fine to just
> > delete it entirely with a "not worth the maintenance cost" argument.
> >
> > If you turn the argument around, and somebody would submit the code as
> > it is today, with the numbers being what they are above, I'm not sure
> > we would even accept it!
>
> The context guy is here :)
>
> Not arguing for one way or another, but I did find the original patch
> that introduced same filled page handling:
>
> https://github.com/torvalds/linux/commit/a85f878b443f8d2b91ba76f09da21ac0af22e07f
>
> https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/T/#u

Thanks for digging this up. I don't know why I didn't start there :)

Following in your footsteps, and given that zram has the same feature,
I found the patch that added support for non-zero same-filled pages in
zram:
https://lore.kernel.org/all/[email protected]/#t

Both of them confirm that most same-filled pages are zero pages, but
they show a more significant portion of same-filled pages being
non-zero (17% to 40%). I suspect this will be less in data centers
compared to consumer apps.

The zswap patch also reports significant performance improvements from
the same-filled handling, but this is with 17-22% same-filled pages.
Johannes mentioned around 10% in your data centers, so the performance
improvement would be less. In the kernel build tests I ran with only
around 1.5% same-filled pages I observed 1.4% improvements just by
optimizing them (only zero-filled, skipping allocations).

So I think removing the same-filled pages handling completely may be
too aggressive, because it doesn't only affect the memory efficiency,
but also cycles spent when handling those pages. Just avoiding going
through the allocator and compressor has to account for something :)

>
> The number looks impressive, and there is some detail about the
> experiment setup, but I can't seem to find what the allocator +
> compressor used.

I would assume it was zbud because it was the default then, but as I
mentioned zram had similar patches and it uses zsmalloc. I suspect
zsmalloc would have changed since then tho, and compressors as well.
With zram there, is also the point that metadata is allocated
statically AFAICT, so there is very little cost to supporting all
same-filled pages if you already support zero-filled pages.

>
> Which, as Johannes has pointed out, matters a lot. A good compressor
> (which should work on arguably the most trivial data pattern there is)
> + a backend allocator that is capable of handling small objects well
> could make this case really efficient, without resorting to special
> handling at the zswap level.

All in all, I think there are three aspects here:
(1) Cycles saved by avoiding going through the compressor and
allocator, and potentially allocating zswap_entry as well the metadata
with this series.

(2) Memory saved by storing same-filled pages in zswap instead of
compressing them. This will ultimately depend on the compressor and
allocator as has been established.

(3) Memory saved on metadata in zswap by avoiding the zswap_entry
allocation. This is probably too small to matter.

I think (1) may be the major factor here, and (2) could be a factor if
the compressors surprise us (and would always be a factor for zbud and
z3fold due to bin packing).

Focusing on just (1), supporting zero-filled pages only may be a good
tradeoff. We get slightly less complicated and potentially faster
handling in zswap without losing a lot. Did I capture the discussion
so far correctly?

2024-03-29 02:15:44

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH 5/9] mm: zswap: remove zswap_same_filled_pages_enabled

On Thu, Mar 28, 2024 at 1:06 PM Yosry Ahmed <[email protected]> wrote:
>
> On Thu, Mar 28, 2024 at 12:11 PM Johannes Weiner <[email protected]> wrote:
> >
> > On Mon, Mar 25, 2024 at 11:50:13PM +0000, Yosry Ahmed wrote:
> > > There is no logical reason to refuse storing same-filled pages more
> > > efficiently and opt for compression. Remove the userspace knob.
> > >
> > > Signed-off-by: Yosry Ahmed <[email protected]>
> >
> > Acked-by: Johannes Weiner <[email protected]>
> >
> > I also think the non_same_filled_pages_enabled option should go
> > away. Both of these tunables are pretty bizarre.
>
> Happy to remove both in the next version :)

I thought non_same_filled_pages_enabled was introduced with the
initial support for same-filled pages, but it was introduced
separately (and much more recently):
https://lore.kernel.org/all/7dbafa963e8bab43608189abbe2067f4b9287831.1641247624.git.maciej.szmigiero@oracle.com/

I am CCing Maciej to hear more about the use case for this.

>
> Thanks!

2024-03-29 15:18:27

by Maciej S. Szmigiero

[permalink] [raw]
Subject: Re: [RFC PATCH 5/9] mm: zswap: remove zswap_same_filled_pages_enabled

On 29.03.2024 03:14, Yosry Ahmed wrote:
> On Thu, Mar 28, 2024 at 1:06 PM Yosry Ahmed <[email protected]> wrote:
>>
>> On Thu, Mar 28, 2024 at 12:11 PM Johannes Weiner <[email protected]> wrote:
>>>
>>> On Mon, Mar 25, 2024 at 11:50:13PM +0000, Yosry Ahmed wrote:
>>>> There is no logical reason to refuse storing same-filled pages more
>>>> efficiently and opt for compression. Remove the userspace knob.
>>>>
>>>> Signed-off-by: Yosry Ahmed <[email protected]>
>>>
>>> Acked-by: Johannes Weiner <[email protected]>
>>>
>>> I also think the non_same_filled_pages_enabled option should go
>>> away. Both of these tunables are pretty bizarre.
>>
>> Happy to remove both in the next version :)
>
> I thought non_same_filled_pages_enabled was introduced with the
> initial support for same-filled pages, but it was introduced
> separately (and much more recently):
> https://lore.kernel.org/all/7dbafa963e8bab43608189abbe2067f4b9287831.1641247624.git.maciej.szmigiero@oracle.com/
>
> I am CCing Maciej to hear more about the use case for this.

Thanks for CCing me.

I introduced "non_same_filled_pages_enabled" a few years ago to
enable using zswap in a lightweight mode where it is only used for
its ability to store same-filled pages effectively.

As far as I remember, there were some interactions between full
zswap and the cgroup memory controller - like, it made it easier
for an aggressive workload to exceed its cgroup memory.high limits.

On the other hand, "same_filled_pages_enabled" sounds like some kind
of a debugging option since I don't see any real reason to disable it
either.

Thanks,
Maciej


2024-03-29 17:38:17

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFC PATCH 6/9] mm: zswap: drop support for non-zero same-filled pages handling

On Thu, Mar 28, 2024 at 09:27:17PM -0700, Yosry Ahmed wrote:
> On Thu, Mar 28, 2024 at 7:05 PM Yosry Ahmed <[email protected]> wrote:
> >
> > On Thu, Mar 28, 2024 at 4:19 PM Nhat Pham <[email protected]> wrote:
> > >
> > > On Thu, Mar 28, 2024 at 2:07 PM Johannes Weiner <[email protected]> wrote:
> > > >
> > > > On Thu, Mar 28, 2024 at 01:23:42PM -0700, Yosry Ahmed wrote:
> > > > > On Thu, Mar 28, 2024 at 12:31 PM Johannes Weiner <[email protected]> wrote:
> > > > > >
> > > > > > On Mon, Mar 25, 2024 at 11:50:14PM +0000, Yosry Ahmed wrote:
> > > > > > > The current same-filled pages handling supports pages filled with any
> > > > > > > repeated word-sized pattern. However, in practice, most of these should
> > > > > > > be zero pages anyway. Other patterns should be nearly as common.
> > > > > > >
> > > > > > > Drop the support for non-zero same-filled pages, but keep the names of
> > > > > > > knobs exposed to userspace as "same_filled", which isn't entirely
> > > > > > > inaccurate.
> > > > > > >
> > > > > > > This yields some nice code simplification and enables a following patch
> > > > > > > that eliminates the need to allocate struct zswap_entry for those pages
> > > > > > > completely.
> > > > > > >
> > > > > > > There is also a very small performance improvement observed over 50 runs
> > > > > > > of kernel build test (kernbench) comparing the mean build time on a
> > > > > > > skylake machine when building the kernel in a cgroup v1 container with a
> > > > > > > 3G limit:
> > > > > > >
> > > > > > > base patched % diff
> > > > > > > real 70.167 69.915 -0.359%
> > > > > > > user 2953.068 2956.147 +0.104%
> > > > > > > sys 2612.811 2594.718 -0.692%
> > > > > > >
> > > > > > > This probably comes from more optimized operations like memchr_inv() and
> > > > > > > clear_highpage(). Note that the percentage of zero-filled pages during
> > > > > > > this test was only around 1.5% on average, and was not affected by this
> > > > > > > patch. Practical workloads could have a larger proportion of such pages
> > > > > > > (e.g. Johannes observed around 10% [1]), so the performance improvement
> > > > > > > should be larger.
> > > > > > >
> > > > > > > [1]https://lore.kernel.org/linux-mm/[email protected]/
> > > > > > >
> > > > > > > Signed-off-by: Yosry Ahmed <[email protected]>
> > > > > >
> > > > > > This is an interesting direction to pursue, but I actually thinkg it
> > > > > > doesn't go far enough. Either way, I think it needs more data.
> > > > > >
> > > > > > 1) How frequent are non-zero-same-filled pages? Difficult to
> > > > > > generalize, but if you could gather some from your fleet, that
> > > > > > would be useful. If you can devise a portable strategy, I'd also be
> > > > > > more than happy to gather this on ours (although I think you have
> > > > > > more widespread zswap use, whereas we have more disk swap.)
> > > > >
> > > > > I am trying to collect the data, but there are.. hurdles. It would
> > > > > take some time, so I was hoping the data could be collected elsewhere
> > > > > if possible.
> > > > >
> > > > > The idea I had was to hook a BPF program to the entry of
> > > > > zswap_fill_page() and create a histogram of the "value" argument. We
> > > > > would get more coverage by hooking it to the return of
> > > > > zswap_is_page_same_filled() and only updating the histogram if the
> > > > > return value is true, as it includes pages in zswap that haven't been
> > > > > swapped in.
> > > > >
> > > > > However, with zswap_is_page_same_filled() the BPF program will run in
> > > > > all zswap stores, whereas for zswap_fill_page() it will only run when
> > > > > needed. Not sure if this makes a practical difference tbh.
> > > > >
> > > > > >
> > > > > > 2) The fact that we're doing any of this pattern analysis in zswap at
> > > > > > all strikes me as a bit misguided. Being efficient about repetitive
> > > > > > patterns is squarely in the domain of a compression algorithm. Do
> > > > > > we not trust e.g. zstd to handle this properly?
> > > > >
> > > > > I thought about this briefly, but I didn't follow through. I could try
> > > > > to collect some data by swapping out different patterns and observing
> > > > > how different compression algorithms react. That would be interesting
> > > > > for sure.
> > > > >
> > > > > >
> > > > > > I'm guessing this goes back to inefficient packing from something
> > > > > > like zbud, which would waste half a page on one repeating byte.
> > > > > >
> > > > > > But zsmalloc can do 32 byte objects. It's also a batching slab
> > > > > > allocator, where storing a series of small, same-sized objects is
> > > > > > quite fast.
> > > > > >
> > > > > > Add to that the additional branches, the additional kmap, the extra
> > > > > > scanning of every single page for patterns - all in the fast path
> > > > > > of zswap, when we already know that the vast majority of incoming
> > > > > > pages will need to be properly compressed anyway.
> > > > > >
> > > > > > Maybe it's time to get rid of the special handling entirely?
> > > > >
> > > > > We would still be wasting some memory (~96 bytes between zswap_entry
> > > > > and zsmalloc object), and wasting cycling allocating them. This could
> > > > > be made up for by cycles saved by removing the handling. We will be
> > > > > saving some branches for sure. I am not worried about kmap as I think
> > > > > it's a noop in most cases.
> > > >
> > > > Yes, true.
> > > >
> > > > > I am interested to see how much we could save by removing scanning for
> > > > > patterns. We may not save much if we abort after reading a few words
> > > > > in most cases, but I guess we could also be scanning a considerable
> > > > > amount before aborting. On the other hand, we would be reading the
> > > > > page contents into cache anyway for compression, so maybe it doesn't
> > > > > really matter?
> > > > >
> > > > > I will try to collect some data about this. I will start by trying to
> > > > > find out how the compression algorithms handle same-filled pages. If
> > > > > they can compress it efficiently, then I will try to get more data on
> > > > > the tradeoff from removing the handling.
> > > >
> > > > I do wonder if this could be overthinking it, too.
> > > >
> > > > Double checking the numbers on our fleet, a 96 additional bytes for
> > > > each same-filled entry would result in a
> > > >
> > > > 1) p50 waste of 0.008% of total memory, and a
> > > >
> > > > 2) p99 waste of 0.06% of total memory.
> >
> > Right. Assuming the compressors do not surprise us and store
> > same-filled pages in an absurd way, it's not worth it in terms of
> > memory savings.
> >
> > > >
> > > > And this is without us having even thought about trying to make
> > > > zsmalloc more efficient for this particular usecase - which might be
> > > > the better point of attack, if we think it's actually worth it.
> > > >
> > > > So my take is that unless removing it would be outright horrible from
> > > > a %sys POV (which seems pretty unlikely), IMO it would be fine to just
> > > > delete it entirely with a "not worth the maintenance cost" argument.
> > > >
> > > > If you turn the argument around, and somebody would submit the code as
> > > > it is today, with the numbers being what they are above, I'm not sure
> > > > we would even accept it!
> > >
> > > The context guy is here :)
> > >
> > > Not arguing for one way or another, but I did find the original patch
> > > that introduced same filled page handling:
> > >
> > > https://github.com/torvalds/linux/commit/a85f878b443f8d2b91ba76f09da21ac0af22e07f
> > >
> > > https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/T/#u
> >
> > Thanks for digging this up. I don't know why I didn't start there :)
> >
> > Following in your footsteps, and given that zram has the same feature,
> > I found the patch that added support for non-zero same-filled pages in
> > zram:
> > https://lore.kernel.org/all/[email protected]/#t
> >
> > Both of them confirm that most same-filled pages are zero pages, but
> > they show a more significant portion of same-filled pages being
> > non-zero (17% to 40%). I suspect this will be less in data centers
> > compared to consumer apps.
> >
> > The zswap patch also reports significant performance improvements from
> > the same-filled handling, but this is with 17-22% same-filled pages.
> > Johannes mentioned around 10% in your data centers, so the performance
> > improvement would be less. In the kernel build tests I ran with only
> > around 1.5% same-filled pages I observed 1.4% improvements just by
> > optimizing them (only zero-filled, skipping allocations).
> >
> > So I think removing the same-filled pages handling completely may be
> > too aggressive, because it doesn't only affect the memory efficiency,
> > but also cycles spent when handling those pages. Just avoiding going
> > through the allocator and compressor has to account for something :)
>
> Here is another data point. I tried removing the same-filled handling
> code completely with the diff Johannes sent upthread. I saw 1.3%
> improvement in the kernel build test, very similar to the improvement
> from this patch series. _However_, the kernel build test only produces
> ~1.5% zero-filled pages in my runs. More realistic workloads have
> significantly higher percentages as demonstrated upthread.
>
> In other words, the kernel build test (at least in my runs) seems to
> be the worst case scenario for same-filled/zero-filled pages. Since
> the improvement from removing same-filled handling is quite small in
> this case, I suspect there will be no improvement, but possibly a
> regression, on real workloads.
>
> As the zero-filled pages ratio increases:
> - The performance with this series will improve.
> - The performance with removing same-filled handling completely will
> become worse.

Sorry, this thread is still really lacking practical perspective.

As do the numbers that initially justified the patch. Sure, the stores
of same-filled pages are faster. What's the cost of prechecking 90% of
the other pages that need compression?

Also, this is the swap path we're talking about. There is vmscan, swap
slot allocations, page table walks, TLB flushes, zswap tree inserts;
then a page fault and everything in reverse.

I perf'd zswapping out data that is 10% same-filled and 90% data that
always needs compression. It does nothing but madvise(MADV_PAGEOUT),
and the zswap_store() stack is already only ~60% of the cycles.

Using zsmalloc + zstd, this is the diff between vanilla and my patch:

# Baseline Delta Abs Shared Object Symbol
# ........ ......... .................... .....................................................
#
4.34% -3.02% [kernel.kallsyms] [k] zswap_store
11.07% +1.41% [kernel.kallsyms] [k] ZSTD_compressBlock_doubleFast
15.55% +0.91% [kernel.kallsyms] [k] FSE_buildCTable_wksp

As expected, we have to compress a bit more; on the other hand we're
removing the content scan for same-filled for 90% of the pages that
don't benefit from it. They almost amortize each other. Let's round it
up and the remaining difference is ~1%.

It's difficult to make the case that this matters to any real
workloads with actual think time in between paging.

But let's say you do make the case that zero-filled pages are worth
optimizing for. Why is this in zswap? Why not do it in vmscan with a
generic zero-swp_entry_t, and avoid the swap backend altogether? No
swap slot allocation, no zswap tree, no *IO on disk swap*.

However you slice it, I fail to see how this has a place in
zswap. It's trying to optimize the slow path of a slow path, at the
wrong layer of the reclaim stack.


2024-03-29 17:45:12

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFC PATCH 5/9] mm: zswap: remove zswap_same_filled_pages_enabled

On Fri, Mar 29, 2024 at 03:02:10PM +0100, Maciej S. Szmigiero wrote:
> On 29.03.2024 03:14, Yosry Ahmed wrote:
> > On Thu, Mar 28, 2024 at 1:06 PM Yosry Ahmed <[email protected]> wrote:
> >>
> >> On Thu, Mar 28, 2024 at 12:11 PM Johannes Weiner <[email protected]> wrote:
> >>>
> >>> On Mon, Mar 25, 2024 at 11:50:13PM +0000, Yosry Ahmed wrote:
> >>>> There is no logical reason to refuse storing same-filled pages more
> >>>> efficiently and opt for compression. Remove the userspace knob.
> >>>>
> >>>> Signed-off-by: Yosry Ahmed <[email protected]>
> >>>
> >>> Acked-by: Johannes Weiner <[email protected]>
> >>>
> >>> I also think the non_same_filled_pages_enabled option should go
> >>> away. Both of these tunables are pretty bizarre.
> >>
> >> Happy to remove both in the next version :)
> >
> > I thought non_same_filled_pages_enabled was introduced with the
> > initial support for same-filled pages, but it was introduced
> > separately (and much more recently):
> > https://lore.kernel.org/all/7dbafa963e8bab43608189abbe2067f4b9287831.1641247624.git.maciej.szmigiero@oracle.com/
> >
> > I am CCing Maciej to hear more about the use case for this.
>
> Thanks for CCing me.
>
> I introduced "non_same_filled_pages_enabled" a few years ago to
> enable using zswap in a lightweight mode where it is only used for
> its ability to store same-filled pages effectively.

But all the pages it rejects go to disk swap instead, which is much
slower than compression...

> As far as I remember, there were some interactions between full
> zswap and the cgroup memory controller - like, it made it easier
> for an aggressive workload to exceed its cgroup memory.high limits.

Ok, that makes sense! A container fairness measure, rather than a
performance optimization.

Fair enough, but that's moot then with cgroup accounting of the
backing memory, f4840ccfca25 ("zswap: memcg accounting").

Thanks for prodiving context.

2024-03-29 18:24:01

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH 5/9] mm: zswap: remove zswap_same_filled_pages_enabled

On Fri, Mar 29, 2024 at 10:45 AM Johannes Weiner <[email protected]> wrote:
>
> On Fri, Mar 29, 2024 at 03:02:10PM +0100, Maciej S. Szmigiero wrote:
> > On 29.03.2024 03:14, Yosry Ahmed wrote:
> > > On Thu, Mar 28, 2024 at 1:06 PM Yosry Ahmed <[email protected]> wrote:
> > >>
> > >> On Thu, Mar 28, 2024 at 12:11 PM Johannes Weiner <[email protected]> wrote:
> > >>>
> > >>> On Mon, Mar 25, 2024 at 11:50:13PM +0000, Yosry Ahmed wrote:
> > >>>> There is no logical reason to refuse storing same-filled pages more
> > >>>> efficiently and opt for compression. Remove the userspace knob.
> > >>>>
> > >>>> Signed-off-by: Yosry Ahmed <[email protected]>
> > >>>
> > >>> Acked-by: Johannes Weiner <[email protected]>
> > >>>
> > >>> I also think the non_same_filled_pages_enabled option should go
> > >>> away. Both of these tunables are pretty bizarre.
> > >>
> > >> Happy to remove both in the next version :)
> > >
> > > I thought non_same_filled_pages_enabled was introduced with the
> > > initial support for same-filled pages, but it was introduced
> > > separately (and much more recently):
> > > https://lore.kernel.org/all/7dbafa963e8bab43608189abbe2067f4b9287831.1641247624.git.maciej.szmigiero@oracle.com/
> > >
> > > I am CCing Maciej to hear more about the use case for this.
> >
> > Thanks for CCing me.
> >
> > I introduced "non_same_filled_pages_enabled" a few years ago to
> > enable using zswap in a lightweight mode where it is only used for
> > its ability to store same-filled pages effectively.
>
> But all the pages it rejects go to disk swap instead, which is much
> slower than compression...
>
> > As far as I remember, there were some interactions between full
> > zswap and the cgroup memory controller - like, it made it easier
> > for an aggressive workload to exceed its cgroup memory.high limits.
>
> Ok, that makes sense! A container fairness measure, rather than a
> performance optimization.
>
> Fair enough, but that's moot then with cgroup accounting of the
> backing memory, f4840ccfca25 ("zswap: memcg accounting").

Right, this should no longer be needed with the zswap charging.

Maciej, is this still being used on kernels with f4840ccfca25 (5.19+)?
Any objections to removing it now?

2024-03-29 18:57:38

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH 6/9] mm: zswap: drop support for non-zero same-filled pages handling

[..]
> > > > The context guy is here :)
> > > >
> > > > Not arguing for one way or another, but I did find the original patch
> > > > that introduced same filled page handling:
> > > >
> > > > https://github.com/torvalds/linux/commit/a85f878b443f8d2b91ba76f09da21ac0af22e07f
> > > >
> > > > https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/T/#u
> > >
> > > Thanks for digging this up. I don't know why I didn't start there :)
> > >
> > > Following in your footsteps, and given that zram has the same feature,
> > > I found the patch that added support for non-zero same-filled pages in
> > > zram:
> > > https://lore.kernel.org/all/[email protected]/#t
> > >
> > > Both of them confirm that most same-filled pages are zero pages, but
> > > they show a more significant portion of same-filled pages being
> > > non-zero (17% to 40%). I suspect this will be less in data centers
> > > compared to consumer apps.
> > >
> > > The zswap patch also reports significant performance improvements from
> > > the same-filled handling, but this is with 17-22% same-filled pages.
> > > Johannes mentioned around 10% in your data centers, so the performance
> > > improvement would be less. In the kernel build tests I ran with only
> > > around 1.5% same-filled pages I observed 1.4% improvements just by
> > > optimizing them (only zero-filled, skipping allocations).
> > >
> > > So I think removing the same-filled pages handling completely may be
> > > too aggressive, because it doesn't only affect the memory efficiency,
> > > but also cycles spent when handling those pages. Just avoiding going
> > > through the allocator and compressor has to account for something :)
> >
> > Here is another data point. I tried removing the same-filled handling
> > code completely with the diff Johannes sent upthread. I saw 1.3%
> > improvement in the kernel build test, very similar to the improvement
> > from this patch series. _However_, the kernel build test only produces
> > ~1.5% zero-filled pages in my runs. More realistic workloads have
> > significantly higher percentages as demonstrated upthread.
> >
> > In other words, the kernel build test (at least in my runs) seems to
> > be the worst case scenario for same-filled/zero-filled pages. Since
> > the improvement from removing same-filled handling is quite small in
> > this case, I suspect there will be no improvement, but possibly a
> > regression, on real workloads.
> >
> > As the zero-filled pages ratio increases:
> > - The performance with this series will improve.
> > - The performance with removing same-filled handling completely will
> > become worse.
>
> Sorry, this thread is still really lacking practical perspective.
>
> As do the numbers that initially justified the patch. Sure, the stores
> of same-filled pages are faster. What's the cost of prechecking 90% of
> the other pages that need compression?

Well, that's exactly what I was trying to find out :)

The kernel build test has over 98% pages that are not same-filled in
my runs, so we are paying the cost of prechecking 98% of pages for
same-filled contents needlessly. However, removing same-filled
handling only resulted in a 1.4% performance improvement. We should
expect even less for workloads that have 90% non-same-filled pages,
right?

It seems like the cost of prechecking is not that bad at all,
potentially because the page contents are read into cache anyway. Did
I miss something?

>
> Also, this is the swap path we're talking about. There is vmscan, swap
> slot allocations, page table walks, TLB flushes, zswap tree inserts;
> then a page fault and everything in reverse.
>
> I perf'd zswapping out data that is 10% same-filled and 90% data that
> always needs compression. It does nothing but madvise(MADV_PAGEOUT),
> and the zswap_store() stack is already only ~60% of the cycles.
>
> Using zsmalloc + zstd, this is the diff between vanilla and my patch:
>
> # Baseline Delta Abs Shared Object Symbol
> # ........ ......... .................... .....................................................
> #
> 4.34% -3.02% [kernel.kallsyms] [k] zswap_store
> 11.07% +1.41% [kernel.kallsyms] [k] ZSTD_compressBlock_doubleFast
> 15.55% +0.91% [kernel.kallsyms] [k] FSE_buildCTable_wksp
>
> As expected, we have to compress a bit more; on the other hand we're
> removing the content scan for same-filled for 90% of the pages that
> don't benefit from it. They almost amortize each other. Let's round it
> up and the remaining difference is ~1%.

Thanks for the data, this is very useful.

There is also the load path. The original patch that introduced
same-filled pages reports more gains on the load side, which also
happens to be more latency-sensitive. Of course, the data could be
outdated -- but it's a different type of workload than what will be
running in a data center fleet AFAICT.

Is there also no noticeable difference on the load side in your data?

Also, how much increase did you observe in the size of compressed data
with your patch? Just curious about how zstd ended up handling those
pages.

>
> It's difficult to make the case that this matters to any real
> workloads with actual think time in between paging.

If the difference in performance is 1%, I surely agree.

The patch reported 19-32% improvement in store time for same-filled
pages depending on the workload and platform. For 10% same-filled
pages, this should roughly correspond to 2-3% overall improvement,
which is not too far from the 1% you observed.

The patch reported much larger improvement for load times (which
matters more), 49-85% for same-filled pages. If this corresponds to
5-8% overall improvement in zswap load time for a workload with 10%
same-filled pages, that would be very significant. I don't expect to
see such improvements tbh, but we should check.

Let's CC Srividya here for visibility.

>
> But let's say you do make the case that zero-filled pages are worth
> optimizing for.

I am not saying they are for sure, but removing the same-filled
checking didn't seem to improve performance much in my testing, so the
cost seems to be mostly in maintenance. This makes it seem to me that
it *could* be a good tradeoff to only handle zero-filled pages. We can
make them slightly faster and we can trim the complexity -- as shown
by this patch.

> Why is this in zswap? Why not do it in vmscan with a
> generic zero-swp_entry_t, and avoid the swap backend altogether? No
> swap slot allocation, no zswap tree, no *IO on disk swap*.

That part I definitely agree with, especially that the logic is
duplicated in zram.

>
> However you slice it, I fail to see how this has a place in
> zswap. It's trying to optimize the slow path of a slow path, at the
> wrong layer of the reclaim stack.

Agreeing for the store path, we still need some clarity on the load
path. But yeah all-in-all zswap is not the correct place for such
optimizations, but it's the way the handling currently lives.

2024-03-29 21:18:04

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFC PATCH 6/9] mm: zswap: drop support for non-zero same-filled pages handling

On Fri, Mar 29, 2024 at 11:56:46AM -0700, Yosry Ahmed wrote:
> > I perf'd zswapping out data that is 10% same-filled and 90% data that
> > always needs compression. It does nothing but madvise(MADV_PAGEOUT),
> > and the zswap_store() stack is already only ~60% of the cycles.
> >
> > Using zsmalloc + zstd, this is the diff between vanilla and my patch:
> >
> > # Baseline Delta Abs Shared Object Symbol
> > # ........ ......... .................... .....................................................
> > #
> > 4.34% -3.02% [kernel.kallsyms] [k] zswap_store
> > 11.07% +1.41% [kernel.kallsyms] [k] ZSTD_compressBlock_doubleFast
> > 15.55% +0.91% [kernel.kallsyms] [k] FSE_buildCTable_wksp
> >
> > As expected, we have to compress a bit more; on the other hand we're
> > removing the content scan for same-filled for 90% of the pages that
> > don't benefit from it. They almost amortize each other. Let's round it
> > up and the remaining difference is ~1%.
>
> Thanks for the data, this is very useful.
>
> There is also the load path. The original patch that introduced
> same-filled pages reports more gains on the load side, which also
> happens to be more latency-sensitive. Of course, the data could be
> outdated -- but it's a different type of workload than what will be
> running in a data center fleet AFAICT.
>
> Is there also no noticeable difference on the load side in your data?

9.40% +2.51% [kernel.kallsyms] [k] ZSTD_safecopy
0.76% -0.48% [kernel.kallsyms] [k] zswap_load

About 2% net.

Checking other compression algorithms, lzo eats 27.58%, and lz4
13.82%. The variance between these alone makes our "try to be smarter
than an actual compression algorithm" code look even sillier.

> Also, how much increase did you observe in the size of compressed data
> with your patch? Just curious about how zstd ended up handling those
> pages.

Checking zsmalloc stats, it did pack the same-filled ones down into 32
byte objects. So 0.7% of their original size. That's negligible, even
for workloads that have an unusually high share of them.

> > It's difficult to make the case that this matters to any real
> > workloads with actual think time in between paging.
>
> If the difference in performance is 1%, I surely agree.
>
> The patch reported 19-32% improvement in store time for same-filled
> pages depending on the workload and platform. For 10% same-filled
> pages, this should roughly correspond to 2-3% overall improvement,
> which is not too far from the 1% you observed.

Right.

> The patch reported much larger improvement for load times (which
> matters more), 49-85% for same-filled pages. If this corresponds to
> 5-8% overall improvement in zswap load time for a workload with 10%
> same-filled pages, that would be very significant. I don't expect to
> see such improvements tbh, but we should check.

No, I'm seeing around 2% net.

> > But let's say you do make the case that zero-filled pages are worth
> > optimizing for.
>
> I am not saying they are for sure, but removing the same-filled
> checking didn't seem to improve performance much in my testing, so the
> cost seems to be mostly in maintenance. This makes it seem to me that
> it *could* be a good tradeoff to only handle zero-filled pages. We can
> make them slightly faster and we can trim the complexity -- as shown
> by this patch.

In the original numbers, there was a certain percentage of non-zero
same-filled pages. You still first have to find that number for real
production loads to determine what the tradeoff actually is.

And I disagree that the code is less complex. There are fewer lines to
the memchr_inv() than to the open-coded word-scan, but that scan is
dead simple, self-contained and out of the way.

So call that a wash in terms of maintenance burden, but for a
reduction in functionality and a regression risk (however small).

The next patch to store them as special xarray entries is also a wash
at best. It trades the entry having an implicit subtype for no type at
all. zswap_load_zero_filled() looks like it's the fast path, and
decompression the fallback; the entry destructor is now called
"zswap_tree_free_element" and takes a void pointer. It also re-adds
most of the lines deleted by the zero-only patch.

I think this is actually a bit worse than the status quo.

But my point is, this all seems like a lot of opportunity cost in
terms of engineering time, review bandwidth, regression risk, and
readability, hackability, reliability, predictability of the code -
for gains that are still not shown to actually matter in practice.

https://lore.kernel.org/all/[email protected]/

This needs numbers to show not just that your patches are fine, but
that any version of this optimization actually matters for real
workloads. Without that, my vote would be NAK.

2024-04-01 10:38:42

by Maciej S. Szmigiero

[permalink] [raw]
Subject: Re: [RFC PATCH 5/9] mm: zswap: remove zswap_same_filled_pages_enabled

On 29.03.2024 19:22, Yosry Ahmed wrote:
> On Fri, Mar 29, 2024 at 10:45 AM Johannes Weiner <[email protected]> wrote:
>>
>> On Fri, Mar 29, 2024 at 03:02:10PM +0100, Maciej S. Szmigiero wrote:
>>> On 29.03.2024 03:14, Yosry Ahmed wrote:
>>>> On Thu, Mar 28, 2024 at 1:06 PM Yosry Ahmed <[email protected]> wrote:
>>>>>
>>>>> On Thu, Mar 28, 2024 at 12:11 PM Johannes Weiner <[email protected]> wrote:
>>>>>>
>>>>>> On Mon, Mar 25, 2024 at 11:50:13PM +0000, Yosry Ahmed wrote:
>>>>>>> There is no logical reason to refuse storing same-filled pages more
>>>>>>> efficiently and opt for compression. Remove the userspace knob.
>>>>>>>
>>>>>>> Signed-off-by: Yosry Ahmed <[email protected]>
>>>>>>
>>>>>> Acked-by: Johannes Weiner <[email protected]>
>>>>>>
>>>>>> I also think the non_same_filled_pages_enabled option should go
>>>>>> away. Both of these tunables are pretty bizarre.
>>>>>
>>>>> Happy to remove both in the next version :)
>>>>
>>>> I thought non_same_filled_pages_enabled was introduced with the
>>>> initial support for same-filled pages, but it was introduced
>>>> separately (and much more recently):
>>>> https://lore.kernel.org/all/7dbafa963e8bab43608189abbe2067f4b9287831.1641247624.git.maciej.szmigiero@oracle.com/
>>>>
>>>> I am CCing Maciej to hear more about the use case for this.
>>>
>>> Thanks for CCing me.
>>>
>>> I introduced "non_same_filled_pages_enabled" a few years ago to
>>> enable using zswap in a lightweight mode where it is only used for
>>> its ability to store same-filled pages effectively.
>>
>> But all the pages it rejects go to disk swap instead, which is much
>> slower than compression...
>>
>>> As far as I remember, there were some interactions between full
>>> zswap and the cgroup memory controller - like, it made it easier
>>> for an aggressive workload to exceed its cgroup memory.high limits.
>>
>> Ok, that makes sense! A container fairness measure, rather than a
>> performance optimization.
>>
>> Fair enough, but that's moot then with cgroup accounting of the
>> backing memory, f4840ccfca25 ("zswap: memcg accounting").
>
> Right, this should no longer be needed with the zswap charging.
>
> Maciej, is this still being used on kernels with f4840ccfca25 (5.19+)?
> Any objections to removing it now?

I don't object to its removal as long as stable kernel trees aren't
affected.

Thanks,
Maciej


2024-04-01 18:30:11

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH 5/9] mm: zswap: remove zswap_same_filled_pages_enabled

On Mon, Apr 1, 2024 at 3:38 AM Maciej S. Szmigiero
<[email protected]> wrote:
>
> On 29.03.2024 19:22, Yosry Ahmed wrote:
> > On Fri, Mar 29, 2024 at 10:45 AM Johannes Weiner <[email protected]> wrote:
> >>
> >> On Fri, Mar 29, 2024 at 03:02:10PM +0100, Maciej S. Szmigiero wrote:
> >>> On 29.03.2024 03:14, Yosry Ahmed wrote:
> >>>> On Thu, Mar 28, 2024 at 1:06 PM Yosry Ahmed <[email protected]> wrote:
> >>>>>
> >>>>> On Thu, Mar 28, 2024 at 12:11 PM Johannes Weiner <[email protected]> wrote:
> >>>>>>
> >>>>>> On Mon, Mar 25, 2024 at 11:50:13PM +0000, Yosry Ahmed wrote:
> >>>>>>> There is no logical reason to refuse storing same-filled pages more
> >>>>>>> efficiently and opt for compression. Remove the userspace knob.
> >>>>>>>
> >>>>>>> Signed-off-by: Yosry Ahmed <[email protected]>
> >>>>>>
> >>>>>> Acked-by: Johannes Weiner <[email protected]>
> >>>>>>
> >>>>>> I also think the non_same_filled_pages_enabled option should go
> >>>>>> away. Both of these tunables are pretty bizarre.
> >>>>>
> >>>>> Happy to remove both in the next version :)
> >>>>
> >>>> I thought non_same_filled_pages_enabled was introduced with the
> >>>> initial support for same-filled pages, but it was introduced
> >>>> separately (and much more recently):
> >>>> https://lore.kernel.org/all/7dbafa963e8bab43608189abbe2067f4b92878311641247624.git.maciej.szmigiero@oracle.com/
> >>>>
> >>>> I am CCing Maciej to hear more about the use case for this.
> >>>
> >>> Thanks for CCing me.
> >>>
> >>> I introduced "non_same_filled_pages_enabled" a few years ago to
> >>> enable using zswap in a lightweight mode where it is only used for
> >>> its ability to store same-filled pages effectively.
> >>
> >> But all the pages it rejects go to disk swap instead, which is much
> >> slower than compression...
> >>
> >>> As far as I remember, there were some interactions between full
> >>> zswap and the cgroup memory controller - like, it made it easier
> >>> for an aggressive workload to exceed its cgroup memory.high limits.
> >>
> >> Ok, that makes sense! A container fairness measure, rather than a
> >> performance optimization.
> >>
> >> Fair enough, but that's moot then with cgroup accounting of the
> >> backing memory, f4840ccfca25 ("zswap: memcg accounting").
> >
> > Right, this should no longer be needed with the zswap charging.
> >
> > Maciej, is this still being used on kernels with f4840ccfca25 (5.19+)?
> > Any objections to removing it now?
>
> I don't object to its removal as long as stable kernel trees aren't
> affected.

Yeah this isn't something that would be backported to stable kernels.
Thanks for confirming.

2024-03-29 22:29:57

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH 6/9] mm: zswap: drop support for non-zero same-filled pages handling

On Fri, Mar 29, 2024 at 2:17 PM Johannes Weiner <[email protected]> wrote:
>
> On Fri, Mar 29, 2024 at 11:56:46AM -0700, Yosry Ahmed wrote:
> > > I perf'd zswapping out data that is 10% same-filled and 90% data that
> > > always needs compression. It does nothing but madvise(MADV_PAGEOUT),
> > > and the zswap_store() stack is already only ~60% of the cycles.
> > >
> > > Using zsmalloc + zstd, this is the diff between vanilla and my patch:
> > >
> > > # Baseline Delta Abs Shared Object Symbol
> > > # ........ ......... .................... ....................................................
> > > #
> > > 4.34% -3.02% [kernel.kallsyms] [k] zswap_store
> > > 11.07% +1.41% [kernel.kallsyms] [k] ZSTD_compressBlock_doubleFast
> > > 15.55% +0.91% [kernel.kallsyms] [k] FSE_buildCTable_wksp
> > >
> > > As expected, we have to compress a bit more; on the other hand we're
> > > removing the content scan for same-filled for 90% of the pages that
> > > don't benefit from it. They almost amortize each other. Let's round it
> > > up and the remaining difference is ~1%.
> >
> > Thanks for the data, this is very useful.
> >
> > There is also the load path. The original patch that introduced
> > same-filled pages reports more gains on the load side, which also
> > happens to be more latency-sensitive. Of course, the data could be
> > outdated -- but it's a different type of workload than what will be
> > running in a data center fleet AFAICT.
> >
> > Is there also no noticeable difference on the load side in your data?
>
> 9.40% +2.51% [kernel.kallsyms] [k] ZSTD_safecopy
> 0.76% -0.48% [kernel.kallsyms] [k] zswap_load
>
> About 2% net.
>
> Checking other compression algorithms, lzo eats 27.58%, and lz4
> 13.82%. The variance between these alone makes our "try to be smarter
> than an actual compression algorithm" code look even sillier.
>
> > Also, how much increase did you observe in the size of compressed data
> > with your patch? Just curious about how zstd ended up handling those
> > pages.
>
> Checking zsmalloc stats, it did pack the same-filled ones down into 32
> byte objects. So 0.7% of their original size. That's negligible, even
> for workloads that have an unusually high share of them.

Glad to see that this was handled appropriately.

>
> > > It's difficult to make the case that this matters to any real
> > > workloads with actual think time in between paging.
> >
> > If the difference in performance is 1%, I surely agree.
> >
> > The patch reported 19-32% improvement in store time for same-filled
> > pages depending on the workload and platform. For 10% same-filled
> > pages, this should roughly correspond to 2-3% overall improvement,
> > which is not too far from the 1% you observed.
>
> Right.
>
> > The patch reported much larger improvement for load times (which
> > matters more), 49-85% for same-filled pages. If this corresponds to
> > 5-8% overall improvement in zswap load time for a workload with 10%
> > same-filled pages, that would be very significant. I don't expect to
> > see such improvements tbh, but we should check.
>
> No, I'm seeing around 2% net.

You mentioned that other compressors eat more cycles in this case, so
perhaps the data in the original patch comes from one of those
compressors.

>
> > > But let's say you do make the case that zero-filled pages are worth
> > > optimizing for.
> >
> > I am not saying they are for sure, but removing the same-filled
> > checking didn't seem to improve performance much in my testing, so the
> > cost seems to be mostly in maintenance. This makes it seem to me that
> > it *could* be a good tradeoff to only handle zero-filled pages. We can
> > make them slightly faster and we can trim the complexity -- as shown
> > by this patch.
>
> In the original numbers, there was a certain percentage of non-zero
> same-filled pages. You still first have to find that number for real
> production loads to determine what the tradeoff actually is.
>
> And I disagree that the code is less complex. There are fewer lines to
> the memchr_inv() than to the open-coded word-scan, but that scan is
> dead simple, self-contained and out of the way.
>
> So call that a wash in terms of maintenance burden, but for a
> reduction in functionality and a regression risk (however small).
>
> The next patch to store them as special xarray entries is also a wash
> at best. It trades the entry having an implicit subtype for no type at
> all. zswap_load_zero_filled() looks like it's the fast path, and
> decompression the fallback; the entry destructor is now called
> "zswap_tree_free_element" and takes a void pointer. It also re-adds
> most of the lines deleted by the zero-only patch.
>
> I think this is actually a bit worse than the status quo.
>
> But my point is, this all seems like a lot of opportunity cost in
> terms of engineering time, review bandwidth, regression risk, and
> readability, hackability, reliability, predictability of the code -
> for gains that are still not shown to actually matter in practice.

Yeah in terms of code cleanliness it did not turn out as I thought it
would. Actually even using different subtypes will have either
similarly abstract (yet less clear) helpers, or completely separate
paths with code duplication -- both of which are not ideal. So perhaps
it's better to leave it alone (and perhaps clean it up slightly) or
remove it completely.

I wanted to see what others thought, and I was aware it's
controversial (hence RFC) :)

Anyway, I will send a separate series with only cleanups and removing
knobs. We can discuss completely removing same-filled handling
separately. I suspect 2% regression in the load path (and potentially
larger with other compressors) may not be okay.

If handling for zero-filled pages is added directly in reclaim as you
suggested though, then the justification for handling other patterns
in zswap becomes much less :) Handling it in reclaim also means we
avoid IO for zero pages completely, which would be even more
beneficial than just avoiding compression/decompression in zswap.

>
> https://lore.kernel.org/all/[email protected]/
>
> This needs numbers to show not just that your patches are fine, but
> that any version of this optimization actually matters for real
> workloads. Without that, my vote would be NAK.

2024-03-29 04:28:10

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH 6/9] mm: zswap: drop support for non-zero same-filled pages handling

On Thu, Mar 28, 2024 at 7:05 PM Yosry Ahmed <[email protected]> wrote:
>
> On Thu, Mar 28, 2024 at 4:19 PM Nhat Pham <[email protected]> wrote:
> >
> > On Thu, Mar 28, 2024 at 2:07 PM Johannes Weiner <hannes@cmpxchgorg> wrote:
> > >
> > > On Thu, Mar 28, 2024 at 01:23:42PM -0700, Yosry Ahmed wrote:
> > > > On Thu, Mar 28, 2024 at 12:31 PM Johannes Weiner <[email protected]> wrote:
> > > > >
> > > > > On Mon, Mar 25, 2024 at 11:50:14PM +0000, Yosry Ahmed wrote:
> > > > > > The current same-filled pages handling supports pages filled with any
> > > > > > repeated word-sized pattern. However, in practice, most of these should
> > > > > > be zero pages anyway. Other patterns should be nearly as common.
> > > > > >
> > > > > > Drop the support for non-zero same-filled pages, but keep the names of
> > > > > > knobs exposed to userspace as "same_filled", which isn't entirely
> > > > > > inaccurate.
> > > > > >
> > > > > > This yields some nice code simplification and enables a following patch
> > > > > > that eliminates the need to allocate struct zswap_entry for those pages
> > > > > > completely.
> > > > > >
> > > > > > There is also a very small performance improvement observed over 50 runs
> > > > > > of kernel build test (kernbench) comparing the mean build time on a
> > > > > > skylake machine when building the kernel in a cgroup v1 container with a
> > > > > > 3G limit:
> > > > > >
> > > > > > base patched % diff
> > > > > > real 70.167 69.915 -0.359%
> > > > > > user 2953.068 2956.147 +0.104%
> > > > > > sys 2612.811 2594.718 -0.692%
> > > > > >
> > > > > > This probably comes from more optimized operations like memchr_inv() and
> > > > > > clear_highpage(). Note that the percentage of zero-filled pages during
> > > > > > this test was only around 1.5% on average, and was not affected by this
> > > > > > patch. Practical workloads could have a larger proportion of such pages
> > > > > > (e.g. Johannes observed around 10% [1]), so the performance improvement
> > > > > > should be larger.
> > > > > >
> > > > > > [1]https://lore.kernel.org/linux-mm/[email protected]/
> > > > > >
> > > > > > Signed-off-by: Yosry Ahmed <[email protected]>
> > > > >
> > > > > This is an interesting direction to pursue, but I actually thinkg it
> > > > > doesn't go far enough. Either way, I think it needs more data.
> > > > >
> > > > > 1) How frequent are non-zero-same-filled pages? Difficult to
> > > > > generalize, but if you could gather some from your fleet, that
> > > > > would be useful. If you can devise a portable strategy, I'd also be
> > > > > more than happy to gather this on ours (although I think you have
> > > > > more widespread zswap use, whereas we have more disk swap.)
> > > >
> > > > I am trying to collect the data, but there are.. hurdles. It would
> > > > take some time, so I was hoping the data could be collected elsewhere
> > > > if possible.
> > > >
> > > > The idea I had was to hook a BPF program to the entry of
> > > > zswap_fill_page() and create a histogram of the "value" argument. We
> > > > would get more coverage by hooking it to the return of
> > > > zswap_is_page_same_filled() and only updating the histogram if the
> > > > return value is true, as it includes pages in zswap that haven't been
> > > > swapped in.
> > > >
> > > > However, with zswap_is_page_same_filled() the BPF program will run in
> > > > all zswap stores, whereas for zswap_fill_page() it will only run when
> > > > needed. Not sure if this makes a practical difference tbh.
> > > >
> > > > >
> > > > > 2) The fact that we're doing any of this pattern analysis in zswap at
> > > > > all strikes me as a bit misguided. Being efficient about repetitive
> > > > > patterns is squarely in the domain of a compression algorithm. Do
> > > > > we not trust e.g. zstd to handle this properly?
> > > >
> > > > I thought about this briefly, but I didn't follow through. I could try
> > > > to collect some data by swapping out different patterns and observing
> > > > how different compression algorithms react. That would be interesting
> > > > for sure.
> > > >
> > > > >
> > > > > I'm guessing this goes back to inefficient packing from something
> > > > > like zbud, which would waste half a page on one repeating byte.
> > > > >
> > > > > But zsmalloc can do 32 byte objects. It's also a batching slab
> > > > > allocator, where storing a series of small, same-sized objects is
> > > > > quite fast.
> > > > >
> > > > > Add to that the additional branches, the additional kmap, the extra
> > > > > scanning of every single page for patterns - all in the fast path
> > > > > of zswap, when we already know that the vast majority of incoming
> > > > > pages will need to be properly compressed anyway.
> > > > >
> > > > > Maybe it's time to get rid of the special handling entirely?
> > > >
> > > > We would still be wasting some memory (~96 bytes between zswap_entry
> > > > and zsmalloc object), and wasting cycling allocating them. This could
> > > > be made up for by cycles saved by removing the handling. We will be
> > > > saving some branches for sure. I am not worried about kmap as I think
> > > > it's a noop in most cases.
> > >
> > > Yes, true.
> > >
> > > > I am interested to see how much we could save by removing scanning for
> > > > patterns. We may not save much if we abort after reading a few words
> > > > in most cases, but I guess we could also be scanning a considerable
> > > > amount before aborting. On the other hand, we would be reading the
> > > > page contents into cache anyway for compression, so maybe it doesn't
> > > > really matter?
> > > >
> > > > I will try to collect some data about this. I will start by trying to
> > > > find out how the compression algorithms handle same-filled pages. If
> > > > they can compress it efficiently, then I will try to get more data on
> > > > the tradeoff from removing the handling.
> > >
> > > I do wonder if this could be overthinking it, too.
> > >
> > > Double checking the numbers on our fleet, a 96 additional bytes for
> > > each same-filled entry would result in a
> > >
> > > 1) p50 waste of 0.008% of total memory, and a
> > >
> > > 2) p99 waste of 0.06% of total memory.
>
> Right. Assuming the compressors do not surprise us and store
> same-filled pages in an absurd way, it's not worth it in terms of
> memory savings.
>
> > >
> > > And this is without us having even thought about trying to make
> > > zsmalloc more efficient for this particular usecase - which might be
> > > the better point of attack, if we think it's actually worth it.
> > >
> > > So my take is that unless removing it would be outright horrible from
> > > a %sys POV (which seems pretty unlikely), IMO it would be fine to just
> > > delete it entirely with a "not worth the maintenance cost" argument.
> > >
> > > If you turn the argument around, and somebody would submit the code as
> > > it is today, with the numbers being what they are above, I'm not sure
> > > we would even accept it!
> >
> > The context guy is here :)
> >
> > Not arguing for one way or another, but I did find the original patch
> > that introduced same filled page handling:
> >
> > https://github.com/torvalds/linux/commit/a85f878b443f8d2b91ba76f09da21ac0af22e07f
> >
> > https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/T/#u
>
> Thanks for digging this up. I don't know why I didn't start there :)
>
> Following in your footsteps, and given that zram has the same feature,
> I found the patch that added support for non-zero same-filled pages in
> zram:
> https://lore.kernel.org/all/[email protected]/#t
>
> Both of them confirm that most same-filled pages are zero pages, but
> they show a more significant portion of same-filled pages being
> non-zero (17% to 40%). I suspect this will be less in data centers
> compared to consumer apps.
>
> The zswap patch also reports significant performance improvements from
> the same-filled handling, but this is with 17-22% same-filled pages.
> Johannes mentioned around 10% in your data centers, so the performance
> improvement would be less. In the kernel build tests I ran with only
> around 1.5% same-filled pages I observed 1.4% improvements just by
> optimizing them (only zero-filled, skipping allocations).
>
> So I think removing the same-filled pages handling completely may be
> too aggressive, because it doesn't only affect the memory efficiency,
> but also cycles spent when handling those pages. Just avoiding going
> through the allocator and compressor has to account for something :)

Here is another data point. I tried removing the same-filled handling
code completely with the diff Johannes sent upthread. I saw 1.3%
improvement in the kernel build test, very similar to the improvement
from this patch series. _However_, the kernel build test only produces
~1.5% zero-filled pages in my runs. More realistic workloads have
significantly higher percentages as demonstrated upthread.

In other words, the kernel build test (at least in my runs) seems to
be the worst case scenario for same-filled/zero-filled pages. Since
the improvement from removing same-filled handling is quite small in
this case, I suspect there will be no improvement, but possibly a
regression, on real workloads.

As the zero-filled pages ratio increases:
- The performance with this series will improve.
- The performance with removing same-filled handling completely will
become worse.

2024-03-29 02:10:30

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH 6/9] mm: zswap: drop support for non-zero same-filled pages handling

On Thu, Mar 28, 2024 at 4:34 PM Nhat Pham <[email protected]> wrote:
>
> On Thu, Mar 28, 2024 at 1:24 PM Yosry Ahmed <[email protected]> wrote:
> >
> > On Thu, Mar 28, 2024 at 12:31 PM Johannes Weiner <[email protected]> wrote:
> > >
> > > On Mon, Mar 25, 2024 at 11:50:14PM +0000, Yosry Ahmed wrote:
> > > > The current same-filled pages handling supports pages filled with any
> > > > repeated word-sized pattern. However, in practice, most of these should
> > > > be zero pages anyway. Other patterns should be nearly as common.
> > > >
> > > > Drop the support for non-zero same-filled pages, but keep the names of
> > > > knobs exposed to userspace as "same_filled", which isn't entirely
> > > > inaccurate.
> > > >
> > > > This yields some nice code simplification and enables a following patch
> > > > that eliminates the need to allocate struct zswap_entry for those pages
> > > > completely.
> > > >
> > > > There is also a very small performance improvement observed over 50 runs
> > > > of kernel build test (kernbench) comparing the mean build time on a
> > > > skylake machine when building the kernel in a cgroup v1 container with a
> > > > 3G limit:
> > > >
> > > > base patched % diff
> > > > real 70.167 69.915 -0.359%
> > > > user 2953.068 2956.147 +0.104%
> > > > sys 2612.811 2594.718 -0.692%
> > > >
> > > > This probably comes from more optimized operations like memchr_inv() and
> > > > clear_highpage(). Note that the percentage of zero-filled pages during
> > > > this test was only around 1.5% on average, and was not affected by this
> > > > patch. Practical workloads could have a larger proportion of such pages
> > > > (e.g. Johannes observed around 10% [1]), so the performance improvement
> > > > should be larger.
> > > >
> > > > [1]https://lore.kernel.org/linux-mm/20240320210716.GH294822@cmpxchgorg/
> > > >
> > > > Signed-off-by: Yosry Ahmed <[email protected]>
> > >
> > > This is an interesting direction to pursue, but I actually thinkg it
> > > doesn't go far enough. Either way, I think it needs more data.
> > >
> > > 1) How frequent are non-zero-same-filled pages? Difficult to
> > > generalize, but if you could gather some from your fleet, that
> > > would be useful. If you can devise a portable strategy, I'd also be
> > > more than happy to gather this on ours (although I think you have
> > > more widespread zswap use, whereas we have more disk swap.)
> >
> > I am trying to collect the data, but there are.. hurdles. It would
> > take some time, so I was hoping the data could be collected elsewhere
> > if possible.
> >
> > The idea I had was to hook a BPF program to the entry of
> > zswap_fill_page() and create a histogram of the "value" argument. We
> > would get more coverage by hooking it to the return of
> > zswap_is_page_same_filled() and only updating the histogram if the
> > return value is true, as it includes pages in zswap that haven't been
> > swapped in.
> >
> > However, with zswap_is_page_same_filled() the BPF program will run in
> > all zswap stores, whereas for zswap_fill_page() it will only run when
> > needed. Not sure if this makes a practical difference tbh.
> >
> > >
> > > 2) The fact that we're doing any of this pattern analysis in zswap at
> > > all strikes me as a bit misguided. Being efficient about repetitive
> > > patterns is squarely in the domain of a compression algorithm. Do
> > > we not trust e.g. zstd to handle this properly?
> >
> > I thought about this briefly, but I didn't follow through. I could try
> > to collect some data by swapping out different patterns and observing
> > how different compression algorithms react. That would be interesting
> > for sure.
> >
> > >
> > > I'm guessing this goes back to inefficient packing from something
> > > like zbud, which would waste half a page on one repeating byte.
> > >
> > > But zsmalloc can do 32 byte objects. It's also a batching slab
> > > allocator, where storing a series of small, same-sized objects is
> > > quite fast.
> > >
> > > Add to that the additional branches, the additional kmap, the extra
> > > scanning of every single page for patterns - all in the fast path
> > > of zswap, when we already know that the vast majority of incoming
> > > pages will need to be properly compressed anyway.
> > >
> > > Maybe it's time to get rid of the special handling entirely?
> >
> > We would still be wasting some memory (~96 bytes between zswap_entry
> > and zsmalloc object), and wasting cycling allocating them. This could
> > be made up for by cycles saved by removing the handling. We will be
> > saving some branches for sure. I am not worried about kmap as I think
> > it's a noop in most cases.
>
> A secondary effect of the current same-filled page handling is that
> we're not considering them for reclaim. Which could potentially be
> beneficial, because we're not saving much memory (essentially just the
> zswap entry and associated cost of storing them) by writing these
> pages back - IOW, the cost / benefit ratio for reclaiming these pages
> is quite atrocious.

Yes, but I think this applies even without same-filled pages. Johannes
mentioned that zsmalloc could store compressed pages down to 32 bytes
in size. If these are common, it would be absurd to them out too. We
already have some kind of heuristic in the shrinker to slowdown
writeback if the compression ratio is high. Perhaps it's worth
skipping writeback completely for some pages on the LRU, even if this
means we violate the LRU ordering. We already do this for same-filled
pages, so it may make sense to generalize it.

>
> Again, all of this is just handwaving without numbers. It'd be nice if
> we can have more concrete data for this conversation :P