2024-06-04 11:00:12

by Usama Arif

[permalink] [raw]
Subject: [PATCH v2 0/2] mm: clear pte for folios that are zero filled

As shown in the patchseries that introduced the zswap same-filled
optimization [1], 10-20% of the pages stored in zswap are same-filled.
This is also observed across Meta's server fleet.
By using VM counters in swap_writepage (not included in this
patchseries) it was found that less than 1% of the same-filled
pages to be swapped out are non-zero pages.

For conventional swap setup (without zswap), rather than reading/writing
these pages to flash resulting in increased I/O and flash wear, the pte
can be cleared for those addresses at unmap time while shrinking folio
list. When this causes a page fault, do_pte_missing will take care of this
page.

When using zswap, this also means that a zswap_entry does not
need to be allocated for zero filled pages resulting in memory savings.

A similar attempt was made earlier in [2] where zswap would only track
zero-filled pages instead of same-filled.
This patchseries adds zero-filled pages optimization by default
(hence it can be used even if zswap is disabled) and removes the
same-filled code from zswap (as only 1% of the same-filled pages are
non-zero), simplifying code.

This patchseries is based on mm-unstable.

[1] https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/
[2] https://lore.kernel.org/lkml/[email protected]/

---
v1 -> v2:
- instead of using a bitmap in swap, clear pte for zero pages and let
do_pte_missing handle this page at page fault. (Yosry and Matthew)
- Check end of page first when checking if folio is zero filled as
it could lead to better performance. (Yosry)

Usama Arif (2):
mm: clear pte for folios that are zero filled
mm: remove code to handle same filled pages

include/linux/rmap.h | 1 +
mm/rmap.c | 163 ++++++++++++++++++++++---------------------
mm/vmscan.c | 89 ++++++++++++++++-------
mm/zswap.c | 86 +++--------------------
4 files changed, 158 insertions(+), 181 deletions(-)

--
2.43.0



2024-06-04 11:08:14

by Usama Arif

[permalink] [raw]
Subject: [PATCH v2 2/2] mm: remove code to handle same filled pages

With an earlier commit to handle zero-filled pages in vmscan directly,
and with only 1% of the same-filled pages being non-zero, zswap no
longer needs to handle same-filled pages and can just work on compressed
pages.

Signed-off-by: Usama Arif <[email protected]>
---
mm/zswap.c | 86 +++++-------------------------------------------------
1 file changed, 8 insertions(+), 78 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index b9b35ef86d9b..ca8df9c99abf 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -44,8 +44,6 @@
**********************************/
/* The number of compressed pages currently stored in zswap */
atomic_t zswap_stored_pages = ATOMIC_INIT(0);
-/* The number of same-value filled pages currently stored in zswap */
-static atomic_t zswap_same_filled_pages = ATOMIC_INIT(0);

/*
* The statistics below are not protected from concurrent access for
@@ -182,11 +180,9 @@ static struct shrinker *zswap_shrinker;
*
* swpentry - associated swap entry, the offset indexes into the red-black tree
* length - the length in bytes of the compressed page data. Needed during
- * decompression. For a same value filled page length is 0, and both
- * pool and lru are invalid and must be ignored.
+ * decompression.
* pool - the zswap_pool the entry's data is in
* handle - zpool allocation handle that stores the compressed page data
- * value - value of the same-value filled pages which have same content
* objcg - the obj_cgroup that the compressed memory is charged to
* lru - handle to the pool's lru used to evict pages.
*/
@@ -194,10 +190,7 @@ struct zswap_entry {
swp_entry_t swpentry;
unsigned int length;
struct zswap_pool *pool;
- union {
- unsigned long handle;
- unsigned long value;
- };
+ unsigned long handle;
struct obj_cgroup *objcg;
struct list_head lru;
};
@@ -814,13 +807,9 @@ static struct zpool *zswap_find_zpool(struct zswap_entry *entry)
*/
static void zswap_entry_free(struct zswap_entry *entry)
{
- if (!entry->length)
- atomic_dec(&zswap_same_filled_pages);
- else {
- zswap_lru_del(&zswap_list_lru, entry);
- zpool_free(zswap_find_zpool(entry), entry->handle);
- zswap_pool_put(entry->pool);
- }
+ zswap_lru_del(&zswap_list_lru, entry);
+ zpool_free(zswap_find_zpool(entry), entry->handle);
+ zswap_pool_put(entry->pool);
if (entry->objcg) {
obj_cgroup_uncharge_zswap(entry->objcg, entry->length);
obj_cgroup_put(entry->objcg);
@@ -1262,11 +1251,6 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
* This ensures that the better zswap compresses memory, the fewer
* pages we will evict to swap (as it will otherwise incur IO for
* relatively small memory saving).
- *
- * The memory saving factor calculated here takes same-filled pages into
- * account, but those are not freeable since they almost occupy no
- * space. Hence, we may scale nr_freeable down a little bit more than we
- * should if we have a lot of same-filled pages.
*/
return mult_frac(nr_freeable, nr_backing, nr_stored);
}
@@ -1370,42 +1354,6 @@ static void shrink_worker(struct work_struct *w)
} while (zswap_total_pages() > thr);
}

-/*********************************
-* same-filled functions
-**********************************/
-static bool zswap_is_folio_same_filled(struct folio *folio, unsigned long *value)
-{
- unsigned long *data;
- unsigned long val;
- unsigned int pos, last_pos = PAGE_SIZE / sizeof(*data) - 1;
- bool ret = false;
-
- data = kmap_local_folio(folio, 0);
- val = data[0];
-
- if (val != data[last_pos])
- goto out;
-
- for (pos = 1; pos < last_pos; pos++) {
- if (val != data[pos])
- goto out;
- }
-
- *value = val;
- ret = true;
-out:
- kunmap_local(data);
- return ret;
-}
-
-static void zswap_fill_folio(struct folio *folio, unsigned long value)
-{
- unsigned long *data = kmap_local_folio(folio, 0);
-
- memset_l(data, value, PAGE_SIZE / sizeof(unsigned long));
- kunmap_local(data);
-}
-
/*********************************
* main API
**********************************/
@@ -1417,7 +1365,6 @@ bool zswap_store(struct folio *folio)
struct zswap_entry *entry, *old;
struct obj_cgroup *objcg = NULL;
struct mem_cgroup *memcg = NULL;
- unsigned long value;

VM_WARN_ON_ONCE(!folio_test_locked(folio));
VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
@@ -1450,13 +1397,6 @@ bool zswap_store(struct folio *folio)
goto reject;
}

- if (zswap_is_folio_same_filled(folio, &value)) {
- entry->length = 0;
- entry->value = value;
- atomic_inc(&zswap_same_filled_pages);
- goto store_entry;
- }
-
/* if entry is successfully added, it keeps the reference */
entry->pool = zswap_pool_current_get();
if (!entry->pool)
@@ -1474,7 +1414,6 @@ bool zswap_store(struct folio *folio)
if (!zswap_compress(folio, entry))
goto put_pool;

-store_entry:
entry->swpentry = swp;
entry->objcg = objcg;

@@ -1522,13 +1461,9 @@ bool zswap_store(struct folio *folio)
return true;

store_failed:
- if (!entry->length)
- atomic_dec(&zswap_same_filled_pages);
- else {
- zpool_free(zswap_find_zpool(entry), entry->handle);
+ zpool_free(zswap_find_zpool(entry), entry->handle);
put_pool:
- zswap_pool_put(entry->pool);
- }
+ zswap_pool_put(entry->pool);
freepage:
zswap_entry_cache_free(entry);
reject:
@@ -1577,10 +1512,7 @@ bool zswap_load(struct folio *folio)
if (!entry)
return false;

- if (entry->length)
- zswap_decompress(entry, folio);
- else
- zswap_fill_folio(folio, entry->value);
+ zswap_decompress(entry, folio);

count_vm_event(ZSWPIN);
if (entry->objcg)
@@ -1682,8 +1614,6 @@ static int zswap_debugfs_init(void)
zswap_debugfs_root, NULL, &total_size_fops);
debugfs_create_atomic_t("stored_pages", 0444,
zswap_debugfs_root, &zswap_stored_pages);
- debugfs_create_atomic_t("same_filled_pages", 0444,
- zswap_debugfs_root, &zswap_same_filled_pages);

return 0;
}
--
2.43.0


2024-06-04 11:08:19

by Usama Arif

[permalink] [raw]
Subject: [PATCH v2 1/2] mm: clear pte for folios that are zero filled

Approximately 10-20% of pages to be swapped out are zero pages [1].
Rather than reading/writing these pages to flash resulting
in increased I/O and flash wear, the pte can be cleared for those
addresses at unmap time while shrinking folio list. When this
causes a page fault, do_pte_missing will take care of this page.
With this patch, NVMe writes in Meta server fleet decreased
by almost 10% with conventional swap setup (zswap disabled).

[1] https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/

Signed-off-by: Usama Arif <[email protected]>
---
include/linux/rmap.h | 1 +
mm/rmap.c | 163 ++++++++++++++++++++++---------------------
mm/vmscan.c | 89 ++++++++++++++++-------
3 files changed, 150 insertions(+), 103 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index bb53e5920b88..b36db1e886e4 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -100,6 +100,7 @@ enum ttu_flags {
* do a final flush if necessary */
TTU_RMAP_LOCKED = 0x80, /* do not grab rmap lock:
* caller holds it */
+ TTU_ZERO_FOLIO = 0x100,/* zero folio */
};

#ifdef CONFIG_MMU
diff --git a/mm/rmap.c b/mm/rmap.c
index 52357d79917c..d98f70876327 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1819,96 +1819,101 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
*/
dec_mm_counter(mm, mm_counter(folio));
} else if (folio_test_anon(folio)) {
- swp_entry_t entry = page_swap_entry(subpage);
- pte_t swp_pte;
- /*
- * Store the swap location in the pte.
- * See handle_pte_fault() ...
- */
- if (unlikely(folio_test_swapbacked(folio) !=
- folio_test_swapcache(folio))) {
+ if (flags & TTU_ZERO_FOLIO) {
+ pte_clear(mm, address, pvmw.pte);
+ dec_mm_counter(mm, MM_ANONPAGES);
+ } else {
+ swp_entry_t entry = page_swap_entry(subpage);
+ pte_t swp_pte;
/*
- * unmap_huge_pmd_locked() will unmark a
- * PMD-mapped folio as lazyfree if the folio or
- * its PMD was redirtied.
+ * Store the swap location in the pte.
+ * See handle_pte_fault() ...
*/
- if (!pmd_mapped)
- WARN_ON_ONCE(1);
- goto walk_done_err;
- }
+ if (unlikely(folio_test_swapbacked(folio) !=
+ folio_test_swapcache(folio))) {
+ /*
+ * unmap_huge_pmd_locked() will unmark a
+ * PMD-mapped folio as lazyfree if the folio or
+ * its PMD was redirtied.
+ */
+ if (!pmd_mapped)
+ WARN_ON_ONCE(1);
+ goto walk_done_err;
+ }

- /* MADV_FREE page check */
- if (!folio_test_swapbacked(folio)) {
- int ref_count, map_count;
+ /* MADV_FREE page check */
+ if (!folio_test_swapbacked(folio)) {
+ int ref_count, map_count;

- /*
- * Synchronize with gup_pte_range():
- * - clear PTE; barrier; read refcount
- * - inc refcount; barrier; read PTE
- */
- smp_mb();
+ /*
+ * Synchronize with gup_pte_range():
+ * - clear PTE; barrier; read refcount
+ * - inc refcount; barrier; read PTE
+ */
+ smp_mb();

- ref_count = folio_ref_count(folio);
- map_count = folio_mapcount(folio);
+ ref_count = folio_ref_count(folio);
+ map_count = folio_mapcount(folio);

- /*
- * Order reads for page refcount and dirty flag
- * (see comments in __remove_mapping()).
- */
- smp_rmb();
+ /*
+ * Order reads for page refcount and dirty flag
+ * (see comments in __remove_mapping()).
+ */
+ smp_rmb();

- /*
- * The only page refs must be one from isolation
- * plus the rmap(s) (dropped by discard:).
- */
- if (ref_count == 1 + map_count &&
- !folio_test_dirty(folio)) {
- dec_mm_counter(mm, MM_ANONPAGES);
- goto discard;
- }
+ /*
+ * The only page refs must be one from isolation
+ * plus the rmap(s) (dropped by discard:).
+ */
+ if (ref_count == 1 + map_count &&
+ !folio_test_dirty(folio)) {
+ dec_mm_counter(mm, MM_ANONPAGES);
+ goto discard;
+ }

- /*
- * If the folio was redirtied, it cannot be
- * discarded. Remap the page to page table.
- */
- set_pte_at(mm, address, pvmw.pte, pteval);
- folio_set_swapbacked(folio);
- goto walk_done_err;
- }
+ /*
+ * If the folio was redirtied, it cannot be
+ * discarded. Remap the page to page table.
+ */
+ set_pte_at(mm, address, pvmw.pte, pteval);
+ folio_set_swapbacked(folio);
+ goto walk_done_err;
+ }

- if (swap_duplicate(entry) < 0) {
- set_pte_at(mm, address, pvmw.pte, pteval);
- goto walk_done_err;
- }
- if (arch_unmap_one(mm, vma, address, pteval) < 0) {
- swap_free(entry);
- set_pte_at(mm, address, pvmw.pte, pteval);
- goto walk_done_err;
- }
+ if (swap_duplicate(entry) < 0) {
+ set_pte_at(mm, address, pvmw.pte, pteval);
+ goto walk_done_err;
+ }
+ if (arch_unmap_one(mm, vma, address, pteval) < 0) {
+ swap_free(entry);
+ set_pte_at(mm, address, pvmw.pte, pteval);
+ goto walk_done_err;
+ }

- /* See folio_try_share_anon_rmap(): clear PTE first. */
- if (anon_exclusive &&
- folio_try_share_anon_rmap_pte(folio, subpage)) {
- swap_free(entry);
- set_pte_at(mm, address, pvmw.pte, pteval);
- goto walk_done_err;
- }
- if (list_empty(&mm->mmlist)) {
- spin_lock(&mmlist_lock);
- if (list_empty(&mm->mmlist))
- list_add(&mm->mmlist, &init_mm.mmlist);
- spin_unlock(&mmlist_lock);
+ /* See folio_try_share_anon_rmap(): clear PTE first. */
+ if (anon_exclusive &&
+ folio_try_share_anon_rmap_pte(folio, subpage)) {
+ swap_free(entry);
+ set_pte_at(mm, address, pvmw.pte, pteval);
+ goto walk_done_err;
+ }
+ if (list_empty(&mm->mmlist)) {
+ spin_lock(&mmlist_lock);
+ if (list_empty(&mm->mmlist))
+ list_add(&mm->mmlist, &init_mm.mmlist);
+ spin_unlock(&mmlist_lock);
+ }
+ dec_mm_counter(mm, MM_ANONPAGES);
+ inc_mm_counter(mm, MM_SWAPENTS);
+ swp_pte = swp_entry_to_pte(entry);
+ if (anon_exclusive)
+ swp_pte = pte_swp_mkexclusive(swp_pte);
+ if (pte_soft_dirty(pteval))
+ swp_pte = pte_swp_mksoft_dirty(swp_pte);
+ if (pte_uffd_wp(pteval))
+ swp_pte = pte_swp_mkuffd_wp(swp_pte);
+ set_pte_at(mm, address, pvmw.pte, swp_pte);
}
- dec_mm_counter(mm, MM_ANONPAGES);
- inc_mm_counter(mm, MM_SWAPENTS);
- swp_pte = swp_entry_to_pte(entry);
- if (anon_exclusive)
- swp_pte = pte_swp_mkexclusive(swp_pte);
- if (pte_soft_dirty(pteval))
- swp_pte = pte_swp_mksoft_dirty(swp_pte);
- if (pte_uffd_wp(pteval))
- swp_pte = pte_swp_mkuffd_wp(swp_pte);
- set_pte_at(mm, address, pvmw.pte, swp_pte);
} else {
/*
* This is a locked file-backed folio,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b9170f767353..d54f44b556f0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1026,6 +1026,38 @@ static bool may_enter_fs(struct folio *folio, gfp_t gfp_mask)
return !data_race(folio_swap_flags(folio) & SWP_FS_OPS);
}

+static bool is_folio_page_zero_filled(struct folio *folio, int i)
+{
+ unsigned long *data;
+ unsigned int pos, last_pos = PAGE_SIZE / sizeof(*data) - 1;
+ bool ret = false;
+
+ data = kmap_local_folio(folio, i * PAGE_SIZE);
+
+ if (data[last_pos])
+ goto out;
+
+ for (pos = 0; pos < last_pos; pos++) {
+ if (data[pos])
+ goto out;
+ }
+ ret = true;
+out:
+ kunmap_local(data);
+ return ret;
+}
+
+static bool is_folio_zero_filled(struct folio *folio)
+{
+ unsigned int i;
+
+ for (i = 0; i < folio_nr_pages(folio); i++) {
+ if (!is_folio_page_zero_filled(folio, i))
+ return false;
+ }
+ return true;
+}
+
/*
* shrink_folio_list() returns the number of reclaimed pages
*/
@@ -1053,6 +1085,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
enum folio_references references = FOLIOREF_RECLAIM;
bool dirty, writeback;
unsigned int nr_pages;
+ bool folio_zero_filled = false;

cond_resched();

@@ -1270,6 +1303,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
nr_pages = 1;
}

+ folio_zero_filled = is_folio_zero_filled(folio);
/*
* The folio is mapped into the page tables of one or more
* processes. Try to unmap it here.
@@ -1295,6 +1329,9 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
if (folio_test_large(folio) && list_empty(&folio->_deferred_list))
flags |= TTU_SYNC;

+ if (folio_zero_filled)
+ flags |= TTU_ZERO_FOLIO;
+
try_to_unmap(folio, flags);
if (folio_mapped(folio)) {
stat->nr_unmap_fail += nr_pages;
@@ -1358,32 +1395,36 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
* starts and then write it out here.
*/
try_to_unmap_flush_dirty();
- switch (pageout(folio, mapping, &plug)) {
- case PAGE_KEEP:
- goto keep_locked;
- case PAGE_ACTIVATE:
- goto activate_locked;
- case PAGE_SUCCESS:
- stat->nr_pageout += nr_pages;
+ if (folio_zero_filled) {
+ folio_clear_dirty(folio);
+ } else {
+ switch (pageout(folio, mapping, &plug)) {
+ case PAGE_KEEP:
+ goto keep_locked;
+ case PAGE_ACTIVATE:
+ goto activate_locked;
+ case PAGE_SUCCESS:
+ stat->nr_pageout += nr_pages;

- if (folio_test_writeback(folio))
- goto keep;
- if (folio_test_dirty(folio))
- goto keep;
+ if (folio_test_writeback(folio))
+ goto keep;
+ if (folio_test_dirty(folio))
+ goto keep;

- /*
- * A synchronous write - probably a ramdisk. Go
- * ahead and try to reclaim the folio.
- */
- if (!folio_trylock(folio))
- goto keep;
- if (folio_test_dirty(folio) ||
- folio_test_writeback(folio))
- goto keep_locked;
- mapping = folio_mapping(folio);
- fallthrough;
- case PAGE_CLEAN:
- ; /* try to free the folio below */
+ /*
+ * A synchronous write - probably a ramdisk. Go
+ * ahead and try to reclaim the folio.
+ */
+ if (!folio_trylock(folio))
+ goto keep;
+ if (folio_test_dirty(folio) ||
+ folio_test_writeback(folio))
+ goto keep_locked;
+ mapping = folio_mapping(folio);
+ fallthrough;
+ case PAGE_CLEAN:
+ ; /* try to free the folio below */
+ }
}
}

--
2.43.0


2024-06-04 12:18:37

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] mm: clear pte for folios that are zero filled

On Tue, Jun 04, 2024 at 11:58:24AM +0100, Usama Arif wrote:
> +++ b/mm/rmap.c
> @@ -1819,96 +1819,101 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> */
> dec_mm_counter(mm, mm_counter(folio));
> } else if (folio_test_anon(folio)) {
> - swp_entry_t entry = page_swap_entry(subpage);
> - pte_t swp_pte;
> - /*
> - * Store the swap location in the pte.
> - * See handle_pte_fault() ...
> - */
> - if (unlikely(folio_test_swapbacked(folio) !=
> - folio_test_swapcache(folio))) {
> + if (flags & TTU_ZERO_FOLIO) {
> + pte_clear(mm, address, pvmw.pte);
> + dec_mm_counter(mm, MM_ANONPAGES);
> + } else {

This is very hard to review. Is what you've done the same as:

if (flags & TTU_ZERO_FOLIO) {
pte_clear(mm, address, pvmw.pte);
dec_mm_counter(mm, MM_ANONPAGES);
goto discard;
}

? I genuinely can't tell.


2024-06-04 12:32:39

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] mm: clear pte for folios that are zero filled

On 04.06.24 12:58, Usama Arif wrote:
> Approximately 10-20% of pages to be swapped out are zero pages [1].
> Rather than reading/writing these pages to flash resulting
> in increased I/O and flash wear, the pte can be cleared for those
> addresses at unmap time while shrinking folio list. When this
> causes a page fault, do_pte_missing will take care of this page.
> With this patch, NVMe writes in Meta server fleet decreased
> by almost 10% with conventional swap setup (zswap disabled).
>
> [1] https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/
>
> Signed-off-by: Usama Arif <[email protected]>
> ---
> include/linux/rmap.h | 1 +
> mm/rmap.c | 163 ++++++++++++++++++++++---------------------
> mm/vmscan.c | 89 ++++++++++++++++-------
> 3 files changed, 150 insertions(+), 103 deletions(-)
>
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index bb53e5920b88..b36db1e886e4 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -100,6 +100,7 @@ enum ttu_flags {
> * do a final flush if necessary */
> TTU_RMAP_LOCKED = 0x80, /* do not grab rmap lock:
> * caller holds it */
> + TTU_ZERO_FOLIO = 0x100,/* zero folio */
> };
>
> #ifdef CONFIG_MMU
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 52357d79917c..d98f70876327 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1819,96 +1819,101 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> */
> dec_mm_counter(mm, mm_counter(folio));
> } else if (folio_test_anon(folio)) {
> - swp_entry_t entry = page_swap_entry(subpage);
> - pte_t swp_pte;
> - /*
> - * Store the swap location in the pte.
> - * See handle_pte_fault() ...
> - */
> - if (unlikely(folio_test_swapbacked(folio) !=
> - folio_test_swapcache(folio))) {
> + if (flags & TTU_ZERO_FOLIO) {
> + pte_clear(mm, address, pvmw.pte);
> + dec_mm_counter(mm, MM_ANONPAGES);

Is there an easy way to reduce the code churn and highlight the added code?

Like

} else if (folio_test_anon(folio) && (flags & TTU_ZERO_FOLIO)) {

} else if (folio_test_anon(folio)) {



Also to concerns that I want to spell out:

(a) what stops the page from getting modified in the meantime? The CPU
can write it until the TLB was flushed.

(b) do you properly handle if the page is pinned (or just got pinned)
and we must not discard it?

--
Cheers,

David / dhildenb


2024-06-04 12:44:07

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] mm: clear pte for folios that are zero filled

On 04.06.24 14:30, David Hildenbrand wrote:
> On 04.06.24 12:58, Usama Arif wrote:
>> Approximately 10-20% of pages to be swapped out are zero pages [1].
>> Rather than reading/writing these pages to flash resulting
>> in increased I/O and flash wear, the pte can be cleared for those
>> addresses at unmap time while shrinking folio list. When this
>> causes a page fault, do_pte_missing will take care of this page.
>> With this patch, NVMe writes in Meta server fleet decreased
>> by almost 10% with conventional swap setup (zswap disabled).
>>
>> [1] https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/
>>
>> Signed-off-by: Usama Arif <[email protected]>
>> ---
>> include/linux/rmap.h | 1 +
>> mm/rmap.c | 163 ++++++++++++++++++++++---------------------
>> mm/vmscan.c | 89 ++++++++++++++++-------
>> 3 files changed, 150 insertions(+), 103 deletions(-)
>>
>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>> index bb53e5920b88..b36db1e886e4 100644
>> --- a/include/linux/rmap.h
>> +++ b/include/linux/rmap.h
>> @@ -100,6 +100,7 @@ enum ttu_flags {
>> * do a final flush if necessary */
>> TTU_RMAP_LOCKED = 0x80, /* do not grab rmap lock:
>> * caller holds it */
>> + TTU_ZERO_FOLIO = 0x100,/* zero folio */
>> };
>>
>> #ifdef CONFIG_MMU
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 52357d79917c..d98f70876327 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1819,96 +1819,101 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>> */
>> dec_mm_counter(mm, mm_counter(folio));
>> } else if (folio_test_anon(folio)) {
>> - swp_entry_t entry = page_swap_entry(subpage);
>> - pte_t swp_pte;
>> - /*
>> - * Store the swap location in the pte.
>> - * See handle_pte_fault() ...
>> - */
>> - if (unlikely(folio_test_swapbacked(folio) !=
>> - folio_test_swapcache(folio))) {
>> + if (flags & TTU_ZERO_FOLIO) {
>> + pte_clear(mm, address, pvmw.pte);
>> + dec_mm_counter(mm, MM_ANONPAGES);
>
> Is there an easy way to reduce the code churn and highlight the added code?
>
> Like
>
> } else if (folio_test_anon(folio) && (flags & TTU_ZERO_FOLIO)) {
>
> } else if (folio_test_anon(folio)) {
>
>
>
> Also to concerns that I want to spell out:
>
> (a) what stops the page from getting modified in the meantime? The CPU
> can write it until the TLB was flushed.
>
> (b) do you properly handle if the page is pinned (or just got pinned)
> and we must not discard it?

Oh, and I forgot, are you handling userfaultd as expected? IIRC there
are some really nasty side-effects with userfaultfd even when
userfaultfd is currently not registered for a VMA [1].

[1]
https://lore.kernel.org/linux-mm/[email protected]/

What should work is replacing all-zero anonymous pages by the shared
zeropage iff the anonymous page is not pinned and we synchronize against
GUP fast. Well, and we handle possible concurrent writes accordingly.

KSM does essentially that when told to de-duplicate the shared zeropage,
and I was thinking a while ago if we would want a zeropage-only KSM
version that doesn't need stable tress and all that, but only
deduplicates zero-filled pages into the shared zeropage in a safe way.

--
Cheers,

David / dhildenb


2024-06-04 13:11:19

by Usama Arif

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] mm: clear pte for folios that are zero filled


On 04/06/2024 13:18, Matthew Wilcox wrote:
> On Tue, Jun 04, 2024 at 11:58:24AM +0100, Usama Arif wrote:
>> +++ b/mm/rmap.c
>> @@ -1819,96 +1819,101 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>> */
>> dec_mm_counter(mm, mm_counter(folio));
>> } else if (folio_test_anon(folio)) {
>> - swp_entry_t entry = page_swap_entry(subpage);
>> - pte_t swp_pte;
>> - /*
>> - * Store the swap location in the pte.
>> - * See handle_pte_fault() ...
>> - */
>> - if (unlikely(folio_test_swapbacked(folio) !=
>> - folio_test_swapcache(folio))) {
>> + if (flags & TTU_ZERO_FOLIO) {
>> + pte_clear(mm, address, pvmw.pte);
>> + dec_mm_counter(mm, MM_ANONPAGES);
>> + } else {
> This is very hard to review. Is what you've done the same as:
>
> if (flags & TTU_ZERO_FOLIO) {
> pte_clear(mm, address, pvmw.pte);
> dec_mm_counter(mm, MM_ANONPAGES);
> goto discard;
> }
>
> ? I genuinely can't tell.
>
Yes, thats what I am doing, will switch to above in next revision, Thanks!

2024-06-05 08:55:37

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] mm: clear pte for folios that are zero filled

On Tue, Jun 04, 2024 at 11:58:24AM GMT, Usama Arif wrote:
[...]
>
> +static bool is_folio_page_zero_filled(struct folio *folio, int i)
> +{
> + unsigned long *data;
> + unsigned int pos, last_pos = PAGE_SIZE / sizeof(*data) - 1;
> + bool ret = false;
> +
> + data = kmap_local_folio(folio, i * PAGE_SIZE);
> +
> + if (data[last_pos])
> + goto out;
> +

Use memchr_inv() instead of the following.

> + for (pos = 0; pos < last_pos; pos++) {
> + if (data[pos])
> + goto out;
> + }
> + ret = true;
> +out:
> + kunmap_local(data);
> + return ret;
> +}
> +
[...]
> +
> /*
> * shrink_folio_list() returns the number of reclaimed pages
> */
> @@ -1053,6 +1085,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> enum folio_references references = FOLIOREF_RECLAIM;
> bool dirty, writeback;
> unsigned int nr_pages;
> + bool folio_zero_filled = false;
>
> cond_resched();
>
> @@ -1270,6 +1303,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> nr_pages = 1;
> }
>
> + folio_zero_filled = is_folio_zero_filled(folio);

You need to check for zeroes after the unmap below otherwise you may
lost data. So you need to do two rmap walks. Most probably the first one
would be the standard one (inserting swap entry in the ptes) but the
second one would be different where swap entries should be replaced by
the zeropage. Also at the end you need to make sure to release all the
swap resources associated with the given page/folio.

> /*
> * The folio is mapped into the page tables of one or more
> * processes. Try to unmap it here.
> @@ -1295,6 +1329,9 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> if (folio_test_large(folio) && list_empty(&folio->_deferred_list))
> flags |= TTU_SYNC;
>
> + if (folio_zero_filled)
> + flags |= TTU_ZERO_FOLIO;
> +
> try_to_unmap(folio, flags);
> if (folio_mapped(folio)) {
> stat->nr_unmap_fail += nr_pages;

2024-06-07 10:24:31

by Usama Arif

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] mm: clear pte for folios that are zero filled


On 04/06/2024 13:43, David Hildenbrand wrote:
> On 04.06.24 14:30, David Hildenbrand wrote:
>> On 04.06.24 12:58, Usama Arif wrote:
>>> Approximately 10-20% of pages to be swapped out are zero pages [1].
>>> Rather than reading/writing these pages to flash resulting
>>> in increased I/O and flash wear, the pte can be cleared for those
>>> addresses at unmap time while shrinking folio list. When this
>>> causes a page fault, do_pte_missing will take care of this page.
>>> With this patch, NVMe writes in Meta server fleet decreased
>>> by almost 10% with conventional swap setup (zswap disabled).
>>>
>>> [1]
>>> https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/
>>>
>>> Signed-off-by: Usama Arif <[email protected]>
>>> ---
>>>    include/linux/rmap.h |   1 +
>>>    mm/rmap.c            | 163
>>> ++++++++++++++++++++++---------------------
>>>    mm/vmscan.c          |  89 ++++++++++++++++-------
>>>    3 files changed, 150 insertions(+), 103 deletions(-)
>>>
>>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>>> index bb53e5920b88..b36db1e886e4 100644
>>> --- a/include/linux/rmap.h
>>> +++ b/include/linux/rmap.h
>>> @@ -100,6 +100,7 @@ enum ttu_flags {
>>>                         * do a final flush if necessary */
>>>        TTU_RMAP_LOCKED        = 0x80,    /* do not grab rmap lock:
>>>                         * caller holds it */
>>> +    TTU_ZERO_FOLIO        = 0x100,/* zero folio */
>>>    };
>>>       #ifdef CONFIG_MMU
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 52357d79917c..d98f70876327 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1819,96 +1819,101 @@ static bool try_to_unmap_one(struct folio
>>> *folio, struct vm_area_struct *vma,
>>>                 */
>>>                dec_mm_counter(mm, mm_counter(folio));
>>>            } else if (folio_test_anon(folio)) {
>>> -            swp_entry_t entry = page_swap_entry(subpage);
>>> -            pte_t swp_pte;
>>> -            /*
>>> -             * Store the swap location in the pte.
>>> -             * See handle_pte_fault() ...
>>> -             */
>>> -            if (unlikely(folio_test_swapbacked(folio) !=
>>> -                    folio_test_swapcache(folio))) {
>>> +            if (flags & TTU_ZERO_FOLIO) {
>>> +                pte_clear(mm, address, pvmw.pte);
>>> +                dec_mm_counter(mm, MM_ANONPAGES);
>>
>> Is there an easy way to reduce the code churn and highlight the added
>> code?
>>
>> Like
>>
>> } else if (folio_test_anon(folio) && (flags & TTU_ZERO_FOLIO)) {
>>
>> } else if (folio_test_anon(folio)) {
>>
>>
>>
>> Also to concerns that I want to spell out:
>>
>> (a) what stops the page from getting modified in the meantime? The CPU
>>       can write it until the TLB was flushed.
>>
Thanks for pointing this out David and Shakeel. This is a big issue in
this v2, and as Shakeel pointed out in [1] we need to do a second rmap
walk. Looking at how ksm deals with this in try_to_merge_one_page which
calls write_protect_page for each vma (i.e. basically an rmap walk),
this would be much more CPU expensive and complicated compared to v1
[2], where the swap subsystem can handle all complexities. I will go
back to my v1 solution for the next revision as its much more simpler
and the memory usage is very low (0.003%) as pointed out by Johannes [3]
which would likely go away with the memory savings of not having a
zswap_entry for zero filled pages, and the solution being a lot simpler
than what a valid v2 approach would look like.


[1]
https://lore.kernel.org/all/nes73bwc5p6yhwt5tw3upxcqrn5kenn6lvqb6exrf4yppmz6jx@ywhuevpkxlvh/

[2]
https://lore.kernel.org/all/[email protected]/

[3] https://lore.kernel.org/all/[email protected]/

>> (b) do you properly handle if the page is pinned (or just got pinned)
>>       and we must not discard it?
>
> Oh, and I forgot, are you handling userfaultd as expected? IIRC there
> are some really nasty side-effects with userfaultfd even when
> userfaultfd is currently not registered for a VMA [1].
>
> [1]
> https://lore.kernel.org/linux-mm/[email protected]/
>
> What should work is replacing all-zero anonymous pages by the shared
> zeropage iff the anonymous page is not pinned and we synchronize
> against GUP fast. Well, and we handle possible concurrent writes
> accordingly.
>
> KSM does essentially that when told to de-duplicate the shared
> zeropage, and I was thinking a while ago if we would want a
> zeropage-only KSM version that doesn't need stable tress and all that,
> but only deduplicates zero-filled pages into the shared zeropage in a
> safe way.
>
Thanks for the pointer to KSM code.



2024-06-07 10:40:51

by Usama Arif

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] mm: clear pte for folios that are zero filled


On 05/06/2024 09:55, Shakeel Butt wrote:
> On Tue, Jun 04, 2024 at 11:58:24AM GMT, Usama Arif wrote:
> [...]
>>
>> +static bool is_folio_page_zero_filled(struct folio *folio, int i)
>> +{
>> + unsigned long *data;
>> + unsigned int pos, last_pos = PAGE_SIZE / sizeof(*data) - 1;
>> + bool ret = false;
>> +
>> + data = kmap_local_folio(folio, i * PAGE_SIZE);
>> +
>> + if (data[last_pos])
>> + goto out;
>> +
> Use memchr_inv() instead of the following.

I had done some benchmarking before sending v1 and this version is 35%
faster than using memchr_inv(). Its likely because this does long
comparison, while memchr_inv does a byte comparison using check_bytes8
[1]. I will stick with the current version for my next revision. I have
added the kernel module I used for benchmarking below:

[308797.975269] Time taken for orig: 2850 ms
[308801.911439] Time taken for memchr_inv: 3936 ms

[1] https://elixir.bootlin.com/linux/v6.9.3/source/lib/string.c#L800


#include <linux/time.h>
#include <linux/init.h>
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/slab.h>
#include <linux/string.h>

#define ITERATIONS 10000000
static int is_page_zero_filled(void *ptr, unsigned long *value)
{
    unsigned long *page;
    unsigned long val;
    unsigned int pos, last_pos = PAGE_SIZE / sizeof(*page) - 1;

    page = (unsigned long *)ptr;
    val = page[0];

    if (page[last_pos] != 0)
        return 0;

    for (pos = 1; pos < last_pos; pos++) {
        if (page[pos] != 0)
            return 0;
    }

    *value = val;

    return 1;
}

static int is_page_zero_filled_memchr_inv(void *ptr, unsigned long *value)
{
    unsigned long *page;
    unsigned long val;
    unsigned long *ret;
    page = (unsigned long *)ptr;

    val = page[0];
    *value = val;

    ret = memchr_inv(ptr, 0, PAGE_SIZE);

    return ret == NULL ? 1: 0;
}

static int __init zsmalloc_test_init(void)
{
    unsigned long *src;
    unsigned long value;
    ktime_t start_time, end_time;
    volatile int res = 0;
    unsigned long milliseconds;

    src = kmalloc(PAGE_SIZE, GFP_KERNEL);
    if (!src)
        return -ENOMEM;

    for (unsigned int pos = 0; pos <= PAGE_SIZE / sizeof(*src) - 1;
pos++) {
        src[pos] = 0x0;
    }

    start_time = ktime_get();
    for (int i = 0; i < ITERATIONS; i++)
        res = is_page_zero_filled(src, &value);
    end_time = ktime_get();
    milliseconds = ktime_ms_delta(end_time, start_time);
    // printk(KERN_INFO "Result: %d, Value: %lu\n", res, value);
    printk(KERN_INFO "Time taken for orig: %lu ms\n", milliseconds);

    start_time = ktime_get();
    for (int i = 0; i < ITERATIONS; i++)
        res = is_page_zero_filled_memchr_inv(src, &value);
    end_time = ktime_get();
    milliseconds = ktime_ms_delta(end_time, start_time);
    // printk(KERN_INFO "Result: %d, Value: %lu\n", res, value);
    printk(KERN_INFO "Time taken for memchr_inv: %lu ms\n", milliseconds);

    kfree(src);
    // Dont insmod so that you can re-run
    return -1;
}

module_init(zsmalloc_test_init);
MODULE_LICENSE("GPL");


2024-06-07 11:16:30

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] mm: clear pte for folios that are zero filled

On 07.06.24 12:24, Usama Arif wrote:
>
> On 04/06/2024 13:43, David Hildenbrand wrote:
>> On 04.06.24 14:30, David Hildenbrand wrote:
>>> On 04.06.24 12:58, Usama Arif wrote:
>>>> Approximately 10-20% of pages to be swapped out are zero pages [1].
>>>> Rather than reading/writing these pages to flash resulting
>>>> in increased I/O and flash wear, the pte can be cleared for those
>>>> addresses at unmap time while shrinking folio list. When this
>>>> causes a page fault, do_pte_missing will take care of this page.
>>>> With this patch, NVMe writes in Meta server fleet decreased
>>>> by almost 10% with conventional swap setup (zswap disabled).
>>>>
>>>> [1]
>>>> https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/
>>>>
>>>> Signed-off-by: Usama Arif <[email protected]>
>>>> ---
>>>>    include/linux/rmap.h |   1 +
>>>>    mm/rmap.c            | 163
>>>> ++++++++++++++++++++++---------------------
>>>>    mm/vmscan.c          |  89 ++++++++++++++++-------
>>>>    3 files changed, 150 insertions(+), 103 deletions(-)
>>>>
>>>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>>>> index bb53e5920b88..b36db1e886e4 100644
>>>> --- a/include/linux/rmap.h
>>>> +++ b/include/linux/rmap.h
>>>> @@ -100,6 +100,7 @@ enum ttu_flags {
>>>>                         * do a final flush if necessary */
>>>>        TTU_RMAP_LOCKED        = 0x80,    /* do not grab rmap lock:
>>>>                         * caller holds it */
>>>> +    TTU_ZERO_FOLIO        = 0x100,/* zero folio */
>>>>    };
>>>>       #ifdef CONFIG_MMU
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index 52357d79917c..d98f70876327 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -1819,96 +1819,101 @@ static bool try_to_unmap_one(struct folio
>>>> *folio, struct vm_area_struct *vma,
>>>>                 */
>>>>                dec_mm_counter(mm, mm_counter(folio));
>>>>            } else if (folio_test_anon(folio)) {
>>>> -            swp_entry_t entry = page_swap_entry(subpage);
>>>> -            pte_t swp_pte;
>>>> -            /*
>>>> -             * Store the swap location in the pte.
>>>> -             * See handle_pte_fault() ...
>>>> -             */
>>>> -            if (unlikely(folio_test_swapbacked(folio) !=
>>>> -                    folio_test_swapcache(folio))) {
>>>> +            if (flags & TTU_ZERO_FOLIO) {
>>>> +                pte_clear(mm, address, pvmw.pte);
>>>> +                dec_mm_counter(mm, MM_ANONPAGES);
>>>
>>> Is there an easy way to reduce the code churn and highlight the added
>>> code?
>>>
>>> Like
>>>
>>> } else if (folio_test_anon(folio) && (flags & TTU_ZERO_FOLIO)) {
>>>
>>> } else if (folio_test_anon(folio)) {
>>>
>>>
>>>
>>> Also to concerns that I want to spell out:
>>>
>>> (a) what stops the page from getting modified in the meantime? The CPU
>>>       can write it until the TLB was flushed.
>>>
> Thanks for pointing this out David and Shakeel. This is a big issue in
> this v2, and as Shakeel pointed out in [1] we need to do a second rmap
> walk. Looking at how ksm deals with this in try_to_merge_one_page which
> calls write_protect_page for each vma (i.e. basically an rmap walk),
> this would be much more CPU expensive and complicated compared to v1
> [2], where the swap subsystem can handle all complexities. I will go
> back to my v1 solution for the next revision as its much more simpler
> and the memory usage is very low (0.003%) as pointed out by Johannes [3]
> which would likely go away with the memory savings of not having a
> zswap_entry for zero filled pages, and the solution being a lot simpler
> than what a valid v2 approach would look like.

Agreed.

--
Cheers,

David / dhildenb