2024-06-10 12:18:40

by Usama Arif

[permalink] [raw]
Subject: [PATCH v3 0/2] mm: store zero pages to be swapped out in a bitmap

Going back to the v1 implementation of the patchseries. The main reason
is that a correct version of v2 implementation requires another rmap
walk in shrink_folio_list to change the ptes from swap entry to zero pages to
work (i.e. more CPU used) [1], is more complex to implement compared to v1
and is harder to verify correctness compared to v1, where everything is
handled by swap.

---
As shown in the patchseries that introduced the zswap same-filled
optimization [2], 10-20% of the pages stored in zswap are same-filled.
This is also observed across Meta's server fleet.
By using VM counters in swap_writepage (not included in this
patchseries) it was found that less than 1% of the same-filled
pages to be swapped out are non-zero pages.

For conventional swap setup (without zswap), rather than reading/writing
these pages to flash resulting in increased I/O and flash wear, a bitmap
can be used to mark these pages as zero at write time, and the pages can
be filled at read time if the bit corresponding to the page is set.

When using zswap with swap, this also means that a zswap_entry does not
need to be allocated for zero filled pages resulting in memory savings
which would offset the memory used for the bitmap.

A similar attempt was made earlier in [3] where zswap would only track
zero-filled pages instead of same-filled.
This patchseries adds zero-filled pages optimization to swap
(hence it can be used even if zswap is disabled) and removes the
same-filled code from zswap (as only 1% of the same-filled pages are
non-zero), simplifying code.

This patchseries is based on mm-unstable.


[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/
[3] https://lore.kernel.org/lkml/[email protected]/

---
v2->v3:
- Going back to the v1 version of the implementation (David and Shakeel)
- convert unatomic bitmap_set/clear to atomic set/clear_bit (Johannes)
- use clear_highpage instead of folio_page_zero_fill (Yosry)

v1 -> v2:
- instead of using a bitmap in swap, clear pte for zero pages and let
do_pte_missing handle this page at page fault. (Yosry and Matthew)
- Check end of page first when checking if folio is zero filled as
it could lead to better performance. (Yosry)

Usama Arif (2):
mm: store zero pages to be swapped out in a bitmap
mm: remove code to handle same filled pages

include/linux/swap.h | 1 +
mm/page_io.c | 92 +++++++++++++++++++++++++++++++++++++++++++-
mm/swapfile.c | 21 +++++++++-
mm/zswap.c | 86 ++++-------------------------------------
4 files changed, 119 insertions(+), 81 deletions(-)

--
2.43.0



2024-06-10 12:18:50

by Usama Arif

[permalink] [raw]
Subject: [PATCH v3 1/2] mm: store zero pages to be swapped out in a bitmap

Approximately 10-20% of pages to be swapped out are zero pages [1].
Rather than reading/writing these pages to flash resulting
in increased I/O and flash wear, a bitmap can be used to mark these
pages as zero at write time, and the pages can be filled at
read time if the bit corresponding to the page is set.
With this patch, NVMe writes in Meta server fleet decreased
by almost 10% with conventional swap setup (zswap disabled).

[1]https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/

Signed-off-by: Usama Arif <[email protected]>
---
include/linux/swap.h | 1 +
mm/page_io.c | 92 +++++++++++++++++++++++++++++++++++++++++++-
mm/swapfile.c | 21 +++++++++-
3 files changed, 111 insertions(+), 3 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index a11c75e897ec..e88563978441 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -299,6 +299,7 @@ struct swap_info_struct {
signed char type; /* strange name for an index */
unsigned int max; /* extent of the swap_map */
unsigned char *swap_map; /* vmalloc'ed array of usage counts */
+ unsigned long *zeromap; /* vmalloc'ed bitmap to track zero pages */
struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
struct swap_cluster_list free_clusters; /* free clusters list */
unsigned int lowest_bit; /* index of first free in swap_map */
diff --git a/mm/page_io.c b/mm/page_io.c
index a360857cf75d..2cac1e11fb85 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -172,6 +172,82 @@ int generic_swapfile_activate(struct swap_info_struct *sis,
goto out;
}

+static bool is_folio_page_zero_filled(struct folio *folio, int i)
+{
+ unsigned long *data;
+ unsigned int pos, last_pos = PAGE_SIZE / sizeof(*data) - 1;
+ bool ret = false;
+
+ data = kmap_local_folio(folio, i * PAGE_SIZE);
+ if (data[last_pos])
+ goto out;
+ for (pos = 0; pos < PAGE_SIZE / sizeof(*data); pos++) {
+ if (data[pos])
+ goto out;
+ }
+ ret = true;
+out:
+ kunmap_local(data);
+ return ret;
+}
+
+static bool is_folio_zero_filled(struct folio *folio)
+{
+ unsigned int i;
+
+ for (i = 0; i < folio_nr_pages(folio); i++) {
+ if (!is_folio_page_zero_filled(folio, i))
+ return false;
+ }
+ return true;
+}
+
+static void folio_zero_fill(struct folio *folio)
+{
+ unsigned int i;
+
+ for (i = 0; i < folio_nr_pages(folio); i++)
+ clear_highpage(folio_page(folio, i));
+}
+
+static void swap_zeromap_folio_set(struct folio *folio)
+{
+ struct swap_info_struct *sis = swp_swap_info(folio->swap);
+ swp_entry_t entry;
+ unsigned int i;
+
+ for (i = 0; i < folio_nr_pages(folio); i++) {
+ entry = page_swap_entry(folio_page(folio, i));
+ set_bit(swp_offset(entry), sis->zeromap);
+ }
+}
+
+static void swap_zeromap_folio_clear(struct folio *folio)
+{
+ struct swap_info_struct *sis = swp_swap_info(folio->swap);
+ swp_entry_t entry;
+ unsigned int i;
+
+ for (i = 0; i < folio_nr_pages(folio); i++) {
+ entry = page_swap_entry(folio_page(folio, i));
+ clear_bit(swp_offset(entry), sis->zeromap);
+ }
+}
+
+static bool swap_zeromap_folio_test(struct folio *folio)
+{
+ struct swap_info_struct *sis = swp_swap_info(folio->swap);
+ swp_entry_t entry;
+ unsigned int i;
+
+ for (i = 0; i < folio_nr_pages(folio); i++) {
+ entry = page_swap_entry(folio_page(folio, i));
+ if (!test_bit(swp_offset(entry), sis->zeromap))
+ return false;
+ }
+ return true;
+}
+
/*
* We may have stale swap cache pages in memory: notice
* them here and get rid of the unnecessary final write.
@@ -195,6 +271,15 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
folio_unlock(folio);
return ret;
}
+
+ if (is_folio_zero_filled(folio)) {
+ swap_zeromap_folio_set(folio);
+ folio_start_writeback(folio);
+ folio_unlock(folio);
+ folio_end_writeback(folio);
+ return 0;
+ }
+ swap_zeromap_folio_clear(folio);
if (zswap_store(folio)) {
folio_start_writeback(folio);
folio_unlock(folio);
@@ -515,8 +600,11 @@ void swap_read_folio(struct folio *folio, bool synchronous,
psi_memstall_enter(&pflags);
}
delayacct_swapin_start();
-
- if (zswap_load(folio)) {
+ if (swap_zeromap_folio_test(folio)) {
+ folio_zero_fill(folio);
+ folio_mark_uptodate(folio);
+ folio_unlock(folio);
+ } else if (zswap_load(folio)) {
folio_mark_uptodate(folio);
folio_unlock(folio);
} else if (data_race(sis->flags & SWP_FS_OPS)) {
diff --git a/mm/swapfile.c b/mm/swapfile.c
index f1e559e216bd..90451174fe34 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -453,6 +453,8 @@ static unsigned int cluster_list_del_first(struct swap_cluster_list *list,
static void swap_cluster_schedule_discard(struct swap_info_struct *si,
unsigned int idx)
{
+ unsigned int i;
+
/*
* If scan_swap_map_slots() can't find a free cluster, it will check
* si->swap_map directly. To make sure the discarding cluster isn't
@@ -461,6 +463,13 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
*/
memset(si->swap_map + idx * SWAPFILE_CLUSTER,
SWAP_MAP_BAD, SWAPFILE_CLUSTER);
+ /*
+ * zeromap can see updates from concurrent swap_writepage() and swap_read_folio()
+ * call on other slots, hence use atomic clear_bit for zeromap instead of the
+ * non-atomic bitmap_clear.
+ */
+ for (i = 0; i < SWAPFILE_CLUSTER; i++)
+ clear_bit(idx * SWAPFILE_CLUSTER + i, si->zeromap);

cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx);

@@ -482,7 +491,7 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
static void swap_do_scheduled_discard(struct swap_info_struct *si)
{
struct swap_cluster_info *info, *ci;
- unsigned int idx;
+ unsigned int idx, i;

info = si->cluster_info;

@@ -498,6 +507,8 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si)
__free_cluster(si, idx);
memset(si->swap_map + idx * SWAPFILE_CLUSTER,
0, SWAPFILE_CLUSTER);
+ for (i = 0; i < SWAPFILE_CLUSTER; i++)
+ clear_bit(idx * SWAPFILE_CLUSTER + i, si->zeromap);
unlock_cluster(ci);
}
}
@@ -1336,6 +1347,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
count = p->swap_map[offset];
VM_BUG_ON(count != SWAP_HAS_CACHE);
p->swap_map[offset] = 0;
+ clear_bit(offset, p->zeromap);
dec_cluster_info_page(p, p->cluster_info, offset);
unlock_cluster(ci);

@@ -2597,6 +2609,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
free_percpu(p->cluster_next_cpu);
p->cluster_next_cpu = NULL;
vfree(swap_map);
+ bitmap_free(p->zeromap);
kvfree(cluster_info);
/* Destroy swap account information */
swap_cgroup_swapoff(p->type);
@@ -3123,6 +3136,12 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
goto bad_swap_unlock_inode;
}

+ p->zeromap = bitmap_zalloc(maxpages, GFP_KERNEL);
+ if (!p->zeromap) {
+ error = -ENOMEM;
+ goto bad_swap_unlock_inode;
+ }
+
if (p->bdev && bdev_stable_writes(p->bdev))
p->flags |= SWP_STABLE_WRITES;

--
2.43.0


2024-06-10 12:31:36

by Usama Arif

[permalink] [raw]
Subject: [PATCH v3 2/2] mm: remove code to handle same filled pages

With an earlier commit to handle zero-filled pages in swap directly,
and with only 1% of the same-filled pages being non-zero, zswap no
longer needs to handle same-filled pages and can just work on compressed
pages.

Signed-off-by: Usama Arif <[email protected]>
---
mm/zswap.c | 86 +++++-------------------------------------------------
1 file changed, 8 insertions(+), 78 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index b9b35ef86d9b..ca8df9c99abf 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -44,8 +44,6 @@
**********************************/
/* The number of compressed pages currently stored in zswap */
atomic_t zswap_stored_pages = ATOMIC_INIT(0);
-/* The number of same-value filled pages currently stored in zswap */
-static atomic_t zswap_same_filled_pages = ATOMIC_INIT(0);

/*
* The statistics below are not protected from concurrent access for
@@ -182,11 +180,9 @@ static struct shrinker *zswap_shrinker;
*
* swpentry - associated swap entry, the offset indexes into the red-black tree
* length - the length in bytes of the compressed page data. Needed during
- * decompression. For a same value filled page length is 0, and both
- * pool and lru are invalid and must be ignored.
+ * decompression.
* pool - the zswap_pool the entry's data is in
* handle - zpool allocation handle that stores the compressed page data
- * value - value of the same-value filled pages which have same content
* objcg - the obj_cgroup that the compressed memory is charged to
* lru - handle to the pool's lru used to evict pages.
*/
@@ -194,10 +190,7 @@ struct zswap_entry {
swp_entry_t swpentry;
unsigned int length;
struct zswap_pool *pool;
- union {
- unsigned long handle;
- unsigned long value;
- };
+ unsigned long handle;
struct obj_cgroup *objcg;
struct list_head lru;
};
@@ -814,13 +807,9 @@ static struct zpool *zswap_find_zpool(struct zswap_entry *entry)
*/
static void zswap_entry_free(struct zswap_entry *entry)
{
- if (!entry->length)
- atomic_dec(&zswap_same_filled_pages);
- else {
- zswap_lru_del(&zswap_list_lru, entry);
- zpool_free(zswap_find_zpool(entry), entry->handle);
- zswap_pool_put(entry->pool);
- }
+ zswap_lru_del(&zswap_list_lru, entry);
+ zpool_free(zswap_find_zpool(entry), entry->handle);
+ zswap_pool_put(entry->pool);
if (entry->objcg) {
obj_cgroup_uncharge_zswap(entry->objcg, entry->length);
obj_cgroup_put(entry->objcg);
@@ -1262,11 +1251,6 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
* This ensures that the better zswap compresses memory, the fewer
* pages we will evict to swap (as it will otherwise incur IO for
* relatively small memory saving).
- *
- * The memory saving factor calculated here takes same-filled pages into
- * account, but those are not freeable since they almost occupy no
- * space. Hence, we may scale nr_freeable down a little bit more than we
- * should if we have a lot of same-filled pages.
*/
return mult_frac(nr_freeable, nr_backing, nr_stored);
}
@@ -1370,42 +1354,6 @@ static void shrink_worker(struct work_struct *w)
} while (zswap_total_pages() > thr);
}

-/*********************************
-* same-filled functions
-**********************************/
-static bool zswap_is_folio_same_filled(struct folio *folio, unsigned long *value)
-{
- unsigned long *data;
- unsigned long val;
- unsigned int pos, last_pos = PAGE_SIZE / sizeof(*data) - 1;
- bool ret = false;
-
- data = kmap_local_folio(folio, 0);
- val = data[0];
-
- if (val != data[last_pos])
- goto out;
-
- for (pos = 1; pos < last_pos; pos++) {
- if (val != data[pos])
- goto out;
- }
-
- *value = val;
- ret = true;
-out:
- kunmap_local(data);
- return ret;
-}
-
-static void zswap_fill_folio(struct folio *folio, unsigned long value)
-{
- unsigned long *data = kmap_local_folio(folio, 0);
-
- memset_l(data, value, PAGE_SIZE / sizeof(unsigned long));
- kunmap_local(data);
-}
-
/*********************************
* main API
**********************************/
@@ -1417,7 +1365,6 @@ bool zswap_store(struct folio *folio)
struct zswap_entry *entry, *old;
struct obj_cgroup *objcg = NULL;
struct mem_cgroup *memcg = NULL;
- unsigned long value;

VM_WARN_ON_ONCE(!folio_test_locked(folio));
VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
@@ -1450,13 +1397,6 @@ bool zswap_store(struct folio *folio)
goto reject;
}

- if (zswap_is_folio_same_filled(folio, &value)) {
- entry->length = 0;
- entry->value = value;
- atomic_inc(&zswap_same_filled_pages);
- goto store_entry;
- }
-
/* if entry is successfully added, it keeps the reference */
entry->pool = zswap_pool_current_get();
if (!entry->pool)
@@ -1474,7 +1414,6 @@ bool zswap_store(struct folio *folio)
if (!zswap_compress(folio, entry))
goto put_pool;

-store_entry:
entry->swpentry = swp;
entry->objcg = objcg;

@@ -1522,13 +1461,9 @@ bool zswap_store(struct folio *folio)
return true;

store_failed:
- if (!entry->length)
- atomic_dec(&zswap_same_filled_pages);
- else {
- zpool_free(zswap_find_zpool(entry), entry->handle);
+ zpool_free(zswap_find_zpool(entry), entry->handle);
put_pool:
- zswap_pool_put(entry->pool);
- }
+ zswap_pool_put(entry->pool);
freepage:
zswap_entry_cache_free(entry);
reject:
@@ -1577,10 +1512,7 @@ bool zswap_load(struct folio *folio)
if (!entry)
return false;

- if (entry->length)
- zswap_decompress(entry, folio);
- else
- zswap_fill_folio(folio, entry->value);
+ zswap_decompress(entry, folio);

count_vm_event(ZSWPIN);
if (entry->objcg)
@@ -1682,8 +1614,6 @@ static int zswap_debugfs_init(void)
zswap_debugfs_root, NULL, &total_size_fops);
debugfs_create_atomic_t("stored_pages", 0444,
zswap_debugfs_root, &zswap_stored_pages);
- debugfs_create_atomic_t("same_filled_pages", 0444,
- zswap_debugfs_root, &zswap_same_filled_pages);

return 0;
}
--
2.43.0


2024-06-10 13:09:28

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] mm: store zero pages to be swapped out in a bitmap

On Mon, Jun 10, 2024 at 01:15:59PM +0100, Usama Arif wrote:
> + if (is_folio_zero_filled(folio)) {
> + swap_zeromap_folio_set(folio);
> + folio_start_writeback(folio);
> + folio_unlock(folio);
> + folio_end_writeback(folio);

What's the point? As far as I can see, the only thing this is going to
do is spend a lot of time messing with various counters only to end up
with incrementing NR_WRITTEN, which is wrong because you didn't actually
write it.


2024-06-10 14:02:46

by Usama Arif

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] mm: store zero pages to be swapped out in a bitmap


On 10/06/2024 14:07, Matthew Wilcox wrote:
> On Mon, Jun 10, 2024 at 01:15:59PM +0100, Usama Arif wrote:
>> + if (is_folio_zero_filled(folio)) {
>> + swap_zeromap_folio_set(folio);
>> + folio_start_writeback(folio);
>> + folio_unlock(folio);
>> + folio_end_writeback(folio);
> What's the point? As far as I can see, the only thing this is going to
> do is spend a lot of time messing with various counters only to end up
> with incrementing NR_WRITTEN, which is wrong because you didn't actually
> write it.
>

I am guessing what you are suggesting is just do this?

    if (is_folio_zero_filled(folio)) {
        swap_zeromap_folio_set(folio);
        folio_unlock(folio);
        return 0;
    }

This is what I did initially while developing this, but when I started
looking into why zswap_store does  folio_start_writeback, folio_unlock,
folio_end_writeback I found:

https://elixir.bootlin.com/linux/v6.9.3/source/Documentation/filesystems/locking.rst#L336

"If no I/O is submitted, the filesystem must run end_page_writeback()
against the page before returning from writepage."

and

https://elixir.bootlin.com/linux/v6.9.3/source/Documentation/filesystems/locking.rst#L349

"Note, failure to run either redirty_page_for_writepage() or the
combination of
set_page_writeback()/end_page_writeback() on a page submitted to writepage
will leave the page itself marked clean but it will be tagged as dirty
in the
radix tree.  This incoherency can lead to all sorts of hard-to-debug
problems
in the filesystem like having dirty inodes at umount and losing written
data.
"

If we have zswap enabled, the zero filled pages (infact any page that is
compressed), will be saved in zswap_entry and NR_WRITTEN will be wrongly
incremented. So the behaviour for NR_WRITTEN does not change in this
patch when encountering zero pages with zswap enabled (even if its wrong).

This patch just extracts the optimization out from zswap [1] to swap, so
that it always runs.

In order to fix NR_WRITTEN accounting for zswap, this patch series and
any other cases where no I/O is submitted but end_page_writeback is
called before returning to writepage, maybe we could add an argument to
__folio_end_writeback like below? There are a lot of calls to
folio_end_writeback and the behaviour of zeropage with regards to
NR_WRITTEN doesnt change with or without this patchseries with zswap
enabled, so maybe we could keep this independent of this series? I would
be happy to submit this as separate patch if it makes sense.


diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 81b2e4128d26..415037f511c2 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -3042,7 +3042,7 @@ static void wb_inode_writeback_end(struct
bdi_writeback *wb)
        spin_unlock_irqrestore(&wb->work_lock, flags);
 }

-bool __folio_end_writeback(struct folio *folio)
+bool __folio_end_writeback(struct folio *folio, bool nr_written_increment)
 {
        long nr = folio_nr_pages(folio);
        struct address_space *mapping = folio_mapping(folio);
@@ -3078,7 +3078,8 @@ bool __folio_end_writeback(struct folio *folio)

        lruvec_stat_mod_folio(folio, NR_WRITEBACK, -nr);
        zone_stat_mod_folio(folio, NR_ZONE_WRITE_PENDING, -nr);
-       node_stat_mod_folio(folio, NR_WRITTEN, nr);
+       if (nr_written_increment)
+               node_stat_mod_folio(folio, NR_WRITTEN, nr);
        folio_memcg_unlock(folio);

        return ret;


[1]
https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/






2024-06-10 14:15:10

by Usama Arif

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] mm: store zero pages to be swapped out in a bitmap


On 10/06/2024 15:06, Matthew Wilcox wrote:
> On Mon, Jun 10, 2024 at 02:56:09PM +0100, Usama Arif wrote:
>> I am guessing what you are suggesting is just do this?
>>
>>     if (is_folio_zero_filled(folio)) {
>>         swap_zeromap_folio_set(folio);
>>         folio_unlock(folio);
>>         return 0;
>>     }
> Right.

Thanks! Will change to this in the next revision.

>> If we have zswap enabled, the zero filled pages (infact any page that is
>> compressed), will be saved in zswap_entry and NR_WRITTEN will be wrongly
>> incremented. So the behaviour for NR_WRITTEN does not change in this patch
>> when encountering zero pages with zswap enabled (even if its wrong).
> We should fiz zswap too.
>
Will send the below diff as a separate patch for zswap:

diff --git a/mm/page_io.c b/mm/page_io.c index
2cac1e11fb85..82796b9f08c7 100644 --- a/mm/page_io.c +++ b/mm/page_io.c
@@ -281,9 +281,7 @@ int swap_writepage(struct page *page, struct
writeback_control *wbc) } swap_zeromap_folio_clear(folio); if
(zswap_store(folio)) { - folio_start_writeback(folio);
folio_unlock(folio); - folio_end_writeback(folio); return 0; }


2024-06-10 14:36:05

by Usama Arif

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] mm: store zero pages to be swapped out in a bitmap


On 10/06/2024 15:14, Usama Arif wrote:
>
> On 10/06/2024 15:06, Matthew Wilcox wrote:
>> On Mon, Jun 10, 2024 at 02:56:09PM +0100, Usama Arif wrote:
>>> I am guessing what you are suggesting is just do this?
>>>
>>>      if (is_folio_zero_filled(folio)) {
>>>          swap_zeromap_folio_set(folio);
>>>          folio_unlock(folio);
>>>          return 0;
>>>      }
>> Right.
>
> Thanks! Will change to this in the next revision.
>
>>> If we have zswap enabled, the zero filled pages (infact any page
>>> that is
>>> compressed), will be saved in zswap_entry and NR_WRITTEN will be
>>> wrongly
>>> incremented. So the behaviour for NR_WRITTEN does not change in this
>>> patch
>>> when encountering zero pages with zswap enabled (even if its wrong).
>> We should fiz zswap too.
>>
> Will send the below diff as a separate patch for zswap:
>
> diff --git a/mm/page_io.c b/mm/page_io.c index
> 2cac1e11fb85..82796b9f08c7 100644 --- a/mm/page_io.c +++
> b/mm/page_io.c @@ -281,9 +281,7 @@ int swap_writepage(struct page
> *page, struct writeback_control *wbc) }
> swap_zeromap_folio_clear(folio); if (zswap_store(folio)) { -
> folio_start_writeback(folio); folio_unlock(folio); -
> folio_end_writeback(folio); return 0; }
>

My mail client seems to have messed up the diff, but have sent the patch
here
(https://lore.kernel.org/all/[email protected]/).
Tested with test_zswap kselftest.


2024-06-10 15:19:53

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] mm: store zero pages to be swapped out in a bitmap

On Mon, Jun 10, 2024 at 02:56:09PM +0100, Usama Arif wrote:
> I am guessing what you are suggesting is just do this?
>
> ??? if (is_folio_zero_filled(folio)) {
> ?? ??? ?swap_zeromap_folio_set(folio);
> ?? ??? ?folio_unlock(folio);
> ?? ??? ?return 0;
> ?? ?}

Right.

> This is what I did initially while developing this, but when I started
> looking into why zswap_store does? folio_start_writeback, folio_unlock,
> folio_end_writeback I found:
>
> https://elixir.bootlin.com/linux/v6.9.3/source/Documentation/filesystems/locking.rst#L336
>
> "If no I/O is submitted, the filesystem must run end_page_writeback()
> against the page before returning from writepage."

But that's advice to filesystem authors. File pages don't come this
way; we only put anonyous pages into swap (except tmpfs).

> If we have zswap enabled, the zero filled pages (infact any page that is
> compressed), will be saved in zswap_entry and NR_WRITTEN will be wrongly
> incremented. So the behaviour for NR_WRITTEN does not change in this patch
> when encountering zero pages with zswap enabled (even if its wrong).

We should fiz zswap too.

> In order to fix NR_WRITTEN accounting for zswap, this patch series and any
> other cases where no I/O is submitted but end_page_writeback is called
> before returning to writepage, maybe we could add an argument to
> __folio_end_writeback like below? There are a lot of calls to
> folio_end_writeback and the behaviour of zeropage with regards to NR_WRITTEN
> doesnt change with or without this patchseries with zswap enabled, so maybe
> we could keep this independent of this series? I would be happy to submit
> this as separate patch if it makes sense.

It makes no sense at all.

2024-06-10 17:58:19

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] mm: store zero pages to be swapped out in a bitmap

On Mon, Jun 10, 2024 at 5:18 AM Usama Arif <[email protected]> wrote:
>
> Approximately 10-20% of pages to be swapped out are zero pages [1].
> Rather than reading/writing these pages to flash resulting
> in increased I/O and flash wear, a bitmap can be used to mark these
> pages as zero at write time, and the pages can be filled at
> read time if the bit corresponding to the page is set.
> With this patch, NVMe writes in Meta server fleet decreased
> by almost 10% with conventional swap setup (zswap disabled).
>
> [1]https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/
>
> Signed-off-by: Usama Arif <[email protected]>
> ---
> include/linux/swap.h | 1 +
> mm/page_io.c | 92 +++++++++++++++++++++++++++++++++++++++++++-
> mm/swapfile.c | 21 +++++++++-
> 3 files changed, 111 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index a11c75e897ec..e88563978441 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -299,6 +299,7 @@ struct swap_info_struct {
> signed char type; /* strange name for an index */
> unsigned int max; /* extent of the swap_map */
> unsigned char *swap_map; /* vmalloc'ed array of usage counts */
> + unsigned long *zeromap; /* vmalloc'ed bitmap to track zero pages */
> struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
> struct swap_cluster_list free_clusters; /* free clusters list */
> unsigned int lowest_bit; /* index of first free in swap_map */
> diff --git a/mm/page_io.c b/mm/page_io.c
> index a360857cf75d..2cac1e11fb85 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -172,6 +172,82 @@ int generic_swapfile_activate(struct swap_info_struct *sis,
> goto out;
> }
>
> +static bool is_folio_page_zero_filled(struct folio *folio, int i)
> +{
> + unsigned long *data;
> + unsigned int pos, last_pos = PAGE_SIZE / sizeof(*data) - 1;
> + bool ret = false;
> +
> + data = kmap_local_folio(folio, i * PAGE_SIZE);
> + if (data[last_pos])
> + goto out;
> + for (pos = 0; pos < PAGE_SIZE / sizeof(*data); pos++) {
> + if (data[pos])
> + goto out;
> + }
> + ret = true;
> +out:
> + kunmap_local(data);
> + return ret;
> +}
> +
> +static bool is_folio_zero_filled(struct folio *folio)
> +{
> + unsigned int i;
> +
> + for (i = 0; i < folio_nr_pages(folio); i++) {
> + if (!is_folio_page_zero_filled(folio, i))
> + return false;
> + }
> + return true;
> +}

Is there any benefit to iterating on the folio in pages (i.e. have
both is_folio_zero_filled() and is_folio_page_zero_filled())? Why
don't we just kmap the entire folio and check it all at once?

> +
> +static void folio_zero_fill(struct folio *folio)
> +{
> + unsigned int i;
> +
> + for (i = 0; i < folio_nr_pages(folio); i++)
> + clear_highpage(folio_page(folio, i));
> +}
> +
> +static void swap_zeromap_folio_set(struct folio *folio)
> +{
> + struct swap_info_struct *sis = swp_swap_info(folio->swap);
> + swp_entry_t entry;
> + unsigned int i;
> +
> + for (i = 0; i < folio_nr_pages(folio); i++) {
> + entry = page_swap_entry(folio_page(folio, i));
> + set_bit(swp_offset(entry), sis->zeromap);
> + }
> +}
> +
> +static void swap_zeromap_folio_clear(struct folio *folio)
> +{
> + struct swap_info_struct *sis = swp_swap_info(folio->swap);
> + swp_entry_t entry;
> + unsigned int i;
> +
> + for (i = 0; i < folio_nr_pages(folio); i++) {
> + entry = page_swap_entry(folio_page(folio, i));
> + clear_bit(swp_offset(entry), sis->zeromap);
> + }
> +}
> +
> +static bool swap_zeromap_folio_test(struct folio *folio)
> +{
> + struct swap_info_struct *sis = swp_swap_info(folio->swap);
> + swp_entry_t entry;
> + unsigned int i;
> +
> + for (i = 0; i < folio_nr_pages(folio); i++) {
> + entry = page_swap_entry(folio_page(folio, i));
> + if (!test_bit(swp_offset(entry), sis->zeromap))
> + return false;
> + }
> + return true;
> +}
> +
> /*
> * We may have stale swap cache pages in memory: notice
> * them here and get rid of the unnecessary final write.
> @@ -195,6 +271,15 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
> folio_unlock(folio);
> return ret;
> }
> +
> + if (is_folio_zero_filled(folio)) {
> + swap_zeromap_folio_set(folio);
> + folio_start_writeback(folio);
> + folio_unlock(folio);
> + folio_end_writeback(folio);
> + return 0;
> + }
> + swap_zeromap_folio_clear(folio);
> if (zswap_store(folio)) {
> folio_start_writeback(folio);
> folio_unlock(folio);
> @@ -515,8 +600,11 @@ void swap_read_folio(struct folio *folio, bool synchronous,
> psi_memstall_enter(&pflags);
> }
> delayacct_swapin_start();
> -
> - if (zswap_load(folio)) {
> + if (swap_zeromap_folio_test(folio)) {
> + folio_zero_fill(folio);
> + folio_mark_uptodate(folio);
> + folio_unlock(folio);

We don't currently support swapping in large folios, but it is a work
in progress, and this will break once we have it.
swap_zeromap_folio_test() will return false even if parts of the folio
are in fact zero-filled. Then, we will go read those from disk swap,
essentially corrupting data.

The same problem can happen for zswap, if a large folio being swapped
is only partially in zswap. In both cases, it's really easy to miss
the problem if we're testing with zswap disabled, with incompressible
data, or with non-zero data. Silent data corruption is not very
debuggable.

I proposed adding handling for this case in zswap here:
https://lore.kernel.org/lkml/[email protected]/

The discussions there hadn't settled yet, but depending on how it pans
out I suspect we will want something similar for the zeromap case as
well.

> + } else if (zswap_load(folio)) {
> folio_mark_uptodate(folio);
> folio_unlock(folio);
> } else if (data_race(sis->flags & SWP_FS_OPS)) {
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index f1e559e216bd..90451174fe34 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -453,6 +453,8 @@ static unsigned int cluster_list_del_first(struct swap_cluster_list *list,
> static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> unsigned int idx)
> {
> + unsigned int i;
> +
> /*
> * If scan_swap_map_slots() can't find a free cluster, it will check
> * si->swap_map directly. To make sure the discarding cluster isn't
> @@ -461,6 +463,13 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> */
> memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> SWAP_MAP_BAD, SWAPFILE_CLUSTER);
> + /*
> + * zeromap can see updates from concurrent swap_writepage() and swap_read_folio()
> + * call on other slots, hence use atomic clear_bit for zeromap instead of the
> + * non-atomic bitmap_clear.
> + */
> + for (i = 0; i < SWAPFILE_CLUSTER; i++)
> + clear_bit(idx * SWAPFILE_CLUSTER + i, si->zeromap);
>
> cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx);
>
> @@ -482,7 +491,7 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
> static void swap_do_scheduled_discard(struct swap_info_struct *si)
> {
> struct swap_cluster_info *info, *ci;
> - unsigned int idx;
> + unsigned int idx, i;
>
> info = si->cluster_info;
>
> @@ -498,6 +507,8 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si)
> __free_cluster(si, idx);
> memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> 0, SWAPFILE_CLUSTER);
> + for (i = 0; i < SWAPFILE_CLUSTER; i++)
> + clear_bit(idx * SWAPFILE_CLUSTER + i, si->zeromap);
> unlock_cluster(ci);
> }
> }
> @@ -1336,6 +1347,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
> count = p->swap_map[offset];
> VM_BUG_ON(count != SWAP_HAS_CACHE);
> p->swap_map[offset] = 0;
> + clear_bit(offset, p->zeromap);
> dec_cluster_info_page(p, p->cluster_info, offset);
> unlock_cluster(ci);
>
> @@ -2597,6 +2609,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
> free_percpu(p->cluster_next_cpu);
> p->cluster_next_cpu = NULL;
> vfree(swap_map);
> + bitmap_free(p->zeromap);
> kvfree(cluster_info);
> /* Destroy swap account information */
> swap_cgroup_swapoff(p->type);
> @@ -3123,6 +3136,12 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
> goto bad_swap_unlock_inode;
> }
>
> + p->zeromap = bitmap_zalloc(maxpages, GFP_KERNEL);
> + if (!p->zeromap) {
> + error = -ENOMEM;
> + goto bad_swap_unlock_inode;
> + }
> +
> if (p->bdev && bdev_stable_writes(p->bdev))
> p->flags |= SWP_STABLE_WRITES;
>
> --
> 2.43.0
>

2024-06-10 18:40:17

by Usama Arif

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] mm: store zero pages to be swapped out in a bitmap


On 10/06/2024 18:57, Yosry Ahmed wrote:
> On Mon, Jun 10, 2024 at 5:18 AM Usama Arif <[email protected]> wrote:
>> Approximately 10-20% of pages to be swapped out are zero pages [1].
>> Rather than reading/writing these pages to flash resulting
>> in increased I/O and flash wear, a bitmap can be used to mark these
>> pages as zero at write time, and the pages can be filled at
>> read time if the bit corresponding to the page is set.
>> With this patch, NVMe writes in Meta server fleet decreased
>> by almost 10% with conventional swap setup (zswap disabled).
>>
>> [1]https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/
>>
>> Signed-off-by: Usama Arif <[email protected]>
>> ---
>> include/linux/swap.h | 1 +
>> mm/page_io.c | 92 +++++++++++++++++++++++++++++++++++++++++++-
>> mm/swapfile.c | 21 +++++++++-
>> 3 files changed, 111 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index a11c75e897ec..e88563978441 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -299,6 +299,7 @@ struct swap_info_struct {
>> signed char type; /* strange name for an index */
>> unsigned int max; /* extent of the swap_map */
>> unsigned char *swap_map; /* vmalloc'ed array of usage counts */
>> + unsigned long *zeromap; /* vmalloc'ed bitmap to track zero pages */
>> struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
>> struct swap_cluster_list free_clusters; /* free clusters list */
>> unsigned int lowest_bit; /* index of first free in swap_map */
>> diff --git a/mm/page_io.c b/mm/page_io.c
>> index a360857cf75d..2cac1e11fb85 100644
>> --- a/mm/page_io.c
>> +++ b/mm/page_io.c
>> @@ -172,6 +172,82 @@ int generic_swapfile_activate(struct swap_info_struct *sis,
>> goto out;
>> }
>>
>> +static bool is_folio_page_zero_filled(struct folio *folio, int i)
>> +{
>> + unsigned long *data;
>> + unsigned int pos, last_pos = PAGE_SIZE / sizeof(*data) - 1;
>> + bool ret = false;
>> +
>> + data = kmap_local_folio(folio, i * PAGE_SIZE);
>> + if (data[last_pos])
>> + goto out;
>> + for (pos = 0; pos < PAGE_SIZE / sizeof(*data); pos++) {
>> + if (data[pos])
>> + goto out;
>> + }
>> + ret = true;
>> +out:
>> + kunmap_local(data);
>> + return ret;
>> +}
>> +
>> +static bool is_folio_zero_filled(struct folio *folio)
>> +{
>> + unsigned int i;
>> +
>> + for (i = 0; i < folio_nr_pages(folio); i++) {
>> + if (!is_folio_page_zero_filled(folio, i))
>> + return false;
>> + }
>> + return true;
>> +}
> Is there any benefit to iterating on the folio in pages (i.e. have
> both is_folio_zero_filled() and is_folio_page_zero_filled())? Why
> don't we just kmap the entire folio and check it all at once?

Is there an API to kmap an entire folio?

I could just do data = page_address(&folio->page) in above and iterate
through folio_nr_pages(folio) * PAGE_SIZE, and do it all in one
function, but this just looks much nicer and much more readable?

In other places in the kernel, the folio iteration is through pages as well:

https://elixir.bootlin.com/linux/latest/source/arch/csky/abiv2/cacheflush.c#L29

https://elixir.bootlin.com/linux/latest/source/arch/mips/mm/cache.c#L162

and others as well.


>> +
>> +static void folio_zero_fill(struct folio *folio)
>> +{
>> + unsigned int i;
>> +
>> + for (i = 0; i < folio_nr_pages(folio); i++)
>> + clear_highpage(folio_page(folio, i));
>> +}
>> +
>> +static void swap_zeromap_folio_set(struct folio *folio)
>> +{
>> + struct swap_info_struct *sis = swp_swap_info(folio->swap);
>> + swp_entry_t entry;
>> + unsigned int i;
>> +
>> + for (i = 0; i < folio_nr_pages(folio); i++) {
>> + entry = page_swap_entry(folio_page(folio, i));
>> + set_bit(swp_offset(entry), sis->zeromap);
>> + }
>> +}
>> +
>> +static void swap_zeromap_folio_clear(struct folio *folio)
>> +{
>> + struct swap_info_struct *sis = swp_swap_info(folio->swap);
>> + swp_entry_t entry;
>> + unsigned int i;
>> +
>> + for (i = 0; i < folio_nr_pages(folio); i++) {
>> + entry = page_swap_entry(folio_page(folio, i));
>> + clear_bit(swp_offset(entry), sis->zeromap);
>> + }
>> +}
>> +
>> +static bool swap_zeromap_folio_test(struct folio *folio)
>> +{
>> + struct swap_info_struct *sis = swp_swap_info(folio->swap);
>> + swp_entry_t entry;
>> + unsigned int i;
>> +
>> + for (i = 0; i < folio_nr_pages(folio); i++) {
>> + entry = page_swap_entry(folio_page(folio, i));
>> + if (!test_bit(swp_offset(entry), sis->zeromap))
>> + return false;
>> + }
>> + return true;
>> +}
>> +
>> /*
>> * We may have stale swap cache pages in memory: notice
>> * them here and get rid of the unnecessary final write.
>> @@ -195,6 +271,15 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
>> folio_unlock(folio);
>> return ret;
>> }
>> +
>> + if (is_folio_zero_filled(folio)) {
>> + swap_zeromap_folio_set(folio);
>> + folio_start_writeback(folio);
>> + folio_unlock(folio);
>> + folio_end_writeback(folio);
>> + return 0;
>> + }
>> + swap_zeromap_folio_clear(folio);
>> if (zswap_store(folio)) {
>> folio_start_writeback(folio);
>> folio_unlock(folio);
>> @@ -515,8 +600,11 @@ void swap_read_folio(struct folio *folio, bool synchronous,
>> psi_memstall_enter(&pflags);
>> }
>> delayacct_swapin_start();
>> -
>> - if (zswap_load(folio)) {
>> + if (swap_zeromap_folio_test(folio)) {
>> + folio_zero_fill(folio);
>> + folio_mark_uptodate(folio);
>> + folio_unlock(folio);
> We don't currently support swapping in large folios, but it is a work
> in progress, and this will break once we have it.
> swap_zeromap_folio_test() will return false even if parts of the folio
> are in fact zero-filled. Then, we will go read those from disk swap,
> essentially corrupting data.

So yes, with this patch I tested swap out of large zero folio, but when
swapping in it was page by page. My assumption was that swap_read_folio
(when support is added) would only pass a large folio that was earlier
swapped out as a large folio. So if a zero filled large folio was
swapped out, the zeromap will be set for all the pages in that folio,
and at large folio swap in (when it is supported), it will see that all
the bits in the zeromap for that folio are set,  and will just
folio_zero_fill.

If even a single page in large folio has non-zero data then zeromap will
not store it and it will go to either zswap or disk, and at read time as
all the bits in zeromap are not set, it will still goto either zswap or
disk, so I think this works?

Is my assumption wrong that only large folios can be swapped in only if
they were swapped out as large? I think this code works in that case.

>
> The same problem can happen for zswap, if a large folio being swapped
> is only partially in zswap. In both cases, it's really easy to miss
> the problem if we're testing with zswap disabled, with incompressible
> data, or with non-zero data. Silent data corruption is not very
> debuggable.
>
> I proposed adding handling for this case in zswap here:
> https://lore.kernel.org/lkml/[email protected]/
>
> The discussions there hadn't settled yet, but depending on how it pans
> out I suspect we will want something similar for the zeromap case as
> well.
>
>> + } else if (zswap_load(folio)) {
>> folio_mark_uptodate(folio);
>> folio_unlock(folio);
>> } else if (data_race(sis->flags & SWP_FS_OPS)) {
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index f1e559e216bd..90451174fe34 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -453,6 +453,8 @@ static unsigned int cluster_list_del_first(struct swap_cluster_list *list,
>> static void swap_cluster_schedule_discard(struct swap_info_struct *si,
>> unsigned int idx)
>> {
>> + unsigned int i;
>> +
>> /*
>> * If scan_swap_map_slots() can't find a free cluster, it will check
>> * si->swap_map directly. To make sure the discarding cluster isn't
>> @@ -461,6 +463,13 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
>> */
>> memset(si->swap_map + idx * SWAPFILE_CLUSTER,
>> SWAP_MAP_BAD, SWAPFILE_CLUSTER);
>> + /*
>> + * zeromap can see updates from concurrent swap_writepage() and swap_read_folio()
>> + * call on other slots, hence use atomic clear_bit for zeromap instead of the
>> + * non-atomic bitmap_clear.
>> + */
>> + for (i = 0; i < SWAPFILE_CLUSTER; i++)
>> + clear_bit(idx * SWAPFILE_CLUSTER + i, si->zeromap);
>>
>> cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx);
>>
>> @@ -482,7 +491,7 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
>> static void swap_do_scheduled_discard(struct swap_info_struct *si)
>> {
>> struct swap_cluster_info *info, *ci;
>> - unsigned int idx;
>> + unsigned int idx, i;
>>
>> info = si->cluster_info;
>>
>> @@ -498,6 +507,8 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si)
>> __free_cluster(si, idx);
>> memset(si->swap_map + idx * SWAPFILE_CLUSTER,
>> 0, SWAPFILE_CLUSTER);
>> + for (i = 0; i < SWAPFILE_CLUSTER; i++)
>> + clear_bit(idx * SWAPFILE_CLUSTER + i, si->zeromap);
>> unlock_cluster(ci);
>> }
>> }
>> @@ -1336,6 +1347,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
>> count = p->swap_map[offset];
>> VM_BUG_ON(count != SWAP_HAS_CACHE);
>> p->swap_map[offset] = 0;
>> + clear_bit(offset, p->zeromap);
>> dec_cluster_info_page(p, p->cluster_info, offset);
>> unlock_cluster(ci);
>>
>> @@ -2597,6 +2609,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>> free_percpu(p->cluster_next_cpu);
>> p->cluster_next_cpu = NULL;
>> vfree(swap_map);
>> + bitmap_free(p->zeromap);
>> kvfree(cluster_info);
>> /* Destroy swap account information */
>> swap_cgroup_swapoff(p->type);
>> @@ -3123,6 +3136,12 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>> goto bad_swap_unlock_inode;
>> }
>>
>> + p->zeromap = bitmap_zalloc(maxpages, GFP_KERNEL);
>> + if (!p->zeromap) {
>> + error = -ENOMEM;
>> + goto bad_swap_unlock_inode;
>> + }
>> +
>> if (p->bdev && bdev_stable_writes(p->bdev))
>> p->flags |= SWP_STABLE_WRITES;
>>
>> --
>> 2.43.0
>>

2024-06-10 18:48:13

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] mm: store zero pages to be swapped out in a bitmap

On Mon, Jun 10, 2024 at 11:36 AM Usama Arif <[email protected]> wrote:
>
>
> On 10/06/2024 18:57, Yosry Ahmed wrote:
> > On Mon, Jun 10, 2024 at 5:18 AM Usama Arif <[email protected]> wrote:
> >> Approximately 10-20% of pages to be swapped out are zero pages [1].
> >> Rather than reading/writing these pages to flash resulting
> >> in increased I/O and flash wear, a bitmap can be used to mark these
> >> pages as zero at write time, and the pages can be filled at
> >> read time if the bit corresponding to the page is set.
> >> With this patch, NVMe writes in Meta server fleet decreased
> >> by almost 10% with conventional swap setup (zswap disabled).
> >>
> >> [1]https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/
> >>
> >> Signed-off-by: Usama Arif <[email protected]>
> >> ---
> >> include/linux/swap.h | 1 +
> >> mm/page_io.c | 92 +++++++++++++++++++++++++++++++++++++++++++-
> >> mm/swapfile.c | 21 +++++++++-
> >> 3 files changed, 111 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/include/linux/swap.h b/include/linux/swap.h
> >> index a11c75e897ec..e88563978441 100644
> >> --- a/include/linux/swap.h
> >> +++ b/include/linux/swap.h
> >> @@ -299,6 +299,7 @@ struct swap_info_struct {
> >> signed char type; /* strange name for an index */
> >> unsigned int max; /* extent of the swap_map */
> >> unsigned char *swap_map; /* vmalloc'ed array of usage counts */
> >> + unsigned long *zeromap; /* vmalloc'ed bitmap to track zero pages */
> >> struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
> >> struct swap_cluster_list free_clusters; /* free clusters list */
> >> unsigned int lowest_bit; /* index of first free in swap_map */
> >> diff --git a/mm/page_io.c b/mm/page_io.c
> >> index a360857cf75d..2cac1e11fb85 100644
> >> --- a/mm/page_io.c
> >> +++ b/mm/page_io.c
> >> @@ -172,6 +172,82 @@ int generic_swapfile_activate(struct swap_info_struct *sis,
> >> goto out;
> >> }
> >>
> >> +static bool is_folio_page_zero_filled(struct folio *folio, int i)
> >> +{
> >> + unsigned long *data;
> >> + unsigned int pos, last_pos = PAGE_SIZE / sizeof(*data) - 1;
> >> + bool ret = false;
> >> +
> >> + data = kmap_local_folio(folio, i * PAGE_SIZE);
> >> + if (data[last_pos])
> >> + goto out;
> >> + for (pos = 0; pos < PAGE_SIZE / sizeof(*data); pos++) {
> >> + if (data[pos])
> >> + goto out;
> >> + }
> >> + ret = true;
> >> +out:
> >> + kunmap_local(data);
> >> + return ret;
> >> +}
> >> +
> >> +static bool is_folio_zero_filled(struct folio *folio)
> >> +{
> >> + unsigned int i;
> >> +
> >> + for (i = 0; i < folio_nr_pages(folio); i++) {
> >> + if (!is_folio_page_zero_filled(folio, i))
> >> + return false;
> >> + }
> >> + return true;
> >> +}
> > Is there any benefit to iterating on the folio in pages (i.e. have
> > both is_folio_zero_filled() and is_folio_page_zero_filled())? Why
> > don't we just kmap the entire folio and check it all at once?
>
> Is there an API to kmap an entire folio?

I thought kmap_local_folio() takes in a 'size' parameter for some
reason, my bad.

>
> I could just do data = page_address(&folio->page) in above and iterate
> through folio_nr_pages(folio) * PAGE_SIZE, and do it all in one
> function, but this just looks much nicer and much more readable?

Yeah if we need to map each page individually, the current code is better.

>
> In other places in the kernel, the folio iteration is through pages as well:
>
> https://elixir.bootlin.com/linux/latest/source/arch/csky/abiv2/cacheflush.c#L29
>
> https://elixir.bootlin.com/linux/latest/source/arch/mips/mm/cache.c#L162
>
> and others as well.
[..]
> >> @@ -515,8 +600,11 @@ void swap_read_folio(struct folio *folio, bool synchronous,
> >> psi_memstall_enter(&pflags);
> >> }
> >> delayacct_swapin_start();
> >> -
> >> - if (zswap_load(folio)) {
> >> + if (swap_zeromap_folio_test(folio)) {
> >> + folio_zero_fill(folio);
> >> + folio_mark_uptodate(folio);
> >> + folio_unlock(folio);
> > We don't currently support swapping in large folios, but it is a work
> > in progress, and this will break once we have it.
> > swap_zeromap_folio_test() will return false even if parts of the folio
> > are in fact zero-filled. Then, we will go read those from disk swap,
> > essentially corrupting data.
>
> So yes, with this patch I tested swap out of large zero folio, but when
> swapping in it was page by page. My assumption was that swap_read_folio
> (when support is added) would only pass a large folio that was earlier
> swapped out as a large folio. So if a zero filled large folio was
> swapped out, the zeromap will be set for all the pages in that folio,
> and at large folio swap in (when it is supported), it will see that all
> the bits in the zeromap for that folio are set, and will just
> folio_zero_fill.
>
> If even a single page in large folio has non-zero data then zeromap will
> not store it and it will go to either zswap or disk, and at read time as
> all the bits in zeromap are not set, it will still goto either zswap or
> disk, so I think this works?
>
> Is my assumption wrong that only large folios can be swapped in only if
> they were swapped out as large? I think this code works in that case.

I think the assumption is incorrect. I think we would just check if
contiguous PTEs have contiguous swap entries and swapin the folio as a
large folio in this case. It is likely that the swap entries are
contiguous because it was swapped out as a large folio, but I don't
think it's guaranteed.

For example, here is a patch that implements large swapin support for
the synchronous swapin case, and I think it just checks that the PTEs
have contiguous swap entries:
https://lore.kernel.org/linux-mm/[email protected]/

This makes a lot of sense because otherwise you'd have to keep track
of how the folios were composed at the time of swapout, to be able to
swap the same folios back in.

2024-06-11 11:54:54

by Usama Arif

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] mm: store zero pages to be swapped out in a bitmap

@@ -515,8 +600,11 @@ void swap_read_folio(struct folio *folio, bool
synchronous,
>>>> psi_memstall_enter(&pflags);
>>>> }
>>>> delayacct_swapin_start();
>>>> -
>>>> - if (zswap_load(folio)) {
>>>> + if (swap_zeromap_folio_test(folio)) {
>>>> + folio_zero_fill(folio);
>>>> + folio_mark_uptodate(folio);
>>>> + folio_unlock(folio);
>>> We don't currently support swapping in large folios, but it is a work
>>> in progress, and this will break once we have it.
>>> swap_zeromap_folio_test() will return false even if parts of the folio
>>> are in fact zero-filled. Then, we will go read those from disk swap,
>>> essentially corrupting data.
>> So yes, with this patch I tested swap out of large zero folio, but when
>> swapping in it was page by page. My assumption was that swap_read_folio
>> (when support is added) would only pass a large folio that was earlier
>> swapped out as a large folio. So if a zero filled large folio was
>> swapped out, the zeromap will be set for all the pages in that folio,
>> and at large folio swap in (when it is supported), it will see that all
>> the bits in the zeromap for that folio are set, and will just
>> folio_zero_fill.
>>
>> If even a single page in large folio has non-zero data then zeromap will
>> not store it and it will go to either zswap or disk, and at read time as
>> all the bits in zeromap are not set, it will still goto either zswap or
>> disk, so I think this works?
>>
>> Is my assumption wrong that only large folios can be swapped in only if
>> they were swapped out as large? I think this code works in that case.
> I think the assumption is incorrect. I think we would just check if
> contiguous PTEs have contiguous swap entries and swapin the folio as a
> large folio in this case. It is likely that the swap entries are
> contiguous because it was swapped out as a large folio, but I don't
> think it's guaranteed.

Yes, makes sense. Thanks for explaining this.

>
> For example, here is a patch that implements large swapin support for
> the synchronous swapin case, and I think it just checks that the PTEs
> have contiguous swap entries:
> https://lore.kernel.org/linux-mm/[email protected]/
>
> This makes a lot of sense because otherwise you'd have to keep track
> of how the folios were composed at the time of swapout, to be able to
> swap the same folios back in.

I think the solution to large folio swap-in for this optimization and
zswap is not in swap_read_folio in this patch-series or any call further
down the stack, as its too late by the time you reach here, but in
Barrys' patch. This needs to happen much earlier when deciding the size
of the folio at folio creation in alloc_swap_folio, after Barry checks

    if (is_zswap_enabled())
        goto fallback;

once the order is decided, we would need to check the indexes in the
zeromap array starting from swap_offset(entry) and ending at
swap_offset(entry) + 2^order are set. If no bits are set, or all bits
are set, then we allocate large folio. Otherwise, we goto fallback.

I think its better to handle this in Barrys patch. I feel this series is
close to its final state, i.e. the only diff I have for the next
revision is below to remove start/end_writeback for zer_filled case. I
will comment on Barrys patch once the I send out the next revision of this.


diff --git a/mm/page_io.c b/mm/page_io.c
index 2cac1e11fb85..08a3772ddcf7 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -274,9 +274,7 @@ int swap_writepage(struct page *page, struct
writeback_control *wbc)

        if (is_folio_zero_filled(folio)) {
                swap_zeromap_folio_set(folio);
-               folio_start_writeback(folio);
                folio_unlock(folio);
-               folio_end_writeback(folio);
                return 0;
        }
        swap_zeromap_folio_clear(folio);



2024-06-11 15:44:45

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] mm: store zero pages to be swapped out in a bitmap

On Tue, Jun 11, 2024 at 4:49 AM Usama Arif <[email protected]> wrote:
>
> @@ -515,8 +600,11 @@ void swap_read_folio(struct folio *folio, bool
> synchronous,
> >>>> psi_memstall_enter(&pflags);
> >>>> }
> >>>> delayacct_swapin_start();
> >>>> -
> >>>> - if (zswap_load(folio)) {
> >>>> + if (swap_zeromap_folio_test(folio)) {
> >>>> + folio_zero_fill(folio);
> >>>> + folio_mark_uptodate(folio);
> >>>> + folio_unlock(folio);
> >>> We don't currently support swapping in large folios, but it is a work
> >>> in progress, and this will break once we have it.
> >>> swap_zeromap_folio_test() will return false even if parts of the folio
> >>> are in fact zero-filled. Then, we will go read those from disk swap,
> >>> essentially corrupting data.
> >> So yes, with this patch I tested swap out of large zero folio, but when
> >> swapping in it was page by page. My assumption was that swap_read_folio
> >> (when support is added) would only pass a large folio that was earlier
> >> swapped out as a large folio. So if a zero filled large folio was
> >> swapped out, the zeromap will be set for all the pages in that folio,
> >> and at large folio swap in (when it is supported), it will see that all
> >> the bits in the zeromap for that folio are set, and will just
> >> folio_zero_fill.
> >>
> >> If even a single page in large folio has non-zero data then zeromap will
> >> not store it and it will go to either zswap or disk, and at read time as
> >> all the bits in zeromap are not set, it will still goto either zswap or
> >> disk, so I think this works?
> >>
> >> Is my assumption wrong that only large folios can be swapped in only if
> >> they were swapped out as large? I think this code works in that case.
> > I think the assumption is incorrect. I think we would just check if
> > contiguous PTEs have contiguous swap entries and swapin the folio as a
> > large folio in this case. It is likely that the swap entries are
> > contiguous because it was swapped out as a large folio, but I don't
> > think it's guaranteed.
>
> Yes, makes sense. Thanks for explaining this.
>
> >
> > For example, here is a patch that implements large swapin support for
> > the synchronous swapin case, and I think it just checks that the PTEs
> > have contiguous swap entries:
> > https://lore.kernel.org/linux-mm/[email protected]/
> >
> > This makes a lot of sense because otherwise you'd have to keep track
> > of how the folios were composed at the time of swapout, to be able to
> > swap the same folios back in.
>
> I think the solution to large folio swap-in for this optimization and
> zswap is not in swap_read_folio in this patch-series or any call further
> down the stack, as its too late by the time you reach here, but in
> Barrys' patch. This needs to happen much earlier when deciding the size
> of the folio at folio creation in alloc_swap_folio, after Barry checks
>
> if (is_zswap_enabled())
> goto fallback;
>
> once the order is decided, we would need to check the indexes in the
> zeromap array starting from swap_offset(entry) and ending at
> swap_offset(entry) + 2^order are set. If no bits are set, or all bits
> are set, then we allocate large folio. Otherwise, we goto fallback.
>
> I think its better to handle this in Barrys patch. I feel this series is
> close to its final state, i.e. the only diff I have for the next
> revision is below to remove start/end_writeback for zer_filled case. I
> will comment on Barrys patch once the I send out the next revision of this.

Sorry I did not make myself clearer. I did not mean that you should
handle the large folio swapin here. This needs to be handled at a
higher level because as you mentioned, a large folio may be partially
in the zeromap, zswap, swapcache, disk, etc.

What I meant is that we should probably have a debug check to make
sure this doesn't go unhandled. For zswap, I am trying to add a
warning and fail the swapin operation if a large folio slips through
to zswap. We can do something similar here if folks agree this is the
right way in the interim:
https://lore.kernel.org/lkml/[email protected]/.

Maybe I am too paranoid, but I think it's easy to mess up these things
when working on large folio swapin imo.

2024-06-11 16:53:32

by Usama Arif

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] mm: store zero pages to be swapped out in a bitmap


On 11/06/2024 16:42, Yosry Ahmed wrote:
> On Tue, Jun 11, 2024 at 4:49 AM Usama Arif <[email protected]> wrote:
>> @@ -515,8 +600,11 @@ void swap_read_folio(struct folio *folio, bool
>> synchronous,
>>>>>> psi_memstall_enter(&pflags);
>>>>>> }
>>>>>> delayacct_swapin_start();
>>>>>> -
>>>>>> - if (zswap_load(folio)) {
>>>>>> + if (swap_zeromap_folio_test(folio)) {
>>>>>> + folio_zero_fill(folio);
>>>>>> + folio_mark_uptodate(folio);
>>>>>> + folio_unlock(folio);
>>>>> We don't currently support swapping in large folios, but it is a work
>>>>> in progress, and this will break once we have it.
>>>>> swap_zeromap_folio_test() will return false even if parts of the folio
>>>>> are in fact zero-filled. Then, we will go read those from disk swap,
>>>>> essentially corrupting data.
>>>> So yes, with this patch I tested swap out of large zero folio, but when
>>>> swapping in it was page by page. My assumption was that swap_read_folio
>>>> (when support is added) would only pass a large folio that was earlier
>>>> swapped out as a large folio. So if a zero filled large folio was
>>>> swapped out, the zeromap will be set for all the pages in that folio,
>>>> and at large folio swap in (when it is supported), it will see that all
>>>> the bits in the zeromap for that folio are set, and will just
>>>> folio_zero_fill.
>>>>
>>>> If even a single page in large folio has non-zero data then zeromap will
>>>> not store it and it will go to either zswap or disk, and at read time as
>>>> all the bits in zeromap are not set, it will still goto either zswap or
>>>> disk, so I think this works?
>>>>
>>>> Is my assumption wrong that only large folios can be swapped in only if
>>>> they were swapped out as large? I think this code works in that case.
>>> I think the assumption is incorrect. I think we would just check if
>>> contiguous PTEs have contiguous swap entries and swapin the folio as a
>>> large folio in this case. It is likely that the swap entries are
>>> contiguous because it was swapped out as a large folio, but I don't
>>> think it's guaranteed.
>> Yes, makes sense. Thanks for explaining this.
>>
>>> For example, here is a patch that implements large swapin support for
>>> the synchronous swapin case, and I think it just checks that the PTEs
>>> have contiguous swap entries:
>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>
>>> This makes a lot of sense because otherwise you'd have to keep track
>>> of how the folios were composed at the time of swapout, to be able to
>>> swap the same folios back in.
>> I think the solution to large folio swap-in for this optimization and
>> zswap is not in swap_read_folio in this patch-series or any call further
>> down the stack, as its too late by the time you reach here, but in
>> Barrys' patch. This needs to happen much earlier when deciding the size
>> of the folio at folio creation in alloc_swap_folio, after Barry checks
>>
>> if (is_zswap_enabled())
>> goto fallback;
>>
>> once the order is decided, we would need to check the indexes in the
>> zeromap array starting from swap_offset(entry) and ending at
>> swap_offset(entry) + 2^order are set. If no bits are set, or all bits
>> are set, then we allocate large folio. Otherwise, we goto fallback.
>>
>> I think its better to handle this in Barrys patch. I feel this series is
>> close to its final state, i.e. the only diff I have for the next
>> revision is below to remove start/end_writeback for zer_filled case. I
>> will comment on Barrys patch once the I send out the next revision of this.
> Sorry I did not make myself clearer. I did not mean that you should
> handle the large folio swapin here. This needs to be handled at a
> higher level because as you mentioned, a large folio may be partially
> in the zeromap, zswap, swapcache, disk, etc.
>
> What I meant is that we should probably have a debug check to make
> sure this doesn't go unhandled. For zswap, I am trying to add a
> warning and fail the swapin operation if a large folio slips through
> to zswap. We can do something similar here if folks agree this is the
> right way in the interim:
> https://lore.kernel.org/lkml/[email protected]/.
>
> Maybe I am too paranoid, but I think it's easy to mess up these things
> when working on large folio swapin imo.

So there is a difference between zswap and this optimization. In this
optimization, if the zeromap is set for all the folio bits, then we
should do large folio swapin. There still needs to be a change in Barrys
patch in alloc_swap_folio, but apart from that does the below diff over
v3 make it better? I will send a v4 with this if it sounds good.


diff --git a/mm/page_io.c b/mm/page_io.c
index 6400be6e4291..bf01364748a9 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -234,18 +234,24 @@ static void swap_zeromap_folio_clear(struct folio
*folio)
        }
 }

-static bool swap_zeromap_folio_test(struct folio *folio)
+/*
+ * Return the index of the first subpage which is not zero-filled
+ * according to swap_info_struct->zeromap.
+ * If all pages are zero-filled according to zeromap, it will return
+ * folio_nr_pages(folio).
+ */
+static long swap_zeromap_folio_test(struct folio *folio)
 {
        struct swap_info_struct *sis = swp_swap_info(folio->swap);
        swp_entry_t entry;
-       unsigned int i;
+       long i;

        for (i = 0; i < folio_nr_pages(folio); i++) {
                entry = page_swap_entry(folio_page(folio, i));
                if (!test_bit(swp_offset(entry), sis->zeromap))
-                       return false;
+                       return i;
        }
-       return true;
+       return i;
 }

 /*
@@ -581,6 +587,7 @@ void swap_read_folio(struct folio *folio, bool
synchronous,
 {
        struct swap_info_struct *sis = swp_swap_info(folio->swap);
        bool workingset = folio_test_workingset(folio);
+       long first_non_zero_page_idx;
        unsigned long pflags;
        bool in_thrashing;

@@ -598,10 +605,19 @@ void swap_read_folio(struct folio *folio, bool
synchronous,
                psi_memstall_enter(&pflags);
        }
        delayacct_swapin_start();
-       if (swap_zeromap_folio_test(folio)) {
+       first_non_zero_page_idx = swap_zeromap_folio_test(folio);
+       if (first_non_zero_page_idx == folio_nr_pages(folio)) {
                folio_zero_fill(folio);
                folio_mark_uptodate(folio);
                folio_unlock(folio);
+       } else if (first_non_zero_page_idx != 0) {
+               /*
+                * The case for when only *some* of subpages being
swapped-in were recorded
+                * in sis->zeromap, while the rest are in zswap/disk is
currently not handled.
+                * WARN in this case and return without marking the
folio uptodate so that
+                * an IO error is emitted (e.g. do_swap_page() will sigbus).
+                */
+                WARN_ON_ONCE(1);
        } else if (zswap_load(folio)) {
                folio_mark_uptodate(folio);
                folio_unlock(folio);



2024-06-11 18:05:27

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] mm: store zero pages to be swapped out in a bitmap

[..]
> >> I think its better to handle this in Barrys patch. I feel this series is
> >> close to its final state, i.e. the only diff I have for the next
> >> revision is below to remove start/end_writeback for zer_filled case. I
> >> will comment on Barrys patch once the I send out the next revision of this.
> > Sorry I did not make myself clearer. I did not mean that you should
> > handle the large folio swapin here. This needs to be handled at a
> > higher level because as you mentioned, a large folio may be partially
> > in the zeromap, zswap, swapcache, disk, etc.
> >
> > What I meant is that we should probably have a debug check to make
> > sure this doesn't go unhandled. For zswap, I am trying to add a
> > warning and fail the swapin operation if a large folio slips through
> > to zswap. We can do something similar here if folks agree this is the
> > right way in the interim:
> > https://lore.kernel.org/lkml/[email protected]/.
> >
> > Maybe I am too paranoid, but I think it's easy to mess up these things
> > when working on large folio swapin imo.
>
> So there is a difference between zswap and this optimization. In this
> optimization, if the zeromap is set for all the folio bits, then we
> should do large folio swapin. There still needs to be a change in Barrys
> patch in alloc_swap_folio, but apart from that does the below diff over
> v3 make it better? I will send a v4 with this if it sounds good.
>
>
> diff --git a/mm/page_io.c b/mm/page_io.c
> index 6400be6e4291..bf01364748a9 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -234,18 +234,24 @@ static void swap_zeromap_folio_clear(struct folio
> *folio)
> }
> }
>
> -static bool swap_zeromap_folio_test(struct folio *folio)
> +/*
> + * Return the index of the first subpage which is not zero-filled
> + * according to swap_info_struct->zeromap.
> + * If all pages are zero-filled according to zeromap, it will return
> + * folio_nr_pages(folio).
> + */
> +static long swap_zeromap_folio_test(struct folio *folio)
> {
> struct swap_info_struct *sis = swp_swap_info(folio->swap);
> swp_entry_t entry;
> - unsigned int i;
> + long i;

Why long?

>
> for (i = 0; i < folio_nr_pages(folio); i++) {
> entry = page_swap_entry(folio_page(folio, i));
> if (!test_bit(swp_offset(entry), sis->zeromap))
> - return false;
> + return i;
> }
> - return true;
> + return i;
> }
>
> /*
> @@ -581,6 +587,7 @@ void swap_read_folio(struct folio *folio, bool
> synchronous,
> {
> struct swap_info_struct *sis = swp_swap_info(folio->swap);
> bool workingset = folio_test_workingset(folio);
> + long first_non_zero_page_idx;
> unsigned long pflags;
> bool in_thrashing;
>
> @@ -598,10 +605,19 @@ void swap_read_folio(struct folio *folio, bool
> synchronous,
> psi_memstall_enter(&pflags);
> }
> delayacct_swapin_start();
> - if (swap_zeromap_folio_test(folio)) {
> + first_non_zero_page_idx = swap_zeromap_folio_test(folio);
> + if (first_non_zero_page_idx == folio_nr_pages(folio)) {
> folio_zero_fill(folio);
> folio_mark_uptodate(folio);
> folio_unlock(folio);
> + } else if (first_non_zero_page_idx != 0) {
> + /*
> + * The case for when only *some* of subpages being
> swapped-in were recorded
> + * in sis->zeromap, while the rest are in zswap/disk is
> currently not handled.
> + * WARN in this case and return without marking the
> folio uptodate so that
> + * an IO error is emitted (e.g. do_swap_page() will sigbus).
> + */
> + WARN_ON_ONCE(1);
> } else if (zswap_load(folio)) {
> folio_mark_uptodate(folio);
> folio_unlock(folio);
>
>

This is too much noise for swap_read_folio(). How about adding
swap_read_folio_zeromap() that takes care of this and decides whether
or not to call folio_mark_uptodate()?

-static bool swap_zeromap_folio_test(struct folio *folio)
+/*
+ * Return the index of the first subpage which is not zero-filled according to
+ * swap_info_struct->zeromap. If all pages are zero-filled according to
+ * zeromap, it will return folio_nr_pages(folio).
+ */
+static unsigned int swap_zeromap_folio_test(struct folio *folio)
{
struct swap_info_struct *sis = swp_swap_info(folio->swap);
swp_entry_t entry;
@@ -243,9 +248,9 @@ static bool swap_zeromap_folio_test(struct folio *folio)
for (i = 0; i < folio_nr_pages(folio); i++) {
entry = page_swap_entry(folio_page(folio, i));
if (!test_bit(swp_offset(entry), sis->zeromap))
- return false;
+ return i;
}
- return true;
+ return i;
}

/*
@@ -511,6 +516,25 @@ static void sio_read_complete(struct kiocb *iocb, long ret)
mempool_free(sio, sio_pool);
}

+static bool swap_read_folio_zeromap(struct folio *folio)
+{
+ unsigned int idx = swap_zeromap_folio_test(folio);
+
+ if (idx == 0)
+ return false;
+
+ /*
+ * Swapping in a large folio that is partially in the zeromap is not
+ * currently handled. Return true without marking the folio uptodate so
+ * that an IO error is emitted (e.g. do_swap_page() will sigbus).
+ */
+ if (WARN_ON_ONCE(idx < folio_nr_pages(folio)))
+ return true;
+
+ folio_zero_fill(folio);
+ folio_mark_uptodate(folio);
+ return true
+}
+
static void swap_read_folio_fs(struct folio *folio, struct swap_iocb **plug)
{
struct swap_info_struct *sis = swp_swap_info(folio->swap);
@@ -600,9 +624,7 @@ void swap_read_folio(struct folio *folio, bool synchronous,
psi_memstall_enter(&pflags);
}
delayacct_swapin_start();
- if (swap_zeromap_folio_test(folio)) {
- folio_zero_fill(folio);
- folio_mark_uptodate(folio);
+ if (swap_read_folio_zeromap(folio)) {
folio_unlock(folio);
} else if (zswap_load(folio)) {
folio_mark_uptodate(folio);

2024-06-11 18:39:37

by Nhat Pham

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] mm: store zero pages to be swapped out in a bitmap

On Mon, Jun 10, 2024 at 5:18 AM Usama Arif <[email protected]> wrote:
>
> Approximately 10-20% of pages to be swapped out are zero pages [1].
> Rather than reading/writing these pages to flash resulting
> in increased I/O and flash wear, a bitmap can be used to mark these
> pages as zero at write time, and the pages can be filled at
> read time if the bit corresponding to the page is set.
> With this patch, NVMe writes in Meta server fleet decreased
> by almost 10% with conventional swap setup (zswap disabled).
>
> [1]https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/
>
> Signed-off-by: Usama Arif <[email protected]>
> ---
> include/linux/swap.h | 1 +
> mm/page_io.c | 92 +++++++++++++++++++++++++++++++++++++++++++-
> mm/swapfile.c | 21 +++++++++-
> 3 files changed, 111 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index a11c75e897ec..e88563978441 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -299,6 +299,7 @@ struct swap_info_struct {
> signed char type; /* strange name for an index */
> unsigned int max; /* extent of the swap_map */
> unsigned char *swap_map; /* vmalloc'ed array of usage counts */
> + unsigned long *zeromap; /* vmalloc'ed bitmap to track zero pages */
> struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
> struct swap_cluster_list free_clusters; /* free clusters list */
> unsigned int lowest_bit; /* index of first free in swap_map */
> diff --git a/mm/page_io.c b/mm/page_io.c
> index a360857cf75d..2cac1e11fb85 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -172,6 +172,82 @@ int generic_swapfile_activate(struct swap_info_struct *sis,
> goto out;
> }
>
> +static bool is_folio_page_zero_filled(struct folio *folio, int i)
> +{
> + unsigned long *data;
> + unsigned int pos, last_pos = PAGE_SIZE / sizeof(*data) - 1;
> + bool ret = false;
> +
> + data = kmap_local_folio(folio, i * PAGE_SIZE);
> + if (data[last_pos])
> + goto out;
> + for (pos = 0; pos < PAGE_SIZE / sizeof(*data); pos++) {
> + if (data[pos])
> + goto out;
> + }
> + ret = true;
> +out:
> + kunmap_local(data);
> + return ret;
> +}
> +
> +static bool is_folio_zero_filled(struct folio *folio)
> +{
> + unsigned int i;
> +
> + for (i = 0; i < folio_nr_pages(folio); i++) {
> + if (!is_folio_page_zero_filled(folio, i))
> + return false;
> + }
> + return true;
> +}
> +
> +static void folio_zero_fill(struct folio *folio)
> +{
> + unsigned int i;
> +
> + for (i = 0; i < folio_nr_pages(folio); i++)
> + clear_highpage(folio_page(folio, i));
> +}
> +
> +static void swap_zeromap_folio_set(struct folio *folio)
> +{
> + struct swap_info_struct *sis = swp_swap_info(folio->swap);
> + swp_entry_t entry;
> + unsigned int i;
> +
> + for (i = 0; i < folio_nr_pages(folio); i++) {
> + entry = page_swap_entry(folio_page(folio, i));
> + set_bit(swp_offset(entry), sis->zeromap);
> + }
> +}
> +
> +static void swap_zeromap_folio_clear(struct folio *folio)
> +{
> + struct swap_info_struct *sis = swp_swap_info(folio->swap);
> + swp_entry_t entry;
> + unsigned int i;
> +
> + for (i = 0; i < folio_nr_pages(folio); i++) {
> + entry = page_swap_entry(folio_page(folio, i));
> + clear_bit(swp_offset(entry), sis->zeromap);
> + }
> +}
> +
> +static bool swap_zeromap_folio_test(struct folio *folio)
> +{
> + struct swap_info_struct *sis = swp_swap_info(folio->swap);
> + swp_entry_t entry;
> + unsigned int i;
> +
> + for (i = 0; i < folio_nr_pages(folio); i++) {
> + entry = page_swap_entry(folio_page(folio, i));
> + if (!test_bit(swp_offset(entry), sis->zeromap))
> + return false;
> + }
> + return true;
> +}
> +
> /*
> * We may have stale swap cache pages in memory: notice
> * them here and get rid of the unnecessary final write.
> @@ -195,6 +271,15 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
> folio_unlock(folio);
> return ret;
> }
> +
> + if (is_folio_zero_filled(folio)) {
> + swap_zeromap_folio_set(folio);
> + folio_start_writeback(folio);
> + folio_unlock(folio);
> + folio_end_writeback(folio);
> + return 0;
> + }
> + swap_zeromap_folio_clear(folio);
> if (zswap_store(folio)) {
> folio_start_writeback(folio);
> folio_unlock(folio);
> @@ -515,8 +600,11 @@ void swap_read_folio(struct folio *folio, bool synchronous,
> psi_memstall_enter(&pflags);
> }
> delayacct_swapin_start();
> -
> - if (zswap_load(folio)) {
> + if (swap_zeromap_folio_test(folio)) {
> + folio_zero_fill(folio);
> + folio_mark_uptodate(folio);
> + folio_unlock(folio);
> + } else if (zswap_load(folio)) {
> folio_mark_uptodate(folio);
> folio_unlock(folio);
> } else if (data_race(sis->flags & SWP_FS_OPS)) {
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index f1e559e216bd..90451174fe34 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -453,6 +453,8 @@ static unsigned int cluster_list_del_first(struct swap_cluster_list *list,
> static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> unsigned int idx)
> {
> + unsigned int i;
> +
> /*
> * If scan_swap_map_slots() can't find a free cluster, it will check
> * si->swap_map directly. To make sure the discarding cluster isn't
> @@ -461,6 +463,13 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> */
> memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> SWAP_MAP_BAD, SWAPFILE_CLUSTER);
> + /*
> + * zeromap can see updates from concurrent swap_writepage() and swap_read_folio()
> + * call on other slots, hence use atomic clear_bit for zeromap instead of the
> + * non-atomic bitmap_clear.
> + */
> + for (i = 0; i < SWAPFILE_CLUSTER; i++)
> + clear_bit(idx * SWAPFILE_CLUSTER + i, si->zeromap);
>
> cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx);
>
> @@ -482,7 +491,7 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
> static void swap_do_scheduled_discard(struct swap_info_struct *si)
> {
> struct swap_cluster_info *info, *ci;
> - unsigned int idx;
> + unsigned int idx, i;
>
> info = si->cluster_info;
>
> @@ -498,6 +507,8 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si)
> __free_cluster(si, idx);
> memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> 0, SWAPFILE_CLUSTER);
> + for (i = 0; i < SWAPFILE_CLUSTER; i++)
> + clear_bit(idx * SWAPFILE_CLUSTER + i, si->zeromap);
> unlock_cluster(ci);
> }
> }
> @@ -1336,6 +1347,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
> count = p->swap_map[offset];
> VM_BUG_ON(count != SWAP_HAS_CACHE);
> p->swap_map[offset] = 0;
> + clear_bit(offset, p->zeromap);

Hmm so clear_bit() is done at the swap_entry_free() point. I wonder if
we can have a problem, where:

1. The swap entry has its zeromap bit set, and is freed to the swap
slot cache (free_swap_slot() in mm/swap_slots.c). For instance, it is
reclaimed from the swap cache, and all the processes referring to it
are terminated, which decrements the swap count to 0 (swap_free() ->
__swap_entry_free() -> free_swap_slots())

2. The swap slot is then re-used in swap space allocation
(add_to_swap()) - its zeromap bit is never cleared.

3. swap_writepage() writes that non-zero page to swap

4. swap_read_folio() checks the bitmap, sees that the zeromap bit for
the entry is set, so populates a zero page for it.

zswap in the past has to carefully invalidate these leftover entries
quite carefully. Chengming then move the invalidation point to
free_swap_slot(), massively simplifying the logic.

I wonder if we need to do the same here?

2024-06-11 18:44:49

by Usama Arif

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] mm: store zero pages to be swapped out in a bitmap


On 11/06/2024 18:51, Yosry Ahmed wrote:
> [..]
>>>> I think its better to handle this in Barrys patch. I feel this series is
>>>> close to its final state, i.e. the only diff I have for the next
>>>> revision is below to remove start/end_writeback for zer_filled case. I
>>>> will comment on Barrys patch once the I send out the next revision of this.
>>> Sorry I did not make myself clearer. I did not mean that you should
>>> handle the large folio swapin here. This needs to be handled at a
>>> higher level because as you mentioned, a large folio may be partially
>>> in the zeromap, zswap, swapcache, disk, etc.
>>>
>>> What I meant is that we should probably have a debug check to make
>>> sure this doesn't go unhandled. For zswap, I am trying to add a
>>> warning and fail the swapin operation if a large folio slips through
>>> to zswap. We can do something similar here if folks agree this is the
>>> right way in the interim:
>>> https://lore.kernel.org/lkml/[email protected]/.
>>>
>>> Maybe I am too paranoid, but I think it's easy to mess up these things
>>> when working on large folio swapin imo.
>> So there is a difference between zswap and this optimization. In this
>> optimization, if the zeromap is set for all the folio bits, then we
>> should do large folio swapin. There still needs to be a change in Barrys
>> patch in alloc_swap_folio, but apart from that does the below diff over
>> v3 make it better? I will send a v4 with this if it sounds good.
>>
>>
>> diff --git a/mm/page_io.c b/mm/page_io.c
>> index 6400be6e4291..bf01364748a9 100644
>> --- a/mm/page_io.c
>> +++ b/mm/page_io.c
>> @@ -234,18 +234,24 @@ static void swap_zeromap_folio_clear(struct folio
>> *folio)
>> }
>> }
>>
>> -static bool swap_zeromap_folio_test(struct folio *folio)
>> +/*
>> + * Return the index of the first subpage which is not zero-filled
>> + * according to swap_info_struct->zeromap.
>> + * If all pages are zero-filled according to zeromap, it will return
>> + * folio_nr_pages(folio).
>> + */
>> +static long swap_zeromap_folio_test(struct folio *folio)
>> {
>> struct swap_info_struct *sis = swp_swap_info(folio->swap);
>> swp_entry_t entry;
>> - unsigned int i;
>> + long i;
> Why long?


folio_nr_pages returns long, but I just checked that
folio->_folio_nr_pages is unsigned int, but that will probably be
typecasted to long :). I will switch to unsigned int as its not really
going to go to long for CONFIG_64BIT

>> for (i = 0; i < folio_nr_pages(folio); i++) {
>> entry = page_swap_entry(folio_page(folio, i));
>> if (!test_bit(swp_offset(entry), sis->zeromap))
>> - return false;
>> + return i;
>> }
>> - return true;
>> + return i;
>> }
>>
>> /*
>> @@ -581,6 +587,7 @@ void swap_read_folio(struct folio *folio, bool
>> synchronous,
>> {
>> struct swap_info_struct *sis = swp_swap_info(folio->swap);
>> bool workingset = folio_test_workingset(folio);
>> + long first_non_zero_page_idx;
>> unsigned long pflags;
>> bool in_thrashing;
>>
>> @@ -598,10 +605,19 @@ void swap_read_folio(struct folio *folio, bool
>> synchronous,
>> psi_memstall_enter(&pflags);
>> }
>> delayacct_swapin_start();
>> - if (swap_zeromap_folio_test(folio)) {
>> + first_non_zero_page_idx = swap_zeromap_folio_test(folio);
>> + if (first_non_zero_page_idx == folio_nr_pages(folio)) {
>> folio_zero_fill(folio);
>> folio_mark_uptodate(folio);
>> folio_unlock(folio);
>> + } else if (first_non_zero_page_idx != 0) {
>> + /*
>> + * The case for when only *some* of subpages being
>> swapped-in were recorded
>> + * in sis->zeromap, while the rest are in zswap/disk is
>> currently not handled.
>> + * WARN in this case and return without marking the
>> folio uptodate so that
>> + * an IO error is emitted (e.g. do_swap_page() will sigbus).
>> + */
>> + WARN_ON_ONCE(1);
>> } else if (zswap_load(folio)) {
>> folio_mark_uptodate(folio);
>> folio_unlock(folio);
>>
>>
> This is too much noise for swap_read_folio(). How about adding
> swap_read_folio_zeromap() that takes care of this and decides whether
> or not to call folio_mark_uptodate()?

Sounds good, will do as below. Thanks!

>
> -static bool swap_zeromap_folio_test(struct folio *folio)
> +/*
> + * Return the index of the first subpage which is not zero-filled according to
> + * swap_info_struct->zeromap. If all pages are zero-filled according to
> + * zeromap, it will return folio_nr_pages(folio).
> + */
> +static unsigned int swap_zeromap_folio_test(struct folio *folio)
> {
> struct swap_info_struct *sis = swp_swap_info(folio->swap);
> swp_entry_t entry;
> @@ -243,9 +248,9 @@ static bool swap_zeromap_folio_test(struct folio *folio)
> for (i = 0; i < folio_nr_pages(folio); i++) {
> entry = page_swap_entry(folio_page(folio, i));
> if (!test_bit(swp_offset(entry), sis->zeromap))
> - return false;
> + return i;
> }
> - return true;
> + return i;
> }
>
> /*
> @@ -511,6 +516,25 @@ static void sio_read_complete(struct kiocb *iocb, long ret)
> mempool_free(sio, sio_pool);
> }
>
> +static bool swap_read_folio_zeromap(struct folio *folio)
> +{
> + unsigned int idx = swap_zeromap_folio_test(folio);
> +
> + if (idx == 0)
> + return false;
> +
> + /*
> + * Swapping in a large folio that is partially in the zeromap is not
> + * currently handled. Return true without marking the folio uptodate so
> + * that an IO error is emitted (e.g. do_swap_page() will sigbus).
> + */
> + if (WARN_ON_ONCE(idx < folio_nr_pages(folio)))
> + return true;
> +
> + folio_zero_fill(folio);
> + folio_mark_uptodate(folio);
> + return true
> +}
> +
> static void swap_read_folio_fs(struct folio *folio, struct swap_iocb **plug)
> {
> struct swap_info_struct *sis = swp_swap_info(folio->swap);
> @@ -600,9 +624,7 @@ void swap_read_folio(struct folio *folio, bool synchronous,
> psi_memstall_enter(&pflags);
> }
> delayacct_swapin_start();
> - if (swap_zeromap_folio_test(folio)) {
> - folio_zero_fill(folio);
> - folio_mark_uptodate(folio);
> + if (swap_read_folio_zeromap(folio)) {
> folio_unlock(folio);
> } else if (zswap_load(folio)) {
> folio_mark_uptodate(folio);

2024-06-11 18:47:46

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] mm: store zero pages to be swapped out in a bitmap

[..]
> > @@ -1336,6 +1347,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
> > count = p->swap_map[offset];
> > VM_BUG_ON(count != SWAP_HAS_CACHE);
> > p->swap_map[offset] = 0;
> > + clear_bit(offset, p->zeromap);
>
> Hmm so clear_bit() is done at the swap_entry_free() point. I wonder if
> we can have a problem, where:
>
> 1. The swap entry has its zeromap bit set, and is freed to the swap
> slot cache (free_swap_slot() in mm/swap_slots.c). For instance, it is
> reclaimed from the swap cache, and all the processes referring to it
> are terminated, which decrements the swap count to 0 (swap_free() ->
> __swap_entry_free() -> free_swap_slots())
>
> 2. The swap slot is then re-used in swap space allocation
> (add_to_swap()) - its zeromap bit is never cleared.

I do not think this can happen before swap_entry_free() is called.
Note that when a swap entry is freed to the swap slot cache in
free_swap_slot(), it is added to cache->slots_ret, not cache->slots.
The former are swap entries cached to be later freed using
swap_entry_free().

>
> 3. swap_writepage() writes that non-zero page to swap
>
> 4. swap_read_folio() checks the bitmap, sees that the zeromap bit for
> the entry is set, so populates a zero page for it.
>
> zswap in the past has to carefully invalidate these leftover entries
> quite carefully. Chengming then move the invalidation point to
> free_swap_slot(), massively simplifying the logic.

I think the main benefit of moving the invalidation point was avoiding
leaving the compressed page in zswap until swap_entry_free() is
called, which will happen only when the swap slot caches are drained.

>
> I wonder if we need to do the same here?

2024-06-11 18:50:38

by Usama Arif

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] mm: store zero pages to be swapped out in a bitmap


On 11/06/2024 19:39, Nhat Pham wrote:
> On Mon, Jun 10, 2024 at 5:18 AM Usama Arif <[email protected]> wrote:
>> Approximately 10-20% of pages to be swapped out are zero pages [1].
>> Rather than reading/writing these pages to flash resulting
>> in increased I/O and flash wear, a bitmap can be used to mark these
>> pages as zero at write time, and the pages can be filled at
>> read time if the bit corresponding to the page is set.
>> With this patch, NVMe writes in Meta server fleet decreased
>> by almost 10% with conventional swap setup (zswap disabled).
>>
>> [1]https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/
>>
>> Signed-off-by: Usama Arif <[email protected]>
>> ---
>> include/linux/swap.h | 1 +
>> mm/page_io.c | 92 +++++++++++++++++++++++++++++++++++++++++++-
>> mm/swapfile.c | 21 +++++++++-
>> 3 files changed, 111 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index a11c75e897ec..e88563978441 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -299,6 +299,7 @@ struct swap_info_struct {
>> signed char type; /* strange name for an index */
>> unsigned int max; /* extent of the swap_map */
>> unsigned char *swap_map; /* vmalloc'ed array of usage counts */
>> + unsigned long *zeromap; /* vmalloc'ed bitmap to track zero pages */
>> struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
>> struct swap_cluster_list free_clusters; /* free clusters list */
>> unsigned int lowest_bit; /* index of first free in swap_map */
>> diff --git a/mm/page_io.c b/mm/page_io.c
>> index a360857cf75d..2cac1e11fb85 100644
>> --- a/mm/page_io.c
>> +++ b/mm/page_io.c
>> @@ -172,6 +172,82 @@ int generic_swapfile_activate(struct swap_info_struct *sis,
>> goto out;
>> }
>>
>> +static bool is_folio_page_zero_filled(struct folio *folio, int i)
>> +{
>> + unsigned long *data;
>> + unsigned int pos, last_pos = PAGE_SIZE / sizeof(*data) - 1;
>> + bool ret = false;
>> +
>> + data = kmap_local_folio(folio, i * PAGE_SIZE);
>> + if (data[last_pos])
>> + goto out;
>> + for (pos = 0; pos < PAGE_SIZE / sizeof(*data); pos++) {
>> + if (data[pos])
>> + goto out;
>> + }
>> + ret = true;
>> +out:
>> + kunmap_local(data);
>> + return ret;
>> +}
>> +
>> +static bool is_folio_zero_filled(struct folio *folio)
>> +{
>> + unsigned int i;
>> +
>> + for (i = 0; i < folio_nr_pages(folio); i++) {
>> + if (!is_folio_page_zero_filled(folio, i))
>> + return false;
>> + }
>> + return true;
>> +}
>> +
>> +static void folio_zero_fill(struct folio *folio)
>> +{
>> + unsigned int i;
>> +
>> + for (i = 0; i < folio_nr_pages(folio); i++)
>> + clear_highpage(folio_page(folio, i));
>> +}
>> +
>> +static void swap_zeromap_folio_set(struct folio *folio)
>> +{
>> + struct swap_info_struct *sis = swp_swap_info(folio->swap);
>> + swp_entry_t entry;
>> + unsigned int i;
>> +
>> + for (i = 0; i < folio_nr_pages(folio); i++) {
>> + entry = page_swap_entry(folio_page(folio, i));
>> + set_bit(swp_offset(entry), sis->zeromap);
>> + }
>> +}
>> +
>> +static void swap_zeromap_folio_clear(struct folio *folio)
>> +{
>> + struct swap_info_struct *sis = swp_swap_info(folio->swap);
>> + swp_entry_t entry;
>> + unsigned int i;
>> +
>> + for (i = 0; i < folio_nr_pages(folio); i++) {
>> + entry = page_swap_entry(folio_page(folio, i));
>> + clear_bit(swp_offset(entry), sis->zeromap);
>> + }
>> +}
>> +
>> +static bool swap_zeromap_folio_test(struct folio *folio)
>> +{
>> + struct swap_info_struct *sis = swp_swap_info(folio->swap);
>> + swp_entry_t entry;
>> + unsigned int i;
>> +
>> + for (i = 0; i < folio_nr_pages(folio); i++) {
>> + entry = page_swap_entry(folio_page(folio, i));
>> + if (!test_bit(swp_offset(entry), sis->zeromap))
>> + return false;
>> + }
>> + return true;
>> +}
>> +
>> /*
>> * We may have stale swap cache pages in memory: notice
>> * them here and get rid of the unnecessary final write.
>> @@ -195,6 +271,15 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
>> folio_unlock(folio);
>> return ret;
>> }
>> +
>> + if (is_folio_zero_filled(folio)) {
>> + swap_zeromap_folio_set(folio);
>> + folio_start_writeback(folio);
>> + folio_unlock(folio);
>> + folio_end_writeback(folio);
>> + return 0;
>> + }
>> + swap_zeromap_folio_clear(folio);
>> if (zswap_store(folio)) {
>> folio_start_writeback(folio);
>> folio_unlock(folio);
>> @@ -515,8 +600,11 @@ void swap_read_folio(struct folio *folio, bool synchronous,
>> psi_memstall_enter(&pflags);
>> }
>> delayacct_swapin_start();
>> -
>> - if (zswap_load(folio)) {
>> + if (swap_zeromap_folio_test(folio)) {
>> + folio_zero_fill(folio);
>> + folio_mark_uptodate(folio);
>> + folio_unlock(folio);
>> + } else if (zswap_load(folio)) {
>> folio_mark_uptodate(folio);
>> folio_unlock(folio);
>> } else if (data_race(sis->flags & SWP_FS_OPS)) {
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index f1e559e216bd..90451174fe34 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -453,6 +453,8 @@ static unsigned int cluster_list_del_first(struct swap_cluster_list *list,
>> static void swap_cluster_schedule_discard(struct swap_info_struct *si,
>> unsigned int idx)
>> {
>> + unsigned int i;
>> +
>> /*
>> * If scan_swap_map_slots() can't find a free cluster, it will check
>> * si->swap_map directly. To make sure the discarding cluster isn't
>> @@ -461,6 +463,13 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
>> */
>> memset(si->swap_map + idx * SWAPFILE_CLUSTER,
>> SWAP_MAP_BAD, SWAPFILE_CLUSTER);
>> + /*
>> + * zeromap can see updates from concurrent swap_writepage() and swap_read_folio()
>> + * call on other slots, hence use atomic clear_bit for zeromap instead of the
>> + * non-atomic bitmap_clear.
>> + */
>> + for (i = 0; i < SWAPFILE_CLUSTER; i++)
>> + clear_bit(idx * SWAPFILE_CLUSTER + i, si->zeromap);
>>
>> cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx);
>>
>> @@ -482,7 +491,7 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
>> static void swap_do_scheduled_discard(struct swap_info_struct *si)
>> {
>> struct swap_cluster_info *info, *ci;
>> - unsigned int idx;
>> + unsigned int idx, i;
>>
>> info = si->cluster_info;
>>
>> @@ -498,6 +507,8 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si)
>> __free_cluster(si, idx);
>> memset(si->swap_map + idx * SWAPFILE_CLUSTER,
>> 0, SWAPFILE_CLUSTER);
>> + for (i = 0; i < SWAPFILE_CLUSTER; i++)
>> + clear_bit(idx * SWAPFILE_CLUSTER + i, si->zeromap);
>> unlock_cluster(ci);
>> }
>> }
>> @@ -1336,6 +1347,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
>> count = p->swap_map[offset];
>> VM_BUG_ON(count != SWAP_HAS_CACHE);
>> p->swap_map[offset] = 0;
>> + clear_bit(offset, p->zeromap);
> Hmm so clear_bit() is done at the swap_entry_free() point. I wonder if
> we can have a problem, where:
>
> 1. The swap entry has its zeromap bit set, and is freed to the swap
> slot cache (free_swap_slot() in mm/swap_slots.c). For instance, it is
> reclaimed from the swap cache, and all the processes referring to it
> are terminated, which decrements the swap count to 0 (swap_free() ->
> __swap_entry_free() -> free_swap_slots())
>
> 2. The swap slot is then re-used in swap space allocation
> (add_to_swap()) - its zeromap bit is never cleared.
>
> 3. swap_writepage() writes that non-zero page to swap

In swap_writepage, with this patch you have:

    if (is_folio_zero_filled(folio)) {
        swap_zeromap_folio_set(folio);
        folio_unlock(folio);
        return 0;
    }
    swap_zeromap_folio_clear(folio);

i.e. if folio is not zero filled, swap_zeromap_folio_clear will be
called and the bit is cleared, so I think it would take care of this
scenario? swap_read_folio will see the bit cleared in step 4.

> 4. swap_read_folio() checks the bitmap, sees that the zeromap bit for
> the entry is set, so populates a zero page for it.
>
> zswap in the past has to carefully invalidate these leftover entries
> quite carefully. Chengming then move the invalidation point to
> free_swap_slot(), massively simplifying the logic.
>
> I wonder if we need to do the same here?

2024-06-11 19:04:00

by Nhat Pham

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] mm: store zero pages to be swapped out in a bitmap

On Tue, Jun 11, 2024 at 11:47 AM Yosry Ahmed <[email protected]> wrote:
>
> [..]
> > > @@ -1336,6 +1347,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
> > > count = p->swap_map[offset];
> > > VM_BUG_ON(count != SWAP_HAS_CACHE);
> > > p->swap_map[offset] = 0;
> > > + clear_bit(offset, p->zeromap);
> >
> > Hmm so clear_bit() is done at the swap_entry_free() point. I wonder if
> > we can have a problem, where:
> >
> > 1. The swap entry has its zeromap bit set, and is freed to the swap
> > slot cache (free_swap_slot() in mm/swap_slots.c). For instance, it is
> > reclaimed from the swap cache, and all the processes referring to it
> > are terminated, which decrements the swap count to 0 (swap_free() ->
> > __swap_entry_free() -> free_swap_slots())
> >
> > 2. The swap slot is then re-used in swap space allocation
> > (add_to_swap()) - its zeromap bit is never cleared.
>
> I do not think this can happen before swap_entry_free() is called.
> Note that when a swap entry is freed to the swap slot cache in
> free_swap_slot(), it is added to cache->slots_ret, not cache->slots.
> The former are swap entries cached to be later freed using
> swap_entry_free().

Ahhh I see. Good point. Then yeah this should be safe from this POV.

>
> >
> > 3. swap_writepage() writes that non-zero page to swap
> >
> > 4. swap_read_folio() checks the bitmap, sees that the zeromap bit for
> > the entry is set, so populates a zero page for it.
> >
> > zswap in the past has to carefully invalidate these leftover entries
> > quite carefully. Chengming then move the invalidation point to
> > free_swap_slot(), massively simplifying the logic.
>
> I think the main benefit of moving the invalidation point was avoiding
> leaving the compressed page in zswap until swap_entry_free() is
> called, which will happen only when the swap slot caches are drained.
>

This is true. In this case yeah there's probably not much difference
between clearing the bit here vs in swap_entry_free().

> >
> > I wonder if we need to do the same here?

2024-06-11 19:34:01

by Nhat Pham

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] mm: store zero pages to be swapped out in a bitmap

On Tue, Jun 11, 2024 at 11:50 AM Usama Arif <[email protected]> wrote:
> In swap_writepage, with this patch you have:
>
> if (is_folio_zero_filled(folio)) {
> swap_zeromap_folio_set(folio);
> folio_unlock(folio);
> return 0;
> }
> swap_zeromap_folio_clear(folio);
>

I was concerned with the swap slot being freed and reused, without
ever being read :) But looks like it has to be properly reset before
being reused, so all is well on that front.

What about the put_swap_folio() -> swap_free_cluster() case - do we
need to handle zeromap bit clearing here too? Looks like it's clearing
the swap_map (i.e returning it directly to the swapfile, allowing
those slots to be reused) here, and I notice that you clear the
zeromap bitmap wherever the swap_map is cleared as well :)

I jumped around the code a bit - in free_cluster() (called by
swap_free_cluster()), there's this chunk:

if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) ==
(SWP_WRITEOK | SWP_PAGE_DISCARD)) {
swap_cluster_schedule_discard(si, idx);
return;
}

swap_cluster_schedule_discard() does clear_bit() on the zeromap on the
entire cluster. We also clear_bit() in the work function
swap_do_scheduled_discard() (is this redundant?).

But what if this check is false, i.e the swap device does not have the
SWP_PAGE_DISCARD flag set? Are we not clearing the bits in the zeromap
here?

2024-06-12 10:49:26

by Usama Arif

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] mm: store zero pages to be swapped out in a bitmap


On 11/06/2024 20:33, Nhat Pham wrote:
> On Tue, Jun 11, 2024 at 11:50 AM Usama Arif <[email protected]> wrote:
>> In swap_writepage, with this patch you have:
>>
>> if (is_folio_zero_filled(folio)) {
>> swap_zeromap_folio_set(folio);
>> folio_unlock(folio);
>> return 0;
>> }
>> swap_zeromap_folio_clear(folio);
>>
> I was concerned with the swap slot being freed and reused, without
> ever being read :) But looks like it has to be properly reset before
> being reused, so all is well on that front.
>
> What about the put_swap_folio() -> swap_free_cluster() case - do we
> need to handle zeromap bit clearing here too? Looks like it's clearing
> the swap_map (i.e returning it directly to the swapfile, allowing
> those slots to be reused) here, and I notice that you clear the
> zeromap bitmap wherever the swap_map is cleared as well :)
>
> I jumped around the code a bit - in free_cluster() (called by
> swap_free_cluster()), there's this chunk:
>
> if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) ==
> (SWP_WRITEOK | SWP_PAGE_DISCARD)) {
> swap_cluster_schedule_discard(si, idx);
> return;
> }
>
> swap_cluster_schedule_discard() does clear_bit() on the zeromap on the
> entire cluster. We also clear_bit() in the work function
> swap_do_scheduled_discard() (is this redundant?).
>
> But what if this check is false, i.e the swap device does not have the
> SWP_PAGE_DISCARD flag set? Are we not clearing the bits in the zeromap
> here?

Yes, should add in swap_free_cluster as well, will do in next revision.
Thanks!



2024-06-13 21:23:46

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH v3 0/2] mm: store zero pages to be swapped out in a bitmap

On Mon, Jun 10, 2024 at 5:18 AM Usama Arif <[email protected]> wrote:
>
> Going back to the v1 implementation of the patchseries. The main reason
> is that a correct version of v2 implementation requires another rmap
> walk in shrink_folio_list to change the ptes from swap entry to zero pages to
> work (i.e. more CPU used) [1], is more complex to implement compared to v1
> and is harder to verify correctness compared to v1, where everything is
> handled by swap.
>
> ---
> As shown in the patchseries that introduced the zswap same-filled
> optimization [2], 10-20% of the pages stored in zswap are same-filled.
> This is also observed across Meta's server fleet.
> By using VM counters in swap_writepage (not included in this
> patchseries) it was found that less than 1% of the same-filled
> pages to be swapped out are non-zero pages.
>
> For conventional swap setup (without zswap), rather than reading/writing
> these pages to flash resulting in increased I/O and flash wear, a bitmap
> can be used to mark these pages as zero at write time, and the pages can
> be filled at read time if the bit corresponding to the page is set.
>
> When using zswap with swap, this also means that a zswap_entry does not
> need to be allocated for zero filled pages resulting in memory savings
> which would offset the memory used for the bitmap.
>
> A similar attempt was made earlier in [3] where zswap would only track
> zero-filled pages instead of same-filled.
> This patchseries adds zero-filled pages optimization to swap
> (hence it can be used even if zswap is disabled) and removes the
> same-filled code from zswap (as only 1% of the same-filled pages are
> non-zero), simplifying code.
>
> This patchseries is based on mm-unstable.

Aside from saving swap/zswap space and simplifying the zswap code
(thanks for that!), did you observe any performance benefits from not
having to go into zswap code for zero-filled pages?

In [3], I observed ~1.5% improvement in kernbench just by optimizing
zswap's handling of zero-filled pages, and that benchmark only
produced around 1.5% zero-filled pages. I imagine avoiding the zswap
code entirely, and for workloads that have 10-20% zero-filled pages,
the performance improvement should be more pronounced.

When zswap is not being used and all swap activity translates to IO, I
imagine the benefits will be much more significant.

I am curious if you have any numbers with or without zswap :)

2024-06-13 21:51:19

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH v3 0/2] mm: store zero pages to be swapped out in a bitmap

On Mon, Jun 10, 2024 at 5:18 AM Usama Arif <[email protected]> wrote:
>
> Going back to the v1 implementation of the patchseries. The main reason
> is that a correct version of v2 implementation requires another rmap
> walk in shrink_folio_list to change the ptes from swap entry to zero pages to
> work (i.e. more CPU used) [1], is more complex to implement compared to v1
> and is harder to verify correctness compared to v1, where everything is
> handled by swap.
>
> ---
> As shown in the patchseries that introduced the zswap same-filled
> optimization [2], 10-20% of the pages stored in zswap are same-filled.
> This is also observed across Meta's server fleet.
> By using VM counters in swap_writepage (not included in this
> patchseries) it was found that less than 1% of the same-filled
> pages to be swapped out are non-zero pages.
>
> For conventional swap setup (without zswap), rather than reading/writing
> these pages to flash resulting in increased I/O and flash wear, a bitmap
> can be used to mark these pages as zero at write time, and the pages can
> be filled at read time if the bit corresponding to the page is set.
>
> When using zswap with swap, this also means that a zswap_entry does not
> need to be allocated for zero filled pages resulting in memory savings
> which would offset the memory used for the bitmap.
>
> A similar attempt was made earlier in [3] where zswap would only track
> zero-filled pages instead of same-filled.
> This patchseries adds zero-filled pages optimization to swap
> (hence it can be used even if zswap is disabled) and removes the
> same-filled code from zswap (as only 1% of the same-filled pages are
> non-zero), simplifying code.

There is also code to handle same-filled pages in zram, should we
remove this as well? It is worth noting that the handling in zram was
initially for zero-filled pages only, but it was extended to cover
same-filled pages as well by commit 8e19d540d107 ("zram: extend zero
pages to same element pages"). Apparently in a test on Android, about
2.5% of the swapped out pages were non-zero same-filled pages.

However, the leap from handling zero-filled pages to handling all
same-filled pages in zram wasn't a stretch. But now that zero-filled
pages handling in zram is redundant with this series, I wonder if it's
still worth keeping the same-filled pages handling.

Adding Minchan and Sergey here.

>
> This patchseries is based on mm-unstable.
>
>
> [1] https://lore.kernel.org/all/[email protected]/
> [2] https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/
> [3] https://lore.kernel.org/lkml/[email protected]/
>
> ---
> v2->v3:
> - Going back to the v1 version of the implementation (David and Shakeel)
> - convert unatomic bitmap_set/clear to atomic set/clear_bit (Johannes)
> - use clear_highpage instead of folio_page_zero_fill (Yosry)
>
> v1 -> v2:
> - instead of using a bitmap in swap, clear pte for zero pages and let
> do_pte_missing handle this page at page fault. (Yosry and Matthew)
> - Check end of page first when checking if folio is zero filled as
> it could lead to better performance. (Yosry)
>
> Usama Arif (2):
> mm: store zero pages to be swapped out in a bitmap
> mm: remove code to handle same filled pages
>
> include/linux/swap.h | 1 +
> mm/page_io.c | 92 +++++++++++++++++++++++++++++++++++++++++++-
> mm/swapfile.c | 21 +++++++++-
> mm/zswap.c | 86 ++++-------------------------------------
> 4 files changed, 119 insertions(+), 81 deletions(-)
>
> --
> 2.43.0
>

2024-06-13 22:41:27

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH v3 0/2] mm: store zero pages to be swapped out in a bitmap

On Thu, Jun 13, 2024 at 02:50:31PM GMT, Yosry Ahmed wrote:
> On Mon, Jun 10, 2024 at 5:18 AM Usama Arif <[email protected]> wrote:
> >
> > Going back to the v1 implementation of the patchseries. The main reason
> > is that a correct version of v2 implementation requires another rmap
> > walk in shrink_folio_list to change the ptes from swap entry to zero pages to
> > work (i.e. more CPU used) [1], is more complex to implement compared to v1
> > and is harder to verify correctness compared to v1, where everything is
> > handled by swap.
> >
> > ---
> > As shown in the patchseries that introduced the zswap same-filled
> > optimization [2], 10-20% of the pages stored in zswap are same-filled.
> > This is also observed across Meta's server fleet.
> > By using VM counters in swap_writepage (not included in this
> > patchseries) it was found that less than 1% of the same-filled
> > pages to be swapped out are non-zero pages.
> >
> > For conventional swap setup (without zswap), rather than reading/writing
> > these pages to flash resulting in increased I/O and flash wear, a bitmap
> > can be used to mark these pages as zero at write time, and the pages can
> > be filled at read time if the bit corresponding to the page is set.
> >
> > When using zswap with swap, this also means that a zswap_entry does not
> > need to be allocated for zero filled pages resulting in memory savings
> > which would offset the memory used for the bitmap.
> >
> > A similar attempt was made earlier in [3] where zswap would only track
> > zero-filled pages instead of same-filled.
> > This patchseries adds zero-filled pages optimization to swap
> > (hence it can be used even if zswap is disabled) and removes the
> > same-filled code from zswap (as only 1% of the same-filled pages are
> > non-zero), simplifying code.
>
> There is also code to handle same-filled pages in zram, should we
> remove this as well? It is worth noting that the handling in zram was
> initially for zero-filled pages only, but it was extended to cover
> same-filled pages as well by commit 8e19d540d107 ("zram: extend zero
> pages to same element pages"). Apparently in a test on Android, about
> 2.5% of the swapped out pages were non-zero same-filled pages.
>
> However, the leap from handling zero-filled pages to handling all
> same-filled pages in zram wasn't a stretch. But now that zero-filled
> pages handling in zram is redundant with this series, I wonder if it's
> still worth keeping the same-filled pages handling.

Please correct me if I am wrong but zram same-filled page handling is
not just limited to swap-on-zram use-case and any zram as block device
user can benefit from it. Also zram might not see any simplification
similar to zswap in this patch series. I would say motivation behind
zswap changes seems quite different from possible zram changes. I would
recommed to evaluate these cases independently.

2024-06-13 23:00:44

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH v3 0/2] mm: store zero pages to be swapped out in a bitmap

On Thu, Jun 13, 2024 at 3:41 PM Shakeel Butt <[email protected]> wrote:
>
> On Thu, Jun 13, 2024 at 02:50:31PM GMT, Yosry Ahmed wrote:
> > On Mon, Jun 10, 2024 at 5:18 AM Usama Arif <[email protected]> wrote:
> > >
> > > Going back to the v1 implementation of the patchseries. The main reason
> > > is that a correct version of v2 implementation requires another rmap
> > > walk in shrink_folio_list to change the ptes from swap entry to zero pages to
> > > work (i.e. more CPU used) [1], is more complex to implement compared to v1
> > > and is harder to verify correctness compared to v1, where everything is
> > > handled by swap.
> > >
> > > ---
> > > As shown in the patchseries that introduced the zswap same-filled
> > > optimization [2], 10-20% of the pages stored in zswap are same-filled.
> > > This is also observed across Meta's server fleet.
> > > By using VM counters in swap_writepage (not included in this
> > > patchseries) it was found that less than 1% of the same-filled
> > > pages to be swapped out are non-zero pages.
> > >
> > > For conventional swap setup (without zswap), rather than reading/writing
> > > these pages to flash resulting in increased I/O and flash wear, a bitmap
> > > can be used to mark these pages as zero at write time, and the pages can
> > > be filled at read time if the bit corresponding to the page is set.
> > >
> > > When using zswap with swap, this also means that a zswap_entry does not
> > > need to be allocated for zero filled pages resulting in memory savings
> > > which would offset the memory used for the bitmap.
> > >
> > > A similar attempt was made earlier in [3] where zswap would only track
> > > zero-filled pages instead of same-filled.
> > > This patchseries adds zero-filled pages optimization to swap
> > > (hence it can be used even if zswap is disabled) and removes the
> > > same-filled code from zswap (as only 1% of the same-filled pages are
> > > non-zero), simplifying code.
> >
> > There is also code to handle same-filled pages in zram, should we
> > remove this as well? It is worth noting that the handling in zram was
> > initially for zero-filled pages only, but it was extended to cover
> > same-filled pages as well by commit 8e19d540d107 ("zram: extend zero
> > pages to same element pages"). Apparently in a test on Android, about
> > 2.5% of the swapped out pages were non-zero same-filled pages.
> >
> > However, the leap from handling zero-filled pages to handling all
> > same-filled pages in zram wasn't a stretch. But now that zero-filled
> > pages handling in zram is redundant with this series, I wonder if it's
> > still worth keeping the same-filled pages handling.
>
> Please correct me if I am wrong but zram same-filled page handling is
> not just limited to swap-on-zram use-case and any zram as block device
> user can benefit from it. Also zram might not see any simplification
> similar to zswap in this patch series. I would say motivation behind
> zswap changes seems quite different from possible zram changes. I would
> recommed to evaluate these cases independently.

Uh yes. I keep forgetting that zram is used for other use cases than
swap. Please dismiss my comments then (unless it's uncommon to have
zero-filled / same-filled pages in other use cases).

2024-06-14 09:22:11

by Usama Arif

[permalink] [raw]
Subject: Re: [PATCH v3 0/2] mm: store zero pages to be swapped out in a bitmap


On 13/06/2024 22:21, Yosry Ahmed wrote:
> On Mon, Jun 10, 2024 at 5:18 AM Usama Arif <[email protected]> wrote:
>> Going back to the v1 implementation of the patchseries. The main reason
>> is that a correct version of v2 implementation requires another rmap
>> walk in shrink_folio_list to change the ptes from swap entry to zero pages to
>> work (i.e. more CPU used) [1], is more complex to implement compared to v1
>> and is harder to verify correctness compared to v1, where everything is
>> handled by swap.
>>
>> ---
>> As shown in the patchseries that introduced the zswap same-filled
>> optimization [2], 10-20% of the pages stored in zswap are same-filled.
>> This is also observed across Meta's server fleet.
>> By using VM counters in swap_writepage (not included in this
>> patchseries) it was found that less than 1% of the same-filled
>> pages to be swapped out are non-zero pages.
>>
>> For conventional swap setup (without zswap), rather than reading/writing
>> these pages to flash resulting in increased I/O and flash wear, a bitmap
>> can be used to mark these pages as zero at write time, and the pages can
>> be filled at read time if the bit corresponding to the page is set.
>>
>> When using zswap with swap, this also means that a zswap_entry does not
>> need to be allocated for zero filled pages resulting in memory savings
>> which would offset the memory used for the bitmap.
>>
>> A similar attempt was made earlier in [3] where zswap would only track
>> zero-filled pages instead of same-filled.
>> This patchseries adds zero-filled pages optimization to swap
>> (hence it can be used even if zswap is disabled) and removes the
>> same-filled code from zswap (as only 1% of the same-filled pages are
>> non-zero), simplifying code.
>>
>> This patchseries is based on mm-unstable.
> Aside from saving swap/zswap space and simplifying the zswap code
> (thanks for that!), did you observe any performance benefits from not
> having to go into zswap code for zero-filled pages?
>
> In [3], I observed ~1.5% improvement in kernbench just by optimizing
> zswap's handling of zero-filled pages, and that benchmark only
> produced around 1.5% zero-filled pages. I imagine avoiding the zswap
> code entirely, and for workloads that have 10-20% zero-filled pages,
> the performance improvement should be more pronounced.
>
> When zswap is not being used and all swap activity translates to IO, I
> imagine the benefits will be much more significant.
>
> I am curious if you have any numbers with or without zswap :)

Apart from tracking zero-filled pages (using inaccurate counters not in
this series) which had the same pattern to zswap_same_filled_pages, the
nvme writes went down around 5-10% during stable points in the
production experiment. The performance improved by 2-3% at some points,
but this is comparing 2 sets of machines running production workloads
(which can vary between machine sets), so I would take those numbers
cautiously and which is why I didnt include them in the cover letter.


2024-06-14 09:29:36

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH v3 0/2] mm: store zero pages to be swapped out in a bitmap

On Fri, Jun 14, 2024 at 2:22 AM Usama Arif <[email protected]> wrote:
>
>
> On 13/06/2024 22:21, Yosry Ahmed wrote:
> > On Mon, Jun 10, 2024 at 5:18 AM Usama Arif <[email protected]> wrote:
> >> Going back to the v1 implementation of the patchseries. The main reason
> >> is that a correct version of v2 implementation requires another rmap
> >> walk in shrink_folio_list to change the ptes from swap entry to zero pages to
> >> work (i.e. more CPU used) [1], is more complex to implement compared to v1
> >> and is harder to verify correctness compared to v1, where everything is
> >> handled by swap.
> >>
> >> ---
> >> As shown in the patchseries that introduced the zswap same-filled
> >> optimization [2], 10-20% of the pages stored in zswap are same-filled.
> >> This is also observed across Meta's server fleet.
> >> By using VM counters in swap_writepage (not included in this
> >> patchseries) it was found that less than 1% of the same-filled
> >> pages to be swapped out are non-zero pages.
> >>
> >> For conventional swap setup (without zswap), rather than reading/writing
> >> these pages to flash resulting in increased I/O and flash wear, a bitmap
> >> can be used to mark these pages as zero at write time, and the pages can
> >> be filled at read time if the bit corresponding to the page is set.
> >>
> >> When using zswap with swap, this also means that a zswap_entry does not
> >> need to be allocated for zero filled pages resulting in memory savings
> >> which would offset the memory used for the bitmap.
> >>
> >> A similar attempt was made earlier in [3] where zswap would only track
> >> zero-filled pages instead of same-filled.
> >> This patchseries adds zero-filled pages optimization to swap
> >> (hence it can be used even if zswap is disabled) and removes the
> >> same-filled code from zswap (as only 1% of the same-filled pages are
> >> non-zero), simplifying code.
> >>
> >> This patchseries is based on mm-unstable.
> > Aside from saving swap/zswap space and simplifying the zswap code
> > (thanks for that!), did you observe any performance benefits from not
> > having to go into zswap code for zero-filled pages?
> >
> > In [3], I observed ~1.5% improvement in kernbench just by optimizing
> > zswap's handling of zero-filled pages, and that benchmark only
> > produced around 1.5% zero-filled pages. I imagine avoiding the zswap
> > code entirely, and for workloads that have 10-20% zero-filled pages,
> > the performance improvement should be more pronounced.
> >
> > When zswap is not being used and all swap activity translates to IO, I
> > imagine the benefits will be much more significant.
> >
> > I am curious if you have any numbers with or without zswap :)
>
> Apart from tracking zero-filled pages (using inaccurate counters not in
> this series) which had the same pattern to zswap_same_filled_pages, the
> nvme writes went down around 5-10% during stable points in the
> production experiment. The performance improved by 2-3% at some points,
> but this is comparing 2 sets of machines running production workloads
> (which can vary between machine sets), so I would take those numbers
> cautiously and which is why I didnt include them in the cover letter.
>

Yeah this makes sense, thanks. It would have been great if we had
comparable numbers with and without this series. But this shouldn't be
a big deal, the advantage of the series should be self-explanatory.
It's just a shame you don't get to brag about it :)