2024-03-26 19:28:41

by Kairui Song

[permalink] [raw]
Subject: [RFC PATCH 00/10] mm/swap: always use swap cache for synchronization

From: Kairui Song <[email protected]>

A month ago a bug was fixed for SWP_SYNCHRONOUS_IO swapin (swap cache
bypass swapin):
https://lore.kernel.org/linux-mm/[email protected]/

Because we have to spin on the swap map on race, and swap map is too small
to contain more usable info, an ugly schedule_timeout_uninterruptible(1)
is added. It's not the first time a hackish workaround was added for cache
bypass swapin and not the last time. I did many experiments locally to
see if the swap cache bypass path can be dropped while keeping the
performance still comparable. And it seems doable.

This series does the following things:
1. Remove swap cache bypass completely.
2. Apply multiple optimizations after that, these optimizations are
either undoable or very difficult to do without dropping the cache
bypass swapin path.
3. Use swap cache as a synchronization layer, also unify some code
with page cache (filemap).

As a result, we have:
1. A comparable performance, some tests are even faster.
2. Multi-index support for swap cache.
3. Removed many hackish workarounds including above long tailing
issue is gone.

Sending this as RFC to collect some discussion, suggestion, or rejection
early, this seems need to be split into multiple series, but the
performance is not good until the last patch so I think start by
seperating them may make this approach not very convincing. And there
are still some (maybe further) TODO items and optimization space
if we are OK with this approach.

This is based on my another series, for reusing filemap code for swapcache:
[PATCH v2 0/4] mm/filemap: optimize folio adding and splitting
https://lore.kernel.org/linux-mm/[email protected]/

Patch 1/10, introduce a helper from filemap side to be used later.
Patch 2/10, 3/10 are clean up and prepare for removing the swap cache
bypass swapin path.
Patch 4/10, removed the swap cache bypass swapin path, and the
performance drop heavily (-28%).
Patch 5/10, apply the first optimization after the removal, since all
folios goes through swap cache now, there is no need to explicit shadow
clearing any more.
Patch 6/10, apply another optimization after clean up shadow clearing
routines. Now swapcache is very alike page cache, so just reuse page
cache code and we will have multi-index support. Shadow memory usage
dropped a lot.
Patch 7/10, just rename __read_swap_cache_async, it will be refactored
and a key part of this series, and the naming is very confusing to me.
Patch 8/10, make swap cache as a synchronization layer, introduce two
helpers for adding folios to swap cache, caller will either succeed or
get a folio to wait on.
Patch 9/10, apply another optimization. With above two helpers, looking
up of swapcache can be optimized and avoid false looking up, which
helped improve the performance.
Patch 10/10, apply a major optimization for SWP_SYNCHRONOUS_IO devices,
after this commit, performance for simple swapin/swapout is basically
same as before.

Test 1, sequential swapin/out of 30G zero page on ZRAM:

Before (us) After (us)
Swapout: 33619409 33886008
Swapin: 32393771 32465441 (- 0.2%)
Swapout (THP): 7817909 6899938 (+11.8%)
Swapin (THP) : 32452387 33193479 (- 2.2%)

And after swapping out 30G with THP, the radix node usage dropped by a
lot:

Before: radix_tree_node 73728K
After: radix_tree_node 7056K (-94%)

Test 2:
Mysql (16g buffer pool, 32G ZRAM SWAP, 4G memcg, Zswap disabled, THP never)
sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-user=root \
--mysql-password=1234 --mysql-db=sb --tables=36 --table-size=2000000 \
--threads=48 --time=300 --report-interval=10 run

Before: transactions: 4849.25 per sec
After: transactions: 4849.40 per sec

Test 3:
Mysql (16g buffer pool, NVME SWAP, 4G memcg, Zswap enabled, THP never)
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo 100 > /sys/module/zswap/parameters/max_pool_percent
echo 1 > /sys/module/zswap/parameters/enabled
echo y > /sys/module/zswap/parameters/shrinker_enabled

sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-user=root \
--mysql-password=1234 --mysql-db=sb --tables=36 --table-size=2000000 \
--threads=48 --time=600 --report-interval=10 run

Before: transactions: 1662.90 per sec
After: transactions: 1726.52 per sec

Test 4:
Mysql (16g buffer pool, NVME SWAP, 4G memcg, Zswap enabled, THP always)
echo always > /sys/kernel/mm/transparent_hugepage/enabled
echo 100 > /sys/module/zswap/parameters/max_pool_percent
echo 1 > /sys/module/zswap/parameters/enabled
echo y > /sys/module/zswap/parameters/shrinker_enabled

sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-user=root \
--mysql-password=1234 --mysql-db=sb --tables=36 --table-size=2000000 \
--threads=48 --time=600 --report-interval=10 run

Before: transactions: 2860.90 per sec.
After: transactions: 2802.55 per sec.

Test 5:
Memtier / memcached (16G brd SWAP, 8G memcg, THP never):

memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 16 -B binary &

memtier_benchmark -S /tmp/memcached.socket \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=24000000 --key-pattern=P:P -c 1 -t 16 \
--ratio 1:0 --pipeline 8 -d 1000

Before: 106730.31 Ops/sec
After: 106360.11 Ops/sec

Test 5:
Memtier / memcached (16G brd SWAP, 8G memcg, THP always):

memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 16 -B binary &

memtier_benchmark -S /tmp/memcached.socket \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=24000000 --key-pattern=P:P -c 1 -t 16 \
--ratio 1:0 --pipeline 8 -d 1000

Before: 83193.11 Ops/sec
After: 82504.89 Ops/sec

These tests are tested under heavy memory stress, and the performance
seems basically same as before,very slightly better/worse for certain
cases, the benefits of multi-index are basically erased by
fragmentation and workingset nodes usage is slightly lower.

Some (maybe further) TODO items if we are OK with this approach:

- I see a slight performance regression for THP tests,
could identify a clear hotspot with perf, my guess is the
content on the xa_lock is an issue (we have a xa_lock for
every 64M swap cache space), THP handling needs to take the lock
longer than usual. splitting the xa_lock to be more
fine-grained seems a good solution. We have
SWAP_ADDRESS_SPACE_SHIFT = 14 which is not an optimal value.
Considering XA_CHUNK_SHIFT is 6, we will have three layer of Xarray
just for 2 extra bits. 12 should be better to always make use of
the whole XA chunk and having two layers at most. But duplicated
address_space struct also wastes more memory and cacheline.
I see an observable performance drop (~3%) after change
SWAP_ADDRESS_SPACE_SHIFT to 12. Might be a good idea to
decouple swap cache xarray from address_space (there are
too many user for swapcache, shouldn't come too dirty).

- Actually after patch Patch 4/10, the performance is much better for
tests limited with memory cgroup, until 10/10 applied the direct swap
cache freeing logic for SWP_SYNCHRONOUS_IO swapin. Because if the swap
device is not near full, swapin doesn't clear up the swapcache, so
repeated swapout doesn't need to re-alloc a swap entry, make things
faster. This may indicate that lazy freeing of swap cache could benifit
certain workloads and may worth looking into later.

- Now SWP_SYNCHRONOUS_IO swapin will bypass readahead and force drop
swap cache after swapin is done, which can be cleaned up and optimized
further after this patch. Device type will only determine the
readahead logic, and swap cache drop check can be based purely on swap
count.

- Recent mTHP swapin/swapout series should have no fundamental
conflict with this.

Kairui Song (10):
mm/filemap: split filemap storing logic into a standalone helper
mm/swap: move no readahead swapin code to a stand-alone helper
mm/swap: convert swapin_readahead to return a folio
mm/swap: remove cache bypass swapin
mm/swap: clean shadow only in unmap path
mm/swap: switch to use multi index entries
mm/swap: rename __read_swap_cache_async to swap_cache_alloc_or_get
mm/swap: use swap cache as a synchronization layer
mm/swap: delay the swap cache look up for swapin
mm/swap: optimize synchronous swapin

include/linux/swapops.h | 5 +-
mm/filemap.c | 161 +++++++++-----
mm/huge_memory.c | 78 +++----
mm/internal.h | 2 +
mm/memory.c | 133 ++++-------
mm/shmem.c | 44 ++--
mm/swap.h | 71 ++++--
mm/swap_state.c | 478 +++++++++++++++++++++-------------------
mm/swapfile.c | 64 +++---
mm/vmscan.c | 8 +-
mm/workingset.c | 2 +-
mm/zswap.c | 4 +-
12 files changed, 540 insertions(+), 510 deletions(-)

--
2.43.0



2024-03-26 19:29:56

by Kairui Song

[permalink] [raw]
Subject: [RFC PATCH 04/10] mm/swap: remove cache bypass swapin

From: Kairui Song <[email protected]>

We used to have the cache bypass swapin path for better performance,
but by removing it, more optimization can be applied and have
an even better overall performance and less hackish.

And these optimizations are not easily doable or not doable at all
without this.

This patch simply removes it, and the performance will drop heavily
for simple swapin, things won't get this worse for real workloads
but still observable. Following commits will fix this and archive a
better performance.

Swapout/in 30G zero pages from ZRAM (This mostly measures overhead
of swap path itself, because zero pages are not compressed but simply
recorded in ZRAM, and performance drops more as SWAP device is getting
full):

Test result of sequential swapin/out:

Before (us) After (us)
Swapout: 33619409 33624641
Swapin: 32393771 41614858 (-28.4%)
Swapout (THP): 7817909 7795530
Swapin (THP) : 32452387 41708471 (-28.4%)

Signed-off-by: Kairui Song <[email protected]>
---
mm/memory.c | 18 ++++-------------
mm/swap.h | 10 +++++-----
mm/swap_state.c | 53 ++++++++++---------------------------------------
mm/swapfile.c | 13 ------------
4 files changed, 19 insertions(+), 75 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index dfdb620a9123..357d239ee2f6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3932,7 +3932,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
struct page *page;
struct swap_info_struct *si = NULL;
rmap_t rmap_flags = RMAP_NONE;
- bool need_clear_cache = false;
bool exclusive = false;
swp_entry_t entry;
pte_t pte;
@@ -4000,14 +3999,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (!folio) {
if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
__swap_count(entry) == 1) {
- /* skip swapcache and readahead */
folio = swapin_direct(entry, GFP_HIGHUSER_MOVABLE, vmf);
- if (PTR_ERR(folio) == -EBUSY)
- goto out;
- need_clear_cache = true;
} else {
folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf);
- swapcache = folio;
}

if (!folio) {
@@ -4023,6 +4017,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
goto unlock;
}

+ swapcache = folio;
page = folio_file_page(folio, swp_offset(entry));

/* Had to read the page from swap area: Major fault */
@@ -4187,7 +4182,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
vmf->orig_pte = pte;

/* ksm created a completely new copy */
- if (unlikely(folio != swapcache && swapcache)) {
+ if (unlikely(folio != swapcache)) {
folio_add_new_anon_rmap(folio, vma, vmf->address);
folio_add_lru_vma(folio, vma);
} else {
@@ -4201,7 +4196,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);

folio_unlock(folio);
- if (folio != swapcache && swapcache) {
+ if (folio != swapcache) {
/*
* Hold the lock to avoid the swap entry to be reused
* until we take the PT lock for the pte_same() check
@@ -4227,9 +4222,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (vmf->pte)
pte_unmap_unlock(vmf->pte, vmf->ptl);
out:
- /* Clear the swap cache pin for direct swapin after PTL unlock */
- if (need_clear_cache)
- swapcache_clear(si, entry);
if (si)
put_swap_device(si);
return ret;
@@ -4240,12 +4232,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
folio_unlock(folio);
out_release:
folio_put(folio);
- if (folio != swapcache && swapcache) {
+ if (folio != swapcache) {
folio_unlock(swapcache);
folio_put(swapcache);
}
- if (need_clear_cache)
- swapcache_clear(si, entry);
if (si)
put_swap_device(si);
return ret;
diff --git a/mm/swap.h b/mm/swap.h
index aee134907a70..ac9573b03432 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -41,7 +41,6 @@ void __delete_from_swap_cache(struct folio *folio,
void delete_from_swap_cache(struct folio *folio);
void clear_shadow_from_swap_cache(int type, unsigned long begin,
unsigned long end);
-void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry);
struct folio *swap_cache_get_folio(swp_entry_t entry,
struct vm_area_struct *vma, unsigned long addr);
struct folio *filemap_get_incore_folio(struct address_space *mapping,
@@ -100,14 +99,15 @@ static inline struct folio *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
{
return NULL;
}
-
-static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
+static inline struct folio *swapin_direct(swp_entry_t entry, gfp_t flag,
+ struct vm_fault *vmf);
{
- return 0;
+ return NULL;
}

-static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
+static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
{
+ return 0;
}

static inline struct folio *swap_cache_get_folio(swp_entry_t entry,
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 2a9c6bdff5ea..49ef6250f676 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -880,61 +880,28 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
}

/**
- * swapin_direct - swap in folios skipping swap cache and readahead
+ * swapin_direct - swap in folios skipping readahead
* @entry: swap entry of this memory
* @gfp_mask: memory allocation flags
* @vmf: fault information
*
- * Returns the struct folio for entry and addr after the swap entry is read
- * in.
+ * Returns the folio for entry after it is read in.
*/
struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
struct vm_fault *vmf)
{
- struct vm_area_struct *vma = vmf->vma;
+ struct mempolicy *mpol;
struct folio *folio;
- void *shadow = NULL;
-
- /*
- * Prevent parallel swapin from proceeding with
- * the cache flag. Otherwise, another thread may
- * finish swapin first, free the entry, and swapout
- * reusing the same entry. It's undetectable as
- * pte_same() returns true due to entry reuse.
- */
- if (swapcache_prepare(entry)) {
- /* Relax a bit to prevent rapid repeated page faults */
- schedule_timeout_uninterruptible(1);
- return ERR_PTR(-EBUSY);
- }
-
- /* skip swapcache */
- folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
- vma, vmf->address, false);
- if (folio) {
- __folio_set_locked(folio);
- __folio_set_swapbacked(folio);
-
- if (mem_cgroup_swapin_charge_folio(folio,
- vma->vm_mm, GFP_KERNEL,
- entry)) {
- folio_unlock(folio);
- folio_put(folio);
- return NULL;
- }
- mem_cgroup_swapin_uncharge_swap(entry);
-
- shadow = get_shadow_from_swap_cache(entry);
- if (shadow)
- workingset_refault(folio, shadow);
+ bool page_allocated;
+ pgoff_t ilx;

- folio_add_lru(folio);
+ mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
+ folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
+ &page_allocated, false);
+ mpol_cond_put(mpol);

- /* To provide entry to swap_read_folio() */
- folio->swap = entry;
+ if (page_allocated)
swap_read_folio(folio, true, NULL);
- folio->private = NULL;
- }

return folio;
}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4dd894395a0f..ae8d3aa05df7 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3389,19 +3389,6 @@ int swapcache_prepare(swp_entry_t entry)
return __swap_duplicate(entry, SWAP_HAS_CACHE);
}

-void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
-{
- struct swap_cluster_info *ci;
- unsigned long offset = swp_offset(entry);
- unsigned char usage;
-
- ci = lock_cluster_or_swap_info(si, offset);
- usage = __swap_entry_free_locked(si, offset, SWAP_HAS_CACHE);
- unlock_cluster_or_swap_info(si, ci);
- if (!usage)
- free_swap_slot(entry);
-}
-
struct swap_info_struct *swp_swap_info(swp_entry_t entry)
{
return swap_type_to_swap_info(swp_type(entry));
--
2.43.0


2024-03-27 02:54:33

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC PATCH 00/10] mm/swap: always use swap cache for synchronization

Hi, Kairui,

Kairui Song <[email protected]> writes:

> From: Kairui Song <[email protected]>
>
> A month ago a bug was fixed for SWP_SYNCHRONOUS_IO swapin (swap cache
> bypass swapin):
> https://lore.kernel.org/linux-mm/[email protected]/
>
> Because we have to spin on the swap map on race, and swap map is too small
> to contain more usable info, an ugly schedule_timeout_uninterruptible(1)
> is added. It's not the first time a hackish workaround was added for cache
> bypass swapin and not the last time. I did many experiments locally to
> see if the swap cache bypass path can be dropped while keeping the
> performance still comparable. And it seems doable.
>

In general, I think that it's a good idea to unify cache bypass swapin
and normal swapin. But I haven't dive into the implementation yet.

> This series does the following things:
> 1. Remove swap cache bypass completely.
> 2. Apply multiple optimizations after that, these optimizations are
> either undoable or very difficult to do without dropping the cache
> bypass swapin path.
> 3. Use swap cache as a synchronization layer, also unify some code
> with page cache (filemap).
>
> As a result, we have:
> 1. A comparable performance, some tests are even faster.
> 2. Multi-index support for swap cache.
> 3. Removed many hackish workarounds including above long tailing
> issue is gone.
>
> Sending this as RFC to collect some discussion, suggestion, or rejection
> early, this seems need to be split into multiple series, but the
> performance is not good until the last patch so I think start by
> seperating them may make this approach not very convincing. And there
> are still some (maybe further) TODO items and optimization space
> if we are OK with this approach.
>
> This is based on my another series, for reusing filemap code for swapcache:
> [PATCH v2 0/4] mm/filemap: optimize folio adding and splitting
> https://lore.kernel.org/linux-mm/[email protected]/
>
> Patch 1/10, introduce a helper from filemap side to be used later.
> Patch 2/10, 3/10 are clean up and prepare for removing the swap cache
> bypass swapin path.
> Patch 4/10, removed the swap cache bypass swapin path, and the
> performance drop heavily (-28%).
> Patch 5/10, apply the first optimization after the removal, since all
> folios goes through swap cache now, there is no need to explicit shadow
> clearing any more.
> Patch 6/10, apply another optimization after clean up shadow clearing
> routines. Now swapcache is very alike page cache, so just reuse page
> cache code and we will have multi-index support. Shadow memory usage
> dropped a lot.
> Patch 7/10, just rename __read_swap_cache_async, it will be refactored
> and a key part of this series, and the naming is very confusing to me.
> Patch 8/10, make swap cache as a synchronization layer, introduce two
> helpers for adding folios to swap cache, caller will either succeed or
> get a folio to wait on.
> Patch 9/10, apply another optimization. With above two helpers, looking
> up of swapcache can be optimized and avoid false looking up, which
> helped improve the performance.
> Patch 10/10, apply a major optimization for SWP_SYNCHRONOUS_IO devices,
> after this commit, performance for simple swapin/swapout is basically
> same as before.
>
> Test 1, sequential swapin/out of 30G zero page on ZRAM:
>
> Before (us) After (us)
> Swapout: 33619409 33886008
> Swapin: 32393771 32465441 (- 0.2%)
> Swapout (THP): 7817909 6899938 (+11.8%)
> Swapin (THP) : 32452387 33193479 (- 2.2%)

If my understanding were correct, we don't have swapin (THP) support,
yet. Right?

> And after swapping out 30G with THP, the radix node usage dropped by a
> lot:
>
> Before: radix_tree_node 73728K
> After: radix_tree_node 7056K (-94%)

Good!

> Test 2:
> Mysql (16g buffer pool, 32G ZRAM SWAP, 4G memcg, Zswap disabled, THP never)
> sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-user=root \
> --mysql-password=1234 --mysql-db=sb --tables=36 --table-size=2000000 \
> --threads=48 --time=300 --report-interval=10 run
>
> Before: transactions: 4849.25 per sec
> After: transactions: 4849.40 per sec
>
> Test 3:
> Mysql (16g buffer pool, NVME SWAP, 4G memcg, Zswap enabled, THP never)
> echo never > /sys/kernel/mm/transparent_hugepage/enabled
> echo 100 > /sys/module/zswap/parameters/max_pool_percent
> echo 1 > /sys/module/zswap/parameters/enabled
> echo y > /sys/module/zswap/parameters/shrinker_enabled
>
> sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-user=root \
> --mysql-password=1234 --mysql-db=sb --tables=36 --table-size=2000000 \
> --threads=48 --time=600 --report-interval=10 run
>
> Before: transactions: 1662.90 per sec
> After: transactions: 1726.52 per sec

3.8% improvement. Good!

> Test 4:
> Mysql (16g buffer pool, NVME SWAP, 4G memcg, Zswap enabled, THP always)
> echo always > /sys/kernel/mm/transparent_hugepage/enabled
> echo 100 > /sys/module/zswap/parameters/max_pool_percent
> echo 1 > /sys/module/zswap/parameters/enabled
> echo y > /sys/module/zswap/parameters/shrinker_enabled
>
> sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-user=root \
> --mysql-password=1234 --mysql-db=sb --tables=36 --table-size=2000000 \
> --threads=48 --time=600 --report-interval=10 run
>
> Before: transactions: 2860.90 per sec.
> After: transactions: 2802.55 per sec.
>
> Test 5:
> Memtier / memcached (16G brd SWAP, 8G memcg, THP never):
>
> memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 16 -B binary &
>
> memtier_benchmark -S /tmp/memcached.socket \
> -P memcache_binary -n allkeys --key-minimum=1 \
> --key-maximum=24000000 --key-pattern=P:P -c 1 -t 16 \
> --ratio 1:0 --pipeline 8 -d 1000
>
> Before: 106730.31 Ops/sec
> After: 106360.11 Ops/sec
>
> Test 5:
> Memtier / memcached (16G brd SWAP, 8G memcg, THP always):
>
> memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 16 -B binary &
>
> memtier_benchmark -S /tmp/memcached.socket \
> -P memcache_binary -n allkeys --key-minimum=1 \
> --key-maximum=24000000 --key-pattern=P:P -c 1 -t 16 \
> --ratio 1:0 --pipeline 8 -d 1000
>
> Before: 83193.11 Ops/sec
> After: 82504.89 Ops/sec
>
> These tests are tested under heavy memory stress, and the performance
> seems basically same as before,very slightly better/worse for certain
> cases, the benefits of multi-index are basically erased by
> fragmentation and workingset nodes usage is slightly lower.
>
> Some (maybe further) TODO items if we are OK with this approach:
>
> - I see a slight performance regression for THP tests,
> could identify a clear hotspot with perf, my guess is the
> content on the xa_lock is an issue (we have a xa_lock for
> every 64M swap cache space), THP handling needs to take the lock
> longer than usual. splitting the xa_lock to be more
> fine-grained seems a good solution. We have
> SWAP_ADDRESS_SPACE_SHIFT = 14 which is not an optimal value.
> Considering XA_CHUNK_SHIFT is 6, we will have three layer of Xarray
> just for 2 extra bits. 12 should be better to always make use of
> the whole XA chunk and having two layers at most. But duplicated
> address_space struct also wastes more memory and cacheline.
> I see an observable performance drop (~3%) after change
> SWAP_ADDRESS_SPACE_SHIFT to 12. Might be a good idea to
> decouple swap cache xarray from address_space (there are
> too many user for swapcache, shouldn't come too dirty).
>
> - Actually after patch Patch 4/10, the performance is much better for
> tests limited with memory cgroup, until 10/10 applied the direct swap
> cache freeing logic for SWP_SYNCHRONOUS_IO swapin. Because if the swap
> device is not near full, swapin doesn't clear up the swapcache, so
> repeated swapout doesn't need to re-alloc a swap entry, make things
> faster. This may indicate that lazy freeing of swap cache could benifit
> certain workloads and may worth looking into later.
>
> - Now SWP_SYNCHRONOUS_IO swapin will bypass readahead and force drop
> swap cache after swapin is done, which can be cleaned up and optimized
> further after this patch. Device type will only determine the
> readahead logic, and swap cache drop check can be based purely on swap
> count.
>
> - Recent mTHP swapin/swapout series should have no fundamental
> conflict with this.
>
> Kairui Song (10):
> mm/filemap: split filemap storing logic into a standalone helper
> mm/swap: move no readahead swapin code to a stand-alone helper
> mm/swap: convert swapin_readahead to return a folio
> mm/swap: remove cache bypass swapin
> mm/swap: clean shadow only in unmap path
> mm/swap: switch to use multi index entries
> mm/swap: rename __read_swap_cache_async to swap_cache_alloc_or_get
> mm/swap: use swap cache as a synchronization layer
> mm/swap: delay the swap cache look up for swapin
> mm/swap: optimize synchronous swapin
>
> include/linux/swapops.h | 5 +-
> mm/filemap.c | 161 +++++++++-----
> mm/huge_memory.c | 78 +++----
> mm/internal.h | 2 +
> mm/memory.c | 133 ++++-------
> mm/shmem.c | 44 ++--
> mm/swap.h | 71 ++++--
> mm/swap_state.c | 478 +++++++++++++++++++++-------------------
> mm/swapfile.c | 64 +++---
> mm/vmscan.c | 8 +-
> mm/workingset.c | 2 +-
> mm/zswap.c | 4 +-
> 12 files changed, 540 insertions(+), 510 deletions(-)

--
Best Regards,
Huang, Ying

2024-03-27 03:01:33

by Kairui Song

[permalink] [raw]
Subject: Re: [RFC PATCH 00/10] mm/swap: always use swap cache for synchronization

On Wed, Mar 27, 2024 at 10:54 AM Huang, Ying <[email protected]> wrote:
>
> Hi, Kairui,
>
> Kairui Song <[email protected]> writes:
>
> > From: Kairui Song <[email protected]>
> >
> > A month ago a bug was fixed for SWP_SYNCHRONOUS_IO swapin (swap cache
> > bypass swapin):
> > https://lore.kernel.org/linux-mm/[email protected]/
> >
> > Because we have to spin on the swap map on race, and swap map is too small
> > to contain more usable info, an ugly schedule_timeout_uninterruptible(1)
> > is added. It's not the first time a hackish workaround was added for cache
> > bypass swapin and not the last time. I did many experiments locally to
> > see if the swap cache bypass path can be dropped while keeping the
> > performance still comparable. And it seems doable.
> >
>
> In general, I think that it's a good idea to unify cache bypass swapin
> and normal swapin. But I haven't dive into the implementation yet.

Thanks!

This series might be a bit too large, I can try to split it for easier
reviewing later, just if we are OK with this idea.

>
> > This series does the following things:
> > 1. Remove swap cache bypass completely.
> > 2. Apply multiple optimizations after that, these optimizations are
> > either undoable or very difficult to do without dropping the cache
> > bypass swapin path.
> > 3. Use swap cache as a synchronization layer, also unify some code
> > with page cache (filemap).
> >
> > As a result, we have:
> > 1. A comparable performance, some tests are even faster.
> > 2. Multi-index support for swap cache.
> > 3. Removed many hackish workarounds including above long tailing
> > issue is gone.
> >
> > Sending this as RFC to collect some discussion, suggestion, or rejection
> > early, this seems need to be split into multiple series, but the
> > performance is not good until the last patch so I think start by
> > seperating them may make this approach not very convincing. And there
> > are still some (maybe further) TODO items and optimization space
> > if we are OK with this approach.
> >
> > This is based on my another series, for reusing filemap code for swapcache:
> > [PATCH v2 0/4] mm/filemap: optimize folio adding and splitting
> > https://lore.kernel.org/linux-mm/[email protected]/
> >
> > Patch 1/10, introduce a helper from filemap side to be used later.
> > Patch 2/10, 3/10 are clean up and prepare for removing the swap cache
> > bypass swapin path.
> > Patch 4/10, removed the swap cache bypass swapin path, and the
> > performance drop heavily (-28%).
> > Patch 5/10, apply the first optimization after the removal, since all
> > folios goes through swap cache now, there is no need to explicit shadow
> > clearing any more.
> > Patch 6/10, apply another optimization after clean up shadow clearing
> > routines. Now swapcache is very alike page cache, so just reuse page
> > cache code and we will have multi-index support. Shadow memory usage
> > dropped a lot.
> > Patch 7/10, just rename __read_swap_cache_async, it will be refactored
> > and a key part of this series, and the naming is very confusing to me.
> > Patch 8/10, make swap cache as a synchronization layer, introduce two
> > helpers for adding folios to swap cache, caller will either succeed or
> > get a folio to wait on.
> > Patch 9/10, apply another optimization. With above two helpers, looking
> > up of swapcache can be optimized and avoid false looking up, which
> > helped improve the performance.
> > Patch 10/10, apply a major optimization for SWP_SYNCHRONOUS_IO devices,
> > after this commit, performance for simple swapin/swapout is basically
> > same as before.
> >
> > Test 1, sequential swapin/out of 30G zero page on ZRAM:
> >
> > Before (us) After (us)
> > Swapout: 33619409 33886008
> > Swapin: 32393771 32465441 (- 0.2%)
> > Swapout (THP): 7817909 6899938 (+11.8%)
> > Swapin (THP) : 32452387 33193479 (- 2.2%)
>
> If my understanding were correct, we don't have swapin (THP) support,
> yet. Right?

Yes, this series doesn't change how swapin/swapout works with THP in
general, but now THP swapout will leave shadows with large order, so
it needs to be splitted upon swapin, that will slow down later swapin
by a little bit but I think that's worth it.

If we can do THP swapin in the future, this split on swapin can be
saved to make the performance even better.

>
> > And after swapping out 30G with THP, the radix node usage dropped by a
> > lot:
> >
> > Before: radix_tree_node 73728K
> > After: radix_tree_node 7056K (-94%)
>
> Good!
>
> > Test 2:
> > Mysql (16g buffer pool, 32G ZRAM SWAP, 4G memcg, Zswap disabled, THP never)
> > sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-user=root \
> > --mysql-password=1234 --mysql-db=sb --tables=36 --table-size=2000000 \
> > --threads=48 --time=300 --report-interval=10 run
> >
> > Before: transactions: 4849.25 per sec
> > After: transactions: 4849.40 per sec
> >
> > Test 3:
> > Mysql (16g buffer pool, NVME SWAP, 4G memcg, Zswap enabled, THP never)
> > echo never > /sys/kernel/mm/transparent_hugepage/enabled
> > echo 100 > /sys/module/zswap/parameters/max_pool_percent
> > echo 1 > /sys/module/zswap/parameters/enabled
> > echo y > /sys/module/zswap/parameters/shrinker_enabled
> >
> > sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-user=root \
> > --mysql-password=1234 --mysql-db=sb --tables=36 --table-size=2000000 \
> > --threads=48 --time=600 --report-interval=10 run
> >
> > Before: transactions: 1662.90 per sec
> > After: transactions: 1726.52 per sec
>
> 3.8% improvement. Good!
>
> > Test 4:
> > Mysql (16g buffer pool, NVME SWAP, 4G memcg, Zswap enabled, THP always)
> > echo always > /sys/kernel/mm/transparent_hugepage/enabled
> > echo 100 > /sys/module/zswap/parameters/max_pool_percent
> > echo 1 > /sys/module/zswap/parameters/enabled
> > echo y > /sys/module/zswap/parameters/shrinker_enabled
> >
> > sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-user=root \
> > --mysql-password=1234 --mysql-db=sb --tables=36 --table-size=2000000 \
> > --threads=48 --time=600 --report-interval=10 run
> >
> > Before: transactions: 2860.90 per sec.
> > After: transactions: 2802.55 per sec.
> >
> > Test 5:
> > Memtier / memcached (16G brd SWAP, 8G memcg, THP never):
> >
> > memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 16 -B binary &
> >
> > memtier_benchmark -S /tmp/memcached.socket \
> > -P memcache_binary -n allkeys --key-minimum=1 \
> > --key-maximum=24000000 --key-pattern=P:P -c 1 -t 16 \
> > --ratio 1:0 --pipeline 8 -d 1000
> >
> > Before: 106730.31 Ops/sec
> > After: 106360.11 Ops/sec
> >
> > Test 5:
> > Memtier / memcached (16G brd SWAP, 8G memcg, THP always):
> >
> > memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 16 -B binary &
> >
> > memtier_benchmark -S /tmp/memcached.socket \
> > -P memcache_binary -n allkeys --key-minimum=1 \
> > --key-maximum=24000000 --key-pattern=P:P -c 1 -t 16 \
> > --ratio 1:0 --pipeline 8 -d 1000
> >
> > Before: 83193.11 Ops/sec
> > After: 82504.89 Ops/sec
> >
> > These tests are tested under heavy memory stress, and the performance
> > seems basically same as before,very slightly better/worse for certain
> > cases, the benefits of multi-index are basically erased by
> > fragmentation and workingset nodes usage is slightly lower.
> >
> > Some (maybe further) TODO items if we are OK with this approach:
> >
> > - I see a slight performance regression for THP tests,
> > could identify a clear hotspot with perf, my guess is the
> > content on the xa_lock is an issue (we have a xa_lock for
> > every 64M swap cache space), THP handling needs to take the lock
> > longer than usual. splitting the xa_lock to be more
> > fine-grained seems a good solution. We have
> > SWAP_ADDRESS_SPACE_SHIFT = 14 which is not an optimal value.
> > Considering XA_CHUNK_SHIFT is 6, we will have three layer of Xarray
> > just for 2 extra bits. 12 should be better to always make use of
> > the whole XA chunk and having two layers at most. But duplicated
> > address_space struct also wastes more memory and cacheline.
> > I see an observable performance drop (~3%) after change
> > SWAP_ADDRESS_SPACE_SHIFT to 12. Might be a good idea to
> > decouple swap cache xarray from address_space (there are
> > too many user for swapcache, shouldn't come too dirty).
> >
> > - Actually after patch Patch 4/10, the performance is much better for
> > tests limited with memory cgroup, until 10/10 applied the direct swap
> > cache freeing logic for SWP_SYNCHRONOUS_IO swapin. Because if the swap
> > device is not near full, swapin doesn't clear up the swapcache, so
> > repeated swapout doesn't need to re-alloc a swap entry, make things
> > faster. This may indicate that lazy freeing of swap cache could benifit
> > certain workloads and may worth looking into later.
> >
> > - Now SWP_SYNCHRONOUS_IO swapin will bypass readahead and force drop
> > swap cache after swapin is done, which can be cleaned up and optimized
> > further after this patch. Device type will only determine the
> > readahead logic, and swap cache drop check can be based purely on swap
> > count.
> >
> > - Recent mTHP swapin/swapout series should have no fundamental
> > conflict with this.
> >
> > Kairui Song (10):
> > mm/filemap: split filemap storing logic into a standalone helper
> > mm/swap: move no readahead swapin code to a stand-alone helper
> > mm/swap: convert swapin_readahead to return a folio
> > mm/swap: remove cache bypass swapin
> > mm/swap: clean shadow only in unmap path
> > mm/swap: switch to use multi index entries
> > mm/swap: rename __read_swap_cache_async to swap_cache_alloc_or_get
> > mm/swap: use swap cache as a synchronization layer
> > mm/swap: delay the swap cache look up for swapin
> > mm/swap: optimize synchronous swapin
> >
> > include/linux/swapops.h | 5 +-
> > mm/filemap.c | 161 +++++++++-----
> > mm/huge_memory.c | 78 +++----
> > mm/internal.h | 2 +
> > mm/memory.c | 133 ++++-------
> > mm/shmem.c | 44 ++--
> > mm/swap.h | 71 ++++--
> > mm/swap_state.c | 478 +++++++++++++++++++++-------------------
> > mm/swapfile.c | 64 +++---
> > mm/vmscan.c | 8 +-
> > mm/workingset.c | 2 +-
> > mm/zswap.c | 4 +-
> > 12 files changed, 540 insertions(+), 510 deletions(-)
>
> --
> Best Regards,
> Huang, Ying

2024-03-27 06:33:21

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC PATCH 04/10] mm/swap: remove cache bypass swapin

Kairui Song <[email protected]> writes:

> From: Kairui Song <[email protected]>
>
> We used to have the cache bypass swapin path for better performance,
> but by removing it, more optimization can be applied and have
> an even better overall performance and less hackish.
>
> And these optimizations are not easily doable or not doable at all
> without this.
>
> This patch simply removes it, and the performance will drop heavily
> for simple swapin, things won't get this worse for real workloads
> but still observable. Following commits will fix this and archive a
> better performance.
>
> Swapout/in 30G zero pages from ZRAM (This mostly measures overhead
> of swap path itself, because zero pages are not compressed but simply
> recorded in ZRAM, and performance drops more as SWAP device is getting
> full):
>
> Test result of sequential swapin/out:
>
> Before (us) After (us)
> Swapout: 33619409 33624641
> Swapin: 32393771 41614858 (-28.4%)
> Swapout (THP): 7817909 7795530
> Swapin (THP) : 32452387 41708471 (-28.4%)
>
> Signed-off-by: Kairui Song <[email protected]>
> ---
> mm/memory.c | 18 ++++-------------
> mm/swap.h | 10 +++++-----
> mm/swap_state.c | 53 ++++++++++---------------------------------------
> mm/swapfile.c | 13 ------------
> 4 files changed, 19 insertions(+), 75 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index dfdb620a9123..357d239ee2f6 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3932,7 +3932,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> struct page *page;
> struct swap_info_struct *si = NULL;
> rmap_t rmap_flags = RMAP_NONE;
> - bool need_clear_cache = false;
> bool exclusive = false;
> swp_entry_t entry;
> pte_t pte;
> @@ -4000,14 +3999,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> if (!folio) {
> if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> __swap_count(entry) == 1) {
> - /* skip swapcache and readahead */
> folio = swapin_direct(entry, GFP_HIGHUSER_MOVABLE, vmf);
> - if (PTR_ERR(folio) == -EBUSY)
> - goto out;
> - need_clear_cache = true;
> } else {
> folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf);
> - swapcache = folio;
> }
>
> if (!folio) {
> @@ -4023,6 +4017,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> goto unlock;
> }
>
> + swapcache = folio;
> page = folio_file_page(folio, swp_offset(entry));
>
> /* Had to read the page from swap area: Major fault */
> @@ -4187,7 +4182,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> vmf->orig_pte = pte;
>
> /* ksm created a completely new copy */
> - if (unlikely(folio != swapcache && swapcache)) {
> + if (unlikely(folio != swapcache)) {
> folio_add_new_anon_rmap(folio, vma, vmf->address);
> folio_add_lru_vma(folio, vma);
> } else {
> @@ -4201,7 +4196,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
>
> folio_unlock(folio);
> - if (folio != swapcache && swapcache) {
> + if (folio != swapcache) {
> /*
> * Hold the lock to avoid the swap entry to be reused
> * until we take the PT lock for the pte_same() check
> @@ -4227,9 +4222,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> if (vmf->pte)
> pte_unmap_unlock(vmf->pte, vmf->ptl);
> out:
> - /* Clear the swap cache pin for direct swapin after PTL unlock */
> - if (need_clear_cache)
> - swapcache_clear(si, entry);
> if (si)
> put_swap_device(si);
> return ret;
> @@ -4240,12 +4232,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> folio_unlock(folio);
> out_release:
> folio_put(folio);
> - if (folio != swapcache && swapcache) {
> + if (folio != swapcache) {
> folio_unlock(swapcache);
> folio_put(swapcache);
> }
> - if (need_clear_cache)
> - swapcache_clear(si, entry);
> if (si)
> put_swap_device(si);
> return ret;
> diff --git a/mm/swap.h b/mm/swap.h
> index aee134907a70..ac9573b03432 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -41,7 +41,6 @@ void __delete_from_swap_cache(struct folio *folio,
> void delete_from_swap_cache(struct folio *folio);
> void clear_shadow_from_swap_cache(int type, unsigned long begin,
> unsigned long end);
> -void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry);
> struct folio *swap_cache_get_folio(swp_entry_t entry,
> struct vm_area_struct *vma, unsigned long addr);
> struct folio *filemap_get_incore_folio(struct address_space *mapping,
> @@ -100,14 +99,15 @@ static inline struct folio *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
> {
> return NULL;
> }
> -
> -static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
> +static inline struct folio *swapin_direct(swp_entry_t entry, gfp_t flag,
> + struct vm_fault *vmf);
> {
> - return 0;
> + return NULL;
> }
>
> -static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
> +static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
> {
> + return 0;
> }
>
> static inline struct folio *swap_cache_get_folio(swp_entry_t entry,
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 2a9c6bdff5ea..49ef6250f676 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -880,61 +880,28 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
> }
>
> /**
> - * swapin_direct - swap in folios skipping swap cache and readahead
> + * swapin_direct - swap in folios skipping readahead
> * @entry: swap entry of this memory
> * @gfp_mask: memory allocation flags
> * @vmf: fault information
> *
> - * Returns the struct folio for entry and addr after the swap entry is read
> - * in.
> + * Returns the folio for entry after it is read in.
> */
> struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
> struct vm_fault *vmf)
> {
> - struct vm_area_struct *vma = vmf->vma;
> + struct mempolicy *mpol;
> struct folio *folio;
> - void *shadow = NULL;
> -
> - /*
> - * Prevent parallel swapin from proceeding with
> - * the cache flag. Otherwise, another thread may
> - * finish swapin first, free the entry, and swapout
> - * reusing the same entry. It's undetectable as
> - * pte_same() returns true due to entry reuse.
> - */
> - if (swapcache_prepare(entry)) {
> - /* Relax a bit to prevent rapid repeated page faults */
> - schedule_timeout_uninterruptible(1);
> - return ERR_PTR(-EBUSY);
> - }
> -
> - /* skip swapcache */
> - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> - vma, vmf->address, false);
> - if (folio) {
> - __folio_set_locked(folio);
> - __folio_set_swapbacked(folio);
> -
> - if (mem_cgroup_swapin_charge_folio(folio,
> - vma->vm_mm, GFP_KERNEL,
> - entry)) {
> - folio_unlock(folio);
> - folio_put(folio);
> - return NULL;
> - }
> - mem_cgroup_swapin_uncharge_swap(entry);
> -
> - shadow = get_shadow_from_swap_cache(entry);
> - if (shadow)
> - workingset_refault(folio, shadow);
> + bool page_allocated;
> + pgoff_t ilx;
>
> - folio_add_lru(folio);
> + mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
> + folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
> + &page_allocated, false);
> + mpol_cond_put(mpol);
>
> - /* To provide entry to swap_read_folio() */
> - folio->swap = entry;
> + if (page_allocated)
> swap_read_folio(folio, true, NULL);
> - folio->private = NULL;
> - }
>
> return folio;
> }

This looks similar as read_swap_cache_async(). Can we merge them?

And, we should avoid to readahead in swapin_readahead() or
swap_vma_readahead() for SWP_SYNCHRONOUS_IO anyway. So, it appears that
we can change and use swapin_readahead() directly?

> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 4dd894395a0f..ae8d3aa05df7 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -3389,19 +3389,6 @@ int swapcache_prepare(swp_entry_t entry)
> return __swap_duplicate(entry, SWAP_HAS_CACHE);
> }
>
> -void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
> -{
> - struct swap_cluster_info *ci;
> - unsigned long offset = swp_offset(entry);
> - unsigned char usage;
> -
> - ci = lock_cluster_or_swap_info(si, offset);
> - usage = __swap_entry_free_locked(si, offset, SWAP_HAS_CACHE);
> - unlock_cluster_or_swap_info(si, ci);
> - if (!usage)
> - free_swap_slot(entry);
> -}
> -
> struct swap_info_struct *swp_swap_info(swp_entry_t entry)
> {
> return swap_type_to_swap_info(swp_type(entry));

--
Best Regards,
Huang, Ying

2024-03-27 06:56:20

by Kairui Song

[permalink] [raw]
Subject: Re: [RFC PATCH 04/10] mm/swap: remove cache bypass swapin

On Wed, Mar 27, 2024 at 2:32 PM Huang, Ying <[email protected]> wrote:
>
> Kairui Song <[email protected]> writes:
>
> > From: Kairui Song <[email protected]>
> >
> > We used to have the cache bypass swapin path for better performance,
> > but by removing it, more optimization can be applied and have
> > an even better overall performance and less hackish.
> >
> > And these optimizations are not easily doable or not doable at all
> > without this.
> >
> > This patch simply removes it, and the performance will drop heavily
> > for simple swapin, things won't get this worse for real workloads
> > but still observable. Following commits will fix this and archive a
> > better performance.
> >
> > Swapout/in 30G zero pages from ZRAM (This mostly measures overhead
> > of swap path itself, because zero pages are not compressed but simply
> > recorded in ZRAM, and performance drops more as SWAP device is getting
> > full):
> >
> > Test result of sequential swapin/out:
> >
> > Before (us) After (us)
> > Swapout: 33619409 33624641
> > Swapin: 32393771 41614858 (-28.4%)
> > Swapout (THP): 7817909 7795530
> > Swapin (THP) : 32452387 41708471 (-28.4%)
> >
> > Signed-off-by: Kairui Song <[email protected]>
> > ---
> > mm/memory.c | 18 ++++-------------
> > mm/swap.h | 10 +++++-----
> > mm/swap_state.c | 53 ++++++++++---------------------------------------
> > mm/swapfile.c | 13 ------------
> > 4 files changed, 19 insertions(+), 75 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index dfdb620a9123..357d239ee2f6 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3932,7 +3932,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > struct page *page;
> > struct swap_info_struct *si = NULL;
> > rmap_t rmap_flags = RMAP_NONE;
> > - bool need_clear_cache = false;
> > bool exclusive = false;
> > swp_entry_t entry;
> > pte_t pte;
> > @@ -4000,14 +3999,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > if (!folio) {
> > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> > __swap_count(entry) == 1) {
> > - /* skip swapcache and readahead */
> > folio = swapin_direct(entry, GFP_HIGHUSER_MOVABLE, vmf);
> > - if (PTR_ERR(folio) == -EBUSY)
> > - goto out;
> > - need_clear_cache = true;
> > } else {
> > folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf);
> > - swapcache = folio;
> > }
> >
> > if (!folio) {
> > @@ -4023,6 +4017,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > goto unlock;
> > }
> >
> > + swapcache = folio;
> > page = folio_file_page(folio, swp_offset(entry));
> >
> > /* Had to read the page from swap area: Major fault */
> > @@ -4187,7 +4182,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > vmf->orig_pte = pte;
> >
> > /* ksm created a completely new copy */
> > - if (unlikely(folio != swapcache && swapcache)) {
> > + if (unlikely(folio != swapcache)) {
> > folio_add_new_anon_rmap(folio, vma, vmf->address);
> > folio_add_lru_vma(folio, vma);
> > } else {
> > @@ -4201,7 +4196,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> >
> > folio_unlock(folio);
> > - if (folio != swapcache && swapcache) {
> > + if (folio != swapcache) {
> > /*
> > * Hold the lock to avoid the swap entry to be reused
> > * until we take the PT lock for the pte_same() check
> > @@ -4227,9 +4222,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > if (vmf->pte)
> > pte_unmap_unlock(vmf->pte, vmf->ptl);
> > out:
> > - /* Clear the swap cache pin for direct swapin after PTL unlock */
> > - if (need_clear_cache)
> > - swapcache_clear(si, entry);
> > if (si)
> > put_swap_device(si);
> > return ret;
> > @@ -4240,12 +4232,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > folio_unlock(folio);
> > out_release:
> > folio_put(folio);
> > - if (folio != swapcache && swapcache) {
> > + if (folio != swapcache) {
> > folio_unlock(swapcache);
> > folio_put(swapcache);
> > }
> > - if (need_clear_cache)
> > - swapcache_clear(si, entry);
> > if (si)
> > put_swap_device(si);
> > return ret;
> > diff --git a/mm/swap.h b/mm/swap.h
> > index aee134907a70..ac9573b03432 100644
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> > @@ -41,7 +41,6 @@ void __delete_from_swap_cache(struct folio *folio,
> > void delete_from_swap_cache(struct folio *folio);
> > void clear_shadow_from_swap_cache(int type, unsigned long begin,
> > unsigned long end);
> > -void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry);
> > struct folio *swap_cache_get_folio(swp_entry_t entry,
> > struct vm_area_struct *vma, unsigned long addr);
> > struct folio *filemap_get_incore_folio(struct address_space *mapping,
> > @@ -100,14 +99,15 @@ static inline struct folio *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
> > {
> > return NULL;
> > }
> > -
> > -static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
> > +static inline struct folio *swapin_direct(swp_entry_t entry, gfp_t flag,
> > + struct vm_fault *vmf);
> > {
> > - return 0;
> > + return NULL;
> > }
> >
> > -static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
> > +static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
> > {
> > + return 0;
> > }
> >
> > static inline struct folio *swap_cache_get_folio(swp_entry_t entry,
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index 2a9c6bdff5ea..49ef6250f676 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -880,61 +880,28 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
> > }
> >
> > /**
> > - * swapin_direct - swap in folios skipping swap cache and readahead
> > + * swapin_direct - swap in folios skipping readahead
> > * @entry: swap entry of this memory
> > * @gfp_mask: memory allocation flags
> > * @vmf: fault information
> > *
> > - * Returns the struct folio for entry and addr after the swap entry is read
> > - * in.
> > + * Returns the folio for entry after it is read in.
> > */
> > struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
> > struct vm_fault *vmf)
> > {
> > - struct vm_area_struct *vma = vmf->vma;
> > + struct mempolicy *mpol;
> > struct folio *folio;
> > - void *shadow = NULL;
> > -
> > - /*
> > - * Prevent parallel swapin from proceeding with
> > - * the cache flag. Otherwise, another thread may
> > - * finish swapin first, free the entry, and swapout
> > - * reusing the same entry. It's undetectable as
> > - * pte_same() returns true due to entry reuse.
> > - */
> > - if (swapcache_prepare(entry)) {
> > - /* Relax a bit to prevent rapid repeated page faults */
> > - schedule_timeout_uninterruptible(1);
> > - return ERR_PTR(-EBUSY);
> > - }
> > -
> > - /* skip swapcache */
> > - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> > - vma, vmf->address, false);
> > - if (folio) {
> > - __folio_set_locked(folio);
> > - __folio_set_swapbacked(folio);
> > -
> > - if (mem_cgroup_swapin_charge_folio(folio,
> > - vma->vm_mm, GFP_KERNEL,
> > - entry)) {
> > - folio_unlock(folio);
> > - folio_put(folio);
> > - return NULL;
> > - }
> > - mem_cgroup_swapin_uncharge_swap(entry);
> > -
> > - shadow = get_shadow_from_swap_cache(entry);
> > - if (shadow)
> > - workingset_refault(folio, shadow);
> > + bool page_allocated;
> > + pgoff_t ilx;
> >
> > - folio_add_lru(folio);
> > + mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
> > + folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
> > + &page_allocated, false);
> > + mpol_cond_put(mpol);
> >
> > - /* To provide entry to swap_read_folio() */
> > - folio->swap = entry;
> > + if (page_allocated)
> > swap_read_folio(folio, true, NULL);
> > - folio->private = NULL;
> > - }
> >
> > return folio;
> > }
>
> This looks similar as read_swap_cache_async(). Can we merge them?

Yes, that's doable. But I may have to split it out again for later
optimizations though.

>
> And, we should avoid to readahead in swapin_readahead() or
> swap_vma_readahead() for SWP_SYNCHRONOUS_IO anyway. So, it appears that
> we can change and use swapin_readahead() directly?

Good point, SWP_SYNCHRONOUS_IO check can be extended more after this
series, but readahead optimization could be another series (like the
previous one which tried to unify readahead for shmem/anon), so I
thought it's better to keep it untouched for now.

2024-03-27 07:31:27

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC PATCH 04/10] mm/swap: remove cache bypass swapin

Kairui Song <[email protected]> writes:

> On Wed, Mar 27, 2024 at 2:32 PM Huang, Ying <[email protected]> wrote:
>>
>> Kairui Song <[email protected]> writes:
>>
>> > From: Kairui Song <[email protected]>
>> >
>> > We used to have the cache bypass swapin path for better performance,
>> > but by removing it, more optimization can be applied and have
>> > an even better overall performance and less hackish.
>> >
>> > And these optimizations are not easily doable or not doable at all
>> > without this.
>> >
>> > This patch simply removes it, and the performance will drop heavily
>> > for simple swapin, things won't get this worse for real workloads
>> > but still observable. Following commits will fix this and archive a
>> > better performance.
>> >
>> > Swapout/in 30G zero pages from ZRAM (This mostly measures overhead
>> > of swap path itself, because zero pages are not compressed but simply
>> > recorded in ZRAM, and performance drops more as SWAP device is getting
>> > full):
>> >
>> > Test result of sequential swapin/out:
>> >
>> > Before (us) After (us)
>> > Swapout: 33619409 33624641
>> > Swapin: 32393771 41614858 (-28.4%)
>> > Swapout (THP): 7817909 7795530
>> > Swapin (THP) : 32452387 41708471 (-28.4%)
>> >
>> > Signed-off-by: Kairui Song <[email protected]>
>> > ---
>> > mm/memory.c | 18 ++++-------------
>> > mm/swap.h | 10 +++++-----
>> > mm/swap_state.c | 53 ++++++++++---------------------------------------
>> > mm/swapfile.c | 13 ------------
>> > 4 files changed, 19 insertions(+), 75 deletions(-)
>> >
>> > diff --git a/mm/memory.c b/mm/memory.c
>> > index dfdb620a9123..357d239ee2f6 100644
>> > --- a/mm/memory.c
>> > +++ b/mm/memory.c
>> > @@ -3932,7 +3932,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> > struct page *page;
>> > struct swap_info_struct *si = NULL;
>> > rmap_t rmap_flags = RMAP_NONE;
>> > - bool need_clear_cache = false;
>> > bool exclusive = false;
>> > swp_entry_t entry;
>> > pte_t pte;
>> > @@ -4000,14 +3999,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> > if (!folio) {
>> > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>> > __swap_count(entry) == 1) {
>> > - /* skip swapcache and readahead */
>> > folio = swapin_direct(entry, GFP_HIGHUSER_MOVABLE, vmf);
>> > - if (PTR_ERR(folio) == -EBUSY)
>> > - goto out;
>> > - need_clear_cache = true;
>> > } else {
>> > folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf);
>> > - swapcache = folio;
>> > }
>> >
>> > if (!folio) {
>> > @@ -4023,6 +4017,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> > goto unlock;
>> > }
>> >
>> > + swapcache = folio;
>> > page = folio_file_page(folio, swp_offset(entry));
>> >
>> > /* Had to read the page from swap area: Major fault */
>> > @@ -4187,7 +4182,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> > vmf->orig_pte = pte;
>> >
>> > /* ksm created a completely new copy */
>> > - if (unlikely(folio != swapcache && swapcache)) {
>> > + if (unlikely(folio != swapcache)) {
>> > folio_add_new_anon_rmap(folio, vma, vmf->address);
>> > folio_add_lru_vma(folio, vma);
>> > } else {
>> > @@ -4201,7 +4196,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> > arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
>> >
>> > folio_unlock(folio);
>> > - if (folio != swapcache && swapcache) {
>> > + if (folio != swapcache) {
>> > /*
>> > * Hold the lock to avoid the swap entry to be reused
>> > * until we take the PT lock for the pte_same() check
>> > @@ -4227,9 +4222,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> > if (vmf->pte)
>> > pte_unmap_unlock(vmf->pte, vmf->ptl);
>> > out:
>> > - /* Clear the swap cache pin for direct swapin after PTL unlock */
>> > - if (need_clear_cache)
>> > - swapcache_clear(si, entry);
>> > if (si)
>> > put_swap_device(si);
>> > return ret;
>> > @@ -4240,12 +4232,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> > folio_unlock(folio);
>> > out_release:
>> > folio_put(folio);
>> > - if (folio != swapcache && swapcache) {
>> > + if (folio != swapcache) {
>> > folio_unlock(swapcache);
>> > folio_put(swapcache);
>> > }
>> > - if (need_clear_cache)
>> > - swapcache_clear(si, entry);
>> > if (si)
>> > put_swap_device(si);
>> > return ret;
>> > diff --git a/mm/swap.h b/mm/swap.h
>> > index aee134907a70..ac9573b03432 100644
>> > --- a/mm/swap.h
>> > +++ b/mm/swap.h
>> > @@ -41,7 +41,6 @@ void __delete_from_swap_cache(struct folio *folio,
>> > void delete_from_swap_cache(struct folio *folio);
>> > void clear_shadow_from_swap_cache(int type, unsigned long begin,
>> > unsigned long end);
>> > -void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry);
>> > struct folio *swap_cache_get_folio(swp_entry_t entry,
>> > struct vm_area_struct *vma, unsigned long addr);
>> > struct folio *filemap_get_incore_folio(struct address_space *mapping,
>> > @@ -100,14 +99,15 @@ static inline struct folio *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
>> > {
>> > return NULL;
>> > }
>> > -
>> > -static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
>> > +static inline struct folio *swapin_direct(swp_entry_t entry, gfp_t flag,
>> > + struct vm_fault *vmf);
>> > {
>> > - return 0;
>> > + return NULL;
>> > }
>> >
>> > -static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
>> > +static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
>> > {
>> > + return 0;
>> > }
>> >
>> > static inline struct folio *swap_cache_get_folio(swp_entry_t entry,
>> > diff --git a/mm/swap_state.c b/mm/swap_state.c
>> > index 2a9c6bdff5ea..49ef6250f676 100644
>> > --- a/mm/swap_state.c
>> > +++ b/mm/swap_state.c
>> > @@ -880,61 +880,28 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
>> > }
>> >
>> > /**
>> > - * swapin_direct - swap in folios skipping swap cache and readahead
>> > + * swapin_direct - swap in folios skipping readahead
>> > * @entry: swap entry of this memory
>> > * @gfp_mask: memory allocation flags
>> > * @vmf: fault information
>> > *
>> > - * Returns the struct folio for entry and addr after the swap entry is read
>> > - * in.
>> > + * Returns the folio for entry after it is read in.
>> > */
>> > struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
>> > struct vm_fault *vmf)
>> > {
>> > - struct vm_area_struct *vma = vmf->vma;
>> > + struct mempolicy *mpol;
>> > struct folio *folio;
>> > - void *shadow = NULL;
>> > -
>> > - /*
>> > - * Prevent parallel swapin from proceeding with
>> > - * the cache flag. Otherwise, another thread may
>> > - * finish swapin first, free the entry, and swapout
>> > - * reusing the same entry. It's undetectable as
>> > - * pte_same() returns true due to entry reuse.
>> > - */
>> > - if (swapcache_prepare(entry)) {
>> > - /* Relax a bit to prevent rapid repeated page faults */
>> > - schedule_timeout_uninterruptible(1);
>> > - return ERR_PTR(-EBUSY);
>> > - }
>> > -
>> > - /* skip swapcache */
>> > - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
>> > - vma, vmf->address, false);
>> > - if (folio) {
>> > - __folio_set_locked(folio);
>> > - __folio_set_swapbacked(folio);
>> > -
>> > - if (mem_cgroup_swapin_charge_folio(folio,
>> > - vma->vm_mm, GFP_KERNEL,
>> > - entry)) {
>> > - folio_unlock(folio);
>> > - folio_put(folio);
>> > - return NULL;
>> > - }
>> > - mem_cgroup_swapin_uncharge_swap(entry);
>> > -
>> > - shadow = get_shadow_from_swap_cache(entry);
>> > - if (shadow)
>> > - workingset_refault(folio, shadow);
>> > + bool page_allocated;
>> > + pgoff_t ilx;
>> >
>> > - folio_add_lru(folio);
>> > + mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
>> > + folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
>> > + &page_allocated, false);
>> > + mpol_cond_put(mpol);
>> >
>> > - /* To provide entry to swap_read_folio() */
>> > - folio->swap = entry;
>> > + if (page_allocated)
>> > swap_read_folio(folio, true, NULL);
>> > - folio->private = NULL;
>> > - }
>> >
>> > return folio;
>> > }
>>
>> This looks similar as read_swap_cache_async(). Can we merge them?
>
> Yes, that's doable. But I may have to split it out again for later
> optimizations though.
>
>>
>> And, we should avoid to readahead in swapin_readahead() or
>> swap_vma_readahead() for SWP_SYNCHRONOUS_IO anyway. So, it appears that
>> we can change and use swapin_readahead() directly?
>
> Good point, SWP_SYNCHRONOUS_IO check can be extended more after this
> series, but readahead optimization could be another series (like the
> previous one which tried to unify readahead for shmem/anon), so I
> thought it's better to keep it untouched for now.

Just want to check whether we can reduce the special processing for
SWP_SYNCHRONOUS_IO as much as possible.

--
Best Regards,
Huang, Ying

2024-03-27 08:34:54

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH 00/10] mm/swap: always use swap cache for synchronization

[...]

>>> Test 1, sequential swapin/out of 30G zero page on ZRAM:
>>>
>>> Before (us) After (us)
>>> Swapout: 33619409 33886008
>>> Swapin: 32393771 32465441 (- 0.2%)
>>> Swapout (THP): 7817909 6899938 (+11.8%)
>>> Swapin (THP) : 32452387 33193479 (- 2.2%)
>>
>> If my understanding were correct, we don't have swapin (THP) support,
>> yet. Right?
>
> Yes, this series doesn't change how swapin/swapout works with THP in
> general, but now THP swapout will leave shadows with large order, so
> it needs to be splitted upon swapin, that will slow down later swapin
> by a little bit but I think that's worth it.
>
> If we can do THP swapin in the future, this split on swapin can be
> saved to make the performance even better.

I'm confused by this (clearly my understanding of how this works is incorrect).
Perhaps you can help me understand:

When you talk about "shadows" I assume you are referring to the swap cache? It
was my understanding that swapping out a THP would always leave the large folio
in the swap cache, so this is nothing new?

And on swap-in, if the target page is in the swap cache, even if part of a large
folio, why does it need to be split? I assumed the single page would just be
mapped? (and if all the other pages subsequently fault, then you end up with a
fully mapped large folio back in the process)?

Perhaps I'm misunderstanding what "shadows" are?

2024-03-27 08:34:58

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC PATCH 00/10] mm/swap: always use swap cache for synchronization

Ryan Roberts <[email protected]> writes:

> [...]
>
>>>> Test 1, sequential swapin/out of 30G zero page on ZRAM:
>>>>
>>>> Before (us) After (us)
>>>> Swapout: 33619409 33886008
>>>> Swapin: 32393771 32465441 (- 0.2%)
>>>> Swapout (THP): 7817909 6899938 (+11.8%)
>>>> Swapin (THP) : 32452387 33193479 (- 2.2%)
>>>
>>> If my understanding were correct, we don't have swapin (THP) support,
>>> yet. Right?
>>
>> Yes, this series doesn't change how swapin/swapout works with THP in
>> general, but now THP swapout will leave shadows with large order, so
>> it needs to be splitted upon swapin, that will slow down later swapin
>> by a little bit but I think that's worth it.
>>
>> If we can do THP swapin in the future, this split on swapin can be
>> saved to make the performance even better.
>
> I'm confused by this (clearly my understanding of how this works is incorrect).
> Perhaps you can help me understand:
>
> When you talk about "shadows" I assume you are referring to the swap cache? It
> was my understanding that swapping out a THP would always leave the large folio
> in the swap cache, so this is nothing new?
>
> And on swap-in, if the target page is in the swap cache, even if part of a large
> folio, why does it need to be split? I assumed the single page would just be
> mapped? (and if all the other pages subsequently fault, then you end up with a
> fully mapped large folio back in the process)?
>
> Perhaps I'm misunderstanding what "shadows" are?

Perhaps, shadow is used to support workingset protection/detection on
the anonymous LRU list as in the following patchset (merged).

https://lore.kernel.org/all/[email protected]/T/#m962395eb5968c74b0c4c8e41d4b0dcdd3f28b2e6

--
Best Regards,
Huang, Ying

2024-03-27 09:39:48

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH 00/10] mm/swap: always use swap cache for synchronization

On 27/03/2024 08:32, Huang, Ying wrote:
> Ryan Roberts <[email protected]> writes:
>
>> [...]
>>
>>>>> Test 1, sequential swapin/out of 30G zero page on ZRAM:
>>>>>
>>>>> Before (us) After (us)
>>>>> Swapout: 33619409 33886008
>>>>> Swapin: 32393771 32465441 (- 0.2%)
>>>>> Swapout (THP): 7817909 6899938 (+11.8%)
>>>>> Swapin (THP) : 32452387 33193479 (- 2.2%)
>>>>
>>>> If my understanding were correct, we don't have swapin (THP) support,
>>>> yet. Right?
>>>
>>> Yes, this series doesn't change how swapin/swapout works with THP in
>>> general, but now THP swapout will leave shadows with large order, so
>>> it needs to be splitted upon swapin, that will slow down later swapin
>>> by a little bit but I think that's worth it.
>>>
>>> If we can do THP swapin in the future, this split on swapin can be
>>> saved to make the performance even better.
>>
>> I'm confused by this (clearly my understanding of how this works is incorrect).
>> Perhaps you can help me understand:
>>
>> When you talk about "shadows" I assume you are referring to the swap cache? It
>> was my understanding that swapping out a THP would always leave the large folio
>> in the swap cache, so this is nothing new?
>>
>> And on swap-in, if the target page is in the swap cache, even if part of a large
>> folio, why does it need to be split? I assumed the single page would just be
>> mapped? (and if all the other pages subsequently fault, then you end up with a
>> fully mapped large folio back in the process)?
>>
>> Perhaps I'm misunderstanding what "shadows" are?
>
> Perhaps, shadow is used to support workingset protection/detection on
> the anonymous LRU list as in the following patchset (merged).
>
> https://lore.kernel.org/all/[email protected]/T/#m962395eb5968c74b0c4c8e41d4b0dcdd3f28b2e6

Thanks! Although after reading the cover letter I still don't really understand
the need for splitting. The LRU applies to whole folios.


>
> --
> Best Regards,
> Huang, Ying


2024-03-27 11:05:28

by Kairui Song

[permalink] [raw]
Subject: Re: [RFC PATCH 00/10] mm/swap: always use swap cache for synchronization

On Wed, Mar 27, 2024 at 4:27 PM Ryan Roberts <[email protected]> wrote:
>
> [...]
>
> >>> Test 1, sequential swapin/out of 30G zero page on ZRAM:
> >>>
> >>> Before (us) After (us)
> >>> Swapout: 33619409 33886008
> >>> Swapin: 32393771 32465441 (- 0.2%)
> >>> Swapout (THP): 7817909 6899938 (+11.8%)
> >>> Swapin (THP) : 32452387 33193479 (- 2.2%)
> >>
> >> If my understanding were correct, we don't have swapin (THP) support,
> >> yet. Right?
> >
> > Yes, this series doesn't change how swapin/swapout works with THP in
> > general, but now THP swapout will leave shadows with large order, so
> > it needs to be splitted upon swapin, that will slow down later swapin
> > by a little bit but I think that's worth it.
> >
> > If we can do THP swapin in the future, this split on swapin can be
> > saved to make the performance even better.
>
> I'm confused by this (clearly my understanding of how this works is incorrect).
> Perhaps you can help me understand:
>
> When you talk about "shadows" I assume you are referring to the swap cache? It
> was my understanding that swapping out a THP would always leave the large folio
> in the swap cache, so this is nothing new?
>
> And on swap-in, if the target page is in the swap cache, even if part of a large
> folio, why does it need to be split? I assumed the single page would just be
> mapped? (and if all the other pages subsequently fault, then you end up with a
> fully mapped large folio back in the process)?
>
> Perhaps I'm misunderstanding what "shadows" are?

Hi Ryan

My bad I haven't made this clear.

Ying have posted the link to the commit that added "shadow" support
for anon pages, it has become a very important part for LRU activation
/ workingset tracking. Basically when folios are removed from the
cache xarray (eg. after swap writeback is done), instead of releasing
the xarray slot, an unsigned long / void * is stored to it, recording
some info that will be used when refault happens, to decide how to
handle the folio from LRU / workingset side.

And about large folio in swapcahce: if you look at the current version
of add_to_swap_cache in mainline (it adds a folio of any order into
swap cache), it calls xas_create_range(&xas) which fill all xarray
slots in entire range covered by the folio. But xarray supports
multi-index storing, making use of the nature of the radix tree to
save a lot of slots. eg. for a 2M THP page, previously 8 + 512 slots
(8 extra xa nodes) is needed to store it, after this series it only
needs 8 slots by using a multi-index store. (not sure if I did the
math right).

Same for shadow, when folio is being deleted, __delete_from_swap_cache
will currently walk the xarray with xas_next update all 8 + 512 slots
one by one, after this series only 8 stores are needed (ignoring
fragmentation).

And upon swapin, I was talking about swapin 1 sub page of a THP folio,
and the folio is gone, leaving a few multi-index shadow slots. The
multi-index slots need to be splitted (multi-index slot have to be
updated as a whole or split first, __filemap_add_folio handles such
split), I optimize and reused routine in __filemap_add_folio in this
series so without too much work it works perfectly for swapcache.