2024-05-02 08:47:26

by Kairui Song

[permalink] [raw]
Subject: [PATCH v4 00/12] mm/swap: clean up and optimize swap cache index

From: Kairui Song <[email protected]>

This is based on latest mm-unstable. Patch 1/12 is not needed if
f2fs converted .readahead to use folio, I included it for easier test
and review.

Currently we use one swap_address_space for every 64M chunk to reduce lock
contention, this is like having a set of smaller files inside a
swap device. But when doing swap cache look up or insert, we are
still using the offset of the whole large swap device. This is OK for
correctness, as the offset (key) is unique.

But Xarray is specially optimized for small indexes, it creates the
redix tree levels lazily to be just enough to fit the largest key
stored in one Xarray. So we are wasting tree nodes unnecessarily.

For 64M chunk it should only take at most 3 level to contain everything.
But if we are using the offset from the whole swap device, the offset (key)
value will be way beyond 64M, and so will the tree level.

Optimize this by reduce the swap cache search space into 64M scope.

Test with `time memhog 128G` inside a 8G memcg using 128G swap (ramdisk
with SWP_SYNCHRONOUS_IO dropped, tested 3 times, results are stable. The
test result is similar but the improvement is smaller if SWP_SYNCHRONOUS_IO
is enabled, as swap out path can never skip swap cache):

Before:
6.07user 250.74system 4:17.26elapsed 99%CPU (0avgtext+0avgdata 8373376maxresident)k
0inputs+0outputs (55major+33555018minor)pagefaults 0swaps

After (+1.8% faster):
6.08user 246.09system 4:12.58elapsed 99%CPU (0avgtext+0avgdata 8373248maxresident)k
0inputs+0outputs (54major+33555027minor)pagefaults 0swaps

Similar result with MySQL and sysbench using swap:
Before:
94055.61 qps

After (+0.8% faster):
94834.91 qps

There is alse a very slight drop of radix tree node slab usage:
Before: 303952K
After: 302224K

For this series:

There are multiple places that expect mixed type of pages (page cache or
swap cache), eg. migration, huge memory split; There are four helpers
for that:

- page_index
- page_file_offset
- folio_index
- folio_file_pos

To keep the code clean and compatible, this series first cleaned up
usage of them.

page_file_offset and folio_file_pos are historical helpes that can
be simply dropped after clean up. And page_index can be all converted to
folio_index or folio->index.

Then introduce two new helpers swap_cache_index and swap_dev_pos
for swap. Replace swp_offset with swap_cache_index when used to
retrieve folio from swap cache, and use swap_dev_pos when needed
to retrieve the device position of a swap entry. This way,
swap_cache_index can return the optimized value with no compatibility
issue.

The result is better performance and reduced LOC.

Idealy, in the future, we may want to reduce SWAP_ADDRESS_SPACE_SHIFT
from 14 to 12: Default Xarray chunk offset is 6, so we have 3 level
trees instead of 2 level trees just for 2 extra bits. But swap cache
is based on address_space struct, with 4 times more metadata sparsely
distributed in memory it waste more cacheline, the performance gain
from this series is almost canceled according to my test. So first,
just have a cleaner seperation of offsets and smaller search space.

Patch 1/12 - 11/12: Clean up usage of above helpers.
Patch 12/12: Apply the optmization.

V3: https://lore.kernel.org/all/[email protected]/
Update from V3:
- Help remove a redundant loop in nilfs2 [Matthew Wilcox]
- Update commit message, use the term swap device instead of swap file
to avoid confusion [Huang, Ying]
- Add more details in commit message about folio_file_pos usage in NFS.
- Fix a shadow leak in clear_shadow_from_swap_cache.

V2: https://lore.kernel.org/linux-mm/[email protected]/
Update from V2:
- Clean up usage of page_file_offset and folio_file_pos [Matthew Wilcox]
https://lore.kernel.org/linux-mm/[email protected]/
- Use folio in nilfs_bmap_data_get_key [Ryusuke Konishi]

V1: https://lore.kernel.org/all/[email protected]/
Update from V1:
- Convert more users to use folio directly when possible [Matthew Wilcox]
- Rename swap_file_pos to swap_dev_pos [Huang, Ying]
- Update comments and commit message.
- Adjust headers and add dummy function to fix build error.

This series is part of effort to reduce swap cache overhead, and ultimately
remove SWP_SYNCHRONOUS_IO and unify swap cache usage as proposed before:
https://lore.kernel.org/lkml/[email protected]/

Kairui Song (12):
f2fs: drop usage of page_index
nilfs2: drop usage of page_index
ceph: drop usage of page_index
NFS: remove nfs_page_lengthg and usage of page_index
cifs: drop usage of page_file_offset
afs: drop usage of folio_file_pos
netfs: drop usage of folio_file_pos
nfs: drop usage of folio_file_pos
mm/swap: get the swap device offset directly
mm: remove page_file_offset and folio_file_pos
mm: drop page_index and convert folio_index to use folio
mm/swap: reduce swap cache search space

fs/afs/dir.c | 6 +++---
fs/afs/dir_edit.c | 4 ++--
fs/ceph/dir.c | 2 +-
fs/ceph/inode.c | 2 +-
fs/f2fs/data.c | 2 +-
fs/netfs/buffered_read.c | 4 ++--
fs/netfs/buffered_write.c | 2 +-
fs/nfs/file.c | 2 +-
fs/nfs/internal.h | 19 -------------------
fs/nfs/nfstrace.h | 4 ++--
fs/nfs/write.c | 6 +++---
fs/nilfs2/bmap.c | 10 ++--------
fs/smb/client/file.c | 2 +-
include/linux/mm.h | 13 -------------
include/linux/pagemap.h | 25 ++++---------------------
mm/huge_memory.c | 2 +-
mm/memcontrol.c | 2 +-
mm/mincore.c | 2 +-
mm/page_io.c | 6 +++---
mm/shmem.c | 2 +-
mm/swap.h | 24 ++++++++++++++++++++++++
mm/swap_state.c | 19 ++++++++++---------
mm/swapfile.c | 11 +++++------
23 files changed, 70 insertions(+), 101 deletions(-)

--
2.44.0



2024-05-02 08:48:29

by Kairui Song

[permalink] [raw]
Subject: [PATCH v4 01/12] f2fs: drop usage of page_index

From: Kairui Song <[email protected]>

page_index is needed for mixed usage of page cache and swap cache,
for pure page cache usage, the caller can just use page->index instead.

It can't be a swap cache page here, so just drop it.

[ This commit will not be needed once f2fs converted
f2fs_mpage_readpages() to use folio]

Signed-off-by: Kairui Song <[email protected]>
Cc: Chao Yu <[email protected]>
Cc: Jaegeuk Kim <[email protected]>
Cc: [email protected]
---
fs/f2fs/data.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 961e6ff77c72..c0e1459702e6 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -2057,7 +2057,7 @@ static int f2fs_read_single_page(struct inode *inode, struct page *page,
sector_t block_nr;
int ret = 0;

- block_in_file = (sector_t)page_index(page);
+ block_in_file = (sector_t)page->index;
last_block = block_in_file + nr_pages;
last_block_in_file = bytes_to_blks(inode,
f2fs_readpage_limit(inode) + blocksize - 1);
--
2.44.0


2024-05-02 08:48:48

by Kairui Song

[permalink] [raw]
Subject: [PATCH v4 02/12] nilfs2: drop usage of page_index

From: Kairui Song <[email protected]>

page_index is only for mixed usage of page cache and swap cache, for
pure page cache usage, the caller can just use page->index instead.

It can't be a swap cache page here (being part of buffer head),
so just drop it. And while we are at it, optimize the code by retrieving
the offset of the buffer head within the folio directly using bh_offset,
and get rid of the loop and usage of page helpers.

Suggested-by: Matthew Wilcox <[email protected]>
Signed-off-by: Kairui Song <[email protected]>
Cc: Ryusuke Konishi <[email protected]>
Cc: [email protected]
---
fs/nilfs2/bmap.c | 10 ++--------
1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/fs/nilfs2/bmap.c b/fs/nilfs2/bmap.c
index 383f0afa2cea..cd14ea25968c 100644
--- a/fs/nilfs2/bmap.c
+++ b/fs/nilfs2/bmap.c
@@ -450,15 +450,9 @@ int nilfs_bmap_test_and_clear_dirty(struct nilfs_bmap *bmap)
__u64 nilfs_bmap_data_get_key(const struct nilfs_bmap *bmap,
const struct buffer_head *bh)
{
- struct buffer_head *pbh;
- __u64 key;
+ loff_t pos = folio_pos(bh->b_folio) + bh_offset(bh);

- key = page_index(bh->b_page) << (PAGE_SHIFT -
- bmap->b_inode->i_blkbits);
- for (pbh = page_buffers(bh->b_page); pbh != bh; pbh = pbh->b_this_page)
- key++;
-
- return key;
+ return pos >> bmap->b_inode->i_blkbits;
}

__u64 nilfs_bmap_find_target_seq(const struct nilfs_bmap *bmap, __u64 key)
--
2.44.0


2024-05-02 08:48:48

by Kairui Song

[permalink] [raw]
Subject: [PATCH v4 05/12] cifs: drop usage of page_file_offset

From: Kairui Song <[email protected]>

page_file_offset is only needed for mixed usage of page cache and
swap cache, for pure page cache usage, the caller can just use
page_offset instead.

It can't be a swap cache page here, so just drop it and convert it to
use folio.

Signed-off-by: Kairui Song <[email protected]>
Cc: Steve French <[email protected]>
Cc: Namjae Jeon <[email protected]>
Cc: Paulo Alcantara <[email protected]>
Cc: Shyam Prasad N <[email protected]>
Cc: Bharath SM <[email protected]>
---
fs/smb/client/file.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/smb/client/file.c b/fs/smb/client/file.c
index 9be37d0fe724..388343b0fceb 100644
--- a/fs/smb/client/file.c
+++ b/fs/smb/client/file.c
@@ -4828,7 +4828,7 @@ static int cifs_readpage_worker(struct file *file, struct page *page,
static int cifs_read_folio(struct file *file, struct folio *folio)
{
struct page *page = &folio->page;
- loff_t offset = page_file_offset(page);
+ loff_t offset = folio_pos(folio);
int rc = -EACCES;
unsigned int xid;

--
2.44.0


2024-05-02 08:49:05

by Kairui Song

[permalink] [raw]
Subject: [PATCH v4 06/12] afs: drop usage of folio_file_pos

From: Kairui Song <[email protected]>

folio_file_pos is only needed for mixed usage of page cache and
swap cache, for pure page cache usage, the caller can just use
folio_pos instead.

It can't be a swap cache page here. Swap mapping may only call into
fs through swap_rw and that is not supported for afs. So just drop
it and use folio_pos instead.

Signed-off-by: Kairui Song <[email protected]>
Cc: David Howells <[email protected]>
Cc: Marc Dionne <[email protected]>
Cc: [email protected]
---
fs/afs/dir.c | 6 +++---
fs/afs/dir_edit.c | 4 ++--
2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/afs/dir.c b/fs/afs/dir.c
index 67afe68972d5..f8622ed72e08 100644
--- a/fs/afs/dir.c
+++ b/fs/afs/dir.c
@@ -533,14 +533,14 @@ static int afs_dir_iterate(struct inode *dir, struct dir_context *ctx,
break;
}

- offset = round_down(ctx->pos, sizeof(*dblock)) - folio_file_pos(folio);
+ offset = round_down(ctx->pos, sizeof(*dblock)) - folio_pos(folio);
size = min_t(loff_t, folio_size(folio),
- req->actual_len - folio_file_pos(folio));
+ req->actual_len - folio_pos(folio));

do {
dblock = kmap_local_folio(folio, offset);
ret = afs_dir_iterate_block(dvnode, ctx, dblock,
- folio_file_pos(folio) + offset);
+ folio_pos(folio) + offset);
kunmap_local(dblock);
if (ret != 1)
goto out;
diff --git a/fs/afs/dir_edit.c b/fs/afs/dir_edit.c
index e2fa577b66fe..a71bff10496b 100644
--- a/fs/afs/dir_edit.c
+++ b/fs/afs/dir_edit.c
@@ -256,7 +256,7 @@ void afs_edit_dir_add(struct afs_vnode *vnode,
folio = folio0;
}

- block = kmap_local_folio(folio, b * AFS_DIR_BLOCK_SIZE - folio_file_pos(folio));
+ block = kmap_local_folio(folio, b * AFS_DIR_BLOCK_SIZE - folio_pos(folio));

/* Abandon the edit if we got a callback break. */
if (!test_bit(AFS_VNODE_DIR_VALID, &vnode->flags))
@@ -417,7 +417,7 @@ void afs_edit_dir_remove(struct afs_vnode *vnode,
folio = folio0;
}

- block = kmap_local_folio(folio, b * AFS_DIR_BLOCK_SIZE - folio_file_pos(folio));
+ block = kmap_local_folio(folio, b * AFS_DIR_BLOCK_SIZE - folio_pos(folio));

/* Abandon the edit if we got a callback break. */
if (!test_bit(AFS_VNODE_DIR_VALID, &vnode->flags))
--
2.44.0


2024-05-02 08:49:18

by Kairui Song

[permalink] [raw]
Subject: [PATCH v4 07/12] netfs: drop usage of folio_file_pos

From: Kairui Song <[email protected]>

folio_file_pos is only needed for mixed usage of page cache and
swap cache, for pure page cache usage, the caller can just use
folio_pos instead.

It can't be a swap cache page here. Swap mapping may only call into
fs through swap_rw and that is not supported for netfs. So just drop
it and use folio_pos instead.

Signed-off-by: Kairui Song <[email protected]>
Cc: David Howells <[email protected]>
Cc: Jeff Layton <[email protected]>
Cc: [email protected]
---
fs/netfs/buffered_read.c | 4 ++--
fs/netfs/buffered_write.c | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/netfs/buffered_read.c b/fs/netfs/buffered_read.c
index 3298c29b5548..d3687d81229f 100644
--- a/fs/netfs/buffered_read.c
+++ b/fs/netfs/buffered_read.c
@@ -255,7 +255,7 @@ int netfs_read_folio(struct file *file, struct folio *folio)
_enter("%lx", folio->index);

rreq = netfs_alloc_request(mapping, file,
- folio_file_pos(folio), folio_size(folio),
+ folio_pos(folio), folio_size(folio),
NETFS_READPAGE);
if (IS_ERR(rreq)) {
ret = PTR_ERR(rreq);
@@ -454,7 +454,7 @@ int netfs_write_begin(struct netfs_inode *ctx,
}

rreq = netfs_alloc_request(mapping, file,
- folio_file_pos(folio), folio_size(folio),
+ folio_pos(folio), folio_size(folio),
NETFS_READ_FOR_WRITE);
if (IS_ERR(rreq)) {
ret = PTR_ERR(rreq);
diff --git a/fs/netfs/buffered_write.c b/fs/netfs/buffered_write.c
index 9a0d32e4b422..859a22a740c3 100644
--- a/fs/netfs/buffered_write.c
+++ b/fs/netfs/buffered_write.c
@@ -63,7 +63,7 @@ static enum netfs_how_to_modify netfs_how_to_modify(struct netfs_inode *ctx,
bool maybe_trouble)
{
struct netfs_folio *finfo = netfs_folio_info(folio);
- loff_t pos = folio_file_pos(folio);
+ loff_t pos = folio_pos(folio);

_enter("");

--
2.44.0


2024-05-02 08:50:32

by Kairui Song

[permalink] [raw]
Subject: [PATCH v4 08/12] nfs: drop usage of folio_file_pos

From: Kairui Song <[email protected]>

folio_file_pos is only needed for mixed usage of page cache and
swap cache, for pure page cache usage, the caller can just use
folio_pos instead.

After commit e1209d3a7a67 ("mm: introduce ->swap_rw and use it for
reads from SWP_FS_OPS swap-space"), swap cache should never be exposed
to nfs.

So remove the usage of folio_file_pos in following NFS functions / helpers:

- nfs_vm_page_mkwrite

It's only used by nfs_file_vm_ops.page_mkwrite

- trace event helper: nfs_folio_event
- trace event helper: nfs_folio_event_done

These two are used through DEFINE_NFS_FOLIO_EVENT and
DEFINE_NFS_FOLIO_EVENT_DONE, which defined following events:

- trace_nfs_aop_readpage{_done}: only called by nfs_read_folio
- trace_nfs_writeback_folio: only called by nfs_wb_folio
- trace_nfs_invalidate_folio: only called by nfs_invalidate_folio
- trace_nfs_launder_folio_done: only called by nfs_launder_folio

None of them could possibly be used on swap cache folio,
nfs_read_folio only called by:
.write_begin -> nfs_read_folio
.read_folio

nfs_wb_folio only called by nfs mapping:
.release_folio -> nfs_wb_folio
.launder_folio -> nfs_wb_folio
.write_begin -> nfs_read_folio -> nfs_wb_folio
.read_folio -> nfs_wb_folio
.write_end -> nfs_update_folio -> nfs_writepage_setup -> nfs_setup_write_request -> nfs_try_to_update_request -> nfs_wb_folio
.page_mkwrite -> nfs_update_folio -> nfs_writepage_setup -> nfs_setup_write_request -> nfs_try_to_update_request -> nfs_wb_folio
.write_begin -> nfs_flush_incompatible -> nfs_wb_folio
.page_mkwrite -> nfs_vm_page_mkwrite -> nfs_flush_incompatible -> nfs_wb_folio

nfs_invalidate_folio is only called by .invalidate_folio.
nfs_launder_folio is only called by .launder_folio

- nfs_grow_file
- nfs_update_folio

nfs_grow_file is only called by nfs_update_folio, and all
possible callers of them are:

.write_end -> nfs_update_folio
.page_mkwrite -> nfs_update_folio

- nfs_wb_folio_cancel

.invalidate_folio -> nfs_wb_folio_cancel

Also, seeing from the swap side, swap_rw is now the only interface calling
into fs, the offset info is always in iocb.ki_pos now.

So we can remove all these folio_file_pos call safely.

Signed-off-by: Kairui Song <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Anna Schumaker <[email protected]>
Cc: [email protected]
---
fs/nfs/file.c | 2 +-
fs/nfs/nfstrace.h | 4 ++--
fs/nfs/write.c | 6 +++---
3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 407c6e15afe2..02741c32e114 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -588,7 +588,7 @@ static vm_fault_t nfs_vm_page_mkwrite(struct vm_fault *vmf)

dfprintk(PAGECACHE, "NFS: vm_page_mkwrite(%pD2(%lu), offset %lld)\n",
filp, filp->f_mapping->host->i_ino,
- (long long)folio_file_pos(folio));
+ (long long)folio_pos(folio));

sb_start_pagefault(inode->i_sb);

diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h
index afedb449b54f..d249741452e1 100644
--- a/fs/nfs/nfstrace.h
+++ b/fs/nfs/nfstrace.h
@@ -960,7 +960,7 @@ DECLARE_EVENT_CLASS(nfs_folio_event,
__entry->fileid = nfsi->fileid;
__entry->fhandle = nfs_fhandle_hash(&nfsi->fh);
__entry->version = inode_peek_iversion_raw(inode);
- __entry->offset = folio_file_pos(folio);
+ __entry->offset = folio_pos(folio);
__entry->count = nfs_folio_length(folio);
),

@@ -1008,7 +1008,7 @@ DECLARE_EVENT_CLASS(nfs_folio_event_done,
__entry->fileid = nfsi->fileid;
__entry->fhandle = nfs_fhandle_hash(&nfsi->fh);
__entry->version = inode_peek_iversion_raw(inode);
- __entry->offset = folio_file_pos(folio);
+ __entry->offset = folio_pos(folio);
__entry->count = nfs_folio_length(folio);
__entry->ret = ret;
),
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 5de85d725fb9..fc782d889449 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -281,7 +281,7 @@ static void nfs_grow_file(struct folio *folio, unsigned int offset,
end_index = ((i_size - 1) >> folio_shift(folio)) << folio_order(folio);
if (i_size > 0 && folio_index(folio) < end_index)
goto out;
- end = folio_file_pos(folio) + (loff_t)offset + (loff_t)count;
+ end = folio_pos(folio) + (loff_t)offset + (loff_t)count;
if (i_size >= end)
goto out;
trace_nfs_size_grow(inode, end);
@@ -1362,7 +1362,7 @@ int nfs_update_folio(struct file *file, struct folio *folio,
nfs_inc_stats(inode, NFSIOS_VFSUPDATEPAGE);

dprintk("NFS: nfs_update_folio(%pD2 %d@%lld)\n", file, count,
- (long long)(folio_file_pos(folio) + offset));
+ (long long)(folio_pos(folio) + offset));

if (!count)
goto out;
@@ -2073,7 +2073,7 @@ int nfs_wb_folio_cancel(struct inode *inode, struct folio *folio)
*/
int nfs_wb_folio(struct inode *inode, struct folio *folio)
{
- loff_t range_start = folio_file_pos(folio);
+ loff_t range_start = folio_pos(folio);
loff_t range_end = range_start + (loff_t)folio_size(folio) - 1;
struct writeback_control wbc = {
.sync_mode = WB_SYNC_ALL,
--
2.44.0


2024-05-02 08:50:52

by Kairui Song

[permalink] [raw]
Subject: [PATCH v4 09/12] mm/swap: get the swap device offset directly

From: Kairui Song <[email protected]>

folio_file_pos and page_file_offset are for mixed usage of swap cache
and page cache, it can't be page cache here, so introduce a new helper
to get the swap offset in swap device directly.

Need to include swapops.h in mm/swap.h to ensure swp_offset is always
defined before use.

Signed-off-by: Kairui Song <[email protected]>
---
mm/page_io.c | 6 +++---
mm/swap.h | 9 +++++++++
2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index 46c603dddf04..a360857cf75d 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -280,7 +280,7 @@ static void sio_write_complete(struct kiocb *iocb, long ret)
* be temporary.
*/
pr_err_ratelimited("Write error %ld on dio swapfile (%llu)\n",
- ret, page_file_offset(page));
+ ret, swap_dev_pos(page_swap_entry(page)));
for (p = 0; p < sio->pages; p++) {
page = sio->bvec[p].bv_page;
set_page_dirty(page);
@@ -299,7 +299,7 @@ static void swap_writepage_fs(struct folio *folio, struct writeback_control *wbc
struct swap_iocb *sio = NULL;
struct swap_info_struct *sis = swp_swap_info(folio->swap);
struct file *swap_file = sis->swap_file;
- loff_t pos = folio_file_pos(folio);
+ loff_t pos = swap_dev_pos(folio->swap);

count_swpout_vm_event(folio);
folio_start_writeback(folio);
@@ -430,7 +430,7 @@ static void swap_read_folio_fs(struct folio *folio, struct swap_iocb **plug)
{
struct swap_info_struct *sis = swp_swap_info(folio->swap);
struct swap_iocb *sio = NULL;
- loff_t pos = folio_file_pos(folio);
+ loff_t pos = swap_dev_pos(folio->swap);

if (plug)
sio = *plug;
diff --git a/mm/swap.h b/mm/swap.h
index fc2f6ade7f80..82023ab93205 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -5,6 +5,7 @@
struct mempolicy;

#ifdef CONFIG_SWAP
+#include <linux/swapops.h> /* for swp_offset */
#include <linux/blk_types.h> /* for bio_end_io_t */

/* linux/mm/page_io.c */
@@ -31,6 +32,14 @@ extern struct address_space *swapper_spaces[];
(&swapper_spaces[swp_type(entry)][swp_offset(entry) \
>> SWAP_ADDRESS_SPACE_SHIFT])

+/*
+ * Return the swap device position of the swap entry.
+ */
+static inline loff_t swap_dev_pos(swp_entry_t entry)
+{
+ return ((loff_t)swp_offset(entry)) << PAGE_SHIFT;
+}
+
void show_swap_cache_info(void);
bool add_to_swap(struct folio *folio);
void *get_shadow_from_swap_cache(swp_entry_t entry);
--
2.44.0


2024-05-02 08:51:07

by Kairui Song

[permalink] [raw]
Subject: [PATCH v4 10/12] mm: remove page_file_offset and folio_file_pos

From: Kairui Song <[email protected]>

These two helpers were useful for mixed usage of swap cache and page
cache, which help retrieve the corresponding file or swap device offset
of a page or folio.

They were introduced in commit f981c5950fa8 ("mm: methods for teaching
filesystems about PG_swapcache pages") and used in commit d56b4ddf7781
("nfs: teach the NFS client how to treat PG_swapcache pages"), suppose
to be used with direct_IO for swap over fs.

But after commit e1209d3a7a67 ("mm: introduce ->swap_rw and use it
for reads from SWP_FS_OPS swap-space"), swap with direct_IO is no more,
and swap cache mapping is never exposed to fs.

Now we have dropped all users of page_file_offset and folio_file_pos,
so they can be deleted.

Signed-off-by: Kairui Song <[email protected]>
---
include/linux/pagemap.h | 17 -----------------
1 file changed, 17 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 850d32057939..a324582ea702 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -918,11 +918,6 @@ static inline loff_t page_offset(struct page *page)
return ((loff_t)page->index) << PAGE_SHIFT;
}

-static inline loff_t page_file_offset(struct page *page)
-{
- return ((loff_t)page_index(page)) << PAGE_SHIFT;
-}
-
/**
* folio_pos - Returns the byte position of this folio in its file.
* @folio: The folio.
@@ -932,18 +927,6 @@ static inline loff_t folio_pos(struct folio *folio)
return page_offset(&folio->page);
}

-/**
- * folio_file_pos - Returns the byte position of this folio in its file.
- * @folio: The folio.
- *
- * This differs from folio_pos() for folios which belong to a swap file.
- * NFS is the only filesystem today which needs to use folio_file_pos().
- */
-static inline loff_t folio_file_pos(struct folio *folio)
-{
- return page_file_offset(&folio->page);
-}
-
/*
* Get the offset in PAGE_SIZE (even for hugetlb folios).
*/
--
2.44.0


2024-05-02 08:51:07

by Kairui Song

[permalink] [raw]
Subject: [PATCH v4 03/12] ceph: drop usage of page_index

From: Kairui Song <[email protected]>

page_index is needed for mixed usage of page cache and swap cache,
for pure page cache usage, the caller can just use page->index instead.

It can't be a swap cache page here, so just drop it.

Signed-off-by: Kairui Song <[email protected]>
Cc: Xiubo Li <[email protected]>
Cc: Ilya Dryomov <[email protected]>
Cc: Jeff Layton <[email protected]>
Cc: [email protected]
---
fs/ceph/dir.c | 2 +-
fs/ceph/inode.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index 0e9f56eaba1e..570a9d634cc5 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -141,7 +141,7 @@ __dcache_find_get_entry(struct dentry *parent, u64 idx,
if (ptr_pos >= i_size_read(dir))
return NULL;

- if (!cache_ctl->page || ptr_pgoff != page_index(cache_ctl->page)) {
+ if (!cache_ctl->page || ptr_pgoff != cache_ctl->page->index) {
ceph_readdir_cache_release(cache_ctl);
cache_ctl->page = find_lock_page(&dir->i_data, ptr_pgoff);
if (!cache_ctl->page) {
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 7b2e77517f23..1f92d3faaa6b 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -1861,7 +1861,7 @@ static int fill_readdir_cache(struct inode *dir, struct dentry *dn,
unsigned idx = ctl->index % nsize;
pgoff_t pgoff = ctl->index / nsize;

- if (!ctl->page || pgoff != page_index(ctl->page)) {
+ if (!ctl->page || pgoff != ctl->page->index) {
ceph_readdir_cache_release(ctl);
if (idx == 0)
ctl->page = grab_cache_page(&dir->i_data, pgoff);
--
2.44.0


2024-05-02 08:59:13

by Kairui Song

[permalink] [raw]
Subject: [PATCH v4 11/12] mm: drop page_index and convert folio_index to use folio

From: Kairui Song <[email protected]>

There are two helpers for retrieving the index within address space
for mixed usage of swap cache and page cache:

- page_index
- folio_index (wrapper of page_index)

This commit drops page_index, as we have eliminated all users, and
converts folio_index to use folio internally.

Signed-off-by: Kairui Song <[email protected]>
---
include/linux/mm.h | 13 -------------
include/linux/pagemap.h | 8 ++++----
mm/swapfile.c | 7 +++----
3 files changed, 7 insertions(+), 21 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9849dfda44d4..e2718cac0fda 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2290,19 +2290,6 @@ static inline void *folio_address(const struct folio *folio)
return page_address(&folio->page);
}

-extern pgoff_t __page_file_index(struct page *page);
-
-/*
- * Return the pagecache index of the passed page. Regular pagecache pages
- * use ->index whereas swapcache pages use swp_offset(->private)
- */
-static inline pgoff_t page_index(struct page *page)
-{
- if (unlikely(PageSwapCache(page)))
- return __page_file_index(page);
- return page->index;
-}
-
/*
* Return true only if the page has been allocated with
* ALLOC_NO_WATERMARKS and the low watermark was not
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index a324582ea702..0cfa5810cde3 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -778,7 +778,7 @@ static inline struct page *grab_cache_page_nowait(struct address_space *mapping,
mapping_gfp_mask(mapping));
}

-#define swapcache_index(folio) __page_file_index(&(folio)->page)
+extern pgoff_t __folio_swap_cache_index(struct folio *folio);

/**
* folio_index - File index of a folio.
@@ -793,9 +793,9 @@ static inline struct page *grab_cache_page_nowait(struct address_space *mapping,
*/
static inline pgoff_t folio_index(struct folio *folio)
{
- if (unlikely(folio_test_swapcache(folio)))
- return swapcache_index(folio);
- return folio->index;
+ if (unlikely(folio_test_swapcache(folio)))
+ return __folio_swap_cache_index(folio);
+ return folio->index;
}

/**
diff --git a/mm/swapfile.c b/mm/swapfile.c
index f6ca215fb92f..0b0ae6e8c764 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3474,12 +3474,11 @@ struct address_space *swapcache_mapping(struct folio *folio)
}
EXPORT_SYMBOL_GPL(swapcache_mapping);

-pgoff_t __page_file_index(struct page *page)
+pgoff_t __folio_swap_cache_index(struct folio *folio)
{
- swp_entry_t swap = page_swap_entry(page);
- return swp_offset(swap);
+ return swp_offset(folio->swap);
}
-EXPORT_SYMBOL_GPL(__page_file_index);
+EXPORT_SYMBOL_GPL(__folio_swap_cache_index);

/*
* add_swap_count_continuation - called when a swap count is duplicated
--
2.44.0


2024-05-02 08:59:38

by Kairui Song

[permalink] [raw]
Subject: [PATCH v4 12/12] mm/swap: reduce swap cache search space

From: Kairui Song <[email protected]>

Currently we use one swap_address_space for every 64M chunk to reduce lock
contention, this is like having a set of smaller swap files inside one
swap device. But when doing swap cache look up or insert, we are
still using the offset of the whole large swap device. This is OK for
correctness, as the offset (key) is unique.

But Xarray is specially optimized for small indexes, it creates the
radix tree levels lazily to be just enough to fit the largest key
stored in one Xarray. So we are wasting tree nodes unnecessarily.

For 64M chunk it should only take at most 3 levels to contain everything.
But if we are using the offset from the whole swap device, the offset (key)
value will be way beyond 64M, and so will the tree level.

Optimize this by using a new helper swap_cache_index to get a swap
entry's unique offset in its own 64M swap_address_space.

I see a ~1% performance gain in benchmark and actual workload with
high memory pressure.

Test with `time memhog 128G` inside a 8G memcg using 128G swap (ramdisk
with SWP_SYNCHRONOUS_IO dropped, tested 3 times, results are stable. The
test result is similar but the improvement is smaller if SWP_SYNCHRONOUS_IO
is enabled, as swap out path can never skip swap cache):

Before:
6.07user 250.74system 4:17.26elapsed 99%CPU (0avgtext+0avgdata 8373376maxresident)k
0inputs+0outputs (55major+33555018minor)pagefaults 0swaps

After (1.8% faster):
6.08user 246.09system 4:12.58elapsed 99%CPU (0avgtext+0avgdata 8373248maxresident)k
0inputs+0outputs (54major+33555027minor)pagefaults 0swaps

Similar result with MySQL and sysbench using swap:
Before:
94055.61 qps

After (0.8% faster):
94834.91 qps

Radix tree slab usage is also very slightly lower.

Signed-off-by: Kairui Song <[email protected]>
---
mm/huge_memory.c | 2 +-
mm/memcontrol.c | 2 +-
mm/mincore.c | 2 +-
mm/shmem.c | 2 +-
mm/swap.h | 15 +++++++++++++++
mm/swap_state.c | 19 ++++++++++---------
mm/swapfile.c | 6 +++---
7 files changed, 32 insertions(+), 16 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d35d526ed48f..45829cc049d2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2918,7 +2918,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
split_page_memcg(head, order, new_order);

if (folio_test_anon(folio) && folio_test_swapcache(folio)) {
- offset = swp_offset(folio->swap);
+ offset = swap_cache_index(folio->swap);
swap_cache = swap_address_space(folio->swap);
xa_lock(&swap_cache->i_pages);
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d11536ef59ef..81b005c459cb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6165,7 +6165,7 @@ static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
* Because swap_cache_get_folio() updates some statistics counter,
* we call find_get_page() with swapper_space directly.
*/
- page = find_get_page(swap_address_space(ent), swp_offset(ent));
+ page = find_get_page(swap_address_space(ent), swap_cache_index(ent));
entry->val = ent.val;

return page;
diff --git a/mm/mincore.c b/mm/mincore.c
index dad3622cc963..e31cf1bde614 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -139,7 +139,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
} else {
#ifdef CONFIG_SWAP
*vec = mincore_page(swap_address_space(entry),
- swp_offset(entry));
+ swap_cache_index(entry));
#else
WARN_ON(1);
*vec = 1;
diff --git a/mm/shmem.c b/mm/shmem.c
index fa2a0ed97507..326315c12feb 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1756,7 +1756,7 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,

old = *foliop;
entry = old->swap;
- swap_index = swp_offset(entry);
+ swap_index = swap_cache_index(entry);
swap_mapping = swap_address_space(entry);

/*
diff --git a/mm/swap.h b/mm/swap.h
index 82023ab93205..93e3e1b58a7f 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -27,6 +27,7 @@ void __swap_writepage(struct folio *folio, struct writeback_control *wbc);
/* One swap address space for each 64M swap space */
#define SWAP_ADDRESS_SPACE_SHIFT 14
#define SWAP_ADDRESS_SPACE_PAGES (1 << SWAP_ADDRESS_SPACE_SHIFT)
+#define SWAP_ADDRESS_SPACE_MASK (BIT(SWAP_ADDRESS_SPACE_SHIFT) - 1)
extern struct address_space *swapper_spaces[];
#define swap_address_space(entry) \
(&swapper_spaces[swp_type(entry)][swp_offset(entry) \
@@ -40,6 +41,15 @@ static inline loff_t swap_dev_pos(swp_entry_t entry)
return ((loff_t)swp_offset(entry)) << PAGE_SHIFT;
}

+/*
+ * Return the swap cache index of the swap entry.
+ */
+static inline pgoff_t swap_cache_index(swp_entry_t entry)
+{
+ BUILD_BUG_ON((SWP_OFFSET_MASK | SWAP_ADDRESS_SPACE_MASK) != SWP_OFFSET_MASK);
+ return swp_offset(entry) & SWAP_ADDRESS_SPACE_MASK;
+}
+
void show_swap_cache_info(void);
bool add_to_swap(struct folio *folio);
void *get_shadow_from_swap_cache(swp_entry_t entry);
@@ -86,6 +96,11 @@ static inline struct address_space *swap_address_space(swp_entry_t entry)
return NULL;
}

+static inline pgoff_t swap_cache_index(swp_entry_t entry)
+{
+ return 0;
+}
+
static inline void show_swap_cache_info(void)
{
}
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 642c30d8376c..09415d4c7843 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -72,7 +72,7 @@ void show_swap_cache_info(void)
void *get_shadow_from_swap_cache(swp_entry_t entry)
{
struct address_space *address_space = swap_address_space(entry);
- pgoff_t idx = swp_offset(entry);
+ pgoff_t idx = swap_cache_index(entry);
void *shadow;

shadow = xa_load(&address_space->i_pages, idx);
@@ -89,7 +89,7 @@ int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
gfp_t gfp, void **shadowp)
{
struct address_space *address_space = swap_address_space(entry);
- pgoff_t idx = swp_offset(entry);
+ pgoff_t idx = swap_cache_index(entry);
XA_STATE_ORDER(xas, &address_space->i_pages, idx, folio_order(folio));
unsigned long i, nr = folio_nr_pages(folio);
void *old;
@@ -144,7 +144,7 @@ void __delete_from_swap_cache(struct folio *folio,
struct address_space *address_space = swap_address_space(entry);
int i;
long nr = folio_nr_pages(folio);
- pgoff_t idx = swp_offset(entry);
+ pgoff_t idx = swap_cache_index(entry);
XA_STATE(xas, &address_space->i_pages, idx);

xas_set_update(&xas, workingset_update_node);
@@ -248,18 +248,19 @@ void delete_from_swap_cache(struct folio *folio)
void clear_shadow_from_swap_cache(int type, unsigned long begin,
unsigned long end)
{
- unsigned long curr = begin;
+ unsigned long curr = begin, offset;
void *old;

for (;;) {
+ offset = curr & SWAP_ADDRESS_SPACE_MASK;
swp_entry_t entry = swp_entry(type, curr);
struct address_space *address_space = swap_address_space(entry);
- XA_STATE(xas, &address_space->i_pages, curr);
+ XA_STATE(xas, &address_space->i_pages, offset);

xas_set_update(&xas, workingset_update_node);

xa_lock_irq(&address_space->i_pages);
- xas_for_each(&xas, old, end) {
+ xas_for_each(&xas, old, offset + min(end - curr, SWAP_ADDRESS_SPACE_PAGES)) {
if (!xa_is_value(old))
continue;
xas_store(&xas, NULL);
@@ -350,7 +351,7 @@ struct folio *swap_cache_get_folio(swp_entry_t entry,
{
struct folio *folio;

- folio = filemap_get_folio(swap_address_space(entry), swp_offset(entry));
+ folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
if (!IS_ERR(folio)) {
bool vma_ra = swap_use_vma_readahead();
bool readahead;
@@ -420,7 +421,7 @@ struct folio *filemap_get_incore_folio(struct address_space *mapping,
si = get_swap_device(swp);
if (!si)
return ERR_PTR(-ENOENT);
- index = swp_offset(swp);
+ index = swap_cache_index(swp);
folio = filemap_get_folio(swap_address_space(swp), index);
put_swap_device(si);
return folio;
@@ -447,7 +448,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
* that would confuse statistics.
*/
folio = filemap_get_folio(swap_address_space(entry),
- swp_offset(entry));
+ swap_cache_index(entry));
if (!IS_ERR(folio))
goto got_folio;

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 0b0ae6e8c764..4f0e8b2ac8aa 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -142,7 +142,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
struct folio *folio;
int ret = 0;

- folio = filemap_get_folio(swap_address_space(entry), offset);
+ folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
if (IS_ERR(folio))
return 0;
/*
@@ -2158,7 +2158,7 @@ static int try_to_unuse(unsigned int type)
(i = find_next_to_unuse(si, i)) != 0) {

entry = swp_entry(type, i);
- folio = filemap_get_folio(swap_address_space(entry), i);
+ folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
if (IS_ERR(folio))
continue;

@@ -3476,7 +3476,7 @@ EXPORT_SYMBOL_GPL(swapcache_mapping);

pgoff_t __folio_swap_cache_index(struct folio *folio)
{
- return swp_offset(folio->swap);
+ return swap_cache_index(folio->swap);
}
EXPORT_SYMBOL_GPL(__folio_swap_cache_index);

--
2.44.0


2024-05-02 09:12:50

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 11/12] mm: drop page_index and convert folio_index to use folio

On 02.05.24 10:49, Kairui Song wrote:
> From: Kairui Song <[email protected]>
>
> There are two helpers for retrieving the index within address space
> for mixed usage of swap cache and page cache:
>
> - page_index
> - folio_index (wrapper of page_index)
>
> This commit drops page_index, as we have eliminated all users, and
> converts folio_index to use folio internally.

The latter does not make sense. folio_index() already is using a folio
internally. Maybe a leftover from reshuffling/reworking patches?

--
Cheers,

David / dhildenb


2024-05-02 09:38:35

by Kairui Song

[permalink] [raw]
Subject: Re: [PATCH v4 11/12] mm: drop page_index and convert folio_index to use folio

On Thu, May 2, 2024 at 5:32 PM Kairui Song <[email protected]> wrote:
>
> On Thu, May 2, 2024 at 5:12 PM David Hildenbrand <[email protected]> wrote:
> >
> > On 02.05.24 10:49, Kairui Song wrote:
> > > From: Kairui Song <[email protected]>
> > >
> > > There are two helpers for retrieving the index within address space
> > > for mixed usage of swap cache and page cache:
> > >
> > > - page_index
> > > - folio_index (wrapper of page_index)
> > >
> > > This commit drops page_index, as we have eliminated all users, and
> > > converts folio_index to use folio internally.
> >
> > The latter does not make sense. folio_index() already is using a folio
> > internally. Maybe a leftover from reshuffling/reworking patches?
>
> Hi, David,
>
> folio_index calls swapcache_index, and swapcache_index is defined as:
>
> #define swapcache_index(folio) __page_file_index(&(folio)->page)
>
> Where it casts the folio to page first, then call __page_file_index,
> __page_file_index is a function and works on pages.
>
> After this commit __page_file_index is converted to
> __folio_swap_cache_index. This change is a bit of trivial but we get
> rid of the internal page conversion.
>
> I can simplify the commit message, just say drop page_index to make
> the code cleaner, if this is confusing.

Ah, you are right folio_index is not a simple wrapper of page_index
indeed, that sentence in the commit message doesn't make sense, so it
should be deleted, my bad for this leftover.

2024-05-02 09:50:00

by Kairui Song

[permalink] [raw]
Subject: Re: [PATCH v4 11/12] mm: drop page_index and convert folio_index to use folio

On Thu, May 2, 2024 at 5:12 PM David Hildenbrand <[email protected]> wrote:
>
> On 02.05.24 10:49, Kairui Song wrote:
> > From: Kairui Song <[email protected]>
> >
> > There are two helpers for retrieving the index within address space
> > for mixed usage of swap cache and page cache:
> >
> > - page_index
> > - folio_index (wrapper of page_index)
> >
> > This commit drops page_index, as we have eliminated all users, and
> > converts folio_index to use folio internally.
>
> The latter does not make sense. folio_index() already is using a folio
> internally. Maybe a leftover from reshuffling/reworking patches?

Hi, David,

folio_index calls swapcache_index, and swapcache_index is defined as:

#define swapcache_index(folio) __page_file_index(&(folio)->page)

Where it casts the folio to page first, then call __page_file_index,
__page_file_index is a function and works on pages.

After this commit __page_file_index is converted to
__folio_swap_cache_index. This change is a bit of trivial but we get
rid of the internal page conversion.

I can simplify the commit message, just say drop page_index to make
the code cleaner, if this is confusing.

2024-05-02 11:09:23

by Ryusuke Konishi

[permalink] [raw]
Subject: Re: [PATCH v4 02/12] nilfs2: drop usage of page_index

On Thu, May 2, 2024 at 5:47 PM Kairui Song wrote:
>
> From: Kairui Song <[email protected]>
>
> page_index is only for mixed usage of page cache and swap cache, for
> pure page cache usage, the caller can just use page->index instead.
>
> It can't be a swap cache page here (being part of buffer head),
> so just drop it. And while we are at it, optimize the code by retrieving
> the offset of the buffer head within the folio directly using bh_offset,
> and get rid of the loop and usage of page helpers.
>
> Suggested-by: Matthew Wilcox <[email protected]>
> Signed-off-by: Kairui Song <[email protected]>
> Cc: Ryusuke Konishi <[email protected]>
> Cc: [email protected]
> ---
> fs/nilfs2/bmap.c | 10 ++--------
> 1 file changed, 2 insertions(+), 8 deletions(-)
>
> diff --git a/fs/nilfs2/bmap.c b/fs/nilfs2/bmap.c
> index 383f0afa2cea..cd14ea25968c 100644
> --- a/fs/nilfs2/bmap.c
> +++ b/fs/nilfs2/bmap.c
> @@ -450,15 +450,9 @@ int nilfs_bmap_test_and_clear_dirty(struct nilfs_bmap *bmap)
> __u64 nilfs_bmap_data_get_key(const struct nilfs_bmap *bmap,
> const struct buffer_head *bh)
> {
> - struct buffer_head *pbh;
> - __u64 key;
> + loff_t pos = folio_pos(bh->b_folio) + bh_offset(bh);
>
> - key = page_index(bh->b_page) << (PAGE_SHIFT -
> - bmap->b_inode->i_blkbits);
> - for (pbh = page_buffers(bh->b_page); pbh != bh; pbh = pbh->b_this_page)
> - key++;
> -
> - return key;
> + return pos >> bmap->b_inode->i_blkbits;
> }
>
> __u64 nilfs_bmap_find_target_seq(const struct nilfs_bmap *bmap, __u64 key)
> --
> 2.44.0

Looks good. Feel free to add:

Acked-by: Ryusuke Konishi <[email protected]>

Just to be sure, I also tested this change in different environments,
including 4k (page size) and smaller block sizes. And of course it's
working as expected and so far hasn't broken anything.

Thanks,
Ryusuke Konishi

2024-05-08 06:33:16

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH v4 09/12] mm/swap: get the swap device offset directly

Kairui Song <[email protected]> writes:

> From: Kairui Song <[email protected]>
>
> folio_file_pos and page_file_offset are for mixed usage of swap cache
> and page cache, it can't be page cache here, so introduce a new helper
> to get the swap offset in swap device directly.
>
> Need to include swapops.h in mm/swap.h to ensure swp_offset is always
> defined before use.
>
> Signed-off-by: Kairui Song <[email protected]>

LGTM! Thanks!

Reviewed-by: "Huang, Ying" <[email protected]>

> ---
> mm/page_io.c | 6 +++---
> mm/swap.h | 9 +++++++++
> 2 files changed, 12 insertions(+), 3 deletions(-)
>
> diff --git a/mm/page_io.c b/mm/page_io.c
> index 46c603dddf04..a360857cf75d 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -280,7 +280,7 @@ static void sio_write_complete(struct kiocb *iocb, long ret)
> * be temporary.
> */
> pr_err_ratelimited("Write error %ld on dio swapfile (%llu)\n",
> - ret, page_file_offset(page));
> + ret, swap_dev_pos(page_swap_entry(page)));
> for (p = 0; p < sio->pages; p++) {
> page = sio->bvec[p].bv_page;
> set_page_dirty(page);
> @@ -299,7 +299,7 @@ static void swap_writepage_fs(struct folio *folio, struct writeback_control *wbc
> struct swap_iocb *sio = NULL;
> struct swap_info_struct *sis = swp_swap_info(folio->swap);
> struct file *swap_file = sis->swap_file;
> - loff_t pos = folio_file_pos(folio);
> + loff_t pos = swap_dev_pos(folio->swap);
>
> count_swpout_vm_event(folio);
> folio_start_writeback(folio);
> @@ -430,7 +430,7 @@ static void swap_read_folio_fs(struct folio *folio, struct swap_iocb **plug)
> {
> struct swap_info_struct *sis = swp_swap_info(folio->swap);
> struct swap_iocb *sio = NULL;
> - loff_t pos = folio_file_pos(folio);
> + loff_t pos = swap_dev_pos(folio->swap);
>
> if (plug)
> sio = *plug;
> diff --git a/mm/swap.h b/mm/swap.h
> index fc2f6ade7f80..82023ab93205 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -5,6 +5,7 @@
> struct mempolicy;
>
> #ifdef CONFIG_SWAP
> +#include <linux/swapops.h> /* for swp_offset */
> #include <linux/blk_types.h> /* for bio_end_io_t */
>
> /* linux/mm/page_io.c */
> @@ -31,6 +32,14 @@ extern struct address_space *swapper_spaces[];
> (&swapper_spaces[swp_type(entry)][swp_offset(entry) \
> >> SWAP_ADDRESS_SPACE_SHIFT])
>
> +/*
> + * Return the swap device position of the swap entry.
> + */
> +static inline loff_t swap_dev_pos(swp_entry_t entry)
> +{
> + return ((loff_t)swp_offset(entry)) << PAGE_SHIFT;
> +}
> +
> void show_swap_cache_info(void);
> bool add_to_swap(struct folio *folio);
> void *get_shadow_from_swap_cache(swp_entry_t entry);

2024-05-08 06:35:07

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH v4 10/12] mm: remove page_file_offset and folio_file_pos

Kairui Song <[email protected]> writes:

> From: Kairui Song <[email protected]>
>
> These two helpers were useful for mixed usage of swap cache and page
> cache, which help retrieve the corresponding file or swap device offset
> of a page or folio.
>
> They were introduced in commit f981c5950fa8 ("mm: methods for teaching
> filesystems about PG_swapcache pages") and used in commit d56b4ddf7781
> ("nfs: teach the NFS client how to treat PG_swapcache pages"), suppose
> to be used with direct_IO for swap over fs.
>
> But after commit e1209d3a7a67 ("mm: introduce ->swap_rw and use it
> for reads from SWP_FS_OPS swap-space"), swap with direct_IO is no more,
> and swap cache mapping is never exposed to fs.
>
> Now we have dropped all users of page_file_offset and folio_file_pos,
> so they can be deleted.
>
> Signed-off-by: Kairui Song <[email protected]>

LGTM, Thanks!

Reviewed-by: "Huang, Ying" <[email protected]>

> ---
> include/linux/pagemap.h | 17 -----------------
> 1 file changed, 17 deletions(-)
>
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 850d32057939..a324582ea702 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -918,11 +918,6 @@ static inline loff_t page_offset(struct page *page)
> return ((loff_t)page->index) << PAGE_SHIFT;
> }
>
> -static inline loff_t page_file_offset(struct page *page)
> -{
> - return ((loff_t)page_index(page)) << PAGE_SHIFT;
> -}
> -
> /**
> * folio_pos - Returns the byte position of this folio in its file.
> * @folio: The folio.
> @@ -932,18 +927,6 @@ static inline loff_t folio_pos(struct folio *folio)
> return page_offset(&folio->page);
> }
>
> -/**
> - * folio_file_pos - Returns the byte position of this folio in its file.
> - * @folio: The folio.
> - *
> - * This differs from folio_pos() for folios which belong to a swap file.
> - * NFS is the only filesystem today which needs to use folio_file_pos().
> - */
> -static inline loff_t folio_file_pos(struct folio *folio)
> -{
> - return page_file_offset(&folio->page);
> -}
> -
> /*
> * Get the offset in PAGE_SIZE (even for hugetlb folios).
> */

2024-05-08 07:27:45

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH v4 12/12] mm/swap: reduce swap cache search space

Kairui Song <[email protected]> writes:

> From: Kairui Song <[email protected]>
>
> Currently we use one swap_address_space for every 64M chunk to reduce lock
> contention, this is like having a set of smaller swap files inside one
> swap device. But when doing swap cache look up or insert, we are
> still using the offset of the whole large swap device. This is OK for
> correctness, as the offset (key) is unique.
>
> But Xarray is specially optimized for small indexes, it creates the
> radix tree levels lazily to be just enough to fit the largest key
> stored in one Xarray. So we are wasting tree nodes unnecessarily.
>
> For 64M chunk it should only take at most 3 levels to contain everything.
> But if we are using the offset from the whole swap device, the offset (key)
> value will be way beyond 64M, and so will the tree level.
>
> Optimize this by using a new helper swap_cache_index to get a swap
> entry's unique offset in its own 64M swap_address_space.
>
> I see a ~1% performance gain in benchmark and actual workload with
> high memory pressure.
>
> Test with `time memhog 128G` inside a 8G memcg using 128G swap (ramdisk
> with SWP_SYNCHRONOUS_IO dropped, tested 3 times, results are stable. The
> test result is similar but the improvement is smaller if SWP_SYNCHRONOUS_IO
> is enabled, as swap out path can never skip swap cache):
>
> Before:
> 6.07user 250.74system 4:17.26elapsed 99%CPU (0avgtext+0avgdata 8373376maxresident)k
> 0inputs+0outputs (55major+33555018minor)pagefaults 0swaps
>
> After (1.8% faster):
> 6.08user 246.09system 4:12.58elapsed 99%CPU (0avgtext+0avgdata 8373248maxresident)k
> 0inputs+0outputs (54major+33555027minor)pagefaults 0swaps
>
> Similar result with MySQL and sysbench using swap:
> Before:
> 94055.61 qps
>
> After (0.8% faster):
> 94834.91 qps
>
> Radix tree slab usage is also very slightly lower.
>
> Signed-off-by: Kairui Song <[email protected]>
> ---
> mm/huge_memory.c | 2 +-
> mm/memcontrol.c | 2 +-
> mm/mincore.c | 2 +-
> mm/shmem.c | 2 +-
> mm/swap.h | 15 +++++++++++++++
> mm/swap_state.c | 19 ++++++++++---------
> mm/swapfile.c | 6 +++---
> 7 files changed, 32 insertions(+), 16 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d35d526ed48f..45829cc049d2 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2918,7 +2918,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
> split_page_memcg(head, order, new_order);
>
> if (folio_test_anon(folio) && folio_test_swapcache(folio)) {
> - offset = swp_offset(folio->swap);
> + offset = swap_cache_index(folio->swap);
> swap_cache = swap_address_space(folio->swap);
> xa_lock(&swap_cache->i_pages);
> }
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d11536ef59ef..81b005c459cb 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6165,7 +6165,7 @@ static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
> * Because swap_cache_get_folio() updates some statistics counter,
> * we call find_get_page() with swapper_space directly.
> */
> - page = find_get_page(swap_address_space(ent), swp_offset(ent));
> + page = find_get_page(swap_address_space(ent), swap_cache_index(ent));
> entry->val = ent.val;
>
> return page;
> diff --git a/mm/mincore.c b/mm/mincore.c
> index dad3622cc963..e31cf1bde614 100644
> --- a/mm/mincore.c
> +++ b/mm/mincore.c
> @@ -139,7 +139,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> } else {
> #ifdef CONFIG_SWAP
> *vec = mincore_page(swap_address_space(entry),
> - swp_offset(entry));
> + swap_cache_index(entry));
> #else
> WARN_ON(1);
> *vec = 1;
> diff --git a/mm/shmem.c b/mm/shmem.c
> index fa2a0ed97507..326315c12feb 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1756,7 +1756,7 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
>
> old = *foliop;
> entry = old->swap;
> - swap_index = swp_offset(entry);
> + swap_index = swap_cache_index(entry);
> swap_mapping = swap_address_space(entry);
>
> /*
> diff --git a/mm/swap.h b/mm/swap.h
> index 82023ab93205..93e3e1b58a7f 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -27,6 +27,7 @@ void __swap_writepage(struct folio *folio, struct writeback_control *wbc);
> /* One swap address space for each 64M swap space */
> #define SWAP_ADDRESS_SPACE_SHIFT 14
> #define SWAP_ADDRESS_SPACE_PAGES (1 << SWAP_ADDRESS_SPACE_SHIFT)
> +#define SWAP_ADDRESS_SPACE_MASK (BIT(SWAP_ADDRESS_SPACE_SHIFT) - 1)

#define SWAP_ADDRESS_SPACE_MASK (SWAP_ADDRESS_SPACE_PAGES - 1)
?

We can use BIT() in SWAP_ADDRESS_SPACE_PAGES definition.

> extern struct address_space *swapper_spaces[];
> #define swap_address_space(entry) \
> (&swapper_spaces[swp_type(entry)][swp_offset(entry) \
> @@ -40,6 +41,15 @@ static inline loff_t swap_dev_pos(swp_entry_t entry)
> return ((loff_t)swp_offset(entry)) << PAGE_SHIFT;
> }
>
> +/*
> + * Return the swap cache index of the swap entry.
> + */
> +static inline pgoff_t swap_cache_index(swp_entry_t entry)
> +{
> + BUILD_BUG_ON((SWP_OFFSET_MASK | SWAP_ADDRESS_SPACE_MASK) != SWP_OFFSET_MASK);
> + return swp_offset(entry) & SWAP_ADDRESS_SPACE_MASK;
> +}
> +
> void show_swap_cache_info(void);
> bool add_to_swap(struct folio *folio);
> void *get_shadow_from_swap_cache(swp_entry_t entry);
> @@ -86,6 +96,11 @@ static inline struct address_space *swap_address_space(swp_entry_t entry)
> return NULL;
> }
>
> +static inline pgoff_t swap_cache_index(swp_entry_t entry)
> +{
> + return 0;
> +}
> +
> static inline void show_swap_cache_info(void)
> {
> }
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 642c30d8376c..09415d4c7843 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -72,7 +72,7 @@ void show_swap_cache_info(void)
> void *get_shadow_from_swap_cache(swp_entry_t entry)
> {
> struct address_space *address_space = swap_address_space(entry);
> - pgoff_t idx = swp_offset(entry);
> + pgoff_t idx = swap_cache_index(entry);
> void *shadow;
>
> shadow = xa_load(&address_space->i_pages, idx);
> @@ -89,7 +89,7 @@ int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
> gfp_t gfp, void **shadowp)
> {
> struct address_space *address_space = swap_address_space(entry);
> - pgoff_t idx = swp_offset(entry);
> + pgoff_t idx = swap_cache_index(entry);
> XA_STATE_ORDER(xas, &address_space->i_pages, idx, folio_order(folio));
> unsigned long i, nr = folio_nr_pages(folio);
> void *old;
> @@ -144,7 +144,7 @@ void __delete_from_swap_cache(struct folio *folio,
> struct address_space *address_space = swap_address_space(entry);
> int i;
> long nr = folio_nr_pages(folio);
> - pgoff_t idx = swp_offset(entry);
> + pgoff_t idx = swap_cache_index(entry);
> XA_STATE(xas, &address_space->i_pages, idx);
>
> xas_set_update(&xas, workingset_update_node);
> @@ -248,18 +248,19 @@ void delete_from_swap_cache(struct folio *folio)
> void clear_shadow_from_swap_cache(int type, unsigned long begin,
> unsigned long end)
> {
> - unsigned long curr = begin;
> + unsigned long curr = begin, offset;

Better to rename "offset" as "index" to avoid confusion?

> void *old;
>
> for (;;) {
> + offset = curr & SWAP_ADDRESS_SPACE_MASK;
> swp_entry_t entry = swp_entry(type, curr);
> struct address_space *address_space = swap_address_space(entry);
> - XA_STATE(xas, &address_space->i_pages, curr);
> + XA_STATE(xas, &address_space->i_pages, offset);
>
> xas_set_update(&xas, workingset_update_node);
>
> xa_lock_irq(&address_space->i_pages);
> - xas_for_each(&xas, old, end) {
> + xas_for_each(&xas, old, offset + min(end - curr, SWAP_ADDRESS_SPACE_PAGES)) {

Is there a bug in the original code? It doesn't check SWAP_ADDRESS_SPACE_PAGES.

And should it be changed to

xas_for_each(&xas, old, min(offset + end - curr, SWAP_ADDRESS_SPACE_PAGES))

?

> if (!xa_is_value(old))
> continue;
> xas_store(&xas, NULL);
> @@ -350,7 +351,7 @@ struct folio *swap_cache_get_folio(swp_entry_t entry,
> {
> struct folio *folio;
>
> - folio = filemap_get_folio(swap_address_space(entry), swp_offset(entry));
> + folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
> if (!IS_ERR(folio)) {
> bool vma_ra = swap_use_vma_readahead();
> bool readahead;
> @@ -420,7 +421,7 @@ struct folio *filemap_get_incore_folio(struct address_space *mapping,
> si = get_swap_device(swp);
> if (!si)
> return ERR_PTR(-ENOENT);
> - index = swp_offset(swp);
> + index = swap_cache_index(swp);
> folio = filemap_get_folio(swap_address_space(swp), index);
> put_swap_device(si);
> return folio;
> @@ -447,7 +448,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> * that would confuse statistics.
> */
> folio = filemap_get_folio(swap_address_space(entry),
> - swp_offset(entry));
> + swap_cache_index(entry));
> if (!IS_ERR(folio))
> goto got_folio;
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 0b0ae6e8c764..4f0e8b2ac8aa 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -142,7 +142,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
> struct folio *folio;
> int ret = 0;
>
> - folio = filemap_get_folio(swap_address_space(entry), offset);
> + folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
> if (IS_ERR(folio))
> return 0;
> /*
> @@ -2158,7 +2158,7 @@ static int try_to_unuse(unsigned int type)
> (i = find_next_to_unuse(si, i)) != 0) {
>
> entry = swp_entry(type, i);
> - folio = filemap_get_folio(swap_address_space(entry), i);
> + folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
> if (IS_ERR(folio))
> continue;
>
> @@ -3476,7 +3476,7 @@ EXPORT_SYMBOL_GPL(swapcache_mapping);
>
> pgoff_t __folio_swap_cache_index(struct folio *folio)
> {
> - return swp_offset(folio->swap);
> + return swap_cache_index(folio->swap);
> }
> EXPORT_SYMBOL_GPL(__folio_swap_cache_index);

--
Best Regards,
Huang, Ying

2024-05-08 07:50:48

by Kairui Song

[permalink] [raw]
Subject: Re: [PATCH v4 12/12] mm/swap: reduce swap cache search space

On Wed, May 8, 2024 at 3:27 PM Huang, Ying <[email protected]> wrote:
>
> Kairui Song <[email protected]> writes:
>
> > From: Kairui Song <[email protected]>
> >
> > Currently we use one swap_address_space for every 64M chunk to reduce lock
> > contention, this is like having a set of smaller swap files inside one
> > swap device. But when doing swap cache look up or insert, we are
> > still using the offset of the whole large swap device. This is OK for
> > correctness, as the offset (key) is unique.
> >
> > But Xarray is specially optimized for small indexes, it creates the
> > radix tree levels lazily to be just enough to fit the largest key
> > stored in one Xarray. So we are wasting tree nodes unnecessarily.
> >
> > For 64M chunk it should only take at most 3 levels to contain everything. it should only take at most 3 levels to contain everything.
> > But if we are using the offset from the whole swap device, the offset (key)
> > value will be way beyond 64M, and so will the tree level.
> >
> > Optimize this by using a new helper swap_cache_index to get a swap
> > entry's unique offset in its own 64M swap_address_space.
> >
> > I see a ~1% performance gain in benchmark and actual workload with
> > high memory pressure.
> >
> > Test with `time memhog 128G` inside a 8G memcg using 128G swap (ramdisk
> > with SWP_SYNCHRONOUS_IO dropped, tested 3 times, results are stable. The
> > test result is similar but the improvement is smaller if SWP_SYNCHRONOUS_IO
> > is enabled, as swap out path can never skip swap cache):
> >
> > Before:
> > 6.07user 250.74system 4:17.26elapsed 99%CPU (0avgtext+0avgdata 8373376maxresident)k
> > 0inputs+0outputs (55major+33555018minor)pagefaults 0swaps
> >
> > After (1.8% faster):
> > 6.08user 246.09system 4:12.58elapsed 99%CPU (0avgtext+0avgdata 8373248maxresident)k
> > 0inputs+0outputs (54major+33555027minor)pagefaults 0swaps
> >
> > Similar result with MySQL and sysbench using swap:
> > Before:
> > 94055.61 qps
> >
> > After (0.8% faster):
> > 94834.91 qps
> >
> > Radix tree slab usage is also very slightly lower.
> >
> > Signed-off-by: Kairui Song <[email protected]>
> > ---
> > mm/huge_memory.c | 2 +-
> > mm/memcontrol.c | 2 +-
> > mm/mincore.c | 2 +-
> > mm/shmem.c | 2 +-
> > mm/swap.h | 15 +++++++++++++++
> > mm/swap_state.c | 19 ++++++++++---------
> > mm/swapfile.c | 6 +++---
> > 7 files changed, 32 insertions(+), 16 deletions(-)
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index d35d526ed48f..45829cc049d2 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -2918,7 +2918,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
> > split_page_memcg(head, order, new_order);
> >
> > if (folio_test_anon(folio) && folio_test_swapcache(folio)) {
> > - offset = swp_offset(folio->swap);
> > + offset = swap_cache_index(folio->swap);
> > swap_cache = swap_address_space(folio->swap);
> > xa_lock(&swap_cache->i_pages);
> > }
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index d11536ef59ef..81b005c459cb 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -6165,7 +6165,7 @@ static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
> > * Because swap_cache_get_folio() updates some statistics counter,
> > * we call find_get_page() with swapper_space directly.
> > */
> > - page = find_get_page(swap_address_space(ent), swp_offset(ent));
> > + page = find_get_page(swap_address_space(ent), swap_cache_index(ent));
> > entry->val = ent.val;
> >
> > return page;
> > diff --git a/mm/mincore.c b/mm/mincore.c
> > index dad3622cc963..e31cf1bde614 100644
> > --- a/mm/mincore.c
> > +++ b/mm/mincore.c
> > @@ -139,7 +139,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> > } else {
> > #ifdef CONFIG_SWAP
> > *vec = mincore_page(swap_address_space(entry),
> > - swp_offset(entry));
> > + swap_cache_index(entry));
> > #else
> > WARN_ON(1);
> > *vec = 1;
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index fa2a0ed97507..326315c12feb 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -1756,7 +1756,7 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
> >
> > old = *foliop;
> > entry = old->swap;
> > - swap_index = swp_offset(entry);
> > + swap_index = swap_cache_index(entry);
> > swap_mapping = swap_address_space(entry);
> >
> > /*
> > diff --git a/mm/swap.h b/mm/swap.h
> > index 82023ab93205..93e3e1b58a7f 100644
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> > @@ -27,6 +27,7 @@ void __swap_writepage(struct folio *folio, struct writeback_control *wbc);
> > /* One swap address space for each 64M swap space */
> > #define SWAP_ADDRESS_SPACE_SHIFT 14
> > #define SWAP_ADDRESS_SPACE_PAGES (1 << SWAP_ADDRESS_SPACE_SHIFT)
> > +#define SWAP_ADDRESS_SPACE_MASK (BIT(SWAP_ADDRESS_SPACE_SHIFT) - 1)
>
> #define SWAP_ADDRESS_SPACE_MASK (SWAP_ADDRESS_SPACE_PAGES - 1)
> ?
>
> We can use BIT() in SWAP_ADDRESS_SPACE_PAGES definition.
>

I'll just use (SWAP_ADDRESS_SPACE_PAGES - 1) then, I was trying to
make the changes minimal, but prefered the BIT macro, a trivial
change.

> > extern struct address_space *swapper_spaces[];
> > #define swap_address_space(entry) \
> > (&swapper_spaces[swp_type(entry)][swp_offset(entry) \
> > @@ -40,6 +41,15 @@ static inline loff_t swap_dev_pos(swp_entry_t entry)
> > return ((loff_t)swp_offset(entry)) << PAGE_SHIFT;
> > }
> >
> > +/*
> > + * Return the swap cache index of the swap entry.
> > + */
> > +static inline pgoff_t swap_cache_index(swp_entry_t entry)
> > +{
> > + BUILD_BUG_ON((SWP_OFFSET_MASK | SWAP_ADDRESS_SPACE_MASK) != SWP_OFFSET_MASK);
> > + return swp_offset(entry) & SWAP_ADDRESS_SPACE_MASK;
> > +}
> > +
> > void show_swap_cache_info(void);
> > bool add_to_swap(struct folio *folio);
> > void *get_shadow_from_swap_cache(swp_entry_t entry);
> > @@ -86,6 +96,11 @@ static inline struct address_space *swap_address_space(swp_entry_t entry)
> > return NULL;
> > }
> >
> > +static inline pgoff_t swap_cache_index(swp_entry_t entry)
> > +{
> > + return 0;
> > +}
> > +
> > static inline void show_swap_cache_info(void)
> > {
> > }
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index 642c30d8376c..09415d4c7843 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -72,7 +72,7 @@ void show_swap_cache_info(void)
> > void *get_shadow_from_swap_cache(swp_entry_t entry)
> > {
> > struct address_space *address_space = swap_address_space(entry);
> > - pgoff_t idx = swp_offset(entry);
> > + pgoff_t idx = swap_cache_index(entry);
> > void *shadow;
> >
> > shadow = xa_load(&address_space->i_pages, idx);
> > @@ -89,7 +89,7 @@ int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
> > gfp_t gfp, void **shadowp)
> > {
> > struct address_space *address_space = swap_address_space(entry);
> > - pgoff_t idx = swp_offset(entry);
> > + pgoff_t idx = swap_cache_index(entry);
> > XA_STATE_ORDER(xas, &address_space->i_pages, idx, folio_order(folio));
> > unsigned long i, nr = folio_nr_pages(folio);
> > void *old;
> > @@ -144,7 +144,7 @@ void __delete_from_swap_cache(struct folio *folio,
> > struct address_space *address_space = swap_address_space(entry);
> > int i;
> > long nr = folio_nr_pages(folio);
> > - pgoff_t idx = swp_offset(entry);
> > + pgoff_t idx = swap_cache_index(entry);
> > XA_STATE(xas, &address_space->i_pages, idx);
> >
> > xas_set_update(&xas, workingset_update_node);
> > @@ -248,18 +248,19 @@ void delete_from_swap_cache(struct folio *folio)
> > void clear_shadow_from_swap_cache(int type, unsigned long begin,
> > unsigned long end)
> > {
> > - unsigned long curr = begin;
> > + unsigned long curr = begin, offset;
>
> Better to rename "offset" as "index" to avoid confusion?

Good idea.

> > void *old;
> >
> > for (;;) {
> > + offset = curr & SWAP_ADDRESS_SPACE_MASK;
> > swp_entry_t entry = swp_entry(type, curr);
> > struct address_space *address_space = swap_address_space(entry);
> > - XA_STATE(xas, &address_space->i_pages, curr);
> > + XA_STATE(xas, &address_space->i_pages, offset);
> >
> > xas_set_update(&xas, workingset_update_node);
> >
> > xa_lock_irq(&address_space->i_pages);
> > - xas_for_each(&xas, old, end) {
> > + xas_for_each(&xas, old, offset + min(end - curr, SWAP_ADDRESS_SPACE_PAGES)) {
>
> Is there a bug in the original code? It doesn't check SWAP_ADDRESS_SPACE_PAGES.

That's OK, if the (end - curr) goes above SWAP_ADDRESS_SPACE_PAGES, it
means all content in current address_space needs to be purged.
xas_for_each will stop after it iterated all content in the current
address space. This is a bit hackish though.

>
> And should it be changed to
>
> xas_for_each(&xas, old, min(offset + end - curr, SWAP_ADDRESS_SPACE_PAGES))

It should be equivalent, as described above, but yeah, this looks
cleaner. I'll use your suggested code.

> ?
>
> > if (!xa_is_value(old))
> > continue;
> > xas_store(&xas, NULL);
> > @@ -350,7 +351,7 @@ struct folio *swap_cache_get_folio(swp_entry_t entry,
> > {
> > struct folio *folio;
> >
> > - folio = filemap_get_folio(swap_address_space(entry), swp_offset(entry));
> > + folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
> > if (!IS_ERR(folio)) {
> > bool vma_ra = swap_use_vma_readahead();
> > bool readahead;
> > @@ -420,7 +421,7 @@ struct folio *filemap_get_incore_folio(struct address_space *mapping,
> > si = get_swap_device(swp);
> > if (!si)
> > return ERR_PTR(-ENOENT);
> > - index = swp_offset(swp);
> > + index = swap_cache_index(swp);
> > folio = filemap_get_folio(swap_address_space(swp), index);
> > put_swap_device(si);
> > return folio;
> > @@ -447,7 +448,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> > * that would confuse statistics.
> > */
> > folio = filemap_get_folio(swap_address_space(entry),
> > - swp_offset(entry));
> > + swap_cache_index(entry));
> > if (!IS_ERR(folio))
> > goto got_folio;
> >
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 0b0ae6e8c764..4f0e8b2ac8aa 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -142,7 +142,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
> > struct folio *folio;
> > int ret = 0;
> >
> > - folio = filemap_get_folio(swap_address_space(entry), offset);
> > + folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
> > if (IS_ERR(folio))
> > return 0;
> > /*
> > @@ -2158,7 +2158,7 @@ static int try_to_unuse(unsigned int type)
> > (i = find_next_to_unuse(si, i)) != 0) {
> >
> > entry = swp_entry(type, i);
> > - folio = filemap_get_folio(swap_address_space(entry), i);
> > + folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
> > if (IS_ERR(folio))
> > continue;
> >
> > @@ -3476,7 +3476,7 @@ EXPORT_SYMBOL_GPL(swapcache_mapping);
> >
> > pgoff_t __folio_swap_cache_index(struct folio *folio)
> > {
> > - return swp_offset(folio->swap);
> > + return swap_cache_index(folio->swap);
> > }
> > EXPORT_SYMBOL_GPL(__folio_swap_cache_index);

Thanks for the suggestions!

>
> --
> Best Regards,
> Huang, Ying