LinuxLists.cc - [PATCH 0/8] mm/swap: optimize swap cache search space

[permalink] [raw]

Subject: [PATCH 5/8] cifs: drop usage of page_file_offset

From: Kairui Song <[email protected]>

page_file_offset is only needed for mixed usage of page cache and
swap cache, for pure page cache usage, the caller can just use
page_offset instead.

It can't be a swap cache page here, so just drop it.

Signed-off-by: Kairui Song <[email protected]>
Cc: Steve French <[email protected]>
Cc: Namjae Jeon <[email protected]>
Cc: Paulo Alcantara <[email protected]>
Cc: Shyam Prasad N <[email protected]>
Cc: Bharath SM <[email protected]>
---
fs/smb/client/file.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/smb/client/file.c b/fs/smb/client/file.c
index 16aadce492b2..73bbd761bf32 100644
--- a/fs/smb/client/file.c
+++ b/fs/smb/client/file.c
@@ -4749,7 +4749,7 @@ static int cifs_readpage_worker(struct file *file, struct page *page,
static int cifs_read_folio(struct file *file, struct folio *folio)
{
struct page *page = &folio->page;
- loff_t offset = page_file_offset(page);
+ loff_t offset = page_offset(page);
int rc = -EACCES;
unsigned int xid;

--
2.44.0

2024-04-17 16:14:39

[permalink] [raw]

Subject: Re: [PATCH 2/8] nilfs2: drop usage of page_index

On Thu, Apr 18, 2024 at 12:08:36AM +0800, Kairui Song wrote:
> +++ b/fs/nilfs2/bmap.c
> @@ -453,8 +453,7 @@ __u64 nilfs_bmap_data_get_key(const struct nilfs_bmap *bmap,
> struct buffer_head *pbh;
> __u64 key;
>
> - key = page_index(bh->b_page) << (PAGE_SHIFT -
> - bmap->b_inode->i_blkbits);
> + key = bh->b_page->index << (PAGE_SHIFT - bmap->b_inode->i_blkbits);

I'd prefer this were

key = bh->b_folio->index << (PAGE_SHIFT - bmap->b_inode->i_blkbits);

(pages only have a ->index field for historical reasons; I'm trying to
get rid of it)

2024-04-17 16:21:52

[permalink] [raw]

Subject: [PATCH 7/8] mm: drop page_index/page_file_offset and convert swap helpers to use folio

From: Kairui Song <[email protected]>

When applied on swap cache pages, page_index / page_file_offset was used
to retrieve the swap cache index or swap file offset of a page, and they
have their folio equivalence version: folio_index / folio_file_pos.

We have eliminated all users for page_index / page_file_offset, everything
is using folio_index / folio_file_pos now, so remove the old helpers.

Then convert the implementation of folio_index / folio_file_pos to
to use folio natively.

After this commit, all users that might encounter mixed usage of swap
cache and page cache will only use following two helpers:

folio_index (calls __folio_swap_cache_index)
folio_file_pos (calls __folio_swap_file_pos)

The offset in swap file and index in swap cache is still basically the
same thing at this moment, but will be different in following commits.

Signed-off-by: Kairui Song <[email protected]>
---
include/linux/mm.h | 13 -------------
include/linux/pagemap.h | 19 +++++++++----------
mm/swapfile.c | 13 +++++++++----
3 files changed, 18 insertions(+), 27 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0436b919f1c7..797480e76c9c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2245,19 +2245,6 @@ static inline void *folio_address(const struct folio *folio)
return page_address(&folio->page);
}

-extern pgoff_t __page_file_index(struct page *page);
-
-/*
- * Return the pagecache index of the passed page. Regular pagecache pages
- * use ->index whereas swapcache pages use swp_offset(->private)
- */
-static inline pgoff_t page_index(struct page *page)
-{
- if (unlikely(PageSwapCache(page)))
- return __page_file_index(page);
- return page->index;
-}
-
/*
* Return true only if the page has been allocated with
* ALLOC_NO_WATERMARKS and the low watermark was not
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 2df35e65557d..313f3144406e 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -780,7 +780,7 @@ static inline struct page *grab_cache_page_nowait(struct address_space *mapping,
mapping_gfp_mask(mapping));
}

-#define swapcache_index(folio) __page_file_index(&(folio)->page)
+extern pgoff_t __folio_swap_cache_index(struct folio *folio);

/**
* folio_index - File index of a folio.
@@ -795,9 +795,9 @@ static inline struct page *grab_cache_page_nowait(struct address_space *mapping,
*/
static inline pgoff_t folio_index(struct folio *folio)
{
- if (unlikely(folio_test_swapcache(folio)))
- return swapcache_index(folio);
- return folio->index;
+ if (unlikely(folio_test_swapcache(folio)))
+ return __folio_swap_cache_index(folio);
+ return folio->index;
}

/**
@@ -920,11 +920,6 @@ static inline loff_t page_offset(struct page *page)
return ((loff_t)page->index) << PAGE_SHIFT;
}

-static inline loff_t page_file_offset(struct page *page)
-{
- return ((loff_t)page_index(page)) << PAGE_SHIFT;
-}
-
/**
* folio_pos - Returns the byte position of this folio in its file.
* @folio: The folio.
@@ -934,6 +929,8 @@ static inline loff_t folio_pos(struct folio *folio)
return page_offset(&folio->page);
}

+extern loff_t __folio_swap_file_pos(struct folio *folio);
+
/**
* folio_file_pos - Returns the byte position of this folio in its file.
* @folio: The folio.
@@ -943,7 +940,9 @@ static inline loff_t folio_pos(struct folio *folio)
*/
static inline loff_t folio_file_pos(struct folio *folio)
{
- return page_file_offset(&folio->page);
+ if (unlikely(folio_test_swapcache(folio)))
+ return __folio_swap_file_pos(folio);
+ return ((loff_t)folio->index << PAGE_SHIFT);
}

/*
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4919423cce76..0c36a5c2400f 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3419,12 +3419,17 @@ struct address_space *swapcache_mapping(struct folio *folio)
}
EXPORT_SYMBOL_GPL(swapcache_mapping);

-pgoff_t __page_file_index(struct page *page)
+pgoff_t __folio_swap_cache_index(struct folio *folio)
{
- swp_entry_t swap = page_swap_entry(page);
- return swp_offset(swap);
+ return swp_offset(folio->swap);
}
-EXPORT_SYMBOL_GPL(__page_file_index);
+EXPORT_SYMBOL_GPL(__folio_swap_cache_index);
+
+loff_t __folio_swap_file_pos(struct folio *folio)
+{
+ return swap_file_pos(folio->swap);
+}
+EXPORT_SYMBOL_GPL(__folio_swap_file_pos);

/*
* add_swap_count_continuation - called when a swap count is duplicated
--
2.44.0

2024-04-17 16:27:57

[permalink] [raw]

Subject: [PATCH 3/8] f2fs: drop usage of page_index

From: Kairui Song <[email protected]>

page_index is needed for mixed usage of page cache and swap cache,
for pure page cache usage, the caller can just use page->index instead.

It can't be a swap cache page here, so just drop it.

Signed-off-by: Kairui Song <[email protected]>
Cc: Chao Yu <[email protected]>
Cc: Jaegeuk Kim <[email protected]>
Cc: [email protected]
---
fs/f2fs/data.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index d9494b5fc7c1..12d5bbd18755 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -2057,7 +2057,7 @@ static int f2fs_read_single_page(struct inode *inode, struct page *page,
sector_t block_nr;
int ret = 0;

- block_in_file = (sector_t)page_index(page);
+ block_in_file = (sector_t)page->index;
last_block = block_in_file + nr_pages;
last_block_in_file = bytes_to_blks(inode,
f2fs_readpage_limit(inode) + blocksize - 1);
@@ -4086,8 +4086,7 @@ void f2fs_clear_page_cache_dirty_tag(struct page *page)
unsigned long flags;

xa_lock_irqsave(&mapping->i_pages, flags);
- __xa_clear_mark(&mapping->i_pages, page_index(page),
- PAGECACHE_TAG_DIRTY);
+ __xa_clear_mark(&mapping->i_pages, page->index, PAGECACHE_TAG_DIRTY);
xa_unlock_irqrestore(&mapping->i_pages, flags);
}

--
2.44.0

2024-04-17 16:30:33

[permalink] [raw]

Subject: [PATCH 4/8] ceph: drop usage of page_index

From: Kairui Song <[email protected]>

page_index is needed for mixed usage of page cache and swap cache,
for pure page cache usage, the caller can just use page->index instead.

It can't be a swap cache page here, so just drop it.

Signed-off-by: Kairui Song <[email protected]>
Cc: Xiubo Li <[email protected]>
Cc: Ilya Dryomov <[email protected]>
Cc: Jeff Layton <[email protected]>
Cc: [email protected]
---
fs/ceph/dir.c | 2 +-
fs/ceph/inode.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index 0e9f56eaba1e..570a9d634cc5 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -141,7 +141,7 @@ __dcache_find_get_entry(struct dentry *parent, u64 idx,
if (ptr_pos >= i_size_read(dir))
return NULL;

- if (!cache_ctl->page || ptr_pgoff != page_index(cache_ctl->page)) {
+ if (!cache_ctl->page || ptr_pgoff != cache_ctl->page->index) {
ceph_readdir_cache_release(cache_ctl);
cache_ctl->page = find_lock_page(&dir->i_data, ptr_pgoff);
if (!cache_ctl->page) {
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 7b2e77517f23..1f92d3faaa6b 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -1861,7 +1861,7 @@ static int fill_readdir_cache(struct inode *dir, struct dentry *dn,
unsigned idx = ctl->index % nsize;
pgoff_t pgoff = ctl->index / nsize;

- if (!ctl->page || pgoff != page_index(ctl->page)) {
+ if (!ctl->page || pgoff != ctl->page->index) {
ceph_readdir_cache_release(ctl);
if (idx == 0)
ctl->page = grab_cache_page(&dir->i_data, pgoff);
--
2.44.0

2024-04-17 16:31:10

[permalink] [raw]

Subject: [PATCH 6/8] mm/swap: get the swap file offset directly

From: Kairui Song <[email protected]>

folio_file_pos and page_file_offset are for mixed usage of swap cache
and page cache, it can't be page cache here, so introduce a new helper
to get the swap offset in swap file directly.

Signed-off-by: Kairui Song <[email protected]>
---
mm/page_io.c | 6 +++---
mm/swap.h | 5 +++++
2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index ae2b49055e43..93de5aadb438 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -279,7 +279,7 @@ static void sio_write_complete(struct kiocb *iocb, long ret)
* be temporary.
*/
pr_err_ratelimited("Write error %ld on dio swapfile (%llu)\n",
- ret, page_file_offset(page));
+ ret, swap_file_pos(page_swap_entry(page)));
for (p = 0; p < sio->pages; p++) {
page = sio->bvec[p].bv_page;
set_page_dirty(page);
@@ -298,7 +298,7 @@ static void swap_writepage_fs(struct folio *folio, struct writeback_control *wbc
struct swap_iocb *sio = NULL;
struct swap_info_struct *sis = swp_swap_info(folio->swap);
struct file *swap_file = sis->swap_file;
- loff_t pos = folio_file_pos(folio);
+ loff_t pos = swap_file_pos(folio->swap);

count_swpout_vm_event(folio);
folio_start_writeback(folio);
@@ -429,7 +429,7 @@ static void swap_read_folio_fs(struct folio *folio, struct swap_iocb **plug)
{
struct swap_info_struct *sis = swp_swap_info(folio->swap);
struct swap_iocb *sio = NULL;
- loff_t pos = folio_file_pos(folio);
+ loff_t pos = swap_file_pos(folio->swap);

if (plug)
sio = *plug;
diff --git a/mm/swap.h b/mm/swap.h
index fc2f6ade7f80..2de83729aaa8 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -7,6 +7,11 @@ struct mempolicy;
#ifdef CONFIG_SWAP
#include <linux/blk_types.h> /* for bio_end_io_t */

+static inline loff_t swap_file_pos(swp_entry_t entry)
+{
+ return ((loff_t)swp_offset(entry)) << PAGE_SHIFT;
+}
+
/* linux/mm/page_io.c */
int sio_pool_init(void);
struct swap_iocb;
--
2.44.0

2024-04-17 16:32:00

[permalink] [raw]

Subject: [PATCH 8/8] mm/swap: reduce swap cache search space

From: Kairui Song <[email protected]>

Currently we use one swap_address_space for every 64M chunk to reduce lock
contention, this is like having a set of smaller swap files inside one
big swap file. But when doing swap cache look up or insert, we are
still using the offset of the whole large swap file. This is OK for
correctness, as the offset (key) is unique.

But Xarray is specially optimized for small indexes, it creates the
radix tree levels lazily to be just enough to fit the largest key
stored in one Xarray. So we are wasting tree nodes unnecessarily.

For 64M chunk it should only take at most 3 levels to contain everything.
But we are using the offset from the whole swap file, so the offset (key)
value will be way beyond 64M, and so will the tree level.

Optimize this by using a new helper swap_cache_index to get a swap
entry's unique offset in its own 64M swap_address_space.

I see a ~1% performance gain in benchmark and actual workload with
high memory pressure.

Test with `time memhog 128G` inside a 8G memcg using 128G swap (ramdisk
with SWP_SYNCHRONOUS_IO dropped, tested 3 times, results are stable. The
test result is similar but the improvement is smaller if SWP_SYNCHRONOUS_IO
is enabled, as swap out path can never skip swap cache):

Before:
6.07user 250.74system 4:17.26elapsed 99%CPU (0avgtext+0avgdata 8373376maxresident)k
0inputs+0outputs (55major+33555018minor)pagefaults 0swaps

After (1.8% faster):
6.08user 246.09system 4:12.58elapsed 99%CPU (0avgtext+0avgdata 8373248maxresident)k
0inputs+0outputs (54major+33555027minor)pagefaults 0swaps

Similar result with MySQL and sysbench using swap:
Before:
94055.61 qps

After (0.8% faster):
94834.91 qps

Radix tree slab usage is also very slightly lower.

Signed-off-by: Kairui Song <[email protected]>
---
mm/huge_memory.c | 2 +-
mm/memcontrol.c | 2 +-
mm/mincore.c | 2 +-
mm/shmem.c | 2 +-
mm/swap.h | 7 +++++++
mm/swap_state.c | 12 ++++++------
mm/swapfile.c | 6 +++---
7 files changed, 20 insertions(+), 13 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9859aa4f7553..1208d60792f0 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2903,7 +2903,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
split_page_memcg(head, order, new_order);

if (folio_test_anon(folio) && folio_test_swapcache(folio)) {
- offset = swp_offset(folio->swap);
+ offset = swap_cache_index(folio->swap);
swap_cache = swap_address_space(folio->swap);
xa_lock(&swap_cache->i_pages);
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fabce2b50c69..04d7be7f30dc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5934,7 +5934,7 @@ static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
* Because swap_cache_get_folio() updates some statistics counter,
* we call find_get_page() with swapper_space directly.
*/
- page = find_get_page(swap_address_space(ent), swp_offset(ent));
+ page = find_get_page(swap_address_space(ent), swap_cache_index(ent));
entry->val = ent.val;

return page;
diff --git a/mm/mincore.c b/mm/mincore.c
index dad3622cc963..e31cf1bde614 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -139,7 +139,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
} else {
#ifdef CONFIG_SWAP
*vec = mincore_page(swap_address_space(entry),
- swp_offset(entry));
+ swap_cache_index(entry));
#else
WARN_ON(1);
*vec = 1;
diff --git a/mm/shmem.c b/mm/shmem.c
index 0aad0d9a621b..cbe33ab52a73 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1762,7 +1762,7 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,

old = *foliop;
entry = old->swap;
- swap_index = swp_offset(entry);
+ swap_index = swap_cache_index(entry);
swap_mapping = swap_address_space(entry);

/*
diff --git a/mm/swap.h b/mm/swap.h
index 2de83729aaa8..6ef237d2b029 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -31,11 +31,18 @@ void __swap_writepage(struct folio *folio, struct writeback_control *wbc);
/* One swap address space for each 64M swap space */
#define SWAP_ADDRESS_SPACE_SHIFT 14
#define SWAP_ADDRESS_SPACE_PAGES (1 << SWAP_ADDRESS_SPACE_SHIFT)
+#define SWAP_ADDRESS_SPACE_MASK (BIT(SWAP_ADDRESS_SPACE_SHIFT) - 1)
extern struct address_space *swapper_spaces[];
#define swap_address_space(entry) \
(&swapper_spaces[swp_type(entry)][swp_offset(entry) \
>> SWAP_ADDRESS_SPACE_SHIFT])

+static inline pgoff_t swap_cache_index(swp_entry_t entry)
+{
+ BUILD_BUG_ON((SWP_OFFSET_MASK | SWAP_ADDRESS_SPACE_MASK) != SWP_OFFSET_MASK);
+ return swp_offset(entry) & SWAP_ADDRESS_SPACE_MASK;
+}
+
void show_swap_cache_info(void);
bool add_to_swap(struct folio *folio);
void *get_shadow_from_swap_cache(swp_entry_t entry);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index bfc7e8c58a6d..9dbb54c72770 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -72,7 +72,7 @@ void show_swap_cache_info(void)
void *get_shadow_from_swap_cache(swp_entry_t entry)
{
struct address_space *address_space = swap_address_space(entry);
- pgoff_t idx = swp_offset(entry);
+ pgoff_t idx = swap_cache_index(entry);
struct page *page;

page = xa_load(&address_space->i_pages, idx);
@@ -89,7 +89,7 @@ int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
gfp_t gfp, void **shadowp)
{
struct address_space *address_space = swap_address_space(entry);
- pgoff_t idx = swp_offset(entry);
+ pgoff_t idx = swap_cache_index(entry);
XA_STATE_ORDER(xas, &address_space->i_pages, idx, folio_order(folio));
unsigned long i, nr = folio_nr_pages(folio);
void *old;
@@ -144,7 +144,7 @@ void __delete_from_swap_cache(struct folio *folio,
struct address_space *address_space = swap_address_space(entry);
int i;
long nr = folio_nr_pages(folio);
- pgoff_t idx = swp_offset(entry);
+ pgoff_t idx = swap_cache_index(entry);
XA_STATE(xas, &address_space->i_pages, idx);

xas_set_update(&xas, workingset_update_node);
@@ -350,7 +350,7 @@ struct folio *swap_cache_get_folio(swp_entry_t entry,
{
struct folio *folio;

- folio = filemap_get_folio(swap_address_space(entry), swp_offset(entry));
+ folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
if (!IS_ERR(folio)) {
bool vma_ra = swap_use_vma_readahead();
bool readahead;
@@ -420,7 +420,7 @@ struct folio *filemap_get_incore_folio(struct address_space *mapping,
si = get_swap_device(swp);
if (!si)
return ERR_PTR(-ENOENT);
- index = swp_offset(swp);
+ index = swap_cache_index(swp);
folio = filemap_get_folio(swap_address_space(swp), index);
put_swap_device(si);
return folio;
@@ -447,7 +447,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
* that would confuse statistics.
*/
folio = filemap_get_folio(swap_address_space(entry),
- swp_offset(entry));
+ swap_cache_index(entry));
if (!IS_ERR(folio))
goto got_folio;

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 0c36a5c2400f..2e8df95977b7 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -138,7 +138,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
struct folio *folio;
int ret = 0;

- folio = filemap_get_folio(swap_address_space(entry), offset);
+ folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
if (IS_ERR(folio))
return 0;
/*
@@ -2110,7 +2110,7 @@ static int try_to_unuse(unsigned int type)
(i = find_next_to_unuse(si, i)) != 0) {

entry = swp_entry(type, i);
- folio = filemap_get_folio(swap_address_space(entry), i);
+ folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
if (IS_ERR(folio))
continue;

@@ -3421,7 +3421,7 @@ EXPORT_SYMBOL_GPL(swapcache_mapping);

pgoff_t __folio_swap_cache_index(struct folio *folio)
{
- return swp_offset(folio->swap);
+ return swap_cache_index(folio->swap);
}
EXPORT_SYMBOL_GPL(__folio_swap_cache_index);

--
2.44.0

2024-04-17 18:51:50

[permalink] [raw]

Subject: Re: [PATCH 5/8] cifs: drop usage of page_file_offset

On Thu, Apr 18, 2024 at 12:08:39AM +0800, Kairui Song wrote:
> +++ b/fs/smb/client/file.c
> @@ -4749,7 +4749,7 @@ static int cifs_readpage_worker(struct file *file, struct page *page,
> static int cifs_read_folio(struct file *file, struct folio *folio)
> {
> struct page *page = &folio->page;
> - loff_t offset = page_file_offset(page);
> + loff_t offset = page_offset(page);

loff_t offset = folio_pos(folio);

2024-04-18 00:28:46

by Xiubo Li

[permalink] [raw]

Subject: Re: [PATCH 4/8] ceph: drop usage of page_index

On 4/18/24 00:08, Kairui Song wrote:
> From: Kairui Song <[email protected]>
>
> page_index is needed for mixed usage of page cache and swap cache,
> for pure page cache usage, the caller can just use page->index instead.
>
> It can't be a swap cache page here, so just drop it.
>
> Signed-off-by: Kairui Song <[email protected]>
> Cc: Xiubo Li <[email protected]>
> Cc: Ilya Dryomov <[email protected]>
> Cc: Jeff Layton <[email protected]>
> Cc: [email protected]
> ---
> fs/ceph/dir.c | 2 +-
> fs/ceph/inode.c | 2 +-
> 2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
> index 0e9f56eaba1e..570a9d634cc5 100644
> --- a/fs/ceph/dir.c
> +++ b/fs/ceph/dir.c
> @@ -141,7 +141,7 @@ __dcache_find_get_entry(struct dentry *parent, u64 idx,
> if (ptr_pos >= i_size_read(dir))
> return NULL;
>
> - if (!cache_ctl->page || ptr_pgoff != page_index(cache_ctl->page)) {
> + if (!cache_ctl->page || ptr_pgoff != cache_ctl->page->index) {
> ceph_readdir_cache_release(cache_ctl);
> cache_ctl->page = find_lock_page(&dir->i_data, ptr_pgoff);
> if (!cache_ctl->page) {
> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> index 7b2e77517f23..1f92d3faaa6b 100644
> --- a/fs/ceph/inode.c
> +++ b/fs/ceph/inode.c
> @@ -1861,7 +1861,7 @@ static int fill_readdir_cache(struct inode *dir, struct dentry *dn,
> unsigned idx = ctl->index % nsize;
> pgoff_t pgoff = ctl->index / nsize;
>
> - if (!ctl->page || pgoff != page_index(ctl->page)) {
> + if (!ctl->page || pgoff != ctl->page->index) {

Hi Kairui,

Thanks for you patch and will it be doable to switch to folio_index()
instead ?

Cheers,

- Xiubo

> ceph_readdir_cache_release(ctl);
> if (idx == 0)
> ctl->page = grab_cache_page(&dir->i_data, pgoff);

2024-04-18 01:40:33

by Xiubo Li

[permalink] [raw]

Subject: Re: [PATCH 4/8] ceph: drop usage of page_index

On 4/18/24 09:30, Matthew Wilcox wrote:
> On Thu, Apr 18, 2024 at 08:28:22AM +0800, Xiubo Li wrote:
>> Thanks for you patch and will it be doable to switch to folio_index()
>> instead ?
> No. Just use folio->index. You only need folio_index() if the folio
> might belong to the swapcache instead of a file.
>
Hmm, Okay.

Thanks

- Xiubo

2024-04-18 01:55:51

by Barry Song

[permalink] [raw]

Subject: Re: [PATCH 7/8] mm: drop page_index/page_file_offset and convert swap helpers to use folio

On Thu, Apr 18, 2024 at 4:12 AM Kairui Song <[email protected]> wrote:
>
> From: Kairui Song <[email protected]>
>
> When applied on swap cache pages, page_index / page_file_offset was used
> to retrieve the swap cache index or swap file offset of a page, and they
> have their folio equivalence version: folio_index / folio_file_pos.
>
> We have eliminated all users for page_index / page_file_offset, everything
> is using folio_index / folio_file_pos now, so remove the old helpers.
>
> Then convert the implementation of folio_index / folio_file_pos to
> to use folio natively.
>
> After this commit, all users that might encounter mixed usage of swap
> cache and page cache will only use following two helpers:
>
> folio_index (calls __folio_swap_cache_index)
> folio_file_pos (calls __folio_swap_file_pos)
>
> The offset in swap file and index in swap cache is still basically the
> same thing at this moment, but will be different in following commits.
>
> Signed-off-by: Kairui Song <[email protected]>

Hi Kairui, thanks !

I also find it rather odd that folio_file_page() is utilized for both
swp and file.

mm/memory.c <<do_swap_page>>
page = folio_file_page(folio, swp_offset(entry));
mm/swap_state.c <<swapin_readahead>>
return folio_file_page(folio, swp_offset(entry));
mm/swapfile.c <<unuse_pte>>
page = folio_file_page(folio, swp_offset(entry));

Do you believe it's worthwhile to tidy up?

> ---
> include/linux/mm.h | 13 -------------
> include/linux/pagemap.h | 19 +++++++++----------
> mm/swapfile.c | 13 +++++++++----
> 3 files changed, 18 insertions(+), 27 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 0436b919f1c7..797480e76c9c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2245,19 +2245,6 @@ static inline void *folio_address(const struct folio *folio)
> return page_address(&folio->page);
> }
>
> -extern pgoff_t __page_file_index(struct page *page);
> -
> -/*
> - * Return the pagecache index of the passed page. Regular pagecache pages
> - * use ->index whereas swapcache pages use swp_offset(->private)
> - */
> -static inline pgoff_t page_index(struct page *page)
> -{
> - if (unlikely(PageSwapCache(page)))
> - return __page_file_index(page);
> - return page->index;
> -}
> -
> /*
> * Return true only if the page has been allocated with
> * ALLOC_NO_WATERMARKS and the low watermark was not
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 2df35e65557d..313f3144406e 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -780,7 +780,7 @@ static inline struct page *grab_cache_page_nowait(struct address_space *mapping,
> mapping_gfp_mask(mapping));
> }
>
> -#define swapcache_index(folio) __page_file_index(&(folio)->page)
> +extern pgoff_t __folio_swap_cache_index(struct folio *folio);
>
> /**
> * folio_index - File index of a folio.
> @@ -795,9 +795,9 @@ static inline struct page *grab_cache_page_nowait(struct address_space *mapping,
> */
> static inline pgoff_t folio_index(struct folio *folio)
> {
> - if (unlikely(folio_test_swapcache(folio)))
> - return swapcache_index(folio);
> - return folio->index;
> + if (unlikely(folio_test_swapcache(folio)))
> + return __folio_swap_cache_index(folio);
> + return folio->index;
> }
>
> /**
> @@ -920,11 +920,6 @@ static inline loff_t page_offset(struct page *page)
> return ((loff_t)page->index) << PAGE_SHIFT;
> }
>
> -static inline loff_t page_file_offset(struct page *page)
> -{
> - return ((loff_t)page_index(page)) << PAGE_SHIFT;
> -}
> -
> /**
> * folio_pos - Returns the byte position of this folio in its file.
> * @folio: The folio.
> @@ -934,6 +929,8 @@ static inline loff_t folio_pos(struct folio *folio)
> return page_offset(&folio->page);
> }
>
> +extern loff_t __folio_swap_file_pos(struct folio *folio);
> +
> /**
> * folio_file_pos - Returns the byte position of this folio in its file.
> * @folio: The folio.
> @@ -943,7 +940,9 @@ static inline loff_t folio_pos(struct folio *folio)
> */
> static inline loff_t folio_file_pos(struct folio *folio)
> {
> - return page_file_offset(&folio->page);
> + if (unlikely(folio_test_swapcache(folio)))
> + return __folio_swap_file_pos(folio);
> + return ((loff_t)folio->index << PAGE_SHIFT);
> }
>
> /*
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 4919423cce76..0c36a5c2400f 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -3419,12 +3419,17 @@ struct address_space *swapcache_mapping(struct folio *folio)
> }
> EXPORT_SYMBOL_GPL(swapcache_mapping);
>
> -pgoff_t __page_file_index(struct page *page)
> +pgoff_t __folio_swap_cache_index(struct folio *folio)
> {
> - swp_entry_t swap = page_swap_entry(page);
> - return swp_offset(swap);
> + return swp_offset(folio->swap);
> }
> -EXPORT_SYMBOL_GPL(__page_file_index);
> +EXPORT_SYMBOL_GPL(__folio_swap_cache_index);
> +
> +loff_t __folio_swap_file_pos(struct folio *folio)
> +{
> + return swap_file_pos(folio->swap);
> +}
> +EXPORT_SYMBOL_GPL(__folio_swap_file_pos);
>
> /*
> * add_swap_count_continuation - called when a swap count is duplicated
> --
> 2.44.0
>

Thanks
Barry

2024-04-18 02:43:19

[permalink] [raw]

Subject: Re: [PATCH 7/8] mm: drop page_index/page_file_offset and convert swap helpers to use folio

On Thu, Apr 18, 2024 at 9:55 AM Barry Song <[email protected]> wrote:
>
> On Thu, Apr 18, 2024 at 4:12 AM Kairui Song <[email protected]> wrote:
> >
> > From: Kairui Song <[email protected]>
> >
> > When applied on swap cache pages, page_index / page_file_offset was used
> > to retrieve the swap cache index or swap file offset of a page, and they
> > have their folio equivalence version: folio_index / folio_file_pos.
> >
> > We have eliminated all users for page_index / page_file_offset, everything
> > is using folio_index / folio_file_pos now, so remove the old helpers.
> >
> > Then convert the implementation of folio_index / folio_file_pos to
> > to use folio natively.
> >
> > After this commit, all users that might encounter mixed usage of swap
> > cache and page cache will only use following two helpers:
> >
> > folio_index (calls __folio_swap_cache_index)
> > folio_file_pos (calls __folio_swap_file_pos)
> >
> > The offset in swap file and index in swap cache is still basically the
> > same thing at this moment, but will be different in following commits.
> >
> > Signed-off-by: Kairui Song <[email protected]>
>
> Hi Kairui, thanks !
>
> I also find it rather odd that folio_file_page() is utilized for both
> swp and file.
>
> mm/memory.c <<do_swap_page>>
> page = folio_file_page(folio, swp_offset(entry));
> mm/swap_state.c <<swapin_readahead>>
> return folio_file_page(folio, swp_offset(entry));
> mm/swapfile.c <<unuse_pte>>
> page = folio_file_page(folio, swp_offset(entry));
>
> Do you believe it's worthwhile to tidy up?
>

Hi Barry,

I'm not sure about this. Using folio_file_page doesn't look too bad,
and it will be gone once we convert them to always use folio, this
shouldn't take too long.

2024-04-18 02:43:41

[permalink] [raw]

Subject: Re: [PATCH 2/8] nilfs2: drop usage of page_index

On Thu, Apr 18, 2024 at 12:14 AM Matthew Wilcox <[email protected]> wrote:
>
> On Thu, Apr 18, 2024 at 12:08:36AM +0800, Kairui Song wrote:
> > +++ b/fs/nilfs2/bmap.c
> > @@ -453,8 +453,7 @@ __u64 nilfs_bmap_data_get_key(const struct nilfs_bmap *bmap,
> > struct buffer_head *pbh;
> > __u64 key;
> >
> > - key = page_index(bh->b_page) << (PAGE_SHIFT -
> > - bmap->b_inode->i_blkbits);
> > + key = bh->b_page->index << (PAGE_SHIFT - bmap->b_inode->i_blkbits);
>
> I'd prefer this were
>
> key = bh->b_folio->index << (PAGE_SHIFT - bmap->b_inode->i_blkbits);
>
> (pages only have a ->index field for historical reasons; I'm trying to
> get rid of it)
>

Good suggestion! For easier review I just copied the original logic
from page_index, I will update with folio conventions in V2.

2024-04-18 03:55:45

by Barry Song

[permalink] [raw]

Subject: Re: [PATCH 7/8] mm: drop page_index/page_file_offset and convert swap helpers to use folio

On Thu, Apr 18, 2024 at 3:31 PM Matthew Wilcox <[email protected]> wrote:
>
> On Thu, Apr 18, 2024 at 01:55:28PM +1200, Barry Song wrote:
> > I also find it rather odd that folio_file_page() is utilized for both
> > swp and file.
> >
> > mm/memory.c <<do_swap_page>>
> > page = folio_file_page(folio, swp_offset(entry));
> > mm/swap_state.c <<swapin_readahead>>
> > return folio_file_page(folio, swp_offset(entry));
> > mm/swapfile.c <<unuse_pte>>
> > page = folio_file_page(folio, swp_offset(entry));
> >
> > Do you believe it's worthwhile to tidy up?
>
> Why do you find it odd? What would you propose instead?

i'd prefer something like folio_swap_page(folio, entry);

2024-04-18 07:27:03

[permalink] [raw]

Subject: Re: [PATCH 4/8] ceph: drop usage of page_index

On Thu, Apr 18, 2024 at 08:28:22AM +0800, Xiubo Li wrote:
> Thanks for you patch and will it be doable to switch to folio_index()
> instead ?

No. Just use folio->index. You only need folio_index() if the folio
might belong to the swapcache instead of a file.

2024-04-18 08:17:03

[permalink] [raw]

Subject: Re: [PATCH 7/8] mm: drop page_index/page_file_offset and convert swap helpers to use folio

On Thu, Apr 18, 2024 at 01:55:28PM +1200, Barry Song wrote:
> I also find it rather odd that folio_file_page() is utilized for both
> swp and file.
>
> mm/memory.c <<do_swap_page>>
> page = folio_file_page(folio, swp_offset(entry));
> mm/swap_state.c <<swapin_readahead>>
> return folio_file_page(folio, swp_offset(entry));
> mm/swapfile.c <<unuse_pte>>
> page = folio_file_page(folio, swp_offset(entry));
>
> Do you believe it's worthwhile to tidy up?

Why do you find it odd? What would you propose instead?

2024-04-18 10:20:39

by Barry Song

[permalink] [raw]

Subject: Re: [PATCH 7/8] mm: drop page_index/page_file_offset and convert swap helpers to use folio

On Thu, Apr 18, 2024 at 2:42 PM Kairui Song <[email protected]> wrote:
>
> On Thu, Apr 18, 2024 at 9:55 AM Barry Song <[email protected]> wrote:
> >
> > On Thu, Apr 18, 2024 at 4:12 AM Kairui Song <[email protected]> wrote:
> > >
> > > From: Kairui Song <[email protected]>
> > >
> > > When applied on swap cache pages, page_index / page_file_offset was used
> > > to retrieve the swap cache index or swap file offset of a page, and they
> > > have their folio equivalence version: folio_index / folio_file_pos.
> > >
> > > We have eliminated all users for page_index / page_file_offset, everything
> > > is using folio_index / folio_file_pos now, so remove the old helpers.
> > >
> > > Then convert the implementation of folio_index / folio_file_pos to
> > > to use folio natively.
> > >
> > > After this commit, all users that might encounter mixed usage of swap
> > > cache and page cache will only use following two helpers:
> > >
> > > folio_index (calls __folio_swap_cache_index)
> > > folio_file_pos (calls __folio_swap_file_pos)
> > >
> > > The offset in swap file and index in swap cache is still basically the
> > > same thing at this moment, but will be different in following commits.
> > >
> > > Signed-off-by: Kairui Song <[email protected]>
> >
> > Hi Kairui, thanks !
> >
> > I also find it rather odd that folio_file_page() is utilized for both
> > swp and file.
> >
> > mm/memory.c <<do_swap_page>>
> > page = folio_file_page(folio, swp_offset(entry));
> > mm/swap_state.c <<swapin_readahead>>
> > return folio_file_page(folio, swp_offset(entry));
> > mm/swapfile.c <<unuse_pte>>
> > page = folio_file_page(folio, swp_offset(entry));
> >
> > Do you believe it's worthwhile to tidy up?
> >
>
> Hi Barry,
>
> I'm not sure about this. Using folio_file_page doesn't look too bad,
> and it will be gone once we convert them to always use folio, this
> shouldn't take too long.

HI Kairui,
I am not quite sure this is going to be quite soon. our swap-in large
folios refault still have corner cases which can't map large folios even
we hit large folios in swapcache [1].
personally, i feel do_swap_page() isn't handling file, and those pages
are anon but not file. so a separate helper taking folio and entry as
arguments seem more readable as Chris even wants to remove
the assumption large folios have been to swapped to contiguous
swap offsets. anyway, it is just me :-）

[1] https://lore.kernel.org/linux-mm/[email protected]/

2024-04-18 18:22:22

by kernel test robot

[permalink] [raw]

Subject: Re: [PATCH 8/8] mm/swap: reduce swap cache search space

Hi Kairui,

kernel test robot noticed the following build errors:

[auto build test ERROR on ceph-client/testing]
[also build test ERROR on ceph-client/for-linus trondmy-nfs/linux-next konis-nilfs2/upstream jaegeuk-f2fs/dev-test jaegeuk-f2fs/dev cifs/for-next linus/master v6.9-rc4]
[cannot apply to akpm-mm/mm-everything next-20240418]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Kairui-Song/NFS-remove-nfs_page_lengthg-and-usage-of-page_index/20240418-001343
base: https://github.com/ceph/ceph-client.git testing
patch link: https://lore.kernel.org/r/20240417160842.76665-9-ryncsn%40gmail.com
patch subject: [PATCH 8/8] mm/swap: reduce swap cache search space
config: i386-buildonly-randconfig-002-20240419 (https://download.01.org/0day-ci/archive/20240419/[email protected]/config)
compiler: gcc-9 (Ubuntu 9.5.0-4ubuntu2) 9.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240419/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All errors (new ones prefixed by >>):

mm/huge_memory.c: In function '__split_huge_page':
>> mm/huge_memory.c:2906:12: error: implicit declaration of function 'swap_cache_index' [-Werror=implicit-function-declaration]
2906 | offset = swap_cache_index(folio->swap);
| ^~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors

vim +/swap_cache_index +2906 mm/huge_memory.c

2888
2889 static void __split_huge_page(struct page *page, struct list_head *list,
2890 pgoff_t end, unsigned int new_order)
2891 {
2892 struct folio *folio = page_folio(page);
2893 struct page *head = &folio->page;
2894 struct lruvec *lruvec;
2895 struct address_space *swap_cache = NULL;
2896 unsigned long offset = 0;
2897 int i, nr_dropped = 0;
2898 unsigned int new_nr = 1 << new_order;
2899 int order = folio_order(folio);
2900 unsigned int nr = 1 << order;
2901
2902 /* complete memcg works before add pages to LRU */
2903 split_page_memcg(head, order, new_order);
2904
2905 if (folio_test_anon(folio) && folio_test_swapcache(folio)) {
> 2906 offset = swap_cache_index(folio->swap);
2907 swap_cache = swap_address_space(folio->swap);
2908 xa_lock(&swap_cache->i_pages);
2909 }
2910
2911 /* lock lru list/PageCompound, ref frozen by page_ref_freeze */
2912 lruvec = folio_lruvec_lock(folio);
2913
2914 ClearPageHasHWPoisoned(head);
2915
2916 for (i = nr - new_nr; i >= new_nr; i -= new_nr) {
2917 __split_huge_page_tail(folio, i, lruvec, list, new_order);
2918 /* Some pages can be beyond EOF: drop them from page cache */
2919 if (head[i].index >= end) {
2920 struct folio *tail = page_folio(head + i);
2921
2922 if (shmem_mapping(folio->mapping))
2923 nr_dropped++;
2924 else if (folio_test_clear_dirty(tail))
2925 folio_account_cleaned(tail,
2926 inode_to_wb(folio->mapping->host));
2927 __filemap_remove_folio(tail, NULL);
2928 folio_put(tail);
2929 } else if (!PageAnon(page)) {
2930 __xa_store(&folio->mapping->i_pages, head[i].index,
2931 head + i, 0);
2932 } else if (swap_cache) {
2933 __xa_store(&swap_cache->i_pages, offset + i,
2934 head + i, 0);
2935 }
2936 }
2937
2938 if (!new_order)
2939 ClearPageCompound(head);
2940 else {
2941 struct folio *new_folio = (struct folio *)head;
2942
2943 folio_set_order(new_folio, new_order);
2944 }
2945 unlock_page_lruvec(lruvec);
2946 /* Caller disabled irqs, so they are still disabled here */
2947
2948 split_page_owner(head, order, new_order);
2949
2950 /* See comment in __split_huge_page_tail() */
2951 if (folio_test_anon(folio)) {
2952 /* Additional pin to swap cache */
2953 if (folio_test_swapcache(folio)) {
2954 folio_ref_add(folio, 1 + new_nr);
2955 xa_unlock(&swap_cache->i_pages);
2956 } else {
2957 folio_ref_inc(folio);
2958 }
2959 } else {
2960 /* Additional pin to page cache */
2961 folio_ref_add(folio, 1 + new_nr);
2962 xa_unlock(&folio->mapping->i_pages);
2963 }
2964 local_irq_enable();
2965
2966 if (nr_dropped)
2967 shmem_uncharge(folio->mapping->host, nr_dropped);
2968 remap_page(folio, nr);
2969
2970 if (folio_test_swapcache(folio))
2971 split_swap_cluster(folio->swap);
2972
2973 /*
2974 * set page to its compound_head when split to non order-0 pages, so
2975 * we can skip unlocking it below, since PG_locked is transferred to
2976 * the compound_head of the page and the caller will unlock it.
2977 */
2978 if (new_order)
2979 page = compound_head(page);
2980
2981 for (i = 0; i < nr; i += new_nr) {
2982 struct page *subpage = head + i;
2983 struct folio *new_folio = page_folio(subpage);
2984 if (subpage == page)
2985 continue;
2986 folio_unlock(new_folio);
2987
2988 /*
2989 * Subpages may be freed if there wasn't any mapping
2990 * like if add_to_swap() is running on a lru page that
2991 * had its mapping zapped. And freeing these pages
2992 * requires taking the lru_lock so we do the put_page
2993 * of the tail pages after the split is complete.
2994 */
2995 free_page_and_swap_cache(subpage);
2996 }
2997 }
2998

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2024-04-18 18:23:58

by kernel test robot

[permalink] [raw]

Subject: Re: [PATCH 8/8] mm/swap: reduce swap cache search space

Hi Kairui,

kernel test robot noticed the following build errors:

[auto build test ERROR on ceph-client/testing]
[also build test ERROR on ceph-client/for-linus trondmy-nfs/linux-next konis-nilfs2/upstream jaegeuk-f2fs/dev-test jaegeuk-f2fs/dev cifs/for-next linus/master v6.9-rc4]
[cannot apply to akpm-mm/mm-everything next-20240418]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Kairui-Song/NFS-remove-nfs_page_lengthg-and-usage-of-page_index/20240418-001343
base: https://github.com/ceph/ceph-client.git testing
patch link: https://lore.kernel.org/r/20240417160842.76665-9-ryncsn%40gmail.com
patch subject: [PATCH 8/8] mm/swap: reduce swap cache search space
config: alpha-allnoconfig (https://download.01.org/0day-ci/archive/20240419/[email protected]/config)
compiler: alpha-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240419/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All errors (new ones prefixed by >>):

mm/shmem.c: In function 'shmem_replace_folio':
>> mm/shmem.c:1765:22: error: implicit declaration of function 'swap_cache_index' [-Werror=implicit-function-declaration]
1765 | swap_index = swap_cache_index(entry);
| ^~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors

vim +/swap_cache_index +1765 mm/shmem.c

1753
1754 static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
1755 struct shmem_inode_info *info, pgoff_t index)
1756 {
1757 struct folio *old, *new;
1758 struct address_space *swap_mapping;
1759 swp_entry_t entry;
1760 pgoff_t swap_index;
1761 int error;
1762
1763 old = *foliop;
1764 entry = old->swap;
> 1765 swap_index = swap_cache_index(entry);
1766 swap_mapping = swap_address_space(entry);
1767
1768 /*
1769 * We have arrived here because our zones are constrained, so don't
1770 * limit chance of success by further cpuset and node constraints.
1771 */
1772 gfp &= ~GFP_CONSTRAINT_MASK;
1773 VM_BUG_ON_FOLIO(folio_test_large(old), old);
1774 new = shmem_alloc_folio(gfp, info, index);
1775 if (!new)
1776 return -ENOMEM;
1777
1778 folio_get(new);
1779 folio_copy(new, old);
1780 flush_dcache_folio(new);
1781
1782 __folio_set_locked(new);
1783 __folio_set_swapbacked(new);
1784 folio_mark_uptodate(new);
1785 new->swap = entry;
1786 folio_set_swapcache(new);
1787
1788 /*
1789 * Our caller will very soon move newpage out of swapcache, but it's
1790 * a nice clean interface for us to replace oldpage by newpage there.
1791 */
1792 xa_lock_irq(&swap_mapping->i_pages);
1793 error = shmem_replace_entry(swap_mapping, swap_index, old, new);
1794 if (!error) {
1795 mem_cgroup_migrate(old, new);
1796 __lruvec_stat_mod_folio(new, NR_FILE_PAGES, 1);
1797 __lruvec_stat_mod_folio(new, NR_SHMEM, 1);
1798 __lruvec_stat_mod_folio(old, NR_FILE_PAGES, -1);
1799 __lruvec_stat_mod_folio(old, NR_SHMEM, -1);
1800 }
1801 xa_unlock_irq(&swap_mapping->i_pages);
1802
1803 if (unlikely(error)) {
1804 /*
1805 * Is this possible? I think not, now that our callers check
1806 * both PageSwapCache and page_private after getting page lock;
1807 * but be defensive. Reverse old to newpage for clear and free.
1808 */
1809 old = new;
1810 } else {
1811 folio_add_lru(new);
1812 *foliop = new;
1813 }
1814
1815 folio_clear_swapcache(old);
1816 old->private = NULL;
1817
1818 folio_unlock(old);
1819 folio_put_refs(old, 2);
1820 return error;
1821 }
1822

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2024-04-18 18:44:45

by kernel test robot

[permalink] [raw]

Subject: Re: [PATCH 6/8] mm/swap: get the swap file offset directly

2024-04-22 07:57:04

[permalink] [raw]

Subject: Re: [PATCH 0/8] mm/swap: optimize swap cache search space

Hi, Kairui,

Kairui Song <[email protected]> writes:

> From: Kairui Song <[email protected]>
>
> Currently we use one swap_address_space for every 64M chunk to reduce lock
> contention, this is like having a set of smaller swap files inside one
> big swap file. But when doing swap cache look up or insert, we are
> still using the offset of the whole large swap file. This is OK for
> correctness, as the offset (key) is unique.
>
> But Xarray is specially optimized for small indexes, it creates the
> redix tree levels lazily to be just enough to fit the largest key
> stored in one Xarray. So we are wasting tree nodes unnecessarily.
>
> For 64M chunk it should only take at most 3 level to contain everything.
> But we are using the offset from the whole swap file, so the offset (key)
> value will be way beyond 64M, and so will the tree level.
>
> Optimize this by reduce the swap cache search space into 64M scope.

In general, I think that it makes sense to reduce the depth of the
xarray.

One concern is that IIUC we make swap cache behaves like file cache if
possible. And your change makes swap cache and file cache diverge more.
Is it possible for us to keep them similar?

For example,

Is it possible to return the offset inside 64M range in
__page_file_index() (maybe rename it)?

Is it possible to add "start_offset" support in xarray, so "index"
will subtract "start_offset" before looking up / inserting?

Is it possible to use multiple range locks to protect one xarray to
improve the lock scalability? This is why we have multiple "struct
address_space" for one swap device. And, we may have same lock
contention issue for large files too.

I haven't look at the code in details. So, my idea may not make sense
at all. If so, sorry about that.

Hi, Matthew,

Can you teach me on this too?

--
Best Regards,
Huang, Ying

2024-04-22 15:20:56

[permalink] [raw]

Subject: Re: [PATCH 0/8] mm/swap: optimize swap cache search space

On Mon, Apr 22, 2024 at 3:56 PM Huang, Ying <[email protected]> wrote:
>
> Hi, Kairui,
>
> Kairui Song <[email protected]> writes:
>
> > From: Kairui Song <[email protected]>
> >
> > Currently we use one swap_address_space for every 64M chunk to reduce lock
> > contention, this is like having a set of smaller swap files inside one
> > big swap file. But when doing swap cache look up or insert, we are
> > still using the offset of the whole large swap file. This is OK for
> > correctness, as the offset (key) is unique.
> >
> > But Xarray is specially optimized for small indexes, it creates the
> > redix tree levels lazily to be just enough to fit the largest key
> > stored in one Xarray. So we are wasting tree nodes unnecessarily.
> >
> > For 64M chunk it should only take at most 3 level to contain everything.
> > But we are using the offset from the whole swap file, so the offset (key)
> > value will be way beyond 64M, and so will the tree level.
> >
> > Optimize this by reduce the swap cache search space into 64M scope.
>

Hi,

Thanks for the comments!

> In general, I think that it makes sense to reduce the depth of the
> xarray.
>
> One concern is that IIUC we make swap cache behaves like file cache if
> possible. And your change makes swap cache and file cache diverge more.
> Is it possible for us to keep them similar?

So far in this series, I think there is no problem for that, the two
main helpers for retrieving file & cache offset: folio_index and
folio_file_pos will work fine and be compatible with current users.

And if we convert to share filemap_* functions for swap cache / page
cache, they are mostly already accepting index as an argument so no
trouble at all.

>
> For example,
>
> Is it possible to return the offset inside 64M range in
> __page_file_index() (maybe rename it)?

Not sure what you mean by this, __page_file_index will be gone as we
convert to folio.
And this series did delete / rename it (it might not be easy to see
this, the usage of these helpers is not very well organized before
this series so some clean up is involved).
It was previously only used through page_index (deleted) /
folio_index, and, now folio_index will be returning the offset inside
the 64M range.

I guess I just did what you wanted? :)

My cover letter and commit message might be not clear enough, I can update it.

>
> Is it possible to add "start_offset" support in xarray, so "index"
> will subtract "start_offset" before looking up / inserting?

xarray struct seems already very full, and this usage doesn't look
generic to me, might be better to fix this kind of issue case by case.

>
> Is it possible to use multiple range locks to protect one xarray to
> improve the lock scalability? This is why we have multiple "struct
> address_space" for one swap device. And, we may have same lock
> contention issue for large files too.

Good question, this series can improve the tree depth issue for swap
cache, but contention in address space is still a thing.

A more generic solution might involve changes of xarray API or use
some other data struct?

(BTW I think reducing the search space and resolving lock contention
is not necessarily related, reducing the search space by having a
large table of small trees should still perform better for swap
cache).

>
> I haven't look at the code in details. So, my idea may not make sense
> at all. If so, sorry about that.
>
> Hi, Matthew,
>
> Can you teach me on this too?
>
> --
> Best Regards,
> Huang, Ying

2024-04-22 15:35:12

[permalink] [raw]

Subject: Re: [PATCH 4/8] ceph: drop usage of page_index

On Thu, Apr 18, 2024 at 9:40 AM Xiubo Li <[email protected]> wrote:
> On 4/18/24 09:30, Matthew Wilcox wrote:
> > On Thu, Apr 18, 2024 at 08:28:22AM +0800, Xiubo Li wrote:
> >> Thanks for you patch and will it be doable to switch to folio_index()
> >> instead ?
> > No. Just use folio->index. You only need folio_index() if the folio
> > might belong to the swapcache instead of a file.
> >
> Hmm, Okay.
>
> Thanks
>
> - Xiubo
>

Hi Xiubo

Thanks for the comment,

As Matthew mentioned there is no need to use folio_index unless you
are access swapcache. And I found that ceph is not using folios
internally yet, needs a lot of conversions. So I think I'll just keep
using page->index here, later conversions may change it to
folio->index.

2024-04-23 00:34:59

by Xiubo Li

[permalink] [raw]

Subject: Re: [PATCH 4/8] ceph: drop usage of page_index

On 4/22/24 23:34, Kairui Song wrote:
> On Thu, Apr 18, 2024 at 9:40 AM Xiubo Li <[email protected]> wrote:
>> On 4/18/24 09:30, Matthew Wilcox wrote:
>>> On Thu, Apr 18, 2024 at 08:28:22AM +0800, Xiubo Li wrote:
>>>> Thanks for you patch and will it be doable to switch to folio_index()
>>>> instead ?
>>> No. Just use folio->index. You only need folio_index() if the folio
>>> might belong to the swapcache instead of a file.
>>>
>> Hmm, Okay.
>>
>> Thanks
>>
>> - Xiubo
>>
> Hi Xiubo
>
> Thanks for the comment,
>
> As Matthew mentioned there is no need to use folio_index unless you
> are access swapcache. And I found that ceph is not using folios
> internally yet, needs a lot of conversions. So I think I'll just keep
> using page->index here, later conversions may change it to
> folio->index.
>
Sounds good to me and thanks for explaining.

Cheers

- Xiubo

2024-04-23 01:31:16

[permalink] [raw]

Subject: Re: [PATCH 0/8] mm/swap: optimize swap cache search space

Kairui Song <[email protected]> writes:

> On Mon, Apr 22, 2024 at 3:56 PM Huang, Ying <[email protected]> wrote:
>>
>> Hi, Kairui,
>>
>> Kairui Song <[email protected]> writes:
>>
>> > From: Kairui Song <[email protected]>
>> >
>> > Currently we use one swap_address_space for every 64M chunk to reduce lock
>> > contention, this is like having a set of smaller swap files inside one
>> > big swap file. But when doing swap cache look up or insert, we are
>> > still using the offset of the whole large swap file. This is OK for
>> > correctness, as the offset (key) is unique.
>> >
>> > But Xarray is specially optimized for small indexes, it creates the
>> > redix tree levels lazily to be just enough to fit the largest key
>> > stored in one Xarray. So we are wasting tree nodes unnecessarily.
>> >
>> > For 64M chunk it should only take at most 3 level to contain everything.
>> > But we are using the offset from the whole swap file, so the offset (key)
>> > value will be way beyond 64M, and so will the tree level.
>> >
>> > Optimize this by reduce the swap cache search space into 64M scope.
>>
>
> Hi,
>
> Thanks for the comments!
>
>> In general, I think that it makes sense to reduce the depth of the
>> xarray.
>>
>> One concern is that IIUC we make swap cache behaves like file cache if
>> possible. And your change makes swap cache and file cache diverge more.
>> Is it possible for us to keep them similar?
>
> So far in this series, I think there is no problem for that, the two
> main helpers for retrieving file & cache offset: folio_index and
> folio_file_pos will work fine and be compatible with current users.
>
> And if we convert to share filemap_* functions for swap cache / page
> cache, they are mostly already accepting index as an argument so no
> trouble at all.
>
>>
>> For example,
>>
>> Is it possible to return the offset inside 64M range in
>> __page_file_index() (maybe rename it)?
>
> Not sure what you mean by this, __page_file_index will be gone as we
> convert to folio.
> And this series did delete / rename it (it might not be easy to see
> this, the usage of these helpers is not very well organized before
> this series so some clean up is involved).
> It was previously only used through page_index (deleted) /
> folio_index, and, now folio_index will be returning the offset inside
> the 64M range.
>
> I guess I just did what you wanted? :)

Good!

> My cover letter and commit message might be not clear enough, I can update it.
>
>>
>> Is it possible to add "start_offset" support in xarray, so "index"
>> will subtract "start_offset" before looking up / inserting?
>
> xarray struct seems already very full, and this usage doesn't look
> generic to me, might be better to fix this kind of issue case by case.

Just some open question.

>>
>> Is it possible to use multiple range locks to protect one xarray to
>> improve the lock scalability? This is why we have multiple "struct
>> address_space" for one swap device. And, we may have same lock
>> contention issue for large files too.
>
> Good question, this series can improve the tree depth issue for swap
> cache, but contention in address space is still a thing.

The lock contention for swap cache has been reduced via using multiple
xarray in commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB
trunks"). But it fixes that for swap cache only, not for file cache in
general. We have observed similar lock contention issue for file cache
too. And the method isn't perfect too, like the issue you found here.
In general, it's about what is "file" for swap device.

> A more generic solution might involve changes of xarray API or use
> some other data struct?
>
> (BTW I think reducing the search space and resolving lock contention
> is not necessarily related, reducing the search space by having a
> large table of small trees should still perform better for swap
> cache).

--
Best Regards,
Huang, Ying

2024-04-23 01:43:26

[permalink] [raw]

Subject: Re: [PATCH 6/8] mm/swap: get the swap file offset directly

Kairui Song <[email protected]> writes:

> From: Kairui Song <[email protected]>
>
> folio_file_pos and page_file_offset are for mixed usage of swap cache
> and page cache, it can't be page cache here, so introduce a new helper
> to get the swap offset in swap file directly.
>
> Signed-off-by: Kairui Song <[email protected]>
> ---
> mm/page_io.c | 6 +++---
> mm/swap.h | 5 +++++
> 2 files changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/mm/page_io.c b/mm/page_io.c
> index ae2b49055e43..93de5aadb438 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -279,7 +279,7 @@ static void sio_write_complete(struct kiocb *iocb, long ret)
> * be temporary.
> */
> pr_err_ratelimited("Write error %ld on dio swapfile (%llu)\n",
> - ret, page_file_offset(page));
> + ret, swap_file_pos(page_swap_entry(page)));
> for (p = 0; p < sio->pages; p++) {
> page = sio->bvec[p].bv_page;
> set_page_dirty(page);
> @@ -298,7 +298,7 @@ static void swap_writepage_fs(struct folio *folio, struct writeback_control *wbc
> struct swap_iocb *sio = NULL;
> struct swap_info_struct *sis = swp_swap_info(folio->swap);
> struct file *swap_file = sis->swap_file;
> - loff_t pos = folio_file_pos(folio);
> + loff_t pos = swap_file_pos(folio->swap);
>
> count_swpout_vm_event(folio);
> folio_start_writeback(folio);
> @@ -429,7 +429,7 @@ static void swap_read_folio_fs(struct folio *folio, struct swap_iocb **plug)
> {
> struct swap_info_struct *sis = swp_swap_info(folio->swap);
> struct swap_iocb *sio = NULL;
> - loff_t pos = folio_file_pos(folio);
> + loff_t pos = swap_file_pos(folio->swap);
>
> if (plug)
> sio = *plug;
> diff --git a/mm/swap.h b/mm/swap.h
> index fc2f6ade7f80..2de83729aaa8 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -7,6 +7,11 @@ struct mempolicy;
> #ifdef CONFIG_SWAP
> #include <linux/blk_types.h> /* for bio_end_io_t */
>
> +static inline loff_t swap_file_pos(swp_entry_t entry)
> +{
> + return ((loff_t)swp_offset(entry)) << PAGE_SHIFT;
> +}
> +
> /* linux/mm/page_io.c */
> int sio_pool_init(void);
> struct swap_iocb;

I feel that the file concept for swap is kind of confusing. From the
file cache point of view, one "struct address space" conresponds to one
file. If so, we have a simple file system on a swap device (block
device backed or file backed), where the size of each file is 64M. The
swap entry encode the file system (swap_type), the file name
(swap_offset >> SWAP_ADDRESS_SPACE_SHIFT), and the offset in file (lower
bits of swap_offset).

If the above definition is good, it's better to rename swap_file_pos()
to swap_dev_pos(), because it returns the swap device position of the
swap entry.

And, when we reaches consensus on the swap file related concept, we may
document it somewhere and review all naming in swap code to cleanup.

--
Best Regards,
Huang, Ying

2024-04-23 05:46:46

[permalink] [raw]

Subject: Re: [PATCH 0/8] mm/swap: optimize swap cache search space

On Mon, Apr 22, 2024 at 03:54:58PM +0800, Huang, Ying wrote:
> Is it possible to add "start_offset" support in xarray, so "index"
> will subtract "start_offset" before looking up / inserting?

We kind of have that with XA_FLAGS_ZERO_BUSY which is used for
XA_FLAGS_ALLOC1. But that's just one bit for the entry at 0. We could
generalise it, but then we'd have to store that somewhere and there's
no obvious good place to store it that wouldn't enlarge struct xarray,
which I'd be reluctant to do.

> Is it possible to use multiple range locks to protect one xarray to
> improve the lock scalability? This is why we have multiple "struct
> address_space" for one swap device. And, we may have same lock
> contention issue for large files too.

It's something I've considered. The issue is search marks. If we delete
an entry, we may have to walk all the way up the xarray clearing bits as
we go and I'd rather not grab a lock at each level. There's a convenient
4 byte hole between nr_values and parent where we could put it.

Oh, another issue is that we use i_pages.xa_lock to synchronise
address_space.nrpages, so I'm not sure that a per-node lock will help.

But I'm conscious that there are workloads which show contention on
xa_lock as their limiting factor, so I'm open to ideas to improve all
these things.

2024-04-23 13:34:28

[permalink] [raw]

Subject: Re: [PATCH 6/8] mm/swap: get the swap file offset directly

On Tue, Apr 23, 2024 at 9:43 AM Huang, Ying <[email protected]> wrote:
>
> Kairui Song <[email protected]> writes:
>
> > From: Kairui Song <[email protected]>
> >
> > folio_file_pos and page_file_offset are for mixed usage of swap cache
> > and page cache, it can't be page cache here, so introduce a new helper
> > to get the swap offset in swap file directly.
> >
> > Signed-off-by: Kairui Song <[email protected]>
> > ---
> > mm/page_io.c | 6 +++---
> > mm/swap.h | 5 +++++
> > 2 files changed, 8 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/page_io.c b/mm/page_io.c
> > index ae2b49055e43..93de5aadb438 100644
> > --- a/mm/page_io.c
> > +++ b/mm/page_io.c
> > @@ -279,7 +279,7 @@ static void sio_write_complete(struct kiocb *iocb, long ret)
> > * be temporary.
> > */
> > pr_err_ratelimited("Write error %ld on dio swapfile (%llu)\n",
> > - ret, page_file_offset(page));
> > + ret, swap_file_pos(page_swap_entry(page)));
> > for (p = 0; p < sio->pages; p++) {
> > page = sio->bvec[p].bv_page;
> > set_page_dirty(page);
> > @@ -298,7 +298,7 @@ static void swap_writepage_fs(struct folio *folio, struct writeback_control *wbc
> > struct swap_iocb *sio = NULL;
> > struct swap_info_struct *sis = swp_swap_info(folio->swap);
> > struct file *swap_file = sis->swap_file;
> > - loff_t pos = folio_file_pos(folio);
> > + loff_t pos = swap_file_pos(folio->swap);
> >
> > count_swpout_vm_event(folio);
> > folio_start_writeback(folio);
> > @@ -429,7 +429,7 @@ static void swap_read_folio_fs(struct folio *folio, struct swap_iocb **plug)
> > {
> > struct swap_info_struct *sis = swp_swap_info(folio->swap);
> > struct swap_iocb *sio = NULL;
> > - loff_t pos = folio_file_pos(folio);
> > + loff_t pos = swap_file_pos(folio->swap);
> >
> > if (plug)
> > sio = *plug;
> > diff --git a/mm/swap.h b/mm/swap.h
> > index fc2f6ade7f80..2de83729aaa8 100644
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> > @@ -7,6 +7,11 @@ struct mempolicy;
> > #ifdef CONFIG_SWAP
> > #include <linux/blk_types.h> /* for bio_end_io_t */
> >
> > +static inline loff_t swap_file_pos(swp_entry_t entry)
> > +{
> > + return ((loff_t)swp_offset(entry)) << PAGE_SHIFT;
> > +}
> > +
> > /* linux/mm/page_io.c */
> > int sio_pool_init(void);
> > struct swap_iocb;
>
> I feel that the file concept for swap is kind of confusing. From the
> file cache point of view, one "struct address space" conresponds to one
> file. If so, we have a simple file system on a swap device (block
> device backed or file backed), where the size of each file is 64M. The
> swap entry encode the file system (swap_type), the file name
> (swap_offset >> SWAP_ADDRESS_SPACE_SHIFT), and the offset in file (lower
> bits of swap_offset).
>
> If the above definition is good, it's better to rename swap_file_pos()
> to swap_dev_pos(), because it returns the swap device position of the
> swap entry.

Good suggestion! The definition looks good to me, swap_dev_pos also
looks better, "swap_file" looks confusing indeed.

>
> And, when we reaches consensus on the swap file related concept, we may
> document it somewhere and review all naming in swap code to cleanup.
>
> --
> Best Regards,
> Huang, Ying

2024-04-24 02:26:19

[permalink] [raw]

Subject: Re: [PATCH 0/8] mm/swap: optimize swap cache search space

Hi, Matthew,

Matthew Wilcox <[email protected]> writes:

> On Mon, Apr 22, 2024 at 03:54:58PM +0800, Huang, Ying wrote:
>> Is it possible to add "start_offset" support in xarray, so "index"
>> will subtract "start_offset" before looking up / inserting?
>
> We kind of have that with XA_FLAGS_ZERO_BUSY which is used for
> XA_FLAGS_ALLOC1. But that's just one bit for the entry at 0. We could
> generalise it, but then we'd have to store that somewhere and there's
> no obvious good place to store it that wouldn't enlarge struct xarray,
> which I'd be reluctant to do.
>
>> Is it possible to use multiple range locks to protect one xarray to
>> improve the lock scalability? This is why we have multiple "struct
>> address_space" for one swap device. And, we may have same lock
>> contention issue for large files too.
>
> It's something I've considered. The issue is search marks. If we delete
> an entry, we may have to walk all the way up the xarray clearing bits as
> we go and I'd rather not grab a lock at each level. There's a convenient
> 4 byte hole between nr_values and parent where we could put it.
>
> Oh, another issue is that we use i_pages.xa_lock to synchronise
> address_space.nrpages, so I'm not sure that a per-node lock will help.

Thanks for looking at this.

> But I'm conscious that there are workloads which show contention on
> xa_lock as their limiting factor, so I'm open to ideas to improve all
> these things.

I have no idea so far because my very limited knowledge about xarray.

--
Best Regards,
Huang, Ying

2024-04-26 23:16:25

[permalink] [raw]

Subject: Re: [PATCH 0/8] mm/swap: optimize swap cache search space

Hi Ying,

On Tue, Apr 23, 2024 at 7:26 PM Huang, Ying <[email protected]> wrote:
>
> Hi, Matthew,
>
> Matthew Wilcox <[email protected]> writes:
>
> > On Mon, Apr 22, 2024 at 03:54:58PM +0800, Huang, Ying wrote:
> >> Is it possible to add "start_offset" support in xarray, so "index"
> >> will subtract "start_offset" before looking up / inserting?
> >
> > We kind of have that with XA_FLAGS_ZERO_BUSY which is used for
> > XA_FLAGS_ALLOC1. But that's just one bit for the entry at 0. We could
> > generalise it, but then we'd have to store that somewhere and there's
> > no obvious good place to store it that wouldn't enlarge struct xarray,
> > which I'd be reluctant to do.
> >
> >> Is it possible to use multiple range locks to protect one xarray to
> >> improve the lock scalability? This is why we have multiple "struct
> >> address_space" for one swap device. And, we may have same lock
> >> contention issue for large files too.
> >
> > It's something I've considered. The issue is search marks. If we delete
> > an entry, we may have to walk all the way up the xarray clearing bits as
> > we go and I'd rather not grab a lock at each level. There's a convenient
> > 4 byte hole between nr_values and parent where we could put it.
> >
> > Oh, another issue is that we use i_pages.xa_lock to synchronise
> > address_space.nrpages, so I'm not sure that a per-node lock will help.
>
> Thanks for looking at this.
>
> > But I'm conscious that there are workloads which show contention on
> > xa_lock as their limiting factor, so I'm open to ideas to improve all
> > these things.
>
> I have no idea so far because my very limited knowledge about xarray.

For the swap file usage, I have been considering an idea to remove the
index part of the xarray from swap cache. Swap cache is different from
file cache in a few aspects.
For one if we want to have a folio equivalent of "large swap entry".
Then the natural alignment of those swap offset on does not make
sense. Ideally we should be able to write the folio to un-aligned swap
file locations.

The other aspect for swap files is that, we already have different
data structures organized around swap offset, swap_map and
swap_cgroup. If we group the swap related data structure together. We
can add a pointer to a union of folio or a shadow swap entry. We can
use atomic updates on the swap struct member or breakdown the access
lock by ranges just like swap cluster does.

I want to discuss those ideas in the upcoming LSF/MM meet up as well.

Chris

2024-04-28 01:16:33

[permalink] [raw]

Subject: Re: [PATCH 0/8] mm/swap: optimize swap cache search space

Chris Li <[email protected]> writes:

> Hi Ying,
>
> On Tue, Apr 23, 2024 at 7:26 PM Huang, Ying <[email protected]> wrote:
>>
>> Hi, Matthew,
>>
>> Matthew Wilcox <[email protected]> writes:
>>
>> > On Mon, Apr 22, 2024 at 03:54:58PM +0800, Huang, Ying wrote:
>> >> Is it possible to add "start_offset" support in xarray, so "index"
>> >> will subtract "start_offset" before looking up / inserting?
>> >
>> > We kind of have that with XA_FLAGS_ZERO_BUSY which is used for
>> > XA_FLAGS_ALLOC1. But that's just one bit for the entry at 0. We could
>> > generalise it, but then we'd have to store that somewhere and there's
>> > no obvious good place to store it that wouldn't enlarge struct xarray,
>> > which I'd be reluctant to do.
>> >
>> >> Is it possible to use multiple range locks to protect one xarray to
>> >> improve the lock scalability? This is why we have multiple "struct
>> >> address_space" for one swap device. And, we may have same lock
>> >> contention issue for large files too.
>> >
>> > It's something I've considered. The issue is search marks. If we delete
>> > an entry, we may have to walk all the way up the xarray clearing bits as
>> > we go and I'd rather not grab a lock at each level. There's a convenient
>> > 4 byte hole between nr_values and parent where we could put it.
>> >
>> > Oh, another issue is that we use i_pages.xa_lock to synchronise
>> > address_space.nrpages, so I'm not sure that a per-node lock will help.
>>
>> Thanks for looking at this.
>>
>> > But I'm conscious that there are workloads which show contention on
>> > xa_lock as their limiting factor, so I'm open to ideas to improve all
>> > these things.
>>
>> I have no idea so far because my very limited knowledge about xarray.
>
> For the swap file usage, I have been considering an idea to remove the
> index part of the xarray from swap cache. Swap cache is different from
> file cache in a few aspects.
> For one if we want to have a folio equivalent of "large swap entry".
> Then the natural alignment of those swap offset on does not make
> sense. Ideally we should be able to write the folio to un-aligned swap
> file locations.
>
> The other aspect for swap files is that, we already have different
> data structures organized around swap offset, swap_map and
> swap_cgroup. If we group the swap related data structure together. We
> can add a pointer to a union of folio or a shadow swap entry.

The shadow swap entry may be freed. So we need to prepare for that.
And, in current design, only swap_map[] is allocated if the swap space
isn't used. That needs to be considered too.

> We can use atomic updates on the swap struct member or breakdown the
> access lock by ranges just like swap cluster does.

The swap code uses xarray in a simple way. That gives us opportunity to
optimize. For example, it makes it easy to use multiple xarray
instances for one swap device.

> I want to discuss those ideas in the upcoming LSF/MM meet up as well.

Good!

--
Best Regards,
Huang, Ying

2024-04-28 02:43:44

[permalink] [raw]

Subject: Re: [PATCH 0/8] mm/swap: optimize swap cache search space

On Sat, Apr 27, 2024 at 6:16 PM Huang, Ying <[email protected]> wrote:
>
> Chris Li <[email protected]> writes:
>
> > Hi Ying,
> >
> > For the swap file usage, I have been considering an idea to remove the
> > index part of the xarray from swap cache. Swap cache is different from
> > file cache in a few aspects.
> > For one if we want to have a folio equivalent of "large swap entry".
> > Then the natural alignment of those swap offset on does not make
> > sense. Ideally we should be able to write the folio to un-aligned swap
> > file locations.
> >
> > The other aspect for swap files is that, we already have different
> > data structures organized around swap offset, swap_map and
> > swap_cgroup. If we group the swap related data structure together. We
> > can add a pointer to a union of folio or a shadow swap entry.
>
> The shadow swap entry may be freed. So we need to prepare for that.

Free the shadow swap entry will just set the pointer to NULL.
Are you concerned that the memory allocated for the pointer is not
free to the system after the shadow swap entry is free?

It will be subject to fragmentation on the free swap entry.
In that regard, xarray is also subject to fragmentation. It will not
free the internal node if the node has one xa_index not freed. Even if
the xarray node is freed to slab, at slab level there is fragmentation
as well, the backing page might not free to the system.

> And, in current design, only swap_map[] is allocated if the swap space
> isn't used. That needs to be considered too.

I am aware of that. I want to make the swap_map[] not static allocated
any more either.
The swap_map static allocation forces the rest of the swap data
structure to have other means to sparsely allocate their data
structure, repeating the fragmentation elsewhere, in different
ways.That is also the one major source of the pain point hacking on
the swap code. The data structure is spread into too many different
places.

> > We can use atomic updates on the swap struct member or breakdown the
> > access lock by ranges just like swap cluster does.
>
> The swap code uses xarray in a simple way. That gives us opportunity to
> optimize. For example, it makes it easy to use multiple xarray

The fixed swap offset range makes it like an array. There are many
ways to shard the array like swap entry, e.g. swap cluster is one way
to shard it. Multiple xarray is another way. We can also do multiple
xarray like sharding, or even more fancy ones.

Chris

2024-04-28 03:23:33

[permalink] [raw]

Subject: Re: [PATCH 0/8] mm/swap: optimize swap cache search space

Chris Li <[email protected]> writes:

> On Sat, Apr 27, 2024 at 6:16 PM Huang, Ying <[email protected]> wrote:
>>
>> Chris Li <[email protected]> writes:
>>
>> > Hi Ying,
>> >
>> > For the swap file usage, I have been considering an idea to remove the
>> > index part of the xarray from swap cache. Swap cache is different from
>> > file cache in a few aspects.
>> > For one if we want to have a folio equivalent of "large swap entry".
>> > Then the natural alignment of those swap offset on does not make
>> > sense. Ideally we should be able to write the folio to un-aligned swap
>> > file locations.
>> >
>> > The other aspect for swap files is that, we already have different
>> > data structures organized around swap offset, swap_map and
>> > swap_cgroup. If we group the swap related data structure together. We
>> > can add a pointer to a union of folio or a shadow swap entry.
>>
>> The shadow swap entry may be freed. So we need to prepare for that.
>
> Free the shadow swap entry will just set the pointer to NULL.
> Are you concerned that the memory allocated for the pointer is not
> free to the system after the shadow swap entry is free?
>
> It will be subject to fragmentation on the free swap entry.
> In that regard, xarray is also subject to fragmentation. It will not
> free the internal node if the node has one xa_index not freed. Even if
> the xarray node is freed to slab, at slab level there is fragmentation
> as well, the backing page might not free to the system.

Sorry my words were confusing. What I wanted to say is that the xarray
node may be freed.

>> And, in current design, only swap_map[] is allocated if the swap space
>> isn't used. That needs to be considered too.
>
> I am aware of that. I want to make the swap_map[] not static allocated
> any more either.

Yes. That's possible.

> The swap_map static allocation forces the rest of the swap data
> structure to have other means to sparsely allocate their data
> structure, repeating the fragmentation elsewhere, in different
> ways.That is also the one major source of the pain point hacking on
> the swap code. The data structure is spread into too many different
> places.

Look forward to more details to compare :-)

>> > We can use atomic updates on the swap struct member or breakdown the
>> > access lock by ranges just like swap cluster does.
>>
>> The swap code uses xarray in a simple way. That gives us opportunity to
>> optimize. For example, it makes it easy to use multiple xarray
>
> The fixed swap offset range makes it like an array. There are many
> ways to shard the array like swap entry, e.g. swap cluster is one way
> to shard it. Multiple xarray is another way. We can also do multiple
> xarray like sharding, or even more fancy ones.

--
Best Regards,
Huang, Ying

2024-04-28 17:27:04

[permalink] [raw]

Subject: Re: [PATCH 0/8] mm/swap: optimize swap cache search space

On Sat, Apr 27, 2024 at 8:23 PM Huang, Ying <[email protected]> wrote:
>
> Chris Li <[email protected]> writes:
>
> > On Sat, Apr 27, 2024 at 6:16 PM Huang, Ying <[email protected]> wrote:
> >>
> >> Chris Li <[email protected]> writes:
> > Free the shadow swap entry will just set the pointer to NULL.
> > Are you concerned that the memory allocated for the pointer is not
> > free to the system after the shadow swap entry is free?
> >
> > It will be subject to fragmentation on the free swap entry.
> > In that regard, xarray is also subject to fragmentation. It will not
> > free the internal node if the node has one xa_index not freed. Even if
> > the xarray node is freed to slab, at slab level there is fragmentation
> > as well, the backing page might not free to the system.
>
> Sorry my words were confusing. What I wanted to say is that the xarray
> node may be freed.

Somehow I get that is what you mean :-) My previous reply still
applies here. The xarray node freeing will be subject to the
fragmentation at slab level. The actual backing page might not release
to the kernel after the node freeing.

>
> >> And, in current design, only swap_map[] is allocated if the swap space
> >> isn't used. That needs to be considered too.
> >
> > I am aware of that. I want to make the swap_map[] not static allocated
> > any more either.
>
> Yes. That's possible.

Of course there will be a price to pay for that. The current swap_map
is only 1 byte per entry. That swap map count size per swap entry is
going to be hard to beat in the alternatives. Hopefully find the trade
off in other places.

>
> > The swap_map static allocation forces the rest of the swap data
> > structure to have other means to sparsely allocate their data
> > structure, repeating the fragmentation elsewhere, in different
> > ways.That is also the one major source of the pain point hacking on
> > the swap code. The data structure is spread into too many different
> > places.
>
> Look forward to more details to compare :-)

Sure. When I make more progress I will post it.

Chris

2024-04-28 17:37:35

[permalink] [raw]

Subject: Re: [PATCH 0/8] mm/swap: optimize swap cache search space

On Sat, Apr 27, 2024 at 7:16 AM Chris Li <[email protected]> wrote:
>
> Hi Ying,
>
> On Tue, Apr 23, 2024 at 7:26 PM Huang, Ying <[email protected]> wrote:
> >
> > Hi, Matthew,
> >
> > Matthew Wilcox <[email protected]> writes:
> >
> > > On Mon, Apr 22, 2024 at 03:54:58PM +0800, Huang, Ying wrote:
> > >> Is it possible to add "start_offset" support in xarray, so "index"
> > >> will subtract "start_offset" before looking up / inserting?
> > >
> > > We kind of have that with XA_FLAGS_ZERO_BUSY which is used for
> > > XA_FLAGS_ALLOC1. But that's just one bit for the entry at 0. We could
> > > generalise it, but then we'd have to store that somewhere and there's
> > > no obvious good place to store it that wouldn't enlarge struct xarray,
> > > which I'd be reluctant to do.
> > >
> > >> Is it possible to use multiple range locks to protect one xarray to
> > >> improve the lock scalability? This is why we have multiple "struct
> > >> address_space" for one swap device. And, we may have same lock
> > >> contention issue for large files too.
> > >
> > > It's something I've considered. The issue is search marks. If we delete
> > > an entry, we may have to walk all the way up the xarray clearing bits as
> > > we go and I'd rather not grab a lock at each level. There's a convenient
> > > 4 byte hole between nr_values and parent where we could put it.
> > >
> > > Oh, another issue is that we use i_pages.xa_lock to synchronise
> > > address_space.nrpages, so I'm not sure that a per-node lock will help.
> >
> > Thanks for looking at this.
> >
> > > But I'm conscious that there are workloads which show contention on
> > > xa_lock as their limiting factor, so I'm open to ideas to improve all
> > > these things.
> >
> > I have no idea so far because my very limited knowledge about xarray.
>
> For the swap file usage, I have been considering an idea to remove the
> index part of the xarray from swap cache. Swap cache is different from
> file cache in a few aspects.
> For one if we want to have a folio equivalent of "large swap entry".
> Then the natural alignment of those swap offset on does not make
> sense. Ideally we should be able to write the folio to un-aligned swap
> file locations.
>

Hi Chris,

This sound interesting, I have a few questions though...

Are you suggesting we handle swap on file and swap on device
differently? Swap on file is much less frequently used than swap on
device I think.

And what do you mean "index part of the xarray"? If we need a cache,
xarray still seems one of the best choices to hold the content.

> The other aspect for swap files is that, we already have different
> data structures organized around swap offset, swap_map and
> swap_cgroup. If we group the swap related data structure together. We
> can add a pointer to a union of folio or a shadow swap entry. We can
> use atomic updates on the swap struct member or breakdown the access
> lock by ranges just like swap cluster does.
>
> I want to discuss those ideas in the upcoming LSF/MM meet up as well.

Looking forward to it!

>
> Chris

2024-04-28 17:46:22

[permalink] [raw]

Subject: Re: [PATCH 0/8] mm/swap: optimize swap cache search space

On Mon, Apr 29, 2024 at 1:37 AM Kairui Song <[email protected]> wrote:
>
> On Sat, Apr 27, 2024 at 7:16 AM Chris Li <[email protected]> wrote:
> >
> > Hi Ying,
> >
> > On Tue, Apr 23, 2024 at 7:26 PM Huang, Ying <[email protected]> wrote:
> > >
> > > Hi, Matthew,
> > >
> > > Matthew Wilcox <[email protected]> writes:
> > >
> > > > On Mon, Apr 22, 2024 at 03:54:58PM +0800, Huang, Ying wrote:
> > > >> Is it possible to add "start_offset" support in xarray, so "index"
> > > >> will subtract "start_offset" before looking up / inserting?
> > > >
> > > > We kind of have that with XA_FLAGS_ZERO_BUSY which is used for
> > > > XA_FLAGS_ALLOC1. But that's just one bit for the entry at 0. We could
> > > > generalise it, but then we'd have to store that somewhere and there's
> > > > no obvious good place to store it that wouldn't enlarge struct xarray,
> > > > which I'd be reluctant to do.
> > > >
> > > >> Is it possible to use multiple range locks to protect one xarray to
> > > >> improve the lock scalability? This is why we have multiple "struct
> > > >> address_space" for one swap device. And, we may have same lock
> > > >> contention issue for large files too.
> > > >
> > > > It's something I've considered. The issue is search marks. If we delete
> > > > an entry, we may have to walk all the way up the xarray clearing bits as
> > > > we go and I'd rather not grab a lock at each level. There's a convenient
> > > > 4 byte hole between nr_values and parent where we could put it.
> > > >
> > > > Oh, another issue is that we use i_pages.xa_lock to synchronise
> > > > address_space.nrpages, so I'm not sure that a per-node lock will help.
> > >
> > > Thanks for looking at this.
> > >
> > > > But I'm conscious that there are workloads which show contention on
> > > > xa_lock as their limiting factor, so I'm open to ideas to improve all
> > > > these things.
> > >
> > > I have no idea so far because my very limited knowledge about xarray.
> >
> > For the swap file usage, I have been considering an idea to remove the
> > index part of the xarray from swap cache. Swap cache is different from
> > file cache in a few aspects.
> > For one if we want to have a folio equivalent of "large swap entry".
> > Then the natural alignment of those swap offset on does not make
> > sense. Ideally we should be able to write the folio to un-aligned swap
> > file locations.
> >
>
> Hi Chris,
>
> This sound interesting, I have a few questions though...
>
> Are you suggesting we handle swap on file and swap on device
> differently? Swap on file is much less frequently used than swap on
> device I think.
>
> And what do you mean "index part of the xarray"? If we need a cache,
> xarray still seems one of the best choices to hold the content.
>
> > The other aspect for swap files is that, we already have different
> > data structures organized around swap offset, swap_map and
> > swap_cgroup. If we group the swap related data structure together. We
> > can add a pointer to a union of folio or a shadow swap entry. We can
> > use atomic updates on the swap struct member or breakdown the access
> > lock by ranges just like swap cluster does.

Oh, and BTW I'm also trying to breakdown the swap address space range
(from 64M to 16M, SWAP_ADDRESS_SPACE_SHIFT from 14 to
12). It's a simple approach, but the coupling and increased memory
usage of address_space structure makes the performance go into
regression (about -2% for worst real world workload). I found this
part very performance sensitive, so basically I'm not making much
progress for the future items I mentioned in this cover letter. New
ideas could be very helpful!

> >
> > I want to discuss those ideas in the upcoming LSF/MM meet up as well.
>
> Looking forward to it!
>
> >
> > Chris

2024-04-29 05:51:00