From: Kairui Song <[email protected]>
Currently we use one swap_address_space for every 64M chunk to reduce lock
contention, this is like having a set of smaller swap files inside one
big swap file. But when doing swap cache look up or insert, we are
still using the offset of the whole large swap file. This is OK for
correctness, as the offset (key) is unique.
But Xarray is specially optimized for small indexes, it creates the
redix tree levels lazily to be just enough to fit the largest key
stored in one Xarray. So we are wasting tree nodes unnecessarily.
For 64M chunk it should only take at most 3 level to contain everything.
But we are using the offset from the whole swap file, so the offset (key)
value will be way beyond 64M, and so will the tree level.
Optimize this by reduce the swap cache search space into 64M scope.
Test with `time memhog 128G` inside a 8G memcg using 128G swap (ramdisk
with SWP_SYNCHRONOUS_IO dropped, tested 3 times, results are stable. The
test result is similar but the improvement is smaller if SWP_SYNCHRONOUS_IO
is enabled, as swap out path can never skip swap cache):
Before:
6.07user 250.74system 4:17.26elapsed 99%CPU (0avgtext+0avgdata 8373376maxresident)k
0inputs+0outputs (55major+33555018minor)pagefaults 0swaps
After (+1.8% faster):
6.08user 246.09system 4:12.58elapsed 99%CPU (0avgtext+0avgdata 8373248maxresident)k
0inputs+0outputs (54major+33555027minor)pagefaults 0swaps
Similar result with MySQL and sysbench using swap:
Before:
94055.61 qps
After (+0.8% faster):
94834.91 qps
There is alse a very slight drop of radix tree node slab usage:
Before: 303952K
After: 302224K
For this series:
There are multiple places that expect mixed type of pages (page cache or
swap cache), eg. migration, huge memory split; There are four helpers
for that:
- page_index
- page_file_offset
- folio_index
- folio_file_pos
So this series first cleaned up usage of page_index and
page_file_offset, then convert folio_index and folio_file_pos to be
compatible with separate offsets. And introduce a new helper
swap_cache_index for swap internal usage, replace swp_offset with
swap_cache_index when used to retrieve folio from swap cache.
And idealy, we may want to reduce SWAP_ADDRESS_SPACE_SHIFT from 14 to
12: Default Xarray chunk offset is 6, so we have 3 level trees instead
of 2 level trees just for 2 extra bits. But swap cache is based on
address_space struct, with 4 times more metadata sparsely distributed
in memory it waste more cacheline, the performance gain from this
series is almost canceled. So firstly, just have a cleaner seperation
of offsets.
Patch 1/8 - 6/8: Clean up usage of page_index and page_file_offset
Patch 7/8: Convert folio_index and folio_file_pos to be compatible with
separate offset.
Patch 8/8: Introduce swap_cache_index and use it when doing lookup in
swap cache.
This series is part of effort to reduce swap cache overhead, and ultimately
remove SWP_SYNCHRONOUS_IO and unify swap cache usage as proposed before:
https://lore.kernel.org/lkml/[email protected]/
Kairui Song (8):
NFS: remove nfs_page_lengthg and usage of page_index
nilfs2: drop usage of page_index
f2fs: drop usage of page_index
ceph: drop usage of page_index
cifs: drop usage of page_file_offset
mm/swap: get the swap file offset directly
mm: drop page_index/page_file_offset and convert swap helpers to use
folio
mm/swap: reduce swap cache search space
fs/ceph/dir.c | 2 +-
fs/ceph/inode.c | 2 +-
fs/f2fs/data.c | 5 ++---
fs/nfs/internal.h | 19 -------------------
fs/nilfs2/bmap.c | 3 +--
fs/smb/client/file.c | 2 +-
include/linux/mm.h | 13 -------------
include/linux/pagemap.h | 19 +++++++++----------
mm/huge_memory.c | 2 +-
mm/memcontrol.c | 2 +-
mm/mincore.c | 2 +-
mm/page_io.c | 6 +++---
mm/shmem.c | 2 +-
mm/swap.h | 12 ++++++++++++
mm/swap_state.c | 12 ++++++------
mm/swapfile.c | 17 +++++++++++------
16 files changed, 51 insertions(+), 69 deletions(-)
--
2.44.0
From: Kairui Song <[email protected]>
page_index is only for mixed usage of page cache and swap cache, for
pure page cache usage, the caller can just use page->index instead.
It can't be a swap cache page here (being part of buffer head),
so just drop it.
Signed-off-by: Kairui Song <[email protected]>
Cc: Ryusuke Konishi <[email protected]>
Cc: [email protected]
---
fs/nilfs2/bmap.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/nilfs2/bmap.c b/fs/nilfs2/bmap.c
index 383f0afa2cea..4594a4459862 100644
--- a/fs/nilfs2/bmap.c
+++ b/fs/nilfs2/bmap.c
@@ -453,8 +453,7 @@ __u64 nilfs_bmap_data_get_key(const struct nilfs_bmap *bmap,
struct buffer_head *pbh;
__u64 key;
- key = page_index(bh->b_page) << (PAGE_SHIFT -
- bmap->b_inode->i_blkbits);
+ key = bh->b_page->index << (PAGE_SHIFT - bmap->b_inode->i_blkbits);
for (pbh = page_buffers(bh->b_page); pbh != bh; pbh = pbh->b_this_page)
key++;
--
2.44.0
From: Kairui Song <[email protected]>
page_file_offset is only needed for mixed usage of page cache and
swap cache, for pure page cache usage, the caller can just use
page_offset instead.
It can't be a swap cache page here, so just drop it.
Signed-off-by: Kairui Song <[email protected]>
Cc: Steve French <[email protected]>
Cc: Namjae Jeon <[email protected]>
Cc: Paulo Alcantara <[email protected]>
Cc: Shyam Prasad N <[email protected]>
Cc: Bharath SM <[email protected]>
---
fs/smb/client/file.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/smb/client/file.c b/fs/smb/client/file.c
index 16aadce492b2..73bbd761bf32 100644
--- a/fs/smb/client/file.c
+++ b/fs/smb/client/file.c
@@ -4749,7 +4749,7 @@ static int cifs_readpage_worker(struct file *file, struct page *page,
static int cifs_read_folio(struct file *file, struct folio *folio)
{
struct page *page = &folio->page;
- loff_t offset = page_file_offset(page);
+ loff_t offset = page_offset(page);
int rc = -EACCES;
unsigned int xid;
--
2.44.0
On Thu, Apr 18, 2024 at 12:08:36AM +0800, Kairui Song wrote:
> +++ b/fs/nilfs2/bmap.c
> @@ -453,8 +453,7 @@ __u64 nilfs_bmap_data_get_key(const struct nilfs_bmap *bmap,
> struct buffer_head *pbh;
> __u64 key;
>
> - key = page_index(bh->b_page) << (PAGE_SHIFT -
> - bmap->b_inode->i_blkbits);
> + key = bh->b_page->index << (PAGE_SHIFT - bmap->b_inode->i_blkbits);
I'd prefer this were
key = bh->b_folio->index << (PAGE_SHIFT - bmap->b_inode->i_blkbits);
(pages only have a ->index field for historical reasons; I'm trying to
get rid of it)
From: Kairui Song <[email protected]>
When applied on swap cache pages, page_index / page_file_offset was used
to retrieve the swap cache index or swap file offset of a page, and they
have their folio equivalence version: folio_index / folio_file_pos.
We have eliminated all users for page_index / page_file_offset, everything
is using folio_index / folio_file_pos now, so remove the old helpers.
Then convert the implementation of folio_index / folio_file_pos to
to use folio natively.
After this commit, all users that might encounter mixed usage of swap
cache and page cache will only use following two helpers:
folio_index (calls __folio_swap_cache_index)
folio_file_pos (calls __folio_swap_file_pos)
The offset in swap file and index in swap cache is still basically the
same thing at this moment, but will be different in following commits.
Signed-off-by: Kairui Song <[email protected]>
---
include/linux/mm.h | 13 -------------
include/linux/pagemap.h | 19 +++++++++----------
mm/swapfile.c | 13 +++++++++----
3 files changed, 18 insertions(+), 27 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0436b919f1c7..797480e76c9c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2245,19 +2245,6 @@ static inline void *folio_address(const struct folio *folio)
return page_address(&folio->page);
}
-extern pgoff_t __page_file_index(struct page *page);
-
-/*
- * Return the pagecache index of the passed page. Regular pagecache pages
- * use ->index whereas swapcache pages use swp_offset(->private)
- */
-static inline pgoff_t page_index(struct page *page)
-{
- if (unlikely(PageSwapCache(page)))
- return __page_file_index(page);
- return page->index;
-}
-
/*
* Return true only if the page has been allocated with
* ALLOC_NO_WATERMARKS and the low watermark was not
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 2df35e65557d..313f3144406e 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -780,7 +780,7 @@ static inline struct page *grab_cache_page_nowait(struct address_space *mapping,
mapping_gfp_mask(mapping));
}
-#define swapcache_index(folio) __page_file_index(&(folio)->page)
+extern pgoff_t __folio_swap_cache_index(struct folio *folio);
/**
* folio_index - File index of a folio.
@@ -795,9 +795,9 @@ static inline struct page *grab_cache_page_nowait(struct address_space *mapping,
*/
static inline pgoff_t folio_index(struct folio *folio)
{
- if (unlikely(folio_test_swapcache(folio)))
- return swapcache_index(folio);
- return folio->index;
+ if (unlikely(folio_test_swapcache(folio)))
+ return __folio_swap_cache_index(folio);
+ return folio->index;
}
/**
@@ -920,11 +920,6 @@ static inline loff_t page_offset(struct page *page)
return ((loff_t)page->index) << PAGE_SHIFT;
}
-static inline loff_t page_file_offset(struct page *page)
-{
- return ((loff_t)page_index(page)) << PAGE_SHIFT;
-}
-
/**
* folio_pos - Returns the byte position of this folio in its file.
* @folio: The folio.
@@ -934,6 +929,8 @@ static inline loff_t folio_pos(struct folio *folio)
return page_offset(&folio->page);
}
+extern loff_t __folio_swap_file_pos(struct folio *folio);
+
/**
* folio_file_pos - Returns the byte position of this folio in its file.
* @folio: The folio.
@@ -943,7 +940,9 @@ static inline loff_t folio_pos(struct folio *folio)
*/
static inline loff_t folio_file_pos(struct folio *folio)
{
- return page_file_offset(&folio->page);
+ if (unlikely(folio_test_swapcache(folio)))
+ return __folio_swap_file_pos(folio);
+ return ((loff_t)folio->index << PAGE_SHIFT);
}
/*
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4919423cce76..0c36a5c2400f 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3419,12 +3419,17 @@ struct address_space *swapcache_mapping(struct folio *folio)
}
EXPORT_SYMBOL_GPL(swapcache_mapping);
-pgoff_t __page_file_index(struct page *page)
+pgoff_t __folio_swap_cache_index(struct folio *folio)
{
- swp_entry_t swap = page_swap_entry(page);
- return swp_offset(swap);
+ return swp_offset(folio->swap);
}
-EXPORT_SYMBOL_GPL(__page_file_index);
+EXPORT_SYMBOL_GPL(__folio_swap_cache_index);
+
+loff_t __folio_swap_file_pos(struct folio *folio)
+{
+ return swap_file_pos(folio->swap);
+}
+EXPORT_SYMBOL_GPL(__folio_swap_file_pos);
/*
* add_swap_count_continuation - called when a swap count is duplicated
--
2.44.0
From: Kairui Song <[email protected]>
page_index is needed for mixed usage of page cache and swap cache,
for pure page cache usage, the caller can just use page->index instead.
It can't be a swap cache page here, so just drop it.
Signed-off-by: Kairui Song <[email protected]>
Cc: Chao Yu <[email protected]>
Cc: Jaegeuk Kim <[email protected]>
Cc: [email protected]
---
fs/f2fs/data.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index d9494b5fc7c1..12d5bbd18755 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -2057,7 +2057,7 @@ static int f2fs_read_single_page(struct inode *inode, struct page *page,
sector_t block_nr;
int ret = 0;
- block_in_file = (sector_t)page_index(page);
+ block_in_file = (sector_t)page->index;
last_block = block_in_file + nr_pages;
last_block_in_file = bytes_to_blks(inode,
f2fs_readpage_limit(inode) + blocksize - 1);
@@ -4086,8 +4086,7 @@ void f2fs_clear_page_cache_dirty_tag(struct page *page)
unsigned long flags;
xa_lock_irqsave(&mapping->i_pages, flags);
- __xa_clear_mark(&mapping->i_pages, page_index(page),
- PAGECACHE_TAG_DIRTY);
+ __xa_clear_mark(&mapping->i_pages, page->index, PAGECACHE_TAG_DIRTY);
xa_unlock_irqrestore(&mapping->i_pages, flags);
}
--
2.44.0
From: Kairui Song <[email protected]>
page_index is needed for mixed usage of page cache and swap cache,
for pure page cache usage, the caller can just use page->index instead.
It can't be a swap cache page here, so just drop it.
Signed-off-by: Kairui Song <[email protected]>
Cc: Xiubo Li <[email protected]>
Cc: Ilya Dryomov <[email protected]>
Cc: Jeff Layton <[email protected]>
Cc: [email protected]
---
fs/ceph/dir.c | 2 +-
fs/ceph/inode.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index 0e9f56eaba1e..570a9d634cc5 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -141,7 +141,7 @@ __dcache_find_get_entry(struct dentry *parent, u64 idx,
if (ptr_pos >= i_size_read(dir))
return NULL;
- if (!cache_ctl->page || ptr_pgoff != page_index(cache_ctl->page)) {
+ if (!cache_ctl->page || ptr_pgoff != cache_ctl->page->index) {
ceph_readdir_cache_release(cache_ctl);
cache_ctl->page = find_lock_page(&dir->i_data, ptr_pgoff);
if (!cache_ctl->page) {
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 7b2e77517f23..1f92d3faaa6b 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -1861,7 +1861,7 @@ static int fill_readdir_cache(struct inode *dir, struct dentry *dn,
unsigned idx = ctl->index % nsize;
pgoff_t pgoff = ctl->index / nsize;
- if (!ctl->page || pgoff != page_index(ctl->page)) {
+ if (!ctl->page || pgoff != ctl->page->index) {
ceph_readdir_cache_release(ctl);
if (idx == 0)
ctl->page = grab_cache_page(&dir->i_data, pgoff);
--
2.44.0
From: Kairui Song <[email protected]>
folio_file_pos and page_file_offset are for mixed usage of swap cache
and page cache, it can't be page cache here, so introduce a new helper
to get the swap offset in swap file directly.
Signed-off-by: Kairui Song <[email protected]>
---
mm/page_io.c | 6 +++---
mm/swap.h | 5 +++++
2 files changed, 8 insertions(+), 3 deletions(-)
diff --git a/mm/page_io.c b/mm/page_io.c
index ae2b49055e43..93de5aadb438 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -279,7 +279,7 @@ static void sio_write_complete(struct kiocb *iocb, long ret)
* be temporary.
*/
pr_err_ratelimited("Write error %ld on dio swapfile (%llu)\n",
- ret, page_file_offset(page));
+ ret, swap_file_pos(page_swap_entry(page)));
for (p = 0; p < sio->pages; p++) {
page = sio->bvec[p].bv_page;
set_page_dirty(page);
@@ -298,7 +298,7 @@ static void swap_writepage_fs(struct folio *folio, struct writeback_control *wbc
struct swap_iocb *sio = NULL;
struct swap_info_struct *sis = swp_swap_info(folio->swap);
struct file *swap_file = sis->swap_file;
- loff_t pos = folio_file_pos(folio);
+ loff_t pos = swap_file_pos(folio->swap);
count_swpout_vm_event(folio);
folio_start_writeback(folio);
@@ -429,7 +429,7 @@ static void swap_read_folio_fs(struct folio *folio, struct swap_iocb **plug)
{
struct swap_info_struct *sis = swp_swap_info(folio->swap);
struct swap_iocb *sio = NULL;
- loff_t pos = folio_file_pos(folio);
+ loff_t pos = swap_file_pos(folio->swap);
if (plug)
sio = *plug;
diff --git a/mm/swap.h b/mm/swap.h
index fc2f6ade7f80..2de83729aaa8 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -7,6 +7,11 @@ struct mempolicy;
#ifdef CONFIG_SWAP
#include <linux/blk_types.h> /* for bio_end_io_t */
+static inline loff_t swap_file_pos(swp_entry_t entry)
+{
+ return ((loff_t)swp_offset(entry)) << PAGE_SHIFT;
+}
+
/* linux/mm/page_io.c */
int sio_pool_init(void);
struct swap_iocb;
--
2.44.0
From: Kairui Song <[email protected]>
Currently we use one swap_address_space for every 64M chunk to reduce lock
contention, this is like having a set of smaller swap files inside one
big swap file. But when doing swap cache look up or insert, we are
still using the offset of the whole large swap file. This is OK for
correctness, as the offset (key) is unique.
But Xarray is specially optimized for small indexes, it creates the
radix tree levels lazily to be just enough to fit the largest key
stored in one Xarray. So we are wasting tree nodes unnecessarily.
For 64M chunk it should only take at most 3 levels to contain everything.
But we are using the offset from the whole swap file, so the offset (key)
value will be way beyond 64M, and so will the tree level.
Optimize this by using a new helper swap_cache_index to get a swap
entry's unique offset in its own 64M swap_address_space.
I see a ~1% performance gain in benchmark and actual workload with
high memory pressure.
Test with `time memhog 128G` inside a 8G memcg using 128G swap (ramdisk
with SWP_SYNCHRONOUS_IO dropped, tested 3 times, results are stable. The
test result is similar but the improvement is smaller if SWP_SYNCHRONOUS_IO
is enabled, as swap out path can never skip swap cache):
Before:
6.07user 250.74system 4:17.26elapsed 99%CPU (0avgtext+0avgdata 8373376maxresident)k
0inputs+0outputs (55major+33555018minor)pagefaults 0swaps
After (1.8% faster):
6.08user 246.09system 4:12.58elapsed 99%CPU (0avgtext+0avgdata 8373248maxresident)k
0inputs+0outputs (54major+33555027minor)pagefaults 0swaps
Similar result with MySQL and sysbench using swap:
Before:
94055.61 qps
After (0.8% faster):
94834.91 qps
Radix tree slab usage is also very slightly lower.
Signed-off-by: Kairui Song <[email protected]>
---
mm/huge_memory.c | 2 +-
mm/memcontrol.c | 2 +-
mm/mincore.c | 2 +-
mm/shmem.c | 2 +-
mm/swap.h | 7 +++++++
mm/swap_state.c | 12 ++++++------
mm/swapfile.c | 6 +++---
7 files changed, 20 insertions(+), 13 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9859aa4f7553..1208d60792f0 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2903,7 +2903,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
split_page_memcg(head, order, new_order);
if (folio_test_anon(folio) && folio_test_swapcache(folio)) {
- offset = swp_offset(folio->swap);
+ offset = swap_cache_index(folio->swap);
swap_cache = swap_address_space(folio->swap);
xa_lock(&swap_cache->i_pages);
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fabce2b50c69..04d7be7f30dc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5934,7 +5934,7 @@ static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
* Because swap_cache_get_folio() updates some statistics counter,
* we call find_get_page() with swapper_space directly.
*/
- page = find_get_page(swap_address_space(ent), swp_offset(ent));
+ page = find_get_page(swap_address_space(ent), swap_cache_index(ent));
entry->val = ent.val;
return page;
diff --git a/mm/mincore.c b/mm/mincore.c
index dad3622cc963..e31cf1bde614 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -139,7 +139,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
} else {
#ifdef CONFIG_SWAP
*vec = mincore_page(swap_address_space(entry),
- swp_offset(entry));
+ swap_cache_index(entry));
#else
WARN_ON(1);
*vec = 1;
diff --git a/mm/shmem.c b/mm/shmem.c
index 0aad0d9a621b..cbe33ab52a73 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1762,7 +1762,7 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
old = *foliop;
entry = old->swap;
- swap_index = swp_offset(entry);
+ swap_index = swap_cache_index(entry);
swap_mapping = swap_address_space(entry);
/*
diff --git a/mm/swap.h b/mm/swap.h
index 2de83729aaa8..6ef237d2b029 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -31,11 +31,18 @@ void __swap_writepage(struct folio *folio, struct writeback_control *wbc);
/* One swap address space for each 64M swap space */
#define SWAP_ADDRESS_SPACE_SHIFT 14
#define SWAP_ADDRESS_SPACE_PAGES (1 << SWAP_ADDRESS_SPACE_SHIFT)
+#define SWAP_ADDRESS_SPACE_MASK (BIT(SWAP_ADDRESS_SPACE_SHIFT) - 1)
extern struct address_space *swapper_spaces[];
#define swap_address_space(entry) \
(&swapper_spaces[swp_type(entry)][swp_offset(entry) \
>> SWAP_ADDRESS_SPACE_SHIFT])
+static inline pgoff_t swap_cache_index(swp_entry_t entry)
+{
+ BUILD_BUG_ON((SWP_OFFSET_MASK | SWAP_ADDRESS_SPACE_MASK) != SWP_OFFSET_MASK);
+ return swp_offset(entry) & SWAP_ADDRESS_SPACE_MASK;
+}
+
void show_swap_cache_info(void);
bool add_to_swap(struct folio *folio);
void *get_shadow_from_swap_cache(swp_entry_t entry);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index bfc7e8c58a6d..9dbb54c72770 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -72,7 +72,7 @@ void show_swap_cache_info(void)
void *get_shadow_from_swap_cache(swp_entry_t entry)
{
struct address_space *address_space = swap_address_space(entry);
- pgoff_t idx = swp_offset(entry);
+ pgoff_t idx = swap_cache_index(entry);
struct page *page;
page = xa_load(&address_space->i_pages, idx);
@@ -89,7 +89,7 @@ int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
gfp_t gfp, void **shadowp)
{
struct address_space *address_space = swap_address_space(entry);
- pgoff_t idx = swp_offset(entry);
+ pgoff_t idx = swap_cache_index(entry);
XA_STATE_ORDER(xas, &address_space->i_pages, idx, folio_order(folio));
unsigned long i, nr = folio_nr_pages(folio);
void *old;
@@ -144,7 +144,7 @@ void __delete_from_swap_cache(struct folio *folio,
struct address_space *address_space = swap_address_space(entry);
int i;
long nr = folio_nr_pages(folio);
- pgoff_t idx = swp_offset(entry);
+ pgoff_t idx = swap_cache_index(entry);
XA_STATE(xas, &address_space->i_pages, idx);
xas_set_update(&xas, workingset_update_node);
@@ -350,7 +350,7 @@ struct folio *swap_cache_get_folio(swp_entry_t entry,
{
struct folio *folio;
- folio = filemap_get_folio(swap_address_space(entry), swp_offset(entry));
+ folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
if (!IS_ERR(folio)) {
bool vma_ra = swap_use_vma_readahead();
bool readahead;
@@ -420,7 +420,7 @@ struct folio *filemap_get_incore_folio(struct address_space *mapping,
si = get_swap_device(swp);
if (!si)
return ERR_PTR(-ENOENT);
- index = swp_offset(swp);
+ index = swap_cache_index(swp);
folio = filemap_get_folio(swap_address_space(swp), index);
put_swap_device(si);
return folio;
@@ -447,7 +447,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
* that would confuse statistics.
*/
folio = filemap_get_folio(swap_address_space(entry),
- swp_offset(entry));
+ swap_cache_index(entry));
if (!IS_ERR(folio))
goto got_folio;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 0c36a5c2400f..2e8df95977b7 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -138,7 +138,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
struct folio *folio;
int ret = 0;
- folio = filemap_get_folio(swap_address_space(entry), offset);
+ folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
if (IS_ERR(folio))
return 0;
/*
@@ -2110,7 +2110,7 @@ static int try_to_unuse(unsigned int type)
(i = find_next_to_unuse(si, i)) != 0) {
entry = swp_entry(type, i);
- folio = filemap_get_folio(swap_address_space(entry), i);
+ folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
if (IS_ERR(folio))
continue;
@@ -3421,7 +3421,7 @@ EXPORT_SYMBOL_GPL(swapcache_mapping);
pgoff_t __folio_swap_cache_index(struct folio *folio)
{
- return swp_offset(folio->swap);
+ return swap_cache_index(folio->swap);
}
EXPORT_SYMBOL_GPL(__folio_swap_cache_index);
--
2.44.0
On Thu, Apr 18, 2024 at 12:08:39AM +0800, Kairui Song wrote:
> +++ b/fs/smb/client/file.c
> @@ -4749,7 +4749,7 @@ static int cifs_readpage_worker(struct file *file, struct page *page,
> static int cifs_read_folio(struct file *file, struct folio *folio)
> {
> struct page *page = &folio->page;
> - loff_t offset = page_file_offset(page);
> + loff_t offset = page_offset(page);
loff_t offset = folio_pos(folio);
On 4/18/24 00:08, Kairui Song wrote:
> From: Kairui Song <[email protected]>
>
> page_index is needed for mixed usage of page cache and swap cache,
> for pure page cache usage, the caller can just use page->index instead.
>
> It can't be a swap cache page here, so just drop it.
>
> Signed-off-by: Kairui Song <[email protected]>
> Cc: Xiubo Li <[email protected]>
> Cc: Ilya Dryomov <[email protected]>
> Cc: Jeff Layton <[email protected]>
> Cc: [email protected]
> ---
> fs/ceph/dir.c | 2 +-
> fs/ceph/inode.c | 2 +-
> 2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
> index 0e9f56eaba1e..570a9d634cc5 100644
> --- a/fs/ceph/dir.c
> +++ b/fs/ceph/dir.c
> @@ -141,7 +141,7 @@ __dcache_find_get_entry(struct dentry *parent, u64 idx,
> if (ptr_pos >= i_size_read(dir))
> return NULL;
>
> - if (!cache_ctl->page || ptr_pgoff != page_index(cache_ctl->page)) {
> + if (!cache_ctl->page || ptr_pgoff != cache_ctl->page->index) {
> ceph_readdir_cache_release(cache_ctl);
> cache_ctl->page = find_lock_page(&dir->i_data, ptr_pgoff);
> if (!cache_ctl->page) {
> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> index 7b2e77517f23..1f92d3faaa6b 100644
> --- a/fs/ceph/inode.c
> +++ b/fs/ceph/inode.c
> @@ -1861,7 +1861,7 @@ static int fill_readdir_cache(struct inode *dir, struct dentry *dn,
> unsigned idx = ctl->index % nsize;
> pgoff_t pgoff = ctl->index / nsize;
>
> - if (!ctl->page || pgoff != page_index(ctl->page)) {
> + if (!ctl->page || pgoff != ctl->page->index) {
Hi Kairui,
Thanks for you patch and will it be doable to switch to folio_index()
instead ?
Cheers,
- Xiubo
> ceph_readdir_cache_release(ctl);
> if (idx == 0)
> ctl->page = grab_cache_page(&dir->i_data, pgoff);
On 4/18/24 09:30, Matthew Wilcox wrote:
> On Thu, Apr 18, 2024 at 08:28:22AM +0800, Xiubo Li wrote:
>> Thanks for you patch and will it be doable to switch to folio_index()
>> instead ?
> No. Just use folio->index. You only need folio_index() if the folio
> might belong to the swapcache instead of a file.
>
Hmm, Okay.
Thanks
- Xiubo
On Thu, Apr 18, 2024 at 4:12 AM Kairui Song <[email protected]> wrote:
>
> From: Kairui Song <[email protected]>
>
> When applied on swap cache pages, page_index / page_file_offset was used
> to retrieve the swap cache index or swap file offset of a page, and they
> have their folio equivalence version: folio_index / folio_file_pos.
>
> We have eliminated all users for page_index / page_file_offset, everything
> is using folio_index / folio_file_pos now, so remove the old helpers.
>
> Then convert the implementation of folio_index / folio_file_pos to
> to use folio natively.
>
> After this commit, all users that might encounter mixed usage of swap
> cache and page cache will only use following two helpers:
>
> folio_index (calls __folio_swap_cache_index)
> folio_file_pos (calls __folio_swap_file_pos)
>
> The offset in swap file and index in swap cache is still basically the
> same thing at this moment, but will be different in following commits.
>
> Signed-off-by: Kairui Song <[email protected]>
Hi Kairui, thanks !
I also find it rather odd that folio_file_page() is utilized for both
swp and file.
mm/memory.c <<do_swap_page>>
page = folio_file_page(folio, swp_offset(entry));
mm/swap_state.c <<swapin_readahead>>
return folio_file_page(folio, swp_offset(entry));
mm/swapfile.c <<unuse_pte>>
page = folio_file_page(folio, swp_offset(entry));
Do you believe it's worthwhile to tidy up?
> ---
> include/linux/mm.h | 13 -------------
> include/linux/pagemap.h | 19 +++++++++----------
> mm/swapfile.c | 13 +++++++++----
> 3 files changed, 18 insertions(+), 27 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 0436b919f1c7..797480e76c9c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2245,19 +2245,6 @@ static inline void *folio_address(const struct folio *folio)
> return page_address(&folio->page);
> }
>
> -extern pgoff_t __page_file_index(struct page *page);
> -
> -/*
> - * Return the pagecache index of the passed page. Regular pagecache pages
> - * use ->index whereas swapcache pages use swp_offset(->private)
> - */
> -static inline pgoff_t page_index(struct page *page)
> -{
> - if (unlikely(PageSwapCache(page)))
> - return __page_file_index(page);
> - return page->index;
> -}
> -
> /*
> * Return true only if the page has been allocated with
> * ALLOC_NO_WATERMARKS and the low watermark was not
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 2df35e65557d..313f3144406e 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -780,7 +780,7 @@ static inline struct page *grab_cache_page_nowait(struct address_space *mapping,
> mapping_gfp_mask(mapping));
> }
>
> -#define swapcache_index(folio) __page_file_index(&(folio)->page)
> +extern pgoff_t __folio_swap_cache_index(struct folio *folio);
>
> /**
> * folio_index - File index of a folio.
> @@ -795,9 +795,9 @@ static inline struct page *grab_cache_page_nowait(struct address_space *mapping,
> */
> static inline pgoff_t folio_index(struct folio *folio)
> {
> - if (unlikely(folio_test_swapcache(folio)))
> - return swapcache_index(folio);
> - return folio->index;
> + if (unlikely(folio_test_swapcache(folio)))
> + return __folio_swap_cache_index(folio);
> + return folio->index;
> }
>
> /**
> @@ -920,11 +920,6 @@ static inline loff_t page_offset(struct page *page)
> return ((loff_t)page->index) << PAGE_SHIFT;
> }
>
> -static inline loff_t page_file_offset(struct page *page)
> -{
> - return ((loff_t)page_index(page)) << PAGE_SHIFT;
> -}
> -
> /**
> * folio_pos - Returns the byte position of this folio in its file.
> * @folio: The folio.
> @@ -934,6 +929,8 @@ static inline loff_t folio_pos(struct folio *folio)
> return page_offset(&folio->page);
> }
>
> +extern loff_t __folio_swap_file_pos(struct folio *folio);
> +
> /**
> * folio_file_pos - Returns the byte position of this folio in its file.
> * @folio: The folio.
> @@ -943,7 +940,9 @@ static inline loff_t folio_pos(struct folio *folio)
> */
> static inline loff_t folio_file_pos(struct folio *folio)
> {
> - return page_file_offset(&folio->page);
> + if (unlikely(folio_test_swapcache(folio)))
> + return __folio_swap_file_pos(folio);
> + return ((loff_t)folio->index << PAGE_SHIFT);
> }
>
> /*
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 4919423cce76..0c36a5c2400f 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -3419,12 +3419,17 @@ struct address_space *swapcache_mapping(struct folio *folio)
> }
> EXPORT_SYMBOL_GPL(swapcache_mapping);
>
> -pgoff_t __page_file_index(struct page *page)
> +pgoff_t __folio_swap_cache_index(struct folio *folio)
> {
> - swp_entry_t swap = page_swap_entry(page);
> - return swp_offset(swap);
> + return swp_offset(folio->swap);
> }
> -EXPORT_SYMBOL_GPL(__page_file_index);
> +EXPORT_SYMBOL_GPL(__folio_swap_cache_index);
> +
> +loff_t __folio_swap_file_pos(struct folio *folio)
> +{
> + return swap_file_pos(folio->swap);
> +}
> +EXPORT_SYMBOL_GPL(__folio_swap_file_pos);
>
> /*
> * add_swap_count_continuation - called when a swap count is duplicated
> --
> 2.44.0
>
Thanks
Barry
On Thu, Apr 18, 2024 at 9:55 AM Barry Song <[email protected]> wrote:
>
> On Thu, Apr 18, 2024 at 4:12 AM Kairui Song <[email protected]> wrote:
> >
> > From: Kairui Song <[email protected]>
> >
> > When applied on swap cache pages, page_index / page_file_offset was used
> > to retrieve the swap cache index or swap file offset of a page, and they
> > have their folio equivalence version: folio_index / folio_file_pos.
> >
> > We have eliminated all users for page_index / page_file_offset, everything
> > is using folio_index / folio_file_pos now, so remove the old helpers.
> >
> > Then convert the implementation of folio_index / folio_file_pos to
> > to use folio natively.
> >
> > After this commit, all users that might encounter mixed usage of swap
> > cache and page cache will only use following two helpers:
> >
> > folio_index (calls __folio_swap_cache_index)
> > folio_file_pos (calls __folio_swap_file_pos)
> >
> > The offset in swap file and index in swap cache is still basically the
> > same thing at this moment, but will be different in following commits.
> >
> > Signed-off-by: Kairui Song <[email protected]>
>
> Hi Kairui, thanks !
>
> I also find it rather odd that folio_file_page() is utilized for both
> swp and file.
>
> mm/memory.c <<do_swap_page>>
> page = folio_file_page(folio, swp_offset(entry));
> mm/swap_state.c <<swapin_readahead>>
> return folio_file_page(folio, swp_offset(entry));
> mm/swapfile.c <<unuse_pte>>
> page = folio_file_page(folio, swp_offset(entry));
>
> Do you believe it's worthwhile to tidy up?
>
Hi Barry,
I'm not sure about this. Using folio_file_page doesn't look too bad,
and it will be gone once we convert them to always use folio, this
shouldn't take too long.
On Thu, Apr 18, 2024 at 12:14 AM Matthew Wilcox <[email protected]> wrote:
>
> On Thu, Apr 18, 2024 at 12:08:36AM +0800, Kairui Song wrote:
> > +++ b/fs/nilfs2/bmap.c
> > @@ -453,8 +453,7 @@ __u64 nilfs_bmap_data_get_key(const struct nilfs_bmap *bmap,
> > struct buffer_head *pbh;
> > __u64 key;
> >
> > - key = page_index(bh->b_page) << (PAGE_SHIFT -
> > - bmap->b_inode->i_blkbits);
> > + key = bh->b_page->index << (PAGE_SHIFT - bmap->b_inode->i_blkbits);
>
> I'd prefer this were
>
> key = bh->b_folio->index << (PAGE_SHIFT - bmap->b_inode->i_blkbits);
>
> (pages only have a ->index field for historical reasons; I'm trying to
> get rid of it)
>
Good suggestion! For easier review I just copied the original logic
from page_index, I will update with folio conventions in V2.
On Thu, Apr 18, 2024 at 3:31 PM Matthew Wilcox <[email protected]> wrote:
>
> On Thu, Apr 18, 2024 at 01:55:28PM +1200, Barry Song wrote:
> > I also find it rather odd that folio_file_page() is utilized for both
> > swp and file.
> >
> > mm/memory.c <<do_swap_page>>
> > page = folio_file_page(folio, swp_offset(entry));
> > mm/swap_state.c <<swapin_readahead>>
> > return folio_file_page(folio, swp_offset(entry));
> > mm/swapfile.c <<unuse_pte>>
> > page = folio_file_page(folio, swp_offset(entry));
> >
> > Do you believe it's worthwhile to tidy up?
>
> Why do you find it odd? What would you propose instead?
i'd prefer something like folio_swap_page(folio, entry);
On Thu, Apr 18, 2024 at 08:28:22AM +0800, Xiubo Li wrote:
> Thanks for you patch and will it be doable to switch to folio_index()
> instead ?
No. Just use folio->index. You only need folio_index() if the folio
might belong to the swapcache instead of a file.
On Thu, Apr 18, 2024 at 01:55:28PM +1200, Barry Song wrote:
> I also find it rather odd that folio_file_page() is utilized for both
> swp and file.
>
> mm/memory.c <<do_swap_page>>
> page = folio_file_page(folio, swp_offset(entry));
> mm/swap_state.c <<swapin_readahead>>
> return folio_file_page(folio, swp_offset(entry));
> mm/swapfile.c <<unuse_pte>>
> page = folio_file_page(folio, swp_offset(entry));
>
> Do you believe it's worthwhile to tidy up?
Why do you find it odd? What would you propose instead?
On Thu, Apr 18, 2024 at 2:42 PM Kairui Song <[email protected]> wrote:
>
> On Thu, Apr 18, 2024 at 9:55 AM Barry Song <[email protected]> wrote:
> >
> > On Thu, Apr 18, 2024 at 4:12 AM Kairui Song <[email protected]> wrote:
> > >
> > > From: Kairui Song <[email protected]>
> > >
> > > When applied on swap cache pages, page_index / page_file_offset was used
> > > to retrieve the swap cache index or swap file offset of a page, and they
> > > have their folio equivalence version: folio_index / folio_file_pos.
> > >
> > > We have eliminated all users for page_index / page_file_offset, everything
> > > is using folio_index / folio_file_pos now, so remove the old helpers.
> > >
> > > Then convert the implementation of folio_index / folio_file_pos to
> > > to use folio natively.
> > >
> > > After this commit, all users that might encounter mixed usage of swap
> > > cache and page cache will only use following two helpers:
> > >
> > > folio_index (calls __folio_swap_cache_index)
> > > folio_file_pos (calls __folio_swap_file_pos)
> > >
> > > The offset in swap file and index in swap cache is still basically the
> > > same thing at this moment, but will be different in following commits.
> > >
> > > Signed-off-by: Kairui Song <[email protected]>
> >
> > Hi Kairui, thanks !
> >
> > I also find it rather odd that folio_file_page() is utilized for both
> > swp and file.
> >
> > mm/memory.c <<do_swap_page>>
> > page = folio_file_page(folio, swp_offset(entry));
> > mm/swap_state.c <<swapin_readahead>>
> > return folio_file_page(folio, swp_offset(entry));
> > mm/swapfile.c <<unuse_pte>>
> > page = folio_file_page(folio, swp_offset(entry));
> >
> > Do you believe it's worthwhile to tidy up?
> >
>
> Hi Barry,
>
> I'm not sure about this. Using folio_file_page doesn't look too bad,
> and it will be gone once we convert them to always use folio, this
> shouldn't take too long.
HI Kairui,
I am not quite sure this is going to be quite soon. our swap-in large
folios refault still have corner cases which can't map large folios even
we hit large folios in swapcache [1].
personally, i feel do_swap_page() isn't handling file, and those pages
are anon but not file. so a separate helper taking folio and entry as
arguments seem more readable as Chris even wants to remove
the assumption large folios have been to swapped to contiguous
swap offsets. anyway, it is just me :-)
[1] https://lore.kernel.org/linux-mm/[email protected]/
Hi Kairui,
kernel test robot noticed the following build errors:
[auto build test ERROR on ceph-client/testing]
[also build test ERROR on ceph-client/for-linus trondmy-nfs/linux-next konis-nilfs2/upstream jaegeuk-f2fs/dev-test jaegeuk-f2fs/dev cifs/for-next linus/master v6.9-rc4]
[cannot apply to akpm-mm/mm-everything next-20240418]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Kairui-Song/NFS-remove-nfs_page_lengthg-and-usage-of-page_index/20240418-001343
base: https://github.com/ceph/ceph-client.git testing
patch link: https://lore.kernel.org/r/20240417160842.76665-9-ryncsn%40gmail.com
patch subject: [PATCH 8/8] mm/swap: reduce swap cache search space
config: i386-buildonly-randconfig-002-20240419 (https://download.01.org/0day-ci/archive/20240419/[email protected]/config)
compiler: gcc-9 (Ubuntu 9.5.0-4ubuntu2) 9.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240419/[email protected]/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All errors (new ones prefixed by >>):
mm/huge_memory.c: In function '__split_huge_page':
>> mm/huge_memory.c:2906:12: error: implicit declaration of function 'swap_cache_index' [-Werror=implicit-function-declaration]
2906 | offset = swap_cache_index(folio->swap);
| ^~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors
vim +/swap_cache_index +2906 mm/huge_memory.c
2888
2889 static void __split_huge_page(struct page *page, struct list_head *list,
2890 pgoff_t end, unsigned int new_order)
2891 {
2892 struct folio *folio = page_folio(page);
2893 struct page *head = &folio->page;
2894 struct lruvec *lruvec;
2895 struct address_space *swap_cache = NULL;
2896 unsigned long offset = 0;
2897 int i, nr_dropped = 0;
2898 unsigned int new_nr = 1 << new_order;
2899 int order = folio_order(folio);
2900 unsigned int nr = 1 << order;
2901
2902 /* complete memcg works before add pages to LRU */
2903 split_page_memcg(head, order, new_order);
2904
2905 if (folio_test_anon(folio) && folio_test_swapcache(folio)) {
> 2906 offset = swap_cache_index(folio->swap);
2907 swap_cache = swap_address_space(folio->swap);
2908 xa_lock(&swap_cache->i_pages);
2909 }
2910
2911 /* lock lru list/PageCompound, ref frozen by page_ref_freeze */
2912 lruvec = folio_lruvec_lock(folio);
2913
2914 ClearPageHasHWPoisoned(head);
2915
2916 for (i = nr - new_nr; i >= new_nr; i -= new_nr) {
2917 __split_huge_page_tail(folio, i, lruvec, list, new_order);
2918 /* Some pages can be beyond EOF: drop them from page cache */
2919 if (head[i].index >= end) {
2920 struct folio *tail = page_folio(head + i);
2921
2922 if (shmem_mapping(folio->mapping))
2923 nr_dropped++;
2924 else if (folio_test_clear_dirty(tail))
2925 folio_account_cleaned(tail,
2926 inode_to_wb(folio->mapping->host));
2927 __filemap_remove_folio(tail, NULL);
2928 folio_put(tail);
2929 } else if (!PageAnon(page)) {
2930 __xa_store(&folio->mapping->i_pages, head[i].index,
2931 head + i, 0);
2932 } else if (swap_cache) {
2933 __xa_store(&swap_cache->i_pages, offset + i,
2934 head + i, 0);
2935 }
2936 }
2937
2938 if (!new_order)
2939 ClearPageCompound(head);
2940 else {
2941 struct folio *new_folio = (struct folio *)head;
2942
2943 folio_set_order(new_folio, new_order);
2944 }
2945 unlock_page_lruvec(lruvec);
2946 /* Caller disabled irqs, so they are still disabled here */
2947
2948 split_page_owner(head, order, new_order);
2949
2950 /* See comment in __split_huge_page_tail() */
2951 if (folio_test_anon(folio)) {
2952 /* Additional pin to swap cache */
2953 if (folio_test_swapcache(folio)) {
2954 folio_ref_add(folio, 1 + new_nr);
2955 xa_unlock(&swap_cache->i_pages);
2956 } else {
2957 folio_ref_inc(folio);
2958 }
2959 } else {
2960 /* Additional pin to page cache */
2961 folio_ref_add(folio, 1 + new_nr);
2962 xa_unlock(&folio->mapping->i_pages);
2963 }
2964 local_irq_enable();
2965
2966 if (nr_dropped)
2967 shmem_uncharge(folio->mapping->host, nr_dropped);
2968 remap_page(folio, nr);
2969
2970 if (folio_test_swapcache(folio))
2971 split_swap_cluster(folio->swap);
2972
2973 /*
2974 * set page to its compound_head when split to non order-0 pages, so
2975 * we can skip unlocking it below, since PG_locked is transferred to
2976 * the compound_head of the page and the caller will unlock it.
2977 */
2978 if (new_order)
2979 page = compound_head(page);
2980
2981 for (i = 0; i < nr; i += new_nr) {
2982 struct page *subpage = head + i;
2983 struct folio *new_folio = page_folio(subpage);
2984 if (subpage == page)
2985 continue;
2986 folio_unlock(new_folio);
2987
2988 /*
2989 * Subpages may be freed if there wasn't any mapping
2990 * like if add_to_swap() is running on a lru page that
2991 * had its mapping zapped. And freeing these pages
2992 * requires taking the lru_lock so we do the put_page
2993 * of the tail pages after the split is complete.
2994 */
2995 free_page_and_swap_cache(subpage);
2996 }
2997 }
2998
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Hi Kairui,
kernel test robot noticed the following build errors:
[auto build test ERROR on ceph-client/testing]
[also build test ERROR on ceph-client/for-linus trondmy-nfs/linux-next konis-nilfs2/upstream jaegeuk-f2fs/dev-test jaegeuk-f2fs/dev cifs/for-next linus/master v6.9-rc4]
[cannot apply to akpm-mm/mm-everything next-20240418]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Kairui-Song/NFS-remove-nfs_page_lengthg-and-usage-of-page_index/20240418-001343
base: https://github.com/ceph/ceph-client.git testing
patch link: https://lore.kernel.org/r/20240417160842.76665-9-ryncsn%40gmail.com
patch subject: [PATCH 8/8] mm/swap: reduce swap cache search space
config: alpha-allnoconfig (https://download.01.org/0day-ci/archive/20240419/[email protected]/config)
compiler: alpha-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240419/[email protected]/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All errors (new ones prefixed by >>):
mm/shmem.c: In function 'shmem_replace_folio':
>> mm/shmem.c:1765:22: error: implicit declaration of function 'swap_cache_index' [-Werror=implicit-function-declaration]
1765 | swap_index = swap_cache_index(entry);
| ^~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors
vim +/swap_cache_index +1765 mm/shmem.c
1753
1754 static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
1755 struct shmem_inode_info *info, pgoff_t index)
1756 {
1757 struct folio *old, *new;
1758 struct address_space *swap_mapping;
1759 swp_entry_t entry;
1760 pgoff_t swap_index;
1761 int error;
1762
1763 old = *foliop;
1764 entry = old->swap;
> 1765 swap_index = swap_cache_index(entry);
1766 swap_mapping = swap_address_space(entry);
1767
1768 /*
1769 * We have arrived here because our zones are constrained, so don't
1770 * limit chance of success by further cpuset and node constraints.
1771 */
1772 gfp &= ~GFP_CONSTRAINT_MASK;
1773 VM_BUG_ON_FOLIO(folio_test_large(old), old);
1774 new = shmem_alloc_folio(gfp, info, index);
1775 if (!new)
1776 return -ENOMEM;
1777
1778 folio_get(new);
1779 folio_copy(new, old);
1780 flush_dcache_folio(new);
1781
1782 __folio_set_locked(new);
1783 __folio_set_swapbacked(new);
1784 folio_mark_uptodate(new);
1785 new->swap = entry;
1786 folio_set_swapcache(new);
1787
1788 /*
1789 * Our caller will very soon move newpage out of swapcache, but it's
1790 * a nice clean interface for us to replace oldpage by newpage there.
1791 */
1792 xa_lock_irq(&swap_mapping->i_pages);
1793 error = shmem_replace_entry(swap_mapping, swap_index, old, new);
1794 if (!error) {
1795 mem_cgroup_migrate(old, new);
1796 __lruvec_stat_mod_folio(new, NR_FILE_PAGES, 1);
1797 __lruvec_stat_mod_folio(new, NR_SHMEM, 1);
1798 __lruvec_stat_mod_folio(old, NR_FILE_PAGES, -1);
1799 __lruvec_stat_mod_folio(old, NR_SHMEM, -1);
1800 }
1801 xa_unlock_irq(&swap_mapping->i_pages);
1802
1803 if (unlikely(error)) {
1804 /*
1805 * Is this possible? I think not, now that our callers check
1806 * both PageSwapCache and page_private after getting page lock;
1807 * but be defensive. Reverse old to newpage for clear and free.
1808 */
1809 old = new;
1810 } else {
1811 folio_add_lru(new);
1812 *foliop = new;
1813 }
1814
1815 folio_clear_swapcache(old);
1816 old->private = NULL;
1817
1818 folio_unlock(old);
1819 folio_put_refs(old, 2);
1820 return error;
1821 }
1822
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Hi Kairui,
kernel test robot noticed the following build errors:
[auto build test ERROR on ceph-client/testing]
[also build test ERROR on ceph-client/for-linus trondmy-nfs/linux-next konis-nilfs2/upstream jaegeuk-f2fs/dev-test jaegeuk-f2fs/dev cifs/for-next linus/master v6.9-rc4 next-20240418]
[cannot apply to akpm-mm/mm-everything]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Kairui-Song/NFS-remove-nfs_page_lengthg-and-usage-of-page_index/20240418-001343
base: https://github.com/ceph/ceph-client.git testing
patch link: https://lore.kernel.org/r/20240417160842.76665-7-ryncsn%40gmail.com
patch subject: [PATCH 6/8] mm/swap: get the swap file offset directly
config: alpha-defconfig (https://download.01.org/0day-ci/archive/20240419/[email protected]/config)
compiler: alpha-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240419/[email protected]/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All errors (new ones prefixed by >>):
In file included from mm/shmem.c:43:
mm/swap.h: In function 'swap_file_pos':
>> mm/swap.h:12:25: error: implicit declaration of function 'swp_offset'; did you mean 'pmd_offset'? [-Werror=implicit-function-declaration]
12 | return ((loff_t)swp_offset(entry)) << PAGE_SHIFT;
| ^~~~~~~~~~
| pmd_offset
In file included from mm/shmem.c:68:
include/linux/swapops.h: At top level:
>> include/linux/swapops.h:107:23: error: conflicting types for 'swp_offset'; have 'long unsigned int(swp_entry_t)'
107 | static inline pgoff_t swp_offset(swp_entry_t entry)
| ^~~~~~~~~~
mm/swap.h:12:25: note: previous implicit declaration of 'swp_offset' with type 'int()'
12 | return ((loff_t)swp_offset(entry)) << PAGE_SHIFT;
| ^~~~~~~~~~
cc1: some warnings being treated as errors
--
In file included from mm/show_mem.c:19:
mm/swap.h: In function 'swap_file_pos':
>> mm/swap.h:12:25: error: implicit declaration of function 'swp_offset'; did you mean 'pmd_offset'? [-Werror=implicit-function-declaration]
12 | return ((loff_t)swp_offset(entry)) << PAGE_SHIFT;
| ^~~~~~~~~~
| pmd_offset
cc1: some warnings being treated as errors
vim +12 mm/swap.h
9
10 static inline loff_t swap_file_pos(swp_entry_t entry)
11 {
> 12 return ((loff_t)swp_offset(entry)) << PAGE_SHIFT;
13 }
14
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Hi, Kairui,
Kairui Song <[email protected]> writes:
> From: Kairui Song <[email protected]>
>
> Currently we use one swap_address_space for every 64M chunk to reduce lock
> contention, this is like having a set of smaller swap files inside one
> big swap file. But when doing swap cache look up or insert, we are
> still using the offset of the whole large swap file. This is OK for
> correctness, as the offset (key) is unique.
>
> But Xarray is specially optimized for small indexes, it creates the
> redix tree levels lazily to be just enough to fit the largest key
> stored in one Xarray. So we are wasting tree nodes unnecessarily.
>
> For 64M chunk it should only take at most 3 level to contain everything.
> But we are using the offset from the whole swap file, so the offset (key)
> value will be way beyond 64M, and so will the tree level.
>
> Optimize this by reduce the swap cache search space into 64M scope.
In general, I think that it makes sense to reduce the depth of the
xarray.
One concern is that IIUC we make swap cache behaves like file cache if
possible. And your change makes swap cache and file cache diverge more.
Is it possible for us to keep them similar?
For example,
Is it possible to return the offset inside 64M range in
__page_file_index() (maybe rename it)?
Is it possible to add "start_offset" support in xarray, so "index"
will subtract "start_offset" before looking up / inserting?
Is it possible to use multiple range locks to protect one xarray to
improve the lock scalability? This is why we have multiple "struct
address_space" for one swap device. And, we may have same lock
contention issue for large files too.
I haven't look at the code in details. So, my idea may not make sense
at all. If so, sorry about that.
Hi, Matthew,
Can you teach me on this too?
--
Best Regards,
Huang, Ying
On Mon, Apr 22, 2024 at 3:56 PM Huang, Ying <[email protected]> wrote:
>
> Hi, Kairui,
>
> Kairui Song <[email protected]> writes:
>
> > From: Kairui Song <[email protected]>
> >
> > Currently we use one swap_address_space for every 64M chunk to reduce lock
> > contention, this is like having a set of smaller swap files inside one
> > big swap file. But when doing swap cache look up or insert, we are
> > still using the offset of the whole large swap file. This is OK for
> > correctness, as the offset (key) is unique.
> >
> > But Xarray is specially optimized for small indexes, it creates the
> > redix tree levels lazily to be just enough to fit the largest key
> > stored in one Xarray. So we are wasting tree nodes unnecessarily.
> >
> > For 64M chunk it should only take at most 3 level to contain everything.
> > But we are using the offset from the whole swap file, so the offset (key)
> > value will be way beyond 64M, and so will the tree level.
> >
> > Optimize this by reduce the swap cache search space into 64M scope.
>
Hi,
Thanks for the comments!
> In general, I think that it makes sense to reduce the depth of the
> xarray.
>
> One concern is that IIUC we make swap cache behaves like file cache if
> possible. And your change makes swap cache and file cache diverge more.
> Is it possible for us to keep them similar?
So far in this series, I think there is no problem for that, the two
main helpers for retrieving file & cache offset: folio_index and
folio_file_pos will work fine and be compatible with current users.
And if we convert to share filemap_* functions for swap cache / page
cache, they are mostly already accepting index as an argument so no
trouble at all.
>
> For example,
>
> Is it possible to return the offset inside 64M range in
> __page_file_index() (maybe rename it)?
Not sure what you mean by this, __page_file_index will be gone as we
convert to folio.
And this series did delete / rename it (it might not be easy to see
this, the usage of these helpers is not very well organized before
this series so some clean up is involved).
It was previously only used through page_index (deleted) /
folio_index, and, now folio_index will be returning the offset inside
the 64M range.
I guess I just did what you wanted? :)
My cover letter and commit message might be not clear enough, I can update it.
>
> Is it possible to add "start_offset" support in xarray, so "index"
> will subtract "start_offset" before looking up / inserting?
xarray struct seems already very full, and this usage doesn't look
generic to me, might be better to fix this kind of issue case by case.
>
> Is it possible to use multiple range locks to protect one xarray to
> improve the lock scalability? This is why we have multiple "struct
> address_space" for one swap device. And, we may have same lock
> contention issue for large files too.
Good question, this series can improve the tree depth issue for swap
cache, but contention in address space is still a thing.
A more generic solution might involve changes of xarray API or use
some other data struct?
(BTW I think reducing the search space and resolving lock contention
is not necessarily related, reducing the search space by having a
large table of small trees should still perform better for swap
cache).
>
> I haven't look at the code in details. So, my idea may not make sense
> at all. If so, sorry about that.
>
> Hi, Matthew,
>
> Can you teach me on this too?
>
> --
> Best Regards,
> Huang, Ying
On Thu, Apr 18, 2024 at 9:40 AM Xiubo Li <[email protected]> wrote:
> On 4/18/24 09:30, Matthew Wilcox wrote:
> > On Thu, Apr 18, 2024 at 08:28:22AM +0800, Xiubo Li wrote:
> >> Thanks for you patch and will it be doable to switch to folio_index()
> >> instead ?
> > No. Just use folio->index. You only need folio_index() if the folio
> > might belong to the swapcache instead of a file.
> >
> Hmm, Okay.
>
> Thanks
>
> - Xiubo
>
Hi Xiubo
Thanks for the comment,
As Matthew mentioned there is no need to use folio_index unless you
are access swapcache. And I found that ceph is not using folios
internally yet, needs a lot of conversions. So I think I'll just keep
using page->index here, later conversions may change it to
folio->index.
On 4/22/24 23:34, Kairui Song wrote:
> On Thu, Apr 18, 2024 at 9:40 AM Xiubo Li <[email protected]> wrote:
>> On 4/18/24 09:30, Matthew Wilcox wrote:
>>> On Thu, Apr 18, 2024 at 08:28:22AM +0800, Xiubo Li wrote:
>>>> Thanks for you patch and will it be doable to switch to folio_index()
>>>> instead ?
>>> No. Just use folio->index. You only need folio_index() if the folio
>>> might belong to the swapcache instead of a file.
>>>
>> Hmm, Okay.
>>
>> Thanks
>>
>> - Xiubo
>>
> Hi Xiubo
>
> Thanks for the comment,
>
> As Matthew mentioned there is no need to use folio_index unless you
> are access swapcache. And I found that ceph is not using folios
> internally yet, needs a lot of conversions. So I think I'll just keep
> using page->index here, later conversions may change it to
> folio->index.
>
Sounds good to me and thanks for explaining.
Cheers
- Xiubo
Kairui Song <[email protected]> writes:
> On Mon, Apr 22, 2024 at 3:56 PM Huang, Ying <[email protected]> wrote:
>>
>> Hi, Kairui,
>>
>> Kairui Song <[email protected]> writes:
>>
>> > From: Kairui Song <[email protected]>
>> >
>> > Currently we use one swap_address_space for every 64M chunk to reduce lock
>> > contention, this is like having a set of smaller swap files inside one
>> > big swap file. But when doing swap cache look up or insert, we are
>> > still using the offset of the whole large swap file. This is OK for
>> > correctness, as the offset (key) is unique.
>> >
>> > But Xarray is specially optimized for small indexes, it creates the
>> > redix tree levels lazily to be just enough to fit the largest key
>> > stored in one Xarray. So we are wasting tree nodes unnecessarily.
>> >
>> > For 64M chunk it should only take at most 3 level to contain everything.
>> > But we are using the offset from the whole swap file, so the offset (key)
>> > value will be way beyond 64M, and so will the tree level.
>> >
>> > Optimize this by reduce the swap cache search space into 64M scope.
>>
>
> Hi,
>
> Thanks for the comments!
>
>> In general, I think that it makes sense to reduce the depth of the
>> xarray.
>>
>> One concern is that IIUC we make swap cache behaves like file cache if
>> possible. And your change makes swap cache and file cache diverge more.
>> Is it possible for us to keep them similar?
>
> So far in this series, I think there is no problem for that, the two
> main helpers for retrieving file & cache offset: folio_index and
> folio_file_pos will work fine and be compatible with current users.
>
> And if we convert to share filemap_* functions for swap cache / page
> cache, they are mostly already accepting index as an argument so no
> trouble at all.
>
>>
>> For example,
>>
>> Is it possible to return the offset inside 64M range in
>> __page_file_index() (maybe rename it)?
>
> Not sure what you mean by this, __page_file_index will be gone as we
> convert to folio.
> And this series did delete / rename it (it might not be easy to see
> this, the usage of these helpers is not very well organized before
> this series so some clean up is involved).
> It was previously only used through page_index (deleted) /
> folio_index, and, now folio_index will be returning the offset inside
> the 64M range.
>
> I guess I just did what you wanted? :)
Good!
> My cover letter and commit message might be not clear enough, I can update it.
>
>>
>> Is it possible to add "start_offset" support in xarray, so "index"
>> will subtract "start_offset" before looking up / inserting?
>
> xarray struct seems already very full, and this usage doesn't look
> generic to me, might be better to fix this kind of issue case by case.
Just some open question.
>>
>> Is it possible to use multiple range locks to protect one xarray to
>> improve the lock scalability? This is why we have multiple "struct
>> address_space" for one swap device. And, we may have same lock
>> contention issue for large files too.
>
> Good question, this series can improve the tree depth issue for swap
> cache, but contention in address space is still a thing.
The lock contention for swap cache has been reduced via using multiple
xarray in commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB
trunks"). But it fixes that for swap cache only, not for file cache in
general. We have observed similar lock contention issue for file cache
too. And the method isn't perfect too, like the issue you found here.
In general, it's about what is "file" for swap device.
> A more generic solution might involve changes of xarray API or use
> some other data struct?
>
> (BTW I think reducing the search space and resolving lock contention
> is not necessarily related, reducing the search space by having a
> large table of small trees should still perform better for swap
> cache).
--
Best Regards,
Huang, Ying
Kairui Song <[email protected]> writes:
> From: Kairui Song <[email protected]>
>
> folio_file_pos and page_file_offset are for mixed usage of swap cache
> and page cache, it can't be page cache here, so introduce a new helper
> to get the swap offset in swap file directly.
>
> Signed-off-by: Kairui Song <[email protected]>
> ---
> mm/page_io.c | 6 +++---
> mm/swap.h | 5 +++++
> 2 files changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/mm/page_io.c b/mm/page_io.c
> index ae2b49055e43..93de5aadb438 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -279,7 +279,7 @@ static void sio_write_complete(struct kiocb *iocb, long ret)
> * be temporary.
> */
> pr_err_ratelimited("Write error %ld on dio swapfile (%llu)\n",
> - ret, page_file_offset(page));
> + ret, swap_file_pos(page_swap_entry(page)));
> for (p = 0; p < sio->pages; p++) {
> page = sio->bvec[p].bv_page;
> set_page_dirty(page);
> @@ -298,7 +298,7 @@ static void swap_writepage_fs(struct folio *folio, struct writeback_control *wbc
> struct swap_iocb *sio = NULL;
> struct swap_info_struct *sis = swp_swap_info(folio->swap);
> struct file *swap_file = sis->swap_file;
> - loff_t pos = folio_file_pos(folio);
> + loff_t pos = swap_file_pos(folio->swap);
>
> count_swpout_vm_event(folio);
> folio_start_writeback(folio);
> @@ -429,7 +429,7 @@ static void swap_read_folio_fs(struct folio *folio, struct swap_iocb **plug)
> {
> struct swap_info_struct *sis = swp_swap_info(folio->swap);
> struct swap_iocb *sio = NULL;
> - loff_t pos = folio_file_pos(folio);
> + loff_t pos = swap_file_pos(folio->swap);
>
> if (plug)
> sio = *plug;
> diff --git a/mm/swap.h b/mm/swap.h
> index fc2f6ade7f80..2de83729aaa8 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -7,6 +7,11 @@ struct mempolicy;
> #ifdef CONFIG_SWAP
> #include <linux/blk_types.h> /* for bio_end_io_t */
>
> +static inline loff_t swap_file_pos(swp_entry_t entry)
> +{
> + return ((loff_t)swp_offset(entry)) << PAGE_SHIFT;
> +}
> +
> /* linux/mm/page_io.c */
> int sio_pool_init(void);
> struct swap_iocb;
I feel that the file concept for swap is kind of confusing. From the
file cache point of view, one "struct address space" conresponds to one
file. If so, we have a simple file system on a swap device (block
device backed or file backed), where the size of each file is 64M. The
swap entry encode the file system (swap_type), the file name
(swap_offset >> SWAP_ADDRESS_SPACE_SHIFT), and the offset in file (lower
bits of swap_offset).
If the above definition is good, it's better to rename swap_file_pos()
to swap_dev_pos(), because it returns the swap device position of the
swap entry.
And, when we reaches consensus on the swap file related concept, we may
document it somewhere and review all naming in swap code to cleanup.
--
Best Regards,
Huang, Ying
On Mon, Apr 22, 2024 at 03:54:58PM +0800, Huang, Ying wrote:
> Is it possible to add "start_offset" support in xarray, so "index"
> will subtract "start_offset" before looking up / inserting?
We kind of have that with XA_FLAGS_ZERO_BUSY which is used for
XA_FLAGS_ALLOC1. But that's just one bit for the entry at 0. We could
generalise it, but then we'd have to store that somewhere and there's
no obvious good place to store it that wouldn't enlarge struct xarray,
which I'd be reluctant to do.
> Is it possible to use multiple range locks to protect one xarray to
> improve the lock scalability? This is why we have multiple "struct
> address_space" for one swap device. And, we may have same lock
> contention issue for large files too.
It's something I've considered. The issue is search marks. If we delete
an entry, we may have to walk all the way up the xarray clearing bits as
we go and I'd rather not grab a lock at each level. There's a convenient
4 byte hole between nr_values and parent where we could put it.
Oh, another issue is that we use i_pages.xa_lock to synchronise
address_space.nrpages, so I'm not sure that a per-node lock will help.
But I'm conscious that there are workloads which show contention on
xa_lock as their limiting factor, so I'm open to ideas to improve all
these things.
On Tue, Apr 23, 2024 at 9:43 AM Huang, Ying <[email protected]> wrote:
>
> Kairui Song <[email protected]> writes:
>
> > From: Kairui Song <[email protected]>
> >
> > folio_file_pos and page_file_offset are for mixed usage of swap cache
> > and page cache, it can't be page cache here, so introduce a new helper
> > to get the swap offset in swap file directly.
> >
> > Signed-off-by: Kairui Song <[email protected]>
> > ---
> > mm/page_io.c | 6 +++---
> > mm/swap.h | 5 +++++
> > 2 files changed, 8 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/page_io.c b/mm/page_io.c
> > index ae2b49055e43..93de5aadb438 100644
> > --- a/mm/page_io.c
> > +++ b/mm/page_io.c
> > @@ -279,7 +279,7 @@ static void sio_write_complete(struct kiocb *iocb, long ret)
> > * be temporary.
> > */
> > pr_err_ratelimited("Write error %ld on dio swapfile (%llu)\n",
> > - ret, page_file_offset(page));
> > + ret, swap_file_pos(page_swap_entry(page)));
> > for (p = 0; p < sio->pages; p++) {
> > page = sio->bvec[p].bv_page;
> > set_page_dirty(page);
> > @@ -298,7 +298,7 @@ static void swap_writepage_fs(struct folio *folio, struct writeback_control *wbc
> > struct swap_iocb *sio = NULL;
> > struct swap_info_struct *sis = swp_swap_info(folio->swap);
> > struct file *swap_file = sis->swap_file;
> > - loff_t pos = folio_file_pos(folio);
> > + loff_t pos = swap_file_pos(folio->swap);
> >
> > count_swpout_vm_event(folio);
> > folio_start_writeback(folio);
> > @@ -429,7 +429,7 @@ static void swap_read_folio_fs(struct folio *folio, struct swap_iocb **plug)
> > {
> > struct swap_info_struct *sis = swp_swap_info(folio->swap);
> > struct swap_iocb *sio = NULL;
> > - loff_t pos = folio_file_pos(folio);
> > + loff_t pos = swap_file_pos(folio->swap);
> >
> > if (plug)
> > sio = *plug;
> > diff --git a/mm/swap.h b/mm/swap.h
> > index fc2f6ade7f80..2de83729aaa8 100644
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> > @@ -7,6 +7,11 @@ struct mempolicy;
> > #ifdef CONFIG_SWAP
> > #include <linux/blk_types.h> /* for bio_end_io_t */
> >
> > +static inline loff_t swap_file_pos(swp_entry_t entry)
> > +{
> > + return ((loff_t)swp_offset(entry)) << PAGE_SHIFT;
> > +}
> > +
> > /* linux/mm/page_io.c */
> > int sio_pool_init(void);
> > struct swap_iocb;
>
> I feel that the file concept for swap is kind of confusing. From the
> file cache point of view, one "struct address space" conresponds to one
> file. If so, we have a simple file system on a swap device (block
> device backed or file backed), where the size of each file is 64M. The
> swap entry encode the file system (swap_type), the file name
> (swap_offset >> SWAP_ADDRESS_SPACE_SHIFT), and the offset in file (lower
> bits of swap_offset).
>
> If the above definition is good, it's better to rename swap_file_pos()
> to swap_dev_pos(), because it returns the swap device position of the
> swap entry.
Good suggestion! The definition looks good to me, swap_dev_pos also
looks better, "swap_file" looks confusing indeed.
>
> And, when we reaches consensus on the swap file related concept, we may
> document it somewhere and review all naming in swap code to cleanup.
>
> --
> Best Regards,
> Huang, Ying
Hi, Matthew,
Matthew Wilcox <[email protected]> writes:
> On Mon, Apr 22, 2024 at 03:54:58PM +0800, Huang, Ying wrote:
>> Is it possible to add "start_offset" support in xarray, so "index"
>> will subtract "start_offset" before looking up / inserting?
>
> We kind of have that with XA_FLAGS_ZERO_BUSY which is used for
> XA_FLAGS_ALLOC1. But that's just one bit for the entry at 0. We could
> generalise it, but then we'd have to store that somewhere and there's
> no obvious good place to store it that wouldn't enlarge struct xarray,
> which I'd be reluctant to do.
>
>> Is it possible to use multiple range locks to protect one xarray to
>> improve the lock scalability? This is why we have multiple "struct
>> address_space" for one swap device. And, we may have same lock
>> contention issue for large files too.
>
> It's something I've considered. The issue is search marks. If we delete
> an entry, we may have to walk all the way up the xarray clearing bits as
> we go and I'd rather not grab a lock at each level. There's a convenient
> 4 byte hole between nr_values and parent where we could put it.
>
> Oh, another issue is that we use i_pages.xa_lock to synchronise
> address_space.nrpages, so I'm not sure that a per-node lock will help.
Thanks for looking at this.
> But I'm conscious that there are workloads which show contention on
> xa_lock as their limiting factor, so I'm open to ideas to improve all
> these things.
I have no idea so far because my very limited knowledge about xarray.
--
Best Regards,
Huang, Ying
Hi Ying,
On Tue, Apr 23, 2024 at 7:26 PM Huang, Ying <[email protected]> wrote:
>
> Hi, Matthew,
>
> Matthew Wilcox <[email protected]> writes:
>
> > On Mon, Apr 22, 2024 at 03:54:58PM +0800, Huang, Ying wrote:
> >> Is it possible to add "start_offset" support in xarray, so "index"
> >> will subtract "start_offset" before looking up / inserting?
> >
> > We kind of have that with XA_FLAGS_ZERO_BUSY which is used for
> > XA_FLAGS_ALLOC1. But that's just one bit for the entry at 0. We could
> > generalise it, but then we'd have to store that somewhere and there's
> > no obvious good place to store it that wouldn't enlarge struct xarray,
> > which I'd be reluctant to do.
> >
> >> Is it possible to use multiple range locks to protect one xarray to
> >> improve the lock scalability? This is why we have multiple "struct
> >> address_space" for one swap device. And, we may have same lock
> >> contention issue for large files too.
> >
> > It's something I've considered. The issue is search marks. If we delete
> > an entry, we may have to walk all the way up the xarray clearing bits as
> > we go and I'd rather not grab a lock at each level. There's a convenient
> > 4 byte hole between nr_values and parent where we could put it.
> >
> > Oh, another issue is that we use i_pages.xa_lock to synchronise
> > address_space.nrpages, so I'm not sure that a per-node lock will help.
>
> Thanks for looking at this.
>
> > But I'm conscious that there are workloads which show contention on
> > xa_lock as their limiting factor, so I'm open to ideas to improve all
> > these things.
>
> I have no idea so far because my very limited knowledge about xarray.
For the swap file usage, I have been considering an idea to remove the
index part of the xarray from swap cache. Swap cache is different from
file cache in a few aspects.
For one if we want to have a folio equivalent of "large swap entry".
Then the natural alignment of those swap offset on does not make
sense. Ideally we should be able to write the folio to un-aligned swap
file locations.
The other aspect for swap files is that, we already have different
data structures organized around swap offset, swap_map and
swap_cgroup. If we group the swap related data structure together. We
can add a pointer to a union of folio or a shadow swap entry. We can
use atomic updates on the swap struct member or breakdown the access
lock by ranges just like swap cluster does.
I want to discuss those ideas in the upcoming LSF/MM meet up as well.
Chris
Chris Li <[email protected]> writes:
> Hi Ying,
>
> On Tue, Apr 23, 2024 at 7:26 PM Huang, Ying <[email protected]> wrote:
>>
>> Hi, Matthew,
>>
>> Matthew Wilcox <[email protected]> writes:
>>
>> > On Mon, Apr 22, 2024 at 03:54:58PM +0800, Huang, Ying wrote:
>> >> Is it possible to add "start_offset" support in xarray, so "index"
>> >> will subtract "start_offset" before looking up / inserting?
>> >
>> > We kind of have that with XA_FLAGS_ZERO_BUSY which is used for
>> > XA_FLAGS_ALLOC1. But that's just one bit for the entry at 0. We could
>> > generalise it, but then we'd have to store that somewhere and there's
>> > no obvious good place to store it that wouldn't enlarge struct xarray,
>> > which I'd be reluctant to do.
>> >
>> >> Is it possible to use multiple range locks to protect one xarray to
>> >> improve the lock scalability? This is why we have multiple "struct
>> >> address_space" for one swap device. And, we may have same lock
>> >> contention issue for large files too.
>> >
>> > It's something I've considered. The issue is search marks. If we delete
>> > an entry, we may have to walk all the way up the xarray clearing bits as
>> > we go and I'd rather not grab a lock at each level. There's a convenient
>> > 4 byte hole between nr_values and parent where we could put it.
>> >
>> > Oh, another issue is that we use i_pages.xa_lock to synchronise
>> > address_space.nrpages, so I'm not sure that a per-node lock will help.
>>
>> Thanks for looking at this.
>>
>> > But I'm conscious that there are workloads which show contention on
>> > xa_lock as their limiting factor, so I'm open to ideas to improve all
>> > these things.
>>
>> I have no idea so far because my very limited knowledge about xarray.
>
> For the swap file usage, I have been considering an idea to remove the
> index part of the xarray from swap cache. Swap cache is different from
> file cache in a few aspects.
> For one if we want to have a folio equivalent of "large swap entry".
> Then the natural alignment of those swap offset on does not make
> sense. Ideally we should be able to write the folio to un-aligned swap
> file locations.
>
> The other aspect for swap files is that, we already have different
> data structures organized around swap offset, swap_map and
> swap_cgroup. If we group the swap related data structure together. We
> can add a pointer to a union of folio or a shadow swap entry.
The shadow swap entry may be freed. So we need to prepare for that.
And, in current design, only swap_map[] is allocated if the swap space
isn't used. That needs to be considered too.
> We can use atomic updates on the swap struct member or breakdown the
> access lock by ranges just like swap cluster does.
The swap code uses xarray in a simple way. That gives us opportunity to
optimize. For example, it makes it easy to use multiple xarray
instances for one swap device.
> I want to discuss those ideas in the upcoming LSF/MM meet up as well.
Good!
--
Best Regards,
Huang, Ying
On Sat, Apr 27, 2024 at 6:16 PM Huang, Ying <[email protected]> wrote:
>
> Chris Li <[email protected]> writes:
>
> > Hi Ying,
> >
> > For the swap file usage, I have been considering an idea to remove the
> > index part of the xarray from swap cache. Swap cache is different from
> > file cache in a few aspects.
> > For one if we want to have a folio equivalent of "large swap entry".
> > Then the natural alignment of those swap offset on does not make
> > sense. Ideally we should be able to write the folio to un-aligned swap
> > file locations.
> >
> > The other aspect for swap files is that, we already have different
> > data structures organized around swap offset, swap_map and
> > swap_cgroup. If we group the swap related data structure together. We
> > can add a pointer to a union of folio or a shadow swap entry.
>
> The shadow swap entry may be freed. So we need to prepare for that.
Free the shadow swap entry will just set the pointer to NULL.
Are you concerned that the memory allocated for the pointer is not
free to the system after the shadow swap entry is free?
It will be subject to fragmentation on the free swap entry.
In that regard, xarray is also subject to fragmentation. It will not
free the internal node if the node has one xa_index not freed. Even if
the xarray node is freed to slab, at slab level there is fragmentation
as well, the backing page might not free to the system.
> And, in current design, only swap_map[] is allocated if the swap space
> isn't used. That needs to be considered too.
I am aware of that. I want to make the swap_map[] not static allocated
any more either.
The swap_map static allocation forces the rest of the swap data
structure to have other means to sparsely allocate their data
structure, repeating the fragmentation elsewhere, in different
ways.That is also the one major source of the pain point hacking on
the swap code. The data structure is spread into too many different
places.
> > We can use atomic updates on the swap struct member or breakdown the
> > access lock by ranges just like swap cluster does.
>
> The swap code uses xarray in a simple way. That gives us opportunity to
> optimize. For example, it makes it easy to use multiple xarray
The fixed swap offset range makes it like an array. There are many
ways to shard the array like swap entry, e.g. swap cluster is one way
to shard it. Multiple xarray is another way. We can also do multiple
xarray like sharding, or even more fancy ones.
Chris
Chris Li <[email protected]> writes:
> On Sat, Apr 27, 2024 at 6:16 PM Huang, Ying <[email protected]> wrote:
>>
>> Chris Li <[email protected]> writes:
>>
>> > Hi Ying,
>> >
>> > For the swap file usage, I have been considering an idea to remove the
>> > index part of the xarray from swap cache. Swap cache is different from
>> > file cache in a few aspects.
>> > For one if we want to have a folio equivalent of "large swap entry".
>> > Then the natural alignment of those swap offset on does not make
>> > sense. Ideally we should be able to write the folio to un-aligned swap
>> > file locations.
>> >
>> > The other aspect for swap files is that, we already have different
>> > data structures organized around swap offset, swap_map and
>> > swap_cgroup. If we group the swap related data structure together. We
>> > can add a pointer to a union of folio or a shadow swap entry.
>>
>> The shadow swap entry may be freed. So we need to prepare for that.
>
> Free the shadow swap entry will just set the pointer to NULL.
> Are you concerned that the memory allocated for the pointer is not
> free to the system after the shadow swap entry is free?
>
> It will be subject to fragmentation on the free swap entry.
> In that regard, xarray is also subject to fragmentation. It will not
> free the internal node if the node has one xa_index not freed. Even if
> the xarray node is freed to slab, at slab level there is fragmentation
> as well, the backing page might not free to the system.
Sorry my words were confusing. What I wanted to say is that the xarray
node may be freed.
>> And, in current design, only swap_map[] is allocated if the swap space
>> isn't used. That needs to be considered too.
>
> I am aware of that. I want to make the swap_map[] not static allocated
> any more either.
Yes. That's possible.
> The swap_map static allocation forces the rest of the swap data
> structure to have other means to sparsely allocate their data
> structure, repeating the fragmentation elsewhere, in different
> ways.That is also the one major source of the pain point hacking on
> the swap code. The data structure is spread into too many different
> places.
Look forward to more details to compare :-)
>> > We can use atomic updates on the swap struct member or breakdown the
>> > access lock by ranges just like swap cluster does.
>>
>> The swap code uses xarray in a simple way. That gives us opportunity to
>> optimize. For example, it makes it easy to use multiple xarray
>
> The fixed swap offset range makes it like an array. There are many
> ways to shard the array like swap entry, e.g. swap cluster is one way
> to shard it. Multiple xarray is another way. We can also do multiple
> xarray like sharding, or even more fancy ones.
--
Best Regards,
Huang, Ying
On Sat, Apr 27, 2024 at 8:23 PM Huang, Ying <[email protected]> wrote:
>
> Chris Li <[email protected]> writes:
>
> > On Sat, Apr 27, 2024 at 6:16 PM Huang, Ying <[email protected]> wrote:
> >>
> >> Chris Li <[email protected]> writes:
> > Free the shadow swap entry will just set the pointer to NULL.
> > Are you concerned that the memory allocated for the pointer is not
> > free to the system after the shadow swap entry is free?
> >
> > It will be subject to fragmentation on the free swap entry.
> > In that regard, xarray is also subject to fragmentation. It will not
> > free the internal node if the node has one xa_index not freed. Even if
> > the xarray node is freed to slab, at slab level there is fragmentation
> > as well, the backing page might not free to the system.
>
> Sorry my words were confusing. What I wanted to say is that the xarray
> node may be freed.
Somehow I get that is what you mean :-) My previous reply still
applies here. The xarray node freeing will be subject to the
fragmentation at slab level. The actual backing page might not release
to the kernel after the node freeing.
>
> >> And, in current design, only swap_map[] is allocated if the swap space
> >> isn't used. That needs to be considered too.
> >
> > I am aware of that. I want to make the swap_map[] not static allocated
> > any more either.
>
> Yes. That's possible.
Of course there will be a price to pay for that. The current swap_map
is only 1 byte per entry. That swap map count size per swap entry is
going to be hard to beat in the alternatives. Hopefully find the trade
off in other places.
>
> > The swap_map static allocation forces the rest of the swap data
> > structure to have other means to sparsely allocate their data
> > structure, repeating the fragmentation elsewhere, in different
> > ways.That is also the one major source of the pain point hacking on
> > the swap code. The data structure is spread into too many different
> > places.
>
> Look forward to more details to compare :-)
Sure. When I make more progress I will post it.
Chris
On Sat, Apr 27, 2024 at 7:16 AM Chris Li <[email protected]> wrote:
>
> Hi Ying,
>
> On Tue, Apr 23, 2024 at 7:26 PM Huang, Ying <[email protected]> wrote:
> >
> > Hi, Matthew,
> >
> > Matthew Wilcox <[email protected]> writes:
> >
> > > On Mon, Apr 22, 2024 at 03:54:58PM +0800, Huang, Ying wrote:
> > >> Is it possible to add "start_offset" support in xarray, so "index"
> > >> will subtract "start_offset" before looking up / inserting?
> > >
> > > We kind of have that with XA_FLAGS_ZERO_BUSY which is used for
> > > XA_FLAGS_ALLOC1. But that's just one bit for the entry at 0. We could
> > > generalise it, but then we'd have to store that somewhere and there's
> > > no obvious good place to store it that wouldn't enlarge struct xarray,
> > > which I'd be reluctant to do.
> > >
> > >> Is it possible to use multiple range locks to protect one xarray to
> > >> improve the lock scalability? This is why we have multiple "struct
> > >> address_space" for one swap device. And, we may have same lock
> > >> contention issue for large files too.
> > >
> > > It's something I've considered. The issue is search marks. If we delete
> > > an entry, we may have to walk all the way up the xarray clearing bits as
> > > we go and I'd rather not grab a lock at each level. There's a convenient
> > > 4 byte hole between nr_values and parent where we could put it.
> > >
> > > Oh, another issue is that we use i_pages.xa_lock to synchronise
> > > address_space.nrpages, so I'm not sure that a per-node lock will help.
> >
> > Thanks for looking at this.
> >
> > > But I'm conscious that there are workloads which show contention on
> > > xa_lock as their limiting factor, so I'm open to ideas to improve all
> > > these things.
> >
> > I have no idea so far because my very limited knowledge about xarray.
>
> For the swap file usage, I have been considering an idea to remove the
> index part of the xarray from swap cache. Swap cache is different from
> file cache in a few aspects.
> For one if we want to have a folio equivalent of "large swap entry".
> Then the natural alignment of those swap offset on does not make
> sense. Ideally we should be able to write the folio to un-aligned swap
> file locations.
>
Hi Chris,
This sound interesting, I have a few questions though...
Are you suggesting we handle swap on file and swap on device
differently? Swap on file is much less frequently used than swap on
device I think.
And what do you mean "index part of the xarray"? If we need a cache,
xarray still seems one of the best choices to hold the content.
> The other aspect for swap files is that, we already have different
> data structures organized around swap offset, swap_map and
> swap_cgroup. If we group the swap related data structure together. We
> can add a pointer to a union of folio or a shadow swap entry. We can
> use atomic updates on the swap struct member or breakdown the access
> lock by ranges just like swap cluster does.
>
> I want to discuss those ideas in the upcoming LSF/MM meet up as well.
Looking forward to it!
>
> Chris
On Mon, Apr 29, 2024 at 1:37 AM Kairui Song <[email protected]> wrote:
>
> On Sat, Apr 27, 2024 at 7:16 AM Chris Li <[email protected]> wrote:
> >
> > Hi Ying,
> >
> > On Tue, Apr 23, 2024 at 7:26 PM Huang, Ying <[email protected]> wrote:
> > >
> > > Hi, Matthew,
> > >
> > > Matthew Wilcox <[email protected]> writes:
> > >
> > > > On Mon, Apr 22, 2024 at 03:54:58PM +0800, Huang, Ying wrote:
> > > >> Is it possible to add "start_offset" support in xarray, so "index"
> > > >> will subtract "start_offset" before looking up / inserting?
> > > >
> > > > We kind of have that with XA_FLAGS_ZERO_BUSY which is used for
> > > > XA_FLAGS_ALLOC1. But that's just one bit for the entry at 0. We could
> > > > generalise it, but then we'd have to store that somewhere and there's
> > > > no obvious good place to store it that wouldn't enlarge struct xarray,
> > > > which I'd be reluctant to do.
> > > >
> > > >> Is it possible to use multiple range locks to protect one xarray to
> > > >> improve the lock scalability? This is why we have multiple "struct
> > > >> address_space" for one swap device. And, we may have same lock
> > > >> contention issue for large files too.
> > > >
> > > > It's something I've considered. The issue is search marks. If we delete
> > > > an entry, we may have to walk all the way up the xarray clearing bits as
> > > > we go and I'd rather not grab a lock at each level. There's a convenient
> > > > 4 byte hole between nr_values and parent where we could put it.
> > > >
> > > > Oh, another issue is that we use i_pages.xa_lock to synchronise
> > > > address_space.nrpages, so I'm not sure that a per-node lock will help.
> > >
> > > Thanks for looking at this.
> > >
> > > > But I'm conscious that there are workloads which show contention on
> > > > xa_lock as their limiting factor, so I'm open to ideas to improve all
> > > > these things.
> > >
> > > I have no idea so far because my very limited knowledge about xarray.
> >
> > For the swap file usage, I have been considering an idea to remove the
> > index part of the xarray from swap cache. Swap cache is different from
> > file cache in a few aspects.
> > For one if we want to have a folio equivalent of "large swap entry".
> > Then the natural alignment of those swap offset on does not make
> > sense. Ideally we should be able to write the folio to un-aligned swap
> > file locations.
> >
>
> Hi Chris,
>
> This sound interesting, I have a few questions though...
>
> Are you suggesting we handle swap on file and swap on device
> differently? Swap on file is much less frequently used than swap on
> device I think.
>
> And what do you mean "index part of the xarray"? If we need a cache,
> xarray still seems one of the best choices to hold the content.
>
> > The other aspect for swap files is that, we already have different
> > data structures organized around swap offset, swap_map and
> > swap_cgroup. If we group the swap related data structure together. We
> > can add a pointer to a union of folio or a shadow swap entry. We can
> > use atomic updates on the swap struct member or breakdown the access
> > lock by ranges just like swap cluster does.
Oh, and BTW I'm also trying to breakdown the swap address space range
(from 64M to 16M, SWAP_ADDRESS_SPACE_SHIFT from 14 to
12). It's a simple approach, but the coupling and increased memory
usage of address_space structure makes the performance go into
regression (about -2% for worst real world workload). I found this
part very performance sensitive, so basically I'm not making much
progress for the future items I mentioned in this cover letter. New
ideas could be very helpful!
> >
> > I want to discuss those ideas in the upcoming LSF/MM meet up as well.
>
> Looking forward to it!
>
> >
> > Chris
On Sun, Apr 28, 2024 at 10:37 AM Kairui Song <[email protected]> wrote:
>
> On Sat, Apr 27, 2024 at 7:16 AM Chris Li <[email protected]> wrote:
> >
> > Hi Ying,
> >
> > On Tue, Apr 23, 2024 at 7:26 PM Huang, Ying <[email protected]> wrote:
> > >
> > > Hi, Matthew,
> > >
> > > Matthew Wilcox <[email protected]> writes:
> > >
> > > > On Mon, Apr 22, 2024 at 03:54:58PM +0800, Huang, Ying wrote:
> > > >> Is it possible to add "start_offset" support in xarray, so "index"
> > > >> will subtract "start_offset" before looking up / inserting?
> > > >
> > > > We kind of have that with XA_FLAGS_ZERO_BUSY which is used for
> > > > XA_FLAGS_ALLOC1. But that's just one bit for the entry at 0. We could
> > > > generalise it, but then we'd have to store that somewhere and there's
> > > > no obvious good place to store it that wouldn't enlarge struct xarray,
> > > > which I'd be reluctant to do.
> > > >
> > > >> Is it possible to use multiple range locks to protect one xarray to
> > > >> improve the lock scalability? This is why we have multiple "struct
> > > >> address_space" for one swap device. And, we may have same lock
> > > >> contention issue for large files too.
> > > >
> > > > It's something I've considered. The issue is search marks. If we delete
> > > > an entry, we may have to walk all the way up the xarray clearing bits as
> > > > we go and I'd rather not grab a lock at each level. There's a convenient
> > > > 4 byte hole between nr_values and parent where we could put it.
> > > >
> > > > Oh, another issue is that we use i_pages.xa_lock to synchronise
> > > > address_space.nrpages, so I'm not sure that a per-node lock will help.
> > >
> > > Thanks for looking at this.
> > >
> > > > But I'm conscious that there are workloads which show contention on
> > > > xa_lock as their limiting factor, so I'm open to ideas to improve all
> > > > these things.
> > >
> > > I have no idea so far because my very limited knowledge about xarray.
> >
> > For the swap file usage, I have been considering an idea to remove the
> > index part of the xarray from swap cache. Swap cache is different from
> > file cache in a few aspects.
> > For one if we want to have a folio equivalent of "large swap entry".
> > Then the natural alignment of those swap offset on does not make
> > sense. Ideally we should be able to write the folio to un-aligned swap
> > file locations.
> >
>
> Hi Chris,
>
> This sound interesting, I have a few questions though...
>
> Are you suggesting we handle swap on file and swap on device
> differently? Swap on file is much less frequently used than swap on
> device I think.
That is not what I have in mind. The swap struct idea did not
distinguish the swap file vs swap device.BTW, I sometimes use swap on
file because I did not allocate a swap partition in advance.
>
> And what do you mean "index part of the xarray"? If we need a cache,
> xarray still seems one of the best choices to hold the content.
We still need to look up swap file offset -> folio. However if we
allocate each swap offset a "struct swap", then the folio lookup can
be as simple as get the swap_struc by offset, then atomic read of
swap_structt->folio.
Not sure how you come to the conclusion for "best choices"? It is one
choice, but it has its drawbacks. The natural alignment requirement of
xarray, e.g. 2M large swap entries need to be written to 2M aligned
offset, that is an unnecessary restriction. If we allocate the "struct
swap" ourselves, we have more flexibility.
> > The other aspect for swap files is that, we already have different
> > data structures organized around swap offset, swap_map and
> > swap_cgroup. If we group the swap related data structure together. We
> > can add a pointer to a union of folio or a shadow swap entry. We can
> > use atomic updates on the swap struct member or breakdown the access
> > lock by ranges just like swap cluster does.
> >
> > I want to discuss those ideas in the upcoming LSF/MM meet up as well.
>
> Looking forward to it!
Thanks, I will post more when I get more progress on that.
>
> Oh, and BTW I'm also trying to breakdown the swap address space range
> (from 64M to 16M, SWAP_ADDRESS_SPACE_SHIFT from 14 to
> 12). It's a simple approach, but the coupling and increased memory
> usage of address_space structure makes the performance go into
> regression (about -2% for worst real world workload). I found this
Yes, that sounds plausible.
> part very performance sensitive, so basically I'm not making much
> progress for the future items I mentioned in this cover letter. New
> ideas could be very helpful!
>
The swap_struct idea is very different from what you are trying to do
in this series. It is more related to my LSF/MM topic on the swap back
end overhaul. More long term and bigger undertakings.
Chris