LinuxLists.cc - [PATCH 0/7] support large folio swap-out and swap-in for shmem

2024-06-06 11:59:22

Subject: [PATCH 0/7] support large folio swap-out and swap-in for shmem

Shmem will support large folio allocation [1] [2] to get a better performance,
however, the memory reclaim still splits the precious large folios when trying
to swap-out shmem, which may lead to the memory fragmentation issue and can not
take advantage of the large folio for shmeme.

Moreover, the swap code already supports for swapping out large folio without
split, and large folio swap-in[3] series is queued into mm-unstable branch.
Hence this patch set also supports the large folio swap-out and swap-in for
shmem.

[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/[email protected]/
[3] https://lore.kernel.org/all/[email protected]/T/

Changes from RFC:
- Rebased to the latest mm-unstable.
- Drop the counter name fixing patch, which was queued into mm-hotfixes-stable
branch.

Baolin Wang (7):
mm: vmscan: add validation before spliting shmem large folio
mm: swap: extend swap_shmem_alloc() to support batch SWAP_MAP_SHMEM
flag setting
mm: shmem: support large folio allocation for shmem_replace_folio()
mm: shmem: extend shmem_partial_swap_usage() to support large folio
swap
mm: add new 'orders' parameter for find_get_entries() and
find_lock_entries()
mm: shmem: use swap_free_nr() to free shmem swap entries
mm: shmem: support large folio swap out

drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 1 +
include/linux/swap.h | 4 +-
include/linux/writeback.h | 1 +
mm/filemap.c | 27 ++++++-
mm/internal.h | 4 +-
mm/shmem.c | 58 ++++++++------
mm/swapfile.c | 98 ++++++++++++-----------
mm/truncate.c | 8 +-
mm/vmscan.c | 22 ++++-
9 files changed, 140 insertions(+), 83 deletions(-)

--
2.39.3

2024-06-06 11:59:29

by Baolin Wang

[permalink] [raw]

Subject: [PATCH 1/7] mm: vmscan: add validation before spliting shmem large folio

Add swap available space validation before spliting shmem large folio to
avoid redundant split, since we can not write shmem folio to the swap device
in this case.

Signed-off-by: Baolin Wang <[email protected]>
---
mm/vmscan.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c0429fd6c573..9146fd0dc61e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1255,6 +1255,14 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
}
} else if (folio_test_swapbacked(folio) &&
folio_test_large(folio)) {
+
+ /*
+ * Do not split shmem folio if no swap memory
+ * available.
+ */
+ if (!total_swap_pages)
+ goto activate_locked;
+
/* Split shmem folio */
if (split_folio_to_list(folio, folio_list))
goto keep_locked;
--
2.39.3

2024-06-06 11:59:44

by Baolin Wang

[permalink] [raw]

Subject: [PATCH 2/7] mm: swap: extend swap_shmem_alloc() to support batch SWAP_MAP_SHMEM flag setting

To support shmem large folio swap operations, add a new parameter to
swap_shmem_alloc() that allows batch SWAP_MAP_SHMEM flag setting for
shmem swap entries.

While we are at it, using folio_nr_pages() to get the number of pages
of the folio as a preparation.

Signed-off-by: Baolin Wang <[email protected]>
---
include/linux/swap.h | 4 +-
mm/shmem.c | 6 ++-
mm/swapfile.c | 98 +++++++++++++++++++++++---------------------
3 files changed, 57 insertions(+), 51 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 3df75d62a835..4a76ab0b4a7f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -477,7 +477,7 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry);
extern swp_entry_t get_swap_page_of_type(int);
extern int get_swap_pages(int n, swp_entry_t swp_entries[], int order);
extern int add_swap_count_continuation(swp_entry_t, gfp_t);
-extern void swap_shmem_alloc(swp_entry_t);
+extern void swap_shmem_alloc(swp_entry_t, int);
extern int swap_duplicate(swp_entry_t);
extern int swapcache_prepare(swp_entry_t);
extern void swap_free_nr(swp_entry_t entry, int nr_pages);
@@ -544,7 +544,7 @@ static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
return 0;
}

-static inline void swap_shmem_alloc(swp_entry_t swp)
+static inline void swap_shmem_alloc(swp_entry_t swp, int nr)
{
}

diff --git a/mm/shmem.c b/mm/shmem.c
index d9a11950c586..174d8ae25b9b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1433,6 +1433,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
swp_entry_t swap;
pgoff_t index;
+ int nr_pages;

/*
* Our capabilities prevent regular writeback or sync from ever calling
@@ -1465,6 +1466,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
}

index = folio->index;
+ nr_pages = folio_nr_pages(folio);

/*
* This is somewhat ridiculous, but without plumbing a SWAP_MAP_FALLOC
@@ -1517,8 +1519,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
if (add_to_swap_cache(folio, swap,
__GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN,
NULL) == 0) {
- shmem_recalc_inode(inode, 0, 1);
- swap_shmem_alloc(swap);
+ shmem_recalc_inode(inode, 0, nr_pages);
+ swap_shmem_alloc(swap, nr_pages);
shmem_delete_from_page_cache(folio, swp_to_radix_entry(swap));

mutex_unlock(&shmem_swaplist_mutex);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 9c6d8e557c0f..1dde413264e2 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3362,62 +3362,58 @@ void si_swapinfo(struct sysinfo *val)
* - swap-cache reference is requested but the entry is not used. -> ENOENT
* - swap-mapped reference requested but needs continued swap count. -> ENOMEM
*/
-static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
+static int __swap_duplicate(struct swap_info_struct *p, unsigned long offset,
+ int nr, unsigned char usage)
{
- struct swap_info_struct *p;
struct swap_cluster_info *ci;
- unsigned long offset;
unsigned char count;
unsigned char has_cache;
- int err;
+ int err, i;

- p = swp_swap_info(entry);
-
- offset = swp_offset(entry);
ci = lock_cluster_or_swap_info(p, offset);

- count = p->swap_map[offset];
-
- /*
- * swapin_readahead() doesn't check if a swap entry is valid, so the
- * swap entry could be SWAP_MAP_BAD. Check here with lock held.
- */
- if (unlikely(swap_count(count) == SWAP_MAP_BAD)) {
- err = -ENOENT;
- goto unlock_out;
- }
-
- has_cache = count & SWAP_HAS_CACHE;
- count &= ~SWAP_HAS_CACHE;
- err = 0;
-
- if (usage == SWAP_HAS_CACHE) {
+ for (i = 0; i < nr; i++) {
+ count = p->swap_map[offset + i];

- /* set SWAP_HAS_CACHE if there is no cache and entry is used */
- if (!has_cache && count)
- has_cache = SWAP_HAS_CACHE;
- else if (has_cache) /* someone else added cache */
- err = -EEXIST;
- else /* no users remaining */
+ /*
+ * swapin_readahead() doesn't check if a swap entry is valid, so the
+ * swap entry could be SWAP_MAP_BAD. Check here with lock held.
+ */
+ if (unlikely(swap_count(count) == SWAP_MAP_BAD)) {
err = -ENOENT;
+ break;
+ }

- } else if (count || has_cache) {
+ has_cache = count & SWAP_HAS_CACHE;
+ count &= ~SWAP_HAS_CACHE;
+ err = 0;
+
+ if (usage == SWAP_HAS_CACHE) {
+ /* set SWAP_HAS_CACHE if there is no cache and entry is used */
+ if (!has_cache && count)
+ has_cache = SWAP_HAS_CACHE;
+ else if (has_cache) /* someone else added cache */
+ err = -EEXIST;
+ else /* no users remaining */
+ err = -ENOENT;
+ } else if (count || has_cache) {
+ if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
+ count += usage;
+ else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX)
+ err = -EINVAL;
+ else if (swap_count_continued(p, offset + i, count))
+ count = COUNT_CONTINUED;
+ else
+ err = -ENOMEM;
+ } else
+ err = -ENOENT; /* unused swap entry */

- if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
- count += usage;
- else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX)
- err = -EINVAL;
- else if (swap_count_continued(p, offset, count))
- count = COUNT_CONTINUED;
- else
- err = -ENOMEM;
- } else
- err = -ENOENT; /* unused swap entry */
+ if (err)
+ break;

- if (!err)
- WRITE_ONCE(p->swap_map[offset], count | has_cache);
+ WRITE_ONCE(p->swap_map[offset + i], count | has_cache);
+ }

-unlock_out:
unlock_cluster_or_swap_info(p, ci);
return err;
}
@@ -3426,9 +3422,12 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
* Help swapoff by noting that swap entry belongs to shmem/tmpfs
* (in which case its reference count is never incremented).
*/
-void swap_shmem_alloc(swp_entry_t entry)
+void swap_shmem_alloc(swp_entry_t entry, int nr)
{
- __swap_duplicate(entry, SWAP_MAP_SHMEM);
+ struct swap_info_struct *p = swp_swap_info(entry);
+ unsigned long offset = swp_offset(entry);
+
+ __swap_duplicate(p, offset, nr, SWAP_MAP_SHMEM);
}

/*
@@ -3440,9 +3439,11 @@ void swap_shmem_alloc(swp_entry_t entry)
*/
int swap_duplicate(swp_entry_t entry)
{
+ struct swap_info_struct *p = swp_swap_info(entry);
+ unsigned long offset = swp_offset(entry);
int err = 0;

- while (!err && __swap_duplicate(entry, 1) == -ENOMEM)
+ while (!err && __swap_duplicate(p, offset, 1, 1) == -ENOMEM)
err = add_swap_count_continuation(entry, GFP_ATOMIC);
return err;
}
@@ -3457,7 +3458,10 @@ int swap_duplicate(swp_entry_t entry)
*/
int swapcache_prepare(swp_entry_t entry)
{
- return __swap_duplicate(entry, SWAP_HAS_CACHE);
+ struct swap_info_struct *p = swp_swap_info(entry);
+ unsigned long offset = swp_offset(entry);
+
+ return __swap_duplicate(p, offset, 1, SWAP_HAS_CACHE);
}

void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
--
2.39.3

2024-06-06 11:59:47

by Baolin Wang

[permalink] [raw]

Subject: [PATCH 3/7] mm: shmem: support large folio allocation for shmem_replace_folio()

To support large folio swapin for shmem in the following patches, add
large folio allocation for the new replacement folio in shmem_replace_folio(),
as well as updating statistics using the number of pages in the folio.

Signed-off-by: Baolin Wang <[email protected]>
---
mm/shmem.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 174d8ae25b9b..eefdf5c61c04 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1889,7 +1889,7 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
*/
gfp &= ~GFP_CONSTRAINT_MASK;
VM_BUG_ON_FOLIO(folio_test_large(old), old);
- new = shmem_alloc_folio(gfp, 0, info, index);
+ new = shmem_alloc_folio(gfp, folio_order(old), info, index);
if (!new)
return -ENOMEM;

@@ -1910,11 +1910,13 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
xa_lock_irq(&swap_mapping->i_pages);
error = shmem_replace_entry(swap_mapping, swap_index, old, new);
if (!error) {
+ int nr_pages = folio_nr_pages(old);
+
mem_cgroup_migrate(old, new);
- __lruvec_stat_mod_folio(new, NR_FILE_PAGES, 1);
- __lruvec_stat_mod_folio(new, NR_SHMEM, 1);
- __lruvec_stat_mod_folio(old, NR_FILE_PAGES, -1);
- __lruvec_stat_mod_folio(old, NR_SHMEM, -1);
+ __lruvec_stat_mod_folio(new, NR_FILE_PAGES, nr_pages);
+ __lruvec_stat_mod_folio(new, NR_SHMEM, nr_pages);
+ __lruvec_stat_mod_folio(old, NR_FILE_PAGES, -nr_pages);
+ __lruvec_stat_mod_folio(old, NR_SHMEM, -nr_pages);
}
xa_unlock_irq(&swap_mapping->i_pages);

--
2.39.3

2024-06-06 11:59:56

by Baolin Wang

[permalink] [raw]

Subject: [PATCH 4/7] mm: shmem: extend shmem_partial_swap_usage() to support large folio swap

To support shmem large folio swapout in the following patches, using
xa_get_order() to get the order of the swap entry to calculate the swap
usage of shmem.

Signed-off-by: Baolin Wang <[email protected]>
---
mm/shmem.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index eefdf5c61c04..0ac71580decb 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -865,13 +865,16 @@ unsigned long shmem_partial_swap_usage(struct address_space *mapping,
struct page *page;
unsigned long swapped = 0;
unsigned long max = end - 1;
+ int order;

rcu_read_lock();
xas_for_each(&xas, page, max) {
if (xas_retry(&xas, page))
continue;
- if (xa_is_value(page))
- swapped++;
+ if (xa_is_value(page)) {
+ order = xa_get_order(xas.xa, xas.xa_index);
+ swapped += 1 << order;
+ }
if (xas.xa_index == max)
break;
if (need_resched()) {
--
2.39.3

2024-06-06 12:00:18

by Baolin Wang

[permalink] [raw]

Subject: [PATCH 6/7] mm: shmem: use swap_free_nr() to free shmem swap entries

As a preparation for supporting shmem large folio swapout, use swap_free_nr()
to free some continuous swap entries of the shmem large folio when the
large folio was swapped in from the swap cache.

Signed-off-by: Baolin Wang <[email protected]>
---
mm/shmem.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 28ba603d87b8..33af3b2e5ecf 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1950,6 +1950,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
struct address_space *mapping = inode->i_mapping;
swp_entry_t swapin_error;
void *old;
+ int nr_pages;

swapin_error = make_poisoned_swp_entry();
old = xa_cmpxchg_irq(&mapping->i_pages, index,
@@ -1958,6 +1959,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
if (old != swp_to_radix_entry(swap))
return;

+ nr_pages = folio_nr_pages(folio);
folio_wait_writeback(folio);
delete_from_swap_cache(folio);
/*
@@ -1965,8 +1967,8 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
* won't be 0 when inode is released and thus trigger WARN_ON(i_blocks)
* in shmem_evict_inode().
*/
- shmem_recalc_inode(inode, -1, -1);
- swap_free(swap);
+ shmem_recalc_inode(inode, -nr_pages, -nr_pages);
+ swap_free_nr(swap, nr_pages);
}

/*
@@ -1985,7 +1987,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
struct swap_info_struct *si;
struct folio *folio = NULL;
swp_entry_t swap;
- int error;
+ int error, nr_pages;

VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
swap = radix_to_swp_entry(*foliop);
@@ -2032,6 +2034,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
goto failed;
}
folio_wait_writeback(folio);
+ nr_pages = folio_nr_pages(folio);

/*
* Some architectures may have to restore extra metadata to the
@@ -2050,14 +2053,14 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
if (error)
goto failed;

- shmem_recalc_inode(inode, 0, -1);
+ shmem_recalc_inode(inode, 0, -nr_pages);

if (sgp == SGP_WRITE)
folio_mark_accessed(folio);

delete_from_swap_cache(folio);
folio_mark_dirty(folio);
- swap_free(swap);
+ swap_free_nr(swap, nr_pages);
put_swap_device(si);

*foliop = folio;
--
2.39.3

2024-06-06 12:00:29

by Baolin Wang

[permalink] [raw]

Subject: [PATCH 5/7] mm: add new 'orders' parameter for find_get_entries() and find_lock_entries()

In the following patches, shmem will support the swap out of large folios,
which means the shmem mappings may contain large order swap entries, so an
'orders' array is added for find_get_entries() and find_lock_entries() to
obtain the order size of shmem swap entries, which will help in the release
of shmem large folio swap entries.

Signed-off-by: Baolin Wang <[email protected]>
---
mm/filemap.c | 27 +++++++++++++++++++++++++--
mm/internal.h | 4 ++--
mm/shmem.c | 17 +++++++++--------
mm/truncate.c | 8 ++++----
4 files changed, 40 insertions(+), 16 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 37061aafd191..47fcd9ee6012 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2036,14 +2036,24 @@ static inline struct folio *find_get_entry(struct xa_state *xas, pgoff_t max,
* Return: The number of entries which were found.
*/
unsigned find_get_entries(struct address_space *mapping, pgoff_t *start,
- pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices)
+ pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices,
+ int *orders)
{
XA_STATE(xas, &mapping->i_pages, *start);
struct folio *folio;
+ int order;

rcu_read_lock();
while ((folio = find_get_entry(&xas, end, XA_PRESENT)) != NULL) {
indices[fbatch->nr] = xas.xa_index;
+ if (orders) {
+ if (!xa_is_value(folio))
+ order = folio_order(folio);
+ else
+ order = xa_get_order(xas.xa, xas.xa_index);
+
+ orders[fbatch->nr] = order;
+ }
if (!folio_batch_add(fbatch, folio))
break;
}
@@ -2056,6 +2066,8 @@ unsigned find_get_entries(struct address_space *mapping, pgoff_t *start,
folio = fbatch->folios[idx];
if (!xa_is_value(folio))
nr = folio_nr_pages(folio);
+ else if (orders)
+ nr = 1 << orders[idx];
*start = indices[idx] + nr;
}
return folio_batch_count(fbatch);
@@ -2082,10 +2094,12 @@ unsigned find_get_entries(struct address_space *mapping, pgoff_t *start,
* Return: The number of entries which were found.
*/
unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start,
- pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices)
+ pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices,
+ int *orders)
{
XA_STATE(xas, &mapping->i_pages, *start);
struct folio *folio;
+ int order;

rcu_read_lock();
while ((folio = find_get_entry(&xas, end, XA_PRESENT))) {
@@ -2099,9 +2113,16 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start,
if (folio->mapping != mapping ||
folio_test_writeback(folio))
goto unlock;
+ if (orders)
+ order = folio_order(folio);
VM_BUG_ON_FOLIO(!folio_contains(folio, xas.xa_index),
folio);
+ } else if (orders) {
+ order = xa_get_order(xas.xa, xas.xa_index);
}
+
+ if (orders)
+ orders[fbatch->nr] = order;
indices[fbatch->nr] = xas.xa_index;
if (!folio_batch_add(fbatch, folio))
break;
@@ -2120,6 +2141,8 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start,
folio = fbatch->folios[idx];
if (!xa_is_value(folio))
nr = folio_nr_pages(folio);
+ else if (orders)
+ nr = 1 << orders[idx];
*start = indices[idx] + nr;
}
return folio_batch_count(fbatch);
diff --git a/mm/internal.h b/mm/internal.h
index 3419c329b3bc..0b5adb6c33cc 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -339,9 +339,9 @@ static inline void force_page_cache_readahead(struct address_space *mapping,
}

unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start,
- pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices);
+ pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices, int *orders);
unsigned find_get_entries(struct address_space *mapping, pgoff_t *start,
- pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices);
+ pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices, int *orders);
void filemap_free_folio(struct address_space *mapping, struct folio *folio);
int truncate_inode_folio(struct address_space *mapping, struct folio *folio);
bool truncate_inode_partial_folio(struct folio *folio, loff_t start,
diff --git a/mm/shmem.c b/mm/shmem.c
index 0ac71580decb..28ba603d87b8 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -840,14 +840,14 @@ static void shmem_delete_from_page_cache(struct folio *folio, void *radswap)
* Remove swap entry from page cache, free the swap and its page cache.
*/
static int shmem_free_swap(struct address_space *mapping,
- pgoff_t index, void *radswap)
+ pgoff_t index, void *radswap, int order)
{
void *old;

old = xa_cmpxchg_irq(&mapping->i_pages, index, radswap, NULL, 0);
if (old != radswap)
return -ENOENT;
- free_swap_and_cache(radix_to_swp_entry(radswap));
+ free_swap_and_cache_nr(radix_to_swp_entry(radswap), 1 << order);
return 0;
}

@@ -981,6 +981,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
pgoff_t end = (lend + 1) >> PAGE_SHIFT;
struct folio_batch fbatch;
pgoff_t indices[PAGEVEC_SIZE];
+ int orders[PAGEVEC_SIZE];
struct folio *folio;
bool same_folio;
long nr_swaps_freed = 0;
@@ -996,15 +997,15 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
folio_batch_init(&fbatch);
index = start;
while (index < end && find_lock_entries(mapping, &index, end - 1,
- &fbatch, indices)) {
+ &fbatch, indices, orders)) {
for (i = 0; i < folio_batch_count(&fbatch); i++) {
folio = fbatch.folios[i];

if (xa_is_value(folio)) {
if (unfalloc)
continue;
- nr_swaps_freed += !shmem_free_swap(mapping,
- indices[i], folio);
+ if (!shmem_free_swap(mapping, indices[i], folio, orders[i]))
+ nr_swaps_freed += 1 << orders[i];
continue;
}

@@ -1058,7 +1059,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
cond_resched();

if (!find_get_entries(mapping, &index, end - 1, &fbatch,
- indices)) {
+ indices, orders)) {
/* If all gone or hole-punch or unfalloc, we're done */
if (index == start || end != -1)
break;
@@ -1072,12 +1073,12 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
if (xa_is_value(folio)) {
if (unfalloc)
continue;
- if (shmem_free_swap(mapping, indices[i], folio)) {
+ if (shmem_free_swap(mapping, indices[i], folio, orders[i])) {
/* Swap was replaced by page: retry */
index = indices[i];
break;
}
- nr_swaps_freed++;
+ nr_swaps_freed += 1 << orders[i];
continue;
}

diff --git a/mm/truncate.c b/mm/truncate.c
index 5ce62a939e55..3a4bc9dba451 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -352,7 +352,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
folio_batch_init(&fbatch);
index = start;
while (index < end && find_lock_entries(mapping, &index, end - 1,
- &fbatch, indices)) {
+ &fbatch, indices, NULL)) {
truncate_folio_batch_exceptionals(mapping, &fbatch, indices);
for (i = 0; i < folio_batch_count(&fbatch); i++)
truncate_cleanup_folio(fbatch.folios[i]);
@@ -392,7 +392,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
while (index < end) {
cond_resched();
if (!find_get_entries(mapping, &index, end - 1, &fbatch,
- indices)) {
+ indices, NULL)) {
/* If all gone from start onwards, we're done */
if (index == start)
break;
@@ -496,7 +496,7 @@ unsigned long mapping_try_invalidate(struct address_space *mapping,
int i;

folio_batch_init(&fbatch);
- while (find_lock_entries(mapping, &index, end, &fbatch, indices)) {
+ while (find_lock_entries(mapping, &index, end, &fbatch, indices, NULL)) {
for (i = 0; i < folio_batch_count(&fbatch); i++) {
struct folio *folio = fbatch.folios[i];

@@ -622,7 +622,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping,

folio_batch_init(&fbatch);
index = start;
- while (find_get_entries(mapping, &index, end, &fbatch, indices)) {
+ while (find_get_entries(mapping, &index, end, &fbatch, indices, NULL)) {
for (i = 0; i < folio_batch_count(&fbatch); i++) {
struct folio *folio = fbatch.folios[i];

--
2.39.3

2024-06-06 12:00:31

by Baolin Wang

[permalink] [raw]

Subject: [PATCH 7/7] mm: shmem: support large folio swap out

Shmem will support large folio allocation [1] [2] to get a better performance,
however, the memory reclaim still splits the precious large folios when trying
to swap out shmem, which may lead to the memory fragmentation issue and can not
take advantage of the large folio for shmeme.

Moreover, the swap code already supports for swapping out large folio without
split, hence this patch set supports the large folio swap out for shmem.

Note the i915_gem_shmem driver still need to be split when swapping, thus
add a new flag 'split_large_folio' for writeback_control to indicate spliting
the large folio.

[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/[email protected]/
Signed-off-by: Baolin Wang <[email protected]>
---
drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 1 +
include/linux/writeback.h | 1 +
mm/shmem.c | 3 +--
mm/vmscan.c | 14 ++++++++++++--
4 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
index c5e1c718a6d2..c66cb9c585e1 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
@@ -308,6 +308,7 @@ void __shmem_writeback(size_t size, struct address_space *mapping)
.range_start = 0,
.range_end = LLONG_MAX,
.for_reclaim = 1,
+ .split_large_folio = 1,
};
unsigned long i;

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 112d806ddbe4..6f2599244ae0 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -63,6 +63,7 @@ struct writeback_control {
unsigned range_cyclic:1; /* range_start is cyclic */
unsigned for_sync:1; /* sync(2) WB_SYNC_ALL writeback */
unsigned unpinned_netfs_wb:1; /* Cleared I_PINNING_NETFS_WB */
+ unsigned split_large_folio:1; /* Split large folio for shmem writeback */

/*
* When writeback IOs are bounced through async layers, only the
diff --git a/mm/shmem.c b/mm/shmem.c
index 33af3b2e5ecf..22a5116888ce 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -776,7 +776,6 @@ static int shmem_add_to_page_cache(struct folio *folio,
VM_BUG_ON_FOLIO(index != round_down(index, nr), folio);
VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
VM_BUG_ON_FOLIO(!folio_test_swapbacked(folio), folio);
- VM_BUG_ON(expected && folio_test_large(folio));

folio_ref_add(folio, nr);
folio->mapping = mapping;
@@ -1460,7 +1459,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
* "force", drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
* and its shmem_writeback() needs them to be split when swapping.
*/
- if (folio_test_large(folio)) {
+ if (wbc->split_large_folio && folio_test_large(folio)) {
/* Ensure the subpages are still dirty */
folio_test_set_dirty(folio);
if (split_huge_page(page) < 0)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9146fd0dc61e..3523fd2dc524 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1263,8 +1263,12 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
if (!total_swap_pages)
goto activate_locked;

- /* Split shmem folio */
- if (split_folio_to_list(folio, folio_list))
+ /*
+ * Only split shmem folio when CONFIG_THP_SWAP
+ * is not enabled.
+ */
+ if (!IS_ENABLED(CONFIG_THP_SWAP) &&
+ split_folio_to_list(folio, folio_list))
goto keep_locked;
}

@@ -1366,10 +1370,16 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
* starts and then write it out here.
*/
try_to_unmap_flush_dirty();
+try_pageout:
switch (pageout(folio, mapping, &plug)) {
case PAGE_KEEP:
goto keep_locked;
case PAGE_ACTIVATE:
+ if (shmem_mapping(mapping) && folio_test_large(folio) &&
+ !split_folio_to_list(folio, folio_list)) {
+ nr_pages = 1;
+ goto try_pageout;
+ }
goto activate_locked;
case PAGE_SUCCESS:
stat->nr_pageout += nr_pages;
--
2.39.3

2024-06-10 14:53:39

by Daniel Gomez

[permalink] [raw]

Subject: Re: [PATCH 4/7] mm: shmem: extend shmem_partial_swap_usage() to support large folio swap

Hi Baolin,
On Thu, Jun 06, 2024 at 07:58:54PM +0800, Baolin Wang wrote:
> To support shmem large folio swapout in the following patches, using
> xa_get_order() to get the order of the swap entry to calculate the swap
> usage of shmem.
>
> Signed-off-by: Baolin Wang <[email protected]>
> ---
> mm/shmem.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index eefdf5c61c04..0ac71580decb 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -865,13 +865,16 @@ unsigned long shmem_partial_swap_usage(struct address_space *mapping,
> struct page *page;
> unsigned long swapped = 0;
> unsigned long max = end - 1;
> + int order;
>
> rcu_read_lock();
> xas_for_each(&xas, page, max) {
> if (xas_retry(&xas, page))
> continue;
> - if (xa_is_value(page))
> - swapped++;
> + if (xa_is_value(page)) {
> + order = xa_get_order(xas.xa, xas.xa_index);
> + swapped += 1 << order;

I'd get rid of order and simply do:

swapped += 1UL << xa_get_order()

> + }
> if (xas.xa_index == max)
> break;
> if (need_resched()) {
> --
> 2.39.3
>

2024-06-10 16:08:01

by Daniel Gomez

[permalink] [raw]

Subject: Re: [PATCH 5/7] mm: add new 'orders' parameter for find_get_entries() and find_lock_entries()

Hi Baolin,

On Thu, Jun 06, 2024 at 07:58:55PM +0800, Baolin Wang wrote:
> In the following patches, shmem will support the swap out of large folios,
> which means the shmem mappings may contain large order swap entries, so an
> 'orders' array is added for find_get_entries() and find_lock_entries() to
> obtain the order size of shmem swap entries, which will help in the release
> of shmem large folio swap entries.
>
> Signed-off-by: Baolin Wang <[email protected]>
> ---
> mm/filemap.c | 27 +++++++++++++++++++++++++--
> mm/internal.h | 4 ++--
> mm/shmem.c | 17 +++++++++--------
> mm/truncate.c | 8 ++++----
> 4 files changed, 40 insertions(+), 16 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 37061aafd191..47fcd9ee6012 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2036,14 +2036,24 @@ static inline struct folio *find_get_entry(struct xa_state *xas, pgoff_t max,
> * Return: The number of entries which were found.
> */
> unsigned find_get_entries(struct address_space *mapping, pgoff_t *start,
> - pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices)
> + pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices,
> + int *orders)
> {
> XA_STATE(xas, &mapping->i_pages, *start);
> struct folio *folio;
> + int order;
>
> rcu_read_lock();
> while ((folio = find_get_entry(&xas, end, XA_PRESENT)) != NULL) {
> indices[fbatch->nr] = xas.xa_index;
> + if (orders) {
> + if (!xa_is_value(folio))
> + order = folio_order(folio);
> + else
> + order = xa_get_order(xas.xa, xas.xa_index);
> +
> + orders[fbatch->nr] = order;
> + }
> if (!folio_batch_add(fbatch, folio))
> break;
> }
> @@ -2056,6 +2066,8 @@ unsigned find_get_entries(struct address_space *mapping, pgoff_t *start,
> folio = fbatch->folios[idx];
> if (!xa_is_value(folio))
> nr = folio_nr_pages(folio);
> + else if (orders)
> + nr = 1 << orders[idx];
> *start = indices[idx] + nr;
> }
> return folio_batch_count(fbatch);
> @@ -2082,10 +2094,12 @@ unsigned find_get_entries(struct address_space *mapping, pgoff_t *start,
> * Return: The number of entries which were found.
> */
> unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start,
> - pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices)
> + pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices,
> + int *orders)
> {
> XA_STATE(xas, &mapping->i_pages, *start);
> struct folio *folio;
> + int order;
>
> rcu_read_lock();
> while ((folio = find_get_entry(&xas, end, XA_PRESENT))) {
> @@ -2099,9 +2113,16 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start,
> if (folio->mapping != mapping ||
> folio_test_writeback(folio))
> goto unlock;
> + if (orders)
> + order = folio_order(folio);
> VM_BUG_ON_FOLIO(!folio_contains(folio, xas.xa_index),
> folio);
> + } else if (orders) {
> + order = xa_get_order(xas.xa, xas.xa_index);
> }
> +
> + if (orders)
> + orders[fbatch->nr] = order;
> indices[fbatch->nr] = xas.xa_index;
> if (!folio_batch_add(fbatch, folio))
> break;
> @@ -2120,6 +2141,8 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start,
> folio = fbatch->folios[idx];
> if (!xa_is_value(folio))
> nr = folio_nr_pages(folio);
> + else if (orders)
> + nr = 1 << orders[idx];
> *start = indices[idx] + nr;
> }
> return folio_batch_count(fbatch);
> diff --git a/mm/internal.h b/mm/internal.h
> index 3419c329b3bc..0b5adb6c33cc 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -339,9 +339,9 @@ static inline void force_page_cache_readahead(struct address_space *mapping,
> }
>
> unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start,
> - pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices);
> + pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices, int *orders);
> unsigned find_get_entries(struct address_space *mapping, pgoff_t *start,
> - pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices);
> + pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices, int *orders);
> void filemap_free_folio(struct address_space *mapping, struct folio *folio);
> int truncate_inode_folio(struct address_space *mapping, struct folio *folio);
> bool truncate_inode_partial_folio(struct folio *folio, loff_t start,
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 0ac71580decb..28ba603d87b8 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -840,14 +840,14 @@ static void shmem_delete_from_page_cache(struct folio *folio, void *radswap)
> * Remove swap entry from page cache, free the swap and its page cache.
> */
> static int shmem_free_swap(struct address_space *mapping,
> - pgoff_t index, void *radswap)
> + pgoff_t index, void *radswap, int order)
> {
> void *old;

Matthew Wilcox suggested [1] returning the number of pages freed in shmem_free_swap().

[1] https://lore.kernel.org/all/[email protected]/

Which I submitted here:
https://lore.kernel.org/all/[email protected]/

Do you agree with the suggestion? If so, could we update my patch to use
free_swap_and_cache_nr() or include that here?

>
> old = xa_cmpxchg_irq(&mapping->i_pages, index, radswap, NULL, 0);
> if (old != radswap)
> return -ENOENT;
> - free_swap_and_cache(radix_to_swp_entry(radswap));
> + free_swap_and_cache_nr(radix_to_swp_entry(radswap), 1 << order);
> return 0;
> }
>
> @@ -981,6 +981,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
> pgoff_t end = (lend + 1) >> PAGE_SHIFT;
> struct folio_batch fbatch;
> pgoff_t indices[PAGEVEC_SIZE];
> + int orders[PAGEVEC_SIZE];
> struct folio *folio;
> bool same_folio;
> long nr_swaps_freed = 0;
> @@ -996,15 +997,15 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
> folio_batch_init(&fbatch);
> index = start;
> while (index < end && find_lock_entries(mapping, &index, end - 1,
> - &fbatch, indices)) {
> + &fbatch, indices, orders)) {
> for (i = 0; i < folio_batch_count(&fbatch); i++) {
> folio = fbatch.folios[i];
>
> if (xa_is_value(folio)) {
> if (unfalloc)
> continue;
> - nr_swaps_freed += !shmem_free_swap(mapping,
> - indices[i], folio);
> + if (!shmem_free_swap(mapping, indices[i], folio, orders[i]))
> + nr_swaps_freed += 1 << orders[i];
> continue;
> }
>
> @@ -1058,7 +1059,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
> cond_resched();
>
> if (!find_get_entries(mapping, &index, end - 1, &fbatch,
> - indices)) {
> + indices, orders)) {
> /* If all gone or hole-punch or unfalloc, we're done */
> if (index == start || end != -1)
> break;
> @@ -1072,12 +1073,12 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
> if (xa_is_value(folio)) {
> if (unfalloc)
> continue;
> - if (shmem_free_swap(mapping, indices[i], folio)) {
> + if (shmem_free_swap(mapping, indices[i], folio, orders[i])) {
> /* Swap was replaced by page: retry */
> index = indices[i];
> break;
> }
> - nr_swaps_freed++;
> + nr_swaps_freed += 1 << orders[i];
> continue;
> }
>
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 5ce62a939e55..3a4bc9dba451 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -352,7 +352,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
> folio_batch_init(&fbatch);
> index = start;
> while (index < end && find_lock_entries(mapping, &index, end - 1,
> - &fbatch, indices)) {
> + &fbatch, indices, NULL)) {
> truncate_folio_batch_exceptionals(mapping, &fbatch, indices);
> for (i = 0; i < folio_batch_count(&fbatch); i++)
> truncate_cleanup_folio(fbatch.folios[i]);
> @@ -392,7 +392,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
> while (index < end) {
> cond_resched();
> if (!find_get_entries(mapping, &index, end - 1, &fbatch,
> - indices)) {
> + indices, NULL)) {
> /* If all gone from start onwards, we're done */
> if (index == start)
> break;
> @@ -496,7 +496,7 @@ unsigned long mapping_try_invalidate(struct address_space *mapping,
> int i;
>
> folio_batch_init(&fbatch);
> - while (find_lock_entries(mapping, &index, end, &fbatch, indices)) {
> + while (find_lock_entries(mapping, &index, end, &fbatch, indices, NULL)) {
> for (i = 0; i < folio_batch_count(&fbatch); i++) {
> struct folio *folio = fbatch.folios[i];
>
> @@ -622,7 +622,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
>
> folio_batch_init(&fbatch);
> index = start;
> - while (find_get_entries(mapping, &index, end, &fbatch, indices)) {
> + while (find_get_entries(mapping, &index, end, &fbatch, indices, NULL)) {
> for (i = 0; i < folio_batch_count(&fbatch); i++) {
> struct folio *folio = fbatch.folios[i];
>
> --
> 2.39.3
>

Daniel

2024-06-10 17:00:11

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [PATCH 5/7] mm: add new 'orders' parameter for find_get_entries() and find_lock_entries()

On Thu, Jun 06, 2024 at 07:58:55PM +0800, Baolin Wang wrote:
> In the following patches, shmem will support the swap out of large folios,
> which means the shmem mappings may contain large order swap entries, so an
> 'orders' array is added for find_get_entries() and find_lock_entries() to
> obtain the order size of shmem swap entries, which will help in the release
> of shmem large folio swap entries.

I am not a fan. I was hoping that 'order' would be encoded in the swap
entry, not passed as a separate parameter.

As I understand it, we currently have a free bit, or
swp_to_radix_entry() would not work. We can use that as detailed
here to encode the order in a single bit.

https://kernelnewbies.org/MatthewWilcox/NaturallyAlignedOrder

2024-06-11 02:59:47

by Baolin Wang

[permalink] [raw]

Subject: Re: [PATCH 4/7] mm: shmem: extend shmem_partial_swap_usage() to support large folio swap

On 2024/6/10 22:53, Daniel Gomez wrote:
> Hi Baolin,
> On Thu, Jun 06, 2024 at 07:58:54PM +0800, Baolin Wang wrote:
>> To support shmem large folio swapout in the following patches, using
>> xa_get_order() to get the order of the swap entry to calculate the swap
>> usage of shmem.
>>
>> Signed-off-by: Baolin Wang <[email protected]>
>> ---
>> mm/shmem.c | 7 +++++--
>> 1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index eefdf5c61c04..0ac71580decb 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -865,13 +865,16 @@ unsigned long shmem_partial_swap_usage(struct address_space *mapping,
>> struct page *page;
>> unsigned long swapped = 0;
>> unsigned long max = end - 1;
>> + int order;
>>
>> rcu_read_lock();
>> xas_for_each(&xas, page, max) {
>> if (xas_retry(&xas, page))
>> continue;
>> - if (xa_is_value(page))
>> - swapped++;
>> + if (xa_is_value(page)) {
>> + order = xa_get_order(xas.xa, xas.xa_index);
>> + swapped += 1 << order;
>
> I'd get rid of order and simply do:
>
> swapped += 1UL << xa_get_order()

OK. Will do.

2024-06-11 03:31:56

by Baolin Wang

[permalink] [raw]

Subject: Re: [PATCH 5/7] mm: add new 'orders' parameter for find_get_entries() and find_lock_entries()

On 2024/6/10 23:23, Daniel Gomez wrote:
> Hi Baolin,
>
> On Thu, Jun 06, 2024 at 07:58:55PM +0800, Baolin Wang wrote:
>> In the following patches, shmem will support the swap out of large folios,
>> which means the shmem mappings may contain large order swap entries, so an
>> 'orders' array is added for find_get_entries() and find_lock_entries() to
>> obtain the order size of shmem swap entries, which will help in the release
>> of shmem large folio swap entries.
>>
>> Signed-off-by: Baolin Wang <[email protected]>
>> ---
>> mm/filemap.c | 27 +++++++++++++++++++++++++--
>> mm/internal.h | 4 ++--
>> mm/shmem.c | 17 +++++++++--------
>> mm/truncate.c | 8 ++++----
>> 4 files changed, 40 insertions(+), 16 deletions(-)
>>
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 37061aafd191..47fcd9ee6012 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -2036,14 +2036,24 @@ static inline struct folio *find_get_entry(struct xa_state *xas, pgoff_t max,
>> * Return: The number of entries which were found.
>> */
>> unsigned find_get_entries(struct address_space *mapping, pgoff_t *start,
>> - pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices)
>> + pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices,
>> + int *orders)
>> {
>> XA_STATE(xas, &mapping->i_pages, *start);
>> struct folio *folio;
>> + int order;
>>
>> rcu_read_lock();
>> while ((folio = find_get_entry(&xas, end, XA_PRESENT)) != NULL) {
>> indices[fbatch->nr] = xas.xa_index;
>> + if (orders) {
>> + if (!xa_is_value(folio))
>> + order = folio_order(folio);
>> + else
>> + order = xa_get_order(xas.xa, xas.xa_index);
>> +
>> + orders[fbatch->nr] = order;
>> + }
>> if (!folio_batch_add(fbatch, folio))
>> break;
>> }
>> @@ -2056,6 +2066,8 @@ unsigned find_get_entries(struct address_space *mapping, pgoff_t *start,
>> folio = fbatch->folios[idx];
>> if (!xa_is_value(folio))
>> nr = folio_nr_pages(folio);
>> + else if (orders)
>> + nr = 1 << orders[idx];
>> *start = indices[idx] + nr;
>> }
>> return folio_batch_count(fbatch);
>> @@ -2082,10 +2094,12 @@ unsigned find_get_entries(struct address_space *mapping, pgoff_t *start,
>> * Return: The number of entries which were found.
>> */
>> unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start,
>> - pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices)
>> + pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices,
>> + int *orders)
>> {
>> XA_STATE(xas, &mapping->i_pages, *start);
>> struct folio *folio;
>> + int order;
>>
>> rcu_read_lock();
>> while ((folio = find_get_entry(&xas, end, XA_PRESENT))) {
>> @@ -2099,9 +2113,16 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start,
>> if (folio->mapping != mapping ||
>> folio_test_writeback(folio))
>> goto unlock;
>> + if (orders)
>> + order = folio_order(folio);
>> VM_BUG_ON_FOLIO(!folio_contains(folio, xas.xa_index),
>> folio);
>> + } else if (orders) {
>> + order = xa_get_order(xas.xa, xas.xa_index);
>> }
>> +
>> + if (orders)
>> + orders[fbatch->nr] = order;
>> indices[fbatch->nr] = xas.xa_index;
>> if (!folio_batch_add(fbatch, folio))
>> break;
>> @@ -2120,6 +2141,8 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start,
>> folio = fbatch->folios[idx];
>> if (!xa_is_value(folio))
>> nr = folio_nr_pages(folio);
>> + else if (orders)
>> + nr = 1 << orders[idx];
>> *start = indices[idx] + nr;
>> }
>> return folio_batch_count(fbatch);
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 3419c329b3bc..0b5adb6c33cc 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -339,9 +339,9 @@ static inline void force_page_cache_readahead(struct address_space *mapping,
>> }
>>
>> unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start,
>> - pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices);
>> + pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices, int *orders);
>> unsigned find_get_entries(struct address_space *mapping, pgoff_t *start,
>> - pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices);
>> + pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices, int *orders);
>> void filemap_free_folio(struct address_space *mapping, struct folio *folio);
>> int truncate_inode_folio(struct address_space *mapping, struct folio *folio);
>> bool truncate_inode_partial_folio(struct folio *folio, loff_t start,
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 0ac71580decb..28ba603d87b8 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -840,14 +840,14 @@ static void shmem_delete_from_page_cache(struct folio *folio, void *radswap)
>> * Remove swap entry from page cache, free the swap and its page cache.
>> */
>> static int shmem_free_swap(struct address_space *mapping,
>> - pgoff_t index, void *radswap)
>> + pgoff_t index, void *radswap, int order)
>> {
>> void *old;
>
> Matthew Wilcox suggested [1] returning the number of pages freed in shmem_free_swap().
>
> [1] https://lore.kernel.org/all/[email protected]/
>
> Which I submitted here:
> https://lore.kernel.org/all/[email protected]/
>
> Do you agree with the suggestion? If so, could we update my patch to use
> free_swap_and_cache_nr() or include that here?

Yes, this looks good to me. But we still need some modification for
find_lock_entries() and find_get_entries() to update the '*start'
correctly. I will include your changes into this patch in next version.
Thanks.

2024-06-11 03:38:53

by Baolin Wang

[permalink] [raw]

Subject: Re: [PATCH 5/7] mm: add new 'orders' parameter for find_get_entries() and find_lock_entries()

On 2024/6/11 00:59, Matthew Wilcox wrote:
> On Thu, Jun 06, 2024 at 07:58:55PM +0800, Baolin Wang wrote:
>> In the following patches, shmem will support the swap out of large folios,
>> which means the shmem mappings may contain large order swap entries, so an
>> 'orders' array is added for find_get_entries() and find_lock_entries() to
>> obtain the order size of shmem swap entries, which will help in the release
>> of shmem large folio swap entries.
>
> I am not a fan.

With Daniel's suggestion, I think I can drop the 'order' parameter if
you don't like it.

I was hoping that 'order' would be encoded in the swap
> entry, not passed as a separate parameter.
>
> As I understand it, we currently have a free bit, or
> swp_to_radix_entry() would not work. We can use that as detailed
> here to encode the order in a single bit.
>
> https://kernelnewbies.org/MatthewWilcox/NaturallyAlignedOrder

OK. This seems to deserve a separate patch set. I will look at your
suggestion in detail. Thanks.

2024-06-12 05:46:57

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH 0/7] support large folio swap-out and swap-in for shmem

On Thu, 6 Jun 2024, Baolin Wang wrote:

> Shmem will support large folio allocation [1] [2] to get a better performance,
> however, the memory reclaim still splits the precious large folios when trying
> to swap-out shmem, which may lead to the memory fragmentation issue and can not
> take advantage of the large folio for shmeme.
>
> Moreover, the swap code already supports for swapping out large folio without
> split, and large folio swap-in[3] series is queued into mm-unstable branch.
> Hence this patch set also supports the large folio swap-out and swap-in for
> shmem.
>
> [1] https://lore.kernel.org/all/[email protected]/
> [2] https://lore.kernel.org/all/[email protected]/
> [3] https://lore.kernel.org/all/[email protected]/T/
>
> Changes from RFC:
> - Rebased to the latest mm-unstable.
> - Drop the counter name fixing patch, which was queued into mm-hotfixes-stable
> branch.
>
> Baolin Wang (7):
> mm: vmscan: add validation before spliting shmem large folio
> mm: swap: extend swap_shmem_alloc() to support batch SWAP_MAP_SHMEM
> flag setting
> mm: shmem: support large folio allocation for shmem_replace_folio()
> mm: shmem: extend shmem_partial_swap_usage() to support large folio
> swap
> mm: add new 'orders' parameter for find_get_entries() and
> find_lock_entries()
> mm: shmem: use swap_free_nr() to free shmem swap entries
> mm: shmem: support large folio swap out
>
> drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 1 +
> include/linux/swap.h | 4 +-
> include/linux/writeback.h | 1 +
> mm/filemap.c | 27 ++++++-
> mm/internal.h | 4 +-
> mm/shmem.c | 58 ++++++++------
> mm/swapfile.c | 98 ++++++++++++-----------
> mm/truncate.c | 8 +-
> mm/vmscan.c | 22 ++++-
> 9 files changed, 140 insertions(+), 83 deletions(-)

I wanted to have some tests running, while looking through these
and your shmem mTHP patches; but I wasted too much time on that by
applying these on top and hitting crash, OOMs and dreadful thrashing -
testing did not get very far at all.

Perhaps all easily fixed, but I don't have more time to spend on it,
and think this series cannot expect to go into 6.11: I'll have another
try with it next cycle.

I really must turn my attention to your shmem mTHP series: no doubt
I'll have minor adjustments to ask there - but several other people
are also waiting for me to respond (or given up on me completely).

The little crash fix needed in this series appears to be:

--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2053,7 +2053,8 @@ static int shmem_swapin_folio(struct ino
goto failed;
}

- error = shmem_add_to_page_cache(folio, mapping, index,
+ error = shmem_add_to_page_cache(folio, mapping,
+ round_down(index, nr_pages),
swp_to_radix_entry(swap), gfp);
if (error)
goto failed;

Then the OOMs and dreadful thrashing are due to refcount confusion:
I did not even glance at these patches to work out what's wanted,
but a printk in __remove_mapping() showed that folio->_refcount was
1024 where 513 was expected, so reclaim was freeing none of them.

Hugh

2024-06-12 06:23:32

by Baolin Wang

[permalink] [raw]

Subject: Re: [PATCH 0/7] support large folio swap-out and swap-in for shmem

Hi Hugh,

On 2024/6/12 13:46, Hugh Dickins wrote:
> On Thu, 6 Jun 2024, Baolin Wang wrote:
>
>> Shmem will support large folio allocation [1] [2] to get a better performance,
>> however, the memory reclaim still splits the precious large folios when trying
>> to swap-out shmem, which may lead to the memory fragmentation issue and can not
>> take advantage of the large folio for shmeme.
>>
>> Moreover, the swap code already supports for swapping out large folio without
>> split, and large folio swap-in[3] series is queued into mm-unstable branch.
>> Hence this patch set also supports the large folio swap-out and swap-in for
>> shmem.
>>
>> [1] https://lore.kernel.org/all/[email protected]/
>> [2] https://lore.kernel.org/all/[email protected]/
>> [3] https://lore.kernel.org/all/[email protected]/T/
>>
>> Changes from RFC:
>> - Rebased to the latest mm-unstable.
>> - Drop the counter name fixing patch, which was queued into mm-hotfixes-stable
>> branch.
>>
>> Baolin Wang (7):
>> mm: vmscan: add validation before spliting shmem large folio
>> mm: swap: extend swap_shmem_alloc() to support batch SWAP_MAP_SHMEM
>> flag setting
>> mm: shmem: support large folio allocation for shmem_replace_folio()
>> mm: shmem: extend shmem_partial_swap_usage() to support large folio
>> swap
>> mm: add new 'orders' parameter for find_get_entries() and
>> find_lock_entries()
>> mm: shmem: use swap_free_nr() to free shmem swap entries
>> mm: shmem: support large folio swap out
>>
>> drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 1 +
>> include/linux/swap.h | 4 +-
>> include/linux/writeback.h | 1 +
>> mm/filemap.c | 27 ++++++-
>> mm/internal.h | 4 +-
>> mm/shmem.c | 58 ++++++++------
>> mm/swapfile.c | 98 ++++++++++++-----------
>> mm/truncate.c | 8 +-
>> mm/vmscan.c | 22 ++++-
>> 9 files changed, 140 insertions(+), 83 deletions(-)
>
> I wanted to have some tests running, while looking through these
> and your shmem mTHP patches; but I wasted too much time on that by
> applying these on top and hitting crash, OOMs and dreadful thrashing -
> testing did not get very far at all.

Thanks for testing. I am sorry I haven't found the issues with my testing.

> Perhaps all easily fixed, but I don't have more time to spend on it,
> and think this series cannot expect to go into 6.11: I'll have another
> try with it next cycle.
>
> I really must turn my attention to your shmem mTHP series: no doubt
> I'll have minor adjustments to ask there - but several other people
> are also waiting for me to respond (or given up on me completely).

Sure. Thanks.

>
> The little crash fix needed in this series appears to be:
>
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2053,7 +2053,8 @@ static int shmem_swapin_folio(struct ino
> goto failed;
> }
>
> - error = shmem_add_to_page_cache(folio, mapping, index,
> + error = shmem_add_to_page_cache(folio, mapping,
> + round_down(index, nr_pages),
> swp_to_radix_entry(swap), gfp);
> if (error)
> goto failed;

Good catch. I missed this.

> Then the OOMs and dreadful thrashing are due to refcount confusion:
> I did not even glance at these patches to work out what's wanted,
> but a printk in __remove_mapping() showed that folio->_refcount was
> 1024 where 513 was expected, so reclaim was freeing none of them.

I will look at this issue and continue to do more tesing before sending
out new version. Thanks.