LinuxLists.cc - [PATCH v4 0/6] large folios swap-in: handle refault cases first

2024-05-08 22:41:49

Subject: [PATCH v4 0/6] large folios swap-in: handle refault cases first

From: Barry Song <[email protected]>

This patch is extracted from the large folio swapin series[1], primarily addressing
the handling of scenarios involving large folios in the swap cache. Currently, it is
particularly focused on addressing the refaulting of mTHP, which is still undergoing
reclamation. This approach aims to streamline code review and expedite the integration
of this segment into the MM tree.

It relies on Ryan's swap-out series[2], leveraging the helper function
swap_pte_batch() introduced by that series.

Presently, do_swap_page only encounters a large folio in the swap
cache before the large folio is released by vmscan. However, the code
should remain equally useful once we support large folio swap-in via
swapin_readahead(). This approach can effectively reduce page faults
and eliminate most redundant checks and early exits for MTE restoration
in recent MTE patchset[3].

The large folio swap-in for SWP_SYNCHRONOUS_IO and swapin_readahead()
will be split into separate patch sets and sent at a later time.

-v4:
- collect acked-by/reviewed-by of Ryan, "Huang, Ying", Chris, David and
Khalid, many thanks!
- Simplify reuse code in do_swap_page() by checking refcount==1, per
David;
- Initialize large folio-related variables later in do_swap_page(), per
Ryan;
- define swap_free() as swap_free_nr(1) per Ying and Ryan.

-v3:
- optimize swap_free_nr using bitmap with single one "long"; "Huang, Ying"
- drop swap_free() as suggested by "Huang, Ying", now hibernation can get
batched;
- lots of cleanup in do_swap_page() as commented by Ryan Roberts and "Huang,
Ying";
- handle arch_do_swap_page() with nr pages though the only platform which
needs it, sparc, doesn't support THP_SWAPOUT as suggested by "Huang,
Ying";
- introduce pte_move_swp_offset() as suggested by "Huang, Ying";
- drop the "any_shared" of checking swap entries with respect to David's
comment;
- drop the counter of swapin_refault and keep it for debug purpose per
Ying
- collect reviewed-by tags
Link:
https://lore.kernel.org/linux-mm/[email protected]/

-v2:
- rebase on top of mm-unstable in which Ryan's swap_pte_batch() has changed
a lot.
- remove folio_add_new_anon_rmap() for !folio_test_anon()
as currently large folios are always anon(refault).
- add mTHP swpin refault counters
Link:
https://lore.kernel.org/linux-mm/[email protected]/

-v1:
Link: https://lore.kernel.org/linux-mm/[email protected]/

Differences with the original large folios swap-in series
- collect r-o-b, acked;
- rename swap_nr_free to swap_free_nr, according to Ryan;
- limit the maximum kernel stack usage for swap_free_nr, Ryan;
- add output argument in swap_pte_batch to expose if all entries are
exclusive
- many clean refinements, handle the corner case folio's virtual addr
might not be naturally aligned

[1] https://lore.kernel.org/linux-mm/[email protected]/
[2] https://lore.kernel.org/linux-mm/[email protected]/
[3] https://lore.kernel.org/linux-mm/[email protected]/

Barry Song (3):
mm: remove the implementation of swap_free() and always use
swap_free_nr()
mm: introduce pte_move_swp_offset() helper which can move offset
bidirectionally
mm: introduce arch_do_swap_page_nr() which allows restore metadata for
nr pages

Chuanhua Han (3):
mm: swap: introduce swap_free_nr() for batched swap_free()
mm: swap: make should_try_to_free_swap() support large-folio
mm: swap: entirely map large folios found in swapcache

include/linux/pgtable.h | 26 +++++++++++++-----
include/linux/swap.h | 9 +++++--
kernel/power/swap.c | 5 ++--
mm/internal.h | 25 ++++++++++++++---
mm/memory.c | 60 +++++++++++++++++++++++++++++++++--------
mm/swapfile.c | 48 +++++++++++++++++++++++++++++----
6 files changed, 142 insertions(+), 31 deletions(-)

--
2.34.1

2024-05-08 22:42:13

by Barry Song

[permalink] [raw]

Subject: [PATCH v4 2/6] mm: remove the implementation of swap_free() and always use swap_free_nr()

From: Barry Song <[email protected]>

To streamline maintenance efforts, we propose removing the implementation
of swap_free(). Instead, we can simply invoke swap_free_nr() with nr
set to 1. swap_free_nr() is designed with a bitmap consisting of only
one long, resulting in overhead that can be ignored for cases where nr
equals 1.

A prime candidate for leveraging swap_free_nr() lies within
kernel/power/swap.c. Implementing this change facilitates the adoption
of batch processing for hibernation.

Suggested-by: "Huang, Ying" <[email protected]>
Signed-off-by: Barry Song <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Acked-by: Chris Li <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Pavel Machek <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Christoph Hellwig <[email protected]>
---
include/linux/swap.h | 10 +++++-----
kernel/power/swap.c | 5 ++---
mm/swapfile.c | 17 ++++-------------
3 files changed, 11 insertions(+), 21 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index d1d35e92d7e9..48131b869a4d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -482,7 +482,6 @@ extern int add_swap_count_continuation(swp_entry_t, gfp_t);
extern void swap_shmem_alloc(swp_entry_t);
extern int swap_duplicate(swp_entry_t);
extern int swapcache_prepare(swp_entry_t);
-extern void swap_free(swp_entry_t);
extern void swap_free_nr(swp_entry_t entry, int nr_pages);
extern void swapcache_free_entries(swp_entry_t *entries, int n);
extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
@@ -561,10 +560,6 @@ static inline int swapcache_prepare(swp_entry_t swp)
return 0;
}

-static inline void swap_free(swp_entry_t swp)
-{
-}
-
static inline void swap_free_nr(swp_entry_t entry, int nr_pages)
{
}
@@ -613,6 +608,11 @@ static inline void free_swap_and_cache(swp_entry_t entry)
free_swap_and_cache_nr(entry, 1);
}

+static inline void swap_free(swp_entry_t entry)
+{
+ swap_free_nr(entry, 1);
+}
+
#ifdef CONFIG_MEMCG
static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
{
diff --git a/kernel/power/swap.c b/kernel/power/swap.c
index 5bc04bfe2db1..75bc9e3f9d59 100644
--- a/kernel/power/swap.c
+++ b/kernel/power/swap.c
@@ -200,12 +200,11 @@ void free_all_swap_pages(int swap)

while ((node = swsusp_extents.rb_node)) {
struct swsusp_extent *ext;
- unsigned long offset;

ext = rb_entry(node, struct swsusp_extent, node);
rb_erase(node, &swsusp_extents);
- for (offset = ext->start; offset <= ext->end; offset++)
- swap_free(swp_entry(swap, offset));
+ swap_free_nr(swp_entry(swap, ext->start),
+ ext->end - ext->start + 1);

kfree(ext);
}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index ec12f2b9d229..99e701620562 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1343,19 +1343,6 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
swap_range_free(p, offset, 1);
}

-/*
- * Caller has made sure that the swap device corresponding to entry
- * is still around or has not been recycled.
- */
-void swap_free(swp_entry_t entry)
-{
- struct swap_info_struct *p;
-
- p = _swap_info_get(entry);
- if (p)
- __swap_entry_free(p, entry);
-}
-
static void cluster_swap_free_nr(struct swap_info_struct *sis,
unsigned long offset, int nr_pages)
{
@@ -1385,6 +1372,10 @@ static void cluster_swap_free_nr(struct swap_info_struct *sis,
unlock_cluster_or_swap_info(sis, ci);
}

+/*
+ * Caller has made sure that the swap device corresponding to entry
+ * is still around or has not been recycled.
+ */
void swap_free_nr(swp_entry_t entry, int nr_pages)
{
int nr;
--
2.34.1

2024-05-08 22:42:15

by Barry Song

[permalink] [raw]

Subject: [PATCH v4 1/6] mm: swap: introduce swap_free_nr() for batched swap_free()

From: Chuanhua Han <[email protected]>

While swapping in a large folio, we need to free swaps related to the whole
folio. To avoid frequently acquiring and releasing swap locks, it is better
to introduce an API for batched free.
Furthermore, this new function, swap_free_nr(), is designed to efficiently
handle various scenarios for releasing a specified number, nr, of swap
entries.

Signed-off-by: Chuanhua Han <[email protected]>
Co-developed-by: Barry Song <[email protected]>
Signed-off-by: Barry Song <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Acked-by: Chris Li <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
---
include/linux/swap.h | 5 +++++
mm/swapfile.c | 47 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 52 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 11c53692f65f..d1d35e92d7e9 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
extern int swap_duplicate(swp_entry_t);
extern int swapcache_prepare(swp_entry_t);
extern void swap_free(swp_entry_t);
+extern void swap_free_nr(swp_entry_t entry, int nr_pages);
extern void swapcache_free_entries(swp_entry_t *entries, int n);
extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
int swap_type_of(dev_t device, sector_t offset);
@@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
{
}

+static inline void swap_free_nr(swp_entry_t entry, int nr_pages)
+{
+}
+
static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
{
}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index f6ca215fb92f..ec12f2b9d229 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1356,6 +1356,53 @@ void swap_free(swp_entry_t entry)
__swap_entry_free(p, entry);
}

+static void cluster_swap_free_nr(struct swap_info_struct *sis,
+ unsigned long offset, int nr_pages)
+{
+ struct swap_cluster_info *ci;
+ DECLARE_BITMAP(to_free, BITS_PER_LONG) = { 0 };
+ int i, nr;
+
+ ci = lock_cluster_or_swap_info(sis, offset);
+ while (nr_pages) {
+ nr = min(BITS_PER_LONG, nr_pages);
+ for (i = 0; i < nr; i++) {
+ if (!__swap_entry_free_locked(sis, offset + i, 1))
+ bitmap_set(to_free, i, 1);
+ }
+ if (!bitmap_empty(to_free, BITS_PER_LONG)) {
+ unlock_cluster_or_swap_info(sis, ci);
+ for_each_set_bit(i, to_free, BITS_PER_LONG)
+ free_swap_slot(swp_entry(sis->type, offset + i));
+ if (nr == nr_pages)
+ return;
+ bitmap_clear(to_free, 0, BITS_PER_LONG);
+ ci = lock_cluster_or_swap_info(sis, offset);
+ }
+ offset += nr;
+ nr_pages -= nr;
+ }
+ unlock_cluster_or_swap_info(sis, ci);
+}
+
+void swap_free_nr(swp_entry_t entry, int nr_pages)
+{
+ int nr;
+ struct swap_info_struct *sis;
+ unsigned long offset = swp_offset(entry);
+
+ sis = _swap_info_get(entry);
+ if (!sis)
+ return;
+
+ while (nr_pages) {
+ nr = min_t(int, nr_pages, SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
+ cluster_swap_free_nr(sis, offset, nr);
+ offset += nr;
+ nr_pages -= nr;
+ }
+}
+
/*
* Called after dropping swapcache to decrease refcnt to swap entries.
*/
--
2.34.1

2024-05-08 22:42:25

by Barry Song

[permalink] [raw]

Subject: [PATCH v4 3/6] mm: introduce pte_move_swp_offset() helper which can move offset bidirectionally

From: Barry Song <[email protected]>

There could arise a necessity to obtain the first pte_t from a swap
pte_t located in the middle. For instance, this may occur within the
context of do_swap_page(), where a page fault can potentially occur in
any PTE of a large folio. To address this, the following patch introduces
pte_move_swp_offset(), a function capable of bidirectional movement by
a specified delta argument. Consequently, pte_next_swp_offset()
will directly invoke it with delta = 1.

Suggested-by: "Huang, Ying" <[email protected]>
Signed-off-by: Barry Song <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
---
mm/internal.h | 25 +++++++++++++++++++++----
1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index b2c75b12014e..17b0a1824948 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -211,18 +211,21 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
}

/**
- * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
+ * pte_move_swp_offset - Move the swap entry offset field of a swap pte
+ * forward or backward by delta
* @pte: The initial pte state; is_swap_pte(pte) must be true and
* non_swap_entry() must be false.
+ * @delta: The direction and the offset we are moving; forward if delta
+ * is positive; backward if delta is negative
*
- * Increments the swap offset, while maintaining all other fields, including
+ * Moves the swap offset, while maintaining all other fields, including
* swap type, and any swp pte bits. The resulting pte is returned.
*/
-static inline pte_t pte_next_swp_offset(pte_t pte)
+static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
{
swp_entry_t entry = pte_to_swp_entry(pte);
pte_t new = __swp_entry_to_pte(__swp_entry(swp_type(entry),
- (swp_offset(entry) + 1)));
+ (swp_offset(entry) + delta)));

if (pte_swp_soft_dirty(pte))
new = pte_swp_mksoft_dirty(new);
@@ -234,6 +237,20 @@ static inline pte_t pte_next_swp_offset(pte_t pte)
return new;
}

+
+/**
+ * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
+ * @pte: The initial pte state; is_swap_pte(pte) must be true and
+ * non_swap_entry() must be false.
+ *
+ * Increments the swap offset, while maintaining all other fields, including
+ * swap type, and any swp pte bits. The resulting pte is returned.
+ */
+static inline pte_t pte_next_swp_offset(pte_t pte)
+{
+ return pte_move_swp_offset(pte, 1);
+}
+
/**
* swap_pte_batch - detect a PTE batch for a set of contiguous swap entries
* @start_ptep: Page table pointer for the first entry.
--
2.34.1

2024-05-08 22:42:48

by Barry Song

[permalink] [raw]

Subject: [PATCH v4 5/6] mm: swap: make should_try_to_free_swap() support large-folio

From: Chuanhua Han <[email protected]>

The function should_try_to_free_swap() operates under the assumption
that swap-in always occurs at the normal page granularity,
i.e., folio_nr_pages() = 1. However, in reality, for large folios,
add_to_swap_cache() will invoke folio_ref_add(folio, nr). To accommodate
large folio swap-in, this patch eliminates this assumption.

Signed-off-by: Chuanhua Han <[email protected]>
Co-developed-by: Barry Song <[email protected]>
Signed-off-by: Barry Song <[email protected]>
Acked-by: Chris Li <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
---
mm/memory.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index b51c059af0b0..d9434df24d62 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3877,7 +3877,7 @@ static inline bool should_try_to_free_swap(struct folio *folio,
* reference only in case it's likely that we'll be the exlusive user.
*/
return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
- folio_ref_count(folio) == 2;
+ folio_ref_count(folio) == (1 + folio_nr_pages(folio));
}

static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
--
2.34.1

2024-05-08 22:52:15

by Barry Song

[permalink] [raw]

Subject: [PATCH v4 6/6] mm: swap: entirely map large folios found in swapcache

From: Chuanhua Han <[email protected]>

When a large folio is found in the swapcache, the current implementation
requires calling do_swap_page() nr_pages times, resulting in nr_pages
page faults. This patch opts to map the entire large folio at once to
minimize page faults. Additionally, redundant checks and early exits
for ARM64 MTE restoring are removed.

Signed-off-by: Chuanhua Han <[email protected]>
Co-developed-by: Barry Song <[email protected]>
Signed-off-by: Barry Song <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
---
mm/memory.c | 59 +++++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 48 insertions(+), 11 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index d9434df24d62..8b9e4cab93ed 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3968,6 +3968,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
pte_t pte;
vm_fault_t ret = 0;
void *shadow = NULL;
+ int nr_pages;
+ unsigned long page_idx;
+ unsigned long address;
+ pte_t *ptep;

if (!pte_unmap_same(vmf))
goto out;
@@ -4166,6 +4170,38 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
goto out_nomap;
}

+ nr_pages = 1;
+ page_idx = 0;
+ address = vmf->address;
+ ptep = vmf->pte;
+ if (folio_test_large(folio) && folio_test_swapcache(folio)) {
+ int nr = folio_nr_pages(folio);
+ unsigned long idx = folio_page_idx(folio, page);
+ unsigned long folio_start = address - idx * PAGE_SIZE;
+ unsigned long folio_end = folio_start + nr * PAGE_SIZE;
+ pte_t *folio_ptep;
+ pte_t folio_pte;
+
+ if (unlikely(folio_start < max(address & PMD_MASK, vma->vm_start)))
+ goto check_folio;
+ if (unlikely(folio_end > pmd_addr_end(address, vma->vm_end)))
+ goto check_folio;
+
+ folio_ptep = vmf->pte - idx;
+ folio_pte = ptep_get(folio_ptep);
+ if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
+ swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
+ goto check_folio;
+
+ page_idx = idx;
+ address = folio_start;
+ ptep = folio_ptep;
+ nr_pages = nr;
+ entry = folio->swap;
+ page = &folio->page;
+ }
+
+check_folio:
/*
* PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
* must never point at an anonymous page in the swapcache that is
@@ -4225,12 +4261,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
* We're already holding a reference on the page but haven't mapped it
* yet.
*/
- swap_free(entry);
+ swap_free_nr(entry, nr_pages);
if (should_try_to_free_swap(folio, vma, vmf->flags))
folio_free_swap(folio);

- inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
- dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
+ add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
+ add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
pte = mk_pte(page, vma->vm_page_prot);

/*
@@ -4247,27 +4283,28 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
}
rmap_flags |= RMAP_EXCLUSIVE;
}
- flush_icache_page(vma, page);
+ folio_ref_add(folio, nr_pages - 1);
+ flush_icache_pages(vma, page, nr_pages);
if (pte_swp_soft_dirty(vmf->orig_pte))
pte = pte_mksoft_dirty(pte);
if (pte_swp_uffd_wp(vmf->orig_pte))
pte = pte_mkuffd_wp(pte);
- vmf->orig_pte = pte;
+ vmf->orig_pte = pte_advance_pfn(pte, page_idx);

/* ksm created a completely new copy */
if (unlikely(folio != swapcache && swapcache)) {
- folio_add_new_anon_rmap(folio, vma, vmf->address);
+ folio_add_new_anon_rmap(folio, vma, address);
folio_add_lru_vma(folio, vma);
} else {
- folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
+ folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address,
rmap_flags);
}

VM_BUG_ON(!folio_test_anon(folio) ||
(pte_write(pte) && !PageAnonExclusive(page)));
- set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
- arch_do_swap_page_nr(vma->vm_mm, vma, vmf->address,
- pte, vmf->orig_pte, 1);
+ set_ptes(vma->vm_mm, address, ptep, pte, nr_pages);
+ arch_do_swap_page_nr(vma->vm_mm, vma, address,
+ pte, pte, nr_pages);

folio_unlock(folio);
if (folio != swapcache && swapcache) {
@@ -4291,7 +4328,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
}

/* No need to invalidate - it was non-present before */
- update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
+ update_mmu_cache_range(vmf, vma, address, ptep, nr_pages);
unlock:
if (vmf->pte)
pte_unmap_unlock(vmf->pte, vmf->ptl);
--
2.34.1

2024-05-08 22:52:20

by Barry Song

[permalink] [raw]

Subject: [PATCH v4 4/6] mm: introduce arch_do_swap_page_nr() which allows restore metadata for nr pages

From: Barry Song <[email protected]>

Should do_swap_page() have the capability to directly map a large folio,
metadata restoration becomes necessary for a specified number of pages
denoted as nr. It's important to highlight that metadata restoration is
solely required by the SPARC platform, which, however, does not enable
THP_SWAP. Consequently, in the present kernel configuration, there
exists no practical scenario where users necessitate the restoration of
nr metadata. Platforms implementing THP_SWAP might invoke this function
with nr values exceeding 1, subsequent to do_swap_page() successfully
mapping an entire large folio. Nonetheless, their arch_do_swap_page_nr()
functions remain empty.

Signed-off-by: Barry Song <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Reviewed-by: Khalid Aziz <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Andreas Larsson <[email protected]>
---
include/linux/pgtable.h | 26 ++++++++++++++++++++------
mm/memory.c | 3 ++-
2 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 18019f037bae..463e84c3de26 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1084,6 +1084,15 @@ static inline int pgd_same(pgd_t pgd_a, pgd_t pgd_b)
})

#ifndef __HAVE_ARCH_DO_SWAP_PAGE
+static inline void arch_do_swap_page_nr(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ unsigned long addr,
+ pte_t pte, pte_t oldpte,
+ int nr)
+{
+
+}
+#else
/*
* Some architectures support metadata associated with a page. When a
* page is being swapped out, this metadata must be saved so it can be
@@ -1092,12 +1101,17 @@ static inline int pgd_same(pgd_t pgd_a, pgd_t pgd_b)
* page as metadata for the page. arch_do_swap_page() can restore this
* metadata when a page is swapped back in.
*/
-static inline void arch_do_swap_page(struct mm_struct *mm,
- struct vm_area_struct *vma,
- unsigned long addr,
- pte_t pte, pte_t oldpte)
-{
-
+static inline void arch_do_swap_page_nr(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ unsigned long addr,
+ pte_t pte, pte_t oldpte,
+ int nr)
+{
+ for (int i = 0; i < nr; i++) {
+ arch_do_swap_page(vma->vm_mm, vma, addr + i * PAGE_SIZE,
+ pte_advance_pfn(pte, i),
+ pte_advance_pfn(oldpte, i));
+ }
}
#endif

diff --git a/mm/memory.c b/mm/memory.c
index eea6e4984eae..b51c059af0b0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4266,7 +4266,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
VM_BUG_ON(!folio_test_anon(folio) ||
(pte_write(pte) && !PageAnonExclusive(page)));
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
- arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
+ arch_do_swap_page_nr(vma->vm_mm, vma, vmf->address,
+ pte, vmf->orig_pte, 1);

folio_unlock(folio);
if (folio != swapcache && swapcache) {
--
2.34.1

2024-05-09 07:48:45

by Huang, Ying

[permalink] [raw]

Subject: Re: [PATCH v4 6/6] mm: swap: entirely map large folios found in swapcache

Barry Song <[email protected]> writes:

> From: Chuanhua Han <[email protected]>
>
> When a large folio is found in the swapcache, the current implementation
> requires calling do_swap_page() nr_pages times, resulting in nr_pages
> page faults. This patch opts to map the entire large folio at once to
> minimize page faults. Additionally, redundant checks and early exits
> for ARM64 MTE restoring are removed.
>
> Signed-off-by: Chuanhua Han <[email protected]>
> Co-developed-by: Barry Song <[email protected]>
> Signed-off-by: Barry Song <[email protected]>
> Reviewed-by: Ryan Roberts <[email protected]>

LGTM, Thanks! Feel free to add

Reviewed-by: "Huang, Ying" <[email protected]>

in the future version.

> ---
> mm/memory.c | 59 +++++++++++++++++++++++++++++++++++++++++++----------
> 1 file changed, 48 insertions(+), 11 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index d9434df24d62..8b9e4cab93ed 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3968,6 +3968,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> pte_t pte;
> vm_fault_t ret = 0;
> void *shadow = NULL;
> + int nr_pages;
> + unsigned long page_idx;
> + unsigned long address;
> + pte_t *ptep;
>
> if (!pte_unmap_same(vmf))
> goto out;
> @@ -4166,6 +4170,38 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> goto out_nomap;
> }
>
> + nr_pages = 1;
> + page_idx = 0;
> + address = vmf->address;
> + ptep = vmf->pte;
> + if (folio_test_large(folio) && folio_test_swapcache(folio)) {
> + int nr = folio_nr_pages(folio);
> + unsigned long idx = folio_page_idx(folio, page);
> + unsigned long folio_start = address - idx * PAGE_SIZE;
> + unsigned long folio_end = folio_start + nr * PAGE_SIZE;
> + pte_t *folio_ptep;
> + pte_t folio_pte;
> +
> + if (unlikely(folio_start < max(address & PMD_MASK, vma->vm_start)))
> + goto check_folio;
> + if (unlikely(folio_end > pmd_addr_end(address, vma->vm_end)))
> + goto check_folio;
> +
> + folio_ptep = vmf->pte - idx;
> + folio_pte = ptep_get(folio_ptep);
> + if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
> + swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
> + goto check_folio;
> +
> + page_idx = idx;
> + address = folio_start;
> + ptep = folio_ptep;
> + nr_pages = nr;
> + entry = folio->swap;
> + page = &folio->page;
> + }
> +
> +check_folio:
> /*
> * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
> * must never point at an anonymous page in the swapcache that is
> @@ -4225,12 +4261,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> * We're already holding a reference on the page but haven't mapped it
> * yet.
> */
> - swap_free(entry);
> + swap_free_nr(entry, nr_pages);
> if (should_try_to_free_swap(folio, vma, vmf->flags))
> folio_free_swap(folio);
>
> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> - dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> pte = mk_pte(page, vma->vm_page_prot);
>
> /*
> @@ -4247,27 +4283,28 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> }
> rmap_flags |= RMAP_EXCLUSIVE;
> }
> - flush_icache_page(vma, page);
> + folio_ref_add(folio, nr_pages - 1);
> + flush_icache_pages(vma, page, nr_pages);
> if (pte_swp_soft_dirty(vmf->orig_pte))
> pte = pte_mksoft_dirty(pte);
> if (pte_swp_uffd_wp(vmf->orig_pte))
> pte = pte_mkuffd_wp(pte);
> - vmf->orig_pte = pte;
> + vmf->orig_pte = pte_advance_pfn(pte, page_idx);
>
> /* ksm created a completely new copy */
> if (unlikely(folio != swapcache && swapcache)) {
> - folio_add_new_anon_rmap(folio, vma, vmf->address);
> + folio_add_new_anon_rmap(folio, vma, address);
> folio_add_lru_vma(folio, vma);
> } else {
> - folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
> + folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address,
> rmap_flags);
> }
>
> VM_BUG_ON(!folio_test_anon(folio) ||
> (pte_write(pte) && !PageAnonExclusive(page)));
> - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> - arch_do_swap_page_nr(vma->vm_mm, vma, vmf->address,
> - pte, vmf->orig_pte, 1);
> + set_ptes(vma->vm_mm, address, ptep, pte, nr_pages);
> + arch_do_swap_page_nr(vma->vm_mm, vma, address,
> + pte, pte, nr_pages);
>
> folio_unlock(folio);
> if (folio != swapcache && swapcache) {
> @@ -4291,7 +4328,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> }
>
> /* No need to invalidate - it was non-present before */
> - update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
> + update_mmu_cache_range(vmf, vma, address, ptep, nr_pages);
> unlock:
> if (vmf->pte)
> pte_unmap_unlock(vmf->pte, vmf->ptl);

--
Best Regards,
Huang, Ying

2024-05-10 09:14:59

by Ryan Roberts

[permalink] [raw]

Subject: Re: [PATCH v4 2/6] mm: remove the implementation of swap_free() and always use swap_free_nr()

On 08/05/2024 23:40, Barry Song wrote:
> From: Barry Song <[email protected]>
>
> To streamline maintenance efforts, we propose removing the implementation
> of swap_free(). Instead, we can simply invoke swap_free_nr() with nr
> set to 1. swap_free_nr() is designed with a bitmap consisting of only
> one long, resulting in overhead that can be ignored for cases where nr
> equals 1.
>
> A prime candidate for leveraging swap_free_nr() lies within
> kernel/power/swap.c. Implementing this change facilitates the adoption
> of batch processing for hibernation.
>
> Suggested-by: "Huang, Ying" <[email protected]>
> Signed-off-by: Barry Song <[email protected]>
> Reviewed-by: "Huang, Ying" <[email protected]>
> Acked-by: Chris Li <[email protected]>
> Cc: "Rafael J. Wysocki" <[email protected]>
> Cc: Pavel Machek <[email protected]>
> Cc: Len Brown <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: Christoph Hellwig <[email protected]>

Reviewed-by: Ryan Roberts <[email protected]>

> ---
> include/linux/swap.h | 10 +++++-----
> kernel/power/swap.c | 5 ++---
> mm/swapfile.c | 17 ++++-------------
> 3 files changed, 11 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index d1d35e92d7e9..48131b869a4d 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -482,7 +482,6 @@ extern int add_swap_count_continuation(swp_entry_t, gfp_t);
> extern void swap_shmem_alloc(swp_entry_t);
> extern int swap_duplicate(swp_entry_t);
> extern int swapcache_prepare(swp_entry_t);
> -extern void swap_free(swp_entry_t);
> extern void swap_free_nr(swp_entry_t entry, int nr_pages);
> extern void swapcache_free_entries(swp_entry_t *entries, int n);
> extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
> @@ -561,10 +560,6 @@ static inline int swapcache_prepare(swp_entry_t swp)
> return 0;
> }
>
> -static inline void swap_free(swp_entry_t swp)
> -{
> -}
> -
> static inline void swap_free_nr(swp_entry_t entry, int nr_pages)
> {
> }
> @@ -613,6 +608,11 @@ static inline void free_swap_and_cache(swp_entry_t entry)
> free_swap_and_cache_nr(entry, 1);
> }
>
> +static inline void swap_free(swp_entry_t entry)
> +{
> + swap_free_nr(entry, 1);
> +}
> +
> #ifdef CONFIG_MEMCG
> static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
> {
> diff --git a/kernel/power/swap.c b/kernel/power/swap.c
> index 5bc04bfe2db1..75bc9e3f9d59 100644
> --- a/kernel/power/swap.c
> +++ b/kernel/power/swap.c
> @@ -200,12 +200,11 @@ void free_all_swap_pages(int swap)
>
> while ((node = swsusp_extents.rb_node)) {
> struct swsusp_extent *ext;
> - unsigned long offset;
>
> ext = rb_entry(node, struct swsusp_extent, node);
> rb_erase(node, &swsusp_extents);
> - for (offset = ext->start; offset <= ext->end; offset++)
> - swap_free(swp_entry(swap, offset));
> + swap_free_nr(swp_entry(swap, ext->start),
> + ext->end - ext->start + 1);
>
> kfree(ext);
> }
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index ec12f2b9d229..99e701620562 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1343,19 +1343,6 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
> swap_range_free(p, offset, 1);
> }
>
> -/*
> - * Caller has made sure that the swap device corresponding to entry
> - * is still around or has not been recycled.
> - */
> -void swap_free(swp_entry_t entry)
> -{
> - struct swap_info_struct *p;
> -
> - p = _swap_info_get(entry);
> - if (p)
> - __swap_entry_free(p, entry);
> -}
> -
> static void cluster_swap_free_nr(struct swap_info_struct *sis,
> unsigned long offset, int nr_pages)
> {
> @@ -1385,6 +1372,10 @@ static void cluster_swap_free_nr(struct swap_info_struct *sis,
> unlock_cluster_or_swap_info(sis, ci);
> }
>
> +/*
> + * Caller has made sure that the swap device corresponding to entry
> + * is still around or has not been recycled.
> + */
> void swap_free_nr(swp_entry_t entry, int nr_pages)
> {
> int nr;

2024-05-21 21:22:02

by Barry Song

[permalink] [raw]

Subject: Re: [PATCH v4 0/6] large folios swap-in: handle refault cases first

Hi Andrew,

This patchset missed the merge window, but I've tried and found that it still
applies cleanly to today's mm-unstable. Would you like me to resend it or just
proceed with using this v4 version?

Thanks
Barry

On Thu, May 9, 2024 at 10:41 AM Barry Song <[email protected]> wrote:
>
> From: Barry Song <[email protected]>
>
> This patch is extracted from the large folio swapin series[1], primarily addressing
> the handling of scenarios involving large folios in the swap cache. Currently, it is
> particularly focused on addressing the refaulting of mTHP, which is still undergoing
> reclamation. This approach aims to streamline code review and expedite the integration
> of this segment into the MM tree.
>
> It relies on Ryan's swap-out series[2], leveraging the helper function
> swap_pte_batch() introduced by that series.
>
> Presently, do_swap_page only encounters a large folio in the swap
> cache before the large folio is released by vmscan. However, the code
> should remain equally useful once we support large folio swap-in via
> swapin_readahead(). This approach can effectively reduce page faults
> and eliminate most redundant checks and early exits for MTE restoration
> in recent MTE patchset[3].
>
> The large folio swap-in for SWP_SYNCHRONOUS_IO and swapin_readahead()
> will be split into separate patch sets and sent at a later time.
>
> -v4:
> - collect acked-by/reviewed-by of Ryan, "Huang, Ying", Chris, David and
> Khalid, many thanks!
> - Simplify reuse code in do_swap_page() by checking refcount==1, per
> David;
> - Initialize large folio-related variables later in do_swap_page(), per
> Ryan;
> - define swap_free() as swap_free_nr(1) per Ying and Ryan.
>
> -v3:
> - optimize swap_free_nr using bitmap with single one "long"; "Huang, Ying"
> - drop swap_free() as suggested by "Huang, Ying", now hibernation can get
> batched;
> - lots of cleanup in do_swap_page() as commented by Ryan Roberts and "Huang,
> Ying";
> - handle arch_do_swap_page() with nr pages though the only platform which
> needs it, sparc, doesn't support THP_SWAPOUT as suggested by "Huang,
> Ying";
> - introduce pte_move_swp_offset() as suggested by "Huang, Ying";
> - drop the "any_shared" of checking swap entries with respect to David's
> comment;
> - drop the counter of swapin_refault and keep it for debug purpose per
> Ying
> - collect reviewed-by tags
> Link:
> https://lore.kernel.org/linux-mm/[email protected]/
>
> -v2:
> - rebase on top of mm-unstable in which Ryan's swap_pte_batch() has changed
> a lot.
> - remove folio_add_new_anon_rmap() for !folio_test_anon()
> as currently large folios are always anon(refault).
> - add mTHP swpin refault counters
> Link:
> https://lore.kernel.org/linux-mm/[email protected]/
>
> -v1:
> Link: https://lore.kernel.org/linux-mm/[email protected]/
>
> Differences with the original large folios swap-in series
> - collect r-o-b, acked;
> - rename swap_nr_free to swap_free_nr, according to Ryan;
> - limit the maximum kernel stack usage for swap_free_nr, Ryan;
> - add output argument in swap_pte_batch to expose if all entries are
> exclusive
> - many clean refinements, handle the corner case folio's virtual addr
> might not be naturally aligned
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
> [2] https://lore.kernel.org/linux-mm/[email protected]/
> [3] https://lore.kernel.org/linux-mm/20240322114136.61386-1-21cnbao@gmailcom/
>
> Barry Song (3):
> mm: remove the implementation of swap_free() and always use
> swap_free_nr()
> mm: introduce pte_move_swp_offset() helper which can move offset
> bidirectionally
> mm: introduce arch_do_swap_page_nr() which allows restore metadata for
> nr pages
>
> Chuanhua Han (3):
> mm: swap: introduce swap_free_nr() for batched swap_free()
> mm: swap: make should_try_to_free_swap() support large-folio
> mm: swap: entirely map large folios found in swapcache
>
> include/linux/pgtable.h | 26 +++++++++++++-----
> include/linux/swap.h | 9 +++++--
> kernel/power/swap.c | 5 ++--
> mm/internal.h | 25 ++++++++++++++---
> mm/memory.c | 60 +++++++++++++++++++++++++++++++++--------
> mm/swapfile.c | 48 +++++++++++++++++++++++++++++----
> 6 files changed, 142 insertions(+), 31 deletions(-)
>
> --
> 2.34.1
>

2024-05-21 21:59:42

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH v4 0/6] large folios swap-in: handle refault cases first

On Wed, 22 May 2024 09:21:38 +1200 Barry Song <[email protected]> wrote:

> This patchset missed the merge window, but I've tried and found that it still
> applies cleanly to today's mm-unstable. Would you like me to resend it or just
> proceed with using this v4 version?

It's in my post merge window backlog pile. I'll let you know when I
get to it ;)