2024-05-03 00:50:52

by Barry Song

[permalink] [raw]
Subject: [PATCH v3 0/6] large folios swap-in: handle refault cases first

From: Barry Song <[email protected]>

This patch is extracted from the large folio swapin series[1], primarily addressing
the handling of scenarios involving large folios in the swap cache. Currently, it is
particularly focused on addressing the refaulting of mTHP, which is still undergoing
reclamation. This approach aims to streamline code review and expedite the integration
of this segment into the MM tree.

It relies on Ryan's swap-out series[2], leveraging the helper function
swap_pte_batch() introduced by that series.

Presently, do_swap_page only encounters a large folio in the swap
cache before the large folio is released by vmscan. However, the code
should remain equally useful once we support large folio swap-in via
swapin_readahead(). This approach can effectively reduce page faults
and eliminate most redundant checks and early exits for MTE restoration
in recent MTE patchset[3].

The large folio swap-in for SWP_SYNCHRONOUS_IO and swapin_readahead()
will be split into separate patch sets and sent at a later time.

-v3:
- optimize swap_free_nr using bitmap with single one "long"; "Huang, Ying"
- drop swap_free() as suggested by "Huang, Ying", now hibernation can get
batched;
- lots of cleanup in do_swap_page() as commented by Ryan Roberts and "Huang,
Ying";
- handle arch_do_swap_page() with nr pages though the only platform which
needs it, sparc, doesn't support THP_SWAPOUT as suggested by "Huang,
Ying";
- introduce pte_move_swp_offset() as suggested by "Huang, Ying";
- drop the "any_shared" of checking swap entries with respect to David's
comment;
- drop the counter of swapin_refault and keep it for debug purpose per
Ying
- collect reviewed-by tags

-v2:
- rebase on top of mm-unstable in which Ryan's swap_pte_batch() has changed
a lot.
- remove folio_add_new_anon_rmap() for !folio_test_anon()
as currently large folios are always anon(refault).
- add mTHP swpin refault counters
Link:
https://lore.kernel.org/linux-mm/[email protected]/

-v1:
Link: https://lore.kernel.org/linux-mm/[email protected]/

Differences with the original large folios swap-in series
- collect r-o-b, acked;
- rename swap_nr_free to swap_free_nr, according to Ryan;
- limit the maximum kernel stack usage for swap_free_nr, Ryan;
- add output argument in swap_pte_batch to expose if all entries are
exclusive
- many clean refinements, handle the corner case folio's virtual addr
might not be naturally aligned

[1] https://lore.kernel.org/linux-mm/[email protected]/
[2] https://lore.kernel.org/linux-mm/[email protected]/
[3] https://lore.kernel.org/linux-mm/[email protected]/

Barry Song (3):
mm: remove swap_free() and always use swap_free_nr()
mm: introduce pte_move_swp_offset() helper which can move offset
bidirectionally
mm: introduce arch_do_swap_page_nr() which allows restore metadata for
nr pages

Chuanhua Han (3):
mm: swap: introduce swap_free_nr() for batched swap_free()
mm: swap: make should_try_to_free_swap() support large-folio
mm: swap: entirely map large folios found in swapcache

include/linux/pgtable.h | 26 ++++++++++++++----
include/linux/swap.h | 4 +--
kernel/power/swap.c | 7 ++---
mm/internal.h | 25 ++++++++++++++---
mm/memory.c | 61 +++++++++++++++++++++++++++++++++--------
mm/rmap.c | 4 +--
mm/shmem.c | 4 +--
mm/swapfile.c | 50 +++++++++++++++++++++++++++++----
8 files changed, 143 insertions(+), 38 deletions(-)

--
2.34.1



2024-05-03 00:51:04

by Barry Song

[permalink] [raw]
Subject: [PATCH v3 1/6] mm: swap: introduce swap_free_nr() for batched swap_free()

From: Chuanhua Han <[email protected]>

While swapping in a large folio, we need to free swaps related to the whole
folio. To avoid frequently acquiring and releasing swap locks, it is better
to introduce an API for batched free.
Furthermore, this new function, swap_free_nr(), is designed to efficiently
handle various scenarios for releasing a specified number, nr, of swap
entries.

Signed-off-by: Chuanhua Han <[email protected]>
Co-developed-by: Barry Song <[email protected]>
Signed-off-by: Barry Song <[email protected]>
---
include/linux/swap.h | 5 +++++
mm/swapfile.c | 47 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 52 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 11c53692f65f..d1d35e92d7e9 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
extern int swap_duplicate(swp_entry_t);
extern int swapcache_prepare(swp_entry_t);
extern void swap_free(swp_entry_t);
+extern void swap_free_nr(swp_entry_t entry, int nr_pages);
extern void swapcache_free_entries(swp_entry_t *entries, int n);
extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
int swap_type_of(dev_t device, sector_t offset);
@@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
{
}

+static inline void swap_free_nr(swp_entry_t entry, int nr_pages)
+{
+}
+
static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
{
}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index f6ca215fb92f..ec12f2b9d229 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1356,6 +1356,53 @@ void swap_free(swp_entry_t entry)
__swap_entry_free(p, entry);
}

+static void cluster_swap_free_nr(struct swap_info_struct *sis,
+ unsigned long offset, int nr_pages)
+{
+ struct swap_cluster_info *ci;
+ DECLARE_BITMAP(to_free, BITS_PER_LONG) = { 0 };
+ int i, nr;
+
+ ci = lock_cluster_or_swap_info(sis, offset);
+ while (nr_pages) {
+ nr = min(BITS_PER_LONG, nr_pages);
+ for (i = 0; i < nr; i++) {
+ if (!__swap_entry_free_locked(sis, offset + i, 1))
+ bitmap_set(to_free, i, 1);
+ }
+ if (!bitmap_empty(to_free, BITS_PER_LONG)) {
+ unlock_cluster_or_swap_info(sis, ci);
+ for_each_set_bit(i, to_free, BITS_PER_LONG)
+ free_swap_slot(swp_entry(sis->type, offset + i));
+ if (nr == nr_pages)
+ return;
+ bitmap_clear(to_free, 0, BITS_PER_LONG);
+ ci = lock_cluster_or_swap_info(sis, offset);
+ }
+ offset += nr;
+ nr_pages -= nr;
+ }
+ unlock_cluster_or_swap_info(sis, ci);
+}
+
+void swap_free_nr(swp_entry_t entry, int nr_pages)
+{
+ int nr;
+ struct swap_info_struct *sis;
+ unsigned long offset = swp_offset(entry);
+
+ sis = _swap_info_get(entry);
+ if (!sis)
+ return;
+
+ while (nr_pages) {
+ nr = min_t(int, nr_pages, SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
+ cluster_swap_free_nr(sis, offset, nr);
+ offset += nr;
+ nr_pages -= nr;
+ }
+}
+
/*
* Called after dropping swapcache to decrease refcnt to swap entries.
*/
--
2.34.1


2024-05-03 00:51:15

by Barry Song

[permalink] [raw]
Subject: [PATCH v3 2/6] mm: remove swap_free() and always use swap_free_nr()

From: Barry Song <[email protected]>

To streamline maintenance efforts, we propose discontinuing the use of
swap_free(). Instead, we can simply invoke swap_free_nr() with nr set
to 1. This adjustment offers the advantage of enabling batch processing
within kernel/power/swap.c. Furthermore, swap_free_nr() is designed with
a bitmap consisting of only one long, resulting in overhead that can be
ignored for cases where nr equals 1.

Suggested-by: "Huang, Ying" <[email protected]>
Signed-off-by: Barry Song <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Pavel Machek <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Hugh Dickins <[email protected]>
---
include/linux/swap.h | 5 -----
kernel/power/swap.c | 7 +++----
mm/memory.c | 2 +-
mm/rmap.c | 4 ++--
mm/shmem.c | 4 ++--
mm/swapfile.c | 19 +++++--------------
6 files changed, 13 insertions(+), 28 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index d1d35e92d7e9..f03cb446124e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -482,7 +482,6 @@ extern int add_swap_count_continuation(swp_entry_t, gfp_t);
extern void swap_shmem_alloc(swp_entry_t);
extern int swap_duplicate(swp_entry_t);
extern int swapcache_prepare(swp_entry_t);
-extern void swap_free(swp_entry_t);
extern void swap_free_nr(swp_entry_t entry, int nr_pages);
extern void swapcache_free_entries(swp_entry_t *entries, int n);
extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
@@ -561,10 +560,6 @@ static inline int swapcache_prepare(swp_entry_t swp)
return 0;
}

-static inline void swap_free(swp_entry_t swp)
-{
-}
-
static inline void swap_free_nr(swp_entry_t entry, int nr_pages)
{
}
diff --git a/kernel/power/swap.c b/kernel/power/swap.c
index 5bc04bfe2db1..6befaa88a342 100644
--- a/kernel/power/swap.c
+++ b/kernel/power/swap.c
@@ -181,7 +181,7 @@ sector_t alloc_swapdev_block(int swap)
offset = swp_offset(get_swap_page_of_type(swap));
if (offset) {
if (swsusp_extents_insert(offset))
- swap_free(swp_entry(swap, offset));
+ swap_free_nr(swp_entry(swap, offset), 1);
else
return swapdev_block(swap, offset);
}
@@ -200,12 +200,11 @@ void free_all_swap_pages(int swap)

while ((node = swsusp_extents.rb_node)) {
struct swsusp_extent *ext;
- unsigned long offset;

ext = rb_entry(node, struct swsusp_extent, node);
rb_erase(node, &swsusp_extents);
- for (offset = ext->start; offset <= ext->end; offset++)
- swap_free(swp_entry(swap, offset));
+ swap_free_nr(swp_entry(swap, ext->start),
+ ext->end - ext->start + 1);

kfree(ext);
}
diff --git a/mm/memory.c b/mm/memory.c
index eea6e4984eae..f033eb3528ba 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4225,7 +4225,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
* We're already holding a reference on the page but haven't mapped it
* yet.
*/
- swap_free(entry);
+ swap_free_nr(entry, 1);
if (should_try_to_free_swap(folio, vma, vmf->flags))
folio_free_swap(folio);

diff --git a/mm/rmap.c b/mm/rmap.c
index 087a79f1f611..39ec7742acec 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1865,7 +1865,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
goto walk_done_err;
}
if (arch_unmap_one(mm, vma, address, pteval) < 0) {
- swap_free(entry);
+ swap_free_nr(entry, 1);
set_pte_at(mm, address, pvmw.pte, pteval);
goto walk_done_err;
}
@@ -1873,7 +1873,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
/* See folio_try_share_anon_rmap(): clear PTE first. */
if (anon_exclusive &&
folio_try_share_anon_rmap_pte(folio, subpage)) {
- swap_free(entry);
+ swap_free_nr(entry, 1);
set_pte_at(mm, address, pvmw.pte, pteval);
goto walk_done_err;
}
diff --git a/mm/shmem.c b/mm/shmem.c
index fa2a0ed97507..bfc8a2beb24f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1836,7 +1836,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
* in shmem_evict_inode().
*/
shmem_recalc_inode(inode, -1, -1);
- swap_free(swap);
+ swap_free_nr(swap, 1);
}

/*
@@ -1927,7 +1927,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,

delete_from_swap_cache(folio);
folio_mark_dirty(folio);
- swap_free(swap);
+ swap_free_nr(swap, 1);
put_swap_device(si);

*foliop = folio;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index ec12f2b9d229..ddcd0f24b9a1 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1343,19 +1343,6 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
swap_range_free(p, offset, 1);
}

-/*
- * Caller has made sure that the swap device corresponding to entry
- * is still around or has not been recycled.
- */
-void swap_free(swp_entry_t entry)
-{
- struct swap_info_struct *p;
-
- p = _swap_info_get(entry);
- if (p)
- __swap_entry_free(p, entry);
-}
-
static void cluster_swap_free_nr(struct swap_info_struct *sis,
unsigned long offset, int nr_pages)
{
@@ -1385,6 +1372,10 @@ static void cluster_swap_free_nr(struct swap_info_struct *sis,
unlock_cluster_or_swap_info(sis, ci);
}

+/*
+ * Caller has made sure that the swap device corresponding to entry
+ * is still around or has not been recycled.
+ */
void swap_free_nr(swp_entry_t entry, int nr_pages)
{
int nr;
@@ -1930,7 +1921,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
new_pte = pte_mkuffd_wp(new_pte);
setpte:
set_pte_at(vma->vm_mm, addr, pte, new_pte);
- swap_free(entry);
+ swap_free_nr(entry, 1);
out:
if (pte)
pte_unmap_unlock(pte, ptl);
--
2.34.1


2024-05-03 00:51:27

by Barry Song

[permalink] [raw]
Subject: [PATCH v3 3/6] mm: introduce pte_move_swp_offset() helper which can move offset bidirectionally

From: Barry Song <[email protected]>

There could arise a necessity to obtain the first pte_t from a swap
pte_t located in the middle. For instance, this may occur within the
context of do_swap_page(), where a page fault can potentially occur in
any PTE of a large folio. To address this, the following patch introduces
pte_move_swp_offset(), a function capable of bidirectional movement by
a specified delta argument. Consequently, pte_increment_swp_offset()
will directly invoke it with delta = 1.

Suggested-by: "Huang, Ying" <[email protected]>
Signed-off-by: Barry Song <[email protected]>
---
mm/internal.h | 25 +++++++++++++++++++++----
1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index c5552d35d995..cfe4aed66a5c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -211,18 +211,21 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
}

/**
- * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
+ * pte_move_swp_offset - Move the swap entry offset field of a swap pte
+ * forward or backward by delta
* @pte: The initial pte state; is_swap_pte(pte) must be true and
* non_swap_entry() must be false.
+ * @delta: The direction and the offset we are moving; forward if delta
+ * is positive; backward if delta is negative
*
- * Increments the swap offset, while maintaining all other fields, including
+ * Moves the swap offset, while maintaining all other fields, including
* swap type, and any swp pte bits. The resulting pte is returned.
*/
-static inline pte_t pte_next_swp_offset(pte_t pte)
+static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
{
swp_entry_t entry = pte_to_swp_entry(pte);
pte_t new = __swp_entry_to_pte(__swp_entry(swp_type(entry),
- (swp_offset(entry) + 1)));
+ (swp_offset(entry) + delta)));

if (pte_swp_soft_dirty(pte))
new = pte_swp_mksoft_dirty(new);
@@ -234,6 +237,20 @@ static inline pte_t pte_next_swp_offset(pte_t pte)
return new;
}

+
+/**
+ * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
+ * @pte: The initial pte state; is_swap_pte(pte) must be true and
+ * non_swap_entry() must be false.
+ *
+ * Increments the swap offset, while maintaining all other fields, including
+ * swap type, and any swp pte bits. The resulting pte is returned.
+ */
+static inline pte_t pte_next_swp_offset(pte_t pte)
+{
+ return pte_move_swp_offset(pte, 1);
+}
+
/**
* swap_pte_batch - detect a PTE batch for a set of contiguous swap entries
* @start_ptep: Page table pointer for the first entry.
--
2.34.1


2024-05-03 00:51:43

by Barry Song

[permalink] [raw]
Subject: [PATCH v3 4/6] mm: introduce arch_do_swap_page_nr() which allows restore metadata for nr pages

From: Barry Song <[email protected]>

Should do_swap_page() have the capability to directly map a large folio,
metadata restoration becomes necessary for a specified number of pages
denoted as nr. It's important to highlight that metadata restoration is
solely required by the SPARC platform, which, however, does not enable
THP_SWAP. Consequently, in the present kernel configuration, there
exists no practical scenario where users necessitate the restoration of
nr metadata. Platforms implementing THP_SWAP might invoke this function
with nr values exceeding 1, subsequent to do_swap_page() successfully
mapping an entire large folio. Nonetheless, their arch_do_swap_page_nr()
functions remain empty.

Cc: Khalid Aziz <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Andreas Larsson <[email protected]>
Signed-off-by: Barry Song <[email protected]>
---
include/linux/pgtable.h | 26 ++++++++++++++++++++------
mm/memory.c | 3 ++-
2 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 18019f037bae..463e84c3de26 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1084,6 +1084,15 @@ static inline int pgd_same(pgd_t pgd_a, pgd_t pgd_b)
})

#ifndef __HAVE_ARCH_DO_SWAP_PAGE
+static inline void arch_do_swap_page_nr(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ unsigned long addr,
+ pte_t pte, pte_t oldpte,
+ int nr)
+{
+
+}
+#else
/*
* Some architectures support metadata associated with a page. When a
* page is being swapped out, this metadata must be saved so it can be
@@ -1092,12 +1101,17 @@ static inline int pgd_same(pgd_t pgd_a, pgd_t pgd_b)
* page as metadata for the page. arch_do_swap_page() can restore this
* metadata when a page is swapped back in.
*/
-static inline void arch_do_swap_page(struct mm_struct *mm,
- struct vm_area_struct *vma,
- unsigned long addr,
- pte_t pte, pte_t oldpte)
-{
-
+static inline void arch_do_swap_page_nr(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ unsigned long addr,
+ pte_t pte, pte_t oldpte,
+ int nr)
+{
+ for (int i = 0; i < nr; i++) {
+ arch_do_swap_page(vma->vm_mm, vma, addr + i * PAGE_SIZE,
+ pte_advance_pfn(pte, i),
+ pte_advance_pfn(oldpte, i));
+ }
}
#endif

diff --git a/mm/memory.c b/mm/memory.c
index f033eb3528ba..74cdefd58f5f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4266,7 +4266,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
VM_BUG_ON(!folio_test_anon(folio) ||
(pte_write(pte) && !PageAnonExclusive(page)));
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
- arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
+ arch_do_swap_page_nr(vma->vm_mm, vma, vmf->address,
+ pte, vmf->orig_pte, 1);

folio_unlock(folio);
if (folio != swapcache && swapcache) {
--
2.34.1


2024-05-03 00:51:48

by Barry Song

[permalink] [raw]
Subject: [PATCH v3 5/6] mm: swap: make should_try_to_free_swap() support large-folio

From: Chuanhua Han <[email protected]>

The function should_try_to_free_swap() operates under the assumption
that swap-in always occurs at the normal page granularity,
i.e., folio_nr_pages() = 1. However, in reality, for large folios,
add_to_swap_cache() will invoke folio_ref_add(folio, nr). To accommodate
large folio swap-in, this patch eliminates this assumption.

Signed-off-by: Chuanhua Han <[email protected]>
Co-developed-by: Barry Song <[email protected]>
Signed-off-by: Barry Song <[email protected]>
Acked-by: Chris Li <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
---
mm/memory.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 74cdefd58f5f..22e7c33cc747 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3877,7 +3877,7 @@ static inline bool should_try_to_free_swap(struct folio *folio,
* reference only in case it's likely that we'll be the exlusive user.
*/
return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
- folio_ref_count(folio) == 2;
+ folio_ref_count(folio) == (1 + folio_nr_pages(folio));
}

static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
--
2.34.1


2024-05-03 00:52:02

by Barry Song

[permalink] [raw]
Subject: [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache

From: Chuanhua Han <[email protected]>

When a large folio is found in the swapcache, the current implementation
requires calling do_swap_page() nr_pages times, resulting in nr_pages
page faults. This patch opts to map the entire large folio at once to
minimize page faults. Additionally, redundant checks and early exits
for ARM64 MTE restoring are removed.

Signed-off-by: Chuanhua Han <[email protected]>
Co-developed-by: Barry Song <[email protected]>
Signed-off-by: Barry Song <[email protected]>
---
mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 48 insertions(+), 12 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 22e7c33cc747..940fdbe69fa1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3968,6 +3968,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
pte_t pte;
vm_fault_t ret = 0;
void *shadow = NULL;
+ int nr_pages = 1;
+ unsigned long page_idx = 0;
+ unsigned long address = vmf->address;
+ pte_t *ptep;

if (!pte_unmap_same(vmf))
goto out;
@@ -4166,6 +4170,36 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
goto out_nomap;
}

+ ptep = vmf->pte;
+ if (folio_test_large(folio) && folio_test_swapcache(folio)) {
+ int nr = folio_nr_pages(folio);
+ unsigned long idx = folio_page_idx(folio, page);
+ unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
+ unsigned long folio_end = folio_start + nr * PAGE_SIZE;
+ pte_t *folio_ptep;
+ pte_t folio_pte;
+
+ if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
+ goto check_folio;
+ if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
+ goto check_folio;
+
+ folio_ptep = vmf->pte - idx;
+ folio_pte = ptep_get(folio_ptep);
+ if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
+ swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
+ goto check_folio;
+
+ page_idx = idx;
+ address = folio_start;
+ ptep = folio_ptep;
+ nr_pages = nr;
+ entry = folio->swap;
+ page = &folio->page;
+ }
+
+check_folio:
+
/*
* PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
* must never point at an anonymous page in the swapcache that is
@@ -4225,12 +4259,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
* We're already holding a reference on the page but haven't mapped it
* yet.
*/
- swap_free_nr(entry, 1);
+ swap_free_nr(entry, nr_pages);
if (should_try_to_free_swap(folio, vma, vmf->flags))
folio_free_swap(folio);

- inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
- dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
+ folio_ref_add(folio, nr_pages - 1);
+ add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
+ add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
pte = mk_pte(page, vma->vm_page_prot);

/*
@@ -4240,34 +4275,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
* exclusivity.
*/
if (!folio_test_ksm(folio) &&
- (exclusive || folio_ref_count(folio) == 1)) {
+ (exclusive || (folio_ref_count(folio) == nr_pages &&
+ folio_nr_pages(folio) == nr_pages))) {
if (vmf->flags & FAULT_FLAG_WRITE) {
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
vmf->flags &= ~FAULT_FLAG_WRITE;
}
rmap_flags |= RMAP_EXCLUSIVE;
}
- flush_icache_page(vma, page);
+ flush_icache_pages(vma, page, nr_pages);
if (pte_swp_soft_dirty(vmf->orig_pte))
pte = pte_mksoft_dirty(pte);
if (pte_swp_uffd_wp(vmf->orig_pte))
pte = pte_mkuffd_wp(pte);
- vmf->orig_pte = pte;
+ vmf->orig_pte = pte_advance_pfn(pte, page_idx);

/* ksm created a completely new copy */
if (unlikely(folio != swapcache && swapcache)) {
- folio_add_new_anon_rmap(folio, vma, vmf->address);
+ folio_add_new_anon_rmap(folio, vma, address);
folio_add_lru_vma(folio, vma);
} else {
- folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
+ folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address,
rmap_flags);
}

VM_BUG_ON(!folio_test_anon(folio) ||
(pte_write(pte) && !PageAnonExclusive(page)));
- set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
- arch_do_swap_page_nr(vma->vm_mm, vma, vmf->address,
- pte, vmf->orig_pte, 1);
+ set_ptes(vma->vm_mm, address, ptep, pte, nr_pages);
+ arch_do_swap_page_nr(vma->vm_mm, vma, address,
+ pte, pte, nr_pages);

folio_unlock(folio);
if (folio != swapcache && swapcache) {
@@ -4291,7 +4327,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
}

/* No need to invalidate - it was non-present before */
- update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
+ update_mmu_cache_range(vmf, vma, address, ptep, nr_pages);
unlock:
if (vmf->pte)
pte_unmap_unlock(vmf->pte, vmf->ptl);
--
2.34.1


2024-05-03 09:26:24

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v3 1/6] mm: swap: introduce swap_free_nr() for batched swap_free()

On 03/05/2024 01:50, Barry Song wrote:
> From: Chuanhua Han <[email protected]>
>
> While swapping in a large folio, we need to free swaps related to the whole
> folio. To avoid frequently acquiring and releasing swap locks, it is better
> to introduce an API for batched free.
> Furthermore, this new function, swap_free_nr(), is designed to efficiently
> handle various scenarios for releasing a specified number, nr, of swap
> entries.
>
> Signed-off-by: Chuanhua Han <[email protected]>
> Co-developed-by: Barry Song <[email protected]>
> Signed-off-by: Barry Song <[email protected]>

This looks much better!

Reviewed-by: Ryan Roberts <[email protected]>


> ---
> include/linux/swap.h | 5 +++++
> mm/swapfile.c | 47 ++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 52 insertions(+)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 11c53692f65f..d1d35e92d7e9 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
> extern int swap_duplicate(swp_entry_t);
> extern int swapcache_prepare(swp_entry_t);
> extern void swap_free(swp_entry_t);
> +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
> extern void swapcache_free_entries(swp_entry_t *entries, int n);
> extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
> int swap_type_of(dev_t device, sector_t offset);
> @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
> {
> }
>
> +static inline void swap_free_nr(swp_entry_t entry, int nr_pages)
> +{
> +}
> +
> static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> {
> }
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index f6ca215fb92f..ec12f2b9d229 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1356,6 +1356,53 @@ void swap_free(swp_entry_t entry)
> __swap_entry_free(p, entry);
> }
>
> +static void cluster_swap_free_nr(struct swap_info_struct *sis,
> + unsigned long offset, int nr_pages)
> +{
> + struct swap_cluster_info *ci;
> + DECLARE_BITMAP(to_free, BITS_PER_LONG) = { 0 };
> + int i, nr;
> +
> + ci = lock_cluster_or_swap_info(sis, offset);
> + while (nr_pages) {
> + nr = min(BITS_PER_LONG, nr_pages);
> + for (i = 0; i < nr; i++) {
> + if (!__swap_entry_free_locked(sis, offset + i, 1))
> + bitmap_set(to_free, i, 1);
> + }
> + if (!bitmap_empty(to_free, BITS_PER_LONG)) {
> + unlock_cluster_or_swap_info(sis, ci);
> + for_each_set_bit(i, to_free, BITS_PER_LONG)
> + free_swap_slot(swp_entry(sis->type, offset + i));
> + if (nr == nr_pages)
> + return;
> + bitmap_clear(to_free, 0, BITS_PER_LONG);
> + ci = lock_cluster_or_swap_info(sis, offset);
> + }
> + offset += nr;
> + nr_pages -= nr;
> + }
> + unlock_cluster_or_swap_info(sis, ci);
> +}
> +
> +void swap_free_nr(swp_entry_t entry, int nr_pages)
> +{
> + int nr;
> + struct swap_info_struct *sis;
> + unsigned long offset = swp_offset(entry);
> +
> + sis = _swap_info_get(entry);
> + if (!sis)
> + return;
> +
> + while (nr_pages) {
> + nr = min_t(int, nr_pages, SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
> + cluster_swap_free_nr(sis, offset, nr);
> + offset += nr;
> + nr_pages -= nr;
> + }
> +}
> +
> /*
> * Called after dropping swapcache to decrease refcnt to swap entries.
> */


2024-05-03 09:31:28

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v3 2/6] mm: remove swap_free() and always use swap_free_nr()

On 03/05/2024 01:50, Barry Song wrote:
> From: Barry Song <[email protected]>
>
> To streamline maintenance efforts, we propose discontinuing the use of
> swap_free(). Instead, we can simply invoke swap_free_nr() with nr set
> to 1. This adjustment offers the advantage of enabling batch processing
> within kernel/power/swap.c. Furthermore, swap_free_nr() is designed with
> a bitmap consisting of only one long, resulting in overhead that can be
> ignored for cases where nr equals 1.
>
> Suggested-by: "Huang, Ying" <[email protected]>
> Signed-off-by: Barry Song <[email protected]>
> Cc: "Rafael J. Wysocki" <[email protected]>
> Cc: Pavel Machek <[email protected]>
> Cc: Len Brown <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> ---
> include/linux/swap.h | 5 -----
> kernel/power/swap.c | 7 +++----
> mm/memory.c | 2 +-
> mm/rmap.c | 4 ++--
> mm/shmem.c | 4 ++--
> mm/swapfile.c | 19 +++++--------------
> 6 files changed, 13 insertions(+), 28 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index d1d35e92d7e9..f03cb446124e 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -482,7 +482,6 @@ extern int add_swap_count_continuation(swp_entry_t, gfp_t);
> extern void swap_shmem_alloc(swp_entry_t);
> extern int swap_duplicate(swp_entry_t);
> extern int swapcache_prepare(swp_entry_t);
> -extern void swap_free(swp_entry_t);

I wonder if it would be cleaner to:

#define swap_free(entry) swap_free_nr((entry), 1)

To save all the churn for the callsites that just want to pass a single entry?

> extern void swap_free_nr(swp_entry_t entry, int nr_pages);
> extern void swapcache_free_entries(swp_entry_t *entries, int n);
> extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
> @@ -561,10 +560,6 @@ static inline int swapcache_prepare(swp_entry_t swp)
> return 0;
> }
>
> -static inline void swap_free(swp_entry_t swp)
> -{
> -}
> -
> static inline void swap_free_nr(swp_entry_t entry, int nr_pages)
> {
> }
> diff --git a/kernel/power/swap.c b/kernel/power/swap.c
> index 5bc04bfe2db1..6befaa88a342 100644
> --- a/kernel/power/swap.c
> +++ b/kernel/power/swap.c
> @@ -181,7 +181,7 @@ sector_t alloc_swapdev_block(int swap)
> offset = swp_offset(get_swap_page_of_type(swap));
> if (offset) {
> if (swsusp_extents_insert(offset))
> - swap_free(swp_entry(swap, offset));
> + swap_free_nr(swp_entry(swap, offset), 1);
> else
> return swapdev_block(swap, offset);
> }
> @@ -200,12 +200,11 @@ void free_all_swap_pages(int swap)
>
> while ((node = swsusp_extents.rb_node)) {
> struct swsusp_extent *ext;
> - unsigned long offset;
>
> ext = rb_entry(node, struct swsusp_extent, node);
> rb_erase(node, &swsusp_extents);
> - for (offset = ext->start; offset <= ext->end; offset++)
> - swap_free(swp_entry(swap, offset));
> + swap_free_nr(swp_entry(swap, ext->start),
> + ext->end - ext->start + 1);
>
> kfree(ext);
> }
> diff --git a/mm/memory.c b/mm/memory.c
> index eea6e4984eae..f033eb3528ba 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4225,7 +4225,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> * We're already holding a reference on the page but haven't mapped it
> * yet.
> */
> - swap_free(entry);
> + swap_free_nr(entry, 1);
> if (should_try_to_free_swap(folio, vma, vmf->flags))
> folio_free_swap(folio);
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 087a79f1f611..39ec7742acec 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1865,7 +1865,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> goto walk_done_err;
> }
> if (arch_unmap_one(mm, vma, address, pteval) < 0) {
> - swap_free(entry);
> + swap_free_nr(entry, 1);
> set_pte_at(mm, address, pvmw.pte, pteval);
> goto walk_done_err;
> }
> @@ -1873,7 +1873,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> /* See folio_try_share_anon_rmap(): clear PTE first. */
> if (anon_exclusive &&
> folio_try_share_anon_rmap_pte(folio, subpage)) {
> - swap_free(entry);
> + swap_free_nr(entry, 1);
> set_pte_at(mm, address, pvmw.pte, pteval);
> goto walk_done_err;
> }
> diff --git a/mm/shmem.c b/mm/shmem.c
> index fa2a0ed97507..bfc8a2beb24f 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1836,7 +1836,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
> * in shmem_evict_inode().
> */
> shmem_recalc_inode(inode, -1, -1);
> - swap_free(swap);
> + swap_free_nr(swap, 1);
> }
>
> /*
> @@ -1927,7 +1927,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>
> delete_from_swap_cache(folio);
> folio_mark_dirty(folio);
> - swap_free(swap);
> + swap_free_nr(swap, 1);
> put_swap_device(si);
>
> *foliop = folio;
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index ec12f2b9d229..ddcd0f24b9a1 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1343,19 +1343,6 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
> swap_range_free(p, offset, 1);
> }
>
> -/*
> - * Caller has made sure that the swap device corresponding to entry
> - * is still around or has not been recycled.
> - */
> -void swap_free(swp_entry_t entry)
> -{
> - struct swap_info_struct *p;
> -
> - p = _swap_info_get(entry);
> - if (p)
> - __swap_entry_free(p, entry);
> -}
> -
> static void cluster_swap_free_nr(struct swap_info_struct *sis,
> unsigned long offset, int nr_pages)
> {
> @@ -1385,6 +1372,10 @@ static void cluster_swap_free_nr(struct swap_info_struct *sis,
> unlock_cluster_or_swap_info(sis, ci);
> }
>
> +/*
> + * Caller has made sure that the swap device corresponding to entry
> + * is still around or has not been recycled.
> + */
> void swap_free_nr(swp_entry_t entry, int nr_pages)
> {
> int nr;
> @@ -1930,7 +1921,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
> new_pte = pte_mkuffd_wp(new_pte);
> setpte:
> set_pte_at(vma->vm_mm, addr, pte, new_pte);
> - swap_free(entry);
> + swap_free_nr(entry, 1);
> out:
> if (pte)
> pte_unmap_unlock(pte, ptl);


2024-05-03 09:42:14

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v3 3/6] mm: introduce pte_move_swp_offset() helper which can move offset bidirectionally

On 03/05/2024 01:50, Barry Song wrote:
> From: Barry Song <[email protected]>
>
> There could arise a necessity to obtain the first pte_t from a swap
> pte_t located in the middle. For instance, this may occur within the
> context of do_swap_page(), where a page fault can potentially occur in
> any PTE of a large folio. To address this, the following patch introduces
> pte_move_swp_offset(), a function capable of bidirectional movement by
> a specified delta argument. Consequently, pte_increment_swp_offset()

You mean pte_next_swp_offset()?

> will directly invoke it with delta = 1.
>
> Suggested-by: "Huang, Ying" <[email protected]>
> Signed-off-by: Barry Song <[email protected]>
> ---
> mm/internal.h | 25 +++++++++++++++++++++----
> 1 file changed, 21 insertions(+), 4 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index c5552d35d995..cfe4aed66a5c 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -211,18 +211,21 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> }
>
> /**
> - * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
> + * pte_move_swp_offset - Move the swap entry offset field of a swap pte
> + * forward or backward by delta
> * @pte: The initial pte state; is_swap_pte(pte) must be true and
> * non_swap_entry() must be false.
> + * @delta: The direction and the offset we are moving; forward if delta
> + * is positive; backward if delta is negative
> *
> - * Increments the swap offset, while maintaining all other fields, including
> + * Moves the swap offset, while maintaining all other fields, including
> * swap type, and any swp pte bits. The resulting pte is returned.
> */
> -static inline pte_t pte_next_swp_offset(pte_t pte)
> +static inline pte_t pte_move_swp_offset(pte_t pte, long delta)

We have equivalent functions for pfn:

pte_next_pfn()
pte_advance_pfn()

Although the latter takes an unsigned long and only moves forward currently. I
wonder if it makes sense to have their naming and semantics match? i.e. change
pte_advance_pfn() to pte_move_pfn() and let it move backwards too.

I guess we don't have a need for that and it adds more churn.

Anyway:

Reviewed-by: Ryan Roberts <[email protected]>


> {
> swp_entry_t entry = pte_to_swp_entry(pte);
> pte_t new = __swp_entry_to_pte(__swp_entry(swp_type(entry),
> - (swp_offset(entry) + 1)));
> + (swp_offset(entry) + delta)));
>
> if (pte_swp_soft_dirty(pte))
> new = pte_swp_mksoft_dirty(new);
> @@ -234,6 +237,20 @@ static inline pte_t pte_next_swp_offset(pte_t pte)
> return new;
> }
>
> +
> +/**
> + * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
> + * @pte: The initial pte state; is_swap_pte(pte) must be true and
> + * non_swap_entry() must be false.
> + *
> + * Increments the swap offset, while maintaining all other fields, including
> + * swap type, and any swp pte bits. The resulting pte is returned.
> + */
> +static inline pte_t pte_next_swp_offset(pte_t pte)
> +{
> + return pte_move_swp_offset(pte, 1);
> +}
> +
> /**
> * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries
> * @start_ptep: Page table pointer for the first entry.


2024-05-03 10:02:21

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v3 4/6] mm: introduce arch_do_swap_page_nr() which allows restore metadata for nr pages

On 03/05/2024 01:50, Barry Song wrote:
> From: Barry Song <[email protected]>
>
> Should do_swap_page() have the capability to directly map a large folio,
> metadata restoration becomes necessary for a specified number of pages
> denoted as nr. It's important to highlight that metadata restoration is
> solely required by the SPARC platform, which, however, does not enable
> THP_SWAP. Consequently, in the present kernel configuration, there
> exists no practical scenario where users necessitate the restoration of
> nr metadata. Platforms implementing THP_SWAP might invoke this function
> with nr values exceeding 1, subsequent to do_swap_page() successfully
> mapping an entire large folio. Nonetheless, their arch_do_swap_page_nr()
> functions remain empty.
>
> Cc: Khalid Aziz <[email protected]>
> Cc: "David S. Miller" <[email protected]>
> Cc: Andreas Larsson <[email protected]>
> Signed-off-by: Barry Song <[email protected]>
> ---
> include/linux/pgtable.h | 26 ++++++++++++++++++++------
> mm/memory.c | 3 ++-
> 2 files changed, 22 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 18019f037bae..463e84c3de26 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1084,6 +1084,15 @@ static inline int pgd_same(pgd_t pgd_a, pgd_t pgd_b)
> })
>
> #ifndef __HAVE_ARCH_DO_SWAP_PAGE
> +static inline void arch_do_swap_page_nr(struct mm_struct *mm,
> + struct vm_area_struct *vma,
> + unsigned long addr,
> + pte_t pte, pte_t oldpte,
> + int nr)
> +{
> +
> +}
> +#else
> /*
> * Some architectures support metadata associated with a page. When a
> * page is being swapped out, this metadata must be saved so it can be
> @@ -1092,12 +1101,17 @@ static inline int pgd_same(pgd_t pgd_a, pgd_t pgd_b)
> * page as metadata for the page. arch_do_swap_page() can restore this
> * metadata when a page is swapped back in.
> */
> -static inline void arch_do_swap_page(struct mm_struct *mm,
> - struct vm_area_struct *vma,
> - unsigned long addr,
> - pte_t pte, pte_t oldpte)

This hook seems to be very similar to arch_swap_restore(), I wonder if it makes
sense to merge them. Out of scope for this patch series though.


> -{
> -
> +static inline void arch_do_swap_page_nr(struct mm_struct *mm,
> + struct vm_area_struct *vma,
> + unsigned long addr,
> + pte_t pte, pte_t oldpte,
> + int nr)
> +{
> + for (int i = 0; i < nr; i++) {
> + arch_do_swap_page(vma->vm_mm, vma, addr + i * PAGE_SIZE,
> + pte_advance_pfn(pte, i),
> + pte_advance_pfn(oldpte, i));

It seems a bit odd to create a batched version of this, but not allow arches to
take advantage. Although I guess your point is that only SPARC implements it and
on that platform nr will always be 1? So no point right now? So this is just a
convenience for do_swap_page()? Makes sense.

Reviewed-by: Ryan Roberts <[email protected]>

> + }
> }
> #endif
>
> diff --git a/mm/memory.c b/mm/memory.c
> index f033eb3528ba..74cdefd58f5f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4266,7 +4266,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> VM_BUG_ON(!folio_test_anon(folio) ||
> (pte_write(pte) && !PageAnonExclusive(page)));
> set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> - arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> + arch_do_swap_page_nr(vma->vm_mm, vma, vmf->address,
> + pte, vmf->orig_pte, 1);
>
> folio_unlock(folio);
> if (folio != swapcache && swapcache) {


2024-05-03 10:50:19

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache

On 03/05/2024 01:50, Barry Song wrote:
> From: Chuanhua Han <[email protected]>
>
> When a large folio is found in the swapcache, the current implementation
> requires calling do_swap_page() nr_pages times, resulting in nr_pages
> page faults. This patch opts to map the entire large folio at once to
> minimize page faults. Additionally, redundant checks and early exits
> for ARM64 MTE restoring are removed.
>
> Signed-off-by: Chuanhua Han <[email protected]>
> Co-developed-by: Barry Song <[email protected]>
> Signed-off-by: Barry Song <[email protected]>

With the suggested changes below:

Reviewed-by: Ryan Roberts <[email protected]>

> ---
> mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++-----------
> 1 file changed, 48 insertions(+), 12 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 22e7c33cc747..940fdbe69fa1 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3968,6 +3968,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> pte_t pte;
> vm_fault_t ret = 0;
> void *shadow = NULL;
> + int nr_pages = 1;
> + unsigned long page_idx = 0;
> + unsigned long address = vmf->address;
> + pte_t *ptep;

nit: Personally I'd prefer all these to get initialised just before the "if
(folio_test_large()..." block below. That way it is clear they are fresh (incase
any logic between here and there makes an adjustment) and its clear that they
are only to be used after that block (the compiler will warn if using an
uninitialized value).

>
> if (!pte_unmap_same(vmf))
> goto out;
> @@ -4166,6 +4170,36 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> goto out_nomap;
> }
>
> + ptep = vmf->pte;
> + if (folio_test_large(folio) && folio_test_swapcache(folio)) {
> + int nr = folio_nr_pages(folio);
> + unsigned long idx = folio_page_idx(folio, page);
> + unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
> + unsigned long folio_end = folio_start + nr * PAGE_SIZE;
> + pte_t *folio_ptep;
> + pte_t folio_pte;
> +
> + if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
> + goto check_folio;
> + if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
> + goto check_folio;
> +
> + folio_ptep = vmf->pte - idx;
> + folio_pte = ptep_get(folio_ptep);
> + if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
> + swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
> + goto check_folio;
> +
> + page_idx = idx;
> + address = folio_start;
> + ptep = folio_ptep;
> + nr_pages = nr;
> + entry = folio->swap;
> + page = &folio->page;
> + }
> +
> +check_folio:

Is this still the correct label name, given the checks are now above the new
block? Perhaps "one_page" or something like that?

> +
> /*
> * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
> * must never point at an anonymous page in the swapcache that is
> @@ -4225,12 +4259,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> * We're already holding a reference on the page but haven't mapped it
> * yet.
> */
> - swap_free_nr(entry, 1);
> + swap_free_nr(entry, nr_pages);
> if (should_try_to_free_swap(folio, vma, vmf->flags))
> folio_free_swap(folio);
>
> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> - dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> + folio_ref_add(folio, nr_pages - 1);
> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> pte = mk_pte(page, vma->vm_page_prot);
>
> /*
> @@ -4240,34 +4275,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> * exclusivity.
> */
> if (!folio_test_ksm(folio) &&
> - (exclusive || folio_ref_count(folio) == 1)) {
> + (exclusive || (folio_ref_count(folio) == nr_pages &&
> + folio_nr_pages(folio) == nr_pages))) {

I think in practice there is no change here? If nr_pages > 1 then the folio is
in the swapcache, so there is an extra ref on it? I agree with the change for
robustness sake. Just checking my understanding.

> if (vmf->flags & FAULT_FLAG_WRITE) {
> pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> vmf->flags &= ~FAULT_FLAG_WRITE;
> }
> rmap_flags |= RMAP_EXCLUSIVE;
> }
> - flush_icache_page(vma, page);
> + flush_icache_pages(vma, page, nr_pages);
> if (pte_swp_soft_dirty(vmf->orig_pte))
> pte = pte_mksoft_dirty(pte);
> if (pte_swp_uffd_wp(vmf->orig_pte))
> pte = pte_mkuffd_wp(pte);
> - vmf->orig_pte = pte;
> + vmf->orig_pte = pte_advance_pfn(pte, page_idx);
>
> /* ksm created a completely new copy */
> if (unlikely(folio != swapcache && swapcache)) {
> - folio_add_new_anon_rmap(folio, vma, vmf->address);
> + folio_add_new_anon_rmap(folio, vma, address);
> folio_add_lru_vma(folio, vma);
> } else {
> - folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
> + folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address,
> rmap_flags);
> }
>
> VM_BUG_ON(!folio_test_anon(folio) ||
> (pte_write(pte) && !PageAnonExclusive(page)));
> - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> - arch_do_swap_page_nr(vma->vm_mm, vma, vmf->address,
> - pte, vmf->orig_pte, 1);
> + set_ptes(vma->vm_mm, address, ptep, pte, nr_pages);
> + arch_do_swap_page_nr(vma->vm_mm, vma, address,
> + pte, pte, nr_pages);
>
> folio_unlock(folio);
> if (folio != swapcache && swapcache) {
> @@ -4291,7 +4327,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> }
>
> /* No need to invalidate - it was non-present before */
> - update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
> + update_mmu_cache_range(vmf, vma, address, ptep, nr_pages);
> unlock:
> if (vmf->pte)
> pte_unmap_unlock(vmf->pte, vmf->ptl);


2024-05-03 20:27:04

by Chris Li

[permalink] [raw]
Subject: Re: [PATCH v3 1/6] mm: swap: introduce swap_free_nr() for batched swap_free()

Hi Barry,

Looks good. Looking forward to the change to batch free to skipping
the swap slot cache.
All the entries are from the same swap device, it does not need the
sort function. Currently it goes through a lot of locking and and
unlocking inside the loop.

Acked-by: Chris Li <[email protected]>

Chris

On Thu, May 2, 2024 at 5:50 PM Barry Song <[email protected]> wrote:
>
> From: Chuanhua Han <[email protected]>
>
> While swapping in a large folio, we need to free swaps related to the whole
> folio. To avoid frequently acquiring and releasing swap locks, it is better
> to introduce an API for batched free.
> Furthermore, this new function, swap_free_nr(), is designed to efficiently
> handle various scenarios for releasing a specified number, nr, of swap
> entries.
>
> Signed-off-by: Chuanhua Han <[email protected]>
> Co-developed-by: Barry Song <[email protected]>
> Signed-off-by: Barry Song <[email protected]>
> ---
> include/linux/swap.h | 5 +++++
> mm/swapfile.c | 47 ++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 52 insertions(+)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 11c53692f65f..d1d35e92d7e9 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
> extern int swap_duplicate(swp_entry_t);
> extern int swapcache_prepare(swp_entry_t);
> extern void swap_free(swp_entry_t);
> +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
> extern void swapcache_free_entries(swp_entry_t *entries, int n);
> extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
> int swap_type_of(dev_t device, sector_t offset);
> @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
> {
> }
>
> +static inline void swap_free_nr(swp_entry_t entry, int nr_pages)
> +{
> +}
> +
> static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> {
> }
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index f6ca215fb92f..ec12f2b9d229 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1356,6 +1356,53 @@ void swap_free(swp_entry_t entry)
> __swap_entry_free(p, entry);
> }
>
> +static void cluster_swap_free_nr(struct swap_info_struct *sis,
> + unsigned long offset, int nr_pages)
> +{
> + struct swap_cluster_info *ci;
> + DECLARE_BITMAP(to_free, BITS_PER_LONG) = { 0 };
> + int i, nr;
> +
> + ci = lock_cluster_or_swap_info(sis, offset);
> + while (nr_pages) {
> + nr = min(BITS_PER_LONG, nr_pages);
> + for (i = 0; i < nr; i++) {
> + if (!__swap_entry_free_locked(sis, offset + i, 1))
> + bitmap_set(to_free, i, 1);
> + }
> + if (!bitmap_empty(to_free, BITS_PER_LONG)) {
> + unlock_cluster_or_swap_info(sis, ci);
> + for_each_set_bit(i, to_free, BITS_PER_LONG)
> + free_swap_slot(swp_entry(sis->type, offset + i));
> + if (nr == nr_pages)
> + return;
> + bitmap_clear(to_free, 0, BITS_PER_LONG);
> + ci = lock_cluster_or_swap_info(sis, offset);
> + }
> + offset += nr;
> + nr_pages -= nr;
> + }
> + unlock_cluster_or_swap_info(sis, ci);
> +}
> +
> +void swap_free_nr(swp_entry_t entry, int nr_pages)
> +{
> + int nr;
> + struct swap_info_struct *sis;
> + unsigned long offset = swp_offset(entry);
> +
> + sis = _swap_info_get(entry);
> + if (!sis)
> + return;
> +
> + while (nr_pages) {
> + nr = min_t(int, nr_pages, SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
> + cluster_swap_free_nr(sis, offset, nr);
> + offset += nr;
> + nr_pages -= nr;
> + }
> +}
> +
> /*
> * Called after dropping swapcache to decrease refcnt to swap entries.
> */
> --
> 2.34.1
>

2024-05-03 20:44:18

by Chris Li

[permalink] [raw]
Subject: Re: [PATCH v3 2/6] mm: remove swap_free() and always use swap_free_nr()

On Fri, May 3, 2024 at 2:31 AM Ryan Roberts <[email protected]> wrote:
>
> On 03/05/2024 01:50, Barry Song wrote:
> > From: Barry Song <[email protected]>
> >
> > To streamline maintenance efforts, we propose discontinuing the use of
> > swap_free(). Instead, we can simply invoke swap_free_nr() with nr set
> > to 1. This adjustment offers the advantage of enabling batch processing
> > within kernel/power/swap.c. Furthermore, swap_free_nr() is designed with
> > a bitmap consisting of only one long, resulting in overhead that can be
> > ignored for cases where nr equals 1.
> >
> > Suggested-by: "Huang, Ying" <[email protected]>
> > Signed-off-by: Barry Song <[email protected]>
> > Cc: "Rafael J. Wysocki" <[email protected]>
> > Cc: Pavel Machek <[email protected]>
> > Cc: Len Brown <[email protected]>
> > Cc: Hugh Dickins <[email protected]>
> > ---
> > include/linux/swap.h | 5 -----
> > kernel/power/swap.c | 7 +++----
> > mm/memory.c | 2 +-
> > mm/rmap.c | 4 ++--
> > mm/shmem.c | 4 ++--
> > mm/swapfile.c | 19 +++++--------------
> > 6 files changed, 13 insertions(+), 28 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index d1d35e92d7e9..f03cb446124e 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -482,7 +482,6 @@ extern int add_swap_count_continuation(swp_entry_t, gfp_t);
> > extern void swap_shmem_alloc(swp_entry_t);
> > extern int swap_duplicate(swp_entry_t);
> > extern int swapcache_prepare(swp_entry_t);
> > -extern void swap_free(swp_entry_t);
>
> I wonder if it would be cleaner to:
>
> #define swap_free(entry) swap_free_nr((entry), 1)
>
> To save all the churn for the callsites that just want to pass a single entry?
>
Either way works. It will produce the same machine code. I have a
slight inclination to just drop swap_free(entry) API so that it
discourages the caller to do a for loop over swap_free().

Acked-by: Chris Li <[email protected]>

Chris

2024-05-03 20:51:33

by Chris Li

[permalink] [raw]
Subject: Re: [PATCH v3 3/6] mm: introduce pte_move_swp_offset() helper which can move offset bidirectionally

On Thu, May 2, 2024 at 5:51 PM Barry Song <[email protected]> wrote:
>
> From: Barry Song <[email protected]>
>
> There could arise a necessity to obtain the first pte_t from a swap
> pte_t located in the middle. For instance, this may occur within the
> context of do_swap_page(), where a page fault can potentially occur in
> any PTE of a large folio. To address this, the following patch introduces
> pte_move_swp_offset(), a function capable of bidirectional movement by
> a specified delta argument. Consequently, pte_increment_swp_offset()
> will directly invoke it with delta = 1.

BTW, this patch has conflict with the latest mm-unstable. You might
need to refresh it.

I do not see the delta = -1 usage case here. Maybe merge the patch
with the follow up patch that uses the delta = -1?
Does delta only make sense as 1 and -1?

This patch doesn't seem straightly necessary to me.

Chris


Chris

>
> Suggested-by: "Huang, Ying" <[email protected]>
> Signed-off-by: Barry Song <[email protected]>
> ---
> mm/internal.h | 25 +++++++++++++++++++++----
> 1 file changed, 21 insertions(+), 4 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index c5552d35d995..cfe4aed66a5c 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -211,18 +211,21 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> }
>
> /**
> - * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
> + * pte_move_swp_offset - Move the swap entry offset field of a swap pte
> + * forward or backward by delta
> * @pte: The initial pte state; is_swap_pte(pte) must be true and
> * non_swap_entry() must be false.
> + * @delta: The direction and the offset we are moving; forward if delta
> + * is positive; backward if delta is negative
> *
> - * Increments the swap offset, while maintaining all other fields, including
> + * Moves the swap offset, while maintaining all other fields, including
> * swap type, and any swp pte bits. The resulting pte is returned.
> */
> -static inline pte_t pte_next_swp_offset(pte_t pte)
> +static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
> {
> swp_entry_t entry = pte_to_swp_entry(pte);
> pte_t new = __swp_entry_to_pte(__swp_entry(swp_type(entry),
> - (swp_offset(entry) + 1)));
> + (swp_offset(entry) + delta)));
>
> if (pte_swp_soft_dirty(pte))
> new = pte_swp_mksoft_dirty(new);
> @@ -234,6 +237,20 @@ static inline pte_t pte_next_swp_offset(pte_t pte)
> return new;
> }
>
> +
> +/**
> + * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
> + * @pte: The initial pte state; is_swap_pte(pte) must be true and
> + * non_swap_entry() must be false.
> + *
> + * Increments the swap offset, while maintaining all other fields, including
> + * swap type, and any swp pte bits. The resulting pte is returned.
> + */
> +static inline pte_t pte_next_swp_offset(pte_t pte)
> +{
> + return pte_move_swp_offset(pte, 1);
> +}
> +
> /**
> * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries
> * @start_ptep: Page table pointer for the first entry.
> --
> 2.34.1
>
>

2024-05-03 23:07:56

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v3 3/6] mm: introduce pte_move_swp_offset() helper which can move offset bidirectionally

On Sat, May 4, 2024 at 4:51 AM Chris Li <[email protected]> wrote:
>
> On Thu, May 2, 2024 at 5:51 PM Barry Song <[email protected]> wrote:
> >
> > From: Barry Song <[email protected]>
> >
> > There could arise a necessity to obtain the first pte_t from a swap
> > pte_t located in the middle. For instance, this may occur within the
> > context of do_swap_page(), where a page fault can potentially occur in
> > any PTE of a large folio. To address this, the following patch introduces
> > pte_move_swp_offset(), a function capable of bidirectional movement by
> > a specified delta argument. Consequently, pte_increment_swp_offset()
> > will directly invoke it with delta = 1.
>
> BTW, this patch has conflict with the latest mm-unstable. You might
> need to refresh it.
>
> I do not see the delta = -1 usage case here. Maybe merge the patch
> with the follow up patch that uses the delta = -1?
> Does delta only make sense as 1 and -1?

nop. delta can be any value from -1 to -(nr_pages - 1). This is used to
get the first pte from vmf->pte in patdh6.

it might be better to be a separate patch as it is introducing a new
helper? do_swap_page() might not be the only user in the future?

>
> This patch doesn't seem straightly necessary to me.
>
> Chris
>
>
> Chris
>
> >
> > Suggested-by: "Huang, Ying" <[email protected]>
> > Signed-off-by: Barry Song <[email protected]>
> > ---
> > mm/internal.h | 25 +++++++++++++++++++++----
> > 1 file changed, 21 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/internal.h b/mm/internal.h
> > index c5552d35d995..cfe4aed66a5c 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -211,18 +211,21 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> > }
> >
> > /**
> > - * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
> > + * pte_move_swp_offset - Move the swap entry offset field of a swap pte
> > + * forward or backward by delta
> > * @pte: The initial pte state; is_swap_pte(pte) must be true and
> > * non_swap_entry() must be false.
> > + * @delta: The direction and the offset we are moving; forward if delta
> > + * is positive; backward if delta is negative
> > *
> > - * Increments the swap offset, while maintaining all other fields, including
> > + * Moves the swap offset, while maintaining all other fields, including
> > * swap type, and any swp pte bits. The resulting pte is returned.
> > */
> > -static inline pte_t pte_next_swp_offset(pte_t pte)
> > +static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
> > {
> > swp_entry_t entry = pte_to_swp_entry(pte);
> > pte_t new = __swp_entry_to_pte(__swp_entry(swp_type(entry),
> > - (swp_offset(entry) + 1)));
> > + (swp_offset(entry) + delta)));
> >
> > if (pte_swp_soft_dirty(pte))
> > new = pte_swp_mksoft_dirty(new);
> > @@ -234,6 +237,20 @@ static inline pte_t pte_next_swp_offset(pte_t pte)
> > return new;
> > }
> >
> > +
> > +/**
> > + * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
> > + * @pte: The initial pte state; is_swap_pte(pte) must be true and
> > + * non_swap_entry() must be false.
> > + *
> > + * Increments the swap offset, while maintaining all other fields, including
> > + * swap type, and any swp pte bits. The resulting pte is returned.
> > + */
> > +static inline pte_t pte_next_swp_offset(pte_t pte)
> > +{
> > + return pte_move_swp_offset(pte, 1);
> > +}
> > +
> > /**
> > * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries
> > * @start_ptep: Page table pointer for the first entry.
> > --
> > 2.34.1

Thanks
Barry

2024-05-03 23:24:20

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache

On Fri, May 3, 2024 at 6:50 PM Ryan Roberts <[email protected]> wrote:
>
> On 03/05/2024 01:50, Barry Song wrote:
> > From: Chuanhua Han <[email protected]>
> >
> > When a large folio is found in the swapcache, the current implementation
> > requires calling do_swap_page() nr_pages times, resulting in nr_pages
> > page faults. This patch opts to map the entire large folio at once to
> > minimize page faults. Additionally, redundant checks and early exits
> > for ARM64 MTE restoring are removed.
> >
> > Signed-off-by: Chuanhua Han <[email protected]>
> > Co-developed-by: Barry Song <[email protected]>
> > Signed-off-by: Barry Song <[email protected]>
>
> With the suggested changes below:
>
> Reviewed-by: Ryan Roberts <[email protected]>
>
> > ---
> > mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++-----------
> > 1 file changed, 48 insertions(+), 12 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 22e7c33cc747..940fdbe69fa1 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3968,6 +3968,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > pte_t pte;
> > vm_fault_t ret = 0;
> > void *shadow = NULL;
> > + int nr_pages = 1;
> > + unsigned long page_idx = 0;
> > + unsigned long address = vmf->address;
> > + pte_t *ptep;
>
> nit: Personally I'd prefer all these to get initialised just before the "if
> (folio_test_large()..." block below. That way it is clear they are fresh (incase
> any logic between here and there makes an adjustment) and its clear that they
> are only to be used after that block (the compiler will warn if using an
> uninitialized value).

right. I agree this will make the code more readable.

>
> >
> > if (!pte_unmap_same(vmf))
> > goto out;
> > @@ -4166,6 +4170,36 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > goto out_nomap;
> > }
> >
> > + ptep = vmf->pte;
> > + if (folio_test_large(folio) && folio_test_swapcache(folio)) {
> > + int nr = folio_nr_pages(folio);
> > + unsigned long idx = folio_page_idx(folio, page);
> > + unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
> > + unsigned long folio_end = folio_start + nr * PAGE_SIZE;
> > + pte_t *folio_ptep;
> > + pte_t folio_pte;
> > +
> > + if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
> > + goto check_folio;
> > + if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
> > + goto check_folio;
> > +
> > + folio_ptep = vmf->pte - idx;
> > + folio_pte = ptep_get(folio_ptep);
> > + if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
> > + swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
> > + goto check_folio;
> > +
> > + page_idx = idx;
> > + address = folio_start;
> > + ptep = folio_ptep;
> > + nr_pages = nr;
> > + entry = folio->swap;
> > + page = &folio->page;
> > + }
> > +
> > +check_folio:
>
> Is this still the correct label name, given the checks are now above the new
> block? Perhaps "one_page" or something like that?

not quite sure about this, as the code after one_page can be multiple_pages.
On the other hand, it seems we are really checking folio after "check_folio"
:-)


BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio));
BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page));

/*
* Check under PT lock (to protect against concurrent fork() sharing
* the swap entry concurrently) for certainly exclusive pages.
*/
if (!folio_test_ksm(folio)) {


>
> > +
> > /*
> > * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
> > * must never point at an anonymous page in the swapcache that is
> > @@ -4225,12 +4259,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > * We're already holding a reference on the page but haven't mapped it
> > * yet.
> > */
> > - swap_free_nr(entry, 1);
> > + swap_free_nr(entry, nr_pages);
> > if (should_try_to_free_swap(folio, vma, vmf->flags))
> > folio_free_swap(folio);
> >
> > - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> > - dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> > + folio_ref_add(folio, nr_pages - 1);
> > + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> > + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> > pte = mk_pte(page, vma->vm_page_prot);
> >
> > /*
> > @@ -4240,34 +4275,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > * exclusivity.
> > */
> > if (!folio_test_ksm(folio) &&
> > - (exclusive || folio_ref_count(folio) == 1)) {
> > + (exclusive || (folio_ref_count(folio) == nr_pages &&
> > + folio_nr_pages(folio) == nr_pages))) {
>
> I think in practice there is no change here? If nr_pages > 1 then the folio is
> in the swapcache, so there is an extra ref on it? I agree with the change for
> robustness sake. Just checking my understanding.

This is the code showing we are reusing/(mkwrite) a folio either
1. we meet a small folio and we are the only one hitting the small folio
2. we meet a large folio and we are the only one hitting the large folio

any corner cases besides the above two seems difficult. for example,

while we hit a large folio in swapcache but if we can't entirely map it
(nr_pages==1) due to partial unmap, we will have folio_ref_count(folio)
== nr_pages == 1, in this case, lacking folio_nr_pages(folio) == nr_pages
might lead to mkwrite() on a single pte within a partially unmapped large
folio. not quite sure this is wrong, but seems buggy and arduous.

>
> > if (vmf->flags & FAULT_FLAG_WRITE) {
> > pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> > vmf->flags &= ~FAULT_FLAG_WRITE;
> > }
> > rmap_flags |= RMAP_EXCLUSIVE;
> > }
> > - flush_icache_page(vma, page);
> > + flush_icache_pages(vma, page, nr_pages);
> > if (pte_swp_soft_dirty(vmf->orig_pte))
> > pte = pte_mksoft_dirty(pte);
> > if (pte_swp_uffd_wp(vmf->orig_pte))
> > pte = pte_mkuffd_wp(pte);
> > - vmf->orig_pte = pte;
> > + vmf->orig_pte = pte_advance_pfn(pte, page_idx);
> >
> > /* ksm created a completely new copy */
> > if (unlikely(folio != swapcache && swapcache)) {
> > - folio_add_new_anon_rmap(folio, vma, vmf->address);
> > + folio_add_new_anon_rmap(folio, vma, address);
> > folio_add_lru_vma(folio, vma);
> > } else {
> > - folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
> > + folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address,
> > rmap_flags);
> > }
> >
> > VM_BUG_ON(!folio_test_anon(folio) ||
> > (pte_write(pte) && !PageAnonExclusive(page)));
> > - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> > - arch_do_swap_page_nr(vma->vm_mm, vma, vmf->address,
> > - pte, vmf->orig_pte, 1);
> > + set_ptes(vma->vm_mm, address, ptep, pte, nr_pages);
> > + arch_do_swap_page_nr(vma->vm_mm, vma, address,
> > + pte, pte, nr_pages);
> >
> > folio_unlock(folio);
> > if (folio != swapcache && swapcache) {
> > @@ -4291,7 +4327,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > }
> >
> > /* No need to invalidate - it was non-present before */
> > - update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
> > + update_mmu_cache_range(vmf, vma, address, ptep, nr_pages);
> > unlock:
> > if (vmf->pte)
> > pte_unmap_unlock(vmf->pte, vmf->ptl);
>

Thanks
Barry

2024-05-03 23:40:37

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v3 3/6] mm: introduce pte_move_swp_offset() helper which can move offset bidirectionally

On Fri, May 3, 2024 at 5:41 PM Ryan Roberts <[email protected]> wrote:
>
> On 03/05/2024 01:50, Barry Song wrote:
> > From: Barry Song <[email protected]>
> >
> > There could arise a necessity to obtain the first pte_t from a swap
> > pte_t located in the middle. For instance, this may occur within the
> > context of do_swap_page(), where a page fault can potentially occur in
> > any PTE of a large folio. To address this, the following patch introduces
> > pte_move_swp_offset(), a function capable of bidirectional movement by
> > a specified delta argument. Consequently, pte_increment_swp_offset()
>
> You mean pte_next_swp_offset()?

yes.

>
> > will directly invoke it with delta = 1.
> >
> > Suggested-by: "Huang, Ying" <[email protected]>
> > Signed-off-by: Barry Song <[email protected]>
> > ---
> > mm/internal.h | 25 +++++++++++++++++++++----
> > 1 file changed, 21 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/internal.h b/mm/internal.h
> > index c5552d35d995..cfe4aed66a5c 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -211,18 +211,21 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> > }
> >
> > /**
> > - * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
> > + * pte_move_swp_offset - Move the swap entry offset field of a swap pte
> > + * forward or backward by delta
> > * @pte: The initial pte state; is_swap_pte(pte) must be true and
> > * non_swap_entry() must be false.
> > + * @delta: The direction and the offset we are moving; forward if delta
> > + * is positive; backward if delta is negative
> > *
> > - * Increments the swap offset, while maintaining all other fields, including
> > + * Moves the swap offset, while maintaining all other fields, including
> > * swap type, and any swp pte bits. The resulting pte is returned.
> > */
> > -static inline pte_t pte_next_swp_offset(pte_t pte)
> > +static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
>
> We have equivalent functions for pfn:
>
> pte_next_pfn()
> pte_advance_pfn()
>
> Although the latter takes an unsigned long and only moves forward currently. I
> wonder if it makes sense to have their naming and semantics match? i.e. change
> pte_advance_pfn() to pte_move_pfn() and let it move backwards too.
>
> I guess we don't have a need for that and it adds more churn.

we might have a need in the below case.
A forks B, then A and B share large folios. B unmap/exit, then large
folios of process
A become single-mapped.
Right now, while writing A's folios, we are CoWing A's large folios
into many small
folios. I believe we can reuse the entire large folios instead of doing nr_pages
CoW and page faults.
In this case, we might want to get the first PTE from vmf->pte.

Another case, might be
A forks B, and we write either A or B, we might CoW an entire large
folios instead
CoWing nr_pages small folios.

case 1 seems more useful, I might have a go after some days. then we might
see pte_move_pfn().

>
> Anyway:
>
> Reviewed-by: Ryan Roberts <[email protected]>

thanks!

>
>
> > {
> > swp_entry_t entry = pte_to_swp_entry(pte);
> > pte_t new = __swp_entry_to_pte(__swp_entry(swp_type(entry),
> > - (swp_offset(entry) + 1)));
> > + (swp_offset(entry) + delta)));
> >
> > if (pte_swp_soft_dirty(pte))
> > new = pte_swp_mksoft_dirty(new);
> > @@ -234,6 +237,20 @@ static inline pte_t pte_next_swp_offset(pte_t pte)
> > return new;
> > }
> >
> > +
> > +/**
> > + * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
> > + * @pte: The initial pte state; is_swap_pte(pte) must be true and
> > + * non_swap_entry() must be false.
> > + *
> > + * Increments the swap offset, while maintaining all other fields, including
> > + * swap type, and any swp pte bits. The resulting pte is returned.
> > + */
> > +static inline pte_t pte_next_swp_offset(pte_t pte)
> > +{
> > + return pte_move_swp_offset(pte, 1);
> > +}
> > +
> > /**
> > * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries
> > * @start_ptep: Page table pointer for the first entry.
>

Barry

2024-05-04 04:04:21

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 2/6] mm: remove swap_free() and always use swap_free_nr()

On Fri, May 03, 2024 at 01:37:06PM -0700, Chris Li wrote:
> Either way works. It will produce the same machine code. I have a
> slight inclination to just drop swap_free(entry) API so that it
> discourages the caller to do a for loop over swap_free().

Then just ad the number of entries parameter to swap_free and do away
with the separate swap_free_nr.


2024-05-04 04:27:32

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v3 2/6] mm: remove swap_free() and always use swap_free_nr()

On Sat, May 4, 2024 at 12:03 PM Christoph Hellwig <[email protected]> wrote:
>
> On Fri, May 03, 2024 at 01:37:06PM -0700, Chris Li wrote:
> > Either way works. It will produce the same machine code. I have a
> > slight inclination to just drop swap_free(entry) API so that it
> > discourages the caller to do a for loop over swap_free().
>
> Then just ad the number of entries parameter to swap_free and do away
> with the separate swap_free_nr.

swap_free_nr() isn't separate, after this patch, it is the only one left.
there won't be swap_free() any more. it seems you want to directly
"rename" it to swap_free()?

2024-05-04 04:47:28

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v3 2/6] mm: remove swap_free() and always use swap_free_nr()

On Sat, May 4, 2024 at 12:29 PM Christoph Hellwig <[email protected]> wrote:
>
> On Sat, May 04, 2024 at 12:27:11PM +0800, Barry Song wrote:
> > swap_free_nr() isn't separate, after this patch, it is the only one left.
> > there won't be swap_free() any more. it seems you want to directly
> > "rename" it to swap_free()?
>
> Yes. Avoid the pointless suffix if it is the only variant.

well. it seems you are right. We usually use a suffix to differentiate
two or more cases, but now, there is only one case left, the suffix
seems no longer useful.

one more problem is that free_swap_and_cache_nr() and
swap_free_nr() are not quite aligned.

extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);

static inline void free_swap_and_cache(swp_entry_t entry)
{
free_swap_and_cache_nr(entry, 1);
}

The problem space is the same. I feel like in that case, we can also drop
free_swap_and_cache_nr() and simply add the nr parameter?

2024-05-04 05:46:54

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 2/6] mm: remove swap_free() and always use swap_free_nr()

On Sat, May 04, 2024 at 12:27:11PM +0800, Barry Song wrote:
> swap_free_nr() isn't separate, after this patch, it is the only one left.
> there won't be swap_free() any more. it seems you want to directly
> "rename" it to swap_free()?

Yes. Avoid the pointless suffix if it is the only variant.

2024-05-06 08:09:27

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 3/6] mm: introduce pte_move_swp_offset() helper which can move offset bidirectionally

On 04.05.24 01:40, Barry Song wrote:
> On Fri, May 3, 2024 at 5:41 PM Ryan Roberts <[email protected]> wrote:
>>
>> On 03/05/2024 01:50, Barry Song wrote:
>>> From: Barry Song <[email protected]>
>>>
>>> There could arise a necessity to obtain the first pte_t from a swap
>>> pte_t located in the middle. For instance, this may occur within the
>>> context of do_swap_page(), where a page fault can potentially occur in
>>> any PTE of a large folio. To address this, the following patch introduces
>>> pte_move_swp_offset(), a function capable of bidirectional movement by
>>> a specified delta argument. Consequently, pte_increment_swp_offset()
>>
>> You mean pte_next_swp_offset()?
>
> yes.
>
>>
>>> will directly invoke it with delta = 1.
>>>
>>> Suggested-by: "Huang, Ying" <[email protected]>
>>> Signed-off-by: Barry Song <[email protected]>
>>> ---
>>> mm/internal.h | 25 +++++++++++++++++++++----
>>> 1 file changed, 21 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/mm/internal.h b/mm/internal.h
>>> index c5552d35d995..cfe4aed66a5c 100644
>>> --- a/mm/internal.h
>>> +++ b/mm/internal.h
>>> @@ -211,18 +211,21 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>>> }
>>>
>>> /**
>>> - * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
>>> + * pte_move_swp_offset - Move the swap entry offset field of a swap pte
>>> + * forward or backward by delta
>>> * @pte: The initial pte state; is_swap_pte(pte) must be true and
>>> * non_swap_entry() must be false.
>>> + * @delta: The direction and the offset we are moving; forward if delta
>>> + * is positive; backward if delta is negative
>>> *
>>> - * Increments the swap offset, while maintaining all other fields, including
>>> + * Moves the swap offset, while maintaining all other fields, including
>>> * swap type, and any swp pte bits. The resulting pte is returned.
>>> */
>>> -static inline pte_t pte_next_swp_offset(pte_t pte)
>>> +static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
>>
>> We have equivalent functions for pfn:
>>
>> pte_next_pfn()
>> pte_advance_pfn()
>>
>> Although the latter takes an unsigned long and only moves forward currently. I
>> wonder if it makes sense to have their naming and semantics match? i.e. change
>> pte_advance_pfn() to pte_move_pfn() and let it move backwards too.
>>
>> I guess we don't have a need for that and it adds more churn.
>
> we might have a need in the below case.
> A forks B, then A and B share large folios. B unmap/exit, then large
> folios of process
> A become single-mapped.
> Right now, while writing A's folios, we are CoWing A's large folios
> into many small
> folios. I believe we can reuse the entire large folios instead of doing nr_pages
> CoW and page faults.
> In this case, we might want to get the first PTE from vmf->pte.

Once we have COW reuse for large folios in place (I think you know that
I am working on that), it might make sense to "COW-reuse around",
meaning we look if some neighboring PTEs map the same large folio and
map them writable as well. But if it's really worth it, increasing page
fault latency, is to be decided separately.


>
> Another case, might be
> A forks B, and we write either A or B, we might CoW an entire large
> folios instead
> CoWing nr_pages small folios.
>
> case 1 seems more useful, I might have a go after some days. then we might
> see pte_move_pfn().
pte_move_pfn() does sound odd to me. It might not be required to
implement the optimization described above. (it's easier to simply read
another PTE, check if it maps the same large folio, and to batch from there)

--
Cheers,

David / dhildenb


2024-05-06 08:15:16

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 5/6] mm: swap: make should_try_to_free_swap() support large-folio

On 03.05.24 02:50, Barry Song wrote:
> From: Chuanhua Han <[email protected]>
>
> The function should_try_to_free_swap() operates under the assumption
> that swap-in always occurs at the normal page granularity,
> i.e., folio_nr_pages() = 1. However, in reality, for large folios,
> add_to_swap_cache() will invoke folio_ref_add(folio, nr). To accommodate
> large folio swap-in, this patch eliminates this assumption.
>
> Signed-off-by: Chuanhua Han <[email protected]>
> Co-developed-by: Barry Song <[email protected]>
> Signed-off-by: Barry Song <[email protected]>
> Acked-by: Chris Li <[email protected]>
> Reviewed-by: Ryan Roberts <[email protected]>
> Reviewed-by: "Huang, Ying" <[email protected]>
> ---
> mm/memory.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 74cdefd58f5f..22e7c33cc747 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3877,7 +3877,7 @@ static inline bool should_try_to_free_swap(struct folio *folio,
> * reference only in case it's likely that we'll be the exlusive user.
> */
> return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
> - folio_ref_count(folio) == 2;
> + folio_ref_count(folio) == (1 + folio_nr_pages(folio));
> }
>
> static vm_fault_t pte_marker_clear(struct vm_fault *vmf)


Reviewed-by: David Hildenbrand <[email protected]>

--
Cheers,

David / dhildenb


2024-05-06 08:30:46

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v3 3/6] mm: introduce pte_move_swp_offset() helper which can move offset bidirectionally

On Mon, May 6, 2024 at 8:06 PM David Hildenbrand <[email protected]> wrote:
>
> On 04.05.24 01:40, Barry Song wrote:
> > On Fri, May 3, 2024 at 5:41 PM Ryan Roberts <[email protected]> wrote:
> >>
> >> On 03/05/2024 01:50, Barry Song wrote:
> >>> From: Barry Song <[email protected]>
> >>>
> >>> There could arise a necessity to obtain the first pte_t from a swap
> >>> pte_t located in the middle. For instance, this may occur within the
> >>> context of do_swap_page(), where a page fault can potentially occur in
> >>> any PTE of a large folio. To address this, the following patch introduces
> >>> pte_move_swp_offset(), a function capable of bidirectional movement by
> >>> a specified delta argument. Consequently, pte_increment_swp_offset()
> >>
> >> You mean pte_next_swp_offset()?
> >
> > yes.
> >
> >>
> >>> will directly invoke it with delta = 1.
> >>>
> >>> Suggested-by: "Huang, Ying" <[email protected]>
> >>> Signed-off-by: Barry Song <[email protected]>
> >>> ---
> >>> mm/internal.h | 25 +++++++++++++++++++++----
> >>> 1 file changed, 21 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/mm/internal.h b/mm/internal.h
> >>> index c5552d35d995..cfe4aed66a5c 100644
> >>> --- a/mm/internal.h
> >>> +++ b/mm/internal.h
> >>> @@ -211,18 +211,21 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> >>> }
> >>>
> >>> /**
> >>> - * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
> >>> + * pte_move_swp_offset - Move the swap entry offset field of a swap pte
> >>> + * forward or backward by delta
> >>> * @pte: The initial pte state; is_swap_pte(pte) must be true and
> >>> * non_swap_entry() must be false.
> >>> + * @delta: The direction and the offset we are moving; forward if delta
> >>> + * is positive; backward if delta is negative
> >>> *
> >>> - * Increments the swap offset, while maintaining all other fields, including
> >>> + * Moves the swap offset, while maintaining all other fields, including
> >>> * swap type, and any swp pte bits. The resulting pte is returned.
> >>> */
> >>> -static inline pte_t pte_next_swp_offset(pte_t pte)
> >>> +static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
> >>
> >> We have equivalent functions for pfn:
> >>
> >> pte_next_pfn()
> >> pte_advance_pfn()
> >>
> >> Although the latter takes an unsigned long and only moves forward currently. I
> >> wonder if it makes sense to have their naming and semantics match? i.e change
> >> pte_advance_pfn() to pte_move_pfn() and let it move backwards too.
> >>
> >> I guess we don't have a need for that and it adds more churn.
> >
> > we might have a need in the below case.
> > A forks B, then A and B share large folios. B unmap/exit, then large
> > folios of process
> > A become single-mapped.
> > Right now, while writing A's folios, we are CoWing A's large folios
> > into many small
> > folios. I believe we can reuse the entire large folios instead of doing nr_pages
> > CoW and page faults.
> > In this case, we might want to get the first PTE from vmf->pte.
>
> Once we have COW reuse for large folios in place (I think you know that
> I am working on that), it might make sense to "COW-reuse around",

TBH, I don't know if you are working on that. please Cc me next time :-)

> meaning we look if some neighboring PTEs map the same large folio and
> map them writable as well. But if it's really worth it, increasing page
> fault latency, is to be decided separately.

On the other hand, we eliminate latency for the remaining nr_pages - 1 PTEs.
Perhaps we can discover a more cost-effective method to signify that a large
folio is probably singly mapped? and only call "multi-PTEs" reuse while that
condition is true in PF and avoid increasing latency always?

>
>
> >
> > Another case, might be
> > A forks B, and we write either A or B, we might CoW an entire large
> > folios instead
> > CoWing nr_pages small folios.
> >
> > case 1 seems more useful, I might have a go after some days. then we might
> > see pte_move_pfn().
> pte_move_pfn() does sound odd to me. It might not be required to
> implement the optimization described above. (it's easier to simply read
> another PTE, check if it maps the same large folio, and to batch from there)
>

It appears that your proposal suggests potential reusability as follows: if we
have a large folio containing 16 PTEs, you might consider reusing only 4 by
examining PTEs "around" but not necessarily all 16 PTEs. please correct me
if my understanding is wrong.

Initially, my idea was to obtain the first PTE using pte_move_pfn() and then
utilize folio_pte_batch() with the first PTE as arguments to ensure consistency
in nr_pages, thus enabling complete reuse of the whole folio.

> --
> Cheers,
>
> David / dhildenb

Thanks
Barry

2024-05-06 08:32:03

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 3/6] mm: introduce pte_move_swp_offset() helper which can move offset bidirectionally

On 06.05.24 10:20, Barry Song wrote:
> On Mon, May 6, 2024 at 8:06 PM David Hildenbrand <[email protected]> wrote:
>>
>> On 04.05.24 01:40, Barry Song wrote:
>>> On Fri, May 3, 2024 at 5:41 PM Ryan Roberts <[email protected]> wrote:
>>>>
>>>> On 03/05/2024 01:50, Barry Song wrote:
>>>>> From: Barry Song <[email protected]>
>>>>>
>>>>> There could arise a necessity to obtain the first pte_t from a swap
>>>>> pte_t located in the middle. For instance, this may occur within the
>>>>> context of do_swap_page(), where a page fault can potentially occur in
>>>>> any PTE of a large folio. To address this, the following patch introduces
>>>>> pte_move_swp_offset(), a function capable of bidirectional movement by
>>>>> a specified delta argument. Consequently, pte_increment_swp_offset()
>>>>
>>>> You mean pte_next_swp_offset()?
>>>
>>> yes.
>>>
>>>>
>>>>> will directly invoke it with delta = 1.
>>>>>
>>>>> Suggested-by: "Huang, Ying" <[email protected]>
>>>>> Signed-off-by: Barry Song <[email protected]>
>>>>> ---
>>>>> mm/internal.h | 25 +++++++++++++++++++++----
>>>>> 1 file changed, 21 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>>> index c5552d35d995..cfe4aed66a5c 100644
>>>>> --- a/mm/internal.h
>>>>> +++ b/mm/internal.h
>>>>> @@ -211,18 +211,21 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>>>>> }
>>>>>
>>>>> /**
>>>>> - * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
>>>>> + * pte_move_swp_offset - Move the swap entry offset field of a swap pte
>>>>> + * forward or backward by delta
>>>>> * @pte: The initial pte state; is_swap_pte(pte) must be true and
>>>>> * non_swap_entry() must be false.
>>>>> + * @delta: The direction and the offset we are moving; forward if delta
>>>>> + * is positive; backward if delta is negative
>>>>> *
>>>>> - * Increments the swap offset, while maintaining all other fields, including
>>>>> + * Moves the swap offset, while maintaining all other fields, including
>>>>> * swap type, and any swp pte bits. The resulting pte is returned.
>>>>> */
>>>>> -static inline pte_t pte_next_swp_offset(pte_t pte)
>>>>> +static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
>>>>
>>>> We have equivalent functions for pfn:
>>>>
>>>> pte_next_pfn()
>>>> pte_advance_pfn()
>>>>
>>>> Although the latter takes an unsigned long and only moves forward currently. I
>>>> wonder if it makes sense to have their naming and semantics match? i.e. change
>>>> pte_advance_pfn() to pte_move_pfn() and let it move backwards too.
>>>>
>>>> I guess we don't have a need for that and it adds more churn.
>>>
>>> we might have a need in the below case.
>>> A forks B, then A and B share large folios. B unmap/exit, then large
>>> folios of process
>>> A become single-mapped.
>>> Right now, while writing A's folios, we are CoWing A's large folios
>>> into many small
>>> folios. I believe we can reuse the entire large folios instead of doing nr_pages
>>> CoW and page faults.
>>> In this case, we might want to get the first PTE from vmf->pte.
>>
>> Once we have COW reuse for large folios in place (I think you know that
>> I am working on that), it might make sense to "COW-reuse around",
>
> TBH, I don't know if you are working on that. please Cc me next time :-)

I could have sworn I mentioned it to you already :)

See

https://lore.kernel.org/linux-mm/[email protected]/T/

I'll follow-up on that soonish (now that batching is upstream and the
large mapcount is on its way upstream).

>
>> meaning we look if some neighboring PTEs map the same large folio and
>> map them writable as well. But if it's really worth it, increasing page
>> fault latency, is to be decided separately.
>
> On the other hand, we eliminate latency for the remaining nr_pages - 1 PTEs.
> Perhaps we can discover a more cost-effective method to signify that a large
> folio is probably singly mapped?

Yes, precisely what I am up to!

> and only call "multi-PTEs" reuse while that
> condition is true in PF and avoid increasing latency always?

I'm thinking along those lines:

If we detect that it's exclusive, we can certainly mapped the current
PTE writable. Then, we can decide how much (and if) we want to
fault-around writable as an optimization.

For smallish large folios, it might make sense to try faulting around
most of the folio.

For large large folios (e.g., PTE-mapped 2MiB THP and bigger), we might
not want to fault around the whole thing -- especially if there is
little benefit to be had from contig-pte bits.

>
>>
>>
>>>
>>> Another case, might be
>>> A forks B, and we write either A or B, we might CoW an entire large
>>> folios instead
>>> CoWing nr_pages small folios.
>>>
>>> case 1 seems more useful, I might have a go after some days. then we might
>>> see pte_move_pfn().
>> pte_move_pfn() does sound odd to me. It might not be required to
>> implement the optimization described above. (it's easier to simply read
>> another PTE, check if it maps the same large folio, and to batch from there)
>>
>
> It appears that your proposal suggests potential reusability as follows: if we
> have a large folio containing 16 PTEs, you might consider reusing only 4 by
> examining PTEs "around" but not necessarily all 16 PTEs. please correct me
> if my understanding is wrong.
>
> Initially, my idea was to obtain the first PTE using pte_move_pfn() and then
> utilize folio_pte_batch() with the first PTE as arguments to ensure consistency
> in nr_pages, thus enabling complete reuse of the whole folio.

Simply doing an vm_normal_folio(pte - X) == folio and then trying to
batch from there might be easier and cleaner.

--
Cheers,

David / dhildenb


2024-05-06 12:06:21

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache

On 03.05.24 02:50, Barry Song wrote:
> From: Chuanhua Han <[email protected]>
>
> When a large folio is found in the swapcache, the current implementation
> requires calling do_swap_page() nr_pages times, resulting in nr_pages
> page faults. This patch opts to map the entire large folio at once to
> minimize page faults. Additionally, redundant checks and early exits
> for ARM64 MTE restoring are removed.
>
> Signed-off-by: Chuanhua Han <[email protected]>
> Co-developed-by: Barry Song <[email protected]>
> Signed-off-by: Barry Song <[email protected]>
> ---
> mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++-----------
> 1 file changed, 48 insertions(+), 12 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 22e7c33cc747..940fdbe69fa1 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3968,6 +3968,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> pte_t pte;
> vm_fault_t ret = 0;
> void *shadow = NULL;
> + int nr_pages = 1;
> + unsigned long page_idx = 0;
> + unsigned long address = vmf->address;
> + pte_t *ptep;
>
> if (!pte_unmap_same(vmf))
> goto out;
> @@ -4166,6 +4170,36 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> goto out_nomap;
> }
>
> + ptep = vmf->pte;
> + if (folio_test_large(folio) && folio_test_swapcache(folio)) {
> + int nr = folio_nr_pages(folio);
> + unsigned long idx = folio_page_idx(folio, page);
> + unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
> + unsigned long folio_end = folio_start + nr * PAGE_SIZE;
> + pte_t *folio_ptep;
> + pte_t folio_pte;
> +
> + if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
> + goto check_folio;
> + if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
> + goto check_folio;
> +
> + folio_ptep = vmf->pte - idx;
> + folio_pte = ptep_get(folio_ptep);
> + if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
> + swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
> + goto check_folio;
> +
> + page_idx = idx;
> + address = folio_start;
> + ptep = folio_ptep;
> + nr_pages = nr;
> + entry = folio->swap;
> + page = &folio->page;
> + }
> +
> +check_folio:
> +
> /*
> * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
> * must never point at an anonymous page in the swapcache that is
> @@ -4225,12 +4259,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> * We're already holding a reference on the page but haven't mapped it
> * yet.
> */
> - swap_free_nr(entry, 1);
> + swap_free_nr(entry, nr_pages);
> if (should_try_to_free_swap(folio, vma, vmf->flags))
> folio_free_swap(folio);
>
> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> - dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> + folio_ref_add(folio, nr_pages - 1);
> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> pte = mk_pte(page, vma->vm_page_prot);
>
> /*
> @@ -4240,34 +4275,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> * exclusivity.
> */
> if (!folio_test_ksm(folio) &&
> - (exclusive || folio_ref_count(folio) == 1)) {
> + (exclusive || (folio_ref_count(folio) == nr_pages &&
> + folio_nr_pages(folio) == nr_pages))) {
> if (vmf->flags & FAULT_FLAG_WRITE) {
> pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> vmf->flags &= ~FAULT_FLAG_WRITE;

I fail to convince myself that this change is correct, and if it is
correct, it's confusing (I think there is a dependency on
folio_free_swap() having been called and succeeding, such that we don't
have a folio that is in the swapcache at this point).

Why can't we move the folio_ref_add() after this check and just leave
the check as it is?

"folio_ref_count(folio) == 1" is as clear as it gets: we hold the single
reference, so we can do with this thing whatever we want: it's certainly
exclusive. No swapcache, no other people mapping it.


--
Cheers,

David / dhildenb


2024-05-06 12:07:43

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache

On 04.05.24 01:23, Barry Song wrote:
> On Fri, May 3, 2024 at 6:50 PM Ryan Roberts <[email protected]> wrote:
>>
>> On 03/05/2024 01:50, Barry Song wrote:
>>> From: Chuanhua Han <[email protected]>
>>>
>>> When a large folio is found in the swapcache, the current implementation
>>> requires calling do_swap_page() nr_pages times, resulting in nr_pages
>>> page faults. This patch opts to map the entire large folio at once to
>>> minimize page faults. Additionally, redundant checks and early exits
>>> for ARM64 MTE restoring are removed.
>>>
>>> Signed-off-by: Chuanhua Han <[email protected]>
>>> Co-developed-by: Barry Song <[email protected]>
>>> Signed-off-by: Barry Song <[email protected]>
>>
>> With the suggested changes below:
>>
>> Reviewed-by: Ryan Roberts <[email protected]>
>>
>>> ---
>>> mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++-----------
>>> 1 file changed, 48 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 22e7c33cc747..940fdbe69fa1 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -3968,6 +3968,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> pte_t pte;
>>> vm_fault_t ret = 0;
>>> void *shadow = NULL;
>>> + int nr_pages = 1;
>>> + unsigned long page_idx = 0;
>>> + unsigned long address = vmf->address;
>>> + pte_t *ptep;
>>
>> nit: Personally I'd prefer all these to get initialised just before the "if
>> (folio_test_large()..." block below. That way it is clear they are fresh (incase
>> any logic between here and there makes an adjustment) and its clear that they
>> are only to be used after that block (the compiler will warn if using an
>> uninitialized value).
>
> right. I agree this will make the code more readable.
>
>>
>>>
>>> if (!pte_unmap_same(vmf))
>>> goto out;
>>> @@ -4166,6 +4170,36 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> goto out_nomap;
>>> }
>>>
>>> + ptep = vmf->pte;
>>> + if (folio_test_large(folio) && folio_test_swapcache(folio)) {
>>> + int nr = folio_nr_pages(folio);
>>> + unsigned long idx = folio_page_idx(folio, page);
>>> + unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
>>> + unsigned long folio_end = folio_start + nr * PAGE_SIZE;
>>> + pte_t *folio_ptep;
>>> + pte_t folio_pte;
>>> +
>>> + if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
>>> + goto check_folio;
>>> + if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
>>> + goto check_folio;
>>> +
>>> + folio_ptep = vmf->pte - idx;
>>> + folio_pte = ptep_get(folio_ptep);
>>> + if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
>>> + swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
>>> + goto check_folio;
>>> +
>>> + page_idx = idx;
>>> + address = folio_start;
>>> + ptep = folio_ptep;
>>> + nr_pages = nr;
>>> + entry = folio->swap;
>>> + page = &folio->page;
>>> + }
>>> +
>>> +check_folio:
>>
>> Is this still the correct label name, given the checks are now above the new
>> block? Perhaps "one_page" or something like that?
>
> not quite sure about this, as the code after one_page can be multiple_pages.
> On the other hand, it seems we are really checking folio after "check_folio"
> :-)
>
>
> BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio));
> BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page));
>
> /*
> * Check under PT lock (to protect against concurrent fork() sharing
> * the swap entry concurrently) for certainly exclusive pages.
> */
> if (!folio_test_ksm(folio)) {
>
>
>>
>>> +
>>> /*
>>> * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
>>> * must never point at an anonymous page in the swapcache that is
>>> @@ -4225,12 +4259,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> * We're already holding a reference on the page but haven't mapped it
>>> * yet.
>>> */
>>> - swap_free_nr(entry, 1);
>>> + swap_free_nr(entry, nr_pages);
>>> if (should_try_to_free_swap(folio, vma, vmf->flags))
>>> folio_free_swap(folio);
>>>
>>> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>>> - dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
>>> + folio_ref_add(folio, nr_pages - 1);
>>> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
>>> + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
>>> pte = mk_pte(page, vma->vm_page_prot);
>>>
>>> /*
>>> @@ -4240,34 +4275,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> * exclusivity.
>>> */
>>> if (!folio_test_ksm(folio) &&
>>> - (exclusive || folio_ref_count(folio) == 1)) {
>>> + (exclusive || (folio_ref_count(folio) == nr_pages &&
>>> + folio_nr_pages(folio) == nr_pages))) {
>>
>> I think in practice there is no change here? If nr_pages > 1 then the folio is
>> in the swapcache, so there is an extra ref on it? I agree with the change for
>> robustness sake. Just checking my understanding.
>
> This is the code showing we are reusing/(mkwrite) a folio either
> 1. we meet a small folio and we are the only one hitting the small folio
> 2. we meet a large folio and we are the only one hitting the large folio
>
> any corner cases besides the above two seems difficult. for example,
>
> while we hit a large folio in swapcache but if we can't entirely map it
> (nr_pages==1) due to partial unmap, we will have folio_ref_count(folio)
> == nr_pages == 1

No, there would be other references from the swapcache and
folio_ref_count(folio) > 1. See my other reply.

--
Cheers,

David / dhildenb


2024-05-06 12:47:56

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache

On Tue, May 7, 2024 at 12:05 AM David Hildenbrand <[email protected]> wrote:
>
> On 03.05.24 02:50, Barry Song wrote:
> > From: Chuanhua Han <[email protected]>
> >
> > When a large folio is found in the swapcache, the current implementation
> > requires calling do_swap_page() nr_pages times, resulting in nr_pages
> > page faults. This patch opts to map the entire large folio at once to
> > minimize page faults. Additionally, redundant checks and early exits
> > for ARM64 MTE restoring are removed.
> >
> > Signed-off-by: Chuanhua Han <[email protected]>
> > Co-developed-by: Barry Song <[email protected]>
> > Signed-off-by: Barry Song <[email protected]>
> > ---
> > mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++-----------
> > 1 file changed, 48 insertions(+), 12 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 22e7c33cc747..940fdbe69fa1 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3968,6 +3968,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > pte_t pte;
> > vm_fault_t ret = 0;
> > void *shadow = NULL;
> > + int nr_pages = 1;
> > + unsigned long page_idx = 0;
> > + unsigned long address = vmf->address;
> > + pte_t *ptep;
> >
> > if (!pte_unmap_same(vmf))
> > goto out;
> > @@ -4166,6 +4170,36 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > goto out_nomap;
> > }
> >
> > + ptep = vmf->pte;
> > + if (folio_test_large(folio) && folio_test_swapcache(folio)) {
> > + int nr = folio_nr_pages(folio);
> > + unsigned long idx = folio_page_idx(folio, page);
> > + unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
> > + unsigned long folio_end = folio_start + nr * PAGE_SIZE;
> > + pte_t *folio_ptep;
> > + pte_t folio_pte;
> > +
> > + if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
> > + goto check_folio;
> > + if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
> > + goto check_folio;
> > +
> > + folio_ptep = vmf->pte - idx;
> > + folio_pte = ptep_get(folio_ptep);
> > + if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
> > + swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
> > + goto check_folio;
> > +
> > + page_idx = idx;
> > + address = folio_start;
> > + ptep = folio_ptep;
> > + nr_pages = nr;
> > + entry = folio->swap;
> > + page = &folio->page;
> > + }
> > +
> > +check_folio:
> > +
> > /*
> > * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
> > * must never point at an anonymous page in the swapcache that is
> > @@ -4225,12 +4259,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > * We're already holding a reference on the page but haven't mapped it
> > * yet.
> > */
> > - swap_free_nr(entry, 1);
> > + swap_free_nr(entry, nr_pages);
> > if (should_try_to_free_swap(folio, vma, vmf->flags))
> > folio_free_swap(folio);
> >
> > - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> > - dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> > + folio_ref_add(folio, nr_pages - 1);
> > + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> > + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> > pte = mk_pte(page, vma->vm_page_prot);
> >
> > /*
> > @@ -4240,34 +4275,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > * exclusivity.
> > */
> > if (!folio_test_ksm(folio) &&
> > - (exclusive || folio_ref_count(folio) == 1)) {
> > + (exclusive || (folio_ref_count(folio) == nr_pages &&
> > + folio_nr_pages(folio) == nr_pages))) {
> > if (vmf->flags & FAULT_FLAG_WRITE) {
> > pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> > vmf->flags &= ~FAULT_FLAG_WRITE;
>
> I fail to convince myself that this change is correct, and if it is
> correct, it's confusing (I think there is a dependency on
> folio_free_swap() having been called and succeeding, such that we don't
> have a folio that is in the swapcache at this point).
>
> Why can't we move the folio_ref_add() after this check and just leave
> the check as it is?
>
> "folio_ref_count(folio) == 1" is as clear as it gets: we hold the single
> reference, so we can do with this thing whatever we want: it's certainly
> exclusive. No swapcache, no other people mapping it.

Right.
I believe the code works correctly but is a bit confusing. as you said,
we might move folio_ref_add() behind folio_ref_count(folio) == 1.

>
>
> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry

2024-05-06 12:58:42

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache

On Tue, May 7, 2024 at 12:38 AM Barry Song <[email protected]> wrote:
>
> On Tue, May 7, 2024 at 12:07 AM David Hildenbrand <[email protected]> wrote:
> >
> > On 04.05.24 01:23, Barry Song wrote:
> > > On Fri, May 3, 2024 at 6:50 PM Ryan Roberts <ryan.roberts@armcom> wrote:
> > >>
> > >> On 03/05/2024 01:50, Barry Song wrote:
> > >>> From: Chuanhua Han <[email protected]>
> > >>>
> > >>> When a large folio is found in the swapcache, the current implementation
> > >>> requires calling do_swap_page() nr_pages times, resulting in nr_pages
> > >>> page faults. This patch opts to map the entire large folio at once to
> > >>> minimize page faults. Additionally, redundant checks and early exits
> > >>> for ARM64 MTE restoring are removed.
> > >>>
> > >>> Signed-off-by: Chuanhua Han <[email protected]>
> > >>> Co-developed-by: Barry Song <[email protected]>
> > >>> Signed-off-by: Barry Song <[email protected]>
> > >>
> > >> With the suggested changes below:
> > >>
> > >> Reviewed-by: Ryan Roberts <[email protected]>
> > >>
> > >>> ---
> > >>> mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++-----------
> > >>> 1 file changed, 48 insertions(+), 12 deletions(-)
> > >>>
> > >>> diff --git a/mm/memory.c b/mm/memory.c
> > >>> index 22e7c33cc747..940fdbe69fa1 100644
> > >>> --- a/mm/memory.c
> > >>> +++ b/mm/memory.c
> > >>> @@ -3968,6 +3968,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >>> pte_t pte;
> > >>> vm_fault_t ret = 0;
> > >>> void *shadow = NULL;
> > >>> + int nr_pages = 1;
> > >>> + unsigned long page_idx = 0;
> > >>> + unsigned long address = vmf->address;
> > >>> + pte_t *ptep;
> > >>
> > >> nit: Personally I'd prefer all these to get initialised just before the "if
> > >> (folio_test_large()..." block below. That way it is clear they are fresh (incase
> > >> any logic between here and there makes an adjustment) and its clear that they
> > >> are only to be used after that block (the compiler will warn if using an
> > >> uninitialized value).
> > >
> > > right. I agree this will make the code more readable.
> > >
> > >>
> > >>>
> > >>> if (!pte_unmap_same(vmf))
> > >>> goto out;
> > >>> @@ -4166,6 +4170,36 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >>> goto out_nomap;
> > >>> }
> > >>>
> > >>> + ptep = vmf->pte;
> > >>> + if (folio_test_large(folio) && folio_test_swapcache(folio)) {
> > >>> + int nr = folio_nr_pages(folio);
> > >>> + unsigned long idx = folio_page_idx(folio, page);
> > >>> + unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
> > >>> + unsigned long folio_end = folio_start + nr * PAGE_SIZE;
> > >>> + pte_t *folio_ptep;
> > >>> + pte_t folio_pte;
> > >>> +
> > >>> + if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
> > >>> + goto check_folio;
> > >>> + if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
> > >>> + goto check_folio;
> > >>> +
> > >>> + folio_ptep = vmf->pte - idx;
> > >>> + folio_pte = ptep_get(folio_ptep);
> > >>> + if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
> > >>> + swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
> > >>> + goto check_folio;
> > >>> +
> > >>> + page_idx = idx;
> > >>> + address = folio_start;
> > >>> + ptep = folio_ptep;
> > >>> + nr_pages = nr;
> > >>> + entry = folio->swap;
> > >>> + page = &folio->page;
> > >>> + }
> > >>> +
> > >>> +check_folio:
> > >>
> > >> Is this still the correct label name, given the checks are now above the new
> > >> block? Perhaps "one_page" or something like that?
> > >
> > > not quite sure about this, as the code after one_page can be multiple_pages.
> > > On the other hand, it seems we are really checking folio after "check_folio"
> > > :-)
> > >
> > >
> > > BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio));
> > > BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page));
> > >
> > > /*
> > > * Check under PT lock (to protect against concurrent fork() sharing
> > > * the swap entry concurrently) for certainly exclusive pages.
> > > */
> > > if (!folio_test_ksm(folio)) {
> > >
> > >
> > >>
> > >>> +
> > >>> /*
> > >>> * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
> > >>> * must never point at an anonymous page in the swapcache that is
> > >>> @@ -4225,12 +4259,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >>> * We're already holding a reference on the page but haven't mapped it
> > >>> * yet.
> > >>> */
> > >>> - swap_free_nr(entry, 1);
> > >>> + swap_free_nr(entry, nr_pages);
> > >>> if (should_try_to_free_swap(folio, vma, vmf->flags))
> > >>> folio_free_swap(folio);
> > >>>
> > >>> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> > >>> - dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> > >>> + folio_ref_add(folio, nr_pages - 1);
> > >>> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> > >>> + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> > >>> pte = mk_pte(page, vma->vm_page_prot);
> > >>>
> > >>> /*
> > >>> @@ -4240,34 +4275,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >>> * exclusivity.
> > >>> */
> > >>> if (!folio_test_ksm(folio) &&
> > >>> - (exclusive || folio_ref_count(folio) == 1)) {
> > >>> + (exclusive || (folio_ref_count(folio) == nr_pages &&
> > >>> + folio_nr_pages(folio) == nr_pages))) {
> > >>
> > >> I think in practice there is no change here? If nr_pages > 1 then the folio is
> > >> in the swapcache, so there is an extra ref on it? I agree with the change for
> > >> robustness sake. Just checking my understanding.
> > >
> > > This is the code showing we are reusing/(mkwrite) a folio either
> > > 1. we meet a small folio and we are the only one hitting the small folio
> > > 2. we meet a large folio and we are the only one hitting the large folio
> > >
> > > any corner cases besides the above two seems difficult. for example,
> > >
> > > while we hit a large folio in swapcache but if we can't entirely map it
> > > (nr_pages==1) due to partial unmap, we will have folio_ref_count(folio)
> > > == nr_pages == 1
> >
> > No, there would be other references from the swapcache and
> > folio_ref_count(folio) > 1. See my other reply.
>
> right. can be clearer by:

Wait, do we still need folio_nr_pages(folio) == nr_pages even if we use
folio_ref_count(folio) == 1 and moving folio_ref_add(folio, nr_pages - 1)?

one case is that we have a large folio with 16 PTEs, and we unmap
15 swap PTE entries, thus we have only one swap entry left. Then
we hit the large folio in swapcache. but we have only one PTE thus we will
map only one PTE. lacking folio_nr_pages(folio) == nr_pages, we reuse the
large folio for one PTE. with it, do_wp_page() will migrate the large
folio to a small one?

1AM, tired and sleepy. not quite sure I am correct.
I look forward to seeing your reply tomorrow morning :-)

>
> @@ -4263,7 +4264,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> if (should_try_to_free_swap(folio, vma, vmf->flags))
> folio_free_swap(folio);
>
> - folio_ref_add(folio, nr_pages - 1);
> add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> pte = mk_pte(page, vma->vm_page_prot);
> @@ -4275,14 +4275,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> * exclusivity.
> */
> if (!folio_test_ksm(folio) &&
> - (exclusive || (folio_ref_count(folio) == nr_pages &&
> - folio_nr_pages(folio) == nr_pages))) {
> + (exclusive || folio_ref_count(folio) == 1)) {
> if (vmf->flags & FAULT_FLAG_WRITE) {
> pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> vmf->flags &= ~FAULT_FLAG_WRITE;
> }
> rmap_flags |= RMAP_EXCLUSIVE;
> }
> + folio_ref_add(folio, nr_pages - 1);
> flush_icache_pages(vma, page, nr_pages);
> if (pte_swp_soft_dirty(vmf->orig_pte))
> pte = pte_mksoft_dirty(pte);
>
>
> >
> > --
> > Cheers,
> >
> > David / dhildenb
> >

2024-05-06 13:07:33

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache

On Tue, May 7, 2024 at 12:07 AM David Hildenbrand <[email protected]> wrote:
>
> On 04.05.24 01:23, Barry Song wrote:
> > On Fri, May 3, 2024 at 6:50 PM Ryan Roberts <[email protected]> wrote:
> >>
> >> On 03/05/2024 01:50, Barry Song wrote:
> >>> From: Chuanhua Han <[email protected]>
> >>>
> >>> When a large folio is found in the swapcache, the current implementation
> >>> requires calling do_swap_page() nr_pages times, resulting in nr_pages
> >>> page faults. This patch opts to map the entire large folio at once to
> >>> minimize page faults. Additionally, redundant checks and early exits
> >>> for ARM64 MTE restoring are removed.
> >>>
> >>> Signed-off-by: Chuanhua Han <[email protected]>
> >>> Co-developed-by: Barry Song <[email protected]>
> >>> Signed-off-by: Barry Song <[email protected]>
> >>
> >> With the suggested changes below:
> >>
> >> Reviewed-by: Ryan Roberts <[email protected]>
> >>
> >>> ---
> >>> mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++-----------
> >>> 1 file changed, 48 insertions(+), 12 deletions(-)
> >>>
> >>> diff --git a/mm/memory.c b/mm/memory.c
> >>> index 22e7c33cc747..940fdbe69fa1 100644
> >>> --- a/mm/memory.c
> >>> +++ b/mm/memory.c
> >>> @@ -3968,6 +3968,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>> pte_t pte;
> >>> vm_fault_t ret = 0;
> >>> void *shadow = NULL;
> >>> + int nr_pages = 1;
> >>> + unsigned long page_idx = 0;
> >>> + unsigned long address = vmf->address;
> >>> + pte_t *ptep;
> >>
> >> nit: Personally I'd prefer all these to get initialised just before the "if
> >> (folio_test_large()..." block below. That way it is clear they are fresh (incase
> >> any logic between here and there makes an adjustment) and its clear that they
> >> are only to be used after that block (the compiler will warn if using an
> >> uninitialized value).
> >
> > right. I agree this will make the code more readable.
> >
> >>
> >>>
> >>> if (!pte_unmap_same(vmf))
> >>> goto out;
> >>> @@ -4166,6 +4170,36 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>> goto out_nomap;
> >>> }
> >>>
> >>> + ptep = vmf->pte;
> >>> + if (folio_test_large(folio) && folio_test_swapcache(folio)) {
> >>> + int nr = folio_nr_pages(folio);
> >>> + unsigned long idx = folio_page_idx(folio, page);
> >>> + unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
> >>> + unsigned long folio_end = folio_start + nr * PAGE_SIZE;
> >>> + pte_t *folio_ptep;
> >>> + pte_t folio_pte;
> >>> +
> >>> + if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
> >>> + goto check_folio;
> >>> + if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
> >>> + goto check_folio;
> >>> +
> >>> + folio_ptep = vmf->pte - idx;
> >>> + folio_pte = ptep_get(folio_ptep);
> >>> + if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
> >>> + swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
> >>> + goto check_folio;
> >>> +
> >>> + page_idx = idx;
> >>> + address = folio_start;
> >>> + ptep = folio_ptep;
> >>> + nr_pages = nr;
> >>> + entry = folio->swap;
> >>> + page = &folio->page;
> >>> + }
> >>> +
> >>> +check_folio:
> >>
> >> Is this still the correct label name, given the checks are now above the new
> >> block? Perhaps "one_page" or something like that?
> >
> > not quite sure about this, as the code after one_page can be multiple_pages.
> > On the other hand, it seems we are really checking folio after "check_folio"
> > :-)
> >
> >
> > BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio));
> > BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page));
> >
> > /*
> > * Check under PT lock (to protect against concurrent fork() sharing
> > * the swap entry concurrently) for certainly exclusive pages.
> > */
> > if (!folio_test_ksm(folio)) {
> >
> >
> >>
> >>> +
> >>> /*
> >>> * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
> >>> * must never point at an anonymous page in the swapcache that is
> >>> @@ -4225,12 +4259,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>> * We're already holding a reference on the page but haven't mapped it
> >>> * yet.
> >>> */
> >>> - swap_free_nr(entry, 1);
> >>> + swap_free_nr(entry, nr_pages);
> >>> if (should_try_to_free_swap(folio, vma, vmf->flags))
> >>> folio_free_swap(folio);
> >>>
> >>> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> >>> - dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> >>> + folio_ref_add(folio, nr_pages - 1);
> >>> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> >>> + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> >>> pte = mk_pte(page, vma->vm_page_prot);
> >>>
> >>> /*
> >>> @@ -4240,34 +4275,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>> * exclusivity.
> >>> */
> >>> if (!folio_test_ksm(folio) &&
> >>> - (exclusive || folio_ref_count(folio) == 1)) {
> >>> + (exclusive || (folio_ref_count(folio) == nr_pages &&
> >>> + folio_nr_pages(folio) == nr_pages))) {
> >>
> >> I think in practice there is no change here? If nr_pages > 1 then the folio is
> >> in the swapcache, so there is an extra ref on it? I agree with the change for
> >> robustness sake. Just checking my understanding.
> >
> > This is the code showing we are reusing/(mkwrite) a folio either
> > 1. we meet a small folio and we are the only one hitting the small folio
> > 2. we meet a large folio and we are the only one hitting the large folio
> >
> > any corner cases besides the above two seems difficult. for example,
> >
> > while we hit a large folio in swapcache but if we can't entirely map it
> > (nr_pages==1) due to partial unmap, we will have folio_ref_count(folio)
> > == nr_pages == 1
>
> No, there would be other references from the swapcache and
> folio_ref_count(folio) > 1. See my other reply.

right. can be clearer by:

@@ -4263,7 +4264,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (should_try_to_free_swap(folio, vma, vmf->flags))
folio_free_swap(folio);

- folio_ref_add(folio, nr_pages - 1);
add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
pte = mk_pte(page, vma->vm_page_prot);
@@ -4275,14 +4275,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
* exclusivity.
*/
if (!folio_test_ksm(folio) &&
- (exclusive || (folio_ref_count(folio) == nr_pages &&
- folio_nr_pages(folio) == nr_pages))) {
+ (exclusive || folio_ref_count(folio) == 1)) {
if (vmf->flags & FAULT_FLAG_WRITE) {
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
vmf->flags &= ~FAULT_FLAG_WRITE;
}
rmap_flags |= RMAP_EXCLUSIVE;
}
+ folio_ref_add(folio, nr_pages - 1);
flush_icache_pages(vma, page, nr_pages);
if (pte_swp_soft_dirty(vmf->orig_pte))
pte = pte_mksoft_dirty(pte);


>
> --
> Cheers,
>
> David / dhildenb
>

2024-05-06 13:17:34

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache

On 06.05.24 14:58, Barry Song wrote:
> On Tue, May 7, 2024 at 12:38 AM Barry Song <[email protected]> wrote:
>>
>> On Tue, May 7, 2024 at 12:07 AM David Hildenbrand <[email protected]> wrote:
>>>
>>> On 04.05.24 01:23, Barry Song wrote:
>>>> On Fri, May 3, 2024 at 6:50 PM Ryan Roberts <[email protected]> wrote:
>>>>>
>>>>> On 03/05/2024 01:50, Barry Song wrote:
>>>>>> From: Chuanhua Han <[email protected]>
>>>>>>
>>>>>> When a large folio is found in the swapcache, the current implementation
>>>>>> requires calling do_swap_page() nr_pages times, resulting in nr_pages
>>>>>> page faults. This patch opts to map the entire large folio at once to
>>>>>> minimize page faults. Additionally, redundant checks and early exits
>>>>>> for ARM64 MTE restoring are removed.
>>>>>>
>>>>>> Signed-off-by: Chuanhua Han <[email protected]>
>>>>>> Co-developed-by: Barry Song <[email protected]>
>>>>>> Signed-off-by: Barry Song <[email protected]>
>>>>>
>>>>> With the suggested changes below:
>>>>>
>>>>> Reviewed-by: Ryan Roberts <[email protected]>
>>>>>
>>>>>> ---
>>>>>> mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++-----------
>>>>>> 1 file changed, 48 insertions(+), 12 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>> index 22e7c33cc747..940fdbe69fa1 100644
>>>>>> --- a/mm/memory.c
>>>>>> +++ b/mm/memory.c
>>>>>> @@ -3968,6 +3968,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>> pte_t pte;
>>>>>> vm_fault_t ret = 0;
>>>>>> void *shadow = NULL;
>>>>>> + int nr_pages = 1;
>>>>>> + unsigned long page_idx = 0;
>>>>>> + unsigned long address = vmf->address;
>>>>>> + pte_t *ptep;
>>>>>
>>>>> nit: Personally I'd prefer all these to get initialised just before the "if
>>>>> (folio_test_large()..." block below. That way it is clear they are fresh (incase
>>>>> any logic between here and there makes an adjustment) and its clear that they
>>>>> are only to be used after that block (the compiler will warn if using an
>>>>> uninitialized value).
>>>>
>>>> right. I agree this will make the code more readable.
>>>>
>>>>>
>>>>>>
>>>>>> if (!pte_unmap_same(vmf))
>>>>>> goto out;
>>>>>> @@ -4166,6 +4170,36 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>> goto out_nomap;
>>>>>> }
>>>>>>
>>>>>> + ptep = vmf->pte;
>>>>>> + if (folio_test_large(folio) && folio_test_swapcache(folio)) {
>>>>>> + int nr = folio_nr_pages(folio);
>>>>>> + unsigned long idx = folio_page_idx(folio, page);
>>>>>> + unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
>>>>>> + unsigned long folio_end = folio_start + nr * PAGE_SIZE;
>>>>>> + pte_t *folio_ptep;
>>>>>> + pte_t folio_pte;
>>>>>> +
>>>>>> + if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
>>>>>> + goto check_folio;
>>>>>> + if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
>>>>>> + goto check_folio;
>>>>>> +
>>>>>> + folio_ptep = vmf->pte - idx;
>>>>>> + folio_pte = ptep_get(folio_ptep);
>>>>>> + if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
>>>>>> + swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
>>>>>> + goto check_folio;
>>>>>> +
>>>>>> + page_idx = idx;
>>>>>> + address = folio_start;
>>>>>> + ptep = folio_ptep;
>>>>>> + nr_pages = nr;
>>>>>> + entry = folio->swap;
>>>>>> + page = &folio->page;
>>>>>> + }
>>>>>> +
>>>>>> +check_folio:
>>>>>
>>>>> Is this still the correct label name, given the checks are now above the new
>>>>> block? Perhaps "one_page" or something like that?
>>>>
>>>> not quite sure about this, as the code after one_page can be multiple_pages.
>>>> On the other hand, it seems we are really checking folio after "check_folio"
>>>> :-)
>>>>
>>>>
>>>> BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio));
>>>> BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page));
>>>>
>>>> /*
>>>> * Check under PT lock (to protect against concurrent fork() sharing
>>>> * the swap entry concurrently) for certainly exclusive pages.
>>>> */
>>>> if (!folio_test_ksm(folio)) {
>>>>
>>>>
>>>>>
>>>>>> +
>>>>>> /*
>>>>>> * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
>>>>>> * must never point at an anonymous page in the swapcache that is
>>>>>> @@ -4225,12 +4259,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>> * We're already holding a reference on the page but haven't mapped it
>>>>>> * yet.
>>>>>> */
>>>>>> - swap_free_nr(entry, 1);
>>>>>> + swap_free_nr(entry, nr_pages);
>>>>>> if (should_try_to_free_swap(folio, vma, vmf->flags))
>>>>>> folio_free_swap(folio);
>>>>>>
>>>>>> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>>>>>> - dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
>>>>>> + folio_ref_add(folio, nr_pages - 1);
>>>>>> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
>>>>>> + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
>>>>>> pte = mk_pte(page, vma->vm_page_prot);
>>>>>>
>>>>>> /*
>>>>>> @@ -4240,34 +4275,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>> * exclusivity.
>>>>>> */
>>>>>> if (!folio_test_ksm(folio) &&
>>>>>> - (exclusive || folio_ref_count(folio) == 1)) {
>>>>>> + (exclusive || (folio_ref_count(folio) == nr_pages &&
>>>>>> + folio_nr_pages(folio) == nr_pages))) {
>>>>>
>>>>> I think in practice there is no change here? If nr_pages > 1 then the folio is
>>>>> in the swapcache, so there is an extra ref on it? I agree with the change for
>>>>> robustness sake. Just checking my understanding.
>>>>
>>>> This is the code showing we are reusing/(mkwrite) a folio either
>>>> 1. we meet a small folio and we are the only one hitting the small folio
>>>> 2. we meet a large folio and we are the only one hitting the large folio
>>>>
>>>> any corner cases besides the above two seems difficult. for example,
>>>>
>>>> while we hit a large folio in swapcache but if we can't entirely map it
>>>> (nr_pages==1) due to partial unmap, we will have folio_ref_count(folio)
>>>> == nr_pages == 1
>>>
>>> No, there would be other references from the swapcache and
>>> folio_ref_count(folio) > 1. See my other reply.
>>
>> right. can be clearer by:
>
> Wait, do we still need folio_nr_pages(folio) == nr_pages even if we use
> folio_ref_count(folio) == 1 and moving folio_ref_add(folio, nr_pages - 1)?

I don't think that we will "need" it.

>
> one case is that we have a large folio with 16 PTEs, and we unmap
> 15 swap PTE entries, thus we have only one swap entry left. Then
> we hit the large folio in swapcache. but we have only one PTE thus we will
> map only one PTE. lacking folio_nr_pages(folio) == nr_pages, we reuse the
> large folio for one PTE. with it, do_wp_page() will migrate the large
> folio to a small one?

We will set PAE bit and do_wp_page() will unconditionally reuse that page.

Note that this is the same as if we had pte_swp_exclusive() set and
would have run into "exclusive=true" here.

If we'd want a similar "optimization" as we have in
wp_can_reuse_anon_folio(), you'd want something like

exclusive || (folio_ref_count(folio) == 1 &&
(!folio_test_large(folio) || nr_pages > 1)

.. but I am not sure if that is really worth the complexity here.

>
> 1AM, tired and sleepy. not quite sure I am correct.
> I look forward to seeing your reply tomorrow morning :-)

Heh, no need to dream about this ;)

--
Cheers,

David / dhildenb


2024-05-06 16:53:53

by Khalid Aziz

[permalink] [raw]
Subject: Re: [PATCH v3 4/6] mm: introduce arch_do_swap_page_nr() which allows restore metadata for nr pages

On 5/2/24 18:50, Barry Song wrote:
> From: Barry Song <[email protected]>
>
> Should do_swap_page() have the capability to directly map a large folio,
> metadata restoration becomes necessary for a specified number of pages
> denoted as nr. It's important to highlight that metadata restoration is
> solely required by the SPARC platform, which, however, does not enable
> THP_SWAP. Consequently, in the present kernel configuration, there
> exists no practical scenario where users necessitate the restoration of
> nr metadata. Platforms implementing THP_SWAP might invoke this function
> with nr values exceeding 1, subsequent to do_swap_page() successfully
> mapping an entire large folio. Nonetheless, their arch_do_swap_page_nr()
> functions remain empty.
>
> Cc: Khalid Aziz <[email protected]>
> Cc: "David S. Miller" <[email protected]>
> Cc: Andreas Larsson <[email protected]>
> Signed-off-by: Barry Song <[email protected]>

Looks good to me.

Reviewed-by: Khalid Aziz <[email protected]>


> ---
> include/linux/pgtable.h | 26 ++++++++++++++++++++------
> mm/memory.c | 3 ++-
> 2 files changed, 22 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 18019f037bae..463e84c3de26 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1084,6 +1084,15 @@ static inline int pgd_same(pgd_t pgd_a, pgd_t pgd_b)
> })
>
> #ifndef __HAVE_ARCH_DO_SWAP_PAGE
> +static inline void arch_do_swap_page_nr(struct mm_struct *mm,
> + struct vm_area_struct *vma,
> + unsigned long addr,
> + pte_t pte, pte_t oldpte,
> + int nr)
> +{
> +
> +}
> +#else
> /*
> * Some architectures support metadata associated with a page. When a
> * page is being swapped out, this metadata must be saved so it can be
> @@ -1092,12 +1101,17 @@ static inline int pgd_same(pgd_t pgd_a, pgd_t pgd_b)
> * page as metadata for the page. arch_do_swap_page() can restore this
> * metadata when a page is swapped back in.
> */
> -static inline void arch_do_swap_page(struct mm_struct *mm,
> - struct vm_area_struct *vma,
> - unsigned long addr,
> - pte_t pte, pte_t oldpte)
> -{
> -
> +static inline void arch_do_swap_page_nr(struct mm_struct *mm,
> + struct vm_area_struct *vma,
> + unsigned long addr,
> + pte_t pte, pte_t oldpte,
> + int nr)
> +{
> + for (int i = 0; i < nr; i++) {
> + arch_do_swap_page(vma->vm_mm, vma, addr + i * PAGE_SIZE,
> + pte_advance_pfn(pte, i),
> + pte_advance_pfn(oldpte, i));
> + }
> }
> #endif
>
> diff --git a/mm/memory.c b/mm/memory.c
> index f033eb3528ba..74cdefd58f5f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4266,7 +4266,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> VM_BUG_ON(!folio_test_anon(folio) ||
> (pte_write(pte) && !PageAnonExclusive(page)));
> set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> - arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> + arch_do_swap_page_nr(vma->vm_mm, vma, vmf->address,
> + pte, vmf->orig_pte, 1);
>
> folio_unlock(folio);
> if (folio != swapcache && swapcache) {


2024-05-06 22:58:47

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache

On Tue, May 7, 2024 at 1:16 AM David Hildenbrand <[email protected]> wrote:
>
> On 06.05.24 14:58, Barry Song wrote:
> > On Tue, May 7, 2024 at 12:38 AM Barry Song <[email protected]> wrote:
> >>
> >> On Tue, May 7, 2024 at 12:07 AM David Hildenbrand <[email protected]> wrote:
> >>>
> >>> On 04.05.24 01:23, Barry Song wrote:
> >>>> On Fri, May 3, 2024 at 6:50 PM Ryan Roberts <[email protected]> wrote:
> >>>>>
> >>>>> On 03/05/2024 01:50, Barry Song wrote:
> >>>>>> From: Chuanhua Han <[email protected]>
> >>>>>>
> >>>>>> When a large folio is found in the swapcache, the current implementation
> >>>>>> requires calling do_swap_page() nr_pages times, resulting in nr_pages
> >>>>>> page faults. This patch opts to map the entire large folio at once to
> >>>>>> minimize page faults. Additionally, redundant checks and early exits
> >>>>>> for ARM64 MTE restoring are removed.
> >>>>>>
> >>>>>> Signed-off-by: Chuanhua Han <[email protected]>
> >>>>>> Co-developed-by: Barry Song <[email protected]>
> >>>>>> Signed-off-by: Barry Song <[email protected]>
> >>>>>
> >>>>> With the suggested changes below:
> >>>>>
> >>>>> Reviewed-by: Ryan Roberts <[email protected]>
> >>>>>
> >>>>>> ---
> >>>>>> mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++-----------
> >>>>>> 1 file changed, 48 insertions(+), 12 deletions(-)
> >>>>>>
> >>>>>> diff --git a/mm/memory.c b/mm/memory.c
> >>>>>> index 22e7c33cc747..940fdbe69fa1 100644
> >>>>>> --- a/mm/memory.c
> >>>>>> +++ b/mm/memory.c
> >>>>>> @@ -3968,6 +3968,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>>>> pte_t pte;
> >>>>>> vm_fault_t ret = 0;
> >>>>>> void *shadow = NULL;
> >>>>>> + int nr_pages = 1;
> >>>>>> + unsigned long page_idx = 0;
> >>>>>> + unsigned long address = vmf->address;
> >>>>>> + pte_t *ptep;
> >>>>>
> >>>>> nit: Personally I'd prefer all these to get initialised just before the "if
> >>>>> (folio_test_large()..." block below. That way it is clear they are fresh (incase
> >>>>> any logic between here and there makes an adjustment) and its clear that they
> >>>>> are only to be used after that block (the compiler will warn if using an
> >>>>> uninitialized value).
> >>>>
> >>>> right. I agree this will make the code more readable.
> >>>>
> >>>>>
> >>>>>>
> >>>>>> if (!pte_unmap_same(vmf))
> >>>>>> goto out;
> >>>>>> @@ -4166,6 +4170,36 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>>>> goto out_nomap;
> >>>>>> }
> >>>>>>
> >>>>>> + ptep = vmf->pte;
> >>>>>> + if (folio_test_large(folio) && folio_test_swapcache(folio)) {
> >>>>>> + int nr = folio_nr_pages(folio);
> >>>>>> + unsigned long idx = folio_page_idx(folio, page);
> >>>>>> + unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
> >>>>>> + unsigned long folio_end = folio_start + nr * PAGE_SIZE;
> >>>>>> + pte_t *folio_ptep;
> >>>>>> + pte_t folio_pte;
> >>>>>> +
> >>>>>> + if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
> >>>>>> + goto check_folio;
> >>>>>> + if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
> >>>>>> + goto check_folio;
> >>>>>> +
> >>>>>> + folio_ptep = vmf->pte - idx;
> >>>>>> + folio_pte = ptep_get(folio_ptep);
> >>>>>> + if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
> >>>>>> + swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
> >>>>>> + goto check_folio;
> >>>>>> +
> >>>>>> + page_idx = idx;
> >>>>>> + address = folio_start;
> >>>>>> + ptep = folio_ptep;
> >>>>>> + nr_pages = nr;
> >>>>>> + entry = folio->swap;
> >>>>>> + page = &folio->page;
> >>>>>> + }
> >>>>>> +
> >>>>>> +check_folio:
> >>>>>
> >>>>> Is this still the correct label name, given the checks are now above the new
> >>>>> block? Perhaps "one_page" or something like that?
> >>>>
> >>>> not quite sure about this, as the code after one_page can be multiple_pages.
> >>>> On the other hand, it seems we are really checking folio after "check_folio"
> >>>> :-)
> >>>>
> >>>>
> >>>> BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio));
> >>>> BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page));
> >>>>
> >>>> /*
> >>>> * Check under PT lock (to protect against concurrent fork() sharing
> >>>> * the swap entry concurrently) for certainly exclusive pages.
> >>>> */
> >>>> if (!folio_test_ksm(folio)) {
> >>>>
> >>>>
> >>>>>
> >>>>>> +
> >>>>>> /*
> >>>>>> * PG_anon_exclusive reuses PG_mappedtodisk for anon pages A swap pte
> >>>>>> * must never point at an anonymous page in the swapcache that is
> >>>>>> @@ -4225,12 +4259,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>>>> * We're already holding a reference on the page but haven't mapped it
> >>>>>> * yet.
> >>>>>> */
> >>>>>> - swap_free_nr(entry, 1);
> >>>>>> + swap_free_nr(entry, nr_pages);
> >>>>>> if (should_try_to_free_swap(folio, vma, vmf->flags))
> >>>>>> folio_free_swap(folio);
> >>>>>>
> >>>>>> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> >>>>>> - dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> >>>>>> + folio_ref_add(folio, nr_pages - 1);
> >>>>>> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> >>>>>> + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> >>>>>> pte = mk_pte(page, vma->vm_page_prot);
> >>>>>>
> >>>>>> /*
> >>>>>> @@ -4240,34 +4275,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>>>> * exclusivity.
> >>>>>> */
> >>>>>> if (!folio_test_ksm(folio) &&
> >>>>>> - (exclusive || folio_ref_count(folio) == 1)) {
> >>>>>> + (exclusive || (folio_ref_count(folio) == nr_pages &&
> >>>>>> + folio_nr_pages(folio) == nr_pages))) {
> >>>>>
> >>>>> I think in practice there is no change here? If nr_pages > 1 then the folio is
> >>>>> in the swapcache, so there is an extra ref on it? I agree with the change for
> >>>>> robustness sake. Just checking my understanding.
> >>>>
> >>>> This is the code showing we are reusing/(mkwrite) a folio either
> >>>> 1. we meet a small folio and we are the only one hitting the small folio
> >>>> 2. we meet a large folio and we are the only one hitting the large folio
> >>>>
> >>>> any corner cases besides the above two seems difficult. for example,
> >>>>
> >>>> while we hit a large folio in swapcache but if we can't entirely map it
> >>>> (nr_pages==1) due to partial unmap, we will have folio_ref_count(folio)
> >>>> == nr_pages == 1
> >>>
> >>> No, there would be other references from the swapcache and
> >>> folio_ref_count(folio) > 1. See my other reply.
> >>
> >> right. can be clearer by:
> >
> > Wait, do we still need folio_nr_pages(folio) == nr_pages even if we use
> > folio_ref_count(folio) == 1 and moving folio_ref_add(folio, nr_pages - 1)?
>
> I don't think that we will "need" it.
>
> >
> > one case is that we have a large folio with 16 PTEs, and we unmap
> > 15 swap PTE entries, thus we have only one swap entry left. Then
> > we hit the large folio in swapcache. but we have only one PTE thus we will
> > map only one PTE. lacking folio_nr_pages(folio) == nr_pages, we reuse the
> > large folio for one PTE. with it, do_wp_page() will migrate the large
> > folio to a small one?
>
> We will set PAE bit and do_wp_page() will unconditionally reuse that page.
>
> Note that this is the same as if we had pte_swp_exclusive() set and
> would have run into "exclusive=true" here.
>
> If we'd want a similar "optimization" as we have in
> wp_can_reuse_anon_folio(), you'd want something like
>
> exclusive || (folio_ref_count(folio) == 1 &&
> (!folio_test_large(folio) || nr_pages > 1)

I feel like

A : !folio_test_large(folio) || nr_pages > 1

equals

B: folio_nr_pages(folio) == nr_pages

if folio is small, folio_test_large(folio) is false, both A and B will be true;
if folio is large, and we map the whole large folio, A will be true
because of nr_pages > 1;
B is also true;
if folio is large, and we map single one PTE, A will be false;
B is also false, because nr_pages == 1 but folio_nr_pages(folio) > 1;

right?

However, I agree that delving into this complexity might not be necessary
at the moment.

>
> ... but I am not sure if that is really worth the complexity here.
>
> >
> > 1AM, tired and sleepy. not quite sure I am correct.
> > I look forward to seeing your reply tomorrow morning :-)
>
> Heh, no need to dream about this ;)
>
> --
> Cheers,
>
> David / dhildenb

Thanks
Barry

2024-05-07 08:14:58

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v3 3/6] mm: introduce pte_move_swp_offset() helper which can move offset bidirectionally

On 06/05/2024 09:31, David Hildenbrand wrote:
> On 06.05.24 10:20, Barry Song wrote:
>> On Mon, May 6, 2024 at 8:06 PM David Hildenbrand <[email protected]> wrote:
>>>
>>> On 04.05.24 01:40, Barry Song wrote:
>>>> On Fri, May 3, 2024 at 5:41 PM Ryan Roberts <[email protected]> wrote:
>>>>>
>>>>> On 03/05/2024 01:50, Barry Song wrote:
>>>>>> From: Barry Song <[email protected]>
>>>>>>
>>>>>> There could arise a necessity to obtain the first pte_t from a swap
>>>>>> pte_t located in the middle. For instance, this may occur within the
>>>>>> context of do_swap_page(), where a page fault can potentially occur in
>>>>>> any PTE of a large folio. To address this, the following patch introduces
>>>>>> pte_move_swp_offset(), a function capable of bidirectional movement by
>>>>>> a specified delta argument. Consequently, pte_increment_swp_offset()
>>>>>
>>>>> You mean pte_next_swp_offset()?
>>>>
>>>> yes.
>>>>
>>>>>
>>>>>> will directly invoke it with delta = 1.
>>>>>>
>>>>>> Suggested-by: "Huang, Ying" <[email protected]>
>>>>>> Signed-off-by: Barry Song <[email protected]>
>>>>>> ---
>>>>>>    mm/internal.h | 25 +++++++++++++++++++++----
>>>>>>    1 file changed, 21 insertions(+), 4 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>>>> index c5552d35d995..cfe4aed66a5c 100644
>>>>>> --- a/mm/internal.h
>>>>>> +++ b/mm/internal.h
>>>>>> @@ -211,18 +211,21 @@ static inline int folio_pte_batch(struct folio
>>>>>> *folio, unsigned long addr,
>>>>>>    }
>>>>>>
>>>>>>    /**
>>>>>> - * pte_next_swp_offset - Increment the swap entry offset field of a swap
>>>>>> pte.
>>>>>> + * pte_move_swp_offset - Move the swap entry offset field of a swap pte
>>>>>> + *    forward or backward by delta
>>>>>>     * @pte: The initial pte state; is_swap_pte(pte) must be true and
>>>>>>     *    non_swap_entry() must be false.
>>>>>> + * @delta: The direction and the offset we are moving; forward if delta
>>>>>> + *    is positive; backward if delta is negative
>>>>>>     *
>>>>>> - * Increments the swap offset, while maintaining all other fields, including
>>>>>> + * Moves the swap offset, while maintaining all other fields, including
>>>>>>     * swap type, and any swp pte bits. The resulting pte is returned.
>>>>>>     */
>>>>>> -static inline pte_t pte_next_swp_offset(pte_t pte)
>>>>>> +static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
>>>>>
>>>>> We have equivalent functions for pfn:
>>>>>
>>>>>     pte_next_pfn()
>>>>>     pte_advance_pfn()
>>>>>
>>>>> Although the latter takes an unsigned long and only moves forward currently. I
>>>>> wonder if it makes sense to have their naming and semantics match? i.e. change
>>>>> pte_advance_pfn() to pte_move_pfn() and let it move backwards too.
>>>>>
>>>>> I guess we don't have a need for that and it adds more churn.
>>>>
>>>> we might have a need in the below case.
>>>> A forks B, then A and B share large folios. B unmap/exit, then large
>>>> folios of process
>>>> A become single-mapped.
>>>> Right now, while writing A's folios, we are CoWing A's large folios
>>>> into many small
>>>> folios. I believe we can reuse the entire large folios instead of doing
>>>> nr_pages
>>>> CoW and page faults.
>>>> In this case, we might want to get the first PTE from vmf->pte.
>>>
>>> Once we have COW reuse for large folios in place (I think you know that
>>> I am working on that), it might make sense to "COW-reuse around",
>>
>> TBH, I don't know if you are working on that. please Cc me next time :-)
>
> I could have sworn I mentioned it to you already :)
>
> See
>
> https://lore.kernel.org/linux-mm/[email protected]/T/
>
> I'll follow-up on that soonish (now that batching is upstream and the large
> mapcount is on its way upstream).
>
>>
>>> meaning we look if some neighboring PTEs map the same large folio and
>>> map them writable as well. But if it's really worth it, increasing page
>>> fault latency, is to be decided separately.
>>
>> On the other hand, we eliminate latency for the remaining nr_pages - 1 PTEs.
>> Perhaps we can discover a more cost-effective method to signify that a large
>> folio is probably singly mapped?
>
> Yes, precisely what I am up to!
>
>> and only call "multi-PTEs" reuse while that
>> condition is true in PF and avoid increasing latency always?
>
> I'm thinking along those lines:
>
> If we detect that it's exclusive, we can certainly mapped the current PTE
> writable. Then, we can decide how much (and if) we want to fault-around writable
> as an optimization.
>
> For smallish large folios, it might make sense to try faulting around most of
> the folio.
>
> For large large folios (e.g., PTE-mapped 2MiB THP and bigger), we might not want
> to fault around the whole thing -- especially if there is little benefit to be
> had from contig-pte bits.
>
>>
>>>
>>>
>>>>
>>>> Another case, might be
>>>> A forks B, and we write either A or B, we might CoW an entire large
>>>> folios instead
>>>> CoWing nr_pages small folios.
>>>>
>>>> case 1 seems more useful, I might have a go after some days. then we might
>>>> see pte_move_pfn().
>>> pte_move_pfn() does sound odd to me.

Yes, I agree the name is odd. pte_move_swp_offset() sounds similarly odd tbh.
Perhaps just pte_advance_swp_offset() with a negative value is clearer about
what its doing?

>>> It might not be required to
>>> implement the optimization described above. (it's easier to simply read
>>> another PTE, check if it maps the same large folio, and to batch from there)

Yes agreed.

>>>
>>
>> It appears that your proposal suggests potential reusability as follows: if we
>> have a large folio containing 16 PTEs, you might consider reusing only 4 by
>> examining PTEs "around" but not necessarily all 16 PTEs. please correct me
>> if my understanding is wrong.
>>
>> Initially, my idea was to obtain the first PTE using pte_move_pfn() and then
>> utilize folio_pte_batch() with the first PTE as arguments to ensure consistency
>> in nr_pages, thus enabling complete reuse of the whole folio.
>
> Simply doing an vm_normal_folio(pte - X) == folio and then trying to batch from
> there might be easier and cleaner.
>


2024-05-07 08:17:34

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache

On 04/05/2024 00:23, Barry Song wrote:
> On Fri, May 3, 2024 at 6:50 PM Ryan Roberts <[email protected]> wrote:
>>
>> On 03/05/2024 01:50, Barry Song wrote:
>>> From: Chuanhua Han <[email protected]>
>>>
>>> When a large folio is found in the swapcache, the current implementation
>>> requires calling do_swap_page() nr_pages times, resulting in nr_pages
>>> page faults. This patch opts to map the entire large folio at once to
>>> minimize page faults. Additionally, redundant checks and early exits
>>> for ARM64 MTE restoring are removed.
>>>
>>> Signed-off-by: Chuanhua Han <[email protected]>
>>> Co-developed-by: Barry Song <[email protected]>
>>> Signed-off-by: Barry Song <[email protected]>
>>
>> With the suggested changes below:
>>
>> Reviewed-by: Ryan Roberts <[email protected]>
>>
>>> ---
>>> mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++-----------
>>> 1 file changed, 48 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 22e7c33cc747..940fdbe69fa1 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -3968,6 +3968,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> pte_t pte;
>>> vm_fault_t ret = 0;
>>> void *shadow = NULL;
>>> + int nr_pages = 1;
>>> + unsigned long page_idx = 0;
>>> + unsigned long address = vmf->address;
>>> + pte_t *ptep;
>>
>> nit: Personally I'd prefer all these to get initialised just before the "if
>> (folio_test_large()..." block below. That way it is clear they are fresh (incase
>> any logic between here and there makes an adjustment) and its clear that they
>> are only to be used after that block (the compiler will warn if using an
>> uninitialized value).
>
> right. I agree this will make the code more readable.
>
>>
>>>
>>> if (!pte_unmap_same(vmf))
>>> goto out;
>>> @@ -4166,6 +4170,36 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> goto out_nomap;
>>> }
>>>
>>> + ptep = vmf->pte;
>>> + if (folio_test_large(folio) && folio_test_swapcache(folio)) {
>>> + int nr = folio_nr_pages(folio);
>>> + unsigned long idx = folio_page_idx(folio, page);
>>> + unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
>>> + unsigned long folio_end = folio_start + nr * PAGE_SIZE;
>>> + pte_t *folio_ptep;
>>> + pte_t folio_pte;
>>> +
>>> + if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
>>> + goto check_folio;
>>> + if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
>>> + goto check_folio;
>>> +
>>> + folio_ptep = vmf->pte - idx;
>>> + folio_pte = ptep_get(folio_ptep);
>>> + if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
>>> + swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
>>> + goto check_folio;
>>> +
>>> + page_idx = idx;
>>> + address = folio_start;
>>> + ptep = folio_ptep;
>>> + nr_pages = nr;
>>> + entry = folio->swap;
>>> + page = &folio->page;
>>> + }
>>> +
>>> +check_folio:
>>
>> Is this still the correct label name, given the checks are now above the new
>> block? Perhaps "one_page" or something like that?
>
> not quite sure about this, as the code after one_page can be multiple_pages.
> On the other hand, it seems we are really checking folio after "check_folio"
> :-)

Yeah fair enough. Ignore my comment.

>
>
> BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio));
> BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page));
>
> /*
> * Check under PT lock (to protect against concurrent fork() sharing
> * the swap entry concurrently) for certainly exclusive pages.
> */
> if (!folio_test_ksm(folio)) {
>
>
>>
>>> +
>>> /*
>>> * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
>>> * must never point at an anonymous page in the swapcache that is
>>> @@ -4225,12 +4259,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> * We're already holding a reference on the page but haven't mapped it
>>> * yet.
>>> */
>>> - swap_free_nr(entry, 1);
>>> + swap_free_nr(entry, nr_pages);
>>> if (should_try_to_free_swap(folio, vma, vmf->flags))
>>> folio_free_swap(folio);
>>>
>>> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>>> - dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
>>> + folio_ref_add(folio, nr_pages - 1);
>>> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
>>> + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
>>> pte = mk_pte(page, vma->vm_page_prot);
>>>
>>> /*
>>> @@ -4240,34 +4275,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> * exclusivity.
>>> */
>>> if (!folio_test_ksm(folio) &&
>>> - (exclusive || folio_ref_count(folio) == 1)) {
>>> + (exclusive || (folio_ref_count(folio) == nr_pages &&
>>> + folio_nr_pages(folio) == nr_pages))) {
>>
>> I think in practice there is no change here? If nr_pages > 1 then the folio is
>> in the swapcache, so there is an extra ref on it? I agree with the change for
>> robustness sake. Just checking my understanding.
>
> This is the code showing we are reusing/(mkwrite) a folio either
> 1. we meet a small folio and we are the only one hitting the small folio
> 2. we meet a large folio and we are the only one hitting the large folio
>
> any corner cases besides the above two seems difficult. for example,
>
> while we hit a large folio in swapcache but if we can't entirely map it
> (nr_pages==1) due to partial unmap, we will have folio_ref_count(folio)
> == nr_pages == 1, in this case, lacking folio_nr_pages(folio) == nr_pages
> might lead to mkwrite() on a single pte within a partially unmapped large
> folio. not quite sure this is wrong, but seems buggy and arduous.
>
>>
>>> if (vmf->flags & FAULT_FLAG_WRITE) {
>>> pte = maybe_mkwrite(pte_mkdirty(pte), vma);
>>> vmf->flags &= ~FAULT_FLAG_WRITE;
>>> }
>>> rmap_flags |= RMAP_EXCLUSIVE;
>>> }
>>> - flush_icache_page(vma, page);
>>> + flush_icache_pages(vma, page, nr_pages);
>>> if (pte_swp_soft_dirty(vmf->orig_pte))
>>> pte = pte_mksoft_dirty(pte);
>>> if (pte_swp_uffd_wp(vmf->orig_pte))
>>> pte = pte_mkuffd_wp(pte);
>>> - vmf->orig_pte = pte;
>>> + vmf->orig_pte = pte_advance_pfn(pte, page_idx);
>>>
>>> /* ksm created a completely new copy */
>>> if (unlikely(folio != swapcache && swapcache)) {
>>> - folio_add_new_anon_rmap(folio, vma, vmf->address);
>>> + folio_add_new_anon_rmap(folio, vma, address);
>>> folio_add_lru_vma(folio, vma);
>>> } else {
>>> - folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
>>> + folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address,
>>> rmap_flags);
>>> }
>>>
>>> VM_BUG_ON(!folio_test_anon(folio) ||
>>> (pte_write(pte) && !PageAnonExclusive(page)));
>>> - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
>>> - arch_do_swap_page_nr(vma->vm_mm, vma, vmf->address,
>>> - pte, vmf->orig_pte, 1);
>>> + set_ptes(vma->vm_mm, address, ptep, pte, nr_pages);
>>> + arch_do_swap_page_nr(vma->vm_mm, vma, address,
>>> + pte, pte, nr_pages);
>>>
>>> folio_unlock(folio);
>>> if (folio != swapcache && swapcache) {
>>> @@ -4291,7 +4327,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> }
>>>
>>> /* No need to invalidate - it was non-present before */
>>> - update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
>>> + update_mmu_cache_range(vmf, vma, address, ptep, nr_pages);
>>> unlock:
>>> if (vmf->pte)
>>> pte_unmap_unlock(vmf->pte, vmf->ptl);
>>
>
> Thanks
> Barry


2024-05-07 08:25:08

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache

On 07.05.24 00:58, Barry Song wrote:
> On Tue, May 7, 2024 at 1:16 AM David Hildenbrand <[email protected]> wrote:
>>
>> On 06.05.24 14:58, Barry Song wrote:
>>> On Tue, May 7, 2024 at 12:38 AM Barry Song <[email protected]> wrote:
>>>>
>>>> On Tue, May 7, 2024 at 12:07 AM David Hildenbrand <[email protected]> wrote:
>>>>>
>>>>> On 04.05.24 01:23, Barry Song wrote:
>>>>>> On Fri, May 3, 2024 at 6:50 PM Ryan Roberts <[email protected]> wrote:
>>>>>>>
>>>>>>> On 03/05/2024 01:50, Barry Song wrote:
>>>>>>>> From: Chuanhua Han <[email protected]>
>>>>>>>>
>>>>>>>> When a large folio is found in the swapcache, the current implementation
>>>>>>>> requires calling do_swap_page() nr_pages times, resulting in nr_pages
>>>>>>>> page faults. This patch opts to map the entire large folio at once to
>>>>>>>> minimize page faults. Additionally, redundant checks and early exits
>>>>>>>> for ARM64 MTE restoring are removed.
>>>>>>>>
>>>>>>>> Signed-off-by: Chuanhua Han <[email protected]>
>>>>>>>> Co-developed-by: Barry Song <[email protected]>
>>>>>>>> Signed-off-by: Barry Song <[email protected]>
>>>>>>>
>>>>>>> With the suggested changes below:
>>>>>>>
>>>>>>> Reviewed-by: Ryan Roberts <[email protected]>
>>>>>>>
>>>>>>>> ---
>>>>>>>> mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++-----------
>>>>>>>> 1 file changed, 48 insertions(+), 12 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>>>> index 22e7c33cc747..940fdbe69fa1 100644
>>>>>>>> --- a/mm/memory.c
>>>>>>>> +++ b/mm/memory.c
>>>>>>>> @@ -3968,6 +3968,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>>>> pte_t pte;
>>>>>>>> vm_fault_t ret = 0;
>>>>>>>> void *shadow = NULL;
>>>>>>>> + int nr_pages = 1;
>>>>>>>> + unsigned long page_idx = 0;
>>>>>>>> + unsigned long address = vmf->address;
>>>>>>>> + pte_t *ptep;
>>>>>>>
>>>>>>> nit: Personally I'd prefer all these to get initialised just before the "if
>>>>>>> (folio_test_large()..." block below. That way it is clear they are fresh (incase
>>>>>>> any logic between here and there makes an adjustment) and its clear that they
>>>>>>> are only to be used after that block (the compiler will warn if using an
>>>>>>> uninitialized value).
>>>>>>
>>>>>> right. I agree this will make the code more readable.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> if (!pte_unmap_same(vmf))
>>>>>>>> goto out;
>>>>>>>> @@ -4166,6 +4170,36 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>>>> goto out_nomap;
>>>>>>>> }
>>>>>>>>
>>>>>>>> + ptep = vmf->pte;
>>>>>>>> + if (folio_test_large(folio) && folio_test_swapcache(folio)) {
>>>>>>>> + int nr = folio_nr_pages(folio);
>>>>>>>> + unsigned long idx = folio_page_idx(folio, page);
>>>>>>>> + unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
>>>>>>>> + unsigned long folio_end = folio_start + nr * PAGE_SIZE;
>>>>>>>> + pte_t *folio_ptep;
>>>>>>>> + pte_t folio_pte;
>>>>>>>> +
>>>>>>>> + if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
>>>>>>>> + goto check_folio;
>>>>>>>> + if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
>>>>>>>> + goto check_folio;
>>>>>>>> +
>>>>>>>> + folio_ptep = vmf->pte - idx;
>>>>>>>> + folio_pte = ptep_get(folio_ptep);
>>>>>>>> + if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
>>>>>>>> + swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
>>>>>>>> + goto check_folio;
>>>>>>>> +
>>>>>>>> + page_idx = idx;
>>>>>>>> + address = folio_start;
>>>>>>>> + ptep = folio_ptep;
>>>>>>>> + nr_pages = nr;
>>>>>>>> + entry = folio->swap;
>>>>>>>> + page = &folio->page;
>>>>>>>> + }
>>>>>>>> +
>>>>>>>> +check_folio:
>>>>>>>
>>>>>>> Is this still the correct label name, given the checks are now above the new
>>>>>>> block? Perhaps "one_page" or something like that?
>>>>>>
>>>>>> not quite sure about this, as the code after one_page can be multiple_pages.
>>>>>> On the other hand, it seems we are really checking folio after "check_folio"
>>>>>> :-)
>>>>>>
>>>>>>
>>>>>> BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio));
>>>>>> BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page));
>>>>>>
>>>>>> /*
>>>>>> * Check under PT lock (to protect against concurrent fork() sharing
>>>>>> * the swap entry concurrently) for certainly exclusive pages.
>>>>>> */
>>>>>> if (!folio_test_ksm(folio)) {
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>> +
>>>>>>>> /*
>>>>>>>> * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
>>>>>>>> * must never point at an anonymous page in the swapcache that is
>>>>>>>> @@ -4225,12 +4259,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>>>> * We're already holding a reference on the page but haven't mapped it
>>>>>>>> * yet.
>>>>>>>> */
>>>>>>>> - swap_free_nr(entry, 1);
>>>>>>>> + swap_free_nr(entry, nr_pages);
>>>>>>>> if (should_try_to_free_swap(folio, vma, vmf->flags))
>>>>>>>> folio_free_swap(folio);
>>>>>>>>
>>>>>>>> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>>>>>>>> - dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
>>>>>>>> + folio_ref_add(folio, nr_pages - 1);
>>>>>>>> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
>>>>>>>> + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
>>>>>>>> pte = mk_pte(page, vma->vm_page_prot);
>>>>>>>>
>>>>>>>> /*
>>>>>>>> @@ -4240,34 +4275,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>>>> * exclusivity.
>>>>>>>> */
>>>>>>>> if (!folio_test_ksm(folio) &&
>>>>>>>> - (exclusive || folio_ref_count(folio) == 1)) {
>>>>>>>> + (exclusive || (folio_ref_count(folio) == nr_pages &&
>>>>>>>> + folio_nr_pages(folio) == nr_pages))) {
>>>>>>>
>>>>>>> I think in practice there is no change here? If nr_pages > 1 then the folio is
>>>>>>> in the swapcache, so there is an extra ref on it? I agree with the change for
>>>>>>> robustness sake. Just checking my understanding.
>>>>>>
>>>>>> This is the code showing we are reusing/(mkwrite) a folio either
>>>>>> 1. we meet a small folio and we are the only one hitting the small folio
>>>>>> 2. we meet a large folio and we are the only one hitting the large folio
>>>>>>
>>>>>> any corner cases besides the above two seems difficult. for example,
>>>>>>
>>>>>> while we hit a large folio in swapcache but if we can't entirely map it
>>>>>> (nr_pages==1) due to partial unmap, we will have folio_ref_count(folio)
>>>>>> == nr_pages == 1
>>>>>
>>>>> No, there would be other references from the swapcache and
>>>>> folio_ref_count(folio) > 1. See my other reply.
>>>>
>>>> right. can be clearer by:
>>>
>>> Wait, do we still need folio_nr_pages(folio) == nr_pages even if we use
>>> folio_ref_count(folio) == 1 and moving folio_ref_add(folio, nr_pages - 1)?
>>
>> I don't think that we will "need" it.
>>
>>>
>>> one case is that we have a large folio with 16 PTEs, and we unmap
>>> 15 swap PTE entries, thus we have only one swap entry left. Then
>>> we hit the large folio in swapcache. but we have only one PTE thus we will
>>> map only one PTE. lacking folio_nr_pages(folio) == nr_pages, we reuse the
>>> large folio for one PTE. with it, do_wp_page() will migrate the large
>>> folio to a small one?
>>
>> We will set PAE bit and do_wp_page() will unconditionally reuse that page.
>>
>> Note that this is the same as if we had pte_swp_exclusive() set and
>> would have run into "exclusive=true" here.
>>
>> If we'd want a similar "optimization" as we have in
>> wp_can_reuse_anon_folio(), you'd want something like
>>
>> exclusive || (folio_ref_count(folio) == 1 &&
>> (!folio_test_large(folio) || nr_pages > 1)
>
> I feel like
>
> A : !folio_test_large(folio) || nr_pages > 1
>
> equals
>
> B: folio_nr_pages(folio) == nr_pages
>
> if folio is small, folio_test_large(folio) is false, both A and B will be true;
> if folio is large, and we map the whole large folio, A will be true
> because of nr_pages > 1;
> B is also true;
> if folio is large, and we map single one PTE, A will be false;
> B is also false, because nr_pages == 1 but folio_nr_pages(folio) > 1;
>
> right?

Let's assume a single subpage of a large folio is no longer mapped.
Then, we'd have:

nr_pages == folio_nr_pages(folio) - 1.

You could simply map+reuse most of the folio without COWing.

Once we support COW reuse of PTE-mapped THP we'd do the same. Here, it's
just easy to detect that the folio is exclusive (folio_ref_count(folio)
== 1 before mapping anything).

If you really want to mimic what do_wp_page() currently does, you should
have:

exclusive || (folio_ref_count(folio) == 1 && !folio_test_large(folio))

Personally, I think we should keep it simple here and use:

exclusive || folio_ref_count(folio) == 1

IMHO, that's as clear as it gets.

--
Cheers,

David / dhildenb


2024-05-07 08:25:21

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v3 3/6] mm: introduce pte_move_swp_offset() helper which can move offset bidirectionally

On Tue, May 7, 2024 at 8:14 PM Ryan Roberts <[email protected]> wrote:
>
> On 06/05/2024 09:31, David Hildenbrand wrote:
> > On 06.05.24 10:20, Barry Song wrote:
> >> On Mon, May 6, 2024 at 8:06 PM David Hildenbrand <david@redhatcom> wrote:
> >>>
> >>> On 04.05.24 01:40, Barry Song wrote:
> >>>> On Fri, May 3, 2024 at 5:41 PM Ryan Roberts <[email protected]> wrote:
> >>>>>
> >>>>> On 03/05/2024 01:50, Barry Song wrote:
> >>>>>> From: Barry Song <[email protected]>
> >>>>>>
> >>>>>> There could arise a necessity to obtain the first pte_t from a swap
> >>>>>> pte_t located in the middle. For instance, this may occur within the
> >>>>>> context of do_swap_page(), where a page fault can potentially occur in
> >>>>>> any PTE of a large folio. To address this, the following patch introduces
> >>>>>> pte_move_swp_offset(), a function capable of bidirectional movement by
> >>>>>> a specified delta argument. Consequently, pte_increment_swp_offset()
> >>>>>
> >>>>> You mean pte_next_swp_offset()?
> >>>>
> >>>> yes.
> >>>>
> >>>>>
> >>>>>> will directly invoke it with delta = 1.
> >>>>>>
> >>>>>> Suggested-by: "Huang, Ying" <[email protected]>
> >>>>>> Signed-off-by: Barry Song <[email protected]>
> >>>>>> ---
> >>>>>> mm/internal.h | 25 +++++++++++++++++++++----
> >>>>>> 1 file changed, 21 insertions(+), 4 deletions(-)
> >>>>>>
> >>>>>> diff --git a/mm/internal.h b/mm/internal.h
> >>>>>> index c5552d35d995..cfe4aed66a5c 100644
> >>>>>> --- a/mm/internal.h
> >>>>>> +++ b/mm/internal.h
> >>>>>> @@ -211,18 +211,21 @@ static inline int folio_pte_batch(struct folio
> >>>>>> *folio, unsigned long addr,
> >>>>>> }
> >>>>>>
> >>>>>> /**
> >>>>>> - * pte_next_swp_offset - Increment the swap entry offset field of a swap
> >>>>>> pte.
> >>>>>> + * pte_move_swp_offset - Move the swap entry offset field of a swap pte
> >>>>>> + * forward or backward by delta
> >>>>>> * @pte: The initial pte state; is_swap_pte(pte) must be true and
> >>>>>> * non_swap_entry() must be false.
> >>>>>> + * @delta: The direction and the offset we are moving; forward if delta
> >>>>>> + * is positive; backward if delta is negative
> >>>>>> *
> >>>>>> - * Increments the swap offset, while maintaining all other fields, including
> >>>>>> + * Moves the swap offset, while maintaining all other fields, including
> >>>>>> * swap type, and any swp pte bits. The resulting pte is returned.
> >>>>>> */
> >>>>>> -static inline pte_t pte_next_swp_offset(pte_t pte)
> >>>>>> +static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
> >>>>>
> >>>>> We have equivalent functions for pfn:
> >>>>>
> >>>>> pte_next_pfn()
> >>>>> pte_advance_pfn()
> >>>>>
> >>>>> Although the latter takes an unsigned long and only moves forward currently. I
> >>>>> wonder if it makes sense to have their naming and semantics match? i.e. change
> >>>>> pte_advance_pfn() to pte_move_pfn() and let it move backwards too.
> >>>>>
> >>>>> I guess we don't have a need for that and it adds more churn.
> >>>>
> >>>> we might have a need in the below case.
> >>>> A forks B, then A and B share large folios. B unmap/exit, then large
> >>>> folios of process
> >>>> A become single-mapped.
> >>>> Right now, while writing A's folios, we are CoWing A's large folios
> >>>> into many small
> >>>> folios. I believe we can reuse the entire large folios instead of doing
> >>>> nr_pages
> >>>> CoW and page faults.
> >>>> In this case, we might want to get the first PTE from vmf->pte.
> >>>
> >>> Once we have COW reuse for large folios in place (I think you know that
> >>> I am working on that), it might make sense to "COW-reuse around",
> >>
> >> TBH, I don't know if you are working on that. please Cc me next time :-)
> >
> > I could have sworn I mentioned it to you already :)
> >
> > See
> >
> > https://lore.kernel.org/linux-mm/[email protected]/T/
> >
> > I'll follow-up on that soonish (now that batching is upstream and the large
> > mapcount is on its way upstream).
> >
> >>
> >>> meaning we look if some neighboring PTEs map the same large folio and
> >>> map them writable as well. But if it's really worth it, increasing page
> >>> fault latency, is to be decided separately.
> >>
> >> On the other hand, we eliminate latency for the remaining nr_pages - 1 PTEs.
> >> Perhaps we can discover a more cost-effective method to signify that a large
> >> folio is probably singly mapped?
> >
> > Yes, precisely what I am up to!
> >
> >> and only call "multi-PTEs" reuse while that
> >> condition is true in PF and avoid increasing latency always?
> >
> > I'm thinking along those lines:
> >
> > If we detect that it's exclusive, we can certainly mapped the current PTE
> > writable. Then, we can decide how much (and if) we want to fault-around writable
> > as an optimization.
> >
> > For smallish large folios, it might make sense to try faulting around most of
> > the folio.
> >
> > For large large folios (e.g., PTE-mapped 2MiB THP and bigger), we might not want
> > to fault around the whole thing -- especially if there is little benefit to be
> > had from contig-pte bits.
> >
> >>
> >>>
> >>>
> >>>>
> >>>> Another case, might be
> >>>> A forks B, and we write either A or B, we might CoW an entire large
> >>>> folios instead
> >>>> CoWing nr_pages small folios.
> >>>>
> >>>> case 1 seems more useful, I might have a go after some days. then we might
> >>>> see pte_move_pfn().
> >>> pte_move_pfn() does sound odd to me.
>
> Yes, I agree the name is odd. pte_move_swp_offset() sounds similarly odd tbh.
> Perhaps just pte_advance_swp_offset() with a negative value is clearer about
> what its doing?
>

I am not a native speaker. but dictionary says

advance:
move forward in a purposeful way.
a forward movement.

Now we are moving backward or forward :-)

> >>> It might not be required to
> >>> implement the optimization described above. (it's easier to simply read
> >>> another PTE, check if it maps the same large folio, and to batch from there)
>
> Yes agreed.
>
> >>>
> >>
> >> It appears that your proposal suggests potential reusability as follows: if we
> >> have a large folio containing 16 PTEs, you might consider reusing only 4 by
> >> examining PTEs "around" but not necessarily all 16 PTEs. please correct me
> >> if my understanding is wrong.
> >>
> >> Initially, my idea was to obtain the first PTE using pte_move_pfn() and then
> >> utilize folio_pte_batch() with the first PTE as arguments to ensure consistency
> >> in nr_pages, thus enabling complete reuse of the whole folio.
> >
> > Simply doing an vm_normal_folio(pte - X) == folio and then trying to batch from
> > there might be easier and cleaner.
> >
>

2024-05-07 08:45:09

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache

On Tue, May 7, 2024 at 8:24 PM David Hildenbrand <[email protected]> wrote:
>
> On 07.05.24 00:58, Barry Song wrote:
> > On Tue, May 7, 2024 at 1:16 AM David Hildenbrand <[email protected]> wrote:
> >>
> >> On 06.05.24 14:58, Barry Song wrote:
> >>> On Tue, May 7, 2024 at 12:38 AM Barry Song <[email protected]> wrote:
> >>>>
> >>>> On Tue, May 7, 2024 at 12:07 AM David Hildenbrand <[email protected]> wrote:
> >>>>>
> >>>>> On 04.05.24 01:23, Barry Song wrote:
> >>>>>> On Fri, May 3, 2024 at 6:50 PM Ryan Roberts <[email protected]> wrote:
> >>>>>>>
> >>>>>>> On 03/05/2024 01:50, Barry Song wrote:
> >>>>>>>> From: Chuanhua Han <[email protected]>
> >>>>>>>>
> >>>>>>>> When a large folio is found in the swapcache, the current implementation
> >>>>>>>> requires calling do_swap_page() nr_pages times, resulting in nr_pages
> >>>>>>>> page faults. This patch opts to map the entire large folio at once to
> >>>>>>>> minimize page faults. Additionally, redundant checks and early exits
> >>>>>>>> for ARM64 MTE restoring are removed.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Chuanhua Han <[email protected]>
> >>>>>>>> Co-developed-by: Barry Song <[email protected]>
> >>>>>>>> Signed-off-by: Barry Song <[email protected]>
> >>>>>>>
> >>>>>>> With the suggested changes below:
> >>>>>>>
> >>>>>>> Reviewed-by: Ryan Roberts <[email protected]>
> >>>>>>>
> >>>>>>>> ---
> >>>>>>>> mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++-----------
> >>>>>>>> 1 file changed, 48 insertions(+), 12 deletions(-)
> >>>>>>>>
> >>>>>>>> diff --git a/mm/memory.c b/mm/memory.c
> >>>>>>>> index 22e7c33cc747..940fdbe69fa1 100644
> >>>>>>>> --- a/mm/memory.c
> >>>>>>>> +++ b/mm/memory.c
> >>>>>>>> @@ -3968,6 +3968,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>>>>>> pte_t pte;
> >>>>>>>> vm_fault_t ret = 0;
> >>>>>>>> void *shadow = NULL;
> >>>>>>>> + int nr_pages = 1;
> >>>>>>>> + unsigned long page_idx = 0;
> >>>>>>>> + unsigned long address = vmf->address;
> >>>>>>>> + pte_t *ptep;
> >>>>>>>
> >>>>>>> nit: Personally I'd prefer all these to get initialised just before the "if
> >>>>>>> (folio_test_large()..." block below. That way it is clear they are fresh (incase
> >>>>>>> any logic between here and there makes an adjustment) and its clear that they
> >>>>>>> are only to be used after that block (the compiler will warn if using an
> >>>>>>> uninitialized value).
> >>>>>>
> >>>>>> right. I agree this will make the code more readable.
> >>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> if (!pte_unmap_same(vmf))
> >>>>>>>> goto out;
> >>>>>>>> @@ -4166,6 +4170,36 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>>>>>> goto out_nomap;
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> + ptep = vmf->pte;
> >>>>>>>> + if (folio_test_large(folio) && folio_test_swapcache(folio)) {
> >>>>>>>> + int nr = folio_nr_pages(folio);
> >>>>>>>> + unsigned long idx = folio_page_idx(folio, page);
> >>>>>>>> + unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
> >>>>>>>> + unsigned long folio_end = folio_start + nr * PAGE_SIZE;
> >>>>>>>> + pte_t *folio_ptep;
> >>>>>>>> + pte_t folio_pte;
> >>>>>>>> +
> >>>>>>>> + if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
> >>>>>>>> + goto check_folio;
> >>>>>>>> + if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
> >>>>>>>> + goto check_folio;
> >>>>>>>> +
> >>>>>>>> + folio_ptep = vmf->pte - idx;
> >>>>>>>> + folio_pte = ptep_get(folio_ptep);
> >>>>>>>> + if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
> >>>>>>>> + swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
> >>>>>>>> + goto check_folio;
> >>>>>>>> +
> >>>>>>>> + page_idx = idx;
> >>>>>>>> + address = folio_start;
> >>>>>>>> + ptep = folio_ptep;
> >>>>>>>> + nr_pages = nr;
> >>>>>>>> + entry = folio->swap;
> >>>>>>>> + page = &folio->page;
> >>>>>>>> + }
> >>>>>>>> +
> >>>>>>>> +check_folio:
> >>>>>>>
> >>>>>>> Is this still the correct label name, given the checks are now above the new
> >>>>>>> block? Perhaps "one_page" or something like that?
> >>>>>>
> >>>>>> not quite sure about this, as the code after one_page can be multiple_pages.
> >>>>>> On the other hand, it seems we are really checking folio after "check_folio"
> >>>>>> :-)
> >>>>>>
> >>>>>>
> >>>>>> BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio));
> >>>>>> BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page));
> >>>>>>
> >>>>>> /*
> >>>>>> * Check under PT lock (to protect against concurrent fork() sharing
> >>>>>> * the swap entry concurrently) for certainly exclusive pages.
> >>>>>> */
> >>>>>> if (!folio_test_ksm(folio)) {
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>> +
> >>>>>>>> /*
> >>>>>>>> * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
> >>>>>>>> * must never point at an anonymous page in the swapcache that is
> >>>>>>>> @@ -4225,12 +4259,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>>>>>> * We're already holding a reference on the page but haven't mapped it
> >>>>>>>> * yet.
> >>>>>>>> */
> >>>>>>>> - swap_free_nr(entry, 1);
> >>>>>>>> + swap_free_nr(entry, nr_pages);
> >>>>>>>> if (should_try_to_free_swap(folio, vma, vmf->flags))
> >>>>>>>> folio_free_swap(folio);
> >>>>>>>>
> >>>>>>>> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> >>>>>>>> - dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> >>>>>>>> + folio_ref_add(folio, nr_pages - 1);
> >>>>>>>> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> >>>>>>>> + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> >>>>>>>> pte = mk_pte(page, vma->vm_page_prot);
> >>>>>>>>
> >>>>>>>> /*
> >>>>>>>> @@ -4240,34 +4275,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>>>>>> * exclusivity.
> >>>>>>>> */
> >>>>>>>> if (!folio_test_ksm(folio) &&
> >>>>>>>> - (exclusive || folio_ref_count(folio) == 1)) {
> >>>>>>>> + (exclusive || (folio_ref_count(folio) == nr_pages &&
> >>>>>>>> + folio_nr_pages(folio) == nr_pages))) {
> >>>>>>>
> >>>>>>> I think in practice there is no change here? If nr_pages > 1 then the folio is
> >>>>>>> in the swapcache, so there is an extra ref on it? I agree with the change for
> >>>>>>> robustness sake. Just checking my understanding.
> >>>>>>
> >>>>>> This is the code showing we are reusing/(mkwrite) a folio either
> >>>>>> 1. we meet a small folio and we are the only one hitting the small folio
> >>>>>> 2. we meet a large folio and we are the only one hitting the large folio
> >>>>>>
> >>>>>> any corner cases besides the above two seems difficult. for example,
> >>>>>>
> >>>>>> while we hit a large folio in swapcache but if we can't entirely map it
> >>>>>> (nr_pages==1) due to partial unmap, we will have folio_ref_count(folio)
> >>>>>> == nr_pages == 1
> >>>>>
> >>>>> No, there would be other references from the swapcache and
> >>>>> folio_ref_count(folio) > 1. See my other reply.
> >>>>
> >>>> right. can be clearer by:
> >>>
> >>> Wait, do we still need folio_nr_pages(folio) == nr_pages even if we use
> >>> folio_ref_count(folio) == 1 and moving folio_ref_add(folio, nr_pages - 1)?
> >>
> >> I don't think that we will "need" it.
> >>
> >>>
> >>> one case is that we have a large folio with 16 PTEs, and we unmap
> >>> 15 swap PTE entries, thus we have only one swap entry left. Then
> >>> we hit the large folio in swapcache. but we have only one PTE thus we will
> >>> map only one PTE. lacking folio_nr_pages(folio) == nr_pages, we reuse the
> >>> large folio for one PTE. with it, do_wp_page() will migrate the large
> >>> folio to a small one?
> >>
> >> We will set PAE bit and do_wp_page() will unconditionally reuse that page.
> >>
> >> Note that this is the same as if we had pte_swp_exclusive() set and
> >> would have run into "exclusive=true" here.
> >>
> >> If we'd want a similar "optimization" as we have in
> >> wp_can_reuse_anon_folio(), you'd want something like
> >>
> >> exclusive || (folio_ref_count(folio) == 1 &&
> >> (!folio_test_large(folio) || nr_pages > 1)
> >
> > I feel like
> >
> > A : !folio_test_large(folio) || nr_pages > 1
> >
> > equals
> >
> > B: folio_nr_pages(folio) == nr_pages
> >
> > if folio is small, folio_test_large(folio) is false, both A and B will be true;
> > if folio is large, and we map the whole large folio, A will be true
> > because of nr_pages > 1;
> > B is also true;
> > if folio is large, and we map single one PTE, A will be false;
> > B is also false, because nr_pages == 1 but folio_nr_pages(folio) > 1;
> >
> > right?
>
> Let's assume a single subpage of a large folio is no longer mapped.
> Then, we'd have:
>
> nr_pages == folio_nr_pages(folio) - 1.
>
> You could simply map+reuse most of the folio without COWing.

yes. This is good but the pte which is no longer mapped could be
anyone within the nr_pages PTEs. so it could be quite tricky for
set_ptes.

>
> Once we support COW reuse of PTE-mapped THP we'd do the same. Here, it's
> just easy to detect that the folio is exclusive (folio_ref_count(folio)
> == 1 before mapping anything).
>
> If you really want to mimic what do_wp_page() currently does, you should
> have:
>
> exclusive || (folio_ref_count(folio) == 1 && !folio_test_large(folio))

I actually dislike the part that do_wp_page() handles the reuse of a large
folio which is entirely mapped. For example, A forks B, B exit, we write
A's large folio, we get nr_pages CoW of small folios. Ideally, we can
reuse the whole folios for writing.

>
> Personally, I think we should keep it simple here and use:
>
> exclusive || folio_ref_count(folio) == 1

I feel this is still better than
exclusive || (folio_ref_count(folio) == 1 && !folio_test_large(folio))
as we reuse the whole large folio. the do_wp_page() behaviour
doesn't have this.

>
> IMHO, that's as clear as it gets.

I agree this is clear. But I wonder if there is a possibility to optimize.

Using your instance,
"Let's assume a single subpage of a large folio is no longer mapped."

For a large folio with 16 PTEs and in case we have unmapped one of them.
Thus, we have 15 swap entries left.

The first PTE which gets page faults will reuse the whole large folio having
"exclusive || folio_ref_count(folio) == 1" only. The left 14 will
allocate 14 small
folios(swapcache has been dropped), thus, we use 16 + 14 = 30 pages
memory.

with either
A : !folio_test_large(folio) || nr_pages > 1
or
B: folio_nr_pages(folio) == nr_pages

We consume 15 pages.

>
> --
> Cheers,
>
> David / dhildenb

Thanks
Barry

2024-05-07 08:59:18

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache

>> Let's assume a single subpage of a large folio is no longer mapped.
>> Then, we'd have:
>>
>> nr_pages == folio_nr_pages(folio) - 1.
>>
>> You could simply map+reuse most of the folio without COWing.
>
> yes. This is good but the pte which is no longer mapped could be
> anyone within the nr_pages PTEs. so it could be quite tricky for
> set_ptes.

The swap batching logic should take care of that, otherwise it would be
buggy.

>
>>
>> Once we support COW reuse of PTE-mapped THP we'd do the same. Here, it's
>> just easy to detect that the folio is exclusive (folio_ref_count(folio)
>> == 1 before mapping anything).
>>
>> If you really want to mimic what do_wp_page() currently does, you should
>> have:
>>
>> exclusive || (folio_ref_count(folio) == 1 && !folio_test_large(folio))
>
> I actually dislike the part that do_wp_page() handles the reuse of a large
> folio which is entirely mapped. For example, A forks B, B exit, we write
> A's large folio, we get nr_pages CoW of small folios. Ideally, we can
> reuse the whole folios for writing.

Yes, see the link I shared to what I am working on. There isn't really a
question if what we do right now needs to be improved and all these
scenarios are pretty obvious clear.

>
>>
>> Personally, I think we should keep it simple here and use:
>>
>> exclusive || folio_ref_count(folio) == 1
>
> I feel this is still better than
> exclusive || (folio_ref_count(folio) == 1 && !folio_test_large(folio))
> as we reuse the whole large folio. the do_wp_page() behaviour
> doesn't have this.

Yes, but there is the comment "Same logic as in do_wp_page();". We
already ran into issues having different COW reuse logic all over the
place. For this case here, I don't care if we leave it as

"exclusive || folio_ref_count(folio) == 1"

But let's not try inventing new stuff here.

--
Cheers,

David / dhildenb


2024-05-07 09:25:03

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache

On Tue, May 7, 2024 at 8:59 PM David Hildenbrand <[email protected]> wrote:
>
> >> Let's assume a single subpage of a large folio is no longer mapped.
> >> Then, we'd have:
> >>
> >> nr_pages == folio_nr_pages(folio) - 1.
> >>
> >> You could simply map+reuse most of the folio without COWing.
> >
> > yes. This is good but the pte which is no longer mapped could be
> > anyone within the nr_pages PTEs. so it could be quite tricky for
> > set_ptes.
>
> The swap batching logic should take care of that, otherwise it would be
> buggy.

When you mention "it would be buggy," are you also referring to the current
fallback approach? or only refer to the future patch which might be able
to map/reuse "nr_pages - 1" pages?

The current patch falls back to setting nr_pages = 1 without mapping or
reusing nr_pages - 1. I feel your concern doesn't refer to this fallback?

>
> >
> >>
> >> Once we support COW reuse of PTE-mapped THP we'd do the same. Here, it's
> >> just easy to detect that the folio is exclusive (folio_ref_count(folio)
> >> == 1 before mapping anything).
> >>
> >> If you really want to mimic what do_wp_page() currently does, you should
> >> have:
> >>
> >> exclusive || (folio_ref_count(folio) == 1 && !folio_test_large(folio))
> >
> > I actually dislike the part that do_wp_page() handles the reuse of a large
> > folio which is entirely mapped. For example, A forks B, B exit, we write
> > A's large folio, we get nr_pages CoW of small folios. Ideally, we can
> > reuse the whole folios for writing.
>
> Yes, see the link I shared to what I am working on. There isn't really a
> question if what we do right now needs to be improved and all these
> scenarios are pretty obvious clear.

Great! I plan to dedicate more time to reviewing your work.

>
> >
> >>
> >> Personally, I think we should keep it simple here and use:
> >>
> >> exclusive || folio_ref_count(folio) == 1
> >
> > I feel this is still better than
> > exclusive || (folio_ref_count(folio) == 1 && !folio_test_large(folio))
> > as we reuse the whole large folio. the do_wp_page() behaviour
> > doesn't have this.
>
> Yes, but there is the comment "Same logic as in do_wp_page();". We
> already ran into issues having different COW reuse logic all over the
> place. For this case here, I don't care if we leave it as
>
> "exclusive || folio_ref_count(folio) == 1"

I'm perfectly fine with using the code for this patchset and maybe
looking for other
opportunities for potential optimization as an incremental patchset,
for example,
reusing the remaining PTEs as suggested by you - "simply map+reuse most of
the folio without COWing."

>
> But let's not try inventing new stuff here.

It seems you ignored and snipped my "16 + 14" pages and "15" pages
example though. but once we support "simply map+reuse most of the
folio without COWing", the "16+14" problem can be resolved, instead,
we consume 16 pages.

>
> --
> Cheers,
>
> David / dhildenb

Thanks
Barry

2024-05-07 09:47:34

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v3 3/6] mm: introduce pte_move_swp_offset() helper which can move offset bidirectionally

On 07/05/2024 09:24, Barry Song wrote:
> On Tue, May 7, 2024 at 8:14 PM Ryan Roberts <[email protected]> wrote:
>>
>> On 06/05/2024 09:31, David Hildenbrand wrote:
>>> On 06.05.24 10:20, Barry Song wrote:
>>>> On Mon, May 6, 2024 at 8:06 PM David Hildenbrand <[email protected]> wrote:
>>>>>
>>>>> On 04.05.24 01:40, Barry Song wrote:
>>>>>> On Fri, May 3, 2024 at 5:41 PM Ryan Roberts <[email protected]> wrote:
>>>>>>>
>>>>>>> On 03/05/2024 01:50, Barry Song wrote:
>>>>>>>> From: Barry Song <[email protected]>
>>>>>>>>
>>>>>>>> There could arise a necessity to obtain the first pte_t from a swap
>>>>>>>> pte_t located in the middle. For instance, this may occur within the
>>>>>>>> context of do_swap_page(), where a page fault can potentially occur in
>>>>>>>> any PTE of a large folio. To address this, the following patch introduces
>>>>>>>> pte_move_swp_offset(), a function capable of bidirectional movement by
>>>>>>>> a specified delta argument. Consequently, pte_increment_swp_offset()
>>>>>>>
>>>>>>> You mean pte_next_swp_offset()?
>>>>>>
>>>>>> yes.
>>>>>>
>>>>>>>
>>>>>>>> will directly invoke it with delta = 1.
>>>>>>>>
>>>>>>>> Suggested-by: "Huang, Ying" <[email protected]>
>>>>>>>> Signed-off-by: Barry Song <[email protected]>
>>>>>>>> ---
>>>>>>>> mm/internal.h | 25 +++++++++++++++++++++----
>>>>>>>> 1 file changed, 21 insertions(+), 4 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>>>>>> index c5552d35d995..cfe4aed66a5c 100644
>>>>>>>> --- a/mm/internal.h
>>>>>>>> +++ b/mm/internal.h
>>>>>>>> @@ -211,18 +211,21 @@ static inline int folio_pte_batch(struct folio
>>>>>>>> *folio, unsigned long addr,
>>>>>>>> }
>>>>>>>>
>>>>>>>> /**
>>>>>>>> - * pte_next_swp_offset - Increment the swap entry offset field of a swap
>>>>>>>> pte.
>>>>>>>> + * pte_move_swp_offset - Move the swap entry offset field of a swap pte
>>>>>>>> + * forward or backward by delta
>>>>>>>> * @pte: The initial pte state; is_swap_pte(pte) must be true and
>>>>>>>> * non_swap_entry() must be false.
>>>>>>>> + * @delta: The direction and the offset we are moving; forward if delta
>>>>>>>> + * is positive; backward if delta is negative
>>>>>>>> *
>>>>>>>> - * Increments the swap offset, while maintaining all other fields, including
>>>>>>>> + * Moves the swap offset, while maintaining all other fields, including
>>>>>>>> * swap type, and any swp pte bits. The resulting pte is returned.
>>>>>>>> */
>>>>>>>> -static inline pte_t pte_next_swp_offset(pte_t pte)
>>>>>>>> +static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
>>>>>>>
>>>>>>> We have equivalent functions for pfn:
>>>>>>>
>>>>>>> pte_next_pfn()
>>>>>>> pte_advance_pfn()
>>>>>>>
>>>>>>> Although the latter takes an unsigned long and only moves forward currently. I
>>>>>>> wonder if it makes sense to have their naming and semantics match? i.e. change
>>>>>>> pte_advance_pfn() to pte_move_pfn() and let it move backwards too.
>>>>>>>
>>>>>>> I guess we don't have a need for that and it adds more churn.
>>>>>>
>>>>>> we might have a need in the below case.
>>>>>> A forks B, then A and B share large folios. B unmap/exit, then large
>>>>>> folios of process
>>>>>> A become single-mapped.
>>>>>> Right now, while writing A's folios, we are CoWing A's large folios
>>>>>> into many small
>>>>>> folios. I believe we can reuse the entire large folios instead of doing
>>>>>> nr_pages
>>>>>> CoW and page faults.
>>>>>> In this case, we might want to get the first PTE from vmf->pte.
>>>>>
>>>>> Once we have COW reuse for large folios in place (I think you know that
>>>>> I am working on that), it might make sense to "COW-reuse around",
>>>>
>>>> TBH, I don't know if you are working on that. please Cc me next time :-)
>>>
>>> I could have sworn I mentioned it to you already :)
>>>
>>> See
>>>
>>> https://lore.kernel.org/linux-mm/[email protected]/T/
>>>
>>> I'll follow-up on that soonish (now that batching is upstream and the large
>>> mapcount is on its way upstream).
>>>
>>>>
>>>>> meaning we look if some neighboring PTEs map the same large folio and
>>>>> map them writable as well. But if it's really worth it, increasing page
>>>>> fault latency, is to be decided separately.
>>>>
>>>> On the other hand, we eliminate latency for the remaining nr_pages - 1 PTEs.
>>>> Perhaps we can discover a more cost-effective method to signify that a large
>>>> folio is probably singly mapped?
>>>
>>> Yes, precisely what I am up to!
>>>
>>>> and only call "multi-PTEs" reuse while that
>>>> condition is true in PF and avoid increasing latency always?
>>>
>>> I'm thinking along those lines:
>>>
>>> If we detect that it's exclusive, we can certainly mapped the current PTE
>>> writable. Then, we can decide how much (and if) we want to fault-around writable
>>> as an optimization.
>>>
>>> For smallish large folios, it might make sense to try faulting around most of
>>> the folio.
>>>
>>> For large large folios (e.g., PTE-mapped 2MiB THP and bigger), we might not want
>>> to fault around the whole thing -- especially if there is little benefit to be
>>> had from contig-pte bits.
>>>
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> Another case, might be
>>>>>> A forks B, and we write either A or B, we might CoW an entire large
>>>>>> folios instead
>>>>>> CoWing nr_pages small folios.
>>>>>>
>>>>>> case 1 seems more useful, I might have a go after some days. then we might
>>>>>> see pte_move_pfn().
>>>>> pte_move_pfn() does sound odd to me.
>>
>> Yes, I agree the name is odd. pte_move_swp_offset() sounds similarly odd tbh.
>> Perhaps just pte_advance_swp_offset() with a negative value is clearer about
>> what its doing?
>>
>
> I am not a native speaker. but dictionary says
>
> advance:
> move forward in a purposeful way.
> a forward movement.
>
> Now we are moving backward or forward :-)

Sure, but if you pass a negative value then you are moving forwards by a
negative amount ;-)

Anyway, forget I said anything - its not important.

>
>>>>> It might not be required to
>>>>> implement the optimization described above. (it's easier to simply read
>>>>> another PTE, check if it maps the same large folio, and to batch from there)
>>
>> Yes agreed.
>>
>>>>>
>>>>
>>>> It appears that your proposal suggests potential reusability as follows: if we
>>>> have a large folio containing 16 PTEs, you might consider reusing only 4 by
>>>> examining PTEs "around" but not necessarily all 16 PTEs. please correct me
>>>> if my understanding is wrong.
>>>>
>>>> Initially, my idea was to obtain the first PTE using pte_move_pfn() and then
>>>> utilize folio_pte_batch() with the first PTE as arguments to ensure consistency
>>>> in nr_pages, thus enabling complete reuse of the whole folio.
>>>
>>> Simply doing an vm_normal_folio(pte - X) == folio and then trying to batch from
>>> there might be easier and cleaner.
>>>
>>


2024-05-07 10:50:05

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache

On Tue, May 7, 2024 at 6:39 PM David Hildenbrand <[email protected]> wrote:
>
> On 07.05.24 11:24, Barry Song wrote:
> > On Tue, May 7, 2024 at 8:59 PM David Hildenbrand <[email protected]> wrote:
> >>
> >>>> Let's assume a single subpage of a large folio is no longer mapped.
> >>>> Then, we'd have:
> >>>>
> >>>> nr_pages == folio_nr_pages(folio) - 1.
> >>>>
> >>>> You could simply map+reuse most of the folio without COWing.
> >>>
> >>> yes. This is good but the pte which is no longer mapped could be
> >>> anyone within the nr_pages PTEs. so it could be quite tricky for
> >>> set_ptes.
> >>
> >> The swap batching logic should take care of that, otherwise it would be
> >> buggy.
> >
> > When you mention "it would be buggy," are you also referring to the current
> > fallback approach? or only refer to the future patch which might be able
> > to map/reuse "nr_pages - 1" pages?
>
> swap_pte_batch() should not skip any holes. So consequently, set_ptes()
> should do the right thing. (regarding your comment "could be quite ricky
> for set_ptes")
>
> So I think that should be working as expected.

maybe not. take a look at my current code, I am goto check_folio with
nr_pages = 1
if swap_pte_batch(folio_ptep, nr, folio_pte) != folio_nr_pages(folio).

+ nr_pages = 1;
+ ...
+ if (folio_test_large(folio) && folio_test_swapcache(folio)) {
+ int nr = folio_nr_pages(folio);
+ ...
+ if (!pte_same(folio_pte,
pte_move_swp_offset(vmf->orig_pte, -idx)) ||
+ swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
+ goto check_folio; /* read here, i am falling
back nr_pages = 1 */
+
+
+ ...
+ nr_pages = nr;

The fallback(=1) works. but it seems you are proposing set nr_pages =
swap_pte_batch(folio_ptep, nr, folio_pte)
if (swap_pte_batch(folio_ptep, nr, folio_pte) > 1 &&
swap_pte_batch(folio_ptep, nr, folio_pte) <
nr_pages) ?

>
> >
> > The current patch falls back to setting nr_pages = 1 without mapping or
> > reusing nr_pages - 1. I feel your concern doesn't refer to this fallback?
> >
> >>
> >>>
> >>>>
> >>>> Once we support COW reuse of PTE-mapped THP we'd do the same. Here, it's
> >>>> just easy to detect that the folio is exclusive (folio_ref_count(folio)
> >>>> == 1 before mapping anything).
> >>>>
> >>>> If you really want to mimic what do_wp_page() currently does, you should
> >>>> have:
> >>>>
> >>>> exclusive || (folio_ref_count(folio) == 1 && !folio_test_large(folio))
> >>>
> >>> I actually dislike the part that do_wp_page() handles the reuse of a large
> >>> folio which is entirely mapped. For example, A forks B, B exit, we write
> >>> A's large folio, we get nr_pages CoW of small folios. Ideally, we can
> >>> reuse the whole folios for writing.
> >>
> >> Yes, see the link I shared to what I am working on. There isn't really a
> >> question if what we do right now needs to be improved and all these
> >> scenarios are pretty obvious clear.
> >
> > Great! I plan to dedicate more time to reviewing your work.
>
> Nice! And there will be a lot of follow-up optimization work I won't
> tackle immediately regarding COW (COW-reuse around, maybe sometimes we
> want to COW bigger chunks).
>
> I still have making PageAnonExclusive a per-folio flag on my TODO list,
> that will help the COW-reuse around case a lot.
>
> >
> >>
> >>>
> >>>>
> >>>> Personally, I think we should keep it simple here and use:
> >>>>
> >>>> exclusive || folio_ref_count(folio) == 1
> >>>
> >>> I feel this is still better than
> >>> exclusive || (folio_ref_count(folio) == 1 && !folio_test_large(folio))
> >>> as we reuse the whole large folio. the do_wp_page() behaviour
> >>> doesn't have this.
> >>
> >> Yes, but there is the comment "Same logic as in do_wp_page();". We
> >> already ran into issues having different COW reuse logic all over the
> >> place. For this case here, I don't care if we leave it as
> >>
> >> "exclusive || folio_ref_count(folio) == 1"
> >
> > I'm perfectly fine with using the code for this patchset and maybe
> > looking for other
> > opportunities for potential optimization as an incremental patchset,
> > for example,
> > reusing the remaining PTEs as suggested by you - "simply map+reuse most of
> > the folio without COWing."
> >
> >>
> >> But let's not try inventing new stuff here.
> >
> > It seems you ignored and snipped my "16 + 14" pages and "15" pages
> > example though. but once we support "simply map+reuse most of the
> > folio without COWing", the "16+14" problem can be resolved, instead,
> > we consume 16 pages.
>
>
> Oh, sorry for skipping that, for me it was rather clear: the partially
> mapped folios will be on the deferred split list and the excess memory
> can (and will be) reclaimed when there is need. So this temporary memory
> consumption is usually not a problem in practice. But yes, something to
> optimize (just like COW reuse in general).
>
> --
> Cheers,
>
> David / dhildenb
>

2024-05-07 11:04:25

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache

On 07.05.24 11:24, Barry Song wrote:
> On Tue, May 7, 2024 at 8:59 PM David Hildenbrand <[email protected]> wrote:
>>
>>>> Let's assume a single subpage of a large folio is no longer mapped.
>>>> Then, we'd have:
>>>>
>>>> nr_pages == folio_nr_pages(folio) - 1.
>>>>
>>>> You could simply map+reuse most of the folio without COWing.
>>>
>>> yes. This is good but the pte which is no longer mapped could be
>>> anyone within the nr_pages PTEs. so it could be quite tricky for
>>> set_ptes.
>>
>> The swap batching logic should take care of that, otherwise it would be
>> buggy.
>
> When you mention "it would be buggy," are you also referring to the current
> fallback approach? or only refer to the future patch which might be able
> to map/reuse "nr_pages - 1" pages?

swap_pte_batch() should not skip any holes. So consequently, set_ptes()
should do the right thing. (regarding your comment "could be quite ricky
for set_ptes")

So I think that should be working as expected.

>
> The current patch falls back to setting nr_pages = 1 without mapping or
> reusing nr_pages - 1. I feel your concern doesn't refer to this fallback?
>
>>
>>>
>>>>
>>>> Once we support COW reuse of PTE-mapped THP we'd do the same. Here, it's
>>>> just easy to detect that the folio is exclusive (folio_ref_count(folio)
>>>> == 1 before mapping anything).
>>>>
>>>> If you really want to mimic what do_wp_page() currently does, you should
>>>> have:
>>>>
>>>> exclusive || (folio_ref_count(folio) == 1 && !folio_test_large(folio))
>>>
>>> I actually dislike the part that do_wp_page() handles the reuse of a large
>>> folio which is entirely mapped. For example, A forks B, B exit, we write
>>> A's large folio, we get nr_pages CoW of small folios. Ideally, we can
>>> reuse the whole folios for writing.
>>
>> Yes, see the link I shared to what I am working on. There isn't really a
>> question if what we do right now needs to be improved and all these
>> scenarios are pretty obvious clear.
>
> Great! I plan to dedicate more time to reviewing your work.

Nice! And there will be a lot of follow-up optimization work I won't
tackle immediately regarding COW (COW-reuse around, maybe sometimes we
want to COW bigger chunks).

I still have making PageAnonExclusive a per-folio flag on my TODO list,
that will help the COW-reuse around case a lot.

>
>>
>>>
>>>>
>>>> Personally, I think we should keep it simple here and use:
>>>>
>>>> exclusive || folio_ref_count(folio) == 1
>>>
>>> I feel this is still better than
>>> exclusive || (folio_ref_count(folio) == 1 && !folio_test_large(folio))
>>> as we reuse the whole large folio. the do_wp_page() behaviour
>>> doesn't have this.
>>
>> Yes, but there is the comment "Same logic as in do_wp_page();". We
>> already ran into issues having different COW reuse logic all over the
>> place. For this case here, I don't care if we leave it as
>>
>> "exclusive || folio_ref_count(folio) == 1"
>
> I'm perfectly fine with using the code for this patchset and maybe
> looking for other
> opportunities for potential optimization as an incremental patchset,
> for example,
> reusing the remaining PTEs as suggested by you - "simply map+reuse most of
> the folio without COWing."
>
>>
>> But let's not try inventing new stuff here.
>
> It seems you ignored and snipped my "16 + 14" pages and "15" pages
> example though. but once we support "simply map+reuse most of the
> folio without COWing", the "16+14" problem can be resolved, instead,
> we consume 16 pages.


Oh, sorry for skipping that, for me it was rather clear: the partially
mapped folios will be on the deferred split list and the excess memory
can (and will be) reclaimed when there is need. So this temporary memory
consumption is usually not a problem in practice. But yes, something to
optimize (just like COW reuse in general).

--
Cheers,

David / dhildenb


2024-05-08 07:37:56

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH v3 1/6] mm: swap: introduce swap_free_nr() for batched swap_free()

Barry Song <[email protected]> writes:

> From: Chuanhua Han <[email protected]>
>
> While swapping in a large folio, we need to free swaps related to the whole
> folio. To avoid frequently acquiring and releasing swap locks, it is better
> to introduce an API for batched free.
> Furthermore, this new function, swap_free_nr(), is designed to efficiently
> handle various scenarios for releasing a specified number, nr, of swap
> entries.
>
> Signed-off-by: Chuanhua Han <[email protected]>
> Co-developed-by: Barry Song <[email protected]>
> Signed-off-by: Barry Song <[email protected]>

LGTM, Thanks!

Reviewed-by: "Huang, Ying" <[email protected]>

> ---
> include/linux/swap.h | 5 +++++
> mm/swapfile.c | 47 ++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 52 insertions(+)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 11c53692f65f..d1d35e92d7e9 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
> extern int swap_duplicate(swp_entry_t);
> extern int swapcache_prepare(swp_entry_t);
> extern void swap_free(swp_entry_t);
> +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
> extern void swapcache_free_entries(swp_entry_t *entries, int n);
> extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
> int swap_type_of(dev_t device, sector_t offset);
> @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
> {
> }
>
> +static inline void swap_free_nr(swp_entry_t entry, int nr_pages)
> +{
> +}
> +
> static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> {
> }
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index f6ca215fb92f..ec12f2b9d229 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1356,6 +1356,53 @@ void swap_free(swp_entry_t entry)
> __swap_entry_free(p, entry);
> }
>
> +static void cluster_swap_free_nr(struct swap_info_struct *sis,
> + unsigned long offset, int nr_pages)
> +{
> + struct swap_cluster_info *ci;
> + DECLARE_BITMAP(to_free, BITS_PER_LONG) = { 0 };
> + int i, nr;
> +
> + ci = lock_cluster_or_swap_info(sis, offset);
> + while (nr_pages) {
> + nr = min(BITS_PER_LONG, nr_pages);
> + for (i = 0; i < nr; i++) {
> + if (!__swap_entry_free_locked(sis, offset + i, 1))
> + bitmap_set(to_free, i, 1);
> + }
> + if (!bitmap_empty(to_free, BITS_PER_LONG)) {
> + unlock_cluster_or_swap_info(sis, ci);
> + for_each_set_bit(i, to_free, BITS_PER_LONG)
> + free_swap_slot(swp_entry(sis->type, offset + i));
> + if (nr == nr_pages)
> + return;
> + bitmap_clear(to_free, 0, BITS_PER_LONG);
> + ci = lock_cluster_or_swap_info(sis, offset);
> + }
> + offset += nr;
> + nr_pages -= nr;
> + }
> + unlock_cluster_or_swap_info(sis, ci);
> +}
> +
> +void swap_free_nr(swp_entry_t entry, int nr_pages)
> +{
> + int nr;
> + struct swap_info_struct *sis;
> + unsigned long offset = swp_offset(entry);
> +
> + sis = _swap_info_get(entry);
> + if (!sis)
> + return;
> +
> + while (nr_pages) {
> + nr = min_t(int, nr_pages, SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
> + cluster_swap_free_nr(sis, offset, nr);
> + offset += nr;
> + nr_pages -= nr;
> + }
> +}
> +
> /*
> * Called after dropping swapcache to decrease refcnt to swap entries.
> */

2024-05-08 08:05:40

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH v3 2/6] mm: remove swap_free() and always use swap_free_nr()

Ryan Roberts <[email protected]> writes:

> On 03/05/2024 01:50, Barry Song wrote:
>> From: Barry Song <[email protected]>
>>
>> To streamline maintenance efforts, we propose discontinuing the use of
>> swap_free(). Instead, we can simply invoke swap_free_nr() with nr set
>> to 1. This adjustment offers the advantage of enabling batch processing
>> within kernel/power/swap.c. Furthermore, swap_free_nr() is designed with
>> a bitmap consisting of only one long, resulting in overhead that can be
>> ignored for cases where nr equals 1.
>>
>> Suggested-by: "Huang, Ying" <[email protected]>
>> Signed-off-by: Barry Song <[email protected]>
>> Cc: "Rafael J. Wysocki" <[email protected]>
>> Cc: Pavel Machek <[email protected]>
>> Cc: Len Brown <[email protected]>
>> Cc: Hugh Dickins <[email protected]>
>> ---
>> include/linux/swap.h | 5 -----
>> kernel/power/swap.c | 7 +++----
>> mm/memory.c | 2 +-
>> mm/rmap.c | 4 ++--
>> mm/shmem.c | 4 ++--
>> mm/swapfile.c | 19 +++++--------------
>> 6 files changed, 13 insertions(+), 28 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index d1d35e92d7e9..f03cb446124e 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -482,7 +482,6 @@ extern int add_swap_count_continuation(swp_entry_t, gfp_t);
>> extern void swap_shmem_alloc(swp_entry_t);
>> extern int swap_duplicate(swp_entry_t);
>> extern int swapcache_prepare(swp_entry_t);
>> -extern void swap_free(swp_entry_t);
>
> I wonder if it would be cleaner to:
>
> #define swap_free(entry) swap_free_nr((entry), 1)
>
> To save all the churn for the callsites that just want to pass a single entry?

I prefer this way. Although I prefer inline functions.

Otherwise, LGTM. Feel free to add

Reviewed-by: "Huang, Ying" <[email protected]>

in the future version.

>> extern void swap_free_nr(swp_entry_t entry, int nr_pages);
>> extern void swapcache_free_entries(swp_entry_t *entries, int n);
>> extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
>> @@ -561,10 +560,6 @@ static inline int swapcache_prepare(swp_entry_t swp)
>> return 0;
>> }
>>
>> -static inline void swap_free(swp_entry_t swp)
>> -{
>> -}
>> -
>> static inline void swap_free_nr(swp_entry_t entry, int nr_pages)
>> {
>> }
>> diff --git a/kernel/power/swap.c b/kernel/power/swap.c
>> index 5bc04bfe2db1..6befaa88a342 100644
>> --- a/kernel/power/swap.c
>> +++ b/kernel/power/swap.c
>> @@ -181,7 +181,7 @@ sector_t alloc_swapdev_block(int swap)
>> offset = swp_offset(get_swap_page_of_type(swap));
>> if (offset) {
>> if (swsusp_extents_insert(offset))
>> - swap_free(swp_entry(swap, offset));
>> + swap_free_nr(swp_entry(swap, offset), 1);
>> else
>> return swapdev_block(swap, offset);
>> }
>> @@ -200,12 +200,11 @@ void free_all_swap_pages(int swap)
>>
>> while ((node = swsusp_extents.rb_node)) {
>> struct swsusp_extent *ext;
>> - unsigned long offset;
>>
>> ext = rb_entry(node, struct swsusp_extent, node);
>> rb_erase(node, &swsusp_extents);
>> - for (offset = ext->start; offset <= ext->end; offset++)
>> - swap_free(swp_entry(swap, offset));
>> + swap_free_nr(swp_entry(swap, ext->start),
>> + ext->end - ext->start + 1);
>>
>> kfree(ext);
>> }
>> diff --git a/mm/memory.c b/mm/memory.c
>> index eea6e4984eae..f033eb3528ba 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4225,7 +4225,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> * We're already holding a reference on the page but haven't mapped it
>> * yet.
>> */
>> - swap_free(entry);
>> + swap_free_nr(entry, 1);
>> if (should_try_to_free_swap(folio, vma, vmf->flags))
>> folio_free_swap(folio);
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 087a79f1f611..39ec7742acec 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1865,7 +1865,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>> goto walk_done_err;
>> }
>> if (arch_unmap_one(mm, vma, address, pteval) < 0) {
>> - swap_free(entry);
>> + swap_free_nr(entry, 1);
>> set_pte_at(mm, address, pvmw.pte, pteval);
>> goto walk_done_err;
>> }
>> @@ -1873,7 +1873,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>> /* See folio_try_share_anon_rmap(): clear PTE first. */
>> if (anon_exclusive &&
>> folio_try_share_anon_rmap_pte(folio, subpage)) {
>> - swap_free(entry);
>> + swap_free_nr(entry, 1);
>> set_pte_at(mm, address, pvmw.pte, pteval);
>> goto walk_done_err;
>> }
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index fa2a0ed97507..bfc8a2beb24f 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -1836,7 +1836,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
>> * in shmem_evict_inode().
>> */
>> shmem_recalc_inode(inode, -1, -1);
>> - swap_free(swap);
>> + swap_free_nr(swap, 1);
>> }
>>
>> /*
>> @@ -1927,7 +1927,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>>
>> delete_from_swap_cache(folio);
>> folio_mark_dirty(folio);
>> - swap_free(swap);
>> + swap_free_nr(swap, 1);
>> put_swap_device(si);
>>
>> *foliop = folio;
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index ec12f2b9d229..ddcd0f24b9a1 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -1343,19 +1343,6 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
>> swap_range_free(p, offset, 1);
>> }
>>
>> -/*
>> - * Caller has made sure that the swap device corresponding to entry
>> - * is still around or has not been recycled.
>> - */
>> -void swap_free(swp_entry_t entry)
>> -{
>> - struct swap_info_struct *p;
>> -
>> - p = _swap_info_get(entry);
>> - if (p)
>> - __swap_entry_free(p, entry);
>> -}
>> -
>> static void cluster_swap_free_nr(struct swap_info_struct *sis,
>> unsigned long offset, int nr_pages)
>> {
>> @@ -1385,6 +1372,10 @@ static void cluster_swap_free_nr(struct swap_info_struct *sis,
>> unlock_cluster_or_swap_info(sis, ci);
>> }
>>
>> +/*
>> + * Caller has made sure that the swap device corresponding to entry
>> + * is still around or has not been recycled.
>> + */
>> void swap_free_nr(swp_entry_t entry, int nr_pages)
>> {
>> int nr;
>> @@ -1930,7 +1921,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
>> new_pte = pte_mkuffd_wp(new_pte);
>> setpte:
>> set_pte_at(vma->vm_mm, addr, pte, new_pte);
>> - swap_free(entry);
>> + swap_free_nr(entry, 1);
>> out:
>> if (pte)
>> pte_unmap_unlock(pte, ptl);

--
Best Regards,
Huang, Ying

2024-05-08 08:11:06

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH v3 3/6] mm: introduce pte_move_swp_offset() helper which can move offset bidirectionally

Barry Song <[email protected]> writes:

> From: Barry Song <[email protected]>
>
> There could arise a necessity to obtain the first pte_t from a swap
> pte_t located in the middle. For instance, this may occur within the
> context of do_swap_page(), where a page fault can potentially occur in
> any PTE of a large folio. To address this, the following patch introduces
> pte_move_swp_offset(), a function capable of bidirectional movement by
> a specified delta argument. Consequently, pte_increment_swp_offset()
> will directly invoke it with delta = 1.
>
> Suggested-by: "Huang, Ying" <[email protected]>
> Signed-off-by: Barry Song <[email protected]>

LGTM, Thanks! Feel free to add

Reviewed-by: "Huang, Ying" <[email protected]>

in the future version.

> ---
> mm/internal.h | 25 +++++++++++++++++++++----
> 1 file changed, 21 insertions(+), 4 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index c5552d35d995..cfe4aed66a5c 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -211,18 +211,21 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> }
>
> /**
> - * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
> + * pte_move_swp_offset - Move the swap entry offset field of a swap pte
> + * forward or backward by delta
> * @pte: The initial pte state; is_swap_pte(pte) must be true and
> * non_swap_entry() must be false.
> + * @delta: The direction and the offset we are moving; forward if delta
> + * is positive; backward if delta is negative
> *
> - * Increments the swap offset, while maintaining all other fields, including
> + * Moves the swap offset, while maintaining all other fields, including
> * swap type, and any swp pte bits. The resulting pte is returned.
> */
> -static inline pte_t pte_next_swp_offset(pte_t pte)
> +static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
> {
> swp_entry_t entry = pte_to_swp_entry(pte);
> pte_t new = __swp_entry_to_pte(__swp_entry(swp_type(entry),
> - (swp_offset(entry) + 1)));
> + (swp_offset(entry) + delta)));
>
> if (pte_swp_soft_dirty(pte))
> new = pte_swp_mksoft_dirty(new);
> @@ -234,6 +237,20 @@ static inline pte_t pte_next_swp_offset(pte_t pte)
> return new;
> }
>
> +
> +/**
> + * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
> + * @pte: The initial pte state; is_swap_pte(pte) must be true and
> + * non_swap_entry() must be false.
> + *
> + * Increments the swap offset, while maintaining all other fields, including
> + * swap type, and any swp pte bits. The resulting pte is returned.
> + */
> +static inline pte_t pte_next_swp_offset(pte_t pte)
> +{
> + return pte_move_swp_offset(pte, 1);
> +}
> +
> /**
> * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries
> * @start_ptep: Page table pointer for the first entry.

--
Best Regards,
Huang, Ying

2024-05-08 08:30:38

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v3 2/6] mm: remove swap_free() and always use swap_free_nr()

On Wed, May 8, 2024 at 7:58 PM Huang, Ying <[email protected]> wrote:
>
> Ryan Roberts <[email protected]> writes:
>
> > On 03/05/2024 01:50, Barry Song wrote:
> >> From: Barry Song <[email protected]>
> >>
> >> To streamline maintenance efforts, we propose discontinuing the use of
> >> swap_free(). Instead, we can simply invoke swap_free_nr() with nr set
> >> to 1. This adjustment offers the advantage of enabling batch processing
> >> within kernel/power/swap.c. Furthermore, swap_free_nr() is designed with
> >> a bitmap consisting of only one long, resulting in overhead that can be
> >> ignored for cases where nr equals 1.
> >>
> >> Suggested-by: "Huang, Ying" <[email protected]>
> >> Signed-off-by: Barry Song <[email protected]>
> >> Cc: "Rafael J. Wysocki" <[email protected]>
> >> Cc: Pavel Machek <[email protected]>
> >> Cc: Len Brown <[email protected]>
> >> Cc: Hugh Dickins <[email protected]>
> >> ---
> >> include/linux/swap.h | 5 -----
> >> kernel/power/swap.c | 7 +++----
> >> mm/memory.c | 2 +-
> >> mm/rmap.c | 4 ++--
> >> mm/shmem.c | 4 ++--
> >> mm/swapfile.c | 19 +++++--------------
> >> 6 files changed, 13 insertions(+), 28 deletions(-)
> >>
> >> diff --git a/include/linux/swap.h b/include/linux/swap.h
> >> index d1d35e92d7e9..f03cb446124e 100644
> >> --- a/include/linux/swap.h
> >> +++ b/include/linux/swap.h
> >> @@ -482,7 +482,6 @@ extern int add_swap_count_continuation(swp_entry_t, gfp_t);
> >> extern void swap_shmem_alloc(swp_entry_t);
> >> extern int swap_duplicate(swp_entry_t);
> >> extern int swapcache_prepare(swp_entry_t);
> >> -extern void swap_free(swp_entry_t);
> >
> > I wonder if it would be cleaner to:
> >
> > #define swap_free(entry) swap_free_nr((entry), 1)
> >
> > To save all the churn for the callsites that just want to pass a single entry?
>
> I prefer this way. Although I prefer inline functions.

Yes, using static inline is preferable. I've recently submitted
a checkpatch/codestyle for this, which can be found at:
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-everything&id=39c58d5ed036
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-everything&id=8379bf0b0e1f5

Using static inline aligns with the established rule.

>
> Otherwise, LGTM. Feel free to add
>
> Reviewed-by: "Huang, Ying" <[email protected]>

Thanks!

>
> in the future version.

I believe Christoph's vote leans towards simply removing swap_free_nr
and renaming it to swap_free, while adding a new parameter as follows.

void swap_free(swp_entry_t entry, int nr);
{
}

now I see Ryan and you prefer

static inline swap_free()
{
swap_free_nr(...., 1)
}

Chris slightly favors discouraging the use of swap_free() without the
new parameter. Removing swap_free() can address this concern.

It seems that maintaining swap_free() and having it call swap_free_nr() with
a default value of 1 received the most support.

To align with free_swap_and_cache() and free_swap_and_cache_nr(),
I'll proceed with the "static inline" approach in the new version. Please
voice any objections you may have, Christoph, Chris.

>
> >> extern void swap_free_nr(swp_entry_t entry, int nr_pages);
> >> extern void swapcache_free_entries(swp_entry_t *entries, int n);
> >> extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
> >> @@ -561,10 +560,6 @@ static inline int swapcache_prepare(swp_entry_t swp)
> >> return 0;
> >> }
> >>
> >> -static inline void swap_free(swp_entry_t swp)
> >> -{
> >> -}
> >> -
> >> static inline void swap_free_nr(swp_entry_t entry, int nr_pages)
> >> {
> >> }
> >> diff --git a/kernel/power/swap.c b/kernel/power/swap.c
> >> index 5bc04bfe2db1..6befaa88a342 100644
> >> --- a/kernel/power/swap.c
> >> +++ b/kernel/power/swap.c
> >> @@ -181,7 +181,7 @@ sector_t alloc_swapdev_block(int swap)
> >> offset = swp_offset(get_swap_page_of_type(swap));
> >> if (offset) {
> >> if (swsusp_extents_insert(offset))
> >> - swap_free(swp_entry(swap, offset));
> >> + swap_free_nr(swp_entry(swap, offset), 1);
> >> else
> >> return swapdev_block(swap, offset);
> >> }
> >> @@ -200,12 +200,11 @@ void free_all_swap_pages(int swap)
> >>
> >> while ((node = swsusp_extents.rb_node)) {
> >> struct swsusp_extent *ext;
> >> - unsigned long offset;
> >>
> >> ext = rb_entry(node, struct swsusp_extent, node);
> >> rb_erase(node, &swsusp_extents);
> >> - for (offset = ext->start; offset <= ext->end; offset++)
> >> - swap_free(swp_entry(swap, offset));
> >> + swap_free_nr(swp_entry(swap, ext->start),
> >> + ext->end - ext->start + 1);
> >>
> >> kfree(ext);
> >> }
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index eea6e4984eae..f033eb3528ba 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -4225,7 +4225,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >> * We're already holding a reference on the page but haven't mapped it
> >> * yet.
> >> */
> >> - swap_free(entry);
> >> + swap_free_nr(entry, 1);
> >> if (should_try_to_free_swap(folio, vma, vmf->flags))
> >> folio_free_swap(folio);
> >>
> >> diff --git a/mm/rmap.c b/mm/rmap.c
> >> index 087a79f1f611..39ec7742acec 100644
> >> --- a/mm/rmap.c
> >> +++ b/mm/rmap.c
> >> @@ -1865,7 +1865,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >> goto walk_done_err;
> >> }
> >> if (arch_unmap_one(mm, vma, address, pteval) < 0) {
> >> - swap_free(entry);
> >> + swap_free_nr(entry, 1);
> >> set_pte_at(mm, address, pvmw.pte, pteval);
> >> goto walk_done_err;
> >> }
> >> @@ -1873,7 +1873,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >> /* See folio_try_share_anon_rmap(): clear PTE first. */
> >> if (anon_exclusive &&
> >> folio_try_share_anon_rmap_pte(folio, subpage)) {
> >> - swap_free(entry);
> >> + swap_free_nr(entry, 1);
> >> set_pte_at(mm, address, pvmw.pte, pteval);
> >> goto walk_done_err;
> >> }
> >> diff --git a/mm/shmem.c b/mm/shmem.c
> >> index fa2a0ed97507..bfc8a2beb24f 100644
> >> --- a/mm/shmem.c
> >> +++ b/mm/shmem.c
> >> @@ -1836,7 +1836,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
> >> * in shmem_evict_inode().
> >> */
> >> shmem_recalc_inode(inode, -1, -1);
> >> - swap_free(swap);
> >> + swap_free_nr(swap, 1);
> >> }
> >>
> >> /*
> >> @@ -1927,7 +1927,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >>
> >> delete_from_swap_cache(folio);
> >> folio_mark_dirty(folio);
> >> - swap_free(swap);
> >> + swap_free_nr(swap, 1);
> >> put_swap_device(si);
> >>
> >> *foliop = folio;
> >> diff --git a/mm/swapfile.c b/mm/swapfile.c
> >> index ec12f2b9d229..ddcd0f24b9a1 100644
> >> --- a/mm/swapfile.c
> >> +++ b/mm/swapfile.c
> >> @@ -1343,19 +1343,6 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
> >> swap_range_free(p, offset, 1);
> >> }
> >>
> >> -/*
> >> - * Caller has made sure that the swap device corresponding to entry
> >> - * is still around or has not been recycled.
> >> - */
> >> -void swap_free(swp_entry_t entry)
> >> -{
> >> - struct swap_info_struct *p;
> >> -
> >> - p = _swap_info_get(entry);
> >> - if (p)
> >> - __swap_entry_free(p, entry);
> >> -}
> >> -
> >> static void cluster_swap_free_nr(struct swap_info_struct *sis,
> >> unsigned long offset, int nr_pages)
> >> {
> >> @@ -1385,6 +1372,10 @@ static void cluster_swap_free_nr(struct swap_info_struct *sis,
> >> unlock_cluster_or_swap_info(sis, ci);
> >> }
> >>
> >> +/*
> >> + * Caller has made sure that the swap device corresponding to entry
> >> + * is still around or has not been recycled.
> >> + */
> >> void swap_free_nr(swp_entry_t entry, int nr_pages)
> >> {
> >> int nr;
> >> @@ -1930,7 +1921,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
> >> new_pte = pte_mkuffd_wp(new_pte);
> >> setpte:
> >> set_pte_at(vma->vm_mm, addr, pte, new_pte);
> >> - swap_free(entry);
> >> + swap_free_nr(entry, 1);
> >> out:
> >> if (pte)
> >> pte_unmap_unlock(pte, ptl);
>
> --
> Best Regards,
> Huang, Ying

Thanks
Barry

2024-05-08 09:10:58

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v3 2/6] mm: remove swap_free() and always use swap_free_nr()

On 08/05/2024 09:30, Barry Song wrote:
> On Wed, May 8, 2024 at 7:58 PM Huang, Ying <[email protected]> wrote:
>>
>> Ryan Roberts <[email protected]> writes:
>>
>>> On 03/05/2024 01:50, Barry Song wrote:
>>>> From: Barry Song <[email protected]>
>>>>
>>>> To streamline maintenance efforts, we propose discontinuing the use of
>>>> swap_free(). Instead, we can simply invoke swap_free_nr() with nr set
>>>> to 1. This adjustment offers the advantage of enabling batch processing
>>>> within kernel/power/swap.c. Furthermore, swap_free_nr() is designed with
>>>> a bitmap consisting of only one long, resulting in overhead that can be
>>>> ignored for cases where nr equals 1.
>>>>
>>>> Suggested-by: "Huang, Ying" <[email protected]>
>>>> Signed-off-by: Barry Song <[email protected]>
>>>> Cc: "Rafael J. Wysocki" <[email protected]>
>>>> Cc: Pavel Machek <[email protected]>
>>>> Cc: Len Brown <[email protected]>
>>>> Cc: Hugh Dickins <[email protected]>
>>>> ---
>>>> include/linux/swap.h | 5 -----
>>>> kernel/power/swap.c | 7 +++----
>>>> mm/memory.c | 2 +-
>>>> mm/rmap.c | 4 ++--
>>>> mm/shmem.c | 4 ++--
>>>> mm/swapfile.c | 19 +++++--------------
>>>> 6 files changed, 13 insertions(+), 28 deletions(-)
>>>>
>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>> index d1d35e92d7e9..f03cb446124e 100644
>>>> --- a/include/linux/swap.h
>>>> +++ b/include/linux/swap.h
>>>> @@ -482,7 +482,6 @@ extern int add_swap_count_continuation(swp_entry_t, gfp_t);
>>>> extern void swap_shmem_alloc(swp_entry_t);
>>>> extern int swap_duplicate(swp_entry_t);
>>>> extern int swapcache_prepare(swp_entry_t);
>>>> -extern void swap_free(swp_entry_t);
>>>
>>> I wonder if it would be cleaner to:
>>>
>>> #define swap_free(entry) swap_free_nr((entry), 1)
>>>
>>> To save all the churn for the callsites that just want to pass a single entry?
>>
>> I prefer this way. Although I prefer inline functions.

Yes, I agree inline function is the better approach.

>
> Yes, using static inline is preferable. I've recently submitted
> a checkpatch/codestyle for this, which can be found at:
> https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-everything&id=39c58d5ed036
> https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-everything&id=8379bf0b0e1f5
>
> Using static inline aligns with the established rule.
>
>>
>> Otherwise, LGTM. Feel free to add
>>
>> Reviewed-by: "Huang, Ying" <[email protected]>
>
> Thanks!
>
>>
>> in the future version.
>
> I believe Christoph's vote leans towards simply removing swap_free_nr
> and renaming it to swap_free, while adding a new parameter as follows.
>
> void swap_free(swp_entry_t entry, int nr);
> {
> }
>
> now I see Ryan and you prefer
>
> static inline swap_free()
> {
> swap_free_nr(...., 1)
> }
>
> Chris slightly favors discouraging the use of swap_free() without the
> new parameter. Removing swap_free() can address this concern.
>
> It seems that maintaining swap_free() and having it call swap_free_nr() with
> a default value of 1 received the most support.
>
> To align with free_swap_and_cache() and free_swap_and_cache_nr(),
> I'll proceed with the "static inline" approach in the new version. Please
> voice any objections you may have, Christoph, Chris.

I'm happy with either route. If you end up adding a nr param to swap_free() then
it would also be good to give free_swap_and_cache_nr() the same treatment.

>
>>
>>>> extern void swap_free_nr(swp_entry_t entry, int nr_pages);
>>>> extern void swapcache_free_entries(swp_entry_t *entries, int n);
>>>> extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
>>>> @@ -561,10 +560,6 @@ static inline int swapcache_prepare(swp_entry_t swp)
>>>> return 0;
>>>> }
>>>>
>>>> -static inline void swap_free(swp_entry_t swp)
>>>> -{
>>>> -}
>>>> -
>>>> static inline void swap_free_nr(swp_entry_t entry, int nr_pages)
>>>> {
>>>> }
>>>> diff --git a/kernel/power/swap.c b/kernel/power/swap.c
>>>> index 5bc04bfe2db1..6befaa88a342 100644
>>>> --- a/kernel/power/swap.c
>>>> +++ b/kernel/power/swap.c
>>>> @@ -181,7 +181,7 @@ sector_t alloc_swapdev_block(int swap)
>>>> offset = swp_offset(get_swap_page_of_type(swap));
>>>> if (offset) {
>>>> if (swsusp_extents_insert(offset))
>>>> - swap_free(swp_entry(swap, offset));
>>>> + swap_free_nr(swp_entry(swap, offset), 1);
>>>> else
>>>> return swapdev_block(swap, offset);
>>>> }
>>>> @@ -200,12 +200,11 @@ void free_all_swap_pages(int swap)
>>>>
>>>> while ((node = swsusp_extents.rb_node)) {
>>>> struct swsusp_extent *ext;
>>>> - unsigned long offset;
>>>>
>>>> ext = rb_entry(node, struct swsusp_extent, node);
>>>> rb_erase(node, &swsusp_extents);
>>>> - for (offset = ext->start; offset <= ext->end; offset++)
>>>> - swap_free(swp_entry(swap, offset));
>>>> + swap_free_nr(swp_entry(swap, ext->start),
>>>> + ext->end - ext->start + 1);
>>>>
>>>> kfree(ext);
>>>> }
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index eea6e4984eae..f033eb3528ba 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -4225,7 +4225,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>> * We're already holding a reference on the page but haven't mapped it
>>>> * yet.
>>>> */
>>>> - swap_free(entry);
>>>> + swap_free_nr(entry, 1);
>>>> if (should_try_to_free_swap(folio, vma, vmf->flags))
>>>> folio_free_swap(folio);
>>>>
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index 087a79f1f611..39ec7742acec 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -1865,7 +1865,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>> goto walk_done_err;
>>>> }
>>>> if (arch_unmap_one(mm, vma, address, pteval) < 0) {
>>>> - swap_free(entry);
>>>> + swap_free_nr(entry, 1);
>>>> set_pte_at(mm, address, pvmw.pte, pteval);
>>>> goto walk_done_err;
>>>> }
>>>> @@ -1873,7 +1873,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>> /* See folio_try_share_anon_rmap(): clear PTE first. */
>>>> if (anon_exclusive &&
>>>> folio_try_share_anon_rmap_pte(folio, subpage)) {
>>>> - swap_free(entry);
>>>> + swap_free_nr(entry, 1);
>>>> set_pte_at(mm, address, pvmw.pte, pteval);
>>>> goto walk_done_err;
>>>> }
>>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>>> index fa2a0ed97507..bfc8a2beb24f 100644
>>>> --- a/mm/shmem.c
>>>> +++ b/mm/shmem.c
>>>> @@ -1836,7 +1836,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
>>>> * in shmem_evict_inode().
>>>> */
>>>> shmem_recalc_inode(inode, -1, -1);
>>>> - swap_free(swap);
>>>> + swap_free_nr(swap, 1);
>>>> }
>>>>
>>>> /*
>>>> @@ -1927,7 +1927,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>>>>
>>>> delete_from_swap_cache(folio);
>>>> folio_mark_dirty(folio);
>>>> - swap_free(swap);
>>>> + swap_free_nr(swap, 1);
>>>> put_swap_device(si);
>>>>
>>>> *foliop = folio;
>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>>> index ec12f2b9d229..ddcd0f24b9a1 100644
>>>> --- a/mm/swapfile.c
>>>> +++ b/mm/swapfile.c
>>>> @@ -1343,19 +1343,6 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
>>>> swap_range_free(p, offset, 1);
>>>> }
>>>>
>>>> -/*
>>>> - * Caller has made sure that the swap device corresponding to entry
>>>> - * is still around or has not been recycled.
>>>> - */
>>>> -void swap_free(swp_entry_t entry)
>>>> -{
>>>> - struct swap_info_struct *p;
>>>> -
>>>> - p = _swap_info_get(entry);
>>>> - if (p)
>>>> - __swap_entry_free(p, entry);
>>>> -}
>>>> -
>>>> static void cluster_swap_free_nr(struct swap_info_struct *sis,
>>>> unsigned long offset, int nr_pages)
>>>> {
>>>> @@ -1385,6 +1372,10 @@ static void cluster_swap_free_nr(struct swap_info_struct *sis,
>>>> unlock_cluster_or_swap_info(sis, ci);
>>>> }
>>>>
>>>> +/*
>>>> + * Caller has made sure that the swap device corresponding to entry
>>>> + * is still around or has not been recycled.
>>>> + */
>>>> void swap_free_nr(swp_entry_t entry, int nr_pages)
>>>> {
>>>> int nr;
>>>> @@ -1930,7 +1921,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
>>>> new_pte = pte_mkuffd_wp(new_pte);
>>>> setpte:
>>>> set_pte_at(vma->vm_mm, addr, pte, new_pte);
>>>> - swap_free(entry);
>>>> + swap_free_nr(entry, 1);
>>>> out:
>>>> if (pte)
>>>> pte_unmap_unlock(pte, ptl);
>>
>> --
>> Best Regards,
>> Huang, Ying
>
> Thanks
> Barry