From: Barry Song <[email protected]>
-v2:
* lots of code cleanup according to Chris's comments, thanks!
* collect Chris's ack tags, thanks!
* address David's comment on moving to use folio_add_new_anon_rmap
for !folio_test_anon in do_swap_page, thanks!
* remove the MADV_PAGEOUT patch from this series as Ryan will
intergrate it into swap-out series
* Apply Kairui's work of "mm/swap: fix race when skipping swapcache"
on large folios swap-in as well
* fixed corrupted data(zero-filled data) in two races: zswap and
a part of entries are in swapcache while some others are not
in by checking SWAP_HAS_CACHE while swapping in a large folio
-v1:
https://lore.kernel.org/all/[email protected]/#t
On an embedded system like Android, more than half of anon memory is actually
in swap devices such as zRAM. For example, while an app is switched to back-
ground, its most memory might be swapped-out.
Now we have mTHP features, unfortunately, if we don't support large folios
swap-in, once those large folios are swapped-out, we immediately lose the
performance gain we can get through large folios and hardware optimization
such as CONT-PTE.
In theory, we don't need to rely on Ryan's swap out patchset[1]. That is to say,
before swap-out, if some memory were normal pages, but when swapping in, we
can also swap-in them as large folios. But this might require I/O happen at
some random places in swap devices. So we limit the large folios swap-in to
those areas which were large folios before swapping-out, aka, swaps are also
contiguous in swapdevice. On the other hand, in OPPO's product, we've deployed
anon large folios on millions of phones[2]. we enhanced zsmalloc and zRAM to
compress and decompress large folios as a whole, which help improve compression
ratio and decrease CPU consumption significantly. In zsmalloc and zRAM we can
save large objects whose original size are 64KiB for example (related patches
are coming). So it is also a good choice for us to support swap-in large
folios for those compressed large objects as a large folio can be decompressed
all together.
Note I am moving my previous "arm64: mm: swap: support THP_SWAP on hardware
with MTE" to this series as it might help review.
[1] [PATCH v3 0/4] Swap-out small-sized THP without splitting
https://lore.kernel.org/linux-mm/[email protected]/
[2] OnePlusOSS / android_kernel_oneplus_sm8550
https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11
Barry Song (2):
arm64: mm: swap: support THP_SWAP on hardware with MTE
mm: swap: introduce swapcache_prepare_nr and swapcache_clear_nr for
large folios swap-in
Chuanhua Han (3):
mm: swap: introduce swap_nr_free() for batched swap_free()
mm: swap: make should_try_to_free_swap() support large-folio
mm: support large folios swapin as a whole
arch/arm64/include/asm/pgtable.h | 19 +--
arch/arm64/mm/mteswap.c | 43 +++++++
include/linux/huge_mm.h | 12 --
include/linux/pgtable.h | 2 +-
include/linux/swap.h | 7 ++
mm/memory.c | 193 +++++++++++++++++++++++++------
mm/page_io.c | 2 +-
mm/swap.h | 1 +
mm/swap_slots.c | 2 +-
mm/swapfile.c | 152 ++++++++++++++++--------
10 files changed, 319 insertions(+), 114 deletions(-)
--
2.34.1
From: Barry Song <[email protected]>
Commit d0637c505f8a1 ("arm64: enable THP_SWAP for arm64") brings up
THP_SWAP on ARM64, but it doesn't enable THP_SWP on hardware with
MTE as the MTE code works with the assumption tags save/restore is
always handling a folio with only one page.
The limitation should be removed as more and more ARM64 SoCs have
this feature. Co-existence of MTE and THP_SWAP becomes more and
more important.
This patch makes MTE tags saving support large folios, then we don't
need to split large folios into base pages for swapping out on ARM64
SoCs with MTE any more.
arch_prepare_to_swap() should take folio rather than page as parameter
because we support THP swap-out as a whole. It saves tags for all
pages in a large folio.
As now we are restoring tags based-on folio, in arch_swap_restore(),
we may increase some extra loops and early-exitings while refaulting
a large folio which is still in swapcache in do_swap_page(). In case
a large folio has nr pages, do_swap_page() will only set the PTE of
the particular page which is causing the page fault.
Thus do_swap_page() runs nr times, and each time, arch_swap_restore()
will loop nr times for those subpages in the folio. So right now the
algorithmic complexity becomes O(nr^2).
Once we support mapping large folios in do_swap_page(), extra loops
and early-exitings will decrease while not being completely removed
as a large folio might get partially tagged in corner cases such as,
1. a large folio in swapcache can be partially unmapped, thus, MTE
tags for the unmapped pages will be invalidated;
2. users might use mprotect() to set MTEs on a part of a large folio.
arch_thp_swp_supported() is dropped since ARM64 MTE was the only one
who needed it.
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Kemeng Shi <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Peter Collingbourne <[email protected]>
Cc: Steven Price <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: "Mike Rapoport (IBM)" <[email protected]>
Cc: Hugh Dickins <[email protected]>
CC: "Aneesh Kumar K.V" <[email protected]>
Cc: Rick Edgecombe <[email protected]>
Signed-off-by: Barry Song <[email protected]>
Reviewed-by: Steven Price <[email protected]>
Acked-by: Chris Li <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 19 ++------------
arch/arm64/mm/mteswap.c | 43 ++++++++++++++++++++++++++++++++
include/linux/huge_mm.h | 12 ---------
include/linux/pgtable.h | 2 +-
mm/page_io.c | 2 +-
mm/swap_slots.c | 2 +-
6 files changed, 48 insertions(+), 32 deletions(-)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 401087e8a43d..7a54750770b8 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -45,12 +45,6 @@
__flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-static inline bool arch_thp_swp_supported(void)
-{
- return !system_supports_mte();
-}
-#define arch_thp_swp_supported arch_thp_swp_supported
-
/*
* Outside of a few very special situations (e.g. hibernation), we always
* use broadcast TLB invalidation instructions, therefore a spurious page
@@ -1095,12 +1089,7 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
#ifdef CONFIG_ARM64_MTE
#define __HAVE_ARCH_PREPARE_TO_SWAP
-static inline int arch_prepare_to_swap(struct page *page)
-{
- if (system_supports_mte())
- return mte_save_tags(page);
- return 0;
-}
+extern int arch_prepare_to_swap(struct folio *folio);
#define __HAVE_ARCH_SWAP_INVALIDATE
static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
@@ -1116,11 +1105,7 @@ static inline void arch_swap_invalidate_area(int type)
}
#define __HAVE_ARCH_SWAP_RESTORE
-static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
-{
- if (system_supports_mte())
- mte_restore_tags(entry, &folio->page);
-}
+extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
#endif /* CONFIG_ARM64_MTE */
diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
index a31833e3ddc5..295836fef620 100644
--- a/arch/arm64/mm/mteswap.c
+++ b/arch/arm64/mm/mteswap.c
@@ -68,6 +68,13 @@ void mte_invalidate_tags(int type, pgoff_t offset)
mte_free_tag_storage(tags);
}
+static inline void __mte_invalidate_tags(struct page *page)
+{
+ swp_entry_t entry = page_swap_entry(page);
+
+ mte_invalidate_tags(swp_type(entry), swp_offset(entry));
+}
+
void mte_invalidate_tags_area(int type)
{
swp_entry_t entry = swp_entry(type, 0);
@@ -83,3 +90,39 @@ void mte_invalidate_tags_area(int type)
}
xa_unlock(&mte_pages);
}
+
+int arch_prepare_to_swap(struct folio *folio)
+{
+ long i, nr;
+ int err;
+
+ if (!system_supports_mte())
+ return 0;
+
+ nr = folio_nr_pages(folio);
+
+ for (i = 0; i < nr; i++) {
+ err = mte_save_tags(folio_page(folio, i));
+ if (err)
+ goto out;
+ }
+ return 0;
+
+out:
+ while (i--)
+ __mte_invalidate_tags(folio_page(folio, i));
+ return err;
+}
+
+void arch_swap_restore(swp_entry_t entry, struct folio *folio)
+{
+ if (system_supports_mte()) {
+ long i, nr = folio_nr_pages(folio);
+
+ entry.val -= swp_offset(entry) & (nr - 1);
+ for (i = 0; i < nr; i++) {
+ mte_restore_tags(entry, folio_page(folio, i));
+ entry.val++;
+ }
+ }
+}
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index de0c89105076..e04b93c43965 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -535,16 +535,4 @@ static inline int split_folio_to_order(struct folio *folio, int new_order)
#define split_folio_to_list(f, l) split_folio_to_list_to_order(f, l, 0)
#define split_folio(f) split_folio_to_order(f, 0)
-/*
- * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
- * limitations in the implementation like arm64 MTE can override this to
- * false
- */
-#ifndef arch_thp_swp_supported
-static inline bool arch_thp_swp_supported(void)
-{
- return true;
-}
-#endif
-
#endif /* _LINUX_HUGE_MM_H */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index a36cf4e124b0..ec7efce0f3f0 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1052,7 +1052,7 @@ static inline int arch_unmap_one(struct mm_struct *mm,
* prototypes must be defined in the arch-specific asm/pgtable.h file.
*/
#ifndef __HAVE_ARCH_PREPARE_TO_SWAP
-static inline int arch_prepare_to_swap(struct page *page)
+static inline int arch_prepare_to_swap(struct folio *folio)
{
return 0;
}
diff --git a/mm/page_io.c b/mm/page_io.c
index ae2b49055e43..a9a7c236aecc 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -189,7 +189,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
* Arch code may have to preserve more data than just the page
* contents, e.g. memory tags.
*/
- ret = arch_prepare_to_swap(&folio->page);
+ ret = arch_prepare_to_swap(folio);
if (ret) {
folio_mark_dirty(folio);
folio_unlock(folio);
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 90973ce7881d..53abeaf1371d 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -310,7 +310,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
entry.val = 0;
if (folio_test_large(folio)) {
- if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
+ if (IS_ENABLED(CONFIG_THP_SWAP))
get_swap_pages(1, &entry, folio_nr_pages(folio));
goto out;
}
--
2.34.1
From: Chuanhua Han <[email protected]>
While swapping in a large folio, we need to free swaps related to the whole
folio. To avoid frequently acquiring and releasing swap locks, it is better
to introduce an API for batched free.
Signed-off-by: Chuanhua Han <[email protected]>
Co-developed-by: Barry Song <[email protected]>
Signed-off-by: Barry Song <[email protected]>
---
include/linux/swap.h | 6 ++++++
mm/swapfile.c | 35 +++++++++++++++++++++++++++++++++++
2 files changed, 41 insertions(+)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 25f6368be078..b3581c976e5f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -481,6 +481,7 @@ extern void swap_shmem_alloc(swp_entry_t);
extern int swap_duplicate(swp_entry_t);
extern int swapcache_prepare(swp_entry_t);
extern void swap_free(swp_entry_t);
+extern void swap_nr_free(swp_entry_t entry, int nr_pages);
extern void swapcache_free_entries(swp_entry_t *entries, int n);
extern int free_swap_and_cache(swp_entry_t);
int swap_type_of(dev_t device, sector_t offset);
@@ -561,6 +562,11 @@ static inline void swap_free(swp_entry_t swp)
{
}
+void swap_nr_free(swp_entry_t entry, int nr_pages)
+{
+
+}
+
static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
{
}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 2b3a2d85e350..c0c058ee7b69 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1340,6 +1340,41 @@ void swap_free(swp_entry_t entry)
__swap_entry_free(p, entry);
}
+/*
+ * Called after swapping in a large folio, batched free swap entries
+ * for this large folio, entry should be for the first subpage and
+ * its offset is aligned with nr_pages
+ */
+void swap_nr_free(swp_entry_t entry, int nr_pages)
+{
+ int i;
+ struct swap_cluster_info *ci;
+ struct swap_info_struct *p;
+ unsigned type = swp_type(entry);
+ unsigned long offset = swp_offset(entry);
+ DECLARE_BITMAP(usage, SWAPFILE_CLUSTER) = { 0 };
+
+ /* all swap entries are within a cluster for mTHP */
+ VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
+
+ if (nr_pages == 1) {
+ swap_free(entry);
+ return;
+ }
+
+ p = _swap_info_get(entry);
+
+ ci = lock_cluster(p, offset);
+ for (i = 0; i < nr_pages; i++) {
+ if (__swap_entry_free_locked(p, offset + i, 1))
+ __bitmap_set(usage, i, 1);
+ }
+ unlock_cluster(ci);
+
+ for_each_clear_bit(i, usage, nr_pages)
+ free_swap_slot(swp_entry(type, offset + i));
+}
+
/*
* Called after dropping swapcache to decrease refcnt to swap entries.
*/
--
2.34.1
From: Chuanhua Han <[email protected]>
should_try_to_free_swap() works with an assumption that swap-in is always done
at normal page granularity, aka, folio_nr_pages = 1. To support large folio
swap-in, this patch removes the assumption.
Signed-off-by: Chuanhua Han <[email protected]>
Co-developed-by: Barry Song <[email protected]>
Signed-off-by: Barry Song <[email protected]>
Acked-by: Chris Li <[email protected]>
---
mm/memory.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/memory.c b/mm/memory.c
index 319b3be05e75..90b08b7cbaac 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3904,7 +3904,7 @@ static inline bool should_try_to_free_swap(struct folio *folio,
* reference only in case it's likely that we'll be the exlusive user.
*/
return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
- folio_ref_count(folio) == 2;
+ folio_ref_count(folio) == (1 + folio_nr_pages(folio));
}
static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
--
2.34.1
From: Chuanhua Han <[email protected]>
On an embedded system like Android, more than half of anon memory is actually
in swap devices such as zRAM. For example, while an app is switched to back-
ground, its most memory might be swapped-out.
Now we have mTHP features, unfortunately, if we don't support large folios
swap-in, once those large folios are swapped-out, we immediately lose the
performance gain we can get through large folios and hardware optimization
such as CONT-PTE.
This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
to those contiguous swaps which were likely swapped out from mTHP as a whole.
On the other hand, the current implementation only covers the SWAP_SYCHRONOUS
case. It doesn't support swapin_readahead as large folios yet.
Right now, we are re-faulting large folios which are still in swapcache as a
whole, this can effectively decrease extra loops and early-exitings which we
have increased in arch_swap_restore() while supporting MTE restore for folios
rather than page. On the other hand, it can also decrease do_swap_page as PTEs
used to be set one by one even we hit a large folio in swapcache.
Signed-off-by: Chuanhua Han <[email protected]>
Co-developed-by: Barry Song <[email protected]>
Signed-off-by: Barry Song <[email protected]>
---
mm/memory.c | 191 ++++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 157 insertions(+), 34 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 90b08b7cbaac..471689ce4e91 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -104,9 +104,16 @@ struct page *mem_map;
EXPORT_SYMBOL(mem_map);
#endif
+/* A choice of behaviors for alloc_anon_folio() */
+enum behavior {
+ DO_SWAP_PAGE,
+ DO_ANON_PAGE,
+};
+
static vm_fault_t do_fault(struct vm_fault *vmf);
static vm_fault_t do_anonymous_page(struct vm_fault *vmf);
static bool vmf_pte_changed(struct vm_fault *vmf);
+static struct folio *alloc_anon_folio(struct vm_fault *vmf, enum behavior behavior);
/*
* Return true if the original pte was a uffd-wp pte marker (so the pte was
@@ -3974,6 +3981,52 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
return VM_FAULT_SIGBUS;
}
+/*
+ * check a range of PTEs are completely swap entries with
+ * contiguous swap offsets and the same SWAP_HAS_CACHE.
+ * pte must be first one in the range
+ */
+static bool is_pte_range_contig_swap(pte_t *pte, int nr_pages)
+{
+ int i;
+ struct swap_info_struct *si;
+ swp_entry_t entry;
+ unsigned type;
+ pgoff_t start_offset;
+ char has_cache;
+
+ entry = pte_to_swp_entry(ptep_get_lockless(pte));
+ if (non_swap_entry(entry))
+ return false;
+ start_offset = swp_offset(entry);
+ if (start_offset % nr_pages)
+ return false;
+
+ si = swp_swap_info(entry);
+ type = swp_type(entry);
+ has_cache = si->swap_map[start_offset] & SWAP_HAS_CACHE;
+ for (i = 1; i < nr_pages; i++) {
+ entry = pte_to_swp_entry(ptep_get_lockless(pte + i));
+ if (non_swap_entry(entry))
+ return false;
+ if (swp_offset(entry) != start_offset + i)
+ return false;
+ if (swp_type(entry) != type)
+ return false;
+ /*
+ * while allocating a large folio and doing swap_read_folio for the
+ * SWP_SYNCHRONOUS_IO path, which is the case the being faulted pte
+ * doesn't have swapcache. We need to ensure all PTEs have no cache
+ * as well, otherwise, we might go to swap devices while the content
+ * is in swapcache
+ */
+ if ((si->swap_map[start_offset + i] & SWAP_HAS_CACHE) != has_cache)
+ return false;
+ }
+
+ return true;
+}
+
/*
* We enter with non-exclusive mmap_lock (to exclude vma changes,
* but allow concurrent faults), and pte mapped but not yet locked.
@@ -3995,6 +4048,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
pte_t pte;
vm_fault_t ret = 0;
void *shadow = NULL;
+ int nr_pages = 1;
+ unsigned long start_address;
+ pte_t *start_pte;
if (!pte_unmap_same(vmf))
goto out;
@@ -4058,28 +4114,32 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (!folio) {
if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
__swap_count(entry) == 1) {
- /*
- * Prevent parallel swapin from proceeding with
- * the cache flag. Otherwise, another thread may
- * finish swapin first, free the entry, and swapout
- * reusing the same entry. It's undetectable as
- * pte_same() returns true due to entry reuse.
- */
- if (swapcache_prepare(entry)) {
- /* Relax a bit to prevent rapid repeated page faults */
- schedule_timeout_uninterruptible(1);
- goto out;
- }
- need_clear_cache = true;
-
/* skip swapcache */
- folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
- vma, vmf->address, false);
+ folio = alloc_anon_folio(vmf, DO_SWAP_PAGE);
page = &folio->page;
if (folio) {
__folio_set_locked(folio);
__folio_set_swapbacked(folio);
+ if (folio_test_large(folio)) {
+ nr_pages = folio_nr_pages(folio);
+ entry.val = ALIGN_DOWN(entry.val, nr_pages);
+ }
+
+ /*
+ * Prevent parallel swapin from proceeding with
+ * the cache flag. Otherwise, another thread may
+ * finish swapin first, free the entry, and swapout
+ * reusing the same entry. It's undetectable as
+ * pte_same() returns true due to entry reuse.
+ */
+ if (swapcache_prepare_nr(entry, nr_pages)) {
+ /* Relax a bit to prevent rapid repeated page faults */
+ schedule_timeout_uninterruptible(1);
+ goto out;
+ }
+ need_clear_cache = true;
+
if (mem_cgroup_swapin_charge_folio(folio,
vma->vm_mm, GFP_KERNEL,
entry)) {
@@ -4185,6 +4245,42 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
*/
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
&vmf->ptl);
+
+ start_address = vmf->address;
+ start_pte = vmf->pte;
+ if (folio_test_large(folio)) {
+ unsigned long nr = folio_nr_pages(folio);
+ unsigned long addr = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
+ pte_t *aligned_pte = vmf->pte - (vmf->address - addr) / PAGE_SIZE;
+
+ /*
+ * case 1: we are allocating large_folio, try to map it as a whole
+ * iff the swap entries are still entirely mapped;
+ * case 2: we hit a large folio in swapcache, and all swap entries
+ * are still entirely mapped, try to map a large folio as a whole.
+ * otherwise, map only the faulting page within the large folio
+ * which is swapcache
+ */
+ if (!is_pte_range_contig_swap(aligned_pte, nr)) {
+ if (nr_pages > 1) /* ptes have changed for case 1 */
+ goto out_nomap;
+ goto check_pte;
+ }
+
+ start_address = addr;
+ start_pte = aligned_pte;
+ /*
+ * the below has been done before swap_read_folio()
+ * for case 1
+ */
+ if (unlikely(folio == swapcache)) {
+ nr_pages = nr;
+ entry.val = ALIGN_DOWN(entry.val, nr_pages);
+ page = &folio->page;
+ }
+ }
+
+check_pte:
if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
goto out_nomap;
@@ -4252,12 +4348,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
* We're already holding a reference on the page but haven't mapped it
* yet.
*/
- swap_free(entry);
+ swap_nr_free(entry, nr_pages);
if (should_try_to_free_swap(folio, vma, vmf->flags))
folio_free_swap(folio);
- inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
- dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
+ folio_ref_add(folio, nr_pages - 1);
+ add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
+ add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
+
pte = mk_pte(page, vma->vm_page_prot);
/*
@@ -4267,14 +4365,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
* exclusivity.
*/
if (!folio_test_ksm(folio) &&
- (exclusive || folio_ref_count(folio) == 1)) {
+ (exclusive || folio_ref_count(folio) == nr_pages)) {
if (vmf->flags & FAULT_FLAG_WRITE) {
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
vmf->flags &= ~FAULT_FLAG_WRITE;
}
rmap_flags |= RMAP_EXCLUSIVE;
}
- flush_icache_page(vma, page);
+ flush_icache_pages(vma, page, nr_pages);
if (pte_swp_soft_dirty(vmf->orig_pte))
pte = pte_mksoft_dirty(pte);
if (pte_swp_uffd_wp(vmf->orig_pte))
@@ -4283,17 +4381,19 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
/* ksm created a completely new copy */
if (unlikely(folio != swapcache && swapcache)) {
- folio_add_new_anon_rmap(folio, vma, vmf->address);
+ folio_add_new_anon_rmap(folio, vma, start_address);
folio_add_lru_vma(folio, vma);
+ } else if (!folio_test_anon(folio)) {
+ folio_add_new_anon_rmap(folio, vma, start_address);
} else {
- folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
+ folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address,
rmap_flags);
}
VM_BUG_ON(!folio_test_anon(folio) ||
(pte_write(pte) && !PageAnonExclusive(page)));
- set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
- arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
+ set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
+ arch_do_swap_page(vma->vm_mm, vma, start_address, pte, vmf->orig_pte);
folio_unlock(folio);
if (folio != swapcache && swapcache) {
@@ -4310,6 +4410,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
}
if (vmf->flags & FAULT_FLAG_WRITE) {
+ if (nr_pages > 1)
+ vmf->orig_pte = ptep_get(vmf->pte);
+
ret |= do_wp_page(vmf);
if (ret & VM_FAULT_ERROR)
ret &= VM_FAULT_ERROR;
@@ -4317,14 +4420,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
}
/* No need to invalidate - it was non-present before */
- update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
+ update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages);
unlock:
if (vmf->pte)
pte_unmap_unlock(vmf->pte, vmf->ptl);
out:
/* Clear the swap cache pin for direct swapin after PTL unlock */
if (need_clear_cache)
- swapcache_clear(si, entry);
+ swapcache_clear_nr(si, entry, nr_pages);
if (si)
put_swap_device(si);
return ret;
@@ -4340,7 +4443,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
folio_put(swapcache);
}
if (need_clear_cache)
- swapcache_clear(si, entry);
+ swapcache_clear_nr(si, entry, nr_pages);
if (si)
put_swap_device(si);
return ret;
@@ -4358,7 +4461,7 @@ static bool pte_range_none(pte_t *pte, int nr_pages)
return true;
}
-static struct folio *alloc_anon_folio(struct vm_fault *vmf)
+static struct folio *alloc_anon_folio(struct vm_fault *vmf, enum behavior behavior)
{
struct vm_area_struct *vma = vmf->vma;
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -4376,6 +4479,19 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
if (unlikely(userfaultfd_armed(vma)))
goto fallback;
+ /*
+ * a large folio being swapped-in could be partially in
+ * zswap and partially in swap devices, zswap doesn't
+ * support large folios yet, we might get corrupted
+ * zero-filled data by reading all subpages from swap
+ * devices while some of them are actually in zswap
+ */
+ if (behavior == DO_SWAP_PAGE && is_zswap_enabled())
+ goto fallback;
+
+ if (unlikely(behavior != DO_ANON_PAGE && behavior != DO_SWAP_PAGE))
+ return ERR_PTR(-EINVAL);
+
/*
* Get a list of all the (large) orders below PMD_ORDER that are enabled
* for this vma. Then filter out the orders that can't be allocated over
@@ -4393,15 +4509,22 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
return ERR_PTR(-EAGAIN);
/*
- * Find the highest order where the aligned range is completely
- * pte_none(). Note that all remaining orders will be completely
+ * For do_anonymous_page, find the highest order where the aligned range is
+ * completely pte_none(). Note that all remaining orders will be completely
* pte_none().
+ * For do_swap_page, find the highest order where the aligned range is
+ * completely swap entries with contiguous swap offsets.
*/
order = highest_order(orders);
while (orders) {
addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
- if (pte_range_none(pte + pte_index(addr), 1 << order))
- break;
+ if (behavior == DO_ANON_PAGE) {
+ if (pte_range_none(pte + pte_index(addr), 1 << order))
+ break;
+ } else {
+ if (is_pte_range_contig_swap(pte + pte_index(addr), 1 << order))
+ break;
+ }
order = next_order(&orders, order);
}
@@ -4485,7 +4608,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
if (unlikely(anon_vma_prepare(vma)))
goto oom;
/* Returns NULL on OOM or ERR_PTR(-EAGAIN) if we must retry the fault */
- folio = alloc_anon_folio(vmf);
+ folio = alloc_anon_folio(vmf, DO_ANON_PAGE);
if (IS_ERR(folio))
return 0;
if (!folio)
--
2.34.1
From: Barry Song <[email protected]>
Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache") supports
one entry only, to support large folio swap-in, we need to handle multiple
swap entries.
Cc: Kairui Song <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Chris Li <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: SeongJae Park <[email protected]>
Signed-off-by: Barry Song <[email protected]>
---
include/linux/swap.h | 1 +
mm/swap.h | 1 +
mm/swapfile.c | 117 ++++++++++++++++++++++++++-----------------
3 files changed, 72 insertions(+), 47 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index b3581c976e5f..2691c739d9a4 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -480,6 +480,7 @@ extern int add_swap_count_continuation(swp_entry_t, gfp_t);
extern void swap_shmem_alloc(swp_entry_t);
extern int swap_duplicate(swp_entry_t);
extern int swapcache_prepare(swp_entry_t);
+extern int swapcache_prepare_nr(swp_entry_t, int nr);
extern void swap_free(swp_entry_t);
extern void swap_nr_free(swp_entry_t entry, int nr_pages);
extern void swapcache_free_entries(swp_entry_t *entries, int n);
diff --git a/mm/swap.h b/mm/swap.h
index fc2f6ade7f80..1cec991efcda 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -42,6 +42,7 @@ void delete_from_swap_cache(struct folio *folio);
void clear_shadow_from_swap_cache(int type, unsigned long begin,
unsigned long end);
void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry);
+void swapcache_clear_nr(struct swap_info_struct *si, swp_entry_t entry, int nr);
struct folio *swap_cache_get_folio(swp_entry_t entry,
struct vm_area_struct *vma, unsigned long addr);
struct folio *filemap_get_incore_folio(struct address_space *mapping,
diff --git a/mm/swapfile.c b/mm/swapfile.c
index c0c058ee7b69..c8c8b6dbaeda 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3308,7 +3308,7 @@ void si_swapinfo(struct sysinfo *val)
}
/*
- * Verify that a swap entry is valid and increment its swap map count.
+ * Verify that nr swap entries are valid and increment their swap map count.
*
* Returns error code in following case.
* - success -> 0
@@ -3318,66 +3318,73 @@ void si_swapinfo(struct sysinfo *val)
* - swap-cache reference is requested but the entry is not used. -> ENOENT
* - swap-mapped reference requested but needs continued swap count. -> ENOMEM
*/
-static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
+static int __swap_duplicate_nr(swp_entry_t entry, int nr, unsigned char usage)
{
struct swap_info_struct *p;
struct swap_cluster_info *ci;
unsigned long offset;
- unsigned char count;
- unsigned char has_cache;
- int err;
+ unsigned char count[SWAPFILE_CLUSTER];
+ unsigned char has_cache[SWAPFILE_CLUSTER];
+ int err, i;
p = swp_swap_info(entry);
offset = swp_offset(entry);
ci = lock_cluster_or_swap_info(p, offset);
- count = p->swap_map[offset];
-
- /*
- * swapin_readahead() doesn't check if a swap entry is valid, so the
- * swap entry could be SWAP_MAP_BAD. Check here with lock held.
- */
- if (unlikely(swap_count(count) == SWAP_MAP_BAD)) {
- err = -ENOENT;
- goto unlock_out;
- }
+ for (i = 0; i < nr; i++) {
+ count[i] = p->swap_map[offset + i];
- has_cache = count & SWAP_HAS_CACHE;
- count &= ~SWAP_HAS_CACHE;
- err = 0;
-
- if (usage == SWAP_HAS_CACHE) {
-
- /* set SWAP_HAS_CACHE if there is no cache and entry is used */
- if (!has_cache && count)
- has_cache = SWAP_HAS_CACHE;
- else if (has_cache) /* someone else added cache */
- err = -EEXIST;
- else /* no users remaining */
+ /*
+ * swapin_readahead() doesn't check if a swap entry is valid, so the
+ * swap entry could be SWAP_MAP_BAD. Check here with lock held.
+ */
+ if (unlikely(swap_count(count[i]) == SWAP_MAP_BAD)) {
err = -ENOENT;
+ goto unlock_out;
+ }
- } else if (count || has_cache) {
-
- if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
- count += usage;
- else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX)
- err = -EINVAL;
- else if (swap_count_continued(p, offset, count))
- count = COUNT_CONTINUED;
- else
- err = -ENOMEM;
- } else
- err = -ENOENT; /* unused swap entry */
-
- if (!err)
- WRITE_ONCE(p->swap_map[offset], count | has_cache);
+ has_cache[i] = count[i] & SWAP_HAS_CACHE;
+ count[i] &= ~SWAP_HAS_CACHE;
+ err = 0;
+
+ if (usage == SWAP_HAS_CACHE) {
+
+ /* set SWAP_HAS_CACHE if there is no cache and entry is used */
+ if (!has_cache[i] && count[i])
+ has_cache[i] = SWAP_HAS_CACHE;
+ else if (has_cache[i]) /* someone else added cache */
+ err = -EEXIST;
+ else /* no users remaining */
+ err = -ENOENT;
+ } else if (count[i] || has_cache[i]) {
+
+ if ((count[i] & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
+ count[i] += usage;
+ else if ((count[i] & ~COUNT_CONTINUED) > SWAP_MAP_MAX)
+ err = -EINVAL;
+ else if (swap_count_continued(p, offset + i, count[i]))
+ count[i] = COUNT_CONTINUED;
+ else
+ err = -ENOMEM;
+ } else
+ err = -ENOENT; /* unused swap entry */
+ }
+ if (!err) {
+ for (i = 0; i < nr; i++)
+ WRITE_ONCE(p->swap_map[offset + i], count[i] | has_cache[i]);
+ }
unlock_out:
unlock_cluster_or_swap_info(p, ci);
return err;
}
+static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
+{
+ return __swap_duplicate_nr(entry, 1, usage);
+}
+
/*
* Help swapoff by noting that swap entry belongs to shmem/tmpfs
* (in which case its reference count is never incremented).
@@ -3416,17 +3423,33 @@ int swapcache_prepare(swp_entry_t entry)
return __swap_duplicate(entry, SWAP_HAS_CACHE);
}
-void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
+int swapcache_prepare_nr(swp_entry_t entry, int nr)
+{
+ return __swap_duplicate_nr(entry, nr, SWAP_HAS_CACHE);
+}
+
+void swapcache_clear_nr(struct swap_info_struct *si, swp_entry_t entry, int nr)
{
struct swap_cluster_info *ci;
unsigned long offset = swp_offset(entry);
- unsigned char usage;
+ unsigned char usage[SWAPFILE_CLUSTER];
+ int i;
ci = lock_cluster_or_swap_info(si, offset);
- usage = __swap_entry_free_locked(si, offset, SWAP_HAS_CACHE);
+ for (i = 0; i < nr; i++)
+ usage[i] = __swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE);
unlock_cluster_or_swap_info(si, ci);
- if (!usage)
- free_swap_slot(entry);
+ for (i = 0; i < nr; i++) {
+ if (!usage[i]) {
+ free_swap_slot(entry);
+ entry.val++;
+ }
+ }
+}
+
+void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
+{
+ swapcache_clear_nr(si, entry, 1);
}
struct swap_info_struct *swp_swap_info(swp_entry_t entry)
--
2.34.1
On Wed, Feb 28, 2024 at 4:39 PM Barry Song <[email protected]> wrote:
>
> From: Barry Song <[email protected]>
>
> Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache") supports
> one entry only, to support large folio swap-in, we need to handle multiple
> swap entries.
>
> Cc: Kairui Song <[email protected]>
> Cc: "Huang, Ying" <[email protected]>
> Cc: Yu Zhao <[email protected]>
> Cc: David Hildenbrand <[email protected]>
> Cc: Chris Li <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Matthew Wilcox (Oracle) <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: Yosry Ahmed <[email protected]>
> Cc: Yu Zhao <[email protected]>
> Cc: SeongJae Park <[email protected]>
> Signed-off-by: Barry Song <[email protected]>
> ---
> include/linux/swap.h | 1 +
> mm/swap.h | 1 +
> mm/swapfile.c | 117 ++++++++++++++++++++++++++-----------------
> 3 files changed, 72 insertions(+), 47 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index b3581c976e5f..2691c739d9a4 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -480,6 +480,7 @@ extern int add_swap_count_continuation(swp_entry_t, gfp_t);
> extern void swap_shmem_alloc(swp_entry_t);
> extern int swap_duplicate(swp_entry_t);
> extern int swapcache_prepare(swp_entry_t);
> +extern int swapcache_prepare_nr(swp_entry_t, int nr);
> extern void swap_free(swp_entry_t);
> extern void swap_nr_free(swp_entry_t entry, int nr_pages);
> extern void swapcache_free_entries(swp_entry_t *entries, int n);
> diff --git a/mm/swap.h b/mm/swap.h
> index fc2f6ade7f80..1cec991efcda 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -42,6 +42,7 @@ void delete_from_swap_cache(struct folio *folio);
> void clear_shadow_from_swap_cache(int type, unsigned long begin,
> unsigned long end);
> void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry);
> +void swapcache_clear_nr(struct swap_info_struct *si, swp_entry_t entry, int nr);
> struct folio *swap_cache_get_folio(swp_entry_t entry,
> struct vm_area_struct *vma, unsigned long addr);
> struct folio *filemap_get_incore_folio(struct address_space *mapping,
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index c0c058ee7b69..c8c8b6dbaeda 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -3308,7 +3308,7 @@ void si_swapinfo(struct sysinfo *val)
> }
>
> /*
> - * Verify that a swap entry is valid and increment its swap map count.
> + * Verify that nr swap entries are valid and increment their swap map count.
> *
> * Returns error code in following case.
> * - success -> 0
> @@ -3318,66 +3318,73 @@ void si_swapinfo(struct sysinfo *val)
> * - swap-cache reference is requested but the entry is not used. -> ENOENT
> * - swap-mapped reference requested but needs continued swap count. -> ENOMEM
> */
> -static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
> +static int __swap_duplicate_nr(swp_entry_t entry, int nr, unsigned char usage)
> {
> struct swap_info_struct *p;
> struct swap_cluster_info *ci;
> unsigned long offset;
> - unsigned char count;
> - unsigned char has_cache;
> - int err;
> + unsigned char count[SWAPFILE_CLUSTER];
> + unsigned char has_cache[SWAPFILE_CLUSTER];
I am not closely following this series, but a couple of things caught my eyes.
Is this reasonable for stack usage?
> + int err, i;
>
> p = swp_swap_info(entry);
>
> offset = swp_offset(entry);
> ci = lock_cluster_or_swap_info(p, offset);
>
> - count = p->swap_map[offset];
> -
> - /*
> - * swapin_readahead() doesn't check if a swap entry is valid, so the
> - * swap entry could be SWAP_MAP_BAD. Check here with lock held.
> - */
> - if (unlikely(swap_count(count) == SWAP_MAP_BAD)) {
> - err = -ENOENT;
> - goto unlock_out;
> - }
> + for (i = 0; i < nr; i++) {
> + count[i] = p->swap_map[offset + i];
>
> - has_cache = count & SWAP_HAS_CACHE;
> - count &= ~SWAP_HAS_CACHE;
> - err = 0;
> -
> - if (usage == SWAP_HAS_CACHE) {
> -
> - /* set SWAP_HAS_CACHE if there is no cache and entry is used */
> - if (!has_cache && count)
> - has_cache = SWAP_HAS_CACHE;
> - else if (has_cache) /* someone else added cache */
> - err = -EEXIST;
> - else /* no users remaining */
> + /*
> + * swapin_readahead() doesn't check if a swap entry is valid, so the
> + * swap entry could be SWAP_MAP_BAD. Check here with lock held.
> + */
> + if (unlikely(swap_count(count[i]) == SWAP_MAP_BAD)) {
> err = -ENOENT;
> + goto unlock_out;
> + }
Here we immediately exit if there is an error, but we don't below, we
just keep overwriting the error every iteration as far as I can tell.
Also, it doesn't seem like we do any cleanups if we hit an error
halfway through. Should we undo previously updated swap entries, or am
I missing something here?
>
> - } else if (count || has_cache) {
> -
> - if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
> - count += usage;
> - else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX)
> - err = -EINVAL;
> - else if (swap_count_continued(p, offset, count))
> - count = COUNT_CONTINUED;
> - else
> - err = -ENOMEM;
> - } else
> - err = -ENOENT; /* unused swap entry */
> -
> - if (!err)
> - WRITE_ONCE(p->swap_map[offset], count | has_cache);
> + has_cache[i] = count[i] & SWAP_HAS_CACHE;
> + count[i] &= ~SWAP_HAS_CACHE;
> + err = 0;
> +
> + if (usage == SWAP_HAS_CACHE) {
> +
> + /* set SWAP_HAS_CACHE if there is no cache and entry is used */
> + if (!has_cache[i] && count[i])
> + has_cache[i] = SWAP_HAS_CACHE;
> + else if (has_cache[i]) /* someone else added cache */
> + err = -EEXIST;
> + else /* no users remaining */
> + err = -ENOENT;
> + } else if (count[i] || has_cache[i]) {
> +
> + if ((count[i] & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
> + count[i] += usage;
> + else if ((count[i] & ~COUNT_CONTINUED) > SWAP_MAP_MAX)
> + err = -EINVAL;
> + else if (swap_count_continued(p, offset + i, count[i]))
> + count[i] = COUNT_CONTINUED;
> + else
> + err = -ENOMEM;
> + } else
> + err = -ENOENT; /* unused swap entry */
> + }
>
> + if (!err) {
> + for (i = 0; i < nr; i++)
> + WRITE_ONCE(p->swap_map[offset + i], count[i] | has_cache[i]);
> + }
> unlock_out:
> unlock_cluster_or_swap_info(p, ci);
> return err;
> }
On Thu, Feb 29, 2024 at 1:52 PM Yosry Ahmed <[email protected]> wrote:
>
> On Wed, Feb 28, 2024 at 4:39 PM Barry Song <[email protected]> wrote:
> >
> > From: Barry Song <[email protected]>
> >
> > Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache") supports
> > one entry only, to support large folio swap-in, we need to handle multiple
> > swap entries.
> >
> > Cc: Kairui Song <[email protected]>
> > Cc: "Huang, Ying" <[email protected]>
> > Cc: Yu Zhao <[email protected]>
> > Cc: David Hildenbrand <[email protected]>
> > Cc: Chris Li <[email protected]>
> > Cc: Hugh Dickins <[email protected]>
> > Cc: Johannes Weiner <[email protected]>
> > Cc: Matthew Wilcox (Oracle) <[email protected]>
> > Cc: Michal Hocko <[email protected]>
> > Cc: Minchan Kim <[email protected]>
> > Cc: Yosry Ahmed <[email protected]>
> > Cc: Yu Zhao <[email protected]>
> > Cc: SeongJae Park <[email protected]>
> > Signed-off-by: Barry Song <[email protected]>
> > ---
> > include/linux/swap.h | 1 +
> > mm/swap.h | 1 +
> > mm/swapfile.c | 117 ++++++++++++++++++++++++++-----------------
> > 3 files changed, 72 insertions(+), 47 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index b3581c976e5f..2691c739d9a4 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -480,6 +480,7 @@ extern int add_swap_count_continuation(swp_entry_t, gfp_t);
> > extern void swap_shmem_alloc(swp_entry_t);
> > extern int swap_duplicate(swp_entry_t);
> > extern int swapcache_prepare(swp_entry_t);
> > +extern int swapcache_prepare_nr(swp_entry_t, int nr);
> > extern void swap_free(swp_entry_t);
> > extern void swap_nr_free(swp_entry_t entry, int nr_pages);
> > extern void swapcache_free_entries(swp_entry_t *entries, int n);
> > diff --git a/mm/swap.h b/mm/swap.h
> > index fc2f6ade7f80..1cec991efcda 100644
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> > @@ -42,6 +42,7 @@ void delete_from_swap_cache(struct folio *folio);
> > void clear_shadow_from_swap_cache(int type, unsigned long begin,
> > unsigned long end);
> > void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry);
> > +void swapcache_clear_nr(struct swap_info_struct *si, swp_entry_t entry, int nr);
> > struct folio *swap_cache_get_folio(swp_entry_t entry,
> > struct vm_area_struct *vma, unsigned long addr);
> > struct folio *filemap_get_incore_folio(struct address_space *mapping,
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index c0c058ee7b69..c8c8b6dbaeda 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -3308,7 +3308,7 @@ void si_swapinfo(struct sysinfo *val)
> > }
> >
> > /*
> > - * Verify that a swap entry is valid and increment its swap map count.
> > + * Verify that nr swap entries are valid and increment their swap map count.
> > *
> > * Returns error code in following case.
> > * - success -> 0
> > @@ -3318,66 +3318,73 @@ void si_swapinfo(struct sysinfo *val)
> > * - swap-cache reference is requested but the entry is not used. -> ENOENT
> > * - swap-mapped reference requested but needs continued swap count. -> ENOMEM
> > */
> > -static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
> > +static int __swap_duplicate_nr(swp_entry_t entry, int nr, unsigned char usage)
> > {
> > struct swap_info_struct *p;
> > struct swap_cluster_info *ci;
> > unsigned long offset;
> > - unsigned char count;
> > - unsigned char has_cache;
> > - int err;
> > + unsigned char count[SWAPFILE_CLUSTER];
> > + unsigned char has_cache[SWAPFILE_CLUSTER];
>
Hi Yosry,
Thanks for reviewing!
> I am not closely following this series, but a couple of things caught my eyes.
>
> Is this reasonable for stack usage?
SWAPFILE_CLUSTER is not huge. typically 512 or 256.
#ifdef CONFIG_THP_SWAP
#define SWAPFILE_CLUSTER HPAGE_PMD_NR
#define swap_entry_size(size) (size)
#else
#define SWAPFILE_CLUSTER 256
If this is still a concern, I can move it to a bitmap.
>
> > + int err, i;
> >
> > p = swp_swap_info(entry);
> >
> > offset = swp_offset(entry);
> > ci = lock_cluster_or_swap_info(p, offset);
> >
> > - count = p->swap_map[offset];
> > -
> > - /*
> > - * swapin_readahead() doesn't check if a swap entry is valid, so the
> > - * swap entry could be SWAP_MAP_BAD. Check here with lock held.
> > - */
> > - if (unlikely(swap_count(count) == SWAP_MAP_BAD)) {
> > - err = -ENOENT;
> > - goto unlock_out;
> > - }
> > + for (i = 0; i < nr; i++) {
> > + count[i] = p->swap_map[offset + i];
> >
> > - has_cache = count & SWAP_HAS_CACHE;
> > - count &= ~SWAP_HAS_CACHE;
> > - err = 0;
> > -
> > - if (usage == SWAP_HAS_CACHE) {
> > -
> > - /* set SWAP_HAS_CACHE if there is no cache and entry is used */
> > - if (!has_cache && count)
> > - has_cache = SWAP_HAS_CACHE;
> > - else if (has_cache) /* someone else added cache */
> > - err = -EEXIST;
> > - else /* no users remaining */
> > + /*
> > + * swapin_readahead() doesn't check if a swap entry is valid, so the
> > + * swap entry could be SWAP_MAP_BAD. Check here with lock held.
> > + */
> > + if (unlikely(swap_count(count[i]) == SWAP_MAP_BAD)) {
> > err = -ENOENT;
> > + goto unlock_out;
> > + }
>
> Here we immediately exit if there is an error, but we don't below, we
> just keep overwriting the error every iteration as far as I can tell.
> Also, it doesn't seem like we do any cleanups if we hit an error
> halfway through. Should we undo previously updated swap entries, or am
> I missing something here?
we are safely immediately exiting because we don't change swap_map
till we finish all checks. while all checks are done, we write them by
WRITE_ONCE(p->swap_map[offset + i], count[i] | has_cache[i]);
at the end.
>
> >
> > - } else if (count || has_cache) {
> > -
> > - if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
> > - count += usage;
> > - else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX)
> > - err = -EINVAL;
> > - else if (swap_count_continued(p, offset, count))
> > - count = COUNT_CONTINUED;
> > - else
> > - err = -ENOMEM;
> > - } else
> > - err = -ENOENT; /* unused swap entry */
> > -
> > - if (!err)
> > - WRITE_ONCE(p->swap_map[offset], count | has_cache);
> > + has_cache[i] = count[i] & SWAP_HAS_CACHE;
> > + count[i] &= ~SWAP_HAS_CACHE;
> > + err = 0;
> > +
> > + if (usage == SWAP_HAS_CACHE) {
> > +
> > + /* set SWAP_HAS_CACHE if there is no cache and entry is used */
> > + if (!has_cache[i] && count[i])
> > + has_cache[i] = SWAP_HAS_CACHE;
> > + else if (has_cache[i]) /* someone else added cache */
> > + err = -EEXIST;
> > + else /* no users remaining */
> > + err = -ENOENT;
> > + } else if (count[i] || has_cache[i]) {
> > +
> > + if ((count[i] & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
> > + count[i] += usage;
> > + else if ((count[i] & ~COUNT_CONTINUED) > SWAP_MAP_MAX)
> > + err = -EINVAL;
> > + else if (swap_count_continued(p, offset + i, count[i]))
> > + count[i] = COUNT_CONTINUED;
> > + else
> > + err = -ENOMEM;
> > + } else
> > + err = -ENOENT; /* unused swap entry */
> > + }
> >
> > + if (!err) {
> > + for (i = 0; i < nr; i++)
> > + WRITE_ONCE(p->swap_map[offset + i], count[i] | has_cache[i]);
Here is the place where we really write data. Before that, we only
touched temp variables.
> > + }
> > unlock_out:
> > unlock_cluster_or_swap_info(p, ci);
> > return err;
> > }
thanks
Barry
On Wed, Feb 28, 2024 at 4:59 PM Barry Song <[email protected]> wrote:
>
> On Thu, Feb 29, 2024 at 1:52 PM Yosry Ahmed <[email protected]> wrote:
> >
> > On Wed, Feb 28, 2024 at 4:39 PM Barry Song <[email protected]> wrote:
> > >
> > > From: Barry Song <[email protected]>
> > >
> > > Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache") supports
> > > one entry only, to support large folio swap-in, we need to handle multiple
> > > swap entries.
> > >
> > > Cc: Kairui Song <[email protected]>
> > > Cc: "Huang, Ying" <[email protected]>
> > > Cc: Yu Zhao <[email protected]>
> > > Cc: David Hildenbrand <[email protected]>
> > > Cc: Chris Li <[email protected]>
> > > Cc: Hugh Dickins <[email protected]>
> > > Cc: Johannes Weiner <[email protected]>
> > > Cc: Matthew Wilcox (Oracle) <[email protected]>
> > > Cc: Michal Hocko <[email protected]>
> > > Cc: Minchan Kim <[email protected]>
> > > Cc: Yosry Ahmed <[email protected]>
> > > Cc: Yu Zhao <[email protected]>
> > > Cc: SeongJae Park <[email protected]>
> > > Signed-off-by: Barry Song <[email protected]>
> > > ---
> > > include/linux/swap.h | 1 +
> > > mm/swap.h | 1 +
> > > mm/swapfile.c | 117 ++++++++++++++++++++++++++-----------------
> > > 3 files changed, 72 insertions(+), 47 deletions(-)
> > >
> > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > index b3581c976e5f..2691c739d9a4 100644
> > > --- a/include/linux/swap.h
> > > +++ b/include/linux/swap.h
> > > @@ -480,6 +480,7 @@ extern int add_swap_count_continuation(swp_entry_t, gfp_t);
> > > extern void swap_shmem_alloc(swp_entry_t);
> > > extern int swap_duplicate(swp_entry_t);
> > > extern int swapcache_prepare(swp_entry_t);
> > > +extern int swapcache_prepare_nr(swp_entry_t, int nr);
> > > extern void swap_free(swp_entry_t);
> > > extern void swap_nr_free(swp_entry_t entry, int nr_pages);
> > > extern void swapcache_free_entries(swp_entry_t *entries, int n);
> > > diff --git a/mm/swap.h b/mm/swap.h
> > > index fc2f6ade7f80..1cec991efcda 100644
> > > --- a/mm/swap.h
> > > +++ b/mm/swap.h
> > > @@ -42,6 +42,7 @@ void delete_from_swap_cache(struct folio *folio);
> > > void clear_shadow_from_swap_cache(int type, unsigned long begin,
> > > unsigned long end);
> > > void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry);
> > > +void swapcache_clear_nr(struct swap_info_struct *si, swp_entry_t entry, int nr);
> > > struct folio *swap_cache_get_folio(swp_entry_t entry,
> > > struct vm_area_struct *vma, unsigned long addr);
> > > struct folio *filemap_get_incore_folio(struct address_space *mapping,
> > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > index c0c058ee7b69..c8c8b6dbaeda 100644
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
> > > @@ -3308,7 +3308,7 @@ void si_swapinfo(struct sysinfo *val)
> > > }
> > >
> > > /*
> > > - * Verify that a swap entry is valid and increment its swap map count.
> > > + * Verify that nr swap entries are valid and increment their swap map count.
> > > *
> > > * Returns error code in following case.
> > > * - success -> 0
> > > @@ -3318,66 +3318,73 @@ void si_swapinfo(struct sysinfo *val)
> > > * - swap-cache reference is requested but the entry is not used. -> ENOENT
> > > * - swap-mapped reference requested but needs continued swap count. -> ENOMEM
> > > */
> > > -static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
> > > +static int __swap_duplicate_nr(swp_entry_t entry, int nr, unsigned char usage)
> > > {
> > > struct swap_info_struct *p;
> > > struct swap_cluster_info *ci;
> > > unsigned long offset;
> > > - unsigned char count;
> > > - unsigned char has_cache;
> > > - int err;
> > > + unsigned char count[SWAPFILE_CLUSTER];
> > > + unsigned char has_cache[SWAPFILE_CLUSTER];
> >
>
> Hi Yosry,
>
> Thanks for reviewing!
>
> > I am not closely following this series, but a couple of things caught my eyes.
> >
> > Is this reasonable for stack usage?
>
> SWAPFILE_CLUSTER is not huge. typically 512 or 256.
So that's 1K of stack usage out of 16K total on x86. I think this may
be a lot for a single function to use, but perhaps others will
disagree.
>
> #ifdef CONFIG_THP_SWAP
> #define SWAPFILE_CLUSTER HPAGE_PMD_NR
>
> #define swap_entry_size(size) (size)
> #else
> #define SWAPFILE_CLUSTER 256
>
> If this is still a concern, I can move it to a bitmap.
>
> >
> > > + int err, i;
> > >
> > > p = swp_swap_info(entry);
> > >
> > > offset = swp_offset(entry);
> > > ci = lock_cluster_or_swap_info(p, offset);
> > >
> > > - count = p->swap_map[offset];
> > > -
> > > - /*
> > > - * swapin_readahead() doesn't check if a swap entry is valid, so the
> > > - * swap entry could be SWAP_MAP_BAD. Check here with lock held.
> > > - */
> > > - if (unlikely(swap_count(count) == SWAP_MAP_BAD)) {
> > > - err = -ENOENT;
> > > - goto unlock_out;
> > > - }
> > > + for (i = 0; i < nr; i++) {
> > > + count[i] = p->swap_map[offset + i];
> > >
> > > - has_cache = count & SWAP_HAS_CACHE;
> > > - count &= ~SWAP_HAS_CACHE;
> > > - err = 0;
> > > -
> > > - if (usage == SWAP_HAS_CACHE) {
> > > -
> > > - /* set SWAP_HAS_CACHE if there is no cache and entry is used */
> > > - if (!has_cache && count)
> > > - has_cache = SWAP_HAS_CACHE;
> > > - else if (has_cache) /* someone else added cache */
> > > - err = -EEXIST;
> > > - else /* no users remaining */
> > > + /*
> > > + * swapin_readahead() doesn't check if a swap entry is valid, so the
> > > + * swap entry could be SWAP_MAP_BAD. Check here with lock held.
> > > + */
> > > + if (unlikely(swap_count(count[i]) == SWAP_MAP_BAD)) {
> > > err = -ENOENT;
> > > + goto unlock_out;
> > > + }
> >
>
>
> > Here we immediately exit if there is an error, but we don't below, we
> > just keep overwriting the error every iteration as far as I can tell.
> > Also, it doesn't seem like we do any cleanups if we hit an error
> > halfway through. Should we undo previously updated swap entries, or am
> > I missing something here?
>
> we are safely immediately exiting because we don't change swap_map
> till we finish all checks. while all checks are done, we write them by
> WRITE_ONCE(p->swap_map[offset + i], count[i] | has_cache[i]);
> at the end.
I see, but I think we may be overwriting the error from each iteration below?
>
> >
> > >
> > > - } else if (count || has_cache) {
> > > -
> > > - if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
> > > - count += usage;
> > > - else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX)
> > > - err = -EINVAL;
> > > - else if (swap_count_continued(p, offset, count))
> > > - count = COUNT_CONTINUED;
> > > - else
> > > - err = -ENOMEM;
> > > - } else
> > > - err = -ENOENT; /* unused swap entry */
> > > -
> > > - if (!err)
> > > - WRITE_ONCE(p->swap_map[offset], count | has_cache);
> > > + has_cache[i] = count[i] & SWAP_HAS_CACHE;
> > > + count[i] &= ~SWAP_HAS_CACHE;
> > > + err = 0;
> > > +
> > > + if (usage == SWAP_HAS_CACHE) {
> > > +
> > > + /* set SWAP_HAS_CACHE if there is no cache and entry is used */
> > > + if (!has_cache[i] && count[i])
> > > + has_cache[i] = SWAP_HAS_CACHE;
> > > + else if (has_cache[i]) /* someone else added cache */
> > > + err = -EEXIST;
> > > + else /* no users remaining */
> > > + err = -ENOENT;
> > > + } else if (count[i] || has_cache[i]) {
> > > +
> > > + if ((count[i] & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
> > > + count[i] += usage;
> > > + else if ((count[i] & ~COUNT_CONTINUED) > SWAP_MAP_MAX)
> > > + err = -EINVAL;
> > > + else if (swap_count_continued(p, offset + i, count[i]))
> > > + count[i] = COUNT_CONTINUED;
> > > + else
> > > + err = -ENOMEM;
> > > + } else
> > > + err = -ENOENT; /* unused swap entry */
> > > + }
> > >
> > > + if (!err) {
> > > + for (i = 0; i < nr; i++)
> > > + WRITE_ONCE(p->swap_map[offset + i], count[i] | has_cache[i]);
>
> Here is the place where we really write data. Before that, we only
> touched temp variables.
>
> > > + }
> > > unlock_out:
> > > unlock_cluster_or_swap_info(p, ci);
> > > return err;
> > > }
>
> thanks
> Barry
On Thu, Feb 29, 2024 at 2:12 PM Yosry Ahmed <[email protected]> wrote:
>
> On Wed, Feb 28, 2024 at 4:59 PM Barry Song <[email protected]> wrote:
> >
> > On Thu, Feb 29, 2024 at 1:52 PM Yosry Ahmed <[email protected]> wrote:
> > >
> > > On Wed, Feb 28, 2024 at 4:39 PM Barry Song <[email protected]> wrote:
> > > >
> > > > From: Barry Song <[email protected]>
> > > >
> > > > Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache") supports
> > > > one entry only, to support large folio swap-in, we need to handle multiple
> > > > swap entries.
> > > >
> > > > Cc: Kairui Song <[email protected]>
> > > > Cc: "Huang, Ying" <[email protected]>
> > > > Cc: Yu Zhao <[email protected]>
> > > > Cc: David Hildenbrand <[email protected]>
> > > > Cc: Chris Li <[email protected]>
> > > > Cc: Hugh Dickins <[email protected]>
> > > > Cc: Johannes Weiner <[email protected]>
> > > > Cc: Matthew Wilcox (Oracle) <[email protected]>
> > > > Cc: Michal Hocko <[email protected]>
> > > > Cc: Minchan Kim <[email protected]>
> > > > Cc: Yosry Ahmed <[email protected]>
> > > > Cc: Yu Zhao <[email protected]>
> > > > Cc: SeongJae Park <[email protected]>
> > > > Signed-off-by: Barry Song <[email protected]>
> > > > ---
> > > > include/linux/swap.h | 1 +
> > > > mm/swap.h | 1 +
> > > > mm/swapfile.c | 117 ++++++++++++++++++++++++++-----------------
> > > > 3 files changed, 72 insertions(+), 47 deletions(-)
> > > >
> > > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > > index b3581c976e5f..2691c739d9a4 100644
> > > > --- a/include/linux/swap.h
> > > > +++ b/include/linux/swap.h
> > > > @@ -480,6 +480,7 @@ extern int add_swap_count_continuation(swp_entry_t, gfp_t);
> > > > extern void swap_shmem_alloc(swp_entry_t);
> > > > extern int swap_duplicate(swp_entry_t);
> > > > extern int swapcache_prepare(swp_entry_t);
> > > > +extern int swapcache_prepare_nr(swp_entry_t, int nr);
> > > > extern void swap_free(swp_entry_t);
> > > > extern void swap_nr_free(swp_entry_t entry, int nr_pages);
> > > > extern void swapcache_free_entries(swp_entry_t *entries, int n);
> > > > diff --git a/mm/swap.h b/mm/swap.h
> > > > index fc2f6ade7f80..1cec991efcda 100644
> > > > --- a/mm/swap.h
> > > > +++ b/mm/swap.h
> > > > @@ -42,6 +42,7 @@ void delete_from_swap_cache(struct folio *folio);
> > > > void clear_shadow_from_swap_cache(int type, unsigned long begin,
> > > > unsigned long end);
> > > > void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry);
> > > > +void swapcache_clear_nr(struct swap_info_struct *si, swp_entry_t entry, int nr);
> > > > struct folio *swap_cache_get_folio(swp_entry_t entry,
> > > > struct vm_area_struct *vma, unsigned long addr);
> > > > struct folio *filemap_get_incore_folio(struct address_space *mapping,
> > > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > > index c0c058ee7b69..c8c8b6dbaeda 100644
> > > > --- a/mm/swapfile.c
> > > > +++ b/mm/swapfile.c
> > > > @@ -3308,7 +3308,7 @@ void si_swapinfo(struct sysinfo *val)
> > > > }
> > > >
> > > > /*
> > > > - * Verify that a swap entry is valid and increment its swap map count.
> > > > + * Verify that nr swap entries are valid and increment their swap map count.
> > > > *
> > > > * Returns error code in following case.
> > > > * - success -> 0
> > > > @@ -3318,66 +3318,73 @@ void si_swapinfo(struct sysinfo *val)
> > > > * - swap-cache reference is requested but the entry is not used. -> ENOENT
> > > > * - swap-mapped reference requested but needs continued swap count. -> ENOMEM
> > > > */
> > > > -static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
> > > > +static int __swap_duplicate_nr(swp_entry_t entry, int nr, unsigned char usage)
> > > > {
> > > > struct swap_info_struct *p;
> > > > struct swap_cluster_info *ci;
> > > > unsigned long offset;
> > > > - unsigned char count;
> > > > - unsigned char has_cache;
> > > > - int err;
> > > > + unsigned char count[SWAPFILE_CLUSTER];
> > > > + unsigned char has_cache[SWAPFILE_CLUSTER];
> > >
> >
> > Hi Yosry,
> >
> > Thanks for reviewing!
> >
> > > I am not closely following this series, but a couple of things caught my eyes.
> > >
> > > Is this reasonable for stack usage?
> >
> > SWAPFILE_CLUSTER is not huge. typically 512 or 256.
>
> So that's 1K of stack usage out of 16K total on x86. I think this may
> be a lot for a single function to use, but perhaps others will
> disagree.
>
> >
> > #ifdef CONFIG_THP_SWAP
> > #define SWAPFILE_CLUSTER HPAGE_PMD_NR
> >
> > #define swap_entry_size(size) (size)
> > #else
> > #define SWAPFILE_CLUSTER 256
> >
> > If this is still a concern, I can move it to a bitmap.
> >
> > >
> > > > + int err, i;
> > > >
> > > > p = swp_swap_info(entry);
> > > >
> > > > offset = swp_offset(entry);
> > > > ci = lock_cluster_or_swap_info(p, offset);
> > > >
> > > > - count = p->swap_map[offset];
> > > > -
> > > > - /*
> > > > - * swapin_readahead() doesn't check if a swap entry is valid, so the
> > > > - * swap entry could be SWAP_MAP_BAD. Check here with lock held.
> > > > - */
> > > > - if (unlikely(swap_count(count) == SWAP_MAP_BAD)) {
> > > > - err = -ENOENT;
> > > > - goto unlock_out;
> > > > - }
> > > > + for (i = 0; i < nr; i++) {
> > > > + count[i] = p->swap_map[offset + i];
> > > >
> > > > - has_cache = count & SWAP_HAS_CACHE;
> > > > - count &= ~SWAP_HAS_CACHE;
> > > > - err = 0;
> > > > -
> > > > - if (usage == SWAP_HAS_CACHE) {
> > > > -
> > > > - /* set SWAP_HAS_CACHE if there is no cache and entry is used */
> > > > - if (!has_cache && count)
> > > > - has_cache = SWAP_HAS_CACHE;
> > > > - else if (has_cache) /* someone else added cache */
> > > > - err = -EEXIST;
> > > > - else /* no users remaining */
> > > > + /*
> > > > + * swapin_readahead() doesn't check if a swap entry is valid, so the
> > > > + * swap entry could be SWAP_MAP_BAD. Check here with lock held.
> > > > + */
> > > > + if (unlikely(swap_count(count[i]) == SWAP_MAP_BAD)) {
> > > > err = -ENOENT;
> > > > + goto unlock_out;
> > > > + }
> > >
> >
> >
> > > Here we immediately exit if there is an error, but we don't below, we
> > > just keep overwriting the error every iteration as far as I can tell.
> > > Also, it doesn't seem like we do any cleanups if we hit an error
> > > halfway through. Should we undo previously updated swap entries, or am
> > > I missing something here?
> >
> > we are safely immediately exiting because we don't change swap_map
> > till we finish all checks. while all checks are done, we write them by
> > WRITE_ONCE(p->swap_map[offset + i], count[i] | has_cache[i]);
> > at the end.
>
> I see, but I think we may be overwriting the error from each iteration below?
you are right, Yosry, i used to have the below to break. don't know
when I accidentally
dropped it :-)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index c8c8b6dbaeda..091966056aec 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3369,6 +3369,9 @@ static int __swap_duplicate_nr(swp_entry_t
entry, int nr, unsigned char usage)
err = -ENOMEM;
} else
err = -ENOENT; /* unused swap entry */
+
+ if (err)
+ break;
}
if (!err) {
>
> >
> > >
> > > >
> > > > - } else if (count || has_cache) {
> > > > -
> > > > - if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
> > > > - count += usage;
> > > > - else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX)
> > > > - err = -EINVAL;
> > > > - else if (swap_count_continued(p, offset, count))
> > > > - count = COUNT_CONTINUED;
> > > > - else
> > > > - err = -ENOMEM;
> > > > - } else
> > > > - err = -ENOENT; /* unused swap entry */
> > > > -
> > > > - if (!err)
> > > > - WRITE_ONCE(p->swap_map[offset], count | has_cache);
> > > > + has_cache[i] = count[i] & SWAP_HAS_CACHE;
> > > > + count[i] &= ~SWAP_HAS_CACHE;
> > > > + err = 0;
> > > > +
> > > > + if (usage == SWAP_HAS_CACHE) {
> > > > +
> > > > + /* set SWAP_HAS_CACHE if there is no cache and entry is used */
> > > > + if (!has_cache[i] && count[i])
> > > > + has_cache[i] = SWAP_HAS_CACHE;
> > > > + else if (has_cache[i]) /* someone else added cache */
> > > > + err = -EEXIST;
> > > > + else /* no users remaining */
> > > > + err = -ENOENT;
> > > > + } else if (count[i] || has_cache[i]) {
> > > > +
> > > > + if ((count[i] & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
> > > > + count[i] += usage;
> > > > + else if ((count[i] & ~COUNT_CONTINUED) > SWAP_MAP_MAX)
> > > > + err = -EINVAL;
> > > > + else if (swap_count_continued(p, offset + i, count[i]))
> > > > + count[i] = COUNT_CONTINUED;
> > > > + else
> > > > + err = -ENOMEM;
> > > > + } else
> > > > + err = -ENOENT; /* unused swap entry */
> > > > + }
> > > >
> > > > + if (!err) {
> > > > + for (i = 0; i < nr; i++)
> > > > + WRITE_ONCE(p->swap_map[offset + i], count[i] | has_cache[i]);
> >
> > Here is the place where we really write data. Before that, we only
> > touched temp variables.
> >
> > > > + }
> > > > unlock_out:
> > > > unlock_cluster_or_swap_info(p, ci);
> > > > return err;
> > > > }
> >
thanks
Barry
On Thu, Feb 29, 2024 at 1:39 PM Barry Song <[email protected]> wrote:
>
> From: Chuanhua Han <[email protected]>
>
> On an embedded system like Android, more than half of anon memory is actually
> in swap devices such as zRAM. For example, while an app is switched to back-
> ground, its most memory might be swapped-out.
>
> Now we have mTHP features, unfortunately, if we don't support large folios
> swap-in, once those large folios are swapped-out, we immediately lose the
> performance gain we can get through large folios and hardware optimization
> such as CONT-PTE.
>
> This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
> to those contiguous swaps which were likely swapped out from mTHP as a whole.
>
> On the other hand, the current implementation only covers the SWAP_SYCHRONOUS
> case. It doesn't support swapin_readahead as large folios yet.
>
> Right now, we are re-faulting large folios which are still in swapcache as a
> whole, this can effectively decrease extra loops and early-exitings which we
> have increased in arch_swap_restore() while supporting MTE restore for folios
> rather than page. On the other hand, it can also decrease do_swap_page as PTEs
> used to be set one by one even we hit a large folio in swapcache.
>
> Signed-off-by: Chuanhua Han <[email protected]>
> Co-developed-by: Barry Song <[email protected]>
> Signed-off-by: Barry Song <[email protected]>
> ---
> mm/memory.c | 191 ++++++++++++++++++++++++++++++++++++++++++----------
> 1 file changed, 157 insertions(+), 34 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 90b08b7cbaac..471689ce4e91 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -104,9 +104,16 @@ struct page *mem_map;
> EXPORT_SYMBOL(mem_map);
> #endif
>
> +/* A choice of behaviors for alloc_anon_folio() */
> +enum behavior {
> + DO_SWAP_PAGE,
> + DO_ANON_PAGE,
> +};
> +
> static vm_fault_t do_fault(struct vm_fault *vmf);
> static vm_fault_t do_anonymous_page(struct vm_fault *vmf);
> static bool vmf_pte_changed(struct vm_fault *vmf);
> +static struct folio *alloc_anon_folio(struct vm_fault *vmf, enum behavior behavior);
>
> /*
> * Return true if the original pte was a uffd-wp pte marker (so the pte was
> @@ -3974,6 +3981,52 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
> return VM_FAULT_SIGBUS;
> }
>
> +/*
> + * check a range of PTEs are completely swap entries with
> + * contiguous swap offsets and the same SWAP_HAS_CACHE.
> + * pte must be first one in the range
> + */
> +static bool is_pte_range_contig_swap(pte_t *pte, int nr_pages)
> +{
> + int i;
> + struct swap_info_struct *si;
> + swp_entry_t entry;
> + unsigned type;
> + pgoff_t start_offset;
> + char has_cache;
> +
> + entry = pte_to_swp_entry(ptep_get_lockless(pte));
> + if (non_swap_entry(entry))
> + return false;
> + start_offset = swp_offset(entry);
> + if (start_offset % nr_pages)
> + return false;
> +
> + si = swp_swap_info(entry);
> + type = swp_type(entry);
> + has_cache = si->swap_map[start_offset] & SWAP_HAS_CACHE;
> + for (i = 1; i < nr_pages; i++) {
> + entry = pte_to_swp_entry(ptep_get_lockless(pte + i));
> + if (non_swap_entry(entry))
> + return false;
> + if (swp_offset(entry) != start_offset + i)
> + return false;
> + if (swp_type(entry) != type)
> + return false;
> + /*
> + * while allocating a large folio and doing swap_read_folio for the
> + * SWP_SYNCHRONOUS_IO path, which is the case the being faulted pte
> + * doesn't have swapcache. We need to ensure all PTEs have no cache
> + * as well, otherwise, we might go to swap devices while the content
> + * is in swapcache
> + */
> + if ((si->swap_map[start_offset + i] & SWAP_HAS_CACHE) != has_cache)
> + return false;
> + }
> +
> + return true;
> +}
> +
> /*
> * We enter with non-exclusive mmap_lock (to exclude vma changes,
> * but allow concurrent faults), and pte mapped but not yet locked.
> @@ -3995,6 +4048,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> pte_t pte;
> vm_fault_t ret = 0;
> void *shadow = NULL;
> + int nr_pages = 1;
> + unsigned long start_address;
> + pte_t *start_pte;
>
> if (!pte_unmap_same(vmf))
> goto out;
> @@ -4058,28 +4114,32 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> if (!folio) {
> if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> __swap_count(entry) == 1) {
> - /*
> - * Prevent parallel swapin from proceeding with
> - * the cache flag. Otherwise, another thread may
> - * finish swapin first, free the entry, and swapout
> - * reusing the same entry. It's undetectable as
> - * pte_same() returns true due to entry reuse.
> - */
> - if (swapcache_prepare(entry)) {
> - /* Relax a bit to prevent rapid repeated page faults */
> - schedule_timeout_uninterruptible(1);
> - goto out;
> - }
> - need_clear_cache = true;
> -
> /* skip swapcache */
> - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> - vma, vmf->address, false);
> + folio = alloc_anon_folio(vmf, DO_SWAP_PAGE);
> page = &folio->page;
> if (folio) {
> __folio_set_locked(folio);
> __folio_set_swapbacked(folio);
>
> + if (folio_test_large(folio)) {
> + nr_pages = folio_nr_pages(folio);
> + entry.val = ALIGN_DOWN(entry.val, nr_pages);
> + }
> +
> + /*
> + * Prevent parallel swapin from proceeding with
> + * the cache flag. Otherwise, another thread may
> + * finish swapin first, free the entry, and swapout
> + * reusing the same entry. It's undetectable as
> + * pte_same() returns true due to entry reuse.
> + */
> + if (swapcache_prepare_nr(entry, nr_pages)) {
> + /* Relax a bit to prevent rapid repeated page faults */
> + schedule_timeout_uninterruptible(1);
> + goto out;
> + }
> + need_clear_cache = true;
> +
> if (mem_cgroup_swapin_charge_folio(folio,
> vma->vm_mm, GFP_KERNEL,
> entry)) {
> @@ -4185,6 +4245,42 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> */
> vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> &vmf->ptl);
> +
> + start_address = vmf->address;
> + start_pte = vmf->pte;
> + if (folio_test_large(folio)) {
> + unsigned long nr = folio_nr_pages(folio);
> + unsigned long addr = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
> + pte_t *aligned_pte = vmf->pte - (vmf->address - addr) / PAGE_SIZE;
> +
> + /*
> + * case 1: we are allocating large_folio, try to map it as a whole
> + * iff the swap entries are still entirely mapped;
> + * case 2: we hit a large folio in swapcache, and all swap entries
> + * are still entirely mapped, try to map a large folio as a whole.
> + * otherwise, map only the faulting page within the large folio
> + * which is swapcache
> + */
> + if (!is_pte_range_contig_swap(aligned_pte, nr)) {
> + if (nr_pages > 1) /* ptes have changed for case 1 */
> + goto out_nomap;
> + goto check_pte;
> + }
> +
> + start_address = addr;
> + start_pte = aligned_pte;
> + /*
> + * the below has been done before swap_read_folio()
> + * for case 1
> + */
> + if (unlikely(folio == swapcache)) {
> + nr_pages = nr;
> + entry.val = ALIGN_DOWN(entry.val, nr_pages);
> + page = &folio->page;
> + }
> + }
> +
> +check_pte:
> if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> goto out_nomap;
>
> @@ -4252,12 +4348,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> * We're already holding a reference on the page but haven't mapped it
> * yet.
> */
> - swap_free(entry);
> + swap_nr_free(entry, nr_pages);
> if (should_try_to_free_swap(folio, vma, vmf->flags))
> folio_free_swap(folio);
>
> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> - dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> + folio_ref_add(folio, nr_pages - 1);
> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> +
> pte = mk_pte(page, vma->vm_page_prot);
>
> /*
> @@ -4267,14 +4365,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> * exclusivity.
> */
> if (!folio_test_ksm(folio) &&
> - (exclusive || folio_ref_count(folio) == 1)) {
> + (exclusive || folio_ref_count(folio) == nr_pages)) {
> if (vmf->flags & FAULT_FLAG_WRITE) {
> pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> vmf->flags &= ~FAULT_FLAG_WRITE;
> }
> rmap_flags |= RMAP_EXCLUSIVE;
> }
> - flush_icache_page(vma, page);
> + flush_icache_pages(vma, page, nr_pages);
> if (pte_swp_soft_dirty(vmf->orig_pte))
> pte = pte_mksoft_dirty(pte);
> if (pte_swp_uffd_wp(vmf->orig_pte))
> @@ -4283,17 +4381,19 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>
> /* ksm created a completely new copy */
> if (unlikely(folio != swapcache && swapcache)) {
> - folio_add_new_anon_rmap(folio, vma, vmf->address);
> + folio_add_new_anon_rmap(folio, vma, start_address);
> folio_add_lru_vma(folio, vma);
> + } else if (!folio_test_anon(folio)) {
> + folio_add_new_anon_rmap(folio, vma, start_address);
> } else {
> - folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
> + folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address,
> rmap_flags);
> }
>
> VM_BUG_ON(!folio_test_anon(folio) ||
> (pte_write(pte) && !PageAnonExclusive(page)));
> - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> - arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> + set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
> + arch_do_swap_page(vma->vm_mm, vma, start_address, pte, vmf->orig_pte);
>
> folio_unlock(folio);
> if (folio != swapcache && swapcache) {
> @@ -4310,6 +4410,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> }
>
> if (vmf->flags & FAULT_FLAG_WRITE) {
> + if (nr_pages > 1)
> + vmf->orig_pte = ptep_get(vmf->pte);
> +
> ret |= do_wp_page(vmf);
> if (ret & VM_FAULT_ERROR)
> ret &= VM_FAULT_ERROR;
> @@ -4317,14 +4420,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> }
>
> /* No need to invalidate - it was non-present before */
> - update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
> + update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages);
> unlock:
> if (vmf->pte)
> pte_unmap_unlock(vmf->pte, vmf->ptl);
> out:
> /* Clear the swap cache pin for direct swapin after PTL unlock */
> if (need_clear_cache)
> - swapcache_clear(si, entry);
> + swapcache_clear_nr(si, entry, nr_pages);
> if (si)
> put_swap_device(si);
> return ret;
> @@ -4340,7 +4443,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> folio_put(swapcache);
> }
> if (need_clear_cache)
> - swapcache_clear(si, entry);
> + swapcache_clear_nr(si, entry, nr_pages);
> if (si)
> put_swap_device(si);
> return ret;
> @@ -4358,7 +4461,7 @@ static bool pte_range_none(pte_t *pte, int nr_pages)
> return true;
> }
>
> -static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> +static struct folio *alloc_anon_folio(struct vm_fault *vmf, enum behavior behavior)
> {
> struct vm_area_struct *vma = vmf->vma;
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> @@ -4376,6 +4479,19 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> if (unlikely(userfaultfd_armed(vma)))
> goto fallback;
>
> + /*
> + * a large folio being swapped-in could be partially in
> + * zswap and partially in swap devices, zswap doesn't
> + * support large folios yet, we might get corrupted
> + * zero-filled data by reading all subpages from swap
> + * devices while some of them are actually in zswap
> + */
> + if (behavior == DO_SWAP_PAGE && is_zswap_enabled())
> + goto fallback;
> +
> + if (unlikely(behavior != DO_ANON_PAGE && behavior != DO_SWAP_PAGE))
> + return ERR_PTR(-EINVAL);
> +
> /*
> * Get a list of all the (large) orders below PMD_ORDER that are enabled
> * for this vma. Then filter out the orders that can't be allocated over
> @@ -4393,15 +4509,22 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> return ERR_PTR(-EAGAIN);
>
> /*
> - * Find the highest order where the aligned range is completely
> - * pte_none(). Note that all remaining orders will be completely
> + * For do_anonymous_page, find the highest order where the aligned range is
> + * completely pte_none(). Note that all remaining orders will be completely
> * pte_none().
> + * For do_swap_page, find the highest order where the aligned range is
> + * completely swap entries with contiguous swap offsets.
> */
> order = highest_order(orders);
> while (orders) {
> addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> - if (pte_range_none(pte + pte_index(addr), 1 << order))
> - break;
> + if (behavior == DO_ANON_PAGE) {
> + if (pte_range_none(pte + pte_index(addr), 1 << order))
> + break;
> + } else {
> + if (is_pte_range_contig_swap(pte + pte_index(addr), 1 << order))
> + break;
> + }
> order = next_order(&orders, order);
> }
We have a problem here. alloc_anon_folio() is charging folio
/* Try allocating the highest of the remaining orders. */
gfp = vma_thp_gfp_mask(vma);
while (orders) {
addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
folio = vma_alloc_folio(gfp, order, vma, addr, true);
if (folio) {
if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
folio_put(folio);
goto next;
}
folio_throttle_swaprate(folio, gfp);
clear_huge_page(&folio->page, vmf->address, 1 << order);
return folio;
}
next:
order = next_order(&orders, order);
}
This is necessary for DO_ANON_PAGE. but for DO_SWAP_PAGE, this is
wrong.
because in do_swap_page, mem_cgroup_swapin_charge_folio() is done again.
if (mem_cgroup_swapin_charge_folio(folio,
vma->vm_mm, GFP_KERNEL,
entry)) {
ret = VM_FAULT_OOM;
goto out_page;
}
So in the do_swap_page() case, charging is done twice.
will get it fixed in v3.
This is also true for folio_prealloc() at the end of alloc_anon_folio()
>
> @@ -4485,7 +4608,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> if (unlikely(anon_vma_prepare(vma)))
> goto oom;
> /* Returns NULL on OOM or ERR_PTR(-EAGAIN) if we must retry the fault */
> - folio = alloc_anon_folio(vmf);
> + folio = alloc_anon_folio(vmf, DO_ANON_PAGE);
> if (IS_ERR(folio))
> return 0;
> if (!folio)
> --
> 2.34.1
>
Thanks
Barry