2024-03-04 08:14:27

by Barry Song

[permalink] [raw]
Subject: [RFC PATCH v3 0/5] mm: support large folios swap-in

From: Barry Song <[email protected]>

-v3:
* avoid over-writing err in __swap_duplicate_nr, pointed out by Yosry,
thanks!
* fix the issue folio is charged twice for do_swap_page, separating
alloc_anon_folio and alloc_swap_folio as they have many differences
now on
* memcg charing
* clearing allocated folio or not

-v2:
https://lore.kernel.org/linux-mm/[email protected]/
* lots of code cleanup according to Chris's comments, thanks!
* collect Chris's ack tags, thanks!
* address David's comment on moving to use folio_add_new_anon_rmap
for !folio_test_anon in do_swap_page, thanks!
* remove the MADV_PAGEOUT patch from this series as Ryan will
intergrate it into swap-out series
* Apply Kairui's work of "mm/swap: fix race when skipping swapcache"
on large folios swap-in as well
* fixed corrupted data(zero-filled data) in two races: zswap and
a part of entries are in swapcache while some others are not
in by checking SWAP_HAS_CACHE while swapping in a large folio

-v1:
https://lore.kernel.org/all/[email protected]/#t

On an embedded system like Android, more than half of anon memory is actually
in swap devices such as zRAM. For example, while an app is switched to back-
ground, its most memory might be swapped-out.

Now we have mTHP features, unfortunately, if we don't support large folios
swap-in, once those large folios are swapped-out, we immediately lose the
performance gain we can get through large folios and hardware optimization
such as CONT-PTE.

In theory, we don't need to rely on Ryan's swap out patchset[1]. That is to say,
before swap-out, if some memory were normal pages, but when swapping in, we
can also swap-in them as large folios. But this might require I/O happen at
some random places in swap devices. So we limit the large folios swap-in to
those areas which were large folios before swapping-out, aka, swaps are also
contiguous in swapdevice. On the other hand, in OPPO's product, we've deployed
anon large folios on millions of phones[2]. we enhanced zsmalloc and zRAM to
compress and decompress large folios as a whole, which help improve compression
ratio and decrease CPU consumption significantly. In zsmalloc and zRAM we can
save large objects whose original size are 64KiB for example (related patches
are coming). So it is also a good choice for us to support swap-in large
folios for those compressed large objects as a large folio can be decompressed
all together.

Note I am moving my previous "arm64: mm: swap: support THP_SWAP on hardware
with MTE" to this series as it might help review.

[1] [PATCH v3 0/4] Swap-out small-sized THP without splitting
https://lore.kernel.org/linux-mm/[email protected]/
[2] OnePlusOSS / android_kernel_oneplus_sm8550
https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11

Barry Song (2):
arm64: mm: swap: support THP_SWAP on hardware with MTE
mm: swap: introduce swapcache_prepare_nr and swapcache_clear_nr for
large folios swap-in

Chuanhua Han (3):
mm: swap: introduce swap_nr_free() for batched swap_free()
mm: swap: make should_try_to_free_swap() support large-folio
mm: support large folios swapin as a whole

arch/arm64/include/asm/pgtable.h | 19 +--
arch/arm64/mm/mteswap.c | 43 ++++++
include/linux/huge_mm.h | 12 --
include/linux/pgtable.h | 2 +-
include/linux/swap.h | 7 +
mm/memory.c | 252 ++++++++++++++++++++++++++-----
mm/page_io.c | 2 +-
mm/swap.h | 1 +
mm/swap_slots.c | 2 +-
mm/swapfile.c | 153 +++++++++++++------
10 files changed, 376 insertions(+), 117 deletions(-)

--
2.34.1



2024-03-04 08:14:53

by Barry Song

[permalink] [raw]
Subject: [RFC PATCH v3 2/5] mm: swap: introduce swap_nr_free() for batched swap_free()

From: Chuanhua Han <[email protected]>

While swapping in a large folio, we need to free swaps related to the whole
folio. To avoid frequently acquiring and releasing swap locks, it is better
to introduce an API for batched free.

Signed-off-by: Chuanhua Han <[email protected]>
Co-developed-by: Barry Song <[email protected]>
Signed-off-by: Barry Song <[email protected]>
---
include/linux/swap.h | 6 ++++++
mm/swapfile.c | 35 +++++++++++++++++++++++++++++++++++
2 files changed, 41 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2955f7a78d8d..d6ab27929458 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -481,6 +481,7 @@ extern void swap_shmem_alloc(swp_entry_t);
extern int swap_duplicate(swp_entry_t);
extern int swapcache_prepare(swp_entry_t);
extern void swap_free(swp_entry_t);
+extern void swap_nr_free(swp_entry_t entry, int nr_pages);
extern void swapcache_free_entries(swp_entry_t *entries, int n);
extern int free_swap_and_cache(swp_entry_t);
int swap_type_of(dev_t device, sector_t offset);
@@ -561,6 +562,11 @@ static inline void swap_free(swp_entry_t swp)
{
}

+void swap_nr_free(swp_entry_t entry, int nr_pages)
+{
+
+}
+
static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
{
}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 3f594be83b58..244106998a69 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1341,6 +1341,41 @@ void swap_free(swp_entry_t entry)
__swap_entry_free(p, entry);
}

+/*
+ * Called after swapping in a large folio, batched free swap entries
+ * for this large folio, entry should be for the first subpage and
+ * its offset is aligned with nr_pages
+ */
+void swap_nr_free(swp_entry_t entry, int nr_pages)
+{
+ int i;
+ struct swap_cluster_info *ci;
+ struct swap_info_struct *p;
+ unsigned type = swp_type(entry);
+ unsigned long offset = swp_offset(entry);
+ DECLARE_BITMAP(usage, SWAPFILE_CLUSTER) = { 0 };
+
+ /* all swap entries are within a cluster for mTHP */
+ VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
+
+ if (nr_pages == 1) {
+ swap_free(entry);
+ return;
+ }
+
+ p = _swap_info_get(entry);
+
+ ci = lock_cluster(p, offset);
+ for (i = 0; i < nr_pages; i++) {
+ if (__swap_entry_free_locked(p, offset + i, 1))
+ __bitmap_set(usage, i, 1);
+ }
+ unlock_cluster(ci);
+
+ for_each_clear_bit(i, usage, nr_pages)
+ free_swap_slot(swp_entry(type, offset + i));
+}
+
/*
* Called after dropping swapcache to decrease refcnt to swap entries.
*/
--
2.34.1


2024-03-04 08:15:35

by Barry Song

[permalink] [raw]
Subject: [RFC PATCH v3 4/5] mm: swap: introduce swapcache_prepare_nr and swapcache_clear_nr for large folios swap-in

From: Barry Song <[email protected]>

Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache") supports
one entry only, to support large folio swap-in, we need to handle multiple
swap entries.

Cc: Kairui Song <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Chris Li <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: SeongJae Park <[email protected]>
Signed-off-by: Barry Song <[email protected]>
---
include/linux/swap.h | 1 +
mm/swap.h | 1 +
mm/swapfile.c | 118 ++++++++++++++++++++++++++-----------------
3 files changed, 74 insertions(+), 46 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index d6ab27929458..22105f0fe2d4 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -480,6 +480,7 @@ extern int add_swap_count_continuation(swp_entry_t, gfp_t);
extern void swap_shmem_alloc(swp_entry_t);
extern int swap_duplicate(swp_entry_t);
extern int swapcache_prepare(swp_entry_t);
+extern int swapcache_prepare_nr(swp_entry_t entry, int nr);
extern void swap_free(swp_entry_t);
extern void swap_nr_free(swp_entry_t entry, int nr_pages);
extern void swapcache_free_entries(swp_entry_t *entries, int n);
diff --git a/mm/swap.h b/mm/swap.h
index fc2f6ade7f80..1cec991efcda 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -42,6 +42,7 @@ void delete_from_swap_cache(struct folio *folio);
void clear_shadow_from_swap_cache(int type, unsigned long begin,
unsigned long end);
void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry);
+void swapcache_clear_nr(struct swap_info_struct *si, swp_entry_t entry, int nr);
struct folio *swap_cache_get_folio(swp_entry_t entry,
struct vm_area_struct *vma, unsigned long addr);
struct folio *filemap_get_incore_folio(struct address_space *mapping,
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 244106998a69..bae1b8165b11 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3309,7 +3309,7 @@ void si_swapinfo(struct sysinfo *val)
}

/*
- * Verify that a swap entry is valid and increment its swap map count.
+ * Verify that nr swap entries are valid and increment their swap map count.
*
* Returns error code in following case.
* - success -> 0
@@ -3319,66 +3319,76 @@ void si_swapinfo(struct sysinfo *val)
* - swap-cache reference is requested but the entry is not used. -> ENOENT
* - swap-mapped reference requested but needs continued swap count. -> ENOMEM
*/
-static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
+static int __swap_duplicate_nr(swp_entry_t entry, int nr, unsigned char usage)
{
struct swap_info_struct *p;
struct swap_cluster_info *ci;
unsigned long offset;
- unsigned char count;
- unsigned char has_cache;
- int err;
+ unsigned char count[SWAPFILE_CLUSTER];
+ unsigned char has_cache[SWAPFILE_CLUSTER];
+ int err, i;

p = swp_swap_info(entry);

offset = swp_offset(entry);
ci = lock_cluster_or_swap_info(p, offset);

- count = p->swap_map[offset];
-
- /*
- * swapin_readahead() doesn't check if a swap entry is valid, so the
- * swap entry could be SWAP_MAP_BAD. Check here with lock held.
- */
- if (unlikely(swap_count(count) == SWAP_MAP_BAD)) {
- err = -ENOENT;
- goto unlock_out;
- }
-
- has_cache = count & SWAP_HAS_CACHE;
- count &= ~SWAP_HAS_CACHE;
- err = 0;
-
- if (usage == SWAP_HAS_CACHE) {
+ for (i = 0; i < nr; i++) {
+ count[i] = p->swap_map[offset + i];

- /* set SWAP_HAS_CACHE if there is no cache and entry is used */
- if (!has_cache && count)
- has_cache = SWAP_HAS_CACHE;
- else if (has_cache) /* someone else added cache */
- err = -EEXIST;
- else /* no users remaining */
+ /*
+ * swapin_readahead() doesn't check if a swap entry is valid, so the
+ * swap entry could be SWAP_MAP_BAD. Check here with lock held.
+ */
+ if (unlikely(swap_count(count[i]) == SWAP_MAP_BAD)) {
err = -ENOENT;
+ goto unlock_out;
+ }

- } else if (count || has_cache) {
-
- if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
- count += usage;
- else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX)
- err = -EINVAL;
- else if (swap_count_continued(p, offset, count))
- count = COUNT_CONTINUED;
- else
- err = -ENOMEM;
- } else
- err = -ENOENT; /* unused swap entry */
+ has_cache[i] = count[i] & SWAP_HAS_CACHE;
+ count[i] &= ~SWAP_HAS_CACHE;
+ err = 0;
+
+ if (usage == SWAP_HAS_CACHE) {
+
+ /* set SWAP_HAS_CACHE if there is no cache and entry is used */
+ if (!has_cache[i] && count[i])
+ has_cache[i] = SWAP_HAS_CACHE;
+ else if (has_cache[i]) /* someone else added cache */
+ err = -EEXIST;
+ else /* no users remaining */
+ err = -ENOENT;
+ } else if (count[i] || has_cache[i]) {
+
+ if ((count[i] & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
+ count[i] += usage;
+ else if ((count[i] & ~COUNT_CONTINUED) > SWAP_MAP_MAX)
+ err = -EINVAL;
+ else if (swap_count_continued(p, offset + i, count[i]))
+ count[i] = COUNT_CONTINUED;
+ else
+ err = -ENOMEM;
+ } else
+ err = -ENOENT; /* unused swap entry */

- if (!err)
- WRITE_ONCE(p->swap_map[offset], count | has_cache);
+ if (err)
+ break;
+ }

+ if (!err) {
+ for (i = 0; i < nr; i++)
+ WRITE_ONCE(p->swap_map[offset + i], count[i] | has_cache[i]);
+ }
unlock_out:
unlock_cluster_or_swap_info(p, ci);
return err;
}

+static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
+{
+ return __swap_duplicate_nr(entry, 1, usage);
+}
+
/*
* Help swapoff by noting that swap entry belongs to shmem/tmpfs
* (in which case its reference count is never incremented).
@@ -3417,17 +3427,33 @@ int swapcache_prepare(swp_entry_t entry)
return __swap_duplicate(entry, SWAP_HAS_CACHE);
}

-void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
+int swapcache_prepare_nr(swp_entry_t entry, int nr)
+{
+ return __swap_duplicate_nr(entry, nr, SWAP_HAS_CACHE);
+}
+
+void swapcache_clear_nr(struct swap_info_struct *si, swp_entry_t entry, int nr)
{
struct swap_cluster_info *ci;
unsigned long offset = swp_offset(entry);
- unsigned char usage;
+ unsigned char usage[SWAPFILE_CLUSTER];
+ int i;

ci = lock_cluster_or_swap_info(si, offset);
- usage = __swap_entry_free_locked(si, offset, SWAP_HAS_CACHE);
+ for (i = 0; i < nr; i++)
+ usage[i] = __swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE);
unlock_cluster_or_swap_info(si, ci);
- if (!usage)
- free_swap_slot(entry);
+ for (i = 0; i < nr; i++) {
+ if (!usage[i]) {
+ free_swap_slot(entry);
+ entry.val++;
+ }
+ }
+}
+
+void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
+{
+ swapcache_clear_nr(si, entry, 1);
}

struct swap_info_struct *swp_swap_info(swp_entry_t entry)
--
2.34.1


2024-03-04 08:15:43

by Barry Song

[permalink] [raw]
Subject: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole

From: Chuanhua Han <[email protected]>

On an embedded system like Android, more than half of anon memory is
actually in swap devices such as zRAM. For example, while an app is
switched to background, its most memory might be swapped-out.

Now we have mTHP features, unfortunately, if we don't support large folios
swap-in, once those large folios are swapped-out, we immediately lose the
performance gain we can get through large folios and hardware optimization
such as CONT-PTE.

This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
to those contiguous swaps which were likely swapped out from mTHP as a
whole.

Meanwhile, the current implementation only covers the SWAP_SYCHRONOUS
case. It doesn't support swapin_readahead as large folios yet since this
kind of shared memory is much less than memory mapped by single process.

Right now, we are re-faulting large folios which are still in swapcache as a
whole, this can effectively decrease extra loops and early-exitings which we
have increased in arch_swap_restore() while supporting MTE restore for folios
rather than page. On the other hand, it can also decrease do_swap_page as
PTEs used to be set one by one even we hit a large folio in swapcache.

Signed-off-by: Chuanhua Han <[email protected]>
Co-developed-by: Barry Song <[email protected]>
Signed-off-by: Barry Song <[email protected]>
---
mm/memory.c | 250 ++++++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 212 insertions(+), 38 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index e0d34d705e07..501ede745ef3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3907,6 +3907,136 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
return VM_FAULT_SIGBUS;
}

+/*
+ * check a range of PTEs are completely swap entries with
+ * contiguous swap offsets and the same SWAP_HAS_CACHE.
+ * pte must be first one in the range
+ */
+static bool is_pte_range_contig_swap(pte_t *pte, int nr_pages)
+{
+ int i;
+ struct swap_info_struct *si;
+ swp_entry_t entry;
+ unsigned type;
+ pgoff_t start_offset;
+ char has_cache;
+
+ entry = pte_to_swp_entry(ptep_get_lockless(pte));
+ if (non_swap_entry(entry))
+ return false;
+ start_offset = swp_offset(entry);
+ if (start_offset % nr_pages)
+ return false;
+
+ si = swp_swap_info(entry);
+ type = swp_type(entry);
+ has_cache = si->swap_map[start_offset] & SWAP_HAS_CACHE;
+ for (i = 1; i < nr_pages; i++) {
+ entry = pte_to_swp_entry(ptep_get_lockless(pte + i));
+ if (non_swap_entry(entry))
+ return false;
+ if (swp_offset(entry) != start_offset + i)
+ return false;
+ if (swp_type(entry) != type)
+ return false;
+ /*
+ * while allocating a large folio and doing swap_read_folio for the
+ * SWP_SYNCHRONOUS_IO path, which is the case the being faulted pte
+ * doesn't have swapcache. We need to ensure all PTEs have no cache
+ * as well, otherwise, we might go to swap devices while the content
+ * is in swapcache
+ */
+ if ((si->swap_map[start_offset + i] & SWAP_HAS_CACHE) != has_cache)
+ return false;
+ }
+
+ return true;
+}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+/*
+ * Get a list of all the (large) orders below PMD_ORDER that are enabled
+ * for this vma. Then filter out the orders that can't be allocated over
+ * the faulting address and still be fully contained in the vma.
+ */
+static inline unsigned long get_alloc_folio_orders(struct vm_fault *vmf)
+{
+ struct vm_area_struct *vma = vmf->vma;
+ unsigned long orders;
+
+ orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
+ BIT(PMD_ORDER) - 1);
+ orders = thp_vma_suitable_orders(vma, vmf->address, orders);
+ return orders;
+}
+#endif
+
+static struct folio *alloc_swap_folio(struct vm_fault *vmf)
+{
+ struct vm_area_struct *vma = vmf->vma;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ unsigned long orders;
+ struct folio *folio;
+ unsigned long addr;
+ pte_t *pte;
+ gfp_t gfp;
+ int order;
+
+ /*
+ * If uffd is active for the vma we need per-page fault fidelity to
+ * maintain the uffd semantics.
+ */
+ if (unlikely(userfaultfd_armed(vma)))
+ goto fallback;
+
+ /*
+ * a large folio being swapped-in could be partially in
+ * zswap and partially in swap devices, zswap doesn't
+ * support large folios yet, we might get corrupted
+ * zero-filled data by reading all subpages from swap
+ * devices while some of them are actually in zswap
+ */
+ if (is_zswap_enabled())
+ goto fallback;
+
+ orders = get_alloc_folio_orders(vmf);
+ if (!orders)
+ goto fallback;
+
+ pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
+ if (unlikely(!pte))
+ goto fallback;
+
+ /*
+ * For do_swap_page, find the highest order where the aligned range is
+ * completely swap entries with contiguous swap offsets.
+ */
+ order = highest_order(orders);
+ while (orders) {
+ addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
+ if (is_pte_range_contig_swap(pte + pte_index(addr), 1 << order))
+ break;
+ order = next_order(&orders, order);
+ }
+
+ pte_unmap(pte);
+
+ /* Try allocating the highest of the remaining orders. */
+ gfp = vma_thp_gfp_mask(vma);
+ while (orders) {
+ addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
+ folio = vma_alloc_folio(gfp, order, vma, addr, true);
+ if (folio)
+ return folio;
+ order = next_order(&orders, order);
+ }
+
+fallback:
+#endif
+ return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false);
+}
+
+
/*
* We enter with non-exclusive mmap_lock (to exclude vma changes,
* but allow concurrent faults), and pte mapped but not yet locked.
@@ -3928,6 +4058,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
pte_t pte;
vm_fault_t ret = 0;
void *shadow = NULL;
+ int nr_pages = 1;
+ unsigned long start_address;
+ pte_t *start_pte;

if (!pte_unmap_same(vmf))
goto out;
@@ -3991,35 +4124,41 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (!folio) {
if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
__swap_count(entry) == 1) {
- /*
- * Prevent parallel swapin from proceeding with
- * the cache flag. Otherwise, another thread may
- * finish swapin first, free the entry, and swapout
- * reusing the same entry. It's undetectable as
- * pte_same() returns true due to entry reuse.
- */
- if (swapcache_prepare(entry)) {
- /* Relax a bit to prevent rapid repeated page faults */
- schedule_timeout_uninterruptible(1);
- goto out;
- }
- need_clear_cache = true;
-
/* skip swapcache */
- folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
- vma, vmf->address, false);
+ folio = alloc_swap_folio(vmf);
page = &folio->page;
if (folio) {
__folio_set_locked(folio);
__folio_set_swapbacked(folio);

+ if (folio_test_large(folio)) {
+ nr_pages = folio_nr_pages(folio);
+ entry.val = ALIGN_DOWN(entry.val, nr_pages);
+ }
+
+ /*
+ * Prevent parallel swapin from proceeding with
+ * the cache flag. Otherwise, another thread may
+ * finish swapin first, free the entry, and swapout
+ * reusing the same entry. It's undetectable as
+ * pte_same() returns true due to entry reuse.
+ */
+ if (swapcache_prepare_nr(entry, nr_pages)) {
+ /* Relax a bit to prevent rapid repeated page faults */
+ schedule_timeout_uninterruptible(1);
+ goto out;
+ }
+ need_clear_cache = true;
+
if (mem_cgroup_swapin_charge_folio(folio,
vma->vm_mm, GFP_KERNEL,
entry)) {
ret = VM_FAULT_OOM;
goto out_page;
}
- mem_cgroup_swapin_uncharge_swap(entry);
+
+ for (swp_entry_t e = entry; e.val < entry.val + nr_pages; e.val++)
+ mem_cgroup_swapin_uncharge_swap(e);

shadow = get_shadow_from_swap_cache(entry);
if (shadow)
@@ -4118,6 +4257,42 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
*/
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
&vmf->ptl);
+
+ start_address = vmf->address;
+ start_pte = vmf->pte;
+ if (start_pte && folio_test_large(folio)) {
+ unsigned long nr = folio_nr_pages(folio);
+ unsigned long addr = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
+ pte_t *aligned_pte = vmf->pte - (vmf->address - addr) / PAGE_SIZE;
+
+ /*
+ * case 1: we are allocating large_folio, try to map it as a whole
+ * iff the swap entries are still entirely mapped;
+ * case 2: we hit a large folio in swapcache, and all swap entries
+ * are still entirely mapped, try to map a large folio as a whole.
+ * otherwise, map only the faulting page within the large folio
+ * which is swapcache
+ */
+ if (!is_pte_range_contig_swap(aligned_pte, nr)) {
+ if (nr_pages > 1) /* ptes have changed for case 1 */
+ goto out_nomap;
+ goto check_pte;
+ }
+
+ start_address = addr;
+ start_pte = aligned_pte;
+ /*
+ * the below has been done before swap_read_folio()
+ * for case 1
+ */
+ if (unlikely(folio == swapcache)) {
+ nr_pages = nr;
+ entry.val = ALIGN_DOWN(entry.val, nr_pages);
+ page = &folio->page;
+ }
+ }
+
+check_pte:
if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
goto out_nomap;

@@ -4185,12 +4360,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
* We're already holding a reference on the page but haven't mapped it
* yet.
*/
- swap_free(entry);
+ swap_nr_free(entry, nr_pages);
if (should_try_to_free_swap(folio, vma, vmf->flags))
folio_free_swap(folio);

- inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
- dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
+ folio_ref_add(folio, nr_pages - 1);
+ add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
+ add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
+
pte = mk_pte(page, vma->vm_page_prot);

/*
@@ -4200,14 +4377,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
* exclusivity.
*/
if (!folio_test_ksm(folio) &&
- (exclusive || folio_ref_count(folio) == 1)) {
+ (exclusive || folio_ref_count(folio) == nr_pages)) {
if (vmf->flags & FAULT_FLAG_WRITE) {
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
vmf->flags &= ~FAULT_FLAG_WRITE;
}
rmap_flags |= RMAP_EXCLUSIVE;
}
- flush_icache_page(vma, page);
+ flush_icache_pages(vma, page, nr_pages);
if (pte_swp_soft_dirty(vmf->orig_pte))
pte = pte_mksoft_dirty(pte);
if (pte_swp_uffd_wp(vmf->orig_pte))
@@ -4216,17 +4393,19 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)

/* ksm created a completely new copy */
if (unlikely(folio != swapcache && swapcache)) {
- folio_add_new_anon_rmap(folio, vma, vmf->address);
+ folio_add_new_anon_rmap(folio, vma, start_address);
folio_add_lru_vma(folio, vma);
+ } else if (!folio_test_anon(folio)) {
+ folio_add_new_anon_rmap(folio, vma, start_address);
} else {
- folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
+ folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address,
rmap_flags);
}

VM_BUG_ON(!folio_test_anon(folio) ||
(pte_write(pte) && !PageAnonExclusive(page)));
- set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
- arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
+ set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
+ arch_do_swap_page(vma->vm_mm, vma, start_address, pte, vmf->orig_pte);

folio_unlock(folio);
if (folio != swapcache && swapcache) {
@@ -4243,6 +4422,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
}

if (vmf->flags & FAULT_FLAG_WRITE) {
+ if (nr_pages > 1)
+ vmf->orig_pte = ptep_get(vmf->pte);
+
ret |= do_wp_page(vmf);
if (ret & VM_FAULT_ERROR)
ret &= VM_FAULT_ERROR;
@@ -4250,14 +4432,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
}

/* No need to invalidate - it was non-present before */
- update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
+ update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages);
unlock:
if (vmf->pte)
pte_unmap_unlock(vmf->pte, vmf->ptl);
out:
/* Clear the swap cache pin for direct swapin after PTL unlock */
if (need_clear_cache)
- swapcache_clear(si, entry);
+ swapcache_clear_nr(si, entry, nr_pages);
if (si)
put_swap_device(si);
return ret;
@@ -4273,7 +4455,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
folio_put(swapcache);
}
if (need_clear_cache)
- swapcache_clear(si, entry);
+ swapcache_clear_nr(si, entry, nr_pages);
if (si)
put_swap_device(si);
return ret;
@@ -4309,15 +4491,7 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
if (unlikely(userfaultfd_armed(vma)))
goto fallback;

- /*
- * Get a list of all the (large) orders below PMD_ORDER that are enabled
- * for this vma. Then filter out the orders that can't be allocated over
- * the faulting address and still be fully contained in the vma.
- */
- orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
- BIT(PMD_ORDER) - 1);
- orders = thp_vma_suitable_orders(vma, vmf->address, orders);
-
+ orders = get_alloc_folio_orders(vmf);
if (!orders)
goto fallback;

--
2.34.1


2024-03-04 08:16:02

by Barry Song

[permalink] [raw]
Subject: [RFC PATCH v3 1/5] arm64: mm: swap: support THP_SWAP on hardware with MTE

From: Barry Song <[email protected]>

Commit d0637c505f8a1 ("arm64: enable THP_SWAP for arm64") brings up
THP_SWAP on ARM64, but it doesn't enable THP_SWP on hardware with
MTE as the MTE code works with the assumption tags save/restore is
always handling a folio with only one page.

The limitation should be removed as more and more ARM64 SoCs have
this feature. Co-existence of MTE and THP_SWAP becomes more and
more important.

This patch makes MTE tags saving support large folios, then we don't
need to split large folios into base pages for swapping out on ARM64
SoCs with MTE any more.

arch_prepare_to_swap() should take folio rather than page as parameter
because we support THP swap-out as a whole. It saves tags for all
pages in a large folio.

As now we are restoring tags based-on folio, in arch_swap_restore(),
we may increase some extra loops and early-exitings while refaulting
a large folio which is still in swapcache in do_swap_page(). In case
a large folio has nr pages, do_swap_page() will only set the PTE of
the particular page which is causing the page fault.
Thus do_swap_page() runs nr times, and each time, arch_swap_restore()
will loop nr times for those subpages in the folio. So right now the
algorithmic complexity becomes O(nr^2).

Once we support mapping large folios in do_swap_page(), extra loops
and early-exitings will decrease while not being completely removed
as a large folio might get partially tagged in corner cases such as,
1. a large folio in swapcache can be partially unmapped, thus, MTE
tags for the unmapped pages will be invalidated;
2. users might use mprotect() to set MTEs on a part of a large folio.

arch_thp_swp_supported() is dropped since ARM64 MTE was the only one
who needed it.

Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Kemeng Shi <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Peter Collingbourne <[email protected]>
Cc: Steven Price <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: "Mike Rapoport (IBM)" <[email protected]>
Cc: Hugh Dickins <[email protected]>
CC: "Aneesh Kumar K.V" <[email protected]>
Cc: Rick Edgecombe <[email protected]>
Signed-off-by: Barry Song <[email protected]>
Reviewed-by: Steven Price <[email protected]>
Acked-by: Chris Li <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 19 ++------------
arch/arm64/mm/mteswap.c | 43 ++++++++++++++++++++++++++++++++
include/linux/huge_mm.h | 12 ---------
include/linux/pgtable.h | 2 +-
mm/page_io.c | 2 +-
mm/swap_slots.c | 2 +-
6 files changed, 48 insertions(+), 32 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 401087e8a43d..7a54750770b8 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -45,12 +45,6 @@
__flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

-static inline bool arch_thp_swp_supported(void)
-{
- return !system_supports_mte();
-}
-#define arch_thp_swp_supported arch_thp_swp_supported
-
/*
* Outside of a few very special situations (e.g. hibernation), we always
* use broadcast TLB invalidation instructions, therefore a spurious page
@@ -1095,12 +1089,7 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
#ifdef CONFIG_ARM64_MTE

#define __HAVE_ARCH_PREPARE_TO_SWAP
-static inline int arch_prepare_to_swap(struct page *page)
-{
- if (system_supports_mte())
- return mte_save_tags(page);
- return 0;
-}
+extern int arch_prepare_to_swap(struct folio *folio);

#define __HAVE_ARCH_SWAP_INVALIDATE
static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
@@ -1116,11 +1105,7 @@ static inline void arch_swap_invalidate_area(int type)
}

#define __HAVE_ARCH_SWAP_RESTORE
-static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
-{
- if (system_supports_mte())
- mte_restore_tags(entry, &folio->page);
-}
+extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);

#endif /* CONFIG_ARM64_MTE */

diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
index a31833e3ddc5..295836fef620 100644
--- a/arch/arm64/mm/mteswap.c
+++ b/arch/arm64/mm/mteswap.c
@@ -68,6 +68,13 @@ void mte_invalidate_tags(int type, pgoff_t offset)
mte_free_tag_storage(tags);
}

+static inline void __mte_invalidate_tags(struct page *page)
+{
+ swp_entry_t entry = page_swap_entry(page);
+
+ mte_invalidate_tags(swp_type(entry), swp_offset(entry));
+}
+
void mte_invalidate_tags_area(int type)
{
swp_entry_t entry = swp_entry(type, 0);
@@ -83,3 +90,39 @@ void mte_invalidate_tags_area(int type)
}
xa_unlock(&mte_pages);
}
+
+int arch_prepare_to_swap(struct folio *folio)
+{
+ long i, nr;
+ int err;
+
+ if (!system_supports_mte())
+ return 0;
+
+ nr = folio_nr_pages(folio);
+
+ for (i = 0; i < nr; i++) {
+ err = mte_save_tags(folio_page(folio, i));
+ if (err)
+ goto out;
+ }
+ return 0;
+
+out:
+ while (i--)
+ __mte_invalidate_tags(folio_page(folio, i));
+ return err;
+}
+
+void arch_swap_restore(swp_entry_t entry, struct folio *folio)
+{
+ if (system_supports_mte()) {
+ long i, nr = folio_nr_pages(folio);
+
+ entry.val -= swp_offset(entry) & (nr - 1);
+ for (i = 0; i < nr; i++) {
+ mte_restore_tags(entry, folio_page(folio, i));
+ entry.val++;
+ }
+ }
+}
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index de0c89105076..e04b93c43965 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -535,16 +535,4 @@ static inline int split_folio_to_order(struct folio *folio, int new_order)
#define split_folio_to_list(f, l) split_folio_to_list_to_order(f, l, 0)
#define split_folio(f) split_folio_to_order(f, 0)

-/*
- * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
- * limitations in the implementation like arm64 MTE can override this to
- * false
- */
-#ifndef arch_thp_swp_supported
-static inline bool arch_thp_swp_supported(void)
-{
- return true;
-}
-#endif
-
#endif /* _LINUX_HUGE_MM_H */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e1b22903f709..bfcfe3386934 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1106,7 +1106,7 @@ static inline int arch_unmap_one(struct mm_struct *mm,
* prototypes must be defined in the arch-specific asm/pgtable.h file.
*/
#ifndef __HAVE_ARCH_PREPARE_TO_SWAP
-static inline int arch_prepare_to_swap(struct page *page)
+static inline int arch_prepare_to_swap(struct folio *folio)
{
return 0;
}
diff --git a/mm/page_io.c b/mm/page_io.c
index ae2b49055e43..a9a7c236aecc 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -189,7 +189,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
* Arch code may have to preserve more data than just the page
* contents, e.g. memory tags.
*/
- ret = arch_prepare_to_swap(&folio->page);
+ ret = arch_prepare_to_swap(folio);
if (ret) {
folio_mark_dirty(folio);
folio_unlock(folio);
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 90973ce7881d..53abeaf1371d 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -310,7 +310,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
entry.val = 0;

if (folio_test_large(folio)) {
- if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
+ if (IS_ENABLED(CONFIG_THP_SWAP))
get_swap_pages(1, &entry, folio_nr_pages(folio));
goto out;
}
--
2.34.1


2024-03-04 08:16:36

by Barry Song

[permalink] [raw]
Subject: [RFC PATCH v3 3/5] mm: swap: make should_try_to_free_swap() support large-folio

From: Chuanhua Han <[email protected]>

should_try_to_free_swap() works with an assumption that swap-in is always done
at normal page granularity, aka, folio_nr_pages = 1. To support large folio
swap-in, this patch removes the assumption.

Signed-off-by: Chuanhua Han <[email protected]>
Co-developed-by: Barry Song <[email protected]>
Signed-off-by: Barry Song <[email protected]>
Acked-by: Chris Li <[email protected]>
---
mm/memory.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index abd4f33d62c9..e0d34d705e07 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3837,7 +3837,7 @@ static inline bool should_try_to_free_swap(struct folio *folio,
* reference only in case it's likely that we'll be the exlusive user.
*/
return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
- folio_ref_count(folio) == 2;
+ folio_ref_count(folio) == (1 + folio_nr_pages(folio));
}

static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
--
2.34.1


2024-03-11 16:56:36

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH v3 1/5] arm64: mm: swap: support THP_SWAP on hardware with MTE

On 04/03/2024 08:13, Barry Song wrote:
> From: Barry Song <[email protected]>
>
> Commit d0637c505f8a1 ("arm64: enable THP_SWAP for arm64") brings up
> THP_SWAP on ARM64, but it doesn't enable THP_SWP on hardware with
> MTE as the MTE code works with the assumption tags save/restore is
> always handling a folio with only one page.
>
> The limitation should be removed as more and more ARM64 SoCs have
> this feature. Co-existence of MTE and THP_SWAP becomes more and
> more important.
>
> This patch makes MTE tags saving support large folios, then we don't
> need to split large folios into base pages for swapping out on ARM64
> SoCs with MTE any more.
>
> arch_prepare_to_swap() should take folio rather than page as parameter
> because we support THP swap-out as a whole. It saves tags for all
> pages in a large folio.
>
> As now we are restoring tags based-on folio, in arch_swap_restore(),
> we may increase some extra loops and early-exitings while refaulting
> a large folio which is still in swapcache in do_swap_page(). In case
> a large folio has nr pages, do_swap_page() will only set the PTE of
> the particular page which is causing the page fault.
> Thus do_swap_page() runs nr times, and each time, arch_swap_restore()
> will loop nr times for those subpages in the folio. So right now the
> algorithmic complexity becomes O(nr^2).
>
> Once we support mapping large folios in do_swap_page(), extra loops
> and early-exitings will decrease while not being completely removed
> as a large folio might get partially tagged in corner cases such as,
> 1. a large folio in swapcache can be partially unmapped, thus, MTE
> tags for the unmapped pages will be invalidated;
> 2. users might use mprotect() to set MTEs on a part of a large folio.
>
> arch_thp_swp_supported() is dropped since ARM64 MTE was the only one
> who needed it.
>
> Cc: Catalin Marinas <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Ryan Roberts <[email protected]>
> Cc: Mark Rutland <[email protected]>
> Cc: David Hildenbrand <[email protected]>
> Cc: Kemeng Shi <[email protected]>
> Cc: "Matthew Wilcox (Oracle)" <[email protected]>
> Cc: Anshuman Khandual <[email protected]>
> Cc: Peter Collingbourne <[email protected]>
> Cc: Steven Price <[email protected]>
> Cc: Yosry Ahmed <[email protected]>
> Cc: Peter Xu <[email protected]>
> Cc: Lorenzo Stoakes <[email protected]>
> Cc: "Mike Rapoport (IBM)" <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> CC: "Aneesh Kumar K.V" <[email protected]>
> Cc: Rick Edgecombe <[email protected]>
> Signed-off-by: Barry Song <[email protected]>
> Reviewed-by: Steven Price <[email protected]>
> Acked-by: Chris Li <[email protected]>
> ---
> arch/arm64/include/asm/pgtable.h | 19 ++------------
> arch/arm64/mm/mteswap.c | 43 ++++++++++++++++++++++++++++++++
> include/linux/huge_mm.h | 12 ---------
> include/linux/pgtable.h | 2 +-
> mm/page_io.c | 2 +-
> mm/swap_slots.c | 2 +-
> 6 files changed, 48 insertions(+), 32 deletions(-)
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 401087e8a43d..7a54750770b8 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -45,12 +45,6 @@
> __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> -static inline bool arch_thp_swp_supported(void)
> -{
> - return !system_supports_mte();
> -}
> -#define arch_thp_swp_supported arch_thp_swp_supported
> -
> /*
> * Outside of a few very special situations (e.g. hibernation), we always
> * use broadcast TLB invalidation instructions, therefore a spurious page
> @@ -1095,12 +1089,7 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> #ifdef CONFIG_ARM64_MTE
>
> #define __HAVE_ARCH_PREPARE_TO_SWAP
> -static inline int arch_prepare_to_swap(struct page *page)
> -{
> - if (system_supports_mte())
> - return mte_save_tags(page);
> - return 0;
> -}
> +extern int arch_prepare_to_swap(struct folio *folio);
>
> #define __HAVE_ARCH_SWAP_INVALIDATE
> static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
> @@ -1116,11 +1105,7 @@ static inline void arch_swap_invalidate_area(int type)
> }
>
> #define __HAVE_ARCH_SWAP_RESTORE
> -static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> -{
> - if (system_supports_mte())
> - mte_restore_tags(entry, &folio->page);
> -}
> +extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
>
> #endif /* CONFIG_ARM64_MTE */
>
> diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
> index a31833e3ddc5..295836fef620 100644
> --- a/arch/arm64/mm/mteswap.c
> +++ b/arch/arm64/mm/mteswap.c
> @@ -68,6 +68,13 @@ void mte_invalidate_tags(int type, pgoff_t offset)
> mte_free_tag_storage(tags);
> }
>
> +static inline void __mte_invalidate_tags(struct page *page)
> +{
> + swp_entry_t entry = page_swap_entry(page);
> +
> + mte_invalidate_tags(swp_type(entry), swp_offset(entry));
> +}
> +
> void mte_invalidate_tags_area(int type)
> {
> swp_entry_t entry = swp_entry(type, 0);
> @@ -83,3 +90,39 @@ void mte_invalidate_tags_area(int type)
> }
> xa_unlock(&mte_pages);
> }
> +
> +int arch_prepare_to_swap(struct folio *folio)
> +{
> + long i, nr;
> + int err;
> +
> + if (!system_supports_mte())
> + return 0;
> +
> + nr = folio_nr_pages(folio);
> +
> + for (i = 0; i < nr; i++) {
> + err = mte_save_tags(folio_page(folio, i));
> + if (err)
> + goto out;
> + }
> + return 0;
> +
> +out:
> + while (i--)
> + __mte_invalidate_tags(folio_page(folio, i));
> + return err;
> +}
> +
> +void arch_swap_restore(swp_entry_t entry, struct folio *folio)

I'm still not a fan of the fact that entry could be anywhere within folio.

> +{
> + if (system_supports_mte()) {

nit: if you do:

if (!system_supports_mte())
return;

It will be consistent with arch_prepare_to_swap() and reduce the indentation of
the main body.

> + long i, nr = folio_nr_pages(folio);
> +
> + entry.val -= swp_offset(entry) & (nr - 1);

This assumes that folios are always stored in swap with natural alignment. Is
that definitely a safe assumption? My swap-out series is currently ensuring that
folios are swapped-out naturally aligned, but that is an implementation detail.

Your cover note for swap-in says that you could technically swap in a large
folio without it having been swapped-out large. If you chose to do that in
future, this would break, right? I don't think it's good to couple the swap
storage layout to the folio order that you want to swap into. Perhaps that's an
argument for passing each *page* to this function with its exact, corresponding
swap entry?

> + for (i = 0; i < nr; i++) {
> + mte_restore_tags(entry, folio_page(folio, i));
> + entry.val++;
> + }
> + }
> +}
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index de0c89105076..e04b93c43965 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -535,16 +535,4 @@ static inline int split_folio_to_order(struct folio *folio, int new_order)
> #define split_folio_to_list(f, l) split_folio_to_list_to_order(f, l, 0)
> #define split_folio(f) split_folio_to_order(f, 0)
>
> -/*
> - * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
> - * limitations in the implementation like arm64 MTE can override this to
> - * false
> - */
> -#ifndef arch_thp_swp_supported
> -static inline bool arch_thp_swp_supported(void)
> -{
> - return true;
> -}
> -#endif
> -
> #endif /* _LINUX_HUGE_MM_H */
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index e1b22903f709..bfcfe3386934 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1106,7 +1106,7 @@ static inline int arch_unmap_one(struct mm_struct *mm,
> * prototypes must be defined in the arch-specific asm/pgtable.h file.
> */
> #ifndef __HAVE_ARCH_PREPARE_TO_SWAP
> -static inline int arch_prepare_to_swap(struct page *page)
> +static inline int arch_prepare_to_swap(struct folio *folio)
> {
> return 0;
> }
> diff --git a/mm/page_io.c b/mm/page_io.c
> index ae2b49055e43..a9a7c236aecc 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -189,7 +189,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
> * Arch code may have to preserve more data than just the page
> * contents, e.g. memory tags.
> */
> - ret = arch_prepare_to_swap(&folio->page);
> + ret = arch_prepare_to_swap(folio);
> if (ret) {
> folio_mark_dirty(folio);
> folio_unlock(folio);
> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> index 90973ce7881d..53abeaf1371d 100644
> --- a/mm/swap_slots.c
> +++ b/mm/swap_slots.c
> @@ -310,7 +310,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
> entry.val = 0;
>
> if (folio_test_large(folio)) {
> - if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
> + if (IS_ENABLED(CONFIG_THP_SWAP))
> get_swap_pages(1, &entry, folio_nr_pages(folio));
> goto out;
> }


2024-03-11 18:55:31

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH v3 2/5] mm: swap: introduce swap_nr_free() for batched swap_free()

On 04/03/2024 08:13, Barry Song wrote:
> From: Chuanhua Han <[email protected]>
>
> While swapping in a large folio, we need to free swaps related to the whole
> folio. To avoid frequently acquiring and releasing swap locks, it is better
> to introduce an API for batched free.
>
> Signed-off-by: Chuanhua Han <[email protected]>
> Co-developed-by: Barry Song <[email protected]>
> Signed-off-by: Barry Song <[email protected]>
> ---
> include/linux/swap.h | 6 ++++++
> mm/swapfile.c | 35 +++++++++++++++++++++++++++++++++++
> 2 files changed, 41 insertions(+)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 2955f7a78d8d..d6ab27929458 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -481,6 +481,7 @@ extern void swap_shmem_alloc(swp_entry_t);
> extern int swap_duplicate(swp_entry_t);
> extern int swapcache_prepare(swp_entry_t);
> extern void swap_free(swp_entry_t);
> +extern void swap_nr_free(swp_entry_t entry, int nr_pages);

nit: In my swap-out v4 series, I've created a batched version of
free_swap_and_cache() and called it free_swap_and_cache_nr(). Perhaps it is
preferable to align the naming schemes - i.e. call this swap_free_nr(). Your
scheme doesn't really work when applied to free_swap_and_cache().

> extern void swapcache_free_entries(swp_entry_t *entries, int n);
> extern int free_swap_and_cache(swp_entry_t);
> int swap_type_of(dev_t device, sector_t offset);
> @@ -561,6 +562,11 @@ static inline void swap_free(swp_entry_t swp)
> {
> }
>
> +void swap_nr_free(swp_entry_t entry, int nr_pages)
> +{
> +
> +}
> +
> static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> {
> }
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 3f594be83b58..244106998a69 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1341,6 +1341,41 @@ void swap_free(swp_entry_t entry)
> __swap_entry_free(p, entry);
> }
>
> +/*
> + * Called after swapping in a large folio, batched free swap entries
> + * for this large folio, entry should be for the first subpage and
> + * its offset is aligned with nr_pages
> + */
> +void swap_nr_free(swp_entry_t entry, int nr_pages)
> +{
> + int i;
> + struct swap_cluster_info *ci;
> + struct swap_info_struct *p;
> + unsigned type = swp_type(entry);

nit: checkpatch.py will complain about bare "unsigned", preferring "unsigned
int" or at least it did for me when I did something similar in my swap-out patch
set.

> + unsigned long offset = swp_offset(entry);
> + DECLARE_BITMAP(usage, SWAPFILE_CLUSTER) = { 0 };

I don't love this, as it could blow the stack if SWAPFILE_CLUSTER ever
increases. But the only other way I can think of is to explicitly loop over
fixed size chunks, and that's not much better.

> +
> + /* all swap entries are within a cluster for mTHP */
> + VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
> +
> + if (nr_pages == 1) {
> + swap_free(entry);
> + return;
> + }
> +
> + p = _swap_info_get(entry);

You need to handle this returning NULL, like swap_free() does.

> +
> + ci = lock_cluster(p, offset);

The existing swap_free() calls lock_cluster_or_swap_info(). So if swap is backed
by rotating media, and clusters are not in use, it will lock the whole swap
info. But your new version only calls lock_cluster() which won't lock anything
if clusters are not in use. So I think this is a locking bug.

> + for (i = 0; i < nr_pages; i++) {
> + if (__swap_entry_free_locked(p, offset + i, 1))
> + __bitmap_set(usage, i, 1);
> + }
> + unlock_cluster(ci);
> +
> + for_each_clear_bit(i, usage, nr_pages)
> + free_swap_slot(swp_entry(type, offset + i));
> +}
> +
> /*
> * Called after dropping swapcache to decrease refcnt to swap entries.
> */

Thanks,
Ryan


2024-03-12 12:34:46

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH v3 3/5] mm: swap: make should_try_to_free_swap() support large-folio

On 04/03/2024 08:13, Barry Song wrote:
> From: Chuanhua Han <[email protected]>
>
> should_try_to_free_swap() works with an assumption that swap-in is always done
> at normal page granularity, aka, folio_nr_pages = 1. To support large folio
> swap-in, this patch removes the assumption.
>
> Signed-off-by: Chuanhua Han <[email protected]>
> Co-developed-by: Barry Song <[email protected]>
> Signed-off-by: Barry Song <[email protected]>
> Acked-by: Chris Li <[email protected]>
> ---
> mm/memory.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index abd4f33d62c9..e0d34d705e07 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3837,7 +3837,7 @@ static inline bool should_try_to_free_swap(struct folio *folio,
> * reference only in case it's likely that we'll be the exlusive user.
> */
> return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
> - folio_ref_count(folio) == 2;
> + folio_ref_count(folio) == (1 + folio_nr_pages(folio));

I don't think this is correct; one reference has just been added to the folio in
do_swap_page(), either by getting from swapcache (swap_cache_get_folio()) or by
allocating. If it came from the swapcache, it could be a large folio, because we
swapped out a large folio and never removed it from swapcache. But in that case,
others may have partially mapped it, so the refcount could legitimately equal
the number of pages while still not being exclusively mapped.

I'm guessing this logic is trying to estimate when we are likely exclusive so
that we remove from swapcache (release ref) and can then reuse rather than CoW
the folio? The main CoW path currently CoWs page-by-page even for large folios,
and with Barry's recent patch, even the last page gets copied. So not sure what
this change is really trying to achieve?


> }
>
> static vm_fault_t pte_marker_clear(struct vm_fault *vmf)


2024-03-12 15:36:02

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH v3 4/5] mm: swap: introduce swapcache_prepare_nr and swapcache_clear_nr for large folios swap-in

On 04/03/2024 08:13, Barry Song wrote:
> From: Barry Song <[email protected]>
>
> Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache") supports
> one entry only, to support large folio swap-in, we need to handle multiple
> swap entries.
>
> Cc: Kairui Song <[email protected]>
> Cc: "Huang, Ying" <[email protected]>
> Cc: David Hildenbrand <[email protected]>
> Cc: Chris Li <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Matthew Wilcox (Oracle) <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: Yosry Ahmed <[email protected]>
> Cc: Yu Zhao <[email protected]>
> Cc: SeongJae Park <[email protected]>
> Signed-off-by: Barry Song <[email protected]>
> ---
> include/linux/swap.h | 1 +
> mm/swap.h | 1 +
> mm/swapfile.c | 118 ++++++++++++++++++++++++++-----------------
> 3 files changed, 74 insertions(+), 46 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index d6ab27929458..22105f0fe2d4 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -480,6 +480,7 @@ extern int add_swap_count_continuation(swp_entry_t, gfp_t);
> extern void swap_shmem_alloc(swp_entry_t);
> extern int swap_duplicate(swp_entry_t);
> extern int swapcache_prepare(swp_entry_t);
> +extern int swapcache_prepare_nr(swp_entry_t entry, int nr);
> extern void swap_free(swp_entry_t);
> extern void swap_nr_free(swp_entry_t entry, int nr_pages);
> extern void swapcache_free_entries(swp_entry_t *entries, int n);
> diff --git a/mm/swap.h b/mm/swap.h
> index fc2f6ade7f80..1cec991efcda 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -42,6 +42,7 @@ void delete_from_swap_cache(struct folio *folio);
> void clear_shadow_from_swap_cache(int type, unsigned long begin,
> unsigned long end);
> void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry);
> +void swapcache_clear_nr(struct swap_info_struct *si, swp_entry_t entry, int nr);
> struct folio *swap_cache_get_folio(swp_entry_t entry,
> struct vm_area_struct *vma, unsigned long addr);
> struct folio *filemap_get_incore_folio(struct address_space *mapping,
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 244106998a69..bae1b8165b11 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -3309,7 +3309,7 @@ void si_swapinfo(struct sysinfo *val)
> }
>
> /*
> - * Verify that a swap entry is valid and increment its swap map count.
> + * Verify that nr swap entries are valid and increment their swap map count.
> *
> * Returns error code in following case.
> * - success -> 0
> @@ -3319,66 +3319,76 @@ void si_swapinfo(struct sysinfo *val)
> * - swap-cache reference is requested but the entry is not used. -> ENOENT
> * - swap-mapped reference requested but needs continued swap count. -> ENOMEM
> */
> -static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
> +static int __swap_duplicate_nr(swp_entry_t entry, int nr, unsigned char usage)

perhaps its better to pass order instead of nr to all these functions to make it
clearer that entry should be naturally aligned and be a power-of-2 number of
pages, no bigger than SWAPFILE_CLUSTER?

> {
> struct swap_info_struct *p;
> struct swap_cluster_info *ci;
> unsigned long offset;
> - unsigned char count;
> - unsigned char has_cache;
> - int err;
> + unsigned char count[SWAPFILE_CLUSTER];
> + unsigned char has_cache[SWAPFILE_CLUSTER];

I'm not sure this 1K stack buffer is a good idea?

Could you split it slightly differently so that loop 1 just does error checking
and bails out if an error is found, and loop 2 does the new value calculation
and writeback? Then you don't need these arrays.

> + int err, i;
>
> p = swp_swap_info(entry);
>
> offset = swp_offset(entry);
> ci = lock_cluster_or_swap_info(p, offset);
>
> - count = p->swap_map[offset];
> -
> - /*
> - * swapin_readahead() doesn't check if a swap entry is valid, so the
> - * swap entry could be SWAP_MAP_BAD. Check here with lock held.
> - */
> - if (unlikely(swap_count(count) == SWAP_MAP_BAD)) {
> - err = -ENOENT;
> - goto unlock_out;
> - }
> -
> - has_cache = count & SWAP_HAS_CACHE;
> - count &= ~SWAP_HAS_CACHE;
> - err = 0;
> -
> - if (usage == SWAP_HAS_CACHE) {
> + for (i = 0; i < nr; i++) {
> + count[i] = p->swap_map[offset + i];
>
> - /* set SWAP_HAS_CACHE if there is no cache and entry is used */
> - if (!has_cache && count)
> - has_cache = SWAP_HAS_CACHE;
> - else if (has_cache) /* someone else added cache */
> - err = -EEXIST;
> - else /* no users remaining */
> + /*
> + * swapin_readahead() doesn't check if a swap entry is valid, so the
> + * swap entry could be SWAP_MAP_BAD. Check here with lock held.
> + */
> + if (unlikely(swap_count(count[i]) == SWAP_MAP_BAD)) {
> err = -ENOENT;
> + goto unlock_out;
> + }
>
> - } else if (count || has_cache) {
> -
> - if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
> - count += usage;
> - else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX)
> - err = -EINVAL;
> - else if (swap_count_continued(p, offset, count))
> - count = COUNT_CONTINUED;
> - else
> - err = -ENOMEM;
> - } else
> - err = -ENOENT; /* unused swap entry */
> + has_cache[i] = count[i] & SWAP_HAS_CACHE;
> + count[i] &= ~SWAP_HAS_CACHE;
> + err = 0;
> +
> + if (usage == SWAP_HAS_CACHE) {
> +
> + /* set SWAP_HAS_CACHE if there is no cache and entry is used */
> + if (!has_cache[i] && count[i])
> + has_cache[i] = SWAP_HAS_CACHE;
> + else if (has_cache[i]) /* someone else added cache */
> + err = -EEXIST;
> + else /* no users remaining */
> + err = -ENOENT;
> + } else if (count[i] || has_cache[i]) {
> +
> + if ((count[i] & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
> + count[i] += usage;
> + else if ((count[i] & ~COUNT_CONTINUED) > SWAP_MAP_MAX)
> + err = -EINVAL;
> + else if (swap_count_continued(p, offset + i, count[i]))
> + count[i] = COUNT_CONTINUED;
> + else
> + err = -ENOMEM;
> + } else
> + err = -ENOENT; /* unused swap entry */
>
> - if (!err)
> - WRITE_ONCE(p->swap_map[offset], count | has_cache);
> + if (err)
> + break;
> + }
>
> + if (!err) {
> + for (i = 0; i < nr; i++)
> + WRITE_ONCE(p->swap_map[offset + i], count[i] | has_cache[i]);
> + }
> unlock_out:
> unlock_cluster_or_swap_info(p, ci);
> return err;
> }
>
> +static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
> +{
> + return __swap_duplicate_nr(entry, 1, usage);
> +}
> +
> /*
> * Help swapoff by noting that swap entry belongs to shmem/tmpfs
> * (in which case its reference count is never incremented).
> @@ -3417,17 +3427,33 @@ int swapcache_prepare(swp_entry_t entry)
> return __swap_duplicate(entry, SWAP_HAS_CACHE);
> }
>
> -void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
> +int swapcache_prepare_nr(swp_entry_t entry, int nr)
> +{
> + return __swap_duplicate_nr(entry, nr, SWAP_HAS_CACHE);
> +}
> +
> +void swapcache_clear_nr(struct swap_info_struct *si, swp_entry_t entry, int nr)
> {
> struct swap_cluster_info *ci;
> unsigned long offset = swp_offset(entry);
> - unsigned char usage;
> + unsigned char usage[SWAPFILE_CLUSTER];
> + int i;
>
> ci = lock_cluster_or_swap_info(si, offset);
> - usage = __swap_entry_free_locked(si, offset, SWAP_HAS_CACHE);
> + for (i = 0; i < nr; i++)
> + usage[i] = __swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE);
> unlock_cluster_or_swap_info(si, ci);
> - if (!usage)
> - free_swap_slot(entry);
> + for (i = 0; i < nr; i++) {
> + if (!usage[i]) {
> + free_swap_slot(entry);
> + entry.val++;
> + }
> + }
> +}

This is pretty similar to swap_nr_free() which you added in patch 2. Except
swap_nr_free() passes 1 as last param to __swap_entry_free_locked() and this
passes SWAP_HAS_CACHE. Perhaps their should be a common helper? I think
swap_nr_free()'s usage bitmap is preferable to this version's char array too.

Thanks,
Ryan

> +
> +void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
> +{
> + swapcache_clear_nr(si, entry, 1);
> }
>
> struct swap_info_struct *swp_swap_info(swp_entry_t entry)


2024-03-12 16:33:24

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole

On 04/03/2024 08:13, Barry Song wrote:
> From: Chuanhua Han <[email protected]>
>
> On an embedded system like Android, more than half of anon memory is
> actually in swap devices such as zRAM. For example, while an app is
> switched to background, its most memory might be swapped-out.
>
> Now we have mTHP features, unfortunately, if we don't support large folios
> swap-in, once those large folios are swapped-out, we immediately lose the
> performance gain we can get through large folios and hardware optimization
> such as CONT-PTE.
>
> This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
> to those contiguous swaps which were likely swapped out from mTHP as a
> whole.
>
> Meanwhile, the current implementation only covers the SWAP_SYCHRONOUS
> case. It doesn't support swapin_readahead as large folios yet since this
> kind of shared memory is much less than memory mapped by single process.
>
> Right now, we are re-faulting large folios which are still in swapcache as a
> whole, this can effectively decrease extra loops and early-exitings which we
> have increased in arch_swap_restore() while supporting MTE restore for folios
> rather than page. On the other hand, it can also decrease do_swap_page as
> PTEs used to be set one by one even we hit a large folio in swapcache.
>
> Signed-off-by: Chuanhua Han <[email protected]>
> Co-developed-by: Barry Song <[email protected]>
> Signed-off-by: Barry Song <[email protected]>
> ---
> mm/memory.c | 250 ++++++++++++++++++++++++++++++++++++++++++++--------
> 1 file changed, 212 insertions(+), 38 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index e0d34d705e07..501ede745ef3 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3907,6 +3907,136 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
> return VM_FAULT_SIGBUS;
> }
>
> +/*
> + * check a range of PTEs are completely swap entries with
> + * contiguous swap offsets and the same SWAP_HAS_CACHE.
> + * pte must be first one in the range
> + */
> +static bool is_pte_range_contig_swap(pte_t *pte, int nr_pages)
> +{
> + int i;
> + struct swap_info_struct *si;
> + swp_entry_t entry;
> + unsigned type;
> + pgoff_t start_offset;
> + char has_cache;
> +
> + entry = pte_to_swp_entry(ptep_get_lockless(pte));

Given you are getting entry locklessly, I expect it could change under you? So
probably need to check that its a swap entry, etc. first?

> + if (non_swap_entry(entry))
> + return false;
> + start_offset = swp_offset(entry);
> + if (start_offset % nr_pages)
> + return false;
> +
> + si = swp_swap_info(entry);

What ensures si remains valid (i.e. swapoff can't happen)? If swapoff can race,
then swap_map may have been freed when you read it below. Holding the PTL can
sometimes prevent it, but I don't think you're holding that here (you're using
ptep_get_lockless(). Perhaps get_swap_device()/put_swap_device() can help?

> + type = swp_type(entry);
> + has_cache = si->swap_map[start_offset] & SWAP_HAS_CACHE;
> + for (i = 1; i < nr_pages; i++) {
> + entry = pte_to_swp_entry(ptep_get_lockless(pte + i));
> + if (non_swap_entry(entry))
> + return false;
> + if (swp_offset(entry) != start_offset + i)
> + return false;
> + if (swp_type(entry) != type)
> + return false;
> + /*
> + * while allocating a large folio and doing swap_read_folio for the
> + * SWP_SYNCHRONOUS_IO path, which is the case the being faulted pte
> + * doesn't have swapcache. We need to ensure all PTEs have no cache
> + * as well, otherwise, we might go to swap devices while the content
> + * is in swapcache
> + */
> + if ((si->swap_map[start_offset + i] & SWAP_HAS_CACHE) != has_cache)
> + return false;
> + }
> +
> + return true;
> +}

I created swap_pte_batch() for the swap-out series [1]. I wonder if that could
be extended for the SWAP_HAS_CACHE checks? Possibly not because it assumes the
PTL is held, and you are lockless here. Thought it might be of interest though.

[1] https://lore.kernel.org/linux-mm/[email protected]/

> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +/*
> + * Get a list of all the (large) orders below PMD_ORDER that are enabled
> + * for this vma. Then filter out the orders that can't be allocated over
> + * the faulting address and still be fully contained in the vma.
> + */
> +static inline unsigned long get_alloc_folio_orders(struct vm_fault *vmf)
> +{
> + struct vm_area_struct *vma = vmf->vma;
> + unsigned long orders;
> +
> + orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
> + BIT(PMD_ORDER) - 1);
> + orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> + return orders;
> +}
> +#endif
> +
> +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> +{
> + struct vm_area_struct *vma = vmf->vma;
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> + unsigned long orders;
> + struct folio *folio;
> + unsigned long addr;
> + pte_t *pte;
> + gfp_t gfp;
> + int order;
> +
> + /*
> + * If uffd is active for the vma we need per-page fault fidelity to
> + * maintain the uffd semantics.
> + */
> + if (unlikely(userfaultfd_armed(vma)))
> + goto fallback;
> +
> + /*
> + * a large folio being swapped-in could be partially in
> + * zswap and partially in swap devices, zswap doesn't
> + * support large folios yet, we might get corrupted
> + * zero-filled data by reading all subpages from swap
> + * devices while some of them are actually in zswap
> + */
> + if (is_zswap_enabled())
> + goto fallback;
> +
> + orders = get_alloc_folio_orders(vmf);
> + if (!orders)
> + goto fallback;
> +
> + pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);

Could also briefly take PTL here, then is_pte_range_contig_swap() could be
merged with an enhanced swap_pte_batch()?

> + if (unlikely(!pte))
> + goto fallback;
> +
> + /*
> + * For do_swap_page, find the highest order where the aligned range is
> + * completely swap entries with contiguous swap offsets.
> + */
> + order = highest_order(orders);
> + while (orders) {
> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> + if (is_pte_range_contig_swap(pte + pte_index(addr), 1 << order))
> + break;
> + order = next_order(&orders, order);
> + }

So in the common case, swap-in will pull in the same size of folio as was
swapped-out. Is that definitely the right policy for all folio sizes? Certainly
it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
it makes sense for 2M THP; As the size increases the chances of actually needing
all of the folio reduces so chances are we are wasting IO. There are similar
arguments for CoW, where we currently copy 1 page per fault - it probably makes
sense to copy the whole folio up to a certain size.

Thanks,
Ryan

> +
> + pte_unmap(pte);
> +
> + /* Try allocating the highest of the remaining orders. */
> + gfp = vma_thp_gfp_mask(vma);
> + while (orders) {
> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> + folio = vma_alloc_folio(gfp, order, vma, addr, true);
> + if (folio)
> + return folio;
> + order = next_order(&orders, order);
> + }
> +
> +fallback:
> +#endif
> + return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false);
> +}
> +
> +
> /*
> * We enter with non-exclusive mmap_lock (to exclude vma changes,
> * but allow concurrent faults), and pte mapped but not yet locked.
> @@ -3928,6 +4058,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> pte_t pte;
> vm_fault_t ret = 0;
> void *shadow = NULL;
> + int nr_pages = 1;
> + unsigned long start_address;
> + pte_t *start_pte;
>
> if (!pte_unmap_same(vmf))
> goto out;
> @@ -3991,35 +4124,41 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> if (!folio) {
> if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> __swap_count(entry) == 1) {
> - /*
> - * Prevent parallel swapin from proceeding with
> - * the cache flag. Otherwise, another thread may
> - * finish swapin first, free the entry, and swapout
> - * reusing the same entry. It's undetectable as
> - * pte_same() returns true due to entry reuse.
> - */
> - if (swapcache_prepare(entry)) {
> - /* Relax a bit to prevent rapid repeated page faults */
> - schedule_timeout_uninterruptible(1);
> - goto out;
> - }
> - need_clear_cache = true;
> -
> /* skip swapcache */
> - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> - vma, vmf->address, false);
> + folio = alloc_swap_folio(vmf);
> page = &folio->page;
> if (folio) {
> __folio_set_locked(folio);
> __folio_set_swapbacked(folio);
>
> + if (folio_test_large(folio)) {
> + nr_pages = folio_nr_pages(folio);
> + entry.val = ALIGN_DOWN(entry.val, nr_pages);
> + }
> +
> + /*
> + * Prevent parallel swapin from proceeding with
> + * the cache flag. Otherwise, another thread may
> + * finish swapin first, free the entry, and swapout
> + * reusing the same entry. It's undetectable as
> + * pte_same() returns true due to entry reuse.
> + */
> + if (swapcache_prepare_nr(entry, nr_pages)) {
> + /* Relax a bit to prevent rapid repeated page faults */
> + schedule_timeout_uninterruptible(1);
> + goto out;
> + }
> + need_clear_cache = true;
> +
> if (mem_cgroup_swapin_charge_folio(folio,
> vma->vm_mm, GFP_KERNEL,
> entry)) {
> ret = VM_FAULT_OOM;
> goto out_page;
> }
> - mem_cgroup_swapin_uncharge_swap(entry);
> +
> + for (swp_entry_t e = entry; e.val < entry.val + nr_pages; e.val++)
> + mem_cgroup_swapin_uncharge_swap(e);
>
> shadow = get_shadow_from_swap_cache(entry);
> if (shadow)
> @@ -4118,6 +4257,42 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> */
> vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> &vmf->ptl);
> +
> + start_address = vmf->address;
> + start_pte = vmf->pte;
> + if (start_pte && folio_test_large(folio)) {
> + unsigned long nr = folio_nr_pages(folio);
> + unsigned long addr = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
> + pte_t *aligned_pte = vmf->pte - (vmf->address - addr) / PAGE_SIZE;
> +
> + /*
> + * case 1: we are allocating large_folio, try to map it as a whole
> + * iff the swap entries are still entirely mapped;
> + * case 2: we hit a large folio in swapcache, and all swap entries
> + * are still entirely mapped, try to map a large folio as a whole.
> + * otherwise, map only the faulting page within the large folio
> + * which is swapcache
> + */
> + if (!is_pte_range_contig_swap(aligned_pte, nr)) {
> + if (nr_pages > 1) /* ptes have changed for case 1 */
> + goto out_nomap;
> + goto check_pte;
> + }
> +
> + start_address = addr;
> + start_pte = aligned_pte;
> + /*
> + * the below has been done before swap_read_folio()
> + * for case 1
> + */
> + if (unlikely(folio == swapcache)) {
> + nr_pages = nr;
> + entry.val = ALIGN_DOWN(entry.val, nr_pages);
> + page = &folio->page;
> + }
> + }
> +
> +check_pte:
> if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> goto out_nomap;
>
> @@ -4185,12 +4360,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> * We're already holding a reference on the page but haven't mapped it
> * yet.
> */
> - swap_free(entry);
> + swap_nr_free(entry, nr_pages);
> if (should_try_to_free_swap(folio, vma, vmf->flags))
> folio_free_swap(folio);
>
> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> - dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> + folio_ref_add(folio, nr_pages - 1);
> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> +
> pte = mk_pte(page, vma->vm_page_prot);
>
> /*
> @@ -4200,14 +4377,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> * exclusivity.
> */
> if (!folio_test_ksm(folio) &&
> - (exclusive || folio_ref_count(folio) == 1)) {
> + (exclusive || folio_ref_count(folio) == nr_pages)) {
> if (vmf->flags & FAULT_FLAG_WRITE) {
> pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> vmf->flags &= ~FAULT_FLAG_WRITE;
> }
> rmap_flags |= RMAP_EXCLUSIVE;
> }
> - flush_icache_page(vma, page);
> + flush_icache_pages(vma, page, nr_pages);
> if (pte_swp_soft_dirty(vmf->orig_pte))
> pte = pte_mksoft_dirty(pte);
> if (pte_swp_uffd_wp(vmf->orig_pte))
> @@ -4216,17 +4393,19 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>
> /* ksm created a completely new copy */
> if (unlikely(folio != swapcache && swapcache)) {
> - folio_add_new_anon_rmap(folio, vma, vmf->address);
> + folio_add_new_anon_rmap(folio, vma, start_address);
> folio_add_lru_vma(folio, vma);
> + } else if (!folio_test_anon(folio)) {
> + folio_add_new_anon_rmap(folio, vma, start_address);
> } else {
> - folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
> + folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address,
> rmap_flags);
> }
>
> VM_BUG_ON(!folio_test_anon(folio) ||
> (pte_write(pte) && !PageAnonExclusive(page)));
> - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> - arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> + set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
> + arch_do_swap_page(vma->vm_mm, vma, start_address, pte, vmf->orig_pte);
>
> folio_unlock(folio);
> if (folio != swapcache && swapcache) {
> @@ -4243,6 +4422,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> }
>
> if (vmf->flags & FAULT_FLAG_WRITE) {
> + if (nr_pages > 1)
> + vmf->orig_pte = ptep_get(vmf->pte);
> +
> ret |= do_wp_page(vmf);
> if (ret & VM_FAULT_ERROR)
> ret &= VM_FAULT_ERROR;
> @@ -4250,14 +4432,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> }
>
> /* No need to invalidate - it was non-present before */
> - update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
> + update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages);
> unlock:
> if (vmf->pte)
> pte_unmap_unlock(vmf->pte, vmf->ptl);
> out:
> /* Clear the swap cache pin for direct swapin after PTL unlock */
> if (need_clear_cache)
> - swapcache_clear(si, entry);
> + swapcache_clear_nr(si, entry, nr_pages);
> if (si)
> put_swap_device(si);
> return ret;
> @@ -4273,7 +4455,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> folio_put(swapcache);
> }
> if (need_clear_cache)
> - swapcache_clear(si, entry);
> + swapcache_clear_nr(si, entry, nr_pages);
> if (si)
> put_swap_device(si);
> return ret;
> @@ -4309,15 +4491,7 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> if (unlikely(userfaultfd_armed(vma)))
> goto fallback;
>
> - /*
> - * Get a list of all the (large) orders below PMD_ORDER that are enabled
> - * for this vma. Then filter out the orders that can't be allocated over
> - * the faulting address and still be fully contained in the vma.
> - */
> - orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
> - BIT(PMD_ORDER) - 1);
> - orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> -
> + orders = get_alloc_folio_orders(vmf);
> if (!orders)
> goto fallback;
>


2024-03-13 02:22:00

by Chuanhua Han

[permalink] [raw]
Subject: Re: [RFC PATCH v3 3/5] mm: swap: make should_try_to_free_swap() support large-folio

hi, Ryan Roberts

在 2024/3/12 20:34, Ryan Roberts 写道:
> On 04/03/2024 08:13, Barry Song wrote:
>> From: Chuanhua Han <[email protected]>
>>
>> should_try_to_free_swap() works with an assumption that swap-in is always done
>> at normal page granularity, aka, folio_nr_pages = 1. To support large folio
>> swap-in, this patch removes the assumption.
>>
>> Signed-off-by: Chuanhua Han <[email protected]>
>> Co-developed-by: Barry Song <[email protected]>
>> Signed-off-by: Barry Song <[email protected]>
>> Acked-by: Chris Li <[email protected]>
>> ---
>> mm/memory.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index abd4f33d62c9..e0d34d705e07 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3837,7 +3837,7 @@ static inline bool should_try_to_free_swap(struct folio *folio,
>> * reference only in case it's likely that we'll be the exlusive user.
>> */
>> return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
>> - folio_ref_count(folio) == 2;
>> + folio_ref_count(folio) == (1 + folio_nr_pages(folio));
> I don't think this is correct; one reference has just been added to the folio in
> do_swap_page(), either by getting from swapcache (swap_cache_get_folio()) or by
> allocating. If it came from the swapcache, it could be a large folio, because we
> swapped out a large folio and never removed it from swapcache. But in that case,
> others may have partially mapped it, so the refcount could legitimately equal
> the number of pages while still not being exclusively mapped.
>
> I'm guessing this logic is trying to estimate when we are likely exclusive so
> that we remove from swapcache (release ref) and can then reuse rather than CoW
> the folio? The main CoW path currently CoWs page-by-page even for large folios,
> and with Barry's recent patch, even the last page gets copied. So not sure what
> this change is really trying to achieve?
>
First, if it is a large folio in the swap cache, then its refcont is at
least folio_nr_pages(folio) :  


For example, in add_to_swap_cache path:

int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
                        gfp_t gfp, void **shadowp)
{
        struct address_space *address_space = swap_address_space(entry);
        pgoff_t idx = swp_offset(entry);
        XA_STATE_ORDER(xas, &address_space->i_pages, idx,
folio_order(folio));
        unsigned long i, nr = folio_nr_pages(folio); <---
        void *old;
        ...
        folio_ref_add(folio, nr); <---
        folio_set_swapcache(folio);
        ...
}


*

Then in the do_swap_page path:

* if (should_try_to_free_swap(folio, vma, vmf->flags))
        folio_free_swap(folio);
*

* It also indicates that only folio in the swap cache will call
folio_free_swap
* to delete it from the swap cache, So I feel like this patch is
necessary!? ????

>> }
>>
>> static vm_fault_t pte_marker_clear(struct vm_fault *vmf)

Thanks,

Chuanhua


2024-03-13 09:10:14

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH v3 3/5] mm: swap: make should_try_to_free_swap() support large-folio

On 13/03/2024 02:21, Chuanhua Han wrote:
> hi, Ryan Roberts
>
> 在 2024/3/12 20:34, Ryan Roberts 写道:
>> On 04/03/2024 08:13, Barry Song wrote:
>>> From: Chuanhua Han <[email protected]>
>>>
>>> should_try_to_free_swap() works with an assumption that swap-in is always done
>>> at normal page granularity, aka, folio_nr_pages = 1. To support large folio
>>> swap-in, this patch removes the assumption.
>>>
>>> Signed-off-by: Chuanhua Han <[email protected]>
>>> Co-developed-by: Barry Song <[email protected]>
>>> Signed-off-by: Barry Song <[email protected]>
>>> Acked-by: Chris Li <[email protected]>
>>> ---
>>> mm/memory.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index abd4f33d62c9..e0d34d705e07 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -3837,7 +3837,7 @@ static inline bool should_try_to_free_swap(struct folio *folio,
>>> * reference only in case it's likely that we'll be the exlusive user.
>>> */
>>> return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
>>> - folio_ref_count(folio) == 2;
>>> + folio_ref_count(folio) == (1 + folio_nr_pages(folio));
>> I don't think this is correct; one reference has just been added to the folio in
>> do_swap_page(), either by getting from swapcache (swap_cache_get_folio()) or by
>> allocating. If it came from the swapcache, it could be a large folio, because we
>> swapped out a large folio and never removed it from swapcache. But in that case,
>> others may have partially mapped it, so the refcount could legitimately equal
>> the number of pages while still not being exclusively mapped.
>>
>> I'm guessing this logic is trying to estimate when we are likely exclusive so
>> that we remove from swapcache (release ref) and can then reuse rather than CoW
>> the folio? The main CoW path currently CoWs page-by-page even for large folios,
>> and with Barry's recent patch, even the last page gets copied. So not sure what
>> this change is really trying to achieve?
>>
> First, if it is a large folio in the swap cache, then its refcont is at
> least folio_nr_pages(folio) :  

Ahh! Sorry, I had it backwards - was thinking there would be 1 ref for the swap
cache, and you were assuming 1 ref per page taken by do_swap_page(). I
understand now. On this basis:

Reviewed-by: Ryan Roberts <[email protected]>

>
>
> For example, in add_to_swap_cache path:
>
> int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
>                         gfp_t gfp, void **shadowp)
> {
>         struct address_space *address_space = swap_address_space(entry);
>         pgoff_t idx = swp_offset(entry);
>         XA_STATE_ORDER(xas, &address_space->i_pages, idx,
> folio_order(folio));
>         unsigned long i, nr = folio_nr_pages(folio); <---
>         void *old;
>         ...
>         folio_ref_add(folio, nr); <---
>         folio_set_swapcache(folio);
>         ...
> }
>
>
> *
>
> Then in the do_swap_page path:
>
> * if (should_try_to_free_swap(folio, vma, vmf->flags))
>         folio_free_swap(folio);
> *
>
> * It also indicates that only folio in the swap cache will call
> folio_free_swap
> * to delete it from the swap cache, So I feel like this patch is
> necessary!? ????
>
>>> }
>>>
>>> static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
>
> Thanks,
>
> Chuanhua
>


2024-03-13 09:24:59

by Chuanhua Han

[permalink] [raw]
Subject: Re: [RFC PATCH v3 3/5] mm: swap: make should_try_to_free_swap() support large-folio

Hi Ryan,
Ryan Roberts <[email protected]> 于2024年3月13日周三 17:10写道:
>
> On 13/03/2024 02:21, Chuanhua Han wrote:
> > hi, Ryan Roberts
> >
> > 在 2024/3/12 20:34, Ryan Roberts 写道:
> >> On 04/03/2024 08:13, Barry Song wrote:
> >>> From: Chuanhua Han <[email protected]>
> >>>
> >>> should_try_to_free_swap() works with an assumption that swap-in is always done
> >>> at normal page granularity, aka, folio_nr_pages = 1. To support large folio
> >>> swap-in, this patch removes the assumption.
> >>>
> >>> Signed-off-by: Chuanhua Han <[email protected]>
> >>> Co-developed-by: Barry Song <[email protected]>
> >>> Signed-off-by: Barry Song <[email protected]>
> >>> Acked-by: Chris Li <[email protected]>
> >>> ---
> >>> mm/memory.c | 2 +-
> >>> 1 file changed, 1 insertion(+), 1 deletion(-)
> >>>
> >>> diff --git a/mm/memory.c b/mm/memory.c
> >>> index abd4f33d62c9..e0d34d705e07 100644
> >>> --- a/mm/memory.c
> >>> +++ b/mm/memory.c
> >>> @@ -3837,7 +3837,7 @@ static inline bool should_try_to_free_swap(struct folio *folio,
> >>> * reference only in case it's likely that we'll be the exlusive user.
> >>> */
> >>> return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
> >>> - folio_ref_count(folio) == 2;
> >>> + folio_ref_count(folio) == (1 + folio_nr_pages(folio));
> >> I don't think this is correct; one reference has just been added to the folio in
> >> do_swap_page(), either by getting from swapcache (swap_cache_get_folio()) or by
> >> allocating. If it came from the swapcache, it could be a large folio, because we
> >> swapped out a large folio and never removed it from swapcache. But in that case,
> >> others may have partially mapped it, so the refcount could legitimately equal
> >> the number of pages while still not being exclusively mapped.
> >>
> >> I'm guessing this logic is trying to estimate when we are likely exclusive so
> >> that we remove from swapcache (release ref) and can then reuse rather than CoW
> >> the folio? The main CoW path currently CoWs page-by-page even for large folios,
> >> and with Barry's recent patch, even the last page gets copied. So not sure what
> >> this change is really trying to achieve?
> >>
> > First, if it is a large folio in the swap cache, then its refcont is at
> > least folio_nr_pages(folio) :
>
> Ahh! Sorry, I had it backwards - was thinking there would be 1 ref for the swap
> cache, and you were assuming 1 ref per page taken by do_swap_page(). I
> understand now. On this basis:
>
> Reviewed-by: Ryan Roberts <[email protected]>

Thank you for your review!
>
> >
> >
> > For example, in add_to_swap_cache path:
> >
> > int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
> > gfp_t gfp, void **shadowp)
> > {
> > struct address_space *address_space = swap_address_space(entry);
> > pgoff_t idx = swp_offset(entry);
> > XA_STATE_ORDER(xas, &address_space->i_pages, idx,
> > folio_order(folio));
> > unsigned long i, nr = folio_nr_pages(folio); <---
> > void *old;
> > ...
> > folio_ref_add(folio, nr); <---
> > folio_set_swapcache(folio);
> > ...
> > }
> >
> >
> > *
> >
> > Then in the do_swap_page path:
> >
> > * if (should_try_to_free_swap(folio, vma, vmf->flags))
> > folio_free_swap(folio);
> > *
> >
> > * It also indicates that only folio in the swap cache will call
> > folio_free_swap
> > * to delete it from the swap cache, So I feel like this patch is
> > necessary!? ????
> >
> >>> }
> >>>
> >>> static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
> >
> > Thanks,
> >
> > Chuanhua
> >
>
>
Thanks,
Chuanhua

2024-03-14 12:56:44

by Chuanhua Han

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole

Ryan Roberts <[email protected]> 于2024年3月13日周三 00:33写道:
>
> On 04/03/2024 08:13, Barry Song wrote:
> > From: Chuanhua Han <[email protected]>
> >
> > On an embedded system like Android, more than half of anon memory is
> > actually in swap devices such as zRAM. For example, while an app is
> > switched to background, its most memory might be swapped-out.
> >
> > Now we have mTHP features, unfortunately, if we don't support large folios
> > swap-in, once those large folios are swapped-out, we immediately lose the
> > performance gain we can get through large folios and hardware optimization
> > such as CONT-PTE.
> >
> > This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
> > to those contiguous swaps which were likely swapped out from mTHP as a
> > whole.
> >
> > Meanwhile, the current implementation only covers the SWAP_SYCHRONOUS
> > case. It doesn't support swapin_readahead as large folios yet since this
> > kind of shared memory is much less than memory mapped by single process.
> >
> > Right now, we are re-faulting large folios which are still in swapcache as a
> > whole, this can effectively decrease extra loops and early-exitings which we
> > have increased in arch_swap_restore() while supporting MTE restore for folios
> > rather than page. On the other hand, it can also decrease do_swap_page as
> > PTEs used to be set one by one even we hit a large folio in swapcache.
> >
> > Signed-off-by: Chuanhua Han <[email protected]>
> > Co-developed-by: Barry Song <[email protected]>
> > Signed-off-by: Barry Song <[email protected]>
> > ---
> > mm/memory.c | 250 ++++++++++++++++++++++++++++++++++++++++++++--------
> > 1 file changed, 212 insertions(+), 38 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index e0d34d705e07..501ede745ef3 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3907,6 +3907,136 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
> > return VM_FAULT_SIGBUS;
> > }
> >
> > +/*
> > + * check a range of PTEs are completely swap entries with
> > + * contiguous swap offsets and the same SWAP_HAS_CACHE.
> > + * pte must be first one in the range
> > + */
> > +static bool is_pte_range_contig_swap(pte_t *pte, int nr_pages)
> > +{
> > + int i;
> > + struct swap_info_struct *si;
> > + swp_entry_t entry;
> > + unsigned type;
> > + pgoff_t start_offset;
> > + char has_cache;
> > +
> > + entry = pte_to_swp_entry(ptep_get_lockless(pte));
>
> Given you are getting entry locklessly, I expect it could change under you? So
> probably need to check that its a swap entry, etc. first?
The following non_swap_entry checks to see if it is a swap entry.
>
> > + if (non_swap_entry(entry))
> > + return false;
> > + start_offset = swp_offset(entry);
> > + if (start_offset % nr_pages)
> > + return false;
> > +
> > + si = swp_swap_info(entry);
>
> What ensures si remains valid (i.e. swapoff can't happen)? If swapoff can race,
> then swap_map may have been freed when you read it below. Holding the PTL can
> sometimes prevent it, but I don't think you're holding that here (you're using
> ptep_get_lockless(). Perhaps get_swap_device()/put_swap_device() can help?
Thank you for your review,you are righit! this place reaally needs
get_swap_device()/put_swap_device().
>
> > + type = swp_type(entry);
> > + has_cache = si->swap_map[start_offset] & SWAP_HAS_CACHE;
> > + for (i = 1; i < nr_pages; i++) {
> > + entry = pte_to_swp_entry(ptep_get_lockless(pte + i));
> > + if (non_swap_entry(entry))
> > + return false;
> > + if (swp_offset(entry) != start_offset + i)
> > + return false;
> > + if (swp_type(entry) != type)
> > + return false;
> > + /*
> > + * while allocating a large folio and doing swap_read_folio for the
> > + * SWP_SYNCHRONOUS_IO path, which is the case the being faulted pte
> > + * doesn't have swapcache. We need to ensure all PTEs have no cache
> > + * as well, otherwise, we might go to swap devices while the content
> > + * is in swapcache
> > + */
> > + if ((si->swap_map[start_offset + i] & SWAP_HAS_CACHE) != has_cache)
> > + return false;
> > + }
> > +
> > + return true;
> > +}
>
> I created swap_pte_batch() for the swap-out series [1]. I wonder if that could
> be extended for the SWAP_HAS_CACHE checks? Possibly not because it assumes the
> PTL is held, and you are lockless here. Thought it might be of interest though.
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
>
Thanks. It's probably simily to ours, but as you said we are lockless
here, and we need to check has_cache.
> > +
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +/*
> > + * Get a list of all the (large) orders below PMD_ORDER that are enabled
> > + * for this vma. Then filter out the orders that can't be allocated over
> > + * the faulting address and still be fully contained in the vma.
> > + */
> > +static inline unsigned long get_alloc_folio_orders(struct vm_fault *vmf)
> > +{
> > + struct vm_area_struct *vma = vmf->vma;
> > + unsigned long orders;
> > +
> > + orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
> > + BIT(PMD_ORDER) - 1);
> > + orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> > + return orders;
> > +}
> > +#endif
> > +
> > +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> > +{
> > + struct vm_area_struct *vma = vmf->vma;
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > + unsigned long orders;
> > + struct folio *folio;
> > + unsigned long addr;
> > + pte_t *pte;
> > + gfp_t gfp;
> > + int order;
> > +
> > + /*
> > + * If uffd is active for the vma we need per-page fault fidelity to
> > + * maintain the uffd semantics.
> > + */
> > + if (unlikely(userfaultfd_armed(vma)))
> > + goto fallback;
> > +
> > + /*
> > + * a large folio being swapped-in could be partially in
> > + * zswap and partially in swap devices, zswap doesn't
> > + * support large folios yet, we might get corrupted
> > + * zero-filled data by reading all subpages from swap
> > + * devices while some of them are actually in zswap
> > + */
> > + if (is_zswap_enabled())
> > + goto fallback;
> > +
> > + orders = get_alloc_folio_orders(vmf);
> > + if (!orders)
> > + goto fallback;
> > +
> > + pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
>
> Could also briefly take PTL here, then is_pte_range_contig_swap() could be
> merged with an enhanced swap_pte_batch()?
Yes, it's easy to use a lock here, but I'm wondering if it's
necessary, because when we actually set pte in do_swap_page, we'll
hold PTL to check if the pte changes.
>
> > + if (unlikely(!pte))
> > + goto fallback;
> > +
> > + /*
> > + * For do_swap_page, find the highest order where the aligned range is
> > + * completely swap entries with contiguous swap offsets.
> > + */
> > + order = highest_order(orders);
> > + while (orders) {
> > + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> > + if (is_pte_range_contig_swap(pte + pte_index(addr), 1 << order))
> > + break;
> > + order = next_order(&orders, order);
> > + }
>
> So in the common case, swap-in will pull in the same size of folio as was
> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
> it makes sense for 2M THP; As the size increases the chances of actually needing
> all of the folio reduces so chances are we are wasting IO. There are similar
> arguments for CoW, where we currently copy 1 page per fault - it probably makes
> sense to copy the whole folio up to a certain size.
For 2M THP, IO overhead may not necessarily be large? :)
1.If 2M THP are continuously stored in the swap device, the IO
overhead may not be very large (such as submitting bio with one
bio_vec at a time).
2.If the process really needs this 2M data, one page-fault may perform
much better than multiple.
3.For swap devices like zram,using 2M THP might also improve
decompression efficiency.

On the other hand, if the process only needs a small part of the 2M
data (such as only frequent use of 4K page, the rest of the data is
never accessed), This is indeed give a lark to catch a kite! :(
>
> Thanks,
> Ryan
>
> > +
> > + pte_unmap(pte);
> > +
> > + /* Try allocating the highest of the remaining orders. */
> > + gfp = vma_thp_gfp_mask(vma);
> > + while (orders) {
> > + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> > + folio = vma_alloc_folio(gfp, order, vma, addr, true);
> > + if (folio)
> > + return folio;
> > + order = next_order(&orders, order);
> > + }
> > +
> > +fallback:
> > +#endif
> > + return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false);
> > +}
> > +
> > +
> > /*
> > * We enter with non-exclusive mmap_lock (to exclude vma changes,
> > * but allow concurrent faults), and pte mapped but not yet locked.
> > @@ -3928,6 +4058,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > pte_t pte;
> > vm_fault_t ret = 0;
> > void *shadow = NULL;
> > + int nr_pages = 1;
> > + unsigned long start_address;
> > + pte_t *start_pte;
> >
> > if (!pte_unmap_same(vmf))
> > goto out;
> > @@ -3991,35 +4124,41 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > if (!folio) {
> > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> > __swap_count(entry) == 1) {
> > - /*
> > - * Prevent parallel swapin from proceeding with
> > - * the cache flag. Otherwise, another thread may
> > - * finish swapin first, free the entry, and swapout
> > - * reusing the same entry. It's undetectable as
> > - * pte_same() returns true due to entry reuse.
> > - */
> > - if (swapcache_prepare(entry)) {
> > - /* Relax a bit to prevent rapid repeated page faults */
> > - schedule_timeout_uninterruptible(1);
> > - goto out;
> > - }
> > - need_clear_cache = true;
> > -
> > /* skip swapcache */
> > - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> > - vma, vmf->address, false);
> > + folio = alloc_swap_folio(vmf);
> > page = &folio->page;
> > if (folio) {
> > __folio_set_locked(folio);
> > __folio_set_swapbacked(folio);
> >
> > + if (folio_test_large(folio)) {
> > + nr_pages = folio_nr_pages(folio);
> > + entry.val = ALIGN_DOWN(entry.val, nr_pages);
> > + }
> > +
> > + /*
> > + * Prevent parallel swapin from proceeding with
> > + * the cache flag. Otherwise, another thread may
> > + * finish swapin first, free the entry, and swapout
> > + * reusing the same entry. It's undetectable as
> > + * pte_same() returns true due to entry reuse.
> > + */
> > + if (swapcache_prepare_nr(entry, nr_pages)) {
> > + /* Relax a bit to prevent rapid repeated page faults */
> > + schedule_timeout_uninterruptible(1);
> > + goto out;
> > + }
> > + need_clear_cache = true;
> > +
> > if (mem_cgroup_swapin_charge_folio(folio,
> > vma->vm_mm, GFP_KERNEL,
> > entry)) {
> > ret = VM_FAULT_OOM;
> > goto out_page;
> > }
> > - mem_cgroup_swapin_uncharge_swap(entry);
> > +
> > + for (swp_entry_t e = entry; e.val < entry.val + nr_pages; e.val++)
> > + mem_cgroup_swapin_uncharge_swap(e);
> >
> > shadow = get_shadow_from_swap_cache(entry);
> > if (shadow)
> > @@ -4118,6 +4257,42 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > */
> > vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> > &vmf->ptl);
> > +
> > + start_address = vmf->address;
> > + start_pte = vmf->pte;
> > + if (start_pte && folio_test_large(folio)) {
> > + unsigned long nr = folio_nr_pages(folio);
> > + unsigned long addr = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
> > + pte_t *aligned_pte = vmf->pte - (vmf->address - addr) / PAGE_SIZE;
> > +
> > + /*
> > + * case 1: we are allocating large_folio, try to map it as a whole
> > + * iff the swap entries are still entirely mapped;
> > + * case 2: we hit a large folio in swapcache, and all swap entries
> > + * are still entirely mapped, try to map a large folio as a whole.
> > + * otherwise, map only the faulting page within the large folio
> > + * which is swapcache
> > + */
> > + if (!is_pte_range_contig_swap(aligned_pte, nr)) {
> > + if (nr_pages > 1) /* ptes have changed for case 1 */
> > + goto out_nomap;
> > + goto check_pte;
> > + }
> > +
> > + start_address = addr;
> > + start_pte = aligned_pte;
> > + /*
> > + * the below has been done before swap_read_folio()
> > + * for case 1
> > + */
> > + if (unlikely(folio == swapcache)) {
> > + nr_pages = nr;
> > + entry.val = ALIGN_DOWN(entry.val, nr_pages);
> > + page = &folio->page;
> > + }
> > + }
> > +
> > +check_pte:
> > if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> > goto out_nomap;
> >
> > @@ -4185,12 +4360,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > * We're already holding a reference on the page but haven't mapped it
> > * yet.
> > */
> > - swap_free(entry);
> > + swap_nr_free(entry, nr_pages);
> > if (should_try_to_free_swap(folio, vma, vmf->flags))
> > folio_free_swap(folio);
> >
> > - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> > - dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> > + folio_ref_add(folio, nr_pages - 1);
> > + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> > + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> > +
> > pte = mk_pte(page, vma->vm_page_prot);
> >
> > /*
> > @@ -4200,14 +4377,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > * exclusivity.
> > */
> > if (!folio_test_ksm(folio) &&
> > - (exclusive || folio_ref_count(folio) == 1)) {
> > + (exclusive || folio_ref_count(folio) == nr_pages)) {
> > if (vmf->flags & FAULT_FLAG_WRITE) {
> > pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> > vmf->flags &= ~FAULT_FLAG_WRITE;
> > }
> > rmap_flags |= RMAP_EXCLUSIVE;
> > }
> > - flush_icache_page(vma, page);
> > + flush_icache_pages(vma, page, nr_pages);
> > if (pte_swp_soft_dirty(vmf->orig_pte))
> > pte = pte_mksoft_dirty(pte);
> > if (pte_swp_uffd_wp(vmf->orig_pte))
> > @@ -4216,17 +4393,19 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >
> > /* ksm created a completely new copy */
> > if (unlikely(folio != swapcache && swapcache)) {
> > - folio_add_new_anon_rmap(folio, vma, vmf->address);
> > + folio_add_new_anon_rmap(folio, vma, start_address);
> > folio_add_lru_vma(folio, vma);
> > + } else if (!folio_test_anon(folio)) {
> > + folio_add_new_anon_rmap(folio, vma, start_address);
> > } else {
> > - folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
> > + folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address,
> > rmap_flags);
> > }
> >
> > VM_BUG_ON(!folio_test_anon(folio) ||
> > (pte_write(pte) && !PageAnonExclusive(page)));
> > - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> > - arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> > + set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
> > + arch_do_swap_page(vma->vm_mm, vma, start_address, pte, vmf->orig_pte);
> >
> > folio_unlock(folio);
> > if (folio != swapcache && swapcache) {
> > @@ -4243,6 +4422,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > }
> >
> > if (vmf->flags & FAULT_FLAG_WRITE) {
> > + if (nr_pages > 1)
> > + vmf->orig_pte = ptep_get(vmf->pte);
> > +
> > ret |= do_wp_page(vmf);
> > if (ret & VM_FAULT_ERROR)
> > ret &= VM_FAULT_ERROR;
> > @@ -4250,14 +4432,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > }
> >
> > /* No need to invalidate - it was non-present before */
> > - update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
> > + update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages);
> > unlock:
> > if (vmf->pte)
> > pte_unmap_unlock(vmf->pte, vmf->ptl);
> > out:
> > /* Clear the swap cache pin for direct swapin after PTL unlock */
> > if (need_clear_cache)
> > - swapcache_clear(si, entry);
> > + swapcache_clear_nr(si, entry, nr_pages);
> > if (si)
> > put_swap_device(si);
> > return ret;
> > @@ -4273,7 +4455,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > folio_put(swapcache);
> > }
> > if (need_clear_cache)
> > - swapcache_clear(si, entry);
> > + swapcache_clear_nr(si, entry, nr_pages);
> > if (si)
> > put_swap_device(si);
> > return ret;
> > @@ -4309,15 +4491,7 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> > if (unlikely(userfaultfd_armed(vma)))
> > goto fallback;
> >
> > - /*
> > - * Get a list of all the (large) orders below PMD_ORDER that are enabled
> > - * for this vma. Then filter out the orders that can't be allocated over
> > - * the faulting address and still be fully contained in the vma.
> > - */
> > - orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
> > - BIT(PMD_ORDER) - 1);
> > - orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> > -
> > + orders = get_alloc_folio_orders(vmf);
> > if (!orders)
> > goto fallback;
> >
>
>


--
Thanks,
Chuanhua

2024-03-14 13:22:40

by Chuanhua Han

[permalink] [raw]
Subject: Re: [RFC PATCH v3 2/5] mm: swap: introduce swap_nr_free() for batched swap_free()

Ryan Roberts <[email protected]> 于2024年3月12日周二 02:51写道:
>
> On 04/03/2024 08:13, Barry Song wrote:
> > From: Chuanhua Han <[email protected]>
> >
> > While swapping in a large folio, we need to free swaps related to the whole
> > folio. To avoid frequently acquiring and releasing swap locks, it is better
> > to introduce an API for batched free.
> >
> > Signed-off-by: Chuanhua Han <[email protected]>
> > Co-developed-by: Barry Song <[email protected]>
> > Signed-off-by: Barry Song <[email protected]>
> > ---
> > include/linux/swap.h | 6 ++++++
> > mm/swapfile.c | 35 +++++++++++++++++++++++++++++++++++
> > 2 files changed, 41 insertions(+)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 2955f7a78d8d..d6ab27929458 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -481,6 +481,7 @@ extern void swap_shmem_alloc(swp_entry_t);
> > extern int swap_duplicate(swp_entry_t);
> > extern int swapcache_prepare(swp_entry_t);
> > extern void swap_free(swp_entry_t);
> > +extern void swap_nr_free(swp_entry_t entry, int nr_pages);
>
> nit: In my swap-out v4 series, I've created a batched version of
> free_swap_and_cache() and called it free_swap_and_cache_nr(). Perhaps it is
> preferable to align the naming schemes - i.e. call this swap_free_nr(). Your
> scheme doesn't really work when applied to free_swap_and_cache().
Thanks for your suggestions, and for the next version, we'll see which
package is more appropriate!
>
> > extern void swapcache_free_entries(swp_entry_t *entries, int n);
> > extern int free_swap_and_cache(swp_entry_t);
> > int swap_type_of(dev_t device, sector_t offset);
> > @@ -561,6 +562,11 @@ static inline void swap_free(swp_entry_t swp)
> > {
> > }
> >
> > +void swap_nr_free(swp_entry_t entry, int nr_pages)
> > +{
> > +
> > +}
> > +
> > static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> > {
> > }
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 3f594be83b58..244106998a69 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -1341,6 +1341,41 @@ void swap_free(swp_entry_t entry)
> > __swap_entry_free(p, entry);
> > }
> >
> > +/*
> > + * Called after swapping in a large folio, batched free swap entries
> > + * for this large folio, entry should be for the first subpage and
> > + * its offset is aligned with nr_pages
> > + */
> > +void swap_nr_free(swp_entry_t entry, int nr_pages)
> > +{
> > + int i;
> > + struct swap_cluster_info *ci;
> > + struct swap_info_struct *p;
> > + unsigned type = swp_type(entry);
>
> nit: checkpatch.py will complain about bare "unsigned", preferring "unsigned
> int" or at least it did for me when I did something similar in my swap-out patch
> set.
Gee, thanks for pointing that out!
>
> > + unsigned long offset = swp_offset(entry);
> > + DECLARE_BITMAP(usage, SWAPFILE_CLUSTER) = { 0 };
>
> I don't love this, as it could blow the stack if SWAPFILE_CLUSTER ever
> increases. But the only other way I can think of is to explicitly loop over
> fixed size chunks, and that's not much better.
Is it possible to save kernel stack better by using bit_map here? If
SWAPFILE_CLUSTER=512, we consume only (512/64)*8= 64 bytes.
>
> > +
> > + /* all swap entries are within a cluster for mTHP */
> > + VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
> > +
> > + if (nr_pages == 1) {
> > + swap_free(entry);
> > + return;
> > + }
> > +
> > + p = _swap_info_get(entry);
>
> You need to handle this returning NULL, like swap_free() does.
Yes, you're right! We did forget to judge NULL here.
>
> > +
> > + ci = lock_cluster(p, offset);
>
> The existing swap_free() calls lock_cluster_or_swap_info(). So if swap is backed
> by rotating media, and clusters are not in use, it will lock the whole swap
> info. But your new version only calls lock_cluster() which won't lock anything
> if clusters are not in use. So I think this is a locking bug.
Again, you're right, it's bug!
>
> > + for (i = 0; i < nr_pages; i++) {
> > + if (__swap_entry_free_locked(p, offset + i, 1))
> > + __bitmap_set(usage, i, 1);
> > + }
> > + unlock_cluster(ci);
> > +
> > + for_each_clear_bit(i, usage, nr_pages)
> > + free_swap_slot(swp_entry(type, offset + i));
> > +}
> > +
> > /*
> > * Called after dropping swapcache to decrease refcnt to swap entries.
> > */
>
> Thanks,
> Ryan
>
>


--
Thanks,
Chuanhua

2024-03-14 13:43:54

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH v3 2/5] mm: swap: introduce swap_nr_free() for batched swap_free()

On 14/03/2024 13:12, Chuanhua Han wrote:
> Ryan Roberts <[email protected]> 于2024年3月12日周二 02:51写道:
>>
>> On 04/03/2024 08:13, Barry Song wrote:
>>> From: Chuanhua Han <[email protected]>
>>>
>>> While swapping in a large folio, we need to free swaps related to the whole
>>> folio. To avoid frequently acquiring and releasing swap locks, it is better
>>> to introduce an API for batched free.
>>>
>>> Signed-off-by: Chuanhua Han <[email protected]>
>>> Co-developed-by: Barry Song <[email protected]>
>>> Signed-off-by: Barry Song <[email protected]>
>>> ---
>>> include/linux/swap.h | 6 ++++++
>>> mm/swapfile.c | 35 +++++++++++++++++++++++++++++++++++
>>> 2 files changed, 41 insertions(+)
>>>
>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>> index 2955f7a78d8d..d6ab27929458 100644
>>> --- a/include/linux/swap.h
>>> +++ b/include/linux/swap.h
>>> @@ -481,6 +481,7 @@ extern void swap_shmem_alloc(swp_entry_t);
>>> extern int swap_duplicate(swp_entry_t);
>>> extern int swapcache_prepare(swp_entry_t);
>>> extern void swap_free(swp_entry_t);
>>> +extern void swap_nr_free(swp_entry_t entry, int nr_pages);
>>
>> nit: In my swap-out v4 series, I've created a batched version of
>> free_swap_and_cache() and called it free_swap_and_cache_nr(). Perhaps it is
>> preferable to align the naming schemes - i.e. call this swap_free_nr(). Your
>> scheme doesn't really work when applied to free_swap_and_cache().
> Thanks for your suggestions, and for the next version, we'll see which
> package is more appropriate!
>>
>>> extern void swapcache_free_entries(swp_entry_t *entries, int n);
>>> extern int free_swap_and_cache(swp_entry_t);
>>> int swap_type_of(dev_t device, sector_t offset);
>>> @@ -561,6 +562,11 @@ static inline void swap_free(swp_entry_t swp)
>>> {
>>> }
>>>
>>> +void swap_nr_free(swp_entry_t entry, int nr_pages)
>>> +{
>>> +
>>> +}
>>> +
>>> static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
>>> {
>>> }
>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>> index 3f594be83b58..244106998a69 100644
>>> --- a/mm/swapfile.c
>>> +++ b/mm/swapfile.c
>>> @@ -1341,6 +1341,41 @@ void swap_free(swp_entry_t entry)
>>> __swap_entry_free(p, entry);
>>> }
>>>
>>> +/*
>>> + * Called after swapping in a large folio, batched free swap entries
>>> + * for this large folio, entry should be for the first subpage and
>>> + * its offset is aligned with nr_pages
>>> + */
>>> +void swap_nr_free(swp_entry_t entry, int nr_pages)
>>> +{
>>> + int i;
>>> + struct swap_cluster_info *ci;
>>> + struct swap_info_struct *p;
>>> + unsigned type = swp_type(entry);
>>
>> nit: checkpatch.py will complain about bare "unsigned", preferring "unsigned
>> int" or at least it did for me when I did something similar in my swap-out patch
>> set.
> Gee, thanks for pointing that out!
>>
>>> + unsigned long offset = swp_offset(entry);
>>> + DECLARE_BITMAP(usage, SWAPFILE_CLUSTER) = { 0 };
>>
>> I don't love this, as it could blow the stack if SWAPFILE_CLUSTER ever
>> increases. But the only other way I can think of is to explicitly loop over
>> fixed size chunks, and that's not much better.
> Is it possible to save kernel stack better by using bit_map here? If
> SWAPFILE_CLUSTER=512, we consume only (512/64)*8= 64 bytes.

I'm not sure I've understood what you are saying? You're already using
DECLARE_BITMAP(), so its already consuming 64 bytes if SWAPFILE_CLUSTER=512, no?

I actually did a bad job of trying to express a couple of different points:

- Are there any configurations today where SWAPFILE_CLUSTER > 512? I'm not sure.
Certainly not for arm64, but not sure about other architectures. For example if
an arch had 64K pages with 8192 entries per THP and supports SWAP_THP, that's 1K
for the bitmap, which is now looking pretty big for the stack.

- Would it be better to decouple stack usage from SWAPFILE_CLUSTER and instead
define a fixed stack size (e.g. 64 bytes -> 512 entries). Then free the range of
entries in batches no bigger than this size. This approach could also allow
removing the constraint that the range has to be aligned and fit in a single
cluster. Personally I think an approach like this would be much more robust, in
return for a tiny bit more complexity.

>>
>>> +
>>> + /* all swap entries are within a cluster for mTHP */
>>> + VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
>>> +
>>> + if (nr_pages == 1) {
>>> + swap_free(entry);
>>> + return;
>>> + }
>>> +
>>> + p = _swap_info_get(entry);
>>
>> You need to handle this returning NULL, like swap_free() does.
> Yes, you're right! We did forget to judge NULL here.
>>
>>> +
>>> + ci = lock_cluster(p, offset);
>>
>> The existing swap_free() calls lock_cluster_or_swap_info(). So if swap is backed
>> by rotating media, and clusters are not in use, it will lock the whole swap
>> info. But your new version only calls lock_cluster() which won't lock anything
>> if clusters are not in use. So I think this is a locking bug.
> Again, you're right, it's bug!
>>
>>> + for (i = 0; i < nr_pages; i++) {
>>> + if (__swap_entry_free_locked(p, offset + i, 1))
>>> + __bitmap_set(usage, i, 1);
>>> + }
>>> + unlock_cluster(ci);
>>> +
>>> + for_each_clear_bit(i, usage, nr_pages)
>>> + free_swap_slot(swp_entry(type, offset + i));
>>> +}
>>> +
>>> /*
>>> * Called after dropping swapcache to decrease refcnt to swap entries.
>>> */
>>
>> Thanks,
>> Ryan
>>
>>
>
>


2024-03-14 14:48:01

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole

On 14/03/2024 12:56, Chuanhua Han wrote:
> Ryan Roberts <[email protected]> 于2024年3月13日周三 00:33写道:
>>
>> On 04/03/2024 08:13, Barry Song wrote:
>>> From: Chuanhua Han <[email protected]>
>>>
>>> On an embedded system like Android, more than half of anon memory is
>>> actually in swap devices such as zRAM. For example, while an app is
>>> switched to background, its most memory might be swapped-out.
>>>
>>> Now we have mTHP features, unfortunately, if we don't support large folios
>>> swap-in, once those large folios are swapped-out, we immediately lose the
>>> performance gain we can get through large folios and hardware optimization
>>> such as CONT-PTE.
>>>
>>> This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
>>> to those contiguous swaps which were likely swapped out from mTHP as a
>>> whole.
>>>
>>> Meanwhile, the current implementation only covers the SWAP_SYCHRONOUS
>>> case. It doesn't support swapin_readahead as large folios yet since this
>>> kind of shared memory is much less than memory mapped by single process.
>>>
>>> Right now, we are re-faulting large folios which are still in swapcache as a
>>> whole, this can effectively decrease extra loops and early-exitings which we
>>> have increased in arch_swap_restore() while supporting MTE restore for folios
>>> rather than page. On the other hand, it can also decrease do_swap_page as
>>> PTEs used to be set one by one even we hit a large folio in swapcache.
>>>
>>> Signed-off-by: Chuanhua Han <[email protected]>
>>> Co-developed-by: Barry Song <[email protected]>
>>> Signed-off-by: Barry Song <[email protected]>
>>> ---
>>> mm/memory.c | 250 ++++++++++++++++++++++++++++++++++++++++++++--------
>>> 1 file changed, 212 insertions(+), 38 deletions(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index e0d34d705e07..501ede745ef3 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -3907,6 +3907,136 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
>>> return VM_FAULT_SIGBUS;
>>> }
>>>
>>> +/*
>>> + * check a range of PTEs are completely swap entries with
>>> + * contiguous swap offsets and the same SWAP_HAS_CACHE.
>>> + * pte must be first one in the range
>>> + */
>>> +static bool is_pte_range_contig_swap(pte_t *pte, int nr_pages)
>>> +{
>>> + int i;
>>> + struct swap_info_struct *si;
>>> + swp_entry_t entry;
>>> + unsigned type;
>>> + pgoff_t start_offset;
>>> + char has_cache;
>>> +
>>> + entry = pte_to_swp_entry(ptep_get_lockless(pte));
>>
>> Given you are getting entry locklessly, I expect it could change under you? So
>> probably need to check that its a swap entry, etc. first?
> The following non_swap_entry checks to see if it is a swap entry.

No, it checks if something already known to be a "swap entry" type is actually
describing a swap entry, or a non-swap entry (e.g. migration entry, hwpoison
entry, etc.) Swap entries with type >= MAX_SWAPFILES don't actually describe swap:

static inline int non_swap_entry(swp_entry_t entry)
{
return swp_type(entry) >= MAX_SWAPFILES;
}


So you need to do something like:

pte = ptep_get_lockless(pte);
if (pte_none(pte) || !pte_present(pte))
return false;
entry = pte_to_swp_entry(pte);
if (non_swap_entry(entry))
return false;
..

>>
>>> + if (non_swap_entry(entry))
>>> + return false;
>>> + start_offset = swp_offset(entry);
>>> + if (start_offset % nr_pages)
>>> + return false;
>>> +
>>> + si = swp_swap_info(entry);
>>
>> What ensures si remains valid (i.e. swapoff can't happen)? If swapoff can race,
>> then swap_map may have been freed when you read it below. Holding the PTL can
>> sometimes prevent it, but I don't think you're holding that here (you're using
>> ptep_get_lockless(). Perhaps get_swap_device()/put_swap_device() can help?
> Thank you for your review,you are righit! this place reaally needs
> get_swap_device()/put_swap_device().
>>
>>> + type = swp_type(entry);
>>> + has_cache = si->swap_map[start_offset] & SWAP_HAS_CACHE;
>>> + for (i = 1; i < nr_pages; i++) {
>>> + entry = pte_to_swp_entry(ptep_get_lockless(pte + i));
>>> + if (non_swap_entry(entry))
>>> + return false;
>>> + if (swp_offset(entry) != start_offset + i)
>>> + return false;
>>> + if (swp_type(entry) != type)
>>> + return false;
>>> + /*
>>> + * while allocating a large folio and doing swap_read_folio for the
>>> + * SWP_SYNCHRONOUS_IO path, which is the case the being faulted pte
>>> + * doesn't have swapcache. We need to ensure all PTEs have no cache
>>> + * as well, otherwise, we might go to swap devices while the content
>>> + * is in swapcache
>>> + */
>>> + if ((si->swap_map[start_offset + i] & SWAP_HAS_CACHE) != has_cache)
>>> + return false;
>>> + }
>>> +
>>> + return true;
>>> +}
>>
>> I created swap_pte_batch() for the swap-out series [1]. I wonder if that could
>> be extended for the SWAP_HAS_CACHE checks? Possibly not because it assumes the
>> PTL is held, and you are lockless here. Thought it might be of interest though.
>>
>> [1] https://lore.kernel.org/linux-mm/[email protected]/
>>
> Thanks. It's probably simily to ours, but as you said we are lockless
> here, and we need to check has_cache.
>>> +
>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>> +/*
>>> + * Get a list of all the (large) orders below PMD_ORDER that are enabled
>>> + * for this vma. Then filter out the orders that can't be allocated over
>>> + * the faulting address and still be fully contained in the vma.
>>> + */
>>> +static inline unsigned long get_alloc_folio_orders(struct vm_fault *vmf)
>>> +{
>>> + struct vm_area_struct *vma = vmf->vma;
>>> + unsigned long orders;
>>> +
>>> + orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
>>> + BIT(PMD_ORDER) - 1);
>>> + orders = thp_vma_suitable_orders(vma, vmf->address, orders);
>>> + return orders;
>>> +}
>>> +#endif
>>> +
>>> +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>>> +{
>>> + struct vm_area_struct *vma = vmf->vma;
>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>> + unsigned long orders;
>>> + struct folio *folio;
>>> + unsigned long addr;
>>> + pte_t *pte;
>>> + gfp_t gfp;
>>> + int order;
>>> +
>>> + /*
>>> + * If uffd is active for the vma we need per-page fault fidelity to
>>> + * maintain the uffd semantics.
>>> + */
>>> + if (unlikely(userfaultfd_armed(vma)))
>>> + goto fallback;
>>> +
>>> + /*
>>> + * a large folio being swapped-in could be partially in
>>> + * zswap and partially in swap devices, zswap doesn't
>>> + * support large folios yet, we might get corrupted
>>> + * zero-filled data by reading all subpages from swap
>>> + * devices while some of them are actually in zswap
>>> + */
>>> + if (is_zswap_enabled())
>>> + goto fallback;
>>> +
>>> + orders = get_alloc_folio_orders(vmf);
>>> + if (!orders)
>>> + goto fallback;
>>> +
>>> + pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
>>
>> Could also briefly take PTL here, then is_pte_range_contig_swap() could be
>> merged with an enhanced swap_pte_batch()?
> Yes, it's easy to use a lock here, but I'm wondering if it's
> necessary, because when we actually set pte in do_swap_page, we'll
> hold PTL to check if the pte changes.
>>
>>> + if (unlikely(!pte))
>>> + goto fallback;
>>> +
>>> + /*
>>> + * For do_swap_page, find the highest order where the aligned range is
>>> + * completely swap entries with contiguous swap offsets.
>>> + */
>>> + order = highest_order(orders);
>>> + while (orders) {
>>> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>>> + if (is_pte_range_contig_swap(pte + pte_index(addr), 1 << order))
>>> + break;
>>> + order = next_order(&orders, order);
>>> + }
>>
>> So in the common case, swap-in will pull in the same size of folio as was
>> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
>> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
>> it makes sense for 2M THP; As the size increases the chances of actually needing
>> all of the folio reduces so chances are we are wasting IO. There are similar
>> arguments for CoW, where we currently copy 1 page per fault - it probably makes
>> sense to copy the whole folio up to a certain size.
> For 2M THP, IO overhead may not necessarily be large? :)
> 1.If 2M THP are continuously stored in the swap device, the IO
> overhead may not be very large (such as submitting bio with one
> bio_vec at a time).
> 2.If the process really needs this 2M data, one page-fault may perform
> much better than multiple.
> 3.For swap devices like zram,using 2M THP might also improve
> decompression efficiency.
>
> On the other hand, if the process only needs a small part of the 2M
> data (such as only frequent use of 4K page, the rest of the data is
> never accessed), This is indeed give a lark to catch a kite! :(

Yes indeed. It's not always clear-cut what the best thing to do is. It would be
good to hear from others on this.

>>
>> Thanks,
>> Ryan
>>
>>> +
>>> + pte_unmap(pte);
>>> +
>>> + /* Try allocating the highest of the remaining orders. */
>>> + gfp = vma_thp_gfp_mask(vma);
>>> + while (orders) {
>>> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>>> + folio = vma_alloc_folio(gfp, order, vma, addr, true);
>>> + if (folio)
>>> + return folio;
>>> + order = next_order(&orders, order);
>>> + }
>>> +
>>> +fallback:
>>> +#endif
>>> + return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false);
>>> +}
>>> +
>>> +
>>> /*
>>> * We enter with non-exclusive mmap_lock (to exclude vma changes,
>>> * but allow concurrent faults), and pte mapped but not yet locked.
>>> @@ -3928,6 +4058,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> pte_t pte;
>>> vm_fault_t ret = 0;
>>> void *shadow = NULL;
>>> + int nr_pages = 1;
>>> + unsigned long start_address;
>>> + pte_t *start_pte;
>>>
>>> if (!pte_unmap_same(vmf))
>>> goto out;
>>> @@ -3991,35 +4124,41 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> if (!folio) {
>>> if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>>> __swap_count(entry) == 1) {
>>> - /*
>>> - * Prevent parallel swapin from proceeding with
>>> - * the cache flag. Otherwise, another thread may
>>> - * finish swapin first, free the entry, and swapout
>>> - * reusing the same entry. It's undetectable as
>>> - * pte_same() returns true due to entry reuse.
>>> - */
>>> - if (swapcache_prepare(entry)) {
>>> - /* Relax a bit to prevent rapid repeated page faults */
>>> - schedule_timeout_uninterruptible(1);
>>> - goto out;
>>> - }
>>> - need_clear_cache = true;
>>> -
>>> /* skip swapcache */
>>> - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
>>> - vma, vmf->address, false);
>>> + folio = alloc_swap_folio(vmf);
>>> page = &folio->page;
>>> if (folio) {
>>> __folio_set_locked(folio);
>>> __folio_set_swapbacked(folio);
>>>
>>> + if (folio_test_large(folio)) {
>>> + nr_pages = folio_nr_pages(folio);
>>> + entry.val = ALIGN_DOWN(entry.val, nr_pages);
>>> + }
>>> +
>>> + /*
>>> + * Prevent parallel swapin from proceeding with
>>> + * the cache flag. Otherwise, another thread may
>>> + * finish swapin first, free the entry, and swapout
>>> + * reusing the same entry. It's undetectable as
>>> + * pte_same() returns true due to entry reuse.
>>> + */
>>> + if (swapcache_prepare_nr(entry, nr_pages)) {
>>> + /* Relax a bit to prevent rapid repeated page faults */
>>> + schedule_timeout_uninterruptible(1);
>>> + goto out;
>>> + }
>>> + need_clear_cache = true;
>>> +
>>> if (mem_cgroup_swapin_charge_folio(folio,
>>> vma->vm_mm, GFP_KERNEL,
>>> entry)) {
>>> ret = VM_FAULT_OOM;
>>> goto out_page;
>>> }
>>> - mem_cgroup_swapin_uncharge_swap(entry);
>>> +
>>> + for (swp_entry_t e = entry; e.val < entry.val + nr_pages; e.val++)
>>> + mem_cgroup_swapin_uncharge_swap(e);
>>>
>>> shadow = get_shadow_from_swap_cache(entry);
>>> if (shadow)
>>> @@ -4118,6 +4257,42 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> */
>>> vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>>> &vmf->ptl);
>>> +
>>> + start_address = vmf->address;
>>> + start_pte = vmf->pte;
>>> + if (start_pte && folio_test_large(folio)) {
>>> + unsigned long nr = folio_nr_pages(folio);
>>> + unsigned long addr = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
>>> + pte_t *aligned_pte = vmf->pte - (vmf->address - addr) / PAGE_SIZE;
>>> +
>>> + /*
>>> + * case 1: we are allocating large_folio, try to map it as a whole
>>> + * iff the swap entries are still entirely mapped;
>>> + * case 2: we hit a large folio in swapcache, and all swap entries
>>> + * are still entirely mapped, try to map a large folio as a whole.
>>> + * otherwise, map only the faulting page within the large folio
>>> + * which is swapcache
>>> + */
>>> + if (!is_pte_range_contig_swap(aligned_pte, nr)) {
>>> + if (nr_pages > 1) /* ptes have changed for case 1 */
>>> + goto out_nomap;
>>> + goto check_pte;
>>> + }
>>> +
>>> + start_address = addr;
>>> + start_pte = aligned_pte;
>>> + /*
>>> + * the below has been done before swap_read_folio()
>>> + * for case 1
>>> + */
>>> + if (unlikely(folio == swapcache)) {
>>> + nr_pages = nr;
>>> + entry.val = ALIGN_DOWN(entry.val, nr_pages);
>>> + page = &folio->page;
>>> + }
>>> + }
>>> +
>>> +check_pte:
>>> if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
>>> goto out_nomap;
>>>
>>> @@ -4185,12 +4360,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> * We're already holding a reference on the page but haven't mapped it
>>> * yet.
>>> */
>>> - swap_free(entry);
>>> + swap_nr_free(entry, nr_pages);
>>> if (should_try_to_free_swap(folio, vma, vmf->flags))
>>> folio_free_swap(folio);
>>>
>>> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>>> - dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
>>> + folio_ref_add(folio, nr_pages - 1);
>>> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
>>> + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
>>> +
>>> pte = mk_pte(page, vma->vm_page_prot);
>>>
>>> /*
>>> @@ -4200,14 +4377,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> * exclusivity.
>>> */
>>> if (!folio_test_ksm(folio) &&
>>> - (exclusive || folio_ref_count(folio) == 1)) {
>>> + (exclusive || folio_ref_count(folio) == nr_pages)) {
>>> if (vmf->flags & FAULT_FLAG_WRITE) {
>>> pte = maybe_mkwrite(pte_mkdirty(pte), vma);
>>> vmf->flags &= ~FAULT_FLAG_WRITE;
>>> }
>>> rmap_flags |= RMAP_EXCLUSIVE;
>>> }
>>> - flush_icache_page(vma, page);
>>> + flush_icache_pages(vma, page, nr_pages);
>>> if (pte_swp_soft_dirty(vmf->orig_pte))
>>> pte = pte_mksoft_dirty(pte);
>>> if (pte_swp_uffd_wp(vmf->orig_pte))
>>> @@ -4216,17 +4393,19 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>
>>> /* ksm created a completely new copy */
>>> if (unlikely(folio != swapcache && swapcache)) {
>>> - folio_add_new_anon_rmap(folio, vma, vmf->address);
>>> + folio_add_new_anon_rmap(folio, vma, start_address);
>>> folio_add_lru_vma(folio, vma);
>>> + } else if (!folio_test_anon(folio)) {
>>> + folio_add_new_anon_rmap(folio, vma, start_address);
>>> } else {
>>> - folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
>>> + folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address,
>>> rmap_flags);
>>> }
>>>
>>> VM_BUG_ON(!folio_test_anon(folio) ||
>>> (pte_write(pte) && !PageAnonExclusive(page)));
>>> - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
>>> - arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
>>> + set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
>>> + arch_do_swap_page(vma->vm_mm, vma, start_address, pte, vmf->orig_pte);
>>>
>>> folio_unlock(folio);
>>> if (folio != swapcache && swapcache) {
>>> @@ -4243,6 +4422,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> }
>>>
>>> if (vmf->flags & FAULT_FLAG_WRITE) {
>>> + if (nr_pages > 1)
>>> + vmf->orig_pte = ptep_get(vmf->pte);
>>> +
>>> ret |= do_wp_page(vmf);
>>> if (ret & VM_FAULT_ERROR)
>>> ret &= VM_FAULT_ERROR;
>>> @@ -4250,14 +4432,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> }
>>>
>>> /* No need to invalidate - it was non-present before */
>>> - update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
>>> + update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages);
>>> unlock:
>>> if (vmf->pte)
>>> pte_unmap_unlock(vmf->pte, vmf->ptl);
>>> out:
>>> /* Clear the swap cache pin for direct swapin after PTL unlock */
>>> if (need_clear_cache)
>>> - swapcache_clear(si, entry);
>>> + swapcache_clear_nr(si, entry, nr_pages);
>>> if (si)
>>> put_swap_device(si);
>>> return ret;
>>> @@ -4273,7 +4455,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> folio_put(swapcache);
>>> }
>>> if (need_clear_cache)
>>> - swapcache_clear(si, entry);
>>> + swapcache_clear_nr(si, entry, nr_pages);
>>> if (si)
>>> put_swap_device(si);
>>> return ret;
>>> @@ -4309,15 +4491,7 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>>> if (unlikely(userfaultfd_armed(vma)))
>>> goto fallback;
>>>
>>> - /*
>>> - * Get a list of all the (large) orders below PMD_ORDER that are enabled
>>> - * for this vma. Then filter out the orders that can't be allocated over
>>> - * the faulting address and still be fully contained in the vma.
>>> - */
>>> - orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
>>> - BIT(PMD_ORDER) - 1);
>>> - orders = thp_vma_suitable_orders(vma, vmf->address, orders);
>>> -
>>> + orders = get_alloc_folio_orders(vmf);
>>> if (!orders)
>>> goto fallback;
>>>
>>
>>
>
>


2024-03-14 20:44:18

by Barry Song

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole

On Fri, Mar 15, 2024 at 2:57 AM Ryan Roberts <[email protected]> wrote:
>
> On 14/03/2024 12:56, Chuanhua Han wrote:
> > Ryan Roberts <[email protected]> 于2024年3月13日周三 00:33写道:
> >>
> >> On 04/03/2024 08:13, Barry Song wrote:
> >>> From: Chuanhua Han <[email protected]>
> >>>
> >>> On an embedded system like Android, more than half of anon memory is
> >>> actually in swap devices such as zRAM. For example, while an app is
> >>> switched to background, its most memory might be swapped-out.
> >>>
> >>> Now we have mTHP features, unfortunately, if we don't support large folios
> >>> swap-in, once those large folios are swapped-out, we immediately lose the
> >>> performance gain we can get through large folios and hardware optimization
> >>> such as CONT-PTE.
> >>>
> >>> This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
> >>> to those contiguous swaps which were likely swapped out from mTHP as a
> >>> whole.
> >>>
> >>> Meanwhile, the current implementation only covers the SWAP_SYCHRONOUS
> >>> case. It doesn't support swapin_readahead as large folios yet since this
> >>> kind of shared memory is much less than memory mapped by single process.
> >>>
> >>> Right now, we are re-faulting large folios which are still in swapcache as a
> >>> whole, this can effectively decrease extra loops and early-exitings which we
> >>> have increased in arch_swap_restore() while supporting MTE restore for folios
> >>> rather than page. On the other hand, it can also decrease do_swap_page as
> >>> PTEs used to be set one by one even we hit a large folio in swapcache.
> >>>
> >>> Signed-off-by: Chuanhua Han <[email protected]>
> >>> Co-developed-by: Barry Song <[email protected]>
> >>> Signed-off-by: Barry Song <[email protected]>
> >>> ---
> >>> mm/memory.c | 250 ++++++++++++++++++++++++++++++++++++++++++++--------
> >>> 1 file changed, 212 insertions(+), 38 deletions(-)
> >>>
> >>> diff --git a/mm/memory.c b/mm/memory.c
> >>> index e0d34d705e07..501ede745ef3 100644
> >>> --- a/mm/memory.c
> >>> +++ b/mm/memory.c
> >>> @@ -3907,6 +3907,136 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
> >>> return VM_FAULT_SIGBUS;
> >>> }
> >>>
> >>> +/*
> >>> + * check a range of PTEs are completely swap entries with
> >>> + * contiguous swap offsets and the same SWAP_HAS_CACHE.
> >>> + * pte must be first one in the range
> >>> + */
> >>> +static bool is_pte_range_contig_swap(pte_t *pte, int nr_pages)
> >>> +{
> >>> + int i;
> >>> + struct swap_info_struct *si;
> >>> + swp_entry_t entry;
> >>> + unsigned type;
> >>> + pgoff_t start_offset;
> >>> + char has_cache;
> >>> +
> >>> + entry = pte_to_swp_entry(ptep_get_lockless(pte));
> >>
> >> Given you are getting entry locklessly, I expect it could change under you? So
> >> probably need to check that its a swap entry, etc. first?
> > The following non_swap_entry checks to see if it is a swap entry.
>
> No, it checks if something already known to be a "swap entry" type is actually
> describing a swap entry, or a non-swap entry (e.g. migration entry, hwpoison
> entry, etc.) Swap entries with type >= MAX_SWAPFILES don't actually describe swap:
>
> static inline int non_swap_entry(swp_entry_t entry)
> {
> return swp_type(entry) >= MAX_SWAPFILES;
> }
>
>
> So you need to do something like:
>
> pte = ptep_get_lockless(pte);
> if (pte_none(pte) || !pte_present(pte))
> return false;


Indeed, I noticed that a couple of days ago, but it turned out that it
didn't cause any issues
because the condition following 'if (swp_offset(entry) != start_offset
+ i)' cannot be true :-)

I do agree it needs a fix here. maybe by

if (!is_swap_pte(pte))
return false?

> entry = pte_to_swp_entry(pte);
> if (non_swap_entry(entry))
> return false;
> ...
>
> >>
> >>> + if (non_swap_entry(entry))
> >>> + return false;
> >>> + start_offset = swp_offset(entry);
> >>> + if (start_offset % nr_pages)
> >>> + return false;
> >>> +
> >>> + si = swp_swap_info(entry);
> >>
> >> What ensures si remains valid (i.e. swapoff can't happen)? If swapoff can race,
> >> then swap_map may have been freed when you read it below. Holding the PTL can
> >> sometimes prevent it, but I don't think you're holding that here (you're using
> >> ptep_get_lockless(). Perhaps get_swap_device()/put_swap_device() can help?
> > Thank you for your review,you are righit! this place reaally needs
> > get_swap_device()/put_swap_device().
> >>
> >>> + type = swp_type(entry);
> >>> + has_cache = si->swap_map[start_offset] & SWAP_HAS_CACHE;
> >>> + for (i = 1; i < nr_pages; i++) {
> >>> + entry = pte_to_swp_entry(ptep_get_lockless(pte + i));
> >>> + if (non_swap_entry(entry))
> >>> + return false;
> >>> + if (swp_offset(entry) != start_offset + i)
> >>> + return false;
> >>> + if (swp_type(entry) != type)
> >>> + return false;
> >>> + /*
> >>> + * while allocating a large folio and doing swap_read_folio for the
> >>> + * SWP_SYNCHRONOUS_IO path, which is the case the being faulted pte
> >>> + * doesn't have swapcache. We need to ensure all PTEs have no cache
> >>> + * as well, otherwise, we might go to swap devices while the content
> >>> + * is in swapcache
> >>> + */
> >>> + if ((si->swap_map[start_offset + i] & SWAP_HAS_CACHE) != has_cache)
> >>> + return false;
> >>> + }
> >>> +
> >>> + return true;
> >>> +}
> >>
> >> I created swap_pte_batch() for the swap-out series [1]. I wonder if that could
> >> be extended for the SWAP_HAS_CACHE checks? Possibly not because it assumes the
> >> PTL is held, and you are lockless here. Thought it might be of interest though.
> >>
> >> [1] https://lore.kernel.org/linux-mm/[email protected]/
> >>
> > Thanks. It's probably simily to ours, but as you said we are lockless
> > here, and we need to check has_cache.
> >>> +
> >>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>> +/*
> >>> + * Get a list of all the (large) orders below PMD_ORDER that are enabled
> >>> + * for this vma. Then filter out the orders that can't be allocated over
> >>> + * the faulting address and still be fully contained in the vma.
> >>> + */
> >>> +static inline unsigned long get_alloc_folio_orders(struct vm_fault *vmf)
> >>> +{
> >>> + struct vm_area_struct *vma = vmf->vma;
> >>> + unsigned long orders;
> >>> +
> >>> + orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
> >>> + BIT(PMD_ORDER) - 1);
> >>> + orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> >>> + return orders;
> >>> +}
> >>> +#endif
> >>> +
> >>> +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> >>> +{
> >>> + struct vm_area_struct *vma = vmf->vma;
> >>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>> + unsigned long orders;
> >>> + struct folio *folio;
> >>> + unsigned long addr;
> >>> + pte_t *pte;
> >>> + gfp_t gfp;
> >>> + int order;
> >>> +
> >>> + /*
> >>> + * If uffd is active for the vma we need per-page fault fidelity to
> >>> + * maintain the uffd semantics.
> >>> + */
> >>> + if (unlikely(userfaultfd_armed(vma)))
> >>> + goto fallback;
> >>> +
> >>> + /*
> >>> + * a large folio being swapped-in could be partially in
> >>> + * zswap and partially in swap devices, zswap doesn't
> >>> + * support large folios yet, we might get corrupted
> >>> + * zero-filled data by reading all subpages from swap
> >>> + * devices while some of them are actually in zswap
> >>> + */
> >>> + if (is_zswap_enabled())
> >>> + goto fallback;
> >>> +
> >>> + orders = get_alloc_folio_orders(vmf);
> >>> + if (!orders)
> >>> + goto fallback;
> >>> +
> >>> + pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
> >>
> >> Could also briefly take PTL here, then is_pte_range_contig_swap() could be
> >> merged with an enhanced swap_pte_batch()?
> > Yes, it's easy to use a lock here, but I'm wondering if it's
> > necessary, because when we actually set pte in do_swap_page, we'll
> > hold PTL to check if the pte changes.
> >>
> >>> + if (unlikely(!pte))
> >>> + goto fallback;
> >>> +
> >>> + /*
> >>> + * For do_swap_page, find the highest order where the aligned range is
> >>> + * completely swap entries with contiguous swap offsets.
> >>> + */
> >>> + order = highest_order(orders);
> >>> + while (orders) {
> >>> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> >>> + if (is_pte_range_contig_swap(pte + pte_index(addr), 1 << order))
> >>> + break;
> >>> + order = next_order(&orders, order);
> >>> + }
> >>
> >> So in the common case, swap-in will pull in the same size of folio as was
> >> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
> >> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
> >> it makes sense for 2M THP; As the size increases the chances of actually needing
> >> all of the folio reduces so chances are we are wasting IO. There are similar
> >> arguments for CoW, where we currently copy 1 page per fault - it probably makes
> >> sense to copy the whole folio up to a certain size.
> > For 2M THP, IO overhead may not necessarily be large? :)
> > 1.If 2M THP are continuously stored in the swap device, the IO
> > overhead may not be very large (such as submitting bio with one
> > bio_vec at a time).
> > 2.If the process really needs this 2M data, one page-fault may perform
> > much better than multiple.
> > 3.For swap devices like zram,using 2M THP might also improve
> > decompression efficiency.
> >
> > On the other hand, if the process only needs a small part of the 2M
> > data (such as only frequent use of 4K page, the rest of the data is
> > never accessed), This is indeed give a lark to catch a kite! :(
>
> Yes indeed. It's not always clear-cut what the best thing to do is. It would be
> good to hear from others on this.
>
> >>
> >> Thanks,
> >> Ryan
> >>
> >>> +
> >>> + pte_unmap(pte);
> >>> +
> >>> + /* Try allocating the highest of the remaining orders. */
> >>> + gfp = vma_thp_gfp_mask(vma);
> >>> + while (orders) {
> >>> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> >>> + folio = vma_alloc_folio(gfp, order, vma, addr, true);
> >>> + if (folio)
> >>> + return folio;
> >>> + order = next_order(&orders, order);
> >>> + }
> >>> +
> >>> +fallback:
> >>> +#endif
> >>> + return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false);
> >>> +}
> >>> +
> >>> +
> >>> /*
> >>> * We enter with non-exclusive mmap_lock (to exclude vma changes,
> >>> * but allow concurrent faults), and pte mapped but not yet locked.
> >>> @@ -3928,6 +4058,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>> pte_t pte;
> >>> vm_fault_t ret = 0;
> >>> void *shadow = NULL;
> >>> + int nr_pages = 1;
> >>> + unsigned long start_address;
> >>> + pte_t *start_pte;
> >>>
> >>> if (!pte_unmap_same(vmf))
> >>> goto out;
> >>> @@ -3991,35 +4124,41 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>> if (!folio) {
> >>> if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> >>> __swap_count(entry) == 1) {
> >>> - /*
> >>> - * Prevent parallel swapin from proceeding with
> >>> - * the cache flag. Otherwise, another thread may
> >>> - * finish swapin first, free the entry, and swapout
> >>> - * reusing the same entry. It's undetectable as
> >>> - * pte_same() returns true due to entry reuse.
> >>> - */
> >>> - if (swapcache_prepare(entry)) {
> >>> - /* Relax a bit to prevent rapid repeated page faults */
> >>> - schedule_timeout_uninterruptible(1);
> >>> - goto out;
> >>> - }
> >>> - need_clear_cache = true;
> >>> -
> >>> /* skip swapcache */
> >>> - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> >>> - vma, vmf->address, false);
> >>> + folio = alloc_swap_folio(vmf);
> >>> page = &folio->page;
> >>> if (folio) {
> >>> __folio_set_locked(folio);
> >>> __folio_set_swapbacked(folio);
> >>>
> >>> + if (folio_test_large(folio)) {
> >>> + nr_pages = folio_nr_pages(folio);
> >>> + entry.val = ALIGN_DOWN(entry.val, nr_pages);
> >>> + }
> >>> +
> >>> + /*
> >>> + * Prevent parallel swapin from proceeding with
> >>> + * the cache flag. Otherwise, another thread may
> >>> + * finish swapin first, free the entry, and swapout
> >>> + * reusing the same entry. It's undetectable as
> >>> + * pte_same() returns true due to entry reuse.
> >>> + */
> >>> + if (swapcache_prepare_nr(entry, nr_pages)) {
> >>> + /* Relax a bit to prevent rapid repeated page faults */
> >>> + schedule_timeout_uninterruptible(1);
> >>> + goto out;
> >>> + }
> >>> + need_clear_cache = true;
> >>> +
> >>> if (mem_cgroup_swapin_charge_folio(folio,
> >>> vma->vm_mm, GFP_KERNEL,
> >>> entry)) {
> >>> ret = VM_FAULT_OOM;
> >>> goto out_page;
> >>> }
> >>> - mem_cgroup_swapin_uncharge_swap(entry);
> >>> +
> >>> + for (swp_entry_t e = entry; e.val < entry.val + nr_pages; e.val++)
> >>> + mem_cgroup_swapin_uncharge_swap(e);
> >>>
> >>> shadow = get_shadow_from_swap_cache(entry);
> >>> if (shadow)
> >>> @@ -4118,6 +4257,42 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>> */
> >>> vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> >>> &vmf->ptl);
> >>> +
> >>> + start_address = vmf->address;
> >>> + start_pte = vmf->pte;
> >>> + if (start_pte && folio_test_large(folio)) {
> >>> + unsigned long nr = folio_nr_pages(folio);
> >>> + unsigned long addr = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
> >>> + pte_t *aligned_pte = vmf->pte - (vmf->address - addr) / PAGE_SIZE;
> >>> +
> >>> + /*
> >>> + * case 1: we are allocating large_folio, try to map it as a whole
> >>> + * iff the swap entries are still entirely mapped;
> >>> + * case 2: we hit a large folio in swapcache, and all swap entries
> >>> + * are still entirely mapped, try to map a large folio as a whole.
> >>> + * otherwise, map only the faulting page within the large folio
> >>> + * which is swapcache
> >>> + */
> >>> + if (!is_pte_range_contig_swap(aligned_pte, nr)) {
> >>> + if (nr_pages > 1) /* ptes have changed for case 1 */
> >>> + goto out_nomap;
> >>> + goto check_pte;
> >>> + }
> >>> +
> >>> + start_address = addr;
> >>> + start_pte = aligned_pte;
> >>> + /*
> >>> + * the below has been done before swap_read_folio()
> >>> + * for case 1
> >>> + */
> >>> + if (unlikely(folio == swapcache)) {
> >>> + nr_pages = nr;
> >>> + entry.val = ALIGN_DOWN(entry.val, nr_pages);
> >>> + page = &folio->page;
> >>> + }
> >>> + }
> >>> +
> >>> +check_pte:
> >>> if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> >>> goto out_nomap;
> >>>
> >>> @@ -4185,12 +4360,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>> * We're already holding a reference on the page but haven't mapped it
> >>> * yet.
> >>> */
> >>> - swap_free(entry);
> >>> + swap_nr_free(entry, nr_pages);
> >>> if (should_try_to_free_swap(folio, vma, vmf->flags))
> >>> folio_free_swap(folio);
> >>>
> >>> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> >>> - dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> >>> + folio_ref_add(folio, nr_pages - 1);
> >>> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> >>> + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> >>> +
> >>> pte = mk_pte(page, vma->vm_page_prot);
> >>>
> >>> /*
> >>> @@ -4200,14 +4377,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>> * exclusivity.
> >>> */
> >>> if (!folio_test_ksm(folio) &&
> >>> - (exclusive || folio_ref_count(folio) == 1)) {
> >>> + (exclusive || folio_ref_count(folio) == nr_pages)) {
> >>> if (vmf->flags & FAULT_FLAG_WRITE) {
> >>> pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> >>> vmf->flags &= ~FAULT_FLAG_WRITE;
> >>> }
> >>> rmap_flags |= RMAP_EXCLUSIVE;
> >>> }
> >>> - flush_icache_page(vma, page);
> >>> + flush_icache_pages(vma, page, nr_pages);
> >>> if (pte_swp_soft_dirty(vmf->orig_pte))
> >>> pte = pte_mksoft_dirty(pte);
> >>> if (pte_swp_uffd_wp(vmf->orig_pte))
> >>> @@ -4216,17 +4393,19 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>
> >>> /* ksm created a completely new copy */
> >>> if (unlikely(folio != swapcache && swapcache)) {
> >>> - folio_add_new_anon_rmap(folio, vma, vmf->address);
> >>> + folio_add_new_anon_rmap(folio, vma, start_address);
> >>> folio_add_lru_vma(folio, vma);
> >>> + } else if (!folio_test_anon(folio)) {
> >>> + folio_add_new_anon_rmap(folio, vma, start_address);
> >>> } else {
> >>> - folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
> >>> + folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address,
> >>> rmap_flags);
> >>> }
> >>>
> >>> VM_BUG_ON(!folio_test_anon(folio) ||
> >>> (pte_write(pte) && !PageAnonExclusive(page)));
> >>> - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> >>> - arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> >>> + set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
> >>> + arch_do_swap_page(vma->vm_mm, vma, start_address, pte, vmf->orig_pte);
> >>>
> >>> folio_unlock(folio);
> >>> if (folio != swapcache && swapcache) {
> >>> @@ -4243,6 +4422,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>> }
> >>>
> >>> if (vmf->flags & FAULT_FLAG_WRITE) {
> >>> + if (nr_pages > 1)
> >>> + vmf->orig_pte = ptep_get(vmf->pte);
> >>> +
> >>> ret |= do_wp_page(vmf);
> >>> if (ret & VM_FAULT_ERROR)
> >>> ret &= VM_FAULT_ERROR;
> >>> @@ -4250,14 +4432,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>> }
> >>>
> >>> /* No need to invalidate - it was non-present before */
> >>> - update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
> >>> + update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages);
> >>> unlock:
> >>> if (vmf->pte)
> >>> pte_unmap_unlock(vmf->pte, vmf->ptl);
> >>> out:
> >>> /* Clear the swap cache pin for direct swapin after PTL unlock */
> >>> if (need_clear_cache)
> >>> - swapcache_clear(si, entry);
> >>> + swapcache_clear_nr(si, entry, nr_pages);
> >>> if (si)
> >>> put_swap_device(si);
> >>> return ret;
> >>> @@ -4273,7 +4455,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>> folio_put(swapcache);
> >>> }
> >>> if (need_clear_cache)
> >>> - swapcache_clear(si, entry);
> >>> + swapcache_clear_nr(si, entry, nr_pages);
> >>> if (si)
> >>> put_swap_device(si);
> >>> return ret;
> >>> @@ -4309,15 +4491,7 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> >>> if (unlikely(userfaultfd_armed(vma)))
> >>> goto fallback;
> >>>
> >>> - /*
> >>> - * Get a list of all the (large) orders below PMD_ORDER that are enabled
> >>> - * for this vma. Then filter out the orders that can't be allocated over
> >>> - * the faulting address and still be fully contained in the vma.
> >>> - */
> >>> - orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
> >>> - BIT(PMD_ORDER) - 1);
> >>> - orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> >>> -
> >>> + orders = get_alloc_folio_orders(vmf);
> >>> if (!orders)
> >>> goto fallback;
> >>>

Thanks
Barry

2024-03-15 01:16:48

by Chuanhua Han

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole

Ryan Roberts <[email protected]> 于2024年3月14日周四 21:57写道:
>
> On 14/03/2024 12:56, Chuanhua Han wrote:
> > Ryan Roberts <[email protected]> 于2024年3月13日周三 00:33写道:
> >>
> >> On 04/03/2024 08:13, Barry Song wrote:
> >>> From: Chuanhua Han <[email protected]>
> >>>
> >>> On an embedded system like Android, more than half of anon memory is
> >>> actually in swap devices such as zRAM. For example, while an app is
> >>> switched to background, its most memory might be swapped-out.
> >>>
> >>> Now we have mTHP features, unfortunately, if we don't support large folios
> >>> swap-in, once those large folios are swapped-out, we immediately lose the
> >>> performance gain we can get through large folios and hardware optimization
> >>> such as CONT-PTE.
> >>>
> >>> This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
> >>> to those contiguous swaps which were likely swapped out from mTHP as a
> >>> whole.
> >>>
> >>> Meanwhile, the current implementation only covers the SWAP_SYCHRONOUS
> >>> case. It doesn't support swapin_readahead as large folios yet since this
> >>> kind of shared memory is much less than memory mapped by single process.
> >>>
> >>> Right now, we are re-faulting large folios which are still in swapcache as a
> >>> whole, this can effectively decrease extra loops and early-exitings which we
> >>> have increased in arch_swap_restore() while supporting MTE restore for folios
> >>> rather than page. On the other hand, it can also decrease do_swap_page as
> >>> PTEs used to be set one by one even we hit a large folio in swapcache.
> >>>
> >>> Signed-off-by: Chuanhua Han <[email protected]>
> >>> Co-developed-by: Barry Song <[email protected]>
> >>> Signed-off-by: Barry Song <[email protected]>
> >>> ---
> >>> mm/memory.c | 250 ++++++++++++++++++++++++++++++++++++++++++++--------
> >>> 1 file changed, 212 insertions(+), 38 deletions(-)
> >>>
> >>> diff --git a/mm/memory.c b/mm/memory.c
> >>> index e0d34d705e07..501ede745ef3 100644
> >>> --- a/mm/memory.c
> >>> +++ b/mm/memory.c
> >>> @@ -3907,6 +3907,136 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
> >>> return VM_FAULT_SIGBUS;
> >>> }
> >>>
> >>> +/*
> >>> + * check a range of PTEs are completely swap entries with
> >>> + * contiguous swap offsets and the same SWAP_HAS_CACHE.
> >>> + * pte must be first one in the range
> >>> + */
> >>> +static bool is_pte_range_contig_swap(pte_t *pte, int nr_pages)
> >>> +{
> >>> + int i;
> >>> + struct swap_info_struct *si;
> >>> + swp_entry_t entry;
> >>> + unsigned type;
> >>> + pgoff_t start_offset;
> >>> + char has_cache;
> >>> +
> >>> + entry = pte_to_swp_entry(ptep_get_lockless(pte));
> >>
> >> Given you are getting entry locklessly, I expect it could change under you? So
> >> probably need to check that its a swap entry, etc. first?
> > The following non_swap_entry checks to see if it is a swap entry.
>
> No, it checks if something already known to be a "swap entry" type is actually
> describing a swap entry, or a non-swap entry (e.g. migration entry, hwpoison
> entry, etc.) Swap entries with type >= MAX_SWAPFILES don't actually describe swap:
>
> static inline int non_swap_entry(swp_entry_t entry)
> {
> return swp_type(entry) >= MAX_SWAPFILES;
> }
>
>
> So you need to do something like:
>
> pte = ptep_get_lockless(pte);
> if (pte_none(pte) || !pte_present(pte))
> return false;
> entry = pte_to_swp_entry(pte);
> if (non_swap_entry(entry))
> return false;
> ...
>
Indeed, this will more accurate, thank you very much for your advise!
> >>
> >>> + if (non_swap_entry(entry))
> >>> + return false;
> >>> + start_offset = swp_offset(entry);
> >>> + if (start_offset % nr_pages)
> >>> + return false;
> >>> +
> >>> + si = swp_swap_info(entry);
> >>
> >> What ensures si remains valid (i.e. swapoff can't happen)? If swapoff can race,
> >> then swap_map may have been freed when you read it below. Holding the PTL can
> >> sometimes prevent it, but I don't think you're holding that here (you're using
> >> ptep_get_lockless(). Perhaps get_swap_device()/put_swap_device() can help?
> > Thank you for your review,you are righit! this place reaally needs
> > get_swap_device()/put_swap_device().
> >>
> >>> + type = swp_type(entry);
> >>> + has_cache = si->swap_map[start_offset] & SWAP_HAS_CACHE;
> >>> + for (i = 1; i < nr_pages; i++) {
> >>> + entry = pte_to_swp_entry(ptep_get_lockless(pte + i));
> >>> + if (non_swap_entry(entry))
> >>> + return false;
> >>> + if (swp_offset(entry) != start_offset + i)
> >>> + return false;
> >>> + if (swp_type(entry) != type)
> >>> + return false;
> >>> + /*
> >>> + * while allocating a large folio and doing swap_read_folio for the
> >>> + * SWP_SYNCHRONOUS_IO path, which is the case the being faulted pte
> >>> + * doesn't have swapcache. We need to ensure all PTEs have no cache
> >>> + * as well, otherwise, we might go to swap devices while the content
> >>> + * is in swapcache
> >>> + */
> >>> + if ((si->swap_map[start_offset + i] & SWAP_HAS_CACHE) != has_cache)
> >>> + return false;
> >>> + }
> >>> +
> >>> + return true;
> >>> +}
> >>
> >> I created swap_pte_batch() for the swap-out series [1]. I wonder if that could
> >> be extended for the SWAP_HAS_CACHE checks? Possibly not because it assumes the
> >> PTL is held, and you are lockless here. Thought it might be of interest though.
> >>
> >> [1] https://lore.kernel.org/linux-mm/[email protected]/
> >>
> > Thanks. It's probably simily to ours, but as you said we are lockless
> > here, and we need to check has_cache.
> >>> +
> >>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>> +/*
> >>> + * Get a list of all the (large) orders below PMD_ORDER that are enabled
> >>> + * for this vma. Then filter out the orders that can't be allocated over
> >>> + * the faulting address and still be fully contained in the vma.
> >>> + */
> >>> +static inline unsigned long get_alloc_folio_orders(struct vm_fault *vmf)
> >>> +{
> >>> + struct vm_area_struct *vma = vmf->vma;
> >>> + unsigned long orders;
> >>> +
> >>> + orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
> >>> + BIT(PMD_ORDER) - 1);
> >>> + orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> >>> + return orders;
> >>> +}
> >>> +#endif
> >>> +
> >>> +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> >>> +{
> >>> + struct vm_area_struct *vma = vmf->vma;
> >>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>> + unsigned long orders;
> >>> + struct folio *folio;
> >>> + unsigned long addr;
> >>> + pte_t *pte;
> >>> + gfp_t gfp;
> >>> + int order;
> >>> +
> >>> + /*
> >>> + * If uffd is active for the vma we need per-page fault fidelity to
> >>> + * maintain the uffd semantics.
> >>> + */
> >>> + if (unlikely(userfaultfd_armed(vma)))
> >>> + goto fallback;
> >>> +
> >>> + /*
> >>> + * a large folio being swapped-in could be partially in
> >>> + * zswap and partially in swap devices, zswap doesn't
> >>> + * support large folios yet, we might get corrupted
> >>> + * zero-filled data by reading all subpages from swap
> >>> + * devices while some of them are actually in zswap
> >>> + */
> >>> + if (is_zswap_enabled())
> >>> + goto fallback;
> >>> +
> >>> + orders = get_alloc_folio_orders(vmf);
> >>> + if (!orders)
> >>> + goto fallback;
> >>> +
> >>> + pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
> >>
> >> Could also briefly take PTL here, then is_pte_range_contig_swap() could be
> >> merged with an enhanced swap_pte_batch()?
> > Yes, it's easy to use a lock here, but I'm wondering if it's
> > necessary, because when we actually set pte in do_swap_page, we'll
> > hold PTL to check if the pte changes.
> >>
> >>> + if (unlikely(!pte))
> >>> + goto fallback;
> >>> +
> >>> + /*
> >>> + * For do_swap_page, find the highest order where the aligned range is
> >>> + * completely swap entries with contiguous swap offsets.
> >>> + */
> >>> + order = highest_order(orders);
> >>> + while (orders) {
> >>> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> >>> + if (is_pte_range_contig_swap(pte + pte_index(addr), 1 << order))
> >>> + break;
> >>> + order = next_order(&orders, order);
> >>> + }
> >>
> >> So in the common case, swap-in will pull in the same size of folio as was
> >> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
> >> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
> >> it makes sense for 2M THP; As the size increases the chances of actually needing
> >> all of the folio reduces so chances are we are wasting IO. There are similar
> >> arguments for CoW, where we currently copy 1 page per fault - it probably makes
> >> sense to copy the whole folio up to a certain size.
> > For 2M THP, IO overhead may not necessarily be large? :)
> > 1.If 2M THP are continuously stored in the swap device, the IO
> > overhead may not be very large (such as submitting bio with one
> > bio_vec at a time).
> > 2.If the process really needs this 2M data, one page-fault may perform
> > much better than multiple.
> > 3.For swap devices like zram,using 2M THP might also improve
> > decompression efficiency.
> >
> > On the other hand, if the process only needs a small part of the 2M
> > data (such as only frequent use of 4K page, the rest of the data is
> > never accessed), This is indeed give a lark to catch a kite! :(
>
> Yes indeed. It's not always clear-cut what the best thing to do is. It would be
> good to hear from others on this.
>
> >>
> >> Thanks,
> >> Ryan
> >>
> >>> +
> >>> + pte_unmap(pte);
> >>> +
> >>> + /* Try allocating the highest of the remaining orders. */
> >>> + gfp = vma_thp_gfp_mask(vma);
> >>> + while (orders) {
> >>> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> >>> + folio = vma_alloc_folio(gfp, order, vma, addr, true);
> >>> + if (folio)
> >>> + return folio;
> >>> + order = next_order(&orders, order);
> >>> + }
> >>> +
> >>> +fallback:
> >>> +#endif
> >>> + return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false);
> >>> +}
> >>> +
> >>> +
> >>> /*
> >>> * We enter with non-exclusive mmap_lock (to exclude vma changes,
> >>> * but allow concurrent faults), and pte mapped but not yet locked.
> >>> @@ -3928,6 +4058,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>> pte_t pte;
> >>> vm_fault_t ret = 0;
> >>> void *shadow = NULL;
> >>> + int nr_pages = 1;
> >>> + unsigned long start_address;
> >>> + pte_t *start_pte;
> >>>
> >>> if (!pte_unmap_same(vmf))
> >>> goto out;
> >>> @@ -3991,35 +4124,41 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>> if (!folio) {
> >>> if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> >>> __swap_count(entry) == 1) {
> >>> - /*
> >>> - * Prevent parallel swapin from proceeding with
> >>> - * the cache flag. Otherwise, another thread may
> >>> - * finish swapin first, free the entry, and swapout
> >>> - * reusing the same entry. It's undetectable as
> >>> - * pte_same() returns true due to entry reuse.
> >>> - */
> >>> - if (swapcache_prepare(entry)) {
> >>> - /* Relax a bit to prevent rapid repeated page faults */
> >>> - schedule_timeout_uninterruptible(1);
> >>> - goto out;
> >>> - }
> >>> - need_clear_cache = true;
> >>> -
> >>> /* skip swapcache */
> >>> - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> >>> - vma, vmf->address, false);
> >>> + folio = alloc_swap_folio(vmf);
> >>> page = &folio->page;
> >>> if (folio) {
> >>> __folio_set_locked(folio);
> >>> __folio_set_swapbacked(folio);
> >>>
> >>> + if (folio_test_large(folio)) {
> >>> + nr_pages = folio_nr_pages(folio);
> >>> + entry.val = ALIGN_DOWN(entry.val, nr_pages);
> >>> + }
> >>> +
> >>> + /*
> >>> + * Prevent parallel swapin from proceeding with
> >>> + * the cache flag. Otherwise, another thread may
> >>> + * finish swapin first, free the entry, and swapout
> >>> + * reusing the same entry. It's undetectable as
> >>> + * pte_same() returns true due to entry reuse.
> >>> + */
> >>> + if (swapcache_prepare_nr(entry, nr_pages)) {
> >>> + /* Relax a bit to prevent rapid repeated page faults */
> >>> + schedule_timeout_uninterruptible(1);
> >>> + goto out;
> >>> + }
> >>> + need_clear_cache = true;
> >>> +
> >>> if (mem_cgroup_swapin_charge_folio(folio,
> >>> vma->vm_mm, GFP_KERNEL,
> >>> entry)) {
> >>> ret = VM_FAULT_OOM;
> >>> goto out_page;
> >>> }
> >>> - mem_cgroup_swapin_uncharge_swap(entry);
> >>> +
> >>> + for (swp_entry_t e = entry; e.val < entry.val + nr_pages; e.val++)
> >>> + mem_cgroup_swapin_uncharge_swap(e);
> >>>
> >>> shadow = get_shadow_from_swap_cache(entry);
> >>> if (shadow)
> >>> @@ -4118,6 +4257,42 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>> */
> >>> vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> >>> &vmf->ptl);
> >>> +
> >>> + start_address = vmf->address;
> >>> + start_pte = vmf->pte;
> >>> + if (start_pte && folio_test_large(folio)) {
> >>> + unsigned long nr = folio_nr_pages(folio);
> >>> + unsigned long addr = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
> >>> + pte_t *aligned_pte = vmf->pte - (vmf->address - addr) / PAGE_SIZE;
> >>> +
> >>> + /*
> >>> + * case 1: we are allocating large_folio, try to map it as a whole
> >>> + * iff the swap entries are still entirely mapped;
> >>> + * case 2: we hit a large folio in swapcache, and all swap entries
> >>> + * are still entirely mapped, try to map a large folio as a whole.
> >>> + * otherwise, map only the faulting page within the large folio
> >>> + * which is swapcache
> >>> + */
> >>> + if (!is_pte_range_contig_swap(aligned_pte, nr)) {
> >>> + if (nr_pages > 1) /* ptes have changed for case 1 */
> >>> + goto out_nomap;
> >>> + goto check_pte;
> >>> + }
> >>> +
> >>> + start_address = addr;
> >>> + start_pte = aligned_pte;
> >>> + /*
> >>> + * the below has been done before swap_read_folio()
> >>> + * for case 1
> >>> + */
> >>> + if (unlikely(folio == swapcache)) {
> >>> + nr_pages = nr;
> >>> + entry.val = ALIGN_DOWN(entry.val, nr_pages);
> >>> + page = &folio->page;
> >>> + }
> >>> + }
> >>> +
> >>> +check_pte:
> >>> if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> >>> goto out_nomap;
> >>>
> >>> @@ -4185,12 +4360,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>> * We're already holding a reference on the page but haven't mapped it
> >>> * yet.
> >>> */
> >>> - swap_free(entry);
> >>> + swap_nr_free(entry, nr_pages);
> >>> if (should_try_to_free_swap(folio, vma, vmf->flags))
> >>> folio_free_swap(folio);
> >>>
> >>> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> >>> - dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> >>> + folio_ref_add(folio, nr_pages - 1);
> >>> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> >>> + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> >>> +
> >>> pte = mk_pte(page, vma->vm_page_prot);
> >>>
> >>> /*
> >>> @@ -4200,14 +4377,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>> * exclusivity.
> >>> */
> >>> if (!folio_test_ksm(folio) &&
> >>> - (exclusive || folio_ref_count(folio) == 1)) {
> >>> + (exclusive || folio_ref_count(folio) == nr_pages)) {
> >>> if (vmf->flags & FAULT_FLAG_WRITE) {
> >>> pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> >>> vmf->flags &= ~FAULT_FLAG_WRITE;
> >>> }
> >>> rmap_flags |= RMAP_EXCLUSIVE;
> >>> }
> >>> - flush_icache_page(vma, page);
> >>> + flush_icache_pages(vma, page, nr_pages);
> >>> if (pte_swp_soft_dirty(vmf->orig_pte))
> >>> pte = pte_mksoft_dirty(pte);
> >>> if (pte_swp_uffd_wp(vmf->orig_pte))
> >>> @@ -4216,17 +4393,19 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>
> >>> /* ksm created a completely new copy */
> >>> if (unlikely(folio != swapcache && swapcache)) {
> >>> - folio_add_new_anon_rmap(folio, vma, vmf->address);
> >>> + folio_add_new_anon_rmap(folio, vma, start_address);
> >>> folio_add_lru_vma(folio, vma);
> >>> + } else if (!folio_test_anon(folio)) {
> >>> + folio_add_new_anon_rmap(folio, vma, start_address);
> >>> } else {
> >>> - folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
> >>> + folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address,
> >>> rmap_flags);
> >>> }
> >>>
> >>> VM_BUG_ON(!folio_test_anon(folio) ||
> >>> (pte_write(pte) && !PageAnonExclusive(page)));
> >>> - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> >>> - arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> >>> + set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
> >>> + arch_do_swap_page(vma->vm_mm, vma, start_address, pte, vmf->orig_pte);
> >>>
> >>> folio_unlock(folio);
> >>> if (folio != swapcache && swapcache) {
> >>> @@ -4243,6 +4422,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>> }
> >>>
> >>> if (vmf->flags & FAULT_FLAG_WRITE) {
> >>> + if (nr_pages > 1)
> >>> + vmf->orig_pte = ptep_get(vmf->pte);
> >>> +
> >>> ret |= do_wp_page(vmf);
> >>> if (ret & VM_FAULT_ERROR)
> >>> ret &= VM_FAULT_ERROR;
> >>> @@ -4250,14 +4432,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>> }
> >>>
> >>> /* No need to invalidate - it was non-present before */
> >>> - update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
> >>> + update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages);
> >>> unlock:
> >>> if (vmf->pte)
> >>> pte_unmap_unlock(vmf->pte, vmf->ptl);
> >>> out:
> >>> /* Clear the swap cache pin for direct swapin after PTL unlock */
> >>> if (need_clear_cache)
> >>> - swapcache_clear(si, entry);
> >>> + swapcache_clear_nr(si, entry, nr_pages);
> >>> if (si)
> >>> put_swap_device(si);
> >>> return ret;
> >>> @@ -4273,7 +4455,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>> folio_put(swapcache);
> >>> }
> >>> if (need_clear_cache)
> >>> - swapcache_clear(si, entry);
> >>> + swapcache_clear_nr(si, entry, nr_pages);
> >>> if (si)
> >>> put_swap_device(si);
> >>> return ret;
> >>> @@ -4309,15 +4491,7 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> >>> if (unlikely(userfaultfd_armed(vma)))
> >>> goto fallback;
> >>>
> >>> - /*
> >>> - * Get a list of all the (large) orders below PMD_ORDER that are enabled
> >>> - * for this vma. Then filter out the orders that can't be allocated over
> >>> - * the faulting address and still be fully contained in the vma.
> >>> - */
> >>> - orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
> >>> - BIT(PMD_ORDER) - 1);
> >>> - orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> >>> -
> >>> + orders = get_alloc_folio_orders(vmf);
> >>> if (!orders)
> >>> goto fallback;
> >>>
> >>
> >>
> >
> >
>


--
Thanks,
Chuanhua

2024-03-15 08:34:56

by Chuanhua Han

[permalink] [raw]
Subject: Re: [RFC PATCH v3 2/5] mm: swap: introduce swap_nr_free() for batched swap_free()

Ryan Roberts <[email protected]> 于2024年3月14日周四 21:43写道:
>
> On 14/03/2024 13:12, Chuanhua Han wrote:
> > Ryan Roberts <[email protected]> 于2024年3月12日周二 02:51写道:
> >>
> >> On 04/03/2024 08:13, Barry Song wrote:
> >>> From: Chuanhua Han <[email protected]>
> >>>
> >>> While swapping in a large folio, we need to free swaps related to the whole
> >>> folio. To avoid frequently acquiring and releasing swap locks, it is better
> >>> to introduce an API for batched free.
> >>>
> >>> Signed-off-by: Chuanhua Han <[email protected]>
> >>> Co-developed-by: Barry Song <[email protected]>
> >>> Signed-off-by: Barry Song <[email protected]>
> >>> ---
> >>> include/linux/swap.h | 6 ++++++
> >>> mm/swapfile.c | 35 +++++++++++++++++++++++++++++++++++
> >>> 2 files changed, 41 insertions(+)
> >>>
> >>> diff --git a/include/linux/swap.h b/include/linux/swap.h
> >>> index 2955f7a78d8d..d6ab27929458 100644
> >>> --- a/include/linux/swap.h
> >>> +++ b/include/linux/swap.h
> >>> @@ -481,6 +481,7 @@ extern void swap_shmem_alloc(swp_entry_t);
> >>> extern int swap_duplicate(swp_entry_t);
> >>> extern int swapcache_prepare(swp_entry_t);
> >>> extern void swap_free(swp_entry_t);
> >>> +extern void swap_nr_free(swp_entry_t entry, int nr_pages);
> >>
> >> nit: In my swap-out v4 series, I've created a batched version of
> >> free_swap_and_cache() and called it free_swap_and_cache_nr(). Perhaps it is
> >> preferable to align the naming schemes - i.e. call this swap_free_nr() Your
> >> scheme doesn't really work when applied to free_swap_and_cache().
> > Thanks for your suggestions, and for the next version, we'll see which
> > package is more appropriate!
> >>
> >>> extern void swapcache_free_entries(swp_entry_t *entries, int n);
> >>> extern int free_swap_and_cache(swp_entry_t);
> >>> int swap_type_of(dev_t device, sector_t offset);
> >>> @@ -561,6 +562,11 @@ static inline void swap_free(swp_entry_t swp)
> >>> {
> >>> }
> >>>
> >>> +void swap_nr_free(swp_entry_t entry, int nr_pages)
> >>> +{
> >>> +
> >>> +}
> >>> +
> >>> static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> >>> {
> >>> }
> >>> diff --git a/mm/swapfile.c b/mm/swapfile.c
> >>> index 3f594be83b58..244106998a69 100644
> >>> --- a/mm/swapfile.c
> >>> +++ b/mm/swapfile.c
> >>> @@ -1341,6 +1341,41 @@ void swap_free(swp_entry_t entry)
> >>> __swap_entry_free(p, entry);
> >>> }
> >>>
> >>> +/*
> >>> + * Called after swapping in a large folio, batched free swap entries
> >>> + * for this large folio, entry should be for the first subpage and
> >>> + * its offset is aligned with nr_pages
> >>> + */
> >>> +void swap_nr_free(swp_entry_t entry, int nr_pages)
> >>> +{
> >>> + int i;
> >>> + struct swap_cluster_info *ci;
> >>> + struct swap_info_struct *p;
> >>> + unsigned type = swp_type(entry);
> >>
> >> nit: checkpatch.py will complain about bare "unsigned", preferring "unsigned
> >> int" or at least it did for me when I did something similar in my swap-out patch
> >> set.
> > Gee, thanks for pointing that out!
> >>
> >>> + unsigned long offset = swp_offset(entry);
> >>> + DECLARE_BITMAP(usage, SWAPFILE_CLUSTER) = { 0 };
> >>
> >> I don't love this, as it could blow the stack if SWAPFILE_CLUSTER ever
> >> increases. But the only other way I can think of is to explicitly loop over
> >> fixed size chunks, and that's not much better.
> > Is it possible to save kernel stack better by using bit_map here? If
> > SWAPFILE_CLUSTER=512, we consume only (512/64)*8= 64 bytes.
>
> I'm not sure I've understood what you are saying? You're already using
> DECLARE_BITMAP(), so its already consuming 64 bytes if SWAPFILE_CLUSTER=512, no?
>
> I actually did a bad job of trying to express a couple of different points:
>
> - Are there any configurations today where SWAPFILE_CLUSTER > 512? I'm not sure.
> Certainly not for arm64, but not sure about other architectures. For example if
> an arch had 64K pages with 8192 entries per THP and supports SWAP_THP, that's 1K
> for the bitmap, which is now looking pretty big for the stack.
I agree with you.The current bit_map grows linearly with the
SWAPFILE_CLUSTER, which may cause the kernel stack to swell.
I need to think of a way to save more memory .
>
> - Would it be better to decouple stack usage from SWAPFILE_CLUSTER and instead
> define a fixed stack size (e.g. 64 bytes -> 512 entries). Then free the range of
> entries in batches no bigger than this size. This approach could also allow
> removing the constraint that the range has to be aligned and fit in a single
> cluster. Personally I think an approach like this would be much more robust, in
> return for a tiny bit more complexity.
Because we cannot determine how many swap entries a cluster has in an
architecture or a configuration, we do not know how large the variable
needs to be defined?
>
> >>
> >>> +
> >>> + /* all swap entries are within a cluster for mTHP */
> >>> + VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
> >>> +
> >>> + if (nr_pages == 1) {
> >>> + swap_free(entry);
> >>> + return;
> >>> + }
> >>> +
> >>> + p = _swap_info_get(entry);
> >>
> >> You need to handle this returning NULL, like swap_free() does.
> > Yes, you're right! We did forget to judge NULL here.
> >>
> >>> +
> >>> + ci = lock_cluster(p, offset);
> >>
> >> The existing swap_free() calls lock_cluster_or_swap_info(). So if swap is backed
> >> by rotating media, and clusters are not in use, it will lock the whole swap
> >> info. But your new version only calls lock_cluster() which won't lock anything
> >> if clusters are not in use. So I think this is a locking bug.
> > Again, you're right, it's bug!
> >>
> >>> + for (i = 0; i < nr_pages; i++) {
> >>> + if (__swap_entry_free_locked(p, offset + i, 1))
> >>> + __bitmap_set(usage, i, 1);
> >>> + }
> >>> + unlock_cluster(ci);
> >>> +
> >>> + for_each_clear_bit(i, usage, nr_pages)
> >>> + free_swap_slot(swp_entry(type, offset + i));
> >>> +}
> >>> +
> >>> /*
> >>> * Called after dropping swapcache to decrease refcnt to swap entries.
> >>> */
> >>
> >> Thanks,
> >> Ryan
> >>
> >>
> >
> >
>


--
Thanks,
Chuanhua

2024-03-15 08:43:56

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole

Barry Song <[email protected]> writes:

> From: Chuanhua Han <[email protected]>
>
> On an embedded system like Android, more than half of anon memory is
> actually in swap devices such as zRAM. For example, while an app is
> switched to background, its most memory might be swapped-out.
>
> Now we have mTHP features, unfortunately, if we don't support large folios
> swap-in, once those large folios are swapped-out, we immediately lose the
> performance gain we can get through large folios and hardware optimization
> such as CONT-PTE.
>
> This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
> to those contiguous swaps which were likely swapped out from mTHP as a
> whole.
>
> Meanwhile, the current implementation only covers the SWAP_SYCHRONOUS
> case. It doesn't support swapin_readahead as large folios yet since this
> kind of shared memory is much less than memory mapped by single process.

In contrast, I still think that it's better to start with normal swap-in
path, then expand to SWAP_SYCHRONOUS case.

In normal swap-in path, we can take advantage of swap readahead
information to determine the swapped-in large folio order. That is, if
the return value of swapin_nr_pages() > 1, then we can try to allocate
and swapin a large folio.

To do that, we need to track whether the sub-pages are accessed. I
guess we need that information for large file folio readahead too.

Hi, Matthew,

Can you help us on tracking whether the sub-pages of a readahead large
folio has been accessed?

> Right now, we are re-faulting large folios which are still in swapcache as a
> whole, this can effectively decrease extra loops and early-exitings which we
> have increased in arch_swap_restore() while supporting MTE restore for folios
> rather than page. On the other hand, it can also decrease do_swap_page as
> PTEs used to be set one by one even we hit a large folio in swapcache.
>

--
Best Regards,
Huang, Ying

2024-03-15 08:54:52

by Barry Song

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole

On Fri, Mar 15, 2024 at 9:43 PM Huang, Ying <[email protected]> wrote:
>
> Barry Song <[email protected]> writes:
>
> > From: Chuanhua Han <[email protected]>
> >
> > On an embedded system like Android, more than half of anon memory is
> > actually in swap devices such as zRAM. For example, while an app is
> > switched to background, its most memory might be swapped-out.
> >
> > Now we have mTHP features, unfortunately, if we don't support large folios
> > swap-in, once those large folios are swapped-out, we immediately lose the
> > performance gain we can get through large folios and hardware optimization
> > such as CONT-PTE.
> >
> > This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
> > to those contiguous swaps which were likely swapped out from mTHP as a
> > whole.
> >
> > Meanwhile, the current implementation only covers the SWAP_SYCHRONOUS
> > case. It doesn't support swapin_readahead as large folios yet since this
> > kind of shared memory is much less than memory mapped by single process.
>
> In contrast, I still think that it's better to start with normal swap-in
> path, then expand to SWAP_SYCHRONOUS case.

I'd rather try the reverse direction as non-sync anon memory is only around
3% in a phone as my observation.

>
> In normal swap-in path, we can take advantage of swap readahead
> information to determine the swapped-in large folio order. That is, if
> the return value of swapin_nr_pages() > 1, then we can try to allocate
> and swapin a large folio.

I am not quite sure we still need to depend on this. in do_anon_page,
we have broken the assumption and allocated a large folio directly.

On the other hand, compressing/decompressing large folios as a
whole rather than doing it one by one can save a large percent of
CPUs and provide a much lower compression ratio. With a hardware
accelerator, this is even faster.

So I'd rather more aggressively get large folios swap-in involved
than depending on readahead.

>
> To do that, we need to track whether the sub-pages are accessed. I
> guess we need that information for large file folio readahead too.
>
> Hi, Matthew,
>
> Can you help us on tracking whether the sub-pages of a readahead large
> folio has been accessed?
>
> > Right now, we are re-faulting large folios which are still in swapcache as a
> > whole, this can effectively decrease extra loops and early-exitings which we
> > have increased in arch_swap_restore() while supporting MTE restore for folios
> > rather than page. On the other hand, it can also decrease do_swap_page as
> > PTEs used to be set one by one even we hit a large folio in swapcache.
> >
>
> --
> Best Regards,
> Huang, Ying

Thanks
Barry

2024-03-15 09:17:57

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole

Barry Song <[email protected]> writes:

> On Fri, Mar 15, 2024 at 9:43 PM Huang, Ying <[email protected]> wrote:
>>
>> Barry Song <[email protected]> writes:
>>
>> > From: Chuanhua Han <[email protected]>
>> >
>> > On an embedded system like Android, more than half of anon memory is
>> > actually in swap devices such as zRAM. For example, while an app is
>> > switched to background, its most memory might be swapped-out.
>> >
>> > Now we have mTHP features, unfortunately, if we don't support large folios
>> > swap-in, once those large folios are swapped-out, we immediately lose the
>> > performance gain we can get through large folios and hardware optimization
>> > such as CONT-PTE.
>> >
>> > This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
>> > to those contiguous swaps which were likely swapped out from mTHP as a
>> > whole.
>> >
>> > Meanwhile, the current implementation only covers the SWAP_SYCHRONOUS
>> > case. It doesn't support swapin_readahead as large folios yet since this
>> > kind of shared memory is much less than memory mapped by single process.
>>
>> In contrast, I still think that it's better to start with normal swap-in
>> path, then expand to SWAP_SYCHRONOUS case.
>
> I'd rather try the reverse direction as non-sync anon memory is only around
> 3% in a phone as my observation.

Phone is not the only platform that Linux is running on.

>>
>> In normal swap-in path, we can take advantage of swap readahead
>> information to determine the swapped-in large folio order. That is, if
>> the return value of swapin_nr_pages() > 1, then we can try to allocate
>> and swapin a large folio.
>
> I am not quite sure we still need to depend on this. in do_anon_page,
> we have broken the assumption and allocated a large folio directly.

I don't think that we have a sophisticated policy to allocate large
folio. Large folio could waste memory for some workloads, so I think
that it's a good idea to allocate large folio always.

Readahead gives us an opportunity to play with the policy.

> On the other hand, compressing/decompressing large folios as a
> whole rather than doing it one by one can save a large percent of
> CPUs and provide a much lower compression ratio. With a hardware
> accelerator, this is even faster.

I am not against to support large folio for compressing/decompressing.

I just suggest to do that later, after we play with normal swap-in.
SWAP_SYCHRONOUS related swap-in code is an optimization based on normal
swap. So, it seems natural to support large folio swap-in for normal
swap-in firstly.

> So I'd rather more aggressively get large folios swap-in involved
> than depending on readahead.

We can take advantage of readahead algorithm in SWAP_SYCHRONOUS
optimization too. The sub-pages that is not accessed by page fault can
be treated as readahead. I think that is a better policy than
allocating large folio always.

>>
>> To do that, we need to track whether the sub-pages are accessed. I
>> guess we need that information for large file folio readahead too.
>>
>> Hi, Matthew,
>>
>> Can you help us on tracking whether the sub-pages of a readahead large
>> folio has been accessed?
>>
>> > Right now, we are re-faulting large folios which are still in swapcache as a
>> > whole, this can effectively decrease extra loops and early-exitings which we
>> > have increased in arch_swap_restore() while supporting MTE restore for folios
>> > rather than page. On the other hand, it can also decrease do_swap_page as
>> > PTEs used to be set one by one even we hit a large folio in swapcache.
>> >
>>
--
Best Regards,
Huang, Ying

2024-03-15 10:04:28

by Barry Song

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole

On Fri, Mar 15, 2024 at 10:17 PM Huang, Ying <[email protected]> wrote:
>
> Barry Song <[email protected]> writes:
>
> > On Fri, Mar 15, 2024 at 9:43 PM Huang, Ying <[email protected]> wrote:
> >>
> >> Barry Song <[email protected]> writes:
> >>
> >> > From: Chuanhua Han <[email protected]>
> >> >
> >> > On an embedded system like Android, more than half of anon memory is
> >> > actually in swap devices such as zRAM. For example, while an app is
> >> > switched to background, its most memory might be swapped-out.
> >> >
> >> > Now we have mTHP features, unfortunately, if we don't support large folios
> >> > swap-in, once those large folios are swapped-out, we immediately lose the
> >> > performance gain we can get through large folios and hardware optimization
> >> > such as CONT-PTE.
> >> >
> >> > This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
> >> > to those contiguous swaps which were likely swapped out from mTHP as a
> >> > whole.
> >> >
> >> > Meanwhile, the current implementation only covers the SWAP_SYCHRONOUS
> >> > case. It doesn't support swapin_readahead as large folios yet since this
> >> > kind of shared memory is much less than memory mapped by single process.
> >>
> >> In contrast, I still think that it's better to start with normal swap-in
> >> path, then expand to SWAP_SYCHRONOUS case.
> >
> > I'd rather try the reverse direction as non-sync anon memory is only around
> > 3% in a phone as my observation.
>
> Phone is not the only platform that Linux is running on.

I suppose it's generally true that forked shared anonymous pages only
constitute a
small portion of all anonymous pages. The majority of anonymous pages are within
a single process.

I agree phones are not the only platform. But Rome wasn't built in a
day. I can only get
started on a hardware which I can easily reach and have enough hardware/test
resources on it. So we may take the first step which can be applied on
a real product
and improve its performance, and step by step, we broaden it and make it
widely useful to various areas in which I can't reach :-)

so probably we can have a sysfs "enable" entry with default "n" or
have a maximum
swap-in order as Ryan's suggestion [1] at the beginning,

"
So in the common case, swap-in will pull in the same size of folio as was
swapped-out. Is that definitely the right policy for all folio sizes? Certainly
it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
it makes sense for 2M THP; As the size increases the chances of actually needing
all of the folio reduces so chances are we are wasting IO. There are similar
arguments for CoW, where we currently copy 1 page per fault - it probably makes
sense to copy the whole folio up to a certain size.
"

>
> >>
> >> In normal swap-in path, we can take advantage of swap readahead
> >> information to determine the swapped-in large folio order. That is, if
> >> the return value of swapin_nr_pages() > 1, then we can try to allocate
> >> and swapin a large folio.
> >
> > I am not quite sure we still need to depend on this. in do_anon_page,
> > we have broken the assumption and allocated a large folio directly.
>
> I don't think that we have a sophisticated policy to allocate large
> folio. Large folio could waste memory for some workloads, so I think
> that it's a good idea to allocate large folio always.

i agree, but we still have the below check just like do_anon_page() has it,

orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
BIT(PMD_ORDER) - 1);
orders = thp_vma_suitable_orders(vma, vmf->address, orders);

in do_anon_page, we don't worry about the waste so much, the same
logic also applies to do_swap_page().

>
> Readahead gives us an opportunity to play with the policy.

I feel somehow the rules of the game have changed with an upper
limit for swap-in size. for example, if the upper limit is 4 order,
it limits folio size to 64KiB which is still a proper size for ARM64
whose base page can be 64KiB.

on the other hand, while swapping out large folios, we will always
compress them as a whole(zsmalloc/zram patch will come in a
couple of days), if we choose to decompress a subpage instead of
a large folio in do_swap_page(), we might need to decompress
nr_pages times. for example,

For large folios 16*4KiB, they are saved as a large object in zsmalloc(with
the coming patch), if we swap in a small folio, we decompress the large
object; next time, we will still need to decompress a large object. so
it is more sensible to swap in a large folio if we find those
swap entries are contiguous and were allocated by a large folio swap-out.

>
> > On the other hand, compressing/decompressing large folios as a
> > whole rather than doing it one by one can save a large percent of
> > CPUs and provide a much lower compression ratio. With a hardware
> > accelerator, this is even faster.
>
> I am not against to support large folio for compressing/decompressing.
>
> I just suggest to do that later, after we play with normal swap-in.
> SWAP_SYCHRONOUS related swap-in code is an optimization based on normal
> swap. So, it seems natural to support large folio swap-in for normal
> swap-in firstly.

I feel like SWAP_SYCHRONOUS is a simpler case and even more "normal"
than the swapcache path since it is the majority. and on the other hand, a lot
of modification is required for the swapcache path. in OPPO's code[1], we did
bring-up both paths, but the swapcache path is much much more complicated
than the SYNC path and hasn't really noticeable improvement.

[1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8650/tree/oneplus/sm8650_u_14.0.0_oneplus12

>
> > So I'd rather more aggressively get large folios swap-in involved
> > than depending on readahead.
>
> We can take advantage of readahead algorithm in SWAP_SYCHRONOUS
> optimization too. The sub-pages that is not accessed by page fault can
> be treated as readahead. I think that is a better policy than
> allocating large folio always.

Considering the zsmalloc optimization, it would be a better choice to
always allocate
large folios if we find those swap entries are for a swapped-out large folio. as
decompressing just once, we get all subpages.
Some hardware accelerators are even able to decompress a large folio with
multi-hardware threads, for example, 16 hardware threads can compress
each subpage of a large folio at the same time, it is just as fast as
decompressing
one subpage.

for platforms without the above optimizations, a proper upper limit
will help them
disable the large folios swap-in or decrease the impact. For example,
if the upper
limit is 0-order, we are just removing this patchset. if the upper
limit is 2 orders, we
are just like BASE_PAGE size is 16KiB.

>
> >>
> >> To do that, we need to track whether the sub-pages are accessed. I
> >> guess we need that information for large file folio readahead too.
> >>
> >> Hi, Matthew,
> >>
> >> Can you help us on tracking whether the sub-pages of a readahead large
> >> folio has been accessed?
> >>
> >> > Right now, we are re-faulting large folios which are still in swapcache as a
> >> > whole, this can effectively decrease extra loops and early-exitings which we
> >> > have increased in arch_swap_restore() while supporting MTE restore for folios
> >> > rather than page. On the other hand, it can also decrease do_swap_page as
> >> > PTEs used to be set one by one even we hit a large folio in swapcache.
> >> >
> >>
> --
> Best Regards,
> Huang, Ying

Thanks
Barry

2024-03-15 10:58:01

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH v3 2/5] mm: swap: introduce swap_nr_free() for batched swap_free()

On 15/03/2024 08:34, Chuanhua Han wrote:
> Ryan Roberts <[email protected]> 于2024年3月14日周四 21:43写道:
>>
>> On 14/03/2024 13:12, Chuanhua Han wrote:
>>> Ryan Roberts <[email protected]> 于2024年3月12日周二 02:51写道:
>>>>
>>>> On 04/03/2024 08:13, Barry Song wrote:
>>>>> From: Chuanhua Han <[email protected]>
>>>>>
>>>>> While swapping in a large folio, we need to free swaps related to the whole
>>>>> folio. To avoid frequently acquiring and releasing swap locks, it is better
>>>>> to introduce an API for batched free.
>>>>>
>>>>> Signed-off-by: Chuanhua Han <[email protected]>
>>>>> Co-developed-by: Barry Song <[email protected]>
>>>>> Signed-off-by: Barry Song <[email protected]>
>>>>> ---
>>>>> include/linux/swap.h | 6 ++++++
>>>>> mm/swapfile.c | 35 +++++++++++++++++++++++++++++++++++
>>>>> 2 files changed, 41 insertions(+)
>>>>>
>>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>>> index 2955f7a78d8d..d6ab27929458 100644
>>>>> --- a/include/linux/swap.h
>>>>> +++ b/include/linux/swap.h
>>>>> @@ -481,6 +481,7 @@ extern void swap_shmem_alloc(swp_entry_t);
>>>>> extern int swap_duplicate(swp_entry_t);
>>>>> extern int swapcache_prepare(swp_entry_t);
>>>>> extern void swap_free(swp_entry_t);
>>>>> +extern void swap_nr_free(swp_entry_t entry, int nr_pages);
>>>>
>>>> nit: In my swap-out v4 series, I've created a batched version of
>>>> free_swap_and_cache() and called it free_swap_and_cache_nr(). Perhaps it is
>>>> preferable to align the naming schemes - i.e. call this swap_free_nr(). Your
>>>> scheme doesn't really work when applied to free_swap_and_cache().
>>> Thanks for your suggestions, and for the next version, we'll see which
>>> package is more appropriate!
>>>>
>>>>> extern void swapcache_free_entries(swp_entry_t *entries, int n);
>>>>> extern int free_swap_and_cache(swp_entry_t);
>>>>> int swap_type_of(dev_t device, sector_t offset);
>>>>> @@ -561,6 +562,11 @@ static inline void swap_free(swp_entry_t swp)
>>>>> {
>>>>> }
>>>>>
>>>>> +void swap_nr_free(swp_entry_t entry, int nr_pages)
>>>>> +{
>>>>> +
>>>>> +}
>>>>> +
>>>>> static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
>>>>> {
>>>>> }
>>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>>>> index 3f594be83b58..244106998a69 100644
>>>>> --- a/mm/swapfile.c
>>>>> +++ b/mm/swapfile.c
>>>>> @@ -1341,6 +1341,41 @@ void swap_free(swp_entry_t entry)
>>>>> __swap_entry_free(p, entry);
>>>>> }
>>>>>
>>>>> +/*
>>>>> + * Called after swapping in a large folio, batched free swap entries
>>>>> + * for this large folio, entry should be for the first subpage and
>>>>> + * its offset is aligned with nr_pages
>>>>> + */
>>>>> +void swap_nr_free(swp_entry_t entry, int nr_pages)
>>>>> +{
>>>>> + int i;
>>>>> + struct swap_cluster_info *ci;
>>>>> + struct swap_info_struct *p;
>>>>> + unsigned type = swp_type(entry);
>>>>
>>>> nit: checkpatch.py will complain about bare "unsigned", preferring "unsigned
>>>> int" or at least it did for me when I did something similar in my swap-out patch
>>>> set.
>>> Gee, thanks for pointing that out!
>>>>
>>>>> + unsigned long offset = swp_offset(entry);
>>>>> + DECLARE_BITMAP(usage, SWAPFILE_CLUSTER) = { 0 };
>>>>
>>>> I don't love this, as it could blow the stack if SWAPFILE_CLUSTER ever
>>>> increases. But the only other way I can think of is to explicitly loop over
>>>> fixed size chunks, and that's not much better.
>>> Is it possible to save kernel stack better by using bit_map here? If
>>> SWAPFILE_CLUSTER=512, we consume only (512/64)*8= 64 bytes.
>>
>> I'm not sure I've understood what you are saying? You're already using
>> DECLARE_BITMAP(), so its already consuming 64 bytes if SWAPFILE_CLUSTER=512, no?
>>
>> I actually did a bad job of trying to express a couple of different points:
>>
>> - Are there any configurations today where SWAPFILE_CLUSTER > 512? I'm not sure.
>> Certainly not for arm64, but not sure about other architectures. For example if
>> an arch had 64K pages with 8192 entries per THP and supports SWAP_THP, that's 1K
>> for the bitmap, which is now looking pretty big for the stack.
> I agree with you.The current bit_map grows linearly with the
> SWAPFILE_CLUSTER, which may cause the kernel stack to swell.
> I need to think of a way to save more memory .
>>
>> - Would it be better to decouple stack usage from SWAPFILE_CLUSTER and instead
>> define a fixed stack size (e.g. 64 bytes -> 512 entries). Then free the range of
>> entries in batches no bigger than this size. This approach could also allow
>> removing the constraint that the range has to be aligned and fit in a single
>> cluster. Personally I think an approach like this would be much more robust, in
>> return for a tiny bit more complexity.
> Because we cannot determine how many swap entries a cluster has in an
> architecture or a configuration, we do not know how large the variable
> needs to be defined?

My point is that we could define a fixed size, then loop through the passed in
range, operating on batches of that fixed size. You could even take into
consideration the cluster boundaries so that you take the correct lock for every
batch and can drop the "must be naturally aligned, must be no bigger than
cluster size" constraint.


>>
>>>>
>>>>> +
>>>>> + /* all swap entries are within a cluster for mTHP */
>>>>> + VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
>>>>> +
>>>>> + if (nr_pages == 1) {
>>>>> + swap_free(entry);
>>>>> + return;
>>>>> + }
>>>>> +
>>>>> + p = _swap_info_get(entry);
>>>>
>>>> You need to handle this returning NULL, like swap_free() does.
>>> Yes, you're right! We did forget to judge NULL here.
>>>>
>>>>> +
>>>>> + ci = lock_cluster(p, offset);
>>>>
>>>> The existing swap_free() calls lock_cluster_or_swap_info(). So if swap is backed
>>>> by rotating media, and clusters are not in use, it will lock the whole swap
>>>> info. But your new version only calls lock_cluster() which won't lock anything
>>>> if clusters are not in use. So I think this is a locking bug.
>>> Again, you're right, it's bug!
>>>>
>>>>> + for (i = 0; i < nr_pages; i++) {
>>>>> + if (__swap_entry_free_locked(p, offset + i, 1))
>>>>> + __bitmap_set(usage, i, 1);
>>>>> + }
>>>>> + unlock_cluster(ci);
>>>>> +
>>>>> + for_each_clear_bit(i, usage, nr_pages)
>>>>> + free_swap_slot(swp_entry(type, offset + i));
>>>>> +}
>>>>> +
>>>>> /*
>>>>> * Called after dropping swapcache to decrease refcnt to swap entries.
>>>>> */
>>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>>
>>>
>>>
>>
>
>


2024-03-15 10:59:34

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole

On 14/03/2024 20:43, Barry Song wrote:
> On Fri, Mar 15, 2024 at 2:57 AM Ryan Roberts <[email protected]> wrote:
>>
>> On 14/03/2024 12:56, Chuanhua Han wrote:
>>> Ryan Roberts <[email protected]> 于2024年3月13日周三 00:33写道:
>>>>
>>>> On 04/03/2024 08:13, Barry Song wrote:
>>>>> From: Chuanhua Han <[email protected]>
>>>>>
>>>>> On an embedded system like Android, more than half of anon memory is
>>>>> actually in swap devices such as zRAM. For example, while an app is
>>>>> switched to background, its most memory might be swapped-out.
>>>>>
>>>>> Now we have mTHP features, unfortunately, if we don't support large folios
>>>>> swap-in, once those large folios are swapped-out, we immediately lose the
>>>>> performance gain we can get through large folios and hardware optimization
>>>>> such as CONT-PTE.
>>>>>
>>>>> This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
>>>>> to those contiguous swaps which were likely swapped out from mTHP as a
>>>>> whole.
>>>>>
>>>>> Meanwhile, the current implementation only covers the SWAP_SYCHRONOUS
>>>>> case. It doesn't support swapin_readahead as large folios yet since this
>>>>> kind of shared memory is much less than memory mapped by single process.
>>>>>
>>>>> Right now, we are re-faulting large folios which are still in swapcache as a
>>>>> whole, this can effectively decrease extra loops and early-exitings which we
>>>>> have increased in arch_swap_restore() while supporting MTE restore for folios
>>>>> rather than page. On the other hand, it can also decrease do_swap_page as
>>>>> PTEs used to be set one by one even we hit a large folio in swapcache.
>>>>>
>>>>> Signed-off-by: Chuanhua Han <[email protected]>
>>>>> Co-developed-by: Barry Song <[email protected]>
>>>>> Signed-off-by: Barry Song <[email protected]>
>>>>> ---
>>>>> mm/memory.c | 250 ++++++++++++++++++++++++++++++++++++++++++++--------
>>>>> 1 file changed, 212 insertions(+), 38 deletions(-)
>>>>>
>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>> index e0d34d705e07..501ede745ef3 100644
>>>>> --- a/mm/memory.c
>>>>> +++ b/mm/memory.c
>>>>> @@ -3907,6 +3907,136 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
>>>>> return VM_FAULT_SIGBUS;
>>>>> }
>>>>>
>>>>> +/*
>>>>> + * check a range of PTEs are completely swap entries with
>>>>> + * contiguous swap offsets and the same SWAP_HAS_CACHE.
>>>>> + * pte must be first one in the range
>>>>> + */
>>>>> +static bool is_pte_range_contig_swap(pte_t *pte, int nr_pages)
>>>>> +{
>>>>> + int i;
>>>>> + struct swap_info_struct *si;
>>>>> + swp_entry_t entry;
>>>>> + unsigned type;
>>>>> + pgoff_t start_offset;
>>>>> + char has_cache;
>>>>> +
>>>>> + entry = pte_to_swp_entry(ptep_get_lockless(pte));
>>>>
>>>> Given you are getting entry locklessly, I expect it could change under you? So
>>>> probably need to check that its a swap entry, etc. first?
>>> The following non_swap_entry checks to see if it is a swap entry.
>>
>> No, it checks if something already known to be a "swap entry" type is actually
>> describing a swap entry, or a non-swap entry (e.g. migration entry, hwpoison
>> entry, etc.) Swap entries with type >= MAX_SWAPFILES don't actually describe swap:
>>
>> static inline int non_swap_entry(swp_entry_t entry)
>> {
>> return swp_type(entry) >= MAX_SWAPFILES;
>> }
>>
>>
>> So you need to do something like:
>>
>> pte = ptep_get_lockless(pte);
>> if (pte_none(pte) || !pte_present(pte))
>> return false;
>
>
> Indeed, I noticed that a couple of days ago, but it turned out that it
> didn't cause any issues
> because the condition following 'if (swp_offset(entry) != start_offset
> + i)' cannot be true :-)
>
> I do agree it needs a fix here. maybe by
>
> if (!is_swap_pte(pte))
> return false?

Nice! I hadn't noticed is_swap_pte().

>
>> entry = pte_to_swp_entry(pte);
>> if (non_swap_entry(entry))
>> return false;
>> ...
>>
>>>>
>>>>> + if (non_swap_entry(entry))
>>>>> + return false;
>>>>> + start_offset = swp_offset(entry);
>>>>> + if (start_offset % nr_pages)
>>>>> + return false;
>>>>> +
>>>>> + si = swp_swap_info(entry);
>>>>
>>>> What ensures si remains valid (i.e. swapoff can't happen)? If swapoff can race,
>>>> then swap_map may have been freed when you read it below. Holding the PTL can
>>>> sometimes prevent it, but I don't think you're holding that here (you're using
>>>> ptep_get_lockless(). Perhaps get_swap_device()/put_swap_device() can help?
>>> Thank you for your review,you are righit! this place reaally needs
>>> get_swap_device()/put_swap_device().
>>>>
>>>>> + type = swp_type(entry);
>>>>> + has_cache = si->swap_map[start_offset] & SWAP_HAS_CACHE;
>>>>> + for (i = 1; i < nr_pages; i++) {
>>>>> + entry = pte_to_swp_entry(ptep_get_lockless(pte + i));
>>>>> + if (non_swap_entry(entry))
>>>>> + return false;
>>>>> + if (swp_offset(entry) != start_offset + i)
>>>>> + return false;
>>>>> + if (swp_type(entry) != type)
>>>>> + return false;
>>>>> + /*
>>>>> + * while allocating a large folio and doing swap_read_folio for the
>>>>> + * SWP_SYNCHRONOUS_IO path, which is the case the being faulted pte
>>>>> + * doesn't have swapcache. We need to ensure all PTEs have no cache
>>>>> + * as well, otherwise, we might go to swap devices while the content
>>>>> + * is in swapcache
>>>>> + */
>>>>> + if ((si->swap_map[start_offset + i] & SWAP_HAS_CACHE) != has_cache)
>>>>> + return false;
>>>>> + }
>>>>> +
>>>>> + return true;
>>>>> +}
>>>>
>>>> I created swap_pte_batch() for the swap-out series [1]. I wonder if that could
>>>> be extended for the SWAP_HAS_CACHE checks? Possibly not because it assumes the
>>>> PTL is held, and you are lockless here. Thought it might be of interest though.
>>>>
>>>> [1] https://lore.kernel.org/linux-mm/[email protected]/
>>>>
>>> Thanks. It's probably simily to ours, but as you said we are lockless
>>> here, and we need to check has_cache.
>>>>> +
>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>> +/*
>>>>> + * Get a list of all the (large) orders below PMD_ORDER that are enabled
>>>>> + * for this vma. Then filter out the orders that can't be allocated over
>>>>> + * the faulting address and still be fully contained in the vma.
>>>>> + */
>>>>> +static inline unsigned long get_alloc_folio_orders(struct vm_fault *vmf)
>>>>> +{
>>>>> + struct vm_area_struct *vma = vmf->vma;
>>>>> + unsigned long orders;
>>>>> +
>>>>> + orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
>>>>> + BIT(PMD_ORDER) - 1);
>>>>> + orders = thp_vma_suitable_orders(vma, vmf->address, orders);
>>>>> + return orders;
>>>>> +}
>>>>> +#endif
>>>>> +
>>>>> +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>>>>> +{
>>>>> + struct vm_area_struct *vma = vmf->vma;
>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>> + unsigned long orders;
>>>>> + struct folio *folio;
>>>>> + unsigned long addr;
>>>>> + pte_t *pte;
>>>>> + gfp_t gfp;
>>>>> + int order;
>>>>> +
>>>>> + /*
>>>>> + * If uffd is active for the vma we need per-page fault fidelity to
>>>>> + * maintain the uffd semantics.
>>>>> + */
>>>>> + if (unlikely(userfaultfd_armed(vma)))
>>>>> + goto fallback;
>>>>> +
>>>>> + /*
>>>>> + * a large folio being swapped-in could be partially in
>>>>> + * zswap and partially in swap devices, zswap doesn't
>>>>> + * support large folios yet, we might get corrupted
>>>>> + * zero-filled data by reading all subpages from swap
>>>>> + * devices while some of them are actually in zswap
>>>>> + */
>>>>> + if (is_zswap_enabled())
>>>>> + goto fallback;
>>>>> +
>>>>> + orders = get_alloc_folio_orders(vmf);
>>>>> + if (!orders)
>>>>> + goto fallback;
>>>>> +
>>>>> + pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
>>>>
>>>> Could also briefly take PTL here, then is_pte_range_contig_swap() could be
>>>> merged with an enhanced swap_pte_batch()?
>>> Yes, it's easy to use a lock here, but I'm wondering if it's
>>> necessary, because when we actually set pte in do_swap_page, we'll
>>> hold PTL to check if the pte changes.
>>>>
>>>>> + if (unlikely(!pte))
>>>>> + goto fallback;
>>>>> +
>>>>> + /*
>>>>> + * For do_swap_page, find the highest order where the aligned range is
>>>>> + * completely swap entries with contiguous swap offsets.
>>>>> + */
>>>>> + order = highest_order(orders);
>>>>> + while (orders) {
>>>>> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>>>>> + if (is_pte_range_contig_swap(pte + pte_index(addr), 1 << order))
>>>>> + break;
>>>>> + order = next_order(&orders, order);
>>>>> + }
>>>>
>>>> So in the common case, swap-in will pull in the same size of folio as was
>>>> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
>>>> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
>>>> it makes sense for 2M THP; As the size increases the chances of actually needing
>>>> all of the folio reduces so chances are we are wasting IO. There are similar
>>>> arguments for CoW, where we currently copy 1 page per fault - it probably makes
>>>> sense to copy the whole folio up to a certain size.
>>> For 2M THP, IO overhead may not necessarily be large? :)
>>> 1.If 2M THP are continuously stored in the swap device, the IO
>>> overhead may not be very large (such as submitting bio with one
>>> bio_vec at a time).
>>> 2.If the process really needs this 2M data, one page-fault may perform
>>> much better than multiple.
>>> 3.For swap devices like zram,using 2M THP might also improve
>>> decompression efficiency.
>>>
>>> On the other hand, if the process only needs a small part of the 2M
>>> data (such as only frequent use of 4K page, the rest of the data is
>>> never accessed), This is indeed give a lark to catch a kite! :(
>>
>> Yes indeed. It's not always clear-cut what the best thing to do is. It would be
>> good to hear from others on this.
>>
>>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>>> +
>>>>> + pte_unmap(pte);
>>>>> +
>>>>> + /* Try allocating the highest of the remaining orders. */
>>>>> + gfp = vma_thp_gfp_mask(vma);
>>>>> + while (orders) {
>>>>> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>>>>> + folio = vma_alloc_folio(gfp, order, vma, addr, true);
>>>>> + if (folio)
>>>>> + return folio;
>>>>> + order = next_order(&orders, order);
>>>>> + }
>>>>> +
>>>>> +fallback:
>>>>> +#endif
>>>>> + return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false);
>>>>> +}
>>>>> +
>>>>> +
>>>>> /*
>>>>> * We enter with non-exclusive mmap_lock (to exclude vma changes,
>>>>> * but allow concurrent faults), and pte mapped but not yet locked.
>>>>> @@ -3928,6 +4058,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>> pte_t pte;
>>>>> vm_fault_t ret = 0;
>>>>> void *shadow = NULL;
>>>>> + int nr_pages = 1;
>>>>> + unsigned long start_address;
>>>>> + pte_t *start_pte;
>>>>>
>>>>> if (!pte_unmap_same(vmf))
>>>>> goto out;
>>>>> @@ -3991,35 +4124,41 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>> if (!folio) {
>>>>> if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>>>>> __swap_count(entry) == 1) {
>>>>> - /*
>>>>> - * Prevent parallel swapin from proceeding with
>>>>> - * the cache flag. Otherwise, another thread may
>>>>> - * finish swapin first, free the entry, and swapout
>>>>> - * reusing the same entry. It's undetectable as
>>>>> - * pte_same() returns true due to entry reuse.
>>>>> - */
>>>>> - if (swapcache_prepare(entry)) {
>>>>> - /* Relax a bit to prevent rapid repeated page faults */
>>>>> - schedule_timeout_uninterruptible(1);
>>>>> - goto out;
>>>>> - }
>>>>> - need_clear_cache = true;
>>>>> -
>>>>> /* skip swapcache */
>>>>> - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
>>>>> - vma, vmf->address, false);
>>>>> + folio = alloc_swap_folio(vmf);
>>>>> page = &folio->page;
>>>>> if (folio) {
>>>>> __folio_set_locked(folio);
>>>>> __folio_set_swapbacked(folio);
>>>>>
>>>>> + if (folio_test_large(folio)) {
>>>>> + nr_pages = folio_nr_pages(folio);
>>>>> + entry.val = ALIGN_DOWN(entry.val, nr_pages);
>>>>> + }
>>>>> +
>>>>> + /*
>>>>> + * Prevent parallel swapin from proceeding with
>>>>> + * the cache flag. Otherwise, another thread may
>>>>> + * finish swapin first, free the entry, and swapout
>>>>> + * reusing the same entry. It's undetectable as
>>>>> + * pte_same() returns true due to entry reuse.
>>>>> + */
>>>>> + if (swapcache_prepare_nr(entry, nr_pages)) {
>>>>> + /* Relax a bit to prevent rapid repeated page faults */
>>>>> + schedule_timeout_uninterruptible(1);
>>>>> + goto out;
>>>>> + }
>>>>> + need_clear_cache = true;
>>>>> +
>>>>> if (mem_cgroup_swapin_charge_folio(folio,
>>>>> vma->vm_mm, GFP_KERNEL,
>>>>> entry)) {
>>>>> ret = VM_FAULT_OOM;
>>>>> goto out_page;
>>>>> }
>>>>> - mem_cgroup_swapin_uncharge_swap(entry);
>>>>> +
>>>>> + for (swp_entry_t e = entry; e.val < entry.val + nr_pages; e.val++)
>>>>> + mem_cgroup_swapin_uncharge_swap(e);
>>>>>
>>>>> shadow = get_shadow_from_swap_cache(entry);
>>>>> if (shadow)
>>>>> @@ -4118,6 +4257,42 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>> */
>>>>> vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>>>>> &vmf->ptl);
>>>>> +
>>>>> + start_address = vmf->address;
>>>>> + start_pte = vmf->pte;
>>>>> + if (start_pte && folio_test_large(folio)) {
>>>>> + unsigned long nr = folio_nr_pages(folio);
>>>>> + unsigned long addr = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
>>>>> + pte_t *aligned_pte = vmf->pte - (vmf->address - addr) / PAGE_SIZE;
>>>>> +
>>>>> + /*
>>>>> + * case 1: we are allocating large_folio, try to map it as a whole
>>>>> + * iff the swap entries are still entirely mapped;
>>>>> + * case 2: we hit a large folio in swapcache, and all swap entries
>>>>> + * are still entirely mapped, try to map a large folio as a whole.
>>>>> + * otherwise, map only the faulting page within the large folio
>>>>> + * which is swapcache
>>>>> + */
>>>>> + if (!is_pte_range_contig_swap(aligned_pte, nr)) {
>>>>> + if (nr_pages > 1) /* ptes have changed for case 1 */
>>>>> + goto out_nomap;
>>>>> + goto check_pte;
>>>>> + }
>>>>> +
>>>>> + start_address = addr;
>>>>> + start_pte = aligned_pte;
>>>>> + /*
>>>>> + * the below has been done before swap_read_folio()
>>>>> + * for case 1
>>>>> + */
>>>>> + if (unlikely(folio == swapcache)) {
>>>>> + nr_pages = nr;
>>>>> + entry.val = ALIGN_DOWN(entry.val, nr_pages);
>>>>> + page = &folio->page;
>>>>> + }
>>>>> + }
>>>>> +
>>>>> +check_pte:
>>>>> if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
>>>>> goto out_nomap;
>>>>>
>>>>> @@ -4185,12 +4360,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>> * We're already holding a reference on the page but haven't mapped it
>>>>> * yet.
>>>>> */
>>>>> - swap_free(entry);
>>>>> + swap_nr_free(entry, nr_pages);
>>>>> if (should_try_to_free_swap(folio, vma, vmf->flags))
>>>>> folio_free_swap(folio);
>>>>>
>>>>> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>>>>> - dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
>>>>> + folio_ref_add(folio, nr_pages - 1);
>>>>> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
>>>>> + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
>>>>> +
>>>>> pte = mk_pte(page, vma->vm_page_prot);
>>>>>
>>>>> /*
>>>>> @@ -4200,14 +4377,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>> * exclusivity.
>>>>> */
>>>>> if (!folio_test_ksm(folio) &&
>>>>> - (exclusive || folio_ref_count(folio) == 1)) {
>>>>> + (exclusive || folio_ref_count(folio) == nr_pages)) {
>>>>> if (vmf->flags & FAULT_FLAG_WRITE) {
>>>>> pte = maybe_mkwrite(pte_mkdirty(pte), vma);
>>>>> vmf->flags &= ~FAULT_FLAG_WRITE;
>>>>> }
>>>>> rmap_flags |= RMAP_EXCLUSIVE;
>>>>> }
>>>>> - flush_icache_page(vma, page);
>>>>> + flush_icache_pages(vma, page, nr_pages);
>>>>> if (pte_swp_soft_dirty(vmf->orig_pte))
>>>>> pte = pte_mksoft_dirty(pte);
>>>>> if (pte_swp_uffd_wp(vmf->orig_pte))
>>>>> @@ -4216,17 +4393,19 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>
>>>>> /* ksm created a completely new copy */
>>>>> if (unlikely(folio != swapcache && swapcache)) {
>>>>> - folio_add_new_anon_rmap(folio, vma, vmf->address);
>>>>> + folio_add_new_anon_rmap(folio, vma, start_address);
>>>>> folio_add_lru_vma(folio, vma);
>>>>> + } else if (!folio_test_anon(folio)) {
>>>>> + folio_add_new_anon_rmap(folio, vma, start_address);
>>>>> } else {
>>>>> - folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
>>>>> + folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address,
>>>>> rmap_flags);
>>>>> }
>>>>>
>>>>> VM_BUG_ON(!folio_test_anon(folio) ||
>>>>> (pte_write(pte) && !PageAnonExclusive(page)));
>>>>> - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
>>>>> - arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
>>>>> + set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
>>>>> + arch_do_swap_page(vma->vm_mm, vma, start_address, pte, vmf->orig_pte);
>>>>>
>>>>> folio_unlock(folio);
>>>>> if (folio != swapcache && swapcache) {
>>>>> @@ -4243,6 +4422,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>> }
>>>>>
>>>>> if (vmf->flags & FAULT_FLAG_WRITE) {
>>>>> + if (nr_pages > 1)
>>>>> + vmf->orig_pte = ptep_get(vmf->pte);
>>>>> +
>>>>> ret |= do_wp_page(vmf);
>>>>> if (ret & VM_FAULT_ERROR)
>>>>> ret &= VM_FAULT_ERROR;
>>>>> @@ -4250,14 +4432,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>> }
>>>>>
>>>>> /* No need to invalidate - it was non-present before */
>>>>> - update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
>>>>> + update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages);
>>>>> unlock:
>>>>> if (vmf->pte)
>>>>> pte_unmap_unlock(vmf->pte, vmf->ptl);
>>>>> out:
>>>>> /* Clear the swap cache pin for direct swapin after PTL unlock */
>>>>> if (need_clear_cache)
>>>>> - swapcache_clear(si, entry);
>>>>> + swapcache_clear_nr(si, entry, nr_pages);
>>>>> if (si)
>>>>> put_swap_device(si);
>>>>> return ret;
>>>>> @@ -4273,7 +4455,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>> folio_put(swapcache);
>>>>> }
>>>>> if (need_clear_cache)
>>>>> - swapcache_clear(si, entry);
>>>>> + swapcache_clear_nr(si, entry, nr_pages);
>>>>> if (si)
>>>>> put_swap_device(si);
>>>>> return ret;
>>>>> @@ -4309,15 +4491,7 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>>>>> if (unlikely(userfaultfd_armed(vma)))
>>>>> goto fallback;
>>>>>
>>>>> - /*
>>>>> - * Get a list of all the (large) orders below PMD_ORDER that are enabled
>>>>> - * for this vma. Then filter out the orders that can't be allocated over
>>>>> - * the faulting address and still be fully contained in the vma.
>>>>> - */
>>>>> - orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
>>>>> - BIT(PMD_ORDER) - 1);
>>>>> - orders = thp_vma_suitable_orders(vma, vmf->address, orders);
>>>>> -
>>>>> + orders = get_alloc_folio_orders(vmf);
>>>>> if (!orders)
>>>>> goto fallback;
>>>>>
>
> Thanks
> Barry


2024-03-15 12:07:10

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole

On 15/03/2024 10:01, Barry Song wrote:
> On Fri, Mar 15, 2024 at 10:17 PM Huang, Ying <[email protected]> wrote:
>>
>> Barry Song <[email protected]> writes:
>>
>>> On Fri, Mar 15, 2024 at 9:43 PM Huang, Ying <[email protected]> wrote:
>>>>
>>>> Barry Song <[email protected]> writes:
>>>>
>>>>> From: Chuanhua Han <[email protected]>
>>>>>
>>>>> On an embedded system like Android, more than half of anon memory is
>>>>> actually in swap devices such as zRAM. For example, while an app is
>>>>> switched to background, its most memory might be swapped-out.
>>>>>
>>>>> Now we have mTHP features, unfortunately, if we don't support large folios
>>>>> swap-in, once those large folios are swapped-out, we immediately lose the
>>>>> performance gain we can get through large folios and hardware optimization
>>>>> such as CONT-PTE.
>>>>>
>>>>> This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
>>>>> to those contiguous swaps which were likely swapped out from mTHP as a
>>>>> whole.
>>>>>
>>>>> Meanwhile, the current implementation only covers the SWAP_SYCHRONOUS
>>>>> case. It doesn't support swapin_readahead as large folios yet since this
>>>>> kind of shared memory is much less than memory mapped by single process.
>>>>
>>>> In contrast, I still think that it's better to start with normal swap-in
>>>> path, then expand to SWAP_SYCHRONOUS case.
>>>
>>> I'd rather try the reverse direction as non-sync anon memory is only around
>>> 3% in a phone as my observation.
>>
>> Phone is not the only platform that Linux is running on.
>
> I suppose it's generally true that forked shared anonymous pages only
> constitute a
> small portion of all anonymous pages. The majority of anonymous pages are within
> a single process.
>
> I agree phones are not the only platform. But Rome wasn't built in a
> day. I can only get
> started on a hardware which I can easily reach and have enough hardware/test
> resources on it. So we may take the first step which can be applied on
> a real product
> and improve its performance, and step by step, we broaden it and make it
> widely useful to various areas in which I can't reach :-)
>
> so probably we can have a sysfs "enable" entry with default "n" or
> have a maximum
> swap-in order as Ryan's suggestion [1] at the beginning,

I wasn't neccessarily suggesting that we should hard-code an upper limit. I was
just pointing out that we likely need some policy somewhere because the right
thing very likely depends on the folio size and workload. And there is likely
similar policy needed for CoW.

We already have per-thp-size directories in sysfs, so there is a natural place
to add new controls as you suggest - that would fit well. Of course if we can
avoid exposing yet more controls that would be preferable.

>
> "
> So in the common case, swap-in will pull in the same size of folio as was
> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
> it makes sense for 2M THP; As the size increases the chances of actually needing
> all of the folio reduces so chances are we are wasting IO. There are similar
> arguments for CoW, where we currently copy 1 page per fault - it probably makes
> sense to copy the whole folio up to a certain size.
> "
>


2024-03-17 06:12:12

by Barry Song

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole

On Sat, Mar 16, 2024 at 1:06 AM Ryan Roberts <[email protected]> wrote:
>
> On 15/03/2024 10:01, Barry Song wrote:
> > On Fri, Mar 15, 2024 at 10:17 PM Huang, Ying <[email protected]> wrote:
> >>
> >> Barry Song <[email protected]> writes:
> >>
> >>> On Fri, Mar 15, 2024 at 9:43 PM Huang, Ying <ying.huang@intelcom> wrote:
> >>>>
> >>>> Barry Song <[email protected]> writes:
> >>>>
> >>>>> From: Chuanhua Han <[email protected]>
> >>>>>
> >>>>> On an embedded system like Android, more than half of anon memory is
> >>>>> actually in swap devices such as zRAM. For example, while an app is
> >>>>> switched to background, its most memory might be swapped-out.
> >>>>>
> >>>>> Now we have mTHP features, unfortunately, if we don't support large folios
> >>>>> swap-in, once those large folios are swapped-out, we immediately lose the
> >>>>> performance gain we can get through large folios and hardware optimization
> >>>>> such as CONT-PTE.
> >>>>>
> >>>>> This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
> >>>>> to those contiguous swaps which were likely swapped out from mTHP as a
> >>>>> whole.
> >>>>>
> >>>>> Meanwhile, the current implementation only covers the SWAP_SYCHRONOUS
> >>>>> case. It doesn't support swapin_readahead as large folios yet since this
> >>>>> kind of shared memory is much less than memory mapped by single process.
> >>>>
> >>>> In contrast, I still think that it's better to start with normal swap-in
> >>>> path, then expand to SWAP_SYCHRONOUS case.
> >>>
> >>> I'd rather try the reverse direction as non-sync anon memory is only around
> >>> 3% in a phone as my observation.
> >>
> >> Phone is not the only platform that Linux is running on.
> >
> > I suppose it's generally true that forked shared anonymous pages only
> > constitute a
> > small portion of all anonymous pages. The majority of anonymous pages are within
> > a single process.
> >
> > I agree phones are not the only platform. But Rome wasn't built in a
> > day. I can only get
> > started on a hardware which I can easily reach and have enough hardware/test
> > resources on it. So we may take the first step which can be applied on
> > a real product
> > and improve its performance, and step by step, we broaden it and make it
> > widely useful to various areas in which I can't reach :-)
> >
> > so probably we can have a sysfs "enable" entry with default "n" or
> > have a maximum
> > swap-in order as Ryan's suggestion [1] at the beginning,
>
> I wasn't neccessarily suggesting that we should hard-code an upper limit. I was
> just pointing out that we likely need some policy somewhere because the right
> thing very likely depends on the folio size and workload. And there is likely
> similar policy needed for CoW.
>
> We already have per-thp-size directories in sysfs, so there is a natural place
> to add new controls as you suggest - that would fit well. Of course if we can
> avoid exposing yet more controls that would be preferable.
>
> >
> > "
> > So in the common case, swap-in will pull in the same size of folio as was
> > swapped-out. Is that definitely the right policy for all folio sizes? Certainly
> > it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
> > it makes sense for 2M THP; As the size increases the chances of actually needing
> > all of the folio reduces so chances are we are wasting IO. There are similar
> > arguments for CoW, where we currently copy 1 page per fault - it probably makes
> > sense to copy the whole folio up to a certain size.
> > "

right now we have an "enable" entry in each size, for example:
/sys/kernel/mm/transparent_hugepage/hugepages-64kB/enable

for the phone case, it would be quite simple, just enable 64KiB(or +16KiB) and
allow swap-in 64KiB(or +16KiB) folios, so it doesn't need any new controls
since do_swap_page does the same checks as do_anonymous_page()
does. And we actually have deployed 64KiB-only swap-out and swap-in on
millions of real phones.

Considering other users scenarios which might want larger folios such as 2MiB
1MiB but only want smaller swap-in folio sizes, I could have a new
swapin control
like,

/sys/kernel/mm/transparent_hugepage/hugepages-64kB/swapin
this can be 1 or 0.

With this, it seems safer for the patchset to land while I don't have
the ability
to extensively test it on Linux servers?

Thanks
Barry

2024-03-18 01:28:54

by Chuanhua Han

[permalink] [raw]
Subject: Re: [RFC PATCH v3 2/5] mm: swap: introduce swap_nr_free() for batched swap_free()

Ryan Roberts <[email protected]> 于2024年3月15日周五 18:57写道:
>
> On 15/03/2024 08:34, Chuanhua Han wrote:
> > Ryan Roberts <[email protected]> 于2024年3月14日周四 21:43写道:
> >>
> >> On 14/03/2024 13:12, Chuanhua Han wrote:
> >>> Ryan Roberts <[email protected]> 于2024年3月12日周二 02:51写道:
> >>>>
> >>>> On 04/03/2024 08:13, Barry Song wrote:
> >>>>> From: Chuanhua Han <[email protected]>
> >>>>>
> >>>>> While swapping in a large folio, we need to free swaps related to the whole
> >>>>> folio. To avoid frequently acquiring and releasing swap locks, it is better
> >>>>> to introduce an API for batched free.
> >>>>>
> >>>>> Signed-off-by: Chuanhua Han <[email protected]>
> >>>>> Co-developed-by: Barry Song <[email protected]>
> >>>>> Signed-off-by: Barry Song <[email protected]>
> >>>>> ---
> >>>>> include/linux/swap.h | 6 ++++++
> >>>>> mm/swapfile.c | 35 +++++++++++++++++++++++++++++++++++
> >>>>> 2 files changed, 41 insertions(+)
> >>>>>
> >>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
> >>>>> index 2955f7a78d8d..d6ab27929458 100644
> >>>>> --- a/include/linux/swap.h
> >>>>> +++ b/include/linux/swap.h
> >>>>> @@ -481,6 +481,7 @@ extern void swap_shmem_alloc(swp_entry_t);
> >>>>> extern int swap_duplicate(swp_entry_t);
> >>>>> extern int swapcache_prepare(swp_entry_t);
> >>>>> extern void swap_free(swp_entry_t);
> >>>>> +extern void swap_nr_free(swp_entry_t entry, int nr_pages);
> >>>>
> >>>> nit: In my swap-out v4 series, I've created a batched version of
> >>>> free_swap_and_cache() and called it free_swap_and_cache_nr(). Perhaps it is
> >>>> preferable to align the naming schemes - i.e. call this swap_free_nr(). Your
> >>>> scheme doesn't really work when applied to free_swap_and_cache().
> >>> Thanks for your suggestions, and for the next version, we'll see which
> >>> package is more appropriate!
> >>>>
> >>>>> extern void swapcache_free_entries(swp_entry_t *entries, int n);
> >>>>> extern int free_swap_and_cache(swp_entry_t);
> >>>>> int swap_type_of(dev_t device, sector_t offset);
> >>>>> @@ -561,6 +562,11 @@ static inline void swap_free(swp_entry_t swp)
> >>>>> {
> >>>>> }
> >>>>>
> >>>>> +void swap_nr_free(swp_entry_t entry, int nr_pages)
> >>>>> +{
> >>>>> +
> >>>>> +}
> >>>>> +
> >>>>> static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> >>>>> {
> >>>>> }
> >>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
> >>>>> index 3f594be83b58..244106998a69 100644
> >>>>> --- a/mm/swapfile.c
> >>>>> +++ b/mm/swapfile.c
> >>>>> @@ -1341,6 +1341,41 @@ void swap_free(swp_entry_t entry)
> >>>>> __swap_entry_free(p, entry);
> >>>>> }
> >>>>>
> >>>>> +/*
> >>>>> + * Called after swapping in a large folio, batched free swap entries
> >>>>> + * for this large folio, entry should be for the first subpage and
> >>>>> + * its offset is aligned with nr_pages
> >>>>> + */
> >>>>> +void swap_nr_free(swp_entry_t entry, int nr_pages)
> >>>>> +{
> >>>>> + int i;
> >>>>> + struct swap_cluster_info *ci;
> >>>>> + struct swap_info_struct *p;
> >>>>> + unsigned type = swp_type(entry);
> >>>>
> >>>> nit: checkpatch.py will complain about bare "unsigned", preferring "unsigned
> >>>> int" or at least it did for me when I did something similar in my swap-out patch
> >>>> set.
> >>> Gee, thanks for pointing that out!
> >>>>
> >>>>> + unsigned long offset = swp_offset(entry);
> >>>>> + DECLARE_BITMAP(usage, SWAPFILE_CLUSTER) = { 0 };
> >>>>
> >>>> I don't love this, as it could blow the stack if SWAPFILE_CLUSTER ever
> >>>> increases. But the only other way I can think of is to explicitly loop over
> >>>> fixed size chunks, and that's not much better.
> >>> Is it possible to save kernel stack better by using bit_map here? If
> >>> SWAPFILE_CLUSTER=512, we consume only (512/64)*8= 64 bytes.
> >>
> >> I'm not sure I've understood what you are saying? You're already using
> >> DECLARE_BITMAP(), so its already consuming 64 bytes if SWAPFILE_CLUSTER=512, no?
> >>
> >> I actually did a bad job of trying to express a couple of different points:
> >>
> >> - Are there any configurations today where SWAPFILE_CLUSTER > 512? I'm not sure.
> >> Certainly not for arm64, but not sure about other architectures. For example if
> >> an arch had 64K pages with 8192 entries per THP and supports SWAP_THP, that's 1K
> >> for the bitmap, which is now looking pretty big for the stack.
> > I agree with you.The current bit_map grows linearly with the
> > SWAPFILE_CLUSTER, which may cause the kernel stack to swell.
> > I need to think of a way to save more memory .
> >>
> >> - Would it be better to decouple stack usage from SWAPFILE_CLUSTER and instead
> >> define a fixed stack size (e.g. 64 bytes -> 512 entries). Then free the range of
> >> entries in batches no bigger than this size. This approach could also allow
> >> removing the constraint that the range has to be aligned and fit in a single
> >> cluster. Personally I think an approach like this would be much more robust, in
> >> return for a tiny bit more complexity.
> > Because we cannot determine how many swap entries a cluster has in an
> > architecture or a configuration, we do not know how large the variable
> > needs to be defined?
>
> My point is that we could define a fixed size, then loop through the passed in
> range, operating on batches of that fixed size. You could even take into
> consideration the cluster boundaries so that you take the correct lock for every
> batch and can drop the "must be naturally aligned, must be no bigger than
> cluster size" constraint.

Thank you. I understand it!

>
>
> >>
> >>>>
> >>>>> +
> >>>>> + /* all swap entries are within a cluster for mTHP */
> >>>>> + VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
> >>>>> +
> >>>>> + if (nr_pages == 1) {
> >>>>> + swap_free(entry);
> >>>>> + return;
> >>>>> + }
> >>>>> +
> >>>>> + p = _swap_info_get(entry);
> >>>>
> >>>> You need to handle this returning NULL, like swap_free() does.
> >>> Yes, you're right! We did forget to judge NULL here.
> >>>>
> >>>>> +
> >>>>> + ci = lock_cluster(p, offset);
> >>>>
> >>>> The existing swap_free() calls lock_cluster_or_swap_info(). So if swap is backed
> >>>> by rotating media, and clusters are not in use, it will lock the whole swap
> >>>> info. But your new version only calls lock_cluster() which won't lock anything
> >>>> if clusters are not in use. So I think this is a locking bug.
> >>> Again, you're right, it's bug!
> >>>>
> >>>>> + for (i = 0; i < nr_pages; i++) {
> >>>>> + if (__swap_entry_free_locked(p, offset + i, 1))
> >>>>> + __bitmap_set(usage, i, 1);
> >>>>> + }
> >>>>> + unlock_cluster(ci);
> >>>>> +
> >>>>> + for_each_clear_bit(i, usage, nr_pages)
> >>>>> + free_swap_slot(swp_entry(type, offset + i));
> >>>>> +}
> >>>>> +
> >>>>> /*
> >>>>> * Called after dropping swapcache to decrease refcnt to swap entries.
> >>>>> */
> >>>>
> >>>> Thanks,
> >>>> Ryan
> >>>>
> >>>>
> >>>
> >>>
> >>
> >
> >
>


--
Thanks,
Chuanhua

2024-03-18 01:54:12

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole

Barry Song <[email protected]> writes:

> On Fri, Mar 15, 2024 at 10:17 PM Huang, Ying <[email protected]> wrote:
>>
>> Barry Song <[email protected]> writes:
>>
>> > On Fri, Mar 15, 2024 at 9:43 PM Huang, Ying <[email protected]> wrote:
>> >>
>> >> Barry Song <[email protected]> writes:
>> >>
>> >> > From: Chuanhua Han <[email protected]>
>> >> >
>> >> > On an embedded system like Android, more than half of anon memory is
>> >> > actually in swap devices such as zRAM. For example, while an app is
>> >> > switched to background, its most memory might be swapped-out.
>> >> >
>> >> > Now we have mTHP features, unfortunately, if we don't support large folios
>> >> > swap-in, once those large folios are swapped-out, we immediately lose the
>> >> > performance gain we can get through large folios and hardware optimization
>> >> > such as CONT-PTE.
>> >> >
>> >> > This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
>> >> > to those contiguous swaps which were likely swapped out from mTHP as a
>> >> > whole.
>> >> >
>> >> > Meanwhile, the current implementation only covers the SWAP_SYCHRONOUS
>> >> > case. It doesn't support swapin_readahead as large folios yet since this
>> >> > kind of shared memory is much less than memory mapped by single process.
>> >>
>> >> In contrast, I still think that it's better to start with normal swap-in
>> >> path, then expand to SWAP_SYCHRONOUS case.
>> >
>> > I'd rather try the reverse direction as non-sync anon memory is only around
>> > 3% in a phone as my observation.
>>
>> Phone is not the only platform that Linux is running on.
>
> I suppose it's generally true that forked shared anonymous pages only
> constitute a
> small portion of all anonymous pages. The majority of anonymous pages are within
> a single process.

Yes. But IIUC, SWP_SYNCHRONOUS_IO is quite limited, they are set only
for memory backed swap devices.

> I agree phones are not the only platform. But Rome wasn't built in a
> day. I can only get
> started on a hardware which I can easily reach and have enough hardware/test
> resources on it. So we may take the first step which can be applied on
> a real product
> and improve its performance, and step by step, we broaden it and make it
> widely useful to various areas in which I can't reach :-)

We must guarantee the normal swap path runs correctly and has no
performance regression when developing SWP_SYNCHRONOUS_IO optimization.
So we have to put some effort on the normal path test anyway.

> so probably we can have a sysfs "enable" entry with default "n" or
> have a maximum
> swap-in order as Ryan's suggestion [1] at the beginning,
>
> "
> So in the common case, swap-in will pull in the same size of folio as was
> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
> it makes sense for 2M THP; As the size increases the chances of actually needing
> all of the folio reduces so chances are we are wasting IO. There are similar
> arguments for CoW, where we currently copy 1 page per fault - it probably makes
> sense to copy the whole folio up to a certain size.
> "
>
>>
>> >>
>> >> In normal swap-in path, we can take advantage of swap readahead
>> >> information to determine the swapped-in large folio order. That is, if
>> >> the return value of swapin_nr_pages() > 1, then we can try to allocate
>> >> and swapin a large folio.
>> >
>> > I am not quite sure we still need to depend on this. in do_anon_page,
>> > we have broken the assumption and allocated a large folio directly.
>>
>> I don't think that we have a sophisticated policy to allocate large
>> folio. Large folio could waste memory for some workloads, so I think
>> that it's a good idea to allocate large folio always.
>
> i agree, but we still have the below check just like do_anon_page() has it,
>
> orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
> BIT(PMD_ORDER) - 1);
> orders = thp_vma_suitable_orders(vma, vmf->address, orders);
>
> in do_anon_page, we don't worry about the waste so much, the same
> logic also applies to do_swap_page().

As I said, "readahead" may help us from application/user specific
configuration in most cases. It can be a starting point of "using mTHP
automatically when it helps and not cause many issues".

>>
>> Readahead gives us an opportunity to play with the policy.
>
> I feel somehow the rules of the game have changed with an upper
> limit for swap-in size. for example, if the upper limit is 4 order,
> it limits folio size to 64KiB which is still a proper size for ARM64
> whose base page can be 64KiB.
>
> on the other hand, while swapping out large folios, we will always
> compress them as a whole(zsmalloc/zram patch will come in a
> couple of days), if we choose to decompress a subpage instead of
> a large folio in do_swap_page(), we might need to decompress
> nr_pages times. for example,
>
> For large folios 16*4KiB, they are saved as a large object in zsmalloc(with
> the coming patch), if we swap in a small folio, we decompress the large
> object; next time, we will still need to decompress a large object. so
> it is more sensible to swap in a large folio if we find those
> swap entries are contiguous and were allocated by a large folio swap-out.

I understand that there are some special requirements for ZRAM. But I
don't think it's a good idea to force the general code to fit the
requirements of a specific swap device too much. This is one of the
reasons that I think that we should start with normal swap devices, then
try to optimize for some specific devices.

>>
>> > On the other hand, compressing/decompressing large folios as a
>> > whole rather than doing it one by one can save a large percent of
>> > CPUs and provide a much lower compression ratio. With a hardware
>> > accelerator, this is even faster.
>>
>> I am not against to support large folio for compressing/decompressing.
>>
>> I just suggest to do that later, after we play with normal swap-in.
>> SWAP_SYCHRONOUS related swap-in code is an optimization based on normal
>> swap. So, it seems natural to support large folio swap-in for normal
>> swap-in firstly.
>
> I feel like SWAP_SYCHRONOUS is a simpler case and even more "normal"
> than the swapcache path since it is the majority.

I don't think so. Most PC and server systems uses !SWAP_SYCHRONOUS
swap devices.

> and on the other hand, a lot
> of modification is required for the swapcache path. in OPPO's code[1], we did
> bring-up both paths, but the swapcache path is much much more complicated
> than the SYNC path and hasn't really noticeable improvement.
>
> [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8650/tree/oneplus/sm8650_u_14.0.0_oneplus12

That's great. Please cleanup the code and post it to mailing list. Why
doesn't it help? IIUC, it can optimize TLB at least.

>>
>> > So I'd rather more aggressively get large folios swap-in involved
>> > than depending on readahead.
>>
>> We can take advantage of readahead algorithm in SWAP_SYCHRONOUS
>> optimization too. The sub-pages that is not accessed by page fault can
>> be treated as readahead. I think that is a better policy than
>> allocating large folio always.
>
> Considering the zsmalloc optimization, it would be a better choice to
> always allocate
> large folios if we find those swap entries are for a swapped-out large folio. as
> decompressing just once, we get all subpages.
> Some hardware accelerators are even able to decompress a large folio with
> multi-hardware threads, for example, 16 hardware threads can compress
> each subpage of a large folio at the same time, it is just as fast as
> decompressing
> one subpage.
>
> for platforms without the above optimizations, a proper upper limit
> will help them
> disable the large folios swap-in or decrease the impact. For example,
> if the upper
> limit is 0-order, we are just removing this patchset. if the upper
> limit is 2 orders, we
> are just like BASE_PAGE size is 16KiB.
>
>>
>> >>
>> >> To do that, we need to track whether the sub-pages are accessed. I
>> >> guess we need that information for large file folio readahead too.
>> >>
>> >> Hi, Matthew,
>> >>
>> >> Can you help us on tracking whether the sub-pages of a readahead large
>> >> folio has been accessed?
>> >>
>> >> > Right now, we are re-faulting large folios which are still in swapcache as a
>> >> > whole, this can effectively decrease extra loops and early-exitings which we
>> >> > have increased in arch_swap_restore() while supporting MTE restore for folios
>> >> > rather than page. On the other hand, it can also decrease do_swap_page as
>> >> > PTEs used to be set one by one even we hit a large folio in swapcache.
>> >> >
>> >>

--
Best Regards,
Huang, Ying

2024-03-18 02:41:48

by Barry Song

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole

On Mon, Mar 18, 2024 at 2:54 PM Huang, Ying <[email protected]> wrote:
>
> Barry Song <[email protected]> writes:
>
> > On Fri, Mar 15, 2024 at 10:17 PM Huang, Ying <[email protected]> wrote:
> >>
> >> Barry Song <[email protected]> writes:
> >>
> >> > On Fri, Mar 15, 2024 at 9:43 PM Huang, Ying <[email protected]> wrote:
> >> >>
> >> >> Barry Song <[email protected]> writes:
> >> >>
> >> >> > From: Chuanhua Han <[email protected]>
> >> >> >
> >> >> > On an embedded system like Android, more than half of anon memory is
> >> >> > actually in swap devices such as zRAM. For example, while an app is
> >> >> > switched to background, its most memory might be swapped-out.
> >> >> >
> >> >> > Now we have mTHP features, unfortunately, if we don't support large folios
> >> >> > swap-in, once those large folios are swapped-out, we immediately lose the
> >> >> > performance gain we can get through large folios and hardware optimization
> >> >> > such as CONT-PTE.
> >> >> >
> >> >> > This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
> >> >> > to those contiguous swaps which were likely swapped out from mTHP as a
> >> >> > whole.
> >> >> >
> >> >> > Meanwhile, the current implementation only covers the SWAP_SYCHRONOUS
> >> >> > case. It doesn't support swapin_readahead as large folios yet since this
> >> >> > kind of shared memory is much less than memory mapped by single process.
> >> >>
> >> >> In contrast, I still think that it's better to start with normal swap-in
> >> >> path, then expand to SWAP_SYCHRONOUS case.
> >> >
> >> > I'd rather try the reverse direction as non-sync anon memory is only around
> >> > 3% in a phone as my observation.
> >>
> >> Phone is not the only platform that Linux is running on.
> >
> > I suppose it's generally true that forked shared anonymous pages only
> > constitute a
> > small portion of all anonymous pages. The majority of anonymous pages are within
> > a single process.
>
> Yes. But IIUC, SWP_SYNCHRONOUS_IO is quite limited, they are set only
> for memory backed swap devices.

SWP_SYNCHRONOUS_IO is the most common case for embedded linux.
note almost all Android/embedded devices use zRAM rather than a disk
for swap.

And we can have an upper limit order or a new control like
/sys/kernel/mm/transparent_hugepage/hugepages-256kB/swapin
and set them default to 0 first.

>
> > I agree phones are not the only platform. But Rome wasn't built in a
> > day. I can only get
> > started on a hardware which I can easily reach and have enough hardware/test
> > resources on it. So we may take the first step which can be applied on
> > a real product
> > and improve its performance, and step by step, we broaden it and make it
> > widely useful to various areas in which I can't reach :-)
>
> We must guarantee the normal swap path runs correctly and has no
> performance regression when developing SWP_SYNCHRONOUS_IO optimization.
> So we have to put some effort on the normal path test anyway.
>
> > so probably we can have a sysfs "enable" entry with default "n" or
> > have a maximum
> > swap-in order as Ryan's suggestion [1] at the beginning,
> >
> > "
> > So in the common case, swap-in will pull in the same size of folio as was
> > swapped-out. Is that definitely the right policy for all folio sizes? Certainly
> > it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
> > it makes sense for 2M THP; As the size increases the chances of actually needing
> > all of the folio reduces so chances are we are wasting IO. There are similar
> > arguments for CoW, where we currently copy 1 page per fault - it probably makes
> > sense to copy the whole folio up to a certain size.
> > "
> >
> >>
> >> >>
> >> >> In normal swap-in path, we can take advantage of swap readahead
> >> >> information to determine the swapped-in large folio order. That is, if
> >> >> the return value of swapin_nr_pages() > 1, then we can try to allocate
> >> >> and swapin a large folio.
> >> >
> >> > I am not quite sure we still need to depend on this. in do_anon_page,
> >> > we have broken the assumption and allocated a large folio directly.
> >>
> >> I don't think that we have a sophisticated policy to allocate large
> >> folio. Large folio could waste memory for some workloads, so I think
> >> that it's a good idea to allocate large folio always.
> >
> > i agree, but we still have the below check just like do_anon_page() has it,
> >
> > orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
> > BIT(PMD_ORDER) - 1);
> > orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> >
> > in do_anon_page, we don't worry about the waste so much, the same
> > logic also applies to do_swap_page().
>
> As I said, "readahead" may help us from application/user specific
> configuration in most cases. It can be a starting point of "using mTHP
> automatically when it helps and not cause many issues".

I'd rather start from the simpler code path and really improve on
phones & embedded linux which our team can really reach :-)

>
> >>
> >> Readahead gives us an opportunity to play with the policy.
> >
> > I feel somehow the rules of the game have changed with an upper
> > limit for swap-in size. for example, if the upper limit is 4 order,
> > it limits folio size to 64KiB which is still a proper size for ARM64
> > whose base page can be 64KiB.
> >
> > on the other hand, while swapping out large folios, we will always
> > compress them as a whole(zsmalloc/zram patch will come in a
> > couple of days), if we choose to decompress a subpage instead of
> > a large folio in do_swap_page(), we might need to decompress
> > nr_pages times. for example,
> >
> > For large folios 16*4KiB, they are saved as a large object in zsmalloc(with
> > the coming patch), if we swap in a small folio, we decompress the large
> > object; next time, we will still need to decompress a large object. so
> > it is more sensible to swap in a large folio if we find those
> > swap entries are contiguous and were allocated by a large folio swap-out.
>
> I understand that there are some special requirements for ZRAM. But I
> don't think it's a good idea to force the general code to fit the
> requirements of a specific swap device too much. This is one of the
> reasons that I think that we should start with normal swap devices, then
> try to optimize for some specific devices.

I agree. but we are having a good start. zRAM is not a specific device,
it widely represents embedded Linux.

>
> >>
> >> > On the other hand, compressing/decompressing large folios as a
> >> > whole rather than doing it one by one can save a large percent of
> >> > CPUs and provide a much lower compression ratio. With a hardware
> >> > accelerator, this is even faster.
> >>
> >> I am not against to support large folio for compressing/decompressing.
> >>
> >> I just suggest to do that later, after we play with normal swap-in.
> >> SWAP_SYCHRONOUS related swap-in code is an optimization based on normal
> >> swap. So, it seems natural to support large folio swap-in for normal
> >> swap-in firstly.
> >
> > I feel like SWAP_SYCHRONOUS is a simpler case and even more "normal"
> > than the swapcache path since it is the majority.
>
> I don't think so. Most PC and server systems uses !SWAP_SYCHRONOUS
> swap devices.

The problem is that our team is all focusing on phones, we won't have
any resource
and bandwidth on PC and server. A more realistic goal is that we at
least let the
solutions benefit phones and similar embedded Linux and extend it to more areas
such as PC and server.

I'd be quite happy if you or other people can join together on PC and server.

>
> > and on the other hand, a lot
> > of modification is required for the swapcache path. in OPPO's code[1], we did
> > bring-up both paths, but the swapcache path is much much more complicated
> > than the SYNC path and hasn't really noticeable improvement.
> >
> > [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8650/tree/oneplus/sm8650_u_14.0.0_oneplus12
>
> That's great. Please cleanup the code and post it to mailing list. Why
> doesn't it help? IIUC, it can optimize TLB at least.

I agree this can improve but most of the anon pages are single-process
mapped. only
quite a few pages go to the readahead code path on phones. That's why
there is no
noticeable improvement finally.

I understand all the benefits you mentioned on changing readahead, but
simply because
those kinds of pages are really really rare, so improving that path
doesn't really help
Android devices.

>
> >>
> >> > So I'd rather more aggressively get large folios swap-in involved
> >> > than depending on readahead.
> >>
> >> We can take advantage of readahead algorithm in SWAP_SYCHRONOUS
> >> optimization too. The sub-pages that is not accessed by page fault can
> >> be treated as readahead. I think that is a better policy than
> >> allocating large folio always.

This is always true even in do_anonymous_page(). but we don't worry that
too much as we have per-size control. overload has the chance to set its
preferences.
/*
* Get a list of all the (large) orders below PMD_ORDER that are enabled
* for this vma. Then filter out the orders that can't be allocated over
* the faulting address and still be fully contained in the vma.
*/
orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
BIT(PMD_ORDER) - 1);
orders = thp_vma_suitable_orders(vma, vmf->address, orders);

On the other hand, we are not always allocating large folios. we are allocating
large folios when the swapped-out folio was large. This is quite important to
an embedded linux, as swap is happening so often. more than half memory
can be in swap, if we swap-out them as a large folio, but swap them in a
small, we immediately lose all advantages such as less page faults, CONT-PTE
etc.

> >
> > Considering the zsmalloc optimization, it would be a better choice to
> > always allocate
> > large folios if we find those swap entries are for a swapped-out large folio. as
> > decompressing just once, we get all subpages.
> > Some hardware accelerators are even able to decompress a large folio with
> > multi-hardware threads, for example, 16 hardware threads can compress
> > each subpage of a large folio at the same time, it is just as fast as
> > decompressing
> > one subpage.
> >
> > for platforms without the above optimizations, a proper upper limit
> > will help them
> > disable the large folios swap-in or decrease the impact. For example,
> > if the upper
> > limit is 0-order, we are just removing this patchset. if the upper
> > limit is 2 orders, we
> > are just like BASE_PAGE size is 16KiB.
> >
> >>
> >> >>
> >> >> To do that, we need to track whether the sub-pages are accessed. I
> >> >> guess we need that information for large file folio readahead too.
> >> >>
> >> >> Hi, Matthew,
> >> >>
> >> >> Can you help us on tracking whether the sub-pages of a readahead large
> >> >> folio has been accessed?
> >> >>
> >> >> > Right now, we are re-faulting large folios which are still in swapcache as a
> >> >> > whole, this can effectively decrease extra loops and early-exitings which we
> >> >> > have increased in arch_swap_restore() while supporting MTE restore for folios
> >> >> > rather than page. On the other hand, it can also decrease do_swap_page as
> >> >> > PTEs used to be set one by one even we hit a large folio in swapcache.
> >> >> >
> >> >>
>
> --
> Best Regards,
> Huang, Ying

Thanks
Barry

2024-03-18 16:45:41

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole

>>> I agree phones are not the only platform. But Rome wasn't built in a
>>> day. I can only get
>>> started on a hardware which I can easily reach and have enough hardware/test
>>> resources on it. So we may take the first step which can be applied on
>>> a real product
>>> and improve its performance, and step by step, we broaden it and make it
>>> widely useful to various areas in which I can't reach :-)
>>
>> We must guarantee the normal swap path runs correctly and has no
>> performance regression when developing SWP_SYNCHRONOUS_IO optimization.
>> So we have to put some effort on the normal path test anyway.
>>
>>> so probably we can have a sysfs "enable" entry with default "n" or
>>> have a maximum
>>> swap-in order as Ryan's suggestion [1] at the beginning,
>>>
>>> "
>>> So in the common case, swap-in will pull in the same size of folio as was
>>> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
>>> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
>>> it makes sense for 2M THP; As the size increases the chances of actually needing
>>> all of the folio reduces so chances are we are wasting IO. There are similar
>>> arguments for CoW, where we currently copy 1 page per fault - it probably makes
>>> sense to copy the whole folio up to a certain size.
>>> "

I thought about this a bit more. No clear conclusions, but hoped this might help
the discussion around policy:

The decision about the size of the THP is made at first fault, with some help
from user space and in future we might make decisions to split based on
munmap/mremap/etc hints. In an ideal world, the fact that we have had to swap
the THP out at some point in its lifetime should not impact on its size. It's
just being moved around in the system and the reason for our original decision
should still hold.

So from that PoV, it would be good to swap-in to the same size that was
swapped-out. But we only kind-of keep that information around, via the swap
entry contiguity and alignment. With that scheme it is possible that multiple
virtually adjacent but not physically contiguous folios get swapped-out to
adjacent swap slot ranges and then they would be swapped-in to a single, larger
folio. This is not ideal, and I think it would be valuable to try to maintain
the original folio size information with the swap slot. One way to do this would
be to store the original order for which the cluster was allocated in the
cluster. Then we at least know that a given swap slot is either for a folio of
that order or an order-0 folio (due to cluster exhaustion/scanning). Can we
steal a bit from swap_map to determine which case it is? Or are there better
approaches?

Next we (I?) have concerns about wasting IO by swapping-in folios that are too
large (e.g. 2M). I'm not sure if this is a real problem or not - intuitively I'd
say yes but I have no data. But on the other hand, memory is aged and
swapped-out per-folio, so why shouldn't it be swapped-in per folio? If the
original allocation size policy is good (it currently isn't) then a folio should
be sized to cover temporally close memory and if we need to access some of it,
chances are we need all of it.

If we think the IO concern is legitimate then we could define a threshold size
(sysfs?) for when we start swapping-in the folio in chunks. And how big should
those chunks be - one page, or the threshold size itself? Probably the latter?
And perhaps that threshold could also be used by zRAM to decide its upper limit
for compression chunk.

Perhaps we can learn from khugepaged here? I think it has programmable
thresholds for how many swapped-out pages can be swapped-in to aid collapse to a
THP? I guess that exists for the same concerns about increased IO pressure?


If we think we will ever be swapping-in folios in chunks less than their
original size, then we need a separate mechanism to re-foliate them. We have
discussed a khugepaged-like approach for doing this asynchronously in the
background. I know that scares the Android folks, but David has suggested that
this could well be very cheap compared with khugepaged, because it would be
entirely limited to a single pgtable, so we only need the PTL. If we need this
mechanism anyway, perhaps we should develop it and see how it performs if
swap-in remains order-0? Although I guess that would imply not being able to
benefit from compressing THPs for the zRAM case.

I see all this as orthogonal to synchronous vs asynchronous swap devices. I
think the latter just implies that you might want to do some readahead to try to
cover up the latency? If swap is moving towards being folio-orientated, then
readahead also surely needs to be folio-orientated, but I think that should be
the only major difference.

Anyway, just some thoughts!

Thanks,
Ryan


2024-03-18 22:36:03

by Barry Song

[permalink] [raw]
Subject: Re: [RFC PATCH v3 4/5] mm: swap: introduce swapcache_prepare_nr and swapcache_clear_nr for large folios swap-in

On Wed, Mar 13, 2024 at 4:35 AM Ryan Roberts <[email protected]> wrote:
>
> On 04/03/2024 08:13, Barry Song wrote:
> > From: Barry Song <[email protected]>
> >
> > Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache") supports
> > one entry only, to support large folio swap-in, we need to handle multiple
> > swap entries.
> >
> > Cc: Kairui Song <[email protected]>
> > Cc: "Huang, Ying" <[email protected]>
> > Cc: David Hildenbrand <[email protected]>
> > Cc: Chris Li <[email protected]>
> > Cc: Hugh Dickins <[email protected]>
> > Cc: Johannes Weiner <[email protected]>
> > Cc: Matthew Wilcox (Oracle) <[email protected]>
> > Cc: Michal Hocko <[email protected]>
> > Cc: Minchan Kim <[email protected]>
> > Cc: Yosry Ahmed <[email protected]>
> > Cc: Yu Zhao <[email protected]>
> > Cc: SeongJae Park <[email protected]>
> > Signed-off-by: Barry Song <[email protected]>
> > ---
> > include/linux/swap.h | 1 +
> > mm/swap.h | 1 +
> > mm/swapfile.c | 118 ++++++++++++++++++++++++++-----------------
> > 3 files changed, 74 insertions(+), 46 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index d6ab27929458..22105f0fe2d4 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -480,6 +480,7 @@ extern int add_swap_count_continuation(swp_entry_t, gfp_t);
> > extern void swap_shmem_alloc(swp_entry_t);
> > extern int swap_duplicate(swp_entry_t);
> > extern int swapcache_prepare(swp_entry_t);
> > +extern int swapcache_prepare_nr(swp_entry_t entry, int nr);
> > extern void swap_free(swp_entry_t);
> > extern void swap_nr_free(swp_entry_t entry, int nr_pages);
> > extern void swapcache_free_entries(swp_entry_t *entries, int n);
> > diff --git a/mm/swap.h b/mm/swap.h
> > index fc2f6ade7f80..1cec991efcda 100644
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> > @@ -42,6 +42,7 @@ void delete_from_swap_cache(struct folio *folio);
> > void clear_shadow_from_swap_cache(int type, unsigned long begin,
> > unsigned long end);
> > void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry);
> > +void swapcache_clear_nr(struct swap_info_struct *si, swp_entry_t entry, int nr);
> > struct folio *swap_cache_get_folio(swp_entry_t entry,
> > struct vm_area_struct *vma, unsigned long addr);
> > struct folio *filemap_get_incore_folio(struct address_space *mapping,
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 244106998a69..bae1b8165b11 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -3309,7 +3309,7 @@ void si_swapinfo(struct sysinfo *val)
> > }
> >
> > /*
> > - * Verify that a swap entry is valid and increment its swap map count.
> > + * Verify that nr swap entries are valid and increment their swap map count.
> > *
> > * Returns error code in following case.
> > * - success -> 0
> > @@ -3319,66 +3319,76 @@ void si_swapinfo(struct sysinfo *val)
> > * - swap-cache reference is requested but the entry is not used. -> ENOENT
> > * - swap-mapped reference requested but needs continued swap count. -> ENOMEM
> > */
> > -static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
> > +static int __swap_duplicate_nr(swp_entry_t entry, int nr, unsigned char usage)
>
> perhaps its better to pass order instead of nr to all these functions to make it
> clearer that entry should be naturally aligned and be a power-of-2 number of
> pages, no bigger than SWAPFILE_CLUSTER?
>
> > {
> > struct swap_info_struct *p;
> > struct swap_cluster_info *ci;
> > unsigned long offset;
> > - unsigned char count;
> > - unsigned char has_cache;
> > - int err;
> > + unsigned char count[SWAPFILE_CLUSTER];
> > + unsigned char has_cache[SWAPFILE_CLUSTER];
>
> I'm not sure this 1K stack buffer is a good idea?
>
> Could you split it slightly differently so that loop 1 just does error checking
> and bails out if an error is found, and loop 2 does the new value calculation
> and writeback? Then you don't need these arrays.

right. we can totally remove thoe arrays by re-reading swap_map.

>
> > + int err, i;
> >
> > p = swp_swap_info(entry);
> >
> > offset = swp_offset(entry);
> > ci = lock_cluster_or_swap_info(p, offset);
> >
> > - count = p->swap_map[offset];
> > -
> > - /*
> > - * swapin_readahead() doesn't check if a swap entry is valid, so the
> > - * swap entry could be SWAP_MAP_BAD. Check here with lock held.
> > - */
> > - if (unlikely(swap_count(count) == SWAP_MAP_BAD)) {
> > - err = -ENOENT;
> > - goto unlock_out;
> > - }
> > -
> > - has_cache = count & SWAP_HAS_CACHE;
> > - count &= ~SWAP_HAS_CACHE;
> > - err = 0;
> > -
> > - if (usage == SWAP_HAS_CACHE) {
> > + for (i = 0; i < nr; i++) {
> > + count[i] = p->swap_map[offset + i];
> >
> > - /* set SWAP_HAS_CACHE if there is no cache and entry is used */
> > - if (!has_cache && count)
> > - has_cache = SWAP_HAS_CACHE;
> > - else if (has_cache) /* someone else added cache */
> > - err = -EEXIST;
> > - else /* no users remaining */
> > + /*
> > + * swapin_readahead() doesn't check if a swap entry is valid, so the
> > + * swap entry could be SWAP_MAP_BAD. Check here with lock held.
> > + */
> > + if (unlikely(swap_count(count[i]) == SWAP_MAP_BAD)) {
> > err = -ENOENT;
> > + goto unlock_out;
> > + }
> >
> > - } else if (count || has_cache) {
> > -
> > - if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
> > - count += usage;
> > - else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX)
> > - err = -EINVAL;
> > - else if (swap_count_continued(p, offset, count))
> > - count = COUNT_CONTINUED;
> > - else
> > - err = -ENOMEM;
> > - } else
> > - err = -ENOENT; /* unused swap entry */
> > + has_cache[i] = count[i] & SWAP_HAS_CACHE;
> > + count[i] &= ~SWAP_HAS_CACHE;
> > + err = 0;
> > +
> > + if (usage == SWAP_HAS_CACHE) {
> > +
> > + /* set SWAP_HAS_CACHE if there is no cache and entry is used */
> > + if (!has_cache[i] && count[i])
> > + has_cache[i] = SWAP_HAS_CACHE;
> > + else if (has_cache[i]) /* someone else added cache */
> > + err = -EEXIST;
> > + else /* no users remaining */
> > + err = -ENOENT;
> > + } else if (count[i] || has_cache[i]) {
> > +
> > + if ((count[i] & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
> > + count[i] += usage;
> > + else if ((count[i] & ~COUNT_CONTINUED) > SWAP_MAP_MAX)
> > + err = -EINVAL;
> > + else if (swap_count_continued(p, offset + i, count[i]))
> > + count[i] = COUNT_CONTINUED;
> > + else
> > + err = -ENOMEM;
> > + } else
> > + err = -ENOENT; /* unused swap entry */
> >
> > - if (!err)
> > - WRITE_ONCE(p->swap_map[offset], count | has_cache);
> > + if (err)
> > + break;
> > + }
> >
> > + if (!err) {
> > + for (i = 0; i < nr; i++)
> > + WRITE_ONCE(p->swap_map[offset + i], count[i] | has_cache[i]);
> > + }
> > unlock_out:
> > unlock_cluster_or_swap_info(p, ci);
> > return err;
> > }
> >
> > +static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
> > +{
> > + return __swap_duplicate_nr(entry, 1, usage);
> > +}
> > +
> > /*
> > * Help swapoff by noting that swap entry belongs to shmem/tmpfs
> > * (in which case its reference count is never incremented).
> > @@ -3417,17 +3427,33 @@ int swapcache_prepare(swp_entry_t entry)
> > return __swap_duplicate(entry, SWAP_HAS_CACHE);
> > }
> >
> > -void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
> > +int swapcache_prepare_nr(swp_entry_t entry, int nr)
> > +{
> > + return __swap_duplicate_nr(entry, nr, SWAP_HAS_CACHE);
> > +}
> > +
> > +void swapcache_clear_nr(struct swap_info_struct *si, swp_entry_t entry, int nr)
> > {
> > struct swap_cluster_info *ci;
> > unsigned long offset = swp_offset(entry);
> > - unsigned char usage;
> > + unsigned char usage[SWAPFILE_CLUSTER];
> > + int i;
> >
> > ci = lock_cluster_or_swap_info(si, offset);
> > - usage = __swap_entry_free_locked(si, offset, SWAP_HAS_CACHE);
> > + for (i = 0; i < nr; i++)
> > + usage[i] = __swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE);
> > unlock_cluster_or_swap_info(si, ci);
> > - if (!usage)
> > - free_swap_slot(entry);
> > + for (i = 0; i < nr; i++) {
> > + if (!usage[i]) {
> > + free_swap_slot(entry);
> > + entry.val++;
> > + }
> > + }
> > +}
>
> This is pretty similar to swap_nr_free() which you added in patch 2. Except
> swap_nr_free() passes 1 as last param to __swap_entry_free_locked() and this
> passes SWAP_HAS_CACHE. Perhaps their should be a common helper? I think
> swap_nr_free()'s usage bitmap is preferable to this version's char array too.

right.

>
> Thanks,
> Ryan
>
> > +
> > +void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
> > +{
> > + swapcache_clear_nr(si, entry, 1);
> > }
> >
> > struct swap_info_struct *swp_swap_info(swp_entry_t entry)

Thanks
Barry

2024-03-19 06:27:52

by Barry Song

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole

On Tue, Mar 19, 2024 at 5:45 AM Ryan Roberts <[email protected]> wrote:
>
> >>> I agree phones are not the only platform. But Rome wasn't built in a
> >>> day. I can only get
> >>> started on a hardware which I can easily reach and have enough hardware/test
> >>> resources on it. So we may take the first step which can be applied on
> >>> a real product
> >>> and improve its performance, and step by step, we broaden it and make it
> >>> widely useful to various areas in which I can't reach :-)
> >>
> >> We must guarantee the normal swap path runs correctly and has no
> >> performance regression when developing SWP_SYNCHRONOUS_IO optimization.
> >> So we have to put some effort on the normal path test anyway.
> >>
> >>> so probably we can have a sysfs "enable" entry with default "n" or
> >>> have a maximum
> >>> swap-in order as Ryan's suggestion [1] at the beginning,
> >>>
> >>> "
> >>> So in the common case, swap-in will pull in the same size of folio as was
> >>> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
> >>> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
> >>> it makes sense for 2M THP; As the size increases the chances of actually needing
> >>> all of the folio reduces so chances are we are wasting IO. There are similar
> >>> arguments for CoW, where we currently copy 1 page per fault - it probably makes
> >>> sense to copy the whole folio up to a certain size.
> >>> "
>
> I thought about this a bit more. No clear conclusions, but hoped this might help
> the discussion around policy:
>
> The decision about the size of the THP is made at first fault, with some help
> from user space and in future we might make decisions to split based on
> munmap/mremap/etc hints. In an ideal world, the fact that we have had to swap
> the THP out at some point in its lifetime should not impact on its size. It's
> just being moved around in the system and the reason for our original decision
> should still hold.

Indeed, this is an ideal framework for smartphones and likely for
widely embedded
Linux systems utilizing zRAM. We set the mTHP size to 64KiB to
leverage CONT-PTE,
given that more than half of the memory on phones may frequently swap out and
swap in (for instance, when opening and switching between apps). The
ideal approach
would involve adhering to the decision made in do_anonymous_page().

>
> So from that PoV, it would be good to swap-in to the same size that was
> swapped-out. But we only kind-of keep that information around, via the swap
> entry contiguity and alignment. With that scheme it is possible that multiple
> virtually adjacent but not physically contiguous folios get swapped-out to
> adjacent swap slot ranges and then they would be swapped-in to a single, larger
> folio. This is not ideal, and I think it would be valuable to try to maintain
> the original folio size information with the swap slot. One way to do this would
> be to store the original order for which the cluster was allocated in the
> cluster. Then we at least know that a given swap slot is either for a folio of
> that order or an order-0 folio (due to cluster exhaustion/scanning). Can we
> steal a bit from swap_map to determine which case it is? Or are there better
> approaches?

In the case of non-SWP_SYNCHRONOUS_IO, users will invariably invoke
swap_readahead()
even when __swap_count(entry) equals 1. This leads to two scenarios:
swap_vma_readahead
and swap_cluster_readahead.

In swap_vma_readahead, when blk_queue_nonrot, physical contiguity
doesn't appear to be a
critical concern. However, for swap_cluster_readahead, the focus
shifts towards the potential
impact of physical discontiguity.

struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
struct vm_fault *vmf)
{
struct mempolicy *mpol;
pgoff_t ilx;
struct folio *folio;

mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
folio = swap_use_vma_readahead() ?
swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf) :
swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
mpol_cond_put(mpol);

if (!folio)
return NULL;
return folio_file_page(folio, swp_offset(entry));
}

In Android and embedded systems, SWP_SYNCHRONOUS_IO is consistently utilized,
rendering physical contiguity less of a concern. Moreover, instances where
swap_readahead() is accessed are rare, typically occurring only in scenarios
involving forked but non-CoWed memory.

So I think large folios swap-in will at least need three steps

1. on SWP_SYNCHRONOUS_IO (Android and embedded Linux), this has a very
clear model and has no complex I/O issue.
2. on nonrot block device(bdev_nonrot == true), it cares less about
I/O contiguity.
3. on rot block devices which care about I/O contiguity.

This patchset primarily addresses the systems utilizing
SWP_SYNCHRONOUS_IO(type1),
such as Android and embedded Linux, a straightforward model is established,
with minimal complexity regarding I/O issues.

>
> Next we (I?) have concerns about wasting IO by swapping-in folios that are too
> large (e.g. 2M). I'm not sure if this is a real problem or not - intuitively I'd
> say yes but I have no data. But on the other hand, memory is aged and
> swapped-out per-folio, so why shouldn't it be swapped-in per folio? If the
> original allocation size policy is good (it currently isn't) then a folio should
> be sized to cover temporally close memory and if we need to access some of it,
> chances are we need all of it.
>
> If we think the IO concern is legitimate then we could define a threshold size
> (sysfs?) for when we start swapping-in the folio in chunks. And how big should
> those chunks be - one page, or the threshold size itself? Probably the latter?
> And perhaps that threshold could also be used by zRAM to decide its upper limit
> for compression chunk.


Agreed. What about introducing a parameter like
/sys/kernel/mm/transparent_hugepage/max_swapin_order
giving users the opportunity to fine-tune it according to their needs. For type1
users specifically, setting it to any value above 4 would be
beneficial. If there's
still a lack of tuning for desktop and server environments (type 2 and type 3),
the default value could be set to 0.

>
> Perhaps we can learn from khugepaged here? I think it has programmable
> thresholds for how many swapped-out pages can be swapped-in to aid collapse to a
> THP? I guess that exists for the same concerns about increased IO pressure?
>
>
> If we think we will ever be swapping-in folios in chunks less than their
> original size, then we need a separate mechanism to re-foliate them. We have
> discussed a khugepaged-like approach for doing this asynchronously in the
> background. I know that scares the Android folks, but David has suggested that
> this could well be very cheap compared with khugepaged, because it would be
> entirely limited to a single pgtable, so we only need the PTL. If we need this
> mechanism anyway, perhaps we should develop it and see how it performs if
> swap-in remains order-0? Although I guess that would imply not being able to
> benefit from compressing THPs for the zRAM case.

The effectiveness of collapse operation relies on the stability of
forming large folios
to ensure optimal performance. In embedded systems, where more than half of the
memory may be allocated to zRAM, folios might undergo swapping out before
collapsing or immediately after the collapse operation. It seems a
TAO-like optimization
to decrease fallback and latency is more effective.

>
> I see all this as orthogonal to synchronous vs asynchronous swap devices. I
> think the latter just implies that you might want to do some readahead to try to
> cover up the latency? If swap is moving towards being folio-orientated, then
> readahead also surely needs to be folio-orientated, but I think that should be
> the only major difference.
>
> Anyway, just some thoughts!

Thank you very much for your valuable and insightful deliberations.

>
> Thanks,
> Ryan
>

Thanks
Barry

2024-03-19 09:06:09

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole

On 19/03/2024 06:27, Barry Song wrote:
> On Tue, Mar 19, 2024 at 5:45 AM Ryan Roberts <[email protected]> wrote:
>>
>>>>> I agree phones are not the only platform. But Rome wasn't built in a
>>>>> day. I can only get
>>>>> started on a hardware which I can easily reach and have enough hardware/test
>>>>> resources on it. So we may take the first step which can be applied on
>>>>> a real product
>>>>> and improve its performance, and step by step, we broaden it and make it
>>>>> widely useful to various areas in which I can't reach :-)
>>>>
>>>> We must guarantee the normal swap path runs correctly and has no
>>>> performance regression when developing SWP_SYNCHRONOUS_IO optimization.
>>>> So we have to put some effort on the normal path test anyway.
>>>>
>>>>> so probably we can have a sysfs "enable" entry with default "n" or
>>>>> have a maximum
>>>>> swap-in order as Ryan's suggestion [1] at the beginning,
>>>>>
>>>>> "
>>>>> So in the common case, swap-in will pull in the same size of folio as was
>>>>> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
>>>>> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
>>>>> it makes sense for 2M THP; As the size increases the chances of actually needing
>>>>> all of the folio reduces so chances are we are wasting IO. There are similar
>>>>> arguments for CoW, where we currently copy 1 page per fault - it probably makes
>>>>> sense to copy the whole folio up to a certain size.
>>>>> "
>>
>> I thought about this a bit more. No clear conclusions, but hoped this might help
>> the discussion around policy:
>>
>> The decision about the size of the THP is made at first fault, with some help
>> from user space and in future we might make decisions to split based on
>> munmap/mremap/etc hints. In an ideal world, the fact that we have had to swap
>> the THP out at some point in its lifetime should not impact on its size. It's
>> just being moved around in the system and the reason for our original decision
>> should still hold.
>
> Indeed, this is an ideal framework for smartphones and likely for
> widely embedded
> Linux systems utilizing zRAM. We set the mTHP size to 64KiB to
> leverage CONT-PTE,
> given that more than half of the memory on phones may frequently swap out and
> swap in (for instance, when opening and switching between apps). The
> ideal approach
> would involve adhering to the decision made in do_anonymous_page().
>
>>
>> So from that PoV, it would be good to swap-in to the same size that was
>> swapped-out. But we only kind-of keep that information around, via the swap
>> entry contiguity and alignment. With that scheme it is possible that multiple
>> virtually adjacent but not physically contiguous folios get swapped-out to
>> adjacent swap slot ranges and then they would be swapped-in to a single, larger
>> folio. This is not ideal, and I think it would be valuable to try to maintain
>> the original folio size information with the swap slot. One way to do this would
>> be to store the original order for which the cluster was allocated in the
>> cluster. Then we at least know that a given swap slot is either for a folio of
>> that order or an order-0 folio (due to cluster exhaustion/scanning). Can we
>> steal a bit from swap_map to determine which case it is? Or are there better
>> approaches?
>
> In the case of non-SWP_SYNCHRONOUS_IO, users will invariably invoke
> swap_readahead()
> even when __swap_count(entry) equals 1. This leads to two scenarios:
> swap_vma_readahead
> and swap_cluster_readahead.
>
> In swap_vma_readahead, when blk_queue_nonrot, physical contiguity
> doesn't appear to be a
> critical concern. However, for swap_cluster_readahead, the focus
> shifts towards the potential
> impact of physical discontiguity.

When you talk about "physical [dis]contiguity" I think you are talking about
contiguity of the swap entries in the swap device? Both paths currently allocate
order-0 folios to swap into, so neither have a concept of physical contiguity in
memory at the moment.

As I understand it, roughly the aim is to readahead by cluster for rotating
disks to reduce seek time, and readahead by virtual address for non-rotating
devices since there is no seek time cost. Correct?

Note that today, swap-out on supports (2M) THP if the swap device is
non-rotating. If it is rotating, the THP is first split. My swap-out series
maintains this policy for mTHP. So I think we only really care about
swap_vma_readahead() here; we want to teach it to figure out the order of the
swap entries and swap them into folios of the same order (with a fallback to
order-0 if allocation fails).

>
> struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> struct vm_fault *vmf)
> {
> struct mempolicy *mpol;
> pgoff_t ilx;
> struct folio *folio;
>
> mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
> folio = swap_use_vma_readahead() ?
> swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf) :
> swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
> mpol_cond_put(mpol);
>
> if (!folio)
> return NULL;
> return folio_file_page(folio, swp_offset(entry));
> }
>
> In Android and embedded systems, SWP_SYNCHRONOUS_IO is consistently utilized,
> rendering physical contiguity less of a concern. Moreover, instances where
> swap_readahead() is accessed are rare, typically occurring only in scenarios
> involving forked but non-CoWed memory.

Yes understood. What I'm hearing is that for Android at least, stealing a bit
from swap_map to remember if a swap entry is the order marked in the cluster or
order-0 won't be noticed because almost all entries have swap count == 1. From
memory, I think swap_map is 8 bits, and 2 bits are currently stolen, leaving 6
bits (count = 64) before having to move to the swap map continuation stuff. Does
anyone know what workloads provoke this overflow? What are the consequences of
reducing that count to 32?

>
> So I think large folios swap-in will at least need three steps
>
> 1. on SWP_SYNCHRONOUS_IO (Android and embedded Linux), this has a very
> clear model and has no complex I/O issue.
> 2. on nonrot block device(bdev_nonrot == true), it cares less about
> I/O contiguity.
> 3. on rot block devices which care about I/O contiguity.

I don't think we care about (3); if the device rotates, we will have split the
folio at swap-out, so we are only concerned with swapping-in order-0 folios.

>
> This patchset primarily addresses the systems utilizing
> SWP_SYNCHRONOUS_IO(type1),
> such as Android and embedded Linux, a straightforward model is established,
> with minimal complexity regarding I/O issues.

Understood. But your implication is that making swap_vma_readahead() large folio
swap-in aware will be complex. I think we can remember the original order in the
swap device, then it shouldn't be too difficult - conceptually at least.

>
>>
>> Next we (I?) have concerns about wasting IO by swapping-in folios that are too
>> large (e.g. 2M). I'm not sure if this is a real problem or not - intuitively I'd
>> say yes but I have no data. But on the other hand, memory is aged and
>> swapped-out per-folio, so why shouldn't it be swapped-in per folio? If the
>> original allocation size policy is good (it currently isn't) then a folio should
>> be sized to cover temporally close memory and if we need to access some of it,
>> chances are we need all of it.
>>
>> If we think the IO concern is legitimate then we could define a threshold size
>> (sysfs?) for when we start swapping-in the folio in chunks. And how big should
>> those chunks be - one page, or the threshold size itself? Probably the latter?
>> And perhaps that threshold could also be used by zRAM to decide its upper limit
>> for compression chunk.
>
>
> Agreed. What about introducing a parameter like
> /sys/kernel/mm/transparent_hugepage/max_swapin_order
> giving users the opportunity to fine-tune it according to their needs. For type1
> users specifically, setting it to any value above 4 would be
> beneficial. If there's
> still a lack of tuning for desktop and server environments (type 2 and type 3),
> the default value could be set to 0.

This sort of thing sounds sensible to me. But I have a history of proposing
crappy sysfs interfaces :) So I'd like to hear from others - I suspect it will
take a fair bit of discussion before we converge. Having data to show that this
threshold is needed would also help (i.e. demonstration that the intuition that
swapping in a 2M folio is often counter-productive to performance).

>
>>
>> Perhaps we can learn from khugepaged here? I think it has programmable
>> thresholds for how many swapped-out pages can be swapped-in to aid collapse to a
>> THP? I guess that exists for the same concerns about increased IO pressure?
>>
>>
>> If we think we will ever be swapping-in folios in chunks less than their
>> original size, then we need a separate mechanism to re-foliate them. We have
>> discussed a khugepaged-like approach for doing this asynchronously in the
>> background. I know that scares the Android folks, but David has suggested that
>> this could well be very cheap compared with khugepaged, because it would be
>> entirely limited to a single pgtable, so we only need the PTL. If we need this
>> mechanism anyway, perhaps we should develop it and see how it performs if
>> swap-in remains order-0? Although I guess that would imply not being able to
>> benefit from compressing THPs for the zRAM case.
>
> The effectiveness of collapse operation relies on the stability of
> forming large folios
> to ensure optimal performance. In embedded systems, where more than half of the
> memory may be allocated to zRAM, folios might undergo swapping out before
> collapsing or immediately after the collapse operation. It seems a
> TAO-like optimization
> to decrease fallback and latency is more effective.

Sorry, I'm not sure I've understood what you are saying here.

>
>>
>> I see all this as orthogonal to synchronous vs asynchronous swap devices. I
>> think the latter just implies that you might want to do some readahead to try to
>> cover up the latency? If swap is moving towards being folio-orientated, then
>> readahead also surely needs to be folio-orientated, but I think that should be
>> the only major difference.
>>
>> Anyway, just some thoughts!
>
> Thank you very much for your valuable and insightful deliberations.
>
>>
>> Thanks,
>> Ryan
>>
>
> Thanks
> Barry


2024-03-19 09:22:24

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole

Ryan Roberts <[email protected]> writes:

>>>> I agree phones are not the only platform. But Rome wasn't built in a
>>>> day. I can only get
>>>> started on a hardware which I can easily reach and have enough hardware/test
>>>> resources on it. So we may take the first step which can be applied on
>>>> a real product
>>>> and improve its performance, and step by step, we broaden it and make it
>>>> widely useful to various areas in which I can't reach :-)
>>>
>>> We must guarantee the normal swap path runs correctly and has no
>>> performance regression when developing SWP_SYNCHRONOUS_IO optimization.
>>> So we have to put some effort on the normal path test anyway.
>>>
>>>> so probably we can have a sysfs "enable" entry with default "n" or
>>>> have a maximum
>>>> swap-in order as Ryan's suggestion [1] at the beginning,
>>>>
>>>> "
>>>> So in the common case, swap-in will pull in the same size of folio as was
>>>> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
>>>> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
>>>> it makes sense for 2M THP; As the size increases the chances of actually needing
>>>> all of the folio reduces so chances are we are wasting IO. There are similar
>>>> arguments for CoW, where we currently copy 1 page per fault - it probably makes
>>>> sense to copy the whole folio up to a certain size.
>>>> "
>
> I thought about this a bit more. No clear conclusions, but hoped this might help
> the discussion around policy:
>
> The decision about the size of the THP is made at first fault, with some help
> from user space and in future we might make decisions to split based on
> munmap/mremap/etc hints. In an ideal world, the fact that we have had to swap
> the THP out at some point in its lifetime should not impact on its size. It's
> just being moved around in the system and the reason for our original decision
> should still hold.
>
> So from that PoV, it would be good to swap-in to the same size that was
> swapped-out.

Sorry, I don't agree with this. It's better to swap-in and swap-out in
smallest size if the page is only accessed seldom to avoid to waste
memory.

> But we only kind-of keep that information around, via the swap
> entry contiguity and alignment. With that scheme it is possible that multiple
> virtually adjacent but not physically contiguous folios get swapped-out to
> adjacent swap slot ranges and then they would be swapped-in to a single, larger
> folio. This is not ideal, and I think it would be valuable to try to maintain
> the original folio size information with the swap slot. One way to do this would
> be to store the original order for which the cluster was allocated in the
> cluster. Then we at least know that a given swap slot is either for a folio of
> that order or an order-0 folio (due to cluster exhaustion/scanning). Can we
> steal a bit from swap_map to determine which case it is? Or are there better
> approaches?

[snip]

--
Best Regards,
Huang, Ying

2024-03-19 12:19:43

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole

On 19/03/2024 09:20, Huang, Ying wrote:
> Ryan Roberts <[email protected]> writes:
>
>>>>> I agree phones are not the only platform. But Rome wasn't built in a
>>>>> day. I can only get
>>>>> started on a hardware which I can easily reach and have enough hardware/test
>>>>> resources on it. So we may take the first step which can be applied on
>>>>> a real product
>>>>> and improve its performance, and step by step, we broaden it and make it
>>>>> widely useful to various areas in which I can't reach :-)
>>>>
>>>> We must guarantee the normal swap path runs correctly and has no
>>>> performance regression when developing SWP_SYNCHRONOUS_IO optimization.
>>>> So we have to put some effort on the normal path test anyway.
>>>>
>>>>> so probably we can have a sysfs "enable" entry with default "n" or
>>>>> have a maximum
>>>>> swap-in order as Ryan's suggestion [1] at the beginning,
>>>>>
>>>>> "
>>>>> So in the common case, swap-in will pull in the same size of folio as was
>>>>> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
>>>>> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
>>>>> it makes sense for 2M THP; As the size increases the chances of actually needing
>>>>> all of the folio reduces so chances are we are wasting IO. There are similar
>>>>> arguments for CoW, where we currently copy 1 page per fault - it probably makes
>>>>> sense to copy the whole folio up to a certain size.
>>>>> "
>>
>> I thought about this a bit more. No clear conclusions, but hoped this might help
>> the discussion around policy:
>>
>> The decision about the size of the THP is made at first fault, with some help
>> from user space and in future we might make decisions to split based on
>> munmap/mremap/etc hints. In an ideal world, the fact that we have had to swap
>> the THP out at some point in its lifetime should not impact on its size. It's
>> just being moved around in the system and the reason for our original decision
>> should still hold.
>>
>> So from that PoV, it would be good to swap-in to the same size that was
>> swapped-out.
>
> Sorry, I don't agree with this. It's better to swap-in and swap-out in
> smallest size if the page is only accessed seldom to avoid to waste
> memory.

If we want to optimize only for memory consumption, I'm sure there are many
things we would do differently. We need to find a balance between memory and
performance. The benefits of folios are well documented and the kernel is
heading in the direction of managing memory in variable-sized blocks. So I don't
think it's as simple as saying we should always swap-in the smallest possible
amount of memory.

You also said we should swap *out* in smallest size possible. Have I
misunderstood you? I thought the case for swapping-out a whole folio without
splitting was well established and non-controversial?

>
>> But we only kind-of keep that information around, via the swap
>> entry contiguity and alignment. With that scheme it is possible that multiple
>> virtually adjacent but not physically contiguous folios get swapped-out to
>> adjacent swap slot ranges and then they would be swapped-in to a single, larger
>> folio. This is not ideal, and I think it would be valuable to try to maintain
>> the original folio size information with the swap slot. One way to do this would
>> be to store the original order for which the cluster was allocated in the
>> cluster. Then we at least know that a given swap slot is either for a folio of
>> that order or an order-0 folio (due to cluster exhaustion/scanning). Can we
>> steal a bit from swap_map to determine which case it is? Or are there better
>> approaches?
>
> [snip]
>
> --
> Best Regards,
> Huang, Ying


2024-03-20 02:21:02

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole

Ryan Roberts <[email protected]> writes:

> On 19/03/2024 09:20, Huang, Ying wrote:
>> Ryan Roberts <[email protected]> writes:
>>
>>>>>> I agree phones are not the only platform. But Rome wasn't built in a
>>>>>> day. I can only get
>>>>>> started on a hardware which I can easily reach and have enough hardware/test
>>>>>> resources on it. So we may take the first step which can be applied on
>>>>>> a real product
>>>>>> and improve its performance, and step by step, we broaden it and make it
>>>>>> widely useful to various areas in which I can't reach :-)
>>>>>
>>>>> We must guarantee the normal swap path runs correctly and has no
>>>>> performance regression when developing SWP_SYNCHRONOUS_IO optimization.
>>>>> So we have to put some effort on the normal path test anyway.
>>>>>
>>>>>> so probably we can have a sysfs "enable" entry with default "n" or
>>>>>> have a maximum
>>>>>> swap-in order as Ryan's suggestion [1] at the beginning,
>>>>>>
>>>>>> "
>>>>>> So in the common case, swap-in will pull in the same size of folio as was
>>>>>> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
>>>>>> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
>>>>>> it makes sense for 2M THP; As the size increases the chances of actually needing
>>>>>> all of the folio reduces so chances are we are wasting IO. There are similar
>>>>>> arguments for CoW, where we currently copy 1 page per fault - it probably makes
>>>>>> sense to copy the whole folio up to a certain size.
>>>>>> "
>>>
>>> I thought about this a bit more. No clear conclusions, but hoped this might help
>>> the discussion around policy:
>>>
>>> The decision about the size of the THP is made at first fault, with some help
>>> from user space and in future we might make decisions to split based on
>>> munmap/mremap/etc hints. In an ideal world, the fact that we have had to swap
>>> the THP out at some point in its lifetime should not impact on its size. It's
>>> just being moved around in the system and the reason for our original decision
>>> should still hold.
>>>
>>> So from that PoV, it would be good to swap-in to the same size that was
>>> swapped-out.
>>
>> Sorry, I don't agree with this. It's better to swap-in and swap-out in
>> smallest size if the page is only accessed seldom to avoid to waste
>> memory.
>
> If we want to optimize only for memory consumption, I'm sure there are many
> things we would do differently. We need to find a balance between memory and
> performance. The benefits of folios are well documented and the kernel is
> heading in the direction of managing memory in variable-sized blocks. So I don't
> think it's as simple as saying we should always swap-in the smallest possible
> amount of memory.

It's conditional, that is,

"if the page is only accessed seldom"

Then, the page swapped-in will be swapped-out soon and adjacent pages in
the same large folio will not be accessed during this period.

So, I suggest to create an algorithm to decide swap-in order based on
swap-readahead information automatically. It can detect the situation
above via reduced swap readahead window size. And, if the page is
accessed for quite long time, and the adjacent pages in the same large
folio are accessed too, swap-readahead window will increase and large
swap-in order will be used.

> You also said we should swap *out* in smallest size possible. Have I
> misunderstood you? I thought the case for swapping-out a whole folio without
> splitting was well established and non-controversial?

That is conditional too.

>>
>>> But we only kind-of keep that information around, via the swap
>>> entry contiguity and alignment. With that scheme it is possible that multiple
>>> virtually adjacent but not physically contiguous folios get swapped-out to
>>> adjacent swap slot ranges and then they would be swapped-in to a single, larger
>>> folio. This is not ideal, and I think it would be valuable to try to maintain
>>> the original folio size information with the swap slot. One way to do this would
>>> be to store the original order for which the cluster was allocated in the
>>> cluster. Then we at least know that a given swap slot is either for a folio of
>>> that order or an order-0 folio (due to cluster exhaustion/scanning). Can we
>>> steal a bit from swap_map to determine which case it is? Or are there better
>>> approaches?
>>
>> [snip]

--
Best Regards,
Huang, Ying

2024-03-20 02:51:01

by Barry Song

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole

On Wed, Mar 20, 2024 at 3:20 PM Huang, Ying <[email protected]> wrote:
>
> Ryan Roberts <[email protected]> writes:
>
> > On 19/03/2024 09:20, Huang, Ying wrote:
> >> Ryan Roberts <[email protected]> writes:
> >>
> >>>>>> I agree phones are not the only platform. But Rome wasn't built in a
> >>>>>> day. I can only get
> >>>>>> started on a hardware which I can easily reach and have enough hardware/test
> >>>>>> resources on it. So we may take the first step which can be applied on
> >>>>>> a real product
> >>>>>> and improve its performance, and step by step, we broaden it and make it
> >>>>>> widely useful to various areas in which I can't reach :-)
> >>>>>
> >>>>> We must guarantee the normal swap path runs correctly and has no
> >>>>> performance regression when developing SWP_SYNCHRONOUS_IO optimization.
> >>>>> So we have to put some effort on the normal path test anyway.
> >>>>>
> >>>>>> so probably we can have a sysfs "enable" entry with default "n" or
> >>>>>> have a maximum
> >>>>>> swap-in order as Ryan's suggestion [1] at the beginning,
> >>>>>>
> >>>>>> "
> >>>>>> So in the common case, swap-in will pull in the same size of folio as was
> >>>>>> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
> >>>>>> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
> >>>>>> it makes sense for 2M THP; As the size increases the chances of actually needing
> >>>>>> all of the folio reduces so chances are we are wasting IO. There are similar
> >>>>>> arguments for CoW, where we currently copy 1 page per fault - it probably makes
> >>>>>> sense to copy the whole folio up to a certain size.
> >>>>>> "
> >>>
> >>> I thought about this a bit more. No clear conclusions, but hoped this might help
> >>> the discussion around policy:
> >>>
> >>> The decision about the size of the THP is made at first fault, with some help
> >>> from user space and in future we might make decisions to split based on
> >>> munmap/mremap/etc hints. In an ideal world, the fact that we have had to swap
> >>> the THP out at some point in its lifetime should not impact on its size. It's
> >>> just being moved around in the system and the reason for our original decision
> >>> should still hold.
> >>>
> >>> So from that PoV, it would be good to swap-in to the same size that was
> >>> swapped-out.
> >>
> >> Sorry, I don't agree with this. It's better to swap-in and swap-out in
> >> smallest size if the page is only accessed seldom to avoid to waste
> >> memory.
> >
> > If we want to optimize only for memory consumption, I'm sure there are many
> > things we would do differently. We need to find a balance between memory and
> > performance. The benefits of folios are well documented and the kernel is
> > heading in the direction of managing memory in variable-sized blocks. So I don't
> > think it's as simple as saying we should always swap-in the smallest possible
> > amount of memory.
>
> It's conditional, that is,
>
> "if the page is only accessed seldom"
>
> Then, the page swapped-in will be swapped-out soon and adjacent pages in
> the same large folio will not be accessed during this period.
>
> So, I suggest to create an algorithm to decide swap-in order based on
> swap-readahead information automatically. It can detect the situation
> above via reduced swap readahead window size. And, if the page is
> accessed for quite long time, and the adjacent pages in the same large
> folio are accessed too, swap-readahead window will increase and large
> swap-in order will be used.

The original size of do_anonymous_page() should be honored, considering it
embodies a decision influenced by not only sysfs settings and per-vma
HUGEPAGE hints but also architectural characteristics, for example
CONT-PTE.

The model you're proposing may offer memory-saving benefits or reduce I/O,
but it entirely disassociates the size of the swap in from the size prior to the
swap out. Moreover, there's no guarantee that the large folio generated by
the readahead window is contiguous in the swap and can be added to the
swap cache, as we are currently dealing with folio->swap instead of
subpage->swap.

Incidentally, do_anonymous_page() serves as the initial location for allocating
large folios. Given that memory conservation is a significant consideration in
do_swap_page(), wouldn't it be even more crucial in do_anonymous_page()?

A large folio, by its nature, represents a high-quality resource that has the
potential to leverage hardware characteristics for the benefit of the
entire system.
Conversely, I don't believe that a randomly determined size dictated by the
readahead window possesses the same advantageous qualities.

SWP_SYNCHRONOUS_IO devices are not reliant on readahead whatsoever,
their needs should also be respected.

> > You also said we should swap *out* in smallest size possible. Have I
> > misunderstood you? I thought the case for swapping-out a whole folio without
> > splitting was well established and non-controversial?
>
> That is conditional too.
>
> >>
> >>> But we only kind-of keep that information around, via the swap
> >>> entry contiguity and alignment. With that scheme it is possible that multiple
> >>> virtually adjacent but not physically contiguous folios get swapped-out to
> >>> adjacent swap slot ranges and then they would be swapped-in to a single, larger
> >>> folio. This is not ideal, and I think it would be valuable to try to maintain
> >>> the original folio size information with the swap slot. One way to do this would
> >>> be to store the original order for which the cluster was allocated in the
> >>> cluster. Then we at least know that a given swap slot is either for a folio of
> >>> that order or an order-0 folio (due to cluster exhaustion/scanning). Can we
> >>> steal a bit from swap_map to determine which case it is? Or are there better
> >>> approaches?
> >>
> >> [snip]
>
> --
> Best Regards,
> Huang, Ying

Thanks
Barry

2024-03-20 06:22:59

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole

Barry Song <[email protected]> writes:

> On Wed, Mar 20, 2024 at 3:20 PM Huang, Ying <[email protected]> wrote:
>>
>> Ryan Roberts <[email protected]> writes:
>>
>> > On 19/03/2024 09:20, Huang, Ying wrote:
>> >> Ryan Roberts <[email protected]> writes:
>> >>
>> >>>>>> I agree phones are not the only platform. But Rome wasn't built in a
>> >>>>>> day. I can only get
>> >>>>>> started on a hardware which I can easily reach and have enough hardware/test
>> >>>>>> resources on it. So we may take the first step which can be applied on
>> >>>>>> a real product
>> >>>>>> and improve its performance, and step by step, we broaden it and make it
>> >>>>>> widely useful to various areas in which I can't reach :-)
>> >>>>>
>> >>>>> We must guarantee the normal swap path runs correctly and has no
>> >>>>> performance regression when developing SWP_SYNCHRONOUS_IO optimization.
>> >>>>> So we have to put some effort on the normal path test anyway.
>> >>>>>
>> >>>>>> so probably we can have a sysfs "enable" entry with default "n" or
>> >>>>>> have a maximum
>> >>>>>> swap-in order as Ryan's suggestion [1] at the beginning,
>> >>>>>>
>> >>>>>> "
>> >>>>>> So in the common case, swap-in will pull in the same size of folio as was
>> >>>>>> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
>> >>>>>> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
>> >>>>>> it makes sense for 2M THP; As the size increases the chances of actually needing
>> >>>>>> all of the folio reduces so chances are we are wasting IO. There are similar
>> >>>>>> arguments for CoW, where we currently copy 1 page per fault - it probably makes
>> >>>>>> sense to copy the whole folio up to a certain size.
>> >>>>>> "
>> >>>
>> >>> I thought about this a bit more. No clear conclusions, but hoped this might help
>> >>> the discussion around policy:
>> >>>
>> >>> The decision about the size of the THP is made at first fault, with some help
>> >>> from user space and in future we might make decisions to split based on
>> >>> munmap/mremap/etc hints. In an ideal world, the fact that we have had to swap
>> >>> the THP out at some point in its lifetime should not impact on its size. It's
>> >>> just being moved around in the system and the reason for our original decision
>> >>> should still hold.
>> >>>
>> >>> So from that PoV, it would be good to swap-in to the same size that was
>> >>> swapped-out.
>> >>
>> >> Sorry, I don't agree with this. It's better to swap-in and swap-out in
>> >> smallest size if the page is only accessed seldom to avoid to waste
>> >> memory.
>> >
>> > If we want to optimize only for memory consumption, I'm sure there are many
>> > things we would do differently. We need to find a balance between memory and
>> > performance. The benefits of folios are well documented and the kernel is
>> > heading in the direction of managing memory in variable-sized blocks. So I don't
>> > think it's as simple as saying we should always swap-in the smallest possible
>> > amount of memory.
>>
>> It's conditional, that is,
>>
>> "if the page is only accessed seldom"
>>
>> Then, the page swapped-in will be swapped-out soon and adjacent pages in
>> the same large folio will not be accessed during this period.
>>
>> So, I suggest to create an algorithm to decide swap-in order based on
>> swap-readahead information automatically. It can detect the situation
>> above via reduced swap readahead window size. And, if the page is
>> accessed for quite long time, and the adjacent pages in the same large
>> folio are accessed too, swap-readahead window will increase and large
>> swap-in order will be used.
>
> The original size of do_anonymous_page() should be honored, considering it
> embodies a decision influenced by not only sysfs settings and per-vma
> HUGEPAGE hints but also architectural characteristics, for example
> CONT-PTE.
>
> The model you're proposing may offer memory-saving benefits or reduce I/O,
> but it entirely disassociates the size of the swap in from the size prior to the
> swap out.

Readahead isn't the only factor to determine folio order. For example,
we must respect "never" policy to allocate order-0 folio always.
There's no requirements to use swap-out order in swap-in too. Memory
allocation has different performance character of storage reading.

> Moreover, there's no guarantee that the large folio generated by
> the readahead window is contiguous in the swap and can be added to the
> swap cache, as we are currently dealing with folio->swap instead of
> subpage->swap.

Yes. We can optimize only when all conditions are satisfied. Just like
other optimization.

> Incidentally, do_anonymous_page() serves as the initial location for allocating
> large folios. Given that memory conservation is a significant consideration in
> do_swap_page(), wouldn't it be even more crucial in do_anonymous_page()?

Yes. We should consider that too. IIUC, that is why mTHP support is
off by default for now. After we find a way to solve the memory usage
issue. We may make default "on".

> A large folio, by its nature, represents a high-quality resource that has the
> potential to leverage hardware characteristics for the benefit of the
> entire system.

But not at the cost of memory wastage.

> Conversely, I don't believe that a randomly determined size dictated by the
> readahead window possesses the same advantageous qualities.

There's a readahead algorithm which is not pure random.

> SWP_SYNCHRONOUS_IO devices are not reliant on readahead whatsoever,
> their needs should also be respected.

I understand that there are special requirements for SWP_SYNCHRONOUS_IO
devices. I just suggest to work on general code before specific
optimization.

>> > You also said we should swap *out* in smallest size possible. Have I
>> > misunderstood you? I thought the case for swapping-out a whole folio without
>> > splitting was well established and non-controversial?
>>
>> That is conditional too.
>>
>> >>
>> >>> But we only kind-of keep that information around, via the swap
>> >>> entry contiguity and alignment. With that scheme it is possible that multiple
>> >>> virtually adjacent but not physically contiguous folios get swapped-out to
>> >>> adjacent swap slot ranges and then they would be swapped-in to a single, larger
>> >>> folio. This is not ideal, and I think it would be valuable to try to maintain
>> >>> the original folio size information with the swap slot. One way to do this would
>> >>> be to store the original order for which the cluster was allocated in the
>> >>> cluster. Then we at least know that a given swap slot is either for a folio of
>> >>> that order or an order-0 folio (due to cluster exhaustion/scanning). Can we
>> >>> steal a bit from swap_map to determine which case it is? Or are there better
>> >>> approaches?
>> >>
>> >> [snip]

--
Best Regards,
Huang, Ying

2024-03-20 18:39:02

by Barry Song

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole

On Wed, Mar 20, 2024 at 7:22 PM Huang, Ying <[email protected]> wrote:
>
> Barry Song <[email protected]> writes:
>
> > On Wed, Mar 20, 2024 at 3:20 PM Huang, Ying <[email protected]> wrote:
> >>
> >> Ryan Roberts <[email protected]> writes:
> >>
> >> > On 19/03/2024 09:20, Huang, Ying wrote:
> >> >> Ryan Roberts <[email protected]> writes:
> >> >>
> >> >>>>>> I agree phones are not the only platform. But Rome wasn't built in a
> >> >>>>>> day. I can only get
> >> >>>>>> started on a hardware which I can easily reach and have enough hardware/test
> >> >>>>>> resources on it. So we may take the first step which can be applied on
> >> >>>>>> a real product
> >> >>>>>> and improve its performance, and step by step, we broaden it and make it
> >> >>>>>> widely useful to various areas in which I can't reach :-)
> >> >>>>>
> >> >>>>> We must guarantee the normal swap path runs correctly and has no
> >> >>>>> performance regression when developing SWP_SYNCHRONOUS_IO optimization.
> >> >>>>> So we have to put some effort on the normal path test anyway.
> >> >>>>>
> >> >>>>>> so probably we can have a sysfs "enable" entry with default "n" or
> >> >>>>>> have a maximum
> >> >>>>>> swap-in order as Ryan's suggestion [1] at the beginning,
> >> >>>>>>
> >> >>>>>> "
> >> >>>>>> So in the common case, swap-in will pull in the same size of folio as was
> >> >>>>>> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
> >> >>>>>> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
> >> >>>>>> it makes sense for 2M THP; As the size increases the chances of actually needing
> >> >>>>>> all of the folio reduces so chances are we are wasting IO. There are similar
> >> >>>>>> arguments for CoW, where we currently copy 1 page per fault - it probably makes
> >> >>>>>> sense to copy the whole folio up to a certain size.
> >> >>>>>> "
> >> >>>
> >> >>> I thought about this a bit more. No clear conclusions, but hoped this might help
> >> >>> the discussion around policy:
> >> >>>
> >> >>> The decision about the size of the THP is made at first fault, with some help
> >> >>> from user space and in future we might make decisions to split based on
> >> >>> munmap/mremap/etc hints. In an ideal world, the fact that we have had to swap
> >> >>> the THP out at some point in its lifetime should not impact on its size. It's
> >> >>> just being moved around in the system and the reason for our original decision
> >> >>> should still hold.
> >> >>>
> >> >>> So from that PoV, it would be good to swap-in to the same size that was
> >> >>> swapped-out.
> >> >>
> >> >> Sorry, I don't agree with this. It's better to swap-in and swap-out in
> >> >> smallest size if the page is only accessed seldom to avoid to waste
> >> >> memory.
> >> >
> >> > If we want to optimize only for memory consumption, I'm sure there are many
> >> > things we would do differently. We need to find a balance between memory and
> >> > performance. The benefits of folios are well documented and the kernel is
> >> > heading in the direction of managing memory in variable-sized blocks So I don't
> >> > think it's as simple as saying we should always swap-in the smallest possible
> >> > amount of memory.
> >>
> >> It's conditional, that is,
> >>
> >> "if the page is only accessed seldom"
> >>
> >> Then, the page swapped-in will be swapped-out soon and adjacent pages in
> >> the same large folio will not be accessed during this period.
> >>
> >> So, I suggest to create an algorithm to decide swap-in order based on
> >> swap-readahead information automatically. It can detect the situation
> >> above via reduced swap readahead window size. And, if the page is
> >> accessed for quite long time, and the adjacent pages in the same large
> >> folio are accessed too, swap-readahead window will increase and large
> >> swap-in order will be used.
> >
> > The original size of do_anonymous_page() should be honored, considering it
> > embodies a decision influenced by not only sysfs settings and per-vma
> > HUGEPAGE hints but also architectural characteristics, for example
> > CONT-PTE.
> >
> > The model you're proposing may offer memory-saving benefits or reduce I/O,
> > but it entirely disassociates the size of the swap in from the size prior to the
> > swap out.
>
> Readahead isn't the only factor to determine folio order. For example,
> we must respect "never" policy to allocate order-0 folio always.
> There's no requirements to use swap-out order in swap-in too. Memory
> allocation has different performance character of storage reading.

Still quite unclear.

If users have only enabled 64KiB (4-ORDER) large folios in sysfs, and the
readahead algorithm requires 16KiB, what should be set as the large folio size?
Setting it to 16KiB doesn't align with users' requirements, while
setting it to 64KiB
would be wasteful according to your criteria.

>
> > Moreover, there's no guarantee that the large folio generated by
> > the readahead window is contiguous in the swap and can be added to the
> > swap cache, as we are currently dealing with folio->swap instead of
> > subpage->swap.
>
> Yes. We can optimize only when all conditions are satisfied. Just like
> other optimization.
>
> > Incidentally, do_anonymous_page() serves as the initial location for allocating
> > large folios. Given that memory conservation is a significant consideration in
> > do_swap_page(), wouldn't it be even more crucial in do_anonymous_page()?
>
> Yes. We should consider that too. IIUC, that is why mTHP support is
> off by default for now. After we find a way to solve the memory usage
> issue. We may make default "on".

It's challenging to establish a universal solution because various systems
exhibit diverse hardware characteristics, and VMAs may require different
alignments. The current sysfs and per-vma hints allow users the opportunity
o customize settings according to their specific requirements.

>
> > A large folio, by its nature, represents a high-quality resource that has the
> > potential to leverage hardware characteristics for the benefit of the
> > entire system.
>
> But not at the cost of memory wastage.
>
> > Conversely, I don't believe that a randomly determined size dictated by the
> > readahead window possesses the same advantageous qualities.
>
> There's a readahead algorithm which is not pure random.
>
> > SWP_SYNCHRONOUS_IO devices are not reliant on readahead whatsoever,
> > their needs should also be respected.
>
> I understand that there are special requirements for SWP_SYNCHRONOUS_IO
> devices. I just suggest to work on general code before specific
> optimization.

I disagree with your definition of "special" and "general". According
to your logic,
non-SWP_SYNCHRONOUS_IO devices could also be classified as "special".
Furthermore, the number of systems running SWP_SYNCHRONOUS_IO is
significantly greater than those running non-SWP_SYNCHRONOUS_IO,
contradicting your assertion.

SWP_SYNCHRONOUS_IO devices have a minor chance of being involved
in readahead. However, in OPPO's code, which hasn't been sent in the
LKML yet, we use the exact same size as do_anonymous_page for readahead.
Without a clear description of how you want the new readahead
algorithm to balance memory waste and users' hints from sysfs and
per-vma flags, it appears to be an ambiguous area to address.

Please provide a clear description of how you would like the new readahead
algorithm to function. I believe this clarity will facilitate others
in attempting to
implement it.

>
> >> > You also said we should swap *out* in smallest size possible. Have I
> >> > misunderstood you? I thought the case for swapping-out a whole folio without
> >> > splitting was well established and non-controversial?
> >>
> >> That is conditional too.
> >>
> >> >>
> >> >>> But we only kind-of keep that information around, via the swap
> >> >>> entry contiguity and alignment. With that scheme it is possible that multiple
> >> >>> virtually adjacent but not physically contiguous folios get swapped-out to
> >> >>> adjacent swap slot ranges and then they would be swapped-in to a single, larger
> >> >>> folio. This is not ideal, and I think it would be valuable to try to maintain
> >> >>> the original folio size information with the swap slot. One way to do this would
> >> >>> be to store the original order for which the cluster was allocated in the
> >> >>> cluster. Then we at least know that a given swap slot is either for a folio of
> >> >>> that order or an order-0 folio (due to cluster exhaustion/scanning). Can we
> >> >>> steal a bit from swap_map to determine which case it is? Or are there better
> >> >>> approaches?
> >> >>
> >> >> [snip]
>
> --
> Best Regards,
> Huang, Ying

Thanks
Barry

2024-03-21 04:25:52

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole

Barry Song <[email protected]> writes:

> On Wed, Mar 20, 2024 at 7:22 PM Huang, Ying <[email protected]> wrote:
>>
>> Barry Song <[email protected]> writes:
>>
>> > On Wed, Mar 20, 2024 at 3:20 PM Huang, Ying <[email protected]> wrote:
>> >>
>> >> Ryan Roberts <[email protected]> writes:
>> >>
>> >> > On 19/03/2024 09:20, Huang, Ying wrote:
>> >> >> Ryan Roberts <[email protected]> writes:
>> >> >>
>> >> >>>>>> I agree phones are not the only platform. But Rome wasn't built in a
>> >> >>>>>> day. I can only get
>> >> >>>>>> started on a hardware which I can easily reach and have enough hardware/test
>> >> >>>>>> resources on it. So we may take the first step which can be applied on
>> >> >>>>>> a real product
>> >> >>>>>> and improve its performance, and step by step, we broaden it and make it
>> >> >>>>>> widely useful to various areas in which I can't reach :-)
>> >> >>>>>
>> >> >>>>> We must guarantee the normal swap path runs correctly and has no
>> >> >>>>> performance regression when developing SWP_SYNCHRONOUS_IO optimization.
>> >> >>>>> So we have to put some effort on the normal path test anyway.
>> >> >>>>>
>> >> >>>>>> so probably we can have a sysfs "enable" entry with default "n" or
>> >> >>>>>> have a maximum
>> >> >>>>>> swap-in order as Ryan's suggestion [1] at the beginning,
>> >> >>>>>>
>> >> >>>>>> "
>> >> >>>>>> So in the common case, swap-in will pull in the same size of folio as was
>> >> >>>>>> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
>> >> >>>>>> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
>> >> >>>>>> it makes sense for 2M THP; As the size increases the chances of actually needing
>> >> >>>>>> all of the folio reduces so chances are we are wasting IO. There are similar
>> >> >>>>>> arguments for CoW, where we currently copy 1 page per fault - it probably makes
>> >> >>>>>> sense to copy the whole folio up to a certain size.
>> >> >>>>>> "
>> >> >>>
>> >> >>> I thought about this a bit more. No clear conclusions, but hoped this might help
>> >> >>> the discussion around policy:
>> >> >>>
>> >> >>> The decision about the size of the THP is made at first fault, with some help
>> >> >>> from user space and in future we might make decisions to split based on
>> >> >>> munmap/mremap/etc hints. In an ideal world, the fact that we have had to swap
>> >> >>> the THP out at some point in its lifetime should not impact on its size. It's
>> >> >>> just being moved around in the system and the reason for our original decision
>> >> >>> should still hold.
>> >> >>>
>> >> >>> So from that PoV, it would be good to swap-in to the same size that was
>> >> >>> swapped-out.
>> >> >>
>> >> >> Sorry, I don't agree with this. It's better to swap-in and swap-out in
>> >> >> smallest size if the page is only accessed seldom to avoid to waste
>> >> >> memory.
>> >> >
>> >> > If we want to optimize only for memory consumption, I'm sure there are many
>> >> > things we would do differently. We need to find a balance between memory and
>> >> > performance. The benefits of folios are well documented and the kernel is
>> >> > heading in the direction of managing memory in variable-sized blocks. So I don't
>> >> > think it's as simple as saying we should always swap-in the smallest possible
>> >> > amount of memory.
>> >>
>> >> It's conditional, that is,
>> >>
>> >> "if the page is only accessed seldom"
>> >>
>> >> Then, the page swapped-in will be swapped-out soon and adjacent pages in
>> >> the same large folio will not be accessed during this period.
>> >>
>> >> So, I suggest to create an algorithm to decide swap-in order based on
>> >> swap-readahead information automatically. It can detect the situation
>> >> above via reduced swap readahead window size. And, if the page is
>> >> accessed for quite long time, and the adjacent pages in the same large
>> >> folio are accessed too, swap-readahead window will increase and large
>> >> swap-in order will be used.
>> >
>> > The original size of do_anonymous_page() should be honored, considering it
>> > embodies a decision influenced by not only sysfs settings and per-vma
>> > HUGEPAGE hints but also architectural characteristics, for example
>> > CONT-PTE.
>> >
>> > The model you're proposing may offer memory-saving benefits or reduce I/O,
>> > but it entirely disassociates the size of the swap in from the size prior to the
>> > swap out.
>>
>> Readahead isn't the only factor to determine folio order. For example,
>> we must respect "never" policy to allocate order-0 folio always.
>> There's no requirements to use swap-out order in swap-in too. Memory
>> allocation has different performance character of storage reading.
>
> Still quite unclear.
>
> If users have only enabled 64KiB (4-ORDER) large folios in sysfs, and the
> readahead algorithm requires 16KiB, what should be set as the large folio size?
> Setting it to 16KiB doesn't align with users' requirements, while
> setting it to 64KiB
> would be wasteful according to your criteria.

IIUC, enabling 64KB means you can use 64KB mTHP if appropriate, doesn't
mean that you must use 64KB mTHP. If so, we should use 16KB mTHP in
that situation.

>> > Moreover, there's no guarantee that the large folio generated by
>> > the readahead window is contiguous in the swap and can be added to the
>> > swap cache, as we are currently dealing with folio->swap instead of
>> > subpage->swap.
>>
>> Yes. We can optimize only when all conditions are satisfied. Just like
>> other optimization.
>>
>> > Incidentally, do_anonymous_page() serves as the initial location for allocating
>> > large folios. Given that memory conservation is a significant consideration in
>> > do_swap_page(), wouldn't it be even more crucial in do_anonymous_page()?
>>
>> Yes. We should consider that too. IIUC, that is why mTHP support is
>> off by default for now. After we find a way to solve the memory usage
>> issue. We may make default "on".
>
> It's challenging to establish a universal solution because various systems
> exhibit diverse hardware characteristics, and VMAs may require different
> alignments. The current sysfs and per-vma hints allow users the opportunity
> o customize settings according to their specific requirements.

IIUC, Linux kernel is trying to provide a reasonable default behavior in
all situations. We are trying to optimize default behavior in the first
place, only introduce customization if we fail to do that. I don't
think that it's a good idea to introduce too much customization if we
haven't tried to optimize the default behavior.

>>
>> > A large folio, by its nature, represents a high-quality resource that has the
>> > potential to leverage hardware characteristics for the benefit of the
>> > entire system.
>>
>> But not at the cost of memory wastage.
>>
>> > Conversely, I don't believe that a randomly determined size dictated by the
>> > readahead window possesses the same advantageous qualities.
>>
>> There's a readahead algorithm which is not pure random.
>>
>> > SWP_SYNCHRONOUS_IO devices are not reliant on readahead whatsoever,
>> > their needs should also be respected.
>>
>> I understand that there are special requirements for SWP_SYNCHRONOUS_IO
>> devices. I just suggest to work on general code before specific
>> optimization.
>
> I disagree with your definition of "special" and "general". According
> to your logic,
> non-SWP_SYNCHRONOUS_IO devices could also be classified as "special".

SWP_SYNCHRONOUS_IO devices also use general code path. They just use
some special optimization in some special situation (__swap_count(entry)
== 1). Optimization in general path benefits everyone.

> Furthermore, the number of systems running SWP_SYNCHRONOUS_IO is
> significantly greater than those running non-SWP_SYNCHRONOUS_IO,
> contradicting your assertion.
>
> SWP_SYNCHRONOUS_IO devices have a minor chance of being involved
> in readahead.

Then it loses an opportunity to determine the appropriate folio order.
We can consider how to balance between the overhead and benefit of
readahead. IIUC, compared with original SWP_SYNCHRONOUS_IO swap-in,
mTHP is a kind of readahead too.

BTW, because we have added more and more swap cache related operations
(swapcache_prepare(), clear_shadow_from_swap_cache(), swapcache_clear(),
etc.) in SWP_SYNCHRONOUS_IO code path, I suspect whether the benefit of
SWP_SYNCHRONOUS_IO is still large enough. We may need to re-evaluate
it.

> However, in OPPO's code, which hasn't been sent in the
> LKML yet, we use the exact same size as do_anonymous_page for readahead.
> Without a clear description of how you want the new readahead
> algorithm to balance memory waste and users' hints from sysfs and
> per-vma flags, it appears to be an ambiguous area to address.
>
> Please provide a clear description of how you would like the new readahead
> algorithm to function. I believe this clarity will facilitate others
> in attempting to
> implement it.

For example, if __swapin_nr_pages() > 4, we can try to allocate an
order-2 mTHP if other conditions are satisfied.

>>
>> >> > You also said we should swap *out* in smallest size possible. Have I
>> >> > misunderstood you? I thought the case for swapping-out a whole folio without
>> >> > splitting was well established and non-controversial?
>> >>
>> >> That is conditional too.
>> >>
>> >> >>
>> >> >>> But we only kind-of keep that information around, via the swap
>> >> >>> entry contiguity and alignment. With that scheme it is possible that multiple
>> >> >>> virtually adjacent but not physically contiguous folios get swapped-out to
>> >> >>> adjacent swap slot ranges and then they would be swapped-in to a single, larger
>> >> >>> folio. This is not ideal, and I think it would be valuable to try to maintain
>> >> >>> the original folio size information with the swap slot. One way to do this would
>> >> >>> be to store the original order for which the cluster was allocated in the
>> >> >>> cluster. Then we at least know that a given swap slot is either for a folio of
>> >> >>> that order or an order-0 folio (due to cluster exhaustion/scanning). Can we
>> >> >>> steal a bit from swap_map to determine which case it is? Or are there better
>> >> >>> approaches?
>> >> >>
>> >> >> [snip]
>>

--
Best Regards,
Huang, Ying

2024-03-21 05:12:51

by Barry Song

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole

On Thu, Mar 21, 2024 at 5:25 PM Huang, Ying <[email protected]> wrote:
>
> Barry Song <[email protected]> writes:
>
> > On Wed, Mar 20, 2024 at 7:22 PM Huang, Ying <[email protected]> wrote:
> >>
> >> Barry Song <[email protected]> writes:
> >>
> >> > On Wed, Mar 20, 2024 at 3:20 PM Huang, Ying <[email protected]> wrote:
> >> >>
> >> >> Ryan Roberts <[email protected]> writes:
> >> >>
> >> >> > On 19/03/2024 09:20, Huang, Ying wrote:
> >> >> >> Ryan Roberts <[email protected]> writes:
> >> >> >>
> >> >> >>>>>> I agree phones are not the only platform. But Rome wasn't built in a
> >> >> >>>>>> day. I can only get
> >> >> >>>>>> started on a hardware which I can easily reach and have enough hardware/test
> >> >> >>>>>> resources on it. So we may take the first step which can be applied on
> >> >> >>>>>> a real product
> >> >> >>>>>> and improve its performance, and step by step, we broaden it and make it
> >> >> >>>>>> widely useful to various areas in which I can't reach :-)
> >> >> >>>>>
> >> >> >>>>> We must guarantee the normal swap path runs correctly and has no
> >> >> >>>>> performance regression when developing SWP_SYNCHRONOUS_IO optimization.
> >> >> >>>>> So we have to put some effort on the normal path test anyway.
> >> >> >>>>>
> >> >> >>>>>> so probably we can have a sysfs "enable" entry with default "n" or
> >> >> >>>>>> have a maximum
> >> >> >>>>>> swap-in order as Ryan's suggestion [1] at the beginning,
> >> >> >>>>>>
> >> >> >>>>>> "
> >> >> >>>>>> So in the common case, swap-in will pull in the same size of folio as was
> >> >> >>>>>> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
> >> >> >>>>>> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
> >> >> >>>>>> it makes sense for 2M THP; As the size increases the chances of actually needing
> >> >> >>>>>> all of the folio reduces so chances are we are wasting IO. There are similar
> >> >> >>>>>> arguments for CoW, where we currently copy 1 page per fault - it probably makes
> >> >> >>>>>> sense to copy the whole folio up to a certain size.
> >> >> >>>>>> "
> >> >> >>>
> >> >> >>> I thought about this a bit more. No clear conclusions, but hoped this might help
> >> >> >>> the discussion around policy:
> >> >> >>>
> >> >> >>> The decision about the size of the THP is made at first fault, with some help
> >> >> >>> from user space and in future we might make decisions to split based on
> >> >> >>> munmap/mremap/etc hints. In an ideal world, the fact that we have had to swap
> >> >> >>> the THP out at some point in its lifetime should not impact on its size. It's
> >> >> >>> just being moved around in the system and the reason for our original decision
> >> >> >>> should still hold.
> >> >> >>>
> >> >> >>> So from that PoV, it would be good to swap-in to the same size that was
> >> >> >>> swapped-out.
> >> >> >>
> >> >> >> Sorry, I don't agree with this. It's better to swap-in and swap-out in
> >> >> >> smallest size if the page is only accessed seldom to avoid to waste
> >> >> >> memory.
> >> >> >
> >> >> > If we want to optimize only for memory consumption, I'm sure there are many
> >> >> > things we would do differently. We need to find a balance between memory and
> >> >> > performance. The benefits of folios are well documented and the kernel is
> >> >> > heading in the direction of managing memory in variable-sized blocks. So I don't
> >> >> > think it's as simple as saying we should always swap-in the smallest possible
> >> >> > amount of memory.
> >> >>
> >> >> It's conditional, that is,
> >> >>
> >> >> "if the page is only accessed seldom"
> >> >>
> >> >> Then, the page swapped-in will be swapped-out soon and adjacent pages in
> >> >> the same large folio will not be accessed during this period.
> >> >>
> >> >> So, I suggest to create an algorithm to decide swap-in order based on
> >> >> swap-readahead information automatically. It can detect the situation
> >> >> above via reduced swap readahead window size. And, if the page is
> >> >> accessed for quite long time, and the adjacent pages in the same large
> >> >> folio are accessed too, swap-readahead window will increase and large
> >> >> swap-in order will be used.
> >> >
> >> > The original size of do_anonymous_page() should be honored, considering it
> >> > embodies a decision influenced by not only sysfs settings and per-vma
> >> > HUGEPAGE hints but also architectural characteristics, for example
> >> > CONT-PTE.
> >> >
> >> > The model you're proposing may offer memory-saving benefits or reduce I/O,
> >> > but it entirely disassociates the size of the swap in from the size prior to the
> >> > swap out.
> >>
> >> Readahead isn't the only factor to determine folio order. For example,
> >> we must respect "never" policy to allocate order-0 folio always.
> >> There's no requirements to use swap-out order in swap-in too. Memory
> >> allocation has different performance character of storage reading.
> >
> > Still quite unclear.
> >
> > If users have only enabled 64KiB (4-ORDER) large folios in sysfs, and the
> > readahead algorithm requires 16KiB, what should be set as the large folio size?
> > Setting it to 16KiB doesn't align with users' requirements, while
> > setting it to 64KiB
> > would be wasteful according to your criteria.
>
> IIUC, enabling 64KB means you can use 64KB mTHP if appropriate, doesn't
> mean that you must use 64KB mTHP. If so, we should use 16KB mTHP in
> that situation.

A specific large folio size inherently denotes a high-quality
resource. For example,
a 64KiB folio necessitates only one TLB on ARM64, just as a 2MB large folio
accommodates only one TLB. I am skeptical whether a size determined by
readahead offers any tangible advantages over simply having small folios.

>
> >> > Moreover, there's no guarantee that the large folio generated by
> >> > the readahead window is contiguous in the swap and can be added to the
> >> > swap cache, as we are currently dealing with folio->swap instead of
> >> > subpage->swap.
> >>
> >> Yes. We can optimize only when all conditions are satisfied. Just like
> >> other optimization.
> >>
> >> > Incidentally, do_anonymous_page() serves as the initial location for allocating
> >> > large folios. Given that memory conservation is a significant consideration in
> >> > do_swap_page(), wouldn't it be even more crucial in do_anonymous_page()?
> >>
> >> Yes. We should consider that too. IIUC, that is why mTHP support is
> >> off by default for now. After we find a way to solve the memory usage
> >> issue. We may make default "on".
> >
> > It's challenging to establish a universal solution because various systems
> > exhibit diverse hardware characteristics, and VMAs may require different
> > alignments. The current sysfs and per-vma hints allow users the opportunity
> > o customize settings according to their specific requirements.
>
> IIUC, Linux kernel is trying to provide a reasonable default behavior in
> all situations. We are trying to optimize default behavior in the first
> place, only introduce customization if we fail to do that. I don't
> think that it's a good idea to introduce too much customization if we
> haven't tried to optimize the default behavior.

I've never been opposed to the readahead case, but I feel it's a second step.

My point is to begin with the simplest and most practical approaches
that can generate
genuine value and contribution. The SWP_SYNCHRONOUS_IO case has been
implemented on millions of OPPO's phones and has demonstrated product success.

>
> >>
> >> > A large folio, by its nature, represents a high-quality resource that has the
> >> > potential to leverage hardware characteristics for the benefit of the
> >> > entire system.
> >>
> >> But not at the cost of memory wastage.
> >>
> >> > Conversely, I don't believe that a randomly determined size dictated by the
> >> > readahead window possesses the same advantageous qualities.
> >>
> >> There's a readahead algorithm which is not pure random.
> >>
> >> > SWP_SYNCHRONOUS_IO devices are not reliant on readahead whatsoever,
> >> > their needs should also be respected.
> >>
> >> I understand that there are special requirements for SWP_SYNCHRONOUS_IO
> >> devices. I just suggest to work on general code before specific
> >> optimization.
> >
> > I disagree with your definition of "special" and "general". According
> > to your logic,
> > non-SWP_SYNCHRONOUS_IO devices could also be classified as "special".
>
> SWP_SYNCHRONOUS_IO devices also use general code path. They just use
> some special optimization in some special situation (__swap_count(entry)
> == 1). Optimization in general path benefits everyone.
>
> > Furthermore, the number of systems running SWP_SYNCHRONOUS_IO is
> > significantly greater than those running non-SWP_SYNCHRONOUS_IO,
> > contradicting your assertion.
> >
> > SWP_SYNCHRONOUS_IO devices have a minor chance of being involved
> > in readahead.
>
> Then it loses an opportunity to determine the appropriate folio order.
> We can consider how to balance between the overhead and benefit of
> readahead. IIUC, compared with original SWP_SYNCHRONOUS_IO swap-in,
> mTHP is a kind of readahead too.
>
> BTW, because we have added more and more swap cache related operations
> (swapcache_prepare(), clear_shadow_from_swap_cache(), swapcache_clear(),
> etc.) in SWP_SYNCHRONOUS_IO code path, I suspect whether the benefit of
> SWP_SYNCHRONOUS_IO is still large enough. We may need to re-evaluate
> it.

Obviously SWP_SYNCHRONOUS_IO is still quite valuable as Kairui has the data
in his commit 13ddaf26be324a ("mm/swap: fix race when skipping swapcache")

"Performance overhead is minimal, microbenchmark swapin 10G from 32G zram:
Before: 10934698 us
After: 11157121 us
Cached: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag) "

>
> > However, in OPPO's code, which hasn't been sent in the
> > LKML yet, we use the exact same size as do_anonymous_page for readahead.
> > Without a clear description of how you want the new readahead
> > algorithm to balance memory waste and users' hints from sysfs and
> > per-vma flags, it appears to be an ambiguous area to address.
> >
> > Please provide a clear description of how you would like the new readahead
> > algorithm to function. I believe this clarity will facilitate others
> > in attempting to
> > implement it.
>
> For example, if __swapin_nr_pages() > 4, we can try to allocate an
> order-2 mTHP if other conditions are satisfied.

There is no evidence suggesting that an order-2 or any other orders
determined by readahead are superior to having four small folios.

>
> >>
> >> >> > You also said we should swap *out* in smallest size possible. Have I
> >> >> > misunderstood you? I thought the case for swapping-out a whole folio without
> >> >> > splitting was well established and non-controversial?
> >> >>
> >> >> That is conditional too.
> >> >>
> >> >> >>
> >> >> >>> But we only kind-of keep that information around, via the swap
> >> >> >>> entry contiguity and alignment. With that scheme it is possible that multiple
> >> >> >>> virtually adjacent but not physically contiguous folios get swapped-out to
> >> >> >>> adjacent swap slot ranges and then they would be swapped-in to a single, larger
> >> >> >>> folio. This is not ideal, and I think it would be valuable to try to maintain
> >> >> >>> the original folio size information with the swap slot. One way to do this would
> >> >> >>> be to store the original order for which the cluster was allocated in the
> >> >> >>> cluster. Then we at least know that a given swap slot is either for a folio of
> >> >> >>> that order or an order-0 folio (due to cluster exhaustion/scanning). Can we
> >> >> >>> steal a bit from swap_map to determine which case it is? Or are there better
> >> >> >>> approaches?
> >> >> >>
> >> >> >> [snip]
> >>
>
> --
> Best Regards,
> Huang, Ying

Thanks
Barry

2024-03-21 08:42:58

by Barry Song

[permalink] [raw]
Subject: Re: [RFC PATCH v3 1/5] arm64: mm: swap: support THP_SWAP on hardware with MTE

Hi Ryan,
Sorry for the late reply.

On Tue, Mar 12, 2024 at 5:56 AM Ryan Roberts <[email protected]> wrote:
>
> On 04/03/2024 08:13, Barry Song wrote:
> > From: Barry Song <[email protected]>
> >
> > Commit d0637c505f8a1 ("arm64: enable THP_SWAP for arm64") brings up
> > THP_SWAP on ARM64, but it doesn't enable THP_SWP on hardware with
> > MTE as the MTE code works with the assumption tags save/restore is
> > always handling a folio with only one page.
> >
> > The limitation should be removed as more and more ARM64 SoCs have
> > this feature. Co-existence of MTE and THP_SWAP becomes more and
> > more important.
> >
> > This patch makes MTE tags saving support large folios, then we don't
> > need to split large folios into base pages for swapping out on ARM64
> > SoCs with MTE any more.
> >
> > arch_prepare_to_swap() should take folio rather than page as parameter
> > because we support THP swap-out as a whole. It saves tags for all
> > pages in a large folio.
> >
> > As now we are restoring tags based-on folio, in arch_swap_restore(),
> > we may increase some extra loops and early-exitings while refaulting
> > a large folio which is still in swapcache in do_swap_page(). In case
> > a large folio has nr pages, do_swap_page() will only set the PTE of
> > the particular page which is causing the page fault.
> > Thus do_swap_page() runs nr times, and each time, arch_swap_restore()
> > will loop nr times for those subpages in the folio. So right now the
> > algorithmic complexity becomes O(nr^2).
> >
> > Once we support mapping large folios in do_swap_page(), extra loops
> > and early-exitings will decrease while not being completely removed
> > as a large folio might get partially tagged in corner cases such as,
> > 1. a large folio in swapcache can be partially unmapped, thus, MTE
> > tags for the unmapped pages will be invalidated;
> > 2. users might use mprotect() to set MTEs on a part of a large folio.
> >
> > arch_thp_swp_supported() is dropped since ARM64 MTE was the only one
> > who needed it.
> >
> > Cc: Catalin Marinas <[email protected]>
> > Cc: Will Deacon <[email protected]>
> > Cc: Ryan Roberts <[email protected]>
> > Cc: Mark Rutland <[email protected]>
> > Cc: David Hildenbrand <[email protected]>
> > Cc: Kemeng Shi <[email protected]>
> > Cc: "Matthew Wilcox (Oracle)" <[email protected]>
> > Cc: Anshuman Khandual <[email protected]>
> > Cc: Peter Collingbourne <[email protected]>
> > Cc: Steven Price <[email protected]>
> > Cc: Yosry Ahmed <[email protected]>
> > Cc: Peter Xu <[email protected]>
> > Cc: Lorenzo Stoakes <[email protected]>
> > Cc: "Mike Rapoport (IBM)" <[email protected]>
> > Cc: Hugh Dickins <[email protected]>
> > CC: "Aneesh Kumar K.V" <[email protected]>
> > Cc: Rick Edgecombe <[email protected]>
> > Signed-off-by: Barry Song <[email protected]>
> > Reviewed-by: Steven Price <[email protected]>
> > Acked-by: Chris Li <[email protected]>
> > ---
> > arch/arm64/include/asm/pgtable.h | 19 ++------------
> > arch/arm64/mm/mteswap.c | 43 ++++++++++++++++++++++++++++++++
> > include/linux/huge_mm.h | 12 ---------
> > include/linux/pgtable.h | 2 +-
> > mm/page_io.c | 2 +-
> > mm/swap_slots.c | 2 +-
> > 6 files changed, 48 insertions(+), 32 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > index 401087e8a43d..7a54750770b8 100644
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -45,12 +45,6 @@
> > __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
> > #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >
> > -static inline bool arch_thp_swp_supported(void)
> > -{
> > - return !system_supports_mte();
> > -}
> > -#define arch_thp_swp_supported arch_thp_swp_supported
> > -
> > /*
> > * Outside of a few very special situations (e.g. hibernation), we always
> > * use broadcast TLB invalidation instructions, therefore a spurious page
> > @@ -1095,12 +1089,7 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> > #ifdef CONFIG_ARM64_MTE
> >
> > #define __HAVE_ARCH_PREPARE_TO_SWAP
> > -static inline int arch_prepare_to_swap(struct page *page)
> > -{
> > - if (system_supports_mte())
> > - return mte_save_tags(page);
> > - return 0;
> > -}
> > +extern int arch_prepare_to_swap(struct folio *folio);
> >
> > #define __HAVE_ARCH_SWAP_INVALIDATE
> > static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
> > @@ -1116,11 +1105,7 @@ static inline void arch_swap_invalidate_area(int type)
> > }
> >
> > #define __HAVE_ARCH_SWAP_RESTORE
> > -static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> > -{
> > - if (system_supports_mte())
> > - mte_restore_tags(entry, &folio->page);
> > -}
> > +extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
> >
> > #endif /* CONFIG_ARM64_MTE */
> >
> > diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
> > index a31833e3ddc5..295836fef620 100644
> > --- a/arch/arm64/mm/mteswap.c
> > +++ b/arch/arm64/mm/mteswap.c
> > @@ -68,6 +68,13 @@ void mte_invalidate_tags(int type, pgoff_t offset)
> > mte_free_tag_storage(tags);
> > }
> >
> > +static inline void __mte_invalidate_tags(struct page *page)
> > +{
> > + swp_entry_t entry = page_swap_entry(page);
> > +
> > + mte_invalidate_tags(swp_type(entry), swp_offset(entry));
> > +}
> > +
> > void mte_invalidate_tags_area(int type)
> > {
> > swp_entry_t entry = swp_entry(type, 0);
> > @@ -83,3 +90,39 @@ void mte_invalidate_tags_area(int type)
> > }
> > xa_unlock(&mte_pages);
> > }
> > +
> > +int arch_prepare_to_swap(struct folio *folio)
> > +{
> > + long i, nr;
> > + int err;
> > +
> > + if (!system_supports_mte())
> > + return 0;
> > +
> > + nr = folio_nr_pages(folio);
> > +
> > + for (i = 0; i < nr; i++) {
> > + err = mte_save_tags(folio_page(folio, i));
> > + if (err)
> > + goto out;
> > + }
> > + return 0;
> > +
> > +out:
> > + while (i--)
> > + __mte_invalidate_tags(folio_page(folio, i));
> > + return err;
> > +}
> > +
> > +void arch_swap_restore(swp_entry_t entry, struct folio *folio)
>
> I'm still not a fan of the fact that entry could be anywhere within folio.
>
> > +{
> > + if (system_supports_mte()) {
>
> nit: if you do:
>
> if (!system_supports_mte())
> return;

Acked

>
> It will be consistent with arch_prepare_to_swap() and reduce the indentation of
> the main body.
>
> > + long i, nr = folio_nr_pages(folio);
> > +
> > + entry.val -= swp_offset(entry) & (nr - 1);
>
> This assumes that folios are always stored in swap with natural alignment Is
> that definitely a safe assumption? My swap-out series is currently ensuring that
> folios are swapped-out naturally aligned, but that is an implementation detail.
>

I concur that this is an implementation detail. However, we should be
bold enough
to state that swap slots will be contiguous, considering we are
currently utilizing
folio->swap instead of subpage->swap ?

> Your cover note for swap-in says that you could technically swap in a large
> folio without it having been swapped-out large. If you chose to do that in
> future, this would break, right? I don't think it's good to couple the swap

Right. technically I agree. Given that we still have many tasks involving even
swapping in contiguous swap slots, it's unlikely that swapping in large folios
for non-contiguous entries will occur in the foreseeable future :-)

> storage layout to the folio order that you want to swap into. Perhaps that's an
> argument for passing each *page* to this function with its exact, corresponding
> swap entry?

I recall Matthew Wilcox strongly objected to using "page" as the
parameter, so I've
discarded that approach. Alternatively, it appears I can consistently pass
folio->swap to this function and ensure the function always retrieves
the first entry?

>
> > + for (i = 0; i < nr; i++) {
> > + mte_restore_tags(entry, folio_page(folio, i));
> > + entry.val++;
> > + }
> > + }
> > +}
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index de0c89105076..e04b93c43965 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -535,16 +535,4 @@ static inline int split_folio_to_order(struct folio *folio, int new_order)
> > #define split_folio_to_list(f, l) split_folio_to_list_to_order(f, l, 0)
> > #define split_folio(f) split_folio_to_order(f, 0)
> >
> > -/*
> > - * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
> > - * limitations in the implementation like arm64 MTE can override this to
> > - * false
> > - */
> > -#ifndef arch_thp_swp_supported
> > -static inline bool arch_thp_swp_supported(void)
> > -{
> > - return true;
> > -}
> > -#endif
> > -
> > #endif /* _LINUX_HUGE_MM_H */
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index e1b22903f709..bfcfe3386934 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -1106,7 +1106,7 @@ static inline int arch_unmap_one(struct mm_struct *mm,
> > * prototypes must be defined in the arch-specific asm/pgtable.h file.
> > */
> > #ifndef __HAVE_ARCH_PREPARE_TO_SWAP
> > -static inline int arch_prepare_to_swap(struct page *page)
> > +static inline int arch_prepare_to_swap(struct folio *folio)
> > {
> > return 0;
> > }
> > diff --git a/mm/page_io.c b/mm/page_io.c
> > index ae2b49055e43..a9a7c236aecc 100644
> > --- a/mm/page_io.c
> > +++ b/mm/page_io.c
> > @@ -189,7 +189,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
> > * Arch code may have to preserve more data than just the page
> > * contents, e.g. memory tags.
> > */
> > - ret = arch_prepare_to_swap(&folio->page);
> > + ret = arch_prepare_to_swap(folio);
> > if (ret) {
> > folio_mark_dirty(folio);
> > folio_unlock(folio);
> > diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> > index 90973ce7881d..53abeaf1371d 100644
> > --- a/mm/swap_slots.c
> > +++ b/mm/swap_slots.c
> > @@ -310,7 +310,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
> > entry.val = 0;
> >
> > if (folio_test_large(folio)) {
> > - if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
> > + if (IS_ENABLED(CONFIG_THP_SWAP))
> > get_swap_pages(1, &entry, folio_nr_pages(folio));
> > goto out;
> > }
>

Thanks
Barry

2024-03-21 09:22:32

by Barry Song

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole

On Tue, Mar 19, 2024 at 10:05 PM Ryan Roberts <[email protected]> wrote:
>
> On 19/03/2024 06:27, Barry Song wrote:
> > On Tue, Mar 19, 2024 at 5:45 AM Ryan Roberts <[email protected]> wrote:
> >>
> >>>>> I agree phones are not the only platform. But Rome wasn't built in a
> >>>>> day. I can only get
> >>>>> started on a hardware which I can easily reach and have enough hardware/test
> >>>>> resources on it. So we may take the first step which can be applied on
> >>>>> a real product
> >>>>> and improve its performance, and step by step, we broaden it and make it
> >>>>> widely useful to various areas in which I can't reach :-)
> >>>>
> >>>> We must guarantee the normal swap path runs correctly and has no
> >>>> performance regression when developing SWP_SYNCHRONOUS_IO optimization.
> >>>> So we have to put some effort on the normal path test anyway.
> >>>>
> >>>>> so probably we can have a sysfs "enable" entry with default "n" or
> >>>>> have a maximum
> >>>>> swap-in order as Ryan's suggestion [1] at the beginning,
> >>>>>
> >>>>> "
> >>>>> So in the common case, swap-in will pull in the same size of folio as was
> >>>>> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
> >>>>> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
> >>>>> it makes sense for 2M THP; As the size increases the chances of actually needing
> >>>>> all of the folio reduces so chances are we are wasting IO. There are similar
> >>>>> arguments for CoW, where we currently copy 1 page per fault - it probably makes
> >>>>> sense to copy the whole folio up to a certain size.
> >>>>> "
> >>
> >> I thought about this a bit more. No clear conclusions, but hoped this might help
> >> the discussion around policy:
> >>
> >> The decision about the size of the THP is made at first fault, with some help
> >> from user space and in future we might make decisions to split based on
> >> munmap/mremap/etc hints. In an ideal world, the fact that we have had to swap
> >> the THP out at some point in its lifetime should not impact on its size. It's
> >> just being moved around in the system and the reason for our original decision
> >> should still hold.
> >
> > Indeed, this is an ideal framework for smartphones and likely for
> > widely embedded
> > Linux systems utilizing zRAM. We set the mTHP size to 64KiB to
> > leverage CONT-PTE,
> > given that more than half of the memory on phones may frequently swap out and
> > swap in (for instance, when opening and switching between apps). The
> > ideal approach
> > would involve adhering to the decision made in do_anonymous_page().
> >
> >>
> >> So from that PoV, it would be good to swap-in to the same size that was
> >> swapped-out. But we only kind-of keep that information around, via the swap
> >> entry contiguity and alignment. With that scheme it is possible that multiple
> >> virtually adjacent but not physically contiguous folios get swapped-out to
> >> adjacent swap slot ranges and then they would be swapped-in to a single, larger
> >> folio. This is not ideal, and I think it would be valuable to try to maintain
> >> the original folio size information with the swap slot. One way to do this would
> >> be to store the original order for which the cluster was allocated in the
> >> cluster. Then we at least know that a given swap slot is either for a folio of
> >> that order or an order-0 folio (due to cluster exhaustion/scanning). Can we
> >> steal a bit from swap_map to determine which case it is? Or are there better
> >> approaches?
> >
> > In the case of non-SWP_SYNCHRONOUS_IO, users will invariably invoke
> > swap_readahead()
> > even when __swap_count(entry) equals 1. This leads to two scenarios:
> > swap_vma_readahead
> > and swap_cluster_readahead.
> >
> > In swap_vma_readahead, when blk_queue_nonrot, physical contiguity
> > doesn't appear to be a
> > critical concern. However, for swap_cluster_readahead, the focus
> > shifts towards the potential
> > impact of physical discontiguity.
>
> When you talk about "physical [dis]contiguity" I think you are talking about
> contiguity of the swap entries in the swap device? Both paths currently allocate
> order-0 folios to swap into, so neither have a concept of physical contiguity in
> memory at the moment.
>
> As I understand it, roughly the aim is to readahead by cluster for rotating
> disks to reduce seek time, and readahead by virtual address for non-rotating
> devices since there is no seek time cost. Correct?

From the code comment, I agree with this.

* It's a main entry function for swap readahead. By the configuration,
* it will read ahead blocks by cluster-based(ie, physical disk based)
* or vma-based(ie, virtual address based on faulty address) readahead.

>
> Note that today, swap-out on supports (2M) THP if the swap device is
> non-rotating. If it is rotating, the THP is first split. My swap-out series
> maintains this policy for mTHP. So I think we only really care about
> swap_vma_readahead() here; we want to teach it to figure out the order of the
> swap entries and swap them into folios of the same order (with a fallback to
> order-0 if allocation fails).

I agree we don't need to care about devices which rotate.

>
> >
> > struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> > struct vm_fault *vmf)
> > {
> > struct mempolicy *mpol;
> > pgoff_t ilx;
> > struct folio *folio;
> >
> > mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
> > folio = swap_use_vma_readahead() ?
> > swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf) :
> > swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
> > mpol_cond_put(mpol);
> >
> > if (!folio)
> > return NULL;
> > return folio_file_page(folio, swp_offset(entry));
> > }
> >
> > In Android and embedded systems, SWP_SYNCHRONOUS_IO is consistently utilized,
> > rendering physical contiguity less of a concern. Moreover, instances where
> > swap_readahead() is accessed are rare, typically occurring only in scenarios
> > involving forked but non-CoWed memory.
>
> Yes understood. What I'm hearing is that for Android at least, stealing a bit
> from swap_map to remember if a swap entry is the order marked in the cluster or
> order-0 won't be noticed because almost all entries have swap count == 1. From
> memory, I think swap_map is 8 bits, and 2 bits are currently stolen, leaving 6
> bits (count = 64) before having to move to the swap map continuation stuff. Does
> anyone know what workloads provoke this overflow? What are the consequences of
> reducing that count to 32?

I'm not entirely clear on why you need bits to record this
information. Could you
provide more details?

>
> >
> > So I think large folios swap-in will at least need three steps
> >
> > 1. on SWP_SYNCHRONOUS_IO (Android and embedded Linux), this has a very
> > clear model and has no complex I/O issue.
> > 2. on nonrot block device(bdev_nonrot == true), it cares less about
> > I/O contiguity.
> > 3. on rot block devices which care about I/O contiguity.
>
> I don't think we care about (3); if the device rotates, we will have split the
> folio at swap-out, so we are only concerned with swapping-in order-0 folios.
>
> >
> > This patchset primarily addresses the systems utilizing
> > SWP_SYNCHRONOUS_IO(type1),
> > such as Android and embedded Linux, a straightforward model is established,
> > with minimal complexity regarding I/O issues.
>
> Understood. But your implication is that making swap_vma_readahead() large folio
> swap-in aware will be complex. I think we can remember the original order in the
> swap device, then it shouldn't be too difficult - conceptually at least.

Currently, I can scan PTE entries and determine the number of
contiguous swap offsets.
The swap_vma_readahead code to support large folios already exists in
OPPO's repository.
I'm confident that it can be cleaned up and submitted to LKML.
However, the issue lies with
the readahead policy. We typically prefer using the same 64KiB size as in
do_anonymous_page(), but clearly, this isn't the preference for Ying :-)

>
> >
> >>
> >> Next we (I?) have concerns about wasting IO by swapping-in folios that are too
> >> large (e.g. 2M). I'm not sure if this is a real problem or not - intuitively I'd
> >> say yes but I have no data. But on the other hand, memory is aged and
> >> swapped-out per-folio, so why shouldn't it be swapped-in per folio? If the
> >> original allocation size policy is good (it currently isn't) then a folio should
> >> be sized to cover temporally close memory and if we need to access some of it,
> >> chances are we need all of it.
> >>
> >> If we think the IO concern is legitimate then we could define a threshold size
> >> (sysfs?) for when we start swapping-in the folio in chunks. And how big should
> >> those chunks be - one page, or the threshold size itself? Probably the latter?
> >> And perhaps that threshold could also be used by zRAM to decide its upper limit
> >> for compression chunk.
> >
> >
> > Agreed. What about introducing a parameter like
> > /sys/kernel/mm/transparent_hugepage/max_swapin_order
> > giving users the opportunity to fine-tune it according to their needs. For type1
> > users specifically, setting it to any value above 4 would be
> > beneficial. If there's
> > still a lack of tuning for desktop and server environments (type 2 and type 3),
> > the default value could be set to 0.
>
> This sort of thing sounds sensible to me. But I have a history of proposing
> crappy sysfs interfaces :) So I'd like to hear from others - I suspect it will
> take a fair bit of discussion before we converge. Having data to show that this
> threshold is needed would also help (i.e. demonstration that the intuition that
> swapping in a 2M folio is often counter-productive to performance).
>

I understand. The ideal swap-in size is obviously a contentious topic :-)
However, for my real use case, simplicity reigns: we consistently adhere
to a single size - 64KiB.

> >
> >>
> >> Perhaps we can learn from khugepaged here? I think it has programmable
> >> thresholds for how many swapped-out pages can be swapped-in to aid collapse to a
> >> THP? I guess that exists for the same concerns about increased IO pressure?
> >>
> >>
> >> If we think we will ever be swapping-in folios in chunks less than their
> >> original size, then we need a separate mechanism to re-foliate them. We have
> >> discussed a khugepaged-like approach for doing this asynchronously in the
> >> background. I know that scares the Android folks, but David has suggested that
> >> this could well be very cheap compared with khugepaged, because it would be
> >> entirely limited to a single pgtable, so we only need the PTL. If we need this
> >> mechanism anyway, perhaps we should develop it and see how it performs if
> >> swap-in remains order-0? Although I guess that would imply not being able to
> >> benefit from compressing THPs for the zRAM case.
> >
> > The effectiveness of collapse operation relies on the stability of
> > forming large folios
> > to ensure optimal performance. In embedded systems, where more than half of the
> > memory may be allocated to zRAM, folios might undergo swapping out before
> > collapsing or immediately after the collapse operation. It seems a
> > TAO-like optimization
> > to decrease fallback and latency is more effective.
>
> Sorry, I'm not sure I've understood what you are saying here.

I'm not entirely clear on the specifics of the khugepaged-like
approach. However,a major
distinction for Android is that its folios may not remain in memory
for extended periods.
If we incur the cost of compaction and page migration to form a large
folio, it might soon
be swapped out. Therefore, a potentially more efficient approach could
involve a TAO-like
pool, where we obtain large folios at a low cost.

>
> >
> >>
> >> I see all this as orthogonal to synchronous vs asynchronous swap devices. I
> >> think the latter just implies that you might want to do some readahead to try to
> >> cover up the latency? If swap is moving towards being folio-orientated, then
> >> readahead also surely needs to be folio-orientated, but I think that should be
> >> the only major difference.
> >>
> >> Anyway, just some thoughts!
> >
> > Thank you very much for your valuable and insightful deliberations.
> >
> >>
> >> Thanks,
> >> Ryan
> >>
> >

Thanks
Barry

2024-03-21 10:21:42

by Barry Song

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole

On Wed, Mar 20, 2024 at 1:19 AM Ryan Roberts <[email protected]> wrote:
>
> On 19/03/2024 09:20, Huang, Ying wrote:
> > Ryan Roberts <[email protected]> writes:
> >
> >>>>> I agree phones are not the only platform. But Rome wasn't built in a
> >>>>> day. I can only get
> >>>>> started on a hardware which I can easily reach and have enough hardware/test
> >>>>> resources on it. So we may take the first step which can be applied on
> >>>>> a real product
> >>>>> and improve its performance, and step by step, we broaden it and make it
> >>>>> widely useful to various areas in which I can't reach :-)
> >>>>
> >>>> We must guarantee the normal swap path runs correctly and has no
> >>>> performance regression when developing SWP_SYNCHRONOUS_IO optimization.
> >>>> So we have to put some effort on the normal path test anyway.
> >>>>
> >>>>> so probably we can have a sysfs "enable" entry with default "n" or
> >>>>> have a maximum
> >>>>> swap-in order as Ryan's suggestion [1] at the beginning,
> >>>>>
> >>>>> "
> >>>>> So in the common case, swap-in will pull in the same size of folio as was
> >>>>> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
> >>>>> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
> >>>>> it makes sense for 2M THP; As the size increases the chances of actually needing
> >>>>> all of the folio reduces so chances are we are wasting IO. There are similar
> >>>>> arguments for CoW, where we currently copy 1 page per fault - it probably makes
> >>>>> sense to copy the whole folio up to a certain size.
> >>>>> "
> >>
> >> I thought about this a bit more. No clear conclusions, but hoped this might help
> >> the discussion around policy:
> >>
> >> The decision about the size of the THP is made at first fault, with some help
> >> from user space and in future we might make decisions to split based on
> >> munmap/mremap/etc hints. In an ideal world, the fact that we have had to swap
> >> the THP out at some point in its lifetime should not impact on its size. It's
> >> just being moved around in the system and the reason for our original decision
> >> should still hold.
> >>
> >> So from that PoV, it would be good to swap-in to the same size that was
> >> swapped-out.
> >
> > Sorry, I don't agree with this. It's better to swap-in and swap-out in
> > smallest size if the page is only accessed seldom to avoid to waste
> > memory.
>
> If we want to optimize only for memory consumption, I'm sure there are many
> things we would do differently. We need to find a balance between memory and
> performance. The benefits of folios are well documented and the kernel is
> heading in the direction of managing memory in variable-sized blocks. So I don't
> think it's as simple as saying we should always swap-in the smallest possible
> amount of memory.

Absolutely agreed. With 64KiB large folios implemented, there may have been
a slight uptick in memory usage due to fragmentation. Nevertheless, through the
optimization of zRAM and zsmalloc to compress entire large folios, we found that
the compressed data could be up to 1GiB smaller compared to compressing them
in 4KiB increments on a typical phone with 12~16GiB memory. Consequently, we
not only reclaimed our memory loss entirely but also gained the benefits of
CONT-PTE , reduced TLB misses etc.

>
> You also said we should swap *out* in smallest size possible. Have I
> misunderstood you? I thought the case for swapping-out a whole folio without
> splitting was well established and non-controversial?
>
> >
> >> But we only kind-of keep that information around, via the swap
> >> entry contiguity and alignment. With that scheme it is possible that multiple
> >> virtually adjacent but not physically contiguous folios get swapped-out to
> >> adjacent swap slot ranges and then they would be swapped-in to a single, larger
> >> folio. This is not ideal, and I think it would be valuable to try to maintain
> >> the original folio size information with the swap slot. One way to do this would
> >> be to store the original order for which the cluster was allocated in the
> >> cluster. Then we at least know that a given swap slot is either for a folio of
> >> that order or an order-0 folio (due to cluster exhaustion/scanning). Can we
> >> steal a bit from swap_map to determine which case it is? Or are there better
> >> approaches?
> >
> > [snip]
> >
> > --
> > Best Regards,
> > Huang, Ying

Thanks
Barry

2024-03-21 10:31:47

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH v3 1/5] arm64: mm: swap: support THP_SWAP on hardware with MTE

On 21/03/2024 08:42, Barry Song wrote:
> Hi Ryan,
> Sorry for the late reply.

No problem!

>
> On Tue, Mar 12, 2024 at 5:56 AM Ryan Roberts <[email protected]> wrote:
>>
>> On 04/03/2024 08:13, Barry Song wrote:
>>> From: Barry Song <[email protected]>
>>>
>>> Commit d0637c505f8a1 ("arm64: enable THP_SWAP for arm64") brings up
>>> THP_SWAP on ARM64, but it doesn't enable THP_SWP on hardware with
>>> MTE as the MTE code works with the assumption tags save/restore is
>>> always handling a folio with only one page.
>>>
>>> The limitation should be removed as more and more ARM64 SoCs have
>>> this feature. Co-existence of MTE and THP_SWAP becomes more and
>>> more important.
>>>
>>> This patch makes MTE tags saving support large folios, then we don't
>>> need to split large folios into base pages for swapping out on ARM64
>>> SoCs with MTE any more.
>>>
>>> arch_prepare_to_swap() should take folio rather than page as parameter
>>> because we support THP swap-out as a whole. It saves tags for all
>>> pages in a large folio.
>>>
>>> As now we are restoring tags based-on folio, in arch_swap_restore(),
>>> we may increase some extra loops and early-exitings while refaulting
>>> a large folio which is still in swapcache in do_swap_page(). In case
>>> a large folio has nr pages, do_swap_page() will only set the PTE of
>>> the particular page which is causing the page fault.
>>> Thus do_swap_page() runs nr times, and each time, arch_swap_restore()
>>> will loop nr times for those subpages in the folio. So right now the
>>> algorithmic complexity becomes O(nr^2).
>>>
>>> Once we support mapping large folios in do_swap_page(), extra loops
>>> and early-exitings will decrease while not being completely removed
>>> as a large folio might get partially tagged in corner cases such as,
>>> 1. a large folio in swapcache can be partially unmapped, thus, MTE
>>> tags for the unmapped pages will be invalidated;
>>> 2. users might use mprotect() to set MTEs on a part of a large folio.
>>>
>>> arch_thp_swp_supported() is dropped since ARM64 MTE was the only one
>>> who needed it.

I think we should decouple this patch from your swap-in series. I suspect this
one could be ready and go in sooner than the swap-in series based on the current
discussions :)

>>>
>>> Cc: Catalin Marinas <[email protected]>
>>> Cc: Will Deacon <[email protected]>
>>> Cc: Ryan Roberts <[email protected]>
>>> Cc: Mark Rutland <[email protected]>
>>> Cc: David Hildenbrand <[email protected]>
>>> Cc: Kemeng Shi <[email protected]>
>>> Cc: "Matthew Wilcox (Oracle)" <[email protected]>
>>> Cc: Anshuman Khandual <[email protected]>
>>> Cc: Peter Collingbourne <[email protected]>
>>> Cc: Steven Price <[email protected]>
>>> Cc: Yosry Ahmed <[email protected]>
>>> Cc: Peter Xu <[email protected]>
>>> Cc: Lorenzo Stoakes <[email protected]>
>>> Cc: "Mike Rapoport (IBM)" <[email protected]>
>>> Cc: Hugh Dickins <[email protected]>
>>> CC: "Aneesh Kumar K.V" <[email protected]>
>>> Cc: Rick Edgecombe <[email protected]>
>>> Signed-off-by: Barry Song <[email protected]>
>>> Reviewed-by: Steven Price <[email protected]>
>>> Acked-by: Chris Li <[email protected]>
>>> ---
>>> arch/arm64/include/asm/pgtable.h | 19 ++------------
>>> arch/arm64/mm/mteswap.c | 43 ++++++++++++++++++++++++++++++++
>>> include/linux/huge_mm.h | 12 ---------
>>> include/linux/pgtable.h | 2 +-
>>> mm/page_io.c | 2 +-
>>> mm/swap_slots.c | 2 +-
>>> 6 files changed, 48 insertions(+), 32 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>> index 401087e8a43d..7a54750770b8 100644
>>> --- a/arch/arm64/include/asm/pgtable.h
>>> +++ b/arch/arm64/include/asm/pgtable.h
>>> @@ -45,12 +45,6 @@
>>> __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
>>> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>>
>>> -static inline bool arch_thp_swp_supported(void)
>>> -{
>>> - return !system_supports_mte();
>>> -}
>>> -#define arch_thp_swp_supported arch_thp_swp_supported
>>> -
>>> /*
>>> * Outside of a few very special situations (e.g. hibernation), we always
>>> * use broadcast TLB invalidation instructions, therefore a spurious page
>>> @@ -1095,12 +1089,7 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
>>> #ifdef CONFIG_ARM64_MTE
>>>
>>> #define __HAVE_ARCH_PREPARE_TO_SWAP
>>> -static inline int arch_prepare_to_swap(struct page *page)
>>> -{
>>> - if (system_supports_mte())
>>> - return mte_save_tags(page);
>>> - return 0;
>>> -}
>>> +extern int arch_prepare_to_swap(struct folio *folio);
>>>
>>> #define __HAVE_ARCH_SWAP_INVALIDATE
>>> static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
>>> @@ -1116,11 +1105,7 @@ static inline void arch_swap_invalidate_area(int type)
>>> }
>>>
>>> #define __HAVE_ARCH_SWAP_RESTORE
>>> -static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
>>> -{
>>> - if (system_supports_mte())
>>> - mte_restore_tags(entry, &folio->page);
>>> -}
>>> +extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
>>>
>>> #endif /* CONFIG_ARM64_MTE */
>>>
>>> diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
>>> index a31833e3ddc5..295836fef620 100644
>>> --- a/arch/arm64/mm/mteswap.c
>>> +++ b/arch/arm64/mm/mteswap.c
>>> @@ -68,6 +68,13 @@ void mte_invalidate_tags(int type, pgoff_t offset)
>>> mte_free_tag_storage(tags);
>>> }
>>>
>>> +static inline void __mte_invalidate_tags(struct page *page)
>>> +{
>>> + swp_entry_t entry = page_swap_entry(page);
>>> +
>>> + mte_invalidate_tags(swp_type(entry), swp_offset(entry));
>>> +}
>>> +
>>> void mte_invalidate_tags_area(int type)
>>> {
>>> swp_entry_t entry = swp_entry(type, 0);
>>> @@ -83,3 +90,39 @@ void mte_invalidate_tags_area(int type)
>>> }
>>> xa_unlock(&mte_pages);
>>> }
>>> +
>>> +int arch_prepare_to_swap(struct folio *folio)
>>> +{
>>> + long i, nr;
>>> + int err;
>>> +
>>> + if (!system_supports_mte())
>>> + return 0;
>>> +
>>> + nr = folio_nr_pages(folio);
>>> +
>>> + for (i = 0; i < nr; i++) {
>>> + err = mte_save_tags(folio_page(folio, i));
>>> + if (err)
>>> + goto out;
>>> + }
>>> + return 0;
>>> +
>>> +out:
>>> + while (i--)
>>> + __mte_invalidate_tags(folio_page(folio, i));
>>> + return err;
>>> +}
>>> +
>>> +void arch_swap_restore(swp_entry_t entry, struct folio *folio)
>>
>> I'm still not a fan of the fact that entry could be anywhere within folio.
>>
>>> +{
>>> + if (system_supports_mte()) {
>>
>> nit: if you do:
>>
>> if (!system_supports_mte())
>> return;
>
> Acked
>
>>
>> It will be consistent with arch_prepare_to_swap() and reduce the indentation of
>> the main body.
>>
>>> + long i, nr = folio_nr_pages(folio);
>>> +
>>> + entry.val -= swp_offset(entry) & (nr - 1);
>>
>> This assumes that folios are always stored in swap with natural alignment. Is
>> that definitely a safe assumption? My swap-out series is currently ensuring that
>> folios are swapped-out naturally aligned, but that is an implementation detail.
>>
>
> I concur that this is an implementation detail. However, we should be
> bold enough
> to state that swap slots will be contiguous, considering we are
> currently utilizing
> folio->swap instead of subpage->swap ?

Yes, I agree about contiguity. My objection is about assuming natural alignment
though. It can still be contiguous while not naturally aligned in swap.

>
>> Your cover note for swap-in says that you could technically swap in a large
>> folio without it having been swapped-out large. If you chose to do that in
>> future, this would break, right? I don't think it's good to couple the swap
>
> Right. technically I agree. Given that we still have many tasks involving even
> swapping in contiguous swap slots, it's unlikely that swapping in large folios
> for non-contiguous entries will occur in the foreseeable future :-)
>
>> storage layout to the folio order that you want to swap into. Perhaps that's an
>> argument for passing each *page* to this function with its exact, corresponding
>> swap entry?
>
> I recall Matthew Wilcox strongly objected to using "page" as the
> parameter, so I've
> discarded that approach. Alternatively, it appears I can consistently pass
> folio->swap to this function and ensure the function always retrieves
> the first entry?

Yes, if we must pass a folio here, I'd prefer that entry always corresponds to
the first entry for the folio. That will remove the need for this function to do
the alignment above too. So win-win.

>
>>
>>> + for (i = 0; i < nr; i++) {
>>> + mte_restore_tags(entry, folio_page(folio, i));
>>> + entry.val++;
>>> + }
>>> + }
>>> +}
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index de0c89105076..e04b93c43965 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -535,16 +535,4 @@ static inline int split_folio_to_order(struct folio *folio, int new_order)
>>> #define split_folio_to_list(f, l) split_folio_to_list_to_order(f, l, 0)
>>> #define split_folio(f) split_folio_to_order(f, 0)
>>>
>>> -/*
>>> - * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
>>> - * limitations in the implementation like arm64 MTE can override this to
>>> - * false
>>> - */
>>> -#ifndef arch_thp_swp_supported
>>> -static inline bool arch_thp_swp_supported(void)
>>> -{
>>> - return true;
>>> -}
>>> -#endif
>>> -
>>> #endif /* _LINUX_HUGE_MM_H */
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index e1b22903f709..bfcfe3386934 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -1106,7 +1106,7 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>>> * prototypes must be defined in the arch-specific asm/pgtable.h file.
>>> */
>>> #ifndef __HAVE_ARCH_PREPARE_TO_SWAP
>>> -static inline int arch_prepare_to_swap(struct page *page)
>>> +static inline int arch_prepare_to_swap(struct folio *folio)
>>> {
>>> return 0;
>>> }
>>> diff --git a/mm/page_io.c b/mm/page_io.c
>>> index ae2b49055e43..a9a7c236aecc 100644
>>> --- a/mm/page_io.c
>>> +++ b/mm/page_io.c
>>> @@ -189,7 +189,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
>>> * Arch code may have to preserve more data than just the page
>>> * contents, e.g. memory tags.
>>> */
>>> - ret = arch_prepare_to_swap(&folio->page);
>>> + ret = arch_prepare_to_swap(folio);
>>> if (ret) {
>>> folio_mark_dirty(folio);
>>> folio_unlock(folio);
>>> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
>>> index 90973ce7881d..53abeaf1371d 100644
>>> --- a/mm/swap_slots.c
>>> +++ b/mm/swap_slots.c
>>> @@ -310,7 +310,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
>>> entry.val = 0;
>>>
>>> if (folio_test_large(folio)) {
>>> - if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
>>> + if (IS_ENABLED(CONFIG_THP_SWAP))
>>> get_swap_pages(1, &entry, folio_nr_pages(folio));
>>> goto out;
>>> }
>>
>
> Thanks
> Barry


2024-03-21 10:44:19

by Barry Song

[permalink] [raw]
Subject: Re: [RFC PATCH v3 1/5] arm64: mm: swap: support THP_SWAP on hardware with MTE

On Thu, Mar 21, 2024 at 6:31 PM Ryan Roberts <[email protected]> wrote:
>
> On 21/03/2024 08:42, Barry Song wrote:
> > Hi Ryan,
> > Sorry for the late reply.
>
> No problem!
>
> >
> > On Tue, Mar 12, 2024 at 5:56 AM Ryan Roberts <[email protected]> wrote:
> >>
> >> On 04/03/2024 08:13, Barry Song wrote:
> >>> From: Barry Song <[email protected]>
> >>>
> >>> Commit d0637c505f8a1 ("arm64: enable THP_SWAP for arm64") brings up
> >>> THP_SWAP on ARM64, but it doesn't enable THP_SWP on hardware with
> >>> MTE as the MTE code works with the assumption tags save/restore is
> >>> always handling a folio with only one page.
> >>>
> >>> The limitation should be removed as more and more ARM64 SoCs have
> >>> this feature. Co-existence of MTE and THP_SWAP becomes more and
> >>> more important.
> >>>
> >>> This patch makes MTE tags saving support large folios, then we don't
> >>> need to split large folios into base pages for swapping out on ARM64
> >>> SoCs with MTE any more.
> >>>
> >>> arch_prepare_to_swap() should take folio rather than page as parameter
> >>> because we support THP swap-out as a whole. It saves tags for all
> >>> pages in a large folio.
> >>>
> >>> As now we are restoring tags based-on folio, in arch_swap_restore(),
> >>> we may increase some extra loops and early-exitings while refaulting
> >>> a large folio which is still in swapcache in do_swap_page(). In case
> >>> a large folio has nr pages, do_swap_page() will only set the PTE of
> >>> the particular page which is causing the page fault.
> >>> Thus do_swap_page() runs nr times, and each time, arch_swap_restore()
> >>> will loop nr times for those subpages in the folio. So right now the
> >>> algorithmic complexity becomes O(nr^2).
> >>>
> >>> Once we support mapping large folios in do_swap_page(), extra loops
> >>> and early-exitings will decrease while not being completely removed
> >>> as a large folio might get partially tagged in corner cases such as,
> >>> 1. a large folio in swapcache can be partially unmapped, thus, MTE
> >>> tags for the unmapped pages will be invalidated;
> >>> 2. users might use mprotect() to set MTEs on a part of a large folio.
> >>>
> >>> arch_thp_swp_supported() is dropped since ARM64 MTE was the only one
> >>> who needed it.
>
> I think we should decouple this patch from your swap-in series. I suspect this
> one could be ready and go in sooner than the swap-in series based on the current
> discussions :)

I concur, particularly given that nowadays, most modern and popular ARM64
SoCs are equipped with MTE. The absence of this patch also hinders the
effective functioning of mTHP swap-out.

>
> >>>
> >>> Cc: Catalin Marinas <[email protected]>
> >>> Cc: Will Deacon <[email protected]>
> >>> Cc: Ryan Roberts <[email protected]>
> >>> Cc: Mark Rutland <[email protected]>
> >>> Cc: David Hildenbrand <[email protected]>
> >>> Cc: Kemeng Shi <[email protected]>
> >>> Cc: "Matthew Wilcox (Oracle)" <[email protected]>
> >>> Cc: Anshuman Khandual <[email protected]>
> >>> Cc: Peter Collingbourne <[email protected]>
> >>> Cc: Steven Price <[email protected]>
> >>> Cc: Yosry Ahmed <[email protected]>
> >>> Cc: Peter Xu <[email protected]>
> >>> Cc: Lorenzo Stoakes <[email protected]>
> >>> Cc: "Mike Rapoport (IBM)" <[email protected]>
> >>> Cc: Hugh Dickins <[email protected]>
> >>> CC: "Aneesh Kumar K.V" <[email protected]>
> >>> Cc: Rick Edgecombe <[email protected]>
> >>> Signed-off-by: Barry Song <[email protected]>
> >>> Reviewed-by: Steven Price <[email protected]>
> >>> Acked-by: Chris Li <[email protected]>
> >>> ---
> >>> arch/arm64/include/asm/pgtable.h | 19 ++------------
> >>> arch/arm64/mm/mteswap.c | 43 ++++++++++++++++++++++++++++++++
> >>> include/linux/huge_mm.h | 12 ---------
> >>> include/linux/pgtable.h | 2 +-
> >>> mm/page_io.c | 2 +-
> >>> mm/swap_slots.c | 2 +-
> >>> 6 files changed, 48 insertions(+), 32 deletions(-)
> >>>
> >>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> >>> index 401087e8a43d..7a54750770b8 100644
> >>> --- a/arch/arm64/include/asm/pgtable.h
> >>> +++ b/arch/arm64/include/asm/pgtable.h
> >>> @@ -45,12 +45,6 @@
> >>> __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
> >>> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >>>
> >>> -static inline bool arch_thp_swp_supported(void)
> >>> -{
> >>> - return !system_supports_mte();
> >>> -}
> >>> -#define arch_thp_swp_supported arch_thp_swp_supported
> >>> -
> >>> /*
> >>> * Outside of a few very special situations (e.g. hibernation), we always
> >>> * use broadcast TLB invalidation instructions, therefore a spurious page
> >>> @@ -1095,12 +1089,7 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> >>> #ifdef CONFIG_ARM64_MTE
> >>>
> >>> #define __HAVE_ARCH_PREPARE_TO_SWAP
> >>> -static inline int arch_prepare_to_swap(struct page *page)
> >>> -{
> >>> - if (system_supports_mte())
> >>> - return mte_save_tags(page);
> >>> - return 0;
> >>> -}
> >>> +extern int arch_prepare_to_swap(struct folio *folio);
> >>>
> >>> #define __HAVE_ARCH_SWAP_INVALIDATE
> >>> static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
> >>> @@ -1116,11 +1105,7 @@ static inline void arch_swap_invalidate_area(int type)
> >>> }
> >>>
> >>> #define __HAVE_ARCH_SWAP_RESTORE
> >>> -static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> >>> -{
> >>> - if (system_supports_mte())
> >>> - mte_restore_tags(entry, &folio->page);
> >>> -}
> >>> +extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
> >>>
> >>> #endif /* CONFIG_ARM64_MTE */
> >>>
> >>> diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
> >>> index a31833e3ddc5..295836fef620 100644
> >>> --- a/arch/arm64/mm/mteswap.c
> >>> +++ b/arch/arm64/mm/mteswap.c
> >>> @@ -68,6 +68,13 @@ void mte_invalidate_tags(int type, pgoff_t offset)
> >>> mte_free_tag_storage(tags);
> >>> }
> >>>
> >>> +static inline void __mte_invalidate_tags(struct page *page)
> >>> +{
> >>> + swp_entry_t entry = page_swap_entry(page);
> >>> +
> >>> + mte_invalidate_tags(swp_type(entry), swp_offset(entry));
> >>> +}
> >>> +
> >>> void mte_invalidate_tags_area(int type)
> >>> {
> >>> swp_entry_t entry = swp_entry(type, 0);
> >>> @@ -83,3 +90,39 @@ void mte_invalidate_tags_area(int type)
> >>> }
> >>> xa_unlock(&mte_pages);
> >>> }
> >>> +
> >>> +int arch_prepare_to_swap(struct folio *folio)
> >>> +{
> >>> + long i, nr;
> >>> + int err;
> >>> +
> >>> + if (!system_supports_mte())
> >>> + return 0;
> >>> +
> >>> + nr = folio_nr_pages(folio);
> >>> +
> >>> + for (i = 0; i < nr; i++) {
> >>> + err = mte_save_tags(folio_page(folio, i));
> >>> + if (err)
> >>> + goto out;
> >>> + }
> >>> + return 0;
> >>> +
> >>> +out:
> >>> + while (i--)
> >>> + __mte_invalidate_tags(folio_page(folio, i));
> >>> + return err;
> >>> +}
> >>> +
> >>> +void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> >>
> >> I'm still not a fan of the fact that entry could be anywhere within folio.
> >>
> >>> +{
> >>> + if (system_supports_mte()) {
> >>
> >> nit: if you do:
> >>
> >> if (!system_supports_mte())
> >> return;
> >
> > Acked
> >
> >>
> >> It will be consistent with arch_prepare_to_swap() and reduce the indentation of
> >> the main body.
> >>
> >>> + long i, nr = folio_nr_pages(folio);
> >>> +
> >>> + entry.val -= swp_offset(entry) & (nr - 1);
> >>
> >> This assumes that folios are always stored in swap with natural alignment. Is
> >> that definitely a safe assumption? My swap-out series is currently ensuring that
> >> folios are swapped-out naturally aligned, but that is an implementation detail.
> >>
> >
> > I concur that this is an implementation detail. However, we should be
> > bold enough
> > to state that swap slots will be contiguous, considering we are
> > currently utilizing
> > folio->swap instead of subpage->swap ?
>
> Yes, I agree about contiguity. My objection is about assuming natural alignment
> though. It can still be contiguous while not naturally aligned in swap.

right.

>
> >
> >> Your cover note for swap-in says that you could technically swap in a large
> >> folio without it having been swapped-out large. If you chose to do that in
> >> future, this would break, right? I don't think it's good to couple the swap
> >
> > Right. technically I agree. Given that we still have many tasks involving even
> > swapping in contiguous swap slots, it's unlikely that swapping in large folios
> > for non-contiguous entries will occur in the foreseeable future :-)
> >
> >> storage layout to the folio order that you want to swap into. Perhaps that's an
> >> argument for passing each *page* to this function with its exact, corresponding
> >> swap entry?
> >
> > I recall Matthew Wilcox strongly objected to using "page" as the
> > parameter, so I've
> > discarded that approach. Alternatively, it appears I can consistently pass
> > folio->swap to this function and ensure the function always retrieves
> > the first entry?
>
> Yes, if we must pass a folio here, I'd prefer that entry always corresponds to
> the first entry for the folio. That will remove the need for this function to do
> the alignment above too. So win-win.

right.

>
> >
> >>
> >>> + for (i = 0; i < nr; i++) {
> >>> + mte_restore_tags(entry, folio_page(folio, i));
> >>> + entry.val++;
> >>> + }
> >>> + }
> >>> +}
> >>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >>> index de0c89105076..e04b93c43965 100644
> >>> --- a/include/linux/huge_mm.h
> >>> +++ b/include/linux/huge_mm.h
> >>> @@ -535,16 +535,4 @@ static inline int split_folio_to_order(struct folio *folio, int new_order)
> >>> #define split_folio_to_list(f, l) split_folio_to_list_to_order(f, l, 0)
> >>> #define split_folio(f) split_folio_to_order(f, 0)
> >>>
> >>> -/*
> >>> - * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
> >>> - * limitations in the implementation like arm64 MTE can override this to
> >>> - * false
> >>> - */
> >>> -#ifndef arch_thp_swp_supported
> >>> -static inline bool arch_thp_swp_supported(void)
> >>> -{
> >>> - return true;
> >>> -}
> >>> -#endif
> >>> -
> >>> #endif /* _LINUX_HUGE_MM_H */
> >>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >>> index e1b22903f709..bfcfe3386934 100644
> >>> --- a/include/linux/pgtable.h
> >>> +++ b/include/linux/pgtable.h
> >>> @@ -1106,7 +1106,7 @@ static inline int arch_unmap_one(struct mm_struct *mm,
> >>> * prototypes must be defined in the arch-specific asm/pgtable.h file.
> >>> */
> >>> #ifndef __HAVE_ARCH_PREPARE_TO_SWAP
> >>> -static inline int arch_prepare_to_swap(struct page *page)
> >>> +static inline int arch_prepare_to_swap(struct folio *folio)
> >>> {
> >>> return 0;
> >>> }
> >>> diff --git a/mm/page_io.c b/mm/page_io.c
> >>> index ae2b49055e43..a9a7c236aecc 100644
> >>> --- a/mm/page_io.c
> >>> +++ b/mm/page_io.c
> >>> @@ -189,7 +189,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
> >>> * Arch code may have to preserve more data than just the page
> >>> * contents, e.g. memory tags.
> >>> */
> >>> - ret = arch_prepare_to_swap(&folio->page);
> >>> + ret = arch_prepare_to_swap(folio);
> >>> if (ret) {
> >>> folio_mark_dirty(folio);
> >>> folio_unlock(folio);
> >>> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> >>> index 90973ce7881d..53abeaf1371d 100644
> >>> --- a/mm/swap_slots.c
> >>> +++ b/mm/swap_slots.c
> >>> @@ -310,7 +310,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
> >>> entry.val = 0;
> >>>
> >>> if (folio_test_large(folio)) {
> >>> - if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
> >>> + if (IS_ENABLED(CONFIG_THP_SWAP))
> >>> get_swap_pages(1, &entry, folio_nr_pages(folio));
> >>> goto out;
> >>> }
> >>
> >

Thanks
Barry

2024-03-21 11:14:59

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole

On 21/03/2024 09:22, Barry Song wrote:
> On Tue, Mar 19, 2024 at 10:05 PM Ryan Roberts <[email protected]> wrote:
>>
>> On 19/03/2024 06:27, Barry Song wrote:
>>> On Tue, Mar 19, 2024 at 5:45 AM Ryan Roberts <[email protected]> wrote:
>>>>
>>>>>>> I agree phones are not the only platform. But Rome wasn't built in a
>>>>>>> day. I can only get
>>>>>>> started on a hardware which I can easily reach and have enough hardware/test
>>>>>>> resources on it. So we may take the first step which can be applied on
>>>>>>> a real product
>>>>>>> and improve its performance, and step by step, we broaden it and make it
>>>>>>> widely useful to various areas in which I can't reach :-)
>>>>>>
>>>>>> We must guarantee the normal swap path runs correctly and has no
>>>>>> performance regression when developing SWP_SYNCHRONOUS_IO optimization.
>>>>>> So we have to put some effort on the normal path test anyway.
>>>>>>
>>>>>>> so probably we can have a sysfs "enable" entry with default "n" or
>>>>>>> have a maximum
>>>>>>> swap-in order as Ryan's suggestion [1] at the beginning,
>>>>>>>
>>>>>>> "
>>>>>>> So in the common case, swap-in will pull in the same size of folio as was
>>>>>>> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
>>>>>>> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
>>>>>>> it makes sense for 2M THP; As the size increases the chances of actually needing
>>>>>>> all of the folio reduces so chances are we are wasting IO. There are similar
>>>>>>> arguments for CoW, where we currently copy 1 page per fault - it probably makes
>>>>>>> sense to copy the whole folio up to a certain size.
>>>>>>> "
>>>>
>>>> I thought about this a bit more. No clear conclusions, but hoped this might help
>>>> the discussion around policy:
>>>>
>>>> The decision about the size of the THP is made at first fault, with some help
>>>> from user space and in future we might make decisions to split based on
>>>> munmap/mremap/etc hints. In an ideal world, the fact that we have had to swap
>>>> the THP out at some point in its lifetime should not impact on its size. It's
>>>> just being moved around in the system and the reason for our original decision
>>>> should still hold.
>>>
>>> Indeed, this is an ideal framework for smartphones and likely for
>>> widely embedded
>>> Linux systems utilizing zRAM. We set the mTHP size to 64KiB to
>>> leverage CONT-PTE,
>>> given that more than half of the memory on phones may frequently swap out and
>>> swap in (for instance, when opening and switching between apps). The
>>> ideal approach
>>> would involve adhering to the decision made in do_anonymous_page().
>>>
>>>>
>>>> So from that PoV, it would be good to swap-in to the same size that was
>>>> swapped-out. But we only kind-of keep that information around, via the swap
>>>> entry contiguity and alignment. With that scheme it is possible that multiple
>>>> virtually adjacent but not physically contiguous folios get swapped-out to
>>>> adjacent swap slot ranges and then they would be swapped-in to a single, larger
>>>> folio. This is not ideal, and I think it would be valuable to try to maintain
>>>> the original folio size information with the swap slot. One way to do this would
>>>> be to store the original order for which the cluster was allocated in the
>>>> cluster. Then we at least know that a given swap slot is either for a folio of
>>>> that order or an order-0 folio (due to cluster exhaustion/scanning). Can we
>>>> steal a bit from swap_map to determine which case it is? Or are there better
>>>> approaches?
>>>
>>> In the case of non-SWP_SYNCHRONOUS_IO, users will invariably invoke
>>> swap_readahead()
>>> even when __swap_count(entry) equals 1. This leads to two scenarios:
>>> swap_vma_readahead
>>> and swap_cluster_readahead.
>>>
>>> In swap_vma_readahead, when blk_queue_nonrot, physical contiguity
>>> doesn't appear to be a
>>> critical concern. However, for swap_cluster_readahead, the focus
>>> shifts towards the potential
>>> impact of physical discontiguity.
>>
>> When you talk about "physical [dis]contiguity" I think you are talking about
>> contiguity of the swap entries in the swap device? Both paths currently allocate
>> order-0 folios to swap into, so neither have a concept of physical contiguity in
>> memory at the moment.
>>
>> As I understand it, roughly the aim is to readahead by cluster for rotating
>> disks to reduce seek time, and readahead by virtual address for non-rotating
>> devices since there is no seek time cost. Correct?
>
> From the code comment, I agree with this.
>
> * It's a main entry function for swap readahead. By the configuration,
> * it will read ahead blocks by cluster-based(ie, physical disk based)
> * or vma-based(ie, virtual address based on faulty address) readahead.
>
>>
>> Note that today, swap-out on supports (2M) THP if the swap device is
>> non-rotating. If it is rotating, the THP is first split. My swap-out series
>> maintains this policy for mTHP. So I think we only really care about
>> swap_vma_readahead() here; we want to teach it to figure out the order of the
>> swap entries and swap them into folios of the same order (with a fallback to
>> order-0 if allocation fails).
>
> I agree we don't need to care about devices which rotate.
>
>>
>>>
>>> struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>>> struct vm_fault *vmf)
>>> {
>>> struct mempolicy *mpol;
>>> pgoff_t ilx;
>>> struct folio *folio;
>>>
>>> mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
>>> folio = swap_use_vma_readahead() ?
>>> swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf) :
>>> swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
>>> mpol_cond_put(mpol);
>>>
>>> if (!folio)
>>> return NULL;
>>> return folio_file_page(folio, swp_offset(entry));
>>> }
>>>
>>> In Android and embedded systems, SWP_SYNCHRONOUS_IO is consistently utilized,
>>> rendering physical contiguity less of a concern. Moreover, instances where
>>> swap_readahead() is accessed are rare, typically occurring only in scenarios
>>> involving forked but non-CoWed memory.
>>
>> Yes understood. What I'm hearing is that for Android at least, stealing a bit
>> from swap_map to remember if a swap entry is the order marked in the cluster or
>> order-0 won't be noticed because almost all entries have swap count == 1. From
>> memory, I think swap_map is 8 bits, and 2 bits are currently stolen, leaving 6
>> bits (count = 64) before having to move to the swap map continuation stuff. Does
>> anyone know what workloads provoke this overflow? What are the consequences of
>> reducing that count to 32?
>
> I'm not entirely clear on why you need bits to record this
> information. Could you
> provide more details?

For norot media, the swap device is carved up into "clusters", each the same
size as the pmd-size. Empty clusters are kept on a free list. Each CPU maintains
a "current cluster" that it fills sequentially with order-0 entries. To swap out
a THP, a whole cluster is allocated from the free list and used. (My swap-out
series expands this so that each CPU maintains a current cluster per-order to
fill sequentially with swap entries of that order). Once a cluster has been
filled, the CPU allocates a new current cluster. If no clusters are available,
only order-0 swap out will succeed, and in that case it will scan through the
entire swap device looking for a free slot. Once swapped-out, there is no
maintained information to tell you, for a given swap slot, what the originating
folio size was. We always swap-in in order-0 units, so it doesn't matter.

A side effect of this is that a page from what was a PMD-sized THP could be
swapped back in, meaning there is a free order-0 swap slot in that cluster. That
could then be allocated by the scanner, so the cluster now holds part of what
was a THP and an order-0 page. With my swap-out series, that mixing of orders
within a cluster becomes even more likely; (e.g.) we could have a current
cluster (2M) for order-4 (64K) and have swapped-out 2 folios, using up the first
128K). Then we try to swap out an order-0 and have to fall back to the scanner
because swap is fragmented. The scanner then puts the order-0 in the swap entry
after the second order-4 folio; the cluster now has a mix of order-4 and order-0
folios.

If we want to store the original folio size in the swap device, I was suggesting
that one approach would be to store the "primary" order of the cluster in the
cluster info struct. But that is not sufficient on its own, because there can
also be order-0 entries in the same cluster (due to either of the 2 mechanisms I
described above - either a page from a large order entry was freed, meaning all
the other pages in the large swap entry become order-0, or an order-0 was added
to the cluster by the scanner). So to discriminate these 2 possibilities on a
per-swap entry basis (either the swap entry is large and of the order indicated
by the cluster, or it is small), we need a per-swap entry bit.

swap_map is the the per-swap entry state that is currently maintained, and it is
(in the common case) 8 bits. 1 bit is used to remember if the swap entry has a
folio in the swap cache. And another bit is used for something relating to shmem
IIRC). The rest of the bits are used to count references to the swap entry (its
a bit more complicated but you get the idea). I was wondering if we could steal
a bit from the reference count to use for discrimanation between "swap entry is
size indicated by cluster's primary order" and "swap entry is order-0".


>
>>
>>>
>>> So I think large folios swap-in will at least need three steps
>>>
>>> 1. on SWP_SYNCHRONOUS_IO (Android and embedded Linux), this has a very
>>> clear model and has no complex I/O issue.
>>> 2. on nonrot block device(bdev_nonrot == true), it cares less about
>>> I/O contiguity.
>>> 3. on rot block devices which care about I/O contiguity.
>>
>> I don't think we care about (3); if the device rotates, we will have split the
>> folio at swap-out, so we are only concerned with swapping-in order-0 folios.
>>
>>>
>>> This patchset primarily addresses the systems utilizing
>>> SWP_SYNCHRONOUS_IO(type1),
>>> such as Android and embedded Linux, a straightforward model is established,
>>> with minimal complexity regarding I/O issues.
>>
>> Understood. But your implication is that making swap_vma_readahead() large folio
>> swap-in aware will be complex. I think we can remember the original order in the
>> swap device, then it shouldn't be too difficult - conceptually at least.
>
> Currently, I can scan PTE entries and determine the number of
> contiguous swap offsets.

That's an approximation; if the folio was mapped into multiple processes and
partially mapped in the one you are swapping in for. But perhaps that is
sufficiently uncommon that it doesn't matter? It certainly removes the need for
storing the precise information in the swap device as I described above. To be
honest, I hadn't considered the PTEs; I was thinking only about the swap slot
contiguity.

> The swap_vma_readahead code to support large folios already exists in
> OPPO's repository.
> I'm confident that it can be cleaned up and submitted to LKML.
> However, the issue lies with
> the readahead policy. We typically prefer using the same 64KiB size as in
> do_anonymous_page(), but clearly, this isn't the preference for Ying :-)

I haven't caught up on all the latest discssion (although I see a bunch of
unread emails in my inbox :) ). But my view at the moment is roughly;

- continue to calculate readahead as before
- always swap-in folios in their original size
- always swap-in the full faulting folio
- calculate which other folios to readahead by rounding the calculated
readahead endpoint to the nearest folio boundary.
- Or if the "round to nearest" policy is shown to be problematic, introduce a
"max swap-in folio size" tunable and round to that nearest boundary instead.

If we need the tunable, a default of 4K is same as current behaviour. 64K would
likely be a good setting for a system using contpte mappings. And 2M would
provide behaviour descibed above without the tubable.

>
>>
>>>
>>>>
>>>> Next we (I?) have concerns about wasting IO by swapping-in folios that are too
>>>> large (e.g. 2M). I'm not sure if this is a real problem or not - intuitively I'd
>>>> say yes but I have no data. But on the other hand, memory is aged and
>>>> swapped-out per-folio, so why shouldn't it be swapped-in per folio? If the
>>>> original allocation size policy is good (it currently isn't) then a folio should
>>>> be sized to cover temporally close memory and if we need to access some of it,
>>>> chances are we need all of it.
>>>>
>>>> If we think the IO concern is legitimate then we could define a threshold size
>>>> (sysfs?) for when we start swapping-in the folio in chunks. And how big should
>>>> those chunks be - one page, or the threshold size itself? Probably the latter?
>>>> And perhaps that threshold could also be used by zRAM to decide its upper limit
>>>> for compression chunk.
>>>
>>>
>>> Agreed. What about introducing a parameter like
>>> /sys/kernel/mm/transparent_hugepage/max_swapin_order
>>> giving users the opportunity to fine-tune it according to their needs. For type1
>>> users specifically, setting it to any value above 4 would be
>>> beneficial. If there's
>>> still a lack of tuning for desktop and server environments (type 2 and type 3),
>>> the default value could be set to 0.
>>
>> This sort of thing sounds sensible to me. But I have a history of proposing
>> crappy sysfs interfaces :) So I'd like to hear from others - I suspect it will
>> take a fair bit of discussion before we converge. Having data to show that this
>> threshold is needed would also help (i.e. demonstration that the intuition that
>> swapping in a 2M folio is often counter-productive to performance).
>>
>
> I understand. The ideal swap-in size is obviously a contentious topic :-)
> However, for my real use case, simplicity reigns: we consistently adhere
> to a single size - 64KiB.
>
>>>
>>>>
>>>> Perhaps we can learn from khugepaged here? I think it has programmable
>>>> thresholds for how many swapped-out pages can be swapped-in to aid collapse to a
>>>> THP? I guess that exists for the same concerns about increased IO pressure?
>>>>
>>>>
>>>> If we think we will ever be swapping-in folios in chunks less than their
>>>> original size, then we need a separate mechanism to re-foliate them. We have
>>>> discussed a khugepaged-like approach for doing this asynchronously in the
>>>> background. I know that scares the Android folks, but David has suggested that
>>>> this could well be very cheap compared with khugepaged, because it would be
>>>> entirely limited to a single pgtable, so we only need the PTL. If we need this
>>>> mechanism anyway, perhaps we should develop it and see how it performs if
>>>> swap-in remains order-0? Although I guess that would imply not being able to
>>>> benefit from compressing THPs for the zRAM case.
>>>
>>> The effectiveness of collapse operation relies on the stability of
>>> forming large folios
>>> to ensure optimal performance. In embedded systems, where more than half of the
>>> memory may be allocated to zRAM, folios might undergo swapping out before
>>> collapsing or immediately after the collapse operation. It seems a
>>> TAO-like optimization
>>> to decrease fallback and latency is more effective.
>>
>> Sorry, I'm not sure I've understood what you are saying here.
>
> I'm not entirely clear on the specifics of the khugepaged-like
> approach. However,a major
> distinction for Android is that its folios may not remain in memory
> for extended periods.
> If we incur the cost of compaction and page migration to form a large
> folio, it might soon
> be swapped out. Therefore, a potentially more efficient approach could
> involve a TAO-like
> pool, where we obtain large folios at a low cost.
>
>>
>>>
>>>>
>>>> I see all this as orthogonal to synchronous vs asynchronous swap devices. I
>>>> think the latter just implies that you might want to do some readahead to try to
>>>> cover up the latency? If swap is moving towards being folio-orientated, then
>>>> readahead also surely needs to be folio-orientated, but I think that should be
>>>> the only major difference.
>>>>
>>>> Anyway, just some thoughts!
>>>
>>> Thank you very much for your valuable and insightful deliberations.
>>>
>>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>
>
> Thanks
> Barry


2024-03-22 02:51:23

by Barry Song

[permalink] [raw]
Subject: Re: [RFC PATCH v3 1/5] arm64: mm: swap: support THP_SWAP on hardware with MTE

On Thu, Mar 21, 2024 at 11:31 PM Ryan Roberts <[email protected]> wrote:
>
> On 21/03/2024 08:42, Barry Song wrote:
> > Hi Ryan,
> > Sorry for the late reply.
>
> No problem!
>
> >
> > On Tue, Mar 12, 2024 at 5:56 AM Ryan Roberts <[email protected]> wrote:
> >>
> >> On 04/03/2024 08:13, Barry Song wrote:
> >>> From: Barry Song <[email protected]>
> >>>
> >>> Commit d0637c505f8a1 ("arm64: enable THP_SWAP for arm64") brings up
> >>> THP_SWAP on ARM64, but it doesn't enable THP_SWP on hardware with
> >>> MTE as the MTE code works with the assumption tags save/restore is
> >>> always handling a folio with only one page.
> >>>
> >>> The limitation should be removed as more and more ARM64 SoCs have
> >>> this feature. Co-existence of MTE and THP_SWAP becomes more and
> >>> more important.
> >>>
> >>> This patch makes MTE tags saving support large folios, then we don't
> >>> need to split large folios into base pages for swapping out on ARM64
> >>> SoCs with MTE any more.
> >>>
> >>> arch_prepare_to_swap() should take folio rather than page as parameter
> >>> because we support THP swap-out as a whole. It saves tags for all
> >>> pages in a large folio.
> >>>
> >>> As now we are restoring tags based-on folio, in arch_swap_restore(),
> >>> we may increase some extra loops and early-exitings while refaulting
> >>> a large folio which is still in swapcache in do_swap_page(). In case
> >>> a large folio has nr pages, do_swap_page() will only set the PTE of
> >>> the particular page which is causing the page fault.
> >>> Thus do_swap_page() runs nr times, and each time, arch_swap_restore()
> >>> will loop nr times for those subpages in the folio. So right now the
> >>> algorithmic complexity becomes O(nr^2).
> >>>
> >>> Once we support mapping large folios in do_swap_page(), extra loops
> >>> and early-exitings will decrease while not being completely removed
> >>> as a large folio might get partially tagged in corner cases such as,
> >>> 1. a large folio in swapcache can be partially unmapped, thus, MTE
> >>> tags for the unmapped pages will be invalidated;
> >>> 2. users might use mprotect() to set MTEs on a part of a large folio.
> >>>
> >>> arch_thp_swp_supported() is dropped since ARM64 MTE was the only one
> >>> who needed it.
>
> I think we should decouple this patch from your swap-in series. I suspect this
> one could be ready and go in sooner than the swap-in series based on the current
> discussions :)
>
> >>>
> >>> Cc: Catalin Marinas <[email protected]>
> >>> Cc: Will Deacon <[email protected]>
> >>> Cc: Ryan Roberts <[email protected]>
> >>> Cc: Mark Rutland <[email protected]>
> >>> Cc: David Hildenbrand <[email protected]>
> >>> Cc: Kemeng Shi <[email protected]>
> >>> Cc: "Matthew Wilcox (Oracle)" <[email protected]>
> >>> Cc: Anshuman Khandual <[email protected]>
> >>> Cc: Peter Collingbourne <[email protected]>
> >>> Cc: Steven Price <[email protected]>
> >>> Cc: Yosry Ahmed <[email protected]>
> >>> Cc: Peter Xu <[email protected]>
> >>> Cc: Lorenzo Stoakes <[email protected]>
> >>> Cc: "Mike Rapoport (IBM)" <[email protected]>
> >>> Cc: Hugh Dickins <[email protected]>
> >>> CC: "Aneesh Kumar K.V" <[email protected]>
> >>> Cc: Rick Edgecombe <[email protected]>
> >>> Signed-off-by: Barry Song <[email protected]>
> >>> Reviewed-by: Steven Price <[email protected]>
> >>> Acked-by: Chris Li <[email protected]>
> >>> ---
> >>> arch/arm64/include/asm/pgtable.h | 19 ++------------
> >>> arch/arm64/mm/mteswap.c | 43 ++++++++++++++++++++++++++++++++
> >>> include/linux/huge_mm.h | 12 ---------
> >>> include/linux/pgtable.h | 2 +-
> >>> mm/page_io.c | 2 +-
> >>> mm/swap_slots.c | 2 +-
> >>> 6 files changed, 48 insertions(+), 32 deletions(-)
> >>>
> >>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> >>> index 401087e8a43d..7a54750770b8 100644
> >>> --- a/arch/arm64/include/asm/pgtable.h
> >>> +++ b/arch/arm64/include/asm/pgtable.h
> >>> @@ -45,12 +45,6 @@
> >>> __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
> >>> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >>>
> >>> -static inline bool arch_thp_swp_supported(void)
> >>> -{
> >>> - return !system_supports_mte();
> >>> -}
> >>> -#define arch_thp_swp_supported arch_thp_swp_supported
> >>> -
> >>> /*
> >>> * Outside of a few very special situations (e.g. hibernation), we always
> >>> * use broadcast TLB invalidation instructions, therefore a spurious page
> >>> @@ -1095,12 +1089,7 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> >>> #ifdef CONFIG_ARM64_MTE
> >>>
> >>> #define __HAVE_ARCH_PREPARE_TO_SWAP
> >>> -static inline int arch_prepare_to_swap(struct page *page)
> >>> -{
> >>> - if (system_supports_mte())
> >>> - return mte_save_tags(page);
> >>> - return 0;
> >>> -}
> >>> +extern int arch_prepare_to_swap(struct folio *folio);
> >>>
> >>> #define __HAVE_ARCH_SWAP_INVALIDATE
> >>> static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
> >>> @@ -1116,11 +1105,7 @@ static inline void arch_swap_invalidate_area(int type)
> >>> }
> >>>
> >>> #define __HAVE_ARCH_SWAP_RESTORE
> >>> -static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> >>> -{
> >>> - if (system_supports_mte())
> >>> - mte_restore_tags(entry, &folio->page);
> >>> -}
> >>> +extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
> >>>
> >>> #endif /* CONFIG_ARM64_MTE */
> >>>
> >>> diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
> >>> index a31833e3ddc5..295836fef620 100644
> >>> --- a/arch/arm64/mm/mteswap.c
> >>> +++ b/arch/arm64/mm/mteswap.c
> >>> @@ -68,6 +68,13 @@ void mte_invalidate_tags(int type, pgoff_t offset)
> >>> mte_free_tag_storage(tags);
> >>> }
> >>>
> >>> +static inline void __mte_invalidate_tags(struct page *page)
> >>> +{
> >>> + swp_entry_t entry = page_swap_entry(page);
> >>> +
> >>> + mte_invalidate_tags(swp_type(entry), swp_offset(entry));
> >>> +}
> >>> +
> >>> void mte_invalidate_tags_area(int type)
> >>> {
> >>> swp_entry_t entry = swp_entry(type, 0);
> >>> @@ -83,3 +90,39 @@ void mte_invalidate_tags_area(int type)
> >>> }
> >>> xa_unlock(&mte_pages);
> >>> }
> >>> +
> >>> +int arch_prepare_to_swap(struct folio *folio)
> >>> +{
> >>> + long i, nr;
> >>> + int err;
> >>> +
> >>> + if (!system_supports_mte())
> >>> + return 0;
> >>> +
> >>> + nr = folio_nr_pages(folio);
> >>> +
> >>> + for (i = 0; i < nr; i++) {
> >>> + err = mte_save_tags(folio_page(folio, i));
> >>> + if (err)
> >>> + goto out;
> >>> + }
> >>> + return 0;
> >>> +
> >>> +out:
> >>> + while (i--)
> >>> + __mte_invalidate_tags(folio_page(folio, i));
> >>> + return err;
> >>> +}
> >>> +
> >>> +void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> >>
> >> I'm still not a fan of the fact that entry could be anywhere within folio.
> >>
> >>> +{
> >>> + if (system_supports_mte()) {
> >>
> >> nit: if you do:
> >>
> >> if (!system_supports_mte())
> >> return;
> >
> > Acked
> >
> >>
> >> It will be consistent with arch_prepare_to_swap() and reduce the indentation of
> >> the main body.
> >>
> >>> + long i, nr = folio_nr_pages(folio);
> >>> +
> >>> + entry.val -= swp_offset(entry) & (nr - 1);
> >>
> >> This assumes that folios are always stored in swap with natural alignment. Is
> >> that definitely a safe assumption? My swap-out series is currently ensuring that
> >> folios are swapped-out naturally aligned, but that is an implementation detail.
> >>
> >
> > I concur that this is an implementation detail. However, we should be
> > bold enough
> > to state that swap slots will be contiguous, considering we are
> > currently utilizing
> > folio->swap instead of subpage->swap ?
>
> Yes, I agree about contiguity. My objection is about assuming natural alignment
> though. It can still be contiguous while not naturally aligned in swap.

Hi Ryan,

While working on the new version of this patch, I've come to recognize
that, for the time being, it's
imperative to maintain a natural alignment. The following code
operates on the basis of this
assumption.

/**
* folio_file_page - The page for a particular index.
* @folio: The folio which contains this index.
* @index: The index we want to look up.
*
* Sometimes after looking up a folio in the page cache, we need to
* obtain the specific page for an index (eg a page fault).
*
* Return: The page containing the file data for this index.
*/
static inline struct page *folio_file_page(struct folio *folio, pgoff_t index)
{
return folio_page(folio, index & (folio_nr_pages(folio) - 1));
}


It's invoked everywhere, particularly within do_swap_page(). Nonetheless,
I remain confident that I can consistently pass the first entry to
arch_swap_restore().

>
> >
> >> Your cover note for swap-in says that you could technically swap in a large
> >> folio without it having been swapped-out large. If you chose to do that in
> >> future, this would break, right? I don't think it's good to couple the swap
> >
> > Right. technically I agree. Given that we still have many tasks involving even
> > swapping in contiguous swap slots, it's unlikely that swapping in large folios
> > for non-contiguous entries will occur in the foreseeable future :-)
> >
> >> storage layout to the folio order that you want to swap into. Perhaps that's an
> >> argument for passing each *page* to this function with its exact, corresponding
> >> swap entry?
> >
> > I recall Matthew Wilcox strongly objected to using "page" as the
> > parameter, so I've
> > discarded that approach. Alternatively, it appears I can consistently pass
> > folio->swap to this function and ensure the function always retrieves
> > the first entry?
>
> Yes, if we must pass a folio here, I'd prefer that entry always corresponds to
> the first entry for the folio. That will remove the need for this function to do
> the alignment above too. So win-win.
>
> >
> >>
> >>> + for (i = 0; i < nr; i++) {
> >>> + mte_restore_tags(entry, folio_page(folio, i));
> >>> + entry.val++;
> >>> + }
> >>> + }
> >>> +}
> >>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >>> index de0c89105076..e04b93c43965 100644
> >>> --- a/include/linux/huge_mm.h
> >>> +++ b/include/linux/huge_mm.h
> >>> @@ -535,16 +535,4 @@ static inline int split_folio_to_order(struct folio *folio, int new_order)
> >>> #define split_folio_to_list(f, l) split_folio_to_list_to_order(f, l, 0)
> >>> #define split_folio(f) split_folio_to_order(f, 0)
> >>>
> >>> -/*
> >>> - * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
> >>> - * limitations in the implementation like arm64 MTE can override this to
> >>> - * false
> >>> - */
> >>> -#ifndef arch_thp_swp_supported
> >>> -static inline bool arch_thp_swp_supported(void)
> >>> -{
> >>> - return true;
> >>> -}
> >>> -#endif
> >>> -
> >>> #endif /* _LINUX_HUGE_MM_H */
> >>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >>> index e1b22903f709..bfcfe3386934 100644
> >>> --- a/include/linux/pgtable.h
> >>> +++ b/include/linux/pgtable.h
> >>> @@ -1106,7 +1106,7 @@ static inline int arch_unmap_one(struct mm_struct *mm,
> >>> * prototypes must be defined in the arch-specific asm/pgtable.h file.
> >>> */
> >>> #ifndef __HAVE_ARCH_PREPARE_TO_SWAP
> >>> -static inline int arch_prepare_to_swap(struct page *page)
> >>> +static inline int arch_prepare_to_swap(struct folio *folio)
> >>> {
> >>> return 0;
> >>> }
> >>> diff --git a/mm/page_io.c b/mm/page_io.c
> >>> index ae2b49055e43..a9a7c236aecc 100644
> >>> --- a/mm/page_io.c
> >>> +++ b/mm/page_io.c
> >>> @@ -189,7 +189,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
> >>> * Arch code may have to preserve more data than just the page
> >>> * contents, e.g. memory tags.
> >>> */
> >>> - ret = arch_prepare_to_swap(&folio->page);
> >>> + ret = arch_prepare_to_swap(folio);
> >>> if (ret) {
> >>> folio_mark_dirty(folio);
> >>> folio_unlock(folio);
> >>> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> >>> index 90973ce7881d..53abeaf1371d 100644
> >>> --- a/mm/swap_slots.c
> >>> +++ b/mm/swap_slots.c
> >>> @@ -310,7 +310,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
> >>> entry.val = 0;
> >>>
> >>> if (folio_test_large(folio)) {
> >>> - if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
> >>> + if (IS_ENABLED(CONFIG_THP_SWAP))
> >>> get_swap_pages(1, &entry, folio_nr_pages(folio));
> >>> goto out;
> >>> }
> >>
> >
> > Thanks
> > Barry
>

2024-03-22 07:42:29

by Barry Song

[permalink] [raw]
Subject: Re: [RFC PATCH v3 1/5] arm64: mm: swap: support THP_SWAP on hardware with MTE

On Fri, Mar 22, 2024 at 3:51 PM Barry Song <[email protected]> wrote:
>
> On Thu, Mar 21, 2024 at 11:31 PM Ryan Roberts <[email protected]> wrote:
> >
> > On 21/03/2024 08:42, Barry Song wrote:
> > > Hi Ryan,
> > > Sorry for the late reply.
> >
> > No problem!
> >
> > >
> > > On Tue, Mar 12, 2024 at 5:56 AM Ryan Roberts <[email protected]> wrote:
> > >>
> > >> On 04/03/2024 08:13, Barry Song wrote:
> > >>> From: Barry Song <[email protected]>
> > >>>
> > >>> Commit d0637c505f8a1 ("arm64: enable THP_SWAP for arm64") brings up
> > >>> THP_SWAP on ARM64, but it doesn't enable THP_SWP on hardware with
> > >>> MTE as the MTE code works with the assumption tags save/restore is
> > >>> always handling a folio with only one page.
> > >>>
> > >>> The limitation should be removed as more and more ARM64 SoCs have
> > >>> this feature. Co-existence of MTE and THP_SWAP becomes more and
> > >>> more important.
> > >>>
> > >>> This patch makes MTE tags saving support large folios, then we don't
> > >>> need to split large folios into base pages for swapping out on ARM64
> > >>> SoCs with MTE any more.
> > >>>
> > >>> arch_prepare_to_swap() should take folio rather than page as parameter
> > >>> because we support THP swap-out as a whole. It saves tags for all
> > >>> pages in a large folio.
> > >>>
> > >>> As now we are restoring tags based-on folio, in arch_swap_restore(),
> > >>> we may increase some extra loops and early-exitings while refaulting
> > >>> a large folio which is still in swapcache in do_swap_page(). In case
> > >>> a large folio has nr pages, do_swap_page() will only set the PTE of
> > >>> the particular page which is causing the page fault.
> > >>> Thus do_swap_page() runs nr times, and each time, arch_swap_restore()
> > >>> will loop nr times for those subpages in the folio. So right now the
> > >>> algorithmic complexity becomes O(nr^2).
> > >>>
> > >>> Once we support mapping large folios in do_swap_page(), extra loops
> > >>> and early-exitings will decrease while not being completely removed
> > >>> as a large folio might get partially tagged in corner cases such as,
> > >>> 1. a large folio in swapcache can be partially unmapped, thus, MTE
> > >>> tags for the unmapped pages will be invalidated;
> > >>> 2. users might use mprotect() to set MTEs on a part of a large folio.
> > >>>
> > >>> arch_thp_swp_supported() is dropped since ARM64 MTE was the only one
> > >>> who needed it.
> >
> > I think we should decouple this patch from your swap-in series. I suspect this
> > one could be ready and go in sooner than the swap-in series based on the current
> > discussions :)
> >
> > >>>
> > >>> Cc: Catalin Marinas <[email protected]>
> > >>> Cc: Will Deacon <[email protected]>
> > >>> Cc: Ryan Roberts <[email protected]>
> > >>> Cc: Mark Rutland <[email protected]>
> > >>> Cc: David Hildenbrand <[email protected]>
> > >>> Cc: Kemeng Shi <[email protected]>
> > >>> Cc: "Matthew Wilcox (Oracle)" <[email protected]>
> > >>> Cc: Anshuman Khandual <[email protected]>
> > >>> Cc: Peter Collingbourne <[email protected]>
> > >>> Cc: Steven Price <[email protected]>
> > >>> Cc: Yosry Ahmed <[email protected]>
> > >>> Cc: Peter Xu <[email protected]>
> > >>> Cc: Lorenzo Stoakes <[email protected]>
> > >>> Cc: "Mike Rapoport (IBM)" <[email protected]>
> > >>> Cc: Hugh Dickins <[email protected]>
> > >>> CC: "Aneesh Kumar K.V" <[email protected]>
> > >>> Cc: Rick Edgecombe <[email protected]>
> > >>> Signed-off-by: Barry Song <[email protected]>
> > >>> Reviewed-by: Steven Price <[email protected]>
> > >>> Acked-by: Chris Li <[email protected]>
> > >>> ---
> > >>> arch/arm64/include/asm/pgtable.h | 19 ++------------
> > >>> arch/arm64/mm/mteswap.c | 43 ++++++++++++++++++++++++++++++++
> > >>> include/linux/huge_mm.h | 12 ---------
> > >>> include/linux/pgtable.h | 2 +-
> > >>> mm/page_io.c | 2 +-
> > >>> mm/swap_slots.c | 2 +-
> > >>> 6 files changed, 48 insertions(+), 32 deletions(-)
> > >>>
> > >>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > >>> index 401087e8a43d..7a54750770b8 100644
> > >>> --- a/arch/arm64/include/asm/pgtable.h
> > >>> +++ b/arch/arm64/include/asm/pgtable.h
> > >>> @@ -45,12 +45,6 @@
> > >>> __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
> > >>> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> > >>>
> > >>> -static inline bool arch_thp_swp_supported(void)
> > >>> -{
> > >>> - return !system_supports_mte();
> > >>> -}
> > >>> -#define arch_thp_swp_supported arch_thp_swp_supported
> > >>> -
> > >>> /*
> > >>> * Outside of a few very special situations (e.g. hibernation), we always
> > >>> * use broadcast TLB invalidation instructions, therefore a spurious page
> > >>> @@ -1095,12 +1089,7 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> > >>> #ifdef CONFIG_ARM64_MTE
> > >>>
> > >>> #define __HAVE_ARCH_PREPARE_TO_SWAP
> > >>> -static inline int arch_prepare_to_swap(struct page *page)
> > >>> -{
> > >>> - if (system_supports_mte())
> > >>> - return mte_save_tags(page);
> > >>> - return 0;
> > >>> -}
> > >>> +extern int arch_prepare_to_swap(struct folio *folio);
> > >>>
> > >>> #define __HAVE_ARCH_SWAP_INVALIDATE
> > >>> static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
> > >>> @@ -1116,11 +1105,7 @@ static inline void arch_swap_invalidate_area(int type)
> > >>> }
> > >>>
> > >>> #define __HAVE_ARCH_SWAP_RESTORE
> > >>> -static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> > >>> -{
> > >>> - if (system_supports_mte())
> > >>> - mte_restore_tags(entry, &folio->page);
> > >>> -}
> > >>> +extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
> > >>>
> > >>> #endif /* CONFIG_ARM64_MTE */
> > >>>
> > >>> diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
> > >>> index a31833e3ddc5..295836fef620 100644
> > >>> --- a/arch/arm64/mm/mteswap.c
> > >>> +++ b/arch/arm64/mm/mteswap.c
> > >>> @@ -68,6 +68,13 @@ void mte_invalidate_tags(int type, pgoff_t offset)
> > >>> mte_free_tag_storage(tags);
> > >>> }
> > >>>
> > >>> +static inline void __mte_invalidate_tags(struct page *page)
> > >>> +{
> > >>> + swp_entry_t entry = page_swap_entry(page);
> > >>> +
> > >>> + mte_invalidate_tags(swp_type(entry), swp_offset(entry));
> > >>> +}
> > >>> +
> > >>> void mte_invalidate_tags_area(int type)
> > >>> {
> > >>> swp_entry_t entry = swp_entry(type, 0);
> > >>> @@ -83,3 +90,39 @@ void mte_invalidate_tags_area(int type)
> > >>> }
> > >>> xa_unlock(&mte_pages);
> > >>> }
> > >>> +
> > >>> +int arch_prepare_to_swap(struct folio *folio)
> > >>> +{
> > >>> + long i, nr;
> > >>> + int err;
> > >>> +
> > >>> + if (!system_supports_mte())
> > >>> + return 0;
> > >>> +
> > >>> + nr = folio_nr_pages(folio);
> > >>> +
> > >>> + for (i = 0; i < nr; i++) {
> > >>> + err = mte_save_tags(folio_page(folio, i));
> > >>> + if (err)
> > >>> + goto out;
> > >>> + }
> > >>> + return 0;
> > >>> +
> > >>> +out:
> > >>> + while (i--)
> > >>> + __mte_invalidate_tags(folio_page(folio, i));
> > >>> + return err;
> > >>> +}
> > >>> +
> > >>> +void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> > >>
> > >> I'm still not a fan of the fact that entry could be anywhere within folio.
> > >>
> > >>> +{
> > >>> + if (system_supports_mte()) {
> > >>
> > >> nit: if you do:
> > >>
> > >> if (!system_supports_mte())
> > >> return;
> > >
> > > Acked
> > >
> > >>
> > >> It will be consistent with arch_prepare_to_swap() and reduce the indentation of
> > >> the main body.
> > >>
> > >>> + long i, nr = folio_nr_pages(folio);
> > >>> +
> > >>> + entry.val -= swp_offset(entry) & (nr - 1);
> > >>
> > >> This assumes that folios are always stored in swap with natural alignment. Is
> > >> that definitely a safe assumption? My swap-out series is currently ensuring that
> > >> folios are swapped-out naturally aligned, but that is an implementation detail.
> > >>
> > >
> > > I concur that this is an implementation detail. However, we should be
> > > bold enough
> > > to state that swap slots will be contiguous, considering we are
> > > currently utilizing
> > > folio->swap instead of subpage->swap ?
> >
> > Yes, I agree about contiguity. My objection is about assuming natural alignment
> > though. It can still be contiguous while not naturally aligned in swap.
>
> Hi Ryan,
>
> While working on the new version of this patch, I've come to recognize
> that, for the time being, it's
> imperative to maintain a natural alignment. The following code
> operates on the basis of this
> assumption.
>
> /**
> * folio_file_page - The page for a particular index.
> * @folio: The folio which contains this index.
> * @index: The index we want to look up.
> *
> * Sometimes after looking up a folio in the page cache, we need to
> * obtain the specific page for an index (eg a page fault).
> *
> * Return: The page containing the file data for this index.
> */
> static inline struct page *folio_file_page(struct folio *folio, pgoff_t index)
> {
> return folio_page(folio, index & (folio_nr_pages(folio) - 1));
> }
>
>
> It's invoked everywhere, particularly within do_swap_page(). Nonetheless,
> I remain confident that I can consistently pass the first entry to
> arch_swap_restore().

After grappling for a couple of hours, I've realized that the only
viable approach
is as follows: shifting the task of obtaining the first entry from the
callee to the
callers( looks silly). This is necessary due to various scenarios like
swap cache,
non-swap cache, and KSM, each presenting different cases. Since there's no
assurance of folio->swap being present, forcibly setting folio->swap could pose
risks (There might not even be any risk involved, but the associated
task getting
the first entry still cannot be overlooked by callers).

diff --git a/mm/internal.h b/mm/internal.h
index 7e486f2c502c..94d5b4b5a5da 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -76,6 +76,20 @@ static inline int folio_nr_pages_mapped(struct folio *folio)
return atomic_read(&folio->_nr_pages_mapped) & FOLIO_PAGES_MAPPED;
}

+/*
+ * Retrieve the first entry of a folio based on a provided entry within the
+ * folio. We cannot rely on folio->swap as there is no guarantee that it has
+ * been initialized. Used by arch_swap_restore()
+ */
+static inline swp_entry_t folio_swap(swp_entry_t entry, struct folio *folio)
+{
+ swp_entry_t swap = {
+ .val = entry.val & (folio_nr_pages(folio) - 1),
+ };
+
+ return swap;
+}
+
static inline void *folio_raw_mapping(struct folio *folio)
{
unsigned long mapping = (unsigned long)folio->mapping;
diff --git a/mm/memory.c b/mm/memory.c
index f2bc6dd15eb8..b7cab8be8632 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4188,7 +4188,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
* when reading from swap. This metadata may be indexed by swap entry
* so this must be called before swap_free().
*/
- arch_swap_restore(entry, folio);
+ arch_swap_restore(folio_swap(entry, folio), folio);

/*
* Remove the swap entry and conditionally try to free up the swapcache.
diff --git a/mm/shmem.c b/mm/shmem.c
index 0aad0d9a621b..82c9df4628f2 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1913,7 +1913,7 @@ static int shmem_swapin_folio(struct inode
*inode, pgoff_t index,
* Some architectures may have to restore extra metadata to the
* folio after reading from swap.
*/
- arch_swap_restore(swap, folio);
+ arch_swap_restore(folio_swap(entry, folio), folio);

if (shmem_should_replace_folio(folio, gfp)) {
error = shmem_replace_folio(&folio, gfp, info, index);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4919423cce76..5e6d2304a2a4 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1806,7 +1806,7 @@ static int unuse_pte(struct vm_area_struct *vma,
pmd_t *pmd,
* when reading from swap. This metadata may be indexed by swap entry
* so this must be called before swap_free().
*/
- arch_swap_restore(entry, folio);
+ arch_swap_restore(folio_swap(entry, folio), folio);

dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
inc_mm_counter(vma->vm_mm, MM_ANONPAGES);


Meanwhile, natural alignment is essential even during the execution of
add_to_swap(), as failure to
do so will trigger the VM_BUG_ON condition below.

int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
gfp_t gfp, void **shadowp)
{
struct address_space *address_space = swap_address_space(entry);
pgoff_t idx = swp_offset(entry);
XA_STATE_ORDER(xas, &address_space->i_pages, idx, folio_order(folio));
unsigned long i, nr = folio_nr_pages(folio);
...
folio_set_swapcache(folio);
folio->swap = entry;

do {
xas_lock_irq(&xas);
xas_create_range(&xas);
if (xas_error(&xas))
goto unlock;
for (i = 0; i < nr; i++) {
VM_BUG_ON_FOLIO(xas.xa_index != idx + i, folio);
if (shadowp) {
old = xas_load(&xas);
if (xa_is_value(old))
*shadowp = old;
}
xas_store(&xas, folio);
xas_next(&xas);
}
}


Based on the information provided, Ryan, would it be feasible to retain the task
of obtaining the first entry within the callee? Or, are you in favor
of utilizing the
new folio_swap() helper?

>
> >
> > >
> > >> Your cover note for swap-in says that you could technically swap in a large
> > >> folio without it having been swapped-out large. If you chose to do that in
> > >> future, this would break, right? I don't think it's good to couple the swap
> > >
> > > Right. technically I agree. Given that we still have many tasks involving even
> > > swapping in contiguous swap slots, it's unlikely that swapping in large folios
> > > for non-contiguous entries will occur in the foreseeable future :-)
> > >
> > >> storage layout to the folio order that you want to swap into. Perhaps that's an
> > >> argument for passing each *page* to this function with its exact, corresponding
> > >> swap entry?
> > >
> > > I recall Matthew Wilcox strongly objected to using "page" as the
> > > parameter, so I've
> > > discarded that approach. Alternatively, it appears I can consistently pass
> > > folio->swap to this function and ensure the function always retrieves
> > > the first entry?
> >
> > Yes, if we must pass a folio here, I'd prefer that entry always corresponds to
> > the first entry for the folio. That will remove the need for this function to do
> > the alignment above too. So win-win.
> >
> > >
> > >>
> > >>> + for (i = 0; i < nr; i++) {
> > >>> + mte_restore_tags(entry, folio_page(folio, i));
> > >>> + entry.val++;
> > >>> + }
> > >>> + }
> > >>> +}
> > >>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > >>> index de0c89105076..e04b93c43965 100644
> > >>> --- a/include/linux/huge_mm.h
> > >>> +++ b/include/linux/huge_mm.h
> > >>> @@ -535,16 +535,4 @@ static inline int split_folio_to_order(struct folio *folio, int new_order)
> > >>> #define split_folio_to_list(f, l) split_folio_to_list_to_order(f, l, 0)
> > >>> #define split_folio(f) split_folio_to_order(f, 0)
> > >>>
> > >>> -/*
> > >>> - * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
> > >>> - * limitations in the implementation like arm64 MTE can override this to
> > >>> - * false
> > >>> - */
> > >>> -#ifndef arch_thp_swp_supported
> > >>> -static inline bool arch_thp_swp_supported(void)
> > >>> -{
> > >>> - return true;
> > >>> -}
> > >>> -#endif
> > >>> -
> > >>> #endif /* _LINUX_HUGE_MM_H */
> > >>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > >>> index e1b22903f709..bfcfe3386934 100644
> > >>> --- a/include/linux/pgtable.h
> > >>> +++ b/include/linux/pgtable.h
> > >>> @@ -1106,7 +1106,7 @@ static inline int arch_unmap_one(struct mm_struct *mm,
> > >>> * prototypes must be defined in the arch-specific asm/pgtable.h file.
> > >>> */
> > >>> #ifndef __HAVE_ARCH_PREPARE_TO_SWAP
> > >>> -static inline int arch_prepare_to_swap(struct page *page)
> > >>> +static inline int arch_prepare_to_swap(struct folio *folio)
> > >>> {
> > >>> return 0;
> > >>> }
> > >>> diff --git a/mm/page_io.c b/mm/page_io.c
> > >>> index ae2b49055e43..a9a7c236aecc 100644
> > >>> --- a/mm/page_io.c
> > >>> +++ b/mm/page_io.c
> > >>> @@ -189,7 +189,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
> > >>> * Arch code may have to preserve more data than just the page
> > >>> * contents, e.g. memory tags.
> > >>> */
> > >>> - ret = arch_prepare_to_swap(&folio->page);
> > >>> + ret = arch_prepare_to_swap(folio);
> > >>> if (ret) {
> > >>> folio_mark_dirty(folio);
> > >>> folio_unlock(folio);
> > >>> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> > >>> index 90973ce7881d..53abeaf1371d 100644
> > >>> --- a/mm/swap_slots.c
> > >>> +++ b/mm/swap_slots.c
> > >>> @@ -310,7 +310,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
> > >>> entry.val = 0;
> > >>>
> > >>> if (folio_test_large(folio)) {
> > >>> - if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
> > >>> + if (IS_ENABLED(CONFIG_THP_SWAP))
> > >>> get_swap_pages(1, &entry, folio_nr_pages(folio));
> > >>> goto out;
> > >>> }
> > >>
> > >
> > > Thanks
> > > Barry
> >

2024-03-22 10:20:01

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH v3 1/5] arm64: mm: swap: support THP_SWAP on hardware with MTE

On 22/03/2024 07:41, Barry Song wrote:
> On Fri, Mar 22, 2024 at 3:51 PM Barry Song <[email protected]> wrote:
>>
>> On Thu, Mar 21, 2024 at 11:31 PM Ryan Roberts <[email protected]> wrote:
>>>
>>> On 21/03/2024 08:42, Barry Song wrote:
>>>> Hi Ryan,
>>>> Sorry for the late reply.
>>>
>>> No problem!
>>>
>>>>
>>>> On Tue, Mar 12, 2024 at 5:56 AM Ryan Roberts <[email protected]> wrote:
>>>>>
>>>>> On 04/03/2024 08:13, Barry Song wrote:
>>>>>> From: Barry Song <[email protected]>
>>>>>>
>>>>>> Commit d0637c505f8a1 ("arm64: enable THP_SWAP for arm64") brings up
>>>>>> THP_SWAP on ARM64, but it doesn't enable THP_SWP on hardware with
>>>>>> MTE as the MTE code works with the assumption tags save/restore is
>>>>>> always handling a folio with only one page.
>>>>>>
>>>>>> The limitation should be removed as more and more ARM64 SoCs have
>>>>>> this feature. Co-existence of MTE and THP_SWAP becomes more and
>>>>>> more important.
>>>>>>
>>>>>> This patch makes MTE tags saving support large folios, then we don't
>>>>>> need to split large folios into base pages for swapping out on ARM64
>>>>>> SoCs with MTE any more.
>>>>>>
>>>>>> arch_prepare_to_swap() should take folio rather than page as parameter
>>>>>> because we support THP swap-out as a whole. It saves tags for all
>>>>>> pages in a large folio.
>>>>>>
>>>>>> As now we are restoring tags based-on folio, in arch_swap_restore(),
>>>>>> we may increase some extra loops and early-exitings while refaulting
>>>>>> a large folio which is still in swapcache in do_swap_page(). In case
>>>>>> a large folio has nr pages, do_swap_page() will only set the PTE of
>>>>>> the particular page which is causing the page fault.
>>>>>> Thus do_swap_page() runs nr times, and each time, arch_swap_restore()
>>>>>> will loop nr times for those subpages in the folio. So right now the
>>>>>> algorithmic complexity becomes O(nr^2).
>>>>>>
>>>>>> Once we support mapping large folios in do_swap_page(), extra loops
>>>>>> and early-exitings will decrease while not being completely removed
>>>>>> as a large folio might get partially tagged in corner cases such as,
>>>>>> 1. a large folio in swapcache can be partially unmapped, thus, MTE
>>>>>> tags for the unmapped pages will be invalidated;
>>>>>> 2. users might use mprotect() to set MTEs on a part of a large folio.
>>>>>>
>>>>>> arch_thp_swp_supported() is dropped since ARM64 MTE was the only one
>>>>>> who needed it.
>>>
>>> I think we should decouple this patch from your swap-in series. I suspect this
>>> one could be ready and go in sooner than the swap-in series based on the current
>>> discussions :)
>>>
>>>>>>
>>>>>> Cc: Catalin Marinas <[email protected]>
>>>>>> Cc: Will Deacon <[email protected]>
>>>>>> Cc: Ryan Roberts <[email protected]>
>>>>>> Cc: Mark Rutland <[email protected]>
>>>>>> Cc: David Hildenbrand <[email protected]>
>>>>>> Cc: Kemeng Shi <[email protected]>
>>>>>> Cc: "Matthew Wilcox (Oracle)" <[email protected]>
>>>>>> Cc: Anshuman Khandual <[email protected]>
>>>>>> Cc: Peter Collingbourne <[email protected]>
>>>>>> Cc: Steven Price <[email protected]>
>>>>>> Cc: Yosry Ahmed <[email protected]>
>>>>>> Cc: Peter Xu <[email protected]>
>>>>>> Cc: Lorenzo Stoakes <[email protected]>
>>>>>> Cc: "Mike Rapoport (IBM)" <[email protected]>
>>>>>> Cc: Hugh Dickins <[email protected]>
>>>>>> CC: "Aneesh Kumar K.V" <[email protected]>
>>>>>> Cc: Rick Edgecombe <[email protected]>
>>>>>> Signed-off-by: Barry Song <[email protected]>
>>>>>> Reviewed-by: Steven Price <[email protected]>
>>>>>> Acked-by: Chris Li <[email protected]>
>>>>>> ---
>>>>>> arch/arm64/include/asm/pgtable.h | 19 ++------------
>>>>>> arch/arm64/mm/mteswap.c | 43 ++++++++++++++++++++++++++++++++
>>>>>> include/linux/huge_mm.h | 12 ---------
>>>>>> include/linux/pgtable.h | 2 +-
>>>>>> mm/page_io.c | 2 +-
>>>>>> mm/swap_slots.c | 2 +-
>>>>>> 6 files changed, 48 insertions(+), 32 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>>>>> index 401087e8a43d..7a54750770b8 100644
>>>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>>>> @@ -45,12 +45,6 @@
>>>>>> __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
>>>>>> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>>>>>
>>>>>> -static inline bool arch_thp_swp_supported(void)
>>>>>> -{
>>>>>> - return !system_supports_mte();
>>>>>> -}
>>>>>> -#define arch_thp_swp_supported arch_thp_swp_supported
>>>>>> -
>>>>>> /*
>>>>>> * Outside of a few very special situations (e.g. hibernation), we always
>>>>>> * use broadcast TLB invalidation instructions, therefore a spurious page
>>>>>> @@ -1095,12 +1089,7 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
>>>>>> #ifdef CONFIG_ARM64_MTE
>>>>>>
>>>>>> #define __HAVE_ARCH_PREPARE_TO_SWAP
>>>>>> -static inline int arch_prepare_to_swap(struct page *page)
>>>>>> -{
>>>>>> - if (system_supports_mte())
>>>>>> - return mte_save_tags(page);
>>>>>> - return 0;
>>>>>> -}
>>>>>> +extern int arch_prepare_to_swap(struct folio *folio);
>>>>>>
>>>>>> #define __HAVE_ARCH_SWAP_INVALIDATE
>>>>>> static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
>>>>>> @@ -1116,11 +1105,7 @@ static inline void arch_swap_invalidate_area(int type)
>>>>>> }
>>>>>>
>>>>>> #define __HAVE_ARCH_SWAP_RESTORE
>>>>>> -static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
>>>>>> -{
>>>>>> - if (system_supports_mte())
>>>>>> - mte_restore_tags(entry, &folio->page);
>>>>>> -}
>>>>>> +extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
>>>>>>
>>>>>> #endif /* CONFIG_ARM64_MTE */
>>>>>>
>>>>>> diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
>>>>>> index a31833e3ddc5..295836fef620 100644
>>>>>> --- a/arch/arm64/mm/mteswap.c
>>>>>> +++ b/arch/arm64/mm/mteswap.c
>>>>>> @@ -68,6 +68,13 @@ void mte_invalidate_tags(int type, pgoff_t offset)
>>>>>> mte_free_tag_storage(tags);
>>>>>> }
>>>>>>
>>>>>> +static inline void __mte_invalidate_tags(struct page *page)
>>>>>> +{
>>>>>> + swp_entry_t entry = page_swap_entry(page);
>>>>>> +
>>>>>> + mte_invalidate_tags(swp_type(entry), swp_offset(entry));
>>>>>> +}
>>>>>> +
>>>>>> void mte_invalidate_tags_area(int type)
>>>>>> {
>>>>>> swp_entry_t entry = swp_entry(type, 0);
>>>>>> @@ -83,3 +90,39 @@ void mte_invalidate_tags_area(int type)
>>>>>> }
>>>>>> xa_unlock(&mte_pages);
>>>>>> }
>>>>>> +
>>>>>> +int arch_prepare_to_swap(struct folio *folio)
>>>>>> +{
>>>>>> + long i, nr;
>>>>>> + int err;
>>>>>> +
>>>>>> + if (!system_supports_mte())
>>>>>> + return 0;
>>>>>> +
>>>>>> + nr = folio_nr_pages(folio);
>>>>>> +
>>>>>> + for (i = 0; i < nr; i++) {
>>>>>> + err = mte_save_tags(folio_page(folio, i));
>>>>>> + if (err)
>>>>>> + goto out;
>>>>>> + }
>>>>>> + return 0;
>>>>>> +
>>>>>> +out:
>>>>>> + while (i--)
>>>>>> + __mte_invalidate_tags(folio_page(folio, i));
>>>>>> + return err;
>>>>>> +}
>>>>>> +
>>>>>> +void arch_swap_restore(swp_entry_t entry, struct folio *folio)
>>>>>
>>>>> I'm still not a fan of the fact that entry could be anywhere within folio.
>>>>>
>>>>>> +{
>>>>>> + if (system_supports_mte()) {
>>>>>
>>>>> nit: if you do:
>>>>>
>>>>> if (!system_supports_mte())
>>>>> return;
>>>>
>>>> Acked
>>>>
>>>>>
>>>>> It will be consistent with arch_prepare_to_swap() and reduce the indentation of
>>>>> the main body.
>>>>>
>>>>>> + long i, nr = folio_nr_pages(folio);
>>>>>> +
>>>>>> + entry.val -= swp_offset(entry) & (nr - 1);
>>>>>
>>>>> This assumes that folios are always stored in swap with natural alignment. Is
>>>>> that definitely a safe assumption? My swap-out series is currently ensuring that
>>>>> folios are swapped-out naturally aligned, but that is an implementation detail.
>>>>>
>>>>
>>>> I concur that this is an implementation detail. However, we should be
>>>> bold enough
>>>> to state that swap slots will be contiguous, considering we are
>>>> currently utilizing
>>>> folio->swap instead of subpage->swap ?
>>>
>>> Yes, I agree about contiguity. My objection is about assuming natural alignment
>>> though. It can still be contiguous while not naturally aligned in swap.
>>
>> Hi Ryan,
>>
>> While working on the new version of this patch, I've come to recognize
>> that, for the time being, it's
>> imperative to maintain a natural alignment. The following code
>> operates on the basis of this
>> assumption.
>>
>> /**
>> * folio_file_page - The page for a particular index.
>> * @folio: The folio which contains this index.
>> * @index: The index we want to look up.
>> *
>> * Sometimes after looking up a folio in the page cache, we need to
>> * obtain the specific page for an index (eg a page fault).
>> *
>> * Return: The page containing the file data for this index.
>> */
>> static inline struct page *folio_file_page(struct folio *folio, pgoff_t index)
>> {
>> return folio_page(folio, index & (folio_nr_pages(folio) - 1));
>> }
>>
>>
>> It's invoked everywhere, particularly within do_swap_page(). Nonetheless,
>> I remain confident that I can consistently pass the first entry to
>> arch_swap_restore().
>
> After grappling for a couple of hours, I've realized that the only
> viable approach
> is as follows: shifting the task of obtaining the first entry from the
> callee to the
> callers( looks silly). This is necessary due to various scenarios like
> swap cache,
> non-swap cache, and KSM, each presenting different cases. Since there's no
> assurance of folio->swap being present, forcibly setting folio->swap could pose
> risks (There might not even be any risk involved, but the associated
> task getting
> the first entry still cannot be overlooked by callers).
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 7e486f2c502c..94d5b4b5a5da 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -76,6 +76,20 @@ static inline int folio_nr_pages_mapped(struct folio *folio)
> return atomic_read(&folio->_nr_pages_mapped) & FOLIO_PAGES_MAPPED;
> }
>
> +/*
> + * Retrieve the first entry of a folio based on a provided entry within the
> + * folio. We cannot rely on folio->swap as there is no guarantee that it has
> + * been initialized. Used by arch_swap_restore()
> + */
> +static inline swp_entry_t folio_swap(swp_entry_t entry, struct folio *folio)
> +{
> + swp_entry_t swap = {
> + .val = entry.val & (folio_nr_pages(folio) - 1),
> + };
> +
> + return swap;
> +}
> +
> static inline void *folio_raw_mapping(struct folio *folio)
> {
> unsigned long mapping = (unsigned long)folio->mapping;
> diff --git a/mm/memory.c b/mm/memory.c
> index f2bc6dd15eb8..b7cab8be8632 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4188,7 +4188,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> * when reading from swap. This metadata may be indexed by swap entry
> * so this must be called before swap_free().
> */
> - arch_swap_restore(entry, folio);
> + arch_swap_restore(folio_swap(entry, folio), folio);
>
> /*
> * Remove the swap entry and conditionally try to free up the swapcache.
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 0aad0d9a621b..82c9df4628f2 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1913,7 +1913,7 @@ static int shmem_swapin_folio(struct inode
> *inode, pgoff_t index,
> * Some architectures may have to restore extra metadata to the
> * folio after reading from swap.
> */
> - arch_swap_restore(swap, folio);
> + arch_swap_restore(folio_swap(entry, folio), folio);
>
> if (shmem_should_replace_folio(folio, gfp)) {
> error = shmem_replace_folio(&folio, gfp, info, index);
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 4919423cce76..5e6d2304a2a4 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1806,7 +1806,7 @@ static int unuse_pte(struct vm_area_struct *vma,
> pmd_t *pmd,
> * when reading from swap. This metadata may be indexed by swap entry
> * so this must be called before swap_free().
> */
> - arch_swap_restore(entry, folio);
> + arch_swap_restore(folio_swap(entry, folio), folio);
>
> dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>
>
> Meanwhile, natural alignment is essential even during the execution of
> add_to_swap(), as failure to
> do so will trigger the VM_BUG_ON condition below.
>
> int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
> gfp_t gfp, void **shadowp)
> {
> struct address_space *address_space = swap_address_space(entry);
> pgoff_t idx = swp_offset(entry);
> XA_STATE_ORDER(xas, &address_space->i_pages, idx, folio_order(folio));
> unsigned long i, nr = folio_nr_pages(folio);
> ...
> folio_set_swapcache(folio);
> folio->swap = entry;
>
> do {
> xas_lock_irq(&xas);
> xas_create_range(&xas);
> if (xas_error(&xas))
> goto unlock;
> for (i = 0; i < nr; i++) {
> VM_BUG_ON_FOLIO(xas.xa_index != idx + i, folio);
> if (shadowp) {
> old = xas_load(&xas);
> if (xa_is_value(old))
> *shadowp = old;
> }
> xas_store(&xas, folio);
> xas_next(&xas);
> }
> }
>
>
> Based on the information provided, Ryan, would it be feasible to retain the task
> of obtaining the first entry within the callee? Or, are you in favor
> of utilizing the
> new folio_swap() helper?

My opinion still remains that either:

- This should be a per-page interface - i.e. call it for each page to restore
tags. If we don't want to pass `struct page *` then perhaps we can pass a folio
and the index of the page we want to restore? In this case, entry refers the the
precise page we are operating on.

OR

- Make it a per-folio interface - i.e. it restores tags for all pages in the
folio. But in this case, entry must refer to the first page in the folio.
Anything else is confusing.

So if going for the latter approach, then I vote for fixing it up in the callee.
But I'm just one guy with one opinion!


>
>>
>>>
>>>>
>>>>> Your cover note for swap-in says that you could technically swap in a large
>>>>> folio without it having been swapped-out large. If you chose to do that in
>>>>> future, this would break, right? I don't think it's good to couple the swap
>>>>
>>>> Right. technically I agree. Given that we still have many tasks involving even
>>>> swapping in contiguous swap slots, it's unlikely that swapping in large folios
>>>> for non-contiguous entries will occur in the foreseeable future :-)
>>>>
>>>>> storage layout to the folio order that you want to swap into. Perhaps that's an
>>>>> argument for passing each *page* to this function with its exact, corresponding
>>>>> swap entry?
>>>>
>>>> I recall Matthew Wilcox strongly objected to using "page" as the
>>>> parameter, so I've
>>>> discarded that approach. Alternatively, it appears I can consistently pass
>>>> folio->swap to this function and ensure the function always retrieves
>>>> the first entry?
>>>
>>> Yes, if we must pass a folio here, I'd prefer that entry always corresponds to
>>> the first entry for the folio. That will remove the need for this function to do
>>> the alignment above too. So win-win.
>>>
>>>>
>>>>>
>>>>>> + for (i = 0; i < nr; i++) {
>>>>>> + mte_restore_tags(entry, folio_page(folio, i));
>>>>>> + entry.val++;
>>>>>> + }
>>>>>> + }
>>>>>> +}
>>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>>>> index de0c89105076..e04b93c43965 100644
>>>>>> --- a/include/linux/huge_mm.h
>>>>>> +++ b/include/linux/huge_mm.h
>>>>>> @@ -535,16 +535,4 @@ static inline int split_folio_to_order(struct folio *folio, int new_order)
>>>>>> #define split_folio_to_list(f, l) split_folio_to_list_to_order(f, l, 0)
>>>>>> #define split_folio(f) split_folio_to_order(f, 0)
>>>>>>
>>>>>> -/*
>>>>>> - * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
>>>>>> - * limitations in the implementation like arm64 MTE can override this to
>>>>>> - * false
>>>>>> - */
>>>>>> -#ifndef arch_thp_swp_supported
>>>>>> -static inline bool arch_thp_swp_supported(void)
>>>>>> -{
>>>>>> - return true;
>>>>>> -}
>>>>>> -#endif
>>>>>> -
>>>>>> #endif /* _LINUX_HUGE_MM_H */
>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>> index e1b22903f709..bfcfe3386934 100644
>>>>>> --- a/include/linux/pgtable.h
>>>>>> +++ b/include/linux/pgtable.h
>>>>>> @@ -1106,7 +1106,7 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>>>>>> * prototypes must be defined in the arch-specific asm/pgtable.h file.
>>>>>> */
>>>>>> #ifndef __HAVE_ARCH_PREPARE_TO_SWAP
>>>>>> -static inline int arch_prepare_to_swap(struct page *page)
>>>>>> +static inline int arch_prepare_to_swap(struct folio *folio)
>>>>>> {
>>>>>> return 0;
>>>>>> }
>>>>>> diff --git a/mm/page_io.c b/mm/page_io.c
>>>>>> index ae2b49055e43..a9a7c236aecc 100644
>>>>>> --- a/mm/page_io.c
>>>>>> +++ b/mm/page_io.c
>>>>>> @@ -189,7 +189,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
>>>>>> * Arch code may have to preserve more data than just the page
>>>>>> * contents, e.g. memory tags.
>>>>>> */
>>>>>> - ret = arch_prepare_to_swap(&folio->page);
>>>>>> + ret = arch_prepare_to_swap(folio);
>>>>>> if (ret) {
>>>>>> folio_mark_dirty(folio);
>>>>>> folio_unlock(folio);
>>>>>> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
>>>>>> index 90973ce7881d..53abeaf1371d 100644
>>>>>> --- a/mm/swap_slots.c
>>>>>> +++ b/mm/swap_slots.c
>>>>>> @@ -310,7 +310,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
>>>>>> entry.val = 0;
>>>>>>
>>>>>> if (folio_test_large(folio)) {
>>>>>> - if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
>>>>>> + if (IS_ENABLED(CONFIG_THP_SWAP))
>>>>>> get_swap_pages(1, &entry, folio_nr_pages(folio));
>>>>>> goto out;
>>>>>> }
>>>>>
>>>>
>>>> Thanks
>>>> Barry
>>>


2024-03-23 02:16:03

by Chris Li

[permalink] [raw]
Subject: Re: [RFC PATCH v3 1/5] arm64: mm: swap: support THP_SWAP on hardware with MTE

On Fri, Mar 22, 2024 at 3:19 AM Ryan Roberts <[email protected]> wrote:
>
> On 22/03/2024 07:41, Barry Song wrote:
> > On Fri, Mar 22, 2024 at 3:51 PM Barry Song <[email protected]> wrote:
> >>
> >> On Thu, Mar 21, 2024 at 11:31 PM Ryan Roberts <[email protected]> wrote:
> >>>
> >>> On 21/03/2024 08:42, Barry Song wrote:
> >>>> Hi Ryan,
> >>>> Sorry for the late reply.
> >>>
> >>> No problem!
> >>>
> >>>>
> >>>> On Tue, Mar 12, 2024 at 5:56 AM Ryan Roberts <[email protected]> wrote:
> >>>>>
> >>>>> On 04/03/2024 08:13, Barry Song wrote:
> >>>>>> From: Barry Song <[email protected]>
> >>>>>>
> >>>>>> Commit d0637c505f8a1 ("arm64: enable THP_SWAP for arm64") brings up
> >>>>>> THP_SWAP on ARM64, but it doesn't enable THP_SWP on hardware with
> >>>>>> MTE as the MTE code works with the assumption tags save/restore is
> >>>>>> always handling a folio with only one page.
> >>>>>>
> >>>>>> The limitation should be removed as more and more ARM64 SoCs have
> >>>>>> this feature. Co-existence of MTE and THP_SWAP becomes more and
> >>>>>> more important.
> >>>>>>
> >>>>>> This patch makes MTE tags saving support large folios, then we don't
> >>>>>> need to split large folios into base pages for swapping out on ARM64
> >>>>>> SoCs with MTE any more.
> >>>>>>
> >>>>>> arch_prepare_to_swap() should take folio rather than page as parameter
> >>>>>> because we support THP swap-out as a whole. It saves tags for all
> >>>>>> pages in a large folio.
> >>>>>>
> >>>>>> As now we are restoring tags based-on folio, in arch_swap_restore(),
> >>>>>> we may increase some extra loops and early-exitings while refaulting
> >>>>>> a large folio which is still in swapcache in do_swap_page(). In case
> >>>>>> a large folio has nr pages, do_swap_page() will only set the PTE of
> >>>>>> the particular page which is causing the page fault.
> >>>>>> Thus do_swap_page() runs nr times, and each time, arch_swap_restore()
> >>>>>> will loop nr times for those subpages in the folio. So right now the
> >>>>>> algorithmic complexity becomes O(nr^2).
> >>>>>>
> >>>>>> Once we support mapping large folios in do_swap_page(), extra loops
> >>>>>> and early-exitings will decrease while not being completely removed
> >>>>>> as a large folio might get partially tagged in corner cases such as,
> >>>>>> 1. a large folio in swapcache can be partially unmapped, thus, MTE
> >>>>>> tags for the unmapped pages will be invalidated;
> >>>>>> 2. users might use mprotect() to set MTEs on a part of a large folio.
> >>>>>>
> >>>>>> arch_thp_swp_supported() is dropped since ARM64 MTE was the only one
> >>>>>> who needed it.
> >>>
> >>> I think we should decouple this patch from your swap-in series. I suspect this
> >>> one could be ready and go in sooner than the swap-in series based on the current
> >>> discussions :)
> >>>
> >>>>>>
> >>>>>> Cc: Catalin Marinas <[email protected]>
> >>>>>> Cc: Will Deacon <[email protected]>
> >>>>>> Cc: Ryan Roberts <[email protected]>
> >>>>>> Cc: Mark Rutland <[email protected]>
> >>>>>> Cc: David Hildenbrand <[email protected]>
> >>>>>> Cc: Kemeng Shi <[email protected]>
> >>>>>> Cc: "Matthew Wilcox (Oracle)" <[email protected]>
> >>>>>> Cc: Anshuman Khandual <[email protected]>
> >>>>>> Cc: Peter Collingbourne <[email protected]>
> >>>>>> Cc: Steven Price <[email protected]>
> >>>>>> Cc: Yosry Ahmed <[email protected]>
> >>>>>> Cc: Peter Xu <[email protected]>
> >>>>>> Cc: Lorenzo Stoakes <[email protected]>
> >>>>>> Cc: "Mike Rapoport (IBM)" <[email protected]>
> >>>>>> Cc: Hugh Dickins <[email protected]>
> >>>>>> CC: "Aneesh Kumar K.V" <[email protected]>
> >>>>>> Cc: Rick Edgecombe <[email protected]>
> >>>>>> Signed-off-by: Barry Song <[email protected]>
> >>>>>> Reviewed-by: Steven Price <[email protected]>
> >>>>>> Acked-by: Chris Li <[email protected]>
> >>>>>> ---
> >>>>>> arch/arm64/include/asm/pgtable.h | 19 ++------------
> >>>>>> arch/arm64/mm/mteswap.c | 43 ++++++++++++++++++++++++++++++++
> >>>>>> include/linux/huge_mm.h | 12 ---------
> >>>>>> include/linux/pgtable.h | 2 +-
> >>>>>> mm/page_io.c | 2 +-
> >>>>>> mm/swap_slots.c | 2 +-
> >>>>>> 6 files changed, 48 insertions(+), 32 deletions(-)
> >>>>>>
> >>>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> >>>>>> index 401087e8a43d..7a54750770b8 100644
> >>>>>> --- a/arch/arm64/include/asm/pgtable.h
> >>>>>> +++ b/arch/arm64/include/asm/pgtable.h
> >>>>>> @@ -45,12 +45,6 @@
> >>>>>> __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
> >>>>>> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >>>>>>
> >>>>>> -static inline bool arch_thp_swp_supported(void)
> >>>>>> -{
> >>>>>> - return !system_supports_mte();
> >>>>>> -}
> >>>>>> -#define arch_thp_swp_supported arch_thp_swp_supported
> >>>>>> -
> >>>>>> /*
> >>>>>> * Outside of a few very special situations (e.g. hibernation), we always
> >>>>>> * use broadcast TLB invalidation instructions, therefore a spurious page
> >>>>>> @@ -1095,12 +1089,7 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> >>>>>> #ifdef CONFIG_ARM64_MTE
> >>>>>>
> >>>>>> #define __HAVE_ARCH_PREPARE_TO_SWAP
> >>>>>> -static inline int arch_prepare_to_swap(struct page *page)
> >>>>>> -{
> >>>>>> - if (system_supports_mte())
> >>>>>> - return mte_save_tags(page);
> >>>>>> - return 0;
> >>>>>> -}
> >>>>>> +extern int arch_prepare_to_swap(struct folio *folio);
> >>>>>>
> >>>>>> #define __HAVE_ARCH_SWAP_INVALIDATE
> >>>>>> static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
> >>>>>> @@ -1116,11 +1105,7 @@ static inline void arch_swap_invalidate_area(int type)
> >>>>>> }
> >>>>>>
> >>>>>> #define __HAVE_ARCH_SWAP_RESTORE
> >>>>>> -static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> >>>>>> -{
> >>>>>> - if (system_supports_mte())
> >>>>>> - mte_restore_tags(entry, &folio->page);
> >>>>>> -}
> >>>>>> +extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
> >>>>>>
> >>>>>> #endif /* CONFIG_ARM64_MTE */
> >>>>>>
> >>>>>> diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
> >>>>>> index a31833e3ddc5..295836fef620 100644
> >>>>>> --- a/arch/arm64/mm/mteswap.c
> >>>>>> +++ b/arch/arm64/mm/mteswap.c
> >>>>>> @@ -68,6 +68,13 @@ void mte_invalidate_tags(int type, pgoff_t offset)
> >>>>>> mte_free_tag_storage(tags);
> >>>>>> }
> >>>>>>
> >>>>>> +static inline void __mte_invalidate_tags(struct page *page)
> >>>>>> +{
> >>>>>> + swp_entry_t entry = page_swap_entry(page);
> >>>>>> +
> >>>>>> + mte_invalidate_tags(swp_type(entry), swp_offset(entry));
> >>>>>> +}
> >>>>>> +
> >>>>>> void mte_invalidate_tags_area(int type)
> >>>>>> {
> >>>>>> swp_entry_t entry = swp_entry(type, 0);
> >>>>>> @@ -83,3 +90,39 @@ void mte_invalidate_tags_area(int type)
> >>>>>> }
> >>>>>> xa_unlock(&mte_pages);
> >>>>>> }
> >>>>>> +
> >>>>>> +int arch_prepare_to_swap(struct folio *folio)
> >>>>>> +{
> >>>>>> + long i, nr;
> >>>>>> + int err;
> >>>>>> +
> >>>>>> + if (!system_supports_mte())
> >>>>>> + return 0;
> >>>>>> +
> >>>>>> + nr = folio_nr_pages(folio);
> >>>>>> +
> >>>>>> + for (i = 0; i < nr; i++) {
> >>>>>> + err = mte_save_tags(folio_page(folio, i));
> >>>>>> + if (err)
> >>>>>> + goto out;
> >>>>>> + }
> >>>>>> + return 0;
> >>>>>> +
> >>>>>> +out:
> >>>>>> + while (i--)
> >>>>>> + __mte_invalidate_tags(folio_page(folio, i));
> >>>>>> + return err;
> >>>>>> +}
> >>>>>> +
> >>>>>> +void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> >>>>>
> >>>>> I'm still not a fan of the fact that entry could be anywhere within folio.
> >>>>>
> >>>>>> +{
> >>>>>> + if (system_supports_mte()) {
> >>>>>
> >>>>> nit: if you do:
> >>>>>
> >>>>> if (!system_supports_mte())
> >>>>> return;
> >>>>
> >>>> Acked
> >>>>
> >>>>>
> >>>>> It will be consistent with arch_prepare_to_swap() and reduce the indentation of
> >>>>> the main body.
> >>>>>
> >>>>>> + long i, nr = folio_nr_pages(folio);
> >>>>>> +
> >>>>>> + entry.val -= swp_offset(entry) & (nr - 1);
> >>>>>
> >>>>> This assumes that folios are always stored in swap with natural alignment. Is
> >>>>> that definitely a safe assumption? My swap-out series is currently ensuring that
> >>>>> folios are swapped-out naturally aligned, but that is an implementation detail.
> >>>>>
> >>>>
> >>>> I concur that this is an implementation detail. However, we should be
> >>>> bold enough
> >>>> to state that swap slots will be contiguous, considering we are
> >>>> currently utilizing
> >>>> folio->swap instead of subpage->swap ?
> >>>
> >>> Yes, I agree about contiguity. My objection is about assuming natural alignment
> >>> though. It can still be contiguous while not naturally aligned in swap.
> >>
> >> Hi Ryan,
> >>
> >> While working on the new version of this patch, I've come to recognize
> >> that, for the time being, it's
> >> imperative to maintain a natural alignment. The following code
> >> operates on the basis of this
> >> assumption.
> >>
> >> /**
> >> * folio_file_page - The page for a particular index.
> >> * @folio: The folio which contains this index.
> >> * @index: The index we want to look up.
> >> *
> >> * Sometimes after looking up a folio in the page cache, we need to
> >> * obtain the specific page for an index (eg a page fault).
> >> *
> >> * Return: The page containing the file data for this index.
> >> */
> >> static inline struct page *folio_file_page(struct folio *folio, pgoff_t index)
> >> {
> >> return folio_page(folio, index & (folio_nr_pages(folio) - 1));
> >> }
> >>
> >>
> >> It's invoked everywhere, particularly within do_swap_page(). Nonetheless,
> >> I remain confident that I can consistently pass the first entry to
> >> arch_swap_restore().
> >
> > After grappling for a couple of hours, I've realized that the only
> > viable approach
> > is as follows: shifting the task of obtaining the first entry from the
> > callee to the
> > callers( looks silly). This is necessary due to various scenarios like
> > swap cache,
> > non-swap cache, and KSM, each presenting different cases. Since there's no
> > assurance of folio->swap being present, forcibly setting folio->swap could pose
> > risks (There might not even be any risk involved, but the associated
> > task getting
> > the first entry still cannot be overlooked by callers).
> >
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 7e486f2c502c..94d5b4b5a5da 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -76,6 +76,20 @@ static inline int folio_nr_pages_mapped(struct folio *folio)
> > return atomic_read(&folio->_nr_pages_mapped) & FOLIO_PAGES_MAPPED;
> > }
> >
> > +/*
> > + * Retrieve the first entry of a folio based on a provided entry within the
> > + * folio. We cannot rely on folio->swap as there is no guarantee that it has
> > + * been initialized. Used by arch_swap_restore()
> > + */
> > +static inline swp_entry_t folio_swap(swp_entry_t entry, struct folio *folio)
> > +{
> > + swp_entry_t swap = {
> > + .val = entry.val & (folio_nr_pages(folio) - 1),
> > + };
> > +
> > + return swap;
> > +}
> > +
> > static inline void *folio_raw_mapping(struct folio *folio)
> > {
> > unsigned long mapping = (unsigned long)folio->mapping;
> > diff --git a/mm/memory.c b/mm/memory.c
> > index f2bc6dd15eb8..b7cab8be8632 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4188,7 +4188,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > * when reading from swap. This metadata may be indexed by swap entry
> > * so this must be called before swap_free().
> > */
> > - arch_swap_restore(entry, folio);
> > + arch_swap_restore(folio_swap(entry, folio), folio);
> >
> > /*
> > * Remove the swap entry and conditionally try to free up the swapcache.
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 0aad0d9a621b..82c9df4628f2 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -1913,7 +1913,7 @@ static int shmem_swapin_folio(struct inode
> > *inode, pgoff_t index,
> > * Some architectures may have to restore extra metadata to the
> > * folio after reading from swap.
> > */
> > - arch_swap_restore(swap, folio);
> > + arch_swap_restore(folio_swap(entry, folio), folio);
> >
> > if (shmem_should_replace_folio(folio, gfp)) {
> > error = shmem_replace_folio(&folio, gfp, info, index);
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 4919423cce76..5e6d2304a2a4 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -1806,7 +1806,7 @@ static int unuse_pte(struct vm_area_struct *vma,
> > pmd_t *pmd,
> > * when reading from swap. This metadata may be indexed by swap entry
> > * so this must be called before swap_free().
> > */
> > - arch_swap_restore(entry, folio);
> > + arch_swap_restore(folio_swap(entry, folio), folio);
> >
> > dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> > inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> >
> >
> > Meanwhile, natural alignment is essential even during the execution of
> > add_to_swap(), as failure to
> > do so will trigger the VM_BUG_ON condition below.
> >
> > int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
> > gfp_t gfp, void **shadowp)
> > {
> > struct address_space *address_space = swap_address_space(entry);
> > pgoff_t idx = swp_offset(entry);
> > XA_STATE_ORDER(xas, &address_space->i_pages, idx, folio_order(folio));
> > unsigned long i, nr = folio_nr_pages(folio);
> > ...
> > folio_set_swapcache(folio);
> > folio->swap = entry;
> >
> > do {
> > xas_lock_irq(&xas);
> > xas_create_range(&xas);
> > if (xas_error(&xas))
> > goto unlock;
> > for (i = 0; i < nr; i++) {
> > VM_BUG_ON_FOLIO(xas.xa_index != idx + i, folio);

Here swap_cache assue swap entry + i match folio + i subpage. The swap
entry of a folio must be continuous. If we want to allow folio write
out to the discontiguous offset of the swap device, this aspect of the
swap cache will need to change as well. Do you see a problem having
all pte entries of a folio point to the same large swap entry? Of
course, the large swap entry internally will track the offset of sub
page + i. The swap cache will only have one index for the large swap
entry (the head entry).


> > if (shadowp) {
> > old = xas_load(&xas);
> > if (xa_is_value(old))
> > *shadowp = old;
> > }
> > xas_store(&xas, folio);
> > xas_next(&xas);
> > }
> > }
> >
> >
> > Based on the information provided, Ryan, would it be feasible to retain the task
> > of obtaining the first entry within the callee? Or, are you in favor
> > of utilizing the
> > new folio_swap() helper?
>
> My opinion still remains that either:
>
> - This should be a per-page interface - i.e. call it for each page to restore
> tags. If we don't want to pass `struct page *` then perhaps we can pass a folio

Can you clarify that by "tag" you mean the MTE tags, not swap cache
xarray tags, right? From the email context I assume that is the MTE
tag. Please let me know if I assume incorrectly.

> and the index of the page we want to restore? In this case, entry refers the the
> precise page we are operating on.
>
> OR
>
> - Make it a per-folio interface - i.e. it restores tags for all pages in the
> folio. But in this case, entry must refer to the first page in the folio.
> Anything else is confusing.

As long as you refer to the subpage as folilo + i, restoring a subset
of the folio should be permitted?

On the swap entry side, I would like to avoid assuming the swap entry
is contingues. The swap entry should have an API to fetch the swap
offset of the head entry + i. For the simple continuous swap entry,
this mapping is just linear. For non continuous swap offset, it would
need to go through some lookup table to find the offset for i.

Chris

>
> So if going for the latter approach, then I vote for fixing it up in the callee.
> But I'm just one guy with one opinion!
>
>
> >
> >>
> >>>
> >>>>
> >>>>> Your cover note for swap-in says that you could technically swap in a large
> >>>>> folio without it having been swapped-out large. If you chose to do that in
> >>>>> future, this would break, right? I don't think it's good to couple the swap
> >>>>
> >>>> Right. technically I agree. Given that we still have many tasks involving even
> >>>> swapping in contiguous swap slots, it's unlikely that swapping in large folios
> >>>> for non-contiguous entries will occur in the foreseeable future :-)
> >>>>
> >>>>> storage layout to the folio order that you want to swap into. Perhaps that's an
> >>>>> argument for passing each *page* to this function with its exact, corresponding
> >>>>> swap entry?
> >>>>
> >>>> I recall Matthew Wilcox strongly objected to using "page" as the
> >>>> parameter, so I've
> >>>> discarded that approach. Alternatively, it appears I can consistently pass
> >>>> folio->swap to this function and ensure the function always retrieves
> >>>> the first entry?
> >>>
> >>> Yes, if we must pass a folio here, I'd prefer that entry always corresponds to
> >>> the first entry for the folio. That will remove the need for this function to do
> >>> the alignment above too. So win-win.
> >>>
> >>>>
> >>>>>
> >>>>>> + for (i = 0; i < nr; i++) {
> >>>>>> + mte_restore_tags(entry, folio_page(folio, i));
> >>>>>> + entry.val++;
> >>>>>> + }
> >>>>>> + }
> >>>>>> +}
> >>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >>>>>> index de0c89105076..e04b93c43965 100644
> >>>>>> --- a/include/linux/huge_mm.h
> >>>>>> +++ b/include/linux/huge_mm.h
> >>>>>> @@ -535,16 +535,4 @@ static inline int split_folio_to_order(struct folio *folio, int new_order)
> >>>>>> #define split_folio_to_list(f, l) split_folio_to_list_to_order(f, l, 0)
> >>>>>> #define split_folio(f) split_folio_to_order(f, 0)
> >>>>>>
> >>>>>> -/*
> >>>>>> - * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
> >>>>>> - * limitations in the implementation like arm64 MTE can override this to
> >>>>>> - * false
> >>>>>> - */
> >>>>>> -#ifndef arch_thp_swp_supported
> >>>>>> -static inline bool arch_thp_swp_supported(void)
> >>>>>> -{
> >>>>>> - return true;
> >>>>>> -}
> >>>>>> -#endif
> >>>>>> -
> >>>>>> #endif /* _LINUX_HUGE_MM_H */
> >>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >>>>>> index e1b22903f709..bfcfe3386934 100644
> >>>>>> --- a/include/linux/pgtable.h
> >>>>>> +++ b/include/linux/pgtable.h
> >>>>>> @@ -1106,7 +1106,7 @@ static inline int arch_unmap_one(struct mm_struct *mm,
> >>>>>> * prototypes must be defined in the arch-specific asm/pgtable.h file.
> >>>>>> */
> >>>>>> #ifndef __HAVE_ARCH_PREPARE_TO_SWAP
> >>>>>> -static inline int arch_prepare_to_swap(struct page *page)
> >>>>>> +static inline int arch_prepare_to_swap(struct folio *folio)
> >>>>>> {
> >>>>>> return 0;
> >>>>>> }
> >>>>>> diff --git a/mm/page_io.c b/mm/page_io.c
> >>>>>> index ae2b49055e43..a9a7c236aecc 100644
> >>>>>> --- a/mm/page_io.c
> >>>>>> +++ b/mm/page_io.c
> >>>>>> @@ -189,7 +189,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
> >>>>>> * Arch code may have to preserve more data than just the page
> >>>>>> * contents, e.g. memory tags.
> >>>>>> */
> >>>>>> - ret = arch_prepare_to_swap(&folio->page);
> >>>>>> + ret = arch_prepare_to_swap(folio);
> >>>>>> if (ret) {
> >>>>>> folio_mark_dirty(folio);
> >>>>>> folio_unlock(folio);
> >>>>>> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> >>>>>> index 90973ce7881d..53abeaf1371d 100644
> >>>>>> --- a/mm/swap_slots.c
> >>>>>> +++ b/mm/swap_slots.c
> >>>>>> @@ -310,7 +310,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
> >>>>>> entry.val = 0;
> >>>>>>
> >>>>>> if (folio_test_large(folio)) {
> >>>>>> - if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
> >>>>>> + if (IS_ENABLED(CONFIG_THP_SWAP))
> >>>>>> get_swap_pages(1, &entry, folio_nr_pages(folio));
> >>>>>> goto out;
> >>>>>> }
> >>>>>
> >>>>
> >>>> Thanks
> >>>> Barry
> >>>
>
>

2024-03-23 03:51:15

by Barry Song

[permalink] [raw]
Subject: Re: [RFC PATCH v3 1/5] arm64: mm: swap: support THP_SWAP on hardware with MTE

On Sat, Mar 23, 2024 at 3:15 PM Chris Li <[email protected]> wrote:
>
> On Fri, Mar 22, 2024 at 3:19 AM Ryan Roberts <[email protected]> wrote:
> >
> > On 22/03/2024 07:41, Barry Song wrote:
> > > On Fri, Mar 22, 2024 at 3:51 PM Barry Song <[email protected]> wrote:
> > >>
> > >> On Thu, Mar 21, 2024 at 11:31 PM Ryan Roberts <[email protected]> wrote:
> > >>>
> > >>> On 21/03/2024 08:42, Barry Song wrote:
> > >>>> Hi Ryan,
> > >>>> Sorry for the late reply.
> > >>>
> > >>> No problem!
> > >>>
> > >>>>
> > >>>> On Tue, Mar 12, 2024 at 5:56 AM Ryan Roberts <[email protected]> wrote:
> > >>>>>
> > >>>>> On 04/03/2024 08:13, Barry Song wrote:
> > >>>>>> From: Barry Song <[email protected]>
> > >>>>>>
> > >>>>>> Commit d0637c505f8a1 ("arm64: enable THP_SWAP for arm64") brings up
> > >>>>>> THP_SWAP on ARM64, but it doesn't enable THP_SWP on hardware with
> > >>>>>> MTE as the MTE code works with the assumption tags save/restore is
> > >>>>>> always handling a folio with only one page.
> > >>>>>>
> > >>>>>> The limitation should be removed as more and more ARM64 SoCs have
> > >>>>>> this feature. Co-existence of MTE and THP_SWAP becomes more and
> > >>>>>> more important.
> > >>>>>>
> > >>>>>> This patch makes MTE tags saving support large folios, then we don't
> > >>>>>> need to split large folios into base pages for swapping out on ARM64
> > >>>>>> SoCs with MTE any more.
> > >>>>>>
> > >>>>>> arch_prepare_to_swap() should take folio rather than page as parameter
> > >>>>>> because we support THP swap-out as a whole. It saves tags for all
> > >>>>>> pages in a large folio.
> > >>>>>>
> > >>>>>> As now we are restoring tags based-on folio, in arch_swap_restore(),
> > >>>>>> we may increase some extra loops and early-exitings while refaulting
> > >>>>>> a large folio which is still in swapcache in do_swap_page(). In case
> > >>>>>> a large folio has nr pages, do_swap_page() will only set the PTE of
> > >>>>>> the particular page which is causing the page fault.
> > >>>>>> Thus do_swap_page() runs nr times, and each time, arch_swap_restore()
> > >>>>>> will loop nr times for those subpages in the folio. So right now the
> > >>>>>> algorithmic complexity becomes O(nr^2).
> > >>>>>>
> > >>>>>> Once we support mapping large folios in do_swap_page(), extra loops
> > >>>>>> and early-exitings will decrease while not being completely removed
> > >>>>>> as a large folio might get partially tagged in corner cases such as,
> > >>>>>> 1. a large folio in swapcache can be partially unmapped, thus, MTE
> > >>>>>> tags for the unmapped pages will be invalidated;
> > >>>>>> 2. users might use mprotect() to set MTEs on a part of a large folio.
> > >>>>>>
> > >>>>>> arch_thp_swp_supported() is dropped since ARM64 MTE was the only one
> > >>>>>> who needed it.
> > >>>
> > >>> I think we should decouple this patch from your swap-in series. I suspect this
> > >>> one could be ready and go in sooner than the swap-in series based on the current
> > >>> discussions :)
> > >>>
> > >>>>>>
> > >>>>>> Cc: Catalin Marinas <[email protected]>
> > >>>>>> Cc: Will Deacon <[email protected]>
> > >>>>>> Cc: Ryan Roberts <[email protected]>
> > >>>>>> Cc: Mark Rutland <[email protected]>
> > >>>>>> Cc: David Hildenbrand <[email protected]>
> > >>>>>> Cc: Kemeng Shi <[email protected]>
> > >>>>>> Cc: "Matthew Wilcox (Oracle)" <[email protected]>
> > >>>>>> Cc: Anshuman Khandual <[email protected]>
> > >>>>>> Cc: Peter Collingbourne <[email protected]>
> > >>>>>> Cc: Steven Price <[email protected]>
> > >>>>>> Cc: Yosry Ahmed <[email protected]>
> > >>>>>> Cc: Peter Xu <[email protected]>
> > >>>>>> Cc: Lorenzo Stoakes <[email protected]>
> > >>>>>> Cc: "Mike Rapoport (IBM)" <[email protected]>
> > >>>>>> Cc: Hugh Dickins <[email protected]>
> > >>>>>> CC: "Aneesh Kumar K.V" <[email protected]>
> > >>>>>> Cc: Rick Edgecombe <[email protected]>
> > >>>>>> Signed-off-by: Barry Song <[email protected]>
> > >>>>>> Reviewed-by: Steven Price <[email protected]>
> > >>>>>> Acked-by: Chris Li <[email protected]>
> > >>>>>> ---
> > >>>>>> arch/arm64/include/asm/pgtable.h | 19 ++------------
> > >>>>>> arch/arm64/mm/mteswap.c | 43 ++++++++++++++++++++++++++++++++
> > >>>>>> include/linux/huge_mm.h | 12 ---------
> > >>>>>> include/linux/pgtable.h | 2 +-
> > >>>>>> mm/page_io.c | 2 +-
> > >>>>>> mm/swap_slots.c | 2 +-
> > >>>>>> 6 files changed, 48 insertions(+), 32 deletions(-)
> > >>>>>>
> > >>>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > >>>>>> index 401087e8a43d..7a54750770b8 100644
> > >>>>>> --- a/arch/arm64/include/asm/pgtable.h
> > >>>>>> +++ b/arch/arm64/include/asm/pgtable.h
> > >>>>>> @@ -45,12 +45,6 @@
> > >>>>>> __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
> > >>>>>> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> > >>>>>>
> > >>>>>> -static inline bool arch_thp_swp_supported(void)
> > >>>>>> -{
> > >>>>>> - return !system_supports_mte();
> > >>>>>> -}
> > >>>>>> -#define arch_thp_swp_supported arch_thp_swp_supported
> > >>>>>> -
> > >>>>>> /*
> > >>>>>> * Outside of a few very special situations (e.g. hibernation), we always
> > >>>>>> * use broadcast TLB invalidation instructions, therefore a spurious page
> > >>>>>> @@ -1095,12 +1089,7 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> > >>>>>> #ifdef CONFIG_ARM64_MTE
> > >>>>>>
> > >>>>>> #define __HAVE_ARCH_PREPARE_TO_SWAP
> > >>>>>> -static inline int arch_prepare_to_swap(struct page *page)
> > >>>>>> -{
> > >>>>>> - if (system_supports_mte())
> > >>>>>> - return mte_save_tags(page);
> > >>>>>> - return 0;
> > >>>>>> -}
> > >>>>>> +extern int arch_prepare_to_swap(struct folio *folio);
> > >>>>>>
> > >>>>>> #define __HAVE_ARCH_SWAP_INVALIDATE
> > >>>>>> static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
> > >>>>>> @@ -1116,11 +1105,7 @@ static inline void arch_swap_invalidate_area(int type)
> > >>>>>> }
> > >>>>>>
> > >>>>>> #define __HAVE_ARCH_SWAP_RESTORE
> > >>>>>> -static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> > >>>>>> -{
> > >>>>>> - if (system_supports_mte())
> > >>>>>> - mte_restore_tags(entry, &folio->page);
> > >>>>>> -}
> > >>>>>> +extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
> > >>>>>>
> > >>>>>> #endif /* CONFIG_ARM64_MTE */
> > >>>>>>
> > >>>>>> diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
> > >>>>>> index a31833e3ddc5..295836fef620 100644
> > >>>>>> --- a/arch/arm64/mm/mteswap.c
> > >>>>>> +++ b/arch/arm64/mm/mteswap.c
> > >>>>>> @@ -68,6 +68,13 @@ void mte_invalidate_tags(int type, pgoff_t offset)
> > >>>>>> mte_free_tag_storage(tags);
> > >>>>>> }
> > >>>>>>
> > >>>>>> +static inline void __mte_invalidate_tags(struct page *page)
> > >>>>>> +{
> > >>>>>> + swp_entry_t entry = page_swap_entry(page);
> > >>>>>> +
> > >>>>>> + mte_invalidate_tags(swp_type(entry), swp_offset(entry));
> > >>>>>> +}
> > >>>>>> +
> > >>>>>> void mte_invalidate_tags_area(int type)
> > >>>>>> {
> > >>>>>> swp_entry_t entry = swp_entry(type, 0);
> > >>>>>> @@ -83,3 +90,39 @@ void mte_invalidate_tags_area(int type)
> > >>>>>> }
> > >>>>>> xa_unlock(&mte_pages);
> > >>>>>> }
> > >>>>>> +
> > >>>>>> +int arch_prepare_to_swap(struct folio *folio)
> > >>>>>> +{
> > >>>>>> + long i, nr;
> > >>>>>> + int err;
> > >>>>>> +
> > >>>>>> + if (!system_supports_mte())
> > >>>>>> + return 0;
> > >>>>>> +
> > >>>>>> + nr = folio_nr_pages(folio);
> > >>>>>> +
> > >>>>>> + for (i = 0; i < nr; i++) {
> > >>>>>> + err = mte_save_tags(folio_page(folio, i));
> > >>>>>> + if (err)
> > >>>>>> + goto out;
> > >>>>>> + }
> > >>>>>> + return 0;
> > >>>>>> +
> > >>>>>> +out:
> > >>>>>> + while (i--)
> > >>>>>> + __mte_invalidate_tags(folio_page(folio, i));
> > >>>>>> + return err;
> > >>>>>> +}
> > >>>>>> +
> > >>>>>> +void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> > >>>>>
> > >>>>> I'm still not a fan of the fact that entry could be anywhere within folio.
> > >>>>>
> > >>>>>> +{
> > >>>>>> + if (system_supports_mte()) {
> > >>>>>
> > >>>>> nit: if you do:
> > >>>>>
> > >>>>> if (!system_supports_mte())
> > >>>>> return;
> > >>>>
> > >>>> Acked
> > >>>>
> > >>>>>
> > >>>>> It will be consistent with arch_prepare_to_swap() and reduce the indentation of
> > >>>>> the main body.
> > >>>>>
> > >>>>>> + long i, nr = folio_nr_pages(folio);
> > >>>>>> +
> > >>>>>> + entry.val -= swp_offset(entry) & (nr - 1);
> > >>>>>
> > >>>>> This assumes that folios are always stored in swap with natural alignment. Is
> > >>>>> that definitely a safe assumption? My swap-out series is currently ensuring that
> > >>>>> folios are swapped-out naturally aligned, but that is an implementation detail.
> > >>>>>
> > >>>>
> > >>>> I concur that this is an implementation detail. However, we should be
> > >>>> bold enough
> > >>>> to state that swap slots will be contiguous, considering we are
> > >>>> currently utilizing
> > >>>> folio->swap instead of subpage->swap ?
> > >>>
> > >>> Yes, I agree about contiguity. My objection is about assuming natural alignment
> > >>> though. It can still be contiguous while not naturally aligned in swap.
> > >>
> > >> Hi Ryan,
> > >>
> > >> While working on the new version of this patch, I've come to recognize
> > >> that, for the time being, it's
> > >> imperative to maintain a natural alignment. The following code
> > >> operates on the basis of this
> > >> assumption.
> > >>
> > >> /**
> > >> * folio_file_page - The page for a particular index.
> > >> * @folio: The folio which contains this index.
> > >> * @index: The index we want to look up.
> > >> *
> > >> * Sometimes after looking up a folio in the page cache, we need to
> > >> * obtain the specific page for an index (eg a page fault).
> > >> *
> > >> * Return: The page containing the file data for this index.
> > >> */
> > >> static inline struct page *folio_file_page(struct folio *folio, pgoff_t index)
> > >> {
> > >> return folio_page(folio, index & (folio_nr_pages(folio) - 1));
> > >> }
> > >>
> > >>
> > >> It's invoked everywhere, particularly within do_swap_page(). Nonetheless,
> > >> I remain confident that I can consistently pass the first entry to
> > >> arch_swap_restore().
> > >
> > > After grappling for a couple of hours, I've realized that the only
> > > viable approach
> > > is as follows: shifting the task of obtaining the first entry from the
> > > callee to the
> > > callers( looks silly). This is necessary due to various scenarios like
> > > swap cache,
> > > non-swap cache, and KSM, each presenting different cases. Since there's no
> > > assurance of folio->swap being present, forcibly setting folio->swap could pose
> > > risks (There might not even be any risk involved, but the associated
> > > task getting
> > > the first entry still cannot be overlooked by callers).
> > >
> > > diff --git a/mm/internal.h b/mm/internal.h
> > > index 7e486f2c502c..94d5b4b5a5da 100644
> > > --- a/mm/internal.h
> > > +++ b/mm/internal.h
> > > @@ -76,6 +76,20 @@ static inline int folio_nr_pages_mapped(struct folio *folio)
> > > return atomic_read(&folio->_nr_pages_mapped) & FOLIO_PAGES_MAPPED;
> > > }
> > >
> > > +/*
> > > + * Retrieve the first entry of a folio based on a provided entry within the
> > > + * folio. We cannot rely on folio->swap as there is no guarantee that it has
> > > + * been initialized. Used by arch_swap_restore()
> > > + */
> > > +static inline swp_entry_t folio_swap(swp_entry_t entry, struct folio *folio)
> > > +{
> > > + swp_entry_t swap = {
> > > + .val = entry.val & (folio_nr_pages(folio) - 1),
> > > + };
> > > +
> > > + return swap;
> > > +}
> > > +
> > > static inline void *folio_raw_mapping(struct folio *folio)
> > > {
> > > unsigned long mapping = (unsigned long)folio->mapping;
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index f2bc6dd15eb8..b7cab8be8632 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -4188,7 +4188,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > * when reading from swap. This metadata may be indexed by swap entry
> > > * so this must be called before swap_free().
> > > */
> > > - arch_swap_restore(entry, folio);
> > > + arch_swap_restore(folio_swap(entry, folio), folio);
> > >
> > > /*
> > > * Remove the swap entry and conditionally try to free up the swapcache.
> > > diff --git a/mm/shmem.c b/mm/shmem.c
> > > index 0aad0d9a621b..82c9df4628f2 100644
> > > --- a/mm/shmem.c
> > > +++ b/mm/shmem.c
> > > @@ -1913,7 +1913,7 @@ static int shmem_swapin_folio(struct inode
> > > *inode, pgoff_t index,
> > > * Some architectures may have to restore extra metadata to the
> > > * folio after reading from swap.
> > > */
> > > - arch_swap_restore(swap, folio);
> > > + arch_swap_restore(folio_swap(entry, folio), folio);
> > >
> > > if (shmem_should_replace_folio(folio, gfp)) {
> > > error = shmem_replace_folio(&folio, gfp, info, index);
> > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > index 4919423cce76..5e6d2304a2a4 100644
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
> > > @@ -1806,7 +1806,7 @@ static int unuse_pte(struct vm_area_struct *vma,
> > > pmd_t *pmd,
> > > * when reading from swap. This metadata may be indexed by swap entry
> > > * so this must be called before swap_free().
> > > */
> > > - arch_swap_restore(entry, folio);
> > > + arch_swap_restore(folio_swap(entry, folio), folio);
> > >
> > > dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> > > inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> > >
> > >
> > > Meanwhile, natural alignment is essential even during the execution of
> > > add_to_swap(), as failure to
> > > do so will trigger the VM_BUG_ON condition below.
> > >
> > > int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
> > > gfp_t gfp, void **shadowp)
> > > {
> > > struct address_space *address_space = swap_address_space(entry);
> > > pgoff_t idx = swp_offset(entry);
> > > XA_STATE_ORDER(xas, &address_space->i_pages, idx, folio_order(folio));
> > > unsigned long i, nr = folio_nr_pages(folio);
> > > ...
> > > folio_set_swapcache(folio);
> > > folio->swap = entry;
> > >
> > > do {
> > > xas_lock_irq(&xas);
> > > xas_create_range(&xas);
> > > if (xas_error(&xas))
> > > goto unlock;
> > > for (i = 0; i < nr; i++) {
> > > VM_BUG_ON_FOLIO(xas.xa_index != idx + i, folio);
>
> Here swap_cache assue swap entry + i match folio + i subpage. The swap
> entry of a folio must be continuous. If we want to allow folio write

even more than this. XA_STATE_ORDER ensures that
xas.xa_index is already naturally aligned by having
(index >> order) << order.

#define XA_STATE_ORDER(name, array, index, order) \
struct xa_state name = __XA_STATE(array, \
(index >> order) << order, \
order - (order % XA_CHUNK_SHIFT), \
(1U << (order % XA_CHUNK_SHIFT)) - 1)


> out to the discontiguous offset of the swap device, this aspect of the
> swap cache will need to change as well. Do you see a problem having
> all pte entries of a folio point to the same large swap entry? Of
> course, the large swap entry internally will track the offset of sub
> page + i. The swap cache will only have one index for the large swap
> entry (the head entry).

I do see two problems(or difficulties).

1. A specific page table entry (PTE) may not always be aware of its
position. For instance,
the second PTE of a large folio might not identify itself as such,
particularly if the virtual
address of the large folio is not aligned with the size of the large
folio due to operations
like mremap.

2. We also need to consider the complexity arising from partial
unmapping or DONTNEED
operations if we allow all PTEs to reference a 'large' swap entry.
Given that userspace
typically operates at 4KiB granularity, numerous partial unmappings
may be expected
for a single large swap entry.

>
>
> > > if (shadowp) {
> > > old = xas_load(&xas);
> > > if (xa_is_value(old))
> > > *shadowp = old;
> > > }
> > > xas_store(&xas, folio);
> > > xas_next(&xas);
> > > }
> > > }
> > >
> > >
> > > Based on the information provided, Ryan, would it be feasible to retain the task
> > > of obtaining the first entry within the callee? Or, are you in favor
> > > of utilizing the
> > > new folio_swap() helper?
> >
> > My opinion still remains that either:
> >
> > - This should be a per-page interface - i.e. call it for each page to restore
> > tags. If we don't want to pass `struct page *` then perhaps we can pass a folio
>
> Can you clarify that by "tag" you mean the MTE tags, not swap cache
> xarray tags, right? From the email context I assume that is the MTE
> tag. Please let me know if I assume incorrectly.
>
> > and the index of the page we want to restore? In this case, entry refers the the
> > precise page we are operating on.
> >
> > OR
> >
> > - Make it a per-folio interface - i.e. it restores tags for all pages in the
> > folio. But in this case, entry must refer to the first page in the folio.
> > Anything else is confusing.
>
> As long as you refer to the subpage as folilo + i, restoring a subset
> of the folio should be permitted?

That was my approach in those older versions - passing subpage
rather than folio.

In recent versions, we've transitioned to always restoring a folio during
swap operations. While this approach is acceptable, it can lead to some
redundant idle loops. For instance, if we swap out a large folio with
nr_pages and subsequently encounter a refault while the folio is still
in the swapcache, the first do_swap_page() call will restore all tags.
Subsequent do_swap_page() calls for the remaining nr_pages-1 PTEs
will perform the same checks, realizing that the restoration has already
been completed and thus skipping the process. However, they still
redundantly execute checks.

I propose extracting a separate patch from "[RFC PATCH v3 5/5] mm:
support large folios swapin as a whole" specifically to handle refaults.
This patch would essentially remove these redundant loops. The
swap-in patch currently addresses both refaults and newly allocated
large folios. If we prioritize addressing refaults sooner, I believe
this extraction would be beneficial.


>
> On the swap entry side, I would like to avoid assuming the swap entry
> is contingues. The swap entry should have an API to fetch the swap
> offset of the head entry + i. For the simple continuous swap entry,
> this mapping is just linear. For non continuous swap offset, it would
> need to go through some lookup table to find the offset for i.

The implementation of this approach necessitates a significant overhaul
of existing infrastructure. Currently, the entire codebase operates under
the assumption of contiguous and naturally aligned swap entries. As such,
adapting the system to support non-contiguous or non-naturally aligned
swap entries will require substantial modifications across various
components.

>
> Chris
>
> >
> > So if going for the latter approach, then I vote for fixing it up in the callee.
> > But I'm just one guy with one opinion!
> >
> >
> > >
> > >>
> > >>>
> > >>>>
> > >>>>> Your cover note for swap-in says that you could technically swap in a large
> > >>>>> folio without it having been swapped-out large. If you chose to do that in
> > >>>>> future, this would break, right? I don't think it's good to couple the swap
> > >>>>
> > >>>> Right. technically I agree. Given that we still have many tasks involving even
> > >>>> swapping in contiguous swap slots, it's unlikely that swapping in large folios
> > >>>> for non-contiguous entries will occur in the foreseeable future :-)
> > >>>>
> > >>>>> storage layout to the folio order that you want to swap into. Perhaps that's an
> > >>>>> argument for passing each *page* to this function with its exact, corresponding
> > >>>>> swap entry?
> > >>>>
> > >>>> I recall Matthew Wilcox strongly objected to using "page" as the
> > >>>> parameter, so I've
> > >>>> discarded that approach. Alternatively, it appears I can consistently pass
> > >>>> folio->swap to this function and ensure the function always retrieves
> > >>>> the first entry?
> > >>>
> > >>> Yes, if we must pass a folio here, I'd prefer that entry always corresponds to
> > >>> the first entry for the folio. That will remove the need for this function to do
> > >>> the alignment above too. So win-win.
> > >>>
> > >>>>
> > >>>>>
> > >>>>>> + for (i = 0; i < nr; i++) {
> > >>>>>> + mte_restore_tags(entry, folio_page(folio, i));
> > >>>>>> + entry.val++;
> > >>>>>> + }
> > >>>>>> + }
> > >>>>>> +}
> > >>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > >>>>>> index de0c89105076..e04b93c43965 100644
> > >>>>>> --- a/include/linux/huge_mm.h
> > >>>>>> +++ b/include/linux/huge_mm.h
> > >>>>>> @@ -535,16 +535,4 @@ static inline int split_folio_to_order(struct folio *folio, int new_order)
> > >>>>>> #define split_folio_to_list(f, l) split_folio_to_list_to_order(f, l, 0)
> > >>>>>> #define split_folio(f) split_folio_to_order(f, 0)
> > >>>>>>
> > >>>>>> -/*
> > >>>>>> - * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
> > >>>>>> - * limitations in the implementation like arm64 MTE can override this to
> > >>>>>> - * false
> > >>>>>> - */
> > >>>>>> -#ifndef arch_thp_swp_supported
> > >>>>>> -static inline bool arch_thp_swp_supported(void)
> > >>>>>> -{
> > >>>>>> - return true;
> > >>>>>> -}
> > >>>>>> -#endif
> > >>>>>> -
> > >>>>>> #endif /* _LINUX_HUGE_MM_H */
> > >>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > >>>>>> index e1b22903f709..bfcfe3386934 100644
> > >>>>>> --- a/include/linux/pgtable.h
> > >>>>>> +++ b/include/linux/pgtable.h
> > >>>>>> @@ -1106,7 +1106,7 @@ static inline int arch_unmap_one(struct mm_struct *mm,
> > >>>>>> * prototypes must be defined in the arch-specific asm/pgtable.h file.
> > >>>>>> */
> > >>>>>> #ifndef __HAVE_ARCH_PREPARE_TO_SWAP
> > >>>>>> -static inline int arch_prepare_to_swap(struct page *page)
> > >>>>>> +static inline int arch_prepare_to_swap(struct folio *folio)
> > >>>>>> {
> > >>>>>> return 0;
> > >>>>>> }
> > >>>>>> diff --git a/mm/page_io.c b/mm/page_io.c
> > >>>>>> index ae2b49055e43..a9a7c236aecc 100644
> > >>>>>> --- a/mm/page_io.c
> > >>>>>> +++ b/mm/page_io.c
> > >>>>>> @@ -189,7 +189,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
> > >>>>>> * Arch code may have to preserve more data than just the page
> > >>>>>> * contents, e.g. memory tags.
> > >>>>>> */
> > >>>>>> - ret = arch_prepare_to_swap(&folio->page);
> > >>>>>> + ret = arch_prepare_to_swap(folio);
> > >>>>>> if (ret) {
> > >>>>>> folio_mark_dirty(folio);
> > >>>>>> folio_unlock(folio);
> > >>>>>> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> > >>>>>> index 90973ce7881d..53abeaf1371d 100644
> > >>>>>> --- a/mm/swap_slots.c
> > >>>>>> +++ b/mm/swap_slots.c
> > >>>>>> @@ -310,7 +310,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
> > >>>>>> entry.val = 0;
> > >>>>>>
> > >>>>>> if (folio_test_large(folio)) {
> > >>>>>> - if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
> > >>>>>> + if (IS_ENABLED(CONFIG_THP_SWAP))
> > >>>>>> get_swap_pages(1, &entry, folio_nr_pages(folio));
> > >>>>>> goto out;
> > >>>>>> }
> > >>>>>
> > >>>>
> > >>>> Thanks
> > >>>> Barry

Thanks
Barry

2024-06-10 20:44:23

by Shakeel Butt

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole

On Thu, Mar 14, 2024 at 08:56:17PM GMT, Chuanhua Han wrote:
[...]
> >
> > So in the common case, swap-in will pull in the same size of folio as was
> > swapped-out. Is that definitely the right policy for all folio sizes? Certainly
> > it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
> > it makes sense for 2M THP; As the size increases the chances of actually needing
> > all of the folio reduces so chances are we are wasting IO. There are similar
> > arguments for CoW, where we currently copy 1 page per fault - it probably makes
> > sense to copy the whole folio up to a certain size.
> For 2M THP, IO overhead may not necessarily be large? :)
> 1.If 2M THP are continuously stored in the swap device, the IO
> overhead may not be very large (such as submitting bio with one
> bio_vec at a time).
> 2.If the process really needs this 2M data, one page-fault may perform
> much better than multiple.
> 3.For swap devices like zram,using 2M THP might also improve
> decompression efficiency.
>

Sorry for late response, do we have any performance data backing the
above claims particularly for zswap/swap-on-zram cases?


2024-06-11 00:26:19

by Barry Song

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole

On Tue, Jun 11, 2024 at 8:43 AM Shakeel Butt <[email protected]> wrote:
>
> On Thu, Mar 14, 2024 at 08:56:17PM GMT, Chuanhua Han wrote:
> [...]
> > >
> > > So in the common case, swap-in will pull in the same size of folio as was
> > > swapped-out. Is that definitely the right policy for all folio sizes? Certainly
> > > it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
> > > it makes sense for 2M THP; As the size increases the chances of actually needing
> > > all of the folio reduces so chances are we are wasting IO. There are similar
> > > arguments for CoW, where we currently copy 1 page per fault - it probably makes
> > > sense to copy the whole folio up to a certain size.
> > For 2M THP, IO overhead may not necessarily be large? :)
> > 1.If 2M THP are continuously stored in the swap device, the IO
> > overhead may not be very large (such as submitting bio with one
> > bio_vec at a time).
> > 2.If the process really needs this 2M data, one page-fault may perform
> > much better than multiple.
> > 3.For swap devices like zram,using 2M THP might also improve
> > decompression efficiency.
> >
>
> Sorry for late response, do we have any performance data backing the
> above claims particularly for zswap/swap-on-zram cases?

no need to say sorry. You are always welcome to give comments.

this, combining with zram modification, not only improves compression
ratio but also reduces CPU time significantly. you may find some data
here[1].

granularity orig_data_size compr_data_size time(us)
4KiB-zstd 1048576000 246876055 50259962
64KiB-zstd 1048576000 199763892 18330605

On mobile devices, We tested the performance of swapin by running
100 iterations of swapping in 100MB of data ,and the results were
as follows.the swapin speed increased by about 45%.

time consumption of swapin(ms)
lz4 4k 45274
lz4 64k 22942

zstdn 4k 85035
zstdn 64k 46558

[1] https://lore.kernel.org/linux-mm/[email protected]/

Thanks
Barry

2024-06-11 17:24:35

by Shakeel Butt

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole

On Tue, Jun 11, 2024 at 12:23:41PM GMT, Barry Song wrote:
> On Tue, Jun 11, 2024 at 8:43 AM Shakeel Butt <[email protected]> wrote:
> >
> > On Thu, Mar 14, 2024 at 08:56:17PM GMT, Chuanhua Han wrote:
> > [...]
> > > >
> > > > So in the common case, swap-in will pull in the same size of folio as was
> > > > swapped-out. Is that definitely the right policy for all folio sizes? Certainly
> > > > it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
> > > > it makes sense for 2M THP; As the size increases the chances of actually needing
> > > > all of the folio reduces so chances are we are wasting IO. There are similar
> > > > arguments for CoW, where we currently copy 1 page per fault - it probably makes
> > > > sense to copy the whole folio up to a certain size.
> > > For 2M THP, IO overhead may not necessarily be large? :)
> > > 1.If 2M THP are continuously stored in the swap device, the IO
> > > overhead may not be very large (such as submitting bio with one
> > > bio_vec at a time).
> > > 2.If the process really needs this 2M data, one page-fault may perform
> > > much better than multiple.
> > > 3.For swap devices like zram,using 2M THP might also improve
> > > decompression efficiency.
> > >
> >
> > Sorry for late response, do we have any performance data backing the
> > above claims particularly for zswap/swap-on-zram cases?
>
> no need to say sorry. You are always welcome to give comments.
>
> this, combining with zram modification, not only improves compression
> ratio but also reduces CPU time significantly. you may find some data
> here[1].
>
> granularity orig_data_size compr_data_size time(us)
> 4KiB-zstd 1048576000 246876055 50259962
> 64KiB-zstd 1048576000 199763892 18330605
>
> On mobile devices, We tested the performance of swapin by running
> 100 iterations of swapping in 100MB of data ,and the results were
> as follows.the swapin speed increased by about 45%.
>
> time consumption of swapin(ms)
> lz4 4k 45274
> lz4 64k 22942
>
> zstdn 4k 85035
> zstdn 64k 46558

Thanks for the response. Above numbers are actually very fascinating and
counter intuitive (at least to me). Do you also have numbers for 2MiB
THP? I am assuming 64k is the right balance between too small or too
large. Did you experiment on server machines as well?

>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
>
> Thanks
> Barry

2024-06-11 22:13:58

by Barry Song

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole

On Wed, Jun 12, 2024 at 5:24 AM Shakeel Butt <[email protected]> wrote:
>
> On Tue, Jun 11, 2024 at 12:23:41PM GMT, Barry Song wrote:
> > On Tue, Jun 11, 2024 at 8:43 AM Shakeel Butt <[email protected]> wrote:
> > >
> > > On Thu, Mar 14, 2024 at 08:56:17PM GMT, Chuanhua Han wrote:
> > > [...]
> > > > >
> > > > > So in the common case, swap-in will pull in the same size of folio as was
> > > > > swapped-out. Is that definitely the right policy for all folio sizes? Certainly
> > > > > it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
> > > > > it makes sense for 2M THP; As the size increases the chances of actually needing
> > > > > all of the folio reduces so chances are we are wasting IO. There are similar
> > > > > arguments for CoW, where we currently copy 1 page per fault - it probably makes
> > > > > sense to copy the whole folio up to a certain size.
> > > > For 2M THP, IO overhead may not necessarily be large? :)
> > > > 1.If 2M THP are continuously stored in the swap device, the IO
> > > > overhead may not be very large (such as submitting bio with one
> > > > bio_vec at a time).
> > > > 2.If the process really needs this 2M data, one page-fault may perform
> > > > much better than multiple.
> > > > 3.For swap devices like zram,using 2M THP might also improve
> > > > decompression efficiency.
> > > >
> > >
> > > Sorry for late response, do we have any performance data backing the
> > > above claims particularly for zswap/swap-on-zram cases?
> >
> > no need to say sorry. You are always welcome to give comments.
> >
> > this, combining with zram modification, not only improves compression
> > ratio but also reduces CPU time significantly. you may find some data
> > here[1].
> >
> > granularity orig_data_size compr_data_size time(us)
> > 4KiB-zstd 1048576000 246876055 50259962
> > 64KiB-zstd 1048576000 199763892 18330605
> >
> > On mobile devices, We tested the performance of swapin by running
> > 100 iterations of swapping in 100MB of data ,and the results were
> > as follows.the swapin speed increased by about 45%.
> >
> > time consumption of swapin(ms)
> > lz4 4k 45274
> > lz4 64k 22942
> >
> > zstdn 4k 85035
> > zstdn 64k 46558
>
> Thanks for the response. Above numbers are actually very fascinating and
> counter intuitive (at least to me). Do you also have numbers for 2MiB
> THP? I am assuming 64k is the right balance between too small or too
> large. Did you experiment on server machines as well?

I don’t possess data on 2MiB, and regrettably, I lack a server machine
for testing. However, I believe that this type of higher compression ratio
and lower CPU consumption generally holds true for generic anonymous
memory.

64KB is a right balance. But nothing can stop THP from using 64KB to
swapin, compression and decompression. as you can see from the
zram/zsmalloc series, we actually have a configuration
CONFIG_ZSMALLOC_MULTI_PAGES_ORDER

The default value is 4.

That means a 2MB THP can be compressed/decompressed as 32 * 64KB.
If we use 64KB as the swapin granularity, we still have the balance and
all the benefits if 2MB is a too large swap-in granularity which might cause
memory waste.

>
> >
> > [1] https://lore.kernel.org/linux-mm/[email protected]/
> >

Thanks
Barry