2024-03-11 15:01:22

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v4 0/6] Swap-out mTHP without splitting

Hi All,

This series adds support for swapping out multi-size THP (mTHP) without needing
to first split the large folio via split_huge_page_to_list_to_order(). It
closely follows the approach already used to swap-out PMD-sized THP.

There are a couple of reasons for swapping out mTHP without splitting:

- Performance: It is expensive to split a large folio and under extreme memory
pressure some workloads regressed performance when using 64K mTHP vs 4K
small folios because of this extra cost in the swap-out path. This series
not only eliminates the regression but makes it faster to swap out 64K mTHP
vs 4K small folios.

- Memory fragmentation avoidance: If we can avoid splitting a large folio
memory is less likely to become fragmented, making it easier to re-allocate
a large folio in future.

- Performance: Enables a separate series [4] to swap-in whole mTHPs, which
means we won't lose the TLB-efficiency benefits of mTHP once the memory has
been through a swap cycle.

I've done what I thought was the smallest change possible, and as a result, this
approach is only employed when the swap is backed by a non-rotating block device
(just as PMD-sized THP is supported today). Discussion against the RFC concluded
that this is sufficient.


Performance Testing
===================

I've run some swap performance tests on Ampere Altra VM (arm64) with 8 CPUs. The
VM is set up with a 35G block ram device as the swap device and the test is run
from inside a memcg limited to 40G memory. I've then run `usemem` from
vm-scalability with 70 processes, each allocating and writing 1G of memory. I've
repeated everything 6 times and taken the mean performance improvement relative
to 4K page baseline:

| alloc size | baseline | + this series |
| | v6.6-rc4+anonfolio | |
|:-----------|--------------------:|--------------------:|
| 4K Page | 0.0% | 1.4% |
| 64K THP | -14.6% | 44.2% |
| 2M THP | 87.4% | 97.7% |

So with this change, the 64K swap performance goes from a 15% regression to a
44% improvement. 4K and 2M swap improves slightly too.

This test also acts as a good stress test for swap and, more generally mm. A
couple of existing bugs were found as a result [5] [6].


---
The series applies against mm-unstable (d7182786dd0a). Although I've
additionally been running with a couple of extra fixes to avoid the issues at
[6].


Changes since v3 [3]
====================

- Renamed SWAP_NEXT_NULL -> SWAP_NEXT_INVALID (per Huang, Ying)
- Simplified max offset calculation (per Huang, Ying)
- Reinstated struct percpu_cluster to contain per-cluster, per-order `next`
offset (per Huang, Ying)
- Removed swap_alloc_large() and merged its functionality into
scan_swap_map_slots() (per Huang, Ying)
- Avoid extra cost of folio ref and lock due to removal of CLUSTER_FLAG_HUGE
by freeing swap entries in batches (see patch 2) (per DavidH)
- vmscan splits folio if its partially mapped (per Barry Song, DavidH)
- Avoid splitting in MADV_PAGEOUT path (per Barry Song)
- Dropped "mm: swap: Simplify ssd behavior when scanner steals entry" patch
since it's not actually a problem for THP as I first thought.


Changes since v2 [2]
====================

- Reuse scan_swap_map_try_ssd_cluster() between order-0 and order > 0
allocation. This required some refactoring to make everything work nicely
(new patches 2 and 3).
- Fix bug where nr_swap_pages would say there are pages available but the
scanner would not be able to allocate them because they were reserved for the
per-cpu allocator. We now allow stealing of order-0 entries from the high
order per-cpu clusters (in addition to exisiting stealing from order-0
per-cpu clusters).


Changes since v1 [1]
====================

- patch 1:
- Use cluster_set_count() instead of cluster_set_count_flag() in
swap_alloc_cluster() since we no longer have any flag to set. I was unable
to kill cluster_set_count_flag() as proposed against v1 as other call
sites depend explicitly setting flags to 0.
- patch 2:
- Moved large_next[] array into percpu_cluster to make it per-cpu
(recommended by Huang, Ying).
- large_next[] array is dynamically allocated because PMD_ORDER is not
compile-time constant for powerpc (fixes build error).


[1] https://lore.kernel.org/linux-mm/[email protected]/
[2] https://lore.kernel.org/linux-mm/[email protected]/
[3] https://lore.kernel.org/linux-mm/[email protected]/
[4] https://lore.kernel.org/linux-mm/[email protected]/
[5] https://lore.kernel.org/linux-mm/[email protected]/
[6] https://lore.kernel.org/linux-mm/[email protected]/

Thanks,
Ryan


Ryan Roberts (6):
mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()
mm: swap: Simplify struct percpu_cluster
mm: swap: Allow storage of all mTHP orders
mm: vmscan: Avoid split during shrink_folio_list()
mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD

include/linux/pgtable.h | 28 ++++
include/linux/swap.h | 33 +++--
mm/huge_memory.c | 3 -
mm/internal.h | 48 +++++++
mm/madvise.c | 101 ++++++++------
mm/memory.c | 13 +-
mm/swapfile.c | 298 ++++++++++++++++++++++------------------
mm/vmscan.c | 9 +-
8 files changed, 332 insertions(+), 201 deletions(-)

--
2.25.1



2024-03-11 15:02:00

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v4 2/6] mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()

Now that we no longer have a convenient flag in the cluster to determine
if a folio is large, free_swap_and_cache() will take a reference and
lock a large folio much more often, which could lead to contention and
(e.g.) failure to split large folios, etc.

Let's solve that problem by batch freeing swap and cache with a new
function, free_swap_and_cache_nr(), to free a contiguous range of swap
entries together. This allows us to first drop a reference to each swap
slot before we try to release the cache folio. This means we only try to
release the folio once, only taking the reference and lock once - much
better than the previous 512 times for the 2M THP case.

Contiguous swap entries are gathered in zap_pte_range() and
madvise_free_pte_range() in a similar way to how present ptes are
already gathered in zap_pte_range().

While we are at it, let's simplify by converting the return type of both
functions to void. The return value was used only by zap_pte_range() to
print a bad pte, and was ignored by everyone else, so the extra
reporting wasn't exactly guaranteed. We will still get the warning with
most of the information from get_swap_device(). With the batch version,
we wouldn't know which pte was bad anyway so could print the wrong one.

Signed-off-by: Ryan Roberts <[email protected]>
---
include/linux/pgtable.h | 28 +++++++++++++++
include/linux/swap.h | 12 +++++--
mm/internal.h | 48 +++++++++++++++++++++++++
mm/madvise.c | 12 ++++---
mm/memory.c | 13 +++----
mm/swapfile.c | 78 ++++++++++++++++++++++++++++++-----------
6 files changed, 157 insertions(+), 34 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 85fc7554cd52..8cf1f2fe2c25 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -708,6 +708,34 @@ static inline void pte_clear_not_present_full(struct mm_struct *mm,
}
#endif

+#ifndef clear_not_present_full_ptes
+/**
+ * clear_not_present_full_ptes - Clear consecutive not present PTEs.
+ * @mm: Address space the ptes represent.
+ * @addr: Address of the first pte.
+ * @ptep: Page table pointer for the first entry.
+ * @nr: Number of entries to clear.
+ * @full: Whether we are clearing a full mm.
+ *
+ * May be overridden by the architecture; otherwise, implemented as a simple
+ * loop over pte_clear_not_present_full().
+ *
+ * Context: The caller holds the page table lock. The PTEs are all not present.
+ * The PTEs are all in the same PMD.
+ */
+static inline void clear_not_present_full_ptes(struct mm_struct *mm,
+ unsigned long addr, pte_t *ptep, unsigned int nr, int full)
+{
+ for (;;) {
+ pte_clear_not_present_full(mm, addr, ptep, full);
+ if (--nr == 0)
+ break;
+ ptep++;
+ addr += PAGE_SIZE;
+ }
+}
+#endif
+
#ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
extern pte_t ptep_clear_flush(struct vm_area_struct *vma,
unsigned long address,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4a8b6c60793a..f2b7f204b968 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -481,7 +481,7 @@ extern int swap_duplicate(swp_entry_t);
extern int swapcache_prepare(swp_entry_t);
extern void swap_free(swp_entry_t);
extern void swapcache_free_entries(swp_entry_t *entries, int n);
-extern int free_swap_and_cache(swp_entry_t);
+extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
int swap_type_of(dev_t device, sector_t offset);
int find_first_swap(dev_t *device);
extern unsigned int count_swap_pages(int, int);
@@ -530,8 +530,9 @@ static inline void put_swap_device(struct swap_info_struct *si)
#define free_pages_and_swap_cache(pages, nr) \
release_pages((pages), (nr));

-/* used to sanity check ptes in zap_pte_range when CONFIG_SWAP=0 */
-#define free_swap_and_cache(e) is_pfn_swap_entry(e)
+static inline void free_swap_and_cache_nr(swp_entry_t entry, int nr)
+{
+}

static inline void free_swap_cache(struct folio *folio)
{
@@ -599,6 +600,11 @@ static inline int add_swap_extent(struct swap_info_struct *sis,
}
#endif /* CONFIG_SWAP */

+static inline void free_swap_and_cache(swp_entry_t entry)
+{
+ free_swap_and_cache_nr(entry, 1);
+}
+
#ifdef CONFIG_MEMCG
static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
{
diff --git a/mm/internal.h b/mm/internal.h
index a3e19194079f..8dbb1335df88 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -11,6 +11,8 @@
#include <linux/mm.h>
#include <linux/pagemap.h>
#include <linux/rmap.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
#include <linux/tracepoint-defs.h>

struct folio_batch;
@@ -174,6 +176,52 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,

return min(ptep - start_ptep, max_nr);
}
+
+/**
+ * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries
+ * @start_ptep: Page table pointer for the first entry.
+ * @max_nr: The maximum number of table entries to consider.
+ * @entry: Swap entry recovered from the first table entry.
+ *
+ * Detect a batch of contiguous swap entries: consecutive (non-present) PTEs
+ * containing swap entries all with consecutive offsets and targeting the same
+ * swap type.
+ *
+ * max_nr must be at least one and must be limited by the caller so scanning
+ * cannot exceed a single page table.
+ *
+ * Return: the number of table entries in the batch.
+ */
+static inline int swap_pte_batch(pte_t *start_ptep, int max_nr,
+ swp_entry_t entry)
+{
+ const pte_t *end_ptep = start_ptep + max_nr;
+ unsigned long expected_offset = swp_offset(entry) + 1;
+ unsigned int expected_type = swp_type(entry);
+ pte_t *ptep = start_ptep + 1;
+
+ VM_WARN_ON(max_nr < 1);
+ VM_WARN_ON(non_swap_entry(entry));
+
+ while (ptep < end_ptep) {
+ pte_t pte = ptep_get(ptep);
+
+ if (pte_none(pte) || pte_present(pte))
+ break;
+
+ entry = pte_to_swp_entry(pte);
+
+ if (non_swap_entry(entry) ||
+ swp_type(entry) != expected_type ||
+ swp_offset(entry) != expected_offset)
+ break;
+
+ expected_offset++;
+ ptep++;
+ }
+
+ return ptep - start_ptep;
+}
#endif /* CONFIG_MMU */

void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
diff --git a/mm/madvise.c b/mm/madvise.c
index 44a498c94158..547dcd1f7a39 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -628,6 +628,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
struct folio *folio;
int nr_swap = 0;
unsigned long next;
+ int nr, max_nr;

next = pmd_addr_end(addr, end);
if (pmd_trans_huge(*pmd))
@@ -640,7 +641,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
return 0;
flush_tlb_batched_pending(mm);
arch_enter_lazy_mmu_mode();
- for (; addr != end; pte++, addr += PAGE_SIZE) {
+ for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) {
+ nr = 1;
ptent = ptep_get(pte);

if (pte_none(ptent))
@@ -655,9 +657,11 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,

entry = pte_to_swp_entry(ptent);
if (!non_swap_entry(entry)) {
- nr_swap--;
- free_swap_and_cache(entry);
- pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+ max_nr = (end - addr) / PAGE_SIZE;
+ nr = swap_pte_batch(pte, max_nr, entry);
+ nr_swap -= nr;
+ free_swap_and_cache_nr(entry, nr);
+ clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
} else if (is_hwpoison_entry(entry) ||
is_poisoned_swp_entry(entry)) {
pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
diff --git a/mm/memory.c b/mm/memory.c
index f2bc6dd15eb8..25c0ef1c7ff3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1629,12 +1629,13 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
folio_remove_rmap_pte(folio, page, vma);
folio_put(folio);
} else if (!non_swap_entry(entry)) {
- /* Genuine swap entry, hence a private anon page */
+ max_nr = (end - addr) / PAGE_SIZE;
+ nr = swap_pte_batch(pte, max_nr, entry);
+ /* Genuine swap entries, hence a private anon pages */
if (!should_zap_cows(details))
continue;
- rss[MM_SWAPENTS]--;
- if (unlikely(!free_swap_and_cache(entry)))
- print_bad_pte(vma, addr, ptent, NULL);
+ rss[MM_SWAPENTS] -= nr;
+ free_swap_and_cache_nr(entry, nr);
} else if (is_migration_entry(entry)) {
folio = pfn_swap_entry_folio(entry);
if (!should_zap_folio(details, folio))
@@ -1657,8 +1658,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
pr_alert("unrecognized swap entry 0x%lx\n", entry.val);
WARN_ON_ONCE(1);
}
- pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
- zap_install_uffd_wp_if_needed(vma, addr, pte, 1, details, ptent);
+ clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
+ zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent);
} while (pte += nr, addr += PAGE_SIZE * nr, addr != end);

add_mm_rss_vec(mm, rss);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index df1de034f6d8..ee7e44cb40c5 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -130,7 +130,11 @@ static inline unsigned char swap_count(unsigned char ent)
/* Reclaim the swap entry if swap is getting full*/
#define TTRS_FULL 0x4

-/* returns 1 if swap entry is freed */
+/*
+ * returns number of pages in the folio that backs the swap entry. If positive,
+ * the folio was reclaimed. If negative, the folio was not reclaimed. If 0, no
+ * folio was associated with the swap entry.
+ */
static int __try_to_reclaim_swap(struct swap_info_struct *si,
unsigned long offset, unsigned long flags)
{
@@ -155,6 +159,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
ret = folio_free_swap(folio);
folio_unlock(folio);
}
+ ret = ret ? folio_nr_pages(folio) : -folio_nr_pages(folio);
folio_put(folio);
return ret;
}
@@ -895,7 +900,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
spin_lock(&si->lock);
/* entry was freed successfully, try to use this again */
- if (swap_was_freed)
+ if (swap_was_freed > 0)
goto checks;
goto scan; /* check next one */
}
@@ -1572,32 +1577,63 @@ bool folio_free_swap(struct folio *folio)
return true;
}

-/*
- * Free the swap entry like above, but also try to
- * free the page cache entry if it is the last user.
- */
-int free_swap_and_cache(swp_entry_t entry)
+void free_swap_and_cache_nr(swp_entry_t entry, int nr)
{
- struct swap_info_struct *p;
- unsigned char count;
+ unsigned long end = swp_offset(entry) + nr;
+ unsigned int type = swp_type(entry);
+ struct swap_info_struct *si;
+ unsigned long offset;

if (non_swap_entry(entry))
- return 1;
+ return;

- p = get_swap_device(entry);
- if (p) {
- if (WARN_ON(data_race(!p->swap_map[swp_offset(entry)]))) {
- put_swap_device(p);
- return 0;
- }
+ si = get_swap_device(entry);
+ if (!si)
+ return;

- count = __swap_entry_free(p, entry);
- if (count == SWAP_HAS_CACHE)
- __try_to_reclaim_swap(p, swp_offset(entry),
+ if (WARN_ON(end > si->max))
+ goto out;
+
+ /*
+ * First free all entries in the range.
+ */
+ for (offset = swp_offset(entry); offset < end; offset++) {
+ if (!WARN_ON(data_race(!si->swap_map[offset])))
+ __swap_entry_free(si, swp_entry(type, offset));
+ }
+
+ /*
+ * Now go back over the range trying to reclaim the swap cache. This is
+ * more efficient for large folios because we will only try to reclaim
+ * the swap once per folio in the common case. If we do
+ * __swap_entry_free() and __try_to_reclaim_swap() in the same loop, the
+ * latter will get a reference and lock the folio for every individual
+ * page but will only succeed once the swap slot for every subpage is
+ * zero.
+ */
+ for (offset = swp_offset(entry); offset < end; offset += nr) {
+ nr = 1;
+ if (READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) {
+ /*
+ * Folios are always naturally aligned in swap so
+ * advance forward to the next boundary. Zero means no
+ * folio was found for the swap entry, so advance by 1
+ * in this case. Negative value means folio was found
+ * but could not be reclaimed. Here we can still advance
+ * to the next boundary.
+ */
+ nr = __try_to_reclaim_swap(si, offset,
TTRS_UNMAPPED | TTRS_FULL);
- put_swap_device(p);
+ if (nr == 0)
+ nr = 1;
+ else if (nr < 0)
+ nr = -nr;
+ nr = ALIGN(offset + 1, nr) - offset;
+ }
}
- return p != NULL;
+
+out:
+ put_swap_device(si);
}

#ifdef CONFIG_HIBERNATION
--
2.25.1


2024-03-11 15:02:30

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v4 1/6] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags

As preparation for supporting small-sized THP in the swap-out path,
without first needing to split to order-0, Remove the CLUSTER_FLAG_HUGE,
which, when present, always implies PMD-sized THP, which is the same as
the cluster size.

The only use of the flag was to determine whether a swap entry refers to
a single page or a PMD-sized THP in swap_page_trans_huge_swapped().
Instead of relying on the flag, we now pass in nr_pages, which
originates from the folio's number of pages. This allows the logic to
work for folios of any order.

The one snag is that one of the swap_page_trans_huge_swapped() call
sites does not have the folio. But it was only being called there to
shortcut a call __try_to_reclaim_swap() in some cases.
__try_to_reclaim_swap() gets the folio and (via some other functions)
calls swap_page_trans_huge_swapped(). So I've removed the problematic
call site and believe the new logic should be functionally equivalent.

That said, removing the fast path means that we will take a reference
and trylock a large folio much more often, which we would like to avoid.
The next patch will solve this.

Removing CLUSTER_FLAG_HUGE also means we can remove split_swap_cluster()
which used to be called during folio splitting, since
split_swap_cluster()'s only job was to remove the flag.

Signed-off-by: Ryan Roberts <[email protected]>
---
include/linux/swap.h | 10 ----------
mm/huge_memory.c | 3 ---
mm/swapfile.c | 47 ++++++++------------------------------------
3 files changed, 8 insertions(+), 52 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2955f7a78d8d..4a8b6c60793a 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -259,7 +259,6 @@ struct swap_cluster_info {
};
#define CLUSTER_FLAG_FREE 1 /* This cluster is free */
#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
-#define CLUSTER_FLAG_HUGE 4 /* This cluster is backing a transparent huge page */

/*
* We assign a cluster to each CPU, so each CPU can allocate swap entry from
@@ -600,15 +599,6 @@ static inline int add_swap_extent(struct swap_info_struct *sis,
}
#endif /* CONFIG_SWAP */

-#ifdef CONFIG_THP_SWAP
-extern int split_swap_cluster(swp_entry_t entry);
-#else
-static inline int split_swap_cluster(swp_entry_t entry)
-{
- return 0;
-}
-#endif
-
#ifdef CONFIG_MEMCG
static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
{
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 04fb994a7b0b..5298ba882d49 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2965,9 +2965,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
shmem_uncharge(folio->mapping->host, nr_dropped);
remap_page(folio, nr);

- if (folio_test_swapcache(folio))
- split_swap_cluster(folio->swap);
-
/*
* set page to its compound_head when split to non order-0 pages, so
* we can skip unlocking it below, since PG_locked is transferred to
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 1155a6304119..df1de034f6d8 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -343,18 +343,6 @@ static inline void cluster_set_null(struct swap_cluster_info *info)
info->data = 0;
}

-static inline bool cluster_is_huge(struct swap_cluster_info *info)
-{
- if (IS_ENABLED(CONFIG_THP_SWAP))
- return info->flags & CLUSTER_FLAG_HUGE;
- return false;
-}
-
-static inline void cluster_clear_huge(struct swap_cluster_info *info)
-{
- info->flags &= ~CLUSTER_FLAG_HUGE;
-}
-
static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
unsigned long offset)
{
@@ -1027,7 +1015,7 @@ static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
offset = idx * SWAPFILE_CLUSTER;
ci = lock_cluster(si, offset);
alloc_cluster(si, idx);
- cluster_set_count_flag(ci, SWAPFILE_CLUSTER, CLUSTER_FLAG_HUGE);
+ cluster_set_count(ci, SWAPFILE_CLUSTER);

memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER);
unlock_cluster(ci);
@@ -1365,7 +1353,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)

ci = lock_cluster_or_swap_info(si, offset);
if (size == SWAPFILE_CLUSTER) {
- VM_BUG_ON(!cluster_is_huge(ci));
map = si->swap_map + offset;
for (i = 0; i < SWAPFILE_CLUSTER; i++) {
val = map[i];
@@ -1373,7 +1360,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
if (val == SWAP_HAS_CACHE)
free_entries++;
}
- cluster_clear_huge(ci);
if (free_entries == SWAPFILE_CLUSTER) {
unlock_cluster_or_swap_info(si, ci);
spin_lock(&si->lock);
@@ -1395,23 +1381,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
unlock_cluster_or_swap_info(si, ci);
}

-#ifdef CONFIG_THP_SWAP
-int split_swap_cluster(swp_entry_t entry)
-{
- struct swap_info_struct *si;
- struct swap_cluster_info *ci;
- unsigned long offset = swp_offset(entry);
-
- si = _swap_info_get(entry);
- if (!si)
- return -EBUSY;
- ci = lock_cluster(si, offset);
- cluster_clear_huge(ci);
- unlock_cluster(ci);
- return 0;
-}
-#endif
-
static int swp_entry_cmp(const void *ent1, const void *ent2)
{
const swp_entry_t *e1 = ent1, *e2 = ent2;
@@ -1519,22 +1488,23 @@ int swp_swapcount(swp_entry_t entry)
}

static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
- swp_entry_t entry)
+ swp_entry_t entry,
+ unsigned int nr_pages)
{
struct swap_cluster_info *ci;
unsigned char *map = si->swap_map;
unsigned long roffset = swp_offset(entry);
- unsigned long offset = round_down(roffset, SWAPFILE_CLUSTER);
+ unsigned long offset = round_down(roffset, nr_pages);
int i;
bool ret = false;

ci = lock_cluster_or_swap_info(si, offset);
- if (!ci || !cluster_is_huge(ci)) {
+ if (!ci || nr_pages == 1) {
if (swap_count(map[roffset]))
ret = true;
goto unlock_out;
}
- for (i = 0; i < SWAPFILE_CLUSTER; i++) {
+ for (i = 0; i < nr_pages; i++) {
if (swap_count(map[offset + i])) {
ret = true;
break;
@@ -1556,7 +1526,7 @@ static bool folio_swapped(struct folio *folio)
if (!IS_ENABLED(CONFIG_THP_SWAP) || likely(!folio_test_large(folio)))
return swap_swapcount(si, entry) != 0;

- return swap_page_trans_huge_swapped(si, entry);
+ return swap_page_trans_huge_swapped(si, entry, folio_nr_pages(folio));
}

/**
@@ -1622,8 +1592,7 @@ int free_swap_and_cache(swp_entry_t entry)
}

count = __swap_entry_free(p, entry);
- if (count == SWAP_HAS_CACHE &&
- !swap_page_trans_huge_swapped(p, entry))
+ if (count == SWAP_HAS_CACHE)
__try_to_reclaim_swap(p, swp_offset(entry),
TTRS_UNMAPPED | TTRS_FULL);
put_swap_device(p);
--
2.25.1


2024-03-11 15:03:17

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v4 4/6] mm: swap: Allow storage of all mTHP orders

Multi-size THP enables performance improvements by allocating large,
pte-mapped folios for anonymous memory. However I've observed that on an
arm64 system running a parallel workload (e.g. kernel compilation)
across many cores, under high memory pressure, the speed regresses. This
is due to bottlenecking on the increased number of TLBIs added due to
all the extra folio splitting when the large folios are swapped out.

Therefore, solve this regression by adding support for swapping out mTHP
without needing to split the folio, just like is already done for
PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enabled,
and when the swap backing store is a non-rotating block device. These
are the same constraints as for the existing PMD-sized THP swap-out
support.

Note that no attempt is made to swap-in (m)THP here - this is still done
page-by-page, like for PMD-sized THP. But swapping-out mTHP is a
prerequisite for swapping-in mTHP.

The main change here is to improve the swap entry allocator so that it
can allocate any power-of-2 number of contiguous entries between [1, (1
<< PMD_ORDER)]. This is done by allocating a cluster for each distinct
order and allocating sequentially from it until the cluster is full.
This ensures that we don't need to search the map and we get no
fragmentation due to alignment padding for different orders in the
cluster. If there is no current cluster for a given order, we attempt to
allocate a free cluster from the list. If there are no free clusters, we
fail the allocation and the caller can fall back to splitting the folio
and allocates individual entries (as per existing PMD-sized THP
fallback).

The per-order current clusters are maintained per-cpu using the existing
infrastructure. This is done to avoid interleving pages from different
tasks, which would prevent IO being batched. This is already done for
the order-0 allocations so we follow the same pattern.

As is done for order-0 per-cpu clusters, the scanner now can steal
order-0 entries from any per-cpu-per-order reserved cluster. This
ensures that when the swap file is getting full, space doesn't get tied
up in the per-cpu reserves.

This change only modifies swap to be able to accept any order mTHP. It
doesn't change the callers to elide doing the actual split. That will be
done in separate changes.

Signed-off-by: Ryan Roberts <[email protected]>
---
include/linux/swap.h | 8 ++-
mm/swapfile.c | 167 +++++++++++++++++++++++++------------------
2 files changed, 103 insertions(+), 72 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0cb082bee717..39b5c18ccc6a 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -268,13 +268,19 @@ struct swap_cluster_info {
*/
#define SWAP_NEXT_INVALID 0

+#ifdef CONFIG_THP_SWAP
+#define SWAP_NR_ORDERS (PMD_ORDER + 1)
+#else
+#define SWAP_NR_ORDERS 1
+#endif
+
/*
* We assign a cluster to each CPU, so each CPU can allocate swap entry from
* its own cluster and swapout sequentially. The purpose is to optimize swapout
* throughput.
*/
struct percpu_cluster {
- unsigned int next; /* Likely next allocation offset */
+ unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
};

struct swap_cluster_list {
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 3828d81aa6b8..61118a090796 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -551,10 +551,12 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)

/*
* The cluster corresponding to page_nr will be used. The cluster will be
- * removed from free cluster list and its usage counter will be increased.
+ * removed from free cluster list and its usage counter will be increased by
+ * count.
*/
-static void inc_cluster_info_page(struct swap_info_struct *p,
- struct swap_cluster_info *cluster_info, unsigned long page_nr)
+static void add_cluster_info_page(struct swap_info_struct *p,
+ struct swap_cluster_info *cluster_info, unsigned long page_nr,
+ unsigned long count)
{
unsigned long idx = page_nr / SWAPFILE_CLUSTER;

@@ -563,9 +565,19 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
if (cluster_is_free(&cluster_info[idx]))
alloc_cluster(p, idx);

- VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER);
+ VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
cluster_set_count(&cluster_info[idx],
- cluster_count(&cluster_info[idx]) + 1);
+ cluster_count(&cluster_info[idx]) + count);
+}
+
+/*
+ * The cluster corresponding to page_nr will be used. The cluster will be
+ * removed from free cluster list and its usage counter will be increased by 1.
+ */
+static void inc_cluster_info_page(struct swap_info_struct *p,
+ struct swap_cluster_info *cluster_info, unsigned long page_nr)
+{
+ add_cluster_info_page(p, cluster_info, page_nr, 1);
}

/*
@@ -595,7 +607,7 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
*/
static bool
scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
- unsigned long offset)
+ unsigned long offset, int order)
{
struct percpu_cluster *percpu_cluster;
bool conflict;
@@ -609,24 +621,39 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
return false;

percpu_cluster = this_cpu_ptr(si->percpu_cluster);
- percpu_cluster->next = SWAP_NEXT_INVALID;
+ percpu_cluster->next[order] = SWAP_NEXT_INVALID;
+ return true;
+}
+
+static inline bool swap_range_empty(char *swap_map, unsigned int start,
+ unsigned int nr_pages)
+{
+ unsigned int i;
+
+ for (i = 0; i < nr_pages; i++) {
+ if (swap_map[start + i])
+ return false;
+ }
+
return true;
}

/*
- * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
- * might involve allocating a new cluster for current CPU too.
+ * Try to get a swap entry (or size indicated by order) from current cpu's swap
+ * entry pool (a cluster). This might involve allocating a new cluster for
+ * current CPU too.
*/
static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
- unsigned long *offset, unsigned long *scan_base)
+ unsigned long *offset, unsigned long *scan_base, int order)
{
+ unsigned int nr_pages = 1 << order;
struct percpu_cluster *cluster;
struct swap_cluster_info *ci;
unsigned int tmp, max;

new_cluster:
cluster = this_cpu_ptr(si->percpu_cluster);
- tmp = cluster->next;
+ tmp = cluster->next[order];
if (tmp == SWAP_NEXT_INVALID) {
if (!cluster_list_empty(&si->free_clusters)) {
tmp = cluster_next(&si->free_clusters.head) *
@@ -647,26 +674,27 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,

/*
* Other CPUs can use our cluster if they can't find a free cluster,
- * check if there is still free entry in the cluster
+ * check if there is still free entry in the cluster, maintaining
+ * natural alignment.
*/
max = min_t(unsigned long, si->max, ALIGN(tmp + 1, SWAPFILE_CLUSTER));
if (tmp < max) {
ci = lock_cluster(si, tmp);
while (tmp < max) {
- if (!si->swap_map[tmp])
+ if (swap_range_empty(si->swap_map, tmp, nr_pages))
break;
- tmp++;
+ tmp += nr_pages;
}
unlock_cluster(ci);
}
if (tmp >= max) {
- cluster->next = SWAP_NEXT_INVALID;
+ cluster->next[order] = SWAP_NEXT_INVALID;
goto new_cluster;
}
*offset = tmp;
*scan_base = tmp;
- tmp += 1;
- cluster->next = tmp < max ? tmp : SWAP_NEXT_INVALID;
+ tmp += nr_pages;
+ cluster->next[order] = tmp < max ? tmp : SWAP_NEXT_INVALID;
return true;
}

@@ -796,13 +824,14 @@ static bool swap_offset_available_and_locked(struct swap_info_struct *si,

static int scan_swap_map_slots(struct swap_info_struct *si,
unsigned char usage, int nr,
- swp_entry_t slots[])
+ swp_entry_t slots[], unsigned int nr_pages)
{
struct swap_cluster_info *ci;
unsigned long offset;
unsigned long scan_base;
unsigned long last_in_cluster = 0;
int latency_ration = LATENCY_LIMIT;
+ int order = ilog2(nr_pages);
int n_ret = 0;
bool scanned_many = false;

@@ -817,6 +846,26 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
* And we let swap pages go all over an SSD partition. Hugh
*/

+ if (nr_pages > 1) {
+ /*
+ * Should not even be attempting large allocations when huge
+ * page swap is disabled. Warn and fail the allocation.
+ */
+ if (!IS_ENABLED(CONFIG_THP_SWAP) ||
+ nr_pages > SWAPFILE_CLUSTER ||
+ !is_power_of_2(nr_pages)) {
+ VM_WARN_ON_ONCE(1);
+ return 0;
+ }
+
+ /*
+ * Swapfile is not block device or not using clusters so unable
+ * to allocate large entries.
+ */
+ if (!(si->flags & SWP_BLKDEV) || !si->cluster_info)
+ return 0;
+ }
+
si->flags += SWP_SCANNING;
/*
* Use percpu scan base for SSD to reduce lock contention on
@@ -831,8 +880,11 @@ static int scan_swap_map_slots(struct swap_info_struct *si,

/* SSD algorithm */
if (si->cluster_info) {
- if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base))
+ if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order)) {
+ if (order > 0)
+ goto no_page;
goto scan;
+ }
} else if (unlikely(!si->cluster_nr--)) {
if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER) {
si->cluster_nr = SWAPFILE_CLUSTER - 1;
@@ -874,26 +926,30 @@ static int scan_swap_map_slots(struct swap_info_struct *si,

checks:
if (si->cluster_info) {
- while (scan_swap_map_ssd_cluster_conflict(si, offset)) {
+ while (scan_swap_map_ssd_cluster_conflict(si, offset, order)) {
/* take a break if we already got some slots */
if (n_ret)
goto done;
if (!scan_swap_map_try_ssd_cluster(si, &offset,
- &scan_base))
+ &scan_base, order)) {
+ if (order > 0)
+ goto no_page;
goto scan;
+ }
}
}
if (!(si->flags & SWP_WRITEOK))
goto no_page;
if (!si->highest_bit)
goto no_page;
- if (offset > si->highest_bit)
+ if (order == 0 && offset > si->highest_bit)
scan_base = offset = si->lowest_bit;

ci = lock_cluster(si, offset);
/* reuse swap entry of cache-only swap if not busy. */
if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
int swap_was_freed;
+ VM_WARN_ON(order > 0);
unlock_cluster(ci);
spin_unlock(&si->lock);
swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
@@ -905,17 +961,18 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
}

if (si->swap_map[offset]) {
+ VM_WARN_ON(order > 0);
unlock_cluster(ci);
if (!n_ret)
goto scan;
else
goto done;
}
- WRITE_ONCE(si->swap_map[offset], usage);
- inc_cluster_info_page(si, si->cluster_info, offset);
+ memset(si->swap_map + offset, usage, nr_pages);
+ add_cluster_info_page(si, si->cluster_info, offset, nr_pages);
unlock_cluster(ci);

- swap_range_alloc(si, offset, 1);
+ swap_range_alloc(si, offset, nr_pages);
slots[n_ret++] = swp_entry(si->type, offset);

/* got enough slots or reach max slots? */
@@ -936,8 +993,10 @@ static int scan_swap_map_slots(struct swap_info_struct *si,

/* try to get more slots in cluster */
if (si->cluster_info) {
- if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base))
+ if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order))
goto checks;
+ if (order > 0)
+ goto done;
} else if (si->cluster_nr && !si->swap_map[++offset]) {
/* non-ssd case, still more slots in cluster? */
--si->cluster_nr;
@@ -964,7 +1023,8 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
}

done:
- set_cluster_next(si, offset + 1);
+ if (order == 0)
+ set_cluster_next(si, offset + 1);
si->flags -= SWP_SCANNING;
return n_ret;

@@ -997,38 +1057,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
return n_ret;
}

-static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
-{
- unsigned long idx;
- struct swap_cluster_info *ci;
- unsigned long offset;
-
- /*
- * Should not even be attempting cluster allocations when huge
- * page swap is disabled. Warn and fail the allocation.
- */
- if (!IS_ENABLED(CONFIG_THP_SWAP)) {
- VM_WARN_ON_ONCE(1);
- return 0;
- }
-
- if (cluster_list_empty(&si->free_clusters))
- return 0;
-
- idx = cluster_list_first(&si->free_clusters);
- offset = idx * SWAPFILE_CLUSTER;
- ci = lock_cluster(si, offset);
- alloc_cluster(si, idx);
- cluster_set_count(ci, SWAPFILE_CLUSTER);
-
- memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER);
- unlock_cluster(ci);
- swap_range_alloc(si, offset, SWAPFILE_CLUSTER);
- *slot = swp_entry(si->type, offset);
-
- return 1;
-}
-
static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
{
unsigned long offset = idx * SWAPFILE_CLUSTER;
@@ -1050,8 +1078,8 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
int n_ret = 0;
int node;

- /* Only single cluster request supported */
- WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER);
+ /* Only single THP request supported */
+ WARN_ON_ONCE(n_goal > 1 && size > 1);

spin_lock(&swap_avail_lock);

@@ -1088,14 +1116,10 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
spin_unlock(&si->lock);
goto nextsi;
}
- if (size == SWAPFILE_CLUSTER) {
- if (si->flags & SWP_BLKDEV)
- n_ret = swap_alloc_cluster(si, swp_entries);
- } else
- n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
- n_goal, swp_entries);
+ n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
+ n_goal, swp_entries, size);
spin_unlock(&si->lock);
- if (n_ret || size == SWAPFILE_CLUSTER)
+ if (n_ret || size > 1)
goto check_out;
cond_resched();

@@ -1647,7 +1671,7 @@ swp_entry_t get_swap_page_of_type(int type)

/* This is called for allocating swap entry, not cache */
spin_lock(&si->lock);
- if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry))
+ if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 1))
atomic_long_dec(&nr_swap_pages);
spin_unlock(&si->lock);
fail:
@@ -3101,7 +3125,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
p->flags |= SWP_SYNCHRONOUS_IO;

if (p->bdev && bdev_nonrot(p->bdev)) {
- int cpu;
+ int cpu, i;
unsigned long ci, nr_cluster;

p->flags |= SWP_SOLIDSTATE;
@@ -3139,7 +3163,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
struct percpu_cluster *cluster;

cluster = per_cpu_ptr(p->percpu_cluster, cpu);
- cluster->next = SWAP_NEXT_INVALID;
+ for (i = 0; i < SWAP_NR_ORDERS; i++)
+ cluster->next[i] = SWAP_NEXT_INVALID;
}
} else {
atomic_inc(&nr_rotate_swap);
--
2.25.1


2024-03-11 15:03:40

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v4 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD

Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
folio that is fully and contiguously mapped in the pageout/cold vm
range. This change means that large folios will be maintained all the
way to swap storage. This both improves performance during swap-out, by
eliding the cost of splitting the folio, and sets us up nicely for
maintaining the large folio when it is swapped back in (to be covered in
a separate series).

Folios that are not fully mapped in the target range are still split,
but note that behavior is changed so that if the split fails for any
reason (folio locked, shared, etc) we now leave it as is and move to the
next pte in the range and continue work on the proceeding folios.
Previously any failure of this sort would cause the entire operation to
give up and no folios mapped at higher addresses were paged out or made
cold. Given large folios are becoming more common, this old behavior
would have likely lead to wasted opportunities.

While we are at it, change the code that clears young from the ptes to
use ptep_test_and_clear_young(), which is more efficent than
get_and_clear/modify/set, especially for contpte mappings on arm64,
where the old approach would require unfolding/refolding and the new
approach can be done in place.

Signed-off-by: Ryan Roberts <[email protected]>
---
mm/madvise.c | 89 ++++++++++++++++++++++++++++++----------------------
1 file changed, 51 insertions(+), 38 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 547dcd1f7a39..56c7ba7bd558 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
LIST_HEAD(folio_list);
bool pageout_anon_only_filter;
unsigned int batch_count = 0;
+ int nr;

if (fatal_signal_pending(current))
return -EINTR;
@@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
return 0;
flush_tlb_batched_pending(mm);
arch_enter_lazy_mmu_mode();
- for (; addr < end; pte++, addr += PAGE_SIZE) {
+ for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
+ nr = 1;
ptent = ptep_get(pte);

if (++batch_count == SWAP_CLUSTER_MAX) {
@@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
continue;

/*
- * Creating a THP page is expensive so split it only if we
- * are sure it's worth. Split it if we are only owner.
+ * If we encounter a large folio, only split it if it is not
+ * fully mapped within the range we are operating on. Otherwise
+ * leave it as is so that it can be swapped out whole. If we
+ * fail to split a folio, leave it in place and advance to the
+ * next pte in the range.
*/
if (folio_test_large(folio)) {
- int err;
-
- if (folio_estimated_sharers(folio) > 1)
- break;
- if (pageout_anon_only_filter && !folio_test_anon(folio))
- break;
- if (!folio_trylock(folio))
- break;
- folio_get(folio);
- arch_leave_lazy_mmu_mode();
- pte_unmap_unlock(start_pte, ptl);
- start_pte = NULL;
- err = split_folio(folio);
- folio_unlock(folio);
- folio_put(folio);
- if (err)
- break;
- start_pte = pte =
- pte_offset_map_lock(mm, pmd, addr, &ptl);
- if (!start_pte)
- break;
- arch_enter_lazy_mmu_mode();
- pte--;
- addr -= PAGE_SIZE;
- continue;
+ const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
+ FPB_IGNORE_SOFT_DIRTY;
+ int max_nr = (end - addr) / PAGE_SIZE;
+
+ nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
+ fpb_flags, NULL);
+
+ if (nr < folio_nr_pages(folio)) {
+ int err;
+
+ if (folio_estimated_sharers(folio) > 1)
+ continue;
+ if (pageout_anon_only_filter && !folio_test_anon(folio))
+ continue;
+ if (!folio_trylock(folio))
+ continue;
+ folio_get(folio);
+ arch_leave_lazy_mmu_mode();
+ pte_unmap_unlock(start_pte, ptl);
+ start_pte = NULL;
+ err = split_folio(folio);
+ folio_unlock(folio);
+ folio_put(folio);
+ if (err)
+ continue;
+ start_pte = pte =
+ pte_offset_map_lock(mm, pmd, addr, &ptl);
+ if (!start_pte)
+ break;
+ arch_enter_lazy_mmu_mode();
+ nr = 0;
+ continue;
+ }
}

/*
* Do not interfere with other mappings of this folio and
- * non-LRU folio.
+ * non-LRU folio. If we have a large folio at this point, we
+ * know it is fully mapped so if its mapcount is the same as its
+ * number of pages, it must be exclusive.
*/
- if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
+ if (!folio_test_lru(folio) ||
+ folio_mapcount(folio) != folio_nr_pages(folio))
continue;

if (pageout_anon_only_filter && !folio_test_anon(folio))
continue;

- VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
-
- if (!pageout && pte_young(ptent)) {
- ptent = ptep_get_and_clear_full(mm, addr, pte,
- tlb->fullmm);
- ptent = pte_mkold(ptent);
- set_pte_at(mm, addr, pte, ptent);
- tlb_remove_tlb_entry(tlb, pte, addr);
+ if (!pageout) {
+ for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
+ if (ptep_test_and_clear_young(vma, addr, pte))
+ tlb_remove_tlb_entry(tlb, pte, addr);
+ }
}

/*
--
2.25.1


2024-03-11 15:05:04

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v4 3/6] mm: swap: Simplify struct percpu_cluster

struct percpu_cluster stores the index of cpu's current cluster and the
offset of the next entry that will be allocated for the cpu. These two
pieces of information are redundant because the cluster index is just
(offset / SWAPFILE_CLUSTER). The only reason for explicitly keeping the
cluster index is because the structure used for it also has a flag to
indicate "no cluster". However this data structure also contains a spin
lock, which is never used in this context, as a side effect the code
copies the spinlock_t structure, which is questionable coding practice
in my view.

So let's clean this up and store only the next offset, and use a
sentinal value (SWAP_NEXT_INVALID) to indicate "no cluster".
SWAP_NEXT_INVALID is chosen to be 0, because 0 will never be seen
legitimately; The first page in the swap file is the swap header, which
is always marked bad to prevent it from being allocated as an entry.
This also prevents the cluster to which it belongs being marked free, so
it will never appear on the free list.

This change saves 16 bytes per cpu. And given we are shortly going to
extend this mechanism to be per-cpu-AND-per-order, we will end up saving
16 * 9 = 144 bytes per cpu, which adds up if you have 256 cpus in the
system.

Signed-off-by: Ryan Roberts <[email protected]>
---
include/linux/swap.h | 9 ++++++++-
mm/swapfile.c | 22 +++++++++++-----------
2 files changed, 19 insertions(+), 12 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index f2b7f204b968..0cb082bee717 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -260,13 +260,20 @@ struct swap_cluster_info {
#define CLUSTER_FLAG_FREE 1 /* This cluster is free */
#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */

+/*
+ * The first page in the swap file is the swap header, which is always marked
+ * bad to prevent it from being allocated as an entry. This also prevents the
+ * cluster to which it belongs being marked free. Therefore 0 is safe to use as
+ * a sentinel to indicate next is not valid in percpu_cluster.
+ */
+#define SWAP_NEXT_INVALID 0
+
/*
* We assign a cluster to each CPU, so each CPU can allocate swap entry from
* its own cluster and swapout sequentially. The purpose is to optimize swapout
* throughput.
*/
struct percpu_cluster {
- struct swap_cluster_info index; /* Current cluster index */
unsigned int next; /* Likely next allocation offset */
};

diff --git a/mm/swapfile.c b/mm/swapfile.c
index ee7e44cb40c5..3828d81aa6b8 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -609,7 +609,7 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
return false;

percpu_cluster = this_cpu_ptr(si->percpu_cluster);
- cluster_set_null(&percpu_cluster->index);
+ percpu_cluster->next = SWAP_NEXT_INVALID;
return true;
}

@@ -622,14 +622,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
{
struct percpu_cluster *cluster;
struct swap_cluster_info *ci;
- unsigned long tmp, max;
+ unsigned int tmp, max;

new_cluster:
cluster = this_cpu_ptr(si->percpu_cluster);
- if (cluster_is_null(&cluster->index)) {
+ tmp = cluster->next;
+ if (tmp == SWAP_NEXT_INVALID) {
if (!cluster_list_empty(&si->free_clusters)) {
- cluster->index = si->free_clusters.head;
- cluster->next = cluster_next(&cluster->index) *
+ tmp = cluster_next(&si->free_clusters.head) *
SWAPFILE_CLUSTER;
} else if (!cluster_list_empty(&si->discard_clusters)) {
/*
@@ -649,9 +649,7 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
* Other CPUs can use our cluster if they can't find a free cluster,
* check if there is still free entry in the cluster
*/
- tmp = cluster->next;
- max = min_t(unsigned long, si->max,
- (cluster_next(&cluster->index) + 1) * SWAPFILE_CLUSTER);
+ max = min_t(unsigned long, si->max, ALIGN(tmp + 1, SWAPFILE_CLUSTER));
if (tmp < max) {
ci = lock_cluster(si, tmp);
while (tmp < max) {
@@ -662,12 +660,13 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
unlock_cluster(ci);
}
if (tmp >= max) {
- cluster_set_null(&cluster->index);
+ cluster->next = SWAP_NEXT_INVALID;
goto new_cluster;
}
- cluster->next = tmp + 1;
*offset = tmp;
*scan_base = tmp;
+ tmp += 1;
+ cluster->next = tmp < max ? tmp : SWAP_NEXT_INVALID;
return true;
}

@@ -3138,8 +3137,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
}
for_each_possible_cpu(cpu) {
struct percpu_cluster *cluster;
+
cluster = per_cpu_ptr(p->percpu_cluster, cpu);
- cluster_set_null(&cluster->index);
+ cluster->next = SWAP_NEXT_INVALID;
}
} else {
atomic_inc(&nr_rotate_swap);
--
2.25.1


2024-03-11 15:05:19

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v4 5/6] mm: vmscan: Avoid split during shrink_folio_list()

Now that swap supports storing all mTHP sizes, avoid splitting large
folios before swap-out. This benefits performance of the swap-out path
by eliding split_folio_to_list(), which is expensive, and also sets us
up for swapping in large folios in a future series.

If the folio is partially mapped, we continue to split it since we want
to avoid the extra IO overhead and storage of writing out pages
uneccessarily.

Signed-off-by: Ryan Roberts <[email protected]>
---
mm/vmscan.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index cf7d4cf47f1a..0ebec99e04c6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1222,11 +1222,12 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
if (!can_split_folio(folio, NULL))
goto activate_locked;
/*
- * Split folios without a PMD map right
- * away. Chances are some or all of the
- * tail pages can be freed without IO.
+ * Split partially mapped folios map
+ * right away. Chances are some or all
+ * of the tail pages can be freed
+ * without IO.
*/
- if (!folio_entire_mapcount(folio) &&
+ if (!list_empty(&folio->_deferred_list) &&
split_folio_to_list(folio,
folio_list))
goto activate_locked;
--
2.25.1


2024-03-11 22:32:26

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v4 5/6] mm: vmscan: Avoid split during shrink_folio_list()

On Mon, Mar 11, 2024 at 11:01 PM Ryan Roberts <[email protected]> wrote:
>
> Now that swap supports storing all mTHP sizes, avoid splitting large
> folios before swap-out. This benefits performance of the swap-out path
> by eliding split_folio_to_list(), which is expensive, and also sets us
> up for swapping in large folios in a future series.
>
> If the folio is partially mapped, we continue to split it since we want
> to avoid the extra IO overhead and storage of writing out pages
> uneccessarily.
>
> Signed-off-by: Ryan Roberts <[email protected]>
> ---
> mm/vmscan.c | 9 +++++----
> 1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index cf7d4cf47f1a..0ebec99e04c6 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1222,11 +1222,12 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> if (!can_split_folio(folio, NULL))
> goto activate_locked;
> /*
> - * Split folios without a PMD map right
> - * away. Chances are some or all of the
> - * tail pages can be freed without IO.
> + * Split partially mapped folios map
> + * right away. Chances are some or all
> + * of the tail pages can be freed
> + * without IO.
> */
> - if (!folio_entire_mapcount(folio) &&
> + if (!list_empty(&folio->_deferred_list) &&

Hi Ryan,
After reconsidering our previous discussion about PMD-mapped large
folios, I've pondered
the possibility of PMD-mapped Transparent Huge Pages (THPs) being
mapped by multiple
processes. In such a scenario, if one process decides to unmap a
portion of the folio while
others retain the entire mapping, it raises questions about how the
system should handle
this situation. Would the large folio be placed in a deferred list? If
so, splitting it might not
yield benefits, as neither I/O nor swap slots would increase in this
case by not splitting it.

Regarding PTE-mapped large folios, the absence of an indicator like
"entire_map" makes it
challenging to identify cases where the entire folio is mapped. Thus,
splitting seems to be
the only viable solution in such circumstances.

> split_folio_to_list(folio,
> folio_list))
> goto activate_locked;
> --
> 2.25.1

Thanks
Barry

2024-03-12 07:53:20

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH v4 4/6] mm: swap: Allow storage of all mTHP orders

Ryan Roberts <[email protected]> writes:

> Multi-size THP enables performance improvements by allocating large,
> pte-mapped folios for anonymous memory. However I've observed that on an
> arm64 system running a parallel workload (e.g. kernel compilation)
> across many cores, under high memory pressure, the speed regresses. This
> is due to bottlenecking on the increased number of TLBIs added due to
> all the extra folio splitting when the large folios are swapped out.
>
> Therefore, solve this regression by adding support for swapping out mTHP
> without needing to split the folio, just like is already done for
> PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enabled,
> and when the swap backing store is a non-rotating block device. These
> are the same constraints as for the existing PMD-sized THP swap-out
> support.
>
> Note that no attempt is made to swap-in (m)THP here - this is still done
> page-by-page, like for PMD-sized THP. But swapping-out mTHP is a
> prerequisite for swapping-in mTHP.
>
> The main change here is to improve the swap entry allocator so that it
> can allocate any power-of-2 number of contiguous entries between [1, (1
> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
> order and allocating sequentially from it until the cluster is full.
> This ensures that we don't need to search the map and we get no
> fragmentation due to alignment padding for different orders in the
> cluster. If there is no current cluster for a given order, we attempt to
> allocate a free cluster from the list. If there are no free clusters, we
> fail the allocation and the caller can fall back to splitting the folio
> and allocates individual entries (as per existing PMD-sized THP
> fallback).
>
> The per-order current clusters are maintained per-cpu using the existing
> infrastructure. This is done to avoid interleving pages from different
> tasks, which would prevent IO being batched. This is already done for
> the order-0 allocations so we follow the same pattern.
>
> As is done for order-0 per-cpu clusters, the scanner now can steal
> order-0 entries from any per-cpu-per-order reserved cluster. This
> ensures that when the swap file is getting full, space doesn't get tied
> up in the per-cpu reserves.
>
> This change only modifies swap to be able to accept any order mTHP. It
> doesn't change the callers to elide doing the actual split. That will be
> done in separate changes.
>
> Signed-off-by: Ryan Roberts <[email protected]>
> ---
> include/linux/swap.h | 8 ++-
> mm/swapfile.c | 167 +++++++++++++++++++++++++------------------
> 2 files changed, 103 insertions(+), 72 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 0cb082bee717..39b5c18ccc6a 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -268,13 +268,19 @@ struct swap_cluster_info {
> */
> #define SWAP_NEXT_INVALID 0
>
> +#ifdef CONFIG_THP_SWAP
> +#define SWAP_NR_ORDERS (PMD_ORDER + 1)
> +#else
> +#define SWAP_NR_ORDERS 1
> +#endif
> +
> /*
> * We assign a cluster to each CPU, so each CPU can allocate swap entry from
> * its own cluster and swapout sequentially. The purpose is to optimize swapout
> * throughput.
> */
> struct percpu_cluster {
> - unsigned int next; /* Likely next allocation offset */
> + unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
> };
>
> struct swap_cluster_list {
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 3828d81aa6b8..61118a090796 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -551,10 +551,12 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
>
> /*
> * The cluster corresponding to page_nr will be used. The cluster will be
> - * removed from free cluster list and its usage counter will be increased.
> + * removed from free cluster list and its usage counter will be increased by
> + * count.
> */
> -static void inc_cluster_info_page(struct swap_info_struct *p,
> - struct swap_cluster_info *cluster_info, unsigned long page_nr)
> +static void add_cluster_info_page(struct swap_info_struct *p,
> + struct swap_cluster_info *cluster_info, unsigned long page_nr,
> + unsigned long count)
> {
> unsigned long idx = page_nr / SWAPFILE_CLUSTER;
>
> @@ -563,9 +565,19 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
> if (cluster_is_free(&cluster_info[idx]))
> alloc_cluster(p, idx);
>
> - VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER);
> + VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
> cluster_set_count(&cluster_info[idx],
> - cluster_count(&cluster_info[idx]) + 1);
> + cluster_count(&cluster_info[idx]) + count);
> +}
> +
> +/*
> + * The cluster corresponding to page_nr will be used. The cluster will be
> + * removed from free cluster list and its usage counter will be increased by 1.
> + */
> +static void inc_cluster_info_page(struct swap_info_struct *p,
> + struct swap_cluster_info *cluster_info, unsigned long page_nr)
> +{
> + add_cluster_info_page(p, cluster_info, page_nr, 1);
> }
>
> /*
> @@ -595,7 +607,7 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
> */
> static bool
> scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
> - unsigned long offset)
> + unsigned long offset, int order)
> {
> struct percpu_cluster *percpu_cluster;
> bool conflict;
> @@ -609,24 +621,39 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
> return false;
>
> percpu_cluster = this_cpu_ptr(si->percpu_cluster);
> - percpu_cluster->next = SWAP_NEXT_INVALID;
> + percpu_cluster->next[order] = SWAP_NEXT_INVALID;
> + return true;
> +}
> +
> +static inline bool swap_range_empty(char *swap_map, unsigned int start,
> + unsigned int nr_pages)
> +{
> + unsigned int i;
> +
> + for (i = 0; i < nr_pages; i++) {
> + if (swap_map[start + i])
> + return false;
> + }
> +
> return true;
> }
>
> /*
> - * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
> - * might involve allocating a new cluster for current CPU too.
> + * Try to get a swap entry (or size indicated by order) from current cpu's swap

IMO, it's not necessary to make mTHP a special case other than base
page. So, this can be changed to

* Try to get swap entries with specified order from current cpu's swap

> + * entry pool (a cluster). This might involve allocating a new cluster for
> + * current CPU too.
> */
> static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> - unsigned long *offset, unsigned long *scan_base)
> + unsigned long *offset, unsigned long *scan_base, int order)
> {
> + unsigned int nr_pages = 1 << order;
> struct percpu_cluster *cluster;
> struct swap_cluster_info *ci;
> unsigned int tmp, max;
>
> new_cluster:
> cluster = this_cpu_ptr(si->percpu_cluster);
> - tmp = cluster->next;
> + tmp = cluster->next[order];
> if (tmp == SWAP_NEXT_INVALID) {
> if (!cluster_list_empty(&si->free_clusters)) {
> tmp = cluster_next(&si->free_clusters.head) *
> @@ -647,26 +674,27 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>
> /*
> * Other CPUs can use our cluster if they can't find a free cluster,
> - * check if there is still free entry in the cluster
> + * check if there is still free entry in the cluster, maintaining
> + * natural alignment.
> */
> max = min_t(unsigned long, si->max, ALIGN(tmp + 1, SWAPFILE_CLUSTER));
> if (tmp < max) {
> ci = lock_cluster(si, tmp);
> while (tmp < max) {
> - if (!si->swap_map[tmp])
> + if (swap_range_empty(si->swap_map, tmp, nr_pages))
> break;
> - tmp++;
> + tmp += nr_pages;
> }
> unlock_cluster(ci);
> }
> if (tmp >= max) {
> - cluster->next = SWAP_NEXT_INVALID;
> + cluster->next[order] = SWAP_NEXT_INVALID;
> goto new_cluster;
> }
> *offset = tmp;
> *scan_base = tmp;
> - tmp += 1;
> - cluster->next = tmp < max ? tmp : SWAP_NEXT_INVALID;
> + tmp += nr_pages;
> + cluster->next[order] = tmp < max ? tmp : SWAP_NEXT_INVALID;
> return true;
> }
>
> @@ -796,13 +824,14 @@ static bool swap_offset_available_and_locked(struct swap_info_struct *si,
>
> static int scan_swap_map_slots(struct swap_info_struct *si,
> unsigned char usage, int nr,
> - swp_entry_t slots[])
> + swp_entry_t slots[], unsigned int nr_pages)

IMHO, it's better to use order as parameter directly. We can change the
parameter of get_swap_pages() too.

> {
> struct swap_cluster_info *ci;
> unsigned long offset;
> unsigned long scan_base;
> unsigned long last_in_cluster = 0;
> int latency_ration = LATENCY_LIMIT;
> + int order = ilog2(nr_pages);
> int n_ret = 0;
> bool scanned_many = false;
>
> @@ -817,6 +846,26 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
> * And we let swap pages go all over an SSD partition. Hugh
> */
>
> + if (nr_pages > 1) {
> + /*
> + * Should not even be attempting large allocations when huge
> + * page swap is disabled. Warn and fail the allocation.
> + */
> + if (!IS_ENABLED(CONFIG_THP_SWAP) ||
> + nr_pages > SWAPFILE_CLUSTER ||
> + !is_power_of_2(nr_pages)) {
> + VM_WARN_ON_ONCE(1);
> + return 0;
> + }
> +
> + /*
> + * Swapfile is not block device or not using clusters so unable
> + * to allocate large entries.
> + */
> + if (!(si->flags & SWP_BLKDEV) || !si->cluster_info)
> + return 0;
> + }
> +
> si->flags += SWP_SCANNING;
> /*
> * Use percpu scan base for SSD to reduce lock contention on
> @@ -831,8 +880,11 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>
> /* SSD algorithm */
> if (si->cluster_info) {
> - if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base))
> + if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order)) {
> + if (order > 0)
> + goto no_page;
> goto scan;
> + }
> } else if (unlikely(!si->cluster_nr--)) {
> if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER) {
> si->cluster_nr = SWAPFILE_CLUSTER - 1;
> @@ -874,26 +926,30 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>
> checks:
> if (si->cluster_info) {
> - while (scan_swap_map_ssd_cluster_conflict(si, offset)) {
> + while (scan_swap_map_ssd_cluster_conflict(si, offset, order)) {
> /* take a break if we already got some slots */
> if (n_ret)
> goto done;
> if (!scan_swap_map_try_ssd_cluster(si, &offset,
> - &scan_base))
> + &scan_base, order)) {
> + if (order > 0)
> + goto no_page;
> goto scan;
> + }
> }
> }
> if (!(si->flags & SWP_WRITEOK))
> goto no_page;
> if (!si->highest_bit)
> goto no_page;
> - if (offset > si->highest_bit)
> + if (order == 0 && offset > si->highest_bit)

I don't think that we need to check "order == 0" here. The original
condition will always be false for "order != 0".

> scan_base = offset = si->lowest_bit;
>
> ci = lock_cluster(si, offset);
> /* reuse swap entry of cache-only swap if not busy. */
> if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
> int swap_was_freed;
> + VM_WARN_ON(order > 0);

Instead of add WARN here, I think that it's better to add WARN at the
beginning of "scan" label. We should never scan if "order > 0", it can
capture even more abnormal status.

> unlock_cluster(ci);
> spin_unlock(&si->lock);
> swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
> @@ -905,17 +961,18 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
> }
>
> if (si->swap_map[offset]) {
> + VM_WARN_ON(order > 0);
> unlock_cluster(ci);
> if (!n_ret)
> goto scan;
> else
> goto done;
> }
> - WRITE_ONCE(si->swap_map[offset], usage);
> - inc_cluster_info_page(si, si->cluster_info, offset);
> + memset(si->swap_map + offset, usage, nr_pages);

Add barrier() here corresponds to original WRITE_ONCE()?
unlock_cluster(ci) may be NOP for some swap devices.

> + add_cluster_info_page(si, si->cluster_info, offset, nr_pages);
> unlock_cluster(ci);
>
> - swap_range_alloc(si, offset, 1);
> + swap_range_alloc(si, offset, nr_pages);
> slots[n_ret++] = swp_entry(si->type, offset);
>
> /* got enough slots or reach max slots? */

If "order > 0", "nr" must be 1. So, we will "goto done" in the
following code.

/* got enough slots or reach max slots? */
if ((n_ret == nr) || (offset >= si->highest_bit))
goto done;

We can add VM_WARN_ON() here to capture some abnormal status.

> @@ -936,8 +993,10 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>
> /* try to get more slots in cluster */
> if (si->cluster_info) {
> - if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base))
> + if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order))
> goto checks;
> + if (order > 0)
> + goto done;

Don't need to add this, if "order > 0", we will never go here.

> } else if (si->cluster_nr && !si->swap_map[++offset]) {
> /* non-ssd case, still more slots in cluster? */
> --si->cluster_nr;
> @@ -964,7 +1023,8 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
> }
>
> done:
> - set_cluster_next(si, offset + 1);
> + if (order == 0)
> + set_cluster_next(si, offset + 1);
> si->flags -= SWP_SCANNING;
> return n_ret;
>
> @@ -997,38 +1057,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
> return n_ret;
> }
>
> -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
> -{
> - unsigned long idx;
> - struct swap_cluster_info *ci;
> - unsigned long offset;
> -
> - /*
> - * Should not even be attempting cluster allocations when huge
> - * page swap is disabled. Warn and fail the allocation.
> - */
> - if (!IS_ENABLED(CONFIG_THP_SWAP)) {
> - VM_WARN_ON_ONCE(1);
> - return 0;
> - }
> -
> - if (cluster_list_empty(&si->free_clusters))
> - return 0;
> -
> - idx = cluster_list_first(&si->free_clusters);
> - offset = idx * SWAPFILE_CLUSTER;
> - ci = lock_cluster(si, offset);
> - alloc_cluster(si, idx);
> - cluster_set_count(ci, SWAPFILE_CLUSTER);
> -
> - memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER);
> - unlock_cluster(ci);
> - swap_range_alloc(si, offset, SWAPFILE_CLUSTER);
> - *slot = swp_entry(si->type, offset);
> -
> - return 1;
> -}
> -
> static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
> {
> unsigned long offset = idx * SWAPFILE_CLUSTER;
> @@ -1050,8 +1078,8 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
> int n_ret = 0;
> int node;
>
> - /* Only single cluster request supported */
> - WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER);
> + /* Only single THP request supported */
> + WARN_ON_ONCE(n_goal > 1 && size > 1);
>
> spin_lock(&swap_avail_lock);
>
> @@ -1088,14 +1116,10 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
> spin_unlock(&si->lock);
> goto nextsi;
> }
> - if (size == SWAPFILE_CLUSTER) {
> - if (si->flags & SWP_BLKDEV)
> - n_ret = swap_alloc_cluster(si, swp_entries);
> - } else
> - n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
> - n_goal, swp_entries);
> + n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
> + n_goal, swp_entries, size);
> spin_unlock(&si->lock);
> - if (n_ret || size == SWAPFILE_CLUSTER)
> + if (n_ret || size > 1)
> goto check_out;
> cond_resched();
>
> @@ -1647,7 +1671,7 @@ swp_entry_t get_swap_page_of_type(int type)
>
> /* This is called for allocating swap entry, not cache */
> spin_lock(&si->lock);
> - if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry))
> + if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 1))
> atomic_long_dec(&nr_swap_pages);
> spin_unlock(&si->lock);
> fail:
> @@ -3101,7 +3125,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
> p->flags |= SWP_SYNCHRONOUS_IO;
>
> if (p->bdev && bdev_nonrot(p->bdev)) {
> - int cpu;
> + int cpu, i;
> unsigned long ci, nr_cluster;
>
> p->flags |= SWP_SOLIDSTATE;
> @@ -3139,7 +3163,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
> struct percpu_cluster *cluster;
>
> cluster = per_cpu_ptr(p->percpu_cluster, cpu);
> - cluster->next = SWAP_NEXT_INVALID;
> + for (i = 0; i < SWAP_NR_ORDERS; i++)
> + cluster->next[i] = SWAP_NEXT_INVALID;
> }
> } else {
> atomic_inc(&nr_rotate_swap);

You also need to check whether we should add swap_entry_size() for some
functions to optimize for small system. We may need to add swap_order()
too.

--
Best Regards,
Huang, Ying

2024-03-12 07:54:22

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH v4 3/6] mm: swap: Simplify struct percpu_cluster

Ryan Roberts <[email protected]> writes:

> struct percpu_cluster stores the index of cpu's current cluster and the
> offset of the next entry that will be allocated for the cpu. These two
> pieces of information are redundant because the cluster index is just
> (offset / SWAPFILE_CLUSTER). The only reason for explicitly keeping the
> cluster index is because the structure used for it also has a flag to
> indicate "no cluster". However this data structure also contains a spin
> lock, which is never used in this context, as a side effect the code
> copies the spinlock_t structure, which is questionable coding practice
> in my view.
>
> So let's clean this up and store only the next offset, and use a
> sentinal value (SWAP_NEXT_INVALID) to indicate "no cluster".
> SWAP_NEXT_INVALID is chosen to be 0, because 0 will never be seen
> legitimately; The first page in the swap file is the swap header, which
> is always marked bad to prevent it from being allocated as an entry.
> This also prevents the cluster to which it belongs being marked free, so
> it will never appear on the free list.
>
> This change saves 16 bytes per cpu. And given we are shortly going to
> extend this mechanism to be per-cpu-AND-per-order, we will end up saving
> 16 * 9 = 144 bytes per cpu, which adds up if you have 256 cpus in the
> system.
>
> Signed-off-by: Ryan Roberts <[email protected]>

LGTM, Thanks!

--
Best Regards,
Huang, Ying


2024-03-12 08:03:22

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH v4 0/6] Swap-out mTHP without splitting

Ryan Roberts <[email protected]> writes:

> Hi All,
>
> This series adds support for swapping out multi-size THP (mTHP) without needing
> to first split the large folio via split_huge_page_to_list_to_order(). It
> closely follows the approach already used to swap-out PMD-sized THP.
>
> There are a couple of reasons for swapping out mTHP without splitting:
>
> - Performance: It is expensive to split a large folio and under extreme memory
> pressure some workloads regressed performance when using 64K mTHP vs 4K
> small folios because of this extra cost in the swap-out path. This series
> not only eliminates the regression but makes it faster to swap out 64K mTHP
> vs 4K small folios.
>
> - Memory fragmentation avoidance: If we can avoid splitting a large folio
> memory is less likely to become fragmented, making it easier to re-allocate
> a large folio in future.
>
> - Performance: Enables a separate series [4] to swap-in whole mTHPs, which
> means we won't lose the TLB-efficiency benefits of mTHP once the memory has
> been through a swap cycle.
>
> I've done what I thought was the smallest change possible, and as a result, this
> approach is only employed when the swap is backed by a non-rotating block device
> (just as PMD-sized THP is supported today). Discussion against the RFC concluded
> that this is sufficient.
>
>
> Performance Testing
> ===================
>
> I've run some swap performance tests on Ampere Altra VM (arm64) with 8 CPUs. The
> VM is set up with a 35G block ram device as the swap device and the test is run
> from inside a memcg limited to 40G memory. I've then run `usemem` from
> vm-scalability with 70 processes, each allocating and writing 1G of memory. I've
> repeated everything 6 times and taken the mean performance improvement relative
> to 4K page baseline:
>
> | alloc size | baseline | + this series |
> | | v6.6-rc4+anonfolio | |
> |:-----------|--------------------:|--------------------:|
> | 4K Page | 0.0% | 1.4% |
> | 64K THP | -14.6% | 44.2% |
> | 2M THP | 87.4% | 97.7% |
>
> So with this change, the 64K swap performance goes from a 15% regression to a
> 44% improvement. 4K and 2M swap improves slightly too.

I don't understand why the performance of 2M THP improves. The swap
entry allocation becomes a little slower. Can you provide some
perf-profile to root cause it?

--
Best Regards,
Huang, Ying

> This test also acts as a good stress test for swap and, more generally mm. A
> couple of existing bugs were found as a result [5] [6].
>
>
> ---
> The series applies against mm-unstable (d7182786dd0a). Although I've
> additionally been running with a couple of extra fixes to avoid the issues at
> [6].
>
>
> Changes since v3 [3]
> ====================
>
> - Renamed SWAP_NEXT_NULL -> SWAP_NEXT_INVALID (per Huang, Ying)
> - Simplified max offset calculation (per Huang, Ying)
> - Reinstated struct percpu_cluster to contain per-cluster, per-order `next`
> offset (per Huang, Ying)
> - Removed swap_alloc_large() and merged its functionality into
> scan_swap_map_slots() (per Huang, Ying)
> - Avoid extra cost of folio ref and lock due to removal of CLUSTER_FLAG_HUGE
> by freeing swap entries in batches (see patch 2) (per DavidH)
> - vmscan splits folio if its partially mapped (per Barry Song, DavidH)
> - Avoid splitting in MADV_PAGEOUT path (per Barry Song)
> - Dropped "mm: swap: Simplify ssd behavior when scanner steals entry" patch
> since it's not actually a problem for THP as I first thought.
>
>
> Changes since v2 [2]
> ====================
>
> - Reuse scan_swap_map_try_ssd_cluster() between order-0 and order > 0
> allocation. This required some refactoring to make everything work nicely
> (new patches 2 and 3).
> - Fix bug where nr_swap_pages would say there are pages available but the
> scanner would not be able to allocate them because they were reserved for the
> per-cpu allocator. We now allow stealing of order-0 entries from the high
> order per-cpu clusters (in addition to exisiting stealing from order-0
> per-cpu clusters).
>
>
> Changes since v1 [1]
> ====================
>
> - patch 1:
> - Use cluster_set_count() instead of cluster_set_count_flag() in
> swap_alloc_cluster() since we no longer have any flag to set. I was unable
> to kill cluster_set_count_flag() as proposed against v1 as other call
> sites depend explicitly setting flags to 0.
> - patch 2:
> - Moved large_next[] array into percpu_cluster to make it per-cpu
> (recommended by Huang, Ying).
> - large_next[] array is dynamically allocated because PMD_ORDER is not
> compile-time constant for powerpc (fixes build error).
>
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
> [2] https://lore.kernel.org/linux-mm/[email protected]/
> [3] https://lore.kernel.org/linux-mm/[email protected]/
> [4] https://lore.kernel.org/linux-mm/[email protected]/
> [5] https://lore.kernel.org/linux-mm/[email protected]/
> [6] https://lore.kernel.org/linux-mm/[email protected]/
>
> Thanks,
> Ryan
>
>
> Ryan Roberts (6):
> mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
> mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()
> mm: swap: Simplify struct percpu_cluster
> mm: swap: Allow storage of all mTHP orders
> mm: vmscan: Avoid split during shrink_folio_list()
> mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD
>
> include/linux/pgtable.h | 28 ++++
> include/linux/swap.h | 33 +++--
> mm/huge_memory.c | 3 -
> mm/internal.h | 48 +++++++
> mm/madvise.c | 101 ++++++++------
> mm/memory.c | 13 +-
> mm/swapfile.c | 298 ++++++++++++++++++++++------------------
> mm/vmscan.c | 9 +-
> 8 files changed, 332 insertions(+), 201 deletions(-)
>
> --
> 2.25.1

2024-03-12 08:12:59

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 5/6] mm: vmscan: Avoid split during shrink_folio_list()

On 11/03/2024 22:30, Barry Song wrote:
> On Mon, Mar 11, 2024 at 11:01 PM Ryan Roberts <[email protected]> wrote:
>>
>> Now that swap supports storing all mTHP sizes, avoid splitting large
>> folios before swap-out. This benefits performance of the swap-out path
>> by eliding split_folio_to_list(), which is expensive, and also sets us
>> up for swapping in large folios in a future series.
>>
>> If the folio is partially mapped, we continue to split it since we want
>> to avoid the extra IO overhead and storage of writing out pages
>> uneccessarily.
>>
>> Signed-off-by: Ryan Roberts <[email protected]>
>> ---
>> mm/vmscan.c | 9 +++++----
>> 1 file changed, 5 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index cf7d4cf47f1a..0ebec99e04c6 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1222,11 +1222,12 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>> if (!can_split_folio(folio, NULL))
>> goto activate_locked;
>> /*
>> - * Split folios without a PMD map right
>> - * away. Chances are some or all of the
>> - * tail pages can be freed without IO.
>> + * Split partially mapped folios map
>> + * right away. Chances are some or all
>> + * of the tail pages can be freed
>> + * without IO.
>> */
>> - if (!folio_entire_mapcount(folio) &&
>> + if (!list_empty(&folio->_deferred_list) &&
>
> Hi Ryan,
> After reconsidering our previous discussion about PMD-mapped large
> folios, I've pondered
> the possibility of PMD-mapped Transparent Huge Pages (THPs) being
> mapped by multiple
> processes. In such a scenario, if one process decides to unmap a
> portion of the folio while
> others retain the entire mapping, it raises questions about how the
> system should handle
> this situation. Would the large folio be placed in a deferred list?

No - if the large folio is entirely mapped (via PMD), then the folio will not be
put on the deferred split list in the first place. See __folio_remove_rmap():

last = (last < ENTIRELY_MAPPED);

means that nr will never be incremented above 0. (_nr_pages_mapped is
incremented by ENTIRELY_MAPPED for every PMD map).

> If
> so, splitting it might not
> yield benefits, as neither I/O nor swap slots would increase in this
> case by not splitting it.
>
> Regarding PTE-mapped large folios, the absence of an indicator like
> "entire_map" makes it
> challenging to identify cases where the entire folio is mapped. Thus,
> splitting seems to be
> the only viable solution in such circumstances.
>
>> split_folio_to_list(folio,
>> folio_list))
>> goto activate_locked;
>> --
>> 2.25.1
>
> Thanks
> Barry


2024-03-12 08:40:53

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v4 5/6] mm: vmscan: Avoid split during shrink_folio_list()

On Tue, Mar 12, 2024 at 9:12 PM Ryan Roberts <[email protected]> wrote:
>
> On 11/03/2024 22:30, Barry Song wrote:
> > On Mon, Mar 11, 2024 at 11:01 PM Ryan Roberts <ryan.roberts@armcom> wrote:
> >>
> >> Now that swap supports storing all mTHP sizes, avoid splitting large
> >> folios before swap-out. This benefits performance of the swap-out path
> >> by eliding split_folio_to_list(), which is expensive, and also sets us
> >> up for swapping in large folios in a future series.
> >>
> >> If the folio is partially mapped, we continue to split it since we want
> >> to avoid the extra IO overhead and storage of writing out pages
> >> uneccessarily.
> >>
> >> Signed-off-by: Ryan Roberts <[email protected]>
> >> ---
> >> mm/vmscan.c | 9 +++++----
> >> 1 file changed, 5 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> index cf7d4cf47f1a..0ebec99e04c6 100644
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -1222,11 +1222,12 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >> if (!can_split_folio(folio, NULL))
> >> goto activate_locked;
> >> /*
> >> - * Split folios without a PMD map right
> >> - * away. Chances are some or all of the
> >> - * tail pages can be freed without IO.
> >> + * Split partially mapped folios map
> >> + * right away. Chances are some or all
> >> + * of the tail pages can be freed
> >> + * without IO.
> >> */
> >> - if (!folio_entire_mapcount(folio) &&
> >> + if (!list_empty(&folio->_deferred_list) &&
> >
> > Hi Ryan,
> > After reconsidering our previous discussion about PMD-mapped large
> > folios, I've pondered
> > the possibility of PMD-mapped Transparent Huge Pages (THPs) being
> > mapped by multiple
> > processes. In such a scenario, if one process decides to unmap a
> > portion of the folio while
> > others retain the entire mapping, it raises questions about how the
> > system should handle
> > this situation. Would the large folio be placed in a deferred list?
>
> No - if the large folio is entirely mapped (via PMD), then the folio will not be
> put on the deferred split list in the first place. See __folio_remove_rmap():
>
> last = (last < ENTIRELY_MAPPED);
>
> means that nr will never be incremented above 0. (_nr_pages_mapped is
> incremented by ENTIRELY_MAPPED for every PMD map).

you are right, I missed this part, we are breaking early in RMAP_LEVEL_PTE.
so we won't get to if (nr). Thanks for your clarification. now we get
unified code
for both pmd-mapped and pte-mapped large folios. feel free to add,

Reviewed-by: Barry Song <[email protected]>

>
> > If
> > so, splitting it might not
> > yield benefits, as neither I/O nor swap slots would increase in this
> > case by not splitting it.
> >
> > Regarding PTE-mapped large folios, the absence of an indicator like
> > "entire_map" makes it
> > challenging to identify cases where the entire folio is mapped. Thus,
> > splitting seems to be
> > the only viable solution in such circumstances.
> >
> >> split_folio_to_list(folio,
> >> folio_list))
> >> goto activate_locked;
> >> --
> >> 2.25.1
>

Thanks
Barry

2024-03-12 08:45:34

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 0/6] Swap-out mTHP without splitting

On 11/03/2024 15:00, Ryan Roberts wrote:
> Hi All,
>
> This series adds support for swapping out multi-size THP (mTHP) without needing
> to first split the large folio via split_huge_page_to_list_to_order(). It
> closely follows the approach already used to swap-out PMD-sized THP.
>
> There are a couple of reasons for swapping out mTHP without splitting:
>
> - Performance: It is expensive to split a large folio and under extreme memory
> pressure some workloads regressed performance when using 64K mTHP vs 4K
> small folios because of this extra cost in the swap-out path. This series
> not only eliminates the regression but makes it faster to swap out 64K mTHP
> vs 4K small folios.
>
> - Memory fragmentation avoidance: If we can avoid splitting a large folio
> memory is less likely to become fragmented, making it easier to re-allocate
> a large folio in future.
>
> - Performance: Enables a separate series [4] to swap-in whole mTHPs, which
> means we won't lose the TLB-efficiency benefits of mTHP once the memory has
> been through a swap cycle.
>
> I've done what I thought was the smallest change possible, and as a result, this
> approach is only employed when the swap is backed by a non-rotating block device
> (just as PMD-sized THP is supported today). Discussion against the RFC concluded
> that this is sufficient.
>
>
> Performance Testing
> ===================
>
> I've run some swap performance tests on Ampere Altra VM (arm64) with 8 CPUs. The
> VM is set up with a 35G block ram device as the swap device and the test is run
> from inside a memcg limited to 40G memory. I've then run `usemem` from
> vm-scalability with 70 processes, each allocating and writing 1G of memory. I've
> repeated everything 6 times and taken the mean performance improvement relative
> to 4K page baseline:
>
> | alloc size | baseline | + this series |
> | | v6.6-rc4+anonfolio | |

Oops, just noticed I failed to update these column headers. The baseline is
actually mm-unstable (d7182786dd0a) which is based on v6.8-rc5 and already
contains "anonfolio" - now called mTHP.


> |:-----------|--------------------:|--------------------:|
> | 4K Page | 0.0% | 1.4% |
> | 64K THP | -14.6% | 44.2% |
> | 2M THP | 87.4% | 97.7% |
>
> So with this change, the 64K swap performance goes from a 15% regression to a
> 44% improvement. 4K and 2M swap improves slightly too.
>
> This test also acts as a good stress test for swap and, more generally mm. A
> couple of existing bugs were found as a result [5] [6].
>
>
> ---
> The series applies against mm-unstable (d7182786dd0a). Although I've
> additionally been running with a couple of extra fixes to avoid the issues at
> [6].
>
>
> Changes since v3 [3]
> ====================
>
> - Renamed SWAP_NEXT_NULL -> SWAP_NEXT_INVALID (per Huang, Ying)
> - Simplified max offset calculation (per Huang, Ying)
> - Reinstated struct percpu_cluster to contain per-cluster, per-order `next`
> offset (per Huang, Ying)
> - Removed swap_alloc_large() and merged its functionality into
> scan_swap_map_slots() (per Huang, Ying)
> - Avoid extra cost of folio ref and lock due to removal of CLUSTER_FLAG_HUGE
> by freeing swap entries in batches (see patch 2) (per DavidH)
> - vmscan splits folio if its partially mapped (per Barry Song, DavidH)
> - Avoid splitting in MADV_PAGEOUT path (per Barry Song)
> - Dropped "mm: swap: Simplify ssd behavior when scanner steals entry" patch
> since it's not actually a problem for THP as I first thought.
>
>
> Changes since v2 [2]
> ====================
>
> - Reuse scan_swap_map_try_ssd_cluster() between order-0 and order > 0
> allocation. This required some refactoring to make everything work nicely
> (new patches 2 and 3).
> - Fix bug where nr_swap_pages would say there are pages available but the
> scanner would not be able to allocate them because they were reserved for the
> per-cpu allocator. We now allow stealing of order-0 entries from the high
> order per-cpu clusters (in addition to exisiting stealing from order-0
> per-cpu clusters).
>
>
> Changes since v1 [1]
> ====================
>
> - patch 1:
> - Use cluster_set_count() instead of cluster_set_count_flag() in
> swap_alloc_cluster() since we no longer have any flag to set. I was unable
> to kill cluster_set_count_flag() as proposed against v1 as other call
> sites depend explicitly setting flags to 0.
> - patch 2:
> - Moved large_next[] array into percpu_cluster to make it per-cpu
> (recommended by Huang, Ying).
> - large_next[] array is dynamically allocated because PMD_ORDER is not
> compile-time constant for powerpc (fixes build error).
>
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
> [2] https://lore.kernel.org/linux-mm/[email protected]/
> [3] https://lore.kernel.org/linux-mm/[email protected]/
> [4] https://lore.kernel.org/linux-mm/[email protected]/
> [5] https://lore.kernel.org/linux-mm/[email protected]/
> [6] https://lore.kernel.org/linux-mm/[email protected]/
>
> Thanks,
> Ryan
>
>
> Ryan Roberts (6):
> mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
> mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()
> mm: swap: Simplify struct percpu_cluster
> mm: swap: Allow storage of all mTHP orders
> mm: vmscan: Avoid split during shrink_folio_list()
> mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD
>
> include/linux/pgtable.h | 28 ++++
> include/linux/swap.h | 33 +++--
> mm/huge_memory.c | 3 -
> mm/internal.h | 48 +++++++
> mm/madvise.c | 101 ++++++++------
> mm/memory.c | 13 +-
> mm/swapfile.c | 298 ++++++++++++++++++++++------------------
> mm/vmscan.c | 9 +-
> 8 files changed, 332 insertions(+), 201 deletions(-)
>
> --
> 2.25.1
>


2024-03-12 08:50:06

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 0/6] Swap-out mTHP without splitting

On 12/03/2024 08:01, Huang, Ying wrote:
> Ryan Roberts <[email protected]> writes:
>
>> Hi All,
>>
>> This series adds support for swapping out multi-size THP (mTHP) without needing
>> to first split the large folio via split_huge_page_to_list_to_order(). It
>> closely follows the approach already used to swap-out PMD-sized THP.
>>
>> There are a couple of reasons for swapping out mTHP without splitting:
>>
>> - Performance: It is expensive to split a large folio and under extreme memory
>> pressure some workloads regressed performance when using 64K mTHP vs 4K
>> small folios because of this extra cost in the swap-out path. This series
>> not only eliminates the regression but makes it faster to swap out 64K mTHP
>> vs 4K small folios.
>>
>> - Memory fragmentation avoidance: If we can avoid splitting a large folio
>> memory is less likely to become fragmented, making it easier to re-allocate
>> a large folio in future.
>>
>> - Performance: Enables a separate series [4] to swap-in whole mTHPs, which
>> means we won't lose the TLB-efficiency benefits of mTHP once the memory has
>> been through a swap cycle.
>>
>> I've done what I thought was the smallest change possible, and as a result, this
>> approach is only employed when the swap is backed by a non-rotating block device
>> (just as PMD-sized THP is supported today). Discussion against the RFC concluded
>> that this is sufficient.
>>
>>
>> Performance Testing
>> ===================
>>
>> I've run some swap performance tests on Ampere Altra VM (arm64) with 8 CPUs. The
>> VM is set up with a 35G block ram device as the swap device and the test is run
>> from inside a memcg limited to 40G memory. I've then run `usemem` from
>> vm-scalability with 70 processes, each allocating and writing 1G of memory. I've
>> repeated everything 6 times and taken the mean performance improvement relative
>> to 4K page baseline:
>>
>> | alloc size | baseline | + this series |
>> | | v6.6-rc4+anonfolio | |
>> |:-----------|--------------------:|--------------------:|
>> | 4K Page | 0.0% | 1.4% |
>> | 64K THP | -14.6% | 44.2% |
>> | 2M THP | 87.4% | 97.7% |
>>
>> So with this change, the 64K swap performance goes from a 15% regression to a
>> 44% improvement. 4K and 2M swap improves slightly too.
>
> I don't understand why the performance of 2M THP improves. The swap
> entry allocation becomes a little slower. Can you provide some
> perf-profile to root cause it?

I didn't post the stdev, which is quite large (~10%), so that may explain some
of it:

| kernel | mean_rel | std_rel |
|:---------|-----------:|----------:|
| base-4K | 0.0% | 5.5% |
| base-64K | -14.6% | 3.8% |
| base-2M | 87.4% | 10.6% |
| v4-4K | 1.4% | 3.7% |
| v4-64K | 44.2% | 11.8% |
| v4-2M | 97.7% | 13.3% |

Regardless, I'll do some perf profiling and post results shortly.

>
> --
> Best Regards,
> Huang, Ying
>
>> This test also acts as a good stress test for swap and, more generally mm. A
>> couple of existing bugs were found as a result [5] [6].
>>
>>
>> ---
>> The series applies against mm-unstable (d7182786dd0a). Although I've
>> additionally been running with a couple of extra fixes to avoid the issues at
>> [6].
>>
>>
>> Changes since v3 [3]
>> ====================
>>
>> - Renamed SWAP_NEXT_NULL -> SWAP_NEXT_INVALID (per Huang, Ying)
>> - Simplified max offset calculation (per Huang, Ying)
>> - Reinstated struct percpu_cluster to contain per-cluster, per-order `next`
>> offset (per Huang, Ying)
>> - Removed swap_alloc_large() and merged its functionality into
>> scan_swap_map_slots() (per Huang, Ying)
>> - Avoid extra cost of folio ref and lock due to removal of CLUSTER_FLAG_HUGE
>> by freeing swap entries in batches (see patch 2) (per DavidH)
>> - vmscan splits folio if its partially mapped (per Barry Song, DavidH)
>> - Avoid splitting in MADV_PAGEOUT path (per Barry Song)
>> - Dropped "mm: swap: Simplify ssd behavior when scanner steals entry" patch
>> since it's not actually a problem for THP as I first thought.
>>
>>
>> Changes since v2 [2]
>> ====================
>>
>> - Reuse scan_swap_map_try_ssd_cluster() between order-0 and order > 0
>> allocation. This required some refactoring to make everything work nicely
>> (new patches 2 and 3).
>> - Fix bug where nr_swap_pages would say there are pages available but the
>> scanner would not be able to allocate them because they were reserved for the
>> per-cpu allocator. We now allow stealing of order-0 entries from the high
>> order per-cpu clusters (in addition to exisiting stealing from order-0
>> per-cpu clusters).
>>
>>
>> Changes since v1 [1]
>> ====================
>>
>> - patch 1:
>> - Use cluster_set_count() instead of cluster_set_count_flag() in
>> swap_alloc_cluster() since we no longer have any flag to set. I was unable
>> to kill cluster_set_count_flag() as proposed against v1 as other call
>> sites depend explicitly setting flags to 0.
>> - patch 2:
>> - Moved large_next[] array into percpu_cluster to make it per-cpu
>> (recommended by Huang, Ying).
>> - large_next[] array is dynamically allocated because PMD_ORDER is not
>> compile-time constant for powerpc (fixes build error).
>>
>>
>> [1] https://lore.kernel.org/linux-mm/[email protected]/
>> [2] https://lore.kernel.org/linux-mm/[email protected]/
>> [3] https://lore.kernel.org/linux-mm/[email protected]/
>> [4] https://lore.kernel.org/linux-mm/[email protected]/
>> [5] https://lore.kernel.org/linux-mm/[email protected]/
>> [6] https://lore.kernel.org/linux-mm/[email protected]/
>>
>> Thanks,
>> Ryan
>>
>>
>> Ryan Roberts (6):
>> mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
>> mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()
>> mm: swap: Simplify struct percpu_cluster
>> mm: swap: Allow storage of all mTHP orders
>> mm: vmscan: Avoid split during shrink_folio_list()
>> mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD
>>
>> include/linux/pgtable.h | 28 ++++
>> include/linux/swap.h | 33 +++--
>> mm/huge_memory.c | 3 -
>> mm/internal.h | 48 +++++++
>> mm/madvise.c | 101 ++++++++------
>> mm/memory.c | 13 +-
>> mm/swapfile.c | 298 ++++++++++++++++++++++------------------
>> mm/vmscan.c | 9 +-
>> 8 files changed, 332 insertions(+), 201 deletions(-)
>>
>> --
>> 2.25.1


2024-03-12 08:51:39

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 3/6] mm: swap: Simplify struct percpu_cluster

On 12/03/2024 07:52, Huang, Ying wrote:
> Ryan Roberts <[email protected]> writes:
>
>> struct percpu_cluster stores the index of cpu's current cluster and the
>> offset of the next entry that will be allocated for the cpu. These two
>> pieces of information are redundant because the cluster index is just
>> (offset / SWAPFILE_CLUSTER). The only reason for explicitly keeping the
>> cluster index is because the structure used for it also has a flag to
>> indicate "no cluster". However this data structure also contains a spin
>> lock, which is never used in this context, as a side effect the code
>> copies the spinlock_t structure, which is questionable coding practice
>> in my view.
>>
>> So let's clean this up and store only the next offset, and use a
>> sentinal value (SWAP_NEXT_INVALID) to indicate "no cluster".
>> SWAP_NEXT_INVALID is chosen to be 0, because 0 will never be seen
>> legitimately; The first page in the swap file is the swap header, which
>> is always marked bad to prevent it from being allocated as an entry.
>> This also prevents the cluster to which it belongs being marked free, so
>> it will never appear on the free list.
>>
>> This change saves 16 bytes per cpu. And given we are shortly going to
>> extend this mechanism to be per-cpu-AND-per-order, we will end up saving
>> 16 * 9 = 144 bytes per cpu, which adds up if you have 256 cpus in the
>> system.
>>
>> Signed-off-by: Ryan Roberts <[email protected]>
>
> LGTM, Thanks!

Thanks! What's a guy got to do to get Rb or Ack? :)

>
> --
> Best Regards,
> Huang, Ying
>


2024-03-12 09:41:46

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 4/6] mm: swap: Allow storage of all mTHP orders

On 12/03/2024 07:51, Huang, Ying wrote:
> Ryan Roberts <[email protected]> writes:
>
>> Multi-size THP enables performance improvements by allocating large,
>> pte-mapped folios for anonymous memory. However I've observed that on an
>> arm64 system running a parallel workload (e.g. kernel compilation)
>> across many cores, under high memory pressure, the speed regresses. This
>> is due to bottlenecking on the increased number of TLBIs added due to
>> all the extra folio splitting when the large folios are swapped out.
>>
>> Therefore, solve this regression by adding support for swapping out mTHP
>> without needing to split the folio, just like is already done for
>> PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enabled,
>> and when the swap backing store is a non-rotating block device. These
>> are the same constraints as for the existing PMD-sized THP swap-out
>> support.
>>
>> Note that no attempt is made to swap-in (m)THP here - this is still done
>> page-by-page, like for PMD-sized THP. But swapping-out mTHP is a
>> prerequisite for swapping-in mTHP.
>>
>> The main change here is to improve the swap entry allocator so that it
>> can allocate any power-of-2 number of contiguous entries between [1, (1
>> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
>> order and allocating sequentially from it until the cluster is full.
>> This ensures that we don't need to search the map and we get no
>> fragmentation due to alignment padding for different orders in the
>> cluster. If there is no current cluster for a given order, we attempt to
>> allocate a free cluster from the list. If there are no free clusters, we
>> fail the allocation and the caller can fall back to splitting the folio
>> and allocates individual entries (as per existing PMD-sized THP
>> fallback).
>>
>> The per-order current clusters are maintained per-cpu using the existing
>> infrastructure. This is done to avoid interleving pages from different
>> tasks, which would prevent IO being batched. This is already done for
>> the order-0 allocations so we follow the same pattern.
>>
>> As is done for order-0 per-cpu clusters, the scanner now can steal
>> order-0 entries from any per-cpu-per-order reserved cluster. This
>> ensures that when the swap file is getting full, space doesn't get tied
>> up in the per-cpu reserves.
>>
>> This change only modifies swap to be able to accept any order mTHP. It
>> doesn't change the callers to elide doing the actual split. That will be
>> done in separate changes.
>>
>> Signed-off-by: Ryan Roberts <[email protected]>
>> ---
>> include/linux/swap.h | 8 ++-
>> mm/swapfile.c | 167 +++++++++++++++++++++++++------------------
>> 2 files changed, 103 insertions(+), 72 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 0cb082bee717..39b5c18ccc6a 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -268,13 +268,19 @@ struct swap_cluster_info {
>> */
>> #define SWAP_NEXT_INVALID 0
>>
>> +#ifdef CONFIG_THP_SWAP
>> +#define SWAP_NR_ORDERS (PMD_ORDER + 1)
>> +#else
>> +#define SWAP_NR_ORDERS 1
>> +#endif
>> +
>> /*
>> * We assign a cluster to each CPU, so each CPU can allocate swap entry from
>> * its own cluster and swapout sequentially. The purpose is to optimize swapout
>> * throughput.
>> */
>> struct percpu_cluster {
>> - unsigned int next; /* Likely next allocation offset */
>> + unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
>> };
>>
>> struct swap_cluster_list {
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 3828d81aa6b8..61118a090796 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -551,10 +551,12 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
>>
>> /*
>> * The cluster corresponding to page_nr will be used. The cluster will be
>> - * removed from free cluster list and its usage counter will be increased.
>> + * removed from free cluster list and its usage counter will be increased by
>> + * count.
>> */
>> -static void inc_cluster_info_page(struct swap_info_struct *p,
>> - struct swap_cluster_info *cluster_info, unsigned long page_nr)
>> +static void add_cluster_info_page(struct swap_info_struct *p,
>> + struct swap_cluster_info *cluster_info, unsigned long page_nr,
>> + unsigned long count)
>> {
>> unsigned long idx = page_nr / SWAPFILE_CLUSTER;
>>
>> @@ -563,9 +565,19 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
>> if (cluster_is_free(&cluster_info[idx]))
>> alloc_cluster(p, idx);
>>
>> - VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER);
>> + VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
>> cluster_set_count(&cluster_info[idx],
>> - cluster_count(&cluster_info[idx]) + 1);
>> + cluster_count(&cluster_info[idx]) + count);
>> +}
>> +
>> +/*
>> + * The cluster corresponding to page_nr will be used. The cluster will be
>> + * removed from free cluster list and its usage counter will be increased by 1.
>> + */
>> +static void inc_cluster_info_page(struct swap_info_struct *p,
>> + struct swap_cluster_info *cluster_info, unsigned long page_nr)
>> +{
>> + add_cluster_info_page(p, cluster_info, page_nr, 1);
>> }
>>
>> /*
>> @@ -595,7 +607,7 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
>> */
>> static bool
>> scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>> - unsigned long offset)
>> + unsigned long offset, int order)
>> {
>> struct percpu_cluster *percpu_cluster;
>> bool conflict;
>> @@ -609,24 +621,39 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>> return false;
>>
>> percpu_cluster = this_cpu_ptr(si->percpu_cluster);
>> - percpu_cluster->next = SWAP_NEXT_INVALID;
>> + percpu_cluster->next[order] = SWAP_NEXT_INVALID;
>> + return true;
>> +}
>> +
>> +static inline bool swap_range_empty(char *swap_map, unsigned int start,
>> + unsigned int nr_pages)
>> +{
>> + unsigned int i;
>> +
>> + for (i = 0; i < nr_pages; i++) {
>> + if (swap_map[start + i])
>> + return false;
>> + }
>> +
>> return true;
>> }
>>
>> /*
>> - * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
>> - * might involve allocating a new cluster for current CPU too.
>> + * Try to get a swap entry (or size indicated by order) from current cpu's swap
>
> IMO, it's not necessary to make mTHP a special case other than base
> page. So, this can be changed to
>
> * Try to get swap entries with specified order from current cpu's swap

Sure, will fix in next version.

>
>> + * entry pool (a cluster). This might involve allocating a new cluster for
>> + * current CPU too.
>> */
>> static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>> - unsigned long *offset, unsigned long *scan_base)
>> + unsigned long *offset, unsigned long *scan_base, int order)
>> {
>> + unsigned int nr_pages = 1 << order;
>> struct percpu_cluster *cluster;
>> struct swap_cluster_info *ci;
>> unsigned int tmp, max;
>>
>> new_cluster:
>> cluster = this_cpu_ptr(si->percpu_cluster);
>> - tmp = cluster->next;
>> + tmp = cluster->next[order];
>> if (tmp == SWAP_NEXT_INVALID) {
>> if (!cluster_list_empty(&si->free_clusters)) {
>> tmp = cluster_next(&si->free_clusters.head) *
>> @@ -647,26 +674,27 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>
>> /*
>> * Other CPUs can use our cluster if they can't find a free cluster,
>> - * check if there is still free entry in the cluster
>> + * check if there is still free entry in the cluster, maintaining
>> + * natural alignment.
>> */
>> max = min_t(unsigned long, si->max, ALIGN(tmp + 1, SWAPFILE_CLUSTER));
>> if (tmp < max) {
>> ci = lock_cluster(si, tmp);
>> while (tmp < max) {
>> - if (!si->swap_map[tmp])
>> + if (swap_range_empty(si->swap_map, tmp, nr_pages))
>> break;
>> - tmp++;
>> + tmp += nr_pages;
>> }
>> unlock_cluster(ci);
>> }
>> if (tmp >= max) {
>> - cluster->next = SWAP_NEXT_INVALID;
>> + cluster->next[order] = SWAP_NEXT_INVALID;
>> goto new_cluster;
>> }
>> *offset = tmp;
>> *scan_base = tmp;
>> - tmp += 1;
>> - cluster->next = tmp < max ? tmp : SWAP_NEXT_INVALID;
>> + tmp += nr_pages;
>> + cluster->next[order] = tmp < max ? tmp : SWAP_NEXT_INVALID;
>> return true;
>> }
>>
>> @@ -796,13 +824,14 @@ static bool swap_offset_available_and_locked(struct swap_info_struct *si,
>>
>> static int scan_swap_map_slots(struct swap_info_struct *si,
>> unsigned char usage, int nr,
>> - swp_entry_t slots[])
>> + swp_entry_t slots[], unsigned int nr_pages)
>
> IMHO, it's better to use order as parameter directly. We can change the
> parameter of get_swap_pages() too.

I agree that this will make the interface clearer/self documenting. I'll do it
in the next version.

>
>> {
>> struct swap_cluster_info *ci;
>> unsigned long offset;
>> unsigned long scan_base;
>> unsigned long last_in_cluster = 0;
>> int latency_ration = LATENCY_LIMIT;
>> + int order = ilog2(nr_pages);
>> int n_ret = 0;
>> bool scanned_many = false;
>>
>> @@ -817,6 +846,26 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>> * And we let swap pages go all over an SSD partition. Hugh
>> */
>>
>> + if (nr_pages > 1) {
>> + /*
>> + * Should not even be attempting large allocations when huge
>> + * page swap is disabled. Warn and fail the allocation.
>> + */
>> + if (!IS_ENABLED(CONFIG_THP_SWAP) ||
>> + nr_pages > SWAPFILE_CLUSTER ||
>> + !is_power_of_2(nr_pages)) {
>> + VM_WARN_ON_ONCE(1);
>> + return 0;
>> + }
>> +
>> + /*
>> + * Swapfile is not block device or not using clusters so unable
>> + * to allocate large entries.
>> + */
>> + if (!(si->flags & SWP_BLKDEV) || !si->cluster_info)
>> + return 0;
>> + }
>> +
>> si->flags += SWP_SCANNING;
>> /*
>> * Use percpu scan base for SSD to reduce lock contention on
>> @@ -831,8 +880,11 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>
>> /* SSD algorithm */
>> if (si->cluster_info) {
>> - if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base))
>> + if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order)) {
>> + if (order > 0)
>> + goto no_page;
>> goto scan;
>> + }
>> } else if (unlikely(!si->cluster_nr--)) {
>> if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER) {
>> si->cluster_nr = SWAPFILE_CLUSTER - 1;
>> @@ -874,26 +926,30 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>
>> checks:
>> if (si->cluster_info) {
>> - while (scan_swap_map_ssd_cluster_conflict(si, offset)) {
>> + while (scan_swap_map_ssd_cluster_conflict(si, offset, order)) {
>> /* take a break if we already got some slots */
>> if (n_ret)
>> goto done;
>> if (!scan_swap_map_try_ssd_cluster(si, &offset,
>> - &scan_base))
>> + &scan_base, order)) {
>> + if (order > 0)
>> + goto no_page;
>> goto scan;
>> + }
>> }
>> }
>> if (!(si->flags & SWP_WRITEOK))
>> goto no_page;
>> if (!si->highest_bit)
>> goto no_page;
>> - if (offset > si->highest_bit)
>> + if (order == 0 && offset > si->highest_bit)
>
> I don't think that we need to check "order == 0" here. The original
> condition will always be false for "order != 0".

I spent ages looking at this and couldn't quite convince myself that this is
definitely safe. Certainly it would be catastrophic if we modified the returned
offset for a non-order-0 case (the code below assumes order-0 when checking). So
I decided in the end to be safe and add this condition. Looking again, I agree
with you. Will fix in next version.

>
>> scan_base = offset = si->lowest_bit;
>>
>> ci = lock_cluster(si, offset);
>> /* reuse swap entry of cache-only swap if not busy. */
>> if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
>> int swap_was_freed;
>> + VM_WARN_ON(order > 0);
>
> Instead of add WARN here, I think that it's better to add WARN at the
> beginning of "scan" label. We should never scan if "order > 0", it can
> capture even more abnormal status.

OK, will do.

>
>> unlock_cluster(ci);
>> spin_unlock(&si->lock);
>> swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
>> @@ -905,17 +961,18 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>> }
>>
>> if (si->swap_map[offset]) {
>> + VM_WARN_ON(order > 0);

And remove this one too? (relying on the one in scan instead)

>> unlock_cluster(ci);
>> if (!n_ret)
>> goto scan;
>> else
>> goto done;
>> }
>> - WRITE_ONCE(si->swap_map[offset], usage);
>> - inc_cluster_info_page(si, si->cluster_info, offset);
>> + memset(si->swap_map + offset, usage, nr_pages);
>
> Add barrier() here corresponds to original WRITE_ONCE()?
> unlock_cluster(ci) may be NOP for some swap devices.

Yep, good spot!

>
>> + add_cluster_info_page(si, si->cluster_info, offset, nr_pages);
>> unlock_cluster(ci);
>>
>> - swap_range_alloc(si, offset, 1);
>> + swap_range_alloc(si, offset, nr_pages);
>> slots[n_ret++] = swp_entry(si->type, offset);
>>
>> /* got enough slots or reach max slots? */
>
> If "order > 0", "nr" must be 1. So, we will "goto done" in the
> following code.

I've deliberately implemented scan_swap_map_slots() so that it allows nr > 1 for
order > 0. And leave it to the higher layers to decide on policy.

>
> /* got enough slots or reach max slots? */
> if ((n_ret == nr) || (offset >= si->highest_bit))
> goto done;
>
> We can add VM_WARN_ON() here to capture some abnormal status.

That was actually how I implemented initially. But decided that it doesn't cost
anything to allow nr > 1 for order > 0, and IMHO makes the function easier to
understand because we remove this uneccessary constraint.

>
>> @@ -936,8 +993,10 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>
>> /* try to get more slots in cluster */
>> if (si->cluster_info) {
>> - if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base))
>> + if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order))
>> goto checks;
>> + if (order > 0)
>> + goto done;
>
> Don't need to add this, if "order > 0", we will never go here.

As per above.

>
>> } else if (si->cluster_nr && !si->swap_map[++offset]) {
>> /* non-ssd case, still more slots in cluster? */
>> --si->cluster_nr;
>> @@ -964,7 +1023,8 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>> }
>>
>> done:
>> - set_cluster_next(si, offset + 1);
>> + if (order == 0)
>> + set_cluster_next(si, offset + 1);
>> si->flags -= SWP_SCANNING;
>> return n_ret;
>>
>> @@ -997,38 +1057,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>> return n_ret;
>> }
>>
>> -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
>> -{
>> - unsigned long idx;
>> - struct swap_cluster_info *ci;
>> - unsigned long offset;
>> -
>> - /*
>> - * Should not even be attempting cluster allocations when huge
>> - * page swap is disabled. Warn and fail the allocation.
>> - */
>> - if (!IS_ENABLED(CONFIG_THP_SWAP)) {
>> - VM_WARN_ON_ONCE(1);
>> - return 0;
>> - }
>> -
>> - if (cluster_list_empty(&si->free_clusters))
>> - return 0;
>> -
>> - idx = cluster_list_first(&si->free_clusters);
>> - offset = idx * SWAPFILE_CLUSTER;
>> - ci = lock_cluster(si, offset);
>> - alloc_cluster(si, idx);
>> - cluster_set_count(ci, SWAPFILE_CLUSTER);
>> -
>> - memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER);
>> - unlock_cluster(ci);
>> - swap_range_alloc(si, offset, SWAPFILE_CLUSTER);
>> - *slot = swp_entry(si->type, offset);
>> -
>> - return 1;
>> -}
>> -
>> static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
>> {
>> unsigned long offset = idx * SWAPFILE_CLUSTER;
>> @@ -1050,8 +1078,8 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
>> int n_ret = 0;
>> int node;
>>
>> - /* Only single cluster request supported */
>> - WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER);
>> + /* Only single THP request supported */
>> + WARN_ON_ONCE(n_goal > 1 && size > 1);
>>
>> spin_lock(&swap_avail_lock);
>>
>> @@ -1088,14 +1116,10 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
>> spin_unlock(&si->lock);
>> goto nextsi;
>> }
>> - if (size == SWAPFILE_CLUSTER) {
>> - if (si->flags & SWP_BLKDEV)
>> - n_ret = swap_alloc_cluster(si, swp_entries);
>> - } else
>> - n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
>> - n_goal, swp_entries);
>> + n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
>> + n_goal, swp_entries, size);
>> spin_unlock(&si->lock);
>> - if (n_ret || size == SWAPFILE_CLUSTER)
>> + if (n_ret || size > 1)
>> goto check_out;
>> cond_resched();
>>
>> @@ -1647,7 +1671,7 @@ swp_entry_t get_swap_page_of_type(int type)
>>
>> /* This is called for allocating swap entry, not cache */
>> spin_lock(&si->lock);
>> - if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry))
>> + if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 1))
>> atomic_long_dec(&nr_swap_pages);
>> spin_unlock(&si->lock);
>> fail:
>> @@ -3101,7 +3125,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>> p->flags |= SWP_SYNCHRONOUS_IO;
>>
>> if (p->bdev && bdev_nonrot(p->bdev)) {
>> - int cpu;
>> + int cpu, i;
>> unsigned long ci, nr_cluster;
>>
>> p->flags |= SWP_SOLIDSTATE;
>> @@ -3139,7 +3163,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>> struct percpu_cluster *cluster;
>>
>> cluster = per_cpu_ptr(p->percpu_cluster, cpu);
>> - cluster->next = SWAP_NEXT_INVALID;
>> + for (i = 0; i < SWAP_NR_ORDERS; i++)
>> + cluster->next[i] = SWAP_NEXT_INVALID;
>> }
>> } else {
>> atomic_inc(&nr_rotate_swap);
>
> You also need to check whether we should add swap_entry_size() for some
> functions to optimize for small system. We may need to add swap_order()
> too.

I was planning to convert swap_entry_size() to swap_entry_order() as part of
switching to pass order instead of nr_pages. There is one other site that uses
swap_entry_size() and needs a size, so was going to just change it to 1 <<
swap_entry_order(). Does that work for you?

I'll do an audit for places to use swap_entry_order() but quick scan just now
suggests that the constant should propagate to all the static functions from
get_swap_pages().

Thanks,
Ryan

>
> --
> Best Regards,
> Huang, Ying


2024-03-12 13:57:32

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 0/6] Swap-out mTHP without splitting

On 12/03/2024 08:49, Ryan Roberts wrote:
> On 12/03/2024 08:01, Huang, Ying wrote:
>> Ryan Roberts <[email protected]> writes:
>>
>>> Hi All,
>>>
>>> This series adds support for swapping out multi-size THP (mTHP) without needing
>>> to first split the large folio via split_huge_page_to_list_to_order(). It
>>> closely follows the approach already used to swap-out PMD-sized THP.
>>>
>>> There are a couple of reasons for swapping out mTHP without splitting:
>>>
>>> - Performance: It is expensive to split a large folio and under extreme memory
>>> pressure some workloads regressed performance when using 64K mTHP vs 4K
>>> small folios because of this extra cost in the swap-out path. This series
>>> not only eliminates the regression but makes it faster to swap out 64K mTHP
>>> vs 4K small folios.
>>>
>>> - Memory fragmentation avoidance: If we can avoid splitting a large folio
>>> memory is less likely to become fragmented, making it easier to re-allocate
>>> a large folio in future.
>>>
>>> - Performance: Enables a separate series [4] to swap-in whole mTHPs, which
>>> means we won't lose the TLB-efficiency benefits of mTHP once the memory has
>>> been through a swap cycle.
>>>
>>> I've done what I thought was the smallest change possible, and as a result, this
>>> approach is only employed when the swap is backed by a non-rotating block device
>>> (just as PMD-sized THP is supported today). Discussion against the RFC concluded
>>> that this is sufficient.
>>>
>>>
>>> Performance Testing
>>> ===================
>>>
>>> I've run some swap performance tests on Ampere Altra VM (arm64) with 8 CPUs. The
>>> VM is set up with a 35G block ram device as the swap device and the test is run
>>> from inside a memcg limited to 40G memory. I've then run `usemem` from
>>> vm-scalability with 70 processes, each allocating and writing 1G of memory. I've
>>> repeated everything 6 times and taken the mean performance improvement relative
>>> to 4K page baseline:
>>>
>>> | alloc size | baseline | + this series |
>>> | | v6.6-rc4+anonfolio | |
>>> |:-----------|--------------------:|--------------------:|
>>> | 4K Page | 0.0% | 1.4% |
>>> | 64K THP | -14.6% | 44.2% |
>>> | 2M THP | 87.4% | 97.7% |
>>>
>>> So with this change, the 64K swap performance goes from a 15% regression to a
>>> 44% improvement. 4K and 2M swap improves slightly too.
>>
>> I don't understand why the performance of 2M THP improves. The swap
>> entry allocation becomes a little slower. Can you provide some
>> perf-profile to root cause it?
>
> I didn't post the stdev, which is quite large (~10%), so that may explain some
> of it:
>
> | kernel | mean_rel | std_rel |
> |:---------|-----------:|----------:|
> | base-4K | 0.0% | 5.5% |
> | base-64K | -14.6% | 3.8% |
> | base-2M | 87.4% | 10.6% |
> | v4-4K | 1.4% | 3.7% |
> | v4-64K | 44.2% | 11.8% |
> | v4-2M | 97.7% | 13.3% |
>
> Regardless, I'll do some perf profiling and post results shortly.

I did a lot more runs (24 for each config) and meaned them to try to remove the
noise in the measurements. It's now only showing a 4% improvement for 2M. So I
don't think the 2M improvement is real:

| kernel | mean_rel | std_rel |
|:---------|-----------:|----------:|
| base-4K | 0.0% | 3.2% |
| base-64K | -9.1% | 10.1% |
| base-2M | 88.9% | 6.8% |
| v4-4K | 0.5% | 3.1% |
| v4-64K | 44.7% | 8.3% |
| v4-2M | 93.3% | 7.8% |

Looking at the perf data, the only thing that sticks out is that a big chunk of
time is spent in during contpte_convert(), called as a result of
try_to_unmap_one(). This is present in both the before and after configs.

This is an arm64 function to "unfold" contpte mappings. Essentially, the PMD is
being split during shrink_folio_list() with TTU_SPLIT_HUGE_PMD, meaning the
THPs are PTE-mapped in contpte blocks. Then we are unmapping each pte one-by-one
which means the contpte block needs to be unfolded. I think try_to_unmap_one()
could potentially be optimized to batch unmap a contiguously mapped folio and
avoid this unfold. But that would be an independent and separate piece of work.

>
>>
>> --
>> Best Regards,
>> Huang, Ying
>>
>>> This test also acts as a good stress test for swap and, more generally mm. A
>>> couple of existing bugs were found as a result [5] [6].
>>>
>>>
>>> ---
>>> The series applies against mm-unstable (d7182786dd0a). Although I've
>>> additionally been running with a couple of extra fixes to avoid the issues at
>>> [6].
>>>
>>>
>>> Changes since v3 [3]
>>> ====================
>>>
>>> - Renamed SWAP_NEXT_NULL -> SWAP_NEXT_INVALID (per Huang, Ying)
>>> - Simplified max offset calculation (per Huang, Ying)
>>> - Reinstated struct percpu_cluster to contain per-cluster, per-order `next`
>>> offset (per Huang, Ying)
>>> - Removed swap_alloc_large() and merged its functionality into
>>> scan_swap_map_slots() (per Huang, Ying)
>>> - Avoid extra cost of folio ref and lock due to removal of CLUSTER_FLAG_HUGE
>>> by freeing swap entries in batches (see patch 2) (per DavidH)
>>> - vmscan splits folio if its partially mapped (per Barry Song, DavidH)
>>> - Avoid splitting in MADV_PAGEOUT path (per Barry Song)
>>> - Dropped "mm: swap: Simplify ssd behavior when scanner steals entry" patch
>>> since it's not actually a problem for THP as I first thought.
>>>
>>>
>>> Changes since v2 [2]
>>> ====================
>>>
>>> - Reuse scan_swap_map_try_ssd_cluster() between order-0 and order > 0
>>> allocation. This required some refactoring to make everything work nicely
>>> (new patches 2 and 3).
>>> - Fix bug where nr_swap_pages would say there are pages available but the
>>> scanner would not be able to allocate them because they were reserved for the
>>> per-cpu allocator. We now allow stealing of order-0 entries from the high
>>> order per-cpu clusters (in addition to exisiting stealing from order-0
>>> per-cpu clusters).
>>>
>>>
>>> Changes since v1 [1]
>>> ====================
>>>
>>> - patch 1:
>>> - Use cluster_set_count() instead of cluster_set_count_flag() in
>>> swap_alloc_cluster() since we no longer have any flag to set. I was unable
>>> to kill cluster_set_count_flag() as proposed against v1 as other call
>>> sites depend explicitly setting flags to 0.
>>> - patch 2:
>>> - Moved large_next[] array into percpu_cluster to make it per-cpu
>>> (recommended by Huang, Ying).
>>> - large_next[] array is dynamically allocated because PMD_ORDER is not
>>> compile-time constant for powerpc (fixes build error).
>>>
>>>
>>> [1] https://lore.kernel.org/linux-mm/[email protected]/
>>> [2] https://lore.kernel.org/linux-mm/[email protected]/
>>> [3] https://lore.kernel.org/linux-mm/[email protected]/
>>> [4] https://lore.kernel.org/linux-mm/[email protected]/
>>> [5] https://lore.kernel.org/linux-mm/[email protected]/
>>> [6] https://lore.kernel.org/linux-mm/[email protected]/
>>>
>>> Thanks,
>>> Ryan
>>>
>>>
>>> Ryan Roberts (6):
>>> mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
>>> mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()
>>> mm: swap: Simplify struct percpu_cluster
>>> mm: swap: Allow storage of all mTHP orders
>>> mm: vmscan: Avoid split during shrink_folio_list()
>>> mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD
>>>
>>> include/linux/pgtable.h | 28 ++++
>>> include/linux/swap.h | 33 +++--
>>> mm/huge_memory.c | 3 -
>>> mm/internal.h | 48 +++++++
>>> mm/madvise.c | 101 ++++++++------
>>> mm/memory.c | 13 +-
>>> mm/swapfile.c | 298 ++++++++++++++++++++++------------------
>>> mm/vmscan.c | 9 +-
>>> 8 files changed, 332 insertions(+), 201 deletions(-)
>>>
>>> --
>>> 2.25.1
>


2024-03-13 01:17:34

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH v4 0/6] Swap-out mTHP without splitting

Ryan Roberts <[email protected]> writes:

> On 12/03/2024 08:49, Ryan Roberts wrote:
>> On 12/03/2024 08:01, Huang, Ying wrote:
>>> Ryan Roberts <[email protected]> writes:
>>>
>>>> Hi All,
>>>>
>>>> This series adds support for swapping out multi-size THP (mTHP) without needing
>>>> to first split the large folio via split_huge_page_to_list_to_order(). It
>>>> closely follows the approach already used to swap-out PMD-sized THP.
>>>>
>>>> There are a couple of reasons for swapping out mTHP without splitting:
>>>>
>>>> - Performance: It is expensive to split a large folio and under extreme memory
>>>> pressure some workloads regressed performance when using 64K mTHP vs 4K
>>>> small folios because of this extra cost in the swap-out path. This series
>>>> not only eliminates the regression but makes it faster to swap out 64K mTHP
>>>> vs 4K small folios.
>>>>
>>>> - Memory fragmentation avoidance: If we can avoid splitting a large folio
>>>> memory is less likely to become fragmented, making it easier to re-allocate
>>>> a large folio in future.
>>>>
>>>> - Performance: Enables a separate series [4] to swap-in whole mTHPs, which
>>>> means we won't lose the TLB-efficiency benefits of mTHP once the memory has
>>>> been through a swap cycle.
>>>>
>>>> I've done what I thought was the smallest change possible, and as a result, this
>>>> approach is only employed when the swap is backed by a non-rotating block device
>>>> (just as PMD-sized THP is supported today). Discussion against the RFC concluded
>>>> that this is sufficient.
>>>>
>>>>
>>>> Performance Testing
>>>> ===================
>>>>
>>>> I've run some swap performance tests on Ampere Altra VM (arm64) with 8 CPUs. The
>>>> VM is set up with a 35G block ram device as the swap device and the test is run
>>>> from inside a memcg limited to 40G memory. I've then run `usemem` from
>>>> vm-scalability with 70 processes, each allocating and writing 1G of memory. I've
>>>> repeated everything 6 times and taken the mean performance improvement relative
>>>> to 4K page baseline:
>>>>
>>>> | alloc size | baseline | + this series |
>>>> | | v6.6-rc4+anonfolio | |
>>>> |:-----------|--------------------:|--------------------:|
>>>> | 4K Page | 0.0% | 1.4% |
>>>> | 64K THP | -14.6% | 44.2% |
>>>> | 2M THP | 87.4% | 97.7% |
>>>>
>>>> So with this change, the 64K swap performance goes from a 15% regression to a
>>>> 44% improvement. 4K and 2M swap improves slightly too.
>>>
>>> I don't understand why the performance of 2M THP improves. The swap
>>> entry allocation becomes a little slower. Can you provide some
>>> perf-profile to root cause it?
>>
>> I didn't post the stdev, which is quite large (~10%), so that may explain some
>> of it:
>>
>> | kernel | mean_rel | std_rel |
>> |:---------|-----------:|----------:|
>> | base-4K | 0.0% | 5.5% |
>> | base-64K | -14.6% | 3.8% |
>> | base-2M | 87.4% | 10.6% |
>> | v4-4K | 1.4% | 3.7% |
>> | v4-64K | 44.2% | 11.8% |
>> | v4-2M | 97.7% | 13.3% |
>>
>> Regardless, I'll do some perf profiling and post results shortly.
>
> I did a lot more runs (24 for each config) and meaned them to try to remove the
> noise in the measurements. It's now only showing a 4% improvement for 2M. So I
> don't think the 2M improvement is real:
>
> | kernel | mean_rel | std_rel |
> |:---------|-----------:|----------:|
> | base-4K | 0.0% | 3.2% |
> | base-64K | -9.1% | 10.1% |
> | base-2M | 88.9% | 6.8% |
> | v4-4K | 0.5% | 3.1% |
> | v4-64K | 44.7% | 8.3% |
> | v4-2M | 93.3% | 7.8% |
>
> Looking at the perf data, the only thing that sticks out is that a big chunk of
> time is spent in during contpte_convert(), called as a result of
> try_to_unmap_one(). This is present in both the before and after configs.
>
> This is an arm64 function to "unfold" contpte mappings. Essentially, the PMD is
> being split during shrink_folio_list() with TTU_SPLIT_HUGE_PMD, meaning the
> THPs are PTE-mapped in contpte blocks. Then we are unmapping each pte one-by-one
> which means the contpte block needs to be unfolded. I think try_to_unmap_one()
> could potentially be optimized to batch unmap a contiguously mapped folio and
> avoid this unfold. But that would be an independent and separate piece of work.

Thanks for more data and detailed explanation.

--
Best Regards,
Huang, Ying

2024-03-13 01:35:23

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH v4 4/6] mm: swap: Allow storage of all mTHP orders

Ryan Roberts <[email protected]> writes:

> On 12/03/2024 07:51, Huang, Ying wrote:
>> Ryan Roberts <[email protected]> writes:
>>
>>> Multi-size THP enables performance improvements by allocating large,
>>> pte-mapped folios for anonymous memory. However I've observed that on an
>>> arm64 system running a parallel workload (e.g. kernel compilation)
>>> across many cores, under high memory pressure, the speed regresses. This
>>> is due to bottlenecking on the increased number of TLBIs added due to
>>> all the extra folio splitting when the large folios are swapped out.
>>>
>>> Therefore, solve this regression by adding support for swapping out mTHP
>>> without needing to split the folio, just like is already done for
>>> PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enabled,
>>> and when the swap backing store is a non-rotating block device. These
>>> are the same constraints as for the existing PMD-sized THP swap-out
>>> support.
>>>
>>> Note that no attempt is made to swap-in (m)THP here - this is still done
>>> page-by-page, like for PMD-sized THP. But swapping-out mTHP is a
>>> prerequisite for swapping-in mTHP.
>>>
>>> The main change here is to improve the swap entry allocator so that it
>>> can allocate any power-of-2 number of contiguous entries between [1, (1
>>> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
>>> order and allocating sequentially from it until the cluster is full.
>>> This ensures that we don't need to search the map and we get no
>>> fragmentation due to alignment padding for different orders in the
>>> cluster. If there is no current cluster for a given order, we attempt to
>>> allocate a free cluster from the list. If there are no free clusters, we
>>> fail the allocation and the caller can fall back to splitting the folio
>>> and allocates individual entries (as per existing PMD-sized THP
>>> fallback).
>>>
>>> The per-order current clusters are maintained per-cpu using the existing
>>> infrastructure. This is done to avoid interleving pages from different
>>> tasks, which would prevent IO being batched. This is already done for
>>> the order-0 allocations so we follow the same pattern.
>>>
>>> As is done for order-0 per-cpu clusters, the scanner now can steal
>>> order-0 entries from any per-cpu-per-order reserved cluster. This
>>> ensures that when the swap file is getting full, space doesn't get tied
>>> up in the per-cpu reserves.
>>>
>>> This change only modifies swap to be able to accept any order mTHP. It
>>> doesn't change the callers to elide doing the actual split. That will be
>>> done in separate changes.
>>>
>>> Signed-off-by: Ryan Roberts <[email protected]>
>>> ---
>>> include/linux/swap.h | 8 ++-
>>> mm/swapfile.c | 167 +++++++++++++++++++++++++------------------
>>> 2 files changed, 103 insertions(+), 72 deletions(-)
>>>
>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>> index 0cb082bee717..39b5c18ccc6a 100644
>>> --- a/include/linux/swap.h
>>> +++ b/include/linux/swap.h
>>> @@ -268,13 +268,19 @@ struct swap_cluster_info {
>>> */
>>> #define SWAP_NEXT_INVALID 0
>>>
>>> +#ifdef CONFIG_THP_SWAP
>>> +#define SWAP_NR_ORDERS (PMD_ORDER + 1)
>>> +#else
>>> +#define SWAP_NR_ORDERS 1
>>> +#endif
>>> +
>>> /*
>>> * We assign a cluster to each CPU, so each CPU can allocate swap entry from
>>> * its own cluster and swapout sequentially. The purpose is to optimize swapout
>>> * throughput.
>>> */
>>> struct percpu_cluster {
>>> - unsigned int next; /* Likely next allocation offset */
>>> + unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
>>> };
>>>
>>> struct swap_cluster_list {
>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>> index 3828d81aa6b8..61118a090796 100644
>>> --- a/mm/swapfile.c
>>> +++ b/mm/swapfile.c
>>> @@ -551,10 +551,12 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
>>>
>>> /*
>>> * The cluster corresponding to page_nr will be used. The cluster will be
>>> - * removed from free cluster list and its usage counter will be increased.
>>> + * removed from free cluster list and its usage counter will be increased by
>>> + * count.
>>> */
>>> -static void inc_cluster_info_page(struct swap_info_struct *p,
>>> - struct swap_cluster_info *cluster_info, unsigned long page_nr)
>>> +static void add_cluster_info_page(struct swap_info_struct *p,
>>> + struct swap_cluster_info *cluster_info, unsigned long page_nr,
>>> + unsigned long count)
>>> {
>>> unsigned long idx = page_nr / SWAPFILE_CLUSTER;
>>>
>>> @@ -563,9 +565,19 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
>>> if (cluster_is_free(&cluster_info[idx]))
>>> alloc_cluster(p, idx);
>>>
>>> - VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER);
>>> + VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
>>> cluster_set_count(&cluster_info[idx],
>>> - cluster_count(&cluster_info[idx]) + 1);
>>> + cluster_count(&cluster_info[idx]) + count);
>>> +}
>>> +
>>> +/*
>>> + * The cluster corresponding to page_nr will be used. The cluster will be
>>> + * removed from free cluster list and its usage counter will be increased by 1.
>>> + */
>>> +static void inc_cluster_info_page(struct swap_info_struct *p,
>>> + struct swap_cluster_info *cluster_info, unsigned long page_nr)
>>> +{
>>> + add_cluster_info_page(p, cluster_info, page_nr, 1);
>>> }
>>>
>>> /*
>>> @@ -595,7 +607,7 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
>>> */
>>> static bool
>>> scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>>> - unsigned long offset)
>>> + unsigned long offset, int order)
>>> {
>>> struct percpu_cluster *percpu_cluster;
>>> bool conflict;
>>> @@ -609,24 +621,39 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>>> return false;
>>>
>>> percpu_cluster = this_cpu_ptr(si->percpu_cluster);
>>> - percpu_cluster->next = SWAP_NEXT_INVALID;
>>> + percpu_cluster->next[order] = SWAP_NEXT_INVALID;
>>> + return true;
>>> +}
>>> +
>>> +static inline bool swap_range_empty(char *swap_map, unsigned int start,
>>> + unsigned int nr_pages)
>>> +{
>>> + unsigned int i;
>>> +
>>> + for (i = 0; i < nr_pages; i++) {
>>> + if (swap_map[start + i])
>>> + return false;
>>> + }
>>> +
>>> return true;
>>> }
>>>
>>> /*
>>> - * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
>>> - * might involve allocating a new cluster for current CPU too.
>>> + * Try to get a swap entry (or size indicated by order) from current cpu's swap
>>
>> IMO, it's not necessary to make mTHP a special case other than base
>> page. So, this can be changed to
>>
>> * Try to get swap entries with specified order from current cpu's swap
>
> Sure, will fix in next version.
>
>>
>>> + * entry pool (a cluster). This might involve allocating a new cluster for
>>> + * current CPU too.
>>> */
>>> static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>> - unsigned long *offset, unsigned long *scan_base)
>>> + unsigned long *offset, unsigned long *scan_base, int order)
>>> {
>>> + unsigned int nr_pages = 1 << order;
>>> struct percpu_cluster *cluster;
>>> struct swap_cluster_info *ci;
>>> unsigned int tmp, max;
>>>
>>> new_cluster:
>>> cluster = this_cpu_ptr(si->percpu_cluster);
>>> - tmp = cluster->next;
>>> + tmp = cluster->next[order];
>>> if (tmp == SWAP_NEXT_INVALID) {
>>> if (!cluster_list_empty(&si->free_clusters)) {
>>> tmp = cluster_next(&si->free_clusters.head) *
>>> @@ -647,26 +674,27 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>>
>>> /*
>>> * Other CPUs can use our cluster if they can't find a free cluster,
>>> - * check if there is still free entry in the cluster
>>> + * check if there is still free entry in the cluster, maintaining
>>> + * natural alignment.
>>> */
>>> max = min_t(unsigned long, si->max, ALIGN(tmp + 1, SWAPFILE_CLUSTER));
>>> if (tmp < max) {
>>> ci = lock_cluster(si, tmp);
>>> while (tmp < max) {
>>> - if (!si->swap_map[tmp])
>>> + if (swap_range_empty(si->swap_map, tmp, nr_pages))
>>> break;
>>> - tmp++;
>>> + tmp += nr_pages;
>>> }
>>> unlock_cluster(ci);
>>> }
>>> if (tmp >= max) {
>>> - cluster->next = SWAP_NEXT_INVALID;
>>> + cluster->next[order] = SWAP_NEXT_INVALID;
>>> goto new_cluster;
>>> }
>>> *offset = tmp;
>>> *scan_base = tmp;
>>> - tmp += 1;
>>> - cluster->next = tmp < max ? tmp : SWAP_NEXT_INVALID;
>>> + tmp += nr_pages;
>>> + cluster->next[order] = tmp < max ? tmp : SWAP_NEXT_INVALID;
>>> return true;
>>> }
>>>
>>> @@ -796,13 +824,14 @@ static bool swap_offset_available_and_locked(struct swap_info_struct *si,
>>>
>>> static int scan_swap_map_slots(struct swap_info_struct *si,
>>> unsigned char usage, int nr,
>>> - swp_entry_t slots[])
>>> + swp_entry_t slots[], unsigned int nr_pages)
>>
>> IMHO, it's better to use order as parameter directly. We can change the
>> parameter of get_swap_pages() too.
>
> I agree that this will make the interface clearer/self documenting. I'll do it
> in the next version.
>
>>
>>> {
>>> struct swap_cluster_info *ci;
>>> unsigned long offset;
>>> unsigned long scan_base;
>>> unsigned long last_in_cluster = 0;
>>> int latency_ration = LATENCY_LIMIT;
>>> + int order = ilog2(nr_pages);
>>> int n_ret = 0;
>>> bool scanned_many = false;
>>>
>>> @@ -817,6 +846,26 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>> * And we let swap pages go all over an SSD partition. Hugh
>>> */
>>>
>>> + if (nr_pages > 1) {
>>> + /*
>>> + * Should not even be attempting large allocations when huge
>>> + * page swap is disabled. Warn and fail the allocation.
>>> + */
>>> + if (!IS_ENABLED(CONFIG_THP_SWAP) ||
>>> + nr_pages > SWAPFILE_CLUSTER ||
>>> + !is_power_of_2(nr_pages)) {
>>> + VM_WARN_ON_ONCE(1);
>>> + return 0;
>>> + }
>>> +
>>> + /*
>>> + * Swapfile is not block device or not using clusters so unable
>>> + * to allocate large entries.
>>> + */
>>> + if (!(si->flags & SWP_BLKDEV) || !si->cluster_info)
>>> + return 0;
>>> + }
>>> +
>>> si->flags += SWP_SCANNING;
>>> /*
>>> * Use percpu scan base for SSD to reduce lock contention on
>>> @@ -831,8 +880,11 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>>
>>> /* SSD algorithm */
>>> if (si->cluster_info) {
>>> - if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base))
>>> + if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order)) {
>>> + if (order > 0)
>>> + goto no_page;
>>> goto scan;
>>> + }
>>> } else if (unlikely(!si->cluster_nr--)) {
>>> if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER) {
>>> si->cluster_nr = SWAPFILE_CLUSTER - 1;
>>> @@ -874,26 +926,30 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>>
>>> checks:
>>> if (si->cluster_info) {
>>> - while (scan_swap_map_ssd_cluster_conflict(si, offset)) {
>>> + while (scan_swap_map_ssd_cluster_conflict(si, offset, order)) {
>>> /* take a break if we already got some slots */
>>> if (n_ret)
>>> goto done;
>>> if (!scan_swap_map_try_ssd_cluster(si, &offset,
>>> - &scan_base))
>>> + &scan_base, order)) {
>>> + if (order > 0)
>>> + goto no_page;
>>> goto scan;
>>> + }
>>> }
>>> }
>>> if (!(si->flags & SWP_WRITEOK))
>>> goto no_page;
>>> if (!si->highest_bit)
>>> goto no_page;
>>> - if (offset > si->highest_bit)
>>> + if (order == 0 && offset > si->highest_bit)
>>
>> I don't think that we need to check "order == 0" here. The original
>> condition will always be false for "order != 0".
>
> I spent ages looking at this and couldn't quite convince myself that this is
> definitely safe. Certainly it would be catastrophic if we modified the returned
> offset for a non-order-0 case (the code below assumes order-0 when checking). So
> I decided in the end to be safe and add this condition. Looking again, I agree
> with you. Will fix in next version.
>
>>
>>> scan_base = offset = si->lowest_bit;
>>>
>>> ci = lock_cluster(si, offset);
>>> /* reuse swap entry of cache-only swap if not busy. */
>>> if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
>>> int swap_was_freed;
>>> + VM_WARN_ON(order > 0);
>>
>> Instead of add WARN here, I think that it's better to add WARN at the
>> beginning of "scan" label. We should never scan if "order > 0", it can
>> capture even more abnormal status.
>
> OK, will do.
>
>>
>>> unlock_cluster(ci);
>>> spin_unlock(&si->lock);
>>> swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
>>> @@ -905,17 +961,18 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>> }
>>>
>>> if (si->swap_map[offset]) {
>>> + VM_WARN_ON(order > 0);
>
> And remove this one too? (relying on the one in scan instead)

Yes. I think so.

>>> unlock_cluster(ci);
>>> if (!n_ret)
>>> goto scan;
>>> else
>>> goto done;
>>> }
>>> - WRITE_ONCE(si->swap_map[offset], usage);
>>> - inc_cluster_info_page(si, si->cluster_info, offset);
>>> + memset(si->swap_map + offset, usage, nr_pages);
>>
>> Add barrier() here corresponds to original WRITE_ONCE()?
>> unlock_cluster(ci) may be NOP for some swap devices.
>
> Yep, good spot!
>
>>
>>> + add_cluster_info_page(si, si->cluster_info, offset, nr_pages);
>>> unlock_cluster(ci);
>>>
>>> - swap_range_alloc(si, offset, 1);
>>> + swap_range_alloc(si, offset, nr_pages);
>>> slots[n_ret++] = swp_entry(si->type, offset);
>>>
>>> /* got enough slots or reach max slots? */
>>
>> If "order > 0", "nr" must be 1. So, we will "goto done" in the
>> following code.
>
> I've deliberately implemented scan_swap_map_slots() so that it allows nr > 1 for
> order > 0. And leave it to the higher layers to decide on policy.
>
>>
>> /* got enough slots or reach max slots? */
>> if ((n_ret == nr) || (offset >= si->highest_bit))
>> goto done;
>>
>> We can add VM_WARN_ON() here to capture some abnormal status.
>
> That was actually how I implemented initially. But decided that it doesn't cost
> anything to allow nr > 1 for order > 0, and IMHO makes the function easier to
> understand because we remove this uneccessary constraint.

This sounds reasonable to me.

>>
>>> @@ -936,8 +993,10 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>>
>>> /* try to get more slots in cluster */
>>> if (si->cluster_info) {
>>> - if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base))
>>> + if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order))
>>> goto checks;
>>> + if (order > 0)
>>> + goto done;
>>
>> Don't need to add this, if "order > 0", we will never go here.
>
> As per above.
>
>>
>>> } else if (si->cluster_nr && !si->swap_map[++offset]) {
>>> /* non-ssd case, still more slots in cluster? */
>>> --si->cluster_nr;
>>> @@ -964,7 +1023,8 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>> }
>>>
>>> done:
>>> - set_cluster_next(si, offset + 1);
>>> + if (order == 0)
>>> + set_cluster_next(si, offset + 1);
>>> si->flags -= SWP_SCANNING;
>>> return n_ret;
>>>
>>> @@ -997,38 +1057,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>> return n_ret;
>>> }
>>>
>>> -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
>>> -{
>>> - unsigned long idx;
>>> - struct swap_cluster_info *ci;
>>> - unsigned long offset;
>>> -
>>> - /*
>>> - * Should not even be attempting cluster allocations when huge
>>> - * page swap is disabled. Warn and fail the allocation.
>>> - */
>>> - if (!IS_ENABLED(CONFIG_THP_SWAP)) {
>>> - VM_WARN_ON_ONCE(1);
>>> - return 0;
>>> - }
>>> -
>>> - if (cluster_list_empty(&si->free_clusters))
>>> - return 0;
>>> -
>>> - idx = cluster_list_first(&si->free_clusters);
>>> - offset = idx * SWAPFILE_CLUSTER;
>>> - ci = lock_cluster(si, offset);
>>> - alloc_cluster(si, idx);
>>> - cluster_set_count(ci, SWAPFILE_CLUSTER);
>>> -
>>> - memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER);
>>> - unlock_cluster(ci);
>>> - swap_range_alloc(si, offset, SWAPFILE_CLUSTER);
>>> - *slot = swp_entry(si->type, offset);
>>> -
>>> - return 1;
>>> -}
>>> -
>>> static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
>>> {
>>> unsigned long offset = idx * SWAPFILE_CLUSTER;
>>> @@ -1050,8 +1078,8 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
>>> int n_ret = 0;
>>> int node;
>>>
>>> - /* Only single cluster request supported */
>>> - WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER);
>>> + /* Only single THP request supported */
>>> + WARN_ON_ONCE(n_goal > 1 && size > 1);
>>>
>>> spin_lock(&swap_avail_lock);
>>>
>>> @@ -1088,14 +1116,10 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
>>> spin_unlock(&si->lock);
>>> goto nextsi;
>>> }
>>> - if (size == SWAPFILE_CLUSTER) {
>>> - if (si->flags & SWP_BLKDEV)
>>> - n_ret = swap_alloc_cluster(si, swp_entries);
>>> - } else
>>> - n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
>>> - n_goal, swp_entries);
>>> + n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
>>> + n_goal, swp_entries, size);
>>> spin_unlock(&si->lock);
>>> - if (n_ret || size == SWAPFILE_CLUSTER)
>>> + if (n_ret || size > 1)
>>> goto check_out;
>>> cond_resched();
>>>
>>> @@ -1647,7 +1671,7 @@ swp_entry_t get_swap_page_of_type(int type)
>>>
>>> /* This is called for allocating swap entry, not cache */
>>> spin_lock(&si->lock);
>>> - if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry))
>>> + if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 1))
>>> atomic_long_dec(&nr_swap_pages);
>>> spin_unlock(&si->lock);
>>> fail:
>>> @@ -3101,7 +3125,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>>> p->flags |= SWP_SYNCHRONOUS_IO;
>>>
>>> if (p->bdev && bdev_nonrot(p->bdev)) {
>>> - int cpu;
>>> + int cpu, i;
>>> unsigned long ci, nr_cluster;
>>>
>>> p->flags |= SWP_SOLIDSTATE;
>>> @@ -3139,7 +3163,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>>> struct percpu_cluster *cluster;
>>>
>>> cluster = per_cpu_ptr(p->percpu_cluster, cpu);
>>> - cluster->next = SWAP_NEXT_INVALID;
>>> + for (i = 0; i < SWAP_NR_ORDERS; i++)
>>> + cluster->next[i] = SWAP_NEXT_INVALID;
>>> }
>>> } else {
>>> atomic_inc(&nr_rotate_swap);
>>
>> You also need to check whether we should add swap_entry_size() for some
>> functions to optimize for small system. We may need to add swap_order()
>> too.
>
> I was planning to convert swap_entry_size() to swap_entry_order() as part of
> switching to pass order instead of nr_pages. There is one other site that uses
> swap_entry_size() and needs a size, so was going to just change it to 1 <<
> swap_entry_order(). Does that work for you?

Yes.

> I'll do an audit for places to use swap_entry_order() but quick scan just now
> suggests that the constant should propagate to all the static functions from
> get_swap_pages().
>

--
Best Regards,
Huang, Ying

2024-03-13 01:42:28

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH v4 3/6] mm: swap: Simplify struct percpu_cluster

Ryan Roberts <[email protected]> writes:

> On 12/03/2024 07:52, Huang, Ying wrote:
>> Ryan Roberts <[email protected]> writes:
>>
>>> struct percpu_cluster stores the index of cpu's current cluster and the
>>> offset of the next entry that will be allocated for the cpu. These two
>>> pieces of information are redundant because the cluster index is just
>>> (offset / SWAPFILE_CLUSTER). The only reason for explicitly keeping the
>>> cluster index is because the structure used for it also has a flag to
>>> indicate "no cluster". However this data structure also contains a spin
>>> lock, which is never used in this context, as a side effect the code
>>> copies the spinlock_t structure, which is questionable coding practice
>>> in my view.
>>>
>>> So let's clean this up and store only the next offset, and use a
>>> sentinal value (SWAP_NEXT_INVALID) to indicate "no cluster".
>>> SWAP_NEXT_INVALID is chosen to be 0, because 0 will never be seen
>>> legitimately; The first page in the swap file is the swap header, which
>>> is always marked bad to prevent it from being allocated as an entry.
>>> This also prevents the cluster to which it belongs being marked free, so
>>> it will never appear on the free list.
>>>
>>> This change saves 16 bytes per cpu. And given we are shortly going to
>>> extend this mechanism to be per-cpu-AND-per-order, we will end up saving
>>> 16 * 9 = 144 bytes per cpu, which adds up if you have 256 cpus in the
>>> system.
>>>
>>> Signed-off-by: Ryan Roberts <[email protected]>
>>
>> LGTM, Thanks!
>
> Thanks! What's a guy got to do to get Rb or Ack? :)

Feel free to add

Reviewed-by: "Huang, Ying" <[email protected]>

in the future version.

--
Best Regards,
Huang, Ying

2024-03-13 07:20:07

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v4 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD

On Tue, Mar 12, 2024 at 4:01 AM Ryan Roberts <[email protected]> wrote:
>
> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
> folio that is fully and contiguously mapped in the pageout/cold vm
> range. This change means that large folios will be maintained all the
> way to swap storage. This both improves performance during swap-out, by
> eliding the cost of splitting the folio, and sets us up nicely for
> maintaining the large folio when it is swapped back in (to be covered in
> a separate series).
>
> Folios that are not fully mapped in the target range are still split,
> but note that behavior is changed so that if the split fails for any
> reason (folio locked, shared, etc) we now leave it as is and move to the
> next pte in the range and continue work on the proceeding folios.
> Previously any failure of this sort would cause the entire operation to
> give up and no folios mapped at higher addresses were paged out or made
> cold. Given large folios are becoming more common, this old behavior
> would have likely lead to wasted opportunities.
>
> While we are at it, change the code that clears young from the ptes to
> use ptep_test_and_clear_young(), which is more efficent than
> get_and_clear/modify/set, especially for contpte mappings on arm64,
> where the old approach would require unfolding/refolding and the new
> approach can be done in place.
>
> Signed-off-by: Ryan Roberts <[email protected]>

This looks so much better than our initial RFC.
Thank you for your excellent work!

> ---
> mm/madvise.c | 89 ++++++++++++++++++++++++++++++----------------------
> 1 file changed, 51 insertions(+), 38 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 547dcd1f7a39..56c7ba7bd558 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> LIST_HEAD(folio_list);
> bool pageout_anon_only_filter;
> unsigned int batch_count = 0;
> + int nr;
>
> if (fatal_signal_pending(current))
> return -EINTR;
> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> return 0;
> flush_tlb_batched_pending(mm);
> arch_enter_lazy_mmu_mode();
> - for (; addr < end; pte++, addr += PAGE_SIZE) {
> + for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
> + nr = 1;
> ptent = ptep_get(pte);
>
> if (++batch_count == SWAP_CLUSTER_MAX) {
> @@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> continue;
>
> /*
> - * Creating a THP page is expensive so split it only if we
> - * are sure it's worth. Split it if we are only owner.
> + * If we encounter a large folio, only split it if it is not
> + * fully mapped within the range we are operating on. Otherwise
> + * leave it as is so that it can be swapped out whole. If we
> + * fail to split a folio, leave it in place and advance to the
> + * next pte in the range.
> */
> if (folio_test_large(folio)) {
> - int err;
> -
> - if (folio_estimated_sharers(folio) > 1)
> - break;
> - if (pageout_anon_only_filter && !folio_test_anon(folio))
> - break;
> - if (!folio_trylock(folio))
> - break;
> - folio_get(folio);
> - arch_leave_lazy_mmu_mode();
> - pte_unmap_unlock(start_pte, ptl);
> - start_pte = NULL;
> - err = split_folio(folio);
> - folio_unlock(folio);
> - folio_put(folio);
> - if (err)
> - break;
> - start_pte = pte =
> - pte_offset_map_lock(mm, pmd, addr, &ptl);
> - if (!start_pte)
> - break;
> - arch_enter_lazy_mmu_mode();
> - pte--;
> - addr -= PAGE_SIZE;
> - continue;
> + const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
> + FPB_IGNORE_SOFT_DIRTY;
> + int max_nr = (end - addr) / PAGE_SIZE;
> +
> + nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
> + fpb_flags, NULL);

I wonder if we have a quick way to avoid folio_pte_batch() if users
are doing madvise() on a portion of a large folio.

> +
> + if (nr < folio_nr_pages(folio)) {
> + int err;
> +
> + if (folio_estimated_sharers(folio) > 1)
> + continue;
> + if (pageout_anon_only_filter && !folio_test_anon(folio))
> + continue;
> + if (!folio_trylock(folio))
> + continue;
> + folio_get(folio);
> + arch_leave_lazy_mmu_mode();
> + pte_unmap_unlock(start_pte, ptl);
> + start_pte = NULL;
> + err = split_folio(folio);
> + folio_unlock(folio);
> + folio_put(folio);
> + if (err)
> + continue;
> + start_pte = pte =
> + pte_offset_map_lock(mm, pmd, addr, &ptl);
> + if (!start_pte)
> + break;
> + arch_enter_lazy_mmu_mode();
> + nr = 0;
> + continue;
> + }
> }
>
> /*
> * Do not interfere with other mappings of this folio and
> - * non-LRU folio.
> + * non-LRU folio. If we have a large folio at this point, we
> + * know it is fully mapped so if its mapcount is the same as its
> + * number of pages, it must be exclusive.
> */
> - if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
> + if (!folio_test_lru(folio) ||
> + folio_mapcount(folio) != folio_nr_pages(folio))
> continue;

This looks so perfect and is exactly what I wanted to achieve.

>
> if (pageout_anon_only_filter && !folio_test_anon(folio))
> continue;
>
> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> -
> - if (!pageout && pte_young(ptent)) {
> - ptent = ptep_get_and_clear_full(mm, addr, pte,
> - tlb->fullmm);
> - ptent = pte_mkold(ptent);
> - set_pte_at(mm, addr, pte, ptent);
> - tlb_remove_tlb_entry(tlb, pte, addr);
> + if (!pageout) {
> + for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
> + if (ptep_test_and_clear_young(vma, addr, pte))
> + tlb_remove_tlb_entry(tlb, pte, addr);
> + }

This looks so smart. if it is not pageout, we have increased pte
and addr here; so nr is 0 and we don't need to increase again in
for (; addr < end; pte += nr, addr += nr * PAGE_SIZE)

otherwise, nr won't be 0. so we will increase addr and
pte by nr.


> }
>
> /*
> --
> 2.25.1
>

Overall, LGTM,

Reviewed-by: Barry Song <[email protected]>

2024-03-13 08:51:13

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 0/6] Swap-out mTHP without splitting

On 13/03/2024 01:15, Huang, Ying wrote:
> Ryan Roberts <[email protected]> writes:
>
>> On 12/03/2024 08:49, Ryan Roberts wrote:
>>> On 12/03/2024 08:01, Huang, Ying wrote:
>>>> Ryan Roberts <[email protected]> writes:
>>>>
>>>>> Hi All,
>>>>>
>>>>> This series adds support for swapping out multi-size THP (mTHP) without needing
>>>>> to first split the large folio via split_huge_page_to_list_to_order(). It
>>>>> closely follows the approach already used to swap-out PMD-sized THP.
>>>>>
>>>>> There are a couple of reasons for swapping out mTHP without splitting:
>>>>>
>>>>> - Performance: It is expensive to split a large folio and under extreme memory
>>>>> pressure some workloads regressed performance when using 64K mTHP vs 4K
>>>>> small folios because of this extra cost in the swap-out path. This series
>>>>> not only eliminates the regression but makes it faster to swap out 64K mTHP
>>>>> vs 4K small folios.
>>>>>
>>>>> - Memory fragmentation avoidance: If we can avoid splitting a large folio
>>>>> memory is less likely to become fragmented, making it easier to re-allocate
>>>>> a large folio in future.
>>>>>
>>>>> - Performance: Enables a separate series [4] to swap-in whole mTHPs, which
>>>>> means we won't lose the TLB-efficiency benefits of mTHP once the memory has
>>>>> been through a swap cycle.
>>>>>
>>>>> I've done what I thought was the smallest change possible, and as a result, this
>>>>> approach is only employed when the swap is backed by a non-rotating block device
>>>>> (just as PMD-sized THP is supported today). Discussion against the RFC concluded
>>>>> that this is sufficient.
>>>>>
>>>>>
>>>>> Performance Testing
>>>>> ===================
>>>>>
>>>>> I've run some swap performance tests on Ampere Altra VM (arm64) with 8 CPUs. The
>>>>> VM is set up with a 35G block ram device as the swap device and the test is run
>>>>> from inside a memcg limited to 40G memory. I've then run `usemem` from
>>>>> vm-scalability with 70 processes, each allocating and writing 1G of memory. I've
>>>>> repeated everything 6 times and taken the mean performance improvement relative
>>>>> to 4K page baseline:
>>>>>
>>>>> | alloc size | baseline | + this series |
>>>>> | | v6.6-rc4+anonfolio | |
>>>>> |:-----------|--------------------:|--------------------:|
>>>>> | 4K Page | 0.0% | 1.4% |
>>>>> | 64K THP | -14.6% | 44.2% |
>>>>> | 2M THP | 87.4% | 97.7% |
>>>>>
>>>>> So with this change, the 64K swap performance goes from a 15% regression to a
>>>>> 44% improvement. 4K and 2M swap improves slightly too.
>>>>
>>>> I don't understand why the performance of 2M THP improves. The swap
>>>> entry allocation becomes a little slower. Can you provide some
>>>> perf-profile to root cause it?
>>>
>>> I didn't post the stdev, which is quite large (~10%), so that may explain some
>>> of it:
>>>
>>> | kernel | mean_rel | std_rel |
>>> |:---------|-----------:|----------:|
>>> | base-4K | 0.0% | 5.5% |
>>> | base-64K | -14.6% | 3.8% |
>>> | base-2M | 87.4% | 10.6% |
>>> | v4-4K | 1.4% | 3.7% |
>>> | v4-64K | 44.2% | 11.8% |
>>> | v4-2M | 97.7% | 13.3% |
>>>
>>> Regardless, I'll do some perf profiling and post results shortly.
>>
>> I did a lot more runs (24 for each config) and meaned them to try to remove the
>> noise in the measurements. It's now only showing a 4% improvement for 2M. So I
>> don't think the 2M improvement is real:
>>
>> | kernel | mean_rel | std_rel |
>> |:---------|-----------:|----------:|
>> | base-4K | 0.0% | 3.2% |
>> | base-64K | -9.1% | 10.1% |
>> | base-2M | 88.9% | 6.8% |
>> | v4-4K | 0.5% | 3.1% |
>> | v4-64K | 44.7% | 8.3% |
>> | v4-2M | 93.3% | 7.8% |
>>
>> Looking at the perf data, the only thing that sticks out is that a big chunk of
>> time is spent in during contpte_convert(), called as a result of
>> try_to_unmap_one(). This is present in both the before and after configs.
>>
>> This is an arm64 function to "unfold" contpte mappings. Essentially, the PMD is
>> being split during shrink_folio_list() with TTU_SPLIT_HUGE_PMD, meaning the
>> THPs are PTE-mapped in contpte blocks. Then we are unmapping each pte one-by-one
>> which means the contpte block needs to be unfolded. I think try_to_unmap_one()
>> could potentially be optimized to batch unmap a contiguously mapped folio and
>> avoid this unfold. But that would be an independent and separate piece of work.
>
> Thanks for more data and detailed explanation.

And thanks for your review! I'll address all your comments (and any others that
I get in the meantime) and repost after the merge window. It would be great if
we can get this in for v6.10.

>
> --
> Best Regards,
> Huang, Ying


2024-03-13 09:03:18

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD

On 13/03/2024 07:19, Barry Song wrote:
> On Tue, Mar 12, 2024 at 4:01 AM Ryan Roberts <[email protected]> wrote:
>>
>> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
>> folio that is fully and contiguously mapped in the pageout/cold vm
>> range. This change means that large folios will be maintained all the
>> way to swap storage. This both improves performance during swap-out, by
>> eliding the cost of splitting the folio, and sets us up nicely for
>> maintaining the large folio when it is swapped back in (to be covered in
>> a separate series).
>>
>> Folios that are not fully mapped in the target range are still split,
>> but note that behavior is changed so that if the split fails for any
>> reason (folio locked, shared, etc) we now leave it as is and move to the
>> next pte in the range and continue work on the proceeding folios.
>> Previously any failure of this sort would cause the entire operation to
>> give up and no folios mapped at higher addresses were paged out or made
>> cold. Given large folios are becoming more common, this old behavior
>> would have likely lead to wasted opportunities.
>>
>> While we are at it, change the code that clears young from the ptes to
>> use ptep_test_and_clear_young(), which is more efficent than
>> get_and_clear/modify/set, especially for contpte mappings on arm64,
>> where the old approach would require unfolding/refolding and the new
>> approach can be done in place.
>>
>> Signed-off-by: Ryan Roberts <[email protected]>
>
> This looks so much better than our initial RFC.
> Thank you for your excellent work!

Thanks - its a team effort - I had your PoC and David's previous batching work
to use as a template.

>
>> ---
>> mm/madvise.c | 89 ++++++++++++++++++++++++++++++----------------------
>> 1 file changed, 51 insertions(+), 38 deletions(-)
>>
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 547dcd1f7a39..56c7ba7bd558 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>> LIST_HEAD(folio_list);
>> bool pageout_anon_only_filter;
>> unsigned int batch_count = 0;
>> + int nr;
>>
>> if (fatal_signal_pending(current))
>> return -EINTR;
>> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>> return 0;
>> flush_tlb_batched_pending(mm);
>> arch_enter_lazy_mmu_mode();
>> - for (; addr < end; pte++, addr += PAGE_SIZE) {
>> + for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
>> + nr = 1;
>> ptent = ptep_get(pte);
>>
>> if (++batch_count == SWAP_CLUSTER_MAX) {
>> @@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>> continue;
>>
>> /*
>> - * Creating a THP page is expensive so split it only if we
>> - * are sure it's worth. Split it if we are only owner.
>> + * If we encounter a large folio, only split it if it is not
>> + * fully mapped within the range we are operating on. Otherwise
>> + * leave it as is so that it can be swapped out whole. If we
>> + * fail to split a folio, leave it in place and advance to the
>> + * next pte in the range.
>> */
>> if (folio_test_large(folio)) {
>> - int err;
>> -
>> - if (folio_estimated_sharers(folio) > 1)
>> - break;
>> - if (pageout_anon_only_filter && !folio_test_anon(folio))
>> - break;
>> - if (!folio_trylock(folio))
>> - break;
>> - folio_get(folio);
>> - arch_leave_lazy_mmu_mode();
>> - pte_unmap_unlock(start_pte, ptl);
>> - start_pte = NULL;
>> - err = split_folio(folio);
>> - folio_unlock(folio);
>> - folio_put(folio);
>> - if (err)
>> - break;
>> - start_pte = pte =
>> - pte_offset_map_lock(mm, pmd, addr, &ptl);
>> - if (!start_pte)
>> - break;
>> - arch_enter_lazy_mmu_mode();
>> - pte--;
>> - addr -= PAGE_SIZE;
>> - continue;
>> + const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
>> + FPB_IGNORE_SOFT_DIRTY;
>> + int max_nr = (end - addr) / PAGE_SIZE;
>> +
>> + nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>> + fpb_flags, NULL);
>
> I wonder if we have a quick way to avoid folio_pte_batch() if users
> are doing madvise() on a portion of a large folio.

Good idea. Something like this?:

if (pte_pfn(pte) == folio_pfn(folio)
nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
fpb_flags, NULL);

If we are not mapping the first page of the folio, then it can't be a full
mapping, so no need to call folio_pte_batch(). Just split it.

>
>> +
>> + if (nr < folio_nr_pages(folio)) {
>> + int err;
>> +
>> + if (folio_estimated_sharers(folio) > 1)
>> + continue;
>> + if (pageout_anon_only_filter && !folio_test_anon(folio))
>> + continue;
>> + if (!folio_trylock(folio))
>> + continue;
>> + folio_get(folio);
>> + arch_leave_lazy_mmu_mode();
>> + pte_unmap_unlock(start_pte, ptl);
>> + start_pte = NULL;
>> + err = split_folio(folio);
>> + folio_unlock(folio);
>> + folio_put(folio);
>> + if (err)
>> + continue;
>> + start_pte = pte =
>> + pte_offset_map_lock(mm, pmd, addr, &ptl);
>> + if (!start_pte)
>> + break;
>> + arch_enter_lazy_mmu_mode();
>> + nr = 0;
>> + continue;
>> + }
>> }
>>
>> /*
>> * Do not interfere with other mappings of this folio and
>> - * non-LRU folio.
>> + * non-LRU folio. If we have a large folio at this point, we
>> + * know it is fully mapped so if its mapcount is the same as its
>> + * number of pages, it must be exclusive.
>> */
>> - if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
>> + if (!folio_test_lru(folio) ||
>> + folio_mapcount(folio) != folio_nr_pages(folio))
>> continue;
>
> This looks so perfect and is exactly what I wanted to achieve.
>
>>
>> if (pageout_anon_only_filter && !folio_test_anon(folio))
>> continue;
>>
>> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>> -
>> - if (!pageout && pte_young(ptent)) {
>> - ptent = ptep_get_and_clear_full(mm, addr, pte,
>> - tlb->fullmm);
>> - ptent = pte_mkold(ptent);
>> - set_pte_at(mm, addr, pte, ptent);
>> - tlb_remove_tlb_entry(tlb, pte, addr);
>> + if (!pageout) {
>> + for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
>> + if (ptep_test_and_clear_young(vma, addr, pte))
>> + tlb_remove_tlb_entry(tlb, pte, addr);
>> + }
>
> This looks so smart. if it is not pageout, we have increased pte
> and addr here; so nr is 0 and we don't need to increase again in
> for (; addr < end; pte += nr, addr += nr * PAGE_SIZE)
>
> otherwise, nr won't be 0. so we will increase addr and
> pte by nr.

Indeed. I'm hoping that Lance is able to follow a similar pattern for
madvise_free_pte_range().


>
>
>> }
>>
>> /*
>> --
>> 2.25.1
>>
>
> Overall, LGTM,
>
> Reviewed-by: Barry Song <[email protected]>

Thanks!



2024-03-13 09:17:03

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v4 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD

On Wed, Mar 13, 2024 at 10:03 PM Ryan Roberts <[email protected]> wrote:
>
> On 13/03/2024 07:19, Barry Song wrote:
> > On Tue, Mar 12, 2024 at 4:01 AM Ryan Roberts <[email protected]> wrote:
> >>
> >> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
> >> folio that is fully and contiguously mapped in the pageout/cold vm
> >> range. This change means that large folios will be maintained all the
> >> way to swap storage. This both improves performance during swap-out, by
> >> eliding the cost of splitting the folio, and sets us up nicely for
> >> maintaining the large folio when it is swapped back in (to be covered in
> >> a separate series).
> >>
> >> Folios that are not fully mapped in the target range are still split,
> >> but note that behavior is changed so that if the split fails for any
> >> reason (folio locked, shared, etc) we now leave it as is and move to the
> >> next pte in the range and continue work on the proceeding folios.
> >> Previously any failure of this sort would cause the entire operation to
> >> give up and no folios mapped at higher addresses were paged out or made
> >> cold. Given large folios are becoming more common, this old behavior
> >> would have likely lead to wasted opportunities.
> >>
> >> While we are at it, change the code that clears young from the ptes to
> >> use ptep_test_and_clear_young(), which is more efficent than
> >> get_and_clear/modify/set, especially for contpte mappings on arm64,
> >> where the old approach would require unfolding/refolding and the new
> >> approach can be done in place.
> >>
> >> Signed-off-by: Ryan Roberts <[email protected]>
> >
> > This looks so much better than our initial RFC.
> > Thank you for your excellent work!
>
> Thanks - its a team effort - I had your PoC and David's previous batching work
> to use as a template.
>
> >
> >> ---
> >> mm/madvise.c | 89 ++++++++++++++++++++++++++++++----------------------
> >> 1 file changed, 51 insertions(+), 38 deletions(-)
> >>
> >> diff --git a/mm/madvise.c b/mm/madvise.c
> >> index 547dcd1f7a39..56c7ba7bd558 100644
> >> --- a/mm/madvise.c
> >> +++ b/mm/madvise.c
> >> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >> LIST_HEAD(folio_list);
> >> bool pageout_anon_only_filter;
> >> unsigned int batch_count = 0;
> >> + int nr;
> >>
> >> if (fatal_signal_pending(current))
> >> return -EINTR;
> >> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >> return 0;
> >> flush_tlb_batched_pending(mm);
> >> arch_enter_lazy_mmu_mode();
> >> - for (; addr < end; pte++, addr += PAGE_SIZE) {
> >> + for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
> >> + nr = 1;
> >> ptent = ptep_get(pte);
> >>
> >> if (++batch_count == SWAP_CLUSTER_MAX) {
> >> @@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >> continue;
> >>
> >> /*
> >> - * Creating a THP page is expensive so split it only if we
> >> - * are sure it's worth. Split it if we are only owner.
> >> + * If we encounter a large folio, only split it if it is not
> >> + * fully mapped within the range we are operating on. Otherwise
> >> + * leave it as is so that it can be swapped out whole. If we
> >> + * fail to split a folio, leave it in place and advance to the
> >> + * next pte in the range.
> >> */
> >> if (folio_test_large(folio)) {
> >> - int err;
> >> -
> >> - if (folio_estimated_sharers(folio) > 1)
> >> - break;
> >> - if (pageout_anon_only_filter && !folio_test_anon(folio))
> >> - break;
> >> - if (!folio_trylock(folio))
> >> - break;
> >> - folio_get(folio);
> >> - arch_leave_lazy_mmu_mode();
> >> - pte_unmap_unlock(start_pte, ptl);
> >> - start_pte = NULL;
> >> - err = split_folio(folio);
> >> - folio_unlock(folio);
> >> - folio_put(folio);
> >> - if (err)
> >> - break;
> >> - start_pte = pte =
> >> - pte_offset_map_lock(mm, pmd, addr, &ptl);
> >> - if (!start_pte)
> >> - break;
> >> - arch_enter_lazy_mmu_mode();
> >> - pte--;
> >> - addr -= PAGE_SIZE;
> >> - continue;
> >> + const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
> >> + FPB_IGNORE_SOFT_DIRTY;
> >> + int max_nr = (end - addr) / PAGE_SIZE;
> >> +
> >> + nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
> >> + fpb_flags, NULL);
> >
> > I wonder if we have a quick way to avoid folio_pte_batch() if users
> > are doing madvise() on a portion of a large folio.
>
> Good idea. Something like this?:
>
> if (pte_pfn(pte) == folio_pfn(folio)

what about

"If (pte_pfn(pte) == folio_pfn(folio) && max_nr >= nr_pages)"

just to account for cases where the user's end address falls within
the middle of a large folio?


BTW, another minor issue is here:

if (++batch_count == SWAP_CLUSTER_MAX) {
batch_count = 0;
if (need_resched()) {
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(start_pte, ptl);
cond_resched();
goto restart;
}
}

We are increasing 1 for nr ptes, thus, we are holding PTL longer
than small folios case? we used to increase 1 for each PTE.
Does it matter?

> nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
> fpb_flags, NULL);
>
> If we are not mapping the first page of the folio, then it can't be a full
> mapping, so no need to call folio_pte_batch(). Just split it.
>
> >
> >> +
> >> + if (nr < folio_nr_pages(folio)) {
> >> + int err;
> >> +
> >> + if (folio_estimated_sharers(folio) > 1)
> >> + continue;
> >> + if (pageout_anon_only_filter && !folio_test_anon(folio))
> >> + continue;
> >> + if (!folio_trylock(folio))
> >> + continue;
> >> + folio_get(folio);
> >> + arch_leave_lazy_mmu_mode();
> >> + pte_unmap_unlock(start_pte, ptl);
> >> + start_pte = NULL;
> >> + err = split_folio(folio);
> >> + folio_unlock(folio);
> >> + folio_put(folio);
> >> + if (err)
> >> + continue;
> >> + start_pte = pte =
> >> + pte_offset_map_lock(mm, pmd, addr, &ptl);
> >> + if (!start_pte)
> >> + break;
> >> + arch_enter_lazy_mmu_mode();
> >> + nr = 0;
> >> + continue;
> >> + }
> >> }
> >>
> >> /*
> >> * Do not interfere with other mappings of this folio and
> >> - * non-LRU folio.
> >> + * non-LRU folio. If we have a large folio at this point, we
> >> + * know it is fully mapped so if its mapcount is the same as its
> >> + * number of pages, it must be exclusive.
> >> */
> >> - if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
> >> + if (!folio_test_lru(folio) ||
> >> + folio_mapcount(folio) != folio_nr_pages(folio))
> >> continue;
> >
> > This looks so perfect and is exactly what I wanted to achieve.
> >
> >>
> >> if (pageout_anon_only_filter && !folio_test_anon(folio))
> >> continue;
> >>
> >> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> >> -
> >> - if (!pageout && pte_young(ptent)) {
> >> - ptent = ptep_get_and_clear_full(mm, addr, pte,
> >> - tlb->fullmm);
> >> - ptent = pte_mkold(ptent);
> >> - set_pte_at(mm, addr, pte, ptent);
> >> - tlb_remove_tlb_entry(tlb, pte, addr);
> >> + if (!pageout) {
> >> + for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
> >> + if (ptep_test_and_clear_young(vma, addr, pte))
> >> + tlb_remove_tlb_entry(tlb, pte, addr);
> >> + }
> >
> > This looks so smart. if it is not pageout, we have increased pte
> > and addr here; so nr is 0 and we don't need to increase again in
> > for (; addr < end; pte += nr, addr += nr * PAGE_SIZE)
> >
> > otherwise, nr won't be 0. so we will increase addr and
> > pte by nr.
>
> Indeed. I'm hoping that Lance is able to follow a similar pattern for
> madvise_free_pte_range().
>
>
> >
> >
> >> }
> >>
> >> /*
> >> --
> >> 2.25.1
> >>
> >
> > Overall, LGTM,
> >
> > Reviewed-by: Barry Song <[email protected]>
>
> Thanks!
>
>

2024-03-13 09:19:53

by Lance Yang

[permalink] [raw]
Subject: Re: [PATCH v4 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD

On Wed, Mar 13, 2024 at 5:03 PM Ryan Roberts <[email protected]> wrote:
>
> On 13/03/2024 07:19, Barry Song wrote:
> > On Tue, Mar 12, 2024 at 4:01 AM Ryan Roberts <[email protected]> wrote:
> >>
> >> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
> >> folio that is fully and contiguously mapped in the pageout/cold vm
> >> range. This change means that large folios will be maintained all the
> >> way to swap storage. This both improves performance during swap-out, by
> >> eliding the cost of splitting the folio, and sets us up nicely for
> >> maintaining the large folio when it is swapped back in (to be covered in
> >> a separate series).
> >>
> >> Folios that are not fully mapped in the target range are still split,
> >> but note that behavior is changed so that if the split fails for any
> >> reason (folio locked, shared, etc) we now leave it as is and move to the
> >> next pte in the range and continue work on the proceeding folios.
> >> Previously any failure of this sort would cause the entire operation to
> >> give up and no folios mapped at higher addresses were paged out or made
> >> cold. Given large folios are becoming more common, this old behavior
> >> would have likely lead to wasted opportunities.
> >>
> >> While we are at it, change the code that clears young from the ptes to
> >> use ptep_test_and_clear_young(), which is more efficent than
> >> get_and_clear/modify/set, especially for contpte mappings on arm64,
> >> where the old approach would require unfolding/refolding and the new
> >> approach can be done in place.
> >>
> >> Signed-off-by: Ryan Roberts <[email protected]>
> >
> > This looks so much better than our initial RFC.
> > Thank you for your excellent work!
>
> Thanks - its a team effort - I had your PoC and David's previous batching work
> to use as a template.
>
> >
> >> ---
> >> mm/madvise.c | 89 ++++++++++++++++++++++++++++++----------------------
> >> 1 file changed, 51 insertions(+), 38 deletions(-)
> >>
> >> diff --git a/mm/madvise.c b/mm/madvise.c
> >> index 547dcd1f7a39..56c7ba7bd558 100644
> >> --- a/mm/madvise.c
> >> +++ b/mm/madvise.c
> >> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >> LIST_HEAD(folio_list);
> >> bool pageout_anon_only_filter;
> >> unsigned int batch_count = 0;
> >> + int nr;
> >>
> >> if (fatal_signal_pending(current))
> >> return -EINTR;
> >> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >> return 0;
> >> flush_tlb_batched_pending(mm);
> >> arch_enter_lazy_mmu_mode();
> >> - for (; addr < end; pte++, addr += PAGE_SIZE) {
> >> + for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
> >> + nr = 1;
> >> ptent = ptep_get(pte);
> >>
> >> if (++batch_count == SWAP_CLUSTER_MAX) {
> >> @@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >> continue;
> >>
> >> /*
> >> - * Creating a THP page is expensive so split it only if we
> >> - * are sure it's worth. Split it if we are only owner.
> >> + * If we encounter a large folio, only split it if it is not
> >> + * fully mapped within the range we are operating on. Otherwise
> >> + * leave it as is so that it can be swapped out whole. If we
> >> + * fail to split a folio, leave it in place and advance to the
> >> + * next pte in the range.
> >> */
> >> if (folio_test_large(folio)) {
> >> - int err;
> >> -
> >> - if (folio_estimated_sharers(folio) > 1)
> >> - break;
> >> - if (pageout_anon_only_filter && !folio_test_anon(folio))
> >> - break;
> >> - if (!folio_trylock(folio))
> >> - break;
> >> - folio_get(folio);
> >> - arch_leave_lazy_mmu_mode();
> >> - pte_unmap_unlock(start_pte, ptl);
> >> - start_pte = NULL;
> >> - err = split_folio(folio);
> >> - folio_unlock(folio);
> >> - folio_put(folio);
> >> - if (err)
> >> - break;
> >> - start_pte = pte =
> >> - pte_offset_map_lock(mm, pmd, addr, &ptl);
> >> - if (!start_pte)
> >> - break;
> >> - arch_enter_lazy_mmu_mode();
> >> - pte--;
> >> - addr -= PAGE_SIZE;
> >> - continue;
> >> + const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
> >> + FPB_IGNORE_SOFT_DIRTY;
> >> + int max_nr = (end - addr) / PAGE_SIZE;
> >> +
> >> + nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
> >> + fpb_flags, NULL);
> >
> > I wonder if we have a quick way to avoid folio_pte_batch() if users
> > are doing madvise() on a portion of a large folio.
>
> Good idea. Something like this?:
>
> if (pte_pfn(pte) == folio_pfn(folio)
> nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
> fpb_flags, NULL);
>
> If we are not mapping the first page of the folio, then it can't be a full
> mapping, so no need to call folio_pte_batch(). Just split it.
>
> >
> >> +
> >> + if (nr < folio_nr_pages(folio)) {
> >> + int err;
> >> +
> >> + if (folio_estimated_sharers(folio) > 1)
> >> + continue;
> >> + if (pageout_anon_only_filter && !folio_test_anon(folio))
> >> + continue;
> >> + if (!folio_trylock(folio))
> >> + continue;
> >> + folio_get(folio);
> >> + arch_leave_lazy_mmu_mode();
> >> + pte_unmap_unlock(start_pte, ptl);
> >> + start_pte = NULL;
> >> + err = split_folio(folio);
> >> + folio_unlock(folio);
> >> + folio_put(folio);
> >> + if (err)
> >> + continue;
> >> + start_pte = pte =
> >> + pte_offset_map_lock(mm, pmd, addr, &ptl);
> >> + if (!start_pte)
> >> + break;
> >> + arch_enter_lazy_mmu_mode();
> >> + nr = 0;
> >> + continue;
> >> + }
> >> }
> >>
> >> /*
> >> * Do not interfere with other mappings of this folio and
> >> - * non-LRU folio.
> >> + * non-LRU folio. If we have a large folio at this point, we
> >> + * know it is fully mapped so if its mapcount is the same as its
> >> + * number of pages, it must be exclusive.
> >> */
> >> - if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
> >> + if (!folio_test_lru(folio) ||
> >> + folio_mapcount(folio) != folio_nr_pages(folio))
> >> continue;
> >
> > This looks so perfect and is exactly what I wanted to achieve.
> >
> >>
> >> if (pageout_anon_only_filter && !folio_test_anon(folio))
> >> continue;
> >>
> >> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> >> -
> >> - if (!pageout && pte_young(ptent)) {
> >> - ptent = ptep_get_and_clear_full(mm, addr, pte,
> >> - tlb->fullmm);
> >> - ptent = pte_mkold(ptent);
> >> - set_pte_at(mm, addr, pte, ptent);
> >> - tlb_remove_tlb_entry(tlb, pte, addr);
> >> + if (!pageout) {
> >> + for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
> >> + if (ptep_test_and_clear_young(vma, addr, pte))
> >> + tlb_remove_tlb_entry(tlb, pte, addr);
> >> + }
> >
> > This looks so smart. if it is not pageout, we have increased pte
> > and addr here; so nr is 0 and we don't need to increase again in
> > for (; addr < end; pte += nr, addr += nr * PAGE_SIZE)
> >
> > otherwise, nr won't be 0. so we will increase addr and
> > pte by nr.
>
> Indeed. I'm hoping that Lance is able to follow a similar pattern for
> madvise_free_pte_range().

Thanks a lot, Ryan!
I'll make sure to follow a similar pattern for madvise_free_pte_range().

Best,
Lance

>
>
> >
> >
> >> }
> >>
> >> /*
> >> --
> >> 2.25.1
> >>
> >
> > Overall, LGTM,
> >
> > Reviewed-by: Barry Song <[email protected]>
>
> Thanks!
>
>

2024-03-13 09:37:12

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD

On 13/03/2024 09:16, Barry Song wrote:
> On Wed, Mar 13, 2024 at 10:03 PM Ryan Roberts <[email protected]> wrote:
>>
>> On 13/03/2024 07:19, Barry Song wrote:
>>> On Tue, Mar 12, 2024 at 4:01 AM Ryan Roberts <[email protected]> wrote:
>>>>
>>>> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
>>>> folio that is fully and contiguously mapped in the pageout/cold vm
>>>> range. This change means that large folios will be maintained all the
>>>> way to swap storage. This both improves performance during swap-out, by
>>>> eliding the cost of splitting the folio, and sets us up nicely for
>>>> maintaining the large folio when it is swapped back in (to be covered in
>>>> a separate series).
>>>>
>>>> Folios that are not fully mapped in the target range are still split,
>>>> but note that behavior is changed so that if the split fails for any
>>>> reason (folio locked, shared, etc) we now leave it as is and move to the
>>>> next pte in the range and continue work on the proceeding folios.
>>>> Previously any failure of this sort would cause the entire operation to
>>>> give up and no folios mapped at higher addresses were paged out or made
>>>> cold. Given large folios are becoming more common, this old behavior
>>>> would have likely lead to wasted opportunities.
>>>>
>>>> While we are at it, change the code that clears young from the ptes to
>>>> use ptep_test_and_clear_young(), which is more efficent than
>>>> get_and_clear/modify/set, especially for contpte mappings on arm64,
>>>> where the old approach would require unfolding/refolding and the new
>>>> approach can be done in place.
>>>>
>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>
>>> This looks so much better than our initial RFC.
>>> Thank you for your excellent work!
>>
>> Thanks - its a team effort - I had your PoC and David's previous batching work
>> to use as a template.
>>
>>>
>>>> ---
>>>> mm/madvise.c | 89 ++++++++++++++++++++++++++++++----------------------
>>>> 1 file changed, 51 insertions(+), 38 deletions(-)
>>>>
>>>> diff --git a/mm/madvise.c b/mm/madvise.c
>>>> index 547dcd1f7a39..56c7ba7bd558 100644
>>>> --- a/mm/madvise.c
>>>> +++ b/mm/madvise.c
>>>> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>> LIST_HEAD(folio_list);
>>>> bool pageout_anon_only_filter;
>>>> unsigned int batch_count = 0;
>>>> + int nr;
>>>>
>>>> if (fatal_signal_pending(current))
>>>> return -EINTR;
>>>> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>> return 0;
>>>> flush_tlb_batched_pending(mm);
>>>> arch_enter_lazy_mmu_mode();
>>>> - for (; addr < end; pte++, addr += PAGE_SIZE) {
>>>> + for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
>>>> + nr = 1;
>>>> ptent = ptep_get(pte);
>>>>
>>>> if (++batch_count == SWAP_CLUSTER_MAX) {
>>>> @@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>> continue;
>>>>
>>>> /*
>>>> - * Creating a THP page is expensive so split it only if we
>>>> - * are sure it's worth. Split it if we are only owner.
>>>> + * If we encounter a large folio, only split it if it is not
>>>> + * fully mapped within the range we are operating on. Otherwise
>>>> + * leave it as is so that it can be swapped out whole. If we
>>>> + * fail to split a folio, leave it in place and advance to the
>>>> + * next pte in the range.
>>>> */
>>>> if (folio_test_large(folio)) {
>>>> - int err;
>>>> -
>>>> - if (folio_estimated_sharers(folio) > 1)
>>>> - break;
>>>> - if (pageout_anon_only_filter && !folio_test_anon(folio))
>>>> - break;
>>>> - if (!folio_trylock(folio))
>>>> - break;
>>>> - folio_get(folio);
>>>> - arch_leave_lazy_mmu_mode();
>>>> - pte_unmap_unlock(start_pte, ptl);
>>>> - start_pte = NULL;
>>>> - err = split_folio(folio);
>>>> - folio_unlock(folio);
>>>> - folio_put(folio);
>>>> - if (err)
>>>> - break;
>>>> - start_pte = pte =
>>>> - pte_offset_map_lock(mm, pmd, addr, &ptl);
>>>> - if (!start_pte)
>>>> - break;
>>>> - arch_enter_lazy_mmu_mode();
>>>> - pte--;
>>>> - addr -= PAGE_SIZE;
>>>> - continue;
>>>> + const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
>>>> + FPB_IGNORE_SOFT_DIRTY;
>>>> + int max_nr = (end - addr) / PAGE_SIZE;
>>>> +
>>>> + nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>>>> + fpb_flags, NULL);
>>>
>>> I wonder if we have a quick way to avoid folio_pte_batch() if users
>>> are doing madvise() on a portion of a large folio.
>>
>> Good idea. Something like this?:
>>
>> if (pte_pfn(pte) == folio_pfn(folio)
>
> what about
>
> "If (pte_pfn(pte) == folio_pfn(folio) && max_nr >= nr_pages)"
>
> just to account for cases where the user's end address falls within
> the middle of a large folio?

yes, even better. I'll add this for the next version.

>
>
> BTW, another minor issue is here:
>
> if (++batch_count == SWAP_CLUSTER_MAX) {
> batch_count = 0;
> if (need_resched()) {
> arch_leave_lazy_mmu_mode();
> pte_unmap_unlock(start_pte, ptl);
> cond_resched();
> goto restart;
> }
> }
>
> We are increasing 1 for nr ptes, thus, we are holding PTL longer
> than small folios case? we used to increase 1 for each PTE.
> Does it matter?

I thought about that, but the vast majority of the work is per-folio, not
per-pte. So I concluded it would be best to continue to increment per-folio.

>
>> nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>> fpb_flags, NULL);
>>
>> If we are not mapping the first page of the folio, then it can't be a full
>> mapping, so no need to call folio_pte_batch(). Just split it.
>>
>>>
>>>> +
>>>> + if (nr < folio_nr_pages(folio)) {
>>>> + int err;
>>>> +
>>>> + if (folio_estimated_sharers(folio) > 1)
>>>> + continue;
>>>> + if (pageout_anon_only_filter && !folio_test_anon(folio))
>>>> + continue;
>>>> + if (!folio_trylock(folio))
>>>> + continue;
>>>> + folio_get(folio);
>>>> + arch_leave_lazy_mmu_mode();
>>>> + pte_unmap_unlock(start_pte, ptl);
>>>> + start_pte = NULL;
>>>> + err = split_folio(folio);
>>>> + folio_unlock(folio);
>>>> + folio_put(folio);
>>>> + if (err)
>>>> + continue;
>>>> + start_pte = pte =
>>>> + pte_offset_map_lock(mm, pmd, addr, &ptl);
>>>> + if (!start_pte)
>>>> + break;
>>>> + arch_enter_lazy_mmu_mode();
>>>> + nr = 0;
>>>> + continue;
>>>> + }
>>>> }
>>>>
>>>> /*
>>>> * Do not interfere with other mappings of this folio and
>>>> - * non-LRU folio.
>>>> + * non-LRU folio. If we have a large folio at this point, we
>>>> + * know it is fully mapped so if its mapcount is the same as its
>>>> + * number of pages, it must be exclusive.
>>>> */
>>>> - if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
>>>> + if (!folio_test_lru(folio) ||
>>>> + folio_mapcount(folio) != folio_nr_pages(folio))
>>>> continue;
>>>
>>> This looks so perfect and is exactly what I wanted to achieve.
>>>
>>>>
>>>> if (pageout_anon_only_filter && !folio_test_anon(folio))
>>>> continue;
>>>>
>>>> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>>>> -
>>>> - if (!pageout && pte_young(ptent)) {
>>>> - ptent = ptep_get_and_clear_full(mm, addr, pte,
>>>> - tlb->fullmm);
>>>> - ptent = pte_mkold(ptent);
>>>> - set_pte_at(mm, addr, pte, ptent);
>>>> - tlb_remove_tlb_entry(tlb, pte, addr);
>>>> + if (!pageout) {
>>>> + for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
>>>> + if (ptep_test_and_clear_young(vma, addr, pte))
>>>> + tlb_remove_tlb_entry(tlb, pte, addr);
>>>> + }
>>>
>>> This looks so smart. if it is not pageout, we have increased pte
>>> and addr here; so nr is 0 and we don't need to increase again in
>>> for (; addr < end; pte += nr, addr += nr * PAGE_SIZE)
>>>
>>> otherwise, nr won't be 0. so we will increase addr and
>>> pte by nr.
>>
>> Indeed. I'm hoping that Lance is able to follow a similar pattern for
>> madvise_free_pte_range().
>>
>>
>>>
>>>
>>>> }
>>>>
>>>> /*
>>>> --
>>>> 2.25.1
>>>>
>>>
>>> Overall, LGTM,
>>>
>>> Reviewed-by: Barry Song <[email protected]>
>>
>> Thanks!
>>
>>


2024-03-13 10:37:41

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v4 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD

On Wed, Mar 13, 2024 at 10:36 PM Ryan Roberts <[email protected]> wrote:
>
> On 13/03/2024 09:16, Barry Song wrote:
> > On Wed, Mar 13, 2024 at 10:03 PM Ryan Roberts <ryan.roberts@armcom> wrote:
> >>
> >> On 13/03/2024 07:19, Barry Song wrote:
> >>> On Tue, Mar 12, 2024 at 4:01 AM Ryan Roberts <[email protected]> wrote:
> >>>>
> >>>> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
> >>>> folio that is fully and contiguously mapped in the pageout/cold vm
> >>>> range. This change means that large folios will be maintained all the
> >>>> way to swap storage. This both improves performance during swap-out, by
> >>>> eliding the cost of splitting the folio, and sets us up nicely for
> >>>> maintaining the large folio when it is swapped back in (to be covered in
> >>>> a separate series).
> >>>>
> >>>> Folios that are not fully mapped in the target range are still split,
> >>>> but note that behavior is changed so that if the split fails for any
> >>>> reason (folio locked, shared, etc) we now leave it as is and move to the
> >>>> next pte in the range and continue work on the proceeding folios.
> >>>> Previously any failure of this sort would cause the entire operation to
> >>>> give up and no folios mapped at higher addresses were paged out or made
> >>>> cold. Given large folios are becoming more common, this old behavior
> >>>> would have likely lead to wasted opportunities.
> >>>>
> >>>> While we are at it, change the code that clears young from the ptes to
> >>>> use ptep_test_and_clear_young(), which is more efficent than
> >>>> get_and_clear/modify/set, especially for contpte mappings on arm64,
> >>>> where the old approach would require unfolding/refolding and the new
> >>>> approach can be done in place.
> >>>>
> >>>> Signed-off-by: Ryan Roberts <[email protected]>
> >>>
> >>> This looks so much better than our initial RFC.
> >>> Thank you for your excellent work!
> >>
> >> Thanks - its a team effort - I had your PoC and David's previous batching work
> >> to use as a template.
> >>
> >>>
> >>>> ---
> >>>> mm/madvise.c | 89 ++++++++++++++++++++++++++++++----------------------
> >>>> 1 file changed, 51 insertions(+), 38 deletions(-)
> >>>>
> >>>> diff --git a/mm/madvise.c b/mm/madvise.c
> >>>> index 547dcd1f7a39..56c7ba7bd558 100644
> >>>> --- a/mm/madvise.c
> >>>> +++ b/mm/madvise.c
> >>>> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>>> LIST_HEAD(folio_list);
> >>>> bool pageout_anon_only_filter;
> >>>> unsigned int batch_count = 0;
> >>>> + int nr;
> >>>>
> >>>> if (fatal_signal_pending(current))
> >>>> return -EINTR;
> >>>> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>>> return 0;
> >>>> flush_tlb_batched_pending(mm);
> >>>> arch_enter_lazy_mmu_mode();
> >>>> - for (; addr < end; pte++, addr += PAGE_SIZE) {
> >>>> + for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
> >>>> + nr = 1;
> >>>> ptent = ptep_get(pte);
> >>>>
> >>>> if (++batch_count == SWAP_CLUSTER_MAX) {
> >>>> @@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>>> continue;
> >>>>
> >>>> /*
> >>>> - * Creating a THP page is expensive so split it only if we
> >>>> - * are sure it's worth. Split it if we are only owner.
> >>>> + * If we encounter a large folio, only split it if it is not
> >>>> + * fully mapped within the range we are operating on Otherwise
> >>>> + * leave it as is so that it can be swapped out whole. If we
> >>>> + * fail to split a folio, leave it in place and advance to the
> >>>> + * next pte in the range.
> >>>> */
> >>>> if (folio_test_large(folio)) {
> >>>> - int err;
> >>>> -
> >>>> - if (folio_estimated_sharers(folio) > 1)
> >>>> - break;
> >>>> - if (pageout_anon_only_filter && !folio_test_anon(folio))
> >>>> - break;
> >>>> - if (!folio_trylock(folio))
> >>>> - break;
> >>>> - folio_get(folio);
> >>>> - arch_leave_lazy_mmu_mode();
> >>>> - pte_unmap_unlock(start_pte, ptl);
> >>>> - start_pte = NULL;
> >>>> - err = split_folio(folio);
> >>>> - folio_unlock(folio);
> >>>> - folio_put(folio);
> >>>> - if (err)
> >>>> - break;
> >>>> - start_pte = pte =
> >>>> - pte_offset_map_lock(mm, pmd, addr, &ptl);
> >>>> - if (!start_pte)
> >>>> - break;
> >>>> - arch_enter_lazy_mmu_mode();
> >>>> - pte--;
> >>>> - addr -= PAGE_SIZE;
> >>>> - continue;
> >>>> + const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
> >>>> + FPB_IGNORE_SOFT_DIRTY;
> >>>> + int max_nr = (end - addr) / PAGE_SIZE;
> >>>> +
> >>>> + nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
> >>>> + fpb_flags, NULL);
> >>>
> >>> I wonder if we have a quick way to avoid folio_pte_batch() if users
> >>> are doing madvise() on a portion of a large folio.
> >>
> >> Good idea. Something like this?:
> >>
> >> if (pte_pfn(pte) == folio_pfn(folio)
> >
> > what about
> >
> > "If (pte_pfn(pte) == folio_pfn(folio) && max_nr >= nr_pages)"
> >
> > just to account for cases where the user's end address falls within
> > the middle of a large folio?
>
> yes, even better. I'll add this for the next version.
>
> >
> >
> > BTW, another minor issue is here:
> >
> > if (++batch_count == SWAP_CLUSTER_MAX) {
> > batch_count = 0;
> > if (need_resched()) {
> > arch_leave_lazy_mmu_mode();
> > pte_unmap_unlock(start_pte, ptl);
> > cond_resched();
> > goto restart;
> > }
> > }
> >
> > We are increasing 1 for nr ptes, thus, we are holding PTL longer
> > than small folios case? we used to increase 1 for each PTE.
> > Does it matter?
>
> I thought about that, but the vast majority of the work is per-folio, not
> per-pte. So I concluded it would be best to continue to increment per-folio.

Okay. The original patch commit b2f557a21bc8 ("mm/madvise: add
cond_resched() in madvise_cold_or_pageout_pte_range()")
primarily addressed the real-time wake-up latency issue. MADV_PAGEOUT
and MADV_COLD are much less critical compared
to other scenarios where operations like do_anon_page or do_swap_page
necessarily need PTL to progress. Therefore, adopting
an approach that relatively aggressively releases the PTL seems to
neither harm MADV_PAGEOUT/COLD nor disadvantage
others.

We are slightly increasing the duration of holding the PTL due to the
iteration of folio_pte_batch() potentially taking longer than
the case of small folios, which do not require it. However, compared
to operations like folio_isolate_lru() and folio_deactivate(),
this increase seems negligible. Recently, we have actually removed
ptep_test_and_clear_young() for MADV_PAGEOUT,
which should also benefit real-time scenarios. Nonetheless, there is a
small risk with large folios, such as 1 MiB mTHP, where
we may need to loop 256 times in folio_pte_batch().

I would vote for increasing 'nr' or maybe max(log2(nr), 1) rather than
1 for two reasons:

1. We are not making MADV_PAGEOUT/COLD worse; in fact, we are
improving them by reducing the time taken to put the same
number of pages into the reclaim list.

2. MADV_PAGEOUT/COLD scenarios are not urgent compared to others that
genuinely require the PTL to progress. Moreover,
the majority of time spent on PAGEOUT is actually reclaim_pages().

> >
> >> nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
> >> fpb_flags, NULL);
> >>
> >> If we are not mapping the first page of the folio, then it can't be a full
> >> mapping, so no need to call folio_pte_batch(). Just split it.
> >>
> >>>
> >>>> +
> >>>> + if (nr < folio_nr_pages(folio)) {
> >>>> + int err;
> >>>> +
> >>>> + if (folio_estimated_sharers(folio) > 1)
> >>>> + continue;
> >>>> + if (pageout_anon_only_filter && !folio_test_anon(folio))
> >>>> + continue;
> >>>> + if (!folio_trylock(folio))
> >>>> + continue;
> >>>> + folio_get(folio);
> >>>> + arch_leave_lazy_mmu_mode();
> >>>> + pte_unmap_unlock(start_pte, ptl);
> >>>> + start_pte = NULL;
> >>>> + err = split_folio(folio);
> >>>> + folio_unlock(folio);
> >>>> + folio_put(folio);
> >>>> + if (err)
> >>>> + continue;
> >>>> + start_pte = pte =
> >>>> + pte_offset_map_lock(mm, pmd, addr, &ptl);
> >>>> + if (!start_pte)
> >>>> + break;
> >>>> + arch_enter_lazy_mmu_mode();
> >>>> + nr = 0;
> >>>> + continue;
> >>>> + }
> >>>> }
> >>>>
> >>>> /*
> >>>> * Do not interfere with other mappings of this folio and
> >>>> - * non-LRU folio.
> >>>> + * non-LRU folio. If we have a large folio at this point, we
> >>>> + * know it is fully mapped so if its mapcount is the same as its
> >>>> + * number of pages, it must be exclusive.
> >>>> */
> >>>> - if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
> >>>> + if (!folio_test_lru(folio) ||
> >>>> + folio_mapcount(folio) != folio_nr_pages(folio))
> >>>> continue;
> >>>
> >>> This looks so perfect and is exactly what I wanted to achieve.
> >>>
> >>>>
> >>>> if (pageout_anon_only_filter && !folio_test_anon(folio))
> >>>> continue;
> >>>>
> >>>> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> >>>> -
> >>>> - if (!pageout && pte_young(ptent)) {
> >>>> - ptent = ptep_get_and_clear_full(mm, addr, pte,
> >>>> - tlb->fullmm);
> >>>> - ptent = pte_mkold(ptent);
> >>>> - set_pte_at(mm, addr, pte, ptent);
> >>>> - tlb_remove_tlb_entry(tlb, pte, addr);
> >>>> + if (!pageout) {
> >>>> + for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
> >>>> + if (ptep_test_and_clear_young(vma, addr, pte))
> >>>> + tlb_remove_tlb_entry(tlb, pte, addr);
> >>>> + }
> >>>
> >>> This looks so smart. if it is not pageout, we have increased pte
> >>> and addr here; so nr is 0 and we don't need to increase again in
> >>> for (; addr < end; pte += nr, addr += nr * PAGE_SIZE)
> >>>
> >>> otherwise, nr won't be 0. so we will increase addr and
> >>> pte by nr.
> >>
> >> Indeed. I'm hoping that Lance is able to follow a similar pattern for
> >> madvise_free_pte_range().
> >>
> >>
> >>>
> >>>
> >>>> }
> >>>>
> >>>> /*
> >>>> --
> >>>> 2.25.1
> >>>>
> >>>
> >>> Overall, LGTM,
> >>>
> >>> Reviewed-by: Barry Song <[email protected]>
> >>
> >> Thanks!
> >>

Thanks
Barry

2024-03-13 11:10:37

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD

On 13/03/2024 10:37, Barry Song wrote:
> On Wed, Mar 13, 2024 at 10:36 PM Ryan Roberts <[email protected]> wrote:
>>
>> On 13/03/2024 09:16, Barry Song wrote:
>>> On Wed, Mar 13, 2024 at 10:03 PM Ryan Roberts <[email protected]> wrote:
>>>>
>>>> On 13/03/2024 07:19, Barry Song wrote:
>>>>> On Tue, Mar 12, 2024 at 4:01 AM Ryan Roberts <[email protected]> wrote:
>>>>>>
>>>>>> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
>>>>>> folio that is fully and contiguously mapped in the pageout/cold vm
>>>>>> range. This change means that large folios will be maintained all the
>>>>>> way to swap storage. This both improves performance during swap-out, by
>>>>>> eliding the cost of splitting the folio, and sets us up nicely for
>>>>>> maintaining the large folio when it is swapped back in (to be covered in
>>>>>> a separate series).
>>>>>>
>>>>>> Folios that are not fully mapped in the target range are still split,
>>>>>> but note that behavior is changed so that if the split fails for any
>>>>>> reason (folio locked, shared, etc) we now leave it as is and move to the
>>>>>> next pte in the range and continue work on the proceeding folios.
>>>>>> Previously any failure of this sort would cause the entire operation to
>>>>>> give up and no folios mapped at higher addresses were paged out or made
>>>>>> cold. Given large folios are becoming more common, this old behavior
>>>>>> would have likely lead to wasted opportunities.
>>>>>>
>>>>>> While we are at it, change the code that clears young from the ptes to
>>>>>> use ptep_test_and_clear_young(), which is more efficent than
>>>>>> get_and_clear/modify/set, especially for contpte mappings on arm64,
>>>>>> where the old approach would require unfolding/refolding and the new
>>>>>> approach can be done in place.
>>>>>>
>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>
>>>>> This looks so much better than our initial RFC.
>>>>> Thank you for your excellent work!
>>>>
>>>> Thanks - its a team effort - I had your PoC and David's previous batching work
>>>> to use as a template.
>>>>
>>>>>
>>>>>> ---
>>>>>> mm/madvise.c | 89 ++++++++++++++++++++++++++++++----------------------
>>>>>> 1 file changed, 51 insertions(+), 38 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/madvise.c b/mm/madvise.c
>>>>>> index 547dcd1f7a39..56c7ba7bd558 100644
>>>>>> --- a/mm/madvise.c
>>>>>> +++ b/mm/madvise.c
>>>>>> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>>>> LIST_HEAD(folio_list);
>>>>>> bool pageout_anon_only_filter;
>>>>>> unsigned int batch_count = 0;
>>>>>> + int nr;
>>>>>>
>>>>>> if (fatal_signal_pending(current))
>>>>>> return -EINTR;
>>>>>> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>>>> return 0;
>>>>>> flush_tlb_batched_pending(mm);
>>>>>> arch_enter_lazy_mmu_mode();
>>>>>> - for (; addr < end; pte++, addr += PAGE_SIZE) {
>>>>>> + for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
>>>>>> + nr = 1;
>>>>>> ptent = ptep_get(pte);
>>>>>>
>>>>>> if (++batch_count == SWAP_CLUSTER_MAX) {
>>>>>> @@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>>>> continue;
>>>>>>
>>>>>> /*
>>>>>> - * Creating a THP page is expensive so split it only if we
>>>>>> - * are sure it's worth. Split it if we are only owner.
>>>>>> + * If we encounter a large folio, only split it if it is not
>>>>>> + * fully mapped within the range we are operating on. Otherwise
>>>>>> + * leave it as is so that it can be swapped out whole. If we
>>>>>> + * fail to split a folio, leave it in place and advance to the
>>>>>> + * next pte in the range.
>>>>>> */
>>>>>> if (folio_test_large(folio)) {
>>>>>> - int err;
>>>>>> -
>>>>>> - if (folio_estimated_sharers(folio) > 1)
>>>>>> - break;
>>>>>> - if (pageout_anon_only_filter && !folio_test_anon(folio))
>>>>>> - break;
>>>>>> - if (!folio_trylock(folio))
>>>>>> - break;
>>>>>> - folio_get(folio);
>>>>>> - arch_leave_lazy_mmu_mode();
>>>>>> - pte_unmap_unlock(start_pte, ptl);
>>>>>> - start_pte = NULL;
>>>>>> - err = split_folio(folio);
>>>>>> - folio_unlock(folio);
>>>>>> - folio_put(folio);
>>>>>> - if (err)
>>>>>> - break;
>>>>>> - start_pte = pte =
>>>>>> - pte_offset_map_lock(mm, pmd, addr, &ptl);
>>>>>> - if (!start_pte)
>>>>>> - break;
>>>>>> - arch_enter_lazy_mmu_mode();
>>>>>> - pte--;
>>>>>> - addr -= PAGE_SIZE;
>>>>>> - continue;
>>>>>> + const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
>>>>>> + FPB_IGNORE_SOFT_DIRTY;
>>>>>> + int max_nr = (end - addr) / PAGE_SIZE;
>>>>>> +
>>>>>> + nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>>>>>> + fpb_flags, NULL);
>>>>>
>>>>> I wonder if we have a quick way to avoid folio_pte_batch() if users
>>>>> are doing madvise() on a portion of a large folio.
>>>>
>>>> Good idea. Something like this?:
>>>>
>>>> if (pte_pfn(pte) == folio_pfn(folio)
>>>
>>> what about
>>>
>>> "If (pte_pfn(pte) == folio_pfn(folio) && max_nr >= nr_pages)"
>>>
>>> just to account for cases where the user's end address falls within
>>> the middle of a large folio?
>>
>> yes, even better. I'll add this for the next version.
>>
>>>
>>>
>>> BTW, another minor issue is here:
>>>
>>> if (++batch_count == SWAP_CLUSTER_MAX) {
>>> batch_count = 0;
>>> if (need_resched()) {
>>> arch_leave_lazy_mmu_mode();
>>> pte_unmap_unlock(start_pte, ptl);
>>> cond_resched();
>>> goto restart;
>>> }
>>> }
>>>
>>> We are increasing 1 for nr ptes, thus, we are holding PTL longer
>>> than small folios case? we used to increase 1 for each PTE.
>>> Does it matter?
>>
>> I thought about that, but the vast majority of the work is per-folio, not
>> per-pte. So I concluded it would be best to continue to increment per-folio.
>
> Okay. The original patch commit b2f557a21bc8 ("mm/madvise: add
> cond_resched() in madvise_cold_or_pageout_pte_range()")
> primarily addressed the real-time wake-up latency issue. MADV_PAGEOUT
> and MADV_COLD are much less critical compared
> to other scenarios where operations like do_anon_page or do_swap_page
> necessarily need PTL to progress. Therefore, adopting
> an approach that relatively aggressively releases the PTL seems to
> neither harm MADV_PAGEOUT/COLD nor disadvantage
> others.
>
> We are slightly increasing the duration of holding the PTL due to the
> iteration of folio_pte_batch() potentially taking longer than
> the case of small folios, which do not require it.

If we can't scan all the PTEs in a page table without dropping the PTL
intermittently we have bigger problems. This all works perfectly fine in all the
other PTE iterators; see zap_pte_range() for example.

> However, compared
> to operations like folio_isolate_lru() and folio_deactivate(),
> this increase seems negligible. Recently, we have actually removed
> ptep_test_and_clear_young() for MADV_PAGEOUT,
> which should also benefit real-time scenarios. Nonetheless, there is a
> small risk with large folios, such as 1 MiB mTHP, where
> we may need to loop 256 times in folio_pte_batch().

As I understand it, RT and THP are mutually exclusive. RT can't handle the extra
latencies THPs can cause in allocation path, etc. So I don't think you will see
a problem here.

>
> I would vote for increasing 'nr' or maybe max(log2(nr), 1) rather than
> 1 for two reasons:
>
> 1. We are not making MADV_PAGEOUT/COLD worse; in fact, we are
> improving them by reducing the time taken to put the same
> number of pages into the reclaim list.
>
> 2. MADV_PAGEOUT/COLD scenarios are not urgent compared to others that
> genuinely require the PTL to progress. Moreover,
> the majority of time spent on PAGEOUT is actually reclaim_pages().

I understand your logic. But I'd rather optimize for fewer lock acquisitions for
the !RT+THP case, since RT+THP is not supported.

>
>>>
>>>> nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>>>> fpb_flags, NULL);
>>>>
>>>> If we are not mapping the first page of the folio, then it can't be a full
>>>> mapping, so no need to call folio_pte_batch(). Just split it.
>>>>
>>>>>
>>>>>> +
>>>>>> + if (nr < folio_nr_pages(folio)) {
>>>>>> + int err;
>>>>>> +
>>>>>> + if (folio_estimated_sharers(folio) > 1)
>>>>>> + continue;
>>>>>> + if (pageout_anon_only_filter && !folio_test_anon(folio))
>>>>>> + continue;
>>>>>> + if (!folio_trylock(folio))
>>>>>> + continue;
>>>>>> + folio_get(folio);
>>>>>> + arch_leave_lazy_mmu_mode();
>>>>>> + pte_unmap_unlock(start_pte, ptl);
>>>>>> + start_pte = NULL;
>>>>>> + err = split_folio(folio);
>>>>>> + folio_unlock(folio);
>>>>>> + folio_put(folio);
>>>>>> + if (err)
>>>>>> + continue;
>>>>>> + start_pte = pte =
>>>>>> + pte_offset_map_lock(mm, pmd, addr, &ptl);
>>>>>> + if (!start_pte)
>>>>>> + break;
>>>>>> + arch_enter_lazy_mmu_mode();
>>>>>> + nr = 0;
>>>>>> + continue;
>>>>>> + }
>>>>>> }
>>>>>>
>>>>>> /*
>>>>>> * Do not interfere with other mappings of this folio and
>>>>>> - * non-LRU folio.
>>>>>> + * non-LRU folio. If we have a large folio at this point, we
>>>>>> + * know it is fully mapped so if its mapcount is the same as its
>>>>>> + * number of pages, it must be exclusive.
>>>>>> */
>>>>>> - if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
>>>>>> + if (!folio_test_lru(folio) ||
>>>>>> + folio_mapcount(folio) != folio_nr_pages(folio))
>>>>>> continue;
>>>>>
>>>>> This looks so perfect and is exactly what I wanted to achieve.
>>>>>
>>>>>>
>>>>>> if (pageout_anon_only_filter && !folio_test_anon(folio))
>>>>>> continue;
>>>>>>
>>>>>> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>>>>>> -
>>>>>> - if (!pageout && pte_young(ptent)) {
>>>>>> - ptent = ptep_get_and_clear_full(mm, addr, pte,
>>>>>> - tlb->fullmm);
>>>>>> - ptent = pte_mkold(ptent);
>>>>>> - set_pte_at(mm, addr, pte, ptent);
>>>>>> - tlb_remove_tlb_entry(tlb, pte, addr);
>>>>>> + if (!pageout) {
>>>>>> + for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
>>>>>> + if (ptep_test_and_clear_young(vma, addr, pte))
>>>>>> + tlb_remove_tlb_entry(tlb, pte, addr);
>>>>>> + }
>>>>>
>>>>> This looks so smart. if it is not pageout, we have increased pte
>>>>> and addr here; so nr is 0 and we don't need to increase again in
>>>>> for (; addr < end; pte += nr, addr += nr * PAGE_SIZE)
>>>>>
>>>>> otherwise, nr won't be 0. so we will increase addr and
>>>>> pte by nr.
>>>>
>>>> Indeed. I'm hoping that Lance is able to follow a similar pattern for
>>>> madvise_free_pte_range().
>>>>
>>>>
>>>>>
>>>>>
>>>>>> }
>>>>>>
>>>>>> /*
>>>>>> --
>>>>>> 2.25.1
>>>>>>
>>>>>
>>>>> Overall, LGTM,
>>>>>
>>>>> Reviewed-by: Barry Song <[email protected]>
>>>>
>>>> Thanks!
>>>>
>
> Thanks
> Barry


2024-03-13 11:37:58

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v4 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD

On Wed, Mar 13, 2024 at 7:08 PM Ryan Roberts <[email protected]> wrote:
>
> On 13/03/2024 10:37, Barry Song wrote:
> > On Wed, Mar 13, 2024 at 10:36 PM Ryan Roberts <ryan.roberts@armcom> wrote:
> >>
> >> On 13/03/2024 09:16, Barry Song wrote:
> >>> On Wed, Mar 13, 2024 at 10:03 PM Ryan Roberts <[email protected]> wrote:
> >>>>
> >>>> On 13/03/2024 07:19, Barry Song wrote:
> >>>>> On Tue, Mar 12, 2024 at 4:01 AM Ryan Roberts <[email protected]> wrote:
> >>>>>>
> >>>>>> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
> >>>>>> folio that is fully and contiguously mapped in the pageout/cold vm
> >>>>>> range. This change means that large folios will be maintained all the
> >>>>>> way to swap storage. This both improves performance during swap-out, by
> >>>>>> eliding the cost of splitting the folio, and sets us up nicely for
> >>>>>> maintaining the large folio when it is swapped back in (to be covered in
> >>>>>> a separate series).
> >>>>>>
> >>>>>> Folios that are not fully mapped in the target range are still split,
> >>>>>> but note that behavior is changed so that if the split fails for any
> >>>>>> reason (folio locked, shared, etc) we now leave it as is and move to the
> >>>>>> next pte in the range and continue work on the proceeding folios.
> >>>>>> Previously any failure of this sort would cause the entire operation to
> >>>>>> give up and no folios mapped at higher addresses were paged out or made
> >>>>>> cold. Given large folios are becoming more common, this old behavior
> >>>>>> would have likely lead to wasted opportunities.
> >>>>>>
> >>>>>> While we are at it, change the code that clears young from the ptes to
> >>>>>> use ptep_test_and_clear_young(), which is more efficent than
> >>>>>> get_and_clear/modify/set, especially for contpte mappings on arm64,
> >>>>>> where the old approach would require unfolding/refolding and the new
> >>>>>> approach can be done in place.
> >>>>>>
> >>>>>> Signed-off-by: Ryan Roberts <[email protected]>
> >>>>>
> >>>>> This looks so much better than our initial RFC.
> >>>>> Thank you for your excellent work!
> >>>>
> >>>> Thanks - its a team effort - I had your PoC and David's previous batching work
> >>>> to use as a template.
> >>>>
> >>>>>
> >>>>>> ---
> >>>>>> mm/madvise.c | 89 ++++++++++++++++++++++++++++++----------------------
> >>>>>> 1 file changed, 51 insertions(+), 38 deletions(-)
> >>>>>>
> >>>>>> diff --git a/mm/madvise.c b/mm/madvise.c
> >>>>>> index 547dcd1f7a39..56c7ba7bd558 100644
> >>>>>> --- a/mm/madvise.c
> >>>>>> +++ b/mm/madvise.c
> >>>>>> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>>>>> LIST_HEAD(folio_list);
> >>>>>> bool pageout_anon_only_filter;
> >>>>>> unsigned int batch_count = 0;
> >>>>>> + int nr;
> >>>>>>
> >>>>>> if (fatal_signal_pending(current))
> >>>>>> return -EINTR;
> >>>>>> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>>>>> return 0;
> >>>>>> flush_tlb_batched_pending(mm);
> >>>>>> arch_enter_lazy_mmu_mode();
> >>>>>> - for (; addr < end; pte++, addr += PAGE_SIZE) {
> >>>>>> + for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
> >>>>>> + nr = 1;
> >>>>>> ptent = ptep_get(pte);
> >>>>>>
> >>>>>> if (++batch_count == SWAP_CLUSTER_MAX) {
> >>>>>> @@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>>>>> continue;
> >>>>>>
> >>>>>> /*
> >>>>>> - * Creating a THP page is expensive so split it only if we
> >>>>>> - * are sure it's worth. Split it if we are only owner.
> >>>>>> + * If we encounter a large folio, only split it if it is not
> >>>>>> + * fully mapped within the range we are operating on. Otherwise
> >>>>>> + * leave it as is so that it can be swapped out whole. If we
> >>>>>> + * fail to split a folio, leave it in place and advance to the
> >>>>>> + * next pte in the range.
> >>>>>> */
> >>>>>> if (folio_test_large(folio)) {
> >>>>>> - int err;
> >>>>>> -
> >>>>>> - if (folio_estimated_sharers(folio) > 1)
> >>>>>> - break;
> >>>>>> - if (pageout_anon_only_filter && !folio_test_anon(folio))
> >>>>>> - break;
> >>>>>> - if (!folio_trylock(folio))
> >>>>>> - break;
> >>>>>> - folio_get(folio);
> >>>>>> - arch_leave_lazy_mmu_mode();
> >>>>>> - pte_unmap_unlock(start_pte, ptl);
> >>>>>> - start_pte = NULL;
> >>>>>> - err = split_folio(folio);
> >>>>>> - folio_unlock(folio);
> >>>>>> - folio_put(folio);
> >>>>>> - if (err)
> >>>>>> - break;
> >>>>>> - start_pte = pte =
> >>>>>> - pte_offset_map_lock(mm, pmd, addr, &ptl);
> >>>>>> - if (!start_pte)
> >>>>>> - break;
> >>>>>> - arch_enter_lazy_mmu_mode();
> >>>>>> - pte--;
> >>>>>> - addr -= PAGE_SIZE;
> >>>>>> - continue;
> >>>>>> + const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
> >>>>>> + FPB_IGNORE_SOFT_DIRTY;
> >>>>>> + int max_nr = (end - addr) / PAGE_SIZE;
> >>>>>> +
> >>>>>> + nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
> >>>>>> + fpb_flags, NULL);
> >>>>>
> >>>>> I wonder if we have a quick way to avoid folio_pte_batch() if users
> >>>>> are doing madvise() on a portion of a large folio.
> >>>>
> >>>> Good idea. Something like this?:
> >>>>
> >>>> if (pte_pfn(pte) == folio_pfn(folio)
> >>>
> >>> what about
> >>>
> >>> "If (pte_pfn(pte) == folio_pfn(folio) && max_nr >= nr_pages)"
> >>>
> >>> just to account for cases where the user's end address falls within
> >>> the middle of a large folio?
> >>
> >> yes, even better. I'll add this for the next version.
> >>
> >>>
> >>>
> >>> BTW, another minor issue is here:
> >>>
> >>> if (++batch_count == SWAP_CLUSTER_MAX) {
> >>> batch_count = 0;
> >>> if (need_resched()) {
> >>> arch_leave_lazy_mmu_mode();
> >>> pte_unmap_unlock(start_pte, ptl);
> >>> cond_resched();
> >>> goto restart;
> >>> }
> >>> }
> >>>
> >>> We are increasing 1 for nr ptes, thus, we are holding PTL longer
> >>> than small folios case? we used to increase 1 for each PTE.
> >>> Does it matter?
> >>
> >> I thought about that, but the vast majority of the work is per-folio, not
> >> per-pte. So I concluded it would be best to continue to increment per-folio.
> >
> > Okay. The original patch commit b2f557a21bc8 ("mm/madvise: add
> > cond_resched() in madvise_cold_or_pageout_pte_range()")
> > primarily addressed the real-time wake-up latency issue. MADV_PAGEOUT
> > and MADV_COLD are much less critical compared
> > to other scenarios where operations like do_anon_page or do_swap_page
> > necessarily need PTL to progress. Therefore, adopting
> > an approach that relatively aggressively releases the PTL seems to
> > neither harm MADV_PAGEOUT/COLD nor disadvantage
> > others.
> >
> > We are slightly increasing the duration of holding the PTL due to the
> > iteration of folio_pte_batch() potentially taking longer than
> > the case of small folios, which do not require it.
>
> If we can't scan all the PTEs in a page table without dropping the PTL
> intermittently we have bigger problems. This all works perfectly fine in all the
> other PTE iterators; see zap_pte_range() for example.

I've no doubt about folio_pte_batch(). was just talking about the
original rt issue
it might affect.

>
> > However, compared
> > to operations like folio_isolate_lru() and folio_deactivate(),
> > this increase seems negligible. Recently, we have actually removed
> > ptep_test_and_clear_young() for MADV_PAGEOUT,
> > which should also benefit real-time scenarios. Nonetheless, there is a
> > small risk with large folios, such as 1 MiB mTHP, where
> > we may need to loop 256 times in folio_pte_batch().
>
> As I understand it, RT and THP are mutually exclusive. RT can't handle the extra
> latencies THPs can cause in allocation path, etc. So I don't think you will see
> a problem here.

I was actually taking a different approach on the phones as obviously
we have some
UX(user-experience)/UI/audio related tasks which cannot tolerate
allocation latency. with
a TAO-similar optimization(we did it by ext_migratetype for some pageblocks), we
actually don't push buddy to do compaction or reclamation for forming
64KiB folio.
We immediately fallback to small folios if a zero-latency 64KiB
allocation can't be
obtained from some kinds of pools - ext_migratetype pageblocks.

>
> >
> > I would vote for increasing 'nr' or maybe max(log2(nr), 1) rather than
> > 1 for two reasons:
> >
> > 1. We are not making MADV_PAGEOUT/COLD worse; in fact, we are
> > improving them by reducing the time taken to put the same
> > number of pages into the reclaim list.
> >
> > 2. MADV_PAGEOUT/COLD scenarios are not urgent compared to others that
> > genuinely require the PTL to progress. Moreover,
> > the majority of time spent on PAGEOUT is actually reclaim_pages().
>
> I understand your logic. But I'd rather optimize for fewer lock acquisitions for
> the !RT+THP case, since RT+THP is not supported.

Fair enough. I agree we can postpone this until RT and THP become an
available option.
For now, keeping this patch simpler seems to be better.

>
> >
> >>>
> >>>> nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
> >>>> fpb_flags, NULL);
> >>>>
> >>>> If we are not mapping the first page of the folio, then it can't be a full
> >>>> mapping, so no need to call folio_pte_batch(). Just split it.
> >>>>
> >>>>>
> >>>>>> +
> >>>>>> + if (nr < folio_nr_pages(folio)) {
> >>>>>> + int err;
> >>>>>> +
> >>>>>> + if (folio_estimated_sharers(folio) > 1)
> >>>>>> + continue;
> >>>>>> + if (pageout_anon_only_filter && !folio_test_anon(folio))
> >>>>>> + continue;
> >>>>>> + if (!folio_trylock(folio))
> >>>>>> + continue;
> >>>>>> + folio_get(folio);
> >>>>>> + arch_leave_lazy_mmu_mode();
> >>>>>> + pte_unmap_unlock(start_pte, ptl);
> >>>>>> + start_pte = NULL;
> >>>>>> + err = split_folio(folio);
> >>>>>> + folio_unlock(folio);
> >>>>>> + folio_put(folio);
> >>>>>> + if (err)
> >>>>>> + continue;
> >>>>>> + start_pte = pte =
> >>>>>> + pte_offset_map_lock(mm, pmd, addr, &ptl);
> >>>>>> + if (!start_pte)
> >>>>>> + break;
> >>>>>> + arch_enter_lazy_mmu_mode();
> >>>>>> + nr = 0;
> >>>>>> + continue;
> >>>>>> + }
> >>>>>> }
> >>>>>>
> >>>>>> /*
> >>>>>> * Do not interfere with other mappings of this folio and
> >>>>>> - * non-LRU folio.
> >>>>>> + * non-LRU folio. If we have a large folio at this point, we
> >>>>>> + * know it is fully mapped so if its mapcount is the same as its
> >>>>>> + * number of pages, it must be exclusive.
> >>>>>> */
> >>>>>> - if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
> >>>>>> + if (!folio_test_lru(folio) ||
> >>>>>> + folio_mapcount(folio) != folio_nr_pages(folio))
> >>>>>> continue;
> >>>>>
> >>>>> This looks so perfect and is exactly what I wanted to achieve.
> >>>>>
> >>>>>>
> >>>>>> if (pageout_anon_only_filter && !folio_test_anon(folio))
> >>>>>> continue;
> >>>>>>
> >>>>>> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> >>>>>> -
> >>>>>> - if (!pageout && pte_young(ptent)) {
> >>>>>> - ptent = ptep_get_and_clear_full(mm, addr, pte,
> >>>>>> - tlb->fullmm);
> >>>>>> - ptent = pte_mkold(ptent);
> >>>>>> - set_pte_at(mm, addr, pte, ptent);
> >>>>>> - tlb_remove_tlb_entry(tlb, pte, addr);
> >>>>>> + if (!pageout) {
> >>>>>> + for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
> >>>>>> + if (ptep_test_and_clear_young(vma, addr, pte))
> >>>>>> + tlb_remove_tlb_entry(tlb, pte, addr);
> >>>>>> + }
> >>>>>
> >>>>> This looks so smart. if it is not pageout, we have increased pte
> >>>>> and addr here; so nr is 0 and we don't need to increase again in
> >>>>> for (; addr < end; pte += nr, addr += nr * PAGE_SIZE)
> >>>>>
> >>>>> otherwise, nr won't be 0. so we will increase addr and
> >>>>> pte by nr.
> >>>>
> >>>> Indeed. I'm hoping that Lance is able to follow a similar pattern for
> >>>> madvise_free_pte_range().
> >>>>
> >>>>
> >>>>>
> >>>>>
> >>>>>> }
> >>>>>>
> >>>>>> /*
> >>>>>> --
> >>>>>> 2.25.1
> >>>>>>
> >>>>>
> >>>>> Overall, LGTM,
> >>>>>
> >>>>> Reviewed-by: Barry Song <[email protected]>
> >>>>
> >>>> Thanks!
> >>>>
> >
> > Thanks
> > Barry
>

2024-03-13 12:04:41

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD

On 13/03/2024 11:37, Barry Song wrote:
> On Wed, Mar 13, 2024 at 7:08 PM Ryan Roberts <[email protected]> wrote:
>>
>> On 13/03/2024 10:37, Barry Song wrote:
>>> On Wed, Mar 13, 2024 at 10:36 PM Ryan Roberts <[email protected]> wrote:
>>>>
>>>> On 13/03/2024 09:16, Barry Song wrote:
>>>>> On Wed, Mar 13, 2024 at 10:03 PM Ryan Roberts <[email protected]> wrote:
>>>>>>
>>>>>> On 13/03/2024 07:19, Barry Song wrote:
>>>>>>> On Tue, Mar 12, 2024 at 4:01 AM Ryan Roberts <[email protected]> wrote:
>>>>>>>>
>>>>>>>> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
>>>>>>>> folio that is fully and contiguously mapped in the pageout/cold vm
>>>>>>>> range. This change means that large folios will be maintained all the
>>>>>>>> way to swap storage. This both improves performance during swap-out, by
>>>>>>>> eliding the cost of splitting the folio, and sets us up nicely for
>>>>>>>> maintaining the large folio when it is swapped back in (to be covered in
>>>>>>>> a separate series).
>>>>>>>>
>>>>>>>> Folios that are not fully mapped in the target range are still split,
>>>>>>>> but note that behavior is changed so that if the split fails for any
>>>>>>>> reason (folio locked, shared, etc) we now leave it as is and move to the
>>>>>>>> next pte in the range and continue work on the proceeding folios.
>>>>>>>> Previously any failure of this sort would cause the entire operation to
>>>>>>>> give up and no folios mapped at higher addresses were paged out or made
>>>>>>>> cold. Given large folios are becoming more common, this old behavior
>>>>>>>> would have likely lead to wasted opportunities.
>>>>>>>>
>>>>>>>> While we are at it, change the code that clears young from the ptes to
>>>>>>>> use ptep_test_and_clear_young(), which is more efficent than
>>>>>>>> get_and_clear/modify/set, especially for contpte mappings on arm64,
>>>>>>>> where the old approach would require unfolding/refolding and the new
>>>>>>>> approach can be done in place.
>>>>>>>>
>>>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>>>
>>>>>>> This looks so much better than our initial RFC.
>>>>>>> Thank you for your excellent work!
>>>>>>
>>>>>> Thanks - its a team effort - I had your PoC and David's previous batching work
>>>>>> to use as a template.
>>>>>>
>>>>>>>
>>>>>>>> ---
>>>>>>>> mm/madvise.c | 89 ++++++++++++++++++++++++++++++----------------------
>>>>>>>> 1 file changed, 51 insertions(+), 38 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/mm/madvise.c b/mm/madvise.c
>>>>>>>> index 547dcd1f7a39..56c7ba7bd558 100644
>>>>>>>> --- a/mm/madvise.c
>>>>>>>> +++ b/mm/madvise.c
>>>>>>>> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>>>>>> LIST_HEAD(folio_list);
>>>>>>>> bool pageout_anon_only_filter;
>>>>>>>> unsigned int batch_count = 0;
>>>>>>>> + int nr;
>>>>>>>>
>>>>>>>> if (fatal_signal_pending(current))
>>>>>>>> return -EINTR;
>>>>>>>> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>>>>>> return 0;
>>>>>>>> flush_tlb_batched_pending(mm);
>>>>>>>> arch_enter_lazy_mmu_mode();
>>>>>>>> - for (; addr < end; pte++, addr += PAGE_SIZE) {
>>>>>>>> + for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
>>>>>>>> + nr = 1;
>>>>>>>> ptent = ptep_get(pte);
>>>>>>>>
>>>>>>>> if (++batch_count == SWAP_CLUSTER_MAX) {
>>>>>>>> @@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>>>>>> continue;
>>>>>>>>
>>>>>>>> /*
>>>>>>>> - * Creating a THP page is expensive so split it only if we
>>>>>>>> - * are sure it's worth. Split it if we are only owner.
>>>>>>>> + * If we encounter a large folio, only split it if it is not
>>>>>>>> + * fully mapped within the range we are operating on. Otherwise
>>>>>>>> + * leave it as is so that it can be swapped out whole. If we
>>>>>>>> + * fail to split a folio, leave it in place and advance to the
>>>>>>>> + * next pte in the range.
>>>>>>>> */
>>>>>>>> if (folio_test_large(folio)) {
>>>>>>>> - int err;
>>>>>>>> -
>>>>>>>> - if (folio_estimated_sharers(folio) > 1)
>>>>>>>> - break;
>>>>>>>> - if (pageout_anon_only_filter && !folio_test_anon(folio))
>>>>>>>> - break;
>>>>>>>> - if (!folio_trylock(folio))
>>>>>>>> - break;
>>>>>>>> - folio_get(folio);
>>>>>>>> - arch_leave_lazy_mmu_mode();
>>>>>>>> - pte_unmap_unlock(start_pte, ptl);
>>>>>>>> - start_pte = NULL;
>>>>>>>> - err = split_folio(folio);
>>>>>>>> - folio_unlock(folio);
>>>>>>>> - folio_put(folio);
>>>>>>>> - if (err)
>>>>>>>> - break;
>>>>>>>> - start_pte = pte =
>>>>>>>> - pte_offset_map_lock(mm, pmd, addr, &ptl);
>>>>>>>> - if (!start_pte)
>>>>>>>> - break;
>>>>>>>> - arch_enter_lazy_mmu_mode();
>>>>>>>> - pte--;
>>>>>>>> - addr -= PAGE_SIZE;
>>>>>>>> - continue;
>>>>>>>> + const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
>>>>>>>> + FPB_IGNORE_SOFT_DIRTY;
>>>>>>>> + int max_nr = (end - addr) / PAGE_SIZE;
>>>>>>>> +
>>>>>>>> + nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>>>>>>>> + fpb_flags, NULL);
>>>>>>>
>>>>>>> I wonder if we have a quick way to avoid folio_pte_batch() if users
>>>>>>> are doing madvise() on a portion of a large folio.
>>>>>>
>>>>>> Good idea. Something like this?:
>>>>>>
>>>>>> if (pte_pfn(pte) == folio_pfn(folio)
>>>>>
>>>>> what about
>>>>>
>>>>> "If (pte_pfn(pte) == folio_pfn(folio) && max_nr >= nr_pages)"
>>>>>
>>>>> just to account for cases where the user's end address falls within
>>>>> the middle of a large folio?
>>>>
>>>> yes, even better. I'll add this for the next version.
>>>>
>>>>>
>>>>>
>>>>> BTW, another minor issue is here:
>>>>>
>>>>> if (++batch_count == SWAP_CLUSTER_MAX) {
>>>>> batch_count = 0;
>>>>> if (need_resched()) {
>>>>> arch_leave_lazy_mmu_mode();
>>>>> pte_unmap_unlock(start_pte, ptl);
>>>>> cond_resched();
>>>>> goto restart;
>>>>> }
>>>>> }
>>>>>
>>>>> We are increasing 1 for nr ptes, thus, we are holding PTL longer
>>>>> than small folios case? we used to increase 1 for each PTE.
>>>>> Does it matter?
>>>>
>>>> I thought about that, but the vast majority of the work is per-folio, not
>>>> per-pte. So I concluded it would be best to continue to increment per-folio.
>>>
>>> Okay. The original patch commit b2f557a21bc8 ("mm/madvise: add
>>> cond_resched() in madvise_cold_or_pageout_pte_range()")
>>> primarily addressed the real-time wake-up latency issue. MADV_PAGEOUT
>>> and MADV_COLD are much less critical compared
>>> to other scenarios where operations like do_anon_page or do_swap_page
>>> necessarily need PTL to progress. Therefore, adopting
>>> an approach that relatively aggressively releases the PTL seems to
>>> neither harm MADV_PAGEOUT/COLD nor disadvantage
>>> others.
>>>
>>> We are slightly increasing the duration of holding the PTL due to the
>>> iteration of folio_pte_batch() potentially taking longer than
>>> the case of small folios, which do not require it.
>>
>> If we can't scan all the PTEs in a page table without dropping the PTL
>> intermittently we have bigger problems. This all works perfectly fine in all the
>> other PTE iterators; see zap_pte_range() for example.
>
> I've no doubt about folio_pte_batch(). was just talking about the
> original rt issue
> it might affect.
>
>>
>>> However, compared
>>> to operations like folio_isolate_lru() and folio_deactivate(),
>>> this increase seems negligible. Recently, we have actually removed
>>> ptep_test_and_clear_young() for MADV_PAGEOUT,
>>> which should also benefit real-time scenarios. Nonetheless, there is a
>>> small risk with large folios, such as 1 MiB mTHP, where
>>> we may need to loop 256 times in folio_pte_batch().
>>
>> As I understand it, RT and THP are mutually exclusive. RT can't handle the extra
>> latencies THPs can cause in allocation path, etc. So I don't think you will see
>> a problem here.
>
> I was actually taking a different approach on the phones as obviously
> we have some
> UX(user-experience)/UI/audio related tasks which cannot tolerate
> allocation latency. with
> a TAO-similar optimization(we did it by ext_migratetype for some pageblocks), we
> actually don't push buddy to do compaction or reclamation for forming
> 64KiB folio.
> We immediately fallback to small folios if a zero-latency 64KiB
> allocation can't be
> obtained from some kinds of pools - ext_migratetype pageblocks.

You can opt-in to avoiding latency due to compaction, etc. with various settings
in /sys/kernel/mm/transparent_hugepage/defrag. That applies to mTHP as well. See
Documentation/admin-guide/mm/transhuge.rst. Obviously this is not as useful as
the TAO approach because it does nothing to avoid fragmentation in the first place.

The other source of latency for THP allocation, which I believe RT doesn't like,
is the cost of zeroing the huge page, IIRC.

>
>>
>>>
>>> I would vote for increasing 'nr' or maybe max(log2(nr), 1) rather than
>>> 1 for two reasons:
>>>
>>> 1. We are not making MADV_PAGEOUT/COLD worse; in fact, we are
>>> improving them by reducing the time taken to put the same
>>> number of pages into the reclaim list.
>>>
>>> 2. MADV_PAGEOUT/COLD scenarios are not urgent compared to others that
>>> genuinely require the PTL to progress. Moreover,
>>> the majority of time spent on PAGEOUT is actually reclaim_pages().
>>
>> I understand your logic. But I'd rather optimize for fewer lock acquisitions for
>> the !RT+THP case, since RT+THP is not supported.
>
> Fair enough. I agree we can postpone this until RT and THP become an
> available option.
> For now, keeping this patch simpler seems to be better.

OK thanks.

>
>>
>>>
>>>>>
>>>>>> nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>>>>>> fpb_flags, NULL);
>>>>>>
>>>>>> If we are not mapping the first page of the folio, then it can't be a full
>>>>>> mapping, so no need to call folio_pte_batch(). Just split it.
>>>>>>
>>>>>>>
>>>>>>>> +
>>>>>>>> + if (nr < folio_nr_pages(folio)) {
>>>>>>>> + int err;
>>>>>>>> +
>>>>>>>> + if (folio_estimated_sharers(folio) > 1)
>>>>>>>> + continue;
>>>>>>>> + if (pageout_anon_only_filter && !folio_test_anon(folio))
>>>>>>>> + continue;
>>>>>>>> + if (!folio_trylock(folio))
>>>>>>>> + continue;
>>>>>>>> + folio_get(folio);
>>>>>>>> + arch_leave_lazy_mmu_mode();
>>>>>>>> + pte_unmap_unlock(start_pte, ptl);
>>>>>>>> + start_pte = NULL;
>>>>>>>> + err = split_folio(folio);
>>>>>>>> + folio_unlock(folio);
>>>>>>>> + folio_put(folio);
>>>>>>>> + if (err)
>>>>>>>> + continue;
>>>>>>>> + start_pte = pte =
>>>>>>>> + pte_offset_map_lock(mm, pmd, addr, &ptl);
>>>>>>>> + if (!start_pte)
>>>>>>>> + break;
>>>>>>>> + arch_enter_lazy_mmu_mode();
>>>>>>>> + nr = 0;
>>>>>>>> + continue;
>>>>>>>> + }
>>>>>>>> }
>>>>>>>>
>>>>>>>> /*
>>>>>>>> * Do not interfere with other mappings of this folio and
>>>>>>>> - * non-LRU folio.
>>>>>>>> + * non-LRU folio. If we have a large folio at this point, we
>>>>>>>> + * know it is fully mapped so if its mapcount is the same as its
>>>>>>>> + * number of pages, it must be exclusive.
>>>>>>>> */
>>>>>>>> - if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
>>>>>>>> + if (!folio_test_lru(folio) ||
>>>>>>>> + folio_mapcount(folio) != folio_nr_pages(folio))
>>>>>>>> continue;
>>>>>>>
>>>>>>> This looks so perfect and is exactly what I wanted to achieve.
>>>>>>>
>>>>>>>>
>>>>>>>> if (pageout_anon_only_filter && !folio_test_anon(folio))
>>>>>>>> continue;
>>>>>>>>
>>>>>>>> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>>>>>>>> -
>>>>>>>> - if (!pageout && pte_young(ptent)) {
>>>>>>>> - ptent = ptep_get_and_clear_full(mm, addr, pte,
>>>>>>>> - tlb->fullmm);
>>>>>>>> - ptent = pte_mkold(ptent);
>>>>>>>> - set_pte_at(mm, addr, pte, ptent);
>>>>>>>> - tlb_remove_tlb_entry(tlb, pte, addr);
>>>>>>>> + if (!pageout) {
>>>>>>>> + for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
>>>>>>>> + if (ptep_test_and_clear_young(vma, addr, pte))
>>>>>>>> + tlb_remove_tlb_entry(tlb, pte, addr);
>>>>>>>> + }
>>>>>>>
>>>>>>> This looks so smart. if it is not pageout, we have increased pte
>>>>>>> and addr here; so nr is 0 and we don't need to increase again in
>>>>>>> for (; addr < end; pte += nr, addr += nr * PAGE_SIZE)
>>>>>>>
>>>>>>> otherwise, nr won't be 0. so we will increase addr and
>>>>>>> pte by nr.
>>>>>>
>>>>>> Indeed. I'm hoping that Lance is able to follow a similar pattern for
>>>>>> madvise_free_pte_range().
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> }
>>>>>>>>
>>>>>>>> /*
>>>>>>>> --
>>>>>>>> 2.25.1
>>>>>>>>
>>>>>>>
>>>>>>> Overall, LGTM,
>>>>>>>
>>>>>>> Reviewed-by: Barry Song <[email protected]>
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>
>>> Thanks
>>> Barry
>>


2024-03-13 14:03:15

by Lance Yang

[permalink] [raw]
Subject: Re: [PATCH v4 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD

On Wed, Mar 13, 2024 at 5:03 PM Ryan Roberts <[email protected]> wrote:
>
> On 13/03/2024 07:19, Barry Song wrote:
> > On Tue, Mar 12, 2024 at 4:01 AM Ryan Roberts <[email protected]> wrote:
> >>
> >> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
> >> folio that is fully and contiguously mapped in the pageout/cold vm
> >> range. This change means that large folios will be maintained all the
> >> way to swap storage. This both improves performance during swap-out, by
> >> eliding the cost of splitting the folio, and sets us up nicely for
> >> maintaining the large folio when it is swapped back in (to be covered in
> >> a separate series).
> >>
> >> Folios that are not fully mapped in the target range are still split,
> >> but note that behavior is changed so that if the split fails for any
> >> reason (folio locked, shared, etc) we now leave it as is and move to the
> >> next pte in the range and continue work on the proceeding folios.
> >> Previously any failure of this sort would cause the entire operation to
> >> give up and no folios mapped at higher addresses were paged out or made
> >> cold. Given large folios are becoming more common, this old behavior
> >> would have likely lead to wasted opportunities.
> >>
> >> While we are at it, change the code that clears young from the ptes to
> >> use ptep_test_and_clear_young(), which is more efficent than
> >> get_and_clear/modify/set, especially for contpte mappings on arm64,
> >> where the old approach would require unfolding/refolding and the new
> >> approach can be done in place.
> >>
> >> Signed-off-by: Ryan Roberts <[email protected]>
> >
> > This looks so much better than our initial RFC.
> > Thank you for your excellent work!
>
> Thanks - its a team effort - I had your PoC and David's previous batching work
> to use as a template.
>
> >
> >> ---
> >> mm/madvise.c | 89 ++++++++++++++++++++++++++++++----------------------
> >> 1 file changed, 51 insertions(+), 38 deletions(-)
> >>
> >> diff --git a/mm/madvise.c b/mm/madvise.c
> >> index 547dcd1f7a39..56c7ba7bd558 100644
> >> --- a/mm/madvise.c
> >> +++ b/mm/madvise.c
> >> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >> LIST_HEAD(folio_list);
> >> bool pageout_anon_only_filter;
> >> unsigned int batch_count = 0;
> >> + int nr;
> >>
> >> if (fatal_signal_pending(current))
> >> return -EINTR;
> >> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >> return 0;
> >> flush_tlb_batched_pending(mm);
> >> arch_enter_lazy_mmu_mode();
> >> - for (; addr < end; pte++, addr += PAGE_SIZE) {
> >> + for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
> >> + nr = 1;
> >> ptent = ptep_get(pte);
> >>
> >> if (++batch_count == SWAP_CLUSTER_MAX) {
> >> @@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >> continue;
> >>
> >> /*
> >> - * Creating a THP page is expensive so split it only if we
> >> - * are sure it's worth. Split it if we are only owner.
> >> + * If we encounter a large folio, only split it if it is not
> >> + * fully mapped within the range we are operating on. Otherwise
> >> + * leave it as is so that it can be swapped out whole. If we
> >> + * fail to split a folio, leave it in place and advance to the
> >> + * next pte in the range.
> >> */
> >> if (folio_test_large(folio)) {
> >> - int err;
> >> -
> >> - if (folio_estimated_sharers(folio) > 1)
> >> - break;
> >> - if (pageout_anon_only_filter && !folio_test_anon(folio))
> >> - break;
> >> - if (!folio_trylock(folio))
> >> - break;
> >> - folio_get(folio);
> >> - arch_leave_lazy_mmu_mode();
> >> - pte_unmap_unlock(start_pte, ptl);
> >> - start_pte = NULL;
> >> - err = split_folio(folio);
> >> - folio_unlock(folio);
> >> - folio_put(folio);
> >> - if (err)
> >> - break;
> >> - start_pte = pte =
> >> - pte_offset_map_lock(mm, pmd, addr, &ptl);
> >> - if (!start_pte)
> >> - break;
> >> - arch_enter_lazy_mmu_mode();
> >> - pte--;
> >> - addr -= PAGE_SIZE;
> >> - continue;
> >> + const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
> >> + FPB_IGNORE_SOFT_DIRTY;
> >> + int max_nr = (end - addr) / PAGE_SIZE;
> >> +
> >> + nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
> >> + fpb_flags, NULL);
> >
> > I wonder if we have a quick way to avoid folio_pte_batch() if users
> > are doing madvise() on a portion of a large folio.
>
> Good idea. Something like this?:
>
> if (pte_pfn(pte) == folio_pfn(folio)
> nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
> fpb_flags, NULL);
>
> If we are not mapping the first page of the folio, then it can't be a full
> mapping, so no need to call folio_pte_batch(). Just split it.

if (folio_test_large(folio)) {
[...]
nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
fpb_flags, NULL);
+ if (folio_estimated_sharers(folio) > 1)
+ continue;

Could we use folio_estimated_sharers as an early exit point here?

if (nr < folio_nr_pages(folio)) {
int err;

- if (folio_estimated_sharers(folio) > 1)
- continue;
[...]

>
> >
> >> +
> >> + if (nr < folio_nr_pages(folio)) {
> >> + int err;
> >> +
> >> + if (folio_estimated_sharers(folio) > 1)
> >> + continue;
> >> + if (pageout_anon_only_filter && !folio_test_anon(folio))
> >> + continue;
> >> + if (!folio_trylock(folio))
> >> + continue;
> >> + folio_get(folio);
> >> + arch_leave_lazy_mmu_mode();
> >> + pte_unmap_unlock(start_pte, ptl);
> >> + start_pte = NULL;
> >> + err = split_folio(folio);
> >> + folio_unlock(folio);
> >> + folio_put(folio);
> >> + if (err)
> >> + continue;
> >> + start_pte = pte =
> >> + pte_offset_map_lock(mm, pmd, addr, &ptl);
> >> + if (!start_pte)
> >> + break;
> >> + arch_enter_lazy_mmu_mode();
> >> + nr = 0;
> >> + continue;
> >> + }
> >> }
> >>
> >> /*
> >> * Do not interfere with other mappings of this folio and
> >> - * non-LRU folio.
> >> + * non-LRU folio. If we have a large folio at this point, we
> >> + * know it is fully mapped so if its mapcount is the same as its
> >> + * number of pages, it must be exclusive.
> >> */
> >> - if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
> >> + if (!folio_test_lru(folio) ||
> >> + folio_mapcount(folio) != folio_nr_pages(folio))
> >> continue;
> >
> > This looks so perfect and is exactly what I wanted to achieve.
> >
> >>
> >> if (pageout_anon_only_filter && !folio_test_anon(folio))
> >> continue;
> >>
> >> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> >> -
> >> - if (!pageout && pte_young(ptent)) {
> >> - ptent = ptep_get_and_clear_full(mm, addr, pte,
> >> - tlb->fullmm);
> >> - ptent = pte_mkold(ptent);
> >> - set_pte_at(mm, addr, pte, ptent);
> >> - tlb_remove_tlb_entry(tlb, pte, addr);
> >> + if (!pageout) {
> >> + for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
> >> + if (ptep_test_and_clear_young(vma, addr, pte))
> >> + tlb_remove_tlb_entry(tlb, pte, addr);

IIRC, some of the architecture(ex, PPC) don't update TLB with set_pte_at and
tlb_remove_tlb_entry. So, didn't we consider remapping the PTE with old after
pte clearing?

Thanks,
Lance



> >> + }
> >
> > This looks so smart. if it is not pageout, we have increased pte
> > and addr here; so nr is 0 and we don't need to increase again in
> > for (; addr < end; pte += nr, addr += nr * PAGE_SIZE)
> >
> > otherwise, nr won't be 0. so we will increase addr and
> > pte by nr.
>
> Indeed. I'm hoping that Lance is able to follow a similar pattern for
> madvise_free_pte_range().
>
>
> >
> >
> >> }
> >>
> >> /*
> >> --
> >> 2.25.1
> >>
> >
> > Overall, LGTM,
> >
> > Reviewed-by: Barry Song <[email protected]>
>
> Thanks!
>
>

2024-03-15 10:35:21

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD

> - if (!pageout && pte_young(ptent)) {
> - ptent = ptep_get_and_clear_full(mm, addr, pte,
> - tlb->fullmm);
> - ptent = pte_mkold(ptent);
> - set_pte_at(mm, addr, pte, ptent);
> - tlb_remove_tlb_entry(tlb, pte, addr);
> + if (!pageout) {
> + for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
> + if (ptep_test_and_clear_young(vma, addr, pte))
> + tlb_remove_tlb_entry(tlb, pte, addr);
> + }
> }


The following might turn out a bit nicer: Make folio_pte_batch() collect
"any_young", then doing something like we do with "any_writable" in the
fork() case:

..
nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
fpb_flags, NULL, any_young);
if (any_young)
pte_mkyoung(ptent)
..

if (!pageout && pte_young(ptent)) {
mkold_full_ptes(mm, addr, pte, nr, tlb->fullmm);
tlb_remove_tlb_entries(tlb, pte, nr, addr);
}

--
Cheers,

David / dhildenb


2024-03-15 10:44:13

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 5/6] mm: vmscan: Avoid split during shrink_folio_list()

On 11.03.24 16:00, Ryan Roberts wrote:
> Now that swap supports storing all mTHP sizes, avoid splitting large
> folios before swap-out. This benefits performance of the swap-out path
> by eliding split_folio_to_list(), which is expensive, and also sets us
> up for swapping in large folios in a future series.
>
> If the folio is partially mapped, we continue to split it since we want
> to avoid the extra IO overhead and storage of writing out pages
> uneccessarily.
>
> Signed-off-by: Ryan Roberts <[email protected]>
> ---
> mm/vmscan.c | 9 +++++----
> 1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index cf7d4cf47f1a..0ebec99e04c6 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1222,11 +1222,12 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> if (!can_split_folio(folio, NULL))
> goto activate_locked;
> /*
> - * Split folios without a PMD map right
> - * away. Chances are some or all of the
> - * tail pages can be freed without IO.
> + * Split partially mapped folios map
> + * right away. Chances are some or all
> + * of the tail pages can be freed
> + * without IO.
> */
> - if (!folio_entire_mapcount(folio) &&
> + if (!list_empty(&folio->_deferred_list) &&
> split_folio_to_list(folio,
> folio_list))
> goto activate_locked;

Not sure if we might have to annotate that with data_race().

Reviewed-by: David Hildenbrand <[email protected]>

--
Cheers,

David / dhildenb


2024-03-15 10:49:36

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 5/6] mm: vmscan: Avoid split during shrink_folio_list()

On 15/03/2024 10:43, David Hildenbrand wrote:
> On 11.03.24 16:00, Ryan Roberts wrote:
>> Now that swap supports storing all mTHP sizes, avoid splitting large
>> folios before swap-out. This benefits performance of the swap-out path
>> by eliding split_folio_to_list(), which is expensive, and also sets us
>> up for swapping in large folios in a future series.
>>
>> If the folio is partially mapped, we continue to split it since we want
>> to avoid the extra IO overhead and storage of writing out pages
>> uneccessarily.
>>
>> Signed-off-by: Ryan Roberts <[email protected]>
>> ---
>>   mm/vmscan.c | 9 +++++----
>>   1 file changed, 5 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index cf7d4cf47f1a..0ebec99e04c6 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1222,11 +1222,12 @@ static unsigned int shrink_folio_list(struct list_head
>> *folio_list,
>>                       if (!can_split_folio(folio, NULL))
>>                           goto activate_locked;
>>                       /*
>> -                     * Split folios without a PMD map right
>> -                     * away. Chances are some or all of the
>> -                     * tail pages can be freed without IO.
>> +                     * Split partially mapped folios map
>> +                     * right away. Chances are some or all
>> +                     * of the tail pages can be freed
>> +                     * without IO.
>>                        */
>> -                    if (!folio_entire_mapcount(folio) &&
>> +                    if (!list_empty(&folio->_deferred_list) &&
>>                           split_folio_to_list(folio,
>>                                   folio_list))
>>                           goto activate_locked;
>
> Not sure if we might have to annotate that with data_race().

I asked that exact question to Matthew in another context bt didn't get a
response. There are examples of checking if the deferred list is empty with and
without data_race() in the code base. But list_empty() is implemented like this:

static inline int list_empty(const struct list_head *head)
{
return READ_ONCE(head->next) == head;
}

So I assumed the READ_ONCE() makes everything safe without a lock? Perhaps not
sufficient for KCSAN?


>
> Reviewed-by: David Hildenbrand <[email protected]>
>

Thanks!

2024-03-15 10:55:37

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD

On 15/03/2024 10:35, David Hildenbrand wrote:
>> -        if (!pageout && pte_young(ptent)) {
>> -            ptent = ptep_get_and_clear_full(mm, addr, pte,
>> -                            tlb->fullmm);
>> -            ptent = pte_mkold(ptent);
>> -            set_pte_at(mm, addr, pte, ptent);
>> -            tlb_remove_tlb_entry(tlb, pte, addr);
>> +        if (!pageout) {
>> +            for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
>> +                if (ptep_test_and_clear_young(vma, addr, pte))
>> +                    tlb_remove_tlb_entry(tlb, pte, addr);
>> +            }
>>           }
>
>
> The following might turn out a bit nicer: Make folio_pte_batch() collect
> "any_young", then doing something like we do with "any_writable" in the fork()
> case:
>
> ...
>     nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>                  fpb_flags, NULL, any_young);
>     if (any_young)
>         pte_mkyoung(ptent)
> ...
>
> if (!pageout && pte_young(ptent)) {
>     mkold_full_ptes(mm, addr, pte, nr, tlb->fullmm);
>     tlb_remove_tlb_entries(tlb, pte, nr, addr);
> }
>

I thought about that but decided that it would be better to only TLBI the actual
entries that were young. Although looking at tlb_remove_tlb_entry() I see that
it just maintains a range between the lowest and highest address, so this won't
actually make any difference.

So, yes, this will be a nice improvement, and also prevent the O(n^2) pte reads
for the contpte case. I'll change in the next version.

2024-03-15 11:13:34

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 5/6] mm: vmscan: Avoid split during shrink_folio_list()

On 15.03.24 11:49, Ryan Roberts wrote:
> On 15/03/2024 10:43, David Hildenbrand wrote:
>> On 11.03.24 16:00, Ryan Roberts wrote:
>>> Now that swap supports storing all mTHP sizes, avoid splitting large
>>> folios before swap-out. This benefits performance of the swap-out path
>>> by eliding split_folio_to_list(), which is expensive, and also sets us
>>> up for swapping in large folios in a future series.
>>>
>>> If the folio is partially mapped, we continue to split it since we want
>>> to avoid the extra IO overhead and storage of writing out pages
>>> uneccessarily.
>>>
>>> Signed-off-by: Ryan Roberts <[email protected]>
>>> ---
>>>   mm/vmscan.c | 9 +++++----
>>>   1 file changed, 5 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index cf7d4cf47f1a..0ebec99e04c6 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -1222,11 +1222,12 @@ static unsigned int shrink_folio_list(struct list_head
>>> *folio_list,
>>>                       if (!can_split_folio(folio, NULL))
>>>                           goto activate_locked;
>>>                       /*
>>> -                     * Split folios without a PMD map right
>>> -                     * away. Chances are some or all of the
>>> -                     * tail pages can be freed without IO.
>>> +                     * Split partially mapped folios map
>>> +                     * right away. Chances are some or all
>>> +                     * of the tail pages can be freed
>>> +                     * without IO.
>>>                        */
>>> -                    if (!folio_entire_mapcount(folio) &&
>>> +                    if (!list_empty(&folio->_deferred_list) &&
>>>                           split_folio_to_list(folio,
>>>                                   folio_list))
>>>                           goto activate_locked;
>>
>> Not sure if we might have to annotate that with data_race().
>
> I asked that exact question to Matthew in another context bt didn't get a
> response. There are examples of checking if the deferred list is empty with and
> without data_race() in the code base. But list_empty() is implemented like this:
>
> static inline int list_empty(const struct list_head *head)
> {
> return READ_ONCE(head->next) == head;
> }
>
> So I assumed the READ_ONCE() makes everything safe without a lock? Perhaps not
> sufficient for KCSAN?

Yeah, there is only one use of data_race with that list.

It was added in f3ebdf042df4 ("THP: avoid lock when check whether THP is
in deferred list").

Looks like that was added right in v1 of that change [1], so my best
guess is that it is not actually required.

If not required, likely we should just cleanup the single user.

[1]
https://lore.kernel.org/linux-mm/[email protected]/

--
Cheers,

David / dhildenb


2024-03-15 11:14:36

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD

On 15.03.24 11:55, Ryan Roberts wrote:
> On 15/03/2024 10:35, David Hildenbrand wrote:
>>> -        if (!pageout && pte_young(ptent)) {
>>> -            ptent = ptep_get_and_clear_full(mm, addr, pte,
>>> -                            tlb->fullmm);
>>> -            ptent = pte_mkold(ptent);
>>> -            set_pte_at(mm, addr, pte, ptent);
>>> -            tlb_remove_tlb_entry(tlb, pte, addr);
>>> +        if (!pageout) {
>>> +            for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
>>> +                if (ptep_test_and_clear_young(vma, addr, pte))
>>> +                    tlb_remove_tlb_entry(tlb, pte, addr);
>>> +            }
>>>           }
>>
>>
>> The following might turn out a bit nicer: Make folio_pte_batch() collect
>> "any_young", then doing something like we do with "any_writable" in the fork()
>> case:
>>
>> ...
>>     nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>>                  fpb_flags, NULL, any_young);
>>     if (any_young)
>>         pte_mkyoung(ptent)
>> ...
>>
>> if (!pageout && pte_young(ptent)) {
>>     mkold_full_ptes(mm, addr, pte, nr, tlb->fullmm);
>>     tlb_remove_tlb_entries(tlb, pte, nr, addr);
>> }
>>
>
> I thought about that but decided that it would be better to only TLBI the actual
> entries that were young. Although looking at tlb_remove_tlb_entry() I see that
> it just maintains a range between the lowest and highest address, so this won't
> actually make any difference.

Yes, tlb_remove_tlb_entries() looks scarier than it actually is :)

--
Cheers,

David / dhildenb


2024-03-15 11:38:53

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 5/6] mm: vmscan: Avoid split during shrink_folio_list()

Hi Yin Fengwei,

On 15/03/2024 11:12, David Hildenbrand wrote:
> On 15.03.24 11:49, Ryan Roberts wrote:
>> On 15/03/2024 10:43, David Hildenbrand wrote:
>>> On 11.03.24 16:00, Ryan Roberts wrote:
>>>> Now that swap supports storing all mTHP sizes, avoid splitting large
>>>> folios before swap-out. This benefits performance of the swap-out path
>>>> by eliding split_folio_to_list(), which is expensive, and also sets us
>>>> up for swapping in large folios in a future series.
>>>>
>>>> If the folio is partially mapped, we continue to split it since we want
>>>> to avoid the extra IO overhead and storage of writing out pages
>>>> uneccessarily.
>>>>
>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>> ---
>>>>    mm/vmscan.c | 9 +++++----
>>>>    1 file changed, 5 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index cf7d4cf47f1a..0ebec99e04c6 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -1222,11 +1222,12 @@ static unsigned int shrink_folio_list(struct list_head
>>>> *folio_list,
>>>>                        if (!can_split_folio(folio, NULL))
>>>>                            goto activate_locked;
>>>>                        /*
>>>> -                     * Split folios without a PMD map right
>>>> -                     * away. Chances are some or all of the
>>>> -                     * tail pages can be freed without IO.
>>>> +                     * Split partially mapped folios map
>>>> +                     * right away. Chances are some or all
>>>> +                     * of the tail pages can be freed
>>>> +                     * without IO.
>>>>                         */
>>>> -                    if (!folio_entire_mapcount(folio) &&
>>>> +                    if (!list_empty(&folio->_deferred_list) &&
>>>>                            split_folio_to_list(folio,
>>>>                                    folio_list))
>>>>                            goto activate_locked;
>>>
>>> Not sure if we might have to annotate that with data_race().
>>
>> I asked that exact question to Matthew in another context bt didn't get a
>> response. There are examples of checking if the deferred list is empty with and
>> without data_race() in the code base. But list_empty() is implemented like this:
>>
>> static inline int list_empty(const struct list_head *head)
>> {
>>     return READ_ONCE(head->next) == head;
>> }
>>
>> So I assumed the READ_ONCE() makes everything safe without a lock? Perhaps not
>> sufficient for KCSAN?
>
> Yeah, there is only one use of data_race with that list.
>
> It was added in f3ebdf042df4 ("THP: avoid lock when check whether THP is in
> deferred list").
>
> Looks like that was added right in v1 of that change [1], so my best guess is
> that it is not actually required.
>
> If not required, likely we should just cleanup the single user.
>
> [1]
> https://lore.kernel.org/linux-mm/[email protected]/

Do you have any recollection of why you added the data_race() markup?

Thanks,
Ryan

>


2024-03-18 02:18:31

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH v4 5/6] mm: vmscan: Avoid split during shrink_folio_list()

Ryan Roberts <[email protected]> writes:

> Hi Yin Fengwei,
>
> On 15/03/2024 11:12, David Hildenbrand wrote:
>> On 15.03.24 11:49, Ryan Roberts wrote:
>>> On 15/03/2024 10:43, David Hildenbrand wrote:
>>>> On 11.03.24 16:00, Ryan Roberts wrote:
>>>>> Now that swap supports storing all mTHP sizes, avoid splitting large
>>>>> folios before swap-out. This benefits performance of the swap-out path
>>>>> by eliding split_folio_to_list(), which is expensive, and also sets us
>>>>> up for swapping in large folios in a future series.
>>>>>
>>>>> If the folio is partially mapped, we continue to split it since we want
>>>>> to avoid the extra IO overhead and storage of writing out pages
>>>>> uneccessarily.
>>>>>
>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>> ---
>>>>>    mm/vmscan.c | 9 +++++----
>>>>>    1 file changed, 5 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>> index cf7d4cf47f1a..0ebec99e04c6 100644
>>>>> --- a/mm/vmscan.c
>>>>> +++ b/mm/vmscan.c
>>>>> @@ -1222,11 +1222,12 @@ static unsigned int shrink_folio_list(struct list_head
>>>>> *folio_list,
>>>>>                        if (!can_split_folio(folio, NULL))
>>>>>                            goto activate_locked;
>>>>>                        /*
>>>>> -                     * Split folios without a PMD map right
>>>>> -                     * away. Chances are some or all of the
>>>>> -                     * tail pages can be freed without IO.
>>>>> +                     * Split partially mapped folios map
>>>>> +                     * right away. Chances are some or all
>>>>> +                     * of the tail pages can be freed
>>>>> +                     * without IO.
>>>>>                         */
>>>>> -                    if (!folio_entire_mapcount(folio) &&
>>>>> +                    if (!list_empty(&folio->_deferred_list) &&
>>>>>                            split_folio_to_list(folio,
>>>>>                                    folio_list))
>>>>>                            goto activate_locked;
>>>>
>>>> Not sure if we might have to annotate that with data_race().
>>>
>>> I asked that exact question to Matthew in another context bt didn't get a
>>> response. There are examples of checking if the deferred list is empty with and
>>> without data_race() in the code base. But list_empty() is implemented like this:
>>>
>>> static inline int list_empty(const struct list_head *head)
>>> {
>>>     return READ_ONCE(head->next) == head;
>>> }
>>>
>>> So I assumed the READ_ONCE() makes everything safe without a lock? Perhaps not
>>> sufficient for KCSAN?
>>
>> Yeah, there is only one use of data_race with that list.
>>
>> It was added in f3ebdf042df4 ("THP: avoid lock when check whether THP is in
>> deferred list").
>>
>> Looks like that was added right in v1 of that change [1], so my best guess is
>> that it is not actually required.
>>
>> If not required, likely we should just cleanup the single user.
>>
>> [1]
>> https://lore.kernel.org/linux-mm/[email protected]/
>
> Do you have any recollection of why you added the data_race() markup?

Per my understanding, this is used to mark that the code accesses
folio->_deferred_list without lock intentionally, while
folio->_deferred_list may be changed in parallel. IIUC, this is what
data_race() is used for. Or, my understanding is wrong?

--
Best Regards,
Huang, Ying

2024-03-18 10:00:41

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH v4 5/6] mm: vmscan: Avoid split during shrink_folio_list()



On 3/18/2024 10:16 AM, Huang, Ying wrote:
> Ryan Roberts <[email protected]> writes:
>
>> Hi Yin Fengwei,
>>
>> On 15/03/2024 11:12, David Hildenbrand wrote:
>>> On 15.03.24 11:49, Ryan Roberts wrote:
>>>> On 15/03/2024 10:43, David Hildenbrand wrote:
>>>>> On 11.03.24 16:00, Ryan Roberts wrote:
>>>>>> Now that swap supports storing all mTHP sizes, avoid splitting large
>>>>>> folios before swap-out. This benefits performance of the swap-out path
>>>>>> by eliding split_folio_to_list(), which is expensive, and also sets us
>>>>>> up for swapping in large folios in a future series.
>>>>>>
>>>>>> If the folio is partially mapped, we continue to split it since we want
>>>>>> to avoid the extra IO overhead and storage of writing out pages
>>>>>> uneccessarily.
>>>>>>
>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>> ---
>>>>>>    mm/vmscan.c | 9 +++++----
>>>>>>    1 file changed, 5 insertions(+), 4 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>> index cf7d4cf47f1a..0ebec99e04c6 100644
>>>>>> --- a/mm/vmscan.c
>>>>>> +++ b/mm/vmscan.c
>>>>>> @@ -1222,11 +1222,12 @@ static unsigned int shrink_folio_list(struct list_head
>>>>>> *folio_list,
>>>>>>                        if (!can_split_folio(folio, NULL))
>>>>>>                            goto activate_locked;
>>>>>>                        /*
>>>>>> -                     * Split folios without a PMD map right
>>>>>> -                     * away. Chances are some or all of the
>>>>>> -                     * tail pages can be freed without IO.
>>>>>> +                     * Split partially mapped folios map
>>>>>> +                     * right away. Chances are some or all
>>>>>> +                     * of the tail pages can be freed
>>>>>> +                     * without IO.
>>>>>>                         */
>>>>>> -                    if (!folio_entire_mapcount(folio) &&
>>>>>> +                    if (!list_empty(&folio->_deferred_list) &&
>>>>>>                            split_folio_to_list(folio,
>>>>>>                                    folio_list))
>>>>>>                            goto activate_locked;
>>>>>
>>>>> Not sure if we might have to annotate that with data_race().
>>>>
>>>> I asked that exact question to Matthew in another context bt didn't get a
>>>> response. There are examples of checking if the deferred list is empty with and
>>>> without data_race() in the code base. But list_empty() is implemented like this:
>>>>
>>>> static inline int list_empty(const struct list_head *head)
>>>> {
>>>>     return READ_ONCE(head->next) == head;
>>>> }
>>>>
>>>> So I assumed the READ_ONCE() makes everything safe without a lock? Perhaps not
>>>> sufficient for KCSAN?
I don't think READ_ONCE() can replace the lock.

>>>
>>> Yeah, there is only one use of data_race with that list.
>>>
>>> It was added in f3ebdf042df4 ("THP: avoid lock when check whether THP is in
>>> deferred list").
>>>
>>> Looks like that was added right in v1 of that change [1], so my best guess is
>>> that it is not actually required.
>>>
>>> If not required, likely we should just cleanup the single user.
>>>
>>> [1]
>>> https://lore.kernel.org/linux-mm/[email protected]/
>>
>> Do you have any recollection of why you added the data_race() markup?
>
> Per my understanding, this is used to mark that the code accesses
> folio->_deferred_list without lock intentionally, while
> folio->_deferred_list may be changed in parallel. IIUC, this is what
> data_race() is used for. Or, my understanding is wrong?
Yes. This is my understanding also.


Regards
Yin, Fengwei

>
> --
> Best Regards,
> Huang, Ying

2024-03-18 10:05:30

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 5/6] mm: vmscan: Avoid split during shrink_folio_list()

On 18.03.24 11:00, Yin, Fengwei wrote:
>
>
> On 3/18/2024 10:16 AM, Huang, Ying wrote:
>> Ryan Roberts <[email protected]> writes:
>>
>>> Hi Yin Fengwei,
>>>
>>> On 15/03/2024 11:12, David Hildenbrand wrote:
>>>> On 15.03.24 11:49, Ryan Roberts wrote:
>>>>> On 15/03/2024 10:43, David Hildenbrand wrote:
>>>>>> On 11.03.24 16:00, Ryan Roberts wrote:
>>>>>>> Now that swap supports storing all mTHP sizes, avoid splitting large
>>>>>>> folios before swap-out. This benefits performance of the swap-out path
>>>>>>> by eliding split_folio_to_list(), which is expensive, and also sets us
>>>>>>> up for swapping in large folios in a future series.
>>>>>>>
>>>>>>> If the folio is partially mapped, we continue to split it since we want
>>>>>>> to avoid the extra IO overhead and storage of writing out pages
>>>>>>> uneccessarily.
>>>>>>>
>>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>>> ---
>>>>>>>    mm/vmscan.c | 9 +++++----
>>>>>>>    1 file changed, 5 insertions(+), 4 deletions(-)
>>>>>>>
>>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>>> index cf7d4cf47f1a..0ebec99e04c6 100644
>>>>>>> --- a/mm/vmscan.c
>>>>>>> +++ b/mm/vmscan.c
>>>>>>> @@ -1222,11 +1222,12 @@ static unsigned int shrink_folio_list(struct list_head
>>>>>>> *folio_list,
>>>>>>>                        if (!can_split_folio(folio, NULL))
>>>>>>>                            goto activate_locked;
>>>>>>>                        /*
>>>>>>> -                     * Split folios without a PMD map right
>>>>>>> -                     * away. Chances are some or all of the
>>>>>>> -                     * tail pages can be freed without IO.
>>>>>>> +                     * Split partially mapped folios map
>>>>>>> +                     * right away. Chances are some or all
>>>>>>> +                     * of the tail pages can be freed
>>>>>>> +                     * without IO.
>>>>>>>                         */
>>>>>>> -                    if (!folio_entire_mapcount(folio) &&
>>>>>>> +                    if (!list_empty(&folio->_deferred_list) &&
>>>>>>>                            split_folio_to_list(folio,
>>>>>>>                                    folio_list))
>>>>>>>                            goto activate_locked;
>>>>>>
>>>>>> Not sure if we might have to annotate that with data_race().
>>>>>
>>>>> I asked that exact question to Matthew in another context bt didn't get a
>>>>> response. There are examples of checking if the deferred list is empty with and
>>>>> without data_race() in the code base. But list_empty() is implemented like this:
>>>>>
>>>>> static inline int list_empty(const struct list_head *head)
>>>>> {
>>>>>     return READ_ONCE(head->next) == head;
>>>>> }
>>>>>
>>>>> So I assumed the READ_ONCE() makes everything safe without a lock? Perhaps not
>>>>> sufficient for KCSAN?
> I don't think READ_ONCE() can replace the lock.
>
>>>>
>>>> Yeah, there is only one use of data_race with that list.
>>>>
>>>> It was added in f3ebdf042df4 ("THP: avoid lock when check whether THP is in
>>>> deferred list").
>>>>
>>>> Looks like that was added right in v1 of that change [1], so my best guess is
>>>> that it is not actually required.
>>>>
>>>> If not required, likely we should just cleanup the single user.
>>>>
>>>> [1]
>>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>
>>> Do you have any recollection of why you added the data_race() markup?
>>
>> Per my understanding, this is used to mark that the code accesses
>> folio->_deferred_list without lock intentionally, while
>> folio->_deferred_list may be changed in parallel. IIUC, this is what
>> data_race() is used for. Or, my understanding is wrong?
> Yes. This is my understanding also.

Why don't we have a data_race() in deferred_split_folio() then, before
taking the lock?

It's used a bit inconsistently here.

--
Cheers,

David / dhildenb


2024-03-18 15:35:57

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 5/6] mm: vmscan: Avoid split during shrink_folio_list()

On 18/03/2024 10:05, David Hildenbrand wrote:
> On 18.03.24 11:00, Yin, Fengwei wrote:
>>
>>
>> On 3/18/2024 10:16 AM, Huang, Ying wrote:
>>> Ryan Roberts <[email protected]> writes:
>>>
>>>> Hi Yin Fengwei,
>>>>
>>>> On 15/03/2024 11:12, David Hildenbrand wrote:
>>>>> On 15.03.24 11:49, Ryan Roberts wrote:
>>>>>> On 15/03/2024 10:43, David Hildenbrand wrote:
>>>>>>> On 11.03.24 16:00, Ryan Roberts wrote:
>>>>>>>> Now that swap supports storing all mTHP sizes, avoid splitting large
>>>>>>>> folios before swap-out. This benefits performance of the swap-out path
>>>>>>>> by eliding split_folio_to_list(), which is expensive, and also sets us
>>>>>>>> up for swapping in large folios in a future series.
>>>>>>>>
>>>>>>>> If the folio is partially mapped, we continue to split it since we want
>>>>>>>> to avoid the extra IO overhead and storage of writing out pages
>>>>>>>> uneccessarily.
>>>>>>>>
>>>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>>>> ---
>>>>>>>>      mm/vmscan.c | 9 +++++----
>>>>>>>>      1 file changed, 5 insertions(+), 4 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>>>> index cf7d4cf47f1a..0ebec99e04c6 100644
>>>>>>>> --- a/mm/vmscan.c
>>>>>>>> +++ b/mm/vmscan.c
>>>>>>>> @@ -1222,11 +1222,12 @@ static unsigned int shrink_folio_list(struct
>>>>>>>> list_head
>>>>>>>> *folio_list,
>>>>>>>>                          if (!can_split_folio(folio, NULL))
>>>>>>>>                              goto activate_locked;
>>>>>>>>                          /*
>>>>>>>> -                     * Split folios without a PMD map right
>>>>>>>> -                     * away. Chances are some or all of the
>>>>>>>> -                     * tail pages can be freed without IO.
>>>>>>>> +                     * Split partially mapped folios map
>>>>>>>> +                     * right away. Chances are some or all
>>>>>>>> +                     * of the tail pages can be freed
>>>>>>>> +                     * without IO.
>>>>>>>>                           */
>>>>>>>> -                    if (!folio_entire_mapcount(folio) &&
>>>>>>>> +                    if (!list_empty(&folio->_deferred_list) &&
>>>>>>>>                              split_folio_to_list(folio,
>>>>>>>>                                      folio_list))
>>>>>>>>                              goto activate_locked;
>>>>>>>
>>>>>>> Not sure if we might have to annotate that with data_race().
>>>>>>
>>>>>> I asked that exact question to Matthew in another context bt didn't get a
>>>>>> response. There are examples of checking if the deferred list is empty
>>>>>> with and
>>>>>> without data_race() in the code base. But list_empty() is implemented like
>>>>>> this:
>>>>>>
>>>>>> static inline int list_empty(const struct list_head *head)
>>>>>> {
>>>>>>       return READ_ONCE(head->next) == head;
>>>>>> }
>>>>>>
>>>>>> So I assumed the READ_ONCE() makes everything safe without a lock? Perhaps
>>>>>> not
>>>>>> sufficient for KCSAN?
>> I don't think READ_ONCE() can replace the lock.

But it doesn't ensure we get a consistent value and that the compiler orders the
load correctly. There are lots of patterns in the kernel that use READ_ONCE()
without a lock and they don't use data_race() - e.g. ptep_get_lockless().

It sounds like none of us really understand what data_race() is for, so I guess
I'll just do a KCSAN build and invoke the code path to see if it complains.


>>
>>>>>
>>>>> Yeah, there is only one use of data_race with that list.
>>>>>
>>>>> It was added in f3ebdf042df4 ("THP: avoid lock when check whether THP is in
>>>>> deferred list").
>>>>>
>>>>> Looks like that was added right in v1 of that change [1], so my best guess is
>>>>> that it is not actually required.
>>>>>
>>>>> If not required, likely we should just cleanup the single user.
>>>>>
>>>>> [1]
>>>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>>
>>>> Do you have any recollection of why you added the data_race() markup?
>>>
>>> Per my understanding, this is used to mark that the code accesses
>>> folio->_deferred_list without lock intentionally, while
>>> folio->_deferred_list may be changed in parallel.  IIUC, this is what
>>> data_race() is used for.  Or, my understanding is wrong?
>> Yes. This is my understanding also.
>
> Why don't we have a data_race() in deferred_split_folio() then, before taking
> the lock?
>
> It's used a bit inconsistently here.
>


2024-03-18 15:37:02

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 5/6] mm: vmscan: Avoid split during shrink_folio_list()

On 18/03/2024 15:35, Ryan Roberts wrote:
> On 18/03/2024 10:05, David Hildenbrand wrote:
>> On 18.03.24 11:00, Yin, Fengwei wrote:
>>>
>>>
>>> On 3/18/2024 10:16 AM, Huang, Ying wrote:
>>>> Ryan Roberts <[email protected]> writes:
>>>>
>>>>> Hi Yin Fengwei,
>>>>>
>>>>> On 15/03/2024 11:12, David Hildenbrand wrote:
>>>>>> On 15.03.24 11:49, Ryan Roberts wrote:
>>>>>>> On 15/03/2024 10:43, David Hildenbrand wrote:
>>>>>>>> On 11.03.24 16:00, Ryan Roberts wrote:
>>>>>>>>> Now that swap supports storing all mTHP sizes, avoid splitting large
>>>>>>>>> folios before swap-out. This benefits performance of the swap-out path
>>>>>>>>> by eliding split_folio_to_list(), which is expensive, and also sets us
>>>>>>>>> up for swapping in large folios in a future series.
>>>>>>>>>
>>>>>>>>> If the folio is partially mapped, we continue to split it since we want
>>>>>>>>> to avoid the extra IO overhead and storage of writing out pages
>>>>>>>>> uneccessarily.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>>>>> ---
>>>>>>>>>      mm/vmscan.c | 9 +++++----
>>>>>>>>>      1 file changed, 5 insertions(+), 4 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>>>>> index cf7d4cf47f1a..0ebec99e04c6 100644
>>>>>>>>> --- a/mm/vmscan.c
>>>>>>>>> +++ b/mm/vmscan.c
>>>>>>>>> @@ -1222,11 +1222,12 @@ static unsigned int shrink_folio_list(struct
>>>>>>>>> list_head
>>>>>>>>> *folio_list,
>>>>>>>>>                          if (!can_split_folio(folio, NULL))
>>>>>>>>>                              goto activate_locked;
>>>>>>>>>                          /*
>>>>>>>>> -                     * Split folios without a PMD map right
>>>>>>>>> -                     * away. Chances are some or all of the
>>>>>>>>> -                     * tail pages can be freed without IO.
>>>>>>>>> +                     * Split partially mapped folios map
>>>>>>>>> +                     * right away. Chances are some or all
>>>>>>>>> +                     * of the tail pages can be freed
>>>>>>>>> +                     * without IO.
>>>>>>>>>                           */
>>>>>>>>> -                    if (!folio_entire_mapcount(folio) &&
>>>>>>>>> +                    if (!list_empty(&folio->_deferred_list) &&
>>>>>>>>>                              split_folio_to_list(folio,
>>>>>>>>>                                      folio_list))
>>>>>>>>>                              goto activate_locked;
>>>>>>>>
>>>>>>>> Not sure if we might have to annotate that with data_race().
>>>>>>>
>>>>>>> I asked that exact question to Matthew in another context bt didn't get a
>>>>>>> response. There are examples of checking if the deferred list is empty
>>>>>>> with and
>>>>>>> without data_race() in the code base. But list_empty() is implemented like
>>>>>>> this:
>>>>>>>
>>>>>>> static inline int list_empty(const struct list_head *head)
>>>>>>> {
>>>>>>>       return READ_ONCE(head->next) == head;
>>>>>>> }
>>>>>>>
>>>>>>> So I assumed the READ_ONCE() makes everything safe without a lock? Perhaps
>>>>>>> not
>>>>>>> sufficient for KCSAN?
>>> I don't think READ_ONCE() can replace the lock.
>
> But it doesn't ensure we get a consistent value and that the compiler orders the

Sorry - fat fingers... I meant it *does* ensure we get a consistent value (i.e.
untorn)

> load correctly. There are lots of patterns in the kernel that use READ_ONCE()
> without a lock and they don't use data_race() - e.g. ptep_get_lockless().
>
> It sounds like none of us really understand what data_race() is for, so I guess
> I'll just do a KCSAN build and invoke the code path to see if it complains.
>
>
>>>
>>>>>>
>>>>>> Yeah, there is only one use of data_race with that list.
>>>>>>
>>>>>> It was added in f3ebdf042df4 ("THP: avoid lock when check whether THP is in
>>>>>> deferred list").
>>>>>>
>>>>>> Looks like that was added right in v1 of that change [1], so my best guess is
>>>>>> that it is not actually required.
>>>>>>
>>>>>> If not required, likely we should just cleanup the single user.
>>>>>>
>>>>>> [1]
>>>>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>>>
>>>>> Do you have any recollection of why you added the data_race() markup?
>>>>
>>>> Per my understanding, this is used to mark that the code accesses
>>>> folio->_deferred_list without lock intentionally, while
>>>> folio->_deferred_list may be changed in parallel.  IIUC, this is what
>>>> data_race() is used for.  Or, my understanding is wrong?
>>> Yes. This is my understanding also.
>>
>> Why don't we have a data_race() in deferred_split_folio() then, before taking
>> the lock?
>>
>> It's used a bit inconsistently here.
>>
>


2024-03-19 02:21:10

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH v4 5/6] mm: vmscan: Avoid split during shrink_folio_list()



On 3/18/24 23:35, Ryan Roberts wrote:
> On 18/03/2024 10:05, David Hildenbrand wrote:
>> On 18.03.24 11:00, Yin, Fengwei wrote:
>>>
>>>
>>> On 3/18/2024 10:16 AM, Huang, Ying wrote:
>>>> Ryan Roberts <[email protected]> writes:
>>>>
>>>>> Hi Yin Fengwei,
>>>>>
>>>>> On 15/03/2024 11:12, David Hildenbrand wrote:
>>>>>> On 15.03.24 11:49, Ryan Roberts wrote:
>>>>>>> On 15/03/2024 10:43, David Hildenbrand wrote:
>>>>>>>> On 11.03.24 16:00, Ryan Roberts wrote:
>>>>>>>>> Now that swap supports storing all mTHP sizes, avoid splitting large
>>>>>>>>> folios before swap-out. This benefits performance of the swap-out path
>>>>>>>>> by eliding split_folio_to_list(), which is expensive, and also sets us
>>>>>>>>> up for swapping in large folios in a future series.
>>>>>>>>>
>>>>>>>>> If the folio is partially mapped, we continue to split it since we want
>>>>>>>>> to avoid the extra IO overhead and storage of writing out pages
>>>>>>>>> uneccessarily.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>>>>> ---
>>>>>>>>>      mm/vmscan.c | 9 +++++----
>>>>>>>>>      1 file changed, 5 insertions(+), 4 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>>>>> index cf7d4cf47f1a..0ebec99e04c6 100644
>>>>>>>>> --- a/mm/vmscan.c
>>>>>>>>> +++ b/mm/vmscan.c
>>>>>>>>> @@ -1222,11 +1222,12 @@ static unsigned int shrink_folio_list(struct
>>>>>>>>> list_head
>>>>>>>>> *folio_list,
>>>>>>>>>                          if (!can_split_folio(folio, NULL))
>>>>>>>>>                              goto activate_locked;
>>>>>>>>>                          /*
>>>>>>>>> -                     * Split folios without a PMD map right
>>>>>>>>> -                     * away. Chances are some or all of the
>>>>>>>>> -                     * tail pages can be freed without IO.
>>>>>>>>> +                     * Split partially mapped folios map
>>>>>>>>> +                     * right away. Chances are some or all
>>>>>>>>> +                     * of the tail pages can be freed
>>>>>>>>> +                     * without IO.
>>>>>>>>>                           */
>>>>>>>>> -                    if (!folio_entire_mapcount(folio) &&
>>>>>>>>> +                    if (!list_empty(&folio->_deferred_list) &&
>>>>>>>>>                              split_folio_to_list(folio,
>>>>>>>>>                                      folio_list))
>>>>>>>>>                              goto activate_locked;
>>>>>>>>
>>>>>>>> Not sure if we might have to annotate that with data_race().
>>>>>>>
>>>>>>> I asked that exact question to Matthew in another context bt didn't get a
>>>>>>> response. There are examples of checking if the deferred list is empty
>>>>>>> with and
>>>>>>> without data_race() in the code base. But list_empty() is implemented like
>>>>>>> this:
>>>>>>>
>>>>>>> static inline int list_empty(const struct list_head *head)
>>>>>>> {
>>>>>>>       return READ_ONCE(head->next) == head;
>>>>>>> }
>>>>>>>
>>>>>>> So I assumed the READ_ONCE() makes everything safe without a lock? Perhaps
>>>>>>> not
>>>>>>> sufficient for KCSAN?
>>> I don't think READ_ONCE() can replace the lock.
>
> But it doesn't ensure we get a consistent value and that the compiler orders the
> load correctly. There are lots of patterns in the kernel that use READ_ONCE()
> without a lock and they don't use data_race() - e.g. ptep_get_lockless().
They (ptep_get_lockless() and deferred_list) have different access pattern
(or race pattern) here. I don't think they are comparable.

>
> It sounds like none of us really understand what data_race() is for, so I guess
> I'll just do a KCSAN build and invoke the code path to see if it complains.
READ_ONCE() in list_empty will shutdown the KCSAN also.

>
>
>>>
>>>>>>
>>>>>> Yeah, there is only one use of data_race with that list.
>>>>>>
>>>>>> It was added in f3ebdf042df4 ("THP: avoid lock when check whether THP is in
>>>>>> deferred list").
>>>>>>
>>>>>> Looks like that was added right in v1 of that change [1], so my best guess is
>>>>>> that it is not actually required.
>>>>>>
>>>>>> If not required, likely we should just cleanup the single user.
>>>>>>
>>>>>> [1]
>>>>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>>>
>>>>> Do you have any recollection of why you added the data_race() markup?
>>>>
>>>> Per my understanding, this is used to mark that the code accesses
>>>> folio->_deferred_list without lock intentionally, while
>>>> folio->_deferred_list may be changed in parallel.  IIUC, this is what
>>>> data_race() is used for.  Or, my understanding is wrong?
>>> Yes. This is my understanding also.
>>
>> Why don't we have a data_race() in deferred_split_folio() then, before taking
>> the lock?
>>
>> It's used a bit inconsistently here.
>>
>

2024-03-19 02:31:07

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH v4 5/6] mm: vmscan: Avoid split during shrink_folio_list()



On 3/18/24 18:05, David Hildenbrand wrote:
> On 18.03.24 11:00, Yin, Fengwei wrote:
>>
>>
>> On 3/18/2024 10:16 AM, Huang, Ying wrote:
>>> Ryan Roberts <[email protected]> writes:
>>>
>>>> Hi Yin Fengwei,
>>>>
>>>> On 15/03/2024 11:12, David Hildenbrand wrote:
>>>>> On 15.03.24 11:49, Ryan Roberts wrote:
>>>>>> On 15/03/2024 10:43, David Hildenbrand wrote:
>>>>>>> On 11.03.24 16:00, Ryan Roberts wrote:
>>>>>>>> Now that swap supports storing all mTHP sizes, avoid splitting large
>>>>>>>> folios before swap-out. This benefits performance of the swap-out path
>>>>>>>> by eliding split_folio_to_list(), which is expensive, and also sets us
>>>>>>>> up for swapping in large folios in a future series.
>>>>>>>>
>>>>>>>> If the folio is partially mapped, we continue to split it since we want
>>>>>>>> to avoid the extra IO overhead and storage of writing out pages
>>>>>>>> uneccessarily.
>>>>>>>>
>>>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>>>> ---
>>>>>>>>      mm/vmscan.c | 9 +++++----
>>>>>>>>      1 file changed, 5 insertions(+), 4 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>>>> index cf7d4cf47f1a..0ebec99e04c6 100644
>>>>>>>> --- a/mm/vmscan.c
>>>>>>>> +++ b/mm/vmscan.c
>>>>>>>> @@ -1222,11 +1222,12 @@ static unsigned int shrink_folio_list(struct list_head
>>>>>>>> *folio_list,
>>>>>>>>                          if (!can_split_folio(folio, NULL))
>>>>>>>>                              goto activate_locked;
>>>>>>>>                          /*
>>>>>>>> -                     * Split folios without a PMD map right
>>>>>>>> -                     * away. Chances are some or all of the
>>>>>>>> -                     * tail pages can be freed without IO.
>>>>>>>> +                     * Split partially mapped folios map
>>>>>>>> +                     * right away. Chances are some or all
>>>>>>>> +                     * of the tail pages can be freed
>>>>>>>> +                     * without IO.
>>>>>>>>                           */
>>>>>>>> -                    if (!folio_entire_mapcount(folio) &&
>>>>>>>> +                    if (!list_empty(&folio->_deferred_list) &&
>>>>>>>>                              split_folio_to_list(folio,
>>>>>>>>                                      folio_list))
>>>>>>>>                              goto activate_locked;
>>>>>>>
>>>>>>> Not sure if we might have to annotate that with data_race().
>>>>>>
>>>>>> I asked that exact question to Matthew in another context bt didn't get a
>>>>>> response. There are examples of checking if the deferred list is empty with and
>>>>>> without data_race() in the code base. But list_empty() is implemented like this:
>>>>>>
>>>>>> static inline int list_empty(const struct list_head *head)
>>>>>> {
>>>>>>       return READ_ONCE(head->next) == head;
>>>>>> }
>>>>>>
>>>>>> So I assumed the READ_ONCE() makes everything safe without a lock? Perhaps not
>>>>>> sufficient for KCSAN?
>> I don't think READ_ONCE() can replace the lock.
>>
>>>>>
>>>>> Yeah, there is only one use of data_race with that list.
>>>>>
>>>>> It was added in f3ebdf042df4 ("THP: avoid lock when check whether THP is in
>>>>> deferred list").
>>>>>
>>>>> Looks like that was added right in v1 of that change [1], so my best guess is
>>>>> that it is not actually required.
>>>>>
>>>>> If not required, likely we should just cleanup the single user.
>>>>>
>>>>> [1]
>>>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>>
>>>> Do you have any recollection of why you added the data_race() markup?
>>>
>>> Per my understanding, this is used to mark that the code accesses
>>> folio->_deferred_list without lock intentionally, while
>>> folio->_deferred_list may be changed in parallel.  IIUC, this is what
>>> data_race() is used for.  Or, my understanding is wrong?
>> Yes. This is my understanding also.
>
> Why don't we have a data_race() in deferred_split_folio() then, before taking the lock?
No idea why there is no data_race() added. But I think we should add data_race().

Regards
Yin, Fengwei

>
> It's used a bit inconsistently here.
>

2024-03-19 14:41:58

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 5/6] mm: vmscan: Avoid split during shrink_folio_list()

On 19/03/2024 02:20, Yin Fengwei wrote:
>
>
> On 3/18/24 23:35, Ryan Roberts wrote:
>> On 18/03/2024 10:05, David Hildenbrand wrote:
>>> On 18.03.24 11:00, Yin, Fengwei wrote:
>>>>
>>>>
>>>> On 3/18/2024 10:16 AM, Huang, Ying wrote:
>>>>> Ryan Roberts <[email protected]> writes:
>>>>>
>>>>>> Hi Yin Fengwei,
>>>>>>
>>>>>> On 15/03/2024 11:12, David Hildenbrand wrote:
>>>>>>> On 15.03.24 11:49, Ryan Roberts wrote:
>>>>>>>> On 15/03/2024 10:43, David Hildenbrand wrote:
>>>>>>>>> On 11.03.24 16:00, Ryan Roberts wrote:
>>>>>>>>>> Now that swap supports storing all mTHP sizes, avoid splitting large
>>>>>>>>>> folios before swap-out. This benefits performance of the swap-out path
>>>>>>>>>> by eliding split_folio_to_list(), which is expensive, and also sets us
>>>>>>>>>> up for swapping in large folios in a future series.
>>>>>>>>>>
>>>>>>>>>> If the folio is partially mapped, we continue to split it since we want
>>>>>>>>>> to avoid the extra IO overhead and storage of writing out pages
>>>>>>>>>> uneccessarily.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>>>>>> ---
>>>>>>>>>>      mm/vmscan.c | 9 +++++----
>>>>>>>>>>      1 file changed, 5 insertions(+), 4 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>>>>>> index cf7d4cf47f1a..0ebec99e04c6 100644
>>>>>>>>>> --- a/mm/vmscan.c
>>>>>>>>>> +++ b/mm/vmscan.c
>>>>>>>>>> @@ -1222,11 +1222,12 @@ static unsigned int shrink_folio_list(struct
>>>>>>>>>> list_head
>>>>>>>>>> *folio_list,
>>>>>>>>>>                          if (!can_split_folio(folio, NULL))
>>>>>>>>>>                              goto activate_locked;
>>>>>>>>>>                          /*
>>>>>>>>>> -                     * Split folios without a PMD map right
>>>>>>>>>> -                     * away. Chances are some or all of the
>>>>>>>>>> -                     * tail pages can be freed without IO.
>>>>>>>>>> +                     * Split partially mapped folios map
>>>>>>>>>> +                     * right away. Chances are some or all
>>>>>>>>>> +                     * of the tail pages can be freed
>>>>>>>>>> +                     * without IO.
>>>>>>>>>>                           */
>>>>>>>>>> -                    if (!folio_entire_mapcount(folio) &&
>>>>>>>>>> +                    if (!list_empty(&folio->_deferred_list) &&
>>>>>>>>>>                              split_folio_to_list(folio,
>>>>>>>>>>                                      folio_list))
>>>>>>>>>>                              goto activate_locked;
>>>>>>>>>
>>>>>>>>> Not sure if we might have to annotate that with data_race().
>>>>>>>>
>>>>>>>> I asked that exact question to Matthew in another context bt didn't get a
>>>>>>>> response. There are examples of checking if the deferred list is empty
>>>>>>>> with and
>>>>>>>> without data_race() in the code base. But list_empty() is implemented like
>>>>>>>> this:
>>>>>>>>
>>>>>>>> static inline int list_empty(const struct list_head *head)
>>>>>>>> {
>>>>>>>>       return READ_ONCE(head->next) == head;
>>>>>>>> }
>>>>>>>>
>>>>>>>> So I assumed the READ_ONCE() makes everything safe without a lock? Perhaps
>>>>>>>> not
>>>>>>>> sufficient for KCSAN?
>>>> I don't think READ_ONCE() can replace the lock.
>>
>> But it doesn't ensure we get a consistent value and that the compiler orders the
>> load correctly. There are lots of patterns in the kernel that use READ_ONCE()
>> without a lock and they don't use data_race() - e.g. ptep_get_lockless().
> They (ptep_get_lockless() and deferred_list) have different access pattern
> (or race pattern) here. I don't think they are comparable.
>
>>
>> It sounds like none of us really understand what data_race() is for, so I guess
>> I'll just do a KCSAN build and invoke the code path to see if it complains.
> READ_ONCE() in list_empty will shutdown the KCSAN also.

OK, I found some time to run the test with KCSAN; nothing fires.

But then I read the docs and looked at the code a bit.
Documentation/dev-tools/kcsan.rst states:

In an execution, two memory accesses form a *data race* if they *conflict*,
they happen concurrently in different threads, and at least one of them is a
*plain access*; they *conflict* if both access the same memory location, and
at least one is a write.

It also clarifies the READ_ONCE() is a "marked access". So we would have a data
race if there was a concurrent, *plain* write to folio->_deferred_list.next.
This can occur in a couple of places I believe, for example:

deferred_split_folio()
list_add_tail()
__list_add()
new->next = next;

deferred_split_scan()
list_move()
list_add()
__list_add()
new->next = next;

So if either partially deferred_split_folio() or deferred_split_scan() can run
concurrently with shrink_folio_list(), for the same folio (I beleive both can
can), then we have a race, and this list_empty() check needs to be protected
with data_race(). The race is safe/by design, but it does need to be marked.

I'll fix this in my next version.

Thanks,
Ryan


>
>>
>>
>>>>
>>>>>>>
>>>>>>> Yeah, there is only one use of data_race with that list.
>>>>>>>
>>>>>>> It was added in f3ebdf042df4 ("THP: avoid lock when check whether THP is in
>>>>>>> deferred list").
>>>>>>>
>>>>>>> Looks like that was added right in v1 of that change [1], so my best guess is
>>>>>>> that it is not actually required.
>>>>>>>
>>>>>>> If not required, likely we should just cleanup the single user.
>>>>>>>
>>>>>>> [1]
>>>>>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>>>>
>>>>>> Do you have any recollection of why you added the data_race() markup?
>>>>>
>>>>> Per my understanding, this is used to mark that the code accesses
>>>>> folio->_deferred_list without lock intentionally, while
>>>>> folio->_deferred_list may be changed in parallel.  IIUC, this is what
>>>>> data_race() is used for.  Or, my understanding is wrong?
>>>> Yes. This is my understanding also.
>>>
>>> Why don't we have a data_race() in deferred_split_folio() then, before taking
>>> the lock?
>>>
>>> It's used a bit inconsistently here.
>>>
>>


2024-03-20 11:10:43

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 2/6] mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()

Hi David,

I hate to chase, but since you provided feedback on a couple of the other
patches in the series, I wondered if you missed this one? It's the one that does
the batching of free_swap_and_cache(), which you suggested in order to prevent
needlessly taking folio locks and refs.

If you have any feedback, it would be appreciated, otherwise I'm planning to
repost as-is next week (nobody else has posted comments against this patch
either) as part of the updated series.

Thanks,
Ryan

On 11/03/2024 15:00, Ryan Roberts wrote:
> Now that we no longer have a convenient flag in the cluster to determine
> if a folio is large, free_swap_and_cache() will take a reference and
> lock a large folio much more often, which could lead to contention and
> (e.g.) failure to split large folios, etc.
>
> Let's solve that problem by batch freeing swap and cache with a new
> function, free_swap_and_cache_nr(), to free a contiguous range of swap
> entries together. This allows us to first drop a reference to each swap
> slot before we try to release the cache folio. This means we only try to
> release the folio once, only taking the reference and lock once - much
> better than the previous 512 times for the 2M THP case.
>
> Contiguous swap entries are gathered in zap_pte_range() and
> madvise_free_pte_range() in a similar way to how present ptes are
> already gathered in zap_pte_range().
>
> While we are at it, let's simplify by converting the return type of both
> functions to void. The return value was used only by zap_pte_range() to
> print a bad pte, and was ignored by everyone else, so the extra
> reporting wasn't exactly guaranteed. We will still get the warning with
> most of the information from get_swap_device(). With the batch version,
> we wouldn't know which pte was bad anyway so could print the wrong one.
>
> Signed-off-by: Ryan Roberts <[email protected]>
> ---
> include/linux/pgtable.h | 28 +++++++++++++++
> include/linux/swap.h | 12 +++++--
> mm/internal.h | 48 +++++++++++++++++++++++++
> mm/madvise.c | 12 ++++---
> mm/memory.c | 13 +++----
> mm/swapfile.c | 78 ++++++++++++++++++++++++++++++-----------
> 6 files changed, 157 insertions(+), 34 deletions(-)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 85fc7554cd52..8cf1f2fe2c25 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -708,6 +708,34 @@ static inline void pte_clear_not_present_full(struct mm_struct *mm,
> }
> #endif
>
> +#ifndef clear_not_present_full_ptes
> +/**
> + * clear_not_present_full_ptes - Clear consecutive not present PTEs.
> + * @mm: Address space the ptes represent.
> + * @addr: Address of the first pte.
> + * @ptep: Page table pointer for the first entry.
> + * @nr: Number of entries to clear.
> + * @full: Whether we are clearing a full mm.
> + *
> + * May be overridden by the architecture; otherwise, implemented as a simple
> + * loop over pte_clear_not_present_full().
> + *
> + * Context: The caller holds the page table lock. The PTEs are all not present.
> + * The PTEs are all in the same PMD.
> + */
> +static inline void clear_not_present_full_ptes(struct mm_struct *mm,
> + unsigned long addr, pte_t *ptep, unsigned int nr, int full)
> +{
> + for (;;) {
> + pte_clear_not_present_full(mm, addr, ptep, full);
> + if (--nr == 0)
> + break;
> + ptep++;
> + addr += PAGE_SIZE;
> + }
> +}
> +#endif
> +
> #ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
> extern pte_t ptep_clear_flush(struct vm_area_struct *vma,
> unsigned long address,
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 4a8b6c60793a..f2b7f204b968 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -481,7 +481,7 @@ extern int swap_duplicate(swp_entry_t);
> extern int swapcache_prepare(swp_entry_t);
> extern void swap_free(swp_entry_t);
> extern void swapcache_free_entries(swp_entry_t *entries, int n);
> -extern int free_swap_and_cache(swp_entry_t);
> +extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
> int swap_type_of(dev_t device, sector_t offset);
> int find_first_swap(dev_t *device);
> extern unsigned int count_swap_pages(int, int);
> @@ -530,8 +530,9 @@ static inline void put_swap_device(struct swap_info_struct *si)
> #define free_pages_and_swap_cache(pages, nr) \
> release_pages((pages), (nr));
>
> -/* used to sanity check ptes in zap_pte_range when CONFIG_SWAP=0 */
> -#define free_swap_and_cache(e) is_pfn_swap_entry(e)
> +static inline void free_swap_and_cache_nr(swp_entry_t entry, int nr)
> +{
> +}
>
> static inline void free_swap_cache(struct folio *folio)
> {
> @@ -599,6 +600,11 @@ static inline int add_swap_extent(struct swap_info_struct *sis,
> }
> #endif /* CONFIG_SWAP */
>
> +static inline void free_swap_and_cache(swp_entry_t entry)
> +{
> + free_swap_and_cache_nr(entry, 1);
> +}
> +
> #ifdef CONFIG_MEMCG
> static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
> {
> diff --git a/mm/internal.h b/mm/internal.h
> index a3e19194079f..8dbb1335df88 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -11,6 +11,8 @@
> #include <linux/mm.h>
> #include <linux/pagemap.h>
> #include <linux/rmap.h>
> +#include <linux/swap.h>
> +#include <linux/swapops.h>
> #include <linux/tracepoint-defs.h>
>
> struct folio_batch;
> @@ -174,6 +176,52 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>
> return min(ptep - start_ptep, max_nr);
> }
> +
> +/**
> + * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries
> + * @start_ptep: Page table pointer for the first entry.
> + * @max_nr: The maximum number of table entries to consider.
> + * @entry: Swap entry recovered from the first table entry.
> + *
> + * Detect a batch of contiguous swap entries: consecutive (non-present) PTEs
> + * containing swap entries all with consecutive offsets and targeting the same
> + * swap type.
> + *
> + * max_nr must be at least one and must be limited by the caller so scanning
> + * cannot exceed a single page table.
> + *
> + * Return: the number of table entries in the batch.
> + */
> +static inline int swap_pte_batch(pte_t *start_ptep, int max_nr,
> + swp_entry_t entry)
> +{
> + const pte_t *end_ptep = start_ptep + max_nr;
> + unsigned long expected_offset = swp_offset(entry) + 1;
> + unsigned int expected_type = swp_type(entry);
> + pte_t *ptep = start_ptep + 1;
> +
> + VM_WARN_ON(max_nr < 1);
> + VM_WARN_ON(non_swap_entry(entry));
> +
> + while (ptep < end_ptep) {
> + pte_t pte = ptep_get(ptep);
> +
> + if (pte_none(pte) || pte_present(pte))
> + break;
> +
> + entry = pte_to_swp_entry(pte);
> +
> + if (non_swap_entry(entry) ||
> + swp_type(entry) != expected_type ||
> + swp_offset(entry) != expected_offset)
> + break;
> +
> + expected_offset++;
> + ptep++;
> + }
> +
> + return ptep - start_ptep;
> +}
> #endif /* CONFIG_MMU */
>
> void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 44a498c94158..547dcd1f7a39 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -628,6 +628,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> struct folio *folio;
> int nr_swap = 0;
> unsigned long next;
> + int nr, max_nr;
>
> next = pmd_addr_end(addr, end);
> if (pmd_trans_huge(*pmd))
> @@ -640,7 +641,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> return 0;
> flush_tlb_batched_pending(mm);
> arch_enter_lazy_mmu_mode();
> - for (; addr != end; pte++, addr += PAGE_SIZE) {
> + for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) {
> + nr = 1;
> ptent = ptep_get(pte);
>
> if (pte_none(ptent))
> @@ -655,9 +657,11 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>
> entry = pte_to_swp_entry(ptent);
> if (!non_swap_entry(entry)) {
> - nr_swap--;
> - free_swap_and_cache(entry);
> - pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> + max_nr = (end - addr) / PAGE_SIZE;
> + nr = swap_pte_batch(pte, max_nr, entry);
> + nr_swap -= nr;
> + free_swap_and_cache_nr(entry, nr);
> + clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
> } else if (is_hwpoison_entry(entry) ||
> is_poisoned_swp_entry(entry)) {
> pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> diff --git a/mm/memory.c b/mm/memory.c
> index f2bc6dd15eb8..25c0ef1c7ff3 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1629,12 +1629,13 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> folio_remove_rmap_pte(folio, page, vma);
> folio_put(folio);
> } else if (!non_swap_entry(entry)) {
> - /* Genuine swap entry, hence a private anon page */
> + max_nr = (end - addr) / PAGE_SIZE;
> + nr = swap_pte_batch(pte, max_nr, entry);
> + /* Genuine swap entries, hence a private anon pages */
> if (!should_zap_cows(details))
> continue;
> - rss[MM_SWAPENTS]--;
> - if (unlikely(!free_swap_and_cache(entry)))
> - print_bad_pte(vma, addr, ptent, NULL);
> + rss[MM_SWAPENTS] -= nr;
> + free_swap_and_cache_nr(entry, nr);
> } else if (is_migration_entry(entry)) {
> folio = pfn_swap_entry_folio(entry);
> if (!should_zap_folio(details, folio))
> @@ -1657,8 +1658,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> pr_alert("unrecognized swap entry 0x%lx\n", entry.val);
> WARN_ON_ONCE(1);
> }
> - pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> - zap_install_uffd_wp_if_needed(vma, addr, pte, 1, details, ptent);
> + clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
> + zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent);
> } while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
>
> add_mm_rss_vec(mm, rss);
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index df1de034f6d8..ee7e44cb40c5 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -130,7 +130,11 @@ static inline unsigned char swap_count(unsigned char ent)
> /* Reclaim the swap entry if swap is getting full*/
> #define TTRS_FULL 0x4
>
> -/* returns 1 if swap entry is freed */
> +/*
> + * returns number of pages in the folio that backs the swap entry. If positive,
> + * the folio was reclaimed. If negative, the folio was not reclaimed. If 0, no
> + * folio was associated with the swap entry.
> + */
> static int __try_to_reclaim_swap(struct swap_info_struct *si,
> unsigned long offset, unsigned long flags)
> {
> @@ -155,6 +159,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
> ret = folio_free_swap(folio);
> folio_unlock(folio);
> }
> + ret = ret ? folio_nr_pages(folio) : -folio_nr_pages(folio);
> folio_put(folio);
> return ret;
> }
> @@ -895,7 +900,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
> swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
> spin_lock(&si->lock);
> /* entry was freed successfully, try to use this again */
> - if (swap_was_freed)
> + if (swap_was_freed > 0)
> goto checks;
> goto scan; /* check next one */
> }
> @@ -1572,32 +1577,63 @@ bool folio_free_swap(struct folio *folio)
> return true;
> }
>
> -/*
> - * Free the swap entry like above, but also try to
> - * free the page cache entry if it is the last user.
> - */
> -int free_swap_and_cache(swp_entry_t entry)
> +void free_swap_and_cache_nr(swp_entry_t entry, int nr)
> {
> - struct swap_info_struct *p;
> - unsigned char count;
> + unsigned long end = swp_offset(entry) + nr;
> + unsigned int type = swp_type(entry);
> + struct swap_info_struct *si;
> + unsigned long offset;
>
> if (non_swap_entry(entry))
> - return 1;
> + return;
>
> - p = get_swap_device(entry);
> - if (p) {
> - if (WARN_ON(data_race(!p->swap_map[swp_offset(entry)]))) {
> - put_swap_device(p);
> - return 0;
> - }
> + si = get_swap_device(entry);
> + if (!si)
> + return;
>
> - count = __swap_entry_free(p, entry);
> - if (count == SWAP_HAS_CACHE)
> - __try_to_reclaim_swap(p, swp_offset(entry),
> + if (WARN_ON(end > si->max))
> + goto out;
> +
> + /*
> + * First free all entries in the range.
> + */
> + for (offset = swp_offset(entry); offset < end; offset++) {
> + if (!WARN_ON(data_race(!si->swap_map[offset])))
> + __swap_entry_free(si, swp_entry(type, offset));
> + }
> +
> + /*
> + * Now go back over the range trying to reclaim the swap cache. This is
> + * more efficient for large folios because we will only try to reclaim
> + * the swap once per folio in the common case. If we do
> + * __swap_entry_free() and __try_to_reclaim_swap() in the same loop, the
> + * latter will get a reference and lock the folio for every individual
> + * page but will only succeed once the swap slot for every subpage is
> + * zero.
> + */
> + for (offset = swp_offset(entry); offset < end; offset += nr) {
> + nr = 1;
> + if (READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) {
> + /*
> + * Folios are always naturally aligned in swap so
> + * advance forward to the next boundary. Zero means no
> + * folio was found for the swap entry, so advance by 1
> + * in this case. Negative value means folio was found
> + * but could not be reclaimed. Here we can still advance
> + * to the next boundary.
> + */
> + nr = __try_to_reclaim_swap(si, offset,
> TTRS_UNMAPPED | TTRS_FULL);
> - put_swap_device(p);
> + if (nr == 0)
> + nr = 1;
> + else if (nr < 0)
> + nr = -nr;
> + nr = ALIGN(offset + 1, nr) - offset;
> + }
> }
> - return p != NULL;
> +
> +out:
> + put_swap_device(si);
> }
>
> #ifdef CONFIG_HIBERNATION


2024-03-20 12:22:32

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 4/6] mm: swap: Allow storage of all mTHP orders

Hi Huang, Ying,


On 12/03/2024 07:51, Huang, Ying wrote:
> Ryan Roberts <[email protected]> writes:
>
>> Multi-size THP enables performance improvements by allocating large,
>> pte-mapped folios for anonymous memory. However I've observed that on an
>> arm64 system running a parallel workload (e.g. kernel compilation)
>> across many cores, under high memory pressure, the speed regresses. This
>> is due to bottlenecking on the increased number of TLBIs added due to
>> all the extra folio splitting when the large folios are swapped out.
>>
>> Therefore, solve this regression by adding support for swapping out mTHP
>> without needing to split the folio, just like is already done for
>> PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enabled,
>> and when the swap backing store is a non-rotating block device. These
>> are the same constraints as for the existing PMD-sized THP swap-out
>> support.
>>
>> Note that no attempt is made to swap-in (m)THP here - this is still done
>> page-by-page, like for PMD-sized THP. But swapping-out mTHP is a
>> prerequisite for swapping-in mTHP.
>>
>> The main change here is to improve the swap entry allocator so that it
>> can allocate any power-of-2 number of contiguous entries between [1, (1
>> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
>> order and allocating sequentially from it until the cluster is full.
>> This ensures that we don't need to search the map and we get no
>> fragmentation due to alignment padding for different orders in the
>> cluster. If there is no current cluster for a given order, we attempt to
>> allocate a free cluster from the list. If there are no free clusters, we
>> fail the allocation and the caller can fall back to splitting the folio
>> and allocates individual entries (as per existing PMD-sized THP
>> fallback).
>>
>> The per-order current clusters are maintained per-cpu using the existing
>> infrastructure. This is done to avoid interleving pages from different
>> tasks, which would prevent IO being batched. This is already done for
>> the order-0 allocations so we follow the same pattern.
>>
>> As is done for order-0 per-cpu clusters, the scanner now can steal
>> order-0 entries from any per-cpu-per-order reserved cluster. This
>> ensures that when the swap file is getting full, space doesn't get tied
>> up in the per-cpu reserves.
>>
>> This change only modifies swap to be able to accept any order mTHP. It
>> doesn't change the callers to elide doing the actual split. That will be
>> done in separate changes.

[...]

>> @@ -905,17 +961,18 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>> }
>>
>> if (si->swap_map[offset]) {
>> + VM_WARN_ON(order > 0);
>> unlock_cluster(ci);
>> if (!n_ret)
>> goto scan;
>> else
>> goto done;
>> }
>> - WRITE_ONCE(si->swap_map[offset], usage);
>> - inc_cluster_info_page(si, si->cluster_info, offset);
>> + memset(si->swap_map + offset, usage, nr_pages);
>
> Add barrier() here corresponds to original WRITE_ONCE()?
> unlock_cluster(ci) may be NOP for some swap devices.

Looking at this a bit more closely, I'm not sure this is needed. Even if there
is no cluster, the swap_info is still locked, so unlocking that will act as a
barrier. There are a number of other callsites that memset(si->swap_map) without
an explicit barrier and with the swap_info locked.

Looking at the original commit that added the WRITE_ONCE() it was worried about
a race with reading swap_map in _swap_info_get(). But that site is now annotated
with a data_race(), which will suppress the warning. And I don't believe there
are any places that read swap_map locklessly and depend upon observing ordering
between it and other state? So I think the si unlock is sufficient?

I'm not planning to add barrier() here. Let me know if you disagree.

Thanks,
Ryan

>
>> + add_cluster_info_page(si, si->cluster_info, offset, nr_pages);
>> unlock_cluster(ci);



2024-03-20 13:49:43

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD

Hi Lance, Barry,

Sorry - I totally missed this when you originally sent it!


On 13/03/2024 14:02, Lance Yang wrote:
> On Wed, Mar 13, 2024 at 5:03 PM Ryan Roberts <[email protected]> wrote:
>>
>> On 13/03/2024 07:19, Barry Song wrote:
>>> On Tue, Mar 12, 2024 at 4:01 AM Ryan Roberts <[email protected]> wrote:
>>>>
>>>> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
>>>> folio that is fully and contiguously mapped in the pageout/cold vm
>>>> range. This change means that large folios will be maintained all the
>>>> way to swap storage. This both improves performance during swap-out, by
>>>> eliding the cost of splitting the folio, and sets us up nicely for
>>>> maintaining the large folio when it is swapped back in (to be covered in
>>>> a separate series).
>>>>
>>>> Folios that are not fully mapped in the target range are still split,
>>>> but note that behavior is changed so that if the split fails for any
>>>> reason (folio locked, shared, etc) we now leave it as is and move to the
>>>> next pte in the range and continue work on the proceeding folios.
>>>> Previously any failure of this sort would cause the entire operation to
>>>> give up and no folios mapped at higher addresses were paged out or made
>>>> cold. Given large folios are becoming more common, this old behavior
>>>> would have likely lead to wasted opportunities.
>>>>
>>>> While we are at it, change the code that clears young from the ptes to
>>>> use ptep_test_and_clear_young(), which is more efficent than
>>>> get_and_clear/modify/set, especially for contpte mappings on arm64,
>>>> where the old approach would require unfolding/refolding and the new
>>>> approach can be done in place.
>>>>
>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>
>>> This looks so much better than our initial RFC.
>>> Thank you for your excellent work!
>>
>> Thanks - its a team effort - I had your PoC and David's previous batching work
>> to use as a template.
>>
>>>
>>>> ---
>>>> mm/madvise.c | 89 ++++++++++++++++++++++++++++++----------------------
>>>> 1 file changed, 51 insertions(+), 38 deletions(-)
>>>>
>>>> diff --git a/mm/madvise.c b/mm/madvise.c
>>>> index 547dcd1f7a39..56c7ba7bd558 100644
>>>> --- a/mm/madvise.c
>>>> +++ b/mm/madvise.c
>>>> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>> LIST_HEAD(folio_list);
>>>> bool pageout_anon_only_filter;
>>>> unsigned int batch_count = 0;
>>>> + int nr;
>>>>
>>>> if (fatal_signal_pending(current))
>>>> return -EINTR;
>>>> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>> return 0;
>>>> flush_tlb_batched_pending(mm);
>>>> arch_enter_lazy_mmu_mode();
>>>> - for (; addr < end; pte++, addr += PAGE_SIZE) {
>>>> + for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
>>>> + nr = 1;
>>>> ptent = ptep_get(pte);
>>>>
>>>> if (++batch_count == SWAP_CLUSTER_MAX) {
>>>> @@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>> continue;
>>>>
>>>> /*
>>>> - * Creating a THP page is expensive so split it only if we
>>>> - * are sure it's worth. Split it if we are only owner.
>>>> + * If we encounter a large folio, only split it if it is not
>>>> + * fully mapped within the range we are operating on. Otherwise
>>>> + * leave it as is so that it can be swapped out whole. If we
>>>> + * fail to split a folio, leave it in place and advance to the
>>>> + * next pte in the range.
>>>> */
>>>> if (folio_test_large(folio)) {
>>>> - int err;
>>>> -
>>>> - if (folio_estimated_sharers(folio) > 1)
>>>> - break;
>>>> - if (pageout_anon_only_filter && !folio_test_anon(folio))
>>>> - break;
>>>> - if (!folio_trylock(folio))
>>>> - break;
>>>> - folio_get(folio);
>>>> - arch_leave_lazy_mmu_mode();
>>>> - pte_unmap_unlock(start_pte, ptl);
>>>> - start_pte = NULL;
>>>> - err = split_folio(folio);
>>>> - folio_unlock(folio);
>>>> - folio_put(folio);
>>>> - if (err)
>>>> - break;
>>>> - start_pte = pte =
>>>> - pte_offset_map_lock(mm, pmd, addr, &ptl);
>>>> - if (!start_pte)
>>>> - break;
>>>> - arch_enter_lazy_mmu_mode();
>>>> - pte--;
>>>> - addr -= PAGE_SIZE;
>>>> - continue;
>>>> + const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
>>>> + FPB_IGNORE_SOFT_DIRTY;
>>>> + int max_nr = (end - addr) / PAGE_SIZE;
>>>> +
>>>> + nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>>>> + fpb_flags, NULL);
>>>
>>> I wonder if we have a quick way to avoid folio_pte_batch() if users
>>> are doing madvise() on a portion of a large folio.
>>
>> Good idea. Something like this?:
>>
>> if (pte_pfn(pte) == folio_pfn(folio)
>> nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>> fpb_flags, NULL);
>>
>> If we are not mapping the first page of the folio, then it can't be a full
>> mapping, so no need to call folio_pte_batch(). Just split it.
>
> if (folio_test_large(folio)) {
> [...]
> nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
> fpb_flags, NULL);
> + if (folio_estimated_sharers(folio) > 1)
> + continue;
>
> Could we use folio_estimated_sharers as an early exit point here?

I'm not sure what this is saving where you have it? Did you mean to put it
before folio_pte_batch()? Currently it is just saving a single conditional.

But now that I think about it a bit more, I remember why I was originally
unconditionally calling folio_pte_batch(). Given its a large folio, if the split
fails, we can move the cursor to the pte where the next folio begins so we don't
have to iterate through one pte at a time which would cause us to keep calling
folio_estimated_sharers(), folio_test_anon(), etc on the same folio until we get
to the next boundary.

Of course the common case at this point will be for the split to succeed, but
then we are going to iterate over ever single PTE anyway - one way or another
they are all fetched into cache. So I feel like its neater not to add the
conditionals for calling folio_pte_batch(), and just leave this as I have it here.

>
> if (nr < folio_nr_pages(folio)) {
> int err;
>
> - if (folio_estimated_sharers(folio) > 1)
> - continue;
> [...]
>
>>
>>>
>>>> +
>>>> + if (nr < folio_nr_pages(folio)) {
>>>> + int err;
>>>> +
>>>> + if (folio_estimated_sharers(folio) > 1)
>>>> + continue;
>>>> + if (pageout_anon_only_filter && !folio_test_anon(folio))
>>>> + continue;
>>>> + if (!folio_trylock(folio))
>>>> + continue;
>>>> + folio_get(folio);
>>>> + arch_leave_lazy_mmu_mode();
>>>> + pte_unmap_unlock(start_pte, ptl);
>>>> + start_pte = NULL;
>>>> + err = split_folio(folio);
>>>> + folio_unlock(folio);
>>>> + folio_put(folio);
>>>> + if (err)
>>>> + continue;
>>>> + start_pte = pte =
>>>> + pte_offset_map_lock(mm, pmd, addr, &ptl);
>>>> + if (!start_pte)
>>>> + break;
>>>> + arch_enter_lazy_mmu_mode();
>>>> + nr = 0;
>>>> + continue;
>>>> + }
>>>> }
>>>>
>>>> /*
>>>> * Do not interfere with other mappings of this folio and
>>>> - * non-LRU folio.
>>>> + * non-LRU folio. If we have a large folio at this point, we
>>>> + * know it is fully mapped so if its mapcount is the same as its
>>>> + * number of pages, it must be exclusive.
>>>> */
>>>> - if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
>>>> + if (!folio_test_lru(folio) ||
>>>> + folio_mapcount(folio) != folio_nr_pages(folio))
>>>> continue;
>>>
>>> This looks so perfect and is exactly what I wanted to achieve.
>>>
>>>>
>>>> if (pageout_anon_only_filter && !folio_test_anon(folio))
>>>> continue;
>>>>
>>>> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>>>> -
>>>> - if (!pageout && pte_young(ptent)) {
>>>> - ptent = ptep_get_and_clear_full(mm, addr, pte,
>>>> - tlb->fullmm);
>>>> - ptent = pte_mkold(ptent);
>>>> - set_pte_at(mm, addr, pte, ptent);
>>>> - tlb_remove_tlb_entry(tlb, pte, addr);
>>>> + if (!pageout) {
>>>> + for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
>>>> + if (ptep_test_and_clear_young(vma, addr, pte))
>>>> + tlb_remove_tlb_entry(tlb, pte, addr);
>
> IIRC, some of the architecture(ex, PPC) don't update TLB with set_pte_at and
> tlb_remove_tlb_entry. So, didn't we consider remapping the PTE with old after
> pte clearing?

Sorry Lance, I don't understand this question, can you rephrase? Are you saying
there is a good reason to do the original clear-mkold-set for some arches?

>
> Thanks,
> Lance
>
>
>
>>>> + }
>>>
>>> This looks so smart. if it is not pageout, we have increased pte
>>> and addr here; so nr is 0 and we don't need to increase again in
>>> for (; addr < end; pte += nr, addr += nr * PAGE_SIZE)
>>>
>>> otherwise, nr won't be 0. so we will increase addr and
>>> pte by nr.
>>
>> Indeed. I'm hoping that Lance is able to follow a similar pattern for
>> madvise_free_pte_range().
>>
>>
>>>
>>>
>>>> }
>>>>
>>>> /*
>>>> --
>>>> 2.25.1
>>>>
>>>
>>> Overall, LGTM,
>>>
>>> Reviewed-by: Barry Song <[email protected]>
>>
>> Thanks!
>>
>>


2024-03-20 13:57:34

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD

On 15/03/2024 10:35, David Hildenbrand wrote:
>> -        if (!pageout && pte_young(ptent)) {
>> -            ptent = ptep_get_and_clear_full(mm, addr, pte,
>> -                            tlb->fullmm);
>> -            ptent = pte_mkold(ptent);
>> -            set_pte_at(mm, addr, pte, ptent);
>> -            tlb_remove_tlb_entry(tlb, pte, addr);
>> +        if (!pageout) {
>> +            for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
>> +                if (ptep_test_and_clear_young(vma, addr, pte))
>> +                    tlb_remove_tlb_entry(tlb, pte, addr);
>> +            }
>>           }
>
>
> The following might turn out a bit nicer: Make folio_pte_batch() collect
> "any_young", then doing something like we do with "any_writable" in the fork()
> case:
>
> ...
>     nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>                  fpb_flags, NULL, any_young);
>     if (any_young)
>         pte_mkyoung(ptent)
> ...
>
> if (!pageout && pte_young(ptent)) {
>     mkold_full_ptes(mm, addr, pte, nr, tlb->fullmm);

I don't think tlb->fullmm makes sense here because we are not clearing the pte,
so there is no chance of optimization? So planning to call this mkold_ptes() and
remove that param. Have I missed something?

>     tlb_remove_tlb_entries(tlb, pte, nr, addr);
> }
>


2024-03-20 14:09:35

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD

On 20.03.24 14:57, Ryan Roberts wrote:
> On 15/03/2024 10:35, David Hildenbrand wrote:
>>> -        if (!pageout && pte_young(ptent)) {
>>> -            ptent = ptep_get_and_clear_full(mm, addr, pte,
>>> -                            tlb->fullmm);
>>> -            ptent = pte_mkold(ptent);
>>> -            set_pte_at(mm, addr, pte, ptent);
>>> -            tlb_remove_tlb_entry(tlb, pte, addr);
>>> +        if (!pageout) {
>>> +            for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
>>> +                if (ptep_test_and_clear_young(vma, addr, pte))
>>> +                    tlb_remove_tlb_entry(tlb, pte, addr);
>>> +            }
>>>           }
>>
>>
>> The following might turn out a bit nicer: Make folio_pte_batch() collect
>> "any_young", then doing something like we do with "any_writable" in the fork()
>> case:
>>
>> ...
>>     nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>>                  fpb_flags, NULL, any_young);
>>     if (any_young)
>>         pte_mkyoung(ptent)
>> ...
>>
>> if (!pageout && pte_young(ptent)) {
>>     mkold_full_ptes(mm, addr, pte, nr, tlb->fullmm);
>
> I don't think tlb->fullmm makes sense here because we are not clearing the pte,
> so there is no chance of optimization? So planning to call this mkold_ptes() and
> remove that param. Have I missed something?

Agreed.

--
Cheers,

David / dhildenb


2024-03-20 14:21:45

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 2/6] mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()

On 20/03/2024 14:13, David Hildenbrand wrote:
> On 20.03.24 12:10, Ryan Roberts wrote:
>> Hi David,
>
> I'm usually lazy with review during the merge window :P

Ahh no worries! I'll send out the latest version next week, and can go from there.

>
>>
>> I hate to chase, but since you provided feedback on a couple of the other
>> patches in the series, I wondered if you missed this one? It's the one that does
>> the batching of free_swap_and_cache(), which you suggested in order to prevent
>> needlessly taking folio locks and refs.
>>
>> If you have any feedback, it would be appreciated, otherwise I'm planning to
>> repost as-is next week (nobody else has posted comments against this patch
>> either) as part of the updated series.
>
> On my TODO list!
>


2024-03-20 14:21:54

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 2/6] mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()

On 20.03.24 12:10, Ryan Roberts wrote:
> Hi David,

I'm usually lazy with review during the merge window :P

>
> I hate to chase, but since you provided feedback on a couple of the other
> patches in the series, I wondered if you missed this one? It's the one that does
> the batching of free_swap_and_cache(), which you suggested in order to prevent
> needlessly taking folio locks and refs.
>
> If you have any feedback, it would be appreciated, otherwise I'm planning to
> repost as-is next week (nobody else has posted comments against this patch
> either) as part of the updated series.

On my TODO list!

--
Cheers,

David / dhildenb


2024-03-20 14:35:33

by Lance Yang

[permalink] [raw]
Subject: Re: [PATCH v4 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD

On Wed, Mar 20, 2024 at 9:49 PM Ryan Roberts <[email protected]> wrote:
>
> Hi Lance, Barry,
>
> Sorry - I totally missed this when you originally sent it!

No worries at all :)

>
>
> On 13/03/2024 14:02, Lance Yang wrote:
> > On Wed, Mar 13, 2024 at 5:03 PM Ryan Roberts <[email protected]> wrote:
> >>
> >> On 13/03/2024 07:19, Barry Song wrote:
> >>> On Tue, Mar 12, 2024 at 4:01 AM Ryan Roberts <[email protected]> wrote:
> >>>>
> >>>> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
> >>>> folio that is fully and contiguously mapped in the pageout/cold vm
> >>>> range. This change means that large folios will be maintained all the
> >>>> way to swap storage. This both improves performance during swap-out, by
> >>>> eliding the cost of splitting the folio, and sets us up nicely for
> >>>> maintaining the large folio when it is swapped back in (to be covered in
> >>>> a separate series).
> >>>>
> >>>> Folios that are not fully mapped in the target range are still split,
> >>>> but note that behavior is changed so that if the split fails for any
> >>>> reason (folio locked, shared, etc) we now leave it as is and move to the
> >>>> next pte in the range and continue work on the proceeding folios.
> >>>> Previously any failure of this sort would cause the entire operation to
> >>>> give up and no folios mapped at higher addresses were paged out or made
> >>>> cold. Given large folios are becoming more common, this old behavior
> >>>> would have likely lead to wasted opportunities.
> >>>>
> >>>> While we are at it, change the code that clears young from the ptes to
> >>>> use ptep_test_and_clear_young(), which is more efficent than
> >>>> get_and_clear/modify/set, especially for contpte mappings on arm64,
> >>>> where the old approach would require unfolding/refolding and the new
> >>>> approach can be done in place.
> >>>>
> >>>> Signed-off-by: Ryan Roberts <[email protected]>
> >>>
> >>> This looks so much better than our initial RFC.
> >>> Thank you for your excellent work!
> >>
> >> Thanks - its a team effort - I had your PoC and David's previous batching work
> >> to use as a template.
> >>
> >>>
> >>>> ---
> >>>> mm/madvise.c | 89 ++++++++++++++++++++++++++++++----------------------
> >>>> 1 file changed, 51 insertions(+), 38 deletions(-)
> >>>>
> >>>> diff --git a/mm/madvise.c b/mm/madvise.c
> >>>> index 547dcd1f7a39..56c7ba7bd558 100644
> >>>> --- a/mm/madvise.c
> >>>> +++ b/mm/madvise.c
> >>>> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>>> LIST_HEAD(folio_list);
> >>>> bool pageout_anon_only_filter;
> >>>> unsigned int batch_count = 0;
> >>>> + int nr;
> >>>>
> >>>> if (fatal_signal_pending(current))
> >>>> return -EINTR;
> >>>> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>>> return 0;
> >>>> flush_tlb_batched_pending(mm);
> >>>> arch_enter_lazy_mmu_mode();
> >>>> - for (; addr < end; pte++, addr += PAGE_SIZE) {
> >>>> + for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
> >>>> + nr = 1;
> >>>> ptent = ptep_get(pte);
> >>>>
> >>>> if (++batch_count == SWAP_CLUSTER_MAX) {
> >>>> @@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>>> continue;
> >>>>
> >>>> /*
> >>>> - * Creating a THP page is expensive so split it only if we
> >>>> - * are sure it's worth. Split it if we are only owner.
> >>>> + * If we encounter a large folio, only split it if it is not
> >>>> + * fully mapped within the range we are operating on Otherwise
> >>>> + * leave it as is so that it can be swapped out whole. If we
> >>>> + * fail to split a folio, leave it in place and advance to the
> >>>> + * next pte in the range.
> >>>> */
> >>>> if (folio_test_large(folio)) {
> >>>> - int err;
> >>>> -
> >>>> - if (folio_estimated_sharers(folio) > 1)
> >>>> - break;
> >>>> - if (pageout_anon_only_filter && !folio_test_anon(folio))
> >>>> - break;
> >>>> - if (!folio_trylock(folio))
> >>>> - break;
> >>>> - folio_get(folio);
> >>>> - arch_leave_lazy_mmu_mode();
> >>>> - pte_unmap_unlock(start_pte, ptl);
> >>>> - start_pte = NULL;
> >>>> - err = split_folio(folio);
> >>>> - folio_unlock(folio);
> >>>> - folio_put(folio);
> >>>> - if (err)
> >>>> - break;
> >>>> - start_pte = pte =
> >>>> - pte_offset_map_lock(mm, pmd, addr, &ptl);
> >>>> - if (!start_pte)
> >>>> - break;
> >>>> - arch_enter_lazy_mmu_mode();
> >>>> - pte--;
> >>>> - addr -= PAGE_SIZE;
> >>>> - continue;
> >>>> + const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
> >>>> + FPB_IGNORE_SOFT_DIRTY;
> >>>> + int max_nr = (end - addr) / PAGE_SIZE;
> >>>> +
> >>>> + nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
> >>>> + fpb_flags, NULL);
> >>>
> >>> I wonder if we have a quick way to avoid folio_pte_batch() if users
> >>> are doing madvise() on a portion of a large folio.
> >>
> >> Good idea. Something like this?:
> >>
> >> if (pte_pfn(pte) == folio_pfn(folio)
> >> nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
> >> fpb_flags, NULL);
> >>
> >> If we are not mapping the first page of the folio, then it can't be a full
> >> mapping, so no need to call folio_pte_batch(). Just split it.
> >
> > if (folio_test_large(folio)) {
> > [...]
> > nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
> > fpb_flags, NULL);
> > + if (folio_estimated_sharers(folio) > 1)
> > + continue;
> >
> > Could we use folio_estimated_sharers as an early exit point here?
>
> I'm not sure what this is saving where you have it? Did you mean to put it
> before folio_pte_batch()? Currently it is just saving a single conditional.

Apologies for the confusion. I made a diff to provide clarity.

diff --git a/mm/madvise.c b/mm/madvise.c
index 56c7ba7bd558..c3458fdea82a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -462,12 +462,11 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,

nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
fpb_flags, NULL);
-
// Could we use folio_estimated_sharers as an early exit point here?
+ if (folio_estimated_sharers(folio) > 1)
+ continue;
if (nr < folio_nr_pages(folio)) {
int err;

- if (folio_estimated_sharers(folio) > 1)
- continue;
if (pageout_anon_only_filter &&
!folio_test_anon(folio))
continue;
if (!folio_trylock(folio))

>
> But now that I think about it a bit more, I remember why I was originally
> unconditionally calling folio_pte_batch(). Given its a large folio, if the split
> fails, we can move the cursor to the pte where the next folio begins so we don't
> have to iterate through one pte at a time which would cause us to keep calling
> folio_estimated_sharers(), folio_test_anon(), etc on the same folio until we get
> to the next boundary.
>
> Of course the common case at this point will be for the split to succeed, but
> then we are going to iterate over ever single PTE anyway - one way or another
> they are all fetched into cache. So I feel like its neater not to add the
> conditionals for calling folio_pte_batch(), and just leave this as I have it here.
>
> >
> > if (nr < folio_nr_pages(folio)) {
> > int err;
> >
> > - if (folio_estimated_sharers(folio) > 1)
> > - continue;
> > [...]
> >
> >>
> >>>
> >>>> +
> >>>> + if (nr < folio_nr_pages(folio)) {
> >>>> + int err;
> >>>> +
> >>>> + if (folio_estimated_sharers(folio) > 1)
> >>>> + continue;
> >>>> + if (pageout_anon_only_filter && !folio_test_anon(folio))
> >>>> + continue;
> >>>> + if (!folio_trylock(folio))
> >>>> + continue;
> >>>> + folio_get(folio);
> >>>> + arch_leave_lazy_mmu_mode();
> >>>> + pte_unmap_unlock(start_pte, ptl);
> >>>> + start_pte = NULL;
> >>>> + err = split_folio(folio);
> >>>> + folio_unlock(folio);
> >>>> + folio_put(folio);
> >>>> + if (err)
> >>>> + continue;
> >>>> + start_pte = pte =
> >>>> + pte_offset_map_lock(mm, pmd, addr, &ptl);
> >>>> + if (!start_pte)
> >>>> + break;
> >>>> + arch_enter_lazy_mmu_mode();
> >>>> + nr = 0;
> >>>> + continue;
> >>>> + }
> >>>> }
> >>>>
> >>>> /*
> >>>> * Do not interfere with other mappings of this folio and
> >>>> - * non-LRU folio.
> >>>> + * non-LRU folio. If we have a large folio at this point, we
> >>>> + * know it is fully mapped so if its mapcount is the same as its
> >>>> + * number of pages, it must be exclusive.
> >>>> */
> >>>> - if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
> >>>> + if (!folio_test_lru(folio) ||
> >>>> + folio_mapcount(folio) != folio_nr_pages(folio))
> >>>> continue;
> >>>
> >>> This looks so perfect and is exactly what I wanted to achieve.
> >>>
> >>>>
> >>>> if (pageout_anon_only_filter && !folio_test_anon(folio))
> >>>> continue;
> >>>>
> >>>> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> >>>> -
> >>>> - if (!pageout && pte_young(ptent)) {
> >>>> - ptent = ptep_get_and_clear_full(mm, addr, pte,
> >>>> - tlb->fullmm);
> >>>> - ptent = pte_mkold(ptent);
> >>>> - set_pte_at(mm, addr, pte, ptent);
> >>>> - tlb_remove_tlb_entry(tlb, pte, addr);
> >>>> + if (!pageout) {
> >>>> + for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
> >>>> + if (ptep_test_and_clear_young(vma, addr, pte))
> >>>> + tlb_remove_tlb_entry(tlb, pte, addr);
> >
> > IIRC, some of the architecture(ex, PPC) don't update TLB with set_pte_at and
> > tlb_remove_tlb_entry. So, didn't we consider remapping the PTE with old after
> > pte clearing?
>
> Sorry Lance, I don't understand this question, can you rephrase? Are you saying
> there is a good reason to do the original clear-mkold-set for some arches?

IIRC, some of the architecture(ex, PPC) don't update TLB with
ptep_test_and_clear_young()
and tlb_remove_tlb_entry().

In my new patch[1], I use refresh_full_ptes() and
tlb_remove_tlb_entries() to batch-update the
access and dirty bits.

[1] https://lore.kernel.org/linux-mm/20240316102952.39233-1-ioworker0@gmailcom

Thanks,
Lance

>
> >
> > Thanks,
> > Lance
> >
> >
> >
> >>>> + }
> >>>
> >>> This looks so smart. if it is not pageout, we have increased pte
> >>> and addr here; so nr is 0 and we don't need to increase again in
> >>> for (; addr < end; pte += nr, addr += nr * PAGE_SIZE)
> >>>
> >>> otherwise, nr won't be 0. so we will increase addr and
> >>> pte by nr.
> >>
> >> Indeed. I'm hoping that Lance is able to follow a similar pattern for
> >> madvise_free_pte_range().
> >>
> >>
> >>>
> >>>
> >>>> }
> >>>>
> >>>> /*
> >>>> --
> >>>> 2.25.1
> >>>>
> >>>
> >>> Overall, LGTM,
> >>>
> >>> Reviewed-by: Barry Song <[email protected]>
> >>
> >> Thanks!
> >>
> >>
>

2024-03-20 17:39:50

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD

On 20/03/2024 14:35, Lance Yang wrote:
> On Wed, Mar 20, 2024 at 9:49 PM Ryan Roberts <[email protected]> wrote:
>>
>> Hi Lance, Barry,
>>
>> Sorry - I totally missed this when you originally sent it!
>
> No worries at all :)
>
>>
>>
>> On 13/03/2024 14:02, Lance Yang wrote:
>>> On Wed, Mar 13, 2024 at 5:03 PM Ryan Roberts <[email protected]> wrote:
>>>>
>>>> On 13/03/2024 07:19, Barry Song wrote:
>>>>> On Tue, Mar 12, 2024 at 4:01 AM Ryan Roberts <[email protected]> wrote:
>>>>>>
>>>>>> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
>>>>>> folio that is fully and contiguously mapped in the pageout/cold vm
>>>>>> range. This change means that large folios will be maintained all the
>>>>>> way to swap storage. This both improves performance during swap-out, by
>>>>>> eliding the cost of splitting the folio, and sets us up nicely for
>>>>>> maintaining the large folio when it is swapped back in (to be covered in
>>>>>> a separate series).
>>>>>>
>>>>>> Folios that are not fully mapped in the target range are still split,
>>>>>> but note that behavior is changed so that if the split fails for any
>>>>>> reason (folio locked, shared, etc) we now leave it as is and move to the
>>>>>> next pte in the range and continue work on the proceeding folios.
>>>>>> Previously any failure of this sort would cause the entire operation to
>>>>>> give up and no folios mapped at higher addresses were paged out or made
>>>>>> cold. Given large folios are becoming more common, this old behavior
>>>>>> would have likely lead to wasted opportunities.
>>>>>>
>>>>>> While we are at it, change the code that clears young from the ptes to
>>>>>> use ptep_test_and_clear_young(), which is more efficent than
>>>>>> get_and_clear/modify/set, especially for contpte mappings on arm64,
>>>>>> where the old approach would require unfolding/refolding and the new
>>>>>> approach can be done in place.
>>>>>>
>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>
>>>>> This looks so much better than our initial RFC.
>>>>> Thank you for your excellent work!
>>>>
>>>> Thanks - its a team effort - I had your PoC and David's previous batching work
>>>> to use as a template.
>>>>
>>>>>
>>>>>> ---
>>>>>> mm/madvise.c | 89 ++++++++++++++++++++++++++++++----------------------
>>>>>> 1 file changed, 51 insertions(+), 38 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/madvise.c b/mm/madvise.c
>>>>>> index 547dcd1f7a39..56c7ba7bd558 100644
>>>>>> --- a/mm/madvise.c
>>>>>> +++ b/mm/madvise.c
>>>>>> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>>>> LIST_HEAD(folio_list);
>>>>>> bool pageout_anon_only_filter;
>>>>>> unsigned int batch_count = 0;
>>>>>> + int nr;
>>>>>>
>>>>>> if (fatal_signal_pending(current))
>>>>>> return -EINTR;
>>>>>> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>>>> return 0;
>>>>>> flush_tlb_batched_pending(mm);
>>>>>> arch_enter_lazy_mmu_mode();
>>>>>> - for (; addr < end; pte++, addr += PAGE_SIZE) {
>>>>>> + for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
>>>>>> + nr = 1;
>>>>>> ptent = ptep_get(pte);
>>>>>>
>>>>>> if (++batch_count == SWAP_CLUSTER_MAX) {
>>>>>> @@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>>>> continue;
>>>>>>
>>>>>> /*
>>>>>> - * Creating a THP page is expensive so split it only if we
>>>>>> - * are sure it's worth. Split it if we are only owner.
>>>>>> + * If we encounter a large folio, only split it if it is not
>>>>>> + * fully mapped within the range we are operating on. Otherwise
>>>>>> + * leave it as is so that it can be swapped out whole. If we
>>>>>> + * fail to split a folio, leave it in place and advance to the
>>>>>> + * next pte in the range.
>>>>>> */
>>>>>> if (folio_test_large(folio)) {
>>>>>> - int err;
>>>>>> -
>>>>>> - if (folio_estimated_sharers(folio) > 1)
>>>>>> - break;
>>>>>> - if (pageout_anon_only_filter && !folio_test_anon(folio))
>>>>>> - break;
>>>>>> - if (!folio_trylock(folio))
>>>>>> - break;
>>>>>> - folio_get(folio);
>>>>>> - arch_leave_lazy_mmu_mode();
>>>>>> - pte_unmap_unlock(start_pte, ptl);
>>>>>> - start_pte = NULL;
>>>>>> - err = split_folio(folio);
>>>>>> - folio_unlock(folio);
>>>>>> - folio_put(folio);
>>>>>> - if (err)
>>>>>> - break;
>>>>>> - start_pte = pte =
>>>>>> - pte_offset_map_lock(mm, pmd, addr, &ptl);
>>>>>> - if (!start_pte)
>>>>>> - break;
>>>>>> - arch_enter_lazy_mmu_mode();
>>>>>> - pte--;
>>>>>> - addr -= PAGE_SIZE;
>>>>>> - continue;
>>>>>> + const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
>>>>>> + FPB_IGNORE_SOFT_DIRTY;
>>>>>> + int max_nr = (end - addr) / PAGE_SIZE;
>>>>>> +
>>>>>> + nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>>>>>> + fpb_flags, NULL);
>>>>>
>>>>> I wonder if we have a quick way to avoid folio_pte_batch() if users
>>>>> are doing madvise() on a portion of a large folio.
>>>>
>>>> Good idea. Something like this?:
>>>>
>>>> if (pte_pfn(pte) == folio_pfn(folio)
>>>> nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>>>> fpb_flags, NULL);
>>>>
>>>> If we are not mapping the first page of the folio, then it can't be a full
>>>> mapping, so no need to call folio_pte_batch(). Just split it.
>>>
>>> if (folio_test_large(folio)) {
>>> [...]
>>> nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>>> fpb_flags, NULL);
>>> + if (folio_estimated_sharers(folio) > 1)
>>> + continue;
>>>
>>> Could we use folio_estimated_sharers as an early exit point here?
>>
>> I'm not sure what this is saving where you have it? Did you mean to put it
>> before folio_pte_batch()? Currently it is just saving a single conditional.
>
> Apologies for the confusion. I made a diff to provide clarity.
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 56c7ba7bd558..c3458fdea82a 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -462,12 +462,11 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>
> nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
> fpb_flags, NULL);
> -
> // Could we use folio_estimated_sharers as an early exit point here?
> + if (folio_estimated_sharers(folio) > 1)
> + continue;
> if (nr < folio_nr_pages(folio)) {
> int err;
>
> - if (folio_estimated_sharers(folio) > 1)
> - continue;
> if (pageout_anon_only_filter &&
> !folio_test_anon(folio))
> continue;
> if (!folio_trylock(folio))


I'm still not really getting it; with my code, if nr < the folio size, we will
try to split and if we estimate that the folio is not exclusive we will avoid
locking the folio, etc. If nr == folio size, we will proceed to the precise
exclusivity check (which is cheap once we know the folio is fully mapped by this
process).

With your change, we will always do the estimated exclusive check then proceed
to the precise check; seems like duplication to me?

>
>>
>> But now that I think about it a bit more, I remember why I was originally
>> unconditionally calling folio_pte_batch(). Given its a large folio, if the split
>> fails, we can move the cursor to the pte where the next folio begins so we don't
>> have to iterate through one pte at a time which would cause us to keep calling
>> folio_estimated_sharers(), folio_test_anon(), etc on the same folio until we get
>> to the next boundary.
>>
>> Of course the common case at this point will be for the split to succeed, but
>> then we are going to iterate over ever single PTE anyway - one way or another
>> they are all fetched into cache. So I feel like its neater not to add the
>> conditionals for calling folio_pte_batch(), and just leave this as I have it here.
>>
>>>
>>> if (nr < folio_nr_pages(folio)) {
>>> int err;
>>>
>>> - if (folio_estimated_sharers(folio) > 1)
>>> - continue;
>>> [...]
>>>
>>>>
>>>>>
>>>>>> +
>>>>>> + if (nr < folio_nr_pages(folio)) {
>>>>>> + int err;
>>>>>> +
>>>>>> + if (folio_estimated_sharers(folio) > 1)
>>>>>> + continue;
>>>>>> + if (pageout_anon_only_filter && !folio_test_anon(folio))
>>>>>> + continue;
>>>>>> + if (!folio_trylock(folio))
>>>>>> + continue;
>>>>>> + folio_get(folio);
>>>>>> + arch_leave_lazy_mmu_mode();
>>>>>> + pte_unmap_unlock(start_pte, ptl);
>>>>>> + start_pte = NULL;
>>>>>> + err = split_folio(folio);
>>>>>> + folio_unlock(folio);
>>>>>> + folio_put(folio);
>>>>>> + if (err)
>>>>>> + continue;
>>>>>> + start_pte = pte =
>>>>>> + pte_offset_map_lock(mm, pmd, addr, &ptl);
>>>>>> + if (!start_pte)
>>>>>> + break;
>>>>>> + arch_enter_lazy_mmu_mode();
>>>>>> + nr = 0;
>>>>>> + continue;
>>>>>> + }
>>>>>> }
>>>>>>
>>>>>> /*
>>>>>> * Do not interfere with other mappings of this folio and
>>>>>> - * non-LRU folio.
>>>>>> + * non-LRU folio. If we have a large folio at this point, we
>>>>>> + * know it is fully mapped so if its mapcount is the same as its
>>>>>> + * number of pages, it must be exclusive.
>>>>>> */
>>>>>> - if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
>>>>>> + if (!folio_test_lru(folio) ||
>>>>>> + folio_mapcount(folio) != folio_nr_pages(folio))
>>>>>> continue;
>>>>>
>>>>> This looks so perfect and is exactly what I wanted to achieve.
>>>>>
>>>>>>
>>>>>> if (pageout_anon_only_filter && !folio_test_anon(folio))
>>>>>> continue;
>>>>>>
>>>>>> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>>>>>> -
>>>>>> - if (!pageout && pte_young(ptent)) {
>>>>>> - ptent = ptep_get_and_clear_full(mm, addr, pte,
>>>>>> - tlb->fullmm);
>>>>>> - ptent = pte_mkold(ptent);
>>>>>> - set_pte_at(mm, addr, pte, ptent);
>>>>>> - tlb_remove_tlb_entry(tlb, pte, addr);
>>>>>> + if (!pageout) {
>>>>>> + for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
>>>>>> + if (ptep_test_and_clear_young(vma, addr, pte))
>>>>>> + tlb_remove_tlb_entry(tlb, pte, addr);
>>>
>>> IIRC, some of the architecture(ex, PPC) don't update TLB with set_pte_at and
>>> tlb_remove_tlb_entry. So, didn't we consider remapping the PTE with old after
>>> pte clearing?
>>
>> Sorry Lance, I don't understand this question, can you rephrase? Are you saying
>> there is a good reason to do the original clear-mkold-set for some arches?
>
> IIRC, some of the architecture(ex, PPC) don't update TLB with
> ptep_test_and_clear_young()
> and tlb_remove_tlb_entry().

Err, I assumed tlb_remove_tlb_entry() meant "invalidate the TLB entry for this
address please" - albeit its deferred and batched. I'll look into this.

>
> In my new patch[1], I use refresh_full_ptes() and
> tlb_remove_tlb_entries() to batch-update the
> access and dirty bits.

I want to avoid the per-pte clear-modify-set approach, because this doesn't
perform well on arm64 when using contpte mappings; it will cause the contpe
mapping to be unfolded by the first clear that touches the contpte block, then
refolded by the last set to touch the block. That's expensive.
ptep_test_and_clear_young() doesn't suffer that problem.

>
> [1] https://lore.kernel.org/linux-mm/[email protected]
>
> Thanks,
> Lance
>
>>
>>>
>>> Thanks,
>>> Lance
>>>
>>>
>>>
>>>>>> + }
>>>>>
>>>>> This looks so smart. if it is not pageout, we have increased pte
>>>>> and addr here; so nr is 0 and we don't need to increase again in
>>>>> for (; addr < end; pte += nr, addr += nr * PAGE_SIZE)
>>>>>
>>>>> otherwise, nr won't be 0. so we will increase addr and
>>>>> pte by nr.
>>>>
>>>> Indeed. I'm hoping that Lance is able to follow a similar pattern for
>>>> madvise_free_pte_range().
>>>>
>>>>
>>>>>
>>>>>
>>>>>> }
>>>>>>
>>>>>> /*
>>>>>> --
>>>>>> 2.25.1
>>>>>>
>>>>>
>>>>> Overall, LGTM,
>>>>>
>>>>> Reviewed-by: Barry Song <[email protected]>
>>>>
>>>> Thanks!
>>>>
>>>>
>>


2024-03-21 01:39:13

by Lance Yang

[permalink] [raw]
Subject: Re: [PATCH v4 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD

On Thu, Mar 21, 2024 at 1:38 AM Ryan Roberts <[email protected]> wrote:
>
> On 20/03/2024 14:35, Lance Yang wrote:
> > On Wed, Mar 20, 2024 at 9:49 PM Ryan Roberts <[email protected]> wrote:
> >>
> >> Hi Lance, Barry,
> >>
> >> Sorry - I totally missed this when you originally sent it!
> >
> > No worries at all :)
> >
> >>
> >>
> >> On 13/03/2024 14:02, Lance Yang wrote:
> >>> On Wed, Mar 13, 2024 at 5:03 PM Ryan Roberts <[email protected]> wrote:
> >>>>
> >>>> On 13/03/2024 07:19, Barry Song wrote:
> >>>>> On Tue, Mar 12, 2024 at 4:01 AM Ryan Roberts <[email protected]> wrote:
> >>>>>>
> >>>>>> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
> >>>>>> folio that is fully and contiguously mapped in the pageout/cold vm
> >>>>>> range. This change means that large folios will be maintained all the
> >>>>>> way to swap storage. This both improves performance during swap-out, by
> >>>>>> eliding the cost of splitting the folio, and sets us up nicely for
> >>>>>> maintaining the large folio when it is swapped back in (to be covered in
> >>>>>> a separate series).
> >>>>>>
> >>>>>> Folios that are not fully mapped in the target range are still split,
> >>>>>> but note that behavior is changed so that if the split fails for any
> >>>>>> reason (folio locked, shared, etc) we now leave it as is and move to the
> >>>>>> next pte in the range and continue work on the proceeding folios.
> >>>>>> Previously any failure of this sort would cause the entire operation to
> >>>>>> give up and no folios mapped at higher addresses were paged out or made
> >>>>>> cold. Given large folios are becoming more common, this old behavior
> >>>>>> would have likely lead to wasted opportunities.
> >>>>>>
> >>>>>> While we are at it, change the code that clears young from the ptes to
> >>>>>> use ptep_test_and_clear_young(), which is more efficent than
> >>>>>> get_and_clear/modify/set, especially for contpte mappings on arm64,
> >>>>>> where the old approach would require unfolding/refolding and the new
> >>>>>> approach can be done in place.
> >>>>>>
> >>>>>> Signed-off-by: Ryan Roberts <[email protected]>
> >>>>>
> >>>>> This looks so much better than our initial RFC.
> >>>>> Thank you for your excellent work!
> >>>>
> >>>> Thanks - its a team effort - I had your PoC and David's previous batching work
> >>>> to use as a template.
> >>>>
> >>>>>
> >>>>>> ---
> >>>>>> mm/madvise.c | 89 ++++++++++++++++++++++++++++++----------------------
> >>>>>> 1 file changed, 51 insertions(+), 38 deletions(-)
> >>>>>>
> >>>>>> diff --git a/mm/madvise.c b/mm/madvise.c
> >>>>>> index 547dcd1f7a39..56c7ba7bd558 100644
> >>>>>> --- a/mm/madvise.c
> >>>>>> +++ b/mm/madvise.c
> >>>>>> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>>>>> LIST_HEAD(folio_list);
> >>>>>> bool pageout_anon_only_filter;
> >>>>>> unsigned int batch_count = 0;
> >>>>>> + int nr;
> >>>>>>
> >>>>>> if (fatal_signal_pending(current))
> >>>>>> return -EINTR;
> >>>>>> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>>>>> return 0;
> >>>>>> flush_tlb_batched_pending(mm);
> >>>>>> arch_enter_lazy_mmu_mode();
> >>>>>> - for (; addr < end; pte++, addr += PAGE_SIZE) {
> >>>>>> + for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
> >>>>>> + nr = 1;
> >>>>>> ptent = ptep_get(pte);
> >>>>>>
> >>>>>> if (++batch_count == SWAP_CLUSTER_MAX) {
> >>>>>> @@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>>>>> continue;
> >>>>>>
> >>>>>> /*
> >>>>>> - * Creating a THP page is expensive so split it only if we
> >>>>>> - * are sure it's worth. Split it if we are only owner.
> >>>>>> + * If we encounter a large folio, only split it if it is not
> >>>>>> + * fully mapped within the range we are operating on. Otherwise
> >>>>>> + * leave it as is so that it can be swapped out whole. If we
> >>>>>> + * fail to split a folio, leave it in place and advance to the
> >>>>>> + * next pte in the range.
> >>>>>> */
> >>>>>> if (folio_test_large(folio)) {
> >>>>>> - int err;
> >>>>>> -
> >>>>>> - if (folio_estimated_sharers(folio) > 1)
> >>>>>> - break;
> >>>>>> - if (pageout_anon_only_filter && !folio_test_anon(folio))
> >>>>>> - break;
> >>>>>> - if (!folio_trylock(folio))
> >>>>>> - break;
> >>>>>> - folio_get(folio);
> >>>>>> - arch_leave_lazy_mmu_mode();
> >>>>>> - pte_unmap_unlock(start_pte, ptl);
> >>>>>> - start_pte = NULL;
> >>>>>> - err = split_folio(folio);
> >>>>>> - folio_unlock(folio);
> >>>>>> - folio_put(folio);
> >>>>>> - if (err)
> >>>>>> - break;
> >>>>>> - start_pte = pte =
> >>>>>> - pte_offset_map_lock(mm, pmd, addr, &ptl);
> >>>>>> - if (!start_pte)
> >>>>>> - break;
> >>>>>> - arch_enter_lazy_mmu_mode();
> >>>>>> - pte--;
> >>>>>> - addr -= PAGE_SIZE;
> >>>>>> - continue;
> >>>>>> + const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
> >>>>>> + FPB_IGNORE_SOFT_DIRTY;
> >>>>>> + int max_nr = (end - addr) / PAGE_SIZE;
> >>>>>> +
> >>>>>> + nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
> >>>>>> + fpb_flags, NULL);
> >>>>>
> >>>>> I wonder if we have a quick way to avoid folio_pte_batch() if users
> >>>>> are doing madvise() on a portion of a large folio.
> >>>>
> >>>> Good idea. Something like this?:
> >>>>
> >>>> if (pte_pfn(pte) == folio_pfn(folio)
> >>>> nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
> >>>> fpb_flags, NULL);
> >>>>
> >>>> If we are not mapping the first page of the folio, then it can't be a full
> >>>> mapping, so no need to call folio_pte_batch(). Just split it.
> >>>
> >>> if (folio_test_large(folio)) {
> >>> [...]
> >>> nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
> >>> fpb_flags, NULL);
> >>> + if (folio_estimated_sharers(folio) > 1)
> >>> + continue;
> >>>
> >>> Could we use folio_estimated_sharers as an early exit point here?
> >>
> >> I'm not sure what this is saving where you have it? Did you mean to put it
> >> before folio_pte_batch()? Currently it is just saving a single conditional.
> >
> > Apologies for the confusion. I made a diff to provide clarity.
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 56c7ba7bd558..c3458fdea82a 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -462,12 +462,11 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >
> > nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
> > fpb_flags, NULL);
> > -
> > // Could we use folio_estimated_sharers as an early exit point here?
> > + if (folio_estimated_sharers(folio) > 1)
> > + continue;
> > if (nr < folio_nr_pages(folio)) {
> > int err;
> >
> > - if (folio_estimated_sharers(folio) > 1)
> > - continue;
> > if (pageout_anon_only_filter &&
> > !folio_test_anon(folio))
> > continue;
> > if (!folio_trylock(folio))
>
>
> I'm still not really getting it; with my code, if nr < the folio size, we will
> try to split and if we estimate that the folio is not exclusive we will avoid
> locking the folio, etc. If nr == folio size, we will proceed to the precise
> exclusivity check (which is cheap once we know the folio is fully mapped by this
> process).
>
> With your change, we will always do the estimated exclusive check then proceed
> to the precise check; seems like duplication to me?

Agreed. The estimated exclusive check is indeed redundant with my change.

>
> >
> >>
> >> But now that I think about it a bit more, I remember why I was originally
> >> unconditionally calling folio_pte_batch(). Given its a large folio, if the split
> >> fails, we can move the cursor to the pte where the next folio begins so we don't
> >> have to iterate through one pte at a time which would cause us to keep calling
> >> folio_estimated_sharers(), folio_test_anon(), etc on the same folio until we get
> >> to the next boundary.
> >>
> >> Of course the common case at this point will be for the split to succeed, but
> >> then we are going to iterate over ever single PTE anyway - one way or another
> >> they are all fetched into cache. So I feel like its neater not to add the
> >> conditionals for calling folio_pte_batch(), and just leave this as I have it here.
> >>
> >>>
> >>> if (nr < folio_nr_pages(folio)) {
> >>> int err;
> >>>
> >>> - if (folio_estimated_sharers(folio) > 1)
> >>> - continue;
> >>> [...]
> >>>
> >>>>
> >>>>>
> >>>>>> +
> >>>>>> + if (nr < folio_nr_pages(folio)) {
> >>>>>> + int err;
> >>>>>> +
> >>>>>> + if (folio_estimated_sharers(folio) > 1)
> >>>>>> + continue;
> >>>>>> + if (pageout_anon_only_filter && !folio_test_anon(folio))
> >>>>>> + continue;
> >>>>>> + if (!folio_trylock(folio))
> >>>>>> + continue;
> >>>>>> + folio_get(folio);
> >>>>>> + arch_leave_lazy_mmu_mode();
> >>>>>> + pte_unmap_unlock(start_pte, ptl);
> >>>>>> + start_pte = NULL;
> >>>>>> + err = split_folio(folio);
> >>>>>> + folio_unlock(folio);
> >>>>>> + folio_put(folio);
> >>>>>> + if (err)
> >>>>>> + continue;
> >>>>>> + start_pte = pte =
> >>>>>> + pte_offset_map_lock(mm, pmd, addr, &ptl);
> >>>>>> + if (!start_pte)
> >>>>>> + break;
> >>>>>> + arch_enter_lazy_mmu_mode();
> >>>>>> + nr = 0;
> >>>>>> + continue;
> >>>>>> + }
> >>>>>> }
> >>>>>>
> >>>>>> /*
> >>>>>> * Do not interfere with other mappings of this folio and
> >>>>>> - * non-LRU folio.
> >>>>>> + * non-LRU folio. If we have a large folio at this point, we
> >>>>>> + * know it is fully mapped so if its mapcount is the same as its
> >>>>>> + * number of pages, it must be exclusive.
> >>>>>> */
> >>>>>> - if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
> >>>>>> + if (!folio_test_lru(folio) ||
> >>>>>> + folio_mapcount(folio) != folio_nr_pages(folio))
> >>>>>> continue;
> >>>>>
> >>>>> This looks so perfect and is exactly what I wanted to achieve.
> >>>>>
> >>>>>>
> >>>>>> if (pageout_anon_only_filter && !folio_test_anon(folio))
> >>>>>> continue;
> >>>>>>
> >>>>>> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> >>>>>> -
> >>>>>> - if (!pageout && pte_young(ptent)) {
> >>>>>> - ptent = ptep_get_and_clear_full(mm, addr, pte,
> >>>>>> - tlb->fullmm);
> >>>>>> - ptent = pte_mkold(ptent);
> >>>>>> - set_pte_at(mm, addr, pte, ptent);
> >>>>>> - tlb_remove_tlb_entry(tlb, pte, addr);
> >>>>>> + if (!pageout) {
> >>>>>> + for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
> >>>>>> + if (ptep_test_and_clear_young(vma, addr, pte))
> >>>>>> + tlb_remove_tlb_entry(tlb, pte, addr);
> >>>
> >>> IIRC, some of the architecture(ex, PPC) don't update TLB with set_pte_at and
> >>> tlb_remove_tlb_entry. So, didn't we consider remapping the PTE with old after
> >>> pte clearing?
> >>
> >> Sorry Lance, I don't understand this question, can you rephrase? Are you saying
> >> there is a good reason to do the original clear-mkold-set for some arches?
> >
> > IIRC, some of the architecture(ex, PPC) don't update TLB with
> > ptep_test_and_clear_young()
> > and tlb_remove_tlb_entry().
>
> Err, I assumed tlb_remove_tlb_entry() meant "invalidate the TLB entry for this
> address please" - albeit its deferred and batched. I'll look into this.
>
> >
> > In my new patch[1], I use refresh_full_ptes() and
> > tlb_remove_tlb_entries() to batch-update the
> > access and dirty bits.
>
> I want to avoid the per-pte clear-modify-set approach, because this doesn't
> perform well on arm64 when using contpte mappings; it will cause the contpe
> mapping to be unfolded by the first clear that touches the contpte block, then
> refolded by the last set to touch the block. That's expensive.
> ptep_test_and_clear_young() doesn't suffer that problem.

Thanks for explaining. I got it.

I think that other architectures will benefit from the per-pte clear-modify-set
approach. IMO, refresh_full_ptes() can be overridden by arm64.

Thanks,
Lance
>
> >
> > [1] https://lore.kernel.org/linux-mm/[email protected]
> >
> > Thanks,
> > Lance
> >
> >>
> >>>
> >>> Thanks,
> >>> Lance
> >>>
> >>>
> >>>
> >>>>>> + }
> >>>>>
> >>>>> This looks so smart. if it is not pageout, we have increased pte
> >>>>> and addr here; so nr is 0 and we don't need to increase again in
> >>>>> for (; addr < end; pte += nr, addr += nr * PAGE_SIZE)
> >>>>>
> >>>>> otherwise, nr won't be 0. so we will increase addr and
> >>>>> pte by nr.
> >>>>
> >>>> Indeed. I'm hoping that Lance is able to follow a similar pattern for
> >>>> madvise_free_pte_range().
> >>>>
> >>>>
> >>>>>
> >>>>>
> >>>>>> }
> >>>>>>
> >>>>>> /*
> >>>>>> --
> >>>>>> 2.25.1
> >>>>>>
> >>>>>
> >>>>> Overall, LGTM,
> >>>>>
> >>>>> Reviewed-by: Barry Song <[email protected]>
> >>>>
> >>>> Thanks!
> >>>>
> >>>>
> >>
>

2024-03-21 04:41:49

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH v4 4/6] mm: swap: Allow storage of all mTHP orders

Ryan Roberts <[email protected]> writes:

> Hi Huang, Ying,
>
>
> On 12/03/2024 07:51, Huang, Ying wrote:
>> Ryan Roberts <[email protected]> writes:
>>
>>> Multi-size THP enables performance improvements by allocating large,
>>> pte-mapped folios for anonymous memory. However I've observed that on an
>>> arm64 system running a parallel workload (e.g. kernel compilation)
>>> across many cores, under high memory pressure, the speed regresses. This
>>> is due to bottlenecking on the increased number of TLBIs added due to
>>> all the extra folio splitting when the large folios are swapped out.
>>>
>>> Therefore, solve this regression by adding support for swapping out mTHP
>>> without needing to split the folio, just like is already done for
>>> PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enabled,
>>> and when the swap backing store is a non-rotating block device. These
>>> are the same constraints as for the existing PMD-sized THP swap-out
>>> support.
>>>
>>> Note that no attempt is made to swap-in (m)THP here - this is still done
>>> page-by-page, like for PMD-sized THP. But swapping-out mTHP is a
>>> prerequisite for swapping-in mTHP.
>>>
>>> The main change here is to improve the swap entry allocator so that it
>>> can allocate any power-of-2 number of contiguous entries between [1, (1
>>> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
>>> order and allocating sequentially from it until the cluster is full.
>>> This ensures that we don't need to search the map and we get no
>>> fragmentation due to alignment padding for different orders in the
>>> cluster. If there is no current cluster for a given order, we attempt to
>>> allocate a free cluster from the list. If there are no free clusters, we
>>> fail the allocation and the caller can fall back to splitting the folio
>>> and allocates individual entries (as per existing PMD-sized THP
>>> fallback).
>>>
>>> The per-order current clusters are maintained per-cpu using the existing
>>> infrastructure. This is done to avoid interleving pages from different
>>> tasks, which would prevent IO being batched. This is already done for
>>> the order-0 allocations so we follow the same pattern.
>>>
>>> As is done for order-0 per-cpu clusters, the scanner now can steal
>>> order-0 entries from any per-cpu-per-order reserved cluster. This
>>> ensures that when the swap file is getting full, space doesn't get tied
>>> up in the per-cpu reserves.
>>>
>>> This change only modifies swap to be able to accept any order mTHP. It
>>> doesn't change the callers to elide doing the actual split. That will be
>>> done in separate changes.
>
> [...]
>
>>> @@ -905,17 +961,18 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>> }
>>>
>>> if (si->swap_map[offset]) {
>>> + VM_WARN_ON(order > 0);
>>> unlock_cluster(ci);
>>> if (!n_ret)
>>> goto scan;
>>> else
>>> goto done;
>>> }
>>> - WRITE_ONCE(si->swap_map[offset], usage);
>>> - inc_cluster_info_page(si, si->cluster_info, offset);
>>> + memset(si->swap_map + offset, usage, nr_pages);
>>
>> Add barrier() here corresponds to original WRITE_ONCE()?
>> unlock_cluster(ci) may be NOP for some swap devices.
>
> Looking at this a bit more closely, I'm not sure this is needed. Even if there
> is no cluster, the swap_info is still locked, so unlocking that will act as a
> barrier. There are a number of other callsites that memset(si->swap_map) without
> an explicit barrier and with the swap_info locked.
>
> Looking at the original commit that added the WRITE_ONCE() it was worried about
> a race with reading swap_map in _swap_info_get(). But that site is now annotated
> with a data_race(), which will suppress the warning. And I don't believe there
> are any places that read swap_map locklessly and depend upon observing ordering
> between it and other state? So I think the si unlock is sufficient?
>
> I'm not planning to add barrier() here. Let me know if you disagree.

swap_map[] may be read locklessly in swap_offset_available_and_locked()
in parallel. IIUC, WRITE_ONCE() here is to make the writing take effect
as early as possible there.

>
>>
>>> + add_cluster_info_page(si, si->cluster_info, offset, nr_pages);
>>> unlock_cluster(ci);

--
Best Regards,
Huang, Ying

2024-03-21 12:22:19

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 4/6] mm: swap: Allow storage of all mTHP orders

On 21/03/2024 04:39, Huang, Ying wrote:
> Ryan Roberts <[email protected]> writes:
>
>> Hi Huang, Ying,
>>
>>
>> On 12/03/2024 07:51, Huang, Ying wrote:
>>> Ryan Roberts <[email protected]> writes:
>>>
>>>> Multi-size THP enables performance improvements by allocating large,
>>>> pte-mapped folios for anonymous memory. However I've observed that on an
>>>> arm64 system running a parallel workload (e.g. kernel compilation)
>>>> across many cores, under high memory pressure, the speed regresses. This
>>>> is due to bottlenecking on the increased number of TLBIs added due to
>>>> all the extra folio splitting when the large folios are swapped out.
>>>>
>>>> Therefore, solve this regression by adding support for swapping out mTHP
>>>> without needing to split the folio, just like is already done for
>>>> PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enabled,
>>>> and when the swap backing store is a non-rotating block device. These
>>>> are the same constraints as for the existing PMD-sized THP swap-out
>>>> support.
>>>>
>>>> Note that no attempt is made to swap-in (m)THP here - this is still done
>>>> page-by-page, like for PMD-sized THP. But swapping-out mTHP is a
>>>> prerequisite for swapping-in mTHP.
>>>>
>>>> The main change here is to improve the swap entry allocator so that it
>>>> can allocate any power-of-2 number of contiguous entries between [1, (1
>>>> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
>>>> order and allocating sequentially from it until the cluster is full.
>>>> This ensures that we don't need to search the map and we get no
>>>> fragmentation due to alignment padding for different orders in the
>>>> cluster. If there is no current cluster for a given order, we attempt to
>>>> allocate a free cluster from the list. If there are no free clusters, we
>>>> fail the allocation and the caller can fall back to splitting the folio
>>>> and allocates individual entries (as per existing PMD-sized THP
>>>> fallback).
>>>>
>>>> The per-order current clusters are maintained per-cpu using the existing
>>>> infrastructure. This is done to avoid interleving pages from different
>>>> tasks, which would prevent IO being batched. This is already done for
>>>> the order-0 allocations so we follow the same pattern.
>>>>
>>>> As is done for order-0 per-cpu clusters, the scanner now can steal
>>>> order-0 entries from any per-cpu-per-order reserved cluster. This
>>>> ensures that when the swap file is getting full, space doesn't get tied
>>>> up in the per-cpu reserves.
>>>>
>>>> This change only modifies swap to be able to accept any order mTHP. It
>>>> doesn't change the callers to elide doing the actual split. That will be
>>>> done in separate changes.
>>
>> [...]
>>
>>>> @@ -905,17 +961,18 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>>> }
>>>>
>>>> if (si->swap_map[offset]) {
>>>> + VM_WARN_ON(order > 0);
>>>> unlock_cluster(ci);
>>>> if (!n_ret)
>>>> goto scan;
>>>> else
>>>> goto done;
>>>> }
>>>> - WRITE_ONCE(si->swap_map[offset], usage);
>>>> - inc_cluster_info_page(si, si->cluster_info, offset);
>>>> + memset(si->swap_map + offset, usage, nr_pages);
>>>
>>> Add barrier() here corresponds to original WRITE_ONCE()?
>>> unlock_cluster(ci) may be NOP for some swap devices.
>>
>> Looking at this a bit more closely, I'm not sure this is needed. Even if there
>> is no cluster, the swap_info is still locked, so unlocking that will act as a
>> barrier. There are a number of other callsites that memset(si->swap_map) without
>> an explicit barrier and with the swap_info locked.
>>
>> Looking at the original commit that added the WRITE_ONCE() it was worried about
>> a race with reading swap_map in _swap_info_get(). But that site is now annotated
>> with a data_race(), which will suppress the warning. And I don't believe there
>> are any places that read swap_map locklessly and depend upon observing ordering
>> between it and other state? So I think the si unlock is sufficient?
>>
>> I'm not planning to add barrier() here. Let me know if you disagree.
>
> swap_map[] may be read locklessly in swap_offset_available_and_locked()
> in parallel. IIUC, WRITE_ONCE() here is to make the writing take effect
> as early as possible there.

Afraid I'm not convinced by that argument; if it's racing, it's racing - the
lockless side needs to be robust (it is). Adding the compiler barrier limits the
compiler's options which could lead to slower code in this path. If your
argument is that you want to reduce the window where
swap_offset_available_and_locked() could observe a free swap slot but then see
that its taken after it gets the si lock, that seems like a micro-optimization
to me, which we should avoid if we can.

By remnoving the WRITE_ONCE() and using memset, the lockless reader could
observe tearing though. I don't think that should cause a problem (because
everything is rechecked with under the lock), but if we want to avoid it, then
perhaps we just need to loop over WRITE_ONCE() here instead of using memset?


>
>>
>>>
>>>> + add_cluster_info_page(si, si->cluster_info, offset, nr_pages);
>>>> unlock_cluster(ci);
>
> --
> Best Regards,
> Huang, Ying


2024-03-21 13:38:26

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD

>>>>>>>> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>>>>>>>> -
>>>>>>>> - if (!pageout && pte_young(ptent)) {
>>>>>>>> - ptent = ptep_get_and_clear_full(mm, addr, pte,
>>>>>>>> - tlb->fullmm);
>>>>>>>> - ptent = pte_mkold(ptent);
>>>>>>>> - set_pte_at(mm, addr, pte, ptent);
>>>>>>>> - tlb_remove_tlb_entry(tlb, pte, addr);
>>>>>>>> + if (!pageout) {
>>>>>>>> + for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
>>>>>>>> + if (ptep_test_and_clear_young(vma, addr, pte))
>>>>>>>> + tlb_remove_tlb_entry(tlb, pte, addr);
>>>>>
>>>>> IIRC, some of the architecture(ex, PPC) don't update TLB with set_pte_at and
>>>>> tlb_remove_tlb_entry. So, didn't we consider remapping the PTE with old after
>>>>> pte clearing?
>>>>
>>>> Sorry Lance, I don't understand this question, can you rephrase? Are you saying
>>>> there is a good reason to do the original clear-mkold-set for some arches?
>>>
>>> IIRC, some of the architecture(ex, PPC) don't update TLB with
>>> ptep_test_and_clear_young()
>>> and tlb_remove_tlb_entry().

Afraid I'm still struggling with this comment. Do you mean to say that powerpc
invalidates the TLB entry as part of the call to ptep_test_and_clear_young()? So
tlb_remove_tlb_entry() would be redundant here, and likely cause performance
degradation on that architecture?

IMHO, ptep_test_and_clear_young() really shouldn't be invalidating the TLB
entry, that's what ptep_clear_flush_young() is for.

But I do see that for some cases of the 32-bit ppc, there appears to be a flush:

#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
static inline int __ptep_test_and_clear_young(struct mm_struct *mm,
unsigned long addr, pte_t *ptep)
{
unsigned long old;
old = pte_update(mm, addr, ptep, _PAGE_ACCESSED, 0, 0);
if (old & _PAGE_HASHPTE)
flush_hash_entry(mm, ptep, addr); <<<<<<<<

return (old & _PAGE_ACCESSED) != 0;
}
#define ptep_test_and_clear_young(__vma, __addr, __ptep) \
__ptep_test_and_clear_young((__vma)->vm_mm, __addr, __ptep)

Is that what you are describing? Does any anyone know why flush_hash_entry() is
called? I'd say that's a bug in ppc and not a reason not to use
ptep_test_and_clear_young() in the common code!

Thanks,
Ryan


>>
>> Err, I assumed tlb_remove_tlb_entry() meant "invalidate the TLB entry for this
>> address please" - albeit its deferred and batched. I'll look into this.
>>
>>>
>>> In my new patch[1], I use refresh_full_ptes() and
>>> tlb_remove_tlb_entries() to batch-update the
>>> access and dirty bits.
>>
>> I want to avoid the per-pte clear-modify-set approach, because this doesn't
>> perform well on arm64 when using contpte mappings; it will cause the contpe
>> mapping to be unfolded by the first clear that touches the contpte block, then
>> refolded by the last set to touch the block. That's expensive.
>> ptep_test_and_clear_young() doesn't suffer that problem.
>
> Thanks for explaining. I got it.
>
> I think that other architectures will benefit from the per-pte clear-modify-set
> approach. IMO, refresh_full_ptes() can be overridden by arm64.
>
> Thanks,
> Lance
>>
>>>
>>> [1] https://lore.kernel.org/linux-mm/[email protected]
>>>
>>> Thanks,
>>> Lance
>>>
>>>>
>>>>>
>>>>> Thanks,
>>>>> Lance
>>>>>
>>>>>
>>>>>
>>>>>>>> + }
>>>>>>>
>>>>>>> This looks so smart. if it is not pageout, we have increased pte
>>>>>>> and addr here; so nr is 0 and we don't need to increase again in
>>>>>>> for (; addr < end; pte += nr, addr += nr * PAGE_SIZE)
>>>>>>>
>>>>>>> otherwise, nr won't be 0. so we will increase addr and
>>>>>>> pte by nr.
>>>>>>
>>>>>> Indeed. I'm hoping that Lance is able to follow a similar pattern for
>>>>>> madvise_free_pte_range().
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> }
>>>>>>>>
>>>>>>>> /*
>>>>>>>> --
>>>>>>>> 2.25.1
>>>>>>>>
>>>>>>>
>>>>>>> Overall, LGTM,
>>>>>>>
>>>>>>> Reviewed-by: Barry Song <[email protected]>
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>
>>>>
>>


2024-03-21 14:55:47

by Lance Yang

[permalink] [raw]
Subject: Re: [PATCH v4 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD

On Thu, Mar 21, 2024 at 9:38 PM Ryan Roberts <[email protected]> wrote:
>
> >>>>>>>> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> >>>>>>>> -
> >>>>>>>> - if (!pageout && pte_young(ptent)) {
> >>>>>>>> - ptent = ptep_get_and_clear_full(mm, addr, pte,
> >>>>>>>> - tlb->fullmm);
> >>>>>>>> - ptent = pte_mkold(ptent);
> >>>>>>>> - set_pte_at(mm, addr, pte, ptent);
> >>>>>>>> - tlb_remove_tlb_entry(tlb, pte, addr);
> >>>>>>>> + if (!pageout) {
> >>>>>>>> + for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
> >>>>>>>> + if (ptep_test_and_clear_young(vma, addr, pte))
> >>>>>>>> + tlb_remove_tlb_entry(tlb, pte, addr);
> >>>>>
> >>>>> IIRC, some of the architecture(ex, PPC) don't update TLB with set_pte_at and
> >>>>> tlb_remove_tlb_entry. So, didn't we consider remapping the PTE with old after
> >>>>> pte clearing?
> >>>>
> >>>> Sorry Lance, I don't understand this question, can you rephrase? Are you saying
> >>>> there is a good reason to do the original clear-mkold-set for some arches?
> >>>
> >>> IIRC, some of the architecture(ex, PPC) don't update TLB with
> >>> ptep_test_and_clear_young()
> >>> and tlb_remove_tlb_entry().
>
> Afraid I'm still struggling with this comment. Do you mean to say that powerpc
> invalidates the TLB entry as part of the call to ptep_test_and_clear_young()? So
> tlb_remove_tlb_entry() would be redundant here, and likely cause performance
> degradation on that architecture?

I just thought that using ptep_test_and_clear_young() instead of
ptep_get_and_clear_full() + pte_mkold() might not be correct.
However, it's most likely that I was mistaken :(

I also have a question. Why aren't we using ptep_test_and_clear_young() in
madvise_cold_or_pageout_pte_range(), but instead
ptep_get_and_clear_full() + pte_mkold() as we did previously.

/*
* Some of architecture(ex, PPC) don't update TLB
* with set_pte_at and tlb_remove_tlb_entry so for
* the portability, remap the pte with old|clean
* after pte clearing.
*/

According to this comment from madvise_free_pte_range. IIUC, we need to
call ptep_get_and_clear_full() to clear the PTE, and then remap the
PTE with old|clean.

Thanks,
Lance

>
> IMHO, ptep_test_and_clear_young() really shouldn't be invalidating the TLB
> entry, that's what ptep_clear_flush_young() is for.
>
> But I do see that for some cases of the 32-bit ppc, there appears to be a flush:
>
> #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
> static inline int __ptep_test_and_clear_young(struct mm_struct *mm,
> unsigned long addr, pte_t *ptep)
> {
> unsigned long old;
> old = pte_update(mm, addr, ptep, _PAGE_ACCESSED, 0, 0);
> if (old & _PAGE_HASHPTE)
> flush_hash_entry(mm, ptep, addr); <<<<<<<<
>
> return (old & _PAGE_ACCESSED) != 0;
> }
> #define ptep_test_and_clear_young(__vma, __addr, __ptep) \
> __ptep_test_and_clear_young((__vma)->vm_mm, __addr, __ptep)
>
> Is that what you are describing? Does any anyone know why flush_hash_entry() is
> called? I'd say that's a bug in ppc and not a reason not to use
> ptep_test_and_clear_young() in the common code!
>
> Thanks,
> Ryan
>
>
> >>
> >> Err, I assumed tlb_remove_tlb_entry() meant "invalidate the TLB entry for this
> >> address please" - albeit its deferred and batched. I'll look into this.
> >>
> >>>
> >>> In my new patch[1], I use refresh_full_ptes() and
> >>> tlb_remove_tlb_entries() to batch-update the
> >>> access and dirty bits.
> >>
> >> I want to avoid the per-pte clear-modify-set approach, because this doesn't
> >> perform well on arm64 when using contpte mappings; it will cause the contpe
> >> mapping to be unfolded by the first clear that touches the contpte block, then
> >> refolded by the last set to touch the block. That's expensive.
> >> ptep_test_and_clear_young() doesn't suffer that problem.
> >
> > Thanks for explaining. I got it.
> >
> > I think that other architectures will benefit from the per-pte clear-modify-set
> > approach. IMO, refresh_full_ptes() can be overridden by arm64.
> >
> > Thanks,
> > Lance
> >>
> >>>
> >>> [1] https://lore.kernel.org/linux-mm/[email protected]
> >>>
> >>> Thanks,
> >>> Lance
> >>>
> >>>>
> >>>>>
> >>>>> Thanks,
> >>>>> Lance
> >>>>>
> >>>>>
> >>>>>
> >>>>>>>> + }
> >>>>>>>
> >>>>>>> This looks so smart. if it is not pageout, we have increased pte
> >>>>>>> and addr here; so nr is 0 and we don't need to increase again in
> >>>>>>> for (; addr < end; pte += nr, addr += nr * PAGE_SIZE)
> >>>>>>>
> >>>>>>> otherwise, nr won't be 0. so we will increase addr and
> >>>>>>> pte by nr.
> >>>>>>
> >>>>>> Indeed. I'm hoping that Lance is able to follow a similar pattern for
> >>>>>> madvise_free_pte_range().
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> /*
> >>>>>>>> --
> >>>>>>>> 2.25.1
> >>>>>>>>
> >>>>>>>
> >>>>>>> Overall, LGTM,
> >>>>>>>
> >>>>>>> Reviewed-by: Barry Song <[email protected]>
> >>>>>>
> >>>>>> Thanks!
> >>>>>>
> >>>>>>
> >>>>
> >>
>

2024-03-21 15:25:10

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD

On 21/03/2024 14:55, Lance Yang wrote:
> On Thu, Mar 21, 2024 at 9:38 PM Ryan Roberts <[email protected]> wrote:
>>
>>>>>>>>>> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>>>>>>>>>> -
>>>>>>>>>> - if (!pageout && pte_young(ptent)) {
>>>>>>>>>> - ptent = ptep_get_and_clear_full(mm, addr, pte,
>>>>>>>>>> - tlb->fullmm);
>>>>>>>>>> - ptent = pte_mkold(ptent);
>>>>>>>>>> - set_pte_at(mm, addr, pte, ptent);
>>>>>>>>>> - tlb_remove_tlb_entry(tlb, pte, addr);
>>>>>>>>>> + if (!pageout) {
>>>>>>>>>> + for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
>>>>>>>>>> + if (ptep_test_and_clear_young(vma, addr, pte))
>>>>>>>>>> + tlb_remove_tlb_entry(tlb, pte, addr);
>>>>>>>
>>>>>>> IIRC, some of the architecture(ex, PPC) don't update TLB with set_pte_at and
>>>>>>> tlb_remove_tlb_entry. So, didn't we consider remapping the PTE with old after
>>>>>>> pte clearing?
>>>>>>
>>>>>> Sorry Lance, I don't understand this question, can you rephrase? Are you saying
>>>>>> there is a good reason to do the original clear-mkold-set for some arches?
>>>>>
>>>>> IIRC, some of the architecture(ex, PPC) don't update TLB with
>>>>> ptep_test_and_clear_young()
>>>>> and tlb_remove_tlb_entry().
>>
>> Afraid I'm still struggling with this comment. Do you mean to say that powerpc
>> invalidates the TLB entry as part of the call to ptep_test_and_clear_young()? So
>> tlb_remove_tlb_entry() would be redundant here, and likely cause performance
>> degradation on that architecture?
>
> I just thought that using ptep_test_and_clear_young() instead of
> ptep_get_and_clear_full() + pte_mkold() might not be correct.
> However, it's most likely that I was mistaken :(

OK, I'm pretty confident that my usage is correct.

>
> I also have a question. Why aren't we using ptep_test_and_clear_young() in
> madvise_cold_or_pageout_pte_range(), but instead
> ptep_get_and_clear_full() + pte_mkold() as we did previously.
>
> /*
> * Some of architecture(ex, PPC) don't update TLB
> * with set_pte_at and tlb_remove_tlb_entry so for
> * the portability, remap the pte with old|clean
> * after pte clearing.
> */

Ahh, I see; this is a comment from madvise_free_pte_range() I don't quite
understand that comment. I suspect it might be out of date, or saying that doing
set_pte_at(pte_mkold(ptep_get(ptent))) is not correct because it is not atomic
and the HW could set the dirty bit between the get and the set. Doing the atomic
ptep_get_and_clear_full() means you go via a pte_none() state, so if the TLB is
racing it will see the entry isn't valid and fault.

Note that madvise_free_pte_range() is trying to clear both the access and dirty
bits, whereas madvise_cold_or_pageout_pte_range() is only trying to clear the
access bit. There is a special helper to clear the access bit atomically -
ptep_test_and_clear_young() - but there is no helper to clear the access *and*
dirty bit, I don't believe. There is ptep_set_access_flags(), but that sets
flags to a "more permissive setting" (i.e. allows setting the flags, not
clearing them). Perhaps this constraint can be relaxed given we will follow up
with an explicit TLBI - it would require auditing all the implementations.

>
> According to this comment from madvise_free_pte_range. IIUC, we need to
> call ptep_get_and_clear_full() to clear the PTE, and then remap the
> PTE with old|clean.
>
> Thanks,
> Lance
>
>>
>> IMHO, ptep_test_and_clear_young() really shouldn't be invalidating the TLB
>> entry, that's what ptep_clear_flush_young() is for.
>>
>> But I do see that for some cases of the 32-bit ppc, there appears to be a flush:
>>
>> #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
>> static inline int __ptep_test_and_clear_young(struct mm_struct *mm,
>> unsigned long addr, pte_t *ptep)
>> {
>> unsigned long old;
>> old = pte_update(mm, addr, ptep, _PAGE_ACCESSED, 0, 0);
>> if (old & _PAGE_HASHPTE)
>> flush_hash_entry(mm, ptep, addr); <<<<<<<<
>>
>> return (old & _PAGE_ACCESSED) != 0;
>> }
>> #define ptep_test_and_clear_young(__vma, __addr, __ptep) \
>> __ptep_test_and_clear_young((__vma)->vm_mm, __addr, __ptep)
>>
>> Is that what you are describing? Does any anyone know why flush_hash_entry() is
>> called? I'd say that's a bug in ppc and not a reason not to use
>> ptep_test_and_clear_young() in the common code!
>>
>> Thanks,
>> Ryan
>>
>>
>>>>
>>>> Err, I assumed tlb_remove_tlb_entry() meant "invalidate the TLB entry for this
>>>> address please" - albeit its deferred and batched. I'll look into this.
>>>>
>>>>>
>>>>> In my new patch[1], I use refresh_full_ptes() and
>>>>> tlb_remove_tlb_entries() to batch-update the
>>>>> access and dirty bits.
>>>>
>>>> I want to avoid the per-pte clear-modify-set approach, because this doesn't
>>>> perform well on arm64 when using contpte mappings; it will cause the contpe
>>>> mapping to be unfolded by the first clear that touches the contpte block, then
>>>> refolded by the last set to touch the block. That's expensive.
>>>> ptep_test_and_clear_young() doesn't suffer that problem.
>>>
>>> Thanks for explaining. I got it.
>>>
>>> I think that other architectures will benefit from the per-pte clear-modify-set
>>> approach. IMO, refresh_full_ptes() can be overridden by arm64.
>>>
>>> Thanks,
>>> Lance
>>>>
>>>>>
>>>>> [1] https://lore.kernel.org/linux-mm/[email protected]
>>>>>
>>>>> Thanks,
>>>>> Lance
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Lance
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>> + }
>>>>>>>>>
>>>>>>>>> This looks so smart. if it is not pageout, we have increased pte
>>>>>>>>> and addr here; so nr is 0 and we don't need to increase again in
>>>>>>>>> for (; addr < end; pte += nr, addr += nr * PAGE_SIZE)
>>>>>>>>>
>>>>>>>>> otherwise, nr won't be 0. so we will increase addr and
>>>>>>>>> pte by nr.
>>>>>>>>
>>>>>>>> Indeed. I'm hoping that Lance is able to follow a similar pattern for
>>>>>>>> madvise_free_pte_range().
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> /*
>>>>>>>>>> --
>>>>>>>>>> 2.25.1
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Overall, LGTM,
>>>>>>>>>
>>>>>>>>> Reviewed-by: Barry Song <[email protected]>
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>


2024-03-22 00:56:24

by Lance Yang

[permalink] [raw]
Subject: Re: [PATCH v4 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD

On Thu, Mar 21, 2024 at 11:24 PM Ryan Roberts <[email protected]> wrote:
>
> On 21/03/2024 14:55, Lance Yang wrote:
> > On Thu, Mar 21, 2024 at 9:38 PM Ryan Roberts <[email protected]> wrote:
> >>
> >>>>>>>>>> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> >>>>>>>>>> -
> >>>>>>>>>> - if (!pageout && pte_young(ptent)) {
> >>>>>>>>>> - ptent = ptep_get_and_clear_full(mm, addr, pte,
> >>>>>>>>>> - tlb->fullmm);
> >>>>>>>>>> - ptent = pte_mkold(ptent);
> >>>>>>>>>> - set_pte_at(mm, addr, pte, ptent);
> >>>>>>>>>> - tlb_remove_tlb_entry(tlb, pte, addr);
> >>>>>>>>>> + if (!pageout) {
> >>>>>>>>>> + for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
> >>>>>>>>>> + if (ptep_test_and_clear_young(vma, addr, pte))
> >>>>>>>>>> + tlb_remove_tlb_entry(tlb, pte, addr);
> >>>>>>>
> >>>>>>> IIRC, some of the architecture(ex, PPC) don't update TLB with set_pte_at and
> >>>>>>> tlb_remove_tlb_entry. So, didn't we consider remapping the PTE with old after
> >>>>>>> pte clearing?
> >>>>>>
> >>>>>> Sorry Lance, I don't understand this question, can you rephrase? Are you saying
> >>>>>> there is a good reason to do the original clear-mkold-set for some arches?
> >>>>>
> >>>>> IIRC, some of the architecture(ex, PPC) don't update TLB with
> >>>>> ptep_test_and_clear_young()
> >>>>> and tlb_remove_tlb_entry().
> >>
> >> Afraid I'm still struggling with this comment. Do you mean to say that powerpc
> >> invalidates the TLB entry as part of the call to ptep_test_and_clear_young()? So
> >> tlb_remove_tlb_entry() would be redundant here, and likely cause performance
> >> degradation on that architecture?
> >
> > I just thought that using ptep_test_and_clear_young() instead of
> > ptep_get_and_clear_full() + pte_mkold() might not be correct.
> > However, it's most likely that I was mistaken :(
>
> OK, I'm pretty confident that my usage is correct.
>
> >
> > I also have a question. Why aren't we using ptep_test_and_clear_young() in
> > madvise_cold_or_pageout_pte_range(), but instead
> > ptep_get_and_clear_full() + pte_mkold() as we did previously.
> >
> > /*
> > * Some of architecture(ex, PPC) don't update TLB
> > * with set_pte_at and tlb_remove_tlb_entry so for
> > * the portability, remap the pte with old|clean
> > * after pte clearing.
> > */
>
> Ahh, I see; this is a comment from madvise_free_pte_range() I don't quite
> understand that comment. I suspect it might be out of date, or saying that doing
> set_pte_at(pte_mkold(ptep_get(ptent))) is not correct because it is not atomic
> and the HW could set the dirty bit between the get and the set. Doing the atomic
> ptep_get_and_clear_full() means you go via a pte_none() state, so if the TLB is
> racing it will see the entry isn't valid and fault.

Thanks for your analysis and explanations!

>
> Note that madvise_free_pte_range() is trying to clear both the access and dirty
> bits, whereas madvise_cold_or_pageout_pte_range() is only trying to clear the
> access bit. There is a special helper to clear the access bit atomically -
> ptep_test_and_clear_young() - but there is no helper to clear the access *and*
> dirty bit, I don't believe. There is ptep_set_access_flags(), but that sets
> flags to a "more permissive setting" (i.e. allows setting the flags, not
> clearing them). Perhaps this constraint can be relaxed given we will follow up
> with an explicit TLBI - it would require auditing all the implementations.

Thanks for bringing this! I'll take a closer look at it.

Thanks again for your time!
Lance

>
> >
> > According to this comment from madvise_free_pte_range. IIUC, we need to
> > call ptep_get_and_clear_full() to clear the PTE, and then remap the
> > PTE with old|clean.
> >
> > Thanks,
> > Lance
> >
> >>
> >> IMHO, ptep_test_and_clear_young() really shouldn't be invalidating the TLB
> >> entry, that's what ptep_clear_flush_young() is for.
> >>
> >> But I do see that for some cases of the 32-bit ppc, there appears to be a flush:
> >>
> >> #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
> >> static inline int __ptep_test_and_clear_young(struct mm_struct *mm,
> >> unsigned long addr, pte_t *ptep)
> >> {
> >> unsigned long old;
> >> old = pte_update(mm, addr, ptep, _PAGE_ACCESSED, 0, 0);
> >> if (old & _PAGE_HASHPTE)
> >> flush_hash_entry(mm, ptep, addr); <<<<<<<<
> >>
> >> return (old & _PAGE_ACCESSED) != 0;
> >> }
> >> #define ptep_test_and_clear_young(__vma, __addr, __ptep) \
> >> __ptep_test_and_clear_young((__vma)->vm_mm, __addr, __ptep)
> >>
> >> Is that what you are describing? Does any anyone know why flush_hash_entry() is
> >> called? I'd say that's a bug in ppc and not a reason not to use
> >> ptep_test_and_clear_young() in the common code!
> >>
> >> Thanks,
> >> Ryan
> >>
> >>
> >>>>
> >>>> Err, I assumed tlb_remove_tlb_entry() meant "invalidate the TLB entry for this
> >>>> address please" - albeit its deferred and batched. I'll look into this.
> >>>>
> >>>>>
> >>>>> In my new patch[1], I use refresh_full_ptes() and
> >>>>> tlb_remove_tlb_entries() to batch-update the
> >>>>> access and dirty bits.
> >>>>
> >>>> I want to avoid the per-pte clear-modify-set approach, because this doesn't
> >>>> perform well on arm64 when using contpte mappings; it will cause the contpe
> >>>> mapping to be unfolded by the first clear that touches the contpte block, then
> >>>> refolded by the last set to touch the block. That's expensive.
> >>>> ptep_test_and_clear_young() doesn't suffer that problem.
> >>>
> >>> Thanks for explaining. I got it.
> >>>
> >>> I think that other architectures will benefit from the per-pte clear-modify-set
> >>> approach. IMO, refresh_full_ptes() can be overridden by arm64.
> >>>
> >>> Thanks,
> >>> Lance
> >>>>
> >>>>>
> >>>>> [1] https://lore.kernel.org/linux-mm/[email protected]
> >>>>>
> >>>>> Thanks,
> >>>>> Lance
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Lance
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>>>> + }
> >>>>>>>>>
> >>>>>>>>> This looks so smart. if it is not pageout, we have increased pte
> >>>>>>>>> and addr here; so nr is 0 and we don't need to increase again in
> >>>>>>>>> for (; addr < end; pte += nr, addr += nr * PAGE_SIZE)
> >>>>>>>>>
> >>>>>>>>> otherwise, nr won't be 0. so we will increase addr and
> >>>>>>>>> pte by nr.
> >>>>>>>>
> >>>>>>>> Indeed. I'm hoping that Lance is able to follow a similar pattern for
> >>>>>>>> madvise_free_pte_range().
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> }
> >>>>>>>>>>
> >>>>>>>>>> /*
> >>>>>>>>>> --
> >>>>>>>>>> 2.25.1
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Overall, LGTM,
> >>>>>>>>>
> >>>>>>>>> Reviewed-by: Barry Song <[email protected]>
> >>>>>>>>
> >>>>>>>> Thanks!
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>
> >>
>

2024-03-22 02:40:28

by Huang, Ying

[permalink] [raw]
Subject: Can you help us on memory barrier usage? (was Re: [PATCH v4 4/6] mm: swap: Allow storage of all mTHP orders)

Hi, Paul,

Can you help us on WRITE_ONCE()/READ_ONCE()/barrier() usage as follows?
For some example kernel code as follows,

"
unsigned char x[16];

void writer(void)
{
memset(x, 1, sizeof(x));
/* To make memset() take effect ASAP */
barrier();
}

unsigned char reader(int n)
{
return READ_ONCE(x[n]);
}
"

where, writer() and reader() may be called on 2 CPUs without any lock.
It's acceptable for reader() to read the written value a little later.
Our questions are,

1. because it's impossible for accessing "unsigned char" to cause
tearing. So, WRITE_ONCE()/READ_ONCE()/barrier() isn't necessary for
correctness, right?

2. we use barrier() and READ_ONCE() in writer() and reader(), because we
want to make writing take effect ASAP. Is it a good practice? Or it's
a micro-optimization that should be avoided?

--
Best Regards,
Huang, Ying

2024-03-22 02:41:49

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH v4 4/6] mm: swap: Allow storage of all mTHP orders

Ryan Roberts <[email protected]> writes:

> On 21/03/2024 04:39, Huang, Ying wrote:
>> Ryan Roberts <[email protected]> writes:
>>
>>> Hi Huang, Ying,
>>>
>>>
>>> On 12/03/2024 07:51, Huang, Ying wrote:
>>>> Ryan Roberts <[email protected]> writes:
>>> [...]
>>>
>>>>> @@ -905,17 +961,18 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>>>> }
>>>>>
>>>>> if (si->swap_map[offset]) {
>>>>> + VM_WARN_ON(order > 0);
>>>>> unlock_cluster(ci);
>>>>> if (!n_ret)
>>>>> goto scan;
>>>>> else
>>>>> goto done;
>>>>> }
>>>>> - WRITE_ONCE(si->swap_map[offset], usage);
>>>>> - inc_cluster_info_page(si, si->cluster_info, offset);
>>>>> + memset(si->swap_map + offset, usage, nr_pages);
>>>>
>>>> Add barrier() here corresponds to original WRITE_ONCE()?
>>>> unlock_cluster(ci) may be NOP for some swap devices.
>>>
>>> Looking at this a bit more closely, I'm not sure this is needed. Even if there
>>> is no cluster, the swap_info is still locked, so unlocking that will act as a
>>> barrier. There are a number of other callsites that memset(si->swap_map) without
>>> an explicit barrier and with the swap_info locked.
>>>
>>> Looking at the original commit that added the WRITE_ONCE() it was worried about
>>> a race with reading swap_map in _swap_info_get(). But that site is now annotated
>>> with a data_race(), which will suppress the warning. And I don't believe there
>>> are any places that read swap_map locklessly and depend upon observing ordering
>>> between it and other state? So I think the si unlock is sufficient?
>>>
>>> I'm not planning to add barrier() here. Let me know if you disagree.
>>
>> swap_map[] may be read locklessly in swap_offset_available_and_locked()
>> in parallel. IIUC, WRITE_ONCE() here is to make the writing take effect
>> as early as possible there.
>
> Afraid I'm not convinced by that argument; if it's racing, it's racing - the

It's not a race.

> lockless side needs to be robust (it is). Adding the compiler barrier limits the
> compiler's options which could lead to slower code in this path. If your
> argument is that you want to reduce the window where
> swap_offset_available_and_locked() could observe a free swap slot but then see
> that its taken after it gets the si lock, that seems like a micro-optimization
> to me, which we should avoid if we can.

Yes. I think that it is a micro-optimization too. I had thought that
it is a common practice to use WRITE_ONCE()/READ_ONCE() or barrier() in
intentional racy data accessing to make the change available as soon as
possible. But I may be wrong here.

> By remnoving the WRITE_ONCE() and using memset, the lockless reader could
> observe tearing though. I don't think that should cause a problem (because
> everything is rechecked with under the lock), but if we want to avoid it, then
> perhaps we just need to loop over WRITE_ONCE() here instead of using memset?

IIUC, in practice that isn't necessary, because type of si->swap_map[]
is "unsigned char". It isn't possible to tear "unsigned char". In
theory, it may be better to use WRITE_ONCE() because we may change the
type of si->swap_map[] at some time (who knows). I don't have a strong
opinion here.

>>>>> + add_cluster_info_page(si, si->cluster_info, offset, nr_pages);
>>>>> unlock_cluster(ci);

--
Best Regards,
Huang, Ying

2024-03-22 09:23:59

by Ryan Roberts

[permalink] [raw]
Subject: Re: Can you help us on memory barrier usage? (was Re: [PATCH v4 4/6] mm: swap: Allow storage of all mTHP orders)

On 22/03/2024 02:38, Huang, Ying wrote:
> Hi, Paul,
>
> Can you help us on WRITE_ONCE()/READ_ONCE()/barrier() usage as follows?
> For some example kernel code as follows,
>
> "
> unsigned char x[16];
>
> void writer(void)
> {
> memset(x, 1, sizeof(x));
> /* To make memset() take effect ASAP */
> barrier();
> }
>
> unsigned char reader(int n)
> {
> return READ_ONCE(x[n]);
> }
> "
>
> where, writer() and reader() may be called on 2 CPUs without any lock.

For the situation we are discussing, writer() is always called with a spin lock
held. So spin_unlock() will act as the barrier in this case; that's my argument
for not needing the explicit barrier(), anyway. Happy to be told I'm wrong.

> It's acceptable for reader() to read the written value a little later.
> Our questions are,
>
> 1. because it's impossible for accessing "unsigned char" to cause
> tearing. So, WRITE_ONCE()/READ_ONCE()/barrier() isn't necessary for
> correctness, right?
>
> 2. we use barrier() and READ_ONCE() in writer() and reader(), because we
> want to make writing take effect ASAP. Is it a good practice? Or it's
> a micro-optimization that should be avoided?
>
> --
> Best Regards,
> Huang, Ying


2024-03-22 09:39:48

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 4/6] mm: swap: Allow storage of all mTHP orders

On 22/03/2024 02:39, Huang, Ying wrote:
> Ryan Roberts <[email protected]> writes:
>
>> On 21/03/2024 04:39, Huang, Ying wrote:
>>> Ryan Roberts <[email protected]> writes:
>>>
>>>> Hi Huang, Ying,
>>>>
>>>>
>>>> On 12/03/2024 07:51, Huang, Ying wrote:
>>>>> Ryan Roberts <[email protected]> writes:
>>>> [...]
>>>>
>>>>>> @@ -905,17 +961,18 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>>>>> }
>>>>>>
>>>>>> if (si->swap_map[offset]) {
>>>>>> + VM_WARN_ON(order > 0);
>>>>>> unlock_cluster(ci);
>>>>>> if (!n_ret)
>>>>>> goto scan;
>>>>>> else
>>>>>> goto done;
>>>>>> }
>>>>>> - WRITE_ONCE(si->swap_map[offset], usage);
>>>>>> - inc_cluster_info_page(si, si->cluster_info, offset);
>>>>>> + memset(si->swap_map + offset, usage, nr_pages);
>>>>>
>>>>> Add barrier() here corresponds to original WRITE_ONCE()?
>>>>> unlock_cluster(ci) may be NOP for some swap devices.
>>>>
>>>> Looking at this a bit more closely, I'm not sure this is needed. Even if there
>>>> is no cluster, the swap_info is still locked, so unlocking that will act as a
>>>> barrier. There are a number of other callsites that memset(si->swap_map) without
>>>> an explicit barrier and with the swap_info locked.
>>>>
>>>> Looking at the original commit that added the WRITE_ONCE() it was worried about
>>>> a race with reading swap_map in _swap_info_get(). But that site is now annotated
>>>> with a data_race(), which will suppress the warning. And I don't believe there
>>>> are any places that read swap_map locklessly and depend upon observing ordering
>>>> between it and other state? So I think the si unlock is sufficient?
>>>>
>>>> I'm not planning to add barrier() here. Let me know if you disagree.
>>>
>>> swap_map[] may be read locklessly in swap_offset_available_and_locked()
>>> in parallel. IIUC, WRITE_ONCE() here is to make the writing take effect
>>> as early as possible there.
>>
>> Afraid I'm not convinced by that argument; if it's racing, it's racing - the
>
> It's not a race.
>
>> lockless side needs to be robust (it is). Adding the compiler barrier limits the
>> compiler's options which could lead to slower code in this path. If your
>> argument is that you want to reduce the window where
>> swap_offset_available_and_locked() could observe a free swap slot but then see
>> that its taken after it gets the si lock, that seems like a micro-optimization
>> to me, which we should avoid if we can.
>
> Yes. I think that it is a micro-optimization too. I had thought that
> it is a common practice to use WRITE_ONCE()/READ_ONCE() or barrier() in
> intentional racy data accessing to make the change available as soon as
> possible. But I may be wrong here.

My understanding is that WRITE_ONCE() forces the compiler to emit a store at
that point in the program; it can't just rely on knowing that it has previously
written the same value to that location, it can't reorder the load to later in
the program and it must store the whole word atomically so that no tearing can
be observed. But given that swap_map is only ever written with the si lock held,
I don't believe we require the first two of those semantics. It should be enough
to know that the compiler has emitted all the stores (if it determines they are
required) prior to the spin_unlock(). I'm not sure about the anti-tearing
guarrantee.

Happy to be told I'm wrong here!

>
>> By remnoving the WRITE_ONCE() and using memset, the lockless reader could
>> observe tearing though. I don't think that should cause a problem (because
>> everything is rechecked with under the lock), but if we want to avoid it, then
>> perhaps we just need to loop over WRITE_ONCE() here instead of using memset?
>
> IIUC, in practice that isn't necessary, because type of si->swap_map[]
> is "unsigned char". It isn't possible to tear "unsigned char".

In practice, perhaps. But I guess the compiler is free to update the char
bit-by-bit if it wants to, if the store is not marked WRITE_ONCE()?

> In
> theory, it may be better to use WRITE_ONCE() because we may change the
> type of si->swap_map[] at some time (who knows). I don't have a strong
> opinion here.

The way I see it, the precedent is already set; there are a number of places
that already use memset to update swap_map. They are all under the si lock, and
none use barrier(). If its wrong here, then its wrong in those places too, I
believe.

>
>>>>>> + add_cluster_info_page(si, si->cluster_info, offset, nr_pages);
>>>>>> unlock_cluster(ci);
>
> --
> Best Regards,
> Huang, Ying


2024-03-22 13:20:21

by Chris Li

[permalink] [raw]
Subject: Re: Can you help us on memory barrier usage? (was Re: [PATCH v4 4/6] mm: swap: Allow storage of all mTHP orders)

Hi Ying,

Very interesting question.

On Thu, Mar 21, 2024 at 7:40 PM Huang, Ying <[email protected]> wrote:
>
> Hi, Paul,
>
> Can you help us on WRITE_ONCE()/READ_ONCE()/barrier() usage as follows?
> For some example kernel code as follows,
>
> "
> unsigned char x[16];
>
> void writer(void)
> {
> memset(x, 1, sizeof(x));
> /* To make memset() take effect ASAP */
> barrier();
> }
>
> unsigned char reader(int n)
> {
> return READ_ONCE(x[n]);
> }
> "
>
> where, writer() and reader() may be called on 2 CPUs without any lock.
> It's acceptable for reader() to read the written value a little later.

I am trying to see if your program can convert into a litmus test so
the linux memory model tools can answer it for you.
Because you allow reader() to read written value a little later, there
is nothing the test can verify against. The reader can see both before
or after the writer's update, both are valid observations.

To make your test example more complete, you need the reader/writer to
do more actions to expose the race. For example, " if (READ_ONCE(x[n])
y = 1;" Then you can ask the question whether it is possible to
observe x[n] == 0 and y== 1. That might not be the test condition you
have in mind, you can get the idea.

We want to have a test example that shows the result observable state
to indicate the bad things did happen(or not possible).

> Our questions are,
>
> 1. because it's impossible for accessing "unsigned char" to cause
> tearing. So, WRITE_ONCE()/READ_ONCE()/barrier() isn't necessary for
> correctness, right?

We need to define what is the expected behavior outcome to be
"correct", possibly including the before and after barrier actions.

Chris

>
> 2. we use barrier() and READ_ONCE() in writer() and reader(), because we
> want to make writing take effect ASAP. Is it a good practice? Or it's
> a micro-optimization that should be avoided?
>
> --
> Best Regards,
> Huang, Ying
>

2024-03-23 02:11:33

by Akira Yokosawa

[permalink] [raw]
Subject: Re: Can you help us on memory barrier usage? (was Re: [PATCH v4 4/6] mm: swap: Allow storage of all mTHP orders)

[Use Paul's reachable address in CC;
trimmed CC list, keeping only those who have responded so far.]

Hello Huang,
Let me chime in.

On Fri, 22 Mar 2024 06:19:52 -0700, Huang, Ying wrote:
> Hi, Paul,
>
> Can you help us on WRITE_ONCE()/READ_ONCE()/barrier() usage as follows?
> For some example kernel code as follows,
>
> "
> unsigned char x[16];
>
> void writer(void)
> {
> memset(x, 1, sizeof(x));
> /* To make memset() take effect ASAP */
> barrier();
> }
>
> unsigned char reader(int n)
> {
> return READ_ONCE(x[n]);
> }
> "
>
> where, writer() and reader() may be called on 2 CPUs without any lock.
> It's acceptable for reader() to read the written value a little later.
> Our questions are,
>
> 1. because it's impossible for accessing "unsigned char" to cause
> tearing. So, WRITE_ONCE()/READ_ONCE()/barrier() isn't necessary for
> correctness, right?
>
> 2. we use barrier() and READ_ONCE() in writer() and reader(), because we
> want to make writing take effect ASAP. Is it a good practice? Or it's
> a micro-optimization that should be avoided?

Why don't you consult Documentation/memory-barriers.txt, especially
the section titled "COMPILER BARRIER"?

TL;DR:

barrier(), WRITE_ONCE(), and READ_ONCE() are compiler barriers, not
memory barriers. They just restrict compiler optimizations and don't
have any effect with regard to "make writing take effect ASAP".

If you have further questions, please don't hesitate to ask.

Regards,
Akira (a LKMM Reveiwer).

>
> --
> Best Regards,
> Huang, Ying


2024-03-25 08:50:26

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Can you help us on memory barrier usage? (was Re: [PATCH v4 4/6] mm: swap: Allow storage of all mTHP orders)

On Sat, Mar 23, 2024 at 11:11:09AM +0900, Akira Yokosawa wrote:
> [Use Paul's reachable address in CC;
> trimmed CC list, keeping only those who have responded so far.]
>
> Hello Huang,
> Let me chime in.
>
> On Fri, 22 Mar 2024 06:19:52 -0700, Huang, Ying wrote:
> > Hi, Paul,
> >
> > Can you help us on WRITE_ONCE()/READ_ONCE()/barrier() usage as follows?
> > For some example kernel code as follows,
> >
> > "
> > unsigned char x[16];
> >
> > void writer(void)
> > {
> > memset(x, 1, sizeof(x));
> > /* To make memset() take effect ASAP */
> > barrier();
> > }
> >
> > unsigned char reader(int n)
> > {
> > return READ_ONCE(x[n]);
> > }
> > "
> >
> > where, writer() and reader() may be called on 2 CPUs without any lock.
> > It's acceptable for reader() to read the written value a little later.

What are your consistency requirements? For but one example, if reader(3)
gives the new value, is it OK for a later call to reader(2) to give the
old value?

Until we know what your requirements are, it is hard to say whether the
above code meets those requirements. In the meantime, I can imagine
requirements that it meets and others that it does not.

Also, Akira's points below are quite important.

Thanx, Paul

> > Our questions are,
> >
> > 1. because it's impossible for accessing "unsigned char" to cause
> > tearing. So, WRITE_ONCE()/READ_ONCE()/barrier() isn't necessary for
> > correctness, right?
> >
> > 2. we use barrier() and READ_ONCE() in writer() and reader(), because we
> > want to make writing take effect ASAP. Is it a good practice? Or it's
> > a micro-optimization that should be avoided?
>
> Why don't you consult Documentation/memory-barriers.txt, especially
> the section titled "COMPILER BARRIER"?
>
> TL;DR:
>
> barrier(), WRITE_ONCE(), and READ_ONCE() are compiler barriers, not
> memory barriers. They just restrict compiler optimizations and don't
> have any effect with regard to "make writing take effect ASAP".
>
> If you have further questions, please don't hesitate to ask.
>
> Regards,
> Akira (a LKMM Reveiwer).
>
> >
> > --
> > Best Regards,
> > Huang, Ying
>

2024-03-25 12:24:18

by Huang, Ying

[permalink] [raw]
Subject: Re: Can you help us on memory barrier usage? (was Re: [PATCH v4 4/6] mm: swap: Allow storage of all mTHP orders)

"Paul E. McKenney" <[email protected]> writes:

> On Sat, Mar 23, 2024 at 11:11:09AM +0900, Akira Yokosawa wrote:
>> [Use Paul's reachable address in CC;
>> trimmed CC list, keeping only those who have responded so far.]
>>
>> Hello Huang,
>> Let me chime in.
>>
>> On Fri, 22 Mar 2024 06:19:52 -0700, Huang, Ying wrote:
>> > Hi, Paul,
>> >
>> > Can you help us on WRITE_ONCE()/READ_ONCE()/barrier() usage as follows?
>> > For some example kernel code as follows,
>> >
>> > "
>> > unsigned char x[16];
>> >
>> > void writer(void)
>> > {
>> > memset(x, 1, sizeof(x));
>> > /* To make memset() take effect ASAP */
>> > barrier();
>> > }
>> >
>> > unsigned char reader(int n)
>> > {
>> > return READ_ONCE(x[n]);
>> > }
>> > "
>> >
>> > where, writer() and reader() may be called on 2 CPUs without any lock.
>> > It's acceptable for reader() to read the written value a little later.
>
> What are your consistency requirements? For but one example, if reader(3)
> gives the new value, is it OK for a later call to reader(2) to give the
> old value?

writer() will be called with a lock held (sorry, my previous words
aren't correct here). After the racy checking in reader(), we will
acquire the lock and check "x[n]" again to confirm. And, there are no
dependencies between different "n". All in all, we can accept almost
all races between writer() and reader().

My question is, if there are some operations between writer() and
unlocking in its caller, whether does barrier() in writer() make any
sense? Make write instructions appear a little earlier in compiled
code? Mark the memory may be read racy? Or doesn't make sense at all?

> Until we know what your requirements are, it is hard to say whether the
> above code meets those requirements. In the meantime, I can imagine
> requirements that it meets and others that it does not.
>
> Also, Akira's points below are quite important.

Replied for his email.

--
Best Regards,
Huang, Ying

2024-03-25 12:24:19

by Huang, Ying

[permalink] [raw]
Subject: Re: Can you help us on memory barrier usage? (was Re: [PATCH v4 4/6] mm: swap: Allow storage of all mTHP orders)

Ryan Roberts <[email protected]> writes:

> On 22/03/2024 02:38, Huang, Ying wrote:
>> Hi, Paul,
>>
>> Can you help us on WRITE_ONCE()/READ_ONCE()/barrier() usage as follows?
>> For some example kernel code as follows,
>>
>> "
>> unsigned char x[16];
>>
>> void writer(void)
>> {
>> memset(x, 1, sizeof(x));
>> /* To make memset() take effect ASAP */
>> barrier();
>> }
>>
>> unsigned char reader(int n)
>> {
>> return READ_ONCE(x[n]);
>> }
>> "
>>
>> where, writer() and reader() may be called on 2 CPUs without any lock.
>
> For the situation we are discussing, writer() is always called with a spin lock
> held. So spin_unlock() will act as the barrier in this case; that's my argument
> for not needing the explicit barrier(), anyway. Happy to be told I'm wrong.

Yes. spin_unlock() is a barrier too. There are some operations between
writer() and spin_unlock(), so I want to check whether it make any sense
to add a barrier earlier.

>> It's acceptable for reader() to read the written value a little later.
>> Our questions are,
>>
>> 1. because it's impossible for accessing "unsigned char" to cause
>> tearing. So, WRITE_ONCE()/READ_ONCE()/barrier() isn't necessary for
>> correctness, right?
>>
>> 2. we use barrier() and READ_ONCE() in writer() and reader(), because we
>> want to make writing take effect ASAP. Is it a good practice? Or it's
>> a micro-optimization that should be avoided?

--
Best Regards,
Huang, Ying

2024-03-25 13:12:59

by Huang, Ying

[permalink] [raw]
Subject: Re: Can you help us on memory barrier usage? (was Re: [PATCH v4 4/6] mm: swap: Allow storage of all mTHP orders)

Akira Yokosawa <[email protected]> writes:

> [Use Paul's reachable address in CC;
> trimmed CC list, keeping only those who have responded so far.]

Thanks a lot!

> Hello Huang,
> Let me chime in.
>
> On Fri, 22 Mar 2024 06:19:52 -0700, Huang, Ying wrote:
>> Hi, Paul,
>>
>> Can you help us on WRITE_ONCE()/READ_ONCE()/barrier() usage as follows?
>> For some example kernel code as follows,
>>
>> "
>> unsigned char x[16];
>>
>> void writer(void)
>> {
>> memset(x, 1, sizeof(x));
>> /* To make memset() take effect ASAP */
>> barrier();
>> }
>>
>> unsigned char reader(int n)
>> {
>> return READ_ONCE(x[n]);
>> }
>> "
>>
>> where, writer() and reader() may be called on 2 CPUs without any lock.
>> It's acceptable for reader() to read the written value a little later.
>> Our questions are,
>>
>> 1. because it's impossible for accessing "unsigned char" to cause
>> tearing. So, WRITE_ONCE()/READ_ONCE()/barrier() isn't necessary for
>> correctness, right?
>>
>> 2. we use barrier() and READ_ONCE() in writer() and reader(), because we
>> want to make writing take effect ASAP. Is it a good practice? Or it's
>> a micro-optimization that should be avoided?
>
> Why don't you consult Documentation/memory-barriers.txt, especially
> the section titled "COMPILER BARRIER"?
>
> TL;DR:
>
> barrier(), WRITE_ONCE(), and READ_ONCE() are compiler barriers, not
> memory barriers. They just restrict compiler optimizations and don't
> have any effect with regard to "make writing take effect ASAP".

Yes. In theory, this is absolutely correct.

My question is, in practice, will compiler barriers make CPU runs (or
sees) memory read/write instructions a little earlier via avoiding to
reorder the operations after read/write (e.g., becomes before
read/write)?

> If you have further questions, please don't hesitate to ask.

--
Best Regards,
Huang, Ying

2024-03-26 17:10:34

by Ryan Roberts

[permalink] [raw]
Subject: Re: Can you help us on memory barrier usage? (was Re: [PATCH v4 4/6] mm: swap: Allow storage of all mTHP orders)

On 25/03/2024 03:16, Huang, Ying wrote:
> "Paul E. McKenney" <[email protected]> writes:
>
>> On Sat, Mar 23, 2024 at 11:11:09AM +0900, Akira Yokosawa wrote:
>>> [Use Paul's reachable address in CC;
>>> trimmed CC list, keeping only those who have responded so far.]
>>>
>>> Hello Huang,
>>> Let me chime in.
>>>
>>> On Fri, 22 Mar 2024 06:19:52 -0700, Huang, Ying wrote:
>>>> Hi, Paul,
>>>>
>>>> Can you help us on WRITE_ONCE()/READ_ONCE()/barrier() usage as follows?
>>>> For some example kernel code as follows,
>>>>
>>>> "
>>>> unsigned char x[16];
>>>>
>>>> void writer(void)
>>>> {
>>>> memset(x, 1, sizeof(x));
>>>> /* To make memset() take effect ASAP */
>>>> barrier();
>>>> }
>>>>
>>>> unsigned char reader(int n)
>>>> {
>>>> return READ_ONCE(x[n]);
>>>> }
>>>> "
>>>>
>>>> where, writer() and reader() may be called on 2 CPUs without any lock.
>>>> It's acceptable for reader() to read the written value a little later.
>>
>> What are your consistency requirements? For but one example, if reader(3)
>> gives the new value, is it OK for a later call to reader(2) to give the
>> old value?
>
> writer() will be called with a lock held (sorry, my previous words
> aren't correct here). After the racy checking in reader(), we will
> acquire the lock and check "x[n]" again to confirm. And, there are no
> dependencies between different "n". All in all, we can accept almost
> all races between writer() and reader().
>
> My question is, if there are some operations between writer() and
> unlocking in its caller, whether does barrier() in writer() make any
> sense? Make write instructions appear a little earlier in compiled
> code? Mark the memory may be read racy? Or doesn't make sense at all?

A compiler barrier is neccessary but not sufficient to guarrantee that the
stores become visible to the reader; you would also need a memory barrier to
stop the HW from reordering IIUC. So I really fail to see the value of adding
barrier().

As you state above there is no correctness issue here. Its just a question of
whether the barrier() can make the store appear earlier to the reader for a
(micro!) performance optimization. You'll get both the compiler and memory
barrier from the slightly later spin_unlock(). The patch that added the original
WRITE_ONCE() was concerned with squashing kcsan warnings, not with performance
optimization. (And the addition of the WRITE_ONCE() wasn't actually needed to
achieve the aim).

So I'm planning to repost my series (hopefully tomorrow) without the barrier()
present, unless you still want to try to convince me that it is useful.

Thanks,
Ryan

>
>> Until we know what your requirements are, it is hard to say whether the
>> above code meets those requirements. In the meantime, I can imagine
>> requirements that it meets and others that it does not.
>>
>> Also, Akira's points below are quite important.
>
> Replied for his email.
>
> --
> Best Regards,
> Huang, Ying