2024-02-14 20:45:27

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v3 00/10] mm/memory: optimize unmap/zap with PTE-mapped THP

This series is based on [1]. Similar to what we did with fork(), let's
implement PTE batching during unmap/zap when processing PTE-mapped THPs.

We collect consecutive PTEs that map consecutive pages of the same large
folio, making sure that the other PTE bits are compatible, and (a) adjust
the refcount only once per batch, (b) call rmap handling functions only
once per batch, (c) perform batch PTE setting/updates and (d) perform TLB
entry removal once per batch.

Ryan was previously working on this in the context of cont-pte for
arm64, int latest iteration [2] with a focus on arm6 with cont-pte only.
This series implements the optimization for all architectures, independent
of such PTE bits, teaches MMU gather/TLB code to be fully aware of such
large-folio-pages batches as well, and amkes use of our new rmap batching
function when removing the rmap.

To achieve that, we have to enlighten MMU gather / page freeing code
(i.e., everything that consumes encoded_page) to process unmapping
of consecutive pages that all belong to the same large folio. I'm being
very careful to not degrade order-0 performance, and it looks like I
managed to achieve that.

While this series should -- similar to [1] -- be beneficial for adding
cont-pte support on arm64[2], it's one of the requirements for maintaining
a total mapcount[3] for large folios with minimal added overhead and
further changes[4] that build up on top of the total mapcount.

Independent of all that, this series results in a speedup during munmap()
and similar unmapping (process teardown, MADV_DONTNEED on larger ranges)
with PTE-mapped THP, which is the default with THPs that are smaller than
a PMD (for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]).

On an Intel Xeon Silver 4210R CPU, munmap'ing a 1GiB VMA backed by
PTE-mapped folios of the same size (stddev < 1%) results in the following
runtimes for munmap() in seconds (shorter is better):

Folio Size | mm-unstable | New | Change
---------------------------------------------
4KiB | 0.058110 | 0.057715 | - 1%
16KiB | 0.044198 | 0.035469 | -20%
32KiB | 0.034216 | 0.023522 | -31%
64KiB | 0.029207 | 0.018434 | -37%
128KiB | 0.026579 | 0.014026 | -47%
256KiB | 0.025130 | 0.011756 | -53%
512KiB | 0.024292 | 0.010703 | -56%
1024KiB | 0.023812 | 0.010294 | -57%
2048KiB | 0.023785 | 0.009910 | -58%

CCing especially s390x folks, because they have a tlb freeing hooks that
needs adjustment. Only tested on x86-64 for now, will have to do some more
stress testing. Compile-tested on most other architectures. The PPC
change is negleglible and makes my cross-compiler happy.

[1] https://lkml.kernel.org/r/[email protected]
[2] https://lkml.kernel.org/r/[email protected]
[3] https://lkml.kernel.org/r/[email protected]
[4] https://lkml.kernel.org/r/[email protected]
[5] https://lkml.kernel.org/r/[email protected]

---

Sending this out earlier than I ususally would, so we can get this into
mm-unstable for Ryan to base his cont-pte work on this ASAP.

The performance numbers are from v1. I did a quick benchmark run of v3
and nothing significantly changed, relevant code paths remained unchanged.

v2 -> v3:
* "mm/mmu_gather: add __tlb_remove_folio_pages()"
-> Slightly adjusted patch description
* "mm/mmu_gather: improve cond_resched() handling with large folios and
expensive page freeing"
-> Use new macro for magic value and avoid code duplication
-> Extend patch description
* Pick up RB's

v1 -> v2:
* "mm/memory: factor out zapping of present pte into zap_present_pte()"
-> Initialize "struct folio *folio" to NULL
* "mm/memory: handle !page case in zap_present_pte() separately"
-> Extend description regarding arch_check_zapped_pte()
* "mm/mmu_gather: add __tlb_remove_folio_pages()"
-> ENCODED_PAGE_BIT_NR_PAGES_NEXT
-> Extend patch description regarding "batching more"
* "mm/mmu_gather: improve cond_resched() handling with large folios and
expensive page freeing"
-> Handle the (so far) theoretical case of possible soft lockups when
we zero/poison memory when freeing pages. Try to keep old behavior in
that corner case to be safe.
* "mm/memory: optimize unmap/zap with PTE-mapped THP"
-> Clarify description of new ptep clearing functions regarding "present
PTEs"
-> Extend patch description regarding relaxed mapcount sanity checks
-> Improve zap_present_ptes() description
* Pick up RB's

Cc: Andrew Morton <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Yin Fengwei <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]

David Hildenbrand (10):
mm/memory: factor out zapping of present pte into zap_present_pte()
mm/memory: handle !page case in zap_present_pte() separately
mm/memory: further separate anon and pagecache folio handling in
zap_present_pte()
mm/memory: factor out zapping folio pte into zap_present_folio_pte()
mm/mmu_gather: pass "delay_rmap" instead of encoded page to
__tlb_remove_page_size()
mm/mmu_gather: define ENCODED_PAGE_FLAG_DELAY_RMAP
mm/mmu_gather: add tlb_remove_tlb_entries()
mm/mmu_gather: add __tlb_remove_folio_pages()
mm/mmu_gather: improve cond_resched() handling with large folios and
expensive page freeing
mm/memory: optimize unmap/zap with PTE-mapped THP

arch/powerpc/include/asm/tlb.h | 2 +
arch/s390/include/asm/tlb.h | 30 ++++--
include/asm-generic/tlb.h | 40 ++++++--
include/linux/mm_types.h | 37 ++++++--
include/linux/pgtable.h | 70 ++++++++++++++
mm/memory.c | 169 +++++++++++++++++++++++----------
mm/mmu_gather.c | 111 ++++++++++++++++++----
mm/swap.c | 12 ++-
mm/swap_state.c | 15 ++-
9 files changed, 393 insertions(+), 93 deletions(-)


base-commit: 7e56cf9a7f108e8129d75cea0dabc9488fb4defa
--
2.43.0



2024-02-14 20:46:32

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v3 04/10] mm/memory: factor out zapping folio pte into zap_present_folio_pte()

Let's prepare for further changes by factoring it out into a separate
function.

Reviewed-by: Ryan Roberts <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
mm/memory.c | 53 ++++++++++++++++++++++++++++++++---------------------
1 file changed, 32 insertions(+), 21 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7a3ebb6e5909..a3efc4da258a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1528,30 +1528,14 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
}

-static inline void zap_present_pte(struct mmu_gather *tlb,
- struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
- unsigned long addr, struct zap_details *details,
- int *rss, bool *force_flush, bool *force_break)
+static inline void zap_present_folio_pte(struct mmu_gather *tlb,
+ struct vm_area_struct *vma, struct folio *folio,
+ struct page *page, pte_t *pte, pte_t ptent, unsigned long addr,
+ struct zap_details *details, int *rss, bool *force_flush,
+ bool *force_break)
{
struct mm_struct *mm = tlb->mm;
bool delay_rmap = false;
- struct folio *folio;
- struct page *page;
-
- page = vm_normal_page(vma, addr, ptent);
- if (!page) {
- /* We don't need up-to-date accessed/dirty bits. */
- ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
- arch_check_zapped_pte(vma, ptent);
- tlb_remove_tlb_entry(tlb, pte, addr);
- VM_WARN_ON_ONCE(userfaultfd_wp(vma));
- ksm_might_unmap_zero_page(mm, ptent);
- return;
- }
-
- folio = page_folio(page);
- if (unlikely(!should_zap_folio(details, folio)))
- return;

if (!folio_test_anon(folio)) {
ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
@@ -1586,6 +1570,33 @@ static inline void zap_present_pte(struct mmu_gather *tlb,
}
}

+static inline void zap_present_pte(struct mmu_gather *tlb,
+ struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
+ unsigned long addr, struct zap_details *details,
+ int *rss, bool *force_flush, bool *force_break)
+{
+ struct mm_struct *mm = tlb->mm;
+ struct folio *folio;
+ struct page *page;
+
+ page = vm_normal_page(vma, addr, ptent);
+ if (!page) {
+ /* We don't need up-to-date accessed/dirty bits. */
+ ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
+ arch_check_zapped_pte(vma, ptent);
+ tlb_remove_tlb_entry(tlb, pte, addr);
+ VM_WARN_ON_ONCE(userfaultfd_wp(vma));
+ ksm_might_unmap_zero_page(mm, ptent);
+ return;
+ }
+
+ folio = page_folio(page);
+ if (unlikely(!should_zap_folio(details, folio)))
+ return;
+ zap_present_folio_pte(tlb, vma, folio, page, pte, ptent, addr, details,
+ rss, force_flush, force_break);
+}
+
static unsigned long zap_pte_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end,
--
2.43.0


2024-02-14 20:46:32

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v3 06/10] mm/mmu_gather: define ENCODED_PAGE_FLAG_DELAY_RMAP

Nowadays, encoded pages are only used in mmu_gather handling. Let's
update the documentation, and define ENCODED_PAGE_BIT_DELAY_RMAP. While at
it, rename ENCODE_PAGE_BITS to ENCODED_PAGE_BITS.

If encoded page pointers would ever be used in other context again, we'd
likely want to change the defines to reflect their context (e.g.,
ENCODED_PAGE_FLAG_MMU_GATHER_DELAY_RMAP). For now, let's keep it simple.

This is a preparation for using the remaining spare bit to indicate that
the next item in an array of encoded pages is a "nr_pages" argument and
not an encoded page.

Reviewed-by: Ryan Roberts <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/mm_types.h | 17 +++++++++++------
mm/mmu_gather.c | 5 +++--
2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8b611e13153e..1b89eec0d6df 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -210,8 +210,8 @@ struct page {
*
* An 'encoded_page' pointer is a pointer to a regular 'struct page', but
* with the low bits of the pointer indicating extra context-dependent
- * information. Not super-common, but happens in mmu_gather and mlock
- * handling, and this acts as a type system check on that use.
+ * information. Only used in mmu_gather handling, and this acts as a type
+ * system check on that use.
*
* We only really have two guaranteed bits in general, although you could
* play with 'struct page' alignment (see CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
@@ -220,21 +220,26 @@ struct page {
* Use the supplied helper functions to endcode/decode the pointer and bits.
*/
struct encoded_page;
-#define ENCODE_PAGE_BITS 3ul
+
+#define ENCODED_PAGE_BITS 3ul
+
+/* Perform rmap removal after we have flushed the TLB. */
+#define ENCODED_PAGE_BIT_DELAY_RMAP 1ul
+
static __always_inline struct encoded_page *encode_page(struct page *page, unsigned long flags)
{
- BUILD_BUG_ON(flags > ENCODE_PAGE_BITS);
+ BUILD_BUG_ON(flags > ENCODED_PAGE_BITS);
return (struct encoded_page *)(flags | (unsigned long)page);
}

static inline unsigned long encoded_page_flags(struct encoded_page *page)
{
- return ENCODE_PAGE_BITS & (unsigned long)page;
+ return ENCODED_PAGE_BITS & (unsigned long)page;
}

static inline struct page *encoded_page_ptr(struct encoded_page *page)
{
- return (struct page *)(~ENCODE_PAGE_BITS & (unsigned long)page);
+ return (struct page *)(~ENCODED_PAGE_BITS & (unsigned long)page);
}

/*
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index ac733d81b112..6540c99c6758 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -53,7 +53,7 @@ static void tlb_flush_rmap_batch(struct mmu_gather_batch *batch, struct vm_area_
for (int i = 0; i < batch->nr; i++) {
struct encoded_page *enc = batch->encoded_pages[i];

- if (encoded_page_flags(enc)) {
+ if (encoded_page_flags(enc) & ENCODED_PAGE_BIT_DELAY_RMAP) {
struct page *page = encoded_page_ptr(enc);
folio_remove_rmap_pte(page_folio(page), page, vma);
}
@@ -119,6 +119,7 @@ static void tlb_batch_list_free(struct mmu_gather *tlb)
bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
bool delay_rmap, int page_size)
{
+ int flags = delay_rmap ? ENCODED_PAGE_BIT_DELAY_RMAP : 0;
struct mmu_gather_batch *batch;

VM_BUG_ON(!tlb->end);
@@ -132,7 +133,7 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
* Add the page and check if we are full. If so
* force a flush.
*/
- batch->encoded_pages[batch->nr++] = encode_page(page, delay_rmap);
+ batch->encoded_pages[batch->nr++] = encode_page(page, flags);
if (batch->nr == batch->max) {
if (!tlb_next_batch(tlb))
return true;
--
2.43.0


2024-02-14 20:46:46

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v3 07/10] mm/mmu_gather: add tlb_remove_tlb_entries()

Let's add a helper that lets us batch-process multiple consecutive PTEs.

Note that the loop will get optimized out on all architectures except on
powerpc. We have to add an early define of __tlb_remove_tlb_entry() on
ppc to make the compiler happy (and avoid making tlb_remove_tlb_entries() a
macro).

Reviewed-by: Ryan Roberts <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
arch/powerpc/include/asm/tlb.h | 2 ++
include/asm-generic/tlb.h | 20 ++++++++++++++++++++
2 files changed, 22 insertions(+)

diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index b3de6102a907..1ca7d4c4b90d 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -19,6 +19,8 @@

#include <linux/pagemap.h>

+static inline void __tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep,
+ unsigned long address);
#define __tlb_remove_tlb_entry __tlb_remove_tlb_entry

#define tlb_flush tlb_flush
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 2eb7b0d4f5d2..95d60a4f468a 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -608,6 +608,26 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
__tlb_remove_tlb_entry(tlb, ptep, address); \
} while (0)

+/**
+ * tlb_remove_tlb_entries - remember unmapping of multiple consecutive ptes for
+ * later tlb invalidation.
+ *
+ * Similar to tlb_remove_tlb_entry(), but remember unmapping of multiple
+ * consecutive ptes instead of only a single one.
+ */
+static inline void tlb_remove_tlb_entries(struct mmu_gather *tlb,
+ pte_t *ptep, unsigned int nr, unsigned long address)
+{
+ tlb_flush_pte_range(tlb, address, PAGE_SIZE * nr);
+ for (;;) {
+ __tlb_remove_tlb_entry(tlb, ptep, address);
+ if (--nr == 0)
+ break;
+ ptep++;
+ address += PAGE_SIZE;
+ }
+}
+
#define tlb_remove_huge_tlb_entry(h, tlb, ptep, address) \
do { \
unsigned long _sz = huge_page_size(h); \
--
2.43.0


2024-02-14 20:47:36

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v3 10/10] mm/memory: optimize unmap/zap with PTE-mapped THP

Similar to how we optimized fork(), let's implement PTE batching when
consecutive (present) PTEs map consecutive pages of the same large
folio.

Most infrastructure we need for batching (mmu gather, rmap) is already
there. We only have to add get_and_clear_full_ptes() and
clear_full_ptes(). Similarly, extend zap_install_uffd_wp_if_needed() to
process a PTE range.

We won't bother sanity-checking the mapcount of all subpages, but only
check the mapcount of the first subpage we process. If there is a real
problem hiding somewhere, we can trigger it simply by using small
folios, or when we zap single pages of a large folio. Ideally, we had
that check in rmap code (including for delayed rmap), but then we cannot
print the PTE. Let's keep it simple for now. If we ever have a cheap
folio_mapcount(), we might just want to check for underflows there.

To keep small folios as fast as possible force inlining of a specialized
variant using __always_inline with nr=1.

Reviewed-by: Ryan Roberts <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/pgtable.h | 70 +++++++++++++++++++++++++++++++
mm/memory.c | 92 +++++++++++++++++++++++++++++------------
2 files changed, 136 insertions(+), 26 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index aab227e12493..49ab1f73b5c2 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -580,6 +580,76 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
}
#endif

+#ifndef get_and_clear_full_ptes
+/**
+ * get_and_clear_full_ptes - Clear present PTEs that map consecutive pages of
+ * the same folio, collecting dirty/accessed bits.
+ * @mm: Address space the pages are mapped into.
+ * @addr: Address the first page is mapped at.
+ * @ptep: Page table pointer for the first entry.
+ * @nr: Number of entries to clear.
+ * @full: Whether we are clearing a full mm.
+ *
+ * May be overridden by the architecture; otherwise, implemented as a simple
+ * loop over ptep_get_and_clear_full(), merging dirty/accessed bits into the
+ * returned PTE.
+ *
+ * Note that PTE bits in the PTE range besides the PFN can differ. For example,
+ * some PTEs might be write-protected.
+ *
+ * Context: The caller holds the page table lock. The PTEs map consecutive
+ * pages that belong to the same folio. The PTEs are all in the same PMD.
+ */
+static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
+ unsigned long addr, pte_t *ptep, unsigned int nr, int full)
+{
+ pte_t pte, tmp_pte;
+
+ pte = ptep_get_and_clear_full(mm, addr, ptep, full);
+ while (--nr) {
+ ptep++;
+ addr += PAGE_SIZE;
+ tmp_pte = ptep_get_and_clear_full(mm, addr, ptep, full);
+ if (pte_dirty(tmp_pte))
+ pte = pte_mkdirty(pte);
+ if (pte_young(tmp_pte))
+ pte = pte_mkyoung(pte);
+ }
+ return pte;
+}
+#endif
+
+#ifndef clear_full_ptes
+/**
+ * clear_full_ptes - Clear present PTEs that map consecutive pages of the same
+ * folio.
+ * @mm: Address space the pages are mapped into.
+ * @addr: Address the first page is mapped at.
+ * @ptep: Page table pointer for the first entry.
+ * @nr: Number of entries to clear.
+ * @full: Whether we are clearing a full mm.
+ *
+ * May be overridden by the architecture; otherwise, implemented as a simple
+ * loop over ptep_get_and_clear_full().
+ *
+ * Note that PTE bits in the PTE range besides the PFN can differ. For example,
+ * some PTEs might be write-protected.
+ *
+ * Context: The caller holds the page table lock. The PTEs map consecutive
+ * pages that belong to the same folio. The PTEs are all in the same PMD.
+ */
+static inline void clear_full_ptes(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, unsigned int nr, int full)
+{
+ for (;;) {
+ ptep_get_and_clear_full(mm, addr, ptep, full);
+ if (--nr == 0)
+ break;
+ ptep++;
+ addr += PAGE_SIZE;
+ }
+}
+#endif

/*
* If two threads concurrently fault at the same page, the thread that
diff --git a/mm/memory.c b/mm/memory.c
index a3efc4da258a..3b8e56eb08a3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1515,7 +1515,7 @@ static inline bool zap_drop_file_uffd_wp(struct zap_details *details)
*/
static inline void
zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
- unsigned long addr, pte_t *pte,
+ unsigned long addr, pte_t *pte, int nr,
struct zap_details *details, pte_t pteval)
{
/* Zap on anonymous always means dropping everything */
@@ -1525,20 +1525,27 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
if (zap_drop_file_uffd_wp(details))
return;

- pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
+ for (;;) {
+ /* the PFN in the PTE is irrelevant. */
+ pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
+ if (--nr == 0)
+ break;
+ pte++;
+ addr += PAGE_SIZE;
+ }
}

-static inline void zap_present_folio_pte(struct mmu_gather *tlb,
+static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
struct vm_area_struct *vma, struct folio *folio,
- struct page *page, pte_t *pte, pte_t ptent, unsigned long addr,
- struct zap_details *details, int *rss, bool *force_flush,
- bool *force_break)
+ struct page *page, pte_t *pte, pte_t ptent, unsigned int nr,
+ unsigned long addr, struct zap_details *details, int *rss,
+ bool *force_flush, bool *force_break)
{
struct mm_struct *mm = tlb->mm;
bool delay_rmap = false;

if (!folio_test_anon(folio)) {
- ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
+ ptent = get_and_clear_full_ptes(mm, addr, pte, nr, tlb->fullmm);
if (pte_dirty(ptent)) {
folio_mark_dirty(folio);
if (tlb_delay_rmap(tlb)) {
@@ -1548,36 +1555,49 @@ static inline void zap_present_folio_pte(struct mmu_gather *tlb,
}
if (pte_young(ptent) && likely(vma_has_recency(vma)))
folio_mark_accessed(folio);
- rss[mm_counter(folio)]--;
+ rss[mm_counter(folio)] -= nr;
} else {
/* We don't need up-to-date accessed/dirty bits. */
- ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
- rss[MM_ANONPAGES]--;
+ clear_full_ptes(mm, addr, pte, nr, tlb->fullmm);
+ rss[MM_ANONPAGES] -= nr;
}
+ /* Checking a single PTE in a batch is sufficient. */
arch_check_zapped_pte(vma, ptent);
- tlb_remove_tlb_entry(tlb, pte, addr);
+ tlb_remove_tlb_entries(tlb, pte, nr, addr);
if (unlikely(userfaultfd_pte_wp(vma, ptent)))
- zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
+ zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details,
+ ptent);

if (!delay_rmap) {
- folio_remove_rmap_pte(folio, page, vma);
+ folio_remove_rmap_ptes(folio, page, nr, vma);
+
+ /* Only sanity-check the first page in a batch. */
if (unlikely(page_mapcount(page) < 0))
print_bad_pte(vma, addr, ptent, page);
}
- if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
+ if (unlikely(__tlb_remove_folio_pages(tlb, page, nr, delay_rmap))) {
*force_flush = true;
*force_break = true;
}
}

-static inline void zap_present_pte(struct mmu_gather *tlb,
+/*
+ * Zap or skip at least one present PTE, trying to batch-process subsequent
+ * PTEs that map consecutive pages of the same folio.
+ *
+ * Returns the number of processed (skipped or zapped) PTEs (at least 1).
+ */
+static inline int zap_present_ptes(struct mmu_gather *tlb,
struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
- unsigned long addr, struct zap_details *details,
- int *rss, bool *force_flush, bool *force_break)
+ unsigned int max_nr, unsigned long addr,
+ struct zap_details *details, int *rss, bool *force_flush,
+ bool *force_break)
{
+ const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
struct mm_struct *mm = tlb->mm;
struct folio *folio;
struct page *page;
+ int nr;

page = vm_normal_page(vma, addr, ptent);
if (!page) {
@@ -1587,14 +1607,29 @@ static inline void zap_present_pte(struct mmu_gather *tlb,
tlb_remove_tlb_entry(tlb, pte, addr);
VM_WARN_ON_ONCE(userfaultfd_wp(vma));
ksm_might_unmap_zero_page(mm, ptent);
- return;
+ return 1;
}

folio = page_folio(page);
if (unlikely(!should_zap_folio(details, folio)))
- return;
- zap_present_folio_pte(tlb, vma, folio, page, pte, ptent, addr, details,
- rss, force_flush, force_break);
+ return 1;
+
+ /*
+ * Make sure that the common "small folio" case is as fast as possible
+ * by keeping the batching logic separate.
+ */
+ if (unlikely(folio_test_large(folio) && max_nr != 1)) {
+ nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, fpb_flags,
+ NULL);
+
+ zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, nr,
+ addr, details, rss, force_flush,
+ force_break);
+ return nr;
+ }
+ zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, 1, addr,
+ details, rss, force_flush, force_break);
+ return 1;
}

static unsigned long zap_pte_range(struct mmu_gather *tlb,
@@ -1609,6 +1644,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
pte_t *start_pte;
pte_t *pte;
swp_entry_t entry;
+ int nr;

tlb_change_page_size(tlb, PAGE_SIZE);
init_rss_vec(rss);
@@ -1622,7 +1658,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
pte_t ptent = ptep_get(pte);
struct folio *folio;
struct page *page;
+ int max_nr;

+ nr = 1;
if (pte_none(ptent))
continue;

@@ -1630,10 +1668,12 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
break;

if (pte_present(ptent)) {
- zap_present_pte(tlb, vma, pte, ptent, addr, details,
- rss, &force_flush, &force_break);
+ max_nr = (end - addr) / PAGE_SIZE;
+ nr = zap_present_ptes(tlb, vma, pte, ptent, max_nr,
+ addr, details, rss, &force_flush,
+ &force_break);
if (unlikely(force_break)) {
- addr += PAGE_SIZE;
+ addr += nr * PAGE_SIZE;
break;
}
continue;
@@ -1687,8 +1727,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
WARN_ON_ONCE(1);
}
pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
- zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
- } while (pte++, addr += PAGE_SIZE, addr != end);
+ zap_install_uffd_wp_if_needed(vma, addr, pte, 1, details, ptent);
+ } while (pte += nr, addr += PAGE_SIZE * nr, addr != end);

add_mm_rss_vec(mm, rss);
arch_leave_lazy_mmu_mode();
--
2.43.0


2024-02-14 21:11:40

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v3 02/10] mm/memory: handle !page case in zap_present_pte() separately

We don't need uptodate accessed/dirty bits, so in theory we could
replace ptep_get_and_clear_full() by an optimized ptep_clear_full()
function. Let's rely on the provided pte.

Further, there is no scenario where we would have to insert uffd-wp
markers when zapping something that is not a normal page (i.e., zeropage).
Add a sanity check to make sure this remains true.

should_zap_folio() no longer has to handle NULL pointers. This change
replaces 2/3 "!page/!folio" checks by a single "!page" one.

Note that arch_check_zapped_pte() on x86-64 checks the HW-dirty bit to
detect shadow stack entries. But for shadow stack entries, the HW dirty
bit (in combination with non-writable PTEs) is set by software. So for the
arch_check_zapped_pte() check, we don't have to sync against HW setting
the HW dirty bit concurrently, it is always set.

Reviewed-by: Ryan Roberts <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
mm/memory.c | 22 +++++++++++-----------
1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 5b0dc33133a6..4da6923709b2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1497,10 +1497,6 @@ static inline bool should_zap_folio(struct zap_details *details,
if (should_zap_cows(details))
return true;

- /* E.g. the caller passes NULL for the case of a zero folio */
- if (!folio)
- return true;
-
/* Otherwise we should only zap non-anon folios */
return !folio_test_anon(folio);
}
@@ -1538,24 +1534,28 @@ static inline void zap_present_pte(struct mmu_gather *tlb,
int *rss, bool *force_flush, bool *force_break)
{
struct mm_struct *mm = tlb->mm;
- struct folio *folio = NULL;
bool delay_rmap = false;
+ struct folio *folio;
struct page *page;

page = vm_normal_page(vma, addr, ptent);
- if (page)
- folio = page_folio(page);
+ if (!page) {
+ /* We don't need up-to-date accessed/dirty bits. */
+ ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
+ arch_check_zapped_pte(vma, ptent);
+ tlb_remove_tlb_entry(tlb, pte, addr);
+ VM_WARN_ON_ONCE(userfaultfd_wp(vma));
+ ksm_might_unmap_zero_page(mm, ptent);
+ return;
+ }

+ folio = page_folio(page);
if (unlikely(!should_zap_folio(details, folio)))
return;
ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
arch_check_zapped_pte(vma, ptent);
tlb_remove_tlb_entry(tlb, pte, addr);
zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
- if (unlikely(!page)) {
- ksm_might_unmap_zero_page(mm, ptent);
- return;
- }

if (!folio_test_anon(folio)) {
if (pte_dirty(ptent)) {
--
2.43.0


2024-02-14 21:12:03

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v3 03/10] mm/memory: further separate anon and pagecache folio handling in zap_present_pte()

We don't need up-to-date accessed-dirty information for anon folios and can
simply work with the ptent we already have. Also, we know the RSS counter
we want to update.

We can safely move arch_check_zapped_pte() + tlb_remove_tlb_entry() +
zap_install_uffd_wp_if_needed() after updating the folio and RSS.

While at it, only call zap_install_uffd_wp_if_needed() if there is even
any chance that pte_install_uffd_wp_if_needed() would do *something*.
That is, just don't bother if uffd-wp does not apply.

Reviewed-by: Ryan Roberts <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
mm/memory.c | 16 +++++++++++-----
1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 4da6923709b2..7a3ebb6e5909 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1552,12 +1552,9 @@ static inline void zap_present_pte(struct mmu_gather *tlb,
folio = page_folio(page);
if (unlikely(!should_zap_folio(details, folio)))
return;
- ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
- arch_check_zapped_pte(vma, ptent);
- tlb_remove_tlb_entry(tlb, pte, addr);
- zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);

if (!folio_test_anon(folio)) {
+ ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
if (pte_dirty(ptent)) {
folio_mark_dirty(folio);
if (tlb_delay_rmap(tlb)) {
@@ -1567,8 +1564,17 @@ static inline void zap_present_pte(struct mmu_gather *tlb,
}
if (pte_young(ptent) && likely(vma_has_recency(vma)))
folio_mark_accessed(folio);
+ rss[mm_counter(folio)]--;
+ } else {
+ /* We don't need up-to-date accessed/dirty bits. */
+ ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
+ rss[MM_ANONPAGES]--;
}
- rss[mm_counter(folio)]--;
+ arch_check_zapped_pte(vma, ptent);
+ tlb_remove_tlb_entry(tlb, pte, addr);
+ if (unlikely(userfaultfd_pte_wp(vma, ptent)))
+ zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
+
if (!delay_rmap) {
folio_remove_rmap_pte(folio, page, vma);
if (unlikely(page_mapcount(page) < 0))
--
2.43.0


2024-02-14 21:12:50

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v3 05/10] mm/mmu_gather: pass "delay_rmap" instead of encoded page to __tlb_remove_page_size()

We have two bits available in the encoded page pointer to store
additional information. Currently, we use one bit to request delay of the
rmap removal until after a TLB flush.

We want to make use of the remaining bit internally for batching of
multiple pages of the same folio, specifying that the next encoded page
pointer in an array is actually "nr_pages". So pass page + delay_rmap flag
instead of an encoded page, to handle the encoding internally.

Reviewed-by: Ryan Roberts <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
arch/s390/include/asm/tlb.h | 13 ++++++-------
include/asm-generic/tlb.h | 12 ++++++------
mm/mmu_gather.c | 7 ++++---
3 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
index d1455a601adc..48df896d5b79 100644
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -25,8 +25,7 @@
void __tlb_remove_table(void *_table);
static inline void tlb_flush(struct mmu_gather *tlb);
static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
- struct encoded_page *page,
- int page_size);
+ struct page *page, bool delay_rmap, int page_size);

#define tlb_flush tlb_flush
#define pte_free_tlb pte_free_tlb
@@ -42,14 +41,14 @@ static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
* tlb_ptep_clear_flush. In both flush modes the tlb for a page cache page
* has already been freed, so just do free_page_and_swap_cache.
*
- * s390 doesn't delay rmap removal, so there is nothing encoded in
- * the page pointer.
+ * s390 doesn't delay rmap removal.
*/
static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
- struct encoded_page *page,
- int page_size)
+ struct page *page, bool delay_rmap, int page_size)
{
- free_page_and_swap_cache(encoded_page_ptr(page));
+ VM_WARN_ON_ONCE(delay_rmap);
+
+ free_page_and_swap_cache(page);
return false;
}

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 129a3a759976..2eb7b0d4f5d2 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -260,9 +260,8 @@ struct mmu_gather_batch {
*/
#define MAX_GATHER_BATCH_COUNT (10000UL/MAX_GATHER_BATCH)

-extern bool __tlb_remove_page_size(struct mmu_gather *tlb,
- struct encoded_page *page,
- int page_size);
+extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
+ bool delay_rmap, int page_size);

#ifdef CONFIG_SMP
/*
@@ -462,13 +461,14 @@ static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
static inline void tlb_remove_page_size(struct mmu_gather *tlb,
struct page *page, int page_size)
{
- if (__tlb_remove_page_size(tlb, encode_page(page, 0), page_size))
+ if (__tlb_remove_page_size(tlb, page, false, page_size))
tlb_flush_mmu(tlb);
}

-static __always_inline bool __tlb_remove_page(struct mmu_gather *tlb, struct page *page, unsigned int flags)
+static __always_inline bool __tlb_remove_page(struct mmu_gather *tlb,
+ struct page *page, bool delay_rmap)
{
- return __tlb_remove_page_size(tlb, encode_page(page, flags), PAGE_SIZE);
+ return __tlb_remove_page_size(tlb, page, delay_rmap, PAGE_SIZE);
}

/* tlb_remove_page
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 604ddf08affe..ac733d81b112 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -116,7 +116,8 @@ static void tlb_batch_list_free(struct mmu_gather *tlb)
tlb->local.next = NULL;
}

-bool __tlb_remove_page_size(struct mmu_gather *tlb, struct encoded_page *page, int page_size)
+bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
+ bool delay_rmap, int page_size)
{
struct mmu_gather_batch *batch;

@@ -131,13 +132,13 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct encoded_page *page, i
* Add the page and check if we are full. If so
* force a flush.
*/
- batch->encoded_pages[batch->nr++] = page;
+ batch->encoded_pages[batch->nr++] = encode_page(page, delay_rmap);
if (batch->nr == batch->max) {
if (!tlb_next_batch(tlb))
return true;
batch = tlb->active;
}
- VM_BUG_ON_PAGE(batch->nr > batch->max, encoded_page_ptr(page));
+ VM_BUG_ON_PAGE(batch->nr > batch->max, page);

return false;
}
--
2.43.0


2024-02-14 21:14:12

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v3 08/10] mm/mmu_gather: add __tlb_remove_folio_pages()

Add __tlb_remove_folio_pages(), which will remove multiple consecutive
pages that belong to the same large folio, instead of only a single
page. We'll be using this function when optimizing unmapping/zapping of
large folios that are mapped by PTEs.

We're using the remaining spare bit in an encoded_page to indicate that
the next enoced page in an array contains actually shifted "nr_pages".
Teach swap/freeing code about putting multiple folio references, and
delayed rmap handling to remove page ranges of a folio.

This extension allows for still gathering almost as many small folios
as we used to (-1, because we have to prepare for a possibly bigger next
entry), but still allows for gathering consecutive pages that belong to the
same large folio.

Note that we don't pass the folio pointer, because it is not required for
now. Further, we don't support page_size != PAGE_SIZE, it won't be
required for simple PTE batching.

We have to provide a separate s390 implementation, but it's fairly
straight forward.

Another, more invasive and likely more expensive, approach would be to
use folio+range or a PFN range instead of page+nr_pages. But, we should
do that consistently for the whole mmu_gather. For now, let's keep it
simple and add "nr_pages" only.

Note that it is now possible to gather significantly more pages: In the
past, we were able to gather ~10000 pages, now we can also gather ~5000
folio fragments that span multiple pages. A folio fragment on x86-64 can
span up to 512 pages (2 MiB THP) and on arm64 with 64k in theory 8192 pages
(512 MiB THP). Gathering more memory is not considered something we should
worry about, especially because these are already corner cases.

While we can gather more total memory, we won't free more folio
fragments. As long as page freeing time primarily only depends on the
number of involved folios, there is no effective change for !preempt
configurations. However, we'll adjust tlb_batch_pages_flush() separately to
handle corner cases where page freeing time grows proportionally with the
actual memory size.

Reviewed-by: Ryan Roberts <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
arch/s390/include/asm/tlb.h | 17 +++++++++++
include/asm-generic/tlb.h | 8 +++++
include/linux/mm_types.h | 20 ++++++++++++
mm/mmu_gather.c | 61 +++++++++++++++++++++++++++++++------
mm/swap.c | 12 ++++++--
mm/swap_state.c | 15 +++++++--
6 files changed, 119 insertions(+), 14 deletions(-)

diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
index 48df896d5b79..e95b2c8081eb 100644
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -26,6 +26,8 @@ void __tlb_remove_table(void *_table);
static inline void tlb_flush(struct mmu_gather *tlb);
static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
struct page *page, bool delay_rmap, int page_size);
+static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
+ struct page *page, unsigned int nr_pages, bool delay_rmap);

#define tlb_flush tlb_flush
#define pte_free_tlb pte_free_tlb
@@ -52,6 +54,21 @@ static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
return false;
}

+static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
+ struct page *page, unsigned int nr_pages, bool delay_rmap)
+{
+ struct encoded_page *encoded_pages[] = {
+ encode_page(page, ENCODED_PAGE_BIT_NR_PAGES_NEXT),
+ encode_nr_pages(nr_pages),
+ };
+
+ VM_WARN_ON_ONCE(delay_rmap);
+ VM_WARN_ON_ONCE(page_folio(page) != page_folio(page + nr_pages - 1));
+
+ free_pages_and_swap_cache(encoded_pages, ARRAY_SIZE(encoded_pages));
+ return false;
+}
+
static inline void tlb_flush(struct mmu_gather *tlb)
{
__tlb_flush_mm_lazy(tlb->mm);
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 95d60a4f468a..bd00dd238b79 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -69,6 +69,7 @@
*
* - tlb_remove_page() / __tlb_remove_page()
* - tlb_remove_page_size() / __tlb_remove_page_size()
+ * - __tlb_remove_folio_pages()
*
* __tlb_remove_page_size() is the basic primitive that queues a page for
* freeing. __tlb_remove_page() assumes PAGE_SIZE. Both will return a
@@ -78,6 +79,11 @@
* tlb_remove_page() and tlb_remove_page_size() imply the call to
* tlb_flush_mmu() when required and has no return value.
*
+ * __tlb_remove_folio_pages() is similar to __tlb_remove_page(), however,
+ * instead of removing a single page, remove the given number of consecutive
+ * pages that are all part of the same (large) folio: just like calling
+ * __tlb_remove_page() on each page individually.
+ *
* - tlb_change_page_size()
*
* call before __tlb_remove_page*() to set the current page-size; implies a
@@ -262,6 +268,8 @@ struct mmu_gather_batch {

extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
bool delay_rmap, int page_size);
+bool __tlb_remove_folio_pages(struct mmu_gather *tlb, struct page *page,
+ unsigned int nr_pages, bool delay_rmap);

#ifdef CONFIG_SMP
/*
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 1b89eec0d6df..a7223ba3ea1e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -226,6 +226,15 @@ struct encoded_page;
/* Perform rmap removal after we have flushed the TLB. */
#define ENCODED_PAGE_BIT_DELAY_RMAP 1ul

+/*
+ * The next item in an encoded_page array is the "nr_pages" argument, specifying
+ * the number of consecutive pages starting from this page, that all belong to
+ * the same folio. For example, "nr_pages" corresponds to the number of folio
+ * references that must be dropped. If this bit is not set, "nr_pages" is
+ * implicitly 1.
+ */
+#define ENCODED_PAGE_BIT_NR_PAGES_NEXT 2ul
+
static __always_inline struct encoded_page *encode_page(struct page *page, unsigned long flags)
{
BUILD_BUG_ON(flags > ENCODED_PAGE_BITS);
@@ -242,6 +251,17 @@ static inline struct page *encoded_page_ptr(struct encoded_page *page)
return (struct page *)(~ENCODED_PAGE_BITS & (unsigned long)page);
}

+static __always_inline struct encoded_page *encode_nr_pages(unsigned long nr)
+{
+ VM_WARN_ON_ONCE((nr << 2) >> 2 != nr);
+ return (struct encoded_page *)(nr << 2);
+}
+
+static __always_inline unsigned long encoded_nr_pages(struct encoded_page *page)
+{
+ return ((unsigned long)page) >> 2;
+}
+
/*
* A swap entry has to fit into a "unsigned long", as the entry is hidden
* in the "index" field of the swapper address space.
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 6540c99c6758..d175c0f1e2c8 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -50,12 +50,21 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
#ifdef CONFIG_SMP
static void tlb_flush_rmap_batch(struct mmu_gather_batch *batch, struct vm_area_struct *vma)
{
+ struct encoded_page **pages = batch->encoded_pages;
+
for (int i = 0; i < batch->nr; i++) {
- struct encoded_page *enc = batch->encoded_pages[i];
+ struct encoded_page *enc = pages[i];

if (encoded_page_flags(enc) & ENCODED_PAGE_BIT_DELAY_RMAP) {
struct page *page = encoded_page_ptr(enc);
- folio_remove_rmap_pte(page_folio(page), page, vma);
+ unsigned int nr_pages = 1;
+
+ if (unlikely(encoded_page_flags(enc) &
+ ENCODED_PAGE_BIT_NR_PAGES_NEXT))
+ nr_pages = encoded_nr_pages(pages[++i]);
+
+ folio_remove_rmap_ptes(page_folio(page), page, nr_pages,
+ vma);
}
}
}
@@ -89,18 +98,26 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
struct encoded_page **pages = batch->encoded_pages;

- do {
+ while (batch->nr) {
/*
* limit free batch count when PAGE_SIZE > 4K
*/
unsigned int nr = min(512U, batch->nr);

+ /*
+ * Make sure we cover page + nr_pages, and don't leave
+ * nr_pages behind when capping the number of entries.
+ */
+ if (unlikely(encoded_page_flags(pages[nr - 1]) &
+ ENCODED_PAGE_BIT_NR_PAGES_NEXT))
+ nr++;
+
free_pages_and_swap_cache(pages, nr);
pages += nr;
batch->nr -= nr;

cond_resched();
- } while (batch->nr);
+ }
}
tlb->active = &tlb->local;
}
@@ -116,8 +133,9 @@ static void tlb_batch_list_free(struct mmu_gather *tlb)
tlb->local.next = NULL;
}

-bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
- bool delay_rmap, int page_size)
+static bool __tlb_remove_folio_pages_size(struct mmu_gather *tlb,
+ struct page *page, unsigned int nr_pages, bool delay_rmap,
+ int page_size)
{
int flags = delay_rmap ? ENCODED_PAGE_BIT_DELAY_RMAP : 0;
struct mmu_gather_batch *batch;
@@ -126,6 +144,8 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,

#ifdef CONFIG_MMU_GATHER_PAGE_SIZE
VM_WARN_ON(tlb->page_size != page_size);
+ VM_WARN_ON_ONCE(nr_pages != 1 && page_size != PAGE_SIZE);
+ VM_WARN_ON_ONCE(page_folio(page) != page_folio(page + nr_pages - 1));
#endif

batch = tlb->active;
@@ -133,17 +153,40 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
* Add the page and check if we are full. If so
* force a flush.
*/
- batch->encoded_pages[batch->nr++] = encode_page(page, flags);
- if (batch->nr == batch->max) {
+ if (likely(nr_pages == 1)) {
+ batch->encoded_pages[batch->nr++] = encode_page(page, flags);
+ } else {
+ flags |= ENCODED_PAGE_BIT_NR_PAGES_NEXT;
+ batch->encoded_pages[batch->nr++] = encode_page(page, flags);
+ batch->encoded_pages[batch->nr++] = encode_nr_pages(nr_pages);
+ }
+ /*
+ * Make sure that we can always add another "page" + "nr_pages",
+ * requiring two entries instead of only a single one.
+ */
+ if (batch->nr >= batch->max - 1) {
if (!tlb_next_batch(tlb))
return true;
batch = tlb->active;
}
- VM_BUG_ON_PAGE(batch->nr > batch->max, page);
+ VM_BUG_ON_PAGE(batch->nr > batch->max - 1, page);

return false;
}

+bool __tlb_remove_folio_pages(struct mmu_gather *tlb, struct page *page,
+ unsigned int nr_pages, bool delay_rmap)
+{
+ return __tlb_remove_folio_pages_size(tlb, page, nr_pages, delay_rmap,
+ PAGE_SIZE);
+}
+
+bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
+ bool delay_rmap, int page_size)
+{
+ return __tlb_remove_folio_pages_size(tlb, page, 1, delay_rmap, page_size);
+}
+
#endif /* MMU_GATHER_NO_GATHER */

#ifdef CONFIG_MMU_GATHER_TABLE_FREE
diff --git a/mm/swap.c b/mm/swap.c
index cd8f0150ba3a..e5380d732c0d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -967,11 +967,17 @@ void release_pages(release_pages_arg arg, int nr)
unsigned int lock_batch;

for (i = 0; i < nr; i++) {
+ unsigned int nr_refs = 1;
struct folio *folio;

/* Turn any of the argument types into a folio */
folio = page_folio(encoded_page_ptr(encoded[i]));

+ /* Is our next entry actually "nr_pages" -> "nr_refs" ? */
+ if (unlikely(encoded_page_flags(encoded[i]) &
+ ENCODED_PAGE_BIT_NR_PAGES_NEXT))
+ nr_refs = encoded_nr_pages(encoded[++i]);
+
/*
* Make sure the IRQ-safe lock-holding time does not get
* excessive with a continuous string of pages from the
@@ -990,14 +996,14 @@ void release_pages(release_pages_arg arg, int nr)
unlock_page_lruvec_irqrestore(lruvec, flags);
lruvec = NULL;
}
- if (put_devmap_managed_page(&folio->page))
+ if (put_devmap_managed_page_refs(&folio->page, nr_refs))
continue;
- if (folio_put_testzero(folio))
+ if (folio_ref_sub_and_test(folio, nr_refs))
free_zone_device_page(&folio->page);
continue;
}

- if (!folio_put_testzero(folio))
+ if (!folio_ref_sub_and_test(folio, nr_refs))
continue;

if (folio_test_large(folio)) {
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 7255c01a1e4e..2f540748f7c0 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -311,8 +311,19 @@ void free_page_and_swap_cache(struct page *page)
void free_pages_and_swap_cache(struct encoded_page **pages, int nr)
{
lru_add_drain();
- for (int i = 0; i < nr; i++)
- free_swap_cache(encoded_page_ptr(pages[i]));
+ for (int i = 0; i < nr; i++) {
+ struct page *page = encoded_page_ptr(pages[i]);
+
+ /*
+ * Skip over the "nr_pages" entry. It's sufficient to call
+ * free_swap_cache() only once per folio.
+ */
+ if (unlikely(encoded_page_flags(pages[i]) &
+ ENCODED_PAGE_BIT_NR_PAGES_NEXT))
+ i++;
+
+ free_swap_cache(page);
+ }
release_pages(pages, nr);
}

--
2.43.0


2024-02-14 21:14:27

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v3 09/10] mm/mmu_gather: improve cond_resched() handling with large folios and expensive page freeing

In tlb_batch_pages_flush(), we can end up freeing up to 512 pages or
now up to 256 folio fragments that span more than one page, before we
conditionally reschedule.

It's a pain that we have to handle cond_resched() in
tlb_batch_pages_flush() manually and cannot simply handle it in
release_pages() -- release_pages() can be called from atomic context.
Well, in a perfect world we wouldn't have to make our code more
complicated at all.

With page poisoning and init_on_free, we might now run into soft lockups
when we free a lot of rather large folio fragments, because page freeing
time then depends on the actual memory size we are freeing instead of on
the number of folios that are involved.

In the absolute (unlikely) worst case, on arm64 with 64k we will be able
to free up to 256 folio fragments that each span 512 MiB: zeroing out 128
GiB does sound like it might take a while. But instead of ignoring this
unlikely case, let's just handle it.

So, let's teach tlb_batch_pages_flush() that there are some
configurations where page freeing is horribly slow, and let's reschedule
more frequently -- similarly like we did for now before we had large folio
fragments in there. Avoid yet another loop over all encoded pages in the
common case by handling that separately.

Note that with page poisoning/zeroing, we might now end up freeing only a
single folio fragment at a time that might exceed the old 512 pages limit:
but if we cannot even free a single MAX_ORDER page on a system without
running into soft lockups, something else is already completely bogus.
Freeing a PMD-mapped THP would similarly cause trouble.

In theory, we might even free 511 order-0 pages + a single MAX_ORDER page,
effectively having to zero out 8703 pages on arm64 with 64k, translating to
~544 MiB of memory: however, if 512 MiB doesn't result in soft lockups,
544 MiB is unlikely to result in soft lockups, so we won't care about
that for the time being.

In the future, we might want to detect if handling cond_resched() is
required at all, and just not do any of that with full preemption enabled.

Reviewed-by: Ryan Roberts <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
mm/mmu_gather.c | 58 ++++++++++++++++++++++++++++++++++++-------------
1 file changed, 43 insertions(+), 15 deletions(-)

diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index d175c0f1e2c8..99b3e9408aa0 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -91,18 +91,21 @@ void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma)
}
#endif

-static void tlb_batch_pages_flush(struct mmu_gather *tlb)
-{
- struct mmu_gather_batch *batch;
+/*
+ * We might end up freeing a lot of pages. Reschedule on a regular
+ * basis to avoid soft lockups in configurations without full
+ * preemption enabled. The magic number of 512 folios seems to work.
+ */
+#define MAX_NR_FOLIOS_PER_FREE 512

- for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
- struct encoded_page **pages = batch->encoded_pages;
+static void __tlb_batch_free_encoded_pages(struct mmu_gather_batch *batch)
+{
+ struct encoded_page **pages = batch->encoded_pages;
+ unsigned int nr, nr_pages;

- while (batch->nr) {
- /*
- * limit free batch count when PAGE_SIZE > 4K
- */
- unsigned int nr = min(512U, batch->nr);
+ while (batch->nr) {
+ if (!page_poisoning_enabled_static() && !want_init_on_free()) {
+ nr = min(MAX_NR_FOLIOS_PER_FREE, batch->nr);

/*
* Make sure we cover page + nr_pages, and don't leave
@@ -111,14 +114,39 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
if (unlikely(encoded_page_flags(pages[nr - 1]) &
ENCODED_PAGE_BIT_NR_PAGES_NEXT))
nr++;
+ } else {
+ /*
+ * With page poisoning and init_on_free, the time it
+ * takes to free memory grows proportionally with the
+ * actual memory size. Therefore, limit based on the
+ * actual memory size and not the number of involved
+ * folios.
+ */
+ for (nr = 0, nr_pages = 0;
+ nr < batch->nr && nr_pages < MAX_NR_FOLIOS_PER_FREE;
+ nr++) {
+ if (unlikely(encoded_page_flags(pages[nr]) &
+ ENCODED_PAGE_BIT_NR_PAGES_NEXT))
+ nr_pages += encoded_nr_pages(pages[++nr]);
+ else
+ nr_pages++;
+ }
+ }

- free_pages_and_swap_cache(pages, nr);
- pages += nr;
- batch->nr -= nr;
+ free_pages_and_swap_cache(pages, nr);
+ pages += nr;
+ batch->nr -= nr;

- cond_resched();
- }
+ cond_resched();
}
+}
+
+static void tlb_batch_pages_flush(struct mmu_gather *tlb)
+{
+ struct mmu_gather_batch *batch;
+
+ for (batch = &tlb->local; batch && batch->nr; batch = batch->next)
+ __tlb_batch_free_encoded_pages(batch);
tlb->active = &tlb->local;
}

--
2.43.0