This series is based on [1]. Similar to what we did with fork(), let's
implement PTE batching during unmap/zap when processing PTE-mapped THPs.
We collect consecutive PTEs that map consecutive pages of the same large
folio, making sure that the other PTE bits are compatible, and (a) adjust
the refcount only once per batch, (b) call rmap handling functions only
once per batch, (c) perform batch PTE setting/updates and (d) perform TLB
entry removal once per batch.
Ryan was previously working on this in the context of cont-pte for
arm64, int latest iteration [2] with a focus on arm6 with cont-pte only.
This series implements the optimization for all architectures, independent
of such PTE bits, teaches MMU gather/TLB code to be fully aware of such
large-folio-pages batches as well, and amkes use of our new rmap batching
function when removing the rmap.
To achieve that, we have to enlighten MMU gather / page freeing code
(i.e., everything that consumes encoded_page) to process unmapping
of consecutive pages that all belong to the same large folio. I'm being
very careful to not degrade order-0 performance, and it looks like I
managed to achieve that.
While this series should -- similar to [1] -- be beneficial for adding
cont-pte support on arm64[2], it's one of the requirements for maintaining
a total mapcount[3] for large folios with minimal added overhead and
further changes[4] that build up on top of the total mapcount.
Independent of all that, this series results in a speedup during munmap()
and similar unmapping (process teardown, MADV_DONTNEED on larger ranges)
with PTE-mapped THP, which is the default with THPs that are smaller than
a PMD (for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]).
On an Intel Xeon Silver 4210R CPU, munmap'ing a 1GiB VMA backed by
PTE-mapped folios of the same size (stddev < 1%) results in the following
runtimes for munmap() in seconds (shorter is better):
Folio Size | mm-unstable | New | Change
---------------------------------------------
4KiB | 0.058110 | 0.057715 | - 1%
16KiB | 0.044198 | 0.035469 | -20%
32KiB | 0.034216 | 0.023522 | -31%
64KiB | 0.029207 | 0.018434 | -37%
128KiB | 0.026579 | 0.014026 | -47%
256KiB | 0.025130 | 0.011756 | -53%
512KiB | 0.024292 | 0.010703 | -56%
1024KiB | 0.023812 | 0.010294 | -57%
2048KiB | 0.023785 | 0.009910 | -58%
CCing especially s390x folks, because they have a tlb freeing hooks that
needs adjustment. Only tested on x86-64 for now, will have to do some more
stress testing. Compile-tested on most other architectures. The PPC
change is negleglible and makes my cross-compiler happy.
[1] https://lkml.kernel.org/r/[email protected]
[2] https://lkml.kernel.org/r/[email protected]
[3] https://lkml.kernel.org/r/[email protected]
[4] https://lkml.kernel.org/r/[email protected]
[5] https://lkml.kernel.org/r/[email protected]
---
The performance numbers are from v1. I did a quick benchmark run of v2
and nothing significantly changed -- because nothing in the code
significantly changed. Sending this out ASAP, so Ryan can make progress
with cont-pte.
v1 -> v2:
* "mm/memory: factor out zapping of present pte into zap_present_pte()"
-> Initialize "struct folio *folio" to NULL
* "mm/memory: handle !page case in zap_present_pte() separately"
-> Extend description regarding arch_check_zapped_pte()
* "mm/mmu_gather: add __tlb_remove_folio_pages()"
-> ENCODED_PAGE_BIT_NR_PAGES_NEXT
-> Extend patch description regarding "batching more"
* "mm/mmu_gather: improve cond_resched() handling with large folios and
expensive page freeing"
-> Handle the (so far) theoretical case of possible soft lockups when
we zero/poison memory when freeing pages. Try to keep old behavior in
that corner case to be safe.
* "mm/memory: optimize unmap/zap with PTE-mapped THP"
-> Clarify description of new ptep clearing functions regarding "present
PTEs"
-> Extend patch description regarding relaxed mapcount sanity checks
-> Improve zap_present_ptes() description
* Pick up RB's
Cc: Andrew Morton <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Yin Fengwei <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
David Hildenbrand (10):
mm/memory: factor out zapping of present pte into zap_present_pte()
mm/memory: handle !page case in zap_present_pte() separately
mm/memory: further separate anon and pagecache folio handling in
zap_present_pte()
mm/memory: factor out zapping folio pte into zap_present_folio_pte()
mm/mmu_gather: pass "delay_rmap" instead of encoded page to
__tlb_remove_page_size()
mm/mmu_gather: define ENCODED_PAGE_FLAG_DELAY_RMAP
mm/mmu_gather: add tlb_remove_tlb_entries()
mm/mmu_gather: add __tlb_remove_folio_pages()
mm/mmu_gather: improve cond_resched() handling with large folios and
expensive page freeing
mm/memory: optimize unmap/zap with PTE-mapped THP
arch/powerpc/include/asm/tlb.h | 2 +
arch/s390/include/asm/tlb.h | 30 ++++--
include/asm-generic/tlb.h | 40 ++++++--
include/linux/mm_types.h | 37 ++++++--
include/linux/pgtable.h | 70 ++++++++++++++
mm/memory.c | 169 +++++++++++++++++++++++----------
mm/mmu_gather.c | 107 ++++++++++++++++++---
mm/swap.c | 12 ++-
mm/swap_state.c | 15 ++-
9 files changed, 393 insertions(+), 89 deletions(-)
base-commit: d7f43604944787ce929efeaabd0f462414002a8f
--
2.43.0
We don't need uptodate accessed/dirty bits, so in theory we could
replace ptep_get_and_clear_full() by an optimized ptep_clear_full()
function. Let's rely on the provided pte.
Further, there is no scenario where we would have to insert uffd-wp
markers when zapping something that is not a normal page (i.e., zeropage).
Add a sanity check to make sure this remains true.
should_zap_folio() no longer has to handle NULL pointers. This change
replaces 2/3 "!page/!folio" checks by a single "!page" one.
Note that arch_check_zapped_pte() on x86-64 checks the HW-dirty bit to
detect shadow stack entries. But for shadow stack entries, the HW dirty
bit (in combination with non-writable PTEs) is set by software. So for the
arch_check_zapped_pte() check, we don't have to sync against HW setting
the HW dirty bit concurrently, it is always set.
Reviewed-by: Ryan Roberts <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
mm/memory.c | 22 +++++++++++-----------
1 file changed, 11 insertions(+), 11 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 5b0dc33133a6..4da6923709b2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1497,10 +1497,6 @@ static inline bool should_zap_folio(struct zap_details *details,
if (should_zap_cows(details))
return true;
- /* E.g. the caller passes NULL for the case of a zero folio */
- if (!folio)
- return true;
-
/* Otherwise we should only zap non-anon folios */
return !folio_test_anon(folio);
}
@@ -1538,24 +1534,28 @@ static inline void zap_present_pte(struct mmu_gather *tlb,
int *rss, bool *force_flush, bool *force_break)
{
struct mm_struct *mm = tlb->mm;
- struct folio *folio = NULL;
bool delay_rmap = false;
+ struct folio *folio;
struct page *page;
page = vm_normal_page(vma, addr, ptent);
- if (page)
- folio = page_folio(page);
+ if (!page) {
+ /* We don't need up-to-date accessed/dirty bits. */
+ ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
+ arch_check_zapped_pte(vma, ptent);
+ tlb_remove_tlb_entry(tlb, pte, addr);
+ VM_WARN_ON_ONCE(userfaultfd_wp(vma));
+ ksm_might_unmap_zero_page(mm, ptent);
+ return;
+ }
+ folio = page_folio(page);
if (unlikely(!should_zap_folio(details, folio)))
return;
ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
arch_check_zapped_pte(vma, ptent);
tlb_remove_tlb_entry(tlb, pte, addr);
zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
- if (unlikely(!page)) {
- ksm_might_unmap_zero_page(mm, ptent);
- return;
- }
if (!folio_test_anon(folio)) {
if (pte_dirty(ptent)) {
--
2.43.0
Let's prepare for further changes by factoring it out into a separate
function.
Reviewed-by: Ryan Roberts <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
mm/memory.c | 53 ++++++++++++++++++++++++++++++++---------------------
1 file changed, 32 insertions(+), 21 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 7a3ebb6e5909..a3efc4da258a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1528,30 +1528,14 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
}
-static inline void zap_present_pte(struct mmu_gather *tlb,
- struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
- unsigned long addr, struct zap_details *details,
- int *rss, bool *force_flush, bool *force_break)
+static inline void zap_present_folio_pte(struct mmu_gather *tlb,
+ struct vm_area_struct *vma, struct folio *folio,
+ struct page *page, pte_t *pte, pte_t ptent, unsigned long addr,
+ struct zap_details *details, int *rss, bool *force_flush,
+ bool *force_break)
{
struct mm_struct *mm = tlb->mm;
bool delay_rmap = false;
- struct folio *folio;
- struct page *page;
-
- page = vm_normal_page(vma, addr, ptent);
- if (!page) {
- /* We don't need up-to-date accessed/dirty bits. */
- ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
- arch_check_zapped_pte(vma, ptent);
- tlb_remove_tlb_entry(tlb, pte, addr);
- VM_WARN_ON_ONCE(userfaultfd_wp(vma));
- ksm_might_unmap_zero_page(mm, ptent);
- return;
- }
-
- folio = page_folio(page);
- if (unlikely(!should_zap_folio(details, folio)))
- return;
if (!folio_test_anon(folio)) {
ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
@@ -1586,6 +1570,33 @@ static inline void zap_present_pte(struct mmu_gather *tlb,
}
}
+static inline void zap_present_pte(struct mmu_gather *tlb,
+ struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
+ unsigned long addr, struct zap_details *details,
+ int *rss, bool *force_flush, bool *force_break)
+{
+ struct mm_struct *mm = tlb->mm;
+ struct folio *folio;
+ struct page *page;
+
+ page = vm_normal_page(vma, addr, ptent);
+ if (!page) {
+ /* We don't need up-to-date accessed/dirty bits. */
+ ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
+ arch_check_zapped_pte(vma, ptent);
+ tlb_remove_tlb_entry(tlb, pte, addr);
+ VM_WARN_ON_ONCE(userfaultfd_wp(vma));
+ ksm_might_unmap_zero_page(mm, ptent);
+ return;
+ }
+
+ folio = page_folio(page);
+ if (unlikely(!should_zap_folio(details, folio)))
+ return;
+ zap_present_folio_pte(tlb, vma, folio, page, pte, ptent, addr, details,
+ rss, force_flush, force_break);
+}
+
static unsigned long zap_pte_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end,
--
2.43.0
We have two bits available in the encoded page pointer to store
additional information. Currently, we use one bit to request delay of the
rmap removal until after a TLB flush.
We want to make use of the remaining bit internally for batching of
multiple pages of the same folio, specifying that the next encoded page
pointer in an array is actually "nr_pages". So pass page + delay_rmap flag
instead of an encoded page, to handle the encoding internally.
Reviewed-by: Ryan Roberts <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
arch/s390/include/asm/tlb.h | 13 ++++++-------
include/asm-generic/tlb.h | 12 ++++++------
mm/mmu_gather.c | 7 ++++---
3 files changed, 16 insertions(+), 16 deletions(-)
diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
index d1455a601adc..48df896d5b79 100644
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -25,8 +25,7 @@
void __tlb_remove_table(void *_table);
static inline void tlb_flush(struct mmu_gather *tlb);
static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
- struct encoded_page *page,
- int page_size);
+ struct page *page, bool delay_rmap, int page_size);
#define tlb_flush tlb_flush
#define pte_free_tlb pte_free_tlb
@@ -42,14 +41,14 @@ static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
* tlb_ptep_clear_flush. In both flush modes the tlb for a page cache page
* has already been freed, so just do free_page_and_swap_cache.
*
- * s390 doesn't delay rmap removal, so there is nothing encoded in
- * the page pointer.
+ * s390 doesn't delay rmap removal.
*/
static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
- struct encoded_page *page,
- int page_size)
+ struct page *page, bool delay_rmap, int page_size)
{
- free_page_and_swap_cache(encoded_page_ptr(page));
+ VM_WARN_ON_ONCE(delay_rmap);
+
+ free_page_and_swap_cache(page);
return false;
}
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 129a3a759976..2eb7b0d4f5d2 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -260,9 +260,8 @@ struct mmu_gather_batch {
*/
#define MAX_GATHER_BATCH_COUNT (10000UL/MAX_GATHER_BATCH)
-extern bool __tlb_remove_page_size(struct mmu_gather *tlb,
- struct encoded_page *page,
- int page_size);
+extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
+ bool delay_rmap, int page_size);
#ifdef CONFIG_SMP
/*
@@ -462,13 +461,14 @@ static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
static inline void tlb_remove_page_size(struct mmu_gather *tlb,
struct page *page, int page_size)
{
- if (__tlb_remove_page_size(tlb, encode_page(page, 0), page_size))
+ if (__tlb_remove_page_size(tlb, page, false, page_size))
tlb_flush_mmu(tlb);
}
-static __always_inline bool __tlb_remove_page(struct mmu_gather *tlb, struct page *page, unsigned int flags)
+static __always_inline bool __tlb_remove_page(struct mmu_gather *tlb,
+ struct page *page, bool delay_rmap)
{
- return __tlb_remove_page_size(tlb, encode_page(page, flags), PAGE_SIZE);
+ return __tlb_remove_page_size(tlb, page, delay_rmap, PAGE_SIZE);
}
/* tlb_remove_page
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 604ddf08affe..ac733d81b112 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -116,7 +116,8 @@ static void tlb_batch_list_free(struct mmu_gather *tlb)
tlb->local.next = NULL;
}
-bool __tlb_remove_page_size(struct mmu_gather *tlb, struct encoded_page *page, int page_size)
+bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
+ bool delay_rmap, int page_size)
{
struct mmu_gather_batch *batch;
@@ -131,13 +132,13 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct encoded_page *page, i
* Add the page and check if we are full. If so
* force a flush.
*/
- batch->encoded_pages[batch->nr++] = page;
+ batch->encoded_pages[batch->nr++] = encode_page(page, delay_rmap);
if (batch->nr == batch->max) {
if (!tlb_next_batch(tlb))
return true;
batch = tlb->active;
}
- VM_BUG_ON_PAGE(batch->nr > batch->max, encoded_page_ptr(page));
+ VM_BUG_ON_PAGE(batch->nr > batch->max, page);
return false;
}
--
2.43.0
Nowadays, encoded pages are only used in mmu_gather handling. Let's
update the documentation, and define ENCODED_PAGE_BIT_DELAY_RMAP. While at
it, rename ENCODE_PAGE_BITS to ENCODED_PAGE_BITS.
If encoded page pointers would ever be used in other context again, we'd
likely want to change the defines to reflect their context (e.g.,
ENCODED_PAGE_FLAG_MMU_GATHER_DELAY_RMAP). For now, let's keep it simple.
This is a preparation for using the remaining spare bit to indicate that
the next item in an array of encoded pages is a "nr_pages" argument and
not an encoded page.
Reviewed-by: Ryan Roberts <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/mm_types.h | 17 +++++++++++------
mm/mmu_gather.c | 5 +++--
2 files changed, 14 insertions(+), 8 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8b611e13153e..1b89eec0d6df 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -210,8 +210,8 @@ struct page {
*
* An 'encoded_page' pointer is a pointer to a regular 'struct page', but
* with the low bits of the pointer indicating extra context-dependent
- * information. Not super-common, but happens in mmu_gather and mlock
- * handling, and this acts as a type system check on that use.
+ * information. Only used in mmu_gather handling, and this acts as a type
+ * system check on that use.
*
* We only really have two guaranteed bits in general, although you could
* play with 'struct page' alignment (see CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
@@ -220,21 +220,26 @@ struct page {
* Use the supplied helper functions to endcode/decode the pointer and bits.
*/
struct encoded_page;
-#define ENCODE_PAGE_BITS 3ul
+
+#define ENCODED_PAGE_BITS 3ul
+
+/* Perform rmap removal after we have flushed the TLB. */
+#define ENCODED_PAGE_BIT_DELAY_RMAP 1ul
+
static __always_inline struct encoded_page *encode_page(struct page *page, unsigned long flags)
{
- BUILD_BUG_ON(flags > ENCODE_PAGE_BITS);
+ BUILD_BUG_ON(flags > ENCODED_PAGE_BITS);
return (struct encoded_page *)(flags | (unsigned long)page);
}
static inline unsigned long encoded_page_flags(struct encoded_page *page)
{
- return ENCODE_PAGE_BITS & (unsigned long)page;
+ return ENCODED_PAGE_BITS & (unsigned long)page;
}
static inline struct page *encoded_page_ptr(struct encoded_page *page)
{
- return (struct page *)(~ENCODE_PAGE_BITS & (unsigned long)page);
+ return (struct page *)(~ENCODED_PAGE_BITS & (unsigned long)page);
}
/*
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index ac733d81b112..6540c99c6758 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -53,7 +53,7 @@ static void tlb_flush_rmap_batch(struct mmu_gather_batch *batch, struct vm_area_
for (int i = 0; i < batch->nr; i++) {
struct encoded_page *enc = batch->encoded_pages[i];
- if (encoded_page_flags(enc)) {
+ if (encoded_page_flags(enc) & ENCODED_PAGE_BIT_DELAY_RMAP) {
struct page *page = encoded_page_ptr(enc);
folio_remove_rmap_pte(page_folio(page), page, vma);
}
@@ -119,6 +119,7 @@ static void tlb_batch_list_free(struct mmu_gather *tlb)
bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
bool delay_rmap, int page_size)
{
+ int flags = delay_rmap ? ENCODED_PAGE_BIT_DELAY_RMAP : 0;
struct mmu_gather_batch *batch;
VM_BUG_ON(!tlb->end);
@@ -132,7 +133,7 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
* Add the page and check if we are full. If so
* force a flush.
*/
- batch->encoded_pages[batch->nr++] = encode_page(page, delay_rmap);
+ batch->encoded_pages[batch->nr++] = encode_page(page, flags);
if (batch->nr == batch->max) {
if (!tlb_next_batch(tlb))
return true;
--
2.43.0
Let's add a helper that lets us batch-process multiple consecutive PTEs.
Note that the loop will get optimized out on all architectures except on
powerpc. We have to add an early define of __tlb_remove_tlb_entry() on
ppc to make the compiler happy (and avoid making tlb_remove_tlb_entries() a
macro).
Reviewed-by: Ryan Roberts <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
arch/powerpc/include/asm/tlb.h | 2 ++
include/asm-generic/tlb.h | 20 ++++++++++++++++++++
2 files changed, 22 insertions(+)
diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index b3de6102a907..1ca7d4c4b90d 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -19,6 +19,8 @@
#include <linux/pagemap.h>
+static inline void __tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep,
+ unsigned long address);
#define __tlb_remove_tlb_entry __tlb_remove_tlb_entry
#define tlb_flush tlb_flush
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 2eb7b0d4f5d2..95d60a4f468a 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -608,6 +608,26 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
__tlb_remove_tlb_entry(tlb, ptep, address); \
} while (0)
+/**
+ * tlb_remove_tlb_entries - remember unmapping of multiple consecutive ptes for
+ * later tlb invalidation.
+ *
+ * Similar to tlb_remove_tlb_entry(), but remember unmapping of multiple
+ * consecutive ptes instead of only a single one.
+ */
+static inline void tlb_remove_tlb_entries(struct mmu_gather *tlb,
+ pte_t *ptep, unsigned int nr, unsigned long address)
+{
+ tlb_flush_pte_range(tlb, address, PAGE_SIZE * nr);
+ for (;;) {
+ __tlb_remove_tlb_entry(tlb, ptep, address);
+ if (--nr == 0)
+ break;
+ ptep++;
+ address += PAGE_SIZE;
+ }
+}
+
#define tlb_remove_huge_tlb_entry(h, tlb, ptep, address) \
do { \
unsigned long _sz = huge_page_size(h); \
--
2.43.0
Add __tlb_remove_folio_pages(), which will remove multiple consecutive
pages that belong to the same large folio, instead of only a single
page. We'll be using this function when optimizing unmapping/zapping of
large folios that are mapped by PTEs.
We're using the remaining spare bit in an encoded_page to indicate that
the next enoced page in an array contains actually shifted "nr_pages".
Teach swap/freeing code about putting multiple folio references, and
delayed rmap handling to remove page ranges of a folio.
This extension allows for still gathering almost as many small folios
as we used to (-1, because we have to prepare for a possibly bigger next
entry), but still allows for gathering consecutive pages that belong to the
same large folio.
Note that we don't pass the folio pointer, because it is not required for
now. Further, we don't support page_size != PAGE_SIZE, it won't be
required for simple PTE batching.
We have to provide a separate s390 implementation, but it's fairly
straight forward.
Another, more invasive and likely more expensive, approach would be to
use folio+range or a PFN range instead of page+nr_pages. But, we should
do that consistently for the whole mmu_gather. For now, let's keep it
simple and add "nr_pages" only.
Note that it is now possible to gather significantly more pages: In the
past, we were able to gather ~10000 pages, now we can gather
also gather ~5000 folio fragments that span multiple pages. A folio
fragement on x86-64 can be up to 512 pages (2 MiB THP) and on arm64 with
64k in theory 8192 pages (512 MiB THP). Gathering more memory is not
considered something we should worry about, especially because these are
already corner cases.
While we can gather more total memory, we won't free more folio
fragments. As long as page freeing time primarily only depends on the
number of involved folios, there is no effective change for !preempt
configurations. However, we'll adjust tlb_batch_pages_flush() separately to
handle corner cases where page freeing time grows proportionally with the
actual memory size.
Signed-off-by: David Hildenbrand <[email protected]>
---
arch/s390/include/asm/tlb.h | 17 +++++++++++
include/asm-generic/tlb.h | 8 +++++
include/linux/mm_types.h | 20 ++++++++++++
mm/mmu_gather.c | 61 +++++++++++++++++++++++++++++++------
mm/swap.c | 12 ++++++--
mm/swap_state.c | 15 +++++++--
6 files changed, 119 insertions(+), 14 deletions(-)
diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
index 48df896d5b79..e95b2c8081eb 100644
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -26,6 +26,8 @@ void __tlb_remove_table(void *_table);
static inline void tlb_flush(struct mmu_gather *tlb);
static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
struct page *page, bool delay_rmap, int page_size);
+static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
+ struct page *page, unsigned int nr_pages, bool delay_rmap);
#define tlb_flush tlb_flush
#define pte_free_tlb pte_free_tlb
@@ -52,6 +54,21 @@ static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
return false;
}
+static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
+ struct page *page, unsigned int nr_pages, bool delay_rmap)
+{
+ struct encoded_page *encoded_pages[] = {
+ encode_page(page, ENCODED_PAGE_BIT_NR_PAGES_NEXT),
+ encode_nr_pages(nr_pages),
+ };
+
+ VM_WARN_ON_ONCE(delay_rmap);
+ VM_WARN_ON_ONCE(page_folio(page) != page_folio(page + nr_pages - 1));
+
+ free_pages_and_swap_cache(encoded_pages, ARRAY_SIZE(encoded_pages));
+ return false;
+}
+
static inline void tlb_flush(struct mmu_gather *tlb)
{
__tlb_flush_mm_lazy(tlb->mm);
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 95d60a4f468a..bd00dd238b79 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -69,6 +69,7 @@
*
* - tlb_remove_page() / __tlb_remove_page()
* - tlb_remove_page_size() / __tlb_remove_page_size()
+ * - __tlb_remove_folio_pages()
*
* __tlb_remove_page_size() is the basic primitive that queues a page for
* freeing. __tlb_remove_page() assumes PAGE_SIZE. Both will return a
@@ -78,6 +79,11 @@
* tlb_remove_page() and tlb_remove_page_size() imply the call to
* tlb_flush_mmu() when required and has no return value.
*
+ * __tlb_remove_folio_pages() is similar to __tlb_remove_page(), however,
+ * instead of removing a single page, remove the given number of consecutive
+ * pages that are all part of the same (large) folio: just like calling
+ * __tlb_remove_page() on each page individually.
+ *
* - tlb_change_page_size()
*
* call before __tlb_remove_page*() to set the current page-size; implies a
@@ -262,6 +268,8 @@ struct mmu_gather_batch {
extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
bool delay_rmap, int page_size);
+bool __tlb_remove_folio_pages(struct mmu_gather *tlb, struct page *page,
+ unsigned int nr_pages, bool delay_rmap);
#ifdef CONFIG_SMP
/*
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 1b89eec0d6df..a7223ba3ea1e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -226,6 +226,15 @@ struct encoded_page;
/* Perform rmap removal after we have flushed the TLB. */
#define ENCODED_PAGE_BIT_DELAY_RMAP 1ul
+/*
+ * The next item in an encoded_page array is the "nr_pages" argument, specifying
+ * the number of consecutive pages starting from this page, that all belong to
+ * the same folio. For example, "nr_pages" corresponds to the number of folio
+ * references that must be dropped. If this bit is not set, "nr_pages" is
+ * implicitly 1.
+ */
+#define ENCODED_PAGE_BIT_NR_PAGES_NEXT 2ul
+
static __always_inline struct encoded_page *encode_page(struct page *page, unsigned long flags)
{
BUILD_BUG_ON(flags > ENCODED_PAGE_BITS);
@@ -242,6 +251,17 @@ static inline struct page *encoded_page_ptr(struct encoded_page *page)
return (struct page *)(~ENCODED_PAGE_BITS & (unsigned long)page);
}
+static __always_inline struct encoded_page *encode_nr_pages(unsigned long nr)
+{
+ VM_WARN_ON_ONCE((nr << 2) >> 2 != nr);
+ return (struct encoded_page *)(nr << 2);
+}
+
+static __always_inline unsigned long encoded_nr_pages(struct encoded_page *page)
+{
+ return ((unsigned long)page) >> 2;
+}
+
/*
* A swap entry has to fit into a "unsigned long", as the entry is hidden
* in the "index" field of the swapper address space.
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 6540c99c6758..d175c0f1e2c8 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -50,12 +50,21 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
#ifdef CONFIG_SMP
static void tlb_flush_rmap_batch(struct mmu_gather_batch *batch, struct vm_area_struct *vma)
{
+ struct encoded_page **pages = batch->encoded_pages;
+
for (int i = 0; i < batch->nr; i++) {
- struct encoded_page *enc = batch->encoded_pages[i];
+ struct encoded_page *enc = pages[i];
if (encoded_page_flags(enc) & ENCODED_PAGE_BIT_DELAY_RMAP) {
struct page *page = encoded_page_ptr(enc);
- folio_remove_rmap_pte(page_folio(page), page, vma);
+ unsigned int nr_pages = 1;
+
+ if (unlikely(encoded_page_flags(enc) &
+ ENCODED_PAGE_BIT_NR_PAGES_NEXT))
+ nr_pages = encoded_nr_pages(pages[++i]);
+
+ folio_remove_rmap_ptes(page_folio(page), page, nr_pages,
+ vma);
}
}
}
@@ -89,18 +98,26 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
struct encoded_page **pages = batch->encoded_pages;
- do {
+ while (batch->nr) {
/*
* limit free batch count when PAGE_SIZE > 4K
*/
unsigned int nr = min(512U, batch->nr);
+ /*
+ * Make sure we cover page + nr_pages, and don't leave
+ * nr_pages behind when capping the number of entries.
+ */
+ if (unlikely(encoded_page_flags(pages[nr - 1]) &
+ ENCODED_PAGE_BIT_NR_PAGES_NEXT))
+ nr++;
+
free_pages_and_swap_cache(pages, nr);
pages += nr;
batch->nr -= nr;
cond_resched();
- } while (batch->nr);
+ }
}
tlb->active = &tlb->local;
}
@@ -116,8 +133,9 @@ static void tlb_batch_list_free(struct mmu_gather *tlb)
tlb->local.next = NULL;
}
-bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
- bool delay_rmap, int page_size)
+static bool __tlb_remove_folio_pages_size(struct mmu_gather *tlb,
+ struct page *page, unsigned int nr_pages, bool delay_rmap,
+ int page_size)
{
int flags = delay_rmap ? ENCODED_PAGE_BIT_DELAY_RMAP : 0;
struct mmu_gather_batch *batch;
@@ -126,6 +144,8 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
#ifdef CONFIG_MMU_GATHER_PAGE_SIZE
VM_WARN_ON(tlb->page_size != page_size);
+ VM_WARN_ON_ONCE(nr_pages != 1 && page_size != PAGE_SIZE);
+ VM_WARN_ON_ONCE(page_folio(page) != page_folio(page + nr_pages - 1));
#endif
batch = tlb->active;
@@ -133,17 +153,40 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
* Add the page and check if we are full. If so
* force a flush.
*/
- batch->encoded_pages[batch->nr++] = encode_page(page, flags);
- if (batch->nr == batch->max) {
+ if (likely(nr_pages == 1)) {
+ batch->encoded_pages[batch->nr++] = encode_page(page, flags);
+ } else {
+ flags |= ENCODED_PAGE_BIT_NR_PAGES_NEXT;
+ batch->encoded_pages[batch->nr++] = encode_page(page, flags);
+ batch->encoded_pages[batch->nr++] = encode_nr_pages(nr_pages);
+ }
+ /*
+ * Make sure that we can always add another "page" + "nr_pages",
+ * requiring two entries instead of only a single one.
+ */
+ if (batch->nr >= batch->max - 1) {
if (!tlb_next_batch(tlb))
return true;
batch = tlb->active;
}
- VM_BUG_ON_PAGE(batch->nr > batch->max, page);
+ VM_BUG_ON_PAGE(batch->nr > batch->max - 1, page);
return false;
}
+bool __tlb_remove_folio_pages(struct mmu_gather *tlb, struct page *page,
+ unsigned int nr_pages, bool delay_rmap)
+{
+ return __tlb_remove_folio_pages_size(tlb, page, nr_pages, delay_rmap,
+ PAGE_SIZE);
+}
+
+bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
+ bool delay_rmap, int page_size)
+{
+ return __tlb_remove_folio_pages_size(tlb, page, 1, delay_rmap, page_size);
+}
+
#endif /* MMU_GATHER_NO_GATHER */
#ifdef CONFIG_MMU_GATHER_TABLE_FREE
diff --git a/mm/swap.c b/mm/swap.c
index cd8f0150ba3a..e5380d732c0d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -967,11 +967,17 @@ void release_pages(release_pages_arg arg, int nr)
unsigned int lock_batch;
for (i = 0; i < nr; i++) {
+ unsigned int nr_refs = 1;
struct folio *folio;
/* Turn any of the argument types into a folio */
folio = page_folio(encoded_page_ptr(encoded[i]));
+ /* Is our next entry actually "nr_pages" -> "nr_refs" ? */
+ if (unlikely(encoded_page_flags(encoded[i]) &
+ ENCODED_PAGE_BIT_NR_PAGES_NEXT))
+ nr_refs = encoded_nr_pages(encoded[++i]);
+
/*
* Make sure the IRQ-safe lock-holding time does not get
* excessive with a continuous string of pages from the
@@ -990,14 +996,14 @@ void release_pages(release_pages_arg arg, int nr)
unlock_page_lruvec_irqrestore(lruvec, flags);
lruvec = NULL;
}
- if (put_devmap_managed_page(&folio->page))
+ if (put_devmap_managed_page_refs(&folio->page, nr_refs))
continue;
- if (folio_put_testzero(folio))
+ if (folio_ref_sub_and_test(folio, nr_refs))
free_zone_device_page(&folio->page);
continue;
}
- if (!folio_put_testzero(folio))
+ if (!folio_ref_sub_and_test(folio, nr_refs))
continue;
if (folio_test_large(folio)) {
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 7255c01a1e4e..2f540748f7c0 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -311,8 +311,19 @@ void free_page_and_swap_cache(struct page *page)
void free_pages_and_swap_cache(struct encoded_page **pages, int nr)
{
lru_add_drain();
- for (int i = 0; i < nr; i++)
- free_swap_cache(encoded_page_ptr(pages[i]));
+ for (int i = 0; i < nr; i++) {
+ struct page *page = encoded_page_ptr(pages[i]);
+
+ /*
+ * Skip over the "nr_pages" entry. It's sufficient to call
+ * free_swap_cache() only once per folio.
+ */
+ if (unlikely(encoded_page_flags(pages[i]) &
+ ENCODED_PAGE_BIT_NR_PAGES_NEXT))
+ i++;
+
+ free_swap_cache(page);
+ }
release_pages(pages, nr);
}
--
2.43.0
It's a pain that we have to handle cond_resched() in
tlb_batch_pages_flush() manually and cannot simply handle it in
release_pages() -- release_pages() can be called from atomic context.
Well, in a perfect world we wouldn't have to make our code more at all.
With page poisoning and init_on_free, we might now run into soft lockups
when we free a lot of rather large folio fragments, because page freeing
time then depends on the actual memory size we are freeing instead of on
the number of folios that are involved.
In the absolute (unlikely) worst case, on arm64 with 64k we will be able
to free up to 256 folio fragments that each span 512 MiB: zeroing out 128
GiB does sound like it might take a while. But instead of ignoring this
unlikely case, let's just handle it.
So, let's teach tlb_batch_pages_flush() that there are some
configurations where page freeing is horribly slow, and let's reschedule
more frequently -- similarly like we did for now before we had large folio
fragments in there. Note that we might end up freeing only a single folio
fragment at a time that might exceed the old 512 pages limit: but if we
cannot even free a single MAX_ORDER page on a system without running into
soft lockups, something else is already completely bogus.
In the future, we might want to detect if handling cond_resched() is
required at all, and just not do any of that with full preemption enabled.
Signed-off-by: David Hildenbrand <[email protected]>
---
mm/mmu_gather.c | 50 ++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 41 insertions(+), 9 deletions(-)
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index d175c0f1e2c8..2774044b5790 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -91,18 +91,19 @@ void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma)
}
#endif
-static void tlb_batch_pages_flush(struct mmu_gather *tlb)
+static void __tlb_batch_free_encoded_pages(struct mmu_gather_batch *batch)
{
- struct mmu_gather_batch *batch;
-
- for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
- struct encoded_page **pages = batch->encoded_pages;
+ struct encoded_page **pages = batch->encoded_pages;
+ unsigned int nr, nr_pages;
+ /*
+ * We might end up freeing a lot of pages. Reschedule on a regular
+ * basis to avoid soft lockups in configurations without full
+ * preemption enabled. The magic number of 512 folios seems to work.
+ */
+ if (!page_poisoning_enabled_static() && !want_init_on_free()) {
while (batch->nr) {
- /*
- * limit free batch count when PAGE_SIZE > 4K
- */
- unsigned int nr = min(512U, batch->nr);
+ nr = min(512, batch->nr);
/*
* Make sure we cover page + nr_pages, and don't leave
@@ -119,6 +120,37 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
cond_resched();
}
}
+
+ /*
+ * With page poisoning and init_on_free, the time it takes to free
+ * memory grows proportionally with the actual memory size. Therefore,
+ * limit based on the actual memory size and not the number of involved
+ * folios.
+ */
+ while (batch->nr) {
+ for (nr = 0, nr_pages = 0;
+ nr < batch->nr && nr_pages < 512; nr++) {
+ if (unlikely(encoded_page_flags(pages[nr]) &
+ ENCODED_PAGE_BIT_NR_PAGES_NEXT))
+ nr_pages += encoded_nr_pages(pages[++nr]);
+ else
+ nr_pages++;
+ }
+
+ free_pages_and_swap_cache(pages, nr);
+ pages += nr;
+ batch->nr -= nr;
+
+ cond_resched();
+ }
+}
+
+static void tlb_batch_pages_flush(struct mmu_gather *tlb)
+{
+ struct mmu_gather_batch *batch;
+
+ for (batch = &tlb->local; batch && batch->nr; batch = batch->next)
+ __tlb_batch_free_encoded_pages(batch);
tlb->active = &tlb->local;
}
--
2.43.0
Similar to how we optimized fork(), let's implement PTE batching when
consecutive (present) PTEs map consecutive pages of the same large
folio.
Most infrastructure we need for batching (mmu gather, rmap) is already
there. We only have to add get_and_clear_full_ptes() and
clear_full_ptes(). Similarly, extend zap_install_uffd_wp_if_needed() to
process a PTE range.
We won't bother sanity-checking the mapcount of all subpages, but only
check the mapcount of the first subpage we process. If there is a real
problem hiding somewhere, we can trigger it simply by using small
folios, or when we zap single pages of a large folio. Ideally, we had
that check in rmap code (including for delayed rmap), but then we cannot
print the PTE. Let's keep it simple for now. If we ever have a cheap
folio_mapcount(), we might just want to check for underflows there.
To keep small folios as fast as possible force inlining of a specialized
variant using __always_inline with nr=1.
Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/pgtable.h | 70 +++++++++++++++++++++++++++++++
mm/memory.c | 92 +++++++++++++++++++++++++++++------------
2 files changed, 136 insertions(+), 26 deletions(-)
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index aab227e12493..49ab1f73b5c2 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -580,6 +580,76 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
}
#endif
+#ifndef get_and_clear_full_ptes
+/**
+ * get_and_clear_full_ptes - Clear present PTEs that map consecutive pages of
+ * the same folio, collecting dirty/accessed bits.
+ * @mm: Address space the pages are mapped into.
+ * @addr: Address the first page is mapped at.
+ * @ptep: Page table pointer for the first entry.
+ * @nr: Number of entries to clear.
+ * @full: Whether we are clearing a full mm.
+ *
+ * May be overridden by the architecture; otherwise, implemented as a simple
+ * loop over ptep_get_and_clear_full(), merging dirty/accessed bits into the
+ * returned PTE.
+ *
+ * Note that PTE bits in the PTE range besides the PFN can differ. For example,
+ * some PTEs might be write-protected.
+ *
+ * Context: The caller holds the page table lock. The PTEs map consecutive
+ * pages that belong to the same folio. The PTEs are all in the same PMD.
+ */
+static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
+ unsigned long addr, pte_t *ptep, unsigned int nr, int full)
+{
+ pte_t pte, tmp_pte;
+
+ pte = ptep_get_and_clear_full(mm, addr, ptep, full);
+ while (--nr) {
+ ptep++;
+ addr += PAGE_SIZE;
+ tmp_pte = ptep_get_and_clear_full(mm, addr, ptep, full);
+ if (pte_dirty(tmp_pte))
+ pte = pte_mkdirty(pte);
+ if (pte_young(tmp_pte))
+ pte = pte_mkyoung(pte);
+ }
+ return pte;
+}
+#endif
+
+#ifndef clear_full_ptes
+/**
+ * clear_full_ptes - Clear present PTEs that map consecutive pages of the same
+ * folio.
+ * @mm: Address space the pages are mapped into.
+ * @addr: Address the first page is mapped at.
+ * @ptep: Page table pointer for the first entry.
+ * @nr: Number of entries to clear.
+ * @full: Whether we are clearing a full mm.
+ *
+ * May be overridden by the architecture; otherwise, implemented as a simple
+ * loop over ptep_get_and_clear_full().
+ *
+ * Note that PTE bits in the PTE range besides the PFN can differ. For example,
+ * some PTEs might be write-protected.
+ *
+ * Context: The caller holds the page table lock. The PTEs map consecutive
+ * pages that belong to the same folio. The PTEs are all in the same PMD.
+ */
+static inline void clear_full_ptes(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, unsigned int nr, int full)
+{
+ for (;;) {
+ ptep_get_and_clear_full(mm, addr, ptep, full);
+ if (--nr == 0)
+ break;
+ ptep++;
+ addr += PAGE_SIZE;
+ }
+}
+#endif
/*
* If two threads concurrently fault at the same page, the thread that
diff --git a/mm/memory.c b/mm/memory.c
index a3efc4da258a..3b8e56eb08a3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1515,7 +1515,7 @@ static inline bool zap_drop_file_uffd_wp(struct zap_details *details)
*/
static inline void
zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
- unsigned long addr, pte_t *pte,
+ unsigned long addr, pte_t *pte, int nr,
struct zap_details *details, pte_t pteval)
{
/* Zap on anonymous always means dropping everything */
@@ -1525,20 +1525,27 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
if (zap_drop_file_uffd_wp(details))
return;
- pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
+ for (;;) {
+ /* the PFN in the PTE is irrelevant. */
+ pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
+ if (--nr == 0)
+ break;
+ pte++;
+ addr += PAGE_SIZE;
+ }
}
-static inline void zap_present_folio_pte(struct mmu_gather *tlb,
+static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
struct vm_area_struct *vma, struct folio *folio,
- struct page *page, pte_t *pte, pte_t ptent, unsigned long addr,
- struct zap_details *details, int *rss, bool *force_flush,
- bool *force_break)
+ struct page *page, pte_t *pte, pte_t ptent, unsigned int nr,
+ unsigned long addr, struct zap_details *details, int *rss,
+ bool *force_flush, bool *force_break)
{
struct mm_struct *mm = tlb->mm;
bool delay_rmap = false;
if (!folio_test_anon(folio)) {
- ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
+ ptent = get_and_clear_full_ptes(mm, addr, pte, nr, tlb->fullmm);
if (pte_dirty(ptent)) {
folio_mark_dirty(folio);
if (tlb_delay_rmap(tlb)) {
@@ -1548,36 +1555,49 @@ static inline void zap_present_folio_pte(struct mmu_gather *tlb,
}
if (pte_young(ptent) && likely(vma_has_recency(vma)))
folio_mark_accessed(folio);
- rss[mm_counter(folio)]--;
+ rss[mm_counter(folio)] -= nr;
} else {
/* We don't need up-to-date accessed/dirty bits. */
- ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
- rss[MM_ANONPAGES]--;
+ clear_full_ptes(mm, addr, pte, nr, tlb->fullmm);
+ rss[MM_ANONPAGES] -= nr;
}
+ /* Checking a single PTE in a batch is sufficient. */
arch_check_zapped_pte(vma, ptent);
- tlb_remove_tlb_entry(tlb, pte, addr);
+ tlb_remove_tlb_entries(tlb, pte, nr, addr);
if (unlikely(userfaultfd_pte_wp(vma, ptent)))
- zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
+ zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details,
+ ptent);
if (!delay_rmap) {
- folio_remove_rmap_pte(folio, page, vma);
+ folio_remove_rmap_ptes(folio, page, nr, vma);
+
+ /* Only sanity-check the first page in a batch. */
if (unlikely(page_mapcount(page) < 0))
print_bad_pte(vma, addr, ptent, page);
}
- if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
+ if (unlikely(__tlb_remove_folio_pages(tlb, page, nr, delay_rmap))) {
*force_flush = true;
*force_break = true;
}
}
-static inline void zap_present_pte(struct mmu_gather *tlb,
+/*
+ * Zap or skip at least one present PTE, trying to batch-process subsequent
+ * PTEs that map consecutive pages of the same folio.
+ *
+ * Returns the number of processed (skipped or zapped) PTEs (at least 1).
+ */
+static inline int zap_present_ptes(struct mmu_gather *tlb,
struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
- unsigned long addr, struct zap_details *details,
- int *rss, bool *force_flush, bool *force_break)
+ unsigned int max_nr, unsigned long addr,
+ struct zap_details *details, int *rss, bool *force_flush,
+ bool *force_break)
{
+ const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
struct mm_struct *mm = tlb->mm;
struct folio *folio;
struct page *page;
+ int nr;
page = vm_normal_page(vma, addr, ptent);
if (!page) {
@@ -1587,14 +1607,29 @@ static inline void zap_present_pte(struct mmu_gather *tlb,
tlb_remove_tlb_entry(tlb, pte, addr);
VM_WARN_ON_ONCE(userfaultfd_wp(vma));
ksm_might_unmap_zero_page(mm, ptent);
- return;
+ return 1;
}
folio = page_folio(page);
if (unlikely(!should_zap_folio(details, folio)))
- return;
- zap_present_folio_pte(tlb, vma, folio, page, pte, ptent, addr, details,
- rss, force_flush, force_break);
+ return 1;
+
+ /*
+ * Make sure that the common "small folio" case is as fast as possible
+ * by keeping the batching logic separate.
+ */
+ if (unlikely(folio_test_large(folio) && max_nr != 1)) {
+ nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, fpb_flags,
+ NULL);
+
+ zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, nr,
+ addr, details, rss, force_flush,
+ force_break);
+ return nr;
+ }
+ zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, 1, addr,
+ details, rss, force_flush, force_break);
+ return 1;
}
static unsigned long zap_pte_range(struct mmu_gather *tlb,
@@ -1609,6 +1644,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
pte_t *start_pte;
pte_t *pte;
swp_entry_t entry;
+ int nr;
tlb_change_page_size(tlb, PAGE_SIZE);
init_rss_vec(rss);
@@ -1622,7 +1658,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
pte_t ptent = ptep_get(pte);
struct folio *folio;
struct page *page;
+ int max_nr;
+ nr = 1;
if (pte_none(ptent))
continue;
@@ -1630,10 +1668,12 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
break;
if (pte_present(ptent)) {
- zap_present_pte(tlb, vma, pte, ptent, addr, details,
- rss, &force_flush, &force_break);
+ max_nr = (end - addr) / PAGE_SIZE;
+ nr = zap_present_ptes(tlb, vma, pte, ptent, max_nr,
+ addr, details, rss, &force_flush,
+ &force_break);
if (unlikely(force_break)) {
- addr += PAGE_SIZE;
+ addr += nr * PAGE_SIZE;
break;
}
continue;
@@ -1687,8 +1727,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
WARN_ON_ONCE(1);
}
pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
- zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
- } while (pte++, addr += PAGE_SIZE, addr != end);
+ zap_install_uffd_wp_if_needed(vma, addr, pte, 1, details, ptent);
+ } while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
add_mm_rss_vec(mm, rss);
arch_leave_lazy_mmu_mode();
--
2.43.0
Let's prepare for further changes by factoring out processing of present
PTEs.
Signed-off-by: David Hildenbrand <[email protected]>
---
mm/memory.c | 94 ++++++++++++++++++++++++++++++-----------------------
1 file changed, 53 insertions(+), 41 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 7c3ca41a7610..5b0dc33133a6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1532,13 +1532,61 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
}
+static inline void zap_present_pte(struct mmu_gather *tlb,
+ struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
+ unsigned long addr, struct zap_details *details,
+ int *rss, bool *force_flush, bool *force_break)
+{
+ struct mm_struct *mm = tlb->mm;
+ struct folio *folio = NULL;
+ bool delay_rmap = false;
+ struct page *page;
+
+ page = vm_normal_page(vma, addr, ptent);
+ if (page)
+ folio = page_folio(page);
+
+ if (unlikely(!should_zap_folio(details, folio)))
+ return;
+ ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
+ arch_check_zapped_pte(vma, ptent);
+ tlb_remove_tlb_entry(tlb, pte, addr);
+ zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
+ if (unlikely(!page)) {
+ ksm_might_unmap_zero_page(mm, ptent);
+ return;
+ }
+
+ if (!folio_test_anon(folio)) {
+ if (pte_dirty(ptent)) {
+ folio_mark_dirty(folio);
+ if (tlb_delay_rmap(tlb)) {
+ delay_rmap = true;
+ *force_flush = true;
+ }
+ }
+ if (pte_young(ptent) && likely(vma_has_recency(vma)))
+ folio_mark_accessed(folio);
+ }
+ rss[mm_counter(folio)]--;
+ if (!delay_rmap) {
+ folio_remove_rmap_pte(folio, page, vma);
+ if (unlikely(page_mapcount(page) < 0))
+ print_bad_pte(vma, addr, ptent, page);
+ }
+ if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
+ *force_flush = true;
+ *force_break = true;
+ }
+}
+
static unsigned long zap_pte_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end,
struct zap_details *details)
{
+ bool force_flush = false, force_break = false;
struct mm_struct *mm = tlb->mm;
- int force_flush = 0;
int rss[NR_MM_COUNTERS];
spinlock_t *ptl;
pte_t *start_pte;
@@ -1555,7 +1603,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
arch_enter_lazy_mmu_mode();
do {
pte_t ptent = ptep_get(pte);
- struct folio *folio = NULL;
+ struct folio *folio;
struct page *page;
if (pte_none(ptent))
@@ -1565,45 +1613,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
break;
if (pte_present(ptent)) {
- unsigned int delay_rmap;
-
- page = vm_normal_page(vma, addr, ptent);
- if (page)
- folio = page_folio(page);
-
- if (unlikely(!should_zap_folio(details, folio)))
- continue;
- ptent = ptep_get_and_clear_full(mm, addr, pte,
- tlb->fullmm);
- arch_check_zapped_pte(vma, ptent);
- tlb_remove_tlb_entry(tlb, pte, addr);
- zap_install_uffd_wp_if_needed(vma, addr, pte, details,
- ptent);
- if (unlikely(!page)) {
- ksm_might_unmap_zero_page(mm, ptent);
- continue;
- }
-
- delay_rmap = 0;
- if (!folio_test_anon(folio)) {
- if (pte_dirty(ptent)) {
- folio_mark_dirty(folio);
- if (tlb_delay_rmap(tlb)) {
- delay_rmap = 1;
- force_flush = 1;
- }
- }
- if (pte_young(ptent) && likely(vma_has_recency(vma)))
- folio_mark_accessed(folio);
- }
- rss[mm_counter(folio)]--;
- if (!delay_rmap) {
- folio_remove_rmap_pte(folio, page, vma);
- if (unlikely(page_mapcount(page) < 0))
- print_bad_pte(vma, addr, ptent, page);
- }
- if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
- force_flush = 1;
+ zap_present_pte(tlb, vma, pte, ptent, addr, details,
+ rss, &force_flush, &force_break);
+ if (unlikely(force_break)) {
addr += PAGE_SIZE;
break;
}
--
2.43.0
On 09/02/2024 22:15, David Hildenbrand wrote:
> Let's prepare for further changes by factoring out processing of present
> PTEs.
>
> Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
> ---
> mm/memory.c | 94 ++++++++++++++++++++++++++++++-----------------------
> 1 file changed, 53 insertions(+), 41 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 7c3ca41a7610..5b0dc33133a6 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1532,13 +1532,61 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
> pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
> }
>
> +static inline void zap_present_pte(struct mmu_gather *tlb,
> + struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
> + unsigned long addr, struct zap_details *details,
> + int *rss, bool *force_flush, bool *force_break)
> +{
> + struct mm_struct *mm = tlb->mm;
> + struct folio *folio = NULL;
> + bool delay_rmap = false;
> + struct page *page;
> +
> + page = vm_normal_page(vma, addr, ptent);
> + if (page)
> + folio = page_folio(page);
> +
> + if (unlikely(!should_zap_folio(details, folio)))
> + return;
> + ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
> + arch_check_zapped_pte(vma, ptent);
> + tlb_remove_tlb_entry(tlb, pte, addr);
> + zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
> + if (unlikely(!page)) {
> + ksm_might_unmap_zero_page(mm, ptent);
> + return;
> + }
> +
> + if (!folio_test_anon(folio)) {
> + if (pte_dirty(ptent)) {
> + folio_mark_dirty(folio);
> + if (tlb_delay_rmap(tlb)) {
> + delay_rmap = true;
> + *force_flush = true;
> + }
> + }
> + if (pte_young(ptent) && likely(vma_has_recency(vma)))
> + folio_mark_accessed(folio);
> + }
> + rss[mm_counter(folio)]--;
> + if (!delay_rmap) {
> + folio_remove_rmap_pte(folio, page, vma);
> + if (unlikely(page_mapcount(page) < 0))
> + print_bad_pte(vma, addr, ptent, page);
> + }
> + if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
> + *force_flush = true;
> + *force_break = true;
> + }
> +}
> +
> static unsigned long zap_pte_range(struct mmu_gather *tlb,
> struct vm_area_struct *vma, pmd_t *pmd,
> unsigned long addr, unsigned long end,
> struct zap_details *details)
> {
> + bool force_flush = false, force_break = false;
> struct mm_struct *mm = tlb->mm;
> - int force_flush = 0;
> int rss[NR_MM_COUNTERS];
> spinlock_t *ptl;
> pte_t *start_pte;
> @@ -1555,7 +1603,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> arch_enter_lazy_mmu_mode();
> do {
> pte_t ptent = ptep_get(pte);
> - struct folio *folio = NULL;
> + struct folio *folio;
> struct page *page;
>
> if (pte_none(ptent))
> @@ -1565,45 +1613,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> break;
>
> if (pte_present(ptent)) {
> - unsigned int delay_rmap;
> -
> - page = vm_normal_page(vma, addr, ptent);
> - if (page)
> - folio = page_folio(page);
> -
> - if (unlikely(!should_zap_folio(details, folio)))
> - continue;
> - ptent = ptep_get_and_clear_full(mm, addr, pte,
> - tlb->fullmm);
> - arch_check_zapped_pte(vma, ptent);
> - tlb_remove_tlb_entry(tlb, pte, addr);
> - zap_install_uffd_wp_if_needed(vma, addr, pte, details,
> - ptent);
> - if (unlikely(!page)) {
> - ksm_might_unmap_zero_page(mm, ptent);
> - continue;
> - }
> -
> - delay_rmap = 0;
> - if (!folio_test_anon(folio)) {
> - if (pte_dirty(ptent)) {
> - folio_mark_dirty(folio);
> - if (tlb_delay_rmap(tlb)) {
> - delay_rmap = 1;
> - force_flush = 1;
> - }
> - }
> - if (pte_young(ptent) && likely(vma_has_recency(vma)))
> - folio_mark_accessed(folio);
> - }
> - rss[mm_counter(folio)]--;
> - if (!delay_rmap) {
> - folio_remove_rmap_pte(folio, page, vma);
> - if (unlikely(page_mapcount(page) < 0))
> - print_bad_pte(vma, addr, ptent, page);
> - }
> - if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
> - force_flush = 1;
> + zap_present_pte(tlb, vma, pte, ptent, addr, details,
> + rss, &force_flush, &force_break);
> + if (unlikely(force_break)) {
> addr += PAGE_SIZE;
> break;
> }
On 09/02/2024 22:15, David Hildenbrand wrote:
> Add __tlb_remove_folio_pages(), which will remove multiple consecutive
> pages that belong to the same large folio, instead of only a single
> page. We'll be using this function when optimizing unmapping/zapping of
> large folios that are mapped by PTEs.
>
> We're using the remaining spare bit in an encoded_page to indicate that
> the next enoced page in an array contains actually shifted "nr_pages".
> Teach swap/freeing code about putting multiple folio references, and
> delayed rmap handling to remove page ranges of a folio.
>
> This extension allows for still gathering almost as many small folios
> as we used to (-1, because we have to prepare for a possibly bigger next
> entry), but still allows for gathering consecutive pages that belong to the
> same large folio.
>
> Note that we don't pass the folio pointer, because it is not required for
> now. Further, we don't support page_size != PAGE_SIZE, it won't be
> required for simple PTE batching.
>
> We have to provide a separate s390 implementation, but it's fairly
> straight forward.
>
> Another, more invasive and likely more expensive, approach would be to
> use folio+range or a PFN range instead of page+nr_pages. But, we should
> do that consistently for the whole mmu_gather. For now, let's keep it
> simple and add "nr_pages" only.
>
> Note that it is now possible to gather significantly more pages: In the
> past, we were able to gather ~10000 pages, now we can gather
> also gather ~5000 folio fragments that span multiple pages. A folio
> fragement on x86-64 can be up to 512 pages (2 MiB THP) and on arm64 with
> 64k in theory 8192 pages (512 MiB THP). Gathering more memory is not
> considered something we should worry about, especially because these are
> already corner cases.
>
> While we can gather more total memory, we won't free more folio
> fragments. As long as page freeing time primarily only depends on the
> number of involved folios, there is no effective change for !preempt
> configurations. However, we'll adjust tlb_batch_pages_flush() separately to
> handle corner cases where page freeing time grows proportionally with the
> actual memory size.
>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> arch/s390/include/asm/tlb.h | 17 +++++++++++
> include/asm-generic/tlb.h | 8 +++++
> include/linux/mm_types.h | 20 ++++++++++++
> mm/mmu_gather.c | 61 +++++++++++++++++++++++++++++++------
> mm/swap.c | 12 ++++++--
> mm/swap_state.c | 15 +++++++--
> 6 files changed, 119 insertions(+), 14 deletions(-)
>
> diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
> index 48df896d5b79..e95b2c8081eb 100644
> --- a/arch/s390/include/asm/tlb.h
> +++ b/arch/s390/include/asm/tlb.h
> @@ -26,6 +26,8 @@ void __tlb_remove_table(void *_table);
> static inline void tlb_flush(struct mmu_gather *tlb);
> static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
> struct page *page, bool delay_rmap, int page_size);
> +static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
> + struct page *page, unsigned int nr_pages, bool delay_rmap);
>
> #define tlb_flush tlb_flush
> #define pte_free_tlb pte_free_tlb
> @@ -52,6 +54,21 @@ static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
> return false;
> }
>
> +static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
> + struct page *page, unsigned int nr_pages, bool delay_rmap)
> +{
> + struct encoded_page *encoded_pages[] = {
> + encode_page(page, ENCODED_PAGE_BIT_NR_PAGES_NEXT),
> + encode_nr_pages(nr_pages),
> + };
> +
> + VM_WARN_ON_ONCE(delay_rmap);
> + VM_WARN_ON_ONCE(page_folio(page) != page_folio(page + nr_pages - 1));
> +
> + free_pages_and_swap_cache(encoded_pages, ARRAY_SIZE(encoded_pages));
> + return false;
> +}
> +
> static inline void tlb_flush(struct mmu_gather *tlb)
> {
> __tlb_flush_mm_lazy(tlb->mm);
> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index 95d60a4f468a..bd00dd238b79 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -69,6 +69,7 @@
> *
> * - tlb_remove_page() / __tlb_remove_page()
> * - tlb_remove_page_size() / __tlb_remove_page_size()
> + * - __tlb_remove_folio_pages()
> *
> * __tlb_remove_page_size() is the basic primitive that queues a page for
> * freeing. __tlb_remove_page() assumes PAGE_SIZE. Both will return a
> @@ -78,6 +79,11 @@
> * tlb_remove_page() and tlb_remove_page_size() imply the call to
> * tlb_flush_mmu() when required and has no return value.
> *
> + * __tlb_remove_folio_pages() is similar to __tlb_remove_page(), however,
> + * instead of removing a single page, remove the given number of consecutive
> + * pages that are all part of the same (large) folio: just like calling
> + * __tlb_remove_page() on each page individually.
> + *
> * - tlb_change_page_size()
> *
> * call before __tlb_remove_page*() to set the current page-size; implies a
> @@ -262,6 +268,8 @@ struct mmu_gather_batch {
>
> extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
> bool delay_rmap, int page_size);
> +bool __tlb_remove_folio_pages(struct mmu_gather *tlb, struct page *page,
> + unsigned int nr_pages, bool delay_rmap);
>
> #ifdef CONFIG_SMP
> /*
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 1b89eec0d6df..a7223ba3ea1e 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -226,6 +226,15 @@ struct encoded_page;
> /* Perform rmap removal after we have flushed the TLB. */
> #define ENCODED_PAGE_BIT_DELAY_RMAP 1ul
>
> +/*
> + * The next item in an encoded_page array is the "nr_pages" argument, specifying
> + * the number of consecutive pages starting from this page, that all belong to
> + * the same folio. For example, "nr_pages" corresponds to the number of folio
> + * references that must be dropped. If this bit is not set, "nr_pages" is
> + * implicitly 1.
> + */
> +#define ENCODED_PAGE_BIT_NR_PAGES_NEXT 2ul
> +
> static __always_inline struct encoded_page *encode_page(struct page *page, unsigned long flags)
> {
> BUILD_BUG_ON(flags > ENCODED_PAGE_BITS);
> @@ -242,6 +251,17 @@ static inline struct page *encoded_page_ptr(struct encoded_page *page)
> return (struct page *)(~ENCODED_PAGE_BITS & (unsigned long)page);
> }
>
> +static __always_inline struct encoded_page *encode_nr_pages(unsigned long nr)
> +{
> + VM_WARN_ON_ONCE((nr << 2) >> 2 != nr);
> + return (struct encoded_page *)(nr << 2);
> +}
> +
> +static __always_inline unsigned long encoded_nr_pages(struct encoded_page *page)
> +{
> + return ((unsigned long)page) >> 2;
> +}
> +
> /*
> * A swap entry has to fit into a "unsigned long", as the entry is hidden
> * in the "index" field of the swapper address space.
> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> index 6540c99c6758..d175c0f1e2c8 100644
> --- a/mm/mmu_gather.c
> +++ b/mm/mmu_gather.c
> @@ -50,12 +50,21 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
> #ifdef CONFIG_SMP
> static void tlb_flush_rmap_batch(struct mmu_gather_batch *batch, struct vm_area_struct *vma)
> {
> + struct encoded_page **pages = batch->encoded_pages;
> +
> for (int i = 0; i < batch->nr; i++) {
> - struct encoded_page *enc = batch->encoded_pages[i];
> + struct encoded_page *enc = pages[i];
>
> if (encoded_page_flags(enc) & ENCODED_PAGE_BIT_DELAY_RMAP) {
> struct page *page = encoded_page_ptr(enc);
> - folio_remove_rmap_pte(page_folio(page), page, vma);
> + unsigned int nr_pages = 1;
> +
> + if (unlikely(encoded_page_flags(enc) &
> + ENCODED_PAGE_BIT_NR_PAGES_NEXT))
> + nr_pages = encoded_nr_pages(pages[++i]);
> +
> + folio_remove_rmap_ptes(page_folio(page), page, nr_pages,
> + vma);
> }
> }
> }
> @@ -89,18 +98,26 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
> for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
> struct encoded_page **pages = batch->encoded_pages;
>
> - do {
> + while (batch->nr) {
> /*
> * limit free batch count when PAGE_SIZE > 4K
> */
> unsigned int nr = min(512U, batch->nr);
>
> + /*
> + * Make sure we cover page + nr_pages, and don't leave
> + * nr_pages behind when capping the number of entries.
> + */
> + if (unlikely(encoded_page_flags(pages[nr - 1]) &
> + ENCODED_PAGE_BIT_NR_PAGES_NEXT))
> + nr++;
> +
> free_pages_and_swap_cache(pages, nr);
> pages += nr;
> batch->nr -= nr;
>
> cond_resched();
> - } while (batch->nr);
> + }
> }
> tlb->active = &tlb->local;
> }
> @@ -116,8 +133,9 @@ static void tlb_batch_list_free(struct mmu_gather *tlb)
> tlb->local.next = NULL;
> }
>
> -bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
> - bool delay_rmap, int page_size)
> +static bool __tlb_remove_folio_pages_size(struct mmu_gather *tlb,
> + struct page *page, unsigned int nr_pages, bool delay_rmap,
> + int page_size)
> {
> int flags = delay_rmap ? ENCODED_PAGE_BIT_DELAY_RMAP : 0;
> struct mmu_gather_batch *batch;
> @@ -126,6 +144,8 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
>
> #ifdef CONFIG_MMU_GATHER_PAGE_SIZE
> VM_WARN_ON(tlb->page_size != page_size);
> + VM_WARN_ON_ONCE(nr_pages != 1 && page_size != PAGE_SIZE);
> + VM_WARN_ON_ONCE(page_folio(page) != page_folio(page + nr_pages - 1));
I've forgotten the rules for when it is ok to assume contiguous PFNs' struct
pages are contiguous in virtual memory? I think its fine as long as the pages
belong to the same folio and the folio order <= MAX_ORDER? So `page + nr_pages -
1` is safe?
Assuming this is the case:
Reviewed-by: Ryan Roberts <[email protected]>
> #endif
>
> batch = tlb->active;
> @@ -133,17 +153,40 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
> * Add the page and check if we are full. If so
> * force a flush.
> */
> - batch->encoded_pages[batch->nr++] = encode_page(page, flags);
> - if (batch->nr == batch->max) {
> + if (likely(nr_pages == 1)) {
> + batch->encoded_pages[batch->nr++] = encode_page(page, flags);
> + } else {
> + flags |= ENCODED_PAGE_BIT_NR_PAGES_NEXT;
> + batch->encoded_pages[batch->nr++] = encode_page(page, flags);
> + batch->encoded_pages[batch->nr++] = encode_nr_pages(nr_pages);
> + }
> + /*
> + * Make sure that we can always add another "page" + "nr_pages",
> + * requiring two entries instead of only a single one.
> + */
> + if (batch->nr >= batch->max - 1) {
> if (!tlb_next_batch(tlb))
> return true;
> batch = tlb->active;
> }
> - VM_BUG_ON_PAGE(batch->nr > batch->max, page);
> + VM_BUG_ON_PAGE(batch->nr > batch->max - 1, page);
>
> return false;
> }
>
> +bool __tlb_remove_folio_pages(struct mmu_gather *tlb, struct page *page,
> + unsigned int nr_pages, bool delay_rmap)
> +{
> + return __tlb_remove_folio_pages_size(tlb, page, nr_pages, delay_rmap,
> + PAGE_SIZE);
> +}
> +
> +bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
> + bool delay_rmap, int page_size)
> +{
> + return __tlb_remove_folio_pages_size(tlb, page, 1, delay_rmap, page_size);
> +}
> +
> #endif /* MMU_GATHER_NO_GATHER */
>
> #ifdef CONFIG_MMU_GATHER_TABLE_FREE
> diff --git a/mm/swap.c b/mm/swap.c
> index cd8f0150ba3a..e5380d732c0d 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -967,11 +967,17 @@ void release_pages(release_pages_arg arg, int nr)
> unsigned int lock_batch;
>
> for (i = 0; i < nr; i++) {
> + unsigned int nr_refs = 1;
> struct folio *folio;
>
> /* Turn any of the argument types into a folio */
> folio = page_folio(encoded_page_ptr(encoded[i]));
>
> + /* Is our next entry actually "nr_pages" -> "nr_refs" ? */
> + if (unlikely(encoded_page_flags(encoded[i]) &
> + ENCODED_PAGE_BIT_NR_PAGES_NEXT))
> + nr_refs = encoded_nr_pages(encoded[++i]);
> +
> /*
> * Make sure the IRQ-safe lock-holding time does not get
> * excessive with a continuous string of pages from the
> @@ -990,14 +996,14 @@ void release_pages(release_pages_arg arg, int nr)
> unlock_page_lruvec_irqrestore(lruvec, flags);
> lruvec = NULL;
> }
> - if (put_devmap_managed_page(&folio->page))
> + if (put_devmap_managed_page_refs(&folio->page, nr_refs))
> continue;
> - if (folio_put_testzero(folio))
> + if (folio_ref_sub_and_test(folio, nr_refs))
> free_zone_device_page(&folio->page);
> continue;
> }
>
> - if (!folio_put_testzero(folio))
> + if (!folio_ref_sub_and_test(folio, nr_refs))
> continue;
>
> if (folio_test_large(folio)) {
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 7255c01a1e4e..2f540748f7c0 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -311,8 +311,19 @@ void free_page_and_swap_cache(struct page *page)
> void free_pages_and_swap_cache(struct encoded_page **pages, int nr)
> {
> lru_add_drain();
> - for (int i = 0; i < nr; i++)
> - free_swap_cache(encoded_page_ptr(pages[i]));
> + for (int i = 0; i < nr; i++) {
> + struct page *page = encoded_page_ptr(pages[i]);
> +
> + /*
> + * Skip over the "nr_pages" entry. It's sufficient to call
> + * free_swap_cache() only once per folio.
> + */
> + if (unlikely(encoded_page_flags(pages[i]) &
> + ENCODED_PAGE_BIT_NR_PAGES_NEXT))
> + i++;
> +
> + free_swap_cache(page);
> + }
> release_pages(pages, nr);
> }
>
On 12.02.24 09:51, Ryan Roberts wrote:
> On 09/02/2024 22:15, David Hildenbrand wrote:
>> Add __tlb_remove_folio_pages(), which will remove multiple consecutive
>> pages that belong to the same large folio, instead of only a single
>> page. We'll be using this function when optimizing unmapping/zapping of
>> large folios that are mapped by PTEs.
>>
>> We're using the remaining spare bit in an encoded_page to indicate that
>> the next enoced page in an array contains actually shifted "nr_pages".
>> Teach swap/freeing code about putting multiple folio references, and
>> delayed rmap handling to remove page ranges of a folio.
>>
>> This extension allows for still gathering almost as many small folios
>> as we used to (-1, because we have to prepare for a possibly bigger next
>> entry), but still allows for gathering consecutive pages that belong to the
>> same large folio.
>>
>> Note that we don't pass the folio pointer, because it is not required for
>> now. Further, we don't support page_size != PAGE_SIZE, it won't be
>> required for simple PTE batching.
>>
>> We have to provide a separate s390 implementation, but it's fairly
>> straight forward.
>>
>> Another, more invasive and likely more expensive, approach would be to
>> use folio+range or a PFN range instead of page+nr_pages. But, we should
>> do that consistently for the whole mmu_gather. For now, let's keep it
>> simple and add "nr_pages" only.
>>
>> Note that it is now possible to gather significantly more pages: In the
>> past, we were able to gather ~10000 pages, now we can gather
>> also gather ~5000 folio fragments that span multiple pages. A folio
>> fragement on x86-64 can be up to 512 pages (2 MiB THP) and on arm64 with
>> 64k in theory 8192 pages (512 MiB THP). Gathering more memory is not
>> considered something we should worry about, especially because these are
>> already corner cases.
>>
>> While we can gather more total memory, we won't free more folio
>> fragments. As long as page freeing time primarily only depends on the
>> number of involved folios, there is no effective change for !preempt
>> configurations. However, we'll adjust tlb_batch_pages_flush() separately to
>> handle corner cases where page freeing time grows proportionally with the
>> actual memory size.
>>
>> Signed-off-by: David Hildenbrand <[email protected]>
>> ---
>> arch/s390/include/asm/tlb.h | 17 +++++++++++
>> include/asm-generic/tlb.h | 8 +++++
>> include/linux/mm_types.h | 20 ++++++++++++
>> mm/mmu_gather.c | 61 +++++++++++++++++++++++++++++++------
>> mm/swap.c | 12 ++++++--
>> mm/swap_state.c | 15 +++++++--
>> 6 files changed, 119 insertions(+), 14 deletions(-)
>>
>> diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
>> index 48df896d5b79..e95b2c8081eb 100644
>> --- a/arch/s390/include/asm/tlb.h
>> +++ b/arch/s390/include/asm/tlb.h
>> @@ -26,6 +26,8 @@ void __tlb_remove_table(void *_table);
>> static inline void tlb_flush(struct mmu_gather *tlb);
>> static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
>> struct page *page, bool delay_rmap, int page_size);
>> +static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
>> + struct page *page, unsigned int nr_pages, bool delay_rmap);
>>
>> #define tlb_flush tlb_flush
>> #define pte_free_tlb pte_free_tlb
>> @@ -52,6 +54,21 @@ static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
>> return false;
>> }
>>
>> +static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
>> + struct page *page, unsigned int nr_pages, bool delay_rmap)
>> +{
>> + struct encoded_page *encoded_pages[] = {
>> + encode_page(page, ENCODED_PAGE_BIT_NR_PAGES_NEXT),
>> + encode_nr_pages(nr_pages),
>> + };
>> +
>> + VM_WARN_ON_ONCE(delay_rmap);
>> + VM_WARN_ON_ONCE(page_folio(page) != page_folio(page + nr_pages - 1));
>> +
>> + free_pages_and_swap_cache(encoded_pages, ARRAY_SIZE(encoded_pages));
>> + return false;
>> +}
>> +
>> static inline void tlb_flush(struct mmu_gather *tlb)
>> {
>> __tlb_flush_mm_lazy(tlb->mm);
>> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
>> index 95d60a4f468a..bd00dd238b79 100644
>> --- a/include/asm-generic/tlb.h
>> +++ b/include/asm-generic/tlb.h
>> @@ -69,6 +69,7 @@
>> *
>> * - tlb_remove_page() / __tlb_remove_page()
>> * - tlb_remove_page_size() / __tlb_remove_page_size()
>> + * - __tlb_remove_folio_pages()
>> *
>> * __tlb_remove_page_size() is the basic primitive that queues a page for
>> * freeing. __tlb_remove_page() assumes PAGE_SIZE. Both will return a
>> @@ -78,6 +79,11 @@
>> * tlb_remove_page() and tlb_remove_page_size() imply the call to
>> * tlb_flush_mmu() when required and has no return value.
>> *
>> + * __tlb_remove_folio_pages() is similar to __tlb_remove_page(), however,
>> + * instead of removing a single page, remove the given number of consecutive
>> + * pages that are all part of the same (large) folio: just like calling
>> + * __tlb_remove_page() on each page individually.
>> + *
>> * - tlb_change_page_size()
>> *
>> * call before __tlb_remove_page*() to set the current page-size; implies a
>> @@ -262,6 +268,8 @@ struct mmu_gather_batch {
>>
>> extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
>> bool delay_rmap, int page_size);
>> +bool __tlb_remove_folio_pages(struct mmu_gather *tlb, struct page *page,
>> + unsigned int nr_pages, bool delay_rmap);
>>
>> #ifdef CONFIG_SMP
>> /*
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 1b89eec0d6df..a7223ba3ea1e 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -226,6 +226,15 @@ struct encoded_page;
>> /* Perform rmap removal after we have flushed the TLB. */
>> #define ENCODED_PAGE_BIT_DELAY_RMAP 1ul
>>
>> +/*
>> + * The next item in an encoded_page array is the "nr_pages" argument, specifying
>> + * the number of consecutive pages starting from this page, that all belong to
>> + * the same folio. For example, "nr_pages" corresponds to the number of folio
>> + * references that must be dropped. If this bit is not set, "nr_pages" is
>> + * implicitly 1.
>> + */
>> +#define ENCODED_PAGE_BIT_NR_PAGES_NEXT 2ul
>> +
>> static __always_inline struct encoded_page *encode_page(struct page *page, unsigned long flags)
>> {
>> BUILD_BUG_ON(flags > ENCODED_PAGE_BITS);
>> @@ -242,6 +251,17 @@ static inline struct page *encoded_page_ptr(struct encoded_page *page)
>> return (struct page *)(~ENCODED_PAGE_BITS & (unsigned long)page);
>> }
>>
>> +static __always_inline struct encoded_page *encode_nr_pages(unsigned long nr)
>> +{
>> + VM_WARN_ON_ONCE((nr << 2) >> 2 != nr);
>> + return (struct encoded_page *)(nr << 2);
>> +}
>> +
>> +static __always_inline unsigned long encoded_nr_pages(struct encoded_page *page)
>> +{
>> + return ((unsigned long)page) >> 2;
>> +}
>> +
>> /*
>> * A swap entry has to fit into a "unsigned long", as the entry is hidden
>> * in the "index" field of the swapper address space.
>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
>> index 6540c99c6758..d175c0f1e2c8 100644
>> --- a/mm/mmu_gather.c
>> +++ b/mm/mmu_gather.c
>> @@ -50,12 +50,21 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
>> #ifdef CONFIG_SMP
>> static void tlb_flush_rmap_batch(struct mmu_gather_batch *batch, struct vm_area_struct *vma)
>> {
>> + struct encoded_page **pages = batch->encoded_pages;
>> +
>> for (int i = 0; i < batch->nr; i++) {
>> - struct encoded_page *enc = batch->encoded_pages[i];
>> + struct encoded_page *enc = pages[i];
>>
>> if (encoded_page_flags(enc) & ENCODED_PAGE_BIT_DELAY_RMAP) {
>> struct page *page = encoded_page_ptr(enc);
>> - folio_remove_rmap_pte(page_folio(page), page, vma);
>> + unsigned int nr_pages = 1;
>> +
>> + if (unlikely(encoded_page_flags(enc) &
>> + ENCODED_PAGE_BIT_NR_PAGES_NEXT))
>> + nr_pages = encoded_nr_pages(pages[++i]);
>> +
>> + folio_remove_rmap_ptes(page_folio(page), page, nr_pages,
>> + vma);
>> }
>> }
>> }
>> @@ -89,18 +98,26 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
>> for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
>> struct encoded_page **pages = batch->encoded_pages;
>>
>> - do {
>> + while (batch->nr) {
>> /*
>> * limit free batch count when PAGE_SIZE > 4K
>> */
>> unsigned int nr = min(512U, batch->nr);
>>
>> + /*
>> + * Make sure we cover page + nr_pages, and don't leave
>> + * nr_pages behind when capping the number of entries.
>> + */
>> + if (unlikely(encoded_page_flags(pages[nr - 1]) &
>> + ENCODED_PAGE_BIT_NR_PAGES_NEXT))
>> + nr++;
>> +
>> free_pages_and_swap_cache(pages, nr);
>> pages += nr;
>> batch->nr -= nr;
>>
>> cond_resched();
>> - } while (batch->nr);
>> + }
>> }
>> tlb->active = &tlb->local;
>> }
>> @@ -116,8 +133,9 @@ static void tlb_batch_list_free(struct mmu_gather *tlb)
>> tlb->local.next = NULL;
>> }
>>
>> -bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
>> - bool delay_rmap, int page_size)
>> +static bool __tlb_remove_folio_pages_size(struct mmu_gather *tlb,
>> + struct page *page, unsigned int nr_pages, bool delay_rmap,
>> + int page_size)
>> {
>> int flags = delay_rmap ? ENCODED_PAGE_BIT_DELAY_RMAP : 0;
>> struct mmu_gather_batch *batch;
>> @@ -126,6 +144,8 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
>>
>> #ifdef CONFIG_MMU_GATHER_PAGE_SIZE
>> VM_WARN_ON(tlb->page_size != page_size);
>> + VM_WARN_ON_ONCE(nr_pages != 1 && page_size != PAGE_SIZE);
>> + VM_WARN_ON_ONCE(page_folio(page) != page_folio(page + nr_pages - 1));
>
> I've forgotten the rules for when it is ok to assume contiguous PFNs' struct
> pages are contiguous in virtual memory? I think its fine as long as the pages
> belong to the same folio and the folio order <= MAX_ORDER? So `page + nr_pages -
> 1` is safe?
>
Essentially, for anything that comes from the buddy it is safe (which we
end up punching into RMAP functions where we now have similar checks).
Note that we'll never end up her with "nr_pages !=1" for hugetlb where
the check would not be true for some gigantic pages.
> Assuming this is the case:
>
> Reviewed-by: Ryan Roberts <[email protected]>
Thanks!
--
Cheers,
David / dhildenb
On 09/02/2024 22:15, David Hildenbrand wrote:
> It's a pain that we have to handle cond_resched() in
> tlb_batch_pages_flush() manually and cannot simply handle it in
> release_pages() -- release_pages() can be called from atomic context.
> Well, in a perfect world we wouldn't have to make our code more at all.
>
> With page poisoning and init_on_free, we might now run into soft lockups
> when we free a lot of rather large folio fragments, because page freeing
> time then depends on the actual memory size we are freeing instead of on
> the number of folios that are involved.
>
> In the absolute (unlikely) worst case, on arm64 with 64k we will be able
> to free up to 256 folio fragments that each span 512 MiB: zeroing out 128
> GiB does sound like it might take a while. But instead of ignoring this
> unlikely case, let's just handle it.
>
> So, let's teach tlb_batch_pages_flush() that there are some
> configurations where page freeing is horribly slow, and let's reschedule
> more frequently -- similarly like we did for now before we had large folio
> fragments in there. Note that we might end up freeing only a single folio
> fragment at a time that might exceed the old 512 pages limit: but if we
> cannot even free a single MAX_ORDER page on a system without running into
> soft lockups, something else is already completely bogus.
>
> In the future, we might want to detect if handling cond_resched() is
> required at all, and just not do any of that with full preemption enabled.
>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> mm/mmu_gather.c | 50 ++++++++++++++++++++++++++++++++++++++++---------
> 1 file changed, 41 insertions(+), 9 deletions(-)
>
> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> index d175c0f1e2c8..2774044b5790 100644
> --- a/mm/mmu_gather.c
> +++ b/mm/mmu_gather.c
> @@ -91,18 +91,19 @@ void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma)
> }
> #endif
>
> -static void tlb_batch_pages_flush(struct mmu_gather *tlb)
> +static void __tlb_batch_free_encoded_pages(struct mmu_gather_batch *batch)
> {
> - struct mmu_gather_batch *batch;
> -
> - for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
> - struct encoded_page **pages = batch->encoded_pages;
> + struct encoded_page **pages = batch->encoded_pages;
> + unsigned int nr, nr_pages;
>
> + /*
> + * We might end up freeing a lot of pages. Reschedule on a regular
> + * basis to avoid soft lockups in configurations without full
> + * preemption enabled. The magic number of 512 folios seems to work.
> + */
> + if (!page_poisoning_enabled_static() && !want_init_on_free()) {
Is the performance win really worth 2 separate implementations keyed off this?
It seems a bit fragile, in case any other operations get added to free which are
proportional to size in future. Why not just always do the conservative version?
> while (batch->nr) {
> - /*
> - * limit free batch count when PAGE_SIZE > 4K
> - */
> - unsigned int nr = min(512U, batch->nr);
> + nr = min(512, batch->nr);
If any entries are for more than 1 page, nr_pages will also be encoded in the
batch, so effectively this could be limiting to 256 actual folios (half of 512).
Is it worth checking for ENCODED_PAGE_BIT_NR_PAGES_NEXT and limiting accordingly?
nit: You're using 512 magic number in 2 places now; perhaps make a macro?
>
> /*
> * Make sure we cover page + nr_pages, and don't leave
> @@ -119,6 +120,37 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
> cond_resched();
> }
> }
> +
> + /*
> + * With page poisoning and init_on_free, the time it takes to free
> + * memory grows proportionally with the actual memory size. Therefore,
> + * limit based on the actual memory size and not the number of involved
> + * folios.
> + */
> + while (batch->nr) {
> + for (nr = 0, nr_pages = 0;
> + nr < batch->nr && nr_pages < 512; nr++) {
> + if (unlikely(encoded_page_flags(pages[nr]) &
> + ENCODED_PAGE_BIT_NR_PAGES_NEXT))
> + nr_pages += encoded_nr_pages(pages[++nr]);
> + else
> + nr_pages++;
> + }
I guess worst case here is freeing (511 + 8192) * 64K pages = ~544M. That's up
from the old limit of 512 * 64K = 32M, and 511 pages bigger than your statement
in the commit log. Are you comfortable with this? I guess the only alternative
is to start splitting a batch which would be really messy. I agree your approach
is preferable if 544M is acceptable.
> +
> + free_pages_and_swap_cache(pages, nr);
> + pages += nr;
> + batch->nr -= nr;
> +
> + cond_resched();
> + }
> +}
> +
> +static void tlb_batch_pages_flush(struct mmu_gather *tlb)
> +{
> + struct mmu_gather_batch *batch;
> +
> + for (batch = &tlb->local; batch && batch->nr; batch = batch->next)
> + __tlb_batch_free_encoded_pages(batch);
> tlb->active = &tlb->local;
> }
>
On 09/02/2024 22:15, David Hildenbrand wrote:
> Similar to how we optimized fork(), let's implement PTE batching when
> consecutive (present) PTEs map consecutive pages of the same large
> folio.
>
> Most infrastructure we need for batching (mmu gather, rmap) is already
> there. We only have to add get_and_clear_full_ptes() and
> clear_full_ptes(). Similarly, extend zap_install_uffd_wp_if_needed() to
> process a PTE range.
>
> We won't bother sanity-checking the mapcount of all subpages, but only
> check the mapcount of the first subpage we process. If there is a real
> problem hiding somewhere, we can trigger it simply by using small
> folios, or when we zap single pages of a large folio. Ideally, we had
> that check in rmap code (including for delayed rmap), but then we cannot
> print the PTE. Let's keep it simple for now. If we ever have a cheap
> folio_mapcount(), we might just want to check for underflows there.
>
> To keep small folios as fast as possible force inlining of a specialized
> variant using __always_inline with nr=1.
>
> Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
> ---
> include/linux/pgtable.h | 70 +++++++++++++++++++++++++++++++
> mm/memory.c | 92 +++++++++++++++++++++++++++++------------
> 2 files changed, 136 insertions(+), 26 deletions(-)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index aab227e12493..49ab1f73b5c2 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -580,6 +580,76 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
> }
> #endif
>
> +#ifndef get_and_clear_full_ptes
> +/**
> + * get_and_clear_full_ptes - Clear present PTEs that map consecutive pages of
> + * the same folio, collecting dirty/accessed bits.
> + * @mm: Address space the pages are mapped into.
> + * @addr: Address the first page is mapped at.
> + * @ptep: Page table pointer for the first entry.
> + * @nr: Number of entries to clear.
> + * @full: Whether we are clearing a full mm.
> + *
> + * May be overridden by the architecture; otherwise, implemented as a simple
> + * loop over ptep_get_and_clear_full(), merging dirty/accessed bits into the
> + * returned PTE.
> + *
> + * Note that PTE bits in the PTE range besides the PFN can differ. For example,
> + * some PTEs might be write-protected.
> + *
> + * Context: The caller holds the page table lock. The PTEs map consecutive
> + * pages that belong to the same folio. The PTEs are all in the same PMD.
> + */
> +static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
> + unsigned long addr, pte_t *ptep, unsigned int nr, int full)
> +{
> + pte_t pte, tmp_pte;
> +
> + pte = ptep_get_and_clear_full(mm, addr, ptep, full);
> + while (--nr) {
> + ptep++;
> + addr += PAGE_SIZE;
> + tmp_pte = ptep_get_and_clear_full(mm, addr, ptep, full);
> + if (pte_dirty(tmp_pte))
> + pte = pte_mkdirty(pte);
> + if (pte_young(tmp_pte))
> + pte = pte_mkyoung(pte);
> + }
> + return pte;
> +}
> +#endif
> +
> +#ifndef clear_full_ptes
> +/**
> + * clear_full_ptes - Clear present PTEs that map consecutive pages of the same
> + * folio.
> + * @mm: Address space the pages are mapped into.
> + * @addr: Address the first page is mapped at.
> + * @ptep: Page table pointer for the first entry.
> + * @nr: Number of entries to clear.
> + * @full: Whether we are clearing a full mm.
> + *
> + * May be overridden by the architecture; otherwise, implemented as a simple
> + * loop over ptep_get_and_clear_full().
> + *
> + * Note that PTE bits in the PTE range besides the PFN can differ. For example,
> + * some PTEs might be write-protected.
> + *
> + * Context: The caller holds the page table lock. The PTEs map consecutive
> + * pages that belong to the same folio. The PTEs are all in the same PMD.
> + */
> +static inline void clear_full_ptes(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, unsigned int nr, int full)
> +{
> + for (;;) {
> + ptep_get_and_clear_full(mm, addr, ptep, full);
> + if (--nr == 0)
> + break;
> + ptep++;
> + addr += PAGE_SIZE;
> + }
> +}
> +#endif
>
> /*
> * If two threads concurrently fault at the same page, the thread that
> diff --git a/mm/memory.c b/mm/memory.c
> index a3efc4da258a..3b8e56eb08a3 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1515,7 +1515,7 @@ static inline bool zap_drop_file_uffd_wp(struct zap_details *details)
> */
> static inline void
> zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
> - unsigned long addr, pte_t *pte,
> + unsigned long addr, pte_t *pte, int nr,
> struct zap_details *details, pte_t pteval)
> {
> /* Zap on anonymous always means dropping everything */
> @@ -1525,20 +1525,27 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
> if (zap_drop_file_uffd_wp(details))
> return;
>
> - pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
> + for (;;) {
> + /* the PFN in the PTE is irrelevant. */
> + pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
> + if (--nr == 0)
> + break;
> + pte++;
> + addr += PAGE_SIZE;
> + }
> }
>
> -static inline void zap_present_folio_pte(struct mmu_gather *tlb,
> +static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
> struct vm_area_struct *vma, struct folio *folio,
> - struct page *page, pte_t *pte, pte_t ptent, unsigned long addr,
> - struct zap_details *details, int *rss, bool *force_flush,
> - bool *force_break)
> + struct page *page, pte_t *pte, pte_t ptent, unsigned int nr,
> + unsigned long addr, struct zap_details *details, int *rss,
> + bool *force_flush, bool *force_break)
> {
> struct mm_struct *mm = tlb->mm;
> bool delay_rmap = false;
>
> if (!folio_test_anon(folio)) {
> - ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
> + ptent = get_and_clear_full_ptes(mm, addr, pte, nr, tlb->fullmm);
> if (pte_dirty(ptent)) {
> folio_mark_dirty(folio);
> if (tlb_delay_rmap(tlb)) {
> @@ -1548,36 +1555,49 @@ static inline void zap_present_folio_pte(struct mmu_gather *tlb,
> }
> if (pte_young(ptent) && likely(vma_has_recency(vma)))
> folio_mark_accessed(folio);
> - rss[mm_counter(folio)]--;
> + rss[mm_counter(folio)] -= nr;
> } else {
> /* We don't need up-to-date accessed/dirty bits. */
> - ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
> - rss[MM_ANONPAGES]--;
> + clear_full_ptes(mm, addr, pte, nr, tlb->fullmm);
> + rss[MM_ANONPAGES] -= nr;
> }
> + /* Checking a single PTE in a batch is sufficient. */
> arch_check_zapped_pte(vma, ptent);
> - tlb_remove_tlb_entry(tlb, pte, addr);
> + tlb_remove_tlb_entries(tlb, pte, nr, addr);
> if (unlikely(userfaultfd_pte_wp(vma, ptent)))
> - zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
> + zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details,
> + ptent);
>
> if (!delay_rmap) {
> - folio_remove_rmap_pte(folio, page, vma);
> + folio_remove_rmap_ptes(folio, page, nr, vma);
> +
> + /* Only sanity-check the first page in a batch. */
> if (unlikely(page_mapcount(page) < 0))
> print_bad_pte(vma, addr, ptent, page);
> }
> - if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
> + if (unlikely(__tlb_remove_folio_pages(tlb, page, nr, delay_rmap))) {
> *force_flush = true;
> *force_break = true;
> }
> }
>
> -static inline void zap_present_pte(struct mmu_gather *tlb,
> +/*
> + * Zap or skip at least one present PTE, trying to batch-process subsequent
> + * PTEs that map consecutive pages of the same folio.
> + *
> + * Returns the number of processed (skipped or zapped) PTEs (at least 1).
> + */
> +static inline int zap_present_ptes(struct mmu_gather *tlb,
> struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
> - unsigned long addr, struct zap_details *details,
> - int *rss, bool *force_flush, bool *force_break)
> + unsigned int max_nr, unsigned long addr,
> + struct zap_details *details, int *rss, bool *force_flush,
> + bool *force_break)
> {
> + const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> struct mm_struct *mm = tlb->mm;
> struct folio *folio;
> struct page *page;
> + int nr;
>
> page = vm_normal_page(vma, addr, ptent);
> if (!page) {
> @@ -1587,14 +1607,29 @@ static inline void zap_present_pte(struct mmu_gather *tlb,
> tlb_remove_tlb_entry(tlb, pte, addr);
> VM_WARN_ON_ONCE(userfaultfd_wp(vma));
> ksm_might_unmap_zero_page(mm, ptent);
> - return;
> + return 1;
> }
>
> folio = page_folio(page);
> if (unlikely(!should_zap_folio(details, folio)))
> - return;
> - zap_present_folio_pte(tlb, vma, folio, page, pte, ptent, addr, details,
> - rss, force_flush, force_break);
> + return 1;
> +
> + /*
> + * Make sure that the common "small folio" case is as fast as possible
> + * by keeping the batching logic separate.
> + */
> + if (unlikely(folio_test_large(folio) && max_nr != 1)) {
> + nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, fpb_flags,
> + NULL);
> +
> + zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, nr,
> + addr, details, rss, force_flush,
> + force_break);
> + return nr;
> + }
> + zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, 1, addr,
> + details, rss, force_flush, force_break);
> + return 1;
> }
>
> static unsigned long zap_pte_range(struct mmu_gather *tlb,
> @@ -1609,6 +1644,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> pte_t *start_pte;
> pte_t *pte;
> swp_entry_t entry;
> + int nr;
>
> tlb_change_page_size(tlb, PAGE_SIZE);
> init_rss_vec(rss);
> @@ -1622,7 +1658,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> pte_t ptent = ptep_get(pte);
> struct folio *folio;
> struct page *page;
> + int max_nr;
>
> + nr = 1;
> if (pte_none(ptent))
> continue;
>
> @@ -1630,10 +1668,12 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> break;
>
> if (pte_present(ptent)) {
> - zap_present_pte(tlb, vma, pte, ptent, addr, details,
> - rss, &force_flush, &force_break);
> + max_nr = (end - addr) / PAGE_SIZE;
> + nr = zap_present_ptes(tlb, vma, pte, ptent, max_nr,
> + addr, details, rss, &force_flush,
> + &force_break);
> if (unlikely(force_break)) {
> - addr += PAGE_SIZE;
> + addr += nr * PAGE_SIZE;
> break;
> }
> continue;
> @@ -1687,8 +1727,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> WARN_ON_ONCE(1);
> }
> pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> - zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
> - } while (pte++, addr += PAGE_SIZE, addr != end);
> + zap_install_uffd_wp_if_needed(vma, addr, pte, 1, details, ptent);
> + } while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
>
> add_mm_rss_vec(mm, rss);
> arch_leave_lazy_mmu_mode();
Hi Ryan,
>> -static void tlb_batch_pages_flush(struct mmu_gather *tlb)
>> +static void __tlb_batch_free_encoded_pages(struct mmu_gather_batch *batch)
>> {
>> - struct mmu_gather_batch *batch;
>> -
>> - for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
>> - struct encoded_page **pages = batch->encoded_pages;
>> + struct encoded_page **pages = batch->encoded_pages;
>> + unsigned int nr, nr_pages;
>>
>> + /*
>> + * We might end up freeing a lot of pages. Reschedule on a regular
>> + * basis to avoid soft lockups in configurations without full
>> + * preemption enabled. The magic number of 512 folios seems to work.
>> + */
>> + if (!page_poisoning_enabled_static() && !want_init_on_free()) {
>
> Is the performance win really worth 2 separate implementations keyed off this?
> It seems a bit fragile, in case any other operations get added to free which are
> proportional to size in future. Why not just always do the conservative version?
I really don't want to iterate over all entries on the "sane" common
case. We already do that two times:
a) free_pages_and_swap_cache()
b) release_pages()
Only the latter really is required, and I'm planning on removing the one
in (a) to move it into (b) as well.
So I keep it separate to keep any unnecessary overhead to the setups
that are already terribly slow.
No need to iterate a page full of entries if it can be easily avoided.
Especially, no need to degrade the common order-0 case.
>
>> while (batch->nr) {
>> - /*
>> - * limit free batch count when PAGE_SIZE > 4K
>> - */
>> - unsigned int nr = min(512U, batch->nr);
>> + nr = min(512, batch->nr);
>
> If any entries are for more than 1 page, nr_pages will also be encoded in the
> batch, so effectively this could be limiting to 256 actual folios (half of 512).
Right, in the patch description I state "256 folio fragments". It's up
to 512 folios (order-0).
> Is it worth checking for ENCODED_PAGE_BIT_NR_PAGES_NEXT and limiting accordingly?
At least with 4k page size, we never have more than 510 (IIRC) entries
per batch page. So any such optimization would only matter for large
page sizes, which I don't think is worth it.
Which exact optimization do you have in mind and would it really make a
difference?
>
> nit: You're using 512 magic number in 2 places now; perhaps make a macro?
I played 3 times with macro names (including just using something
"intuitive" like MAX_ORDER_NR_PAGES) but returned to just using 512.
That cond_resched() handling is just absolutely disgusting, one way or
the other.
Do you have a good idea for a macro name?
>
>>
>> /*
>> * Make sure we cover page + nr_pages, and don't leave
>> @@ -119,6 +120,37 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
>> cond_resched();
>> }
>> }
>> +
>> + /*
>> + * With page poisoning and init_on_free, the time it takes to free
>> + * memory grows proportionally with the actual memory size. Therefore,
>> + * limit based on the actual memory size and not the number of involved
>> + * folios.
>> + */
>> + while (batch->nr) {
>> + for (nr = 0, nr_pages = 0;
>> + nr < batch->nr && nr_pages < 512; nr++) {
>> + if (unlikely(encoded_page_flags(pages[nr]) &
>> + ENCODED_PAGE_BIT_NR_PAGES_NEXT))
>> + nr_pages += encoded_nr_pages(pages[++nr]);
>> + else
>> + nr_pages++;
>> + }
>
> I guess worst case here is freeing (511 + 8192) * 64K pages = ~544M. That's up
> from the old limit of 512 * 64K = 32M, and 511 pages bigger than your statement
> in the commit log. Are you comfortable with this? I guess the only alternative
> is to start splitting a batch which would be really messy. I agree your approach
> is preferable if 544M is acceptable.
Right, I have in the description:
"if we cannot even free a single MAX_ORDER page on a system without
running into soft lockups, something else is already completely bogus.".
That would be 8192 pages on arm64. Anybody freeing a PMD-mapped THP
would be in trouble already and should just reconsider life choices
running such a machine.
We could have 511 more pages, yes. If 8192 don't trigger a soft-lockup,
I am confident that 511 more pages won't make a difference.
But, if that ever is a problem, we can butcher this code as much as we
want, because performance with poisoning/zeroing is already down the drain.
As you say, splitting even further is messy, so I rather avoid that
unless really required.
--
Cheers,
David / dhildenb
On 12/02/2024 10:11, David Hildenbrand wrote:
> Hi Ryan,
>
>>> -static void tlb_batch_pages_flush(struct mmu_gather *tlb)
>>> +static void __tlb_batch_free_encoded_pages(struct mmu_gather_batch *batch)
>>> {
>>> - struct mmu_gather_batch *batch;
>>> -
>>> - for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
>>> - struct encoded_page **pages = batch->encoded_pages;
>>> + struct encoded_page **pages = batch->encoded_pages;
>>> + unsigned int nr, nr_pages;
>>> + /*
>>> + * We might end up freeing a lot of pages. Reschedule on a regular
>>> + * basis to avoid soft lockups in configurations without full
>>> + * preemption enabled. The magic number of 512 folios seems to work.
>>> + */
>>> + if (!page_poisoning_enabled_static() && !want_init_on_free()) {
>>
>> Is the performance win really worth 2 separate implementations keyed off this?
>> It seems a bit fragile, in case any other operations get added to free which are
>> proportional to size in future. Why not just always do the conservative version?
>
> I really don't want to iterate over all entries on the "sane" common case. We
> already do that two times:
>
> a) free_pages_and_swap_cache()
>
> b) release_pages()
>
> Only the latter really is required, and I'm planning on removing the one in (a)
> to move it into (b) as well.
>
> So I keep it separate to keep any unnecessary overhead to the setups that are
> already terribly slow.
>
> No need to iterate a page full of entries if it can be easily avoided.
> Especially, no need to degrade the common order-0 case.
Yeah, I understand all that. But given this is all coming from an array, (so
easy to prefetch?) and will presumably all fit in the cache for the common case,
at least, so its hot for (a) and (b), does separating this out really make a
measurable performance difference? If yes then absolutely this optimizaiton
makes sense. But if not, I think its a bit questionable.
You're the boss though, so if your experience tells you this is neccessary, then
I'm ok with that.
By the way, Matthew had an RFC a while back that was doing some clever things
with batches further down the call chain (I think; be memory). Might be worth
taking a look at that if you are planning a follow up change to (a).
>
>>
>>> while (batch->nr) {
>>> - /*
>>> - * limit free batch count when PAGE_SIZE > 4K
>>> - */
>>> - unsigned int nr = min(512U, batch->nr);
>>> + nr = min(512, batch->nr);
>>
>> If any entries are for more than 1 page, nr_pages will also be encoded in the
>> batch, so effectively this could be limiting to 256 actual folios (half of 512).
>
> Right, in the patch description I state "256 folio fragments". It's up to 512
> folios (order-0).
>
>> Is it worth checking for ENCODED_PAGE_BIT_NR_PAGES_NEXT and limiting accordingly?
>
> At least with 4k page size, we never have more than 510 (IIRC) entries per batch
> page. So any such optimization would only matter for large page sizes, which I
> don't think is worth it.
Yep; agreed.
>
> Which exact optimization do you have in mind and would it really make a difference?
No I don't think it would make any difference, performance-wise. I'm just
pointing out that in pathalogical cases you could end up with half the number of
pages being freed at a time.
>
>>
>> nit: You're using 512 magic number in 2 places now; perhaps make a macro?
>
> I played 3 times with macro names (including just using something "intuitive"
> like MAX_ORDER_NR_PAGES) but returned to just using 512.
>
> That cond_resched() handling is just absolutely disgusting, one way or the other.
>
> Do you have a good idea for a macro name?
MAX_NR_FOLIOS_PER_BATCH?
MAX_NR_FOLIOS_PER_FREE?
I don't think the name has to be perfect, because its private to the c file; but
it ensures the 2 usages remain in sync if someone wants to change it in future.
>
>>
>>> /*
>>> * Make sure we cover page + nr_pages, and don't leave
>>> @@ -119,6 +120,37 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
>>> cond_resched();
>>> }
>>> }
>>> +
>>> + /*
>>> + * With page poisoning and init_on_free, the time it takes to free
>>> + * memory grows proportionally with the actual memory size. Therefore,
>>> + * limit based on the actual memory size and not the number of involved
>>> + * folios.
>>> + */
>>> + while (batch->nr) {
>>> + for (nr = 0, nr_pages = 0;
>>> + nr < batch->nr && nr_pages < 512; nr++) {
>>> + if (unlikely(encoded_page_flags(pages[nr]) &
>>> + ENCODED_PAGE_BIT_NR_PAGES_NEXT))
>>> + nr_pages += encoded_nr_pages(pages[++nr]);
>>> + else
>>> + nr_pages++;
>>> + }
>>
>> I guess worst case here is freeing (511 + 8192) * 64K pages = ~544M. That's up
>> from the old limit of 512 * 64K = 32M, and 511 pages bigger than your statement
>> in the commit log. Are you comfortable with this? I guess the only alternative
>> is to start splitting a batch which would be really messy. I agree your approach
>> is preferable if 544M is acceptable.
>
> Right, I have in the description:
>
> "if we cannot even free a single MAX_ORDER page on a system without running into
> soft lockups, something else is already completely bogus.".
>
> That would be 8192 pages on arm64. Anybody freeing a PMD-mapped THP would be in
> trouble already and should just reconsider life choices running such a machine.
>
> We could have 511 more pages, yes. If 8192 don't trigger a soft-lockup, I am
> confident that 511 more pages won't make a difference.
>
> But, if that ever is a problem, we can butcher this code as much as we want,
> because performance with poisoning/zeroing is already down the drain.
>
> As you say, splitting even further is messy, so I rather avoid that unless
> really required.
>
Yep ok, I understand the argument better now - thanks.
On 12.02.24 11:32, Ryan Roberts wrote:
> On 12/02/2024 10:11, David Hildenbrand wrote:
>> Hi Ryan,
>>
>>>> -static void tlb_batch_pages_flush(struct mmu_gather *tlb)
>>>> +static void __tlb_batch_free_encoded_pages(struct mmu_gather_batch *batch)
>>>> {
>>>> - struct mmu_gather_batch *batch;
>>>> -
>>>> - for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
>>>> - struct encoded_page **pages = batch->encoded_pages;
>>>> + struct encoded_page **pages = batch->encoded_pages;
>>>> + unsigned int nr, nr_pages;
>>>> + /*
>>>> + * We might end up freeing a lot of pages. Reschedule on a regular
>>>> + * basis to avoid soft lockups in configurations without full
>>>> + * preemption enabled. The magic number of 512 folios seems to work.
>>>> + */
>>>> + if (!page_poisoning_enabled_static() && !want_init_on_free()) {
>>>
>>> Is the performance win really worth 2 separate implementations keyed off this?
>>> It seems a bit fragile, in case any other operations get added to free which are
>>> proportional to size in future. Why not just always do the conservative version?
>>
>> I really don't want to iterate over all entries on the "sane" common case. We
>> already do that two times:
>>
>> a) free_pages_and_swap_cache()
>>
>> b) release_pages()
>>
>> Only the latter really is required, and I'm planning on removing the one in (a)
>> to move it into (b) as well.
>>
>> So I keep it separate to keep any unnecessary overhead to the setups that are
>> already terribly slow.
>>
>> No need to iterate a page full of entries if it can be easily avoided.
>> Especially, no need to degrade the common order-0 case.
>
> Yeah, I understand all that. But given this is all coming from an array, (so
> easy to prefetch?) and will presumably all fit in the cache for the common case,
> at least, so its hot for (a) and (b), does separating this out really make a
> measurable performance difference? If yes then absolutely this optimizaiton
> makes sense. But if not, I think its a bit questionable.
I primarily added it because
(a) we learned that each cycle counts during mmap() just like it does
during fork().
(b) Linus was similarly concerned about optimizing out another batching
walk in c47454823bd4 ("mm: mmu_gather: allow more than one batch of
delayed rmaps"):
"it needs to walk that array of pages while still holding the page table
lock, and our mmu_gather infrastructure allows for batching quite a lot
of pages. We may have thousands on pages queued up for freeing, and we
wanted to walk only the last batch if we then added a dirty page to the
queue."
So if it matters enough for reducing the time we hold the page table
lock, it surely adds "some" overhead in general.
>
> You're the boss though, so if your experience tells you this is neccessary, then
> I'm ok with that.
I did not do any measurements myself, I just did that intuitively as
above. After all, it's all pretty straight forward (keeping the existing
logic, we need a new one either way) and not that much code.
So unless there are strong opinions, I'd just leave the common case as
it was, and the odd case be special.
>
> By the way, Matthew had an RFC a while back that was doing some clever things
> with batches further down the call chain (I think; be memory). Might be worth
> taking a look at that if you are planning a follow up change to (a).
>
Do you have a pointer?
>>
>>>
>>>> while (batch->nr) {
>>>> - /*
>>>> - * limit free batch count when PAGE_SIZE > 4K
>>>> - */
>>>> - unsigned int nr = min(512U, batch->nr);
>>>> + nr = min(512, batch->nr);
>>>
>>> If any entries are for more than 1 page, nr_pages will also be encoded in the
>>> batch, so effectively this could be limiting to 256 actual folios (half of 512).
>>
>> Right, in the patch description I state "256 folio fragments". It's up to 512
>> folios (order-0).
>>
>>> Is it worth checking for ENCODED_PAGE_BIT_NR_PAGES_NEXT and limiting accordingly?
>>
>> At least with 4k page size, we never have more than 510 (IIRC) entries per batch
>> page. So any such optimization would only matter for large page sizes, which I
>> don't think is worth it.
>
> Yep; agreed.
>
>>
>> Which exact optimization do you have in mind and would it really make a difference?
>
> No I don't think it would make any difference, performance-wise. I'm just
> pointing out that in pathalogical cases you could end up with half the number of
> pages being freed at a time.
Yes, I'll extend the patch description!
>
>>
>>>
>>> nit: You're using 512 magic number in 2 places now; perhaps make a macro?
>>
>> I played 3 times with macro names (including just using something "intuitive"
>> like MAX_ORDER_NR_PAGES) but returned to just using 512.
>>
>> That cond_resched() handling is just absolutely disgusting, one way or the other.
>>
>> Do you have a good idea for a macro name?
>
> MAX_NR_FOLIOS_PER_BATCH?
> MAX_NR_FOLIOS_PER_FREE?
>
> I don't think the name has to be perfect, because its private to the c file; but
> it ensures the 2 usages remain in sync if someone wants to change it in future.
Makes sense, I'll use something along those lines.
>
>>
>>>
>>>> /*
>>>> * Make sure we cover page + nr_pages, and don't leave
>>>> @@ -119,6 +120,37 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
>>>> cond_resched();
>>>> }
>>>> }
>>>> +
>>>> + /*
>>>> + * With page poisoning and init_on_free, the time it takes to free
>>>> + * memory grows proportionally with the actual memory size. Therefore,
>>>> + * limit based on the actual memory size and not the number of involved
>>>> + * folios.
>>>> + */
>>>> + while (batch->nr) {
>>>> + for (nr = 0, nr_pages = 0;
>>>> + nr < batch->nr && nr_pages < 512; nr++) {
>>>> + if (unlikely(encoded_page_flags(pages[nr]) &
>>>> + ENCODED_PAGE_BIT_NR_PAGES_NEXT))
>>>> + nr_pages += encoded_nr_pages(pages[++nr]);
>>>> + else
>>>> + nr_pages++;
>>>> + }
>>>
>>> I guess worst case here is freeing (511 + 8192) * 64K pages = ~544M. That's up
>>> from the old limit of 512 * 64K = 32M, and 511 pages bigger than your statement
>>> in the commit log. Are you comfortable with this? I guess the only alternative
>>> is to start splitting a batch which would be really messy. I agree your approach
>>> is preferable if 544M is acceptable.
>>
>> Right, I have in the description:
>>
>> "if we cannot even free a single MAX_ORDER page on a system without running into
>> soft lockups, something else is already completely bogus.".
>>
>> That would be 8192 pages on arm64. Anybody freeing a PMD-mapped THP would be in
>> trouble already and should just reconsider life choices running such a machine.
>>
>> We could have 511 more pages, yes. If 8192 don't trigger a soft-lockup, I am
>> confident that 511 more pages won't make a difference.
>>
>> But, if that ever is a problem, we can butcher this code as much as we want,
>> because performance with poisoning/zeroing is already down the drain.
>>
>> As you say, splitting even further is messy, so I rather avoid that unless
>> really required.
>>
>
> Yep ok, I understand the argument better now - thanks.
>
I'll further extend the patch description.
Thanks!
--
Cheers,
David / dhildenb
On 12/02/2024 11:05, David Hildenbrand wrote:
> On 12.02.24 11:56, David Hildenbrand wrote:
>> On 12.02.24 11:32, Ryan Roberts wrote:
>>> On 12/02/2024 10:11, David Hildenbrand wrote:
>>>> Hi Ryan,
>>>>
>>>>>> -static void tlb_batch_pages_flush(struct mmu_gather *tlb)
>>>>>> +static void __tlb_batch_free_encoded_pages(struct mmu_gather_batch *batch)
>>>>>> {
>>>>>> - struct mmu_gather_batch *batch;
>>>>>> -
>>>>>> - for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
>>>>>> - struct encoded_page **pages = batch->encoded_pages;
>>>>>> + struct encoded_page **pages = batch->encoded_pages;
>>>>>> + unsigned int nr, nr_pages;
>>>>>> + /*
>>>>>> + * We might end up freeing a lot of pages. Reschedule on a regular
>>>>>> + * basis to avoid soft lockups in configurations without full
>>>>>> + * preemption enabled. The magic number of 512 folios seems to work.
>>>>>> + */
>>>>>> + if (!page_poisoning_enabled_static() && !want_init_on_free()) {
>>>>>
>>>>> Is the performance win really worth 2 separate implementations keyed off this?
>>>>> It seems a bit fragile, in case any other operations get added to free
>>>>> which are
>>>>> proportional to size in future. Why not just always do the conservative
>>>>> version?
>>>>
>>>> I really don't want to iterate over all entries on the "sane" common case. We
>>>> already do that two times:
>>>>
>>>> a) free_pages_and_swap_cache()
>>>>
>>>> b) release_pages()
>>>>
>>>> Only the latter really is required, and I'm planning on removing the one in (a)
>>>> to move it into (b) as well.
>>>>
>>>> So I keep it separate to keep any unnecessary overhead to the setups that are
>>>> already terribly slow.
>>>>
>>>> No need to iterate a page full of entries if it can be easily avoided.
>>>> Especially, no need to degrade the common order-0 case.
>>>
>>> Yeah, I understand all that. But given this is all coming from an array, (so
>>> easy to prefetch?) and will presumably all fit in the cache for the common case,
>>> at least, so its hot for (a) and (b), does separating this out really make a
>>> measurable performance difference? If yes then absolutely this optimizaiton
>>> makes sense. But if not, I think its a bit questionable.
>>
>> I primarily added it because
>>
>> (a) we learned that each cycle counts during mmap() just like it does
>> during fork().
>>
>> (b) Linus was similarly concerned about optimizing out another batching
>> walk in c47454823bd4 ("mm: mmu_gather: allow more than one batch of
>> delayed rmaps"):
>>
>> "it needs to walk that array of pages while still holding the page table
>> lock, and our mmu_gather infrastructure allows for batching quite a lot
>> of pages. We may have thousands on pages queued up for freeing, and we
>> wanted to walk only the last batch if we then added a dirty page to the
>> queue."
>>
>> So if it matters enough for reducing the time we hold the page table
>> lock, it surely adds "some" overhead in general.
>>
>>
>>>
>>> You're the boss though, so if your experience tells you this is neccessary, then
>>> I'm ok with that.
>>
>> I did not do any measurements myself, I just did that intuitively as
>> above. After all, it's all pretty straight forward (keeping the existing
>> logic, we need a new one either way) and not that much code.
>>
>> So unless there are strong opinions, I'd just leave the common case as
>> it was, and the odd case be special.
>
> I think we can just reduce the code duplication easily:
>
> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> index d175c0f1e2c8..99b3e9408aa0 100644
> --- a/mm/mmu_gather.c
> +++ b/mm/mmu_gather.c
> @@ -91,18 +91,21 @@ void tlb_flush_rmaps(struct mmu_gather *tlb, struct
> vm_area_struct *vma)
> }
> #endif
>
> -static void tlb_batch_pages_flush(struct mmu_gather *tlb)
> -{
> - struct mmu_gather_batch *batch;
> +/*
> + * We might end up freeing a lot of pages. Reschedule on a regular
> + * basis to avoid soft lockups in configurations without full
> + * preemption enabled. The magic number of 512 folios seems to work.
> + */
> +#define MAX_NR_FOLIOS_PER_FREE 512
>
> - for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
> - struct encoded_page **pages = batch->encoded_pages;
> +static void __tlb_batch_free_encoded_pages(struct mmu_gather_batch *batch)
> +{
> + struct encoded_page **pages = batch->encoded_pages;
> + unsigned int nr, nr_pages;
>
> - while (batch->nr) {
> - /*
> - * limit free batch count when PAGE_SIZE > 4K
> - */
> - unsigned int nr = min(512U, batch->nr);
> + while (batch->nr) {
> + if (!page_poisoning_enabled_static() && !want_init_on_free()) {
> + nr = min(MAX_NR_FOLIOS_PER_FREE, batch->nr);
>
> /*
> * Make sure we cover page + nr_pages, and don't leave
> @@ -111,14 +114,39 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
> if (unlikely(encoded_page_flags(pages[nr - 1]) &
> ENCODED_PAGE_BIT_NR_PAGES_NEXT))
> nr++;
> + } else {
> + /*
> + * With page poisoning and init_on_free, the time it
> + * takes to free memory grows proportionally with the
> + * actual memory size. Therefore, limit based on the
> + * actual memory size and not the number of involved
> + * folios.
> + */
> + for (nr = 0, nr_pages = 0;
> + nr < batch->nr && nr_pages < MAX_NR_FOLIOS_PER_FREE;
> + nr++) {
> + if (unlikely(encoded_page_flags(pages[nr]) &
> + ENCODED_PAGE_BIT_NR_PAGES_NEXT))
> + nr_pages += encoded_nr_pages(pages[++nr]);
> + else
> + nr_pages++;
> + }
> + }
>
> - free_pages_and_swap_cache(pages, nr);
> - pages += nr;
> - batch->nr -= nr;
> + free_pages_and_swap_cache(pages, nr);
> + pages += nr;
> + batch->nr -= nr;
>
> - cond_resched();
> - }
> + cond_resched();
> }
> +}
> +
> +static void tlb_batch_pages_flush(struct mmu_gather *tlb)
> +{
> + struct mmu_gather_batch *batch;
> +
> + for (batch = &tlb->local; batch && batch->nr; batch = batch->next)
> + __tlb_batch_free_encoded_pages(batch);
> tlb->active = &tlb->local;
> }
>
Yes this is much cleaner IMHO! I don't think putting the poison and init_on_free
checks inside the while loops should make a whole lot of difference - you're
only going round that loop once in the common (4K pages) case.
Reviewed-by: Ryan Roberts <[email protected]>
On 12.02.24 11:56, David Hildenbrand wrote:
> On 12.02.24 11:32, Ryan Roberts wrote:
>> On 12/02/2024 10:11, David Hildenbrand wrote:
>>> Hi Ryan,
>>>
>>>>> -static void tlb_batch_pages_flush(struct mmu_gather *tlb)
>>>>> +static void __tlb_batch_free_encoded_pages(struct mmu_gather_batch *batch)
>>>>> {
>>>>> - struct mmu_gather_batch *batch;
>>>>> -
>>>>> - for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
>>>>> - struct encoded_page **pages = batch->encoded_pages;
>>>>> + struct encoded_page **pages = batch->encoded_pages;
>>>>> + unsigned int nr, nr_pages;
>>>>> + /*
>>>>> + * We might end up freeing a lot of pages. Reschedule on a regular
>>>>> + * basis to avoid soft lockups in configurations without full
>>>>> + * preemption enabled. The magic number of 512 folios seems to work.
>>>>> + */
>>>>> + if (!page_poisoning_enabled_static() && !want_init_on_free()) {
>>>>
>>>> Is the performance win really worth 2 separate implementations keyed off this?
>>>> It seems a bit fragile, in case any other operations get added to free which are
>>>> proportional to size in future. Why not just always do the conservative version?
>>>
>>> I really don't want to iterate over all entries on the "sane" common case. We
>>> already do that two times:
>>>
>>> a) free_pages_and_swap_cache()
>>>
>>> b) release_pages()
>>>
>>> Only the latter really is required, and I'm planning on removing the one in (a)
>>> to move it into (b) as well.
>>>
>>> So I keep it separate to keep any unnecessary overhead to the setups that are
>>> already terribly slow.
>>>
>>> No need to iterate a page full of entries if it can be easily avoided.
>>> Especially, no need to degrade the common order-0 case.
>>
>> Yeah, I understand all that. But given this is all coming from an array, (so
>> easy to prefetch?) and will presumably all fit in the cache for the common case,
>> at least, so its hot for (a) and (b), does separating this out really make a
>> measurable performance difference? If yes then absolutely this optimizaiton
>> makes sense. But if not, I think its a bit questionable.
>
> I primarily added it because
>
> (a) we learned that each cycle counts during mmap() just like it does
> during fork().
>
> (b) Linus was similarly concerned about optimizing out another batching
> walk in c47454823bd4 ("mm: mmu_gather: allow more than one batch of
> delayed rmaps"):
>
> "it needs to walk that array of pages while still holding the page table
> lock, and our mmu_gather infrastructure allows for batching quite a lot
> of pages. We may have thousands on pages queued up for freeing, and we
> wanted to walk only the last batch if we then added a dirty page to the
> queue."
>
> So if it matters enough for reducing the time we hold the page table
> lock, it surely adds "some" overhead in general.
>
>
>>
>> You're the boss though, so if your experience tells you this is neccessary, then
>> I'm ok with that.
>
> I did not do any measurements myself, I just did that intuitively as
> above. After all, it's all pretty straight forward (keeping the existing
> logic, we need a new one either way) and not that much code.
>
> So unless there are strong opinions, I'd just leave the common case as
> it was, and the odd case be special.
I think we can just reduce the code duplication easily:
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index d175c0f1e2c8..99b3e9408aa0 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -91,18 +91,21 @@ void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma)
}
#endif
-static void tlb_batch_pages_flush(struct mmu_gather *tlb)
-{
- struct mmu_gather_batch *batch;
+/*
+ * We might end up freeing a lot of pages. Reschedule on a regular
+ * basis to avoid soft lockups in configurations without full
+ * preemption enabled. The magic number of 512 folios seems to work.
+ */
+#define MAX_NR_FOLIOS_PER_FREE 512
- for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
- struct encoded_page **pages = batch->encoded_pages;
+static void __tlb_batch_free_encoded_pages(struct mmu_gather_batch *batch)
+{
+ struct encoded_page **pages = batch->encoded_pages;
+ unsigned int nr, nr_pages;
- while (batch->nr) {
- /*
- * limit free batch count when PAGE_SIZE > 4K
- */
- unsigned int nr = min(512U, batch->nr);
+ while (batch->nr) {
+ if (!page_poisoning_enabled_static() && !want_init_on_free()) {
+ nr = min(MAX_NR_FOLIOS_PER_FREE, batch->nr);
/*
* Make sure we cover page + nr_pages, and don't leave
@@ -111,14 +114,39 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
if (unlikely(encoded_page_flags(pages[nr - 1]) &
ENCODED_PAGE_BIT_NR_PAGES_NEXT))
nr++;
+ } else {
+ /*
+ * With page poisoning and init_on_free, the time it
+ * takes to free memory grows proportionally with the
+ * actual memory size. Therefore, limit based on the
+ * actual memory size and not the number of involved
+ * folios.
+ */
+ for (nr = 0, nr_pages = 0;
+ nr < batch->nr && nr_pages < MAX_NR_FOLIOS_PER_FREE;
+ nr++) {
+ if (unlikely(encoded_page_flags(pages[nr]) &
+ ENCODED_PAGE_BIT_NR_PAGES_NEXT))
+ nr_pages += encoded_nr_pages(pages[++nr]);
+ else
+ nr_pages++;
+ }
+ }
- free_pages_and_swap_cache(pages, nr);
- pages += nr;
- batch->nr -= nr;
+ free_pages_and_swap_cache(pages, nr);
+ pages += nr;
+ batch->nr -= nr;
- cond_resched();
- }
+ cond_resched();
}
+}
+
+static void tlb_batch_pages_flush(struct mmu_gather *tlb)
+{
+ struct mmu_gather_batch *batch;
+
+ for (batch = &tlb->local; batch && batch->nr; batch = batch->next)
+ __tlb_batch_free_encoded_pages(batch);
tlb->active = &tlb->local;
}
--
2.43.0
--
Cheers,
David / dhildenb
On 12.02.24 12:21, Ryan Roberts wrote:
> On 12/02/2024 11:05, David Hildenbrand wrote:
>> On 12.02.24 11:56, David Hildenbrand wrote:
>>> On 12.02.24 11:32, Ryan Roberts wrote:
>>>> On 12/02/2024 10:11, David Hildenbrand wrote:
>>>>> Hi Ryan,
>>>>>
>>>>>>> -static void tlb_batch_pages_flush(struct mmu_gather *tlb)
>>>>>>> +static void __tlb_batch_free_encoded_pages(struct mmu_gather_batch *batch)
>>>>>>> {
>>>>>>> - struct mmu_gather_batch *batch;
>>>>>>> -
>>>>>>> - for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
>>>>>>> - struct encoded_page **pages = batch->encoded_pages;
>>>>>>> + struct encoded_page **pages = batch->encoded_pages;
>>>>>>> + unsigned int nr, nr_pages;
>>>>>>> + /*
>>>>>>> + * We might end up freeing a lot of pages. Reschedule on a regular
>>>>>>> + * basis to avoid soft lockups in configurations without full
>>>>>>> + * preemption enabled. The magic number of 512 folios seems to work.
>>>>>>> + */
>>>>>>> + if (!page_poisoning_enabled_static() && !want_init_on_free()) {
>>>>>>
>>>>>> Is the performance win really worth 2 separate implementations keyed off this?
>>>>>> It seems a bit fragile, in case any other operations get added to free
>>>>>> which are
>>>>>> proportional to size in future. Why not just always do the conservative
>>>>>> version?
>>>>>
>>>>> I really don't want to iterate over all entries on the "sane" common case. We
>>>>> already do that two times:
>>>>>
>>>>> a) free_pages_and_swap_cache()
>>>>>
>>>>> b) release_pages()
>>>>>
>>>>> Only the latter really is required, and I'm planning on removing the one in (a)
>>>>> to move it into (b) as well.
>>>>>
>>>>> So I keep it separate to keep any unnecessary overhead to the setups that are
>>>>> already terribly slow.
>>>>>
>>>>> No need to iterate a page full of entries if it can be easily avoided.
>>>>> Especially, no need to degrade the common order-0 case.
>>>>
>>>> Yeah, I understand all that. But given this is all coming from an array, (so
>>>> easy to prefetch?) and will presumably all fit in the cache for the common case,
>>>> at least, so its hot for (a) and (b), does separating this out really make a
>>>> measurable performance difference? If yes then absolutely this optimizaiton
>>>> makes sense. But if not, I think its a bit questionable.
>>>
>>> I primarily added it because
>>>
>>> (a) we learned that each cycle counts during mmap() just like it does
>>> during fork().
>>>
>>> (b) Linus was similarly concerned about optimizing out another batching
>>> walk in c47454823bd4 ("mm: mmu_gather: allow more than one batch of
>>> delayed rmaps"):
>>>
>>> "it needs to walk that array of pages while still holding the page table
>>> lock, and our mmu_gather infrastructure allows for batching quite a lot
>>> of pages. We may have thousands on pages queued up for freeing, and we
>>> wanted to walk only the last batch if we then added a dirty page to the
>>> queue."
>>>
>>> So if it matters enough for reducing the time we hold the page table
>>> lock, it surely adds "some" overhead in general.
>>>
>>>
>>>>
>>>> You're the boss though, so if your experience tells you this is neccessary, then
>>>> I'm ok with that.
>>>
>>> I did not do any measurements myself, I just did that intuitively as
>>> above. After all, it's all pretty straight forward (keeping the existing
>>> logic, we need a new one either way) and not that much code.
>>>
>>> So unless there are strong opinions, I'd just leave the common case as
>>> it was, and the odd case be special.
>>
>> I think we can just reduce the code duplication easily:
>>
>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
>> index d175c0f1e2c8..99b3e9408aa0 100644
>> --- a/mm/mmu_gather.c
>> +++ b/mm/mmu_gather.c
>> @@ -91,18 +91,21 @@ void tlb_flush_rmaps(struct mmu_gather *tlb, struct
>> vm_area_struct *vma)
>> }
>> #endif
>>
>> -static void tlb_batch_pages_flush(struct mmu_gather *tlb)
>> -{
>> - struct mmu_gather_batch *batch;
>> +/*
>> + * We might end up freeing a lot of pages. Reschedule on a regular
>> + * basis to avoid soft lockups in configurations without full
>> + * preemption enabled. The magic number of 512 folios seems to work.
>> + */
>> +#define MAX_NR_FOLIOS_PER_FREE 512
>>
>> - for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
>> - struct encoded_page **pages = batch->encoded_pages;
>> +static void __tlb_batch_free_encoded_pages(struct mmu_gather_batch *batch)
>> +{
>> + struct encoded_page **pages = batch->encoded_pages;
>> + unsigned int nr, nr_pages;
>>
>> - while (batch->nr) {
>> - /*
>> - * limit free batch count when PAGE_SIZE > 4K
>> - */
>> - unsigned int nr = min(512U, batch->nr);
>> + while (batch->nr) {
>> + if (!page_poisoning_enabled_static() && !want_init_on_free()) {
>> + nr = min(MAX_NR_FOLIOS_PER_FREE, batch->nr);
>>
>> /*
>> * Make sure we cover page + nr_pages, and don't leave
>> @@ -111,14 +114,39 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
>> if (unlikely(encoded_page_flags(pages[nr - 1]) &
>> ENCODED_PAGE_BIT_NR_PAGES_NEXT))
>> nr++;
>> + } else {
>> + /*
>> + * With page poisoning and init_on_free, the time it
>> + * takes to free memory grows proportionally with the
>> + * actual memory size. Therefore, limit based on the
>> + * actual memory size and not the number of involved
>> + * folios.
>> + */
>> + for (nr = 0, nr_pages = 0;
>> + nr < batch->nr && nr_pages < MAX_NR_FOLIOS_PER_FREE;
>> + nr++) {
>> + if (unlikely(encoded_page_flags(pages[nr]) &
>> + ENCODED_PAGE_BIT_NR_PAGES_NEXT))
>> + nr_pages += encoded_nr_pages(pages[++nr]);
>> + else
>> + nr_pages++;
>> + }
>> + }
>>
>> - free_pages_and_swap_cache(pages, nr);
>> - pages += nr;
>> - batch->nr -= nr;
>> + free_pages_and_swap_cache(pages, nr);
>> + pages += nr;
>> + batch->nr -= nr;
>>
>> - cond_resched();
>> - }
>> + cond_resched();
>> }
>> +}
>> +
>> +static void tlb_batch_pages_flush(struct mmu_gather *tlb)
>> +{
>> + struct mmu_gather_batch *batch;
>> +
>> + for (batch = &tlb->local; batch && batch->nr; batch = batch->next)
>> + __tlb_batch_free_encoded_pages(batch);
>> tlb->active = &tlb->local;
>> }
>>
>
> Yes this is much cleaner IMHO! I don't think putting the poison and init_on_free
> checks inside the while loops should make a whole lot of difference - you're
> only going round that loop once in the common (4K pages) case.
Exactly.
>
> Reviewed-by: Ryan Roberts <[email protected]>
Thanks, this is the full patch, including the extended patch
description:
From 5518fb32b950154794380d029eef8751af8c9804 Mon Sep 17 00:00:00 2001
From: David Hildenbrand <[email protected]>
Date: Fri, 9 Feb 2024 18:43:11 +0100
Subject: [PATCH] mm/mmu_gather: improve cond_resched() handling with large
folios and expensive page freeing
In tlb_batch_pages_flush(), we can end up freeing up to 512 pages or
now up to 256 folio fragments that span more than one page, before we
conditionally reschedule.
It's a pain that we have to handle cond_resched() in
tlb_batch_pages_flush() manually and cannot simply handle it in
release_pages() -- release_pages() can be called from atomic context.
Well, in a perfect world we wouldn't have to make our code more
complicated at all.
With page poisoning and init_on_free, we might now run into soft lockups
when we free a lot of rather large folio fragments, because page freeing
time then depends on the actual memory size we are freeing instead of on
the number of folios that are involved.
In the absolute (unlikely) worst case, on arm64 with 64k we will be able
to free up to 256 folio fragments that each span 512 MiB: zeroing out 128
GiB does sound like it might take a while. But instead of ignoring this
unlikely case, let's just handle it.
So, let's teach tlb_batch_pages_flush() that there are some
configurations where page freeing is horribly slow, and let's reschedule
more frequently -- similarly like we did for now before we had large folio
fragments in there. Avoid yet another loop over all encoded pages in the
common case by handling that separately.
Note that with page poisoning/zeroing, we might now end up freeing only a
single folio fragment at a time that might exceed the old 512 pages limit:
but if we cannot even free a single MAX_ORDER page on a system without
running into soft lockups, something else is already completely bogus.
Freeing a PMD-mapped THP would similarly cause trouble.
In theory, we might even free 511 order-0 pages + a single MAX_ORDER page,
effectively having to zero out 8703 pages on arm64 with 64k, translating to
~544 MiB of memory: however, if 512 MiB doesn't result in soft lockups,
544 MiB is unlikely to result in soft lockups, so we won't care about
that for the time being.
In the future, we might want to detect if handling cond_resched() is
required at all, and just not do any of that with full preemption enabled.
Reviewed-by: Ryan Roberts <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
mm/mmu_gather.c | 58 ++++++++++++++++++++++++++++++++++++-------------
1 file changed, 43 insertions(+), 15 deletions(-)
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index d175c0f1e2c8..99b3e9408aa0 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -91,18 +91,21 @@ void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma)
}
#endif
-static void tlb_batch_pages_flush(struct mmu_gather *tlb)
-{
- struct mmu_gather_batch *batch;
+/*
+ * We might end up freeing a lot of pages. Reschedule on a regular
+ * basis to avoid soft lockups in configurations without full
+ * preemption enabled. The magic number of 512 folios seems to work.
+ */
+#define MAX_NR_FOLIOS_PER_FREE 512
- for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
- struct encoded_page **pages = batch->encoded_pages;
+static void __tlb_batch_free_encoded_pages(struct mmu_gather_batch *batch)
+{
+ struct encoded_page **pages = batch->encoded_pages;
+ unsigned int nr, nr_pages;
- while (batch->nr) {
- /*
- * limit free batch count when PAGE_SIZE > 4K
- */
- unsigned int nr = min(512U, batch->nr);
+ while (batch->nr) {
+ if (!page_poisoning_enabled_static() && !want_init_on_free()) {
+ nr = min(MAX_NR_FOLIOS_PER_FREE, batch->nr);
/*
* Make sure we cover page + nr_pages, and don't leave
@@ -111,14 +114,39 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
if (unlikely(encoded_page_flags(pages[nr - 1]) &
ENCODED_PAGE_BIT_NR_PAGES_NEXT))
nr++;
+ } else {
+ /*
+ * With page poisoning and init_on_free, the time it
+ * takes to free memory grows proportionally with the
+ * actual memory size. Therefore, limit based on the
+ * actual memory size and not the number of involved
+ * folios.
+ */
+ for (nr = 0, nr_pages = 0;
+ nr < batch->nr && nr_pages < MAX_NR_FOLIOS_PER_FREE;
+ nr++) {
+ if (unlikely(encoded_page_flags(pages[nr]) &
+ ENCODED_PAGE_BIT_NR_PAGES_NEXT))
+ nr_pages += encoded_nr_pages(pages[++nr]);
+ else
+ nr_pages++;
+ }
+ }
- free_pages_and_swap_cache(pages, nr);
- pages += nr;
- batch->nr -= nr;
+ free_pages_and_swap_cache(pages, nr);
+ pages += nr;
+ batch->nr -= nr;
- cond_resched();
- }
+ cond_resched();
}
+}
+
+static void tlb_batch_pages_flush(struct mmu_gather *tlb)
+{
+ struct mmu_gather_batch *batch;
+
+ for (batch = &tlb->local; batch && batch->nr; batch = batch->next)
+ __tlb_batch_free_encoded_pages(batch);
tlb->active = &tlb->local;
}
--
2.43.0
--
Cheers,
David / dhildenb