LinuxLists.cc - [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP

2024-01-29 14:47:04

by David Hildenbrand

[permalink] [raw]

Subject: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP

This series is based on [1] and must be applied on top of it.
Similar to what we did with fork(), let's implement PTE batching
during unmap/zap when processing PTE-mapped THPs.

We collect consecutive PTEs that map consecutive pages of the same large
folio, making sure that the other PTE bits are compatible, and (a) adjust
the refcount only once per batch, (b) call rmap handling functions only
once per batch, (c) perform batch PTE setting/updates and (d) perform TLB
entry removal once per batch.

Ryan was previously working on this in the context of cont-pte for
arm64, int latest iteration [2] with a focus on arm6 with cont-pte only.
This series implements the optimization for all architectures, independent
of such PTE bits, teaches MMU gather/TLB code to be fully aware of such
large-folio-pages batches as well, and amkes use of our new rmap batching
function when removing the rmap.

To achieve that, we have to enlighten MMU gather / page freeing code
(i.e., everything that consumes encoded_page) to process unmapping
of consecutive pages that all belong to the same large folio. I'm being
very careful to not degrade order-0 performance, and it looks like I
managed to achieve that.

While this series should -- similar to [1] -- be beneficial for adding
cont-pte support on arm64[2], it's one of the requirements for maintaining
a total mapcount[3] for large folios with minimal added overhead and
further changes[4] that build up on top of the total mapcount.

Independent of all that, this series results in a speedup during munmap()
and similar unmapping (process teardown, MADV_DONTNEED on larger ranges)
with PTE-mapped THP, which is the default with THPs that are smaller than
a PMD (for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]).

On an Intel Xeon Silver 4210R CPU, munmap'ing a 1GiB VMA backed by
PTE-mapped folios of the same size (stddev < 1%) results in the following
runtimes for munmap() in seconds (shorter is better):

Folio Size | mm-unstable | New | Change
---------------------------------------------
4KiB | 0.058110 | 0.057715 | - 1%
16KiB | 0.044198 | 0.035469 | -20%
32KiB | 0.034216 | 0.023522 | -31%
64KiB | 0.029207 | 0.018434 | -37%
128KiB | 0.026579 | 0.014026 | -47%
256KiB | 0.025130 | 0.011756 | -53%
512KiB | 0.024292 | 0.010703 | -56%
1024KiB | 0.023812 | 0.010294 | -57%
2048KiB | 0.023785 | 0.009910 | -58%

CCing especially s390x folks, because they have a tlb freeing hooks that
needs adjustment. Only tested on x86-64 for now, will have to do some more
stress testing. Compile-tested on most other architectures. The PPC
change is negleglible and makes my cross-compiler happy.

[1] https://lkml.kernel.org/r/[email protected]
[2] https://lkml.kernel.org/r/[email protected]
[3] https://lkml.kernel.org/r/[email protected]
[4] https://lkml.kernel.org/r/[email protected]
[5] https://lkml.kernel.org/r/[email protected]

Cc: Andrew Morton <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]

David Hildenbrand (9):
mm/memory: factor out zapping of present pte into zap_present_pte()
mm/memory: handle !page case in zap_present_pte() separately
mm/memory: further separate anon and pagecache folio handling in
zap_present_pte()
mm/memory: factor out zapping folio pte into zap_present_folio_pte()
mm/mmu_gather: pass "delay_rmap" instead of encoded page to
__tlb_remove_page_size()
mm/mmu_gather: define ENCODED_PAGE_FLAG_DELAY_RMAP
mm/mmu_gather: add __tlb_remove_folio_pages()
mm/mmu_gather: add tlb_remove_tlb_entries()
mm/memory: optimize unmap/zap with PTE-mapped THP

arch/powerpc/include/asm/tlb.h | 2 +
arch/s390/include/asm/tlb.h | 30 ++++--
include/asm-generic/tlb.h | 40 ++++++--
include/linux/mm_types.h | 37 ++++++--
include/linux/pgtable.h | 66 +++++++++++++
mm/memory.c | 167 +++++++++++++++++++++++----------
mm/mmu_gather.c | 63 +++++++++++--
mm/swap.c | 12 ++-
mm/swap_state.c | 12 ++-
9 files changed, 347 insertions(+), 82 deletions(-)

--
2.43.0

2024-01-29 14:51:56

by David Hildenbrand

[permalink] [raw]

Subject: [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP

Similar to how we optimized fork(), let's implement PTE batching when
consecutive (present) PTEs map consecutive pages of the same large
folio.

Most infrastructure we need for batching (mmu gather, rmap) is already
there. We only have to add get_and_clear_full_ptes() and
clear_full_ptes(). Similarly, extend zap_install_uffd_wp_if_needed() to
process a PTE range.

We won't bother sanity-checking the mapcount of all subpages, but only
check the mapcount of the first subpage we process.

To keep small folios as fast as possible force inlining of a specialized
variant using __always_inline with nr=1.

Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/pgtable.h | 66 +++++++++++++++++++++++++++++
mm/memory.c | 92 +++++++++++++++++++++++++++++------------
2 files changed, 132 insertions(+), 26 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index aab227e12493..f0feae7f89fb 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -580,6 +580,72 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
}
#endif

+#ifndef get_and_clear_full_ptes
+/**
+ * get_and_clear_full_ptes - Clear PTEs that map consecutive pages of the same
+ * folio, collecting dirty/accessed bits.
+ * @mm: Address space the pages are mapped into.
+ * @addr: Address the first page is mapped at.
+ * @ptep: Page table pointer for the first entry.
+ * @nr: Number of entries to clear.
+ * @full: Whether we are clearing a full mm.
+ *
+ * May be overridden by the architecture; otherwise, implemented as a simple
+ * loop over ptep_get_and_clear_full(), merging dirty/accessed bits into
+ * returned PTE.
+ *
+ * Note that PTE bits in the PTE range besides the PFN can differ. For example,
+ * some PTEs might be write-protected.
+ *
+ * Context: The caller holds the page table lock. The PTEs map consecutive
+ * pages that belong to the same folio. The PTEs are all in the same PMD.
+ */
+static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
+ unsigned long addr, pte_t *ptep, unsigned int nr, int full)
+{
+ pte_t pte, tmp_pte;
+
+ pte = ptep_get_and_clear_full(mm, addr, ptep, full);
+ while (--nr) {
+ ptep++;
+ addr += PAGE_SIZE;
+ tmp_pte = ptep_get_and_clear_full(mm, addr, ptep, full);
+ if (pte_dirty(tmp_pte))
+ pte = pte_mkdirty(pte);
+ if (pte_young(tmp_pte))
+ pte = pte_mkyoung(pte);
+ }
+ return pte;
+}
+#endif
+
+#ifndef clear_full_ptes
+/**
+ * clear_full_ptes - Clear PTEs that map consecutive pages of the same folio.
+ * @mm: Address space the pages are mapped into.
+ * @addr: Address the first page is mapped at.
+ * @ptep: Page table pointer for the first entry.
+ * @nr: Number of entries to clear.
+ * @full: Whether we are clearing a full mm.
+ *
+ * Note that PTE bits in the PTE range besides the PFN can differ. For example,
+ * some PTEs might be write-protected.
+ *
+ * Context: The caller holds the page table lock. The PTEs map consecutive
+ * pages that belong to the same folio. The PTEs are all in the same PMD.
+ */
+static inline void clear_full_ptes(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, unsigned int nr, int full)
+{
+ for (;;) {
+ ptep_get_and_clear_full(mm, addr, ptep, full);
+ if (--nr == 0)
+ break;
+ ptep++;
+ addr += PAGE_SIZE;
+ }
+}
+#endif

/*
* If two threads concurrently fault at the same page, the thread that
diff --git a/mm/memory.c b/mm/memory.c
index a2190d7cfa74..38a010c4d04d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1515,7 +1515,7 @@ static inline bool zap_drop_file_uffd_wp(struct zap_details *details)
*/
static inline void
zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
- unsigned long addr, pte_t *pte,
+ unsigned long addr, pte_t *pte, int nr,
struct zap_details *details, pte_t pteval)
{
/* Zap on anonymous always means dropping everything */
@@ -1525,20 +1525,27 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
if (zap_drop_file_uffd_wp(details))
return;

- pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
+ for (;;) {
+ /* the PFN in the PTE is irrelevant. */
+ pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
+ if (--nr == 0)
+ break;
+ pte++;
+ addr += PAGE_SIZE;
+ }
}

-static inline void zap_present_folio_pte(struct mmu_gather *tlb,
+static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
struct vm_area_struct *vma, struct folio *folio,
- struct page *page, pte_t *pte, pte_t ptent, unsigned long addr,
- struct zap_details *details, int *rss, bool *force_flush,
- bool *force_break)
+ struct page *page, pte_t *pte, pte_t ptent, unsigned int nr,
+ unsigned long addr, struct zap_details *details, int *rss,
+ bool *force_flush, bool *force_break)
{
struct mm_struct *mm = tlb->mm;
bool delay_rmap = false;

if (!folio_test_anon(folio)) {
- ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
+ ptent = get_and_clear_full_ptes(mm, addr, pte, nr, tlb->fullmm);
if (pte_dirty(ptent)) {
folio_mark_dirty(folio);
if (tlb_delay_rmap(tlb)) {
@@ -1548,36 +1555,49 @@ static inline void zap_present_folio_pte(struct mmu_gather *tlb,
}
if (pte_young(ptent) && likely(vma_has_recency(vma)))
folio_mark_accessed(folio);
- rss[mm_counter(folio)]--;
+ rss[mm_counter(folio)] -= nr;
} else {
/* We don't need up-to-date accessed/dirty bits. */
- ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
- rss[MM_ANONPAGES]--;
+ clear_full_ptes(mm, addr, pte, nr, tlb->fullmm);
+ rss[MM_ANONPAGES] -= nr;
}
+ /* Checking a single PTE in a batch is sufficient. */
arch_check_zapped_pte(vma, ptent);
- tlb_remove_tlb_entry(tlb, pte, addr);
+ tlb_remove_tlb_entries(tlb, pte, nr, addr);
if (unlikely(userfaultfd_pte_wp(vma, ptent)))
- zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
+ zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details,
+ ptent);

if (!delay_rmap) {
- folio_remove_rmap_pte(folio, page, vma);
+ folio_remove_rmap_ptes(folio, page, nr, vma);
+
+ /* Only sanity-check the first page in a batch. */
if (unlikely(page_mapcount(page) < 0))
print_bad_pte(vma, addr, ptent, page);
}
- if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
+ if (unlikely(__tlb_remove_folio_pages(tlb, page, nr, delay_rmap))) {
*force_flush = true;
*force_break = true;
}
}

-static inline void zap_present_pte(struct mmu_gather *tlb,
+/*
+ * Zap or skip one present PTE, trying to batch-process subsequent PTEs that map
+ * consecutive pages of the same folio.
+ *
+ * Returns the number of processed (skipped or zapped) PTEs (at least 1).
+ */
+static inline int zap_present_ptes(struct mmu_gather *tlb,
struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
- unsigned long addr, struct zap_details *details,
- int *rss, bool *force_flush, bool *force_break)
+ unsigned int max_nr, unsigned long addr,
+ struct zap_details *details, int *rss, bool *force_flush,
+ bool *force_break)
{
+ const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
struct mm_struct *mm = tlb->mm;
struct folio *folio;
struct page *page;
+ int nr;

page = vm_normal_page(vma, addr, ptent);
if (!page) {
@@ -1587,14 +1607,29 @@ static inline void zap_present_pte(struct mmu_gather *tlb,
tlb_remove_tlb_entry(tlb, pte, addr);
VM_WARN_ON_ONCE(userfaultfd_wp(vma));
ksm_might_unmap_zero_page(mm, ptent);
- return;
+ return 1;
}

folio = page_folio(page);
if (unlikely(!should_zap_folio(details, folio)))
- return;
- zap_present_folio_pte(tlb, vma, folio, page, pte, ptent, addr, details,
- rss, force_flush, force_break);
+ return 1;
+
+ /*
+ * Make sure that the common "small folio" case is as fast as possible
+ * by keeping the batching logic separate.
+ */
+ if (unlikely(folio_test_large(folio) && max_nr != 1)) {
+ nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, fpb_flags,
+ NULL);
+
+ zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, nr,
+ addr, details, rss, force_flush,
+ force_break);
+ return nr;
+ }
+ zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, 1, addr,
+ details, rss, force_flush, force_break);
+ return 1;
}

static unsigned long zap_pte_range(struct mmu_gather *tlb,
@@ -1609,6 +1644,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
pte_t *start_pte;
pte_t *pte;
swp_entry_t entry;
+ int nr;

tlb_change_page_size(tlb, PAGE_SIZE);
init_rss_vec(rss);
@@ -1622,7 +1658,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
pte_t ptent = ptep_get(pte);
struct folio *folio = NULL;
struct page *page;
+ int max_nr;

+ nr = 1;
if (pte_none(ptent))
continue;

@@ -1630,10 +1668,12 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
break;

if (pte_present(ptent)) {
- zap_present_pte(tlb, vma, pte, ptent, addr, details,
- rss, &force_flush, &force_break);
+ max_nr = (end - addr) / PAGE_SIZE;
+ nr = zap_present_ptes(tlb, vma, pte, ptent, max_nr,
+ addr, details, rss, &force_flush,
+ &force_break);
if (unlikely(force_break)) {
- addr += PAGE_SIZE;
+ addr += nr * PAGE_SIZE;
break;
}
continue;
@@ -1687,8 +1727,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
WARN_ON_ONCE(1);
}
pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
- zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
- } while (pte++, addr += PAGE_SIZE, addr != end);
+ zap_install_uffd_wp_if_needed(vma, addr, pte, 1, details, ptent);
+ } while (pte += nr, addr += PAGE_SIZE * nr, addr != end);

add_mm_rss_vec(mm, rss);
arch_leave_lazy_mmu_mode();
--
2.43.0

2024-01-30 09:14:00

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP

Re-reading the docs myself:

> +#ifndef get_and_clear_full_ptes
> +/**
> + * get_and_clear_full_ptes - Clear PTEs that map consecutive pages of the same
> + * folio, collecting dirty/accessed bits.
> + * @mm: Address space the pages are mapped into.
> + * @addr: Address the first page is mapped at.
> + * @ptep: Page table pointer for the first entry.
> + * @nr: Number of entries to clear.
> + * @full: Whether we are clearing a full mm.
> + *
> + * May be overridden by the architecture; otherwise, implemented as a simple
> + * loop over ptep_get_and_clear_full(), merging dirty/accessed bits into
> + * returned PTE.

"into the"

> + *
> + * Note that PTE bits in the PTE range besides the PFN can differ. For example,
> + * some PTEs might be write-protected.
> + *
> + * Context: The caller holds the page table lock. The PTEs map consecutive
> + * pages that belong to the same folio. The PTEs are all in the same PMD.
> + */
> +static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
> + unsigned long addr, pte_t *ptep, unsigned int nr, int full)
> +{
> + pte_t pte, tmp_pte;
> +
> + pte = ptep_get_and_clear_full(mm, addr, ptep, full);
> + while (--nr) {
> + ptep++;
> + addr += PAGE_SIZE;
> + tmp_pte = ptep_get_and_clear_full(mm, addr, ptep, full);
> + if (pte_dirty(tmp_pte))
> + pte = pte_mkdirty(pte);
> + if (pte_young(tmp_pte))
> + pte = pte_mkyoung(pte);
> + }
> + return pte;
> +}
> +#endif
> +
> +#ifndef clear_full_ptes
> +/**
> + * clear_full_ptes - Clear PTEs that map consecutive pages of the same folio.
> + * @mm: Address space the pages are mapped into.
> + * @addr: Address the first page is mapped at.
> + * @ptep: Page table pointer for the first entry.
> + * @nr: Number of entries to clear.
> + * @full: Whether we are clearing a full mm.

Something went missing:

May be overridden by the architecture; otherwise, implemented as a
simple loop over ptep_get_and_clear_full().

--
Cheers,

David / dhildenb

2024-01-30 09:50:03

by Ryan Roberts

[permalink] [raw]

Subject: Re: [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP

On 29/01/2024 14:32, David Hildenbrand wrote:
> Similar to how we optimized fork(), let's implement PTE batching when
> consecutive (present) PTEs map consecutive pages of the same large
> folio.
>
> Most infrastructure we need for batching (mmu gather, rmap) is already
> there. We only have to add get_and_clear_full_ptes() and
> clear_full_ptes(). Similarly, extend zap_install_uffd_wp_if_needed() to
> process a PTE range.
>
> We won't bother sanity-checking the mapcount of all subpages, but only
> check the mapcount of the first subpage we process.
>
> To keep small folios as fast as possible force inlining of a specialized
> variant using __always_inline with nr=1.
>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> include/linux/pgtable.h | 66 +++++++++++++++++++++++++++++
> mm/memory.c | 92 +++++++++++++++++++++++++++++------------
> 2 files changed, 132 insertions(+), 26 deletions(-)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index aab227e12493..f0feae7f89fb 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -580,6 +580,72 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
> }
> #endif
>
> +#ifndef get_and_clear_full_ptes
> +/**
> + * get_and_clear_full_ptes - Clear PTEs that map consecutive pages of the same
> + * folio, collecting dirty/accessed bits.
> + * @mm: Address space the pages are mapped into.
> + * @addr: Address the first page is mapped at.
> + * @ptep: Page table pointer for the first entry.
> + * @nr: Number of entries to clear.
> + * @full: Whether we are clearing a full mm.
> + *
> + * May be overridden by the architecture; otherwise, implemented as a simple
> + * loop over ptep_get_and_clear_full(), merging dirty/accessed bits into
> + * returned PTE.
> + *
> + * Note that PTE bits in the PTE range besides the PFN can differ. For example,
> + * some PTEs might be write-protected.
> + *
> + * Context: The caller holds the page table lock. The PTEs map consecutive
> + * pages that belong to the same folio. The PTEs are all in the same PMD.
> + */
> +static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
> + unsigned long addr, pte_t *ptep, unsigned int nr, int full)
> +{
> + pte_t pte, tmp_pte;
> +
> + pte = ptep_get_and_clear_full(mm, addr, ptep, full);
> + while (--nr) {
> + ptep++;
> + addr += PAGE_SIZE;
> + tmp_pte = ptep_get_and_clear_full(mm, addr, ptep, full);
> + if (pte_dirty(tmp_pte))
> + pte = pte_mkdirty(pte);
> + if (pte_young(tmp_pte))
> + pte = pte_mkyoung(pte);
> + }
> + return pte;
> +}
> +#endif
> +
> +#ifndef clear_full_ptes
> +/**
> + * clear_full_ptes - Clear PTEs that map consecutive pages of the same folio.

I know its implied from "pages of the same folio" (and even more so for the
above variant due to mention of access/dirty), but I wonder if its useful to
explicitly state that "all ptes being cleared are present at the time of the call"?

> + * @mm: Address space the pages are mapped into.
> + * @addr: Address the first page is mapped at.
> + * @ptep: Page table pointer for the first entry.
> + * @nr: Number of entries to clear.
> + * @full: Whether we are clearing a full mm.
> + *
> + * Note that PTE bits in the PTE range besides the PFN can differ. For example,
> + * some PTEs might be write-protected.
> + *
> + * Context: The caller holds the page table lock. The PTEs map consecutive
> + * pages that belong to the same folio. The PTEs are all in the same PMD.
> + */
> +static inline void clear_full_ptes(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, unsigned int nr, int full)
> +{
> + for (;;) {
> + ptep_get_and_clear_full(mm, addr, ptep, full);
> + if (--nr == 0)
> + break;
> + ptep++;
> + addr += PAGE_SIZE;
> + }
> +}
> +#endif
>
> /*
> * If two threads concurrently fault at the same page, the thread that
> diff --git a/mm/memory.c b/mm/memory.c
> index a2190d7cfa74..38a010c4d04d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1515,7 +1515,7 @@ static inline bool zap_drop_file_uffd_wp(struct zap_details *details)
> */
> static inline void
> zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
> - unsigned long addr, pte_t *pte,
> + unsigned long addr, pte_t *pte, int nr,
> struct zap_details *details, pte_t pteval)
> {
> /* Zap on anonymous always means dropping everything */
> @@ -1525,20 +1525,27 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
> if (zap_drop_file_uffd_wp(details))
> return;
>
> - pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
> + for (;;) {
> + /* the PFN in the PTE is irrelevant. */
> + pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
> + if (--nr == 0)
> + break;
> + pte++;
> + addr += PAGE_SIZE;
> + }
> }
>
> -static inline void zap_present_folio_pte(struct mmu_gather *tlb,
> +static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
> struct vm_area_struct *vma, struct folio *folio,
> - struct page *page, pte_t *pte, pte_t ptent, unsigned long addr,
> - struct zap_details *details, int *rss, bool *force_flush,
> - bool *force_break)
> + struct page *page, pte_t *pte, pte_t ptent, unsigned int nr,
> + unsigned long addr, struct zap_details *details, int *rss,
> + bool *force_flush, bool *force_break)
> {
> struct mm_struct *mm = tlb->mm;
> bool delay_rmap = false;
>
> if (!folio_test_anon(folio)) {
> - ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
> + ptent = get_and_clear_full_ptes(mm, addr, pte, nr, tlb->fullmm);
> if (pte_dirty(ptent)) {
> folio_mark_dirty(folio);
> if (tlb_delay_rmap(tlb)) {
> @@ -1548,36 +1555,49 @@ static inline void zap_present_folio_pte(struct mmu_gather *tlb,
> }
> if (pte_young(ptent) && likely(vma_has_recency(vma)))
> folio_mark_accessed(folio);
> - rss[mm_counter(folio)]--;
> + rss[mm_counter(folio)] -= nr;
> } else {
> /* We don't need up-to-date accessed/dirty bits. */
> - ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
> - rss[MM_ANONPAGES]--;
> + clear_full_ptes(mm, addr, pte, nr, tlb->fullmm);
> + rss[MM_ANONPAGES] -= nr;
> }
> + /* Checking a single PTE in a batch is sufficient. */
> arch_check_zapped_pte(vma, ptent);
> - tlb_remove_tlb_entry(tlb, pte, addr);
> + tlb_remove_tlb_entries(tlb, pte, nr, addr);
> if (unlikely(userfaultfd_pte_wp(vma, ptent)))
> - zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
> + zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details,
> + ptent);
>
> if (!delay_rmap) {
> - folio_remove_rmap_pte(folio, page, vma);
> + folio_remove_rmap_ptes(folio, page, nr, vma);
> +
> + /* Only sanity-check the first page in a batch. */
> if (unlikely(page_mapcount(page) < 0))
> print_bad_pte(vma, addr, ptent, page);

Is there a case for either removing this all together or moving it into
folio_remove_rmap_ptes()? It seems odd to only check some pages.

> }
> - if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
> + if (unlikely(__tlb_remove_folio_pages(tlb, page, nr, delay_rmap))) {
> *force_flush = true;
> *force_break = true;
> }
> }
>
> -static inline void zap_present_pte(struct mmu_gather *tlb,
> +/*
> + * Zap or skip one present PTE, trying to batch-process subsequent PTEs that map

Zap or skip *at least* one... ?

> + * consecutive pages of the same folio.
> + *
> + * Returns the number of processed (skipped or zapped) PTEs (at least 1).
> + */
> +static inline int zap_present_ptes(struct mmu_gather *tlb,
> struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
> - unsigned long addr, struct zap_details *details,
> - int *rss, bool *force_flush, bool *force_break)
> + unsigned int max_nr, unsigned long addr,
> + struct zap_details *details, int *rss, bool *force_flush,
> + bool *force_break)
> {
> + const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> struct mm_struct *mm = tlb->mm;
> struct folio *folio;
> struct page *page;
> + int nr;
>
> page = vm_normal_page(vma, addr, ptent);
> if (!page) {
> @@ -1587,14 +1607,29 @@ static inline void zap_present_pte(struct mmu_gather *tlb,
> tlb_remove_tlb_entry(tlb, pte, addr);
> VM_WARN_ON_ONCE(userfaultfd_wp(vma));
> ksm_might_unmap_zero_page(mm, ptent);
> - return;
> + return 1;
> }
>
> folio = page_folio(page);
> if (unlikely(!should_zap_folio(details, folio)))
> - return;
> - zap_present_folio_pte(tlb, vma, folio, page, pte, ptent, addr, details,
> - rss, force_flush, force_break);
> + return 1;
> +
> + /*
> + * Make sure that the common "small folio" case is as fast as possible
> + * by keeping the batching logic separate.
> + */
> + if (unlikely(folio_test_large(folio) && max_nr != 1)) {
> + nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, fpb_flags,
> + NULL);
> +
> + zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, nr,
> + addr, details, rss, force_flush,
> + force_break);
> + return nr;
> + }
> + zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, 1, addr,
> + details, rss, force_flush, force_break);
> + return 1;
> }
>
> static unsigned long zap_pte_range(struct mmu_gather *tlb,
> @@ -1609,6 +1644,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> pte_t *start_pte;
> pte_t *pte;
> swp_entry_t entry;
> + int nr;
>
> tlb_change_page_size(tlb, PAGE_SIZE);
> init_rss_vec(rss);
> @@ -1622,7 +1658,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> pte_t ptent = ptep_get(pte);
> struct folio *folio = NULL;
> struct page *page;
> + int max_nr;
>
> + nr = 1;
> if (pte_none(ptent))
> continue;
>
> @@ -1630,10 +1668,12 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> break;
>
> if (pte_present(ptent)) {
> - zap_present_pte(tlb, vma, pte, ptent, addr, details,
> - rss, &force_flush, &force_break);
> + max_nr = (end - addr) / PAGE_SIZE;
> + nr = zap_present_ptes(tlb, vma, pte, ptent, max_nr,
> + addr, details, rss, &force_flush,
> + &force_break);
> if (unlikely(force_break)) {
> - addr += PAGE_SIZE;
> + addr += nr * PAGE_SIZE;
> break;
> }
> continue;
> @@ -1687,8 +1727,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> WARN_ON_ONCE(1);
> }
> pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> - zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
> - } while (pte++, addr += PAGE_SIZE, addr != end);
> + zap_install_uffd_wp_if_needed(vma, addr, pte, 1, details, ptent);
> + } while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
>
> add_mm_rss_vec(mm, rss);
> arch_leave_lazy_mmu_mode();

2024-01-31 02:25:58

by Yin, Fengwei

[permalink] [raw]

Subject: Re: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP

On 1/29/24 22:32, David Hildenbrand wrote:
> This series is based on [1] and must be applied on top of it.
> Similar to what we did with fork(), let's implement PTE batching
> during unmap/zap when processing PTE-mapped THPs.
>
> We collect consecutive PTEs that map consecutive pages of the same large
> folio, making sure that the other PTE bits are compatible, and (a) adjust
> the refcount only once per batch, (b) call rmap handling functions only
> once per batch, (c) perform batch PTE setting/updates and (d) perform TLB
> entry removal once per batch.
>
> Ryan was previously working on this in the context of cont-pte for
> arm64, int latest iteration [2] with a focus on arm6 with cont-pte only.
> This series implements the optimization for all architectures, independent
> of such PTE bits, teaches MMU gather/TLB code to be fully aware of such
> large-folio-pages batches as well, and amkes use of our new rmap batching
> function when removing the rmap.
>
> To achieve that, we have to enlighten MMU gather / page freeing code
> (i.e., everything that consumes encoded_page) to process unmapping
> of consecutive pages that all belong to the same large folio. I'm being
> very careful to not degrade order-0 performance, and it looks like I
> managed to achieve that.

One possible scenario:
If all the folio is 2M size folio, then one full batch could hold 510M memory.
Is it too much regarding one full batch before just can hold (2M - 4096 * 2)
memory?

Regards
Yin, Fengwei

2024-01-31 02:40:10

by Yin, Fengwei

[permalink] [raw]

Subject: Re: [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP

On 1/29/24 22:32, David Hildenbrand wrote:
> +static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
> + unsigned long addr, pte_t *ptep, unsigned int nr, int full)
> +{
> + pte_t pte, tmp_pte;
> +
> + pte = ptep_get_and_clear_full(mm, addr, ptep, full);
> + while (--nr) {
> + ptep++;
> + addr += PAGE_SIZE;
> + tmp_pte = ptep_get_and_clear_full(mm, addr, ptep, full);
> + if (pte_dirty(tmp_pte))
> + pte = pte_mkdirty(pte);
> + if (pte_young(tmp_pte))
> + pte = pte_mkyoung(pte);
I am wondering whether it's worthy to move the pte_mkdirty() and pte_mkyoung()
out of the loop and just do it one time if needed. The worst case is that they
are called nr - 1 time. Or it's just too micro?

Regards
Yin, Fengwei

> + }
> + return pte;
> +}

2024-01-31 10:25:43

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP

>> +
>> +#ifndef clear_full_ptes
>> +/**
>> + * clear_full_ptes - Clear PTEs that map consecutive pages of the same folio.
>
> I know its implied from "pages of the same folio" (and even more so for the
> above variant due to mention of access/dirty), but I wonder if its useful to
> explicitly state that "all ptes being cleared are present at the time of the call"?

"Clear PTEs" -> "Clear present PTEs" ?

That should make it clearer.

[...]

>> if (!delay_rmap) {
>> - folio_remove_rmap_pte(folio, page, vma);
>> + folio_remove_rmap_ptes(folio, page, nr, vma);
>> +
>> + /* Only sanity-check the first page in a batch. */
>> if (unlikely(page_mapcount(page) < 0))
>> print_bad_pte(vma, addr, ptent, page);
>
> Is there a case for either removing this all together or moving it into
> folio_remove_rmap_ptes()? It seems odd to only check some pages.
>

I really wanted to avoid another nasty loop here.

In my thinking, for 4k folios, or when zapping subpages of large folios,
we still perform the exact same checks. Only when batching we don't. So
if there is some problem, there are ways to get it triggered. And these
problems are barely ever seen.

folio_remove_rmap_ptes() feels like the better place -- especially
because the delayed-rmap handling is effectively unchecked. But in
there, we cannot "print_bad_pte()".

[background: if we had a total mapcount -- iow cheap folio_mapcount(),
I'd check here that the total mapcount does not underflow, instead of
checking per-subpage]

>
>> }
>> - if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
>> + if (unlikely(__tlb_remove_folio_pages(tlb, page, nr, delay_rmap))) {
>> *force_flush = true;
>> *force_break = true;
>> }
>> }
>>
>> -static inline void zap_present_pte(struct mmu_gather *tlb,
>> +/*
>> + * Zap or skip one present PTE, trying to batch-process subsequent PTEs that map
>
> Zap or skip *at least* one... ?

Ack

--
Cheers,

David / dhildenb

2024-01-31 10:27:52

by Ryan Roberts

[permalink] [raw]

Subject: Re: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP

On 31/01/2024 10:16, David Hildenbrand wrote:
> On 31.01.24 03:20, Yin Fengwei wrote:
>> On 1/29/24 22:32, David Hildenbrand wrote:
>>> This series is based on [1] and must be applied on top of it.
>>> Similar to what we did with fork(), let's implement PTE batching
>>> during unmap/zap when processing PTE-mapped THPs.
>>>
>>> We collect consecutive PTEs that map consecutive pages of the same large
>>> folio, making sure that the other PTE bits are compatible, and (a) adjust
>>> the refcount only once per batch, (b) call rmap handling functions only
>>> once per batch, (c) perform batch PTE setting/updates and (d) perform TLB
>>> entry removal once per batch.
>>>
>>> Ryan was previously working on this in the context of cont-pte for
>>> arm64, int latest iteration [2] with a focus on arm6 with cont-pte only.
>>> This series implements the optimization for all architectures, independent
>>> of such PTE bits, teaches MMU gather/TLB code to be fully aware of such
>>> large-folio-pages batches as well, and amkes use of our new rmap batching
>>> function when removing the rmap.
>>>
>>> To achieve that, we have to enlighten MMU gather / page freeing code
>>> (i.e., everything that consumes encoded_page) to process unmapping
>>> of consecutive pages that all belong to the same large folio. I'm being
>>> very careful to not degrade order-0 performance, and it looks like I
>>> managed to achieve that.
>>
>
> Let's CC Linus and Michal to make sure I'm not daydreaming.
>
> Relevant patch:
> https://lkml.kernel.org/r/[email protected]
>
> Context: I'm adjusting MMU gather code to support batching of consecutive pages
> that belong to the same large folio, when unmapping/zapping PTEs.
>
> For small folios, there is no (relevant) change.
>
> Imagine we have a PTE-mapped THP (2M folio -> 512 pages) and zap all 512 PTEs:
> Instead of adding 512 individual encoded_page entries, we add a combined entry
> that expresses "page+nr_pages". That allows for "easily" adding various other
> per-folio batching (refcount, rmap, swap freeing).
>
> The implication is, that we can now batch effective more pages with large
> folios, exceeding the old 10000 limit. The number of involved *folios* does not
> increase, though.
>
>> One possible scenario:
>> If all the folio is 2M size folio, then one full batch could hold 510M memory.
>> Is it too much regarding one full batch before just can hold (2M - 4096 * 2)
>> memory?
>
> Excellent point, I think there are three parts to it:
>
> (1) Batch pages / folio fragments per batch page
>
> Before this change (and with 4k folios) we have exactly one page (4k) per
> encoded_page entry in the batch. Now, we can have (with 2M folios), 512 pages
> for every two encoded_page entries (page+nr_pages) in a batch page. So an
> average ~256 pages per encoded_page entry.
>
> So one batch page can now store in the worst case ~256 times the number of
> pages, but the number of folio fragments ("pages+nr_pages") would not increase.
>
> The time it takes to perform the actual page freeing of a batch will not be 256
> times higher -- the time is expected to be much closer to the old time (i.e.,
> not freeing more folios).

IIRC there is an option to zero memory when it is freed back to the buddy? So
that could be a place where time is proportional to size rather than
proportional to folio count? But I think that option is intended for debug only?
So perhaps not a problem in practice?

>
> (2) Delayed rmap handling
>
> We limit batching early (see tlb_next_batch()) when we have delayed rmap
> pending. Reason being, that we don't want to check for many entries if they
> require delayed rmap handling, while still holding the page table lock (see
> tlb_flush_rmaps()), because we have to remove the rmap before dropping the PTL.
>
> Note that we perform the check whether we need delayed rmap handling per
> page+nr_pages entry, not per page. So we won't perform more such checks.
>
> Once we set tlb->delayed_rmap (because we add one entry that requires it), we
> already force a flush before dropping the PT lock. So once we get a single
> delayed rmap entry in there, we will not batch more than we could have in the
> same page table: so not more than 512 entries (x86-64) in the worst case. So it
> will still be bounded, and not significantly more than what we had before.
>
> So regarding delayed rmap handling I think this should be fine.
>
> (3) Total patched pages
>
> MAX_GATHER_BATCH_COUNT effectively limits the number of pages we allocate (full
> batches), and thereby limits the number of pages we were able to batch.
>
> The old limit was ~10000 pages, now we could batch ~5000 folio fragments
> (page+nr_pages), resulting int the "times 256" increase in the worst case on
> x86-64 as you point out.
>
> This 10000 pages limit was introduced in 53a59fc67f97 ("mm: limit mmu_gather
> batching to fix soft lockups on !CONFIG_PREEMPT") where we wanted to handle
> soft-lockups.
>
> As the number of effective folios we are freeing does not increase, I *think*
> this should be fine.
>
>
> If any of that is a problem, we would have to keep track of the total number of
> pages in our batch, and stop as soon as we hit our 10000 limit -- independent of
> page vs. folio fragment. Something I would like to avoid of possible.
>

2024-01-31 10:39:05

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP

On 31.01.24 03:30, Yin Fengwei wrote:
>
>
> On 1/29/24 22:32, David Hildenbrand wrote:
>> +static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
>> + unsigned long addr, pte_t *ptep, unsigned int nr, int full)
>> +{
>> + pte_t pte, tmp_pte;
>> +
>> + pte = ptep_get_and_clear_full(mm, addr, ptep, full);
>> + while (--nr) {
>> + ptep++;
>> + addr += PAGE_SIZE;
>> + tmp_pte = ptep_get_and_clear_full(mm, addr, ptep, full);
>> + if (pte_dirty(tmp_pte))
>> + pte = pte_mkdirty(pte);
>> + if (pte_young(tmp_pte))
>> + pte = pte_mkyoung(pte);
> I am wondering whether it's worthy to move the pte_mkdirty() and pte_mkyoung()
> out of the loop and just do it one time if needed. The worst case is that they
> are called nr - 1 time. Or it's just too micro?

I also thought about just indicating "any_accessed" or "any_dirty" using
flags to the caller, to avoid the PTE modifications completely. Felt a
bit micro-optimized.

Regarding your proposal: I thought about that as well, but my assumption
was that dirty+young are "cheap" to be set.

On x86, pte_mkyoung() is setting _PAGE_ACCESSED.
pte_mkdirty() is setting _PAGE_DIRTY | _PAGE_SOFT_DIRTY, but it also has
to handle the saveddirty handling, using some bit trickery.

So at least for pte_mkyoung() there would be no real benefit as far as I
can see (might be even worse). For pte_mkdirty() there might be a small
benefit.

Is it going to be measurable? Likely not.

Am I missing something?

Thanks!

--
Cheers,

David / dhildenb

2024-01-31 10:43:24

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP

On 31.01.24 03:20, Yin Fengwei wrote:
> On 1/29/24 22:32, David Hildenbrand wrote:
>> This series is based on [1] and must be applied on top of it.
>> Similar to what we did with fork(), let's implement PTE batching
>> during unmap/zap when processing PTE-mapped THPs.
>>
>> We collect consecutive PTEs that map consecutive pages of the same large
>> folio, making sure that the other PTE bits are compatible, and (a) adjust
>> the refcount only once per batch, (b) call rmap handling functions only
>> once per batch, (c) perform batch PTE setting/updates and (d) perform TLB
>> entry removal once per batch.
>>
>> Ryan was previously working on this in the context of cont-pte for
>> arm64, int latest iteration [2] with a focus on arm6 with cont-pte only.
>> This series implements the optimization for all architectures, independent
>> of such PTE bits, teaches MMU gather/TLB code to be fully aware of such
>> large-folio-pages batches as well, and amkes use of our new rmap batching
>> function when removing the rmap.
>>
>> To achieve that, we have to enlighten MMU gather / page freeing code
>> (i.e., everything that consumes encoded_page) to process unmapping
>> of consecutive pages that all belong to the same large folio. I'm being
>> very careful to not degrade order-0 performance, and it looks like I
>> managed to achieve that.
>
> One possible scenario:
> If all the folio is 2M size folio, then one full batch could hold 510M memory.
> Is it too much regarding one full batch before just can hold (2M - 4096 * 2)
> memory?

Good point, we do have CONFIG_INIT_ON_FREE_DEFAULT_ON. I don't remember
if init_on_free or init_on_alloc was used in production systems. In
tlb_batch_pages_flush(), there is a cond_resched() to limit the number
of entries we process.

So if that is actually problematic, we'd run into a soft-lockup and need
another cond_resched() [I have some faint recollection that people are
working on removing cond_resched() completely].

One could do some counting in free_pages_and_swap_cache() (where we
iterate all entries already) and insert cond_resched+release_pages() for
every (e.g., 512) pages.

--
Cheers,

David / dhildenb

2024-01-31 10:44:20

by Yin, Fengwei

[permalink] [raw]

Subject: Re: [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP

On 1/31/2024 6:30 PM, David Hildenbrand wrote:
> On 31.01.24 03:30, Yin Fengwei wrote:
>>
>>
>> On 1/29/24 22:32, David Hildenbrand wrote:
>>> +static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
>>> +        unsigned long addr, pte_t *ptep, unsigned int nr, int full)
>>> +{
>>> +    pte_t pte, tmp_pte;
>>> +
>>> +    pte = ptep_get_and_clear_full(mm, addr, ptep, full);
>>> +    while (--nr) {
>>> +        ptep++;
>>> +        addr += PAGE_SIZE;
>>> +        tmp_pte = ptep_get_and_clear_full(mm, addr, ptep, full);
>>> +        if (pte_dirty(tmp_pte))
>>> +            pte = pte_mkdirty(pte);
>>> +        if (pte_young(tmp_pte))
>>> +            pte = pte_mkyoung(pte);
>> I am wondering whether it's worthy to move the pte_mkdirty() and
>> pte_mkyoung()
>> out of the loop and just do it one time if needed. The worst case is
>> that they
>> are called nr - 1 time. Or it's just too micro?
>
> I also thought about just indicating "any_accessed" or "any_dirty" using
> flags to the caller, to avoid the PTE modifications completely. Felt a
> bit micro-optimized.
>
> Regarding your proposal: I thought about that as well, but my assumption
> was that dirty+young are "cheap" to be set.
>
> On x86, pte_mkyoung() is setting _PAGE_ACCESSED.
> pte_mkdirty() is setting _PAGE_DIRTY | _PAGE_SOFT_DIRTY, but it also has
> to handle the saveddirty handling, using some bit trickery.
>
> So at least for pte_mkyoung() there would be no real benefit as far as I
> can see (might be even worse). For pte_mkdirty() there might be a small
> benefit.
>
> Is it going to be measurable? Likely not.
Yeah. We can do more investigation when performance profiling call this
out.

Regards
Yin, Fengwei

>
> Am I missing something?
>
> Thanks!
>

2024-01-31 11:03:28

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP

On 31.01.24 03:20, Yin Fengwei wrote:
> On 1/29/24 22:32, David Hildenbrand wrote:
>> This series is based on [1] and must be applied on top of it.
>> Similar to what we did with fork(), let's implement PTE batching
>> during unmap/zap when processing PTE-mapped THPs.
>>
>> We collect consecutive PTEs that map consecutive pages of the same large
>> folio, making sure that the other PTE bits are compatible, and (a) adjust
>> the refcount only once per batch, (b) call rmap handling functions only
>> once per batch, (c) perform batch PTE setting/updates and (d) perform TLB
>> entry removal once per batch.
>>
>> Ryan was previously working on this in the context of cont-pte for
>> arm64, int latest iteration [2] with a focus on arm6 with cont-pte only.
>> This series implements the optimization for all architectures, independent
>> of such PTE bits, teaches MMU gather/TLB code to be fully aware of such
>> large-folio-pages batches as well, and amkes use of our new rmap batching
>> function when removing the rmap.
>>
>> To achieve that, we have to enlighten MMU gather / page freeing code
>> (i.e., everything that consumes encoded_page) to process unmapping
>> of consecutive pages that all belong to the same large folio. I'm being
>> very careful to not degrade order-0 performance, and it looks like I
>> managed to achieve that.
>

Let's CC Linus and Michal to make sure I'm not daydreaming.

Relevant patch:
https://lkml.kernel.org/r/[email protected]

Context: I'm adjusting MMU gather code to support batching of
consecutive pages that belong to the same large folio, when
unmapping/zapping PTEs.

For small folios, there is no (relevant) change.

Imagine we have a PTE-mapped THP (2M folio -> 512 pages) and zap all 512
PTEs: Instead of adding 512 individual encoded_page entries, we add a
combined entry that expresses "page+nr_pages". That allows for "easily"
adding various other per-folio batching (refcount, rmap, swap freeing).

The implication is, that we can now batch effective more pages with
large folios, exceeding the old 10000 limit. The number of involved
*folios* does not increase, though.

> One possible scenario:
> If all the folio is 2M size folio, then one full batch could hold 510M memory.
> Is it too much regarding one full batch before just can hold (2M - 4096 * 2)
> memory?

Excellent point, I think there are three parts to it:

(1) Batch pages / folio fragments per batch page

Before this change (and with 4k folios) we have exactly one page (4k)
per encoded_page entry in the batch. Now, we can have (with 2M folios),
512 pages for every two encoded_page entries (page+nr_pages) in a batch
page. So an average ~256 pages per encoded_page entry.

So one batch page can now store in the worst case ~256 times the number
of pages, but the number of folio fragments ("pages+nr_pages") would not
increase.

The time it takes to perform the actual page freeing of a batch will not
be 256 times higher -- the time is expected to be much closer to the old
time (i.e., not freeing more folios).

(2) Delayed rmap handling

We limit batching early (see tlb_next_batch()) when we have delayed rmap
pending. Reason being, that we don't want to check for many entries if
they require delayed rmap handling, while still holding the page table
lock (see tlb_flush_rmaps()), because we have to remove the rmap before
dropping the PTL.

Note that we perform the check whether we need delayed rmap handling per
page+nr_pages entry, not per page. So we won't perform more such checks.

Once we set tlb->delayed_rmap (because we add one entry that requires
it), we already force a flush before dropping the PT lock. So once we
get a single delayed rmap entry in there, we will not batch more than we
could have in the same page table: so not more than 512 entries (x86-64)
in the worst case. So it will still be bounded, and not significantly
more than what we had before.

So regarding delayed rmap handling I think this should be fine.

(3) Total patched pages

MAX_GATHER_BATCH_COUNT effectively limits the number of pages we
allocate (full batches), and thereby limits the number of pages we were
able to batch.

The old limit was ~10000 pages, now we could batch ~5000 folio fragments
(page+nr_pages), resulting int the "times 256" increase in the worst
case on x86-64 as you point out.

This 10000 pages limit was introduced in 53a59fc67f97 ("mm: limit
mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT") where we
wanted to handle soft-lockups.

As the number of effective folios we are freeing does not increase, I
*think* this should be fine.

If any of that is a problem, we would have to keep track of the total
number of pages in our batch, and stop as soon as we hit our 10000 limit
-- independent of page vs. folio fragment. Something I would like to
avoid of possible.

--
Cheers,

David / dhildenb

2024-01-31 11:14:21

by Ryan Roberts

[permalink] [raw]

Subject: Re: [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP

On 31/01/2024 10:21, David Hildenbrand wrote:
>
>>> +
>>> +#ifndef clear_full_ptes
>>> +/**
>>> + * clear_full_ptes - Clear PTEs that map consecutive pages of the same folio.
>>
>> I know its implied from "pages of the same folio" (and even more so for the
>> above variant due to mention of access/dirty), but I wonder if its useful to
>> explicitly state that "all ptes being cleared are present at the time of the
>> call"?
>
> "Clear PTEs" -> "Clear present PTEs" ?
>
> That should make it clearer.

Works for me.

>
> [...]
>
>>>       if (!delay_rmap) {
>>> -        folio_remove_rmap_pte(folio, page, vma);
>>> +        folio_remove_rmap_ptes(folio, page, nr, vma);
>>> +
>>> +        /* Only sanity-check the first page in a batch. */
>>>           if (unlikely(page_mapcount(page) < 0))
>>>               print_bad_pte(vma, addr, ptent, page);
>>
>> Is there a case for either removing this all together or moving it into
>> folio_remove_rmap_ptes()? It seems odd to only check some pages.
>>
>
> I really wanted to avoid another nasty loop here.
>
> In my thinking, for 4k folios, or when zapping subpages of large folios, we
> still perform the exact same checks. Only when batching we don't. So if there is
> some problem, there are ways to get it triggered. And these problems are barely
> ever seen.
>
> folio_remove_rmap_ptes() feels like the better place -- especially because the
> delayed-rmap handling is effectively unchecked. But in there, we cannot
> "print_bad_pte()".
>
> [background: if we had a total mapcount -- iow cheap folio_mapcount(), I'd check
> here that the total mapcount does not underflow, instead of checking per-subpage]

All good points... perhaps extend the comment to describe how this could be
solved in future with cheap total_mapcount()? Or in the commit log if you prefer?

>
>>
>>>       }
>>> -    if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
>>> +    if (unlikely(__tlb_remove_folio_pages(tlb, page, nr, delay_rmap))) {
>>>           *force_flush = true;
>>>           *force_break = true;
>>>       }
>>> }
>>> -static inline void zap_present_pte(struct mmu_gather *tlb,
>>> +/*
>>> + * Zap or skip one present PTE, trying to batch-process subsequent PTEs that
>>> map
>>
>> Zap or skip *at least* one... ?
>
> Ack
>

2024-01-31 11:27:05

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP

>>>> -        folio_remove_rmap_pte(folio, page, vma);
>>>> +        folio_remove_rmap_ptes(folio, page, nr, vma);
>>>> +
>>>> +        /* Only sanity-check the first page in a batch. */
>>>>           if (unlikely(page_mapcount(page) < 0))
>>>>               print_bad_pte(vma, addr, ptent, page);
>>>
>>> Is there a case for either removing this all together or moving it into
>>> folio_remove_rmap_ptes()? It seems odd to only check some pages.
>>>
>>
>> I really wanted to avoid another nasty loop here.
>>
>> In my thinking, for 4k folios, or when zapping subpages of large folios, we
>> still perform the exact same checks. Only when batching we don't. So if there is
>> some problem, there are ways to get it triggered. And these problems are barely
>> ever seen.
>>
>> folio_remove_rmap_ptes() feels like the better place -- especially because the
>> delayed-rmap handling is effectively unchecked. But in there, we cannot
>> "print_bad_pte()".
>>
>> [background: if we had a total mapcount -- iow cheap folio_mapcount(), I'd check
>> here that the total mapcount does not underflow, instead of checking per-subpage]
>
> All good points... perhaps extend the comment to describe how this could be
> solved in future with cheap total_mapcount()? Or in the commit log if you prefer?

I'll add more meat to the cover letter, thanks!

--
Cheers,

David / dhildenb

2024-01-31 14:09:21

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP

On Wed 31-01-24 10:26:13, Ryan Roberts wrote:
> IIRC there is an option to zero memory when it is freed back to the buddy? So
> that could be a place where time is proportional to size rather than
> proportional to folio count? But I think that option is intended for debug only?
> So perhaps not a problem in practice?

init_on_free is considered a security/hardening feature more than a
debugging one. It will surely add an overhead and I guess this is
something people who use it know about. The batch size limit is a latency
reduction feature for !PREEMPT kernels but by no means it should be
considered low latency guarantee feature. A lot of has changed since
the limit was introduced and the current latency numbers will surely be
different than back then. As long as soft lockups do not trigger again
this should be acceptable IMHO.

--
Michal Hocko
SUSE Labs

2024-01-31 14:12:46

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP

On Wed 31-01-24 11:16:01, David Hildenbrand wrote:
[...]
> This 10000 pages limit was introduced in 53a59fc67f97 ("mm: limit mmu_gather
> batching to fix soft lockups on !CONFIG_PREEMPT") where we wanted to handle
> soft-lockups.

AFAIR at the time of this patch this was mostly just to put some cap on
the number of batches to collect and free at once. If there is a lot of
free memory and a large process exiting this could grow really high. Now
that those pages^Wfolios can represent larger memory chunks it could
mean more physical memory being freed but from which might make the
operation take longer but still far from soft lockup triggering.

Now latency might suck on !PREEMPT kernels with too many pages to
free in a single batch but I guess this is somehow expected for this
preemption model. The soft lockup has to be avoided because this can
panic the machine in some configurations.

--
Michal Hocko
SUSE Labs

2024-01-31 14:21:35

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP

On 31.01.24 15:08, Michal Hocko wrote:
> On Wed 31-01-24 10:26:13, Ryan Roberts wrote:
>> IIRC there is an option to zero memory when it is freed back to the buddy? So
>> that could be a place where time is proportional to size rather than
>> proportional to folio count? But I think that option is intended for debug only?
>> So perhaps not a problem in practice?
>
> init_on_free is considered a security/hardening feature more than a
> debugging one. It will surely add an overhead and I guess this is
> something people who use it know about. The batch size limit is a latency
> reduction feature for !PREEMPT kernels but by no means it should be
> considered low latency guarantee feature. A lot of has changed since
> the limit was introduced and the current latency numbers will surely be
> different than back then. As long as soft lockups do not trigger again
> this should be acceptable IMHO.

It could now be zeroing out ~512 MiB. That shouldn't take double-digit
seconds unless we are running in a very problematic environment
(over-committed VM). But then, we might have different problems already.

I'll do some sanity checks with an extremely large processes (as much as
I can fit on my machines), with a !CONFIG_PREEMPT kernel and
init_on_free, to see if anything pops up.

Thanks Michal!

--
Cheers,

David / dhildenb