LinuxLists.cc - [PATCH v2 00/14] Transparent Contiguous PTEs for User Mappings

2023-11-15 16:31:28

Subject: [PATCH v2 00/14] Transparent Contiguous PTEs for User Mappings

Hi All,

This is v2 of a series to opportunistically and transparently use contpte
mappings (set the contiguous bit in ptes) for user memory when those mappings
meet the requirements. It is part of a wider effort to improve performance by
allocating and mapping variable-sized blocks of memory (folios). One aim is for
the 4K kernel to approach the performance of the 16K kernel, but without
breaking compatibility and without the associated increase in memory. Another
aim is to benefit the 16K and 64K kernels by enabling 2M THP, since this is the
contpte size for those kernels. We have good performance data that demonstrates
both aims are being met (see below).

Of course this is only one half of the change. We require the mapped physical
memory to be the correct size and alignment for this to actually be useful (i.e.
64K for 4K pages, or 2M for 16K/64K pages). Fortunately folios are solving this
problem for us. Filesystems that support it (XFS, AFS, EROFS, tmpfs, ...) will
allocate large folios up to the PMD size today, and more filesystems are coming.
And the other half of my work, to enable "small-sized THP" (large folios) for
anonymous memory, makes contpte sized folios prevalent for anonymous memory too
[2].

Optimistically, I would really like to get this series merged for v6.8; there is
a chance that the small-sized THP series will also get merged for that version.
But even if it doesn't, this series still benefits file-backed memory from the
file systems that support large folios so shouldn't be held up for it.
Additionally I've got data that shows this series adds no regression when the
system has no appropriate large folios.

All dependecies listed against v1 are now resolved; This series applies cleanly
against v6.7-rc1.

Note that the first patch is for core-mm and provides the refactoring to make a
crucial optimization possible - which is then implemented in patch 13. The
remaining patches are arm64-specific.

Testing
=======

I've tested this series together with small-sized THP [2] on both Ampere Altra
(bare metal) and Apple M2 (VM):
- mm selftests (inc new tests written for small-sized THP); no regressions
- Speedometer Java script benchmark in Chromium web browser; no issues
- Kernel compilation; no issues
- Various tests under high memory pressure with swap enabled; no issues

Performance
===========

John Hubbard at Nvidia has indicated dramatic 10x performance improvements for
some workloads at [3], when using 64K base page kernel.

You can also see the original performance results I posted against v1 [1] which
are still valid.

I've additionally run the kernel compilation and speedometer benchmarks on a
system with small-sized THP disabled and large folio support for file-backed
memory intentionally disabled; I see no change in performance in this case (i.e.
no regression when this change is "present but not useful").

Changes since v1
================

- Export contpte_* symbols so that modules can continue to call inline
functions (e.g. ptep_get) which may now call the contpte_* functions (thanks
to JohnH)
- Use pte_valid() instead of pte_present() where sensible (thanks to Catalin)
- Factor out (pte_valid() && pte_cont()) into new pte_valid_cont() helper
(thanks to Catalin)
- Fixed bug in contpte_ptep_set_access_flags() where TLBIs were missed (thanks
to Catalin)
- Added ARM64_CONTPTE expert Kconfig (enabled by default) (thanks to Anshuman)
- Simplified contpte_ptep_get_and_clear_full()
- Improved various code comments

[1] https://lore.kernel.org/linux-arm-kernel/[email protected]/
[2] https://lore.kernel.org/linux-arm-kernel/[email protected]/
[3] https://lore.kernel.org/linux-mm/[email protected]/

Thanks,
Ryan

Ryan Roberts (14):
mm: Batch-copy PTE ranges during fork()
arm64/mm: set_pte(): New layer to manage contig bit
arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit
arm64/mm: pte_clear(): New layer to manage contig bit
arm64/mm: ptep_get_and_clear(): New layer to manage contig bit
arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit
arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit
arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit
arm64/mm: ptep_set_access_flags(): New layer to manage contig bit
arm64/mm: ptep_get(): New layer to manage contig bit
arm64/mm: Split __flush_tlb_range() to elide trailing DSB
arm64/mm: Wire up PTE_CONT for user mappings
arm64/mm: Implement ptep_set_wrprotects() to optimize fork()
arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown

arch/arm64/Kconfig | 10 +-
arch/arm64/include/asm/pgtable.h | 325 +++++++++++++++++++---
arch/arm64/include/asm/tlbflush.h | 13 +-
arch/arm64/kernel/efi.c | 4 +-
arch/arm64/kernel/mte.c | 2 +-
arch/arm64/kvm/guest.c | 2 +-
arch/arm64/mm/Makefile | 1 +
arch/arm64/mm/contpte.c | 447 ++++++++++++++++++++++++++++++
arch/arm64/mm/fault.c | 12 +-
arch/arm64/mm/fixmap.c | 4 +-
arch/arm64/mm/hugetlbpage.c | 40 +--
arch/arm64/mm/kasan_init.c | 6 +-
arch/arm64/mm/mmu.c | 16 +-
arch/arm64/mm/pageattr.c | 6 +-
arch/arm64/mm/trans_pgd.c | 6 +-
include/linux/pgtable.h | 13 +
mm/memory.c | 175 +++++++++---
17 files changed, 956 insertions(+), 126 deletions(-)
create mode 100644 arch/arm64/mm/contpte.c

--
2.25.1

Split __flush_tlb_range() into __flush_tlb_range_nosync() +
__flush_tlb_range(), in the same way as the existing flush_tlb_page()
arrangement. This allows calling __flush_tlb_range_nosync() to elide the
trailing DSB. Forthcoming "contpte" code will take advantage of this
when clearing the young bit from a contiguous range of ptes.

Signed-off-by: Ryan Roberts <[email protected]>
---
arch/arm64/include/asm/tlbflush.h | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index bb2c2833a987..925ef3bdf9ed 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -399,7 +399,7 @@ do { \
#define __flush_s2_tlb_range_op(op, start, pages, stride, tlb_level) \
__flush_tlb_range_op(op, start, pages, stride, 0, tlb_level, false)

-static inline void __flush_tlb_range(struct vm_area_struct *vma,
+static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
unsigned long start, unsigned long end,
unsigned long stride, bool last_level,
int tlb_level)
@@ -431,10 +431,19 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
else
__flush_tlb_range_op(vae1is, start, pages, stride, asid, tlb_level, true);

- dsb(ish);
mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
}

+static inline void __flush_tlb_range(struct vm_area_struct *vma,
+ unsigned long start, unsigned long end,
+ unsigned long stride, bool last_level,
+ int tlb_level)
+{
+ __flush_tlb_range_nosync(vma, start, end, stride,
+ last_level, tlb_level);
+ dsb(ish);
+}
+
static inline void flush_tlb_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end)
{
--
2.25.1

2023-11-15 16:33:09

by Ryan Roberts

[permalink] [raw]

Subject: [PATCH v2 02/14] arm64/mm: set_pte(): New layer to manage contig bit

2023-11-15 16:33:11

by Ryan Roberts

[permalink] [raw]

Subject: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

Convert copy_pte_range() to copy a set of ptes in a batch. A given batch
maps a physically contiguous block of memory, all belonging to the same
folio, with the same permissions, and for shared mappings, the same
dirty state. This will likely improve performance by a tiny amount due
to batching the folio reference count management and calling set_ptes()
rather than making individual calls to set_pte_at().

However, the primary motivation for this change is to reduce the number
of tlb maintenance operations that the arm64 backend has to perform
during fork, as it is about to add transparent support for the
"contiguous bit" in its ptes. By write-protecting the parent using the
new ptep_set_wrprotects() (note the 's' at the end) function, the
backend can avoid having to unfold contig ranges of PTEs, which is
expensive, when all ptes in the range are being write-protected.
Similarly, by using set_ptes() rather than set_pte_at() to set up ptes
in the child, the backend does not need to fold a contiguous range once
they are all populated - they can be initially populated as a contiguous
range in the first place.

This change addresses the core-mm refactoring only, and introduces
ptep_set_wrprotects() with a default implementation that calls
ptep_set_wrprotect() for each pte in the range. A separate change will
implement ptep_set_wrprotects() in the arm64 backend to realize the
performance improvement as part of the work to enable contpte mappings.

Signed-off-by: Ryan Roberts <[email protected]>
---
include/linux/pgtable.h | 13 +++
mm/memory.c | 175 +++++++++++++++++++++++++++++++---------
2 files changed, 150 insertions(+), 38 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index af7639c3b0a3..1c50f8a0fdde 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -622,6 +622,19 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addres
}
#endif

+#ifndef ptep_set_wrprotects
+struct mm_struct;
+static inline void ptep_set_wrprotects(struct mm_struct *mm,
+ unsigned long address, pte_t *ptep,
+ unsigned int nr)
+{
+ unsigned int i;
+
+ for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
+ ptep_set_wrprotect(mm, address, ptep);
+}
+#endif
+
/*
* On some architectures hardware does not set page access bit when accessing
* memory page, it is responsibility of software setting this bit. It brings
diff --git a/mm/memory.c b/mm/memory.c
index 1f18ed4a5497..b7c8228883cf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -921,46 +921,129 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
/* Uffd-wp needs to be delivered to dest pte as well */
pte = pte_mkuffd_wp(pte);
set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
- return 0;
+ return 1;
+}
+
+static inline unsigned long page_cont_mapped_vaddr(struct page *page,
+ struct page *anchor, unsigned long anchor_vaddr)
+{
+ unsigned long offset;
+ unsigned long vaddr;
+
+ offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
+ vaddr = anchor_vaddr + offset;
+
+ if (anchor > page) {
+ if (vaddr > anchor_vaddr)
+ return 0;
+ } else {
+ if (vaddr < anchor_vaddr)
+ return ULONG_MAX;
+ }
+
+ return vaddr;
+}
+
+static int folio_nr_pages_cont_mapped(struct folio *folio,
+ struct page *page, pte_t *pte,
+ unsigned long addr, unsigned long end,
+ pte_t ptent, bool *any_dirty)
+{
+ int floops;
+ int i;
+ unsigned long pfn;
+ pgprot_t prot;
+ struct page *folio_end;
+
+ if (!folio_test_large(folio))
+ return 1;
+
+ folio_end = &folio->page + folio_nr_pages(folio);
+ end = min(page_cont_mapped_vaddr(folio_end, page, addr), end);
+ floops = (end - addr) >> PAGE_SHIFT;
+ pfn = page_to_pfn(page);
+ prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
+
+ *any_dirty = pte_dirty(ptent);
+
+ pfn++;
+ pte++;
+
+ for (i = 1; i < floops; i++) {
+ ptent = ptep_get(pte);
+ ptent = pte_mkold(pte_mkclean(ptent));
+
+ if (!pte_present(ptent) || pte_pfn(ptent) != pfn ||
+ pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
+ break;
+
+ if (pte_dirty(ptent))
+ *any_dirty = true;
+
+ pfn++;
+ pte++;
+ }
+
+ return i;
}

/*
- * Copy one pte. Returns 0 if succeeded, or -EAGAIN if one preallocated page
- * is required to copy this pte.
+ * Copy set of contiguous ptes. Returns number of ptes copied if succeeded
+ * (always gte 1), or -EAGAIN if one preallocated page is required to copy the
+ * first pte.
*/
static inline int
-copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
- pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
- struct folio **prealloc)
+copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
+ pte_t *dst_pte, pte_t *src_pte,
+ unsigned long addr, unsigned long end,
+ int *rss, struct folio **prealloc)
{
struct mm_struct *src_mm = src_vma->vm_mm;
unsigned long vm_flags = src_vma->vm_flags;
pte_t pte = ptep_get(src_pte);
struct page *page;
struct folio *folio;
+ int nr = 1;
+ bool anon;
+ bool any_dirty = pte_dirty(pte);
+ int i;

page = vm_normal_page(src_vma, addr, pte);
- if (page)
+ if (page) {
folio = page_folio(page);
- if (page && folio_test_anon(folio)) {
- /*
- * If this page may have been pinned by the parent process,
- * copy the page immediately for the child so that we'll always
- * guarantee the pinned page won't be randomly replaced in the
- * future.
- */
- folio_get(folio);
- if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
- /* Page may be pinned, we have to copy. */
- folio_put(folio);
- return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
- addr, rss, prealloc, page);
+ anon = folio_test_anon(folio);
+ nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
+ end, pte, &any_dirty);
+
+ for (i = 0; i < nr; i++, page++) {
+ if (anon) {
+ /*
+ * If this page may have been pinned by the
+ * parent process, copy the page immediately for
+ * the child so that we'll always guarantee the
+ * pinned page won't be randomly replaced in the
+ * future.
+ */
+ if (unlikely(page_try_dup_anon_rmap(
+ page, false, src_vma))) {
+ if (i != 0)
+ break;
+ /* Page may be pinned, we have to copy. */
+ return copy_present_page(
+ dst_vma, src_vma, dst_pte,
+ src_pte, addr, rss, prealloc,
+ page);
+ }
+ rss[MM_ANONPAGES]++;
+ VM_BUG_ON(PageAnonExclusive(page));
+ } else {
+ page_dup_file_rmap(page, false);
+ rss[mm_counter_file(page)]++;
+ }
}
- rss[MM_ANONPAGES]++;
- } else if (page) {
- folio_get(folio);
- page_dup_file_rmap(page, false);
- rss[mm_counter_file(page)]++;
+
+ nr = i;
+ folio_ref_add(folio, nr);
}

/*
@@ -968,24 +1051,28 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
* in the parent and the child
*/
if (is_cow_mapping(vm_flags) && pte_write(pte)) {
- ptep_set_wrprotect(src_mm, addr, src_pte);
+ ptep_set_wrprotects(src_mm, addr, src_pte, nr);
pte = pte_wrprotect(pte);
}
- VM_BUG_ON(page && folio_test_anon(folio) && PageAnonExclusive(page));

/*
- * If it's a shared mapping, mark it clean in
- * the child
+ * If it's a shared mapping, mark it clean in the child. If its a
+ * private mapping, mark it dirty in the child if _any_ of the parent
+ * mappings in the block were marked dirty. The contiguous block of
+ * mappings are all backed by the same folio, so if any are dirty then
+ * the whole folio is dirty. This allows us to determine the batch size
+ * without having to ever consider the dirty bit. See
+ * folio_nr_pages_cont_mapped().
*/
- if (vm_flags & VM_SHARED)
- pte = pte_mkclean(pte);
- pte = pte_mkold(pte);
+ pte = pte_mkold(pte_mkclean(pte));
+ if (!(vm_flags & VM_SHARED) && any_dirty)
+ pte = pte_mkdirty(pte);

if (!userfaultfd_wp(dst_vma))
pte = pte_clear_uffd_wp(pte);

- set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
- return 0;
+ set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
+ return nr;
}

static inline struct folio *page_copy_prealloc(struct mm_struct *src_mm,
@@ -1087,15 +1174,28 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
*/
WARN_ON_ONCE(ret != -ENOENT);
}
- /* copy_present_pte() will clear `*prealloc' if consumed */
- ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte,
- addr, rss, &prealloc);
+ /* copy_present_ptes() will clear `*prealloc' if consumed */
+ ret = copy_present_ptes(dst_vma, src_vma, dst_pte, src_pte,
+ addr, end, rss, &prealloc);
+
/*
* If we need a pre-allocated page for this pte, drop the
* locks, allocate, and try again.
*/
if (unlikely(ret == -EAGAIN))
break;
+
+ /*
+ * Positive return value is the number of ptes copied.
+ */
+ VM_WARN_ON_ONCE(ret < 1);
+ progress += 8 * ret;
+ ret--;
+ dst_pte += ret;
+ src_pte += ret;
+ addr += ret << PAGE_SHIFT;
+ ret = 0;
+
if (unlikely(prealloc)) {
/*
* pre-alloc page cannot be reused by next time so as
@@ -1106,7 +1206,6 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
folio_put(prealloc);
prealloc = NULL;
}
- progress += 8;
} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);

arch_leave_lazy_mmu_mode();
--
2.25.1

2023-11-15 16:33:16

by Ryan Roberts

[permalink] [raw]

Subject: [PATCH v2 04/14] arm64/mm: pte_clear(): New layer to manage contig bit

2023-11-15 16:33:17

by Ryan Roberts

[permalink] [raw]

On 03/12/2023 23:20, Alistair Popple wrote:
>
> Ryan Roberts <[email protected]> writes:
>
>> On 30/11/2023 05:07, Alistair Popple wrote:
>>>
>>> Ryan Roberts <[email protected]> writes:
>>>
>>>>>>> So if we do need to deal with racing HW, I'm pretty sure my v1 implementation is
>>>>>>> buggy because it iterated through the PTEs, getting and accumulating. Then
>>>>>>> iterated again, writing that final set of bits to all the PTEs. And the HW could
>>>>>>> have modified the bits during those loops. I think it would be possible to fix
>>>>>>> the race, but intuition says it would be expensive.
>>>>>>
>>>>>> So the issue as I understand it is subsequent iterations would see a
>>>>>> clean PTE after the first iteration returned a dirty PTE. In
>>>>>> ptep_get_and_clear_full() why couldn't you just copy the dirty/accessed
>>>>>> bit (if set) from the PTE being cleared to an adjacent PTE rather than
>>>>>> all the PTEs?
>>>>>
>>>>> The raciness I'm describing is the race between reading access/dirty from one
>>>>> pte and applying it to another. But yes I like your suggestion. if we do:
>>>>>
>>>>> pte = __ptep_get_and_clear_full(ptep)
>>>>>
>>>>> on the target pte, then we have grabbed access/dirty from it in a race-free
>>>>> manner. we can then loop from current pte up towards the top of the block until
>>>>> we find a valid entry (and I guess wrap at the top to make us robust against
>>>>> future callers clearing an an arbitrary order). Then atomically accumulate the
>>>>> access/dirty bits we have just saved into that new entry. I guess that's just a
>>>>> cmpxchg loop - there are already examples of how to do that correctly when
>>>>> racing the TLB.
>>>>>
>>>>> For most entries, we will just be copying up to the next pte. For the last pte,
>>>>> we would end up reading all ptes and determine we are the last one.
>>>>>
>>>>> What do you think?
>>>>
>>>> OK here is an attempt at something which solves the fragility. I think this is
>>>> now robust and will always return the correct access/dirty state from
>>>> ptep_get_and_clear_full() and ptep_get().
>>>>
>>>> But I'm not sure about performance; each call to ptep_get_and_clear_full() for
>>>> each pte in a contpte block will cause a ptep_get() to gather the access/dirty
>>>> bits from across the contpte block - which requires reading each pte in the
>>>> contpte block. So its O(n^2) in that sense. I'll benchmark it and report back.
>>>>
>>>> Was this the type of thing you were thinking of, Alistair?
>>>
>>> Yes, that is along the lines of what I was thinking. However I have
>>> added a couple of comments inline.
>>>
>>>> --8<--
>>>> arch/arm64/include/asm/pgtable.h | 23 ++++++++-
>>>> arch/arm64/mm/contpte.c | 81 ++++++++++++++++++++++++++++++++
>>>> arch/arm64/mm/fault.c | 38 +++++++++------
>>>> 3 files changed, 125 insertions(+), 17 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>>> index 9bd2f57a9e11..6c295d277784 100644
>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>> @@ -851,6 +851,7 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
>>>> return pte_pmd(pte_modify(pmd_pte(pmd), newprot));
>>>> }
>>>>
>>>> +extern int __ptep_set_access_flags_notlbi(pte_t *ptep, pte_t entry);
>>>> extern int __ptep_set_access_flags(struct vm_area_struct *vma,
>>>> unsigned long address, pte_t *ptep,
>>>> pte_t entry, int dirty);
>>>> @@ -1145,6 +1146,8 @@ extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
>>>> extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
>>>> extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>>>> pte_t *ptep, pte_t pte, unsigned int nr);
>>>> +extern pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
>>>> + unsigned long addr, pte_t *ptep);
>>>> extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>> unsigned long addr, pte_t *ptep);
>>>> extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>>>> @@ -1270,12 +1273,28 @@ static inline void pte_clear(struct mm_struct *mm,
>>>> __pte_clear(mm, addr, ptep);
>>>> }
>>>>
>>>> +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
>>>> +static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
>>>> + unsigned long addr, pte_t *ptep, int full)
>>>> +{
>>>> + pte_t orig_pte = __ptep_get(ptep);
>>>> +
>>>> + if (!pte_valid_cont(orig_pte))
>>>> + return __ptep_get_and_clear(mm, addr, ptep);
>>>> +
>>>> + if (!full) {
>>>> + contpte_try_unfold(mm, addr, ptep, orig_pte);
>>>> + return __ptep_get_and_clear(mm, addr, ptep);
>>>> + }
>>>> +
>>>> + return contpte_ptep_get_and_clear_full(mm, addr, ptep);
>>>> +}
>>>> +
>>>> #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>>> static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>>> unsigned long addr, pte_t *ptep)
>>>> {
>>>> - contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>>> - return __ptep_get_and_clear(mm, addr, ptep);
>>>> + return ptep_get_and_clear_full(mm, addr, ptep, 0);
>>>> }
>>>>
>>>> #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
>>>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>>>> index 2a57df16bf58..99b211118d93 100644
>>>> --- a/arch/arm64/mm/contpte.c
>>>> +++ b/arch/arm64/mm/contpte.c
>>>> @@ -145,6 +145,14 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>>> for (i = 0; i < CONT_PTES; i++, ptep++) {
>>>> pte = __ptep_get(ptep);
>>>>
>>>> + /*
>>>> + * Deal with the partial contpte_ptep_get_and_clear_full() case,
>>>> + * where some of the ptes in the range may be cleared but others
>>>> + * are still to do. See contpte_ptep_get_and_clear_full().
>>>> + */
>>>> + if (!pte_valid(pte))
>>>> + continue;
>>>> +
>>>> if (pte_dirty(pte))
>>>> orig_pte = pte_mkdirty(orig_pte);
>>>>
>>>> @@ -257,6 +265,79 @@ void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>>>> }
>>>> EXPORT_SYMBOL(contpte_set_ptes);
>>>>
>>>> +pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
>>>> + unsigned long addr, pte_t *ptep)
>>>> +{
>>>> + /*
>>>> + * When doing a full address space teardown, we can avoid unfolding the
>>>> + * contiguous range, and therefore avoid the associated tlbi. Instead,
>>>> + * just get and clear the pte. The caller is promising to call us for
>>>> + * every pte, so every pte in the range will be cleared by the time the
>>>> + * final tlbi is issued.
>>>> + *
>>>> + * This approach requires some complex hoop jumping though, as for the
>>>> + * duration between returning from the first call to
>>>> + * ptep_get_and_clear_full() and making the final call, the contpte
>>>> + * block is in an intermediate state, where some ptes are cleared and
>>>> + * others are still set with the PTE_CONT bit. If any other APIs are
>>>> + * called for the ptes in the contpte block during that time, we have to
>>>> + * be very careful. The core code currently interleaves calls to
>>>> + * ptep_get_and_clear_full() with ptep_get() and so ptep_get() must be
>>>> + * careful to ignore the cleared entries when accumulating the access
>>>> + * and dirty bits - the same goes for ptep_get_lockless(). The only
>>>> + * other calls we might resonably expect are to set markers in the
>>>> + * previously cleared ptes. (We shouldn't see valid entries being set
>>>> + * until after the tlbi, at which point we are no longer in the
>>>> + * intermediate state). Since markers are not valid, this is safe;
>>>> + * set_ptes() will see the old, invalid entry and will not attempt to
>>>> + * unfold. And the new pte is also invalid so it won't attempt to fold.
>>>> + * We shouldn't see pte markers being set for the 'full' case anyway
>>>> + * since the address space is being torn down.
>>>> + *
>>>> + * The last remaining issue is returning the access/dirty bits. That
>>>> + * info could be present in any of the ptes in the contpte block.
>>>> + * ptep_get() will gather those bits from across the contpte block (for
>>>> + * the remaining valid entries). So below, if the pte we are clearing
>>>> + * has dirty or young set, we need to stash it into a pte that we are
>>>> + * yet to clear. This allows future calls to return the correct state
>>>> + * even when the info was stored in a different pte. Since the core-mm
>>>> + * calls from low to high address, we prefer to stash in the last pte of
>>>> + * the contpte block - this means we are not "dragging" the bits up
>>>> + * through all ptes and increases the chances that we can exit early
>>>> + * because a given pte will have neither dirty or young set.
>>>> + */
>>>> +
>>>> + pte_t orig_pte = __ptep_get_and_clear(mm, addr, ptep);
>>>> + bool dirty = pte_dirty(orig_pte);
>>>> + bool young = pte_young(orig_pte);
>>>> + pte_t *start;
>>>> +
>>>> + if (!dirty && !young)
>>>> + return contpte_ptep_get(ptep, orig_pte);
>>>
>>> I don't think we need to do this. If the PTE is !dirty && !young we can
>>> just return it. As you say we have to assume HW can set those flags at
>>> any time anyway so it doesn't get us much. This means in the common case
>>> we should only run through the loop setting the dirty/young flags once
>>> which should alay the performance concerns.
>>
>> I don't follow your logic. This is precisely the problem I was trying to solve
>> vs my original (simple) attempt - we want to always report the correct
>> access/dirty info. If we read one of the PTEs and neither access nor dirty are
>> set, that doesn't mean its old and clean, it just means that that info is
>> definitely not stored in this PTE - we need to check the others. (when the
>> contiguous bit is set, the HW will only update the access/dirty bits for 1 of
>> the PTEs in the contpte block).
>
> So my concern wasn't about incorrectly returning a !young && !dirty PTE
> when the CONT_PTE block was *previously* clean/old (ie. the first
> ptep_get/ptep_get_and_clear_full returned clean/old) because we have to
> tolerate that anyway due to HW being able to set those bits. Rather my
> concern was ptep_get_and_clear_full() could implicitly clear dirty/young
> bits - ie. ptep_get_and_clear_full() could return a dirty/young PTE but
> the next call would not.
>
> That's because regardless of what we do here it is just a matter of
> timing if we have to assume other HW threads can set these bits at any
> time. There is nothing stopping HW from doing that just after we read
> them in that loop, so a block can always become dirty/young at any time.
> However it shouldn't become !dirty/!young without explicit SW
> intervention.
>
> But this is all a bit of a moot point due to the discussion below.
>
>> Also, IIRC correctly, the core-mm sets access when initially setting up the
>> mapping so its not guarranteed that all but one of the PTEs in the contpte block
>> have (!dirty && !young).
>>
>>>
>>> However I am now wondering if we're doing the wrong thing trying to hide
>>> this down in the arch layer anyway. Perhaps it would be better to deal
>>> with this in the core-mm code after all.
>>>
>>> So how about having ptep_get_and_clear_full() clearing the PTEs for the
>>> entire cont block? We know by definition all PTEs should be pointing to
>>> the same folio anyway, and it seems at least zap_pte_range() would cope
>>> with this just fine because subsequent iterations would just see
>>> pte_none() and continue the loop. I haven't checked the other call sites
>>> though, but in principal I don't see why we couldn't define
>>> ptep_get_and_clear_full() as being something that clears all PTEs
>>> mapping a given folio (although it might need renaming).
>>
>> Ahha! Yes, I've been working on a solution like this since Barry raised it
>> yesterday. I have a working version, that seems to perform well. I wouldn't want
>> to just clear all the PTEs in the block inside ptep_get_and_clear_full() because
>> although it might work today, its fragile in the same way that my v2 version is.
>
> Yes, agree a new helper would be needed.
>
>> Instead, I've defined a new helper, clear_ptes(), which takes a starting pte and
>> a number of ptes to clear (like set_ptes()). It returns the PTE read from the
>> *first* slot, but with the access/dirty bits being accumulated from all of the
>> ptes in the requested batch. Then zap_pte_range() is reworked to find
>> appropriate batches (similar to how I've reworked for ptep_set_wrprotects()).
>>
>> I was trying to avoid introducing new helpers, but I think this is the most
>> robust approach, and looks slightly more performant to, on first sight. It also
>> addresses cases where full=0, which Barry says are important for madvise(DONTNEED).
>
> I strongly agree with this approach now especially if it is equally (or
> more!) performant. I get why you didn't want to intorduce new helpers
> but I think doing so was making things too subtle so would like to see
> this.
>
>>>
>>> This does assume you don't need to partially unmap a page in
>>> zap_pte_range (ie. end >= folio), but we're already making that
>>> assumption.
>>
>> That's fine for full=1. But we can't make that assumption for full=0. If a VMA
>> gets split for a reason that doesn't require re-setting the PTEs then a contpte
>> block could straddle 2 VMAs. But the solution I describe above is robust to that.
>>
>> I'll finish gathering perf data then post for all 3 approaches; v2 as originally
>> posted, "robust ptep_get_and_clear_full()", and clear_ptes(). Hopefully later today.
>
> Thanks!

From the commit log of the new version, which I'll hopefully post later today:

The following shows the results of running a kernel compilation workload
and measuring the cost of arm64_sys_exit_group() (which at ~1.5% is a
very small part of the overall workload).

Benchmarks were run on Ampere Altra in 2 configs; single numa node and 2
numa nodes (tlbis are more expensive in 2 node config).

- baseline: v6.7-rc1 + anonfolio-v7
- no-opt: contpte series without any attempt to optimize exit()
- simple-ptep_get_clear_full: simple optimization to exploit full=1.
ptep_get_clear_full() does not fully conform to its intended semantic
- robust-ptep_get_clear_full: similar to previous but
ptep_get_clear_full() fully conforms to its intended semantic
- clear_ptes: optimization implemented by this patch

| config | numa=1 | numa=2 |
|----------------------------|--------|--------|
| baseline | 0% | 0% |
| no-opt | 190% | 768% |
| simple-ptep_get_clear_full | 8% | 29% |
| robust-ptep_get_clear_full | 21% | 19% |
| clear_ptes | 13% | 9% |

In all cases, the cost of arm64_sys_exit_group() increases; this is
anticipated because there is more work to do to tear down the page
tables. But clear_ptes() gives the smallest increase overall.

Note that "simple-ptep_get_clear_full" is the version I posted with v2.
"robust-ptep_get_clear_full" is the version I tried as part of this
conversation. And "clear_ptes" is the batched version that I think we all now
prefer (and plan to post as part of v3).

Thanks,
Ryan