2023-11-15 16:31:28

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v2 00/14] Transparent Contiguous PTEs for User Mappings

Hi All,

This is v2 of a series to opportunistically and transparently use contpte
mappings (set the contiguous bit in ptes) for user memory when those mappings
meet the requirements. It is part of a wider effort to improve performance by
allocating and mapping variable-sized blocks of memory (folios). One aim is for
the 4K kernel to approach the performance of the 16K kernel, but without
breaking compatibility and without the associated increase in memory. Another
aim is to benefit the 16K and 64K kernels by enabling 2M THP, since this is the
contpte size for those kernels. We have good performance data that demonstrates
both aims are being met (see below).

Of course this is only one half of the change. We require the mapped physical
memory to be the correct size and alignment for this to actually be useful (i.e.
64K for 4K pages, or 2M for 16K/64K pages). Fortunately folios are solving this
problem for us. Filesystems that support it (XFS, AFS, EROFS, tmpfs, ...) will
allocate large folios up to the PMD size today, and more filesystems are coming.
And the other half of my work, to enable "small-sized THP" (large folios) for
anonymous memory, makes contpte sized folios prevalent for anonymous memory too
[2].

Optimistically, I would really like to get this series merged for v6.8; there is
a chance that the small-sized THP series will also get merged for that version.
But even if it doesn't, this series still benefits file-backed memory from the
file systems that support large folios so shouldn't be held up for it.
Additionally I've got data that shows this series adds no regression when the
system has no appropriate large folios.

All dependecies listed against v1 are now resolved; This series applies cleanly
against v6.7-rc1.

Note that the first patch is for core-mm and provides the refactoring to make a
crucial optimization possible - which is then implemented in patch 13. The
remaining patches are arm64-specific.

Testing
=======

I've tested this series together with small-sized THP [2] on both Ampere Altra
(bare metal) and Apple M2 (VM):
- mm selftests (inc new tests written for small-sized THP); no regressions
- Speedometer Java script benchmark in Chromium web browser; no issues
- Kernel compilation; no issues
- Various tests under high memory pressure with swap enabled; no issues


Performance
===========

John Hubbard at Nvidia has indicated dramatic 10x performance improvements for
some workloads at [3], when using 64K base page kernel.

You can also see the original performance results I posted against v1 [1] which
are still valid.

I've additionally run the kernel compilation and speedometer benchmarks on a
system with small-sized THP disabled and large folio support for file-backed
memory intentionally disabled; I see no change in performance in this case (i.e.
no regression when this change is "present but not useful").


Changes since v1
================

- Export contpte_* symbols so that modules can continue to call inline
functions (e.g. ptep_get) which may now call the contpte_* functions (thanks
to JohnH)
- Use pte_valid() instead of pte_present() where sensible (thanks to Catalin)
- Factor out (pte_valid() && pte_cont()) into new pte_valid_cont() helper
(thanks to Catalin)
- Fixed bug in contpte_ptep_set_access_flags() where TLBIs were missed (thanks
to Catalin)
- Added ARM64_CONTPTE expert Kconfig (enabled by default) (thanks to Anshuman)
- Simplified contpte_ptep_get_and_clear_full()
- Improved various code comments


[1] https://lore.kernel.org/linux-arm-kernel/[email protected]/
[2] https://lore.kernel.org/linux-arm-kernel/[email protected]/
[3] https://lore.kernel.org/linux-mm/[email protected]/


Thanks,
Ryan


Ryan Roberts (14):
mm: Batch-copy PTE ranges during fork()
arm64/mm: set_pte(): New layer to manage contig bit
arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit
arm64/mm: pte_clear(): New layer to manage contig bit
arm64/mm: ptep_get_and_clear(): New layer to manage contig bit
arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit
arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit
arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit
arm64/mm: ptep_set_access_flags(): New layer to manage contig bit
arm64/mm: ptep_get(): New layer to manage contig bit
arm64/mm: Split __flush_tlb_range() to elide trailing DSB
arm64/mm: Wire up PTE_CONT for user mappings
arm64/mm: Implement ptep_set_wrprotects() to optimize fork()
arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown

arch/arm64/Kconfig | 10 +-
arch/arm64/include/asm/pgtable.h | 325 +++++++++++++++++++---
arch/arm64/include/asm/tlbflush.h | 13 +-
arch/arm64/kernel/efi.c | 4 +-
arch/arm64/kernel/mte.c | 2 +-
arch/arm64/kvm/guest.c | 2 +-
arch/arm64/mm/Makefile | 1 +
arch/arm64/mm/contpte.c | 447 ++++++++++++++++++++++++++++++
arch/arm64/mm/fault.c | 12 +-
arch/arm64/mm/fixmap.c | 4 +-
arch/arm64/mm/hugetlbpage.c | 40 +--
arch/arm64/mm/kasan_init.c | 6 +-
arch/arm64/mm/mmu.c | 16 +-
arch/arm64/mm/pageattr.c | 6 +-
arch/arm64/mm/trans_pgd.c | 6 +-
include/linux/pgtable.h | 13 +
mm/memory.c | 175 +++++++++---
17 files changed, 956 insertions(+), 126 deletions(-)
create mode 100644 arch/arm64/mm/contpte.c

--
2.25.1


2023-11-15 16:32:00

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v2 06/14] arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Signed-off-by: Ryan Roberts <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 18 +++++++-----------
1 file changed, 7 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 994597a0bb0f..9b4a9909fd5b 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -887,8 +887,9 @@ static inline bool pud_user_accessible_page(pud_t pud)
/*
* Atomic pte/pmd modifications.
*/
-#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
-static inline int __ptep_test_and_clear_young(pte_t *ptep)
+static inline int __ptep_test_and_clear_young(struct vm_area_struct *vma,
+ unsigned long address,
+ pte_t *ptep)
{
pte_t old_pte, pte;

@@ -903,18 +904,11 @@ static inline int __ptep_test_and_clear_young(pte_t *ptep)
return pte_young(pte);
}

-static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
- unsigned long address,
- pte_t *ptep)
-{
- return __ptep_test_and_clear_young(ptep);
-}
-
#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep)
{
- int young = ptep_test_and_clear_young(vma, address, ptep);
+ int young = __ptep_test_and_clear_young(vma, address, ptep);

if (young) {
/*
@@ -937,7 +931,7 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
unsigned long address,
pmd_t *pmdp)
{
- return ptep_test_and_clear_young(vma, address, (pte_t *)pmdp);
+ return __ptep_test_and_clear_young(vma, address, (pte_t *)pmdp);
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

@@ -1123,6 +1117,8 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
#define pte_clear __pte_clear
#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
#define ptep_get_and_clear __ptep_get_and_clear
+#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
+#define ptep_test_and_clear_young __ptep_test_and_clear_young

#endif /* !__ASSEMBLY__ */

--
2.25.1

2023-11-15 16:32:10

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v2 07/14] arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Signed-off-by: Ryan Roberts <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 9b4a9909fd5b..fc1005222ee4 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -138,7 +138,7 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
* so that we don't erroneously return false for pages that have been
* remapped as PROT_NONE but are yet to be flushed from the TLB.
* Note that we can't make any assumptions based on the state of the access
- * flag, since ptep_clear_flush_young() elides a DSB when invalidating the
+ * flag, since __ptep_clear_flush_young() elides a DSB when invalidating the
* TLB.
*/
#define pte_accessible(mm, pte) \
@@ -904,8 +904,7 @@ static inline int __ptep_test_and_clear_young(struct vm_area_struct *vma,
return pte_young(pte);
}

-#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
-static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
+static inline int __ptep_clear_flush_young(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep)
{
int young = __ptep_test_and_clear_young(vma, address, ptep);
@@ -1119,6 +1118,8 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
#define ptep_get_and_clear __ptep_get_and_clear
#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
#define ptep_test_and_clear_young __ptep_test_and_clear_young
+#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
+#define ptep_clear_flush_young __ptep_clear_flush_young

#endif /* !__ASSEMBLY__ */

--
2.25.1

2023-11-15 16:32:10

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v2 08/14] arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Signed-off-by: Ryan Roberts <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 10 ++++++----
arch/arm64/mm/hugetlbpage.c | 2 +-
2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index fc1005222ee4..423cc32b2777 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -958,11 +958,11 @@ static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

/*
- * ptep_set_wrprotect - mark read-only while trasferring potential hardware
+ * __ptep_set_wrprotect - mark read-only while trasferring potential hardware
* dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit.
*/
-#define __HAVE_ARCH_PTEP_SET_WRPROTECT
-static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long address, pte_t *ptep)
+static inline void __ptep_set_wrprotect(struct mm_struct *mm,
+ unsigned long address, pte_t *ptep)
{
pte_t old_pte, pte;

@@ -980,7 +980,7 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addres
static inline void pmdp_set_wrprotect(struct mm_struct *mm,
unsigned long address, pmd_t *pmdp)
{
- ptep_set_wrprotect(mm, address, (pte_t *)pmdp);
+ __ptep_set_wrprotect(mm, address, (pte_t *)pmdp);
}

#define pmdp_establish pmdp_establish
@@ -1120,6 +1120,8 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
#define ptep_test_and_clear_young __ptep_test_and_clear_young
#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
#define ptep_clear_flush_young __ptep_clear_flush_young
+#define __HAVE_ARCH_PTEP_SET_WRPROTECT
+#define ptep_set_wrprotect __ptep_set_wrprotect

#endif /* !__ASSEMBLY__ */

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index c2a753541d13..952462820d9d 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -493,7 +493,7 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
pte_t pte;

if (!pte_cont(READ_ONCE(*ptep))) {
- ptep_set_wrprotect(mm, addr, ptep);
+ __ptep_set_wrprotect(mm, addr, ptep);
return;
}

--
2.25.1

2023-11-15 16:32:20

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v2 03/14] arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

set_pte_at() is a core macro that forwards to set_ptes() (with nr=1).
Instead of creating a __set_pte_at() internal macro, convert all arch
users to use set_ptes()/__set_ptes() directly, as appropriate. Callers
in hugetlb may benefit from calling __set_ptes() once for their whole
range rather than managing their own loop. This is left for future
improvement.

Signed-off-by: Ryan Roberts <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 10 +++++-----
arch/arm64/kernel/mte.c | 2 +-
arch/arm64/kvm/guest.c | 2 +-
arch/arm64/mm/fault.c | 2 +-
arch/arm64/mm/hugetlbpage.c | 10 +++++-----
5 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 650d4f4bb6dc..323ec91add60 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -342,9 +342,9 @@ static inline void __sync_cache_and_tags(pte_t pte, unsigned int nr_pages)
mte_sync_tags(pte, nr_pages);
}

-static inline void set_ptes(struct mm_struct *mm,
- unsigned long __always_unused addr,
- pte_t *ptep, pte_t pte, unsigned int nr)
+static inline void __set_ptes(struct mm_struct *mm,
+ unsigned long __always_unused addr,
+ pte_t *ptep, pte_t pte, unsigned int nr)
{
page_table_check_ptes_set(mm, ptep, pte, nr);
__sync_cache_and_tags(pte, nr);
@@ -358,7 +358,6 @@ static inline void set_ptes(struct mm_struct *mm,
pte_val(pte) += PAGE_SIZE;
}
}
-#define set_ptes set_ptes

/*
* Huge pte definitions.
@@ -1067,7 +1066,7 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
#endif /* CONFIG_ARM64_MTE */

/*
- * On AArch64, the cache coherency is handled via the set_pte_at() function.
+ * On AArch64, the cache coherency is handled via the __set_ptes() function.
*/
static inline void update_mmu_cache_range(struct vm_fault *vmf,
struct vm_area_struct *vma, unsigned long addr, pte_t *ptep,
@@ -1121,6 +1120,7 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
pte_t old_pte, pte_t new_pte);

#define set_pte __set_pte
+#define set_ptes __set_ptes

#endif /* !__ASSEMBLY__ */

diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index a41ef3213e1e..dcdcccd40891 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -67,7 +67,7 @@ int memcmp_pages(struct page *page1, struct page *page2)
/*
* If the page content is identical but at least one of the pages is
* tagged, return non-zero to avoid KSM merging. If only one of the
- * pages is tagged, set_pte_at() may zero or change the tags of the
+ * pages is tagged, __set_ptes() may zero or change the tags of the
* other page via mte_sync_tags().
*/
if (page_mte_tagged(page1) || page_mte_tagged(page2))
diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
index aaf1d4939739..629145fd3161 100644
--- a/arch/arm64/kvm/guest.c
+++ b/arch/arm64/kvm/guest.c
@@ -1072,7 +1072,7 @@ int kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
} else {
/*
* Only locking to serialise with a concurrent
- * set_pte_at() in the VMM but still overriding the
+ * __set_ptes() in the VMM but still overriding the
* tags, hence ignoring the return value.
*/
try_page_mte_tagging(page);
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 460d799e1296..a287c1dea871 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -205,7 +205,7 @@ static void show_pte(unsigned long addr)
*
* It needs to cope with hardware update of the accessed/dirty state by other
* agents in the system and can safely skip the __sync_icache_dcache() call as,
- * like set_pte_at(), the PTE is never changed from no-exec to exec here.
+ * like __set_ptes(), the PTE is never changed from no-exec to exec here.
*
* Returns whether or not the PTE actually changed.
*/
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index f5aae342632c..741cb53672fd 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -254,12 +254,12 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,

if (!pte_present(pte)) {
for (i = 0; i < ncontig; i++, ptep++, addr += pgsize)
- set_pte_at(mm, addr, ptep, pte);
+ __set_ptes(mm, addr, ptep, pte, 1);
return;
}

if (!pte_cont(pte)) {
- set_pte_at(mm, addr, ptep, pte);
+ __set_ptes(mm, addr, ptep, pte, 1);
return;
}

@@ -270,7 +270,7 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
clear_flush(mm, addr, ptep, pgsize, ncontig);

for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
- set_pte_at(mm, addr, ptep, pfn_pte(pfn, hugeprot));
+ __set_ptes(mm, addr, ptep, pfn_pte(pfn, hugeprot), 1);
}

pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -478,7 +478,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,

hugeprot = pte_pgprot(pte);
for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
- set_pte_at(mm, addr, ptep, pfn_pte(pfn, hugeprot));
+ __set_ptes(mm, addr, ptep, pfn_pte(pfn, hugeprot), 1);

return 1;
}
@@ -507,7 +507,7 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
pfn = pte_pfn(pte);

for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
- set_pte_at(mm, addr, ptep, pfn_pte(pfn, hugeprot));
+ __set_ptes(mm, addr, ptep, pfn_pte(pfn, hugeprot), 1);
}

pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
--
2.25.1

2023-11-15 16:32:37

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v2 09/14] arm64/mm: ptep_set_access_flags(): New layer to manage contig bit

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Signed-off-by: Ryan Roberts <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 10 ++++++----
arch/arm64/mm/fault.c | 6 +++---
arch/arm64/mm/hugetlbpage.c | 2 +-
3 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 423cc32b2777..85010c2d4dfa 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -312,7 +312,7 @@ static inline void __check_safe_pte_update(struct mm_struct *mm, pte_t *ptep,

/*
* Check for potential race with hardware updates of the pte
- * (ptep_set_access_flags safely changes valid ptes without going
+ * (__ptep_set_access_flags safely changes valid ptes without going
* through an invalid entry).
*/
VM_WARN_ONCE(!pte_young(pte),
@@ -842,8 +842,7 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
return pte_pmd(pte_modify(pmd_pte(pmd), newprot));
}

-#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
-extern int ptep_set_access_flags(struct vm_area_struct *vma,
+extern int __ptep_set_access_flags(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep,
pte_t entry, int dirty);

@@ -853,7 +852,8 @@ static inline int pmdp_set_access_flags(struct vm_area_struct *vma,
unsigned long address, pmd_t *pmdp,
pmd_t entry, int dirty)
{
- return ptep_set_access_flags(vma, address, (pte_t *)pmdp, pmd_pte(entry), dirty);
+ return __ptep_set_access_flags(vma, address, (pte_t *)pmdp,
+ pmd_pte(entry), dirty);
}

static inline int pud_devmap(pud_t pud)
@@ -1122,6 +1122,8 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
#define ptep_clear_flush_young __ptep_clear_flush_young
#define __HAVE_ARCH_PTEP_SET_WRPROTECT
#define ptep_set_wrprotect __ptep_set_wrprotect
+#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
+#define ptep_set_access_flags __ptep_set_access_flags

#endif /* !__ASSEMBLY__ */

diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index a287c1dea871..7cebd9847aae 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -209,9 +209,9 @@ static void show_pte(unsigned long addr)
*
* Returns whether or not the PTE actually changed.
*/
-int ptep_set_access_flags(struct vm_area_struct *vma,
- unsigned long address, pte_t *ptep,
- pte_t entry, int dirty)
+int __ptep_set_access_flags(struct vm_area_struct *vma,
+ unsigned long address, pte_t *ptep,
+ pte_t entry, int dirty)
{
pteval_t old_pteval, pteval;
pte_t pte = READ_ONCE(*ptep);
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 952462820d9d..627a9717e98c 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -459,7 +459,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
pte_t orig_pte;

if (!pte_cont(pte))
- return ptep_set_access_flags(vma, addr, ptep, pte, dirty);
+ return __ptep_set_access_flags(vma, addr, ptep, pte, dirty);

ncontig = find_num_contig(mm, addr, ptep, &pgsize);
dpfn = pgsize >> PAGE_SHIFT;
--
2.25.1

2023-11-15 16:32:51

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v2 11/14] arm64/mm: Split __flush_tlb_range() to elide trailing DSB

Split __flush_tlb_range() into __flush_tlb_range_nosync() +
__flush_tlb_range(), in the same way as the existing flush_tlb_page()
arrangement. This allows calling __flush_tlb_range_nosync() to elide the
trailing DSB. Forthcoming "contpte" code will take advantage of this
when clearing the young bit from a contiguous range of ptes.

Signed-off-by: Ryan Roberts <[email protected]>
---
arch/arm64/include/asm/tlbflush.h | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index bb2c2833a987..925ef3bdf9ed 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -399,7 +399,7 @@ do { \
#define __flush_s2_tlb_range_op(op, start, pages, stride, tlb_level) \
__flush_tlb_range_op(op, start, pages, stride, 0, tlb_level, false)

-static inline void __flush_tlb_range(struct vm_area_struct *vma,
+static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
unsigned long start, unsigned long end,
unsigned long stride, bool last_level,
int tlb_level)
@@ -431,10 +431,19 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
else
__flush_tlb_range_op(vae1is, start, pages, stride, asid, tlb_level, true);

- dsb(ish);
mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
}

+static inline void __flush_tlb_range(struct vm_area_struct *vma,
+ unsigned long start, unsigned long end,
+ unsigned long stride, bool last_level,
+ int tlb_level)
+{
+ __flush_tlb_range_nosync(vma, start, end, stride,
+ last_level, tlb_level);
+ dsb(ish);
+}
+
static inline void flush_tlb_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end)
{
--
2.25.1

2023-11-15 16:33:09

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v2 02/14] arm64/mm: set_pte(): New layer to manage contig bit

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Signed-off-by: Ryan Roberts <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 12 ++++++++----
arch/arm64/kernel/efi.c | 2 +-
arch/arm64/mm/fixmap.c | 2 +-
arch/arm64/mm/kasan_init.c | 4 ++--
arch/arm64/mm/mmu.c | 2 +-
arch/arm64/mm/pageattr.c | 2 +-
arch/arm64/mm/trans_pgd.c | 4 ++--
7 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index b19a8aee684c..650d4f4bb6dc 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -93,7 +93,8 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
__pte(__phys_to_pte_val((phys_addr_t)(pfn) << PAGE_SHIFT) | pgprot_val(prot))

#define pte_none(pte) (!pte_val(pte))
-#define pte_clear(mm,addr,ptep) set_pte(ptep, __pte(0))
+#define pte_clear(mm, addr, ptep) \
+ __set_pte(ptep, __pte(0))
#define pte_page(pte) (pfn_to_page(pte_pfn(pte)))

/*
@@ -261,7 +262,7 @@ static inline pte_t pte_mkdevmap(pte_t pte)
return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
}

-static inline void set_pte(pte_t *ptep, pte_t pte)
+static inline void __set_pte(pte_t *ptep, pte_t pte)
{
WRITE_ONCE(*ptep, pte);

@@ -350,7 +351,7 @@ static inline void set_ptes(struct mm_struct *mm,

for (;;) {
__check_safe_pte_update(mm, ptep, pte);
- set_pte(ptep, pte);
+ __set_pte(ptep, pte);
if (--nr == 0)
break;
ptep++;
@@ -534,7 +535,7 @@ static inline void __set_pte_at(struct mm_struct *mm,
{
__sync_cache_and_tags(pte, nr);
__check_safe_pte_update(mm, ptep, pte);
- set_pte(ptep, pte);
+ __set_pte(ptep, pte);
}

static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
@@ -1118,6 +1119,9 @@ extern pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep,
pte_t old_pte, pte_t new_pte);
+
+#define set_pte __set_pte
+
#endif /* !__ASSEMBLY__ */

#endif /* __ASM_PGTABLE_H */
diff --git a/arch/arm64/kernel/efi.c b/arch/arm64/kernel/efi.c
index 0228001347be..44288a12fc6c 100644
--- a/arch/arm64/kernel/efi.c
+++ b/arch/arm64/kernel/efi.c
@@ -111,7 +111,7 @@ static int __init set_permissions(pte_t *ptep, unsigned long addr, void *data)
pte = set_pte_bit(pte, __pgprot(PTE_PXN));
else if (system_supports_bti_kernel() && spd->has_bti)
pte = set_pte_bit(pte, __pgprot(PTE_GP));
- set_pte(ptep, pte);
+ __set_pte(ptep, pte);
return 0;
}

diff --git a/arch/arm64/mm/fixmap.c b/arch/arm64/mm/fixmap.c
index c0a3301203bd..51cd4501816d 100644
--- a/arch/arm64/mm/fixmap.c
+++ b/arch/arm64/mm/fixmap.c
@@ -121,7 +121,7 @@ void __set_fixmap(enum fixed_addresses idx,
ptep = fixmap_pte(addr);

if (pgprot_val(flags)) {
- set_pte(ptep, pfn_pte(phys >> PAGE_SHIFT, flags));
+ __set_pte(ptep, pfn_pte(phys >> PAGE_SHIFT, flags));
} else {
pte_clear(&init_mm, addr, ptep);
flush_tlb_kernel_range(addr, addr+PAGE_SIZE);
diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index 555285ebd5af..5eade712e9e5 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -112,7 +112,7 @@ static void __init kasan_pte_populate(pmd_t *pmdp, unsigned long addr,
if (!early)
memset(__va(page_phys), KASAN_SHADOW_INIT, PAGE_SIZE);
next = addr + PAGE_SIZE;
- set_pte(ptep, pfn_pte(__phys_to_pfn(page_phys), PAGE_KERNEL));
+ __set_pte(ptep, pfn_pte(__phys_to_pfn(page_phys), PAGE_KERNEL));
} while (ptep++, addr = next, addr != end && pte_none(READ_ONCE(*ptep)));
}

@@ -266,7 +266,7 @@ static void __init kasan_init_shadow(void)
* so we should make sure that it maps the zero page read-only.
*/
for (i = 0; i < PTRS_PER_PTE; i++)
- set_pte(&kasan_early_shadow_pte[i],
+ __set_pte(&kasan_early_shadow_pte[i],
pfn_pte(sym_to_pfn(kasan_early_shadow_page),
PAGE_KERNEL_RO));

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 15f6347d23b6..e884279b268e 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -178,7 +178,7 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
do {
pte_t old_pte = READ_ONCE(*ptep);

- set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));
+ __set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));

/*
* After the PTE entry has been populated once, we
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index 8e2017ba5f1b..057097acf9e0 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -41,7 +41,7 @@ static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
pte = clear_pte_bit(pte, cdata->clear_mask);
pte = set_pte_bit(pte, cdata->set_mask);

- set_pte(ptep, pte);
+ __set_pte(ptep, pte);
return 0;
}

diff --git a/arch/arm64/mm/trans_pgd.c b/arch/arm64/mm/trans_pgd.c
index 7b14df3c6477..230b607cf881 100644
--- a/arch/arm64/mm/trans_pgd.c
+++ b/arch/arm64/mm/trans_pgd.c
@@ -41,7 +41,7 @@ static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr)
* read only (code, rodata). Clear the RDONLY bit from
* the temporary mappings we use during restore.
*/
- set_pte(dst_ptep, pte_mkwrite_novma(pte));
+ __set_pte(dst_ptep, pte_mkwrite_novma(pte));
} else if ((debug_pagealloc_enabled() ||
is_kfence_address((void *)addr)) && !pte_none(pte)) {
/*
@@ -55,7 +55,7 @@ static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr)
*/
BUG_ON(!pfn_valid(pte_pfn(pte)));

- set_pte(dst_ptep, pte_mkpresent(pte_mkwrite_novma(pte)));
+ __set_pte(dst_ptep, pte_mkpresent(pte_mkwrite_novma(pte)));
}
}

--
2.25.1

2023-11-15 16:33:11

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

Convert copy_pte_range() to copy a set of ptes in a batch. A given batch
maps a physically contiguous block of memory, all belonging to the same
folio, with the same permissions, and for shared mappings, the same
dirty state. This will likely improve performance by a tiny amount due
to batching the folio reference count management and calling set_ptes()
rather than making individual calls to set_pte_at().

However, the primary motivation for this change is to reduce the number
of tlb maintenance operations that the arm64 backend has to perform
during fork, as it is about to add transparent support for the
"contiguous bit" in its ptes. By write-protecting the parent using the
new ptep_set_wrprotects() (note the 's' at the end) function, the
backend can avoid having to unfold contig ranges of PTEs, which is
expensive, when all ptes in the range are being write-protected.
Similarly, by using set_ptes() rather than set_pte_at() to set up ptes
in the child, the backend does not need to fold a contiguous range once
they are all populated - they can be initially populated as a contiguous
range in the first place.

This change addresses the core-mm refactoring only, and introduces
ptep_set_wrprotects() with a default implementation that calls
ptep_set_wrprotect() for each pte in the range. A separate change will
implement ptep_set_wrprotects() in the arm64 backend to realize the
performance improvement as part of the work to enable contpte mappings.

Signed-off-by: Ryan Roberts <[email protected]>
---
include/linux/pgtable.h | 13 +++
mm/memory.c | 175 +++++++++++++++++++++++++++++++---------
2 files changed, 150 insertions(+), 38 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index af7639c3b0a3..1c50f8a0fdde 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -622,6 +622,19 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addres
}
#endif

+#ifndef ptep_set_wrprotects
+struct mm_struct;
+static inline void ptep_set_wrprotects(struct mm_struct *mm,
+ unsigned long address, pte_t *ptep,
+ unsigned int nr)
+{
+ unsigned int i;
+
+ for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
+ ptep_set_wrprotect(mm, address, ptep);
+}
+#endif
+
/*
* On some architectures hardware does not set page access bit when accessing
* memory page, it is responsibility of software setting this bit. It brings
diff --git a/mm/memory.c b/mm/memory.c
index 1f18ed4a5497..b7c8228883cf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -921,46 +921,129 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
/* Uffd-wp needs to be delivered to dest pte as well */
pte = pte_mkuffd_wp(pte);
set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
- return 0;
+ return 1;
+}
+
+static inline unsigned long page_cont_mapped_vaddr(struct page *page,
+ struct page *anchor, unsigned long anchor_vaddr)
+{
+ unsigned long offset;
+ unsigned long vaddr;
+
+ offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
+ vaddr = anchor_vaddr + offset;
+
+ if (anchor > page) {
+ if (vaddr > anchor_vaddr)
+ return 0;
+ } else {
+ if (vaddr < anchor_vaddr)
+ return ULONG_MAX;
+ }
+
+ return vaddr;
+}
+
+static int folio_nr_pages_cont_mapped(struct folio *folio,
+ struct page *page, pte_t *pte,
+ unsigned long addr, unsigned long end,
+ pte_t ptent, bool *any_dirty)
+{
+ int floops;
+ int i;
+ unsigned long pfn;
+ pgprot_t prot;
+ struct page *folio_end;
+
+ if (!folio_test_large(folio))
+ return 1;
+
+ folio_end = &folio->page + folio_nr_pages(folio);
+ end = min(page_cont_mapped_vaddr(folio_end, page, addr), end);
+ floops = (end - addr) >> PAGE_SHIFT;
+ pfn = page_to_pfn(page);
+ prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
+
+ *any_dirty = pte_dirty(ptent);
+
+ pfn++;
+ pte++;
+
+ for (i = 1; i < floops; i++) {
+ ptent = ptep_get(pte);
+ ptent = pte_mkold(pte_mkclean(ptent));
+
+ if (!pte_present(ptent) || pte_pfn(ptent) != pfn ||
+ pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
+ break;
+
+ if (pte_dirty(ptent))
+ *any_dirty = true;
+
+ pfn++;
+ pte++;
+ }
+
+ return i;
}

/*
- * Copy one pte. Returns 0 if succeeded, or -EAGAIN if one preallocated page
- * is required to copy this pte.
+ * Copy set of contiguous ptes. Returns number of ptes copied if succeeded
+ * (always gte 1), or -EAGAIN if one preallocated page is required to copy the
+ * first pte.
*/
static inline int
-copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
- pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
- struct folio **prealloc)
+copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
+ pte_t *dst_pte, pte_t *src_pte,
+ unsigned long addr, unsigned long end,
+ int *rss, struct folio **prealloc)
{
struct mm_struct *src_mm = src_vma->vm_mm;
unsigned long vm_flags = src_vma->vm_flags;
pte_t pte = ptep_get(src_pte);
struct page *page;
struct folio *folio;
+ int nr = 1;
+ bool anon;
+ bool any_dirty = pte_dirty(pte);
+ int i;

page = vm_normal_page(src_vma, addr, pte);
- if (page)
+ if (page) {
folio = page_folio(page);
- if (page && folio_test_anon(folio)) {
- /*
- * If this page may have been pinned by the parent process,
- * copy the page immediately for the child so that we'll always
- * guarantee the pinned page won't be randomly replaced in the
- * future.
- */
- folio_get(folio);
- if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
- /* Page may be pinned, we have to copy. */
- folio_put(folio);
- return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
- addr, rss, prealloc, page);
+ anon = folio_test_anon(folio);
+ nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
+ end, pte, &any_dirty);
+
+ for (i = 0; i < nr; i++, page++) {
+ if (anon) {
+ /*
+ * If this page may have been pinned by the
+ * parent process, copy the page immediately for
+ * the child so that we'll always guarantee the
+ * pinned page won't be randomly replaced in the
+ * future.
+ */
+ if (unlikely(page_try_dup_anon_rmap(
+ page, false, src_vma))) {
+ if (i != 0)
+ break;
+ /* Page may be pinned, we have to copy. */
+ return copy_present_page(
+ dst_vma, src_vma, dst_pte,
+ src_pte, addr, rss, prealloc,
+ page);
+ }
+ rss[MM_ANONPAGES]++;
+ VM_BUG_ON(PageAnonExclusive(page));
+ } else {
+ page_dup_file_rmap(page, false);
+ rss[mm_counter_file(page)]++;
+ }
}
- rss[MM_ANONPAGES]++;
- } else if (page) {
- folio_get(folio);
- page_dup_file_rmap(page, false);
- rss[mm_counter_file(page)]++;
+
+ nr = i;
+ folio_ref_add(folio, nr);
}

/*
@@ -968,24 +1051,28 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
* in the parent and the child
*/
if (is_cow_mapping(vm_flags) && pte_write(pte)) {
- ptep_set_wrprotect(src_mm, addr, src_pte);
+ ptep_set_wrprotects(src_mm, addr, src_pte, nr);
pte = pte_wrprotect(pte);
}
- VM_BUG_ON(page && folio_test_anon(folio) && PageAnonExclusive(page));

/*
- * If it's a shared mapping, mark it clean in
- * the child
+ * If it's a shared mapping, mark it clean in the child. If its a
+ * private mapping, mark it dirty in the child if _any_ of the parent
+ * mappings in the block were marked dirty. The contiguous block of
+ * mappings are all backed by the same folio, so if any are dirty then
+ * the whole folio is dirty. This allows us to determine the batch size
+ * without having to ever consider the dirty bit. See
+ * folio_nr_pages_cont_mapped().
*/
- if (vm_flags & VM_SHARED)
- pte = pte_mkclean(pte);
- pte = pte_mkold(pte);
+ pte = pte_mkold(pte_mkclean(pte));
+ if (!(vm_flags & VM_SHARED) && any_dirty)
+ pte = pte_mkdirty(pte);

if (!userfaultfd_wp(dst_vma))
pte = pte_clear_uffd_wp(pte);

- set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
- return 0;
+ set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
+ return nr;
}

static inline struct folio *page_copy_prealloc(struct mm_struct *src_mm,
@@ -1087,15 +1174,28 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
*/
WARN_ON_ONCE(ret != -ENOENT);
}
- /* copy_present_pte() will clear `*prealloc' if consumed */
- ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte,
- addr, rss, &prealloc);
+ /* copy_present_ptes() will clear `*prealloc' if consumed */
+ ret = copy_present_ptes(dst_vma, src_vma, dst_pte, src_pte,
+ addr, end, rss, &prealloc);
+
/*
* If we need a pre-allocated page for this pte, drop the
* locks, allocate, and try again.
*/
if (unlikely(ret == -EAGAIN))
break;
+
+ /*
+ * Positive return value is the number of ptes copied.
+ */
+ VM_WARN_ON_ONCE(ret < 1);
+ progress += 8 * ret;
+ ret--;
+ dst_pte += ret;
+ src_pte += ret;
+ addr += ret << PAGE_SHIFT;
+ ret = 0;
+
if (unlikely(prealloc)) {
/*
* pre-alloc page cannot be reused by next time so as
@@ -1106,7 +1206,6 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
folio_put(prealloc);
prealloc = NULL;
}
- progress += 8;
} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);

arch_leave_lazy_mmu_mode();
--
2.25.1

2023-11-15 16:33:16

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v2 04/14] arm64/mm: pte_clear(): New layer to manage contig bit

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Signed-off-by: Ryan Roberts <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 3 ++-
arch/arm64/mm/fixmap.c | 2 +-
arch/arm64/mm/hugetlbpage.c | 2 +-
arch/arm64/mm/mmu.c | 2 +-
4 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 323ec91add60..1464e990580a 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -93,7 +93,7 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
__pte(__phys_to_pte_val((phys_addr_t)(pfn) << PAGE_SHIFT) | pgprot_val(prot))

#define pte_none(pte) (!pte_val(pte))
-#define pte_clear(mm, addr, ptep) \
+#define __pte_clear(mm, addr, ptep) \
__set_pte(ptep, __pte(0))
#define pte_page(pte) (pfn_to_page(pte_pfn(pte)))

@@ -1121,6 +1121,7 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,

#define set_pte __set_pte
#define set_ptes __set_ptes
+#define pte_clear __pte_clear

#endif /* !__ASSEMBLY__ */

diff --git a/arch/arm64/mm/fixmap.c b/arch/arm64/mm/fixmap.c
index 51cd4501816d..bfc02568805a 100644
--- a/arch/arm64/mm/fixmap.c
+++ b/arch/arm64/mm/fixmap.c
@@ -123,7 +123,7 @@ void __set_fixmap(enum fixed_addresses idx,
if (pgprot_val(flags)) {
__set_pte(ptep, pfn_pte(phys >> PAGE_SHIFT, flags));
} else {
- pte_clear(&init_mm, addr, ptep);
+ __pte_clear(&init_mm, addr, ptep);
flush_tlb_kernel_range(addr, addr+PAGE_SIZE);
}
}
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 741cb53672fd..510b2d4b89a9 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -400,7 +400,7 @@ void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
ncontig = num_contig_ptes(sz, &pgsize);

for (i = 0; i < ncontig; i++, addr += pgsize, ptep++)
- pte_clear(mm, addr, ptep);
+ __pte_clear(mm, addr, ptep);
}

pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index e884279b268e..080e9b50f595 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -859,7 +859,7 @@ static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr,
continue;

WARN_ON(!pte_present(pte));
- pte_clear(&init_mm, addr, ptep);
+ __pte_clear(&init_mm, addr, ptep);
flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
if (free_mapped)
free_hotplug_page_range(pte_page(pte),
--
2.25.1

2023-11-15 16:33:17

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v2 05/14] arm64/mm: ptep_get_and_clear(): New layer to manage contig bit

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Signed-off-by: Ryan Roberts <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 5 +++--
arch/arm64/mm/hugetlbpage.c | 6 +++---
2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 1464e990580a..994597a0bb0f 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -941,8 +941,7 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

-#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
-static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
+static inline pte_t __ptep_get_and_clear(struct mm_struct *mm,
unsigned long address, pte_t *ptep)
{
pte_t pte = __pte(xchg_relaxed(&pte_val(*ptep), 0));
@@ -1122,6 +1121,8 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
#define set_pte __set_pte
#define set_ptes __set_ptes
#define pte_clear __pte_clear
+#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
+#define ptep_get_and_clear __ptep_get_and_clear

#endif /* !__ASSEMBLY__ */

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 510b2d4b89a9..c2a753541d13 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -188,7 +188,7 @@ static pte_t get_clear_contig(struct mm_struct *mm,
unsigned long i;

for (i = 0; i < ncontig; i++, addr += pgsize, ptep++) {
- pte_t pte = ptep_get_and_clear(mm, addr, ptep);
+ pte_t pte = __ptep_get_and_clear(mm, addr, ptep);

/*
* If HW_AFDBM is enabled, then the HW could turn on
@@ -236,7 +236,7 @@ static void clear_flush(struct mm_struct *mm,
unsigned long i, saddr = addr;

for (i = 0; i < ncontig; i++, addr += pgsize, ptep++)
- ptep_clear(mm, addr, ptep);
+ __ptep_get_and_clear(mm, addr, ptep);

flush_tlb_range(&vma, saddr, addr);
}
@@ -411,7 +411,7 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
pte_t orig_pte = ptep_get(ptep);

if (!pte_cont(orig_pte))
- return ptep_get_and_clear(mm, addr, ptep);
+ return __ptep_get_and_clear(mm, addr, ptep);

ncontig = find_num_contig(mm, addr, ptep, &pgsize);

--
2.25.1

2023-11-15 16:33:19

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v2 14/14] arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown

ptep_get_and_clear_full() adds a 'full' parameter which is not present
for the fallback ptep_get_and_clear() function. 'full' is set to 1 when
a full address space teardown is in progress. We use this information to
optimize arm64_sys_exit_group() by avoiding unfolding (and therefore
tlbi) contiguous ranges. Instead we just clear the PTE but allow all the
contiguous neighbours to keep their contig bit set, because we know we
are about to clear the rest too.

Before this optimization, the cost of arm64_sys_exit_group() exploded to
32x what it was before PTE_CONT support was wired up, when compiling the
kernel. With this optimization in place, we are back down to the
original cost.

This approach is not perfect though, as for the duration between
returning from the first call to ptep_get_and_clear_full() and making
the final call, the contpte block in an intermediate state, where some
ptes are cleared and others are still set with the PTE_CONT bit. If any
other APIs are called for the ptes in the contpte block during that
time, we have to be very careful. The core code currently interleaves
calls to ptep_get_and_clear_full() with ptep_get() and so ptep_get()
must be careful to ignore the cleared entries when accumulating the
access and dirty bits - the same goes for ptep_get_lockless(). The only
other calls we might resonably expect are to set markers in the
previously cleared ptes. (We shouldn't see valid entries being set until
after the tlbi, at which point we are no longer in the intermediate
state). Since markers are not valid, this is safe; set_ptes() will see
the old, invalid entry and will not attempt to unfold. And the new pte
is also invalid so it won't attempt to fold. We shouldn't see this for
the 'full' case anyway.

The last remaining issue is returning the access/dirty bits. That info
could be present in any of the ptes in the contpte block. ptep_get()
will gather those bits from across the contpte block. We don't bother
doing that here, because we know that the information is used by the
core-mm to mark the underlying folio as accessed/dirty. And since the
same folio must be underpinning the whole block (that was a requirement
for folding in the first place), that information will make it to the
folio eventually once all the ptes have been cleared. This approach
means we don't have to play games with accumulating and storing the
bits. It does mean that any interleaved calls to ptep_get() may lack
correct access/dirty information if we have already cleared the pte that
happened to store it. The core code does not rely on this though.

Signed-off-by: Ryan Roberts <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 18 +++++++++--
arch/arm64/mm/contpte.c | 54 ++++++++++++++++++++++++++++++++
2 files changed, 70 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 9bd2f57a9e11..ea58a9f4e700 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1145,6 +1145,8 @@ extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pte, unsigned int nr);
+extern pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
+ unsigned long addr, pte_t *ptep);
extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep);
extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
@@ -1270,12 +1272,24 @@ static inline void pte_clear(struct mm_struct *mm,
__pte_clear(mm, addr, ptep);
}

+#define __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
+static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
+ unsigned long addr, pte_t *ptep, int full)
+{
+ pte_t orig_pte = __ptep_get(ptep);
+
+ if (!pte_valid_cont(orig_pte) || !full) {
+ contpte_try_unfold(mm, addr, ptep, orig_pte);
+ return __ptep_get_and_clear(mm, addr, ptep);
+ } else
+ return contpte_ptep_get_and_clear_full(mm, addr, ptep);
+}
+
#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
unsigned long addr, pte_t *ptep)
{
- contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
- return __ptep_get_and_clear(mm, addr, ptep);
+ return ptep_get_and_clear_full(mm, addr, ptep, 0);
}

#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index 426be9cd4dea..5d1aaed82d32 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -144,6 +144,14 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
for (i = 0; i < CONT_PTES; i++, ptep++) {
pte = __ptep_get(ptep);

+ /*
+ * Deal with the partial contpte_ptep_get_and_clear_full() case,
+ * where some of the ptes in the range may be cleared but others
+ * are still to do. See contpte_ptep_get_and_clear_full().
+ */
+ if (pte_val(pte) == 0)
+ continue;
+
if (pte_dirty(pte))
orig_pte = pte_mkdirty(orig_pte);

@@ -256,6 +264,52 @@ void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
}
EXPORT_SYMBOL(contpte_set_ptes);

+pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
+ unsigned long addr, pte_t *ptep)
+{
+ /*
+ * When doing a full address space teardown, we can avoid unfolding the
+ * contiguous range, and therefore avoid the associated tlbi. Instead,
+ * just get and clear the pte. The caller is promising to call us for
+ * every pte, so every pte in the range will be cleared by the time the
+ * tlbi is issued.
+ *
+ * This approach is not perfect though, as for the duration between
+ * returning from the first call to ptep_get_and_clear_full() and making
+ * the final call, the contpte block in an intermediate state, where
+ * some ptes are cleared and others are still set with the PTE_CONT bit.
+ * If any other APIs are called for the ptes in the contpte block during
+ * that time, we have to be very careful. The core code currently
+ * interleaves calls to ptep_get_and_clear_full() with ptep_get() and so
+ * ptep_get() must be careful to ignore the cleared entries when
+ * accumulating the access and dirty bits - the same goes for
+ * ptep_get_lockless(). The only other calls we might resonably expect
+ * are to set markers in the previously cleared ptes. (We shouldn't see
+ * valid entries being set until after the tlbi, at which point we are
+ * no longer in the intermediate state). Since markers are not valid,
+ * this is safe; set_ptes() will see the old, invalid entry and will not
+ * attempt to unfold. And the new pte is also invalid so it won't
+ * attempt to fold. We shouldn't see this for the 'full' case anyway.
+ *
+ * The last remaining issue is returning the access/dirty bits. That
+ * info could be present in any of the ptes in the contpte block.
+ * ptep_get() will gather those bits from across the contpte block. We
+ * don't bother doing that here, because we know that the information is
+ * used by the core-mm to mark the underlying folio as accessed/dirty.
+ * And since the same folio must be underpinning the whole block (that
+ * was a requirement for folding in the first place), that information
+ * will make it to the folio eventually once all the ptes have been
+ * cleared. This approach means we don't have to play games with
+ * accumulating and storing the bits. It does mean that any interleaved
+ * calls to ptep_get() may lack correct access/dirty information if we
+ * have already cleared the pte that happened to store it. The core code
+ * does not rely on this though.
+ */
+
+ return __ptep_get_and_clear(mm, addr, ptep);
+}
+EXPORT_SYMBOL(contpte_ptep_get_and_clear_full);
+
int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep)
{
--
2.25.1

2023-11-15 16:33:31

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v2 10/14] arm64/mm: ptep_get(): New layer to manage contig bit

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

arm64 did not previously define an arch-specific ptep_get(), so override
the default version in the arch code, and also define the private
__ptep_get() version. Currently they both do the same thing that the
default version does (READ_ONCE()). Some arch users (hugetlb) were
already using ptep_get() so convert those to the private API. While
other callsites were doing direct READ_ONCE(), so convert those to use
the appropriate (public/private) API too.

Signed-off-by: Ryan Roberts <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 12 +++++++++---
arch/arm64/kernel/efi.c | 2 +-
arch/arm64/mm/fault.c | 4 ++--
arch/arm64/mm/hugetlbpage.c | 18 +++++++++---------
arch/arm64/mm/kasan_init.c | 2 +-
arch/arm64/mm/mmu.c | 12 ++++++------
arch/arm64/mm/pageattr.c | 4 ++--
arch/arm64/mm/trans_pgd.c | 2 +-
8 files changed, 31 insertions(+), 25 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 85010c2d4dfa..6930c14f062f 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -276,6 +276,11 @@ static inline void __set_pte(pte_t *ptep, pte_t pte)
}
}

+static inline pte_t __ptep_get(pte_t *ptep)
+{
+ return READ_ONCE(*ptep);
+}
+
extern void __sync_icache_dcache(pte_t pteval);
bool pgattr_change_is_safe(u64 old, u64 new);

@@ -303,7 +308,7 @@ static inline void __check_safe_pte_update(struct mm_struct *mm, pte_t *ptep,
if (!IS_ENABLED(CONFIG_DEBUG_VM))
return;

- old_pte = READ_ONCE(*ptep);
+ old_pte = __ptep_get(ptep);

if (!pte_valid(old_pte) || !pte_valid(pte))
return;
@@ -893,7 +898,7 @@ static inline int __ptep_test_and_clear_young(struct vm_area_struct *vma,
{
pte_t old_pte, pte;

- pte = READ_ONCE(*ptep);
+ pte = __ptep_get(ptep);
do {
old_pte = pte;
pte = pte_mkold(pte);
@@ -966,7 +971,7 @@ static inline void __ptep_set_wrprotect(struct mm_struct *mm,
{
pte_t old_pte, pte;

- pte = READ_ONCE(*ptep);
+ pte = __ptep_get(ptep);
do {
old_pte = pte;
pte = pte_wrprotect(pte);
@@ -1111,6 +1116,7 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep,
pte_t old_pte, pte_t new_pte);

+#define ptep_get __ptep_get
#define set_pte __set_pte
#define set_ptes __set_ptes
#define pte_clear __pte_clear
diff --git a/arch/arm64/kernel/efi.c b/arch/arm64/kernel/efi.c
index 44288a12fc6c..9afcc690fe73 100644
--- a/arch/arm64/kernel/efi.c
+++ b/arch/arm64/kernel/efi.c
@@ -103,7 +103,7 @@ static int __init set_permissions(pte_t *ptep, unsigned long addr, void *data)
{
struct set_perm_data *spd = data;
const efi_memory_desc_t *md = spd->md;
- pte_t pte = READ_ONCE(*ptep);
+ pte_t pte = __ptep_get(ptep);

if (md->attribute & EFI_MEMORY_RO)
pte = set_pte_bit(pte, __pgprot(PTE_RDONLY));
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 7cebd9847aae..d63f3a0a7251 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -191,7 +191,7 @@ static void show_pte(unsigned long addr)
if (!ptep)
break;

- pte = READ_ONCE(*ptep);
+ pte = __ptep_get(ptep);
pr_cont(", pte=%016llx", pte_val(pte));
pte_unmap(ptep);
} while(0);
@@ -214,7 +214,7 @@ int __ptep_set_access_flags(struct vm_area_struct *vma,
pte_t entry, int dirty)
{
pteval_t old_pteval, pteval;
- pte_t pte = READ_ONCE(*ptep);
+ pte_t pte = __ptep_get(ptep);

if (pte_same(pte, entry))
return 0;
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 627a9717e98c..52fb767607e0 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -152,14 +152,14 @@ pte_t huge_ptep_get(pte_t *ptep)
{
int ncontig, i;
size_t pgsize;
- pte_t orig_pte = ptep_get(ptep);
+ pte_t orig_pte = __ptep_get(ptep);

if (!pte_present(orig_pte) || !pte_cont(orig_pte))
return orig_pte;

ncontig = num_contig_ptes(page_size(pte_page(orig_pte)), &pgsize);
for (i = 0; i < ncontig; i++, ptep++) {
- pte_t pte = ptep_get(ptep);
+ pte_t pte = __ptep_get(ptep);

if (pte_dirty(pte))
orig_pte = pte_mkdirty(orig_pte);
@@ -184,7 +184,7 @@ static pte_t get_clear_contig(struct mm_struct *mm,
unsigned long pgsize,
unsigned long ncontig)
{
- pte_t orig_pte = ptep_get(ptep);
+ pte_t orig_pte = __ptep_get(ptep);
unsigned long i;

for (i = 0; i < ncontig; i++, addr += pgsize, ptep++) {
@@ -408,7 +408,7 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
{
int ncontig;
size_t pgsize;
- pte_t orig_pte = ptep_get(ptep);
+ pte_t orig_pte = __ptep_get(ptep);

if (!pte_cont(orig_pte))
return __ptep_get_and_clear(mm, addr, ptep);
@@ -431,11 +431,11 @@ static int __cont_access_flags_changed(pte_t *ptep, pte_t pte, int ncontig)
{
int i;

- if (pte_write(pte) != pte_write(ptep_get(ptep)))
+ if (pte_write(pte) != pte_write(__ptep_get(ptep)))
return 1;

for (i = 0; i < ncontig; i++) {
- pte_t orig_pte = ptep_get(ptep + i);
+ pte_t orig_pte = __ptep_get(ptep + i);

if (pte_dirty(pte) != pte_dirty(orig_pte))
return 1;
@@ -492,7 +492,7 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
size_t pgsize;
pte_t pte;

- if (!pte_cont(READ_ONCE(*ptep))) {
+ if (!pte_cont(__ptep_get(ptep))) {
__ptep_set_wrprotect(mm, addr, ptep);
return;
}
@@ -517,7 +517,7 @@ pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
size_t pgsize;
int ncontig;

- if (!pte_cont(READ_ONCE(*ptep)))
+ if (!pte_cont(__ptep_get(ptep)))
return ptep_clear_flush(vma, addr, ptep);

ncontig = find_num_contig(mm, addr, ptep, &pgsize);
@@ -550,7 +550,7 @@ pte_t huge_ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr
* when the permission changes from executable to non-executable
* in cases where cpu is affected with errata #2645198.
*/
- if (pte_user_exec(READ_ONCE(*ptep)))
+ if (pte_user_exec(__ptep_get(ptep)))
return huge_ptep_clear_flush(vma, addr, ptep);
}
return huge_ptep_get_and_clear(vma->vm_mm, addr, ptep);
diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index 5eade712e9e5..5274c317d775 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -113,7 +113,7 @@ static void __init kasan_pte_populate(pmd_t *pmdp, unsigned long addr,
memset(__va(page_phys), KASAN_SHADOW_INIT, PAGE_SIZE);
next = addr + PAGE_SIZE;
__set_pte(ptep, pfn_pte(__phys_to_pfn(page_phys), PAGE_KERNEL));
- } while (ptep++, addr = next, addr != end && pte_none(READ_ONCE(*ptep)));
+ } while (ptep++, addr = next, addr != end && pte_none(__ptep_get(ptep)));
}

static void __init kasan_pmd_populate(pud_t *pudp, unsigned long addr,
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 080e9b50f595..784f1e312447 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -176,7 +176,7 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,

ptep = pte_set_fixmap_offset(pmdp, addr);
do {
- pte_t old_pte = READ_ONCE(*ptep);
+ pte_t old_pte = __ptep_get(ptep);

__set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));

@@ -185,7 +185,7 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
* only allow updates to the permission attributes.
*/
BUG_ON(!pgattr_change_is_safe(pte_val(old_pte),
- READ_ONCE(pte_val(*ptep))));
+ pte_val(__ptep_get(ptep))));

phys += PAGE_SIZE;
} while (ptep++, addr += PAGE_SIZE, addr != end);
@@ -854,7 +854,7 @@ static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr,

do {
ptep = pte_offset_kernel(pmdp, addr);
- pte = READ_ONCE(*ptep);
+ pte = __ptep_get(ptep);
if (pte_none(pte))
continue;

@@ -987,7 +987,7 @@ static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr,

do {
ptep = pte_offset_kernel(pmdp, addr);
- pte = READ_ONCE(*ptep);
+ pte = __ptep_get(ptep);

/*
* This is just a sanity check here which verifies that
@@ -1006,7 +1006,7 @@ static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr,
*/
ptep = pte_offset_kernel(pmdp, 0UL);
for (i = 0; i < PTRS_PER_PTE; i++) {
- if (!pte_none(READ_ONCE(ptep[i])))
+ if (!pte_none(__ptep_get(&ptep[i])))
return;
}

@@ -1475,7 +1475,7 @@ pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte
* when the permission changes from executable to non-executable
* in cases where cpu is affected with errata #2645198.
*/
- if (pte_user_exec(READ_ONCE(*ptep)))
+ if (pte_user_exec(ptep_get(ptep)))
return ptep_clear_flush(vma, addr, ptep);
}
return ptep_get_and_clear(vma->vm_mm, addr, ptep);
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index 057097acf9e0..624b0b0982e3 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -36,7 +36,7 @@ bool can_set_direct_map(void)
static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
{
struct page_change_data *cdata = data;
- pte_t pte = READ_ONCE(*ptep);
+ pte_t pte = __ptep_get(ptep);

pte = clear_pte_bit(pte, cdata->clear_mask);
pte = set_pte_bit(pte, cdata->set_mask);
@@ -246,5 +246,5 @@ bool kernel_page_present(struct page *page)
return true;

ptep = pte_offset_kernel(pmdp, addr);
- return pte_valid(READ_ONCE(*ptep));
+ return pte_valid(__ptep_get(ptep));
}
diff --git a/arch/arm64/mm/trans_pgd.c b/arch/arm64/mm/trans_pgd.c
index 230b607cf881..5139a28130c0 100644
--- a/arch/arm64/mm/trans_pgd.c
+++ b/arch/arm64/mm/trans_pgd.c
@@ -33,7 +33,7 @@ static void *trans_alloc(struct trans_pgd_info *info)

static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr)
{
- pte_t pte = READ_ONCE(*src_ptep);
+ pte_t pte = __ptep_get(src_ptep);

if (pte_valid(pte)) {
/*
--
2.25.1

2023-11-15 16:33:38

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v2 13/14] arm64/mm: Implement ptep_set_wrprotects() to optimize fork()

With the core-mm changes in place to batch-copy ptes during fork, we can
take advantage of this in arm64 to greatly reduce the number of tlbis we
have to issue, and recover the lost fork performance incured when adding
support for transparent contiguous ptes.

If we are write-protecting a whole contig range, we can apply the
write-protection to the whole range and know that it won't change
whether the range should have the contiguous bit set or not. For ranges
smaller than the contig range, we will still have to unfold, apply the
write-protection, then fold if the change now means the range is
foldable.

This optimization is possible thanks to the tightening of the Arm ARM in
respect to the definition and behaviour when 'Misprogramming the
Contiguous bit'. See section D21194 at
https://developer.arm.com/documentation/102105/latest/

Performance tested with the following test written for the will-it-scale
framework:

-------

char *testcase_description = "fork and exit";

void testcase(unsigned long long *iterations, unsigned long nr)
{
int pid;
char *mem;

mem = malloc(SZ_128M);
assert(mem);
memset(mem, 1, SZ_128M);

while (1) {
pid = fork();
assert(pid >= 0);

if (!pid)
exit(0);

waitpid(pid, NULL, 0);

(*iterations)++;
}
}

-------

I see huge performance regression when PTE_CONT support was added, then
the regression is mostly fixed with the addition of this change. The
following shows regression relative to before PTE_CONT was enabled
(bigger negative value is bigger regression):

| cpus | before opt | after opt |
|-------:|-------------:|------------:|
| 1 | -10.4% | -5.2% |
| 8 | -15.4% | -3.5% |
| 16 | -38.7% | -3.7% |
| 24 | -57.0% | -4.4% |
| 32 | -65.8% | -5.4% |

Signed-off-by: Ryan Roberts <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 30 ++++++++++++++++++++---
arch/arm64/mm/contpte.c | 42 ++++++++++++++++++++++++++++++++
2 files changed, 69 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 15bc9cf1eef4..9bd2f57a9e11 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -984,6 +984,16 @@ static inline void __ptep_set_wrprotect(struct mm_struct *mm,
} while (pte_val(pte) != pte_val(old_pte));
}

+static inline void __ptep_set_wrprotects(struct mm_struct *mm,
+ unsigned long address, pte_t *ptep,
+ unsigned int nr)
+{
+ unsigned int i;
+
+ for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
+ __ptep_set_wrprotect(mm, address, ptep);
+}
+
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
#define __HAVE_ARCH_PMDP_SET_WRPROTECT
static inline void pmdp_set_wrprotect(struct mm_struct *mm,
@@ -1139,6 +1149,8 @@ extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep);
extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep);
+extern void contpte_set_wrprotects(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, unsigned int nr);
extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep,
pte_t entry, int dirty);
@@ -1290,13 +1302,25 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
return contpte_ptep_clear_flush_young(vma, addr, ptep);
}

+#define ptep_set_wrprotects ptep_set_wrprotects
+static inline void ptep_set_wrprotects(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, unsigned int nr)
+{
+ if (!contpte_is_enabled(mm))
+ __ptep_set_wrprotects(mm, addr, ptep, nr);
+ else if (nr == 1) {
+ contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+ __ptep_set_wrprotects(mm, addr, ptep, 1);
+ contpte_try_fold(mm, addr, ptep, __ptep_get(ptep));
+ } else
+ contpte_set_wrprotects(mm, addr, ptep, nr);
+}
+
#define __HAVE_ARCH_PTEP_SET_WRPROTECT
static inline void ptep_set_wrprotect(struct mm_struct *mm,
unsigned long addr, pte_t *ptep)
{
- contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
- __ptep_set_wrprotect(mm, addr, ptep);
- contpte_try_fold(mm, addr, ptep, __ptep_get(ptep));
+ ptep_set_wrprotects(mm, addr, ptep, 1);
}

#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index 667bcf7c3260..426be9cd4dea 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -302,6 +302,48 @@ int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
}
EXPORT_SYMBOL(contpte_ptep_clear_flush_young);

+void contpte_set_wrprotects(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, unsigned int nr)
+{
+ unsigned long next;
+ unsigned long end = addr + (nr << PAGE_SHIFT);
+
+ do {
+ next = pte_cont_addr_end(addr, end);
+ nr = (next - addr) >> PAGE_SHIFT;
+
+ /*
+ * If wrprotecting an entire contig range, we can avoid
+ * unfolding. Just set wrprotect and wait for the later
+ * mmu_gather flush to invalidate the tlb. Until the flush, the
+ * page may or may not be wrprotected. After the flush, it is
+ * guarranteed wrprotected. If its a partial range though, we
+ * must unfold, because we can't have a case where CONT_PTE is
+ * set but wrprotect applies to a subset of the PTEs; this would
+ * cause it to continue to be unpredictable after the flush.
+ */
+ if (nr != CONT_PTES)
+ contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+
+ __ptep_set_wrprotects(mm, addr, ptep, nr);
+
+ addr = next;
+ ptep += nr;
+
+ /*
+ * If applying to a partial contig range, the change could have
+ * made the range foldable. Use the last pte in the range we
+ * just set for comparison, since contpte_try_fold() only
+ * triggers when acting on the last pte in the contig range.
+ */
+ if (nr != CONT_PTES)
+ contpte_try_fold(mm, addr - PAGE_SIZE, ptep - 1,
+ __ptep_get(ptep - 1));
+
+ } while (addr != end);
+}
+EXPORT_SYMBOL(contpte_set_wrprotects);
+
int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep,
pte_t entry, int dirty)
--
2.25.1

2023-11-15 16:33:38

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v2 12/14] arm64/mm: Wire up PTE_CONT for user mappings

With the ptep API sufficiently refactored, we can now introduce a new
"contpte" API layer, which transparently manages the PTE_CONT bit for
user mappings. Whenever it detects a set of PTEs that meet the
requirements for a contiguous range, the PTEs are re-painted with the
PTE_CONT bit. Use of contpte mappings is intended to be transparent to
the core-mm, which continues to interact with individual ptes.

Since a contpte block only has a single access and dirty bit, the
semantic here changes slightly; when getting a pte (e.g. ptep_get())
that is part of a contpte mapping, the access and dirty information are
pulled from the block (so all ptes in the block return the same
access/dirty info). When changing the access/dirty info on a pte (e.g.
ptep_set_access_flags()) that is part of a contpte mapping, this change
will affect the whole contpte block. This is works fine in practice
since we guarrantee that only a single folio is mapped by a contpte
block, and the core-mm tracks access/dirty information per folio.

This initial change provides a baseline that can be optimized in future
commits. That said, fold/unfold operations (which imply tlb
invalidation) are avoided where possible with a few tricks for
access/dirty bit management. Write-protect modifications for contpte
mappings are currently non-optimal, and incure a regression in fork()
performance. This will be addressed in follow-up changes.

In order for the public functions, which used to be pure inline, to
continue to be callable by modules, export all the contpte_* symbols
that are now called by those public inline functions.

The feature is enabled/disabled with the ARM64_CONTPTE Kconfig parameter
at build time. It defaults to enabled as long as its dependency,
TRANSPARENT_HUGEPAGE is also enabled. The core-mm depends upon
TRANSPARENT_HUGEPAGE to be able to allocate large folios, so if its not
enabled, then there is no chance of meeting the physical contiguity
requirement for contpte mappings.

Signed-off-by: Ryan Roberts <[email protected]>
---
arch/arm64/Kconfig | 10 +-
arch/arm64/include/asm/pgtable.h | 202 ++++++++++++++++++
arch/arm64/mm/Makefile | 1 +
arch/arm64/mm/contpte.c | 351 +++++++++++++++++++++++++++++++
4 files changed, 563 insertions(+), 1 deletion(-)
create mode 100644 arch/arm64/mm/contpte.c

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 7b071a00425d..de76e484ff3a 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2209,6 +2209,15 @@ config UNWIND_PATCH_PAC_INTO_SCS
select UNWIND_TABLES
select DYNAMIC_SCS

+config ARM64_CONTPTE
+ bool "Contiguous PTE mappings for user memory" if EXPERT
+ depends on TRANSPARENT_HUGEPAGE
+ default y
+ help
+ When enabled, user mappings are configured using the PTE contiguous
+ bit, for any mappings that meet the size and alignment requirements.
+ This reduces TLB pressure and improves performance.
+
endmenu # "Kernel Features"

menu "Boot options"
@@ -2318,4 +2327,3 @@ endmenu # "CPU Power Management"
source "drivers/acpi/Kconfig"

source "arch/arm64/kvm/Kconfig"
-
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 6930c14f062f..15bc9cf1eef4 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -133,6 +133,10 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
*/
#define pte_valid_not_user(pte) \
((pte_val(pte) & (PTE_VALID | PTE_USER | PTE_UXN)) == (PTE_VALID | PTE_UXN))
+/*
+ * Returns true if the pte is valid and has the contiguous bit set.
+ */
+#define pte_valid_cont(pte) (pte_valid(pte) && pte_cont(pte))
/*
* Could the pte be present in the TLB? We must check mm_tlb_flush_pending
* so that we don't erroneously return false for pages that have been
@@ -1116,6 +1120,202 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep,
pte_t old_pte, pte_t new_pte);

+#ifdef CONFIG_ARM64_CONTPTE
+
+/*
+ * The contpte APIs are used to transparently manage the contiguous bit in ptes
+ * where it is possible and makes sense to do so. The PTE_CONT bit is considered
+ * a private implementation detail of the public ptep API (see below).
+ */
+extern void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t pte);
+extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t pte);
+extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
+extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
+extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t pte, unsigned int nr);
+extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep);
+extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep);
+extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep,
+ pte_t entry, int dirty);
+
+static inline pte_t *contpte_align_down(pte_t *ptep)
+{
+ return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
+}
+
+static inline bool contpte_is_enabled(struct mm_struct *mm)
+{
+ /*
+ * Don't attempt to apply the contig bit to kernel mappings, because
+ * dynamically adding/removing the contig bit can cause page faults.
+ * These racing faults are ok for user space, since they get serialized
+ * on the PTL. But kernel mappings can't tolerate faults.
+ */
+
+ return mm != &init_mm;
+}
+
+static inline void contpte_try_fold(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t pte)
+{
+ /*
+ * Only bother trying if both the virtual and physical addresses are
+ * aligned and correspond to the last entry in a contig range. The core
+ * code mostly modifies ranges from low to high, so this is the likely
+ * the last modification in the contig range, so a good time to fold.
+ * We can't fold special mappings, because there is no associated folio.
+ */
+
+ bool valign = ((unsigned long)ptep >> 3) % CONT_PTES == CONT_PTES - 1;
+ bool palign = pte_pfn(pte) % CONT_PTES == CONT_PTES - 1;
+
+ if (contpte_is_enabled(mm) && valign && palign &&
+ pte_valid(pte) && !pte_cont(pte) && !pte_special(pte))
+ __contpte_try_fold(mm, addr, ptep, pte);
+}
+
+static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t pte)
+{
+ if (contpte_is_enabled(mm) && pte_valid_cont(pte))
+ __contpte_try_unfold(mm, addr, ptep, pte);
+}
+
+/*
+ * The below functions constitute the public API that arm64 presents to the
+ * core-mm to manipulate PTE entries within the their page tables (or at least
+ * this is the subset of the API that arm64 needs to implement). These public
+ * versions will automatically and transparently apply the contiguous bit where
+ * it makes sense to do so. Therefore any users that are contig-aware (e.g.
+ * hugetlb, kernel mapper) should NOT use these APIs, but instead use the
+ * private versions, which are prefixed with double underscore. All of these
+ * APIs except for ptep_get_lockless() are expected to be called with the PTL
+ * held.
+ */
+
+#define ptep_get ptep_get
+static inline pte_t ptep_get(pte_t *ptep)
+{
+ pte_t pte = __ptep_get(ptep);
+
+ if (!pte_valid_cont(pte))
+ return pte;
+
+ return contpte_ptep_get(ptep, pte);
+}
+
+#define ptep_get_lockless ptep_get_lockless
+static inline pte_t ptep_get_lockless(pte_t *ptep)
+{
+ pte_t pte = __ptep_get(ptep);
+
+ if (!pte_valid_cont(pte))
+ return pte;
+
+ return contpte_ptep_get_lockless(ptep);
+}
+
+static inline void set_pte(pte_t *ptep, pte_t pte)
+{
+ /*
+ * We don't have the mm or vaddr so cannot unfold or fold contig entries
+ * (since it requires tlb maintenance). set_pte() is not used in core
+ * code, so this should never even be called. Regardless do our best to
+ * service any call and emit a warning if there is any attempt to set a
+ * pte on top of an existing contig range.
+ */
+ pte_t orig_pte = __ptep_get(ptep);
+
+ WARN_ON_ONCE(pte_valid_cont(orig_pte));
+ __set_pte(ptep, pte_mknoncont(pte));
+}
+
+#define set_ptes set_ptes
+static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t pte, unsigned int nr)
+{
+ pte = pte_mknoncont(pte);
+
+ if (!contpte_is_enabled(mm))
+ __set_ptes(mm, addr, ptep, pte, nr);
+ else if (nr == 1) {
+ contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+ __set_ptes(mm, addr, ptep, pte, nr);
+ contpte_try_fold(mm, addr, ptep, pte);
+ } else
+ contpte_set_ptes(mm, addr, ptep, pte, nr);
+}
+
+static inline void pte_clear(struct mm_struct *mm,
+ unsigned long addr, pte_t *ptep)
+{
+ contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+ __pte_clear(mm, addr, ptep);
+}
+
+#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
+static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
+ unsigned long addr, pte_t *ptep)
+{
+ contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+ return __ptep_get_and_clear(mm, addr, ptep);
+}
+
+#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
+static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep)
+{
+ pte_t orig_pte = __ptep_get(ptep);
+
+ if (!pte_valid_cont(orig_pte))
+ return __ptep_test_and_clear_young(vma, addr, ptep);
+
+ return contpte_ptep_test_and_clear_young(vma, addr, ptep);
+}
+
+#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
+static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep)
+{
+ pte_t orig_pte = __ptep_get(ptep);
+
+ if (!pte_valid_cont(orig_pte))
+ return __ptep_clear_flush_young(vma, addr, ptep);
+
+ return contpte_ptep_clear_flush_young(vma, addr, ptep);
+}
+
+#define __HAVE_ARCH_PTEP_SET_WRPROTECT
+static inline void ptep_set_wrprotect(struct mm_struct *mm,
+ unsigned long addr, pte_t *ptep)
+{
+ contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+ __ptep_set_wrprotect(mm, addr, ptep);
+ contpte_try_fold(mm, addr, ptep, __ptep_get(ptep));
+}
+
+#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
+static inline int ptep_set_access_flags(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep,
+ pte_t entry, int dirty)
+{
+ pte_t orig_pte = __ptep_get(ptep);
+
+ entry = pte_mknoncont(entry);
+
+ if (!pte_valid_cont(orig_pte))
+ return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
+
+ return contpte_ptep_set_access_flags(vma, addr, ptep, entry, dirty);
+}
+
+#else /* CONFIG_ARM64_CONTPTE */
+
#define ptep_get __ptep_get
#define set_pte __set_pte
#define set_ptes __set_ptes
@@ -1131,6 +1331,8 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
#define ptep_set_access_flags __ptep_set_access_flags

+#endif /* CONFIG_ARM64_CONTPTE */
+
#endif /* !__ASSEMBLY__ */

#endif /* __ASM_PGTABLE_H */
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index dbd1bc95967d..60454256945b 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -3,6 +3,7 @@ obj-y := dma-mapping.o extable.o fault.o init.o \
cache.o copypage.o flush.o \
ioremap.o mmap.o pgd.o mmu.o \
context.o proc.o pageattr.o fixmap.o
+obj-$(CONFIG_ARM64_CONTPTE) += contpte.o
obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
obj-$(CONFIG_PTDUMP_DEBUGFS) += ptdump_debugfs.o
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
new file mode 100644
index 000000000000..667bcf7c3260
--- /dev/null
+++ b/arch/arm64/mm/contpte.c
@@ -0,0 +1,351 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2023 ARM Ltd.
+ */
+
+#include <linux/mm.h>
+#include <linux/export.h>
+#include <asm/tlbflush.h>
+
+static void ptep_clear_flush_range(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, int nr)
+{
+ struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
+ unsigned long start_addr = addr;
+ int i;
+
+ for (i = 0; i < nr; i++, ptep++, addr += PAGE_SIZE)
+ __pte_clear(mm, addr, ptep);
+
+ __flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
+}
+
+static bool ptep_any_valid(pte_t *ptep, int nr)
+{
+ int i;
+
+ for (i = 0; i < nr; i++, ptep++) {
+ if (pte_valid(__ptep_get(ptep)))
+ return true;
+ }
+
+ return false;
+}
+
+static void contpte_fold(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t pte, bool fold)
+{
+ struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
+ unsigned long start_addr;
+ pte_t *start_ptep;
+ int i;
+
+ start_ptep = ptep = contpte_align_down(ptep);
+ start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
+ pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
+ pte = fold ? pte_mkcont(pte) : pte_mknoncont(pte);
+
+ for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
+ pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
+
+ if (pte_dirty(ptent))
+ pte = pte_mkdirty(pte);
+
+ if (pte_young(ptent))
+ pte = pte_mkyoung(pte);
+ }
+
+ __flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
+
+ __set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
+}
+
+void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t pte)
+{
+ /*
+ * We have already checked that the virtual and pysical addresses are
+ * correctly aligned for a contpte mapping in contpte_try_fold() so the
+ * remaining checks are to ensure that the contpte range is fully
+ * covered by a single folio, and ensure that all the ptes are valid
+ * with contiguous PFNs and matching prots. We ignore the state of the
+ * access and dirty bits for the purpose of deciding if its a contiguous
+ * range; the folding process will generate a single contpte entry which
+ * has a single access and dirty bit. Those 2 bits are the logical OR of
+ * their respective bits in the constituent pte entries. In order to
+ * ensure the contpte range is covered by a single folio, we must
+ * recover the folio from the pfn, but special mappings don't have a
+ * folio backing them. Fortunately contpte_try_fold() already checked
+ * that the pte is not special - we never try to fold special mappings.
+ * Note we can't use vm_normal_page() for this since we don't have the
+ * vma.
+ */
+
+ struct page *page = pte_page(pte);
+ struct folio *folio = page_folio(page);
+ unsigned long folio_saddr = addr - (page - &folio->page) * PAGE_SIZE;
+ unsigned long folio_eaddr = folio_saddr + folio_nr_pages(folio) * PAGE_SIZE;
+ unsigned long cont_saddr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
+ unsigned long cont_eaddr = cont_saddr + CONT_PTE_SIZE;
+ unsigned long pfn;
+ pgprot_t prot;
+ pte_t subpte;
+ pte_t *orig_ptep;
+ int i;
+
+ if (folio_saddr > cont_saddr || folio_eaddr < cont_eaddr)
+ return;
+
+ pfn = pte_pfn(pte) - ((addr - cont_saddr) >> PAGE_SHIFT);
+ prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
+ orig_ptep = ptep;
+ ptep = contpte_align_down(ptep);
+
+ for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
+ subpte = __ptep_get(ptep);
+ subpte = pte_mkold(pte_mkclean(subpte));
+
+ if (!pte_valid(subpte) ||
+ pte_pfn(subpte) != pfn ||
+ pgprot_val(pte_pgprot(subpte)) != pgprot_val(prot))
+ return;
+ }
+
+ contpte_fold(mm, addr, orig_ptep, pte, true);
+}
+EXPORT_SYMBOL(__contpte_try_fold);
+
+void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t pte)
+{
+ /*
+ * We have already checked that the ptes are contiguous in
+ * contpte_try_unfold(), so we can unfold unconditionally here.
+ */
+
+ contpte_fold(mm, addr, ptep, pte, false);
+}
+EXPORT_SYMBOL(__contpte_try_unfold);
+
+pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
+{
+ /*
+ * Gather access/dirty bits, which may be populated in any of the ptes
+ * of the contig range. We are guarranteed to be holding the PTL, so any
+ * contiguous range cannot be unfolded or otherwise modified under our
+ * feet.
+ */
+
+ pte_t pte;
+ int i;
+
+ ptep = contpte_align_down(ptep);
+
+ for (i = 0; i < CONT_PTES; i++, ptep++) {
+ pte = __ptep_get(ptep);
+
+ if (pte_dirty(pte))
+ orig_pte = pte_mkdirty(orig_pte);
+
+ if (pte_young(pte))
+ orig_pte = pte_mkyoung(orig_pte);
+ }
+
+ return orig_pte;
+}
+EXPORT_SYMBOL(contpte_ptep_get);
+
+pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
+{
+ /*
+ * Gather access/dirty bits, which may be populated in any of the ptes
+ * of the contig range. We may not be holding the PTL, so any contiguous
+ * range may be unfolded/modified/refolded under our feet. Therefore we
+ * ensure we read a _consistent_ contpte range by checking that all ptes
+ * in the range are valid and have CONT_PTE set, that all pfns are
+ * contiguous and that all pgprots are the same (ignoring access/dirty).
+ * If we find a pte that is not consistent, then we must be racing with
+ * an update so start again. If the target pte does not have CONT_PTE
+ * set then that is considered consistent on its own because it is not
+ * part of a contpte range.
+ */
+
+ pte_t orig_pte;
+ pgprot_t orig_prot;
+ pte_t *ptep;
+ unsigned long pfn;
+ pte_t pte;
+ pgprot_t prot;
+ int i;
+
+retry:
+ orig_pte = __ptep_get(orig_ptep);
+
+ if (!pte_valid_cont(orig_pte))
+ return orig_pte;
+
+ orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
+ ptep = contpte_align_down(orig_ptep);
+ pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
+
+ for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
+ pte = __ptep_get(ptep);
+ prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
+
+ if (!pte_valid_cont(pte) ||
+ pte_pfn(pte) != pfn ||
+ pgprot_val(prot) != pgprot_val(orig_prot))
+ goto retry;
+
+ if (pte_dirty(pte))
+ orig_pte = pte_mkdirty(orig_pte);
+
+ if (pte_young(pte))
+ orig_pte = pte_mkyoung(orig_pte);
+ }
+
+ return orig_pte;
+}
+EXPORT_SYMBOL(contpte_ptep_get_lockless);
+
+void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t pte, unsigned int nr)
+{
+ unsigned long next;
+ unsigned long end = addr + (nr << PAGE_SHIFT);
+ unsigned long pfn = pte_pfn(pte);
+ pgprot_t prot = pte_pgprot(pte);
+ pte_t orig_pte;
+
+ do {
+ next = pte_cont_addr_end(addr, end);
+ nr = (next - addr) >> PAGE_SHIFT;
+ pte = pfn_pte(pfn, prot);
+
+ if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
+ pte = pte_mkcont(pte);
+ else
+ pte = pte_mknoncont(pte);
+
+ /*
+ * If operating on a partial contiguous range then we must first
+ * unfold the contiguous range if it was previously folded.
+ * Otherwise we could end up with overlapping tlb entries.
+ */
+ if (nr != CONT_PTES)
+ contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+
+ /*
+ * If we are replacing ptes that were contiguous or if the new
+ * ptes are contiguous and any of the ptes being replaced are
+ * valid, we need to clear and flush the range to prevent
+ * overlapping tlb entries.
+ */
+ orig_pte = __ptep_get(ptep);
+ if (pte_valid_cont(orig_pte) ||
+ (pte_cont(pte) && ptep_any_valid(ptep, nr)))
+ ptep_clear_flush_range(mm, addr, ptep, nr);
+
+ __set_ptes(mm, addr, ptep, pte, nr);
+
+ addr = next;
+ ptep += nr;
+ pfn += nr;
+
+ } while (addr != end);
+}
+EXPORT_SYMBOL(contpte_set_ptes);
+
+int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep)
+{
+ /*
+ * ptep_clear_flush_young() technically requires us to clear the access
+ * flag for a _single_ pte. However, the core-mm code actually tracks
+ * access/dirty per folio, not per page. And since we only create a
+ * contig range when the range is covered by a single folio, we can get
+ * away with clearing young for the whole contig range here, so we avoid
+ * having to unfold.
+ */
+
+ int i;
+ int young = 0;
+
+ ptep = contpte_align_down(ptep);
+ addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
+
+ for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
+ young |= __ptep_test_and_clear_young(vma, addr, ptep);
+
+ return young;
+}
+EXPORT_SYMBOL(contpte_ptep_test_and_clear_young);
+
+int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep)
+{
+ int young;
+
+ young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
+
+ if (young) {
+ /*
+ * See comment in __ptep_clear_flush_young(); same rationale for
+ * eliding the trailing DSB applies here.
+ */
+ addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
+ __flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE,
+ PAGE_SIZE, true, 3);
+ }
+
+ return young;
+}
+EXPORT_SYMBOL(contpte_ptep_clear_flush_young);
+
+int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep,
+ pte_t entry, int dirty)
+{
+ pte_t orig_pte;
+ int i;
+ unsigned long start_addr;
+
+ /*
+ * Gather the access/dirty bits for the contiguous range. If nothing has
+ * changed, its a noop.
+ */
+ orig_pte = ptep_get(ptep);
+ if (pte_val(orig_pte) == pte_val(entry))
+ return 0;
+
+ /*
+ * We can fix up access/dirty bits without having to unfold/fold the
+ * contig range. But if the write bit is changing, we need to go through
+ * the full unfold/fold cycle.
+ */
+ if (pte_write(orig_pte) == pte_write(entry)) {
+ /*
+ * For HW access management, we technically only need to update
+ * the flag on a single pte in the range. But for SW access
+ * management, we need to update all the ptes to prevent extra
+ * faults. Avoid per-page tlb flush in __ptep_set_access_flags()
+ * and instead flush the whole range at the end.
+ */
+ ptep = contpte_align_down(ptep);
+ start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
+
+ for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
+ __ptep_set_access_flags(vma, addr, ptep, entry, 0);
+
+ if (dirty)
+ __flush_tlb_range(vma, start_addr, addr,
+ PAGE_SIZE, true, 3);
+ } else {
+ __contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
+ __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
+ contpte_try_fold(vma->vm_mm, addr, ptep, entry);
+ }
+
+ return 1;
+}
+EXPORT_SYMBOL(contpte_ptep_set_access_flags);
--
2.25.1

2023-11-15 21:28:30

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

Hi Ryan,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on linus/master v6.7-rc1 next-20231115]
[cannot apply to arm64/for-next/core efi/next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Ryan-Roberts/mm-Batch-copy-PTE-ranges-during-fork/20231116-010123
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20231115163018.1303287-2-ryan.roberts%40arm.com
patch subject: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()
config: arm-randconfig-002-20231116 (https://download.01.org/0day-ci/archive/20231116/[email protected]/config)
compiler: arm-linux-gnueabi-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231116/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All errors (new ones prefixed by >>):

mm/memory.c: In function 'folio_nr_pages_cont_mapped':
>> mm/memory.c:969:16: error: implicit declaration of function 'pte_pgprot'; did you mean 'ptep_get'? [-Werror=implicit-function-declaration]
969 | prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
| ^~~~~~~~~~
| ptep_get
cc1: some warnings being treated as errors


vim +969 mm/memory.c

950
951 static int folio_nr_pages_cont_mapped(struct folio *folio,
952 struct page *page, pte_t *pte,
953 unsigned long addr, unsigned long end,
954 pte_t ptent, bool *any_dirty)
955 {
956 int floops;
957 int i;
958 unsigned long pfn;
959 pgprot_t prot;
960 struct page *folio_end;
961
962 if (!folio_test_large(folio))
963 return 1;
964
965 folio_end = &folio->page + folio_nr_pages(folio);
966 end = min(page_cont_mapped_vaddr(folio_end, page, addr), end);
967 floops = (end - addr) >> PAGE_SHIFT;
968 pfn = page_to_pfn(page);
> 969 prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
970
971 *any_dirty = pte_dirty(ptent);
972
973 pfn++;
974 pte++;
975
976 for (i = 1; i < floops; i++) {
977 ptent = ptep_get(pte);
978 ptent = pte_mkold(pte_mkclean(ptent));
979
980 if (!pte_present(ptent) || pte_pfn(ptent) != pfn ||
981 pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
982 break;
983
984 if (pte_dirty(ptent))
985 *any_dirty = true;
986
987 pfn++;
988 pte++;
989 }
990
991 return i;
992 }
993

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2023-11-15 21:37:56

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

On Wed, 15 Nov 2023 16:30:05 +0000 Ryan Roberts <[email protected]> wrote:

> However, the primary motivation for this change is to reduce the number
> of tlb maintenance operations that the arm64 backend has to perform
> during fork

Do you have a feeling for how much performance improved due to this?

Are there other architectures which might similarly benefit? By
implementing ptep_set_wrprotects(), it appears. If so, what sort of
gains might they see?

2023-11-15 22:41:40

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

Hi Ryan,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on linus/master v6.7-rc1 next-20231115]
[cannot apply to arm64/for-next/core efi/next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Ryan-Roberts/mm-Batch-copy-PTE-ranges-during-fork/20231116-010123
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20231115163018.1303287-2-ryan.roberts%40arm.com
patch subject: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()
config: alpha-defconfig (https://download.01.org/0day-ci/archive/20231116/[email protected]/config)
compiler: alpha-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231116/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All errors (new ones prefixed by >>):

mm/memory.c: In function 'folio_nr_pages_cont_mapped':
mm/memory.c:969:16: error: implicit declaration of function 'pte_pgprot'; did you mean 'ptep_get'? [-Werror=implicit-function-declaration]
969 | prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
| ^~~~~~~~~~
| ptep_get
>> mm/memory.c:969:16: error: incompatible types when assigning to type 'pgprot_t' from type 'int'
In file included from include/linux/shm.h:6,
from include/linux/sched.h:16,
from include/linux/hardirq.h:9,
from include/linux/interrupt.h:11,
from include/linux/kernel_stat.h:9,
from mm/memory.c:43:
>> arch/alpha/include/asm/page.h:38:29: error: request for member 'pgprot' in something not a structure or union
38 | #define pgprot_val(x) ((x).pgprot)
| ^
mm/memory.c:981:21: note: in expansion of macro 'pgprot_val'
981 | pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
| ^~~~~~~~~~
cc1: some warnings being treated as errors


vim +969 mm/memory.c

950
951 static int folio_nr_pages_cont_mapped(struct folio *folio,
952 struct page *page, pte_t *pte,
953 unsigned long addr, unsigned long end,
954 pte_t ptent, bool *any_dirty)
955 {
956 int floops;
957 int i;
958 unsigned long pfn;
959 pgprot_t prot;
960 struct page *folio_end;
961
962 if (!folio_test_large(folio))
963 return 1;
964
965 folio_end = &folio->page + folio_nr_pages(folio);
966 end = min(page_cont_mapped_vaddr(folio_end, page, addr), end);
967 floops = (end - addr) >> PAGE_SHIFT;
968 pfn = page_to_pfn(page);
> 969 prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
970
971 *any_dirty = pte_dirty(ptent);
972
973 pfn++;
974 pte++;
975
976 for (i = 1; i < floops; i++) {
977 ptent = ptep_get(pte);
978 ptent = pte_mkold(pte_mkclean(ptent));
979
980 if (!pte_present(ptent) || pte_pfn(ptent) != pfn ||
981 pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
982 break;
983
984 if (pte_dirty(ptent))
985 *any_dirty = true;
986
987 pfn++;
988 pte++;
989 }
990
991 return i;
992 }
993

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2023-11-16 09:36:09

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

On 15/11/2023 21:37, Andrew Morton wrote:
> On Wed, 15 Nov 2023 16:30:05 +0000 Ryan Roberts <[email protected]> wrote:
>
>> However, the primary motivation for this change is to reduce the number
>> of tlb maintenance operations that the arm64 backend has to perform
>> during fork
>
> Do you have a feeling for how much performance improved due to this?

The commit log for patch 13 (the one which implements ptep_set_wrprotects() for
armt64) has performance numbers for a fork() microbenchmark with/without the
optimization:

---8<---

I see huge performance regression when PTE_CONT support was added, then
the regression is mostly fixed with the addition of this change. The
following shows regression relative to before PTE_CONT was enabled
(bigger negative value is bigger regression):

| cpus | before opt | after opt |
|-------:|-------------:|------------:|
| 1 | -10.4% | -5.2% |
| 8 | -15.4% | -3.5% |
| 16 | -38.7% | -3.7% |
| 24 | -57.0% | -4.4% |
| 32 | -65.8% | -5.4% |

---8<---

Note that's running on Ampere Altra, where TLBI tends to have high cost.

>
> Are there other architectures which might similarly benefit? By
> implementing ptep_set_wrprotects(), it appears. If so, what sort of
> gains might they see?

The rationale for this is to reduce expense for arm64 to manage
contpte-mappings. If other architectures support contpte-mappings then they
could benefit from this API for the same reasons that arm64 benefits. I have a
vague understanding that riscv has a similar concept to the arm64's contiguous
bit, so perhaps they are a future candidate. But I'm not familiar with the
details of the riscv feature so couldn't say whether they would be likely to see
the same level of perf improvement as arm64.

Thanks,
Ryan


2023-11-16 10:03:55

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

On 15.11.23 17:30, Ryan Roberts wrote:
> Convert copy_pte_range() to copy a set of ptes in a batch. A given batch
> maps a physically contiguous block of memory, all belonging to the same
> folio, with the same permissions, and for shared mappings, the same
> dirty state. This will likely improve performance by a tiny amount due
> to batching the folio reference count management and calling set_ptes()
> rather than making individual calls to set_pte_at().
>
> However, the primary motivation for this change is to reduce the number
> of tlb maintenance operations that the arm64 backend has to perform
> during fork, as it is about to add transparent support for the
> "contiguous bit" in its ptes. By write-protecting the parent using the
> new ptep_set_wrprotects() (note the 's' at the end) function, the
> backend can avoid having to unfold contig ranges of PTEs, which is
> expensive, when all ptes in the range are being write-protected.
> Similarly, by using set_ptes() rather than set_pte_at() to set up ptes
> in the child, the backend does not need to fold a contiguous range once
> they are all populated - they can be initially populated as a contiguous
> range in the first place.
>
> This change addresses the core-mm refactoring only, and introduces
> ptep_set_wrprotects() with a default implementation that calls
> ptep_set_wrprotect() for each pte in the range. A separate change will
> implement ptep_set_wrprotects() in the arm64 backend to realize the
> performance improvement as part of the work to enable contpte mappings.
>
> Signed-off-by: Ryan Roberts <[email protected]>
> ---
> include/linux/pgtable.h | 13 +++
> mm/memory.c | 175 +++++++++++++++++++++++++++++++---------
> 2 files changed, 150 insertions(+), 38 deletions(-)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index af7639c3b0a3..1c50f8a0fdde 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -622,6 +622,19 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addres
> }
> #endif
>
> +#ifndef ptep_set_wrprotects
> +struct mm_struct;
> +static inline void ptep_set_wrprotects(struct mm_struct *mm,
> + unsigned long address, pte_t *ptep,
> + unsigned int nr)
> +{
> + unsigned int i;
> +
> + for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
> + ptep_set_wrprotect(mm, address, ptep);
> +}
> +#endif
> +
> /*
> * On some architectures hardware does not set page access bit when accessing
> * memory page, it is responsibility of software setting this bit. It brings
> diff --git a/mm/memory.c b/mm/memory.c
> index 1f18ed4a5497..b7c8228883cf 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -921,46 +921,129 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
> /* Uffd-wp needs to be delivered to dest pte as well */
> pte = pte_mkuffd_wp(pte);
> set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
> - return 0;
> + return 1;
> +}
> +
> +static inline unsigned long page_cont_mapped_vaddr(struct page *page,
> + struct page *anchor, unsigned long anchor_vaddr)
> +{
> + unsigned long offset;
> + unsigned long vaddr;
> +
> + offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
> + vaddr = anchor_vaddr + offset;
> +
> + if (anchor > page) {
> + if (vaddr > anchor_vaddr)
> + return 0;
> + } else {
> + if (vaddr < anchor_vaddr)
> + return ULONG_MAX;
> + }
> +
> + return vaddr;
> +}
> +
> +static int folio_nr_pages_cont_mapped(struct folio *folio,
> + struct page *page, pte_t *pte,
> + unsigned long addr, unsigned long end,
> + pte_t ptent, bool *any_dirty)
> +{
> + int floops;
> + int i;
> + unsigned long pfn;
> + pgprot_t prot;
> + struct page *folio_end;
> +
> + if (!folio_test_large(folio))
> + return 1;
> +
> + folio_end = &folio->page + folio_nr_pages(folio);
> + end = min(page_cont_mapped_vaddr(folio_end, page, addr), end);
> + floops = (end - addr) >> PAGE_SHIFT;
> + pfn = page_to_pfn(page);
> + prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
> +
> + *any_dirty = pte_dirty(ptent);
> +
> + pfn++;
> + pte++;
> +
> + for (i = 1; i < floops; i++) {
> + ptent = ptep_get(pte);
> + ptent = pte_mkold(pte_mkclean(ptent));
> +
> + if (!pte_present(ptent) || pte_pfn(ptent) != pfn ||
> + pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
> + break;
> +
> + if (pte_dirty(ptent))
> + *any_dirty = true;
> +
> + pfn++;
> + pte++;
> + }
> +
> + return i;
> }
>
> /*
> - * Copy one pte. Returns 0 if succeeded, or -EAGAIN if one preallocated page
> - * is required to copy this pte.
> + * Copy set of contiguous ptes. Returns number of ptes copied if succeeded
> + * (always gte 1), or -EAGAIN if one preallocated page is required to copy the
> + * first pte.
> */
> static inline int
> -copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> - pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
> - struct folio **prealloc)
> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> + pte_t *dst_pte, pte_t *src_pte,
> + unsigned long addr, unsigned long end,
> + int *rss, struct folio **prealloc)
> {
> struct mm_struct *src_mm = src_vma->vm_mm;
> unsigned long vm_flags = src_vma->vm_flags;
> pte_t pte = ptep_get(src_pte);
> struct page *page;
> struct folio *folio;
> + int nr = 1;
> + bool anon;
> + bool any_dirty = pte_dirty(pte);
> + int i;
>
> page = vm_normal_page(src_vma, addr, pte);
> - if (page)
> + if (page) {
> folio = page_folio(page);
> - if (page && folio_test_anon(folio)) {
> - /*
> - * If this page may have been pinned by the parent process,
> - * copy the page immediately for the child so that we'll always
> - * guarantee the pinned page won't be randomly replaced in the
> - * future.
> - */
> - folio_get(folio);
> - if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
> - /* Page may be pinned, we have to copy. */
> - folio_put(folio);
> - return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
> - addr, rss, prealloc, page);
> + anon = folio_test_anon(folio);
> + nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
> + end, pte, &any_dirty);
> +
> + for (i = 0; i < nr; i++, page++) {
> + if (anon) {
> + /*
> + * If this page may have been pinned by the
> + * parent process, copy the page immediately for
> + * the child so that we'll always guarantee the
> + * pinned page won't be randomly replaced in the
> + * future.
> + */
> + if (unlikely(page_try_dup_anon_rmap(
> + page, false, src_vma))) {
> + if (i != 0)
> + break;
> + /* Page may be pinned, we have to copy. */
> + return copy_present_page(
> + dst_vma, src_vma, dst_pte,
> + src_pte, addr, rss, prealloc,
> + page);
> + }
> + rss[MM_ANONPAGES]++;
> + VM_BUG_ON(PageAnonExclusive(page));
> + } else {
> + page_dup_file_rmap(page, false);
> + rss[mm_counter_file(page)]++;
> + }
> }
> - rss[MM_ANONPAGES]++;
> - } else if (page) {
> - folio_get(folio);
> - page_dup_file_rmap(page, false);
> - rss[mm_counter_file(page)]++;
> +
> + nr = i;
> + folio_ref_add(folio, nr);

You're changing the order of mapcount vs. refcount increment. Don't.
Make sure your refcount >= mapcount.

You can do that easily by doing the folio_ref_add(folio, nr) first and
then decrementing in case of error accordingly. Errors due to pinned
pages are the corner case.

I'll note that it will make a lot of sense to have batch variants of
page_try_dup_anon_rmap() and page_dup_file_rmap().

Especially, the batch variant of page_try_dup_anon_rmap() would only
check once if the folio maybe pinned, and in that case, you can simply
drop all references again. So you either have all or no ptes to process,
which makes that code easier.

But that can be added on top, and I'll happily do that.

--
Cheers,

David / dhildenb

2023-11-16 10:08:18

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

Hi All,

Hoping for some guidance below!


On 15/11/2023 21:26, kernel test robot wrote:
> Hi Ryan,
>
> kernel test robot noticed the following build errors:
>
> [auto build test ERROR on akpm-mm/mm-everything]
> [also build test ERROR on linus/master v6.7-rc1 next-20231115]
> [cannot apply to arm64/for-next/core efi/next]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> url: https://github.com/intel-lab-lkp/linux/commits/Ryan-Roberts/mm-Batch-copy-PTE-ranges-during-fork/20231116-010123
> base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> patch link: https://lore.kernel.org/r/20231115163018.1303287-2-ryan.roberts%40arm.com
> patch subject: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()
> config: arm-randconfig-002-20231116 (https://download.01.org/0day-ci/archive/20231116/[email protected]/config)
> compiler: arm-linux-gnueabi-gcc (GCC) 13.2.0
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231116/[email protected]/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <[email protected]>
> | Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
>
> All errors (new ones prefixed by >>):
>
> mm/memory.c: In function 'folio_nr_pages_cont_mapped':
>>> mm/memory.c:969:16: error: implicit declaration of function 'pte_pgprot'; did you mean 'ptep_get'? [-Werror=implicit-function-declaration]
> 969 | prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
> | ^~~~~~~~~~
> | ptep_get
> cc1: some warnings being treated as errors

It turns out that pte_pgprot() is not universal; its only implemented by
architectures that select CONFIG_HAVE_IOREMAP_PROT (currently arc, arm64,
loongarch, mips, powerpc, s390, sh, x86).

I'm using it in core-mm to help calculate the number of "contiguously mapped"
pages within a folio (note that's not the same as arm64's notion of
contpte-mapped. I just want to know that there are N physically contiguous pages
mapped virtually contiguously with the same permissions). And I'm using
pte_pgprot() to extract the permissions for each pte to compare. It's important
that we compare the permissions because just because the pages belongs to the
same folio doesn't imply they are mapped with the same permissions; think
mprotect()ing a sub-range.

I don't have a great idea for how to fix this - does anyone have any thoughts?

Some ideas:

- Implement folio_nr_pages_cont_mapped() conditionally on
CONFIG_HAVE_IOREMAP_PROT being set, otherwise it just returns 1 and for those
arches we always get the old, non-batching behavior. There is some precident;
mm/memory.c is already using pte_pgprot() behind this ifdef.

- Implement a generic helper the same way arm64 does it. This will return all
the pte bits that are not part of the PFN. But I'm not sure this is definitely a
valid thing to do for all architectures:

static inline pgprot_t pte_pgprot(pte_t pte)
{
unsigned long pfn = pte_pfn(pte);

return __pgprot(pte_val(pfn_pte(pfn, __pgprot(0))) ^ pte_val(pte));
}

- Explicitly implement pte_pgprot() for all arches that don't currently have it
(sigh).

Thanks,
Ryan


>
>
> vim +969 mm/memory.c
>
> 950
> 951 static int folio_nr_pages_cont_mapped(struct folio *folio,
> 952 struct page *page, pte_t *pte,
> 953 unsigned long addr, unsigned long end,
> 954 pte_t ptent, bool *any_dirty)
> 955 {
> 956 int floops;
> 957 int i;
> 958 unsigned long pfn;
> 959 pgprot_t prot;
> 960 struct page *folio_end;
> 961
> 962 if (!folio_test_large(folio))
> 963 return 1;
> 964
> 965 folio_end = &folio->page + folio_nr_pages(folio);
> 966 end = min(page_cont_mapped_vaddr(folio_end, page, addr), end);
> 967 floops = (end - addr) >> PAGE_SHIFT;
> 968 pfn = page_to_pfn(page);
> > 969 prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
> 970
> 971 *any_dirty = pte_dirty(ptent);
> 972
> 973 pfn++;
> 974 pte++;
> 975
> 976 for (i = 1; i < floops; i++) {
> 977 ptent = ptep_get(pte);
> 978 ptent = pte_mkold(pte_mkclean(ptent));
> 979
> 980 if (!pte_present(ptent) || pte_pfn(ptent) != pfn ||
> 981 pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
> 982 break;
> 983
> 984 if (pte_dirty(ptent))
> 985 *any_dirty = true;
> 986
> 987 pfn++;
> 988 pte++;
> 989 }
> 990
> 991 return i;
> 992 }
> 993
>

2023-11-16 10:13:06

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

On 16.11.23 11:07, Ryan Roberts wrote:
> Hi All,
>
> Hoping for some guidance below!
>
>
> On 15/11/2023 21:26, kernel test robot wrote:
>> Hi Ryan,
>>
>> kernel test robot noticed the following build errors:
>>
>> [auto build test ERROR on akpm-mm/mm-everything]
>> [also build test ERROR on linus/master v6.7-rc1 next-20231115]
>> [cannot apply to arm64/for-next/core efi/next]
>> [If your patch is applied to the wrong git tree, kindly drop us a note.
>> And when submitting patch, we suggest to use '--base' as documented in
>> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>>
>> url: https://github.com/intel-lab-lkp/linux/commits/Ryan-Roberts/mm-Batch-copy-PTE-ranges-during-fork/20231116-010123
>> base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
>> patch link: https://lore.kernel.org/r/20231115163018.1303287-2-ryan.roberts%40arm.com
>> patch subject: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()
>> config: arm-randconfig-002-20231116 (https://download.01.org/0day-ci/archive/20231116/[email protected]/config)
>> compiler: arm-linux-gnueabi-gcc (GCC) 13.2.0
>> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231116/[email protected]/reproduce)
>>
>> If you fix the issue in a separate patch/commit (i.e. not just a new version of
>> the same patch/commit), kindly add following tags
>> | Reported-by: kernel test robot <[email protected]>
>> | Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
>>
>> All errors (new ones prefixed by >>):
>>
>> mm/memory.c: In function 'folio_nr_pages_cont_mapped':
>>>> mm/memory.c:969:16: error: implicit declaration of function 'pte_pgprot'; did you mean 'ptep_get'? [-Werror=implicit-function-declaration]
>> 969 | prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
>> | ^~~~~~~~~~
>> | ptep_get
>> cc1: some warnings being treated as errors
>
> It turns out that pte_pgprot() is not universal; its only implemented by
> architectures that select CONFIG_HAVE_IOREMAP_PROT (currently arc, arm64,
> loongarch, mips, powerpc, s390, sh, x86).
>
> I'm using it in core-mm to help calculate the number of "contiguously mapped"
> pages within a folio (note that's not the same as arm64's notion of
> contpte-mapped. I just want to know that there are N physically contiguous pages
> mapped virtually contiguously with the same permissions). And I'm using
> pte_pgprot() to extract the permissions for each pte to compare. It's important
> that we compare the permissions because just because the pages belongs to the
> same folio doesn't imply they are mapped with the same permissions; think
> mprotect()ing a sub-range.
>
> I don't have a great idea for how to fix this - does anyone have any thoughts?

KIS :) fork() operates on individual VMAs if I am not daydreaming.

Just check for the obvious pte_write()/dirty/ and you'll be fine.

If your code tries to optimize "between VMAs", you really shouldn't be
doing that at this point.

If someone did an mprotect(), there are separate VMAs, and you shouldn't
be looking at the PTEs belonging to a different VMA.

--
Cheers,

David / dhildenb

2023-11-16 10:27:03

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

On 16/11/2023 10:03, David Hildenbrand wrote:
> On 15.11.23 17:30, Ryan Roberts wrote:
>> Convert copy_pte_range() to copy a set of ptes in a batch. A given batch
>> maps a physically contiguous block of memory, all belonging to the same
>> folio, with the same permissions, and for shared mappings, the same
>> dirty state. This will likely improve performance by a tiny amount due
>> to batching the folio reference count management and calling set_ptes()
>> rather than making individual calls to set_pte_at().
>>
>> However, the primary motivation for this change is to reduce the number
>> of tlb maintenance operations that the arm64 backend has to perform
>> during fork, as it is about to add transparent support for the
>> "contiguous bit" in its ptes. By write-protecting the parent using the
>> new ptep_set_wrprotects() (note the 's' at the end) function, the
>> backend can avoid having to unfold contig ranges of PTEs, which is
>> expensive, when all ptes in the range are being write-protected.
>> Similarly, by using set_ptes() rather than set_pte_at() to set up ptes
>> in the child, the backend does not need to fold a contiguous range once
>> they are all populated - they can be initially populated as a contiguous
>> range in the first place.
>>
>> This change addresses the core-mm refactoring only, and introduces
>> ptep_set_wrprotects() with a default implementation that calls
>> ptep_set_wrprotect() for each pte in the range. A separate change will
>> implement ptep_set_wrprotects() in the arm64 backend to realize the
>> performance improvement as part of the work to enable contpte mappings.
>>
>> Signed-off-by: Ryan Roberts <[email protected]>
>> ---
>>   include/linux/pgtable.h |  13 +++
>>   mm/memory.c             | 175 +++++++++++++++++++++++++++++++---------
>>   2 files changed, 150 insertions(+), 38 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index af7639c3b0a3..1c50f8a0fdde 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -622,6 +622,19 @@ static inline void ptep_set_wrprotect(struct mm_struct
>> *mm, unsigned long addres
>>   }
>>   #endif
>>   +#ifndef ptep_set_wrprotects
>> +struct mm_struct;
>> +static inline void ptep_set_wrprotects(struct mm_struct *mm,
>> +                unsigned long address, pte_t *ptep,
>> +                unsigned int nr)
>> +{
>> +    unsigned int i;
>> +
>> +    for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
>> +        ptep_set_wrprotect(mm, address, ptep);
>> +}
>> +#endif
>> +
>>   /*
>>    * On some architectures hardware does not set page access bit when accessing
>>    * memory page, it is responsibility of software setting this bit. It brings
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 1f18ed4a5497..b7c8228883cf 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -921,46 +921,129 @@ copy_present_page(struct vm_area_struct *dst_vma,
>> struct vm_area_struct *src_vma
>>           /* Uffd-wp needs to be delivered to dest pte as well */
>>           pte = pte_mkuffd_wp(pte);
>>       set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
>> -    return 0;
>> +    return 1;
>> +}
>> +
>> +static inline unsigned long page_cont_mapped_vaddr(struct page *page,
>> +                struct page *anchor, unsigned long anchor_vaddr)
>> +{
>> +    unsigned long offset;
>> +    unsigned long vaddr;
>> +
>> +    offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
>> +    vaddr = anchor_vaddr + offset;
>> +
>> +    if (anchor > page) {
>> +        if (vaddr > anchor_vaddr)
>> +            return 0;
>> +    } else {
>> +        if (vaddr < anchor_vaddr)
>> +            return ULONG_MAX;
>> +    }
>> +
>> +    return vaddr;
>> +}
>> +
>> +static int folio_nr_pages_cont_mapped(struct folio *folio,
>> +                      struct page *page, pte_t *pte,
>> +                      unsigned long addr, unsigned long end,
>> +                      pte_t ptent, bool *any_dirty)
>> +{
>> +    int floops;
>> +    int i;
>> +    unsigned long pfn;
>> +    pgprot_t prot;
>> +    struct page *folio_end;
>> +
>> +    if (!folio_test_large(folio))
>> +        return 1;
>> +
>> +    folio_end = &folio->page + folio_nr_pages(folio);
>> +    end = min(page_cont_mapped_vaddr(folio_end, page, addr), end);
>> +    floops = (end - addr) >> PAGE_SHIFT;
>> +    pfn = page_to_pfn(page);
>> +    prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
>> +
>> +    *any_dirty = pte_dirty(ptent);
>> +
>> +    pfn++;
>> +    pte++;
>> +
>> +    for (i = 1; i < floops; i++) {
>> +        ptent = ptep_get(pte);
>> +        ptent = pte_mkold(pte_mkclean(ptent));
>> +
>> +        if (!pte_present(ptent) || pte_pfn(ptent) != pfn ||
>> +            pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
>> +            break;
>> +
>> +        if (pte_dirty(ptent))
>> +            *any_dirty = true;
>> +
>> +        pfn++;
>> +        pte++;
>> +    }
>> +
>> +    return i;
>>   }
>>     /*
>> - * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated page
>> - * is required to copy this pte.
>> + * Copy set of contiguous ptes.  Returns number of ptes copied if succeeded
>> + * (always gte 1), or -EAGAIN if one preallocated page is required to copy the
>> + * first pte.
>>    */
>>   static inline int
>> -copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>> -         pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
>> -         struct folio **prealloc)
>> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct
>> *src_vma,
>> +          pte_t *dst_pte, pte_t *src_pte,
>> +          unsigned long addr, unsigned long end,
>> +          int *rss, struct folio **prealloc)
>>   {
>>       struct mm_struct *src_mm = src_vma->vm_mm;
>>       unsigned long vm_flags = src_vma->vm_flags;
>>       pte_t pte = ptep_get(src_pte);
>>       struct page *page;
>>       struct folio *folio;
>> +    int nr = 1;
>> +    bool anon;
>> +    bool any_dirty = pte_dirty(pte);
>> +    int i;
>>         page = vm_normal_page(src_vma, addr, pte);
>> -    if (page)
>> +    if (page) {
>>           folio = page_folio(page);
>> -    if (page && folio_test_anon(folio)) {
>> -        /*
>> -         * If this page may have been pinned by the parent process,
>> -         * copy the page immediately for the child so that we'll always
>> -         * guarantee the pinned page won't be randomly replaced in the
>> -         * future.
>> -         */
>> -        folio_get(folio);
>> -        if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
>> -            /* Page may be pinned, we have to copy. */
>> -            folio_put(folio);
>> -            return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
>> -                         addr, rss, prealloc, page);
>> +        anon = folio_test_anon(folio);
>> +        nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
>> +                        end, pte, &any_dirty);
>> +
>> +        for (i = 0; i < nr; i++, page++) {
>> +            if (anon) {
>> +                /*
>> +                 * If this page may have been pinned by the
>> +                 * parent process, copy the page immediately for
>> +                 * the child so that we'll always guarantee the
>> +                 * pinned page won't be randomly replaced in the
>> +                 * future.
>> +                 */
>> +                if (unlikely(page_try_dup_anon_rmap(
>> +                        page, false, src_vma))) {
>> +                    if (i != 0)
>> +                        break;
>> +                    /* Page may be pinned, we have to copy. */
>> +                    return copy_present_page(
>> +                        dst_vma, src_vma, dst_pte,
>> +                        src_pte, addr, rss, prealloc,
>> +                        page);
>> +                }
>> +                rss[MM_ANONPAGES]++;
>> +                VM_BUG_ON(PageAnonExclusive(page));
>> +            } else {
>> +                page_dup_file_rmap(page, false);
>> +                rss[mm_counter_file(page)]++;
>> +            }
>>           }
>> -        rss[MM_ANONPAGES]++;
>> -    } else if (page) {
>> -        folio_get(folio);
>> -        page_dup_file_rmap(page, false);
>> -        rss[mm_counter_file(page)]++;
>> +
>> +        nr = i;
>> +        folio_ref_add(folio, nr);
>
> You're changing the order of mapcount vs. refcount increment. Don't. Make sure
> your refcount >= mapcount.

Ouch - good spot.

>
> You can do that easily by doing the folio_ref_add(folio, nr) first and then
> decrementing in case of error accordingly. Errors due to pinned pages are the
> corner case.

Yep, propose this for v3:

diff --git a/mm/memory.c b/mm/memory.c
index b7c8228883cf..98373349806e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1014,6 +1014,7 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct
vm_area_struct *src_vma
anon = folio_test_anon(folio);
nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
end, pte, &any_dirty);
+ folio_ref_add(folio, nr);

for (i = 0; i < nr; i++, page++) {
if (anon) {
@@ -1029,6 +1030,7 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct
vm_area_struct *src_vma
if (i != 0)
break;
/* Page may be pinned, we have to copy. */
+ folio_ref_sub(folio, nr);
return copy_present_page(
dst_vma, src_vma, dst_pte,
src_pte, addr, rss, prealloc,
@@ -1042,8 +1044,10 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct
vm_area_struct *src_vma
}
}

- nr = i;
- folio_ref_add(folio, nr);
+ if (i < nr) {
+ folio_ref_sub(folio, nr - i);
+ nr = i;
+ }
}

>
> I'll note that it will make a lot of sense to have batch variants of
> page_try_dup_anon_rmap() and page_dup_file_rmap().
>
> Especially, the batch variant of page_try_dup_anon_rmap() would only check once
> if the folio maybe pinned, and in that case, you can simply drop all references
> again. So you either have all or no ptes to process, which makes that code easier.
>
> But that can be added on top, and I'll happily do that.

That's very kind - thanks for the offer! I'll leave it to you then.

2023-11-16 10:37:44

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

On 16/11/2023 10:12, David Hildenbrand wrote:
> On 16.11.23 11:07, Ryan Roberts wrote:
>> Hi All,
>>
>> Hoping for some guidance below!
>>
>>
>> On 15/11/2023 21:26, kernel test robot wrote:
>>> Hi Ryan,
>>>
>>> kernel test robot noticed the following build errors:
>>>
>>> [auto build test ERROR on akpm-mm/mm-everything]
>>> [also build test ERROR on linus/master v6.7-rc1 next-20231115]
>>> [cannot apply to arm64/for-next/core efi/next]
>>> [If your patch is applied to the wrong git tree, kindly drop us a note.
>>> And when submitting patch, we suggest to use '--base' as documented in
>>> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>>>
>>> url:   
>>> https://github.com/intel-lab-lkp/linux/commits/Ryan-Roberts/mm-Batch-copy-PTE-ranges-during-fork/20231116-010123
>>> base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git
>>> mm-everything
>>> patch link:   
>>> https://lore.kernel.org/r/20231115163018.1303287-2-ryan.roberts%40arm.com
>>> patch subject: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()
>>> config: arm-randconfig-002-20231116
>>> (https://download.01.org/0day-ci/archive/20231116/[email protected]/config)
>>> compiler: arm-linux-gnueabi-gcc (GCC) 13.2.0
>>> reproduce (this is a W=1 build):
>>> (https://download.01.org/0day-ci/archive/20231116/[email protected]/reproduce)
>>>
>>> If you fix the issue in a separate patch/commit (i.e. not just a new version of
>>> the same patch/commit), kindly add following tags
>>> | Reported-by: kernel test robot <[email protected]>
>>> | Closes:
>>> https://lore.kernel.org/oe-kbuild-all/[email protected]/
>>>
>>> All errors (new ones prefixed by >>):
>>>
>>>     mm/memory.c: In function 'folio_nr_pages_cont_mapped':
>>>>> mm/memory.c:969:16: error: implicit declaration of function 'pte_pgprot';
>>>>> did you mean 'ptep_get'? [-Werror=implicit-function-declaration]
>>>       969 |         prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
>>>           |                ^~~~~~~~~~
>>>           |                ptep_get
>>>     cc1: some warnings being treated as errors
>>
>> It turns out that pte_pgprot() is not universal; its only implemented by
>> architectures that select CONFIG_HAVE_IOREMAP_PROT (currently arc, arm64,
>> loongarch, mips, powerpc, s390, sh, x86).
>>
>> I'm using it in core-mm to help calculate the number of "contiguously mapped"
>> pages within a folio (note that's not the same as arm64's notion of
>> contpte-mapped. I just want to know that there are N physically contiguous pages
>> mapped virtually contiguously with the same permissions). And I'm using
>> pte_pgprot() to extract the permissions for each pte to compare. It's important
>> that we compare the permissions because just because the pages belongs to the
>> same folio doesn't imply they are mapped with the same permissions; think
>> mprotect()ing a sub-range.
>>
>> I don't have a great idea for how to fix this - does anyone have any thoughts?
>
> KIS :) fork() operates on individual VMAs if I am not daydreaming.
>
> Just check for the obvious pte_write()/dirty/ and you'll be fine.

Yes, that seems much simpler! I think we might have to be careful about the uffd
wp bit too? I think that's it - are there any other exotic bits that might need
to be considered?

>
> If your code tries to optimize "between VMAs", you really shouldn't be doing
> that at this point.

No I'm not doing that; It's one VMA at a time.

>
> If someone did an mprotect(), there are separate VMAs, and you shouldn't be
> looking at the PTEs belonging to a different VMA.
>

Yep understood, thanks.

2023-11-16 11:02:25

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

On 16.11.23 11:36, Ryan Roberts wrote:
> On 16/11/2023 10:12, David Hildenbrand wrote:
>> On 16.11.23 11:07, Ryan Roberts wrote:
>>> Hi All,
>>>
>>> Hoping for some guidance below!
>>>
>>>
>>> On 15/11/2023 21:26, kernel test robot wrote:
>>>> Hi Ryan,
>>>>
>>>> kernel test robot noticed the following build errors:
>>>>
>>>> [auto build test ERROR on akpm-mm/mm-everything]
>>>> [also build test ERROR on linus/master v6.7-rc1 next-20231115]
>>>> [cannot apply to arm64/for-next/core efi/next]
>>>> [If your patch is applied to the wrong git tree, kindly drop us a note.
>>>> And when submitting patch, we suggest to use '--base' as documented in
>>>> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>>>>
>>>> url:
>>>> https://github.com/intel-lab-lkp/linux/commits/Ryan-Roberts/mm-Batch-copy-PTE-ranges-during-fork/20231116-010123
>>>> base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git
>>>> mm-everything
>>>> patch link:
>>>> https://lore.kernel.org/r/20231115163018.1303287-2-ryan.roberts%40arm.com
>>>> patch subject: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()
>>>> config: arm-randconfig-002-20231116
>>>> (https://download.01.org/0day-ci/archive/20231116/[email protected]/config)
>>>> compiler: arm-linux-gnueabi-gcc (GCC) 13.2.0
>>>> reproduce (this is a W=1 build):
>>>> (https://download.01.org/0day-ci/archive/20231116/[email protected]/reproduce)
>>>>
>>>> If you fix the issue in a separate patch/commit (i.e. not just a new version of
>>>> the same patch/commit), kindly add following tags
>>>> | Reported-by: kernel test robot <[email protected]>
>>>> | Closes:
>>>> https://lore.kernel.org/oe-kbuild-all/[email protected]/
>>>>
>>>> All errors (new ones prefixed by >>):
>>>>
>>>>     mm/memory.c: In function 'folio_nr_pages_cont_mapped':
>>>>>> mm/memory.c:969:16: error: implicit declaration of function 'pte_pgprot';
>>>>>> did you mean 'ptep_get'? [-Werror=implicit-function-declaration]
>>>>       969 |         prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
>>>>           |                ^~~~~~~~~~
>>>>           |                ptep_get
>>>>     cc1: some warnings being treated as errors
>>>
>>> It turns out that pte_pgprot() is not universal; its only implemented by
>>> architectures that select CONFIG_HAVE_IOREMAP_PROT (currently arc, arm64,
>>> loongarch, mips, powerpc, s390, sh, x86).
>>>
>>> I'm using it in core-mm to help calculate the number of "contiguously mapped"
>>> pages within a folio (note that's not the same as arm64's notion of
>>> contpte-mapped. I just want to know that there are N physically contiguous pages
>>> mapped virtually contiguously with the same permissions). And I'm using
>>> pte_pgprot() to extract the permissions for each pte to compare. It's important
>>> that we compare the permissions because just because the pages belongs to the
>>> same folio doesn't imply they are mapped with the same permissions; think
>>> mprotect()ing a sub-range.
>>>
>>> I don't have a great idea for how to fix this - does anyone have any thoughts?
>>
>> KIS :) fork() operates on individual VMAs if I am not daydreaming.
>>
>> Just check for the obvious pte_write()/dirty/ and you'll be fine.
>
> Yes, that seems much simpler! I think we might have to be careful about the uffd
> wp bit too? I think that's it - are there any other exotic bits that might need
> to be considered?

Good question. Mimicing what the current code already does should be
sufficient. uffd-wp should have the PTE R/O. You can set the contpte bit
independent of any SW bit (uffd-wp, softdirty, ...) I guess, no need to
worry about that.

--
Cheers,

David / dhildenb

2023-11-16 11:04:01

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

On 15.11.23 17:30, Ryan Roberts wrote:
> Convert copy_pte_range() to copy a set of ptes in a batch. A given batch
> maps a physically contiguous block of memory, all belonging to the same
> folio, with the same permissions, and for shared mappings, the same
> dirty state. This will likely improve performance by a tiny amount due
> to batching the folio reference count management and calling set_ptes()
> rather than making individual calls to set_pte_at().
>
> However, the primary motivation for this change is to reduce the number
> of tlb maintenance operations that the arm64 backend has to perform
> during fork, as it is about to add transparent support for the
> "contiguous bit" in its ptes. By write-protecting the parent using the
> new ptep_set_wrprotects() (note the 's' at the end) function, the
> backend can avoid having to unfold contig ranges of PTEs, which is
> expensive, when all ptes in the range are being write-protected.
> Similarly, by using set_ptes() rather than set_pte_at() to set up ptes
> in the child, the backend does not need to fold a contiguous range once
> they are all populated - they can be initially populated as a contiguous
> range in the first place.
>
> This change addresses the core-mm refactoring only, and introduces
> ptep_set_wrprotects() with a default implementation that calls
> ptep_set_wrprotect() for each pte in the range. A separate change will
> implement ptep_set_wrprotects() in the arm64 backend to realize the
> performance improvement as part of the work to enable contpte mappings.
>
> Signed-off-by: Ryan Roberts <[email protected]>
> ---
> include/linux/pgtable.h | 13 +++
> mm/memory.c | 175 +++++++++++++++++++++++++++++++---------
> 2 files changed, 150 insertions(+), 38 deletions(-)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index af7639c3b0a3..1c50f8a0fdde 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -622,6 +622,19 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addres
> }
> #endif
>
> +#ifndef ptep_set_wrprotects
> +struct mm_struct;
> +static inline void ptep_set_wrprotects(struct mm_struct *mm,
> + unsigned long address, pte_t *ptep,
> + unsigned int nr)
> +{
> + unsigned int i;
> +
> + for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
> + ptep_set_wrprotect(mm, address, ptep);
> +}
> +#endif
> +
> /*
> * On some architectures hardware does not set page access bit when accessing
> * memory page, it is responsibility of software setting this bit. It brings
> diff --git a/mm/memory.c b/mm/memory.c
> index 1f18ed4a5497..b7c8228883cf 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -921,46 +921,129 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
> /* Uffd-wp needs to be delivered to dest pte as well */
> pte = pte_mkuffd_wp(pte);
> set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
> - return 0;
> + return 1;
> +}
> +
> +static inline unsigned long page_cont_mapped_vaddr(struct page *page,
> + struct page *anchor, unsigned long anchor_vaddr)
> +{
> + unsigned long offset;
> + unsigned long vaddr;
> +
> + offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
> + vaddr = anchor_vaddr + offset;
> +
> + if (anchor > page) {
> + if (vaddr > anchor_vaddr)
> + return 0;
> + } else {
> + if (vaddr < anchor_vaddr)
> + return ULONG_MAX;
> + }
> +
> + return vaddr;
> +}
> +
> +static int folio_nr_pages_cont_mapped(struct folio *folio,
> + struct page *page, pte_t *pte,
> + unsigned long addr, unsigned long end,
> + pte_t ptent, bool *any_dirty)
> +{
> + int floops;
> + int i;
> + unsigned long pfn;
> + pgprot_t prot;
> + struct page *folio_end;
> +
> + if (!folio_test_large(folio))
> + return 1;
> +
> + folio_end = &folio->page + folio_nr_pages(folio);
> + end = min(page_cont_mapped_vaddr(folio_end, page, addr), end);
> + floops = (end - addr) >> PAGE_SHIFT;
> + pfn = page_to_pfn(page);
> + prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
> +
> + *any_dirty = pte_dirty(ptent);
> +
> + pfn++;
> + pte++;
> +
> + for (i = 1; i < floops; i++) {
> + ptent = ptep_get(pte);
> + ptent = pte_mkold(pte_mkclean(ptent));
> +
> + if (!pte_present(ptent) || pte_pfn(ptent) != pfn ||
> + pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
> + break;
> +
> + if (pte_dirty(ptent))
> + *any_dirty = true;
> +
> + pfn++;
> + pte++;
> + }
> +
> + return i;
> }
>
> /*
> - * Copy one pte. Returns 0 if succeeded, or -EAGAIN if one preallocated page
> - * is required to copy this pte.
> + * Copy set of contiguous ptes. Returns number of ptes copied if succeeded
> + * (always gte 1), or -EAGAIN if one preallocated page is required to copy the
> + * first pte.
> */
> static inline int
> -copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> - pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
> - struct folio **prealloc)
> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> + pte_t *dst_pte, pte_t *src_pte,
> + unsigned long addr, unsigned long end,
> + int *rss, struct folio **prealloc)
> {
> struct mm_struct *src_mm = src_vma->vm_mm;
> unsigned long vm_flags = src_vma->vm_flags;
> pte_t pte = ptep_get(src_pte);
> struct page *page;
> struct folio *folio;
> + int nr = 1;
> + bool anon;
> + bool any_dirty = pte_dirty(pte);
> + int i;
>
> page = vm_normal_page(src_vma, addr, pte);
> - if (page)
> + if (page) {
> folio = page_folio(page);
> - if (page && folio_test_anon(folio)) {
> - /*
> - * If this page may have been pinned by the parent process,
> - * copy the page immediately for the child so that we'll always
> - * guarantee the pinned page won't be randomly replaced in the
> - * future.
> - */
> - folio_get(folio);
> - if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
> - /* Page may be pinned, we have to copy. */
> - folio_put(folio);
> - return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
> - addr, rss, prealloc, page);
> + anon = folio_test_anon(folio);
> + nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
> + end, pte, &any_dirty);
> +
> + for (i = 0; i < nr; i++, page++) {
> + if (anon) {
> + /*
> + * If this page may have been pinned by the
> + * parent process, copy the page immediately for
> + * the child so that we'll always guarantee the
> + * pinned page won't be randomly replaced in the
> + * future.
> + */
> + if (unlikely(page_try_dup_anon_rmap(
> + page, false, src_vma))) {
> + if (i != 0)
> + break;
> + /* Page may be pinned, we have to copy. */
> + return copy_present_page(
> + dst_vma, src_vma, dst_pte,
> + src_pte, addr, rss, prealloc,
> + page);
> + }
> + rss[MM_ANONPAGES]++;
> + VM_BUG_ON(PageAnonExclusive(page));
> + } else {
> + page_dup_file_rmap(page, false);
> + rss[mm_counter_file(page)]++;
> + }
> }
> - rss[MM_ANONPAGES]++;
> - } else if (page) {
> - folio_get(folio);
> - page_dup_file_rmap(page, false);
> - rss[mm_counter_file(page)]++;
> +
> + nr = i;
> + folio_ref_add(folio, nr);
> }
>
> /*
> @@ -968,24 +1051,28 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> * in the parent and the child
> */
> if (is_cow_mapping(vm_flags) && pte_write(pte)) {
> - ptep_set_wrprotect(src_mm, addr, src_pte);
> + ptep_set_wrprotects(src_mm, addr, src_pte, nr);
> pte = pte_wrprotect(pte);

You likely want an "any_pte_writable" check here instead, no?

Any operations that target a single indiividual PTE while multiple PTEs
are adjusted are suspicious :)

--
Cheers,

David / dhildenb

2023-11-16 11:13:43

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

On 16/11/2023 11:01, David Hildenbrand wrote:
> On 16.11.23 11:36, Ryan Roberts wrote:
>> On 16/11/2023 10:12, David Hildenbrand wrote:
>>> On 16.11.23 11:07, Ryan Roberts wrote:
>>>> Hi All,
>>>>
>>>> Hoping for some guidance below!
>>>>
>>>>
>>>> On 15/11/2023 21:26, kernel test robot wrote:
>>>>> Hi Ryan,
>>>>>
>>>>> kernel test robot noticed the following build errors:
>>>>>
>>>>> [auto build test ERROR on akpm-mm/mm-everything]
>>>>> [also build test ERROR on linus/master v6.7-rc1 next-20231115]
>>>>> [cannot apply to arm64/for-next/core efi/next]
>>>>> [If your patch is applied to the wrong git tree, kindly drop us a note.
>>>>> And when submitting patch, we suggest to use '--base' as documented in
>>>>> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>>>>>
>>>>> url:
>>>>> https://github.com/intel-lab-lkp/linux/commits/Ryan-Roberts/mm-Batch-copy-PTE-ranges-during-fork/20231116-010123
>>>>> base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git
>>>>> mm-everything
>>>>> patch link:
>>>>> https://lore.kernel.org/r/20231115163018.1303287-2-ryan.roberts%40arm.com
>>>>> patch subject: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()
>>>>> config: arm-randconfig-002-20231116
>>>>> (https://download.01.org/0day-ci/archive/20231116/[email protected]/config)
>>>>> compiler: arm-linux-gnueabi-gcc (GCC) 13.2.0
>>>>> reproduce (this is a W=1 build):
>>>>> (https://download.01.org/0day-ci/archive/20231116/[email protected]/reproduce)
>>>>>
>>>>> If you fix the issue in a separate patch/commit (i.e. not just a new
>>>>> version of
>>>>> the same patch/commit), kindly add following tags
>>>>> | Reported-by: kernel test robot <[email protected]>
>>>>> | Closes:
>>>>> https://lore.kernel.org/oe-kbuild-all/[email protected]/
>>>>>
>>>>> All errors (new ones prefixed by >>):
>>>>>
>>>>>      mm/memory.c: In function 'folio_nr_pages_cont_mapped':
>>>>>>> mm/memory.c:969:16: error: implicit declaration of function 'pte_pgprot';
>>>>>>> did you mean 'ptep_get'? [-Werror=implicit-function-declaration]
>>>>>        969 |         prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
>>>>>            |                ^~~~~~~~~~
>>>>>            |                ptep_get
>>>>>      cc1: some warnings being treated as errors
>>>>
>>>> It turns out that pte_pgprot() is not universal; its only implemented by
>>>> architectures that select CONFIG_HAVE_IOREMAP_PROT (currently arc, arm64,
>>>> loongarch, mips, powerpc, s390, sh, x86).
>>>>
>>>> I'm using it in core-mm to help calculate the number of "contiguously mapped"
>>>> pages within a folio (note that's not the same as arm64's notion of
>>>> contpte-mapped. I just want to know that there are N physically contiguous
>>>> pages
>>>> mapped virtually contiguously with the same permissions). And I'm using
>>>> pte_pgprot() to extract the permissions for each pte to compare. It's important
>>>> that we compare the permissions because just because the pages belongs to the
>>>> same folio doesn't imply they are mapped with the same permissions; think
>>>> mprotect()ing a sub-range.
>>>>
>>>> I don't have a great idea for how to fix this - does anyone have any thoughts?
>>>
>>> KIS :) fork() operates on individual VMAs if I am not daydreaming.
>>>
>>> Just check for the obvious pte_write()/dirty/ and you'll be fine.
>>
>> Yes, that seems much simpler! I think we might have to be careful about the uffd
>> wp bit too? I think that's it - are there any other exotic bits that might need
>> to be considered?
>
> Good question. Mimicing what the current code already does should be sufficient.
> uffd-wp should have the PTE R/O. You can set the contpte bit independent of any
> SW bit (uffd-wp, softdirty, ...) I guess, no need to worry about that.
>

OK thanks. I'll rework for this approach in v3.

2023-11-16 11:21:14

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

On 16/11/2023 11:03, David Hildenbrand wrote:
> On 15.11.23 17:30, Ryan Roberts wrote:
>> Convert copy_pte_range() to copy a set of ptes in a batch. A given batch
>> maps a physically contiguous block of memory, all belonging to the same
>> folio, with the same permissions, and for shared mappings, the same
>> dirty state. This will likely improve performance by a tiny amount due
>> to batching the folio reference count management and calling set_ptes()
>> rather than making individual calls to set_pte_at().
>>
>> However, the primary motivation for this change is to reduce the number
>> of tlb maintenance operations that the arm64 backend has to perform
>> during fork, as it is about to add transparent support for the
>> "contiguous bit" in its ptes. By write-protecting the parent using the
>> new ptep_set_wrprotects() (note the 's' at the end) function, the
>> backend can avoid having to unfold contig ranges of PTEs, which is
>> expensive, when all ptes in the range are being write-protected.
>> Similarly, by using set_ptes() rather than set_pte_at() to set up ptes
>> in the child, the backend does not need to fold a contiguous range once
>> they are all populated - they can be initially populated as a contiguous
>> range in the first place.
>>
>> This change addresses the core-mm refactoring only, and introduces
>> ptep_set_wrprotects() with a default implementation that calls
>> ptep_set_wrprotect() for each pte in the range. A separate change will
>> implement ptep_set_wrprotects() in the arm64 backend to realize the
>> performance improvement as part of the work to enable contpte mappings.
>>
>> Signed-off-by: Ryan Roberts <[email protected]>
>> ---
>>   include/linux/pgtable.h |  13 +++
>>   mm/memory.c             | 175 +++++++++++++++++++++++++++++++---------
>>   2 files changed, 150 insertions(+), 38 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index af7639c3b0a3..1c50f8a0fdde 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -622,6 +622,19 @@ static inline void ptep_set_wrprotect(struct mm_struct
>> *mm, unsigned long addres
>>   }
>>   #endif
>>   +#ifndef ptep_set_wrprotects
>> +struct mm_struct;
>> +static inline void ptep_set_wrprotects(struct mm_struct *mm,
>> +                unsigned long address, pte_t *ptep,
>> +                unsigned int nr)
>> +{
>> +    unsigned int i;
>> +
>> +    for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
>> +        ptep_set_wrprotect(mm, address, ptep);
>> +}
>> +#endif
>> +
>>   /*
>>    * On some architectures hardware does not set page access bit when accessing
>>    * memory page, it is responsibility of software setting this bit. It brings
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 1f18ed4a5497..b7c8228883cf 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -921,46 +921,129 @@ copy_present_page(struct vm_area_struct *dst_vma,
>> struct vm_area_struct *src_vma
>>           /* Uffd-wp needs to be delivered to dest pte as well */
>>           pte = pte_mkuffd_wp(pte);
>>       set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
>> -    return 0;
>> +    return 1;
>> +}
>> +
>> +static inline unsigned long page_cont_mapped_vaddr(struct page *page,
>> +                struct page *anchor, unsigned long anchor_vaddr)
>> +{
>> +    unsigned long offset;
>> +    unsigned long vaddr;
>> +
>> +    offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
>> +    vaddr = anchor_vaddr + offset;
>> +
>> +    if (anchor > page) {
>> +        if (vaddr > anchor_vaddr)
>> +            return 0;
>> +    } else {
>> +        if (vaddr < anchor_vaddr)
>> +            return ULONG_MAX;
>> +    }
>> +
>> +    return vaddr;
>> +}
>> +
>> +static int folio_nr_pages_cont_mapped(struct folio *folio,
>> +                      struct page *page, pte_t *pte,
>> +                      unsigned long addr, unsigned long end,
>> +                      pte_t ptent, bool *any_dirty)
>> +{
>> +    int floops;
>> +    int i;
>> +    unsigned long pfn;
>> +    pgprot_t prot;
>> +    struct page *folio_end;
>> +
>> +    if (!folio_test_large(folio))
>> +        return 1;
>> +
>> +    folio_end = &folio->page + folio_nr_pages(folio);
>> +    end = min(page_cont_mapped_vaddr(folio_end, page, addr), end);
>> +    floops = (end - addr) >> PAGE_SHIFT;
>> +    pfn = page_to_pfn(page);
>> +    prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
>> +
>> +    *any_dirty = pte_dirty(ptent);
>> +
>> +    pfn++;
>> +    pte++;
>> +
>> +    for (i = 1; i < floops; i++) {
>> +        ptent = ptep_get(pte);
>> +        ptent = pte_mkold(pte_mkclean(ptent));
>> +
>> +        if (!pte_present(ptent) || pte_pfn(ptent) != pfn ||
>> +            pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
>> +            break;
>> +
>> +        if (pte_dirty(ptent))
>> +            *any_dirty = true;
>> +
>> +        pfn++;
>> +        pte++;
>> +    }
>> +
>> +    return i;
>>   }
>>     /*
>> - * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated page
>> - * is required to copy this pte.
>> + * Copy set of contiguous ptes.  Returns number of ptes copied if succeeded
>> + * (always gte 1), or -EAGAIN if one preallocated page is required to copy the
>> + * first pte.
>>    */
>>   static inline int
>> -copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>> -         pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
>> -         struct folio **prealloc)
>> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct
>> *src_vma,
>> +          pte_t *dst_pte, pte_t *src_pte,
>> +          unsigned long addr, unsigned long end,
>> +          int *rss, struct folio **prealloc)
>>   {
>>       struct mm_struct *src_mm = src_vma->vm_mm;
>>       unsigned long vm_flags = src_vma->vm_flags;
>>       pte_t pte = ptep_get(src_pte);
>>       struct page *page;
>>       struct folio *folio;
>> +    int nr = 1;
>> +    bool anon;
>> +    bool any_dirty = pte_dirty(pte);
>> +    int i;
>>         page = vm_normal_page(src_vma, addr, pte);
>> -    if (page)
>> +    if (page) {
>>           folio = page_folio(page);
>> -    if (page && folio_test_anon(folio)) {
>> -        /*
>> -         * If this page may have been pinned by the parent process,
>> -         * copy the page immediately for the child so that we'll always
>> -         * guarantee the pinned page won't be randomly replaced in the
>> -         * future.
>> -         */
>> -        folio_get(folio);
>> -        if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
>> -            /* Page may be pinned, we have to copy. */
>> -            folio_put(folio);
>> -            return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
>> -                         addr, rss, prealloc, page);
>> +        anon = folio_test_anon(folio);
>> +        nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
>> +                        end, pte, &any_dirty);
>> +
>> +        for (i = 0; i < nr; i++, page++) {
>> +            if (anon) {
>> +                /*
>> +                 * If this page may have been pinned by the
>> +                 * parent process, copy the page immediately for
>> +                 * the child so that we'll always guarantee the
>> +                 * pinned page won't be randomly replaced in the
>> +                 * future.
>> +                 */
>> +                if (unlikely(page_try_dup_anon_rmap(
>> +                        page, false, src_vma))) {
>> +                    if (i != 0)
>> +                        break;
>> +                    /* Page may be pinned, we have to copy. */
>> +                    return copy_present_page(
>> +                        dst_vma, src_vma, dst_pte,
>> +                        src_pte, addr, rss, prealloc,
>> +                        page);
>> +                }
>> +                rss[MM_ANONPAGES]++;
>> +                VM_BUG_ON(PageAnonExclusive(page));
>> +            } else {
>> +                page_dup_file_rmap(page, false);
>> +                rss[mm_counter_file(page)]++;
>> +            }
>>           }
>> -        rss[MM_ANONPAGES]++;
>> -    } else if (page) {
>> -        folio_get(folio);
>> -        page_dup_file_rmap(page, false);
>> -        rss[mm_counter_file(page)]++;
>> +
>> +        nr = i;
>> +        folio_ref_add(folio, nr);
>>       }
>>         /*
>> @@ -968,24 +1051,28 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct
>> vm_area_struct *src_vma,
>>        * in the parent and the child
>>        */
>>       if (is_cow_mapping(vm_flags) && pte_write(pte)) {
>> -        ptep_set_wrprotect(src_mm, addr, src_pte);
>> +        ptep_set_wrprotects(src_mm, addr, src_pte, nr);
>>           pte = pte_wrprotect(pte);
>
> You likely want an "any_pte_writable" check here instead, no?
>
> Any operations that target a single indiividual PTE while multiple PTEs are
> adjusted are suspicious :)

The idea is that I've already constrained the batch of pages such that the
permissions are all the same (see folio_nr_pages_cont_mapped()). So if the first
pte is writable, then they all are - something has gone badly wrong if some are
writable and others are not.

The dirty bit has any_dirty special case, because we (deliberately) don't
consider access/dirty when determining the batch. Given the batch is all covered
by the same folio, and the kernel maintains the access/dirty info per-folio, we
don't want to uneccessarily reduce the batch size just because one of the pages
in the folio has been written to.

>

2023-11-16 13:21:27

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

On 16.11.23 12:20, Ryan Roberts wrote:
> On 16/11/2023 11:03, David Hildenbrand wrote:
>> On 15.11.23 17:30, Ryan Roberts wrote:
>>> Convert copy_pte_range() to copy a set of ptes in a batch. A given batch
>>> maps a physically contiguous block of memory, all belonging to the same
>>> folio, with the same permissions, and for shared mappings, the same
>>> dirty state. This will likely improve performance by a tiny amount due
>>> to batching the folio reference count management and calling set_ptes()
>>> rather than making individual calls to set_pte_at().
>>>
>>> However, the primary motivation for this change is to reduce the number
>>> of tlb maintenance operations that the arm64 backend has to perform
>>> during fork, as it is about to add transparent support for the
>>> "contiguous bit" in its ptes. By write-protecting the parent using the
>>> new ptep_set_wrprotects() (note the 's' at the end) function, the
>>> backend can avoid having to unfold contig ranges of PTEs, which is
>>> expensive, when all ptes in the range are being write-protected.
>>> Similarly, by using set_ptes() rather than set_pte_at() to set up ptes
>>> in the child, the backend does not need to fold a contiguous range once
>>> they are all populated - they can be initially populated as a contiguous
>>> range in the first place.
>>>
>>> This change addresses the core-mm refactoring only, and introduces
>>> ptep_set_wrprotects() with a default implementation that calls
>>> ptep_set_wrprotect() for each pte in the range. A separate change will
>>> implement ptep_set_wrprotects() in the arm64 backend to realize the
>>> performance improvement as part of the work to enable contpte mappings.
>>>
>>> Signed-off-by: Ryan Roberts <[email protected]>
>>> ---
>>>   include/linux/pgtable.h |  13 +++
>>>   mm/memory.c             | 175 +++++++++++++++++++++++++++++++---------
>>>   2 files changed, 150 insertions(+), 38 deletions(-)
>>>
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index af7639c3b0a3..1c50f8a0fdde 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -622,6 +622,19 @@ static inline void ptep_set_wrprotect(struct mm_struct
>>> *mm, unsigned long addres
>>>   }
>>>   #endif
>>>   +#ifndef ptep_set_wrprotects
>>> +struct mm_struct;
>>> +static inline void ptep_set_wrprotects(struct mm_struct *mm,
>>> +                unsigned long address, pte_t *ptep,
>>> +                unsigned int nr)
>>> +{
>>> +    unsigned int i;
>>> +
>>> +    for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
>>> +        ptep_set_wrprotect(mm, address, ptep);
>>> +}
>>> +#endif
>>> +
>>>   /*
>>>    * On some architectures hardware does not set page access bit when accessing
>>>    * memory page, it is responsibility of software setting this bit. It brings
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 1f18ed4a5497..b7c8228883cf 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -921,46 +921,129 @@ copy_present_page(struct vm_area_struct *dst_vma,
>>> struct vm_area_struct *src_vma
>>>           /* Uffd-wp needs to be delivered to dest pte as well */
>>>           pte = pte_mkuffd_wp(pte);
>>>       set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
>>> -    return 0;
>>> +    return 1;
>>> +}
>>> +
>>> +static inline unsigned long page_cont_mapped_vaddr(struct page *page,
>>> +                struct page *anchor, unsigned long anchor_vaddr)
>>> +{
>>> +    unsigned long offset;
>>> +    unsigned long vaddr;
>>> +
>>> +    offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
>>> +    vaddr = anchor_vaddr + offset;
>>> +
>>> +    if (anchor > page) {
>>> +        if (vaddr > anchor_vaddr)
>>> +            return 0;
>>> +    } else {
>>> +        if (vaddr < anchor_vaddr)
>>> +            return ULONG_MAX;
>>> +    }
>>> +
>>> +    return vaddr;
>>> +}
>>> +
>>> +static int folio_nr_pages_cont_mapped(struct folio *folio,
>>> +                      struct page *page, pte_t *pte,
>>> +                      unsigned long addr, unsigned long end,
>>> +                      pte_t ptent, bool *any_dirty)
>>> +{
>>> +    int floops;
>>> +    int i;
>>> +    unsigned long pfn;
>>> +    pgprot_t prot;
>>> +    struct page *folio_end;
>>> +
>>> +    if (!folio_test_large(folio))
>>> +        return 1;
>>> +
>>> +    folio_end = &folio->page + folio_nr_pages(folio);
>>> +    end = min(page_cont_mapped_vaddr(folio_end, page, addr), end);
>>> +    floops = (end - addr) >> PAGE_SHIFT;
>>> +    pfn = page_to_pfn(page);
>>> +    prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
>>> +
>>> +    *any_dirty = pte_dirty(ptent);
>>> +
>>> +    pfn++;
>>> +    pte++;
>>> +
>>> +    for (i = 1; i < floops; i++) {
>>> +        ptent = ptep_get(pte);
>>> +        ptent = pte_mkold(pte_mkclean(ptent));
>>> +
>>> +        if (!pte_present(ptent) || pte_pfn(ptent) != pfn ||
>>> +            pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
>>> +            break;
>>> +
>>> +        if (pte_dirty(ptent))
>>> +            *any_dirty = true;
>>> +
>>> +        pfn++;
>>> +        pte++;
>>> +    }
>>> +
>>> +    return i;
>>>   }
>>>     /*
>>> - * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated page
>>> - * is required to copy this pte.
>>> + * Copy set of contiguous ptes.  Returns number of ptes copied if succeeded
>>> + * (always gte 1), or -EAGAIN if one preallocated page is required to copy the
>>> + * first pte.
>>>    */
>>>   static inline int
>>> -copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>>> -         pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
>>> -         struct folio **prealloc)
>>> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct
>>> *src_vma,
>>> +          pte_t *dst_pte, pte_t *src_pte,
>>> +          unsigned long addr, unsigned long end,
>>> +          int *rss, struct folio **prealloc)
>>>   {
>>>       struct mm_struct *src_mm = src_vma->vm_mm;
>>>       unsigned long vm_flags = src_vma->vm_flags;
>>>       pte_t pte = ptep_get(src_pte);
>>>       struct page *page;
>>>       struct folio *folio;
>>> +    int nr = 1;
>>> +    bool anon;
>>> +    bool any_dirty = pte_dirty(pte);
>>> +    int i;
>>>         page = vm_normal_page(src_vma, addr, pte);
>>> -    if (page)
>>> +    if (page) {
>>>           folio = page_folio(page);
>>> -    if (page && folio_test_anon(folio)) {
>>> -        /*
>>> -         * If this page may have been pinned by the parent process,
>>> -         * copy the page immediately for the child so that we'll always
>>> -         * guarantee the pinned page won't be randomly replaced in the
>>> -         * future.
>>> -         */
>>> -        folio_get(folio);
>>> -        if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
>>> -            /* Page may be pinned, we have to copy. */
>>> -            folio_put(folio);
>>> -            return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
>>> -                         addr, rss, prealloc, page);
>>> +        anon = folio_test_anon(folio);
>>> +        nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
>>> +                        end, pte, &any_dirty);
>>> +
>>> +        for (i = 0; i < nr; i++, page++) {
>>> +            if (anon) {
>>> +                /*
>>> +                 * If this page may have been pinned by the
>>> +                 * parent process, copy the page immediately for
>>> +                 * the child so that we'll always guarantee the
>>> +                 * pinned page won't be randomly replaced in the
>>> +                 * future.
>>> +                 */
>>> +                if (unlikely(page_try_dup_anon_rmap(
>>> +                        page, false, src_vma))) {
>>> +                    if (i != 0)
>>> +                        break;
>>> +                    /* Page may be pinned, we have to copy. */
>>> +                    return copy_present_page(
>>> +                        dst_vma, src_vma, dst_pte,
>>> +                        src_pte, addr, rss, prealloc,
>>> +                        page);
>>> +                }
>>> +                rss[MM_ANONPAGES]++;
>>> +                VM_BUG_ON(PageAnonExclusive(page));
>>> +            } else {
>>> +                page_dup_file_rmap(page, false);
>>> +                rss[mm_counter_file(page)]++;
>>> +            }
>>>           }
>>> -        rss[MM_ANONPAGES]++;
>>> -    } else if (page) {
>>> -        folio_get(folio);
>>> -        page_dup_file_rmap(page, false);
>>> -        rss[mm_counter_file(page)]++;
>>> +
>>> +        nr = i;
>>> +        folio_ref_add(folio, nr);
>>>       }
>>>         /*
>>> @@ -968,24 +1051,28 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct
>>> vm_area_struct *src_vma,
>>>        * in the parent and the child
>>>        */
>>>       if (is_cow_mapping(vm_flags) && pte_write(pte)) {
>>> -        ptep_set_wrprotect(src_mm, addr, src_pte);
>>> +        ptep_set_wrprotects(src_mm, addr, src_pte, nr);
>>>           pte = pte_wrprotect(pte);
>>
>> You likely want an "any_pte_writable" check here instead, no?
>>
>> Any operations that target a single indiividual PTE while multiple PTEs are
>> adjusted are suspicious :)
>
> The idea is that I've already constrained the batch of pages such that the
> permissions are all the same (see folio_nr_pages_cont_mapped()). So if the first
> pte is writable, then they all are - something has gone badly wrong if some are
> writable and others are not.

I wonder if it would be cleaner and easier to not do that, though.

Simply record if any pte is writable. Afterwards they will *all* be R/O
and you can set the cont bit, correct?

--
Cheers,

David / dhildenb

2023-11-16 13:49:28

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

On 16/11/2023 13:20, David Hildenbrand wrote:
> On 16.11.23 12:20, Ryan Roberts wrote:
>> On 16/11/2023 11:03, David Hildenbrand wrote:
>>> On 15.11.23 17:30, Ryan Roberts wrote:
>>>> Convert copy_pte_range() to copy a set of ptes in a batch. A given batch
>>>> maps a physically contiguous block of memory, all belonging to the same
>>>> folio, with the same permissions, and for shared mappings, the same
>>>> dirty state. This will likely improve performance by a tiny amount due
>>>> to batching the folio reference count management and calling set_ptes()
>>>> rather than making individual calls to set_pte_at().
>>>>
>>>> However, the primary motivation for this change is to reduce the number
>>>> of tlb maintenance operations that the arm64 backend has to perform
>>>> during fork, as it is about to add transparent support for the
>>>> "contiguous bit" in its ptes. By write-protecting the parent using the
>>>> new ptep_set_wrprotects() (note the 's' at the end) function, the
>>>> backend can avoid having to unfold contig ranges of PTEs, which is
>>>> expensive, when all ptes in the range are being write-protected.
>>>> Similarly, by using set_ptes() rather than set_pte_at() to set up ptes
>>>> in the child, the backend does not need to fold a contiguous range once
>>>> they are all populated - they can be initially populated as a contiguous
>>>> range in the first place.
>>>>
>>>> This change addresses the core-mm refactoring only, and introduces
>>>> ptep_set_wrprotects() with a default implementation that calls
>>>> ptep_set_wrprotect() for each pte in the range. A separate change will
>>>> implement ptep_set_wrprotects() in the arm64 backend to realize the
>>>> performance improvement as part of the work to enable contpte mappings.
>>>>
>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>> ---
>>>>    include/linux/pgtable.h |  13 +++
>>>>    mm/memory.c             | 175 +++++++++++++++++++++++++++++++---------
>>>>    2 files changed, 150 insertions(+), 38 deletions(-)
>>>>
>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>> index af7639c3b0a3..1c50f8a0fdde 100644
>>>> --- a/include/linux/pgtable.h
>>>> +++ b/include/linux/pgtable.h
>>>> @@ -622,6 +622,19 @@ static inline void ptep_set_wrprotect(struct mm_struct
>>>> *mm, unsigned long addres
>>>>    }
>>>>    #endif
>>>>    +#ifndef ptep_set_wrprotects
>>>> +struct mm_struct;
>>>> +static inline void ptep_set_wrprotects(struct mm_struct *mm,
>>>> +                unsigned long address, pte_t *ptep,
>>>> +                unsigned int nr)
>>>> +{
>>>> +    unsigned int i;
>>>> +
>>>> +    for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
>>>> +        ptep_set_wrprotect(mm, address, ptep);
>>>> +}
>>>> +#endif
>>>> +
>>>>    /*
>>>>     * On some architectures hardware does not set page access bit when
>>>> accessing
>>>>     * memory page, it is responsibility of software setting this bit. It brings
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 1f18ed4a5497..b7c8228883cf 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -921,46 +921,129 @@ copy_present_page(struct vm_area_struct *dst_vma,
>>>> struct vm_area_struct *src_vma
>>>>            /* Uffd-wp needs to be delivered to dest pte as well */
>>>>            pte = pte_mkuffd_wp(pte);
>>>>        set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
>>>> -    return 0;
>>>> +    return 1;
>>>> +}
>>>> +
>>>> +static inline unsigned long page_cont_mapped_vaddr(struct page *page,
>>>> +                struct page *anchor, unsigned long anchor_vaddr)
>>>> +{
>>>> +    unsigned long offset;
>>>> +    unsigned long vaddr;
>>>> +
>>>> +    offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
>>>> +    vaddr = anchor_vaddr + offset;
>>>> +
>>>> +    if (anchor > page) {
>>>> +        if (vaddr > anchor_vaddr)
>>>> +            return 0;
>>>> +    } else {
>>>> +        if (vaddr < anchor_vaddr)
>>>> +            return ULONG_MAX;
>>>> +    }
>>>> +
>>>> +    return vaddr;
>>>> +}
>>>> +
>>>> +static int folio_nr_pages_cont_mapped(struct folio *folio,
>>>> +                      struct page *page, pte_t *pte,
>>>> +                      unsigned long addr, unsigned long end,
>>>> +                      pte_t ptent, bool *any_dirty)
>>>> +{
>>>> +    int floops;
>>>> +    int i;
>>>> +    unsigned long pfn;
>>>> +    pgprot_t prot;
>>>> +    struct page *folio_end;
>>>> +
>>>> +    if (!folio_test_large(folio))
>>>> +        return 1;
>>>> +
>>>> +    folio_end = &folio->page + folio_nr_pages(folio);
>>>> +    end = min(page_cont_mapped_vaddr(folio_end, page, addr), end);
>>>> +    floops = (end - addr) >> PAGE_SHIFT;
>>>> +    pfn = page_to_pfn(page);
>>>> +    prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
>>>> +
>>>> +    *any_dirty = pte_dirty(ptent);
>>>> +
>>>> +    pfn++;
>>>> +    pte++;
>>>> +
>>>> +    for (i = 1; i < floops; i++) {
>>>> +        ptent = ptep_get(pte);
>>>> +        ptent = pte_mkold(pte_mkclean(ptent));
>>>> +
>>>> +        if (!pte_present(ptent) || pte_pfn(ptent) != pfn ||
>>>> +            pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
>>>> +            break;
>>>> +
>>>> +        if (pte_dirty(ptent))
>>>> +            *any_dirty = true;
>>>> +
>>>> +        pfn++;
>>>> +        pte++;
>>>> +    }
>>>> +
>>>> +    return i;
>>>>    }
>>>>      /*
>>>> - * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated page
>>>> - * is required to copy this pte.
>>>> + * Copy set of contiguous ptes.  Returns number of ptes copied if succeeded
>>>> + * (always gte 1), or -EAGAIN if one preallocated page is required to copy the
>>>> + * first pte.
>>>>     */
>>>>    static inline int
>>>> -copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct
>>>> *src_vma,
>>>> -         pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
>>>> -         struct folio **prealloc)
>>>> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct
>>>> *src_vma,
>>>> +          pte_t *dst_pte, pte_t *src_pte,
>>>> +          unsigned long addr, unsigned long end,
>>>> +          int *rss, struct folio **prealloc)
>>>>    {
>>>>        struct mm_struct *src_mm = src_vma->vm_mm;
>>>>        unsigned long vm_flags = src_vma->vm_flags;
>>>>        pte_t pte = ptep_get(src_pte);
>>>>        struct page *page;
>>>>        struct folio *folio;
>>>> +    int nr = 1;
>>>> +    bool anon;
>>>> +    bool any_dirty = pte_dirty(pte);
>>>> +    int i;
>>>>          page = vm_normal_page(src_vma, addr, pte);
>>>> -    if (page)
>>>> +    if (page) {
>>>>            folio = page_folio(page);
>>>> -    if (page && folio_test_anon(folio)) {
>>>> -        /*
>>>> -         * If this page may have been pinned by the parent process,
>>>> -         * copy the page immediately for the child so that we'll always
>>>> -         * guarantee the pinned page won't be randomly replaced in the
>>>> -         * future.
>>>> -         */
>>>> -        folio_get(folio);
>>>> -        if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
>>>> -            /* Page may be pinned, we have to copy. */
>>>> -            folio_put(folio);
>>>> -            return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
>>>> -                         addr, rss, prealloc, page);
>>>> +        anon = folio_test_anon(folio);
>>>> +        nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
>>>> +                        end, pte, &any_dirty);
>>>> +
>>>> +        for (i = 0; i < nr; i++, page++) {
>>>> +            if (anon) {
>>>> +                /*
>>>> +                 * If this page may have been pinned by the
>>>> +                 * parent process, copy the page immediately for
>>>> +                 * the child so that we'll always guarantee the
>>>> +                 * pinned page won't be randomly replaced in the
>>>> +                 * future.
>>>> +                 */
>>>> +                if (unlikely(page_try_dup_anon_rmap(
>>>> +                        page, false, src_vma))) {
>>>> +                    if (i != 0)
>>>> +                        break;
>>>> +                    /* Page may be pinned, we have to copy. */
>>>> +                    return copy_present_page(
>>>> +                        dst_vma, src_vma, dst_pte,
>>>> +                        src_pte, addr, rss, prealloc,
>>>> +                        page);
>>>> +                }
>>>> +                rss[MM_ANONPAGES]++;
>>>> +                VM_BUG_ON(PageAnonExclusive(page));
>>>> +            } else {
>>>> +                page_dup_file_rmap(page, false);
>>>> +                rss[mm_counter_file(page)]++;
>>>> +            }
>>>>            }
>>>> -        rss[MM_ANONPAGES]++;
>>>> -    } else if (page) {
>>>> -        folio_get(folio);
>>>> -        page_dup_file_rmap(page, false);
>>>> -        rss[mm_counter_file(page)]++;
>>>> +
>>>> +        nr = i;
>>>> +        folio_ref_add(folio, nr);
>>>>        }
>>>>          /*
>>>> @@ -968,24 +1051,28 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct
>>>> vm_area_struct *src_vma,
>>>>         * in the parent and the child
>>>>         */
>>>>        if (is_cow_mapping(vm_flags) && pte_write(pte)) {
>>>> -        ptep_set_wrprotect(src_mm, addr, src_pte);
>>>> +        ptep_set_wrprotects(src_mm, addr, src_pte, nr);
>>>>            pte = pte_wrprotect(pte);
>>>
>>> You likely want an "any_pte_writable" check here instead, no?
>>>
>>> Any operations that target a single indiividual PTE while multiple PTEs are
>>> adjusted are suspicious :)
>>
>> The idea is that I've already constrained the batch of pages such that the
>> permissions are all the same (see folio_nr_pages_cont_mapped()). So if the first
>> pte is writable, then they all are - something has gone badly wrong if some are
>> writable and others are not.
>
> I wonder if it would be cleaner and easier to not do that, though.
>
> Simply record if any pte is writable. Afterwards they will *all* be R/O and you
> can set the cont bit, correct?

Oh I see what you mean - that only works for cow mappings though. If you have a
shared mapping, you won't be making it read-only at fork. So if we ignore
pte_write() state when demarking the batches, we will end up with a batch of
pages with a mix of RO and RW in the parent, but then we set_ptes() for the
child and those pages will all have the permissions of the first page of the batch.

I guess we could special case and do it the way you suggested for cow mappings;
it might be faster, but certainly not cleaner and easier IMHO.

>

2023-11-16 14:13:56

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

On 16.11.23 14:49, Ryan Roberts wrote:
> On 16/11/2023 13:20, David Hildenbrand wrote:
>> On 16.11.23 12:20, Ryan Roberts wrote:
>>> On 16/11/2023 11:03, David Hildenbrand wrote:
>>>> On 15.11.23 17:30, Ryan Roberts wrote:
>>>>> Convert copy_pte_range() to copy a set of ptes in a batch. A given batch
>>>>> maps a physically contiguous block of memory, all belonging to the same
>>>>> folio, with the same permissions, and for shared mappings, the same
>>>>> dirty state. This will likely improve performance by a tiny amount due
>>>>> to batching the folio reference count management and calling set_ptes()
>>>>> rather than making individual calls to set_pte_at().
>>>>>
>>>>> However, the primary motivation for this change is to reduce the number
>>>>> of tlb maintenance operations that the arm64 backend has to perform
>>>>> during fork, as it is about to add transparent support for the
>>>>> "contiguous bit" in its ptes. By write-protecting the parent using the
>>>>> new ptep_set_wrprotects() (note the 's' at the end) function, the
>>>>> backend can avoid having to unfold contig ranges of PTEs, which is
>>>>> expensive, when all ptes in the range are being write-protected.
>>>>> Similarly, by using set_ptes() rather than set_pte_at() to set up ptes
>>>>> in the child, the backend does not need to fold a contiguous range once
>>>>> they are all populated - they can be initially populated as a contiguous
>>>>> range in the first place.
>>>>>
>>>>> This change addresses the core-mm refactoring only, and introduces
>>>>> ptep_set_wrprotects() with a default implementation that calls
>>>>> ptep_set_wrprotect() for each pte in the range. A separate change will
>>>>> implement ptep_set_wrprotects() in the arm64 backend to realize the
>>>>> performance improvement as part of the work to enable contpte mappings.
>>>>>
>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>> ---
>>>>>    include/linux/pgtable.h |  13 +++
>>>>>    mm/memory.c             | 175 +++++++++++++++++++++++++++++++---------
>>>>>    2 files changed, 150 insertions(+), 38 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>> index af7639c3b0a3..1c50f8a0fdde 100644
>>>>> --- a/include/linux/pgtable.h
>>>>> +++ b/include/linux/pgtable.h
>>>>> @@ -622,6 +622,19 @@ static inline void ptep_set_wrprotect(struct mm_struct
>>>>> *mm, unsigned long addres
>>>>>    }
>>>>>    #endif
>>>>>    +#ifndef ptep_set_wrprotects
>>>>> +struct mm_struct;
>>>>> +static inline void ptep_set_wrprotects(struct mm_struct *mm,
>>>>> +                unsigned long address, pte_t *ptep,
>>>>> +                unsigned int nr)
>>>>> +{
>>>>> +    unsigned int i;
>>>>> +
>>>>> +    for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
>>>>> +        ptep_set_wrprotect(mm, address, ptep);
>>>>> +}
>>>>> +#endif
>>>>> +
>>>>>    /*
>>>>>     * On some architectures hardware does not set page access bit when
>>>>> accessing
>>>>>     * memory page, it is responsibility of software setting this bit. It brings
>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>> index 1f18ed4a5497..b7c8228883cf 100644
>>>>> --- a/mm/memory.c
>>>>> +++ b/mm/memory.c
>>>>> @@ -921,46 +921,129 @@ copy_present_page(struct vm_area_struct *dst_vma,
>>>>> struct vm_area_struct *src_vma
>>>>>            /* Uffd-wp needs to be delivered to dest pte as well */
>>>>>            pte = pte_mkuffd_wp(pte);
>>>>>        set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
>>>>> -    return 0;
>>>>> +    return 1;
>>>>> +}
>>>>> +
>>>>> +static inline unsigned long page_cont_mapped_vaddr(struct page *page,
>>>>> +                struct page *anchor, unsigned long anchor_vaddr)
>>>>> +{
>>>>> +    unsigned long offset;
>>>>> +    unsigned long vaddr;
>>>>> +
>>>>> +    offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
>>>>> +    vaddr = anchor_vaddr + offset;
>>>>> +
>>>>> +    if (anchor > page) {
>>>>> +        if (vaddr > anchor_vaddr)
>>>>> +            return 0;
>>>>> +    } else {
>>>>> +        if (vaddr < anchor_vaddr)
>>>>> +            return ULONG_MAX;
>>>>> +    }
>>>>> +
>>>>> +    return vaddr;
>>>>> +}
>>>>> +
>>>>> +static int folio_nr_pages_cont_mapped(struct folio *folio,
>>>>> +                      struct page *page, pte_t *pte,
>>>>> +                      unsigned long addr, unsigned long end,
>>>>> +                      pte_t ptent, bool *any_dirty)
>>>>> +{
>>>>> +    int floops;
>>>>> +    int i;
>>>>> +    unsigned long pfn;
>>>>> +    pgprot_t prot;
>>>>> +    struct page *folio_end;
>>>>> +
>>>>> +    if (!folio_test_large(folio))
>>>>> +        return 1;
>>>>> +
>>>>> +    folio_end = &folio->page + folio_nr_pages(folio);
>>>>> +    end = min(page_cont_mapped_vaddr(folio_end, page, addr), end);
>>>>> +    floops = (end - addr) >> PAGE_SHIFT;
>>>>> +    pfn = page_to_pfn(page);
>>>>> +    prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
>>>>> +
>>>>> +    *any_dirty = pte_dirty(ptent);
>>>>> +
>>>>> +    pfn++;
>>>>> +    pte++;
>>>>> +
>>>>> +    for (i = 1; i < floops; i++) {
>>>>> +        ptent = ptep_get(pte);
>>>>> +        ptent = pte_mkold(pte_mkclean(ptent));
>>>>> +
>>>>> +        if (!pte_present(ptent) || pte_pfn(ptent) != pfn ||
>>>>> +            pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
>>>>> +            break;
>>>>> +
>>>>> +        if (pte_dirty(ptent))
>>>>> +            *any_dirty = true;
>>>>> +
>>>>> +        pfn++;
>>>>> +        pte++;
>>>>> +    }
>>>>> +
>>>>> +    return i;
>>>>>    }
>>>>>      /*
>>>>> - * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated page
>>>>> - * is required to copy this pte.
>>>>> + * Copy set of contiguous ptes.  Returns number of ptes copied if succeeded
>>>>> + * (always gte 1), or -EAGAIN if one preallocated page is required to copy the
>>>>> + * first pte.
>>>>>     */
>>>>>    static inline int
>>>>> -copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct
>>>>> *src_vma,
>>>>> -         pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
>>>>> -         struct folio **prealloc)
>>>>> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct
>>>>> *src_vma,
>>>>> +          pte_t *dst_pte, pte_t *src_pte,
>>>>> +          unsigned long addr, unsigned long end,
>>>>> +          int *rss, struct folio **prealloc)
>>>>>    {
>>>>>        struct mm_struct *src_mm = src_vma->vm_mm;
>>>>>        unsigned long vm_flags = src_vma->vm_flags;
>>>>>        pte_t pte = ptep_get(src_pte);
>>>>>        struct page *page;
>>>>>        struct folio *folio;
>>>>> +    int nr = 1;
>>>>> +    bool anon;
>>>>> +    bool any_dirty = pte_dirty(pte);
>>>>> +    int i;
>>>>>          page = vm_normal_page(src_vma, addr, pte);
>>>>> -    if (page)
>>>>> +    if (page) {
>>>>>            folio = page_folio(page);
>>>>> -    if (page && folio_test_anon(folio)) {
>>>>> -        /*
>>>>> -         * If this page may have been pinned by the parent process,
>>>>> -         * copy the page immediately for the child so that we'll always
>>>>> -         * guarantee the pinned page won't be randomly replaced in the
>>>>> -         * future.
>>>>> -         */
>>>>> -        folio_get(folio);
>>>>> -        if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
>>>>> -            /* Page may be pinned, we have to copy. */
>>>>> -            folio_put(folio);
>>>>> -            return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
>>>>> -                         addr, rss, prealloc, page);
>>>>> +        anon = folio_test_anon(folio);
>>>>> +        nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
>>>>> +                        end, pte, &any_dirty);
>>>>> +
>>>>> +        for (i = 0; i < nr; i++, page++) {
>>>>> +            if (anon) {
>>>>> +                /*
>>>>> +                 * If this page may have been pinned by the
>>>>> +                 * parent process, copy the page immediately for
>>>>> +                 * the child so that we'll always guarantee the
>>>>> +                 * pinned page won't be randomly replaced in the
>>>>> +                 * future.
>>>>> +                 */
>>>>> +                if (unlikely(page_try_dup_anon_rmap(
>>>>> +                        page, false, src_vma))) {
>>>>> +                    if (i != 0)
>>>>> +                        break;
>>>>> +                    /* Page may be pinned, we have to copy. */
>>>>> +                    return copy_present_page(
>>>>> +                        dst_vma, src_vma, dst_pte,
>>>>> +                        src_pte, addr, rss, prealloc,
>>>>> +                        page);
>>>>> +                }
>>>>> +                rss[MM_ANONPAGES]++;
>>>>> +                VM_BUG_ON(PageAnonExclusive(page));
>>>>> +            } else {
>>>>> +                page_dup_file_rmap(page, false);
>>>>> +                rss[mm_counter_file(page)]++;
>>>>> +            }
>>>>>            }
>>>>> -        rss[MM_ANONPAGES]++;
>>>>> -    } else if (page) {
>>>>> -        folio_get(folio);
>>>>> -        page_dup_file_rmap(page, false);
>>>>> -        rss[mm_counter_file(page)]++;
>>>>> +
>>>>> +        nr = i;
>>>>> +        folio_ref_add(folio, nr);
>>>>>        }
>>>>>          /*
>>>>> @@ -968,24 +1051,28 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct
>>>>> vm_area_struct *src_vma,
>>>>>         * in the parent and the child
>>>>>         */
>>>>>        if (is_cow_mapping(vm_flags) && pte_write(pte)) {
>>>>> -        ptep_set_wrprotect(src_mm, addr, src_pte);
>>>>> +        ptep_set_wrprotects(src_mm, addr, src_pte, nr);
>>>>>            pte = pte_wrprotect(pte);
>>>>
>>>> You likely want an "any_pte_writable" check here instead, no?
>>>>
>>>> Any operations that target a single indiividual PTE while multiple PTEs are
>>>> adjusted are suspicious :)
>>>
>>> The idea is that I've already constrained the batch of pages such that the
>>> permissions are all the same (see folio_nr_pages_cont_mapped()). So if the first
>>> pte is writable, then they all are - something has gone badly wrong if some are
>>> writable and others are not.
>>
>> I wonder if it would be cleaner and easier to not do that, though.
>>
>> Simply record if any pte is writable. Afterwards they will *all* be R/O and you
>> can set the cont bit, correct?
>
> Oh I see what you mean - that only works for cow mappings though. If you have a
> shared mapping, you won't be making it read-only at fork. So if we ignore
> pte_write() state when demarking the batches, we will end up with a batch of
> pages with a mix of RO and RW in the parent, but then we set_ptes() for the
> child and those pages will all have the permissions of the first page of the batch.

I see what you mean.

After fork(), all anon pages will be R/O in the parent and the child.
Easy. If any PTE is writable, wrprotect all in the parent and the child.

After fork(), all shared pages can be R/O or R/W in the parent. For
simplicity, I think you can simply set them all R/O in the child. So if
any PTE is writable, wrprotect all in the child.

Why? in the default case, fork() does not even care about MAP_SHARED
mappings; it does not copy the page tables/ptes. See vma_needs_copy().

Only in corner cases (e.g., uffd-wp, VM_PFNMAP, VM_MIXEDMAP), or in
MAP_PRIVATE mappings, you can even end up in that code.

In MAP_PRIVATE mappings, only anon pages can be R/W, other pages can
never be writable, so it does not matter. In VM_PFNMAP/VM_MIXEDMAP
likely all permissions match either way.

So you might just wrprotect the !anon pages R/O for the child and nobody
should really notice it, write faults will resolve it.

Famous last words :)

--
Cheers,

David / dhildenb

2023-11-16 14:16:23

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

On 16.11.23 15:13, David Hildenbrand wrote:
> On 16.11.23 14:49, Ryan Roberts wrote:
>> On 16/11/2023 13:20, David Hildenbrand wrote:
>>> On 16.11.23 12:20, Ryan Roberts wrote:
>>>> On 16/11/2023 11:03, David Hildenbrand wrote:
>>>>> On 15.11.23 17:30, Ryan Roberts wrote:
>>>>>> Convert copy_pte_range() to copy a set of ptes in a batch. A given batch
>>>>>> maps a physically contiguous block of memory, all belonging to the same
>>>>>> folio, with the same permissions, and for shared mappings, the same
>>>>>> dirty state. This will likely improve performance by a tiny amount due
>>>>>> to batching the folio reference count management and calling set_ptes()
>>>>>> rather than making individual calls to set_pte_at().
>>>>>>
>>>>>> However, the primary motivation for this change is to reduce the number
>>>>>> of tlb maintenance operations that the arm64 backend has to perform
>>>>>> during fork, as it is about to add transparent support for the
>>>>>> "contiguous bit" in its ptes. By write-protecting the parent using the
>>>>>> new ptep_set_wrprotects() (note the 's' at the end) function, the
>>>>>> backend can avoid having to unfold contig ranges of PTEs, which is
>>>>>> expensive, when all ptes in the range are being write-protected.
>>>>>> Similarly, by using set_ptes() rather than set_pte_at() to set up ptes
>>>>>> in the child, the backend does not need to fold a contiguous range once
>>>>>> they are all populated - they can be initially populated as a contiguous
>>>>>> range in the first place.
>>>>>>
>>>>>> This change addresses the core-mm refactoring only, and introduces
>>>>>> ptep_set_wrprotects() with a default implementation that calls
>>>>>> ptep_set_wrprotect() for each pte in the range. A separate change will
>>>>>> implement ptep_set_wrprotects() in the arm64 backend to realize the
>>>>>> performance improvement as part of the work to enable contpte mappings.
>>>>>>
>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>> ---
>>>>>>    include/linux/pgtable.h |  13 +++
>>>>>>    mm/memory.c             | 175 +++++++++++++++++++++++++++++++---------
>>>>>>    2 files changed, 150 insertions(+), 38 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>> index af7639c3b0a3..1c50f8a0fdde 100644
>>>>>> --- a/include/linux/pgtable.h
>>>>>> +++ b/include/linux/pgtable.h
>>>>>> @@ -622,6 +622,19 @@ static inline void ptep_set_wrprotect(struct mm_struct
>>>>>> *mm, unsigned long addres
>>>>>>    }
>>>>>>    #endif
>>>>>>    +#ifndef ptep_set_wrprotects
>>>>>> +struct mm_struct;
>>>>>> +static inline void ptep_set_wrprotects(struct mm_struct *mm,
>>>>>> +                unsigned long address, pte_t *ptep,
>>>>>> +                unsigned int nr)
>>>>>> +{
>>>>>> +    unsigned int i;
>>>>>> +
>>>>>> +    for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
>>>>>> +        ptep_set_wrprotect(mm, address, ptep);
>>>>>> +}
>>>>>> +#endif
>>>>>> +
>>>>>>    /*
>>>>>>     * On some architectures hardware does not set page access bit when
>>>>>> accessing
>>>>>>     * memory page, it is responsibility of software setting this bit. It brings
>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>> index 1f18ed4a5497..b7c8228883cf 100644
>>>>>> --- a/mm/memory.c
>>>>>> +++ b/mm/memory.c
>>>>>> @@ -921,46 +921,129 @@ copy_present_page(struct vm_area_struct *dst_vma,
>>>>>> struct vm_area_struct *src_vma
>>>>>>            /* Uffd-wp needs to be delivered to dest pte as well */
>>>>>>            pte = pte_mkuffd_wp(pte);
>>>>>>        set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
>>>>>> -    return 0;
>>>>>> +    return 1;
>>>>>> +}
>>>>>> +
>>>>>> +static inline unsigned long page_cont_mapped_vaddr(struct page *page,
>>>>>> +                struct page *anchor, unsigned long anchor_vaddr)
>>>>>> +{
>>>>>> +    unsigned long offset;
>>>>>> +    unsigned long vaddr;
>>>>>> +
>>>>>> +    offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
>>>>>> +    vaddr = anchor_vaddr + offset;
>>>>>> +
>>>>>> +    if (anchor > page) {
>>>>>> +        if (vaddr > anchor_vaddr)
>>>>>> +            return 0;
>>>>>> +    } else {
>>>>>> +        if (vaddr < anchor_vaddr)
>>>>>> +            return ULONG_MAX;
>>>>>> +    }
>>>>>> +
>>>>>> +    return vaddr;
>>>>>> +}
>>>>>> +
>>>>>> +static int folio_nr_pages_cont_mapped(struct folio *folio,
>>>>>> +                      struct page *page, pte_t *pte,
>>>>>> +                      unsigned long addr, unsigned long end,
>>>>>> +                      pte_t ptent, bool *any_dirty)
>>>>>> +{
>>>>>> +    int floops;
>>>>>> +    int i;
>>>>>> +    unsigned long pfn;
>>>>>> +    pgprot_t prot;
>>>>>> +    struct page *folio_end;
>>>>>> +
>>>>>> +    if (!folio_test_large(folio))
>>>>>> +        return 1;
>>>>>> +
>>>>>> +    folio_end = &folio->page + folio_nr_pages(folio);
>>>>>> +    end = min(page_cont_mapped_vaddr(folio_end, page, addr), end);
>>>>>> +    floops = (end - addr) >> PAGE_SHIFT;
>>>>>> +    pfn = page_to_pfn(page);
>>>>>> +    prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
>>>>>> +
>>>>>> +    *any_dirty = pte_dirty(ptent);
>>>>>> +
>>>>>> +    pfn++;
>>>>>> +    pte++;
>>>>>> +
>>>>>> +    for (i = 1; i < floops; i++) {
>>>>>> +        ptent = ptep_get(pte);
>>>>>> +        ptent = pte_mkold(pte_mkclean(ptent));
>>>>>> +
>>>>>> +        if (!pte_present(ptent) || pte_pfn(ptent) != pfn ||
>>>>>> +            pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
>>>>>> +            break;
>>>>>> +
>>>>>> +        if (pte_dirty(ptent))
>>>>>> +            *any_dirty = true;
>>>>>> +
>>>>>> +        pfn++;
>>>>>> +        pte++;
>>>>>> +    }
>>>>>> +
>>>>>> +    return i;
>>>>>>    }
>>>>>>      /*
>>>>>> - * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated page
>>>>>> - * is required to copy this pte.
>>>>>> + * Copy set of contiguous ptes.  Returns number of ptes copied if succeeded
>>>>>> + * (always gte 1), or -EAGAIN if one preallocated page is required to copy the
>>>>>> + * first pte.
>>>>>>     */
>>>>>>    static inline int
>>>>>> -copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct
>>>>>> *src_vma,
>>>>>> -         pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
>>>>>> -         struct folio **prealloc)
>>>>>> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct
>>>>>> *src_vma,
>>>>>> +          pte_t *dst_pte, pte_t *src_pte,
>>>>>> +          unsigned long addr, unsigned long end,
>>>>>> +          int *rss, struct folio **prealloc)
>>>>>>    {
>>>>>>        struct mm_struct *src_mm = src_vma->vm_mm;
>>>>>>        unsigned long vm_flags = src_vma->vm_flags;
>>>>>>        pte_t pte = ptep_get(src_pte);
>>>>>>        struct page *page;
>>>>>>        struct folio *folio;
>>>>>> +    int nr = 1;
>>>>>> +    bool anon;
>>>>>> +    bool any_dirty = pte_dirty(pte);
>>>>>> +    int i;
>>>>>>          page = vm_normal_page(src_vma, addr, pte);
>>>>>> -    if (page)
>>>>>> +    if (page) {
>>>>>>            folio = page_folio(page);
>>>>>> -    if (page && folio_test_anon(folio)) {
>>>>>> -        /*
>>>>>> -         * If this page may have been pinned by the parent process,
>>>>>> -         * copy the page immediately for the child so that we'll always
>>>>>> -         * guarantee the pinned page won't be randomly replaced in the
>>>>>> -         * future.
>>>>>> -         */
>>>>>> -        folio_get(folio);
>>>>>> -        if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
>>>>>> -            /* Page may be pinned, we have to copy. */
>>>>>> -            folio_put(folio);
>>>>>> -            return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
>>>>>> -                         addr, rss, prealloc, page);
>>>>>> +        anon = folio_test_anon(folio);
>>>>>> +        nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
>>>>>> +                        end, pte, &any_dirty);
>>>>>> +
>>>>>> +        for (i = 0; i < nr; i++, page++) {
>>>>>> +            if (anon) {
>>>>>> +                /*
>>>>>> +                 * If this page may have been pinned by the
>>>>>> +                 * parent process, copy the page immediately for
>>>>>> +                 * the child so that we'll always guarantee the
>>>>>> +                 * pinned page won't be randomly replaced in the
>>>>>> +                 * future.
>>>>>> +                 */
>>>>>> +                if (unlikely(page_try_dup_anon_rmap(
>>>>>> +                        page, false, src_vma))) {
>>>>>> +                    if (i != 0)
>>>>>> +                        break;
>>>>>> +                    /* Page may be pinned, we have to copy. */
>>>>>> +                    return copy_present_page(
>>>>>> +                        dst_vma, src_vma, dst_pte,
>>>>>> +                        src_pte, addr, rss, prealloc,
>>>>>> +                        page);
>>>>>> +                }
>>>>>> +                rss[MM_ANONPAGES]++;
>>>>>> +                VM_BUG_ON(PageAnonExclusive(page));
>>>>>> +            } else {
>>>>>> +                page_dup_file_rmap(page, false);
>>>>>> +                rss[mm_counter_file(page)]++;
>>>>>> +            }
>>>>>>            }
>>>>>> -        rss[MM_ANONPAGES]++;
>>>>>> -    } else if (page) {
>>>>>> -        folio_get(folio);
>>>>>> -        page_dup_file_rmap(page, false);
>>>>>> -        rss[mm_counter_file(page)]++;
>>>>>> +
>>>>>> +        nr = i;
>>>>>> +        folio_ref_add(folio, nr);
>>>>>>        }
>>>>>>          /*
>>>>>> @@ -968,24 +1051,28 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct
>>>>>> vm_area_struct *src_vma,
>>>>>>         * in the parent and the child
>>>>>>         */
>>>>>>        if (is_cow_mapping(vm_flags) && pte_write(pte)) {
>>>>>> -        ptep_set_wrprotect(src_mm, addr, src_pte);
>>>>>> +        ptep_set_wrprotects(src_mm, addr, src_pte, nr);
>>>>>>            pte = pte_wrprotect(pte);
>>>>>
>>>>> You likely want an "any_pte_writable" check here instead, no?
>>>>>
>>>>> Any operations that target a single indiividual PTE while multiple PTEs are
>>>>> adjusted are suspicious :)
>>>>
>>>> The idea is that I've already constrained the batch of pages such that the
>>>> permissions are all the same (see folio_nr_pages_cont_mapped()). So if the first
>>>> pte is writable, then they all are - something has gone badly wrong if some are
>>>> writable and others are not.
>>>
>>> I wonder if it would be cleaner and easier to not do that, though.
>>>
>>> Simply record if any pte is writable. Afterwards they will *all* be R/O and you
>>> can set the cont bit, correct?
>>
>> Oh I see what you mean - that only works for cow mappings though. If you have a
>> shared mapping, you won't be making it read-only at fork. So if we ignore
>> pte_write() state when demarking the batches, we will end up with a batch of
>> pages with a mix of RO and RW in the parent, but then we set_ptes() for the
>> child and those pages will all have the permissions of the first page of the batch.
>
> I see what you mean.
>
> After fork(), all anon pages will be R/O in the parent and the child.
> Easy. If any PTE is writable, wrprotect all in the parent and the child.
>
> After fork(), all shared pages can be R/O or R/W in the parent. For
> simplicity, I think you can simply set them all R/O in the child. So if
> any PTE is writable, wrprotect all in the child.

Or better: if any is R/O, set them all R/O. Otherwise just leave them as is.

But devil is in the detail.

--
Cheers,

David / dhildenb

2023-11-16 17:59:19

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

On 16/11/2023 14:15, David Hildenbrand wrote:
> On 16.11.23 15:13, David Hildenbrand wrote:
>> On 16.11.23 14:49, Ryan Roberts wrote:
>>> On 16/11/2023 13:20, David Hildenbrand wrote:
>>>> On 16.11.23 12:20, Ryan Roberts wrote:
>>>>> On 16/11/2023 11:03, David Hildenbrand wrote:
>>>>>> On 15.11.23 17:30, Ryan Roberts wrote:
>>>>>>> Convert copy_pte_range() to copy a set of ptes in a batch. A given batch
>>>>>>> maps a physically contiguous block of memory, all belonging to the same
>>>>>>> folio, with the same permissions, and for shared mappings, the same
>>>>>>> dirty state. This will likely improve performance by a tiny amount due
>>>>>>> to batching the folio reference count management and calling set_ptes()
>>>>>>> rather than making individual calls to set_pte_at().
>>>>>>>
>>>>>>> However, the primary motivation for this change is to reduce the number
>>>>>>> of tlb maintenance operations that the arm64 backend has to perform
>>>>>>> during fork, as it is about to add transparent support for the
>>>>>>> "contiguous bit" in its ptes. By write-protecting the parent using the
>>>>>>> new ptep_set_wrprotects() (note the 's' at the end) function, the
>>>>>>> backend can avoid having to unfold contig ranges of PTEs, which is
>>>>>>> expensive, when all ptes in the range are being write-protected.
>>>>>>> Similarly, by using set_ptes() rather than set_pte_at() to set up ptes
>>>>>>> in the child, the backend does not need to fold a contiguous range once
>>>>>>> they are all populated - they can be initially populated as a contiguous
>>>>>>> range in the first place.
>>>>>>>
>>>>>>> This change addresses the core-mm refactoring only, and introduces
>>>>>>> ptep_set_wrprotects() with a default implementation that calls
>>>>>>> ptep_set_wrprotect() for each pte in the range. A separate change will
>>>>>>> implement ptep_set_wrprotects() in the arm64 backend to realize the
>>>>>>> performance improvement as part of the work to enable contpte mappings.
>>>>>>>
>>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>>> ---
>>>>>>>      include/linux/pgtable.h |  13 +++
>>>>>>>      mm/memory.c             | 175 +++++++++++++++++++++++++++++++---------
>>>>>>>      2 files changed, 150 insertions(+), 38 deletions(-)
>>>>>>>
>>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>>> index af7639c3b0a3..1c50f8a0fdde 100644
>>>>>>> --- a/include/linux/pgtable.h
>>>>>>> +++ b/include/linux/pgtable.h
>>>>>>> @@ -622,6 +622,19 @@ static inline void ptep_set_wrprotect(struct mm_struct
>>>>>>> *mm, unsigned long addres
>>>>>>>      }
>>>>>>>      #endif
>>>>>>>      +#ifndef ptep_set_wrprotects
>>>>>>> +struct mm_struct;
>>>>>>> +static inline void ptep_set_wrprotects(struct mm_struct *mm,
>>>>>>> +                unsigned long address, pte_t *ptep,
>>>>>>> +                unsigned int nr)
>>>>>>> +{
>>>>>>> +    unsigned int i;
>>>>>>> +
>>>>>>> +    for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
>>>>>>> +        ptep_set_wrprotect(mm, address, ptep);
>>>>>>> +}
>>>>>>> +#endif
>>>>>>> +
>>>>>>>      /*
>>>>>>>       * On some architectures hardware does not set page access bit when
>>>>>>> accessing
>>>>>>>       * memory page, it is responsibility of software setting this bit.
>>>>>>> It brings
>>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>>> index 1f18ed4a5497..b7c8228883cf 100644
>>>>>>> --- a/mm/memory.c
>>>>>>> +++ b/mm/memory.c
>>>>>>> @@ -921,46 +921,129 @@ copy_present_page(struct vm_area_struct *dst_vma,
>>>>>>> struct vm_area_struct *src_vma
>>>>>>>              /* Uffd-wp needs to be delivered to dest pte as well */
>>>>>>>              pte = pte_mkuffd_wp(pte);
>>>>>>>          set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
>>>>>>> -    return 0;
>>>>>>> +    return 1;
>>>>>>> +}
>>>>>>> +
>>>>>>> +static inline unsigned long page_cont_mapped_vaddr(struct page *page,
>>>>>>> +                struct page *anchor, unsigned long anchor_vaddr)
>>>>>>> +{
>>>>>>> +    unsigned long offset;
>>>>>>> +    unsigned long vaddr;
>>>>>>> +
>>>>>>> +    offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
>>>>>>> +    vaddr = anchor_vaddr + offset;
>>>>>>> +
>>>>>>> +    if (anchor > page) {
>>>>>>> +        if (vaddr > anchor_vaddr)
>>>>>>> +            return 0;
>>>>>>> +    } else {
>>>>>>> +        if (vaddr < anchor_vaddr)
>>>>>>> +            return ULONG_MAX;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    return vaddr;
>>>>>>> +}
>>>>>>> +
>>>>>>> +static int folio_nr_pages_cont_mapped(struct folio *folio,
>>>>>>> +                      struct page *page, pte_t *pte,
>>>>>>> +                      unsigned long addr, unsigned long end,
>>>>>>> +                      pte_t ptent, bool *any_dirty)
>>>>>>> +{
>>>>>>> +    int floops;
>>>>>>> +    int i;
>>>>>>> +    unsigned long pfn;
>>>>>>> +    pgprot_t prot;
>>>>>>> +    struct page *folio_end;
>>>>>>> +
>>>>>>> +    if (!folio_test_large(folio))
>>>>>>> +        return 1;
>>>>>>> +
>>>>>>> +    folio_end = &folio->page + folio_nr_pages(folio);
>>>>>>> +    end = min(page_cont_mapped_vaddr(folio_end, page, addr), end);
>>>>>>> +    floops = (end - addr) >> PAGE_SHIFT;
>>>>>>> +    pfn = page_to_pfn(page);
>>>>>>> +    prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
>>>>>>> +
>>>>>>> +    *any_dirty = pte_dirty(ptent);
>>>>>>> +
>>>>>>> +    pfn++;
>>>>>>> +    pte++;
>>>>>>> +
>>>>>>> +    for (i = 1; i < floops; i++) {
>>>>>>> +        ptent = ptep_get(pte);
>>>>>>> +        ptent = pte_mkold(pte_mkclean(ptent));
>>>>>>> +
>>>>>>> +        if (!pte_present(ptent) || pte_pfn(ptent) != pfn ||
>>>>>>> +            pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
>>>>>>> +            break;
>>>>>>> +
>>>>>>> +        if (pte_dirty(ptent))
>>>>>>> +            *any_dirty = true;
>>>>>>> +
>>>>>>> +        pfn++;
>>>>>>> +        pte++;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    return i;
>>>>>>>      }
>>>>>>>        /*
>>>>>>> - * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated
>>>>>>> page
>>>>>>> - * is required to copy this pte.
>>>>>>> + * Copy set of contiguous ptes.  Returns number of ptes copied if succeeded
>>>>>>> + * (always gte 1), or -EAGAIN if one preallocated page is required to
>>>>>>> copy the
>>>>>>> + * first pte.
>>>>>>>       */
>>>>>>>      static inline int
>>>>>>> -copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct
>>>>>>> *src_vma,
>>>>>>> -         pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
>>>>>>> -         struct folio **prealloc)
>>>>>>> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct
>>>>>>> *src_vma,
>>>>>>> +          pte_t *dst_pte, pte_t *src_pte,
>>>>>>> +          unsigned long addr, unsigned long end,
>>>>>>> +          int *rss, struct folio **prealloc)
>>>>>>>      {
>>>>>>>          struct mm_struct *src_mm = src_vma->vm_mm;
>>>>>>>          unsigned long vm_flags = src_vma->vm_flags;
>>>>>>>          pte_t pte = ptep_get(src_pte);
>>>>>>>          struct page *page;
>>>>>>>          struct folio *folio;
>>>>>>> +    int nr = 1;
>>>>>>> +    bool anon;
>>>>>>> +    bool any_dirty = pte_dirty(pte);
>>>>>>> +    int i;
>>>>>>>            page = vm_normal_page(src_vma, addr, pte);
>>>>>>> -    if (page)
>>>>>>> +    if (page) {
>>>>>>>              folio = page_folio(page);
>>>>>>> -    if (page && folio_test_anon(folio)) {
>>>>>>> -        /*
>>>>>>> -         * If this page may have been pinned by the parent process,
>>>>>>> -         * copy the page immediately for the child so that we'll always
>>>>>>> -         * guarantee the pinned page won't be randomly replaced in the
>>>>>>> -         * future.
>>>>>>> -         */
>>>>>>> -        folio_get(folio);
>>>>>>> -        if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
>>>>>>> -            /* Page may be pinned, we have to copy. */
>>>>>>> -            folio_put(folio);
>>>>>>> -            return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
>>>>>>> -                         addr, rss, prealloc, page);
>>>>>>> +        anon = folio_test_anon(folio);
>>>>>>> +        nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
>>>>>>> +                        end, pte, &any_dirty);
>>>>>>> +
>>>>>>> +        for (i = 0; i < nr; i++, page++) {
>>>>>>> +            if (anon) {
>>>>>>> +                /*
>>>>>>> +                 * If this page may have been pinned by the
>>>>>>> +                 * parent process, copy the page immediately for
>>>>>>> +                 * the child so that we'll always guarantee the
>>>>>>> +                 * pinned page won't be randomly replaced in the
>>>>>>> +                 * future.
>>>>>>> +                 */
>>>>>>> +                if (unlikely(page_try_dup_anon_rmap(
>>>>>>> +                        page, false, src_vma))) {
>>>>>>> +                    if (i != 0)
>>>>>>> +                        break;
>>>>>>> +                    /* Page may be pinned, we have to copy. */
>>>>>>> +                    return copy_present_page(
>>>>>>> +                        dst_vma, src_vma, dst_pte,
>>>>>>> +                        src_pte, addr, rss, prealloc,
>>>>>>> +                        page);
>>>>>>> +                }
>>>>>>> +                rss[MM_ANONPAGES]++;
>>>>>>> +                VM_BUG_ON(PageAnonExclusive(page));
>>>>>>> +            } else {
>>>>>>> +                page_dup_file_rmap(page, false);
>>>>>>> +                rss[mm_counter_file(page)]++;
>>>>>>> +            }
>>>>>>>              }
>>>>>>> -        rss[MM_ANONPAGES]++;
>>>>>>> -    } else if (page) {
>>>>>>> -        folio_get(folio);
>>>>>>> -        page_dup_file_rmap(page, false);
>>>>>>> -        rss[mm_counter_file(page)]++;
>>>>>>> +
>>>>>>> +        nr = i;
>>>>>>> +        folio_ref_add(folio, nr);
>>>>>>>          }
>>>>>>>            /*
>>>>>>> @@ -968,24 +1051,28 @@ copy_present_pte(struct vm_area_struct *dst_vma,
>>>>>>> struct
>>>>>>> vm_area_struct *src_vma,
>>>>>>>           * in the parent and the child
>>>>>>>           */
>>>>>>>          if (is_cow_mapping(vm_flags) && pte_write(pte)) {
>>>>>>> -        ptep_set_wrprotect(src_mm, addr, src_pte);
>>>>>>> +        ptep_set_wrprotects(src_mm, addr, src_pte, nr);
>>>>>>>              pte = pte_wrprotect(pte);
>>>>>>
>>>>>> You likely want an "any_pte_writable" check here instead, no?
>>>>>>
>>>>>> Any operations that target a single indiividual PTE while multiple PTEs are
>>>>>> adjusted are suspicious :)
>>>>>
>>>>> The idea is that I've already constrained the batch of pages such that the
>>>>> permissions are all the same (see folio_nr_pages_cont_mapped()). So if the
>>>>> first
>>>>> pte is writable, then they all are - something has gone badly wrong if some
>>>>> are
>>>>> writable and others are not.
>>>>
>>>> I wonder if it would be cleaner and easier to not do that, though.
>>>>
>>>> Simply record if any pte is writable. Afterwards they will *all* be R/O and you
>>>> can set the cont bit, correct?
>>>
>>> Oh I see what you mean - that only works for cow mappings though. If you have a
>>> shared mapping, you won't be making it read-only at fork. So if we ignore
>>> pte_write() state when demarking the batches, we will end up with a batch of
>>> pages with a mix of RO and RW in the parent, but then we set_ptes() for the
>>> child and those pages will all have the permissions of the first page of the
>>> batch.
>>
>> I see what you mean.
>>
>> After fork(), all anon pages will be R/O in the parent and the child.
>> Easy. If any PTE is writable, wrprotect all in the parent and the child.
>>
>> After fork(), all shared pages can be R/O or R/W in the parent. For
>> simplicity, I think you can simply set them all R/O in the child. So if
>> any PTE is writable, wrprotect all in the child.
>
> Or better: if any is R/O, set them all R/O. Otherwise just leave them as is.
>
> But devil is in the detail.

OK I think I follow. I'll implement this for v3. Thanks!

2023-11-21 11:23:06

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v2 12/14] arm64/mm: Wire up PTE_CONT for user mappings


Ryan Roberts <[email protected]> writes:

[...]

> +static void contpte_fold(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, pte_t pte, bool fold)
> +{
> + struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
> + unsigned long start_addr;
> + pte_t *start_ptep;
> + int i;
> +
> + start_ptep = ptep = contpte_align_down(ptep);
> + start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> + pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
> + pte = fold ? pte_mkcont(pte) : pte_mknoncont(pte);
> +
> + for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
> + pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
> +
> + if (pte_dirty(ptent))
> + pte = pte_mkdirty(pte);
> +
> + if (pte_young(ptent))
> + pte = pte_mkyoung(pte);
> + }
> +
> + __flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
> +
> + __set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
> +}
> +
> +void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, pte_t pte)
> +{
> + /*
> + * We have already checked that the virtual and pysical addresses are
> + * correctly aligned for a contpte mapping in contpte_try_fold() so the
> + * remaining checks are to ensure that the contpte range is fully
> + * covered by a single folio, and ensure that all the ptes are valid
> + * with contiguous PFNs and matching prots. We ignore the state of the
> + * access and dirty bits for the purpose of deciding if its a contiguous
> + * range; the folding process will generate a single contpte entry which
> + * has a single access and dirty bit. Those 2 bits are the logical OR of
> + * their respective bits in the constituent pte entries. In order to
> + * ensure the contpte range is covered by a single folio, we must
> + * recover the folio from the pfn, but special mappings don't have a
> + * folio backing them. Fortunately contpte_try_fold() already checked
> + * that the pte is not special - we never try to fold special mappings.
> + * Note we can't use vm_normal_page() for this since we don't have the
> + * vma.
> + */
> +
> + struct page *page = pte_page(pte);
> + struct folio *folio = page_folio(page);
> + unsigned long folio_saddr = addr - (page - &folio->page) * PAGE_SIZE;
> + unsigned long folio_eaddr = folio_saddr + folio_nr_pages(folio) * PAGE_SIZE;
> + unsigned long cont_saddr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> + unsigned long cont_eaddr = cont_saddr + CONT_PTE_SIZE;
> + unsigned long pfn;
> + pgprot_t prot;
> + pte_t subpte;
> + pte_t *orig_ptep;
> + int i;
> +
> + if (folio_saddr > cont_saddr || folio_eaddr < cont_eaddr)
> + return;
> +
> + pfn = pte_pfn(pte) - ((addr - cont_saddr) >> PAGE_SHIFT);
> + prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
> + orig_ptep = ptep;
> + ptep = contpte_align_down(ptep);
> +
> + for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
> + subpte = __ptep_get(ptep);
> + subpte = pte_mkold(pte_mkclean(subpte));
> +
> + if (!pte_valid(subpte) ||
> + pte_pfn(subpte) != pfn ||
> + pgprot_val(pte_pgprot(subpte)) != pgprot_val(prot))
> + return;
> + }
> +
> + contpte_fold(mm, addr, orig_ptep, pte, true);
> +}
> +EXPORT_SYMBOL(__contpte_try_fold);
> +
> +void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, pte_t pte)
> +{
> + /*
> + * We have already checked that the ptes are contiguous in
> + * contpte_try_unfold(), so we can unfold unconditionally here.
> + */
> +
> + contpte_fold(mm, addr, ptep, pte, false);

I'm still working my way through the series but calling a fold during an
unfold stood out as it seemed wrong. Obviously further reading revealed
the boolean flag that changes the functions meaning but I think it would
be better to refactor that.

We could easily rename contpte_fold() to eg. set_cont_ptes() and factor
the pte calculation loop into a separate helper
(eg. calculate_contpte_dirty_young() or some hopefully better name)
called further up the stack. That has an added benefit of providing a
spot to add the nice comment for young/dirty rules you provided in the
patch description ;-)

In other words we'd have something like:

void __contpte_try_unfold() {
pte = calculate_contpte_dirty_young(mm, addr, ptep, pte);
pte = pte_mknoncont(pte);
set_cont_ptes(mm, addr, ptep, pte);
}

Which IMHO is more immediately understandable.

- Alistair

> +}
> +EXPORT_SYMBOL(__contpte_try_unfold);
> +
> +pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> +{
> + /*
> + * Gather access/dirty bits, which may be populated in any of the ptes
> + * of the contig range. We are guarranteed to be holding the PTL, so any
> + * contiguous range cannot be unfolded or otherwise modified under our
> + * feet.
> + */
> +
> + pte_t pte;
> + int i;
> +
> + ptep = contpte_align_down(ptep);
> +
> + for (i = 0; i < CONT_PTES; i++, ptep++) {
> + pte = __ptep_get(ptep);
> +
> + if (pte_dirty(pte))
> + orig_pte = pte_mkdirty(orig_pte);
> +
> + if (pte_young(pte))
> + orig_pte = pte_mkyoung(orig_pte);
> + }
> +
> + return orig_pte;
> +}
> +EXPORT_SYMBOL(contpte_ptep_get);
> +
> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
> +{
> + /*
> + * Gather access/dirty bits, which may be populated in any of the ptes
> + * of the contig range. We may not be holding the PTL, so any contiguous
> + * range may be unfolded/modified/refolded under our feet. Therefore we
> + * ensure we read a _consistent_ contpte range by checking that all ptes
> + * in the range are valid and have CONT_PTE set, that all pfns are
> + * contiguous and that all pgprots are the same (ignoring access/dirty).
> + * If we find a pte that is not consistent, then we must be racing with
> + * an update so start again. If the target pte does not have CONT_PTE
> + * set then that is considered consistent on its own because it is not
> + * part of a contpte range.
> + */
> +
> + pte_t orig_pte;
> + pgprot_t orig_prot;
> + pte_t *ptep;
> + unsigned long pfn;
> + pte_t pte;
> + pgprot_t prot;
> + int i;
> +
> +retry:
> + orig_pte = __ptep_get(orig_ptep);
> +
> + if (!pte_valid_cont(orig_pte))
> + return orig_pte;
> +
> + orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
> + ptep = contpte_align_down(orig_ptep);
> + pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
> +
> + for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
> + pte = __ptep_get(ptep);
> + prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
> +
> + if (!pte_valid_cont(pte) ||
> + pte_pfn(pte) != pfn ||
> + pgprot_val(prot) != pgprot_val(orig_prot))
> + goto retry;
> +
> + if (pte_dirty(pte))
> + orig_pte = pte_mkdirty(orig_pte);
> +
> + if (pte_young(pte))
> + orig_pte = pte_mkyoung(orig_pte);
> + }
> +
> + return orig_pte;
> +}
> +EXPORT_SYMBOL(contpte_ptep_get_lockless);
> +
> +void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, pte_t pte, unsigned int nr)
> +{
> + unsigned long next;
> + unsigned long end = addr + (nr << PAGE_SHIFT);
> + unsigned long pfn = pte_pfn(pte);
> + pgprot_t prot = pte_pgprot(pte);
> + pte_t orig_pte;
> +
> + do {
> + next = pte_cont_addr_end(addr, end);
> + nr = (next - addr) >> PAGE_SHIFT;
> + pte = pfn_pte(pfn, prot);
> +
> + if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
> + pte = pte_mkcont(pte);
> + else
> + pte = pte_mknoncont(pte);
> +
> + /*
> + * If operating on a partial contiguous range then we must first
> + * unfold the contiguous range if it was previously folded.
> + * Otherwise we could end up with overlapping tlb entries.
> + */
> + if (nr != CONT_PTES)
> + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> +
> + /*
> + * If we are replacing ptes that were contiguous or if the new
> + * ptes are contiguous and any of the ptes being replaced are
> + * valid, we need to clear and flush the range to prevent
> + * overlapping tlb entries.
> + */
> + orig_pte = __ptep_get(ptep);
> + if (pte_valid_cont(orig_pte) ||
> + (pte_cont(pte) && ptep_any_valid(ptep, nr)))
> + ptep_clear_flush_range(mm, addr, ptep, nr);
> +
> + __set_ptes(mm, addr, ptep, pte, nr);
> +
> + addr = next;
> + ptep += nr;
> + pfn += nr;
> +
> + } while (addr != end);
> +}
> +EXPORT_SYMBOL(contpte_set_ptes);
> +
> +int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> + unsigned long addr, pte_t *ptep)
> +{
> + /*
> + * ptep_clear_flush_young() technically requires us to clear the access
> + * flag for a _single_ pte. However, the core-mm code actually tracks
> + * access/dirty per folio, not per page. And since we only create a
> + * contig range when the range is covered by a single folio, we can get
> + * away with clearing young for the whole contig range here, so we avoid
> + * having to unfold.
> + */
> +
> + int i;
> + int young = 0;
> +
> + ptep = contpte_align_down(ptep);
> + addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +
> + for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
> + young |= __ptep_test_and_clear_young(vma, addr, ptep);
> +
> + return young;
> +}
> +EXPORT_SYMBOL(contpte_ptep_test_and_clear_young);
> +
> +int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> + unsigned long addr, pte_t *ptep)
> +{
> + int young;
> +
> + young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
> +
> + if (young) {
> + /*
> + * See comment in __ptep_clear_flush_young(); same rationale for
> + * eliding the trailing DSB applies here.
> + */
> + addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> + __flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE,
> + PAGE_SIZE, true, 3);
> + }
> +
> + return young;
> +}
> +EXPORT_SYMBOL(contpte_ptep_clear_flush_young);
> +
> +int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
> + unsigned long addr, pte_t *ptep,
> + pte_t entry, int dirty)
> +{
> + pte_t orig_pte;
> + int i;
> + unsigned long start_addr;
> +
> + /*
> + * Gather the access/dirty bits for the contiguous range. If nothing has
> + * changed, its a noop.
> + */
> + orig_pte = ptep_get(ptep);
> + if (pte_val(orig_pte) == pte_val(entry))
> + return 0;
> +
> + /*
> + * We can fix up access/dirty bits without having to unfold/fold the
> + * contig range. But if the write bit is changing, we need to go through
> + * the full unfold/fold cycle.
> + */
> + if (pte_write(orig_pte) == pte_write(entry)) {
> + /*
> + * For HW access management, we technically only need to update
> + * the flag on a single pte in the range. But for SW access
> + * management, we need to update all the ptes to prevent extra
> + * faults. Avoid per-page tlb flush in __ptep_set_access_flags()
> + * and instead flush the whole range at the end.
> + */
> + ptep = contpte_align_down(ptep);
> + start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +
> + for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
> + __ptep_set_access_flags(vma, addr, ptep, entry, 0);
> +
> + if (dirty)
> + __flush_tlb_range(vma, start_addr, addr,
> + PAGE_SIZE, true, 3);
> + } else {
> + __contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
> + __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
> + contpte_try_fold(vma->vm_mm, addr, ptep, entry);
> + }
> +
> + return 1;
> +}
> +EXPORT_SYMBOL(contpte_ptep_set_access_flags);

2023-11-21 15:16:16

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 12/14] arm64/mm: Wire up PTE_CONT for user mappings

On 21/11/2023 11:22, Alistair Popple wrote:
>
> Ryan Roberts <[email protected]> writes:
>
> [...]
>
>> +static void contpte_fold(struct mm_struct *mm, unsigned long addr,
>> + pte_t *ptep, pte_t pte, bool fold)
>> +{
>> + struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
>> + unsigned long start_addr;
>> + pte_t *start_ptep;
>> + int i;
>> +
>> + start_ptep = ptep = contpte_align_down(ptep);
>> + start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> + pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
>> + pte = fold ? pte_mkcont(pte) : pte_mknoncont(pte);
>> +
>> + for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
>> + pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
>> +
>> + if (pte_dirty(ptent))
>> + pte = pte_mkdirty(pte);
>> +
>> + if (pte_young(ptent))
>> + pte = pte_mkyoung(pte);
>> + }
>> +
>> + __flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
>> +
>> + __set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
>> +}
>> +
>> +void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
>> + pte_t *ptep, pte_t pte)
>> +{
>> + /*
>> + * We have already checked that the virtual and pysical addresses are
>> + * correctly aligned for a contpte mapping in contpte_try_fold() so the
>> + * remaining checks are to ensure that the contpte range is fully
>> + * covered by a single folio, and ensure that all the ptes are valid
>> + * with contiguous PFNs and matching prots. We ignore the state of the
>> + * access and dirty bits for the purpose of deciding if its a contiguous
>> + * range; the folding process will generate a single contpte entry which
>> + * has a single access and dirty bit. Those 2 bits are the logical OR of
>> + * their respective bits in the constituent pte entries. In order to
>> + * ensure the contpte range is covered by a single folio, we must
>> + * recover the folio from the pfn, but special mappings don't have a
>> + * folio backing them. Fortunately contpte_try_fold() already checked
>> + * that the pte is not special - we never try to fold special mappings.
>> + * Note we can't use vm_normal_page() for this since we don't have the
>> + * vma.
>> + */
>> +
>> + struct page *page = pte_page(pte);
>> + struct folio *folio = page_folio(page);
>> + unsigned long folio_saddr = addr - (page - &folio->page) * PAGE_SIZE;
>> + unsigned long folio_eaddr = folio_saddr + folio_nr_pages(folio) * PAGE_SIZE;
>> + unsigned long cont_saddr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> + unsigned long cont_eaddr = cont_saddr + CONT_PTE_SIZE;
>> + unsigned long pfn;
>> + pgprot_t prot;
>> + pte_t subpte;
>> + pte_t *orig_ptep;
>> + int i;
>> +
>> + if (folio_saddr > cont_saddr || folio_eaddr < cont_eaddr)
>> + return;
>> +
>> + pfn = pte_pfn(pte) - ((addr - cont_saddr) >> PAGE_SHIFT);
>> + prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>> + orig_ptep = ptep;
>> + ptep = contpte_align_down(ptep);
>> +
>> + for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
>> + subpte = __ptep_get(ptep);
>> + subpte = pte_mkold(pte_mkclean(subpte));
>> +
>> + if (!pte_valid(subpte) ||
>> + pte_pfn(subpte) != pfn ||
>> + pgprot_val(pte_pgprot(subpte)) != pgprot_val(prot))
>> + return;
>> + }
>> +
>> + contpte_fold(mm, addr, orig_ptep, pte, true);
>> +}
>> +EXPORT_SYMBOL(__contpte_try_fold);
>> +
>> +void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>> + pte_t *ptep, pte_t pte)
>> +{
>> + /*
>> + * We have already checked that the ptes are contiguous in
>> + * contpte_try_unfold(), so we can unfold unconditionally here.
>> + */
>> +
>> + contpte_fold(mm, addr, ptep, pte, false);
>
> I'm still working my way through the series but

Thanks for taking the time to review!

> calling a fold during an
> unfold stood out as it seemed wrong. Obviously further reading revealed
> the boolean flag that changes the functions meaning but I think it would
> be better to refactor that.

Yes that sounds reasonable.

>
> We could easily rename contpte_fold() to eg. set_cont_ptes() and factor
> the pte calculation loop into a separate helper
> (eg. calculate_contpte_dirty_young() or some hopefully better name)
> called further up the stack. That has an added benefit of providing a
> spot to add the nice comment for young/dirty rules you provided in the
> patch description ;-)
>
> In other words we'd have something like:
>
> void __contpte_try_unfold() {
> pte = calculate_contpte_dirty_young(mm, addr, ptep, pte);
> pte = pte_mknoncont(pte);
> set_cont_ptes(mm, addr, ptep, pte);
> }

My concern with this approach is that calculate_contpte_dirty_young() has side
effects; it has to clear each PTE as it loops through it prevent a race between
our reading access/dirty and another thread causing access/dirty to be set. So
its not just a "calculation", its the teardown portion of the process too. I
guess its a taste thing, so happy for it to be argued the other way, but I would
prefer to keep it all together in one function.

How about renaming contpte_fold() to contpte_convert() or contpte_repaint()
(other suggestions welcome), and extracting the pte_mkcont()/pte_mknoncont()
part (so we can remove the bool param):

void __contpte_try_unfold() {
pte = pte_mknoncont(pte);
contpte_convert(mm, addr, ptep, pte);
}

Thanks,
Ryan

>
> Which IMHO is more immediately understandable.
>
> - Alistair
>

2023-11-22 06:03:40

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v2 12/14] arm64/mm: Wire up PTE_CONT for user mappings


Ryan Roberts <[email protected]> writes:

> On 21/11/2023 11:22, Alistair Popple wrote:
>>
>> Ryan Roberts <[email protected]> writes:
>>
>> [...]
>>
>>> +static void contpte_fold(struct mm_struct *mm, unsigned long addr,
>>> + pte_t *ptep, pte_t pte, bool fold)
>>> +{
>>> + struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
>>> + unsigned long start_addr;
>>> + pte_t *start_ptep;
>>> + int i;
>>> +
>>> + start_ptep = ptep = contpte_align_down(ptep);
>>> + start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>> + pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
>>> + pte = fold ? pte_mkcont(pte) : pte_mknoncont(pte);
>>> +
>>> + for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
>>> + pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
>>> +
>>> + if (pte_dirty(ptent))
>>> + pte = pte_mkdirty(pte);
>>> +
>>> + if (pte_young(ptent))
>>> + pte = pte_mkyoung(pte);
>>> + }
>>> +
>>> + __flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
>>> +
>>> + __set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
>>> +}
>>> +
>>> +void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
>>> + pte_t *ptep, pte_t pte)
>>> +{
>>> + /*
>>> + * We have already checked that the virtual and pysical addresses are
>>> + * correctly aligned for a contpte mapping in contpte_try_fold() so the
>>> + * remaining checks are to ensure that the contpte range is fully
>>> + * covered by a single folio, and ensure that all the ptes are valid
>>> + * with contiguous PFNs and matching prots. We ignore the state of the
>>> + * access and dirty bits for the purpose of deciding if its a contiguous
>>> + * range; the folding process will generate a single contpte entry which
>>> + * has a single access and dirty bit. Those 2 bits are the logical OR of
>>> + * their respective bits in the constituent pte entries. In order to
>>> + * ensure the contpte range is covered by a single folio, we must
>>> + * recover the folio from the pfn, but special mappings don't have a
>>> + * folio backing them. Fortunately contpte_try_fold() already checked
>>> + * that the pte is not special - we never try to fold special mappings.
>>> + * Note we can't use vm_normal_page() for this since we don't have the
>>> + * vma.
>>> + */
>>> +
>>> + struct page *page = pte_page(pte);
>>> + struct folio *folio = page_folio(page);
>>> + unsigned long folio_saddr = addr - (page - &folio->page) * PAGE_SIZE;
>>> + unsigned long folio_eaddr = folio_saddr + folio_nr_pages(folio) * PAGE_SIZE;
>>> + unsigned long cont_saddr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>> + unsigned long cont_eaddr = cont_saddr + CONT_PTE_SIZE;
>>> + unsigned long pfn;
>>> + pgprot_t prot;
>>> + pte_t subpte;
>>> + pte_t *orig_ptep;
>>> + int i;
>>> +
>>> + if (folio_saddr > cont_saddr || folio_eaddr < cont_eaddr)
>>> + return;
>>> +
>>> + pfn = pte_pfn(pte) - ((addr - cont_saddr) >> PAGE_SHIFT);
>>> + prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>>> + orig_ptep = ptep;
>>> + ptep = contpte_align_down(ptep);
>>> +
>>> + for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
>>> + subpte = __ptep_get(ptep);
>>> + subpte = pte_mkold(pte_mkclean(subpte));
>>> +
>>> + if (!pte_valid(subpte) ||
>>> + pte_pfn(subpte) != pfn ||
>>> + pgprot_val(pte_pgprot(subpte)) != pgprot_val(prot))
>>> + return;
>>> + }
>>> +
>>> + contpte_fold(mm, addr, orig_ptep, pte, true);
>>> +}
>>> +EXPORT_SYMBOL(__contpte_try_fold);
>>> +
>>> +void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>>> + pte_t *ptep, pte_t pte)
>>> +{
>>> + /*
>>> + * We have already checked that the ptes are contiguous in
>>> + * contpte_try_unfold(), so we can unfold unconditionally here.
>>> + */
>>> +
>>> + contpte_fold(mm, addr, ptep, pte, false);
>>
>> I'm still working my way through the series but
>
> Thanks for taking the time to review!
>
>> calling a fold during an
>> unfold stood out as it seemed wrong. Obviously further reading revealed
>> the boolean flag that changes the functions meaning but I think it would
>> be better to refactor that.
>
> Yes that sounds reasonable.
>
>>
>> We could easily rename contpte_fold() to eg. set_cont_ptes() and factor
>> the pte calculation loop into a separate helper
>> (eg. calculate_contpte_dirty_young() or some hopefully better name)
>> called further up the stack. That has an added benefit of providing a
>> spot to add the nice comment for young/dirty rules you provided in the
>> patch description ;-)
>>
>> In other words we'd have something like:
>>
>> void __contpte_try_unfold() {
>> pte = calculate_contpte_dirty_young(mm, addr, ptep, pte);
>> pte = pte_mknoncont(pte);
>> set_cont_ptes(mm, addr, ptep, pte);
>> }
>
> My concern with this approach is that calculate_contpte_dirty_young() has side
> effects; it has to clear each PTE as it loops through it prevent a race between
> our reading access/dirty and another thread causing access/dirty to be set. So
> its not just a "calculation", its the teardown portion of the process too. I
> guess its a taste thing, so happy for it to be argued the other way, but I would
> prefer to keep it all together in one function.
>
> How about renaming contpte_fold() to contpte_convert() or contpte_repaint()
> (other suggestions welcome), and extracting the pte_mkcont()/pte_mknoncont()
> part (so we can remove the bool param):
>
> void __contpte_try_unfold() {
> pte = pte_mknoncont(pte);
> contpte_convert(mm, addr, ptep, pte);
> }

Thanks. That works for me, although sadly I don't have any better ideas
for names atm.

- Alistair

> Thanks,
> Ryan
>
>>
>> Which IMHO is more immediately understandable.
>>
>> - Alistair
>>

2023-11-22 08:39:20

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 12/14] arm64/mm: Wire up PTE_CONT for user mappings

On 22/11/2023 06:01, Alistair Popple wrote:
>
> Ryan Roberts <[email protected]> writes:
>
>> On 21/11/2023 11:22, Alistair Popple wrote:
>>>
>>> Ryan Roberts <[email protected]> writes:
>>>
>>> [...]
>>>
>>>> +static void contpte_fold(struct mm_struct *mm, unsigned long addr,
>>>> + pte_t *ptep, pte_t pte, bool fold)
>>>> +{
>>>> + struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
>>>> + unsigned long start_addr;
>>>> + pte_t *start_ptep;
>>>> + int i;
>>>> +
>>>> + start_ptep = ptep = contpte_align_down(ptep);
>>>> + start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>>> + pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
>>>> + pte = fold ? pte_mkcont(pte) : pte_mknoncont(pte);
>>>> +
>>>> + for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
>>>> + pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
>>>> +
>>>> + if (pte_dirty(ptent))
>>>> + pte = pte_mkdirty(pte);
>>>> +
>>>> + if (pte_young(ptent))
>>>> + pte = pte_mkyoung(pte);
>>>> + }
>>>> +
>>>> + __flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
>>>> +
>>>> + __set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
>>>> +}
>>>> +
>>>> +void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
>>>> + pte_t *ptep, pte_t pte)
>>>> +{
>>>> + /*
>>>> + * We have already checked that the virtual and pysical addresses are
>>>> + * correctly aligned for a contpte mapping in contpte_try_fold() so the
>>>> + * remaining checks are to ensure that the contpte range is fully
>>>> + * covered by a single folio, and ensure that all the ptes are valid
>>>> + * with contiguous PFNs and matching prots. We ignore the state of the
>>>> + * access and dirty bits for the purpose of deciding if its a contiguous
>>>> + * range; the folding process will generate a single contpte entry which
>>>> + * has a single access and dirty bit. Those 2 bits are the logical OR of
>>>> + * their respective bits in the constituent pte entries. In order to
>>>> + * ensure the contpte range is covered by a single folio, we must
>>>> + * recover the folio from the pfn, but special mappings don't have a
>>>> + * folio backing them. Fortunately contpte_try_fold() already checked
>>>> + * that the pte is not special - we never try to fold special mappings.
>>>> + * Note we can't use vm_normal_page() for this since we don't have the
>>>> + * vma.
>>>> + */
>>>> +
>>>> + struct page *page = pte_page(pte);
>>>> + struct folio *folio = page_folio(page);
>>>> + unsigned long folio_saddr = addr - (page - &folio->page) * PAGE_SIZE;
>>>> + unsigned long folio_eaddr = folio_saddr + folio_nr_pages(folio) * PAGE_SIZE;
>>>> + unsigned long cont_saddr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>>> + unsigned long cont_eaddr = cont_saddr + CONT_PTE_SIZE;
>>>> + unsigned long pfn;
>>>> + pgprot_t prot;
>>>> + pte_t subpte;
>>>> + pte_t *orig_ptep;
>>>> + int i;
>>>> +
>>>> + if (folio_saddr > cont_saddr || folio_eaddr < cont_eaddr)
>>>> + return;
>>>> +
>>>> + pfn = pte_pfn(pte) - ((addr - cont_saddr) >> PAGE_SHIFT);
>>>> + prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>>>> + orig_ptep = ptep;
>>>> + ptep = contpte_align_down(ptep);
>>>> +
>>>> + for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
>>>> + subpte = __ptep_get(ptep);
>>>> + subpte = pte_mkold(pte_mkclean(subpte));
>>>> +
>>>> + if (!pte_valid(subpte) ||
>>>> + pte_pfn(subpte) != pfn ||
>>>> + pgprot_val(pte_pgprot(subpte)) != pgprot_val(prot))
>>>> + return;
>>>> + }
>>>> +
>>>> + contpte_fold(mm, addr, orig_ptep, pte, true);
>>>> +}
>>>> +EXPORT_SYMBOL(__contpte_try_fold);
>>>> +
>>>> +void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>>>> + pte_t *ptep, pte_t pte)
>>>> +{
>>>> + /*
>>>> + * We have already checked that the ptes are contiguous in
>>>> + * contpte_try_unfold(), so we can unfold unconditionally here.
>>>> + */
>>>> +
>>>> + contpte_fold(mm, addr, ptep, pte, false);
>>>
>>> I'm still working my way through the series but
>>
>> Thanks for taking the time to review!
>>
>>> calling a fold during an
>>> unfold stood out as it seemed wrong. Obviously further reading revealed
>>> the boolean flag that changes the functions meaning but I think it would
>>> be better to refactor that.
>>
>> Yes that sounds reasonable.
>>
>>>
>>> We could easily rename contpte_fold() to eg. set_cont_ptes() and factor
>>> the pte calculation loop into a separate helper
>>> (eg. calculate_contpte_dirty_young() or some hopefully better name)
>>> called further up the stack. That has an added benefit of providing a
>>> spot to add the nice comment for young/dirty rules you provided in the
>>> patch description ;-)
>>>
>>> In other words we'd have something like:
>>>
>>> void __contpte_try_unfold() {
>>> pte = calculate_contpte_dirty_young(mm, addr, ptep, pte);
>>> pte = pte_mknoncont(pte);
>>> set_cont_ptes(mm, addr, ptep, pte);
>>> }
>>
>> My concern with this approach is that calculate_contpte_dirty_young() has side
>> effects; it has to clear each PTE as it loops through it prevent a race between
>> our reading access/dirty and another thread causing access/dirty to be set. So
>> its not just a "calculation", its the teardown portion of the process too. I
>> guess its a taste thing, so happy for it to be argued the other way, but I would
>> prefer to keep it all together in one function.
>>
>> How about renaming contpte_fold() to contpte_convert() or contpte_repaint()
>> (other suggestions welcome), and extracting the pte_mkcont()/pte_mknoncont()
>> part (so we can remove the bool param):
>>
>> void __contpte_try_unfold() {
>> pte = pte_mknoncont(pte);
>> contpte_convert(mm, addr, ptep, pte);
>> }
>
> Thanks. That works for me, although sadly I don't have any better ideas
> for names atm.

Thanks - I'll make this change for v3 and go with contpte_convert().

>
> - Alistair
>
>> Thanks,
>> Ryan
>>
>>>
>>> Which IMHO is more immediately understandable.
>>>
>>> - Alistair
>>>
>

2023-11-23 04:28:48

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()


Ryan Roberts <[email protected]> writes:

> Convert copy_pte_range() to copy a set of ptes in a batch. A given batch
> maps a physically contiguous block of memory, all belonging to the same
> folio, with the same permissions, and for shared mappings, the same
> dirty state. This will likely improve performance by a tiny amount due
> to batching the folio reference count management and calling set_ptes()
> rather than making individual calls to set_pte_at().
>
> However, the primary motivation for this change is to reduce the number
> of tlb maintenance operations that the arm64 backend has to perform
> during fork, as it is about to add transparent support for the
> "contiguous bit" in its ptes. By write-protecting the parent using the
> new ptep_set_wrprotects() (note the 's' at the end) function, the
> backend can avoid having to unfold contig ranges of PTEs, which is
> expensive, when all ptes in the range are being write-protected.
> Similarly, by using set_ptes() rather than set_pte_at() to set up ptes
> in the child, the backend does not need to fold a contiguous range once
> they are all populated - they can be initially populated as a contiguous
> range in the first place.
>
> This change addresses the core-mm refactoring only, and introduces
> ptep_set_wrprotects() with a default implementation that calls
> ptep_set_wrprotect() for each pte in the range. A separate change will
> implement ptep_set_wrprotects() in the arm64 backend to realize the
> performance improvement as part of the work to enable contpte mappings.
>
> Signed-off-by: Ryan Roberts <[email protected]>
> ---
> include/linux/pgtable.h | 13 +++
> mm/memory.c | 175 +++++++++++++++++++++++++++++++---------
> 2 files changed, 150 insertions(+), 38 deletions(-)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index af7639c3b0a3..1c50f8a0fdde 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -622,6 +622,19 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addres
> }
> #endif
>
> +#ifndef ptep_set_wrprotects
> +struct mm_struct;
> +static inline void ptep_set_wrprotects(struct mm_struct *mm,
> + unsigned long address, pte_t *ptep,
> + unsigned int nr)
> +{
> + unsigned int i;
> +
> + for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
> + ptep_set_wrprotect(mm, address, ptep);
> +}
> +#endif
> +
> /*
> * On some architectures hardware does not set page access bit when accessing
> * memory page, it is responsibility of software setting this bit. It brings
> diff --git a/mm/memory.c b/mm/memory.c
> index 1f18ed4a5497..b7c8228883cf 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -921,46 +921,129 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
> /* Uffd-wp needs to be delivered to dest pte as well */
> pte = pte_mkuffd_wp(pte);
> set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
> - return 0;
> + return 1;

We should update the function comment to indicate why we return 1 here
because it will become non-obvious in future. But perhaps it's better to
leave this as is and do the error check/return code calculation in
copy_present_ptes().

> +}
> +
> +static inline unsigned long page_cont_mapped_vaddr(struct page *page,
> + struct page *anchor, unsigned long anchor_vaddr)

It's likely I'm easily confused but the arguments here don't make much
sense to me. Something like this (noting that I've switch the argument
order) makes more sense to me at least:

static inline unsigned long page_cont_mapped_vaddr(struct page *page,
unsigned long page_vaddr, struct page *next_folio_page)

> +{
> + unsigned long offset;
> + unsigned long vaddr;
> +
> + offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;

Which IMHO makes this much more readable:

offset = (page_to_pfn(next_folio_page) - page_to_pfn(page)) << PAGE_SHIFT;

> + vaddr = anchor_vaddr + offset;
> +
> + if (anchor > page) {

And also highlights that I think this condition (page > folio_page_end)
is impossible to hit. Which is good ...

> + if (vaddr > anchor_vaddr)
> + return 0;

... because I'm not sure returning 0 is valid as we would end up setting
floops = (0 - addr) >> PAGE_SHIFT which doesn't seem like it would end
particularly well :-)

> + } else {
> + if (vaddr < anchor_vaddr)

Same here - isn't the vaddr of the next folio always going to be larger
than the vaddr for the current page? It seems this function is really
just calculating the virtual address of the next folio, or am I deeply
confused?

> + return ULONG_MAX;
> + }
> +
> + return vaddr;
> +}
> +
> +static int folio_nr_pages_cont_mapped(struct folio *folio,
> + struct page *page, pte_t *pte,
> + unsigned long addr, unsigned long end,
> + pte_t ptent, bool *any_dirty)
> +{
> + int floops;
> + int i;
> + unsigned long pfn;
> + pgprot_t prot;
> + struct page *folio_end;
> +
> + if (!folio_test_large(folio))
> + return 1;
> +
> + folio_end = &folio->page + folio_nr_pages(folio);

I think you can replace this with:

folio_end = folio_next(folio)

Although given this is only passed to page_cont_mapped_vaddr() perhaps
it's better to just pass the folio in and do the calculation there.

> + end = min(page_cont_mapped_vaddr(folio_end, page, addr), end);
> + floops = (end - addr) >> PAGE_SHIFT;
> + pfn = page_to_pfn(page);
> + prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
> +
> + *any_dirty = pte_dirty(ptent);
> +
> + pfn++;
> + pte++;
> +
> + for (i = 1; i < floops; i++) {
> + ptent = ptep_get(pte);
> + ptent = pte_mkold(pte_mkclean(ptent));
> +
> + if (!pte_present(ptent) || pte_pfn(ptent) != pfn ||
> + pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
> + break;
> +
> + if (pte_dirty(ptent))
> + *any_dirty = true;
> +
> + pfn++;
> + pte++;
> + }
> +
> + return i;
> }
>
> /*
> - * Copy one pte. Returns 0 if succeeded, or -EAGAIN if one preallocated page
> - * is required to copy this pte.
> + * Copy set of contiguous ptes. Returns number of ptes copied if succeeded
> + * (always gte 1), or -EAGAIN if one preallocated page is required to copy the
> + * first pte.
> */
> static inline int
> -copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> - pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
> - struct folio **prealloc)
> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> + pte_t *dst_pte, pte_t *src_pte,
> + unsigned long addr, unsigned long end,
> + int *rss, struct folio **prealloc)
> {
> struct mm_struct *src_mm = src_vma->vm_mm;
> unsigned long vm_flags = src_vma->vm_flags;
> pte_t pte = ptep_get(src_pte);
> struct page *page;
> struct folio *folio;
> + int nr = 1;
> + bool anon;
> + bool any_dirty = pte_dirty(pte);
> + int i;
>
> page = vm_normal_page(src_vma, addr, pte);
> - if (page)
> + if (page) {
> folio = page_folio(page);
> - if (page && folio_test_anon(folio)) {
> - /*
> - * If this page may have been pinned by the parent process,
> - * copy the page immediately for the child so that we'll always
> - * guarantee the pinned page won't be randomly replaced in the
> - * future.
> - */
> - folio_get(folio);
> - if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
> - /* Page may be pinned, we have to copy. */
> - folio_put(folio);
> - return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
> - addr, rss, prealloc, page);
> + anon = folio_test_anon(folio);
> + nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
> + end, pte, &any_dirty);
> +
> + for (i = 0; i < nr; i++, page++) {
> + if (anon) {
> + /*
> + * If this page may have been pinned by the
> + * parent process, copy the page immediately for
> + * the child so that we'll always guarantee the
> + * pinned page won't be randomly replaced in the
> + * future.
> + */
> + if (unlikely(page_try_dup_anon_rmap(
> + page, false, src_vma))) {
> + if (i != 0)
> + break;
> + /* Page may be pinned, we have to copy. */
> + return copy_present_page(
> + dst_vma, src_vma, dst_pte,
> + src_pte, addr, rss, prealloc,
> + page);
> + }
> + rss[MM_ANONPAGES]++;
> + VM_BUG_ON(PageAnonExclusive(page));
> + } else {
> + page_dup_file_rmap(page, false);
> + rss[mm_counter_file(page)]++;
> + }
> }
> - rss[MM_ANONPAGES]++;
> - } else if (page) {
> - folio_get(folio);
> - page_dup_file_rmap(page, false);
> - rss[mm_counter_file(page)]++;
> +
> + nr = i;
> + folio_ref_add(folio, nr);
> }
>
> /*
> @@ -968,24 +1051,28 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> * in the parent and the child
> */
> if (is_cow_mapping(vm_flags) && pte_write(pte)) {
> - ptep_set_wrprotect(src_mm, addr, src_pte);
> + ptep_set_wrprotects(src_mm, addr, src_pte, nr);
> pte = pte_wrprotect(pte);
> }
> - VM_BUG_ON(page && folio_test_anon(folio) && PageAnonExclusive(page));
>
> /*
> - * If it's a shared mapping, mark it clean in
> - * the child
> + * If it's a shared mapping, mark it clean in the child. If its a
> + * private mapping, mark it dirty in the child if _any_ of the parent
> + * mappings in the block were marked dirty. The contiguous block of
> + * mappings are all backed by the same folio, so if any are dirty then
> + * the whole folio is dirty. This allows us to determine the batch size
> + * without having to ever consider the dirty bit. See
> + * folio_nr_pages_cont_mapped().
> */
> - if (vm_flags & VM_SHARED)
> - pte = pte_mkclean(pte);
> - pte = pte_mkold(pte);
> + pte = pte_mkold(pte_mkclean(pte));
> + if (!(vm_flags & VM_SHARED) && any_dirty)
> + pte = pte_mkdirty(pte);
>
> if (!userfaultfd_wp(dst_vma))
> pte = pte_clear_uffd_wp(pte);
>
> - set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
> - return 0;
> + set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
> + return nr;
> }
>
> static inline struct folio *page_copy_prealloc(struct mm_struct *src_mm,
> @@ -1087,15 +1174,28 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> */
> WARN_ON_ONCE(ret != -ENOENT);
> }
> - /* copy_present_pte() will clear `*prealloc' if consumed */
> - ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte,
> - addr, rss, &prealloc);
> + /* copy_present_ptes() will clear `*prealloc' if consumed */
> + ret = copy_present_ptes(dst_vma, src_vma, dst_pte, src_pte,
> + addr, end, rss, &prealloc);
> +
> /*
> * If we need a pre-allocated page for this pte, drop the
> * locks, allocate, and try again.
> */
> if (unlikely(ret == -EAGAIN))
> break;
> +
> + /*
> + * Positive return value is the number of ptes copied.
> + */
> + VM_WARN_ON_ONCE(ret < 1);
> + progress += 8 * ret;
> + ret--;

Took me a second to figure out what was going on here. I think it would
be clearer to rename ret to nr_ptes ...

> + dst_pte += ret;
> + src_pte += ret;
> + addr += ret << PAGE_SHIFT;
> + ret = 0;
> +
> if (unlikely(prealloc)) {
> /*
> * pre-alloc page cannot be reused by next time so as
> @@ -1106,7 +1206,6 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> folio_put(prealloc);
> prealloc = NULL;
> }
> - progress += 8;
> } while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);

... and do dst_pte += nr_ptes, etc. here instead (noting of course that
the continue clauses will need nr_ptes == 1, but perhpas reset that at
the start of the loop).

> arch_leave_lazy_mmu_mode();

2023-11-23 05:17:46

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v2 14/14] arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown


Ryan Roberts <[email protected]> writes:

> ptep_get_and_clear_full() adds a 'full' parameter which is not present
> for the fallback ptep_get_and_clear() function. 'full' is set to 1 when
> a full address space teardown is in progress. We use this information to
> optimize arm64_sys_exit_group() by avoiding unfolding (and therefore
> tlbi) contiguous ranges. Instead we just clear the PTE but allow all the
> contiguous neighbours to keep their contig bit set, because we know we
> are about to clear the rest too.
>
> Before this optimization, the cost of arm64_sys_exit_group() exploded to
> 32x what it was before PTE_CONT support was wired up, when compiling the
> kernel. With this optimization in place, we are back down to the
> original cost.
>
> This approach is not perfect though, as for the duration between
> returning from the first call to ptep_get_and_clear_full() and making
> the final call, the contpte block in an intermediate state, where some
> ptes are cleared and others are still set with the PTE_CONT bit. If any
> other APIs are called for the ptes in the contpte block during that
> time, we have to be very careful. The core code currently interleaves
> calls to ptep_get_and_clear_full() with ptep_get() and so ptep_get()
> must be careful to ignore the cleared entries when accumulating the
> access and dirty bits - the same goes for ptep_get_lockless(). The only
> other calls we might resonably expect are to set markers in the
> previously cleared ptes. (We shouldn't see valid entries being set until
> after the tlbi, at which point we are no longer in the intermediate
> state). Since markers are not valid, this is safe; set_ptes() will see
> the old, invalid entry and will not attempt to unfold. And the new pte
> is also invalid so it won't attempt to fold. We shouldn't see this for
> the 'full' case anyway.
>
> The last remaining issue is returning the access/dirty bits. That info
> could be present in any of the ptes in the contpte block. ptep_get()
> will gather those bits from across the contpte block. We don't bother
> doing that here, because we know that the information is used by the
> core-mm to mark the underlying folio as accessed/dirty. And since the
> same folio must be underpinning the whole block (that was a requirement
> for folding in the first place), that information will make it to the
> folio eventually once all the ptes have been cleared. This approach
> means we don't have to play games with accumulating and storing the
> bits. It does mean that any interleaved calls to ptep_get() may lack
> correct access/dirty information if we have already cleared the pte that
> happened to store it. The core code does not rely on this though.

Does not *currently* rely on this. I can't help but think it is
potentially something that could change in the future though which would
lead to some subtle bugs.

Would there be any may of avoiding this? Half baked thought but could
you for example copy the access/dirty information to the last (or
perhaps first, most likely invalid) PTE?

- Alistair

> Signed-off-by: Ryan Roberts <[email protected]>
> ---
> arch/arm64/include/asm/pgtable.h | 18 +++++++++--
> arch/arm64/mm/contpte.c | 54 ++++++++++++++++++++++++++++++++
> 2 files changed, 70 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 9bd2f57a9e11..ea58a9f4e700 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1145,6 +1145,8 @@ extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
> extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
> extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
> pte_t *ptep, pte_t pte, unsigned int nr);
> +extern pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
> + unsigned long addr, pte_t *ptep);
> extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> unsigned long addr, pte_t *ptep);
> extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> @@ -1270,12 +1272,24 @@ static inline void pte_clear(struct mm_struct *mm,
> __pte_clear(mm, addr, ptep);
> }
>
> +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
> +static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
> + unsigned long addr, pte_t *ptep, int full)
> +{
> + pte_t orig_pte = __ptep_get(ptep);
> +
> + if (!pte_valid_cont(orig_pte) || !full) {
> + contpte_try_unfold(mm, addr, ptep, orig_pte);
> + return __ptep_get_and_clear(mm, addr, ptep);
> + } else
> + return contpte_ptep_get_and_clear_full(mm, addr, ptep);
> +}
> +
> #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
> static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
> unsigned long addr, pte_t *ptep)
> {
> - contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> - return __ptep_get_and_clear(mm, addr, ptep);
> + return ptep_get_and_clear_full(mm, addr, ptep, 0);
> }
>
> #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index 426be9cd4dea..5d1aaed82d32 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -144,6 +144,14 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> for (i = 0; i < CONT_PTES; i++, ptep++) {
> pte = __ptep_get(ptep);
>
> + /*
> + * Deal with the partial contpte_ptep_get_and_clear_full() case,
> + * where some of the ptes in the range may be cleared but others
> + * are still to do. See contpte_ptep_get_and_clear_full().
> + */
> + if (pte_val(pte) == 0)
> + continue;
> +
> if (pte_dirty(pte))
> orig_pte = pte_mkdirty(orig_pte);
>
> @@ -256,6 +264,52 @@ void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
> }
> EXPORT_SYMBOL(contpte_set_ptes);
>
> +pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
> + unsigned long addr, pte_t *ptep)
> +{
> + /*
> + * When doing a full address space teardown, we can avoid unfolding the
> + * contiguous range, and therefore avoid the associated tlbi. Instead,
> + * just get and clear the pte. The caller is promising to call us for
> + * every pte, so every pte in the range will be cleared by the time the
> + * tlbi is issued.
> + *
> + * This approach is not perfect though, as for the duration between
> + * returning from the first call to ptep_get_and_clear_full() and making
> + * the final call, the contpte block in an intermediate state, where
> + * some ptes are cleared and others are still set with the PTE_CONT bit.
> + * If any other APIs are called for the ptes in the contpte block during
> + * that time, we have to be very careful. The core code currently
> + * interleaves calls to ptep_get_and_clear_full() with ptep_get() and so
> + * ptep_get() must be careful to ignore the cleared entries when
> + * accumulating the access and dirty bits - the same goes for
> + * ptep_get_lockless(). The only other calls we might resonably expect
> + * are to set markers in the previously cleared ptes. (We shouldn't see
> + * valid entries being set until after the tlbi, at which point we are
> + * no longer in the intermediate state). Since markers are not valid,
> + * this is safe; set_ptes() will see the old, invalid entry and will not
> + * attempt to unfold. And the new pte is also invalid so it won't
> + * attempt to fold. We shouldn't see this for the 'full' case anyway.
> + *
> + * The last remaining issue is returning the access/dirty bits. That
> + * info could be present in any of the ptes in the contpte block.
> + * ptep_get() will gather those bits from across the contpte block. We
> + * don't bother doing that here, because we know that the information is
> + * used by the core-mm to mark the underlying folio as accessed/dirty.
> + * And since the same folio must be underpinning the whole block (that
> + * was a requirement for folding in the first place), that information
> + * will make it to the folio eventually once all the ptes have been
> + * cleared. This approach means we don't have to play games with
> + * accumulating and storing the bits. It does mean that any interleaved
> + * calls to ptep_get() may lack correct access/dirty information if we
> + * have already cleared the pte that happened to store it. The core code
> + * does not rely on this though.
> + */
> +
> + return __ptep_get_and_clear(mm, addr, ptep);
> +}
> +EXPORT_SYMBOL(contpte_ptep_get_and_clear_full);
> +
> int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> unsigned long addr, pte_t *ptep)
> {

2023-11-23 10:27:02

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

On 16/11/2023 14:15, David Hildenbrand wrote:
> On 16.11.23 15:13, David Hildenbrand wrote:
>> On 16.11.23 14:49, Ryan Roberts wrote:
>>> On 16/11/2023 13:20, David Hildenbrand wrote:
>>>> On 16.11.23 12:20, Ryan Roberts wrote:
>>>>> On 16/11/2023 11:03, David Hildenbrand wrote:
>>>>>> On 15.11.23 17:30, Ryan Roberts wrote:
>>>>>>> Convert copy_pte_range() to copy a set of ptes in a batch. A given batch
>>>>>>> maps a physically contiguous block of memory, all belonging to the same
>>>>>>> folio, with the same permissions, and for shared mappings, the same
>>>>>>> dirty state. This will likely improve performance by a tiny amount due
>>>>>>> to batching the folio reference count management and calling set_ptes()
>>>>>>> rather than making individual calls to set_pte_at().
>>>>>>>
>>>>>>> However, the primary motivation for this change is to reduce the number
>>>>>>> of tlb maintenance operations that the arm64 backend has to perform
>>>>>>> during fork, as it is about to add transparent support for the
>>>>>>> "contiguous bit" in its ptes. By write-protecting the parent using the
>>>>>>> new ptep_set_wrprotects() (note the 's' at the end) function, the
>>>>>>> backend can avoid having to unfold contig ranges of PTEs, which is
>>>>>>> expensive, when all ptes in the range are being write-protected.
>>>>>>> Similarly, by using set_ptes() rather than set_pte_at() to set up ptes
>>>>>>> in the child, the backend does not need to fold a contiguous range once
>>>>>>> they are all populated - they can be initially populated as a contiguous
>>>>>>> range in the first place.
>>>>>>>
>>>>>>> This change addresses the core-mm refactoring only, and introduces
>>>>>>> ptep_set_wrprotects() with a default implementation that calls
>>>>>>> ptep_set_wrprotect() for each pte in the range. A separate change will
>>>>>>> implement ptep_set_wrprotects() in the arm64 backend to realize the
>>>>>>> performance improvement as part of the work to enable contpte mappings.
>>>>>>>
>>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>>> ---
>>>>>>>      include/linux/pgtable.h |  13 +++
>>>>>>>      mm/memory.c             | 175 +++++++++++++++++++++++++++++++---------
>>>>>>>      2 files changed, 150 insertions(+), 38 deletions(-)
>>>>>>>
>>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>>> index af7639c3b0a3..1c50f8a0fdde 100644
>>>>>>> --- a/include/linux/pgtable.h
>>>>>>> +++ b/include/linux/pgtable.h
>>>>>>> @@ -622,6 +622,19 @@ static inline void ptep_set_wrprotect(struct mm_struct
>>>>>>> *mm, unsigned long addres
>>>>>>>      }
>>>>>>>      #endif
>>>>>>>      +#ifndef ptep_set_wrprotects
>>>>>>> +struct mm_struct;
>>>>>>> +static inline void ptep_set_wrprotects(struct mm_struct *mm,
>>>>>>> +                unsigned long address, pte_t *ptep,
>>>>>>> +                unsigned int nr)
>>>>>>> +{
>>>>>>> +    unsigned int i;
>>>>>>> +
>>>>>>> +    for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
>>>>>>> +        ptep_set_wrprotect(mm, address, ptep);
>>>>>>> +}
>>>>>>> +#endif
>>>>>>> +
>>>>>>>      /*
>>>>>>>       * On some architectures hardware does not set page access bit when
>>>>>>> accessing
>>>>>>>       * memory page, it is responsibility of software setting this bit.
>>>>>>> It brings
>>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>>> index 1f18ed4a5497..b7c8228883cf 100644
>>>>>>> --- a/mm/memory.c
>>>>>>> +++ b/mm/memory.c
>>>>>>> @@ -921,46 +921,129 @@ copy_present_page(struct vm_area_struct *dst_vma,
>>>>>>> struct vm_area_struct *src_vma
>>>>>>>              /* Uffd-wp needs to be delivered to dest pte as well */
>>>>>>>              pte = pte_mkuffd_wp(pte);
>>>>>>>          set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
>>>>>>> -    return 0;
>>>>>>> +    return 1;
>>>>>>> +}
>>>>>>> +
>>>>>>> +static inline unsigned long page_cont_mapped_vaddr(struct page *page,
>>>>>>> +                struct page *anchor, unsigned long anchor_vaddr)
>>>>>>> +{
>>>>>>> +    unsigned long offset;
>>>>>>> +    unsigned long vaddr;
>>>>>>> +
>>>>>>> +    offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
>>>>>>> +    vaddr = anchor_vaddr + offset;
>>>>>>> +
>>>>>>> +    if (anchor > page) {
>>>>>>> +        if (vaddr > anchor_vaddr)
>>>>>>> +            return 0;
>>>>>>> +    } else {
>>>>>>> +        if (vaddr < anchor_vaddr)
>>>>>>> +            return ULONG_MAX;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    return vaddr;
>>>>>>> +}
>>>>>>> +
>>>>>>> +static int folio_nr_pages_cont_mapped(struct folio *folio,
>>>>>>> +                      struct page *page, pte_t *pte,
>>>>>>> +                      unsigned long addr, unsigned long end,
>>>>>>> +                      pte_t ptent, bool *any_dirty)
>>>>>>> +{
>>>>>>> +    int floops;
>>>>>>> +    int i;
>>>>>>> +    unsigned long pfn;
>>>>>>> +    pgprot_t prot;
>>>>>>> +    struct page *folio_end;
>>>>>>> +
>>>>>>> +    if (!folio_test_large(folio))
>>>>>>> +        return 1;
>>>>>>> +
>>>>>>> +    folio_end = &folio->page + folio_nr_pages(folio);
>>>>>>> +    end = min(page_cont_mapped_vaddr(folio_end, page, addr), end);
>>>>>>> +    floops = (end - addr) >> PAGE_SHIFT;
>>>>>>> +    pfn = page_to_pfn(page);
>>>>>>> +    prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
>>>>>>> +
>>>>>>> +    *any_dirty = pte_dirty(ptent);
>>>>>>> +
>>>>>>> +    pfn++;
>>>>>>> +    pte++;
>>>>>>> +
>>>>>>> +    for (i = 1; i < floops; i++) {
>>>>>>> +        ptent = ptep_get(pte);
>>>>>>> +        ptent = pte_mkold(pte_mkclean(ptent));
>>>>>>> +
>>>>>>> +        if (!pte_present(ptent) || pte_pfn(ptent) != pfn ||
>>>>>>> +            pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
>>>>>>> +            break;
>>>>>>> +
>>>>>>> +        if (pte_dirty(ptent))
>>>>>>> +            *any_dirty = true;
>>>>>>> +
>>>>>>> +        pfn++;
>>>>>>> +        pte++;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    return i;
>>>>>>>      }
>>>>>>>        /*
>>>>>>> - * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated
>>>>>>> page
>>>>>>> - * is required to copy this pte.
>>>>>>> + * Copy set of contiguous ptes.  Returns number of ptes copied if succeeded
>>>>>>> + * (always gte 1), or -EAGAIN if one preallocated page is required to
>>>>>>> copy the
>>>>>>> + * first pte.
>>>>>>>       */
>>>>>>>      static inline int
>>>>>>> -copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct
>>>>>>> *src_vma,
>>>>>>> -         pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
>>>>>>> -         struct folio **prealloc)
>>>>>>> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct
>>>>>>> *src_vma,
>>>>>>> +          pte_t *dst_pte, pte_t *src_pte,
>>>>>>> +          unsigned long addr, unsigned long end,
>>>>>>> +          int *rss, struct folio **prealloc)
>>>>>>>      {
>>>>>>>          struct mm_struct *src_mm = src_vma->vm_mm;
>>>>>>>          unsigned long vm_flags = src_vma->vm_flags;
>>>>>>>          pte_t pte = ptep_get(src_pte);
>>>>>>>          struct page *page;
>>>>>>>          struct folio *folio;
>>>>>>> +    int nr = 1;
>>>>>>> +    bool anon;
>>>>>>> +    bool any_dirty = pte_dirty(pte);
>>>>>>> +    int i;
>>>>>>>            page = vm_normal_page(src_vma, addr, pte);
>>>>>>> -    if (page)
>>>>>>> +    if (page) {
>>>>>>>              folio = page_folio(page);
>>>>>>> -    if (page && folio_test_anon(folio)) {
>>>>>>> -        /*
>>>>>>> -         * If this page may have been pinned by the parent process,
>>>>>>> -         * copy the page immediately for the child so that we'll always
>>>>>>> -         * guarantee the pinned page won't be randomly replaced in the
>>>>>>> -         * future.
>>>>>>> -         */
>>>>>>> -        folio_get(folio);
>>>>>>> -        if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
>>>>>>> -            /* Page may be pinned, we have to copy. */
>>>>>>> -            folio_put(folio);
>>>>>>> -            return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
>>>>>>> -                         addr, rss, prealloc, page);
>>>>>>> +        anon = folio_test_anon(folio);
>>>>>>> +        nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
>>>>>>> +                        end, pte, &any_dirty);
>>>>>>> +
>>>>>>> +        for (i = 0; i < nr; i++, page++) {
>>>>>>> +            if (anon) {
>>>>>>> +                /*
>>>>>>> +                 * If this page may have been pinned by the
>>>>>>> +                 * parent process, copy the page immediately for
>>>>>>> +                 * the child so that we'll always guarantee the
>>>>>>> +                 * pinned page won't be randomly replaced in the
>>>>>>> +                 * future.
>>>>>>> +                 */
>>>>>>> +                if (unlikely(page_try_dup_anon_rmap(
>>>>>>> +                        page, false, src_vma))) {
>>>>>>> +                    if (i != 0)
>>>>>>> +                        break;
>>>>>>> +                    /* Page may be pinned, we have to copy. */
>>>>>>> +                    return copy_present_page(
>>>>>>> +                        dst_vma, src_vma, dst_pte,
>>>>>>> +                        src_pte, addr, rss, prealloc,
>>>>>>> +                        page);
>>>>>>> +                }
>>>>>>> +                rss[MM_ANONPAGES]++;
>>>>>>> +                VM_BUG_ON(PageAnonExclusive(page));
>>>>>>> +            } else {
>>>>>>> +                page_dup_file_rmap(page, false);
>>>>>>> +                rss[mm_counter_file(page)]++;
>>>>>>> +            }
>>>>>>>              }
>>>>>>> -        rss[MM_ANONPAGES]++;
>>>>>>> -    } else if (page) {
>>>>>>> -        folio_get(folio);
>>>>>>> -        page_dup_file_rmap(page, false);
>>>>>>> -        rss[mm_counter_file(page)]++;
>>>>>>> +
>>>>>>> +        nr = i;
>>>>>>> +        folio_ref_add(folio, nr);
>>>>>>>          }
>>>>>>>            /*
>>>>>>> @@ -968,24 +1051,28 @@ copy_present_pte(struct vm_area_struct *dst_vma,
>>>>>>> struct
>>>>>>> vm_area_struct *src_vma,
>>>>>>>           * in the parent and the child
>>>>>>>           */
>>>>>>>          if (is_cow_mapping(vm_flags) && pte_write(pte)) {
>>>>>>> -        ptep_set_wrprotect(src_mm, addr, src_pte);
>>>>>>> +        ptep_set_wrprotects(src_mm, addr, src_pte, nr);
>>>>>>>              pte = pte_wrprotect(pte);
>>>>>>
>>>>>> You likely want an "any_pte_writable" check here instead, no?
>>>>>>
>>>>>> Any operations that target a single indiividual PTE while multiple PTEs are
>>>>>> adjusted are suspicious :)
>>>>>
>>>>> The idea is that I've already constrained the batch of pages such that the
>>>>> permissions are all the same (see folio_nr_pages_cont_mapped()). So if the
>>>>> first
>>>>> pte is writable, then they all are - something has gone badly wrong if some
>>>>> are
>>>>> writable and others are not.
>>>>
>>>> I wonder if it would be cleaner and easier to not do that, though.
>>>>
>>>> Simply record if any pte is writable. Afterwards they will *all* be R/O and you
>>>> can set the cont bit, correct?
>>>
>>> Oh I see what you mean - that only works for cow mappings though. If you have a
>>> shared mapping, you won't be making it read-only at fork. So if we ignore
>>> pte_write() state when demarking the batches, we will end up with a batch of
>>> pages with a mix of RO and RW in the parent, but then we set_ptes() for the
>>> child and those pages will all have the permissions of the first page of the
>>> batch.
>>
>> I see what you mean.
>>
>> After fork(), all anon pages will be R/O in the parent and the child.
>> Easy. If any PTE is writable, wrprotect all in the parent and the child.
>>
>> After fork(), all shared pages can be R/O or R/W in the parent. For
>> simplicity, I think you can simply set them all R/O in the child. So if
>> any PTE is writable, wrprotect all in the child.
>
> Or better: if any is R/O, set them all R/O. Otherwise just leave them as is.

I've just come back to this to code it up, and want to clarify this last
comment; I'm already going to have to collect any_writable for the anon case, so
I will already have that info for the shared case too. I think you are
suggesting I *additionally* collect any_readonly, then in the shared case, I
only apply wrprotect if (any_writable && any_readonly). i.e. only apply
wrprotect if there is a mix of permissions for the batch, otherwise all the
permissions are the same (either all RW or all RO) and I can elide the wrprotet.
Is that what you meant?


>
> But devil is in the detail.
>

2023-11-23 12:13:00

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

On 23.11.23 11:26, Ryan Roberts wrote:
> On 16/11/2023 14:15, David Hildenbrand wrote:
>> On 16.11.23 15:13, David Hildenbrand wrote:
>>> On 16.11.23 14:49, Ryan Roberts wrote:
>>>> On 16/11/2023 13:20, David Hildenbrand wrote:
>>>>> On 16.11.23 12:20, Ryan Roberts wrote:
>>>>>> On 16/11/2023 11:03, David Hildenbrand wrote:
>>>>>>> On 15.11.23 17:30, Ryan Roberts wrote:
>>>>>>>> Convert copy_pte_range() to copy a set of ptes in a batch. A given batch
>>>>>>>> maps a physically contiguous block of memory, all belonging to the same
>>>>>>>> folio, with the same permissions, and for shared mappings, the same
>>>>>>>> dirty state. This will likely improve performance by a tiny amount due
>>>>>>>> to batching the folio reference count management and calling set_ptes()
>>>>>>>> rather than making individual calls to set_pte_at().
>>>>>>>>
>>>>>>>> However, the primary motivation for this change is to reduce the number
>>>>>>>> of tlb maintenance operations that the arm64 backend has to perform
>>>>>>>> during fork, as it is about to add transparent support for the
>>>>>>>> "contiguous bit" in its ptes. By write-protecting the parent using the
>>>>>>>> new ptep_set_wrprotects() (note the 's' at the end) function, the
>>>>>>>> backend can avoid having to unfold contig ranges of PTEs, which is
>>>>>>>> expensive, when all ptes in the range are being write-protected.
>>>>>>>> Similarly, by using set_ptes() rather than set_pte_at() to set up ptes
>>>>>>>> in the child, the backend does not need to fold a contiguous range once
>>>>>>>> they are all populated - they can be initially populated as a contiguous
>>>>>>>> range in the first place.
>>>>>>>>
>>>>>>>> This change addresses the core-mm refactoring only, and introduces
>>>>>>>> ptep_set_wrprotects() with a default implementation that calls
>>>>>>>> ptep_set_wrprotect() for each pte in the range. A separate change will
>>>>>>>> implement ptep_set_wrprotects() in the arm64 backend to realize the
>>>>>>>> performance improvement as part of the work to enable contpte mappings.
>>>>>>>>
>>>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>>>> ---
>>>>>>>>      include/linux/pgtable.h |  13 +++
>>>>>>>>      mm/memory.c             | 175 +++++++++++++++++++++++++++++++---------
>>>>>>>>      2 files changed, 150 insertions(+), 38 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>>>> index af7639c3b0a3..1c50f8a0fdde 100644
>>>>>>>> --- a/include/linux/pgtable.h
>>>>>>>> +++ b/include/linux/pgtable.h
>>>>>>>> @@ -622,6 +622,19 @@ static inline void ptep_set_wrprotect(struct mm_struct
>>>>>>>> *mm, unsigned long addres
>>>>>>>>      }
>>>>>>>>      #endif
>>>>>>>>      +#ifndef ptep_set_wrprotects
>>>>>>>> +struct mm_struct;
>>>>>>>> +static inline void ptep_set_wrprotects(struct mm_struct *mm,
>>>>>>>> +                unsigned long address, pte_t *ptep,
>>>>>>>> +                unsigned int nr)
>>>>>>>> +{
>>>>>>>> +    unsigned int i;
>>>>>>>> +
>>>>>>>> +    for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
>>>>>>>> +        ptep_set_wrprotect(mm, address, ptep);
>>>>>>>> +}
>>>>>>>> +#endif
>>>>>>>> +
>>>>>>>>      /*
>>>>>>>>       * On some architectures hardware does not set page access bit when
>>>>>>>> accessing
>>>>>>>>       * memory page, it is responsibility of software setting this bit.
>>>>>>>> It brings
>>>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>>>> index 1f18ed4a5497..b7c8228883cf 100644
>>>>>>>> --- a/mm/memory.c
>>>>>>>> +++ b/mm/memory.c
>>>>>>>> @@ -921,46 +921,129 @@ copy_present_page(struct vm_area_struct *dst_vma,
>>>>>>>> struct vm_area_struct *src_vma
>>>>>>>>              /* Uffd-wp needs to be delivered to dest pte as well */
>>>>>>>>              pte = pte_mkuffd_wp(pte);
>>>>>>>>          set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
>>>>>>>> -    return 0;
>>>>>>>> +    return 1;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static inline unsigned long page_cont_mapped_vaddr(struct page *page,
>>>>>>>> +                struct page *anchor, unsigned long anchor_vaddr)
>>>>>>>> +{
>>>>>>>> +    unsigned long offset;
>>>>>>>> +    unsigned long vaddr;
>>>>>>>> +
>>>>>>>> +    offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
>>>>>>>> +    vaddr = anchor_vaddr + offset;
>>>>>>>> +
>>>>>>>> +    if (anchor > page) {
>>>>>>>> +        if (vaddr > anchor_vaddr)
>>>>>>>> +            return 0;
>>>>>>>> +    } else {
>>>>>>>> +        if (vaddr < anchor_vaddr)
>>>>>>>> +            return ULONG_MAX;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    return vaddr;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static int folio_nr_pages_cont_mapped(struct folio *folio,
>>>>>>>> +                      struct page *page, pte_t *pte,
>>>>>>>> +                      unsigned long addr, unsigned long end,
>>>>>>>> +                      pte_t ptent, bool *any_dirty)
>>>>>>>> +{
>>>>>>>> +    int floops;
>>>>>>>> +    int i;
>>>>>>>> +    unsigned long pfn;
>>>>>>>> +    pgprot_t prot;
>>>>>>>> +    struct page *folio_end;
>>>>>>>> +
>>>>>>>> +    if (!folio_test_large(folio))
>>>>>>>> +        return 1;
>>>>>>>> +
>>>>>>>> +    folio_end = &folio->page + folio_nr_pages(folio);
>>>>>>>> +    end = min(page_cont_mapped_vaddr(folio_end, page, addr), end);
>>>>>>>> +    floops = (end - addr) >> PAGE_SHIFT;
>>>>>>>> +    pfn = page_to_pfn(page);
>>>>>>>> +    prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
>>>>>>>> +
>>>>>>>> +    *any_dirty = pte_dirty(ptent);
>>>>>>>> +
>>>>>>>> +    pfn++;
>>>>>>>> +    pte++;
>>>>>>>> +
>>>>>>>> +    for (i = 1; i < floops; i++) {
>>>>>>>> +        ptent = ptep_get(pte);
>>>>>>>> +        ptent = pte_mkold(pte_mkclean(ptent));
>>>>>>>> +
>>>>>>>> +        if (!pte_present(ptent) || pte_pfn(ptent) != pfn ||
>>>>>>>> +            pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
>>>>>>>> +            break;
>>>>>>>> +
>>>>>>>> +        if (pte_dirty(ptent))
>>>>>>>> +            *any_dirty = true;
>>>>>>>> +
>>>>>>>> +        pfn++;
>>>>>>>> +        pte++;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    return i;
>>>>>>>>      }
>>>>>>>>        /*
>>>>>>>> - * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated
>>>>>>>> page
>>>>>>>> - * is required to copy this pte.
>>>>>>>> + * Copy set of contiguous ptes.  Returns number of ptes copied if succeeded
>>>>>>>> + * (always gte 1), or -EAGAIN if one preallocated page is required to
>>>>>>>> copy the
>>>>>>>> + * first pte.
>>>>>>>>       */
>>>>>>>>      static inline int
>>>>>>>> -copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct
>>>>>>>> *src_vma,
>>>>>>>> -         pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
>>>>>>>> -         struct folio **prealloc)
>>>>>>>> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct
>>>>>>>> *src_vma,
>>>>>>>> +          pte_t *dst_pte, pte_t *src_pte,
>>>>>>>> +          unsigned long addr, unsigned long end,
>>>>>>>> +          int *rss, struct folio **prealloc)
>>>>>>>>      {
>>>>>>>>          struct mm_struct *src_mm = src_vma->vm_mm;
>>>>>>>>          unsigned long vm_flags = src_vma->vm_flags;
>>>>>>>>          pte_t pte = ptep_get(src_pte);
>>>>>>>>          struct page *page;
>>>>>>>>          struct folio *folio;
>>>>>>>> +    int nr = 1;
>>>>>>>> +    bool anon;
>>>>>>>> +    bool any_dirty = pte_dirty(pte);
>>>>>>>> +    int i;
>>>>>>>>            page = vm_normal_page(src_vma, addr, pte);
>>>>>>>> -    if (page)
>>>>>>>> +    if (page) {
>>>>>>>>              folio = page_folio(page);
>>>>>>>> -    if (page && folio_test_anon(folio)) {
>>>>>>>> -        /*
>>>>>>>> -         * If this page may have been pinned by the parent process,
>>>>>>>> -         * copy the page immediately for the child so that we'll always
>>>>>>>> -         * guarantee the pinned page won't be randomly replaced in the
>>>>>>>> -         * future.
>>>>>>>> -         */
>>>>>>>> -        folio_get(folio);
>>>>>>>> -        if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
>>>>>>>> -            /* Page may be pinned, we have to copy. */
>>>>>>>> -            folio_put(folio);
>>>>>>>> -            return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
>>>>>>>> -                         addr, rss, prealloc, page);
>>>>>>>> +        anon = folio_test_anon(folio);
>>>>>>>> +        nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
>>>>>>>> +                        end, pte, &any_dirty);
>>>>>>>> +
>>>>>>>> +        for (i = 0; i < nr; i++, page++) {
>>>>>>>> +            if (anon) {
>>>>>>>> +                /*
>>>>>>>> +                 * If this page may have been pinned by the
>>>>>>>> +                 * parent process, copy the page immediately for
>>>>>>>> +                 * the child so that we'll always guarantee the
>>>>>>>> +                 * pinned page won't be randomly replaced in the
>>>>>>>> +                 * future.
>>>>>>>> +                 */
>>>>>>>> +                if (unlikely(page_try_dup_anon_rmap(
>>>>>>>> +                        page, false, src_vma))) {
>>>>>>>> +                    if (i != 0)
>>>>>>>> +                        break;
>>>>>>>> +                    /* Page may be pinned, we have to copy. */
>>>>>>>> +                    return copy_present_page(
>>>>>>>> +                        dst_vma, src_vma, dst_pte,
>>>>>>>> +                        src_pte, addr, rss, prealloc,
>>>>>>>> +                        page);
>>>>>>>> +                }
>>>>>>>> +                rss[MM_ANONPAGES]++;
>>>>>>>> +                VM_BUG_ON(PageAnonExclusive(page));
>>>>>>>> +            } else {
>>>>>>>> +                page_dup_file_rmap(page, false);
>>>>>>>> +                rss[mm_counter_file(page)]++;
>>>>>>>> +            }
>>>>>>>>              }
>>>>>>>> -        rss[MM_ANONPAGES]++;
>>>>>>>> -    } else if (page) {
>>>>>>>> -        folio_get(folio);
>>>>>>>> -        page_dup_file_rmap(page, false);
>>>>>>>> -        rss[mm_counter_file(page)]++;
>>>>>>>> +
>>>>>>>> +        nr = i;
>>>>>>>> +        folio_ref_add(folio, nr);
>>>>>>>>          }
>>>>>>>>            /*
>>>>>>>> @@ -968,24 +1051,28 @@ copy_present_pte(struct vm_area_struct *dst_vma,
>>>>>>>> struct
>>>>>>>> vm_area_struct *src_vma,
>>>>>>>>           * in the parent and the child
>>>>>>>>           */
>>>>>>>>          if (is_cow_mapping(vm_flags) && pte_write(pte)) {
>>>>>>>> -        ptep_set_wrprotect(src_mm, addr, src_pte);
>>>>>>>> +        ptep_set_wrprotects(src_mm, addr, src_pte, nr);
>>>>>>>>              pte = pte_wrprotect(pte);
>>>>>>>
>>>>>>> You likely want an "any_pte_writable" check here instead, no?
>>>>>>>
>>>>>>> Any operations that target a single indiividual PTE while multiple PTEs are
>>>>>>> adjusted are suspicious :)
>>>>>>
>>>>>> The idea is that I've already constrained the batch of pages such that the
>>>>>> permissions are all the same (see folio_nr_pages_cont_mapped()). So if the
>>>>>> first
>>>>>> pte is writable, then they all are - something has gone badly wrong if some
>>>>>> are
>>>>>> writable and others are not.
>>>>>
>>>>> I wonder if it would be cleaner and easier to not do that, though.
>>>>>
>>>>> Simply record if any pte is writable. Afterwards they will *all* be R/O and you
>>>>> can set the cont bit, correct?
>>>>
>>>> Oh I see what you mean - that only works for cow mappings though. If you have a
>>>> shared mapping, you won't be making it read-only at fork. So if we ignore
>>>> pte_write() state when demarking the batches, we will end up with a batch of
>>>> pages with a mix of RO and RW in the parent, but then we set_ptes() for the
>>>> child and those pages will all have the permissions of the first page of the
>>>> batch.
>>>
>>> I see what you mean.
>>>
>>> After fork(), all anon pages will be R/O in the parent and the child.
>>> Easy. If any PTE is writable, wrprotect all in the parent and the child.
>>>
>>> After fork(), all shared pages can be R/O or R/W in the parent. For
>>> simplicity, I think you can simply set them all R/O in the child. So if
>>> any PTE is writable, wrprotect all in the child.
>>
>> Or better: if any is R/O, set them all R/O. Otherwise just leave them as is.
>
> I've just come back to this to code it up, and want to clarify this last
> comment; I'm already going to have to collect any_writable for the anon case, so
> I will already have that info for the shared case too. I think you are
> suggesting I *additionally* collect any_readonly, then in the shared case, I
> only apply wrprotect if (any_writable && any_readonly). i.e. only apply
> wrprotect if there is a mix of permissions for the batch, otherwise all the
> permissions are the same (either all RW or all RO) and I can elide the wrprotet.
> Is that what you meant?

Yes. I suspect you might somehow be able to derive "any_readonly = nr -
!any_writable".

Within a VMA, we really should only see:
* writable VMA: some might be R/O, some might be R/W
* VMA applicable to NUMA hinting: some might be PROT_NONE, others R/O or
R/W

One could simply skip batching for now on pte_protnone() and focus on
the "writable" vs. "not-writable".

--
Cheers,

David / dhildenb

2023-11-23 12:28:57

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

On 23/11/2023 12:12, David Hildenbrand wrote:
> On 23.11.23 11:26, Ryan Roberts wrote:
>> On 16/11/2023 14:15, David Hildenbrand wrote:
>>> On 16.11.23 15:13, David Hildenbrand wrote:
>>>> On 16.11.23 14:49, Ryan Roberts wrote:
>>>>> On 16/11/2023 13:20, David Hildenbrand wrote:
>>>>>> On 16.11.23 12:20, Ryan Roberts wrote:
>>>>>>> On 16/11/2023 11:03, David Hildenbrand wrote:
>>>>>>>> On 15.11.23 17:30, Ryan Roberts wrote:
>>>>>>>>> Convert copy_pte_range() to copy a set of ptes in a batch. A given batch
>>>>>>>>> maps a physically contiguous block of memory, all belonging to the same
>>>>>>>>> folio, with the same permissions, and for shared mappings, the same
>>>>>>>>> dirty state. This will likely improve performance by a tiny amount due
>>>>>>>>> to batching the folio reference count management and calling set_ptes()
>>>>>>>>> rather than making individual calls to set_pte_at().
>>>>>>>>>
>>>>>>>>> However, the primary motivation for this change is to reduce the number
>>>>>>>>> of tlb maintenance operations that the arm64 backend has to perform
>>>>>>>>> during fork, as it is about to add transparent support for the
>>>>>>>>> "contiguous bit" in its ptes. By write-protecting the parent using the
>>>>>>>>> new ptep_set_wrprotects() (note the 's' at the end) function, the
>>>>>>>>> backend can avoid having to unfold contig ranges of PTEs, which is
>>>>>>>>> expensive, when all ptes in the range are being write-protected.
>>>>>>>>> Similarly, by using set_ptes() rather than set_pte_at() to set up ptes
>>>>>>>>> in the child, the backend does not need to fold a contiguous range once
>>>>>>>>> they are all populated - they can be initially populated as a contiguous
>>>>>>>>> range in the first place.
>>>>>>>>>
>>>>>>>>> This change addresses the core-mm refactoring only, and introduces
>>>>>>>>> ptep_set_wrprotects() with a default implementation that calls
>>>>>>>>> ptep_set_wrprotect() for each pte in the range. A separate change will
>>>>>>>>> implement ptep_set_wrprotects() in the arm64 backend to realize the
>>>>>>>>> performance improvement as part of the work to enable contpte mappings.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>>>>> ---
>>>>>>>>>       include/linux/pgtable.h |  13 +++
>>>>>>>>>       mm/memory.c             | 175
>>>>>>>>> +++++++++++++++++++++++++++++++---------
>>>>>>>>>       2 files changed, 150 insertions(+), 38 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>>>>> index af7639c3b0a3..1c50f8a0fdde 100644
>>>>>>>>> --- a/include/linux/pgtable.h
>>>>>>>>> +++ b/include/linux/pgtable.h
>>>>>>>>> @@ -622,6 +622,19 @@ static inline void ptep_set_wrprotect(struct
>>>>>>>>> mm_struct
>>>>>>>>> *mm, unsigned long addres
>>>>>>>>>       }
>>>>>>>>>       #endif
>>>>>>>>>       +#ifndef ptep_set_wrprotects
>>>>>>>>> +struct mm_struct;
>>>>>>>>> +static inline void ptep_set_wrprotects(struct mm_struct *mm,
>>>>>>>>> +                unsigned long address, pte_t *ptep,
>>>>>>>>> +                unsigned int nr)
>>>>>>>>> +{
>>>>>>>>> +    unsigned int i;
>>>>>>>>> +
>>>>>>>>> +    for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
>>>>>>>>> +        ptep_set_wrprotect(mm, address, ptep);
>>>>>>>>> +}
>>>>>>>>> +#endif
>>>>>>>>> +
>>>>>>>>>       /*
>>>>>>>>>        * On some architectures hardware does not set page access bit when
>>>>>>>>> accessing
>>>>>>>>>        * memory page, it is responsibility of software setting this bit.
>>>>>>>>> It brings
>>>>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>>>>> index 1f18ed4a5497..b7c8228883cf 100644
>>>>>>>>> --- a/mm/memory.c
>>>>>>>>> +++ b/mm/memory.c
>>>>>>>>> @@ -921,46 +921,129 @@ copy_present_page(struct vm_area_struct *dst_vma,
>>>>>>>>> struct vm_area_struct *src_vma
>>>>>>>>>               /* Uffd-wp needs to be delivered to dest pte as well */
>>>>>>>>>               pte = pte_mkuffd_wp(pte);
>>>>>>>>>           set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
>>>>>>>>> -    return 0;
>>>>>>>>> +    return 1;
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +static inline unsigned long page_cont_mapped_vaddr(struct page *page,
>>>>>>>>> +                struct page *anchor, unsigned long anchor_vaddr)
>>>>>>>>> +{
>>>>>>>>> +    unsigned long offset;
>>>>>>>>> +    unsigned long vaddr;
>>>>>>>>> +
>>>>>>>>> +    offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
>>>>>>>>> +    vaddr = anchor_vaddr + offset;
>>>>>>>>> +
>>>>>>>>> +    if (anchor > page) {
>>>>>>>>> +        if (vaddr > anchor_vaddr)
>>>>>>>>> +            return 0;
>>>>>>>>> +    } else {
>>>>>>>>> +        if (vaddr < anchor_vaddr)
>>>>>>>>> +            return ULONG_MAX;
>>>>>>>>> +    }
>>>>>>>>> +
>>>>>>>>> +    return vaddr;
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +static int folio_nr_pages_cont_mapped(struct folio *folio,
>>>>>>>>> +                      struct page *page, pte_t *pte,
>>>>>>>>> +                      unsigned long addr, unsigned long end,
>>>>>>>>> +                      pte_t ptent, bool *any_dirty)
>>>>>>>>> +{
>>>>>>>>> +    int floops;
>>>>>>>>> +    int i;
>>>>>>>>> +    unsigned long pfn;
>>>>>>>>> +    pgprot_t prot;
>>>>>>>>> +    struct page *folio_end;
>>>>>>>>> +
>>>>>>>>> +    if (!folio_test_large(folio))
>>>>>>>>> +        return 1;
>>>>>>>>> +
>>>>>>>>> +    folio_end = &folio->page + folio_nr_pages(folio);
>>>>>>>>> +    end = min(page_cont_mapped_vaddr(folio_end, page, addr), end);
>>>>>>>>> +    floops = (end - addr) >> PAGE_SHIFT;
>>>>>>>>> +    pfn = page_to_pfn(page);
>>>>>>>>> +    prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
>>>>>>>>> +
>>>>>>>>> +    *any_dirty = pte_dirty(ptent);
>>>>>>>>> +
>>>>>>>>> +    pfn++;
>>>>>>>>> +    pte++;
>>>>>>>>> +
>>>>>>>>> +    for (i = 1; i < floops; i++) {
>>>>>>>>> +        ptent = ptep_get(pte);
>>>>>>>>> +        ptent = pte_mkold(pte_mkclean(ptent));
>>>>>>>>> +
>>>>>>>>> +        if (!pte_present(ptent) || pte_pfn(ptent) != pfn ||
>>>>>>>>> +            pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
>>>>>>>>> +            break;
>>>>>>>>> +
>>>>>>>>> +        if (pte_dirty(ptent))
>>>>>>>>> +            *any_dirty = true;
>>>>>>>>> +
>>>>>>>>> +        pfn++;
>>>>>>>>> +        pte++;
>>>>>>>>> +    }
>>>>>>>>> +
>>>>>>>>> +    return i;
>>>>>>>>>       }
>>>>>>>>>         /*
>>>>>>>>> - * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated
>>>>>>>>> page
>>>>>>>>> - * is required to copy this pte.
>>>>>>>>> + * Copy set of contiguous ptes.  Returns number of ptes copied if
>>>>>>>>> succeeded
>>>>>>>>> + * (always gte 1), or -EAGAIN if one preallocated page is required to
>>>>>>>>> copy the
>>>>>>>>> + * first pte.
>>>>>>>>>        */
>>>>>>>>>       static inline int
>>>>>>>>> -copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct
>>>>>>>>> *src_vma,
>>>>>>>>> -         pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
>>>>>>>>> -         struct folio **prealloc)
>>>>>>>>> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct
>>>>>>>>> *src_vma,
>>>>>>>>> +          pte_t *dst_pte, pte_t *src_pte,
>>>>>>>>> +          unsigned long addr, unsigned long end,
>>>>>>>>> +          int *rss, struct folio **prealloc)
>>>>>>>>>       {
>>>>>>>>>           struct mm_struct *src_mm = src_vma->vm_mm;
>>>>>>>>>           unsigned long vm_flags = src_vma->vm_flags;
>>>>>>>>>           pte_t pte = ptep_get(src_pte);
>>>>>>>>>           struct page *page;
>>>>>>>>>           struct folio *folio;
>>>>>>>>> +    int nr = 1;
>>>>>>>>> +    bool anon;
>>>>>>>>> +    bool any_dirty = pte_dirty(pte);
>>>>>>>>> +    int i;
>>>>>>>>>             page = vm_normal_page(src_vma, addr, pte);
>>>>>>>>> -    if (page)
>>>>>>>>> +    if (page) {
>>>>>>>>>               folio = page_folio(page);
>>>>>>>>> -    if (page && folio_test_anon(folio)) {
>>>>>>>>> -        /*
>>>>>>>>> -         * If this page may have been pinned by the parent process,
>>>>>>>>> -         * copy the page immediately for the child so that we'll always
>>>>>>>>> -         * guarantee the pinned page won't be randomly replaced in the
>>>>>>>>> -         * future.
>>>>>>>>> -         */
>>>>>>>>> -        folio_get(folio);
>>>>>>>>> -        if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
>>>>>>>>> -            /* Page may be pinned, we have to copy. */
>>>>>>>>> -            folio_put(folio);
>>>>>>>>> -            return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
>>>>>>>>> -                         addr, rss, prealloc, page);
>>>>>>>>> +        anon = folio_test_anon(folio);
>>>>>>>>> +        nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
>>>>>>>>> +                        end, pte, &any_dirty);
>>>>>>>>> +
>>>>>>>>> +        for (i = 0; i < nr; i++, page++) {
>>>>>>>>> +            if (anon) {
>>>>>>>>> +                /*
>>>>>>>>> +                 * If this page may have been pinned by the
>>>>>>>>> +                 * parent process, copy the page immediately for
>>>>>>>>> +                 * the child so that we'll always guarantee the
>>>>>>>>> +                 * pinned page won't be randomly replaced in the
>>>>>>>>> +                 * future.
>>>>>>>>> +                 */
>>>>>>>>> +                if (unlikely(page_try_dup_anon_rmap(
>>>>>>>>> +                        page, false, src_vma))) {
>>>>>>>>> +                    if (i != 0)
>>>>>>>>> +                        break;
>>>>>>>>> +                    /* Page may be pinned, we have to copy. */
>>>>>>>>> +                    return copy_present_page(
>>>>>>>>> +                        dst_vma, src_vma, dst_pte,
>>>>>>>>> +                        src_pte, addr, rss, prealloc,
>>>>>>>>> +                        page);
>>>>>>>>> +                }
>>>>>>>>> +                rss[MM_ANONPAGES]++;
>>>>>>>>> +                VM_BUG_ON(PageAnonExclusive(page));
>>>>>>>>> +            } else {
>>>>>>>>> +                page_dup_file_rmap(page, false);
>>>>>>>>> +                rss[mm_counter_file(page)]++;
>>>>>>>>> +            }
>>>>>>>>>               }
>>>>>>>>> -        rss[MM_ANONPAGES]++;
>>>>>>>>> -    } else if (page) {
>>>>>>>>> -        folio_get(folio);
>>>>>>>>> -        page_dup_file_rmap(page, false);
>>>>>>>>> -        rss[mm_counter_file(page)]++;
>>>>>>>>> +
>>>>>>>>> +        nr = i;
>>>>>>>>> +        folio_ref_add(folio, nr);
>>>>>>>>>           }
>>>>>>>>>             /*
>>>>>>>>> @@ -968,24 +1051,28 @@ copy_present_pte(struct vm_area_struct *dst_vma,
>>>>>>>>> struct
>>>>>>>>> vm_area_struct *src_vma,
>>>>>>>>>            * in the parent and the child
>>>>>>>>>            */
>>>>>>>>>           if (is_cow_mapping(vm_flags) && pte_write(pte)) {
>>>>>>>>> -        ptep_set_wrprotect(src_mm, addr, src_pte);
>>>>>>>>> +        ptep_set_wrprotects(src_mm, addr, src_pte, nr);
>>>>>>>>>               pte = pte_wrprotect(pte);
>>>>>>>>
>>>>>>>> You likely want an "any_pte_writable" check here instead, no?
>>>>>>>>
>>>>>>>> Any operations that target a single indiividual PTE while multiple PTEs are
>>>>>>>> adjusted are suspicious :)
>>>>>>>
>>>>>>> The idea is that I've already constrained the batch of pages such that the
>>>>>>> permissions are all the same (see folio_nr_pages_cont_mapped()). So if the
>>>>>>> first
>>>>>>> pte is writable, then they all are - something has gone badly wrong if some
>>>>>>> are
>>>>>>> writable and others are not.
>>>>>>
>>>>>> I wonder if it would be cleaner and easier to not do that, though.
>>>>>>
>>>>>> Simply record if any pte is writable. Afterwards they will *all* be R/O
>>>>>> and you
>>>>>> can set the cont bit, correct?
>>>>>
>>>>> Oh I see what you mean - that only works for cow mappings though. If you
>>>>> have a
>>>>> shared mapping, you won't be making it read-only at fork. So if we ignore
>>>>> pte_write() state when demarking the batches, we will end up with a batch of
>>>>> pages with a mix of RO and RW in the parent, but then we set_ptes() for the
>>>>> child and those pages will all have the permissions of the first page of the
>>>>> batch.
>>>>
>>>> I see what you mean.
>>>>
>>>> After fork(), all anon pages will be R/O in the parent and the child.
>>>> Easy. If any PTE is writable, wrprotect all in the parent and the child.
>>>>
>>>> After fork(), all shared pages can be R/O or R/W in the parent. For
>>>> simplicity, I think you can simply set them all R/O in the child. So if
>>>> any PTE is writable, wrprotect all in the child.
>>>
>>> Or better: if any is R/O, set them all R/O. Otherwise just leave them as is.
>>
>> I've just come back to this to code it up, and want to clarify this last
>> comment; I'm already going to have to collect any_writable for the anon case, so
>> I will already have that info for the shared case too. I think you are
>> suggesting I *additionally* collect any_readonly, then in the shared case, I
>> only apply wrprotect if (any_writable && any_readonly). i.e. only apply
>> wrprotect if there is a mix of permissions for the batch, otherwise all the
>> permissions are the same (either all RW or all RO) and I can elide the wrprotet.
>> Is that what you meant?
>
> Yes. I suspect you might somehow be able to derive "any_readonly = nr -
> !any_writable".

Yep, nice.

>
> Within a VMA, we really should only see:
> * writable VMA: some might be R/O, some might be R/W
> * VMA applicable to NUMA hinting: some might be PROT_NONE, others R/O or
>   R/W
>
> One could simply skip batching for now on pte_protnone() and focus on the
> "writable" vs. "not-writable".

I'm not sure we can simply "skip" batching on pte_protnone() since we will need
to terminate the batch if we spot it. But if we have to look for it anyway, we
might as well just terminate the batch when the value of pte_protnone()
*changes*. I'm also proposing to take this approach for pte_uffd_wp() which also
needs to be carefully preserved per-pte.


>

2023-11-23 14:43:54

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

On 23/11/2023 04:26, Alistair Popple wrote:
>
> Ryan Roberts <[email protected]> writes:
>
>> Convert copy_pte_range() to copy a set of ptes in a batch. A given batch
>> maps a physically contiguous block of memory, all belonging to the same
>> folio, with the same permissions, and for shared mappings, the same
>> dirty state. This will likely improve performance by a tiny amount due
>> to batching the folio reference count management and calling set_ptes()
>> rather than making individual calls to set_pte_at().
>>
>> However, the primary motivation for this change is to reduce the number
>> of tlb maintenance operations that the arm64 backend has to perform
>> during fork, as it is about to add transparent support for the
>> "contiguous bit" in its ptes. By write-protecting the parent using the
>> new ptep_set_wrprotects() (note the 's' at the end) function, the
>> backend can avoid having to unfold contig ranges of PTEs, which is
>> expensive, when all ptes in the range are being write-protected.
>> Similarly, by using set_ptes() rather than set_pte_at() to set up ptes
>> in the child, the backend does not need to fold a contiguous range once
>> they are all populated - they can be initially populated as a contiguous
>> range in the first place.
>>
>> This change addresses the core-mm refactoring only, and introduces
>> ptep_set_wrprotects() with a default implementation that calls
>> ptep_set_wrprotect() for each pte in the range. A separate change will
>> implement ptep_set_wrprotects() in the arm64 backend to realize the
>> performance improvement as part of the work to enable contpte mappings.
>>
>> Signed-off-by: Ryan Roberts <[email protected]>
>> ---
>> include/linux/pgtable.h | 13 +++
>> mm/memory.c | 175 +++++++++++++++++++++++++++++++---------
>> 2 files changed, 150 insertions(+), 38 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index af7639c3b0a3..1c50f8a0fdde 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -622,6 +622,19 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addres
>> }
>> #endif
>>
>> +#ifndef ptep_set_wrprotects
>> +struct mm_struct;
>> +static inline void ptep_set_wrprotects(struct mm_struct *mm,
>> + unsigned long address, pte_t *ptep,
>> + unsigned int nr)
>> +{
>> + unsigned int i;
>> +
>> + for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
>> + ptep_set_wrprotect(mm, address, ptep);
>> +}
>> +#endif
>> +
>> /*
>> * On some architectures hardware does not set page access bit when accessing
>> * memory page, it is responsibility of software setting this bit. It brings
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 1f18ed4a5497..b7c8228883cf 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -921,46 +921,129 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
>> /* Uffd-wp needs to be delivered to dest pte as well */
>> pte = pte_mkuffd_wp(pte);
>> set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
>> - return 0;
>> + return 1;
>
> We should update the function comment to indicate why we return 1 here
> because it will become non-obvious in future. But perhaps it's better to
> leave this as is and do the error check/return code calculation in
> copy_present_ptes().

OK, I'll return 0 for success and fix it up to 1 in copy_present_ptes().

>
>> +}
>> +
>> +static inline unsigned long page_cont_mapped_vaddr(struct page *page,
>> + struct page *anchor, unsigned long anchor_vaddr)
>
> It's likely I'm easily confused but the arguments here don't make much
> sense to me. Something like this (noting that I've switch the argument
> order) makes more sense to me at least:
>
> static inline unsigned long page_cont_mapped_vaddr(struct page *page,
> unsigned long page_vaddr, struct page *next_folio_page)

I was originally using page_cont_mapped_vaddr() in more places than here and
needed a more generic helper than just "what is the virtual address of the end
of the folio, given a random page within the folio and its virtual address"; (I
needed "what is the virtual address of a page given a different page and its
virtual address and assuming the distance between the 2 pages is the same in
physical and virtual space"). But given I don't need that generality anymore,
yes, I agree I can simplify this significantly.

I think I can remove the function entirely and replace with this in
folio_nr_pages_cont_mapped():

/*
* Loop either to `end` or to end of folio if its contiguously mapped,
* whichever is smaller.
*/
floops = (end - addr) >> PAGE_SHIFT;
floops = min_t(int, floops,
folio_pfn(folio_next(folio)) - page_to_pfn(page));

where `end` and `addr` are the parameters as passed into the function. What do
you think?

>
>> +{
>> + unsigned long offset;
>> + unsigned long vaddr;
>> +
>> + offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
>
> Which IMHO makes this much more readable:
>
> offset = (page_to_pfn(next_folio_page) - page_to_pfn(page)) << PAGE_SHIFT;
>
>> + vaddr = anchor_vaddr + offset;
>> +
>> + if (anchor > page) {
>
> And also highlights that I think this condition (page > folio_page_end)
> is impossible to hit. Which is good ...
>
>> + if (vaddr > anchor_vaddr)
>> + return 0;
>
> ... because I'm not sure returning 0 is valid as we would end up setting
> floops = (0 - addr) >> PAGE_SHIFT which doesn't seem like it would end
> particularly well :-)

This was covering the more general case that I no longer need.

>
>> + } else {
>> + if (vaddr < anchor_vaddr)
>
> Same here - isn't the vaddr of the next folio always going to be larger
> than the vaddr for the current page? It seems this function is really
> just calculating the virtual address of the next folio, or am I deeply
> confused?

This aims to protect against the corner case, where a page from a folio is
mremap()ed very high in address space such that the extra pages from the anchor
page to the end of the folio would actually wrap back to zero. But with the
approach propsed above, this problem goes away, I think.

>
>> + return ULONG_MAX;
>> + }
>> +
>> + return vaddr;
>> +}
>> +
>> +static int folio_nr_pages_cont_mapped(struct folio *folio,
>> + struct page *page, pte_t *pte,
>> + unsigned long addr, unsigned long end,
>> + pte_t ptent, bool *any_dirty)
>> +{
>> + int floops;
>> + int i;
>> + unsigned long pfn;
>> + pgprot_t prot;
>> + struct page *folio_end;
>> +
>> + if (!folio_test_large(folio))
>> + return 1;
>> +
>> + folio_end = &folio->page + folio_nr_pages(folio);
>
> I think you can replace this with:
>
> folio_end = folio_next(folio)

yep, done - thanks.

>
> Although given this is only passed to page_cont_mapped_vaddr() perhaps
> it's better to just pass the folio in and do the calculation there.
>
>> + end = min(page_cont_mapped_vaddr(folio_end, page, addr), end);
>> + floops = (end - addr) >> PAGE_SHIFT;
>> + pfn = page_to_pfn(page);
>> + prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
>> +
>> + *any_dirty = pte_dirty(ptent);
>> +
>> + pfn++;
>> + pte++;
>> +
>> + for (i = 1; i < floops; i++) {
>> + ptent = ptep_get(pte);
>> + ptent = pte_mkold(pte_mkclean(ptent));
>> +
>> + if (!pte_present(ptent) || pte_pfn(ptent) != pfn ||
>> + pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
>> + break;
>> +
>> + if (pte_dirty(ptent))
>> + *any_dirty = true;
>> +
>> + pfn++;
>> + pte++;
>> + }
>> +
>> + return i;
>> }
>>
>> /*
>> - * Copy one pte. Returns 0 if succeeded, or -EAGAIN if one preallocated page
>> - * is required to copy this pte.
>> + * Copy set of contiguous ptes. Returns number of ptes copied if succeeded
>> + * (always gte 1), or -EAGAIN if one preallocated page is required to copy the
>> + * first pte.
>> */
>> static inline int
>> -copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>> - pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
>> - struct folio **prealloc)
>> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>> + pte_t *dst_pte, pte_t *src_pte,
>> + unsigned long addr, unsigned long end,
>> + int *rss, struct folio **prealloc)
>> {
>> struct mm_struct *src_mm = src_vma->vm_mm;
>> unsigned long vm_flags = src_vma->vm_flags;
>> pte_t pte = ptep_get(src_pte);
>> struct page *page;
>> struct folio *folio;
>> + int nr = 1;
>> + bool anon;
>> + bool any_dirty = pte_dirty(pte);
>> + int i;
>>
>> page = vm_normal_page(src_vma, addr, pte);
>> - if (page)
>> + if (page) {
>> folio = page_folio(page);
>> - if (page && folio_test_anon(folio)) {
>> - /*
>> - * If this page may have been pinned by the parent process,
>> - * copy the page immediately for the child so that we'll always
>> - * guarantee the pinned page won't be randomly replaced in the
>> - * future.
>> - */
>> - folio_get(folio);
>> - if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
>> - /* Page may be pinned, we have to copy. */
>> - folio_put(folio);
>> - return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
>> - addr, rss, prealloc, page);
>> + anon = folio_test_anon(folio);
>> + nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
>> + end, pte, &any_dirty);
>> +
>> + for (i = 0; i < nr; i++, page++) {
>> + if (anon) {
>> + /*
>> + * If this page may have been pinned by the
>> + * parent process, copy the page immediately for
>> + * the child so that we'll always guarantee the
>> + * pinned page won't be randomly replaced in the
>> + * future.
>> + */
>> + if (unlikely(page_try_dup_anon_rmap(
>> + page, false, src_vma))) {
>> + if (i != 0)
>> + break;
>> + /* Page may be pinned, we have to copy. */
>> + return copy_present_page(
>> + dst_vma, src_vma, dst_pte,
>> + src_pte, addr, rss, prealloc,
>> + page);
>> + }
>> + rss[MM_ANONPAGES]++;
>> + VM_BUG_ON(PageAnonExclusive(page));
>> + } else {
>> + page_dup_file_rmap(page, false);
>> + rss[mm_counter_file(page)]++;
>> + }
>> }
>> - rss[MM_ANONPAGES]++;
>> - } else if (page) {
>> - folio_get(folio);
>> - page_dup_file_rmap(page, false);
>> - rss[mm_counter_file(page)]++;
>> +
>> + nr = i;
>> + folio_ref_add(folio, nr);
>> }
>>
>> /*
>> @@ -968,24 +1051,28 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>> * in the parent and the child
>> */
>> if (is_cow_mapping(vm_flags) && pte_write(pte)) {
>> - ptep_set_wrprotect(src_mm, addr, src_pte);
>> + ptep_set_wrprotects(src_mm, addr, src_pte, nr);
>> pte = pte_wrprotect(pte);
>> }
>> - VM_BUG_ON(page && folio_test_anon(folio) && PageAnonExclusive(page));
>>
>> /*
>> - * If it's a shared mapping, mark it clean in
>> - * the child
>> + * If it's a shared mapping, mark it clean in the child. If its a
>> + * private mapping, mark it dirty in the child if _any_ of the parent
>> + * mappings in the block were marked dirty. The contiguous block of
>> + * mappings are all backed by the same folio, so if any are dirty then
>> + * the whole folio is dirty. This allows us to determine the batch size
>> + * without having to ever consider the dirty bit. See
>> + * folio_nr_pages_cont_mapped().
>> */
>> - if (vm_flags & VM_SHARED)
>> - pte = pte_mkclean(pte);
>> - pte = pte_mkold(pte);
>> + pte = pte_mkold(pte_mkclean(pte));
>> + if (!(vm_flags & VM_SHARED) && any_dirty)
>> + pte = pte_mkdirty(pte);
>>
>> if (!userfaultfd_wp(dst_vma))
>> pte = pte_clear_uffd_wp(pte);
>>
>> - set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
>> - return 0;
>> + set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
>> + return nr;
>> }
>>
>> static inline struct folio *page_copy_prealloc(struct mm_struct *src_mm,
>> @@ -1087,15 +1174,28 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>> */
>> WARN_ON_ONCE(ret != -ENOENT);
>> }
>> - /* copy_present_pte() will clear `*prealloc' if consumed */
>> - ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte,
>> - addr, rss, &prealloc);
>> + /* copy_present_ptes() will clear `*prealloc' if consumed */
>> + ret = copy_present_ptes(dst_vma, src_vma, dst_pte, src_pte,
>> + addr, end, rss, &prealloc);
>> +
>> /*
>> * If we need a pre-allocated page for this pte, drop the
>> * locks, allocate, and try again.
>> */
>> if (unlikely(ret == -EAGAIN))
>> break;
>> +
>> + /*
>> + * Positive return value is the number of ptes copied.
>> + */
>> + VM_WARN_ON_ONCE(ret < 1);
>> + progress += 8 * ret;
>> + ret--;
>
> Took me a second to figure out what was going on here. I think it would
> be clearer to rename ret to nr_ptes ...
>
>> + dst_pte += ret;
>> + src_pte += ret;
>> + addr += ret << PAGE_SHIFT;
>> + ret = 0;
>> +
>> if (unlikely(prealloc)) {
>> /*
>> * pre-alloc page cannot be reused by next time so as
>> @@ -1106,7 +1206,6 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>> folio_put(prealloc);
>> prealloc = NULL;
>> }
>> - progress += 8;
>> } while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
>
> ... and do dst_pte += nr_ptes, etc. here instead (noting of course that
> the continue clauses will need nr_ptes == 1, but perhpas reset that at
> the start of the loop).

Yes, much cleaner! Implementing for v3...

Thanks for the review!

Thanks,
Ryan

>
>> arch_leave_lazy_mmu_mode();
>

2023-11-23 16:01:45

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 14/14] arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown

On 23/11/2023 05:13, Alistair Popple wrote:
>
> Ryan Roberts <[email protected]> writes:
>
>> ptep_get_and_clear_full() adds a 'full' parameter which is not present
>> for the fallback ptep_get_and_clear() function. 'full' is set to 1 when
>> a full address space teardown is in progress. We use this information to
>> optimize arm64_sys_exit_group() by avoiding unfolding (and therefore
>> tlbi) contiguous ranges. Instead we just clear the PTE but allow all the
>> contiguous neighbours to keep their contig bit set, because we know we
>> are about to clear the rest too.
>>
>> Before this optimization, the cost of arm64_sys_exit_group() exploded to
>> 32x what it was before PTE_CONT support was wired up, when compiling the
>> kernel. With this optimization in place, we are back down to the
>> original cost.
>>
>> This approach is not perfect though, as for the duration between
>> returning from the first call to ptep_get_and_clear_full() and making
>> the final call, the contpte block in an intermediate state, where some
>> ptes are cleared and others are still set with the PTE_CONT bit. If any
>> other APIs are called for the ptes in the contpte block during that
>> time, we have to be very careful. The core code currently interleaves
>> calls to ptep_get_and_clear_full() with ptep_get() and so ptep_get()
>> must be careful to ignore the cleared entries when accumulating the
>> access and dirty bits - the same goes for ptep_get_lockless(). The only
>> other calls we might resonably expect are to set markers in the
>> previously cleared ptes. (We shouldn't see valid entries being set until
>> after the tlbi, at which point we are no longer in the intermediate
>> state). Since markers are not valid, this is safe; set_ptes() will see
>> the old, invalid entry and will not attempt to unfold. And the new pte
>> is also invalid so it won't attempt to fold. We shouldn't see this for
>> the 'full' case anyway.
>>
>> The last remaining issue is returning the access/dirty bits. That info
>> could be present in any of the ptes in the contpte block. ptep_get()
>> will gather those bits from across the contpte block. We don't bother
>> doing that here, because we know that the information is used by the
>> core-mm to mark the underlying folio as accessed/dirty. And since the
>> same folio must be underpinning the whole block (that was a requirement
>> for folding in the first place), that information will make it to the
>> folio eventually once all the ptes have been cleared. This approach
>> means we don't have to play games with accumulating and storing the
>> bits. It does mean that any interleaved calls to ptep_get() may lack
>> correct access/dirty information if we have already cleared the pte that
>> happened to store it. The core code does not rely on this though.
>
> Does not *currently* rely on this. I can't help but think it is
> potentially something that could change in the future though which would
> lead to some subtle bugs.

Yes, there is a risk, although IMHO, its very small.

>
> Would there be any may of avoiding this? Half baked thought but could
> you for example copy the access/dirty information to the last (or
> perhaps first, most likely invalid) PTE?

I spent a long time thinking about this and came up with a number of
possibilities, none of them ideal. In the end, I went for the simplest one
(which works but suffers from the problem that it depends on the way it is
called not changing).

1) copy the access/dirty flags into all the remaining uncleared ptes within the
contpte block. This is how I did it in v1; although it was racy. I think this
could be implemented correctly but its extremely complex.

2) batch calls from the core-mm (like I did for pte_set_wrprotects()) so that we
can clear 1 or more full contpte blocks in a single call - the ptes are never in
an intermediate state. This is difficult because ptep_get_and_clear_full()
returns the pte that was cleared so its difficult to scale that up to multiple ptes.

3) add ptep_get_no_access_dirty() and redefine the interface to only allow that
to be called while ptep_get_and_clear_full() calls are on-going. Then assert in
the other functions that ptep_get_and_clear_full() is not on-going when they are
called. So we would get a clear sign that usage patterns have changed. But there
is no easy place to store that state (other than scanning a contpte block
looking for pte_none() amongst pte_valid_cont() entries) and it all felt ugly.

4) The simple approach I ended up taking; I thought it would be best to keep it
simple and see if anyone was concerned before doing something more drastic.

What do you think? If we really need to solve this, then option 1 is my
preferred route, but it would take some time to figure out and reason about a
race-free scheme.

Thanks,
Ryan


2023-11-23 23:55:30

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()


Ryan Roberts <[email protected]> writes:

> On 23/11/2023 04:26, Alistair Popple wrote:
>>
>> Ryan Roberts <[email protected]> writes:
>>
>>> Convert copy_pte_range() to copy a set of ptes in a batch. A given batch
>>> maps a physically contiguous block of memory, all belonging to the same
>>> folio, with the same permissions, and for shared mappings, the same
>>> dirty state. This will likely improve performance by a tiny amount due
>>> to batching the folio reference count management and calling set_ptes()
>>> rather than making individual calls to set_pte_at().
>>>
>>> However, the primary motivation for this change is to reduce the number
>>> of tlb maintenance operations that the arm64 backend has to perform
>>> during fork, as it is about to add transparent support for the
>>> "contiguous bit" in its ptes. By write-protecting the parent using the
>>> new ptep_set_wrprotects() (note the 's' at the end) function, the
>>> backend can avoid having to unfold contig ranges of PTEs, which is
>>> expensive, when all ptes in the range are being write-protected.
>>> Similarly, by using set_ptes() rather than set_pte_at() to set up ptes
>>> in the child, the backend does not need to fold a contiguous range once
>>> they are all populated - they can be initially populated as a contiguous
>>> range in the first place.
>>>
>>> This change addresses the core-mm refactoring only, and introduces
>>> ptep_set_wrprotects() with a default implementation that calls
>>> ptep_set_wrprotect() for each pte in the range. A separate change will
>>> implement ptep_set_wrprotects() in the arm64 backend to realize the
>>> performance improvement as part of the work to enable contpte mappings.
>>>
>>> Signed-off-by: Ryan Roberts <[email protected]>
>>> ---
>>> include/linux/pgtable.h | 13 +++
>>> mm/memory.c | 175 +++++++++++++++++++++++++++++++---------
>>> 2 files changed, 150 insertions(+), 38 deletions(-)
>>>
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index af7639c3b0a3..1c50f8a0fdde 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -622,6 +622,19 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addres
>>> }
>>> #endif
>>>
>>> +#ifndef ptep_set_wrprotects
>>> +struct mm_struct;
>>> +static inline void ptep_set_wrprotects(struct mm_struct *mm,
>>> + unsigned long address, pte_t *ptep,
>>> + unsigned int nr)
>>> +{
>>> + unsigned int i;
>>> +
>>> + for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
>>> + ptep_set_wrprotect(mm, address, ptep);
>>> +}
>>> +#endif
>>> +
>>> /*
>>> * On some architectures hardware does not set page access bit when accessing
>>> * memory page, it is responsibility of software setting this bit. It brings
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 1f18ed4a5497..b7c8228883cf 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -921,46 +921,129 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
>>> /* Uffd-wp needs to be delivered to dest pte as well */
>>> pte = pte_mkuffd_wp(pte);
>>> set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
>>> - return 0;
>>> + return 1;
>>
>> We should update the function comment to indicate why we return 1 here
>> because it will become non-obvious in future. But perhaps it's better to
>> leave this as is and do the error check/return code calculation in
>> copy_present_ptes().
>
> OK, I'll return 0 for success and fix it up to 1 in copy_present_ptes().
>
>>
>>> +}
>>> +
>>> +static inline unsigned long page_cont_mapped_vaddr(struct page *page,
>>> + struct page *anchor, unsigned long anchor_vaddr)
>>
>> It's likely I'm easily confused but the arguments here don't make much
>> sense to me. Something like this (noting that I've switch the argument
>> order) makes more sense to me at least:
>>
>> static inline unsigned long page_cont_mapped_vaddr(struct page *page,
>> unsigned long page_vaddr, struct page *next_folio_page)
>
> I was originally using page_cont_mapped_vaddr() in more places than here and
> needed a more generic helper than just "what is the virtual address of the end
> of the folio, given a random page within the folio and its virtual address"; (I
> needed "what is the virtual address of a page given a different page and its
> virtual address and assuming the distance between the 2 pages is the same in
> physical and virtual space"). But given I don't need that generality anymore,
> yes, I agree I can simplify this significantly.

Thanks for the explaination, that explains my head scratching.

> I think I can remove the function entirely and replace with this in
> folio_nr_pages_cont_mapped():
>
> /*
> * Loop either to `end` or to end of folio if its contiguously mapped,
> * whichever is smaller.
> */
> floops = (end - addr) >> PAGE_SHIFT;
> floops = min_t(int, floops,
> folio_pfn(folio_next(folio)) - page_to_pfn(page));
>
> where `end` and `addr` are the parameters as passed into the function. What do
> you think?

Will admit by the end of the review I was wondering why we even needed
the extra function so looks good to me (the comment helps too!)

>>
>>> +{
>>> + unsigned long offset;
>>> + unsigned long vaddr;
>>> +
>>> + offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
>>
>> Which IMHO makes this much more readable:
>>
>> offset = (page_to_pfn(next_folio_page) - page_to_pfn(page)) << PAGE_SHIFT;
>>
>>> + vaddr = anchor_vaddr + offset;
>>> +
>>> + if (anchor > page) {
>>
>> And also highlights that I think this condition (page > folio_page_end)
>> is impossible to hit. Which is good ...
>>
>>> + if (vaddr > anchor_vaddr)
>>> + return 0;
>>
>> ... because I'm not sure returning 0 is valid as we would end up setting
>> floops = (0 - addr) >> PAGE_SHIFT which doesn't seem like it would end
>> particularly well :-)
>
> This was covering the more general case that I no longer need.
>
>>
>>> + } else {
>>> + if (vaddr < anchor_vaddr)
>>
>> Same here - isn't the vaddr of the next folio always going to be larger
>> than the vaddr for the current page? It seems this function is really
>> just calculating the virtual address of the next folio, or am I deeply
>> confused?
>
> This aims to protect against the corner case, where a page from a folio is
> mremap()ed very high in address space such that the extra pages from the anchor
> page to the end of the folio would actually wrap back to zero. But with the
> approach propsed above, this problem goes away, I think.
>
>>
>>> + return ULONG_MAX;
>>> + }
>>> +
>>> + return vaddr;
>>> +}
>>> +
>>> +static int folio_nr_pages_cont_mapped(struct folio *folio,
>>> + struct page *page, pte_t *pte,
>>> + unsigned long addr, unsigned long end,
>>> + pte_t ptent, bool *any_dirty)
>>> +{
>>> + int floops;
>>> + int i;
>>> + unsigned long pfn;
>>> + pgprot_t prot;
>>> + struct page *folio_end;
>>> +
>>> + if (!folio_test_large(folio))
>>> + return 1;
>>> +
>>> + folio_end = &folio->page + folio_nr_pages(folio);
>>
>> I think you can replace this with:
>>
>> folio_end = folio_next(folio)
>
> yep, done - thanks.
>
>>
>> Although given this is only passed to page_cont_mapped_vaddr() perhaps
>> it's better to just pass the folio in and do the calculation there.
>>
>>> + end = min(page_cont_mapped_vaddr(folio_end, page, addr), end);
>>> + floops = (end - addr) >> PAGE_SHIFT;
>>> + pfn = page_to_pfn(page);
>>> + prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
>>> +
>>> + *any_dirty = pte_dirty(ptent);
>>> +
>>> + pfn++;
>>> + pte++;
>>> +
>>> + for (i = 1; i < floops; i++) {
>>> + ptent = ptep_get(pte);
>>> + ptent = pte_mkold(pte_mkclean(ptent));
>>> +
>>> + if (!pte_present(ptent) || pte_pfn(ptent) != pfn ||
>>> + pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
>>> + break;
>>> +
>>> + if (pte_dirty(ptent))
>>> + *any_dirty = true;
>>> +
>>> + pfn++;
>>> + pte++;
>>> + }
>>> +
>>> + return i;
>>> }
>>>
>>> /*
>>> - * Copy one pte. Returns 0 if succeeded, or -EAGAIN if one preallocated page
>>> - * is required to copy this pte.
>>> + * Copy set of contiguous ptes. Returns number of ptes copied if succeeded
>>> + * (always gte 1), or -EAGAIN if one preallocated page is required to copy the
>>> + * first pte.
>>> */
>>> static inline int
>>> -copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>>> - pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
>>> - struct folio **prealloc)
>>> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>>> + pte_t *dst_pte, pte_t *src_pte,
>>> + unsigned long addr, unsigned long end,
>>> + int *rss, struct folio **prealloc)
>>> {
>>> struct mm_struct *src_mm = src_vma->vm_mm;
>>> unsigned long vm_flags = src_vma->vm_flags;
>>> pte_t pte = ptep_get(src_pte);
>>> struct page *page;
>>> struct folio *folio;
>>> + int nr = 1;
>>> + bool anon;
>>> + bool any_dirty = pte_dirty(pte);
>>> + int i;
>>>
>>> page = vm_normal_page(src_vma, addr, pte);
>>> - if (page)
>>> + if (page) {
>>> folio = page_folio(page);
>>> - if (page && folio_test_anon(folio)) {
>>> - /*
>>> - * If this page may have been pinned by the parent process,
>>> - * copy the page immediately for the child so that we'll always
>>> - * guarantee the pinned page won't be randomly replaced in the
>>> - * future.
>>> - */
>>> - folio_get(folio);
>>> - if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
>>> - /* Page may be pinned, we have to copy. */
>>> - folio_put(folio);
>>> - return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
>>> - addr, rss, prealloc, page);
>>> + anon = folio_test_anon(folio);
>>> + nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
>>> + end, pte, &any_dirty);
>>> +
>>> + for (i = 0; i < nr; i++, page++) {
>>> + if (anon) {
>>> + /*
>>> + * If this page may have been pinned by the
>>> + * parent process, copy the page immediately for
>>> + * the child so that we'll always guarantee the
>>> + * pinned page won't be randomly replaced in the
>>> + * future.
>>> + */
>>> + if (unlikely(page_try_dup_anon_rmap(
>>> + page, false, src_vma))) {
>>> + if (i != 0)
>>> + break;
>>> + /* Page may be pinned, we have to copy. */
>>> + return copy_present_page(
>>> + dst_vma, src_vma, dst_pte,
>>> + src_pte, addr, rss, prealloc,
>>> + page);
>>> + }
>>> + rss[MM_ANONPAGES]++;
>>> + VM_BUG_ON(PageAnonExclusive(page));
>>> + } else {
>>> + page_dup_file_rmap(page, false);
>>> + rss[mm_counter_file(page)]++;
>>> + }
>>> }
>>> - rss[MM_ANONPAGES]++;
>>> - } else if (page) {
>>> - folio_get(folio);
>>> - page_dup_file_rmap(page, false);
>>> - rss[mm_counter_file(page)]++;
>>> +
>>> + nr = i;
>>> + folio_ref_add(folio, nr);
>>> }
>>>
>>> /*
>>> @@ -968,24 +1051,28 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>>> * in the parent and the child
>>> */
>>> if (is_cow_mapping(vm_flags) && pte_write(pte)) {
>>> - ptep_set_wrprotect(src_mm, addr, src_pte);
>>> + ptep_set_wrprotects(src_mm, addr, src_pte, nr);
>>> pte = pte_wrprotect(pte);
>>> }
>>> - VM_BUG_ON(page && folio_test_anon(folio) && PageAnonExclusive(page));
>>>
>>> /*
>>> - * If it's a shared mapping, mark it clean in
>>> - * the child
>>> + * If it's a shared mapping, mark it clean in the child. If its a
>>> + * private mapping, mark it dirty in the child if _any_ of the parent
>>> + * mappings in the block were marked dirty. The contiguous block of
>>> + * mappings are all backed by the same folio, so if any are dirty then
>>> + * the whole folio is dirty. This allows us to determine the batch size
>>> + * without having to ever consider the dirty bit. See
>>> + * folio_nr_pages_cont_mapped().
>>> */
>>> - if (vm_flags & VM_SHARED)
>>> - pte = pte_mkclean(pte);
>>> - pte = pte_mkold(pte);
>>> + pte = pte_mkold(pte_mkclean(pte));
>>> + if (!(vm_flags & VM_SHARED) && any_dirty)
>>> + pte = pte_mkdirty(pte);
>>>
>>> if (!userfaultfd_wp(dst_vma))
>>> pte = pte_clear_uffd_wp(pte);
>>>
>>> - set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
>>> - return 0;
>>> + set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
>>> + return nr;
>>> }
>>>
>>> static inline struct folio *page_copy_prealloc(struct mm_struct *src_mm,
>>> @@ -1087,15 +1174,28 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>>> */
>>> WARN_ON_ONCE(ret != -ENOENT);
>>> }
>>> - /* copy_present_pte() will clear `*prealloc' if consumed */
>>> - ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte,
>>> - addr, rss, &prealloc);
>>> + /* copy_present_ptes() will clear `*prealloc' if consumed */
>>> + ret = copy_present_ptes(dst_vma, src_vma, dst_pte, src_pte,
>>> + addr, end, rss, &prealloc);
>>> +
>>> /*
>>> * If we need a pre-allocated page for this pte, drop the
>>> * locks, allocate, and try again.
>>> */
>>> if (unlikely(ret == -EAGAIN))
>>> break;
>>> +
>>> + /*
>>> + * Positive return value is the number of ptes copied.
>>> + */
>>> + VM_WARN_ON_ONCE(ret < 1);
>>> + progress += 8 * ret;
>>> + ret--;
>>
>> Took me a second to figure out what was going on here. I think it would
>> be clearer to rename ret to nr_ptes ...
>>
>>> + dst_pte += ret;
>>> + src_pte += ret;
>>> + addr += ret << PAGE_SHIFT;
>>> + ret = 0;
>>> +
>>> if (unlikely(prealloc)) {
>>> /*
>>> * pre-alloc page cannot be reused by next time so as
>>> @@ -1106,7 +1206,6 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>>> folio_put(prealloc);
>>> prealloc = NULL;
>>> }
>>> - progress += 8;
>>> } while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
>>
>> ... and do dst_pte += nr_ptes, etc. here instead (noting of course that
>> the continue clauses will need nr_ptes == 1, but perhpas reset that at
>> the start of the loop).
>
> Yes, much cleaner! Implementing for v3...
>
> Thanks for the review!
>
> Thanks,
> Ryan
>
>>
>>> arch_leave_lazy_mmu_mode();
>>

2023-11-24 01:37:29

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v2 14/14] arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown


Ryan Roberts <[email protected]> writes:

> On 23/11/2023 05:13, Alistair Popple wrote:
>>
>> Ryan Roberts <[email protected]> writes:
>>
>>> ptep_get_and_clear_full() adds a 'full' parameter which is not present
>>> for the fallback ptep_get_and_clear() function. 'full' is set to 1 when
>>> a full address space teardown is in progress. We use this information to
>>> optimize arm64_sys_exit_group() by avoiding unfolding (and therefore
>>> tlbi) contiguous ranges. Instead we just clear the PTE but allow all the
>>> contiguous neighbours to keep their contig bit set, because we know we
>>> are about to clear the rest too.
>>>
>>> Before this optimization, the cost of arm64_sys_exit_group() exploded to
>>> 32x what it was before PTE_CONT support was wired up, when compiling the
>>> kernel. With this optimization in place, we are back down to the
>>> original cost.
>>>
>>> This approach is not perfect though, as for the duration between
>>> returning from the first call to ptep_get_and_clear_full() and making
>>> the final call, the contpte block in an intermediate state, where some
>>> ptes are cleared and others are still set with the PTE_CONT bit. If any
>>> other APIs are called for the ptes in the contpte block during that
>>> time, we have to be very careful. The core code currently interleaves
>>> calls to ptep_get_and_clear_full() with ptep_get() and so ptep_get()
>>> must be careful to ignore the cleared entries when accumulating the
>>> access and dirty bits - the same goes for ptep_get_lockless(). The only
>>> other calls we might resonably expect are to set markers in the
>>> previously cleared ptes. (We shouldn't see valid entries being set until
>>> after the tlbi, at which point we are no longer in the intermediate
>>> state). Since markers are not valid, this is safe; set_ptes() will see
>>> the old, invalid entry and will not attempt to unfold. And the new pte
>>> is also invalid so it won't attempt to fold. We shouldn't see this for
>>> the 'full' case anyway.
>>>
>>> The last remaining issue is returning the access/dirty bits. That info
>>> could be present in any of the ptes in the contpte block. ptep_get()
>>> will gather those bits from across the contpte block. We don't bother
>>> doing that here, because we know that the information is used by the
>>> core-mm to mark the underlying folio as accessed/dirty. And since the
>>> same folio must be underpinning the whole block (that was a requirement
>>> for folding in the first place), that information will make it to the
>>> folio eventually once all the ptes have been cleared. This approach
>>> means we don't have to play games with accumulating and storing the
>>> bits. It does mean that any interleaved calls to ptep_get() may lack
>>> correct access/dirty information if we have already cleared the pte that
>>> happened to store it. The core code does not rely on this though.
>>
>> Does not *currently* rely on this. I can't help but think it is
>> potentially something that could change in the future though which would
>> lead to some subtle bugs.
>
> Yes, there is a risk, although IMHO, its very small.
>
>>
>> Would there be any may of avoiding this? Half baked thought but could
>> you for example copy the access/dirty information to the last (or
>> perhaps first, most likely invalid) PTE?
>
> I spent a long time thinking about this and came up with a number of
> possibilities, none of them ideal. In the end, I went for the simplest one
> (which works but suffers from the problem that it depends on the way it is
> called not changing).

Ok, that answers my underlying question of "has someone thought about
this and are there any easy solutions". I suspected that was the case
given the excellent write up though!

> 1) copy the access/dirty flags into all the remaining uncleared ptes within the
> contpte block. This is how I did it in v1; although it was racy. I think this
> could be implemented correctly but its extremely complex.
>
> 2) batch calls from the core-mm (like I did for pte_set_wrprotects()) so that we
> can clear 1 or more full contpte blocks in a single call - the ptes are never in
> an intermediate state. This is difficult because ptep_get_and_clear_full()
> returns the pte that was cleared so its difficult to scale that up to multiple ptes.
>
> 3) add ptep_get_no_access_dirty() and redefine the interface to only allow that
> to be called while ptep_get_and_clear_full() calls are on-going. Then assert in
> the other functions that ptep_get_and_clear_full() is not on-going when they are
> called. So we would get a clear sign that usage patterns have changed. But there
> is no easy place to store that state (other than scanning a contpte block
> looking for pte_none() amongst pte_valid_cont() entries) and it all felt ugly.
>
> 4) The simple approach I ended up taking; I thought it would be best to keep it
> simple and see if anyone was concerned before doing something more drastic.
>
> What do you think? If we really need to solve this, then option 1 is my
> preferred route, but it would take some time to figure out and reason about a
> race-free scheme.

Well I like simple, and I agree the risk is small. But I can't help feel
the current situation is too subtle, mainly because it is architecture
specific and the assumptions are not communicated in core-mm code
anywhere. But also none of the aternatives seem much better.

However there are only three callers of ptep_get_and_clear_full(), and
all of these hold the PTL. So if I'm not mistaken that should exclude
just about all users of ptep_get*() which will take the ptl before hand.

So really that only leaves ptep_get_lockless() that could/should
interleave right? From a quick glance of those users none look at the
young/dirty information anyway, so I wonder if we can just assert in the
core-mm that ptep_get_lockless() does not return young/dirty information
and clear it in the helpers? That would make things explicit and
consistent which would address my concern (although I haven't looked too
closely at the details there).

> Thanks,
> Ryan

2023-11-24 08:53:19

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

>> One could simply skip batching for now on pte_protnone() and focus on the
>> "writable" vs. "not-writable".
>
> I'm not sure we can simply "skip" batching on pte_protnone() since we will need
> to terminate the batch if we spot it. But if we have to look for it anyway, we
> might as well just terminate the batch when the value of pte_protnone()
> *changes*. I'm also proposing to take this approach for pte_uffd_wp() which also
> needs to be carefully preserved per-pte.

Yes, that's what I meant.

--
Cheers,

David / dhildenb

2023-11-24 08:56:01

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 14/14] arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown

On 24/11/2023 01:35, Alistair Popple wrote:
>
> Ryan Roberts <[email protected]> writes:
>
>> On 23/11/2023 05:13, Alistair Popple wrote:
>>>
>>> Ryan Roberts <[email protected]> writes:
>>>
>>>> ptep_get_and_clear_full() adds a 'full' parameter which is not present
>>>> for the fallback ptep_get_and_clear() function. 'full' is set to 1 when
>>>> a full address space teardown is in progress. We use this information to
>>>> optimize arm64_sys_exit_group() by avoiding unfolding (and therefore
>>>> tlbi) contiguous ranges. Instead we just clear the PTE but allow all the
>>>> contiguous neighbours to keep their contig bit set, because we know we
>>>> are about to clear the rest too.
>>>>
>>>> Before this optimization, the cost of arm64_sys_exit_group() exploded to
>>>> 32x what it was before PTE_CONT support was wired up, when compiling the
>>>> kernel. With this optimization in place, we are back down to the
>>>> original cost.
>>>>
>>>> This approach is not perfect though, as for the duration between
>>>> returning from the first call to ptep_get_and_clear_full() and making
>>>> the final call, the contpte block in an intermediate state, where some
>>>> ptes are cleared and others are still set with the PTE_CONT bit. If any
>>>> other APIs are called for the ptes in the contpte block during that
>>>> time, we have to be very careful. The core code currently interleaves
>>>> calls to ptep_get_and_clear_full() with ptep_get() and so ptep_get()
>>>> must be careful to ignore the cleared entries when accumulating the
>>>> access and dirty bits - the same goes for ptep_get_lockless(). The only
>>>> other calls we might resonably expect are to set markers in the
>>>> previously cleared ptes. (We shouldn't see valid entries being set until
>>>> after the tlbi, at which point we are no longer in the intermediate
>>>> state). Since markers are not valid, this is safe; set_ptes() will see
>>>> the old, invalid entry and will not attempt to unfold. And the new pte
>>>> is also invalid so it won't attempt to fold. We shouldn't see this for
>>>> the 'full' case anyway.
>>>>
>>>> The last remaining issue is returning the access/dirty bits. That info
>>>> could be present in any of the ptes in the contpte block. ptep_get()
>>>> will gather those bits from across the contpte block. We don't bother
>>>> doing that here, because we know that the information is used by the
>>>> core-mm to mark the underlying folio as accessed/dirty. And since the
>>>> same folio must be underpinning the whole block (that was a requirement
>>>> for folding in the first place), that information will make it to the
>>>> folio eventually once all the ptes have been cleared. This approach
>>>> means we don't have to play games with accumulating and storing the
>>>> bits. It does mean that any interleaved calls to ptep_get() may lack
>>>> correct access/dirty information if we have already cleared the pte that
>>>> happened to store it. The core code does not rely on this though.
>>>
>>> Does not *currently* rely on this. I can't help but think it is
>>> potentially something that could change in the future though which would
>>> lead to some subtle bugs.
>>
>> Yes, there is a risk, although IMHO, its very small.
>>
>>>
>>> Would there be any may of avoiding this? Half baked thought but could
>>> you for example copy the access/dirty information to the last (or
>>> perhaps first, most likely invalid) PTE?
>>
>> I spent a long time thinking about this and came up with a number of
>> possibilities, none of them ideal. In the end, I went for the simplest one
>> (which works but suffers from the problem that it depends on the way it is
>> called not changing).
>
> Ok, that answers my underlying question of "has someone thought about
> this and are there any easy solutions". I suspected that was the case
> given the excellent write up though!
>
>> 1) copy the access/dirty flags into all the remaining uncleared ptes within the
>> contpte block. This is how I did it in v1; although it was racy. I think this
>> could be implemented correctly but its extremely complex.
>>
>> 2) batch calls from the core-mm (like I did for pte_set_wrprotects()) so that we
>> can clear 1 or more full contpte blocks in a single call - the ptes are never in
>> an intermediate state. This is difficult because ptep_get_and_clear_full()
>> returns the pte that was cleared so its difficult to scale that up to multiple ptes.
>>
>> 3) add ptep_get_no_access_dirty() and redefine the interface to only allow that
>> to be called while ptep_get_and_clear_full() calls are on-going. Then assert in
>> the other functions that ptep_get_and_clear_full() is not on-going when they are
>> called. So we would get a clear sign that usage patterns have changed. But there
>> is no easy place to store that state (other than scanning a contpte block
>> looking for pte_none() amongst pte_valid_cont() entries) and it all felt ugly.
>>
>> 4) The simple approach I ended up taking; I thought it would be best to keep it
>> simple and see if anyone was concerned before doing something more drastic.
>>
>> What do you think? If we really need to solve this, then option 1 is my
>> preferred route, but it would take some time to figure out and reason about a
>> race-free scheme.
>
> Well I like simple, and I agree the risk is small. But I can't help feel
> the current situation is too subtle, mainly because it is architecture
> specific and the assumptions are not communicated in core-mm code
> anywhere. But also none of the aternatives seem much better.
>
> However there are only three callers of ptep_get_and_clear_full(), and
> all of these hold the PTL. So if I'm not mistaken that should exclude
> just about all users of ptep_get*() which will take the ptl before hand.

The problem isn't racing threads because as you say, the PTL is already
serializing all calls except ptep_get_lockless(). And although there are 3
callers to ptep_get_and_clear_full(), only the caller in zap_pte_range() ever
calls it with full=1, as I recall.

The problem is that the caller in zap_pte_range() does this:

ptl = lock_page_table()
for each pte {
ptent = ptep_get(pte);
if (pte_present(ptent) {
ptent = ptep_get_and_clear_full(ptent);
if (pte_dirty(ptent))
...
if (pte_young(ptent))
...
}
}
unlock_page_table(ptl)

It deliberately interleves calls to ptep_get() and ptep_get_and_clear_full()
under the ptl. So if the loop is iterating over a contpte block and the HW
happens to be storing the access/dirty info in the first pte entry, then the
first time through the loop, ptep_get() will return the correct access/dirty
info, as will ptep_get_and_clear_full(). The next time through the loop though,
the access/dirty info which was in the previous pte is now cleared so ptep_get()
and ptep_get_and_clear_full() will return old/clean. It all works, but is fragile.


>
> So really that only leaves ptep_get_lockless() that could/should
> interleave right?

Yes, but ptep_get_lockless() is special. Since it is called without the PTL, it
is very careful to ensure that the contpte block is in a consistent state and it
keeps trying until it is. So this will always return the correct consistent
information.

> From a quick glance of those users none look at the
> young/dirty information anyway, so I wonder if we can just assert in the
> core-mm that ptep_get_lockless() does not return young/dirty information
> and clear it in the helpers? That would make things explicit and
> consistent which would address my concern (although I haven't looked too
> closely at the details there).

As per explanation above, its not ptep_get_lockless() that is the problem so I
don't think this helps.

Thanks,
Ryan

>
>> Thanks,
>> Ryan
>

2023-11-27 03:18:48

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v2 00/14] Transparent Contiguous PTEs for User Mappings

> Ryan Roberts (14):
> mm: Batch-copy PTE ranges during fork()
> arm64/mm: set_pte(): New layer to manage contig bit
> arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit
> arm64/mm: pte_clear(): New layer to manage contig bit
> arm64/mm: ptep_get_and_clear(): New layer to manage contig bit
> arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit
> arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit
> arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit
> arm64/mm: ptep_set_access_flags(): New layer to manage contig bit
> arm64/mm: ptep_get(): New layer to manage contig bit
> arm64/mm: Split __flush_tlb_range() to elide trailing DSB
> arm64/mm: Wire up PTE_CONT for user mappings
> arm64/mm: Implement ptep_set_wrprotects() to optimize fork()
> arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown

Hi Ryan,
Not quite sure if I missed something, are we splitting/unfolding CONTPTES
in the below cases

1. madvise(MADV_DONTNEED) on a part of basepages on a CONTPTE large folio

2. vma split in a large folio due to various reasons such as mprotect,
munmap, mlock etc.

3. try_to_unmap_one() to reclaim a folio, ptes are scanned one by one
rather than being as a whole.

In hardware, we need to make sure CONTPTE follow the rule - always 16
contiguous physical address with CONTPTE set. if one of them run away
from the 16 ptes group and PTEs become unconsistent, some terrible
errors/faults can happen in HW. for example

case0:
addr0 PTE - has no CONTPE
addr0+4kb PTE - has CONTPTE
....
addr0+60kb PTE - has CONTPTE

case 1:
addr0 PTE - has no CONTPE
addr0+4kb PTE - has CONTPTE
....
addr0+60kb PTE - has swap

Unconsistent 16 PTEs will lead to crash even in the firmware based on
our observation.

Thanks
Barry


2023-11-27 05:54:47

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> + pte_t *dst_pte, pte_t *src_pte,
> + unsigned long addr, unsigned long end,
> + int *rss, struct folio **prealloc)
> {
> struct mm_struct *src_mm = src_vma->vm_mm;
> unsigned long vm_flags = src_vma->vm_flags;
> pte_t pte = ptep_get(src_pte);
> struct page *page;
> struct folio *folio;
> + int nr = 1;
> + bool anon;
> + bool any_dirty = pte_dirty(pte);
> + int i;
>
> page = vm_normal_page(src_vma, addr, pte);
> - if (page)
> + if (page) {
> folio = page_folio(page);
> - if (page && folio_test_anon(folio)) {
> - /*
> - * If this page may have been pinned by the parent process,
> - * copy the page immediately for the child so that we'll always
> - * guarantee the pinned page won't be randomly replaced in the
> - * future.
> - */
> - folio_get(folio);
> - if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
> - /* Page may be pinned, we have to copy. */
> - folio_put(folio);
> - return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
> - addr, rss, prealloc, page);
> + anon = folio_test_anon(folio);
> + nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
> + end, pte, &any_dirty);

in case we have a large folio with 16 CONTPTE basepages, and userspace
do madvise(addr + 4KB * 5, DONTNEED);

thus, the 4th basepage of PTE becomes PTE_NONE and folio_nr_pages_cont_mapped()
will return 15. in this case, we should copy page0~page3 and page5~page15.

but the current code is copying page0~page14, right? unless we are immediatly
split_folio to basepages in zap_pte_range(), we will have problems?

> +
> + for (i = 0; i < nr; i++, page++) {
> + if (anon) {
> + /*
> + * If this page may have been pinned by the
> + * parent process, copy the page immediately for
> + * the child so that we'll always guarantee the
> + * pinned page won't be randomly replaced in the
> + * future.
> + */
> + if (unlikely(page_try_dup_anon_rmap(
> + page, false, src_vma))) {
> + if (i != 0)
> + break;
> + /* Page may be pinned, we have to copy. */
> + return copy_present_page(
> + dst_vma, src_vma, dst_pte,
> + src_pte, addr, rss, prealloc,
> + page);
> + }
> + rss[MM_ANONPAGES]++;
> + VM_BUG_ON(PageAnonExclusive(page));
> + } else {
> + page_dup_file_rmap(page, false);
> + rss[mm_counter_file(page)]++;
> + }

Thanks
Barry

2023-11-27 07:43:55

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v2 14/14] arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown


Ryan Roberts <[email protected]> writes:

> On 24/11/2023 01:35, Alistair Popple wrote:
>>
>> Ryan Roberts <[email protected]> writes:
>>
>>> On 23/11/2023 05:13, Alistair Popple wrote:
>>>>
>>>> Ryan Roberts <[email protected]> writes:
>>>>
>>>>> ptep_get_and_clear_full() adds a 'full' parameter which is not present
>>>>> for the fallback ptep_get_and_clear() function. 'full' is set to 1 when
>>>>> a full address space teardown is in progress. We use this information to
>>>>> optimize arm64_sys_exit_group() by avoiding unfolding (and therefore
>>>>> tlbi) contiguous ranges. Instead we just clear the PTE but allow all the
>>>>> contiguous neighbours to keep their contig bit set, because we know we
>>>>> are about to clear the rest too.
>>>>>
>>>>> Before this optimization, the cost of arm64_sys_exit_group() exploded to
>>>>> 32x what it was before PTE_CONT support was wired up, when compiling the
>>>>> kernel. With this optimization in place, we are back down to the
>>>>> original cost.
>>>>>
>>>>> This approach is not perfect though, as for the duration between
>>>>> returning from the first call to ptep_get_and_clear_full() and making
>>>>> the final call, the contpte block in an intermediate state, where some
>>>>> ptes are cleared and others are still set with the PTE_CONT bit. If any
>>>>> other APIs are called for the ptes in the contpte block during that
>>>>> time, we have to be very careful. The core code currently interleaves
>>>>> calls to ptep_get_and_clear_full() with ptep_get() and so ptep_get()
>>>>> must be careful to ignore the cleared entries when accumulating the
>>>>> access and dirty bits - the same goes for ptep_get_lockless(). The only
>>>>> other calls we might resonably expect are to set markers in the
>>>>> previously cleared ptes. (We shouldn't see valid entries being set until
>>>>> after the tlbi, at which point we are no longer in the intermediate
>>>>> state). Since markers are not valid, this is safe; set_ptes() will see
>>>>> the old, invalid entry and will not attempt to unfold. And the new pte
>>>>> is also invalid so it won't attempt to fold. We shouldn't see this for
>>>>> the 'full' case anyway.
>>>>>
>>>>> The last remaining issue is returning the access/dirty bits. That info
>>>>> could be present in any of the ptes in the contpte block. ptep_get()
>>>>> will gather those bits from across the contpte block. We don't bother
>>>>> doing that here, because we know that the information is used by the
>>>>> core-mm to mark the underlying folio as accessed/dirty. And since the
>>>>> same folio must be underpinning the whole block (that was a requirement
>>>>> for folding in the first place), that information will make it to the
>>>>> folio eventually once all the ptes have been cleared. This approach
>>>>> means we don't have to play games with accumulating and storing the
>>>>> bits. It does mean that any interleaved calls to ptep_get() may lack
>>>>> correct access/dirty information if we have already cleared the pte that
>>>>> happened to store it. The core code does not rely on this though.
>>>>
>>>> Does not *currently* rely on this. I can't help but think it is
>>>> potentially something that could change in the future though which would
>>>> lead to some subtle bugs.
>>>
>>> Yes, there is a risk, although IMHO, its very small.
>>>
>>>>
>>>> Would there be any may of avoiding this? Half baked thought but could
>>>> you for example copy the access/dirty information to the last (or
>>>> perhaps first, most likely invalid) PTE?
>>>
>>> I spent a long time thinking about this and came up with a number of
>>> possibilities, none of them ideal. In the end, I went for the simplest one
>>> (which works but suffers from the problem that it depends on the way it is
>>> called not changing).
>>
>> Ok, that answers my underlying question of "has someone thought about
>> this and are there any easy solutions". I suspected that was the case
>> given the excellent write up though!
>>
>>> 1) copy the access/dirty flags into all the remaining uncleared ptes within the
>>> contpte block. This is how I did it in v1; although it was racy. I think this
>>> could be implemented correctly but its extremely complex.
>>>
>>> 2) batch calls from the core-mm (like I did for pte_set_wrprotects()) so that we
>>> can clear 1 or more full contpte blocks in a single call - the ptes are never in
>>> an intermediate state. This is difficult because ptep_get_and_clear_full()
>>> returns the pte that was cleared so its difficult to scale that up to multiple ptes.
>>>
>>> 3) add ptep_get_no_access_dirty() and redefine the interface to only allow that
>>> to be called while ptep_get_and_clear_full() calls are on-going. Then assert in
>>> the other functions that ptep_get_and_clear_full() is not on-going when they are
>>> called. So we would get a clear sign that usage patterns have changed. But there
>>> is no easy place to store that state (other than scanning a contpte block
>>> looking for pte_none() amongst pte_valid_cont() entries) and it all felt ugly.
>>>
>>> 4) The simple approach I ended up taking; I thought it would be best to keep it
>>> simple and see if anyone was concerned before doing something more drastic.
>>>
>>> What do you think? If we really need to solve this, then option 1 is my
>>> preferred route, but it would take some time to figure out and reason about a
>>> race-free scheme.
>>
>> Well I like simple, and I agree the risk is small. But I can't help feel
>> the current situation is too subtle, mainly because it is architecture
>> specific and the assumptions are not communicated in core-mm code
>> anywhere. But also none of the aternatives seem much better.
>>
>> However there are only three callers of ptep_get_and_clear_full(), and
>> all of these hold the PTL. So if I'm not mistaken that should exclude
>> just about all users of ptep_get*() which will take the ptl before hand.
>
> The problem isn't racing threads because as you say, the PTL is already
> serializing all calls except ptep_get_lockless(). And although there are 3
> callers to ptep_get_and_clear_full(), only the caller in zap_pte_range() ever
> calls it with full=1, as I recall.
>
> The problem is that the caller in zap_pte_range() does this:
>
> ptl = lock_page_table()
> for each pte {
> ptent = ptep_get(pte);
> if (pte_present(ptent) {
> ptent = ptep_get_and_clear_full(ptent);
> if (pte_dirty(ptent))
> ...
> if (pte_young(ptent))
> ...
> }
> }
> unlock_page_table(ptl)
>
> It deliberately interleves calls to ptep_get() and ptep_get_and_clear_full()
> under the ptl. So if the loop is iterating over a contpte block and the HW
> happens to be storing the access/dirty info in the first pte entry, then the
> first time through the loop, ptep_get() will return the correct access/dirty
> info, as will ptep_get_and_clear_full(). The next time through the loop though,
> the access/dirty info which was in the previous pte is now cleared so ptep_get()
> and ptep_get_and_clear_full() will return old/clean. It all works, but is fragile.

So if ptep_get_lockless() isn't a concern what made the option posted in
v1 racy (your option 1 above)? Is there something else reading PTEs or
clearing PTE bits without holding the PTL that I'm missing?

>>
>> So really that only leaves ptep_get_lockless() that could/should
>> interleave right?
>
> Yes, but ptep_get_lockless() is special. Since it is called without the PTL, it
> is very careful to ensure that the contpte block is in a consistent state and it
> keeps trying until it is. So this will always return the correct consistent
> information.
>
>> From a quick glance of those users none look at the
>> young/dirty information anyway, so I wonder if we can just assert in the
>> core-mm that ptep_get_lockless() does not return young/dirty information
>> and clear it in the helpers? That would make things explicit and
>> consistent which would address my concern (although I haven't looked too
>> closely at the details there).
>
> As per explanation above, its not ptep_get_lockless() that is the problem so I
> don't think this helps.
>
> Thanks,
> Ryan
>
>>
>>> Thanks,
>>> Ryan
>>

2023-11-27 08:42:56

by Barry Song

[permalink] [raw]
Subject: Re: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

>> + for (i = 0; i < nr; i++, page++) {
>> + if (anon) {
>> + /*
>> + * If this page may have been pinned by the
>> + * parent process, copy the page immediately for
>> + * the child so that we'll always guarantee the
>> + * pinned page won't be randomly replaced in the
>> + * future.
>> + */
>> + if (unlikely(page_try_dup_anon_rmap(
>> + page, false, src_vma))) {
>> + if (i != 0)
>> + break;
>> + /* Page may be pinned, we have to copy. */
>> + return copy_present_page(
>> + dst_vma, src_vma, dst_pte,
>> + src_pte, addr, rss, prealloc,
>> + page);
>> + }
>> + rss[MM_ANONPAGES]++;
>> + VM_BUG_ON(PageAnonExclusive(page));
>> + } else {
>> + page_dup_file_rmap(page, false);
>> + rss[mm_counter_file(page)]++;
>> + }
>> }
>> - rss[MM_ANONPAGES]++;
>> - } else if (page) {
>> - folio_get(folio);
>> - page_dup_file_rmap(page, false);
>> - rss[mm_counter_file(page)]++;
>> +
>> + nr = i;
>> + folio_ref_add(folio, nr);
>
> You're changing the order of mapcount vs. refcount increment. Don't.
> Make sure your refcount >= mapcount.
>
> You can do that easily by doing the folio_ref_add(folio, nr) first and
> then decrementing in case of error accordingly. Errors due to pinned
> pages are the corner case.
>
> I'll note that it will make a lot of sense to have batch variants of
> page_try_dup_anon_rmap() and page_dup_file_rmap().
>

i still don't understand why it is not a entire map+1, but an increment
in each basepage.

as long as it is a CONTPTE large folio, there is no much difference with
PMD-mapped large folio. it has all the chance to be DoubleMap and need
split.

When A and B share a CONTPTE large folio, we do madvise(DONTNEED) or any
similar things on a part of the large folio in process A,

this large folio will have partially mapped subpage in A (all CONTPE bits
in all subpages need to be removed though we only unmap a part of the
large folioas HW requires consistent CONTPTEs); and it has entire map in
process B(all PTEs are still CONPTES in process B).

isn't it more sensible for this large folios to have entire_map = 0(for
process B), and subpages which are still mapped in process A has map_count
=0? (start from -1).

> Especially, the batch variant of page_try_dup_anon_rmap() would only
> check once if the folio maybe pinned, and in that case, you can simply
> drop all references again. So you either have all or no ptes to process,
> which makes that code easier.
>
> But that can be added on top, and I'll happily do that.
>
> --
> Cheers,
>
> David / dhildenb

Thanks
Barry

2023-11-27 08:53:31

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 14/14] arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown

On 27/11/2023 07:34, Alistair Popple wrote:
>
> Ryan Roberts <[email protected]> writes:
>
>> On 24/11/2023 01:35, Alistair Popple wrote:
>>>
>>> Ryan Roberts <[email protected]> writes:
>>>
>>>> On 23/11/2023 05:13, Alistair Popple wrote:
>>>>>
>>>>> Ryan Roberts <[email protected]> writes:
>>>>>
>>>>>> ptep_get_and_clear_full() adds a 'full' parameter which is not present
>>>>>> for the fallback ptep_get_and_clear() function. 'full' is set to 1 when
>>>>>> a full address space teardown is in progress. We use this information to
>>>>>> optimize arm64_sys_exit_group() by avoiding unfolding (and therefore
>>>>>> tlbi) contiguous ranges. Instead we just clear the PTE but allow all the
>>>>>> contiguous neighbours to keep their contig bit set, because we know we
>>>>>> are about to clear the rest too.
>>>>>>
>>>>>> Before this optimization, the cost of arm64_sys_exit_group() exploded to
>>>>>> 32x what it was before PTE_CONT support was wired up, when compiling the
>>>>>> kernel. With this optimization in place, we are back down to the
>>>>>> original cost.
>>>>>>
>>>>>> This approach is not perfect though, as for the duration between
>>>>>> returning from the first call to ptep_get_and_clear_full() and making
>>>>>> the final call, the contpte block in an intermediate state, where some
>>>>>> ptes are cleared and others are still set with the PTE_CONT bit. If any
>>>>>> other APIs are called for the ptes in the contpte block during that
>>>>>> time, we have to be very careful. The core code currently interleaves
>>>>>> calls to ptep_get_and_clear_full() with ptep_get() and so ptep_get()
>>>>>> must be careful to ignore the cleared entries when accumulating the
>>>>>> access and dirty bits - the same goes for ptep_get_lockless(). The only
>>>>>> other calls we might resonably expect are to set markers in the
>>>>>> previously cleared ptes. (We shouldn't see valid entries being set until
>>>>>> after the tlbi, at which point we are no longer in the intermediate
>>>>>> state). Since markers are not valid, this is safe; set_ptes() will see
>>>>>> the old, invalid entry and will not attempt to unfold. And the new pte
>>>>>> is also invalid so it won't attempt to fold. We shouldn't see this for
>>>>>> the 'full' case anyway.
>>>>>>
>>>>>> The last remaining issue is returning the access/dirty bits. That info
>>>>>> could be present in any of the ptes in the contpte block. ptep_get()
>>>>>> will gather those bits from across the contpte block. We don't bother
>>>>>> doing that here, because we know that the information is used by the
>>>>>> core-mm to mark the underlying folio as accessed/dirty. And since the
>>>>>> same folio must be underpinning the whole block (that was a requirement
>>>>>> for folding in the first place), that information will make it to the
>>>>>> folio eventually once all the ptes have been cleared. This approach
>>>>>> means we don't have to play games with accumulating and storing the
>>>>>> bits. It does mean that any interleaved calls to ptep_get() may lack
>>>>>> correct access/dirty information if we have already cleared the pte that
>>>>>> happened to store it. The core code does not rely on this though.
>>>>>
>>>>> Does not *currently* rely on this. I can't help but think it is
>>>>> potentially something that could change in the future though which would
>>>>> lead to some subtle bugs.
>>>>
>>>> Yes, there is a risk, although IMHO, its very small.
>>>>
>>>>>
>>>>> Would there be any may of avoiding this? Half baked thought but could
>>>>> you for example copy the access/dirty information to the last (or
>>>>> perhaps first, most likely invalid) PTE?
>>>>
>>>> I spent a long time thinking about this and came up with a number of
>>>> possibilities, none of them ideal. In the end, I went for the simplest one
>>>> (which works but suffers from the problem that it depends on the way it is
>>>> called not changing).
>>>
>>> Ok, that answers my underlying question of "has someone thought about
>>> this and are there any easy solutions". I suspected that was the case
>>> given the excellent write up though!
>>>
>>>> 1) copy the access/dirty flags into all the remaining uncleared ptes within the
>>>> contpte block. This is how I did it in v1; although it was racy. I think this
>>>> could be implemented correctly but its extremely complex.
>>>>
>>>> 2) batch calls from the core-mm (like I did for pte_set_wrprotects()) so that we
>>>> can clear 1 or more full contpte blocks in a single call - the ptes are never in
>>>> an intermediate state. This is difficult because ptep_get_and_clear_full()
>>>> returns the pte that was cleared so its difficult to scale that up to multiple ptes.
>>>>
>>>> 3) add ptep_get_no_access_dirty() and redefine the interface to only allow that
>>>> to be called while ptep_get_and_clear_full() calls are on-going. Then assert in
>>>> the other functions that ptep_get_and_clear_full() is not on-going when they are
>>>> called. So we would get a clear sign that usage patterns have changed. But there
>>>> is no easy place to store that state (other than scanning a contpte block
>>>> looking for pte_none() amongst pte_valid_cont() entries) and it all felt ugly.
>>>>
>>>> 4) The simple approach I ended up taking; I thought it would be best to keep it
>>>> simple and see if anyone was concerned before doing something more drastic.
>>>>
>>>> What do you think? If we really need to solve this, then option 1 is my
>>>> preferred route, but it would take some time to figure out and reason about a
>>>> race-free scheme.
>>>
>>> Well I like simple, and I agree the risk is small. But I can't help feel
>>> the current situation is too subtle, mainly because it is architecture
>>> specific and the assumptions are not communicated in core-mm code
>>> anywhere. But also none of the aternatives seem much better.
>>>
>>> However there are only three callers of ptep_get_and_clear_full(), and
>>> all of these hold the PTL. So if I'm not mistaken that should exclude
>>> just about all users of ptep_get*() which will take the ptl before hand.
>>
>> The problem isn't racing threads because as you say, the PTL is already
>> serializing all calls except ptep_get_lockless(). And although there are 3
>> callers to ptep_get_and_clear_full(), only the caller in zap_pte_range() ever
>> calls it with full=1, as I recall.
>>
>> The problem is that the caller in zap_pte_range() does this:
>>
>> ptl = lock_page_table()
>> for each pte {
>> ptent = ptep_get(pte);
>> if (pte_present(ptent) {
>> ptent = ptep_get_and_clear_full(ptent);
>> if (pte_dirty(ptent))
>> ...
>> if (pte_young(ptent))
>> ...
>> }
>> }
>> unlock_page_table(ptl)
>>
>> It deliberately interleves calls to ptep_get() and ptep_get_and_clear_full()
>> under the ptl. So if the loop is iterating over a contpte block and the HW
>> happens to be storing the access/dirty info in the first pte entry, then the
>> first time through the loop, ptep_get() will return the correct access/dirty
>> info, as will ptep_get_and_clear_full(). The next time through the loop though,
>> the access/dirty info which was in the previous pte is now cleared so ptep_get()
>> and ptep_get_and_clear_full() will return old/clean. It all works, but is fragile.
>
> So if ptep_get_lockless() isn't a concern what made the option posted in
> v1 racy (your option 1 above)? Is there something else reading PTEs or
> clearing PTE bits without holding the PTL that I'm missing?

The HW could be racing to set access and dirty bits. Well actually, I'm not
completely sure if that's the case here; if full=1 then presumably no other
threads in the process should be running at this point, so perhaps it can be
guarranteed that nothing is causing a concurrent memory access and the HW is
therefore definitely not going to try to write the access/dirty bits
concurrently. But I didn't manage to convince myself that's definitely the case.

So if we do need to deal with racing HW, I'm pretty sure my v1 implementation is
buggy because it iterated through the PTEs, getting and accumulating. Then
iterated again, writing that final set of bits to all the PTEs. And the HW could
have modified the bits during those loops. I think it would be possible to fix
the race, but intuition says it would be expensive.

>
>>>
>>> So really that only leaves ptep_get_lockless() that could/should
>>> interleave right?
>>
>> Yes, but ptep_get_lockless() is special. Since it is called without the PTL, it
>> is very careful to ensure that the contpte block is in a consistent state and it
>> keeps trying until it is. So this will always return the correct consistent
>> information.
>>
>>> From a quick glance of those users none look at the
>>> young/dirty information anyway, so I wonder if we can just assert in the
>>> core-mm that ptep_get_lockless() does not return young/dirty information
>>> and clear it in the helpers? That would make things explicit and
>>> consistent which would address my concern (although I haven't looked too
>>> closely at the details there).
>>
>> As per explanation above, its not ptep_get_lockless() that is the problem so I
>> don't think this helps.
>>
>> Thanks,
>> Ryan
>>
>>>
>>>> Thanks,
>>>> Ryan
>>>
>

2023-11-27 09:20:03

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 00/14] Transparent Contiguous PTEs for User Mappings

On 27/11/2023 03:18, Barry Song wrote:
>> Ryan Roberts (14):
>> mm: Batch-copy PTE ranges during fork()
>> arm64/mm: set_pte(): New layer to manage contig bit
>> arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit
>> arm64/mm: pte_clear(): New layer to manage contig bit
>> arm64/mm: ptep_get_and_clear(): New layer to manage contig bit
>> arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit
>> arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit
>> arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit
>> arm64/mm: ptep_set_access_flags(): New layer to manage contig bit
>> arm64/mm: ptep_get(): New layer to manage contig bit
>> arm64/mm: Split __flush_tlb_range() to elide trailing DSB
>> arm64/mm: Wire up PTE_CONT for user mappings
>> arm64/mm: Implement ptep_set_wrprotects() to optimize fork()
>> arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown
>
> Hi Ryan,
> Not quite sure if I missed something, are we splitting/unfolding CONTPTES
> in the below cases

The general idea is that the core-mm sets the individual ptes (one at a time if
it likes with set_pte_at(), or in a block with set_ptes()), modifies its
permissions (ptep_set_wrprotect(), ptep_set_access_flags()) and clears them
(ptep_clear(), etc); This is exactly the same interface as previously.

BUT, the arm64 implementation of those interfaces will now detect when a set of
adjacent PTEs (a contpte block - so 16 naturally aligned entries when using 4K
base pages) are all appropriate for having the CONT_PTE bit set; in this case
the block is "folded". And it will detect when the first PTE in the block
changes such that the CONT_PTE bit must now be unset ("unfolded"). One of the
requirements for folding a contpte block is that all the pages must belong to
the *same* folio (that means its safe to only track access/dirty for thecontpte
block as a whole rather than for each individual pte).

(there are a couple of optimizations that make the reality slightly more
complicated than what I've just explained, but you get the idea).

On that basis, I believe all the specific cases you describe below are all
covered and safe - please let me know if you think there is a hole here!

>
> 1. madvise(MADV_DONTNEED) on a part of basepages on a CONTPTE large folio

The page will first be unmapped (e.g. ptep_clear() or ptep_get_and_clear(), or
whatever). The implementation of that will cause an unfold and the CONT_PTE bit
is removed from the whole contpte block. If there is then a subsequent
set_pte_at() to set a swap entry, the implementation will see that its not
appropriate to re-fold, so the range will remain unfolded.

>
> 2. vma split in a large folio due to various reasons such as mprotect,
> munmap, mlock etc.

I'm not sure if PTEs are explicitly unmapped/remapped when splitting a VMA? I
suspect not, so if the VMA is split in the middle of a currently folded contpte
block, it will remain folded. But this is safe and continues to work correctly.
The VMA arrangement is not important; it is just important that a single folio
is mapped contiguously across the whole block.

>
> 3. try_to_unmap_one() to reclaim a folio, ptes are scanned one by one
> rather than being as a whole.

Yes, as per 1; the arm64 implementation will notice when the first entry is
cleared and unfold the contpte block.

>
> In hardware, we need to make sure CONTPTE follow the rule - always 16
> contiguous physical address with CONTPTE set. if one of them run away
> from the 16 ptes group and PTEs become unconsistent, some terrible
> errors/faults can happen in HW. for example

Yes, the implementation obeys all these rules; see contpte_try_fold() and
contpte_try_unfold(). the fold/unfold operation is only done when all
requirements are met, and we perform it in a manner that is conformant to the
architecture requirements (see contpte_fold() - being renamed to
contpte_convert() in the next version).

Thanks for the review!

Thanks,
Ryan

>
> case0:
> addr0 PTE - has no CONTPE
> addr0+4kb PTE - has CONTPTE
> ....
> addr0+60kb PTE - has CONTPTE
>
> case 1:
> addr0 PTE - has no CONTPE
> addr0+4kb PTE - has CONTPTE
> ....
> addr0+60kb PTE - has swap
>
> Unconsistent 16 PTEs will lead to crash even in the firmware based on
> our observation.
>
> Thanks
> Barry
>
>

2023-11-27 09:24:42

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

On 27/11/2023 05:54, Barry Song wrote:
>> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>> + pte_t *dst_pte, pte_t *src_pte,
>> + unsigned long addr, unsigned long end,
>> + int *rss, struct folio **prealloc)
>> {
>> struct mm_struct *src_mm = src_vma->vm_mm;
>> unsigned long vm_flags = src_vma->vm_flags;
>> pte_t pte = ptep_get(src_pte);
>> struct page *page;
>> struct folio *folio;
>> + int nr = 1;
>> + bool anon;
>> + bool any_dirty = pte_dirty(pte);
>> + int i;
>>
>> page = vm_normal_page(src_vma, addr, pte);
>> - if (page)
>> + if (page) {
>> folio = page_folio(page);
>> - if (page && folio_test_anon(folio)) {
>> - /*
>> - * If this page may have been pinned by the parent process,
>> - * copy the page immediately for the child so that we'll always
>> - * guarantee the pinned page won't be randomly replaced in the
>> - * future.
>> - */
>> - folio_get(folio);
>> - if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
>> - /* Page may be pinned, we have to copy. */
>> - folio_put(folio);
>> - return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
>> - addr, rss, prealloc, page);
>> + anon = folio_test_anon(folio);
>> + nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
>> + end, pte, &any_dirty);
>
> in case we have a large folio with 16 CONTPTE basepages, and userspace
> do madvise(addr + 4KB * 5, DONTNEED);

nit: if you are offsetting by 5 pages from addr, then below I think you mean
page0~page4 and page6~15?

>
> thus, the 4th basepage of PTE becomes PTE_NONE and folio_nr_pages_cont_mapped()
> will return 15. in this case, we should copy page0~page3 and page5~page15.

No I don't think folio_nr_pages_cont_mapped() will return 15; that's certainly
not how its intended to work. The function is scanning forwards from the current
pte until it finds the first pte that does not fit in the batch - either because
it maps a PFN that is not contiguous, or because the permissions are different
(although this is being relaxed a bit; see conversation with DavidH against this
same patch).

So the first time through this loop, folio_nr_pages_cont_mapped() will return 5,
(page0~page4) then the next time through the loop we will go through the
!present path and process the single swap marker. Then the 3rd time through the
loop folio_nr_pages_cont_mapped() will return 10.

Thanks,
Ryan

>
> but the current code is copying page0~page14, right? unless we are immediatly
> split_folio to basepages in zap_pte_range(), we will have problems?
>
>> +
>> + for (i = 0; i < nr; i++, page++) {
>> + if (anon) {
>> + /*
>> + * If this page may have been pinned by the
>> + * parent process, copy the page immediately for
>> + * the child so that we'll always guarantee the
>> + * pinned page won't be randomly replaced in the
>> + * future.
>> + */
>> + if (unlikely(page_try_dup_anon_rmap(
>> + page, false, src_vma))) {
>> + if (i != 0)
>> + break;
>> + /* Page may be pinned, we have to copy. */
>> + return copy_present_page(
>> + dst_vma, src_vma, dst_pte,
>> + src_pte, addr, rss, prealloc,
>> + page);
>> + }
>> + rss[MM_ANONPAGES]++;
>> + VM_BUG_ON(PageAnonExclusive(page));
>> + } else {
>> + page_dup_file_rmap(page, false);
>> + rss[mm_counter_file(page)]++;
>> + }
>
> Thanks
> Barry
>

2023-11-27 09:36:43

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

On 27/11/2023 08:42, Barry Song wrote:
>>> + for (i = 0; i < nr; i++, page++) {
>>> + if (anon) {
>>> + /*
>>> + * If this page may have been pinned by the
>>> + * parent process, copy the page immediately for
>>> + * the child so that we'll always guarantee the
>>> + * pinned page won't be randomly replaced in the
>>> + * future.
>>> + */
>>> + if (unlikely(page_try_dup_anon_rmap(
>>> + page, false, src_vma))) {
>>> + if (i != 0)
>>> + break;
>>> + /* Page may be pinned, we have to copy. */
>>> + return copy_present_page(
>>> + dst_vma, src_vma, dst_pte,
>>> + src_pte, addr, rss, prealloc,
>>> + page);
>>> + }
>>> + rss[MM_ANONPAGES]++;
>>> + VM_BUG_ON(PageAnonExclusive(page));
>>> + } else {
>>> + page_dup_file_rmap(page, false);
>>> + rss[mm_counter_file(page)]++;
>>> + }
>>> }
>>> - rss[MM_ANONPAGES]++;
>>> - } else if (page) {
>>> - folio_get(folio);
>>> - page_dup_file_rmap(page, false);
>>> - rss[mm_counter_file(page)]++;
>>> +
>>> + nr = i;
>>> + folio_ref_add(folio, nr);
>>
>> You're changing the order of mapcount vs. refcount increment. Don't.
>> Make sure your refcount >= mapcount.
>>
>> You can do that easily by doing the folio_ref_add(folio, nr) first and
>> then decrementing in case of error accordingly. Errors due to pinned
>> pages are the corner case.
>>
>> I'll note that it will make a lot of sense to have batch variants of
>> page_try_dup_anon_rmap() and page_dup_file_rmap().
>>
>
> i still don't understand why it is not a entire map+1, but an increment
> in each basepage.

Because we are PTE-mapping the folio, we have to account each individual page.
If we accounted the entire folio, where would we unaccount it? Each page can be
unmapped individually (e.g. munmap() part of the folio) so need to account each
page. When PMD mapping, the whole thing is either mapped or unmapped, and its
atomic, so we can account the entire thing.

>
> as long as it is a CONTPTE large folio, there is no much difference with
> PMD-mapped large folio. it has all the chance to be DoubleMap and need
> split.
>
> When A and B share a CONTPTE large folio, we do madvise(DONTNEED) or any
> similar things on a part of the large folio in process A,
>
> this large folio will have partially mapped subpage in A (all CONTPE bits
> in all subpages need to be removed though we only unmap a part of the
> large folioas HW requires consistent CONTPTEs); and it has entire map in
> process B(all PTEs are still CONPTES in process B).
>
> isn't it more sensible for this large folios to have entire_map = 0(for
> process B), and subpages which are still mapped in process A has map_count
> =0? (start from -1).
>
>> Especially, the batch variant of page_try_dup_anon_rmap() would only
>> check once if the folio maybe pinned, and in that case, you can simply
>> drop all references again. So you either have all or no ptes to process,
>> which makes that code easier.

I'm afraid this doesn't make sense to me. Perhaps I've misunderstood. But
fundamentally you can only use entire_mapcount if its only possible to map and
unmap the whole folio atomically.

>>
>> But that can be added on top, and I'll happily do that.
>>
>> --
>> Cheers,
>>
>> David / dhildenb
>
> Thanks
> Barry
>

2023-11-27 10:00:54

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

On Mon, Nov 27, 2023 at 10:35 PM Ryan Roberts <[email protected]> wrote:
>
> On 27/11/2023 08:42, Barry Song wrote:
> >>> + for (i = 0; i < nr; i++, page++) {
> >>> + if (anon) {
> >>> + /*
> >>> + * If this page may have been pinned by the
> >>> + * parent process, copy the page immediately for
> >>> + * the child so that we'll always guarantee the
> >>> + * pinned page won't be randomly replaced in the
> >>> + * future.
> >>> + */
> >>> + if (unlikely(page_try_dup_anon_rmap(
> >>> + page, false, src_vma))) {
> >>> + if (i != 0)
> >>> + break;
> >>> + /* Page may be pinned, we have to copy. */
> >>> + return copy_present_page(
> >>> + dst_vma, src_vma, dst_pte,
> >>> + src_pte, addr, rss, prealloc,
> >>> + page);
> >>> + }
> >>> + rss[MM_ANONPAGES]++;
> >>> + VM_BUG_ON(PageAnonExclusive(page));
> >>> + } else {
> >>> + page_dup_file_rmap(page, false);
> >>> + rss[mm_counter_file(page)]++;
> >>> + }
> >>> }
> >>> - rss[MM_ANONPAGES]++;
> >>> - } else if (page) {
> >>> - folio_get(folio);
> >>> - page_dup_file_rmap(page, false);
> >>> - rss[mm_counter_file(page)]++;
> >>> +
> >>> + nr = i;
> >>> + folio_ref_add(folio, nr);
> >>
> >> You're changing the order of mapcount vs. refcount increment. Don't.
> >> Make sure your refcount >= mapcount.
> >>
> >> You can do that easily by doing the folio_ref_add(folio, nr) first and
> >> then decrementing in case of error accordingly. Errors due to pinned
> >> pages are the corner case.
> >>
> >> I'll note that it will make a lot of sense to have batch variants of
> >> page_try_dup_anon_rmap() and page_dup_file_rmap().
> >>
> >
> > i still don't understand why it is not a entire map+1, but an increment
> > in each basepage.
>
> Because we are PTE-mapping the folio, we have to account each individual page.
> If we accounted the entire folio, where would we unaccount it? Each page can be
> unmapped individually (e.g. munmap() part of the folio) so need to account each
> page. When PMD mapping, the whole thing is either mapped or unmapped, and its
> atomic, so we can account the entire thing.

Hi Ryan,

There is no problem. for example, a large folio is entirely mapped in
process A with CONPTE,
and only page2 is mapped in process B.
then we will have

entire_map = 0
page0.map = -1
page1.map = -1
page2.map = 0
page3.map = -1
....

>
> >
> > as long as it is a CONTPTE large folio, there is no much difference with
> > PMD-mapped large folio. it has all the chance to be DoubleMap and need
> > split.
> >
> > When A and B share a CONTPTE large folio, we do madvise(DONTNEED) or any
> > similar things on a part of the large folio in process A,
> >
> > this large folio will have partially mapped subpage in A (all CONTPE bits
> > in all subpages need to be removed though we only unmap a part of the
> > large folioas HW requires consistent CONTPTEs); and it has entire map in
> > process B(all PTEs are still CONPTES in process B).
> >
> > isn't it more sensible for this large folios to have entire_map = 0(for
> > process B), and subpages which are still mapped in process A has map_count
> > =0? (start from -1).
> >
> >> Especially, the batch variant of page_try_dup_anon_rmap() would only
> >> check once if the folio maybe pinned, and in that case, you can simply
> >> drop all references again. So you either have all or no ptes to process,
> >> which makes that code easier.
>
> I'm afraid this doesn't make sense to me. Perhaps I've misunderstood. But
> fundamentally you can only use entire_mapcount if its only possible to map and
> unmap the whole folio atomically.



My point is that CONTPEs should either all-set in all 16 PTEs or all are dropped
in 16 PTEs. if all PTEs have CONT, it is entirely mapped; otherwise,
it is partially
mapped. if a large folio is mapped in one processes with all CONTPTEs
and meanwhile in another process with partial mapping(w/o CONTPTE), it is
DoubleMapped.

Since we always hold ptl to set or drop CONTPTE bits, set/drop is
still atomic in a
spinlock area.

>
> >>
> >> But that can be added on top, and I'll happily do that.
> >>
> >> --
> >> Cheers,
> >>
> >> David / dhildenb
> >

Thanks
Barry

2023-11-27 10:11:24

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

On 27/11/2023 09:59, Barry Song wrote:
> On Mon, Nov 27, 2023 at 10:35 PM Ryan Roberts <[email protected]> wrote:
>>
>> On 27/11/2023 08:42, Barry Song wrote:
>>>>> + for (i = 0; i < nr; i++, page++) {
>>>>> + if (anon) {
>>>>> + /*
>>>>> + * If this page may have been pinned by the
>>>>> + * parent process, copy the page immediately for
>>>>> + * the child so that we'll always guarantee the
>>>>> + * pinned page won't be randomly replaced in the
>>>>> + * future.
>>>>> + */
>>>>> + if (unlikely(page_try_dup_anon_rmap(
>>>>> + page, false, src_vma))) {
>>>>> + if (i != 0)
>>>>> + break;
>>>>> + /* Page may be pinned, we have to copy. */
>>>>> + return copy_present_page(
>>>>> + dst_vma, src_vma, dst_pte,
>>>>> + src_pte, addr, rss, prealloc,
>>>>> + page);
>>>>> + }
>>>>> + rss[MM_ANONPAGES]++;
>>>>> + VM_BUG_ON(PageAnonExclusive(page));
>>>>> + } else {
>>>>> + page_dup_file_rmap(page, false);
>>>>> + rss[mm_counter_file(page)]++;
>>>>> + }
>>>>> }
>>>>> - rss[MM_ANONPAGES]++;
>>>>> - } else if (page) {
>>>>> - folio_get(folio);
>>>>> - page_dup_file_rmap(page, false);
>>>>> - rss[mm_counter_file(page)]++;
>>>>> +
>>>>> + nr = i;
>>>>> + folio_ref_add(folio, nr);
>>>>
>>>> You're changing the order of mapcount vs. refcount increment. Don't.
>>>> Make sure your refcount >= mapcount.
>>>>
>>>> You can do that easily by doing the folio_ref_add(folio, nr) first and
>>>> then decrementing in case of error accordingly. Errors due to pinned
>>>> pages are the corner case.
>>>>
>>>> I'll note that it will make a lot of sense to have batch variants of
>>>> page_try_dup_anon_rmap() and page_dup_file_rmap().
>>>>
>>>
>>> i still don't understand why it is not a entire map+1, but an increment
>>> in each basepage.
>>
>> Because we are PTE-mapping the folio, we have to account each individual page.
>> If we accounted the entire folio, where would we unaccount it? Each page can be
>> unmapped individually (e.g. munmap() part of the folio) so need to account each
>> page. When PMD mapping, the whole thing is either mapped or unmapped, and its
>> atomic, so we can account the entire thing.
>
> Hi Ryan,
>
> There is no problem. for example, a large folio is entirely mapped in
> process A with CONPTE,
> and only page2 is mapped in process B.
> then we will have
>
> entire_map = 0
> page0.map = -1
> page1.map = -1
> page2.map = 0
> page3.map = -1
> ....
>
>>
>>>
>>> as long as it is a CONTPTE large folio, there is no much difference with
>>> PMD-mapped large folio. it has all the chance to be DoubleMap and need
>>> split.
>>>
>>> When A and B share a CONTPTE large folio, we do madvise(DONTNEED) or any
>>> similar things on a part of the large folio in process A,
>>>
>>> this large folio will have partially mapped subpage in A (all CONTPE bits
>>> in all subpages need to be removed though we only unmap a part of the
>>> large folioas HW requires consistent CONTPTEs); and it has entire map in
>>> process B(all PTEs are still CONPTES in process B).
>>>
>>> isn't it more sensible for this large folios to have entire_map = 0(for
>>> process B), and subpages which are still mapped in process A has map_count
>>> =0? (start from -1).
>>>
>>>> Especially, the batch variant of page_try_dup_anon_rmap() would only
>>>> check once if the folio maybe pinned, and in that case, you can simply
>>>> drop all references again. So you either have all or no ptes to process,
>>>> which makes that code easier.
>>
>> I'm afraid this doesn't make sense to me. Perhaps I've misunderstood. But
>> fundamentally you can only use entire_mapcount if its only possible to map and
>> unmap the whole folio atomically.
>
>
>
> My point is that CONTPEs should either all-set in all 16 PTEs or all are dropped
> in 16 PTEs. if all PTEs have CONT, it is entirely mapped; otherwise,
> it is partially
> mapped. if a large folio is mapped in one processes with all CONTPTEs
> and meanwhile in another process with partial mapping(w/o CONTPTE), it is
> DoubleMapped.

There are 2 problems with your proposal, as I see it;

1) the core-mm is not enlightened for CONTPTE mappings. As far as it is
concerned, its just mapping a bunch of PTEs. So it has no hook to inc/dec
entire_mapcount. The arch code is opportunistically and *transparently* managing
the CONT_PTE bit.

2) There is nothing to say a folio isn't *bigger* than the contpte block; it may
be 128K and be mapped with 2 contpte blocks. Or even a PTE-mapped THP (2M) and
be mapped with 32 contpte blocks. So you can't say it is entirely mapped
unless/until ALL of those blocks are set up. And then of course each block could
be unmapped unatomically.

For the PMD case there are actually 2 properties that allow using the
entire_mapcount optimization; It's atomically mapped/unmapped through the PMD
and we know that the folio is exactly PMD sized (since it must be at least PMD
sized to be able to map it with the PMD, and we don't allocate THPs any bigger
than PMD size). So one PMD map or unmap operation corresponds to exactly one
*entire* map or unmap. That is not true when we are PTE mapping.

>
> Since we always hold ptl to set or drop CONTPTE bits, set/drop is
> still atomic in a
> spinlock area.
>
>>
>>>>
>>>> But that can be added on top, and I'll happily do that.
>>>>
>>>> --
>>>> Cheers,
>>>>
>>>> David / dhildenb
>>>
>
> Thanks
> Barry

2023-11-27 10:29:47

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

On Mon, Nov 27, 2023 at 11:11 PM Ryan Roberts <[email protected]> wrote:
>
> On 27/11/2023 09:59, Barry Song wrote:
> > On Mon, Nov 27, 2023 at 10:35 PM Ryan Roberts <[email protected]> wrote:
> >>
> >> On 27/11/2023 08:42, Barry Song wrote:
> >>>>> + for (i = 0; i < nr; i++, page++) {
> >>>>> + if (anon) {
> >>>>> + /*
> >>>>> + * If this page may have been pinned by the
> >>>>> + * parent process, copy the page immediately for
> >>>>> + * the child so that we'll always guarantee the
> >>>>> + * pinned page won't be randomly replaced in the
> >>>>> + * future.
> >>>>> + */
> >>>>> + if (unlikely(page_try_dup_anon_rmap(
> >>>>> + page, false, src_vma))) {
> >>>>> + if (i != 0)
> >>>>> + break;
> >>>>> + /* Page may be pinned, we have to copy. */
> >>>>> + return copy_present_page(
> >>>>> + dst_vma, src_vma, dst_pte,
> >>>>> + src_pte, addr, rss, prealloc,
> >>>>> + page);
> >>>>> + }
> >>>>> + rss[MM_ANONPAGES]++;
> >>>>> + VM_BUG_ON(PageAnonExclusive(page));
> >>>>> + } else {
> >>>>> + page_dup_file_rmap(page, false);
> >>>>> + rss[mm_counter_file(page)]++;
> >>>>> + }
> >>>>> }
> >>>>> - rss[MM_ANONPAGES]++;
> >>>>> - } else if (page) {
> >>>>> - folio_get(folio);
> >>>>> - page_dup_file_rmap(page, false);
> >>>>> - rss[mm_counter_file(page)]++;
> >>>>> +
> >>>>> + nr = i;
> >>>>> + folio_ref_add(folio, nr);
> >>>>
> >>>> You're changing the order of mapcount vs. refcount increment. Don't.
> >>>> Make sure your refcount >= mapcount.
> >>>>
> >>>> You can do that easily by doing the folio_ref_add(folio, nr) first and
> >>>> then decrementing in case of error accordingly. Errors due to pinned
> >>>> pages are the corner case.
> >>>>
> >>>> I'll note that it will make a lot of sense to have batch variants of
> >>>> page_try_dup_anon_rmap() and page_dup_file_rmap().
> >>>>
> >>>
> >>> i still don't understand why it is not a entire map+1, but an increment
> >>> in each basepage.
> >>
> >> Because we are PTE-mapping the folio, we have to account each individual page.
> >> If we accounted the entire folio, where would we unaccount it? Each page can be
> >> unmapped individually (e.g. munmap() part of the folio) so need to account each
> >> page. When PMD mapping, the whole thing is either mapped or unmapped, and its
> >> atomic, so we can account the entire thing.
> >
> > Hi Ryan,
> >
> > There is no problem. for example, a large folio is entirely mapped in
> > process A with CONPTE,
> > and only page2 is mapped in process B.
> > then we will have
> >
> > entire_map = 0
> > page0.map = -1
> > page1.map = -1
> > page2.map = 0
> > page3.map = -1
> > ....
> >
> >>
> >>>
> >>> as long as it is a CONTPTE large folio, there is no much difference with
> >>> PMD-mapped large folio. it has all the chance to be DoubleMap and need
> >>> split.
> >>>
> >>> When A and B share a CONTPTE large folio, we do madvise(DONTNEED) or any
> >>> similar things on a part of the large folio in process A,
> >>>
> >>> this large folio will have partially mapped subpage in A (all CONTPE bits
> >>> in all subpages need to be removed though we only unmap a part of the
> >>> large folioas HW requires consistent CONTPTEs); and it has entire map in
> >>> process B(all PTEs are still CONPTES in process B).
> >>>
> >>> isn't it more sensible for this large folios to have entire_map = 0(for
> >>> process B), and subpages which are still mapped in process A has map_count
> >>> =0? (start from -1).
> >>>
> >>>> Especially, the batch variant of page_try_dup_anon_rmap() would only
> >>>> check once if the folio maybe pinned, and in that case, you can simply
> >>>> drop all references again. So you either have all or no ptes to process,
> >>>> which makes that code easier.
> >>
> >> I'm afraid this doesn't make sense to me. Perhaps I've misunderstood. But
> >> fundamentally you can only use entire_mapcount if its only possible to map and
> >> unmap the whole folio atomically.
> >
> >
> >
> > My point is that CONTPEs should either all-set in all 16 PTEs or all are dropped
> > in 16 PTEs. if all PTEs have CONT, it is entirely mapped; otherwise,
> > it is partially
> > mapped. if a large folio is mapped in one processes with all CONTPTEs
> > and meanwhile in another process with partial mapping(w/o CONTPTE), it is
> > DoubleMapped.
>
> There are 2 problems with your proposal, as I see it;
>
> 1) the core-mm is not enlightened for CONTPTE mappings. As far as it is
> concerned, its just mapping a bunch of PTEs. So it has no hook to inc/dec
> entire_mapcount. The arch code is opportunistically and *transparently* managing
> the CONT_PTE bit.
>
> 2) There is nothing to say a folio isn't *bigger* than the contpte block; it may
> be 128K and be mapped with 2 contpte blocks. Or even a PTE-mapped THP (2M) and
> be mapped with 32 contpte blocks. So you can't say it is entirely mapped
> unless/until ALL of those blocks are set up. And then of course each block could
> be unmapped unatomically.
>
> For the PMD case there are actually 2 properties that allow using the
> entire_mapcount optimization; It's atomically mapped/unmapped through the PMD
> and we know that the folio is exactly PMD sized (since it must be at least PMD
> sized to be able to map it with the PMD, and we don't allocate THPs any bigger
> than PMD size). So one PMD map or unmap operation corresponds to exactly one
> *entire* map or unmap. That is not true when we are PTE mapping.

well. Thanks for clarification. based on the above description, i agree the
current code might make more sense by always using mapcount in subpage.

I gave my proposals as I thought we were always CONTPTE size for small-THP
then we could drop the loop to iterate 16 times rmap. if we do it
entirely, we only
need to do dup rmap once for all 16 PTEs by increasing entire_map.

BTW, I have concerns that a variable small-THP size will really work
as userspace
is probably friendly to only one fixed size. for example, userspace
heap management
might be optimized to a size for freeing memory to the kernel. it is
very difficult
for the heap to adapt to various sizes at the same time. frequent unmap/free
size not equal with, and particularly smaller than small-THP size will
defeat all
efforts to use small-THP.

>
> >
> > Since we always hold ptl to set or drop CONTPTE bits, set/drop is
> > still atomic in a
> > spinlock area.
> >
> >>
> >>>>
> >>>> But that can be added on top, and I'll happily do that.
> >>>>
> >>>> --
> >>>> Cheers,
> >>>>
> >>>> David / dhildenb
> >>>
> >

Thanks
Barry

2023-11-27 10:36:11

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v2 00/14] Transparent Contiguous PTEs for User Mappings

On Mon, Nov 27, 2023 at 10:15 PM Ryan Roberts <[email protected]> wrote:
>
> On 27/11/2023 03:18, Barry Song wrote:
> >> Ryan Roberts (14):
> >> mm: Batch-copy PTE ranges during fork()
> >> arm64/mm: set_pte(): New layer to manage contig bit
> >> arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit
> >> arm64/mm: pte_clear(): New layer to manage contig bit
> >> arm64/mm: ptep_get_and_clear(): New layer to manage contig bit
> >> arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit
> >> arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit
> >> arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit
> >> arm64/mm: ptep_set_access_flags(): New layer to manage contig bit
> >> arm64/mm: ptep_get(): New layer to manage contig bit
> >> arm64/mm: Split __flush_tlb_range() to elide trailing DSB
> >> arm64/mm: Wire up PTE_CONT for user mappings
> >> arm64/mm: Implement ptep_set_wrprotects() to optimize fork()
> >> arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown
> >
> > Hi Ryan,
> > Not quite sure if I missed something, are we splitting/unfolding CONTPTES
> > in the below cases
>
> The general idea is that the core-mm sets the individual ptes (one at a time if
> it likes with set_pte_at(), or in a block with set_ptes()), modifies its
> permissions (ptep_set_wrprotect(), ptep_set_access_flags()) and clears them
> (ptep_clear(), etc); This is exactly the same interface as previously.
>
> BUT, the arm64 implementation of those interfaces will now detect when a set of
> adjacent PTEs (a contpte block - so 16 naturally aligned entries when using 4K
> base pages) are all appropriate for having the CONT_PTE bit set; in this case
> the block is "folded". And it will detect when the first PTE in the block
> changes such that the CONT_PTE bit must now be unset ("unfolded"). One of the
> requirements for folding a contpte block is that all the pages must belong to
> the *same* folio (that means its safe to only track access/dirty for thecontpte
> block as a whole rather than for each individual pte).
>
> (there are a couple of optimizations that make the reality slightly more
> complicated than what I've just explained, but you get the idea).
>
> On that basis, I believe all the specific cases you describe below are all
> covered and safe - please let me know if you think there is a hole here!
>
> >
> > 1. madvise(MADV_DONTNEED) on a part of basepages on a CONTPTE large folio
>
> The page will first be unmapped (e.g. ptep_clear() or ptep_get_and_clear(), or
> whatever). The implementation of that will cause an unfold and the CONT_PTE bit
> is removed from the whole contpte block. If there is then a subsequent
> set_pte_at() to set a swap entry, the implementation will see that its not
> appropriate to re-fold, so the range will remain unfolded.
>
> >
> > 2. vma split in a large folio due to various reasons such as mprotect,
> > munmap, mlock etc.
>
> I'm not sure if PTEs are explicitly unmapped/remapped when splitting a VMA? I
> suspect not, so if the VMA is split in the middle of a currently folded contpte
> block, it will remain folded. But this is safe and continues to work correctly.
> The VMA arrangement is not important; it is just important that a single folio
> is mapped contiguously across the whole block.

I don't think it is safe to keep CONTPTE folded in a split_vma case. as
otherwise, copy_ptes in your other patch might only copy a part
of CONTPES.
For example, if page0-page4 and page5-page15 are splitted in split_vma,
in fork, while copying pte for the first VMA, we are copying page0-page4,
this will immediately cause inconsistent CONTPTE. as we have to
make sure all CONTPTEs are atomically mapped in a PTL.

>
> >
> > 3. try_to_unmap_one() to reclaim a folio, ptes are scanned one by one
> > rather than being as a whole.
>
> Yes, as per 1; the arm64 implementation will notice when the first entry is
> cleared and unfold the contpte block.
>
> >
> > In hardware, we need to make sure CONTPTE follow the rule - always 16
> > contiguous physical address with CONTPTE set. if one of them run away
> > from the 16 ptes group and PTEs become unconsistent, some terrible
> > errors/faults can happen in HW. for example
>
> Yes, the implementation obeys all these rules; see contpte_try_fold() and
> contpte_try_unfold(). the fold/unfold operation is only done when all
> requirements are met, and we perform it in a manner that is conformant to the
> architecture requirements (see contpte_fold() - being renamed to
> contpte_convert() in the next version).
>
> Thanks for the review!
>
> Thanks,
> Ryan
>
> >
> > case0:
> > addr0 PTE - has no CONTPE
> > addr0+4kb PTE - has CONTPTE
> > ....
> > addr0+60kb PTE - has CONTPTE
> >
> > case 1:
> > addr0 PTE - has no CONTPE
> > addr0+4kb PTE - has CONTPTE
> > ....
> > addr0+60kb PTE - has swap
> >
> > Unconsistent 16 PTEs will lead to crash even in the firmware based on
> > our observation.
> >

Thanks
Barry

2023-11-27 11:08:07

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

On 27/11/2023 10:28, Barry Song wrote:
> On Mon, Nov 27, 2023 at 11:11 PM Ryan Roberts <[email protected]> wrote:
>>
>> On 27/11/2023 09:59, Barry Song wrote:
>>> On Mon, Nov 27, 2023 at 10:35 PM Ryan Roberts <[email protected]> wrote:
>>>>
>>>> On 27/11/2023 08:42, Barry Song wrote:
>>>>>>> + for (i = 0; i < nr; i++, page++) {
>>>>>>> + if (anon) {
>>>>>>> + /*
>>>>>>> + * If this page may have been pinned by the
>>>>>>> + * parent process, copy the page immediately for
>>>>>>> + * the child so that we'll always guarantee the
>>>>>>> + * pinned page won't be randomly replaced in the
>>>>>>> + * future.
>>>>>>> + */
>>>>>>> + if (unlikely(page_try_dup_anon_rmap(
>>>>>>> + page, false, src_vma))) {
>>>>>>> + if (i != 0)
>>>>>>> + break;
>>>>>>> + /* Page may be pinned, we have to copy. */
>>>>>>> + return copy_present_page(
>>>>>>> + dst_vma, src_vma, dst_pte,
>>>>>>> + src_pte, addr, rss, prealloc,
>>>>>>> + page);
>>>>>>> + }
>>>>>>> + rss[MM_ANONPAGES]++;
>>>>>>> + VM_BUG_ON(PageAnonExclusive(page));
>>>>>>> + } else {
>>>>>>> + page_dup_file_rmap(page, false);
>>>>>>> + rss[mm_counter_file(page)]++;
>>>>>>> + }
>>>>>>> }
>>>>>>> - rss[MM_ANONPAGES]++;
>>>>>>> - } else if (page) {
>>>>>>> - folio_get(folio);
>>>>>>> - page_dup_file_rmap(page, false);
>>>>>>> - rss[mm_counter_file(page)]++;
>>>>>>> +
>>>>>>> + nr = i;
>>>>>>> + folio_ref_add(folio, nr);
>>>>>>
>>>>>> You're changing the order of mapcount vs. refcount increment. Don't.
>>>>>> Make sure your refcount >= mapcount.
>>>>>>
>>>>>> You can do that easily by doing the folio_ref_add(folio, nr) first and
>>>>>> then decrementing in case of error accordingly. Errors due to pinned
>>>>>> pages are the corner case.
>>>>>>
>>>>>> I'll note that it will make a lot of sense to have batch variants of
>>>>>> page_try_dup_anon_rmap() and page_dup_file_rmap().
>>>>>>
>>>>>
>>>>> i still don't understand why it is not a entire map+1, but an increment
>>>>> in each basepage.
>>>>
>>>> Because we are PTE-mapping the folio, we have to account each individual page.
>>>> If we accounted the entire folio, where would we unaccount it? Each page can be
>>>> unmapped individually (e.g. munmap() part of the folio) so need to account each
>>>> page. When PMD mapping, the whole thing is either mapped or unmapped, and its
>>>> atomic, so we can account the entire thing.
>>>
>>> Hi Ryan,
>>>
>>> There is no problem. for example, a large folio is entirely mapped in
>>> process A with CONPTE,
>>> and only page2 is mapped in process B.
>>> then we will have
>>>
>>> entire_map = 0
>>> page0.map = -1
>>> page1.map = -1
>>> page2.map = 0
>>> page3.map = -1
>>> ....
>>>
>>>>
>>>>>
>>>>> as long as it is a CONTPTE large folio, there is no much difference with
>>>>> PMD-mapped large folio. it has all the chance to be DoubleMap and need
>>>>> split.
>>>>>
>>>>> When A and B share a CONTPTE large folio, we do madvise(DONTNEED) or any
>>>>> similar things on a part of the large folio in process A,
>>>>>
>>>>> this large folio will have partially mapped subpage in A (all CONTPE bits
>>>>> in all subpages need to be removed though we only unmap a part of the
>>>>> large folioas HW requires consistent CONTPTEs); and it has entire map in
>>>>> process B(all PTEs are still CONPTES in process B).
>>>>>
>>>>> isn't it more sensible for this large folios to have entire_map = 0(for
>>>>> process B), and subpages which are still mapped in process A has map_count
>>>>> =0? (start from -1).
>>>>>
>>>>>> Especially, the batch variant of page_try_dup_anon_rmap() would only
>>>>>> check once if the folio maybe pinned, and in that case, you can simply
>>>>>> drop all references again. So you either have all or no ptes to process,
>>>>>> which makes that code easier.
>>>>
>>>> I'm afraid this doesn't make sense to me. Perhaps I've misunderstood. But
>>>> fundamentally you can only use entire_mapcount if its only possible to map and
>>>> unmap the whole folio atomically.
>>>
>>>
>>>
>>> My point is that CONTPEs should either all-set in all 16 PTEs or all are dropped
>>> in 16 PTEs. if all PTEs have CONT, it is entirely mapped; otherwise,
>>> it is partially
>>> mapped. if a large folio is mapped in one processes with all CONTPTEs
>>> and meanwhile in another process with partial mapping(w/o CONTPTE), it is
>>> DoubleMapped.
>>
>> There are 2 problems with your proposal, as I see it;
>>
>> 1) the core-mm is not enlightened for CONTPTE mappings. As far as it is
>> concerned, its just mapping a bunch of PTEs. So it has no hook to inc/dec
>> entire_mapcount. The arch code is opportunistically and *transparently* managing
>> the CONT_PTE bit.
>>
>> 2) There is nothing to say a folio isn't *bigger* than the contpte block; it may
>> be 128K and be mapped with 2 contpte blocks. Or even a PTE-mapped THP (2M) and
>> be mapped with 32 contpte blocks. So you can't say it is entirely mapped
>> unless/until ALL of those blocks are set up. And then of course each block could
>> be unmapped unatomically.
>>
>> For the PMD case there are actually 2 properties that allow using the
>> entire_mapcount optimization; It's atomically mapped/unmapped through the PMD
>> and we know that the folio is exactly PMD sized (since it must be at least PMD
>> sized to be able to map it with the PMD, and we don't allocate THPs any bigger
>> than PMD size). So one PMD map or unmap operation corresponds to exactly one
>> *entire* map or unmap. That is not true when we are PTE mapping.
>
> well. Thanks for clarification. based on the above description, i agree the
> current code might make more sense by always using mapcount in subpage.
>
> I gave my proposals as I thought we were always CONTPTE size for small-THP
> then we could drop the loop to iterate 16 times rmap. if we do it
> entirely, we only
> need to do dup rmap once for all 16 PTEs by increasing entire_map.

Well its always good to have the discussion - so thanks for the ideas. I think
there is a bigger question lurking here; should we be exposing the concept of
contpte mappings to the core-mm rather than burying it in the arm64 arch code?
I'm confident that would be a huge amount of effort and the end result would be
similar performace to what this approach gives. One potential benefit of letting
core-mm control it is that it would also give control to core-mm over the
granularity of access/dirty reporting (my approach implicitly ties it to the
folio). Having sub-folio access tracking _could_ potentially help with future
work to make THP size selection automatic, but we are not there yet, and I think
there are other (simpler) ways to achieve the same thing. So my view is that
_not_ exposing it to core-mm is the right way for now.

>
> BTW, I have concerns that a variable small-THP size will really work
> as userspace
> is probably friendly to only one fixed size. for example, userspace
> heap management
> might be optimized to a size for freeing memory to the kernel. it is
> very difficult
> for the heap to adapt to various sizes at the same time. frequent unmap/free
> size not equal with, and particularly smaller than small-THP size will
> defeat all
> efforts to use small-THP.

I'll admit to not knowing a huge amount about user space allocators. But I will
say that as currently defined, the small-sized THP interface to user space
allows a sysadmin to specifically enable the set of sizes that they want; so a
single size can be enabled. I'm diliberately punting that decision away from the
kernel for now.

FWIW, My experience with the Speedometer/JavaScript use case is that performance
is a little bit better when enabling 64+32+16K vs just 64K THP.

Functionally, it will not matter if the allocator is not enlightened for the THP
size; it can continue to free, and if a partial folio is unmapped it is put on
the deferred split list, then under memory pressure it is split and the unused
pages are reclaimed. I guess this is the bit you are concerned about having a
performance impact?

Regardless, it would be good to move this conversation to the small-sized THP
patch series since this is all independent of contpte mappings.

>
>>
>>>
>>> Since we always hold ptl to set or drop CONTPTE bits, set/drop is
>>> still atomic in a
>>> spinlock area.
>>>
>>>>
>>>>>>
>>>>>> But that can be added on top, and I'll happily do that.
>>>>>>
>>>>>> --
>>>>>> Cheers,
>>>>>>
>>>>>> David / dhildenb
>>>>>
>>>
>
> Thanks
> Barry

2023-11-27 11:11:56

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 00/14] Transparent Contiguous PTEs for User Mappings

On 27/11/2023 10:35, Barry Song wrote:
> On Mon, Nov 27, 2023 at 10:15 PM Ryan Roberts <[email protected]> wrote:
>>
>> On 27/11/2023 03:18, Barry Song wrote:
>>>> Ryan Roberts (14):
>>>> mm: Batch-copy PTE ranges during fork()
>>>> arm64/mm: set_pte(): New layer to manage contig bit
>>>> arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit
>>>> arm64/mm: pte_clear(): New layer to manage contig bit
>>>> arm64/mm: ptep_get_and_clear(): New layer to manage contig bit
>>>> arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit
>>>> arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit
>>>> arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit
>>>> arm64/mm: ptep_set_access_flags(): New layer to manage contig bit
>>>> arm64/mm: ptep_get(): New layer to manage contig bit
>>>> arm64/mm: Split __flush_tlb_range() to elide trailing DSB
>>>> arm64/mm: Wire up PTE_CONT for user mappings
>>>> arm64/mm: Implement ptep_set_wrprotects() to optimize fork()
>>>> arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown
>>>
>>> Hi Ryan,
>>> Not quite sure if I missed something, are we splitting/unfolding CONTPTES
>>> in the below cases
>>
>> The general idea is that the core-mm sets the individual ptes (one at a time if
>> it likes with set_pte_at(), or in a block with set_ptes()), modifies its
>> permissions (ptep_set_wrprotect(), ptep_set_access_flags()) and clears them
>> (ptep_clear(), etc); This is exactly the same interface as previously.
>>
>> BUT, the arm64 implementation of those interfaces will now detect when a set of
>> adjacent PTEs (a contpte block - so 16 naturally aligned entries when using 4K
>> base pages) are all appropriate for having the CONT_PTE bit set; in this case
>> the block is "folded". And it will detect when the first PTE in the block
>> changes such that the CONT_PTE bit must now be unset ("unfolded"). One of the
>> requirements for folding a contpte block is that all the pages must belong to
>> the *same* folio (that means its safe to only track access/dirty for thecontpte
>> block as a whole rather than for each individual pte).
>>
>> (there are a couple of optimizations that make the reality slightly more
>> complicated than what I've just explained, but you get the idea).
>>
>> On that basis, I believe all the specific cases you describe below are all
>> covered and safe - please let me know if you think there is a hole here!
>>
>>>
>>> 1. madvise(MADV_DONTNEED) on a part of basepages on a CONTPTE large folio
>>
>> The page will first be unmapped (e.g. ptep_clear() or ptep_get_and_clear(), or
>> whatever). The implementation of that will cause an unfold and the CONT_PTE bit
>> is removed from the whole contpte block. If there is then a subsequent
>> set_pte_at() to set a swap entry, the implementation will see that its not
>> appropriate to re-fold, so the range will remain unfolded.
>>
>>>
>>> 2. vma split in a large folio due to various reasons such as mprotect,
>>> munmap, mlock etc.
>>
>> I'm not sure if PTEs are explicitly unmapped/remapped when splitting a VMA? I
>> suspect not, so if the VMA is split in the middle of a currently folded contpte
>> block, it will remain folded. But this is safe and continues to work correctly.
>> The VMA arrangement is not important; it is just important that a single folio
>> is mapped contiguously across the whole block.
>
> I don't think it is safe to keep CONTPTE folded in a split_vma case. as
> otherwise, copy_ptes in your other patch might only copy a part
> of CONTPES.
> For example, if page0-page4 and page5-page15 are splitted in split_vma,
> in fork, while copying pte for the first VMA, we are copying page0-page4,
> this will immediately cause inconsistent CONTPTE. as we have to
> make sure all CONTPTEs are atomically mapped in a PTL.

No that's not how it works. The CONT_PTE bit is not blindly copied from parent
to child. It is explicitly managed by the arch code and set when appropriate. In
the case above, we will end up calling set_ptes() for page0-page4 in the child.
set_ptes() will notice that there are only 5 contiguous pages so it will map
without the CONT_PTE bit.

>
>>
>>>
>>> 3. try_to_unmap_one() to reclaim a folio, ptes are scanned one by one
>>> rather than being as a whole.
>>
>> Yes, as per 1; the arm64 implementation will notice when the first entry is
>> cleared and unfold the contpte block.
>>
>>>
>>> In hardware, we need to make sure CONTPTE follow the rule - always 16
>>> contiguous physical address with CONTPTE set. if one of them run away
>>> from the 16 ptes group and PTEs become unconsistent, some terrible
>>> errors/faults can happen in HW. for example
>>
>> Yes, the implementation obeys all these rules; see contpte_try_fold() and
>> contpte_try_unfold(). the fold/unfold operation is only done when all
>> requirements are met, and we perform it in a manner that is conformant to the
>> architecture requirements (see contpte_fold() - being renamed to
>> contpte_convert() in the next version).
>>
>> Thanks for the review!
>>
>> Thanks,
>> Ryan
>>
>>>
>>> case0:
>>> addr0 PTE - has no CONTPE
>>> addr0+4kb PTE - has CONTPTE
>>> ....
>>> addr0+60kb PTE - has CONTPTE
>>>
>>> case 1:
>>> addr0 PTE - has no CONTPE
>>> addr0+4kb PTE - has CONTPTE
>>> ....
>>> addr0+60kb PTE - has swap
>>>
>>> Unconsistent 16 PTEs will lead to crash even in the firmware based on
>>> our observation.
>>>
>
> Thanks
> Barry

2023-11-27 20:35:36

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

On Tue, Nov 28, 2023 at 12:07 AM Ryan Roberts <[email protected]> wrote:
>
> On 27/11/2023 10:28, Barry Song wrote:
> > On Mon, Nov 27, 2023 at 11:11 PM Ryan Roberts <[email protected]> wrote:
> >>
> >> On 27/11/2023 09:59, Barry Song wrote:
> >>> On Mon, Nov 27, 2023 at 10:35 PM Ryan Roberts <[email protected]> wrote:
> >>>>
> >>>> On 27/11/2023 08:42, Barry Song wrote:
> >>>>>>> + for (i = 0; i < nr; i++, page++) {
> >>>>>>> + if (anon) {
> >>>>>>> + /*
> >>>>>>> + * If this page may have been pinned by the
> >>>>>>> + * parent process, copy the page immediately for
> >>>>>>> + * the child so that we'll always guarantee the
> >>>>>>> + * pinned page won't be randomly replaced in the
> >>>>>>> + * future.
> >>>>>>> + */
> >>>>>>> + if (unlikely(page_try_dup_anon_rmap(
> >>>>>>> + page, false, src_vma))) {
> >>>>>>> + if (i != 0)
> >>>>>>> + break;
> >>>>>>> + /* Page may be pinned, we have to copy. */
> >>>>>>> + return copy_present_page(
> >>>>>>> + dst_vma, src_vma, dst_pte,
> >>>>>>> + src_pte, addr, rss, prealloc,
> >>>>>>> + page);
> >>>>>>> + }
> >>>>>>> + rss[MM_ANONPAGES]++;
> >>>>>>> + VM_BUG_ON(PageAnonExclusive(page));
> >>>>>>> + } else {
> >>>>>>> + page_dup_file_rmap(page, false);
> >>>>>>> + rss[mm_counter_file(page)]++;
> >>>>>>> + }
> >>>>>>> }
> >>>>>>> - rss[MM_ANONPAGES]++;
> >>>>>>> - } else if (page) {
> >>>>>>> - folio_get(folio);
> >>>>>>> - page_dup_file_rmap(page, false);
> >>>>>>> - rss[mm_counter_file(page)]++;
> >>>>>>> +
> >>>>>>> + nr = i;
> >>>>>>> + folio_ref_add(folio, nr);
> >>>>>>
> >>>>>> You're changing the order of mapcount vs. refcount increment. Don't.
> >>>>>> Make sure your refcount >= mapcount.
> >>>>>>
> >>>>>> You can do that easily by doing the folio_ref_add(folio, nr) first and
> >>>>>> then decrementing in case of error accordingly. Errors due to pinned
> >>>>>> pages are the corner case.
> >>>>>>
> >>>>>> I'll note that it will make a lot of sense to have batch variants of
> >>>>>> page_try_dup_anon_rmap() and page_dup_file_rmap().
> >>>>>>
> >>>>>
> >>>>> i still don't understand why it is not a entire map+1, but an increment
> >>>>> in each basepage.
> >>>>
> >>>> Because we are PTE-mapping the folio, we have to account each individual page.
> >>>> If we accounted the entire folio, where would we unaccount it? Each page can be
> >>>> unmapped individually (e.g. munmap() part of the folio) so need to account each
> >>>> page. When PMD mapping, the whole thing is either mapped or unmapped, and its
> >>>> atomic, so we can account the entire thing.
> >>>
> >>> Hi Ryan,
> >>>
> >>> There is no problem. for example, a large folio is entirely mapped in
> >>> process A with CONPTE,
> >>> and only page2 is mapped in process B.
> >>> then we will have
> >>>
> >>> entire_map = 0
> >>> page0.map = -1
> >>> page1.map = -1
> >>> page2.map = 0
> >>> page3.map = -1
> >>> ....
> >>>
> >>>>
> >>>>>
> >>>>> as long as it is a CONTPTE large folio, there is no much difference with
> >>>>> PMD-mapped large folio. it has all the chance to be DoubleMap and need
> >>>>> split.
> >>>>>
> >>>>> When A and B share a CONTPTE large folio, we do madvise(DONTNEED) or any
> >>>>> similar things on a part of the large folio in process A,
> >>>>>
> >>>>> this large folio will have partially mapped subpage in A (all CONTPE bits
> >>>>> in all subpages need to be removed though we only unmap a part of the
> >>>>> large folioas HW requires consistent CONTPTEs); and it has entire map in
> >>>>> process B(all PTEs are still CONPTES in process B).
> >>>>>
> >>>>> isn't it more sensible for this large folios to have entire_map = 0(for
> >>>>> process B), and subpages which are still mapped in process A has map_count
> >>>>> =0? (start from -1).
> >>>>>
> >>>>>> Especially, the batch variant of page_try_dup_anon_rmap() would only
> >>>>>> check once if the folio maybe pinned, and in that case, you can simply
> >>>>>> drop all references again. So you either have all or no ptes to process,
> >>>>>> which makes that code easier.
> >>>>
> >>>> I'm afraid this doesn't make sense to me. Perhaps I've misunderstood. But
> >>>> fundamentally you can only use entire_mapcount if its only possible to map and
> >>>> unmap the whole folio atomically.
> >>>
> >>>
> >>>
> >>> My point is that CONTPEs should either all-set in all 16 PTEs or all are dropped
> >>> in 16 PTEs. if all PTEs have CONT, it is entirely mapped; otherwise,
> >>> it is partially
> >>> mapped. if a large folio is mapped in one processes with all CONTPTEs
> >>> and meanwhile in another process with partial mapping(w/o CONTPTE), it is
> >>> DoubleMapped.
> >>
> >> There are 2 problems with your proposal, as I see it;
> >>
> >> 1) the core-mm is not enlightened for CONTPTE mappings. As far as it is
> >> concerned, its just mapping a bunch of PTEs. So it has no hook to inc/dec
> >> entire_mapcount. The arch code is opportunistically and *transparently* managing
> >> the CONT_PTE bit.
> >>
> >> 2) There is nothing to say a folio isn't *bigger* than the contpte block; it may
> >> be 128K and be mapped with 2 contpte blocks. Or even a PTE-mapped THP (2M) and
> >> be mapped with 32 contpte blocks. So you can't say it is entirely mapped
> >> unless/until ALL of those blocks are set up. And then of course each block could
> >> be unmapped unatomically.
> >>
> >> For the PMD case there are actually 2 properties that allow using the
> >> entire_mapcount optimization; It's atomically mapped/unmapped through the PMD
> >> and we know that the folio is exactly PMD sized (since it must be at least PMD
> >> sized to be able to map it with the PMD, and we don't allocate THPs any bigger
> >> than PMD size). So one PMD map or unmap operation corresponds to exactly one
> >> *entire* map or unmap. That is not true when we are PTE mapping.
> >
> > well. Thanks for clarification. based on the above description, i agree the
> > current code might make more sense by always using mapcount in subpage.
> >
> > I gave my proposals as I thought we were always CONTPTE size for small-THP
> > then we could drop the loop to iterate 16 times rmap. if we do it
> > entirely, we only
> > need to do dup rmap once for all 16 PTEs by increasing entire_map.
>
> Well its always good to have the discussion - so thanks for the ideas. I think
> there is a bigger question lurking here; should we be exposing the concept of
> contpte mappings to the core-mm rather than burying it in the arm64 arch code?
> I'm confident that would be a huge amount of effort and the end result would be
> similar performace to what this approach gives. One potential benefit of letting
> core-mm control it is that it would also give control to core-mm over the
> granularity of access/dirty reporting (my approach implicitly ties it to the
> folio). Having sub-folio access tracking _could_ potentially help with future
> work to make THP size selection automatic, but we are not there yet, and I think
> there are other (simpler) ways to achieve the same thing. So my view is that
> _not_ exposing it to core-mm is the right way for now.

Hi Ryan,

We(OPPO) started a similar project like you even before folio was imported to
mainline, we have deployed the dynamic hugepage(that is how we name it)
on millions of mobile phones on real products and kernels before 5.16, making
a huge success on performance improvement. for example, you may
find the out-of-tree 5.15 source code here

https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11

Our modification might not be so clean and has lots of workarounds
just for the stability of products

We mainly have

1. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/cont_pte_hugepage.c

some CONTPTE helpers

2.https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/include/linux/mm.h

some Dynamic Hugepage APIs

3. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c

modified all page faults to support
(1). allocation of hugepage of 64KB in do_anon_page
(2). CoW hugepage in do_wp_page
(3). copy CONPTEs in copy_pte_range
(4). allocate and swap-in Hugepage as a whole in do_swap_page

4. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/vmscan.c
https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/rmap.c

reclaim hugepage as a whole and LRU optimization for 64KB dynamic hugepage.

So we are 100% interested in your patchset and hope it can find a way
to land on the
mainline, thus decreasing all the cost we have to maintain out-of-tree
code from a
kernel to another kernel version which we have done on a couple of
kernel versions
before 5.16. Firmly, we are 100% supportive of large anon folios
things you are leading.

A big pain was we found lots of races especially on CONTPTE unfolding
and especially a part
of basepages ran away from the 16 CONPTEs group since userspace is
always working
on basepages, having no idea of small-THP. We ran our code on millions of
real phones, and now we have got them fixed (or maybe "can't reproduce"),
no outstanding issue.

Particularly for the rmap issue we are discussing, our out-of-tree is
using the entire_map for
CONTPTE in the way I sent to you. But I guess we can learn from you to decouple
CONTPTE from mm-core.

We are doing this in mm/memory.c

copy_present_cont_pte(struct vm_area_struct *dst_vma, struct
vm_area_struct *src_vma,
pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
struct page **prealloc)
{
struct mm_struct *src_mm = src_vma->vm_mm;
unsigned long vm_flags = src_vma->vm_flags;
pte_t pte = *src_pte;
struct page *page;

page = vm_normal_page(src_vma, addr, pte);
...

get_page(page);
page_dup_rmap(page, true); // an entire dup_rmap as you can
see.............
rss[mm_counter(page)] += HPAGE_CONT_PTE_NR;
}

and we have a split in mm/cont_pte_hugepage.c to handle partially unmap,

static void __split_huge_cont_pte_locked(struct vm_area_struct *vma, pte_t *pte,
unsigned long haddr, bool freeze)
{
...
if (compound_mapcount(head) > 1 && !TestSetPageDoubleMap(head)) {
for (i = 0; i < HPAGE_CONT_PTE_NR; i++)
atomic_inc(&head[i]._mapcount);
atomic_long_inc(&cont_pte_double_map_count);
}


if (atomic_add_negative(-1, compound_mapcount_ptr(head))) {
...
}

I am not selling our solution any more, but just showing you some differences we
have :-)

>
> >
> > BTW, I have concerns that a variable small-THP size will really work
> > as userspace
> > is probably friendly to only one fixed size. for example, userspace
> > heap management
> > might be optimized to a size for freeing memory to the kernel. it is
> > very difficult
> > for the heap to adapt to various sizes at the same time. frequent unmap/free
> > size not equal with, and particularly smaller than small-THP size will
> > defeat all
> > efforts to use small-THP.
>
> I'll admit to not knowing a huge amount about user space allocators. But I will
> say that as currently defined, the small-sized THP interface to user space
> allows a sysadmin to specifically enable the set of sizes that they want; so a
> single size can be enabled. I'm diliberately punting that decision away from the
> kernel for now.

Basically, userspace heap library has a PAGESIZE setting and allows users
to allocate/free all kinds of small objects such as 16,32,64,128,256,512 etc.
The default size is for sure equal to the basepage SIZE. once some objects are
freed by free() and libc get a free "page", userspace heap libraries might free
the PAGESIZE page to kernel by things like MADV_DONTNEED, then zap_pte_range().
it is quite similar with kernel slab.

so imagine we have small-THP now, but userspace libraries have *NO*
idea at all, so it can frequently cause unfolding.

>
> FWIW, My experience with the Speedometer/JavaScript use case is that performance
> is a little bit better when enabling 64+32+16K vs just 64K THP.
>
> Functionally, it will not matter if the allocator is not enlightened for the THP
> size; it can continue to free, and if a partial folio is unmapped it is put on
> the deferred split list, then under memory pressure it is split and the unused
> pages are reclaimed. I guess this is the bit you are concerned about having a
> performance impact?

right. If this is happening on the majority of small-THP folios, we
don't have performance
improvement, and probably regression instead. This is really true on
real workloads!!

So that is why we really love a per-VMA hint to enable small-THP but
obviously you
have already supported it now by
mm: thp: Introduce per-size thp sysfs interface
https://lore.kernel.org/linux-mm/[email protected]/

we can use MADVISE rather than ALWAYS and set fixed size like 64KB, so userspace
can set the VMA flag when it is quite sure this VMA is working with
the alignment
of 64KB?

>
> Regardless, it would be good to move this conversation to the small-sized THP
> patch series since this is all independent of contpte mappings.
>
> >
> >>
> >>>
> >>> Since we always hold ptl to set or drop CONTPTE bits, set/drop is
> >>> still atomic in a
> >>> spinlock area.
> >>>
> >>>>
> >>>>>>
> >>>>>> But that can be added on top, and I'll happily do that.
> >>>>>>
> >>>>>> --
> >>>>>> Cheers,
> >>>>>>
> >>>>>> David / dhildenb
> >>>>>
> >>>
> >

Thanks
Barry

2023-11-27 22:54:19

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v2 00/14] Transparent Contiguous PTEs for User Mappings

On Tue, Nov 28, 2023 at 12:11 AM Ryan Roberts <[email protected]> wrote:
>
> On 27/11/2023 10:35, Barry Song wrote:
> > On Mon, Nov 27, 2023 at 10:15 PM Ryan Roberts <[email protected]> wrote:
> >>
> >> On 27/11/2023 03:18, Barry Song wrote:
> >>>> Ryan Roberts (14):
> >>>> mm: Batch-copy PTE ranges during fork()
> >>>> arm64/mm: set_pte(): New layer to manage contig bit
> >>>> arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit
> >>>> arm64/mm: pte_clear(): New layer to manage contig bit
> >>>> arm64/mm: ptep_get_and_clear(): New layer to manage contig bit
> >>>> arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit
> >>>> arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit
> >>>> arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit
> >>>> arm64/mm: ptep_set_access_flags(): New layer to manage contig bit
> >>>> arm64/mm: ptep_get(): New layer to manage contig bit
> >>>> arm64/mm: Split __flush_tlb_range() to elide trailing DSB
> >>>> arm64/mm: Wire up PTE_CONT for user mappings
> >>>> arm64/mm: Implement ptep_set_wrprotects() to optimize fork()
> >>>> arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown
> >>>
> >>> Hi Ryan,
> >>> Not quite sure if I missed something, are we splitting/unfolding CONTPTES
> >>> in the below cases
> >>
> >> The general idea is that the core-mm sets the individual ptes (one at a time if
> >> it likes with set_pte_at(), or in a block with set_ptes()), modifies its
> >> permissions (ptep_set_wrprotect(), ptep_set_access_flags()) and clears them
> >> (ptep_clear(), etc); This is exactly the same interface as previously.
> >>
> >> BUT, the arm64 implementation of those interfaces will now detect when a set of
> >> adjacent PTEs (a contpte block - so 16 naturally aligned entries when using 4K
> >> base pages) are all appropriate for having the CONT_PTE bit set; in this case
> >> the block is "folded". And it will detect when the first PTE in the block
> >> changes such that the CONT_PTE bit must now be unset ("unfolded"). One of the
> >> requirements for folding a contpte block is that all the pages must belong to
> >> the *same* folio (that means its safe to only track access/dirty for thecontpte
> >> block as a whole rather than for each individual pte).
> >>
> >> (there are a couple of optimizations that make the reality slightly more
> >> complicated than what I've just explained, but you get the idea).
> >>
> >> On that basis, I believe all the specific cases you describe below are all
> >> covered and safe - please let me know if you think there is a hole here!
> >>
> >>>
> >>> 1. madvise(MADV_DONTNEED) on a part of basepages on a CONTPTE large folio
> >>
> >> The page will first be unmapped (e.g. ptep_clear() or ptep_get_and_clear(), or
> >> whatever). The implementation of that will cause an unfold and the CONT_PTE bit
> >> is removed from the whole contpte block. If there is then a subsequent
> >> set_pte_at() to set a swap entry, the implementation will see that its not
> >> appropriate to re-fold, so the range will remain unfolded.
> >>
> >>>
> >>> 2. vma split in a large folio due to various reasons such as mprotect,
> >>> munmap, mlock etc.
> >>
> >> I'm not sure if PTEs are explicitly unmapped/remapped when splitting a VMA? I
> >> suspect not, so if the VMA is split in the middle of a currently folded contpte
> >> block, it will remain folded. But this is safe and continues to work correctly.
> >> The VMA arrangement is not important; it is just important that a single folio
> >> is mapped contiguously across the whole block.
> >
> > I don't think it is safe to keep CONTPTE folded in a split_vma case. as
> > otherwise, copy_ptes in your other patch might only copy a part
> > of CONTPES.
> > For example, if page0-page4 and page5-page15 are splitted in split_vma,
> > in fork, while copying pte for the first VMA, we are copying page0-page4,
> > this will immediately cause inconsistent CONTPTE. as we have to
> > make sure all CONTPTEs are atomically mapped in a PTL.
>
> No that's not how it works. The CONT_PTE bit is not blindly copied from parent
> to child. It is explicitly managed by the arch code and set when appropriate. In
> the case above, we will end up calling set_ptes() for page0-page4 in the child.
> set_ptes() will notice that there are only 5 contiguous pages so it will map
> without the CONT_PTE bit.

Ok. cool. alternatively, in the code I shared to you, we are doing an unfold
immediately when split_vma happens within a large anon folio, so we disallow
CONTPTE to cross two VMAs to avoid all kinds of complexity afterwards.

https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/huge_memory.c

#ifdef CONFIG_CONT_PTE_HUGEPAGE
void vma_adjust_cont_pte_trans_huge(struct vm_area_struct *vma,
unsigned long start,
unsigned long end,
long adjust_next)
{
/*
* If the new start address isn't hpage aligned and it could
* previously contain an hugepage: check if we need to split
* an huge pmd.
*/
if (start & ~HPAGE_CONT_PTE_MASK &&
(start & HPAGE_CONT_PTE_MASK) >= vma->vm_start &&
(start & HPAGE_CONT_PTE_MASK) + HPAGE_CONT_PTE_SIZE <= vma->vm_end)
split_huge_cont_pte_address(vma, start, false, NULL);

....
}
#endif

In your approach, you are still holding CONTPTE crossing two VMAs. but it seems
ok. I can't have a case which might fail in my brain right now. only
running the code on
a large amount of real hardware will tell :-)

>
> >
> >>
> >>>
> >>> 3. try_to_unmap_one() to reclaim a folio, ptes are scanned one by one
> >>> rather than being as a whole.
> >>
> >> Yes, as per 1; the arm64 implementation will notice when the first entry is
> >> cleared and unfold the contpte block.
> >>
> >>>
> >>> In hardware, we need to make sure CONTPTE follow the rule - always 16
> >>> contiguous physical address with CONTPTE set. if one of them run away
> >>> from the 16 ptes group and PTEs become unconsistent, some terrible
> >>> errors/faults can happen in HW. for example
> >>
> >> Yes, the implementation obeys all these rules; see contpte_try_fold() and
> >> contpte_try_unfold(). the fold/unfold operation is only done when all
> >> requirements are met, and we perform it in a manner that is conformant to the
> >> architecture requirements (see contpte_fold() - being renamed to
> >> contpte_convert() in the next version).
> >>
> >> Thanks for the review!
> >>
> >> Thanks,
> >> Ryan
> >>
> >>>
> >>> case0:
> >>> addr0 PTE - has no CONTPE
> >>> addr0+4kb PTE - has CONTPTE
> >>> ....
> >>> addr0+60kb PTE - has CONTPTE
> >>>
> >>> case 1:
> >>> addr0 PTE - has no CONTPE
> >>> addr0+4kb PTE - has CONTPTE
> >>> ....
> >>> addr0+60kb PTE - has swap
> >>>
> >>> Unconsistent 16 PTEs will lead to crash even in the firmware based on
> >>> our observation.
> >>>
> >

Thanks
Barry

2023-11-28 00:11:51

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

On Mon, Nov 27, 2023 at 10:24 PM Ryan Roberts <[email protected]> wrote:
>
> On 27/11/2023 05:54, Barry Song wrote:
> >> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> >> + pte_t *dst_pte, pte_t *src_pte,
> >> + unsigned long addr, unsigned long end,
> >> + int *rss, struct folio **prealloc)
> >> {
> >> struct mm_struct *src_mm = src_vma->vm_mm;
> >> unsigned long vm_flags = src_vma->vm_flags;
> >> pte_t pte = ptep_get(src_pte);
> >> struct page *page;
> >> struct folio *folio;
> >> + int nr = 1;
> >> + bool anon;
> >> + bool any_dirty = pte_dirty(pte);
> >> + int i;
> >>
> >> page = vm_normal_page(src_vma, addr, pte);
> >> - if (page)
> >> + if (page) {
> >> folio = page_folio(page);
> >> - if (page && folio_test_anon(folio)) {
> >> - /*
> >> - * If this page may have been pinned by the parent process,
> >> - * copy the page immediately for the child so that we'll always
> >> - * guarantee the pinned page won't be randomly replaced in the
> >> - * future.
> >> - */
> >> - folio_get(folio);
> >> - if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
> >> - /* Page may be pinned, we have to copy. */
> >> - folio_put(folio);
> >> - return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
> >> - addr, rss, prealloc, page);
> >> + anon = folio_test_anon(folio);
> >> + nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
> >> + end, pte, &any_dirty);
> >
> > in case we have a large folio with 16 CONTPTE basepages, and userspace
> > do madvise(addr + 4KB * 5, DONTNEED);
>
> nit: if you are offsetting by 5 pages from addr, then below I think you mean
> page0~page4 and page6~15?
>
> >
> > thus, the 4th basepage of PTE becomes PTE_NONE and folio_nr_pages_cont_mapped()
> > will return 15. in this case, we should copy page0~page3 and page5~page15.
>
> No I don't think folio_nr_pages_cont_mapped() will return 15; that's certainly
> not how its intended to work. The function is scanning forwards from the current
> pte until it finds the first pte that does not fit in the batch - either because
> it maps a PFN that is not contiguous, or because the permissions are different
> (although this is being relaxed a bit; see conversation with DavidH against this
> same patch).
>
> So the first time through this loop, folio_nr_pages_cont_mapped() will return 5,
> (page0~page4) then the next time through the loop we will go through the
> !present path and process the single swap marker. Then the 3rd time through the
> loop folio_nr_pages_cont_mapped() will return 10.

one case we have met by running hundreds of real phones is as below,


static int
copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
unsigned long end)
{
...
dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
if (!dst_pte) {
ret = -ENOMEM;
goto out;
}
src_pte = pte_offset_map_nolock(src_mm, src_pmd, addr, &src_ptl);
if (!src_pte) {
pte_unmap_unlock(dst_pte, dst_ptl);
/* ret == 0 */
goto out;
}
spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
orig_src_pte = src_pte;
orig_dst_pte = dst_pte;
arch_enter_lazy_mmu_mode();

do {
/*
* We are holding two locks at this point - either of them
* could generate latencies in another task on another CPU.
*/
if (progress >= 32) {
progress = 0;
if (need_resched() ||
spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
break;
}
ptent = ptep_get(src_pte);
if (pte_none(ptent)) {
progress++;
continue;
}

the above iteration can break when progress > =32. for example, at the
beginning,
if all PTEs are none, we break when progress >=32, and we break when we
are in the 8th pte of 16PTEs which might become CONTPTE after we release
PTL.

since we are releasing PTLs, next time when we get PTL, those pte_none() might
become pte_cont(), then are you going to copy CONTPTE from 8th pte,
thus, immediately
break the consistent CONPTEs rule of hardware?

pte0 - pte_none
pte1 - pte_none
...
pte7 - pte_none

pte8 - pte_cont
...
pte15 - pte_cont

so we did some modification to avoid a break in the middle of PTEs
which can potentially
become CONTPE.
do {
/*
* We are holding two locks at this point - either of them
* could generate latencies in another task on another CPU.
*/
if (progress >= 32) {
progress = 0;
#ifdef CONFIG_CONT_PTE_HUGEPAGE
/*
* XXX: don't release ptl at an unligned address as
cont_pte might form while
* ptl is released, this causes double-map
*/
if (!vma_is_chp_anonymous(src_vma) ||
(vma_is_chp_anonymous(src_vma) && IS_ALIGNED(addr,
HPAGE_CONT_PTE_SIZE)))
#endif
if (need_resched() ||
spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
break;
}

We could only reproduce the above issue by running thousands of phones.

Does your code survive from this problem?

>
> Thanks,
> Ryan
>
> >
> > but the current code is copying page0~page14, right? unless we are immediatly
> > split_folio to basepages in zap_pte_range(), we will have problems?
> >
> >> +
> >> + for (i = 0; i < nr; i++, page++) {
> >> + if (anon) {
> >> + /*
> >> + * If this page may have been pinned by the
> >> + * parent process, copy the page immediately for
> >> + * the child so that we'll always guarantee the
> >> + * pinned page won't be randomly replaced in the
> >> + * future.
> >> + */
> >> + if (unlikely(page_try_dup_anon_rmap(
> >> + page, false, src_vma))) {
> >> + if (i != 0)
> >> + break;
> >> + /* Page may be pinned, we have to copy. */
> >> + return copy_present_page(
> >> + dst_vma, src_vma, dst_pte,
> >> + src_pte, addr, rss, prealloc,
> >> + page);
> >> + }
> >> + rss[MM_ANONPAGES]++;
> >> + VM_BUG_ON(PageAnonExclusive(page));
> >> + } else {
> >> + page_dup_file_rmap(page, false);
> >> + rss[mm_counter_file(page)]++;
> >> + }
> >

Thanks
Barry

2023-11-28 03:14:21

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH v2 00/14] Transparent Contiguous PTEs for User Mappings

On Mon, Nov 27, 2023 at 1:15 AM Ryan Roberts <[email protected]> wrote:
>
> On 27/11/2023 03:18, Barry Song wrote:
> >> Ryan Roberts (14):
> >> mm: Batch-copy PTE ranges during fork()
> >> arm64/mm: set_pte(): New layer to manage contig bit
> >> arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit
> >> arm64/mm: pte_clear(): New layer to manage contig bit
> >> arm64/mm: ptep_get_and_clear(): New layer to manage contig bit
> >> arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit
> >> arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit
> >> arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit
> >> arm64/mm: ptep_set_access_flags(): New layer to manage contig bit
> >> arm64/mm: ptep_get(): New layer to manage contig bit
> >> arm64/mm: Split __flush_tlb_range() to elide trailing DSB
> >> arm64/mm: Wire up PTE_CONT for user mappings
> >> arm64/mm: Implement ptep_set_wrprotects() to optimize fork()
> >> arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown
> >
> > Hi Ryan,
> > Not quite sure if I missed something, are we splitting/unfolding CONTPTES
> > in the below cases
>
> The general idea is that the core-mm sets the individual ptes (one at a time if
> it likes with set_pte_at(), or in a block with set_ptes()), modifies its
> permissions (ptep_set_wrprotect(), ptep_set_access_flags()) and clears them
> (ptep_clear(), etc); This is exactly the same interface as previously.
>
> BUT, the arm64 implementation of those interfaces will now detect when a set of
> adjacent PTEs (a contpte block - so 16 naturally aligned entries when using 4K
> base pages) are all appropriate for having the CONT_PTE bit set; in this case
> the block is "folded". And it will detect when the first PTE in the block
> changes such that the CONT_PTE bit must now be unset ("unfolded"). One of the
> requirements for folding a contpte block is that all the pages must belong to
> the *same* folio (that means its safe to only track access/dirty for thecontpte
> block as a whole rather than for each individual pte).
>
> (there are a couple of optimizations that make the reality slightly more
> complicated than what I've just explained, but you get the idea).
>
> On that basis, I believe all the specific cases you describe below are all
> covered and safe - please let me know if you think there is a hole here!
>
> >
> > 1. madvise(MADV_DONTNEED) on a part of basepages on a CONTPTE large folio
>
> The page will first be unmapped (e.g. ptep_clear() or ptep_get_and_clear(), or
> whatever). The implementation of that will cause an unfold and the CONT_PTE bit
> is removed from the whole contpte block. If there is then a subsequent
> set_pte_at() to set a swap entry, the implementation will see that its not
> appropriate to re-fold, so the range will remain unfolded.
>
> >
> > 2. vma split in a large folio due to various reasons such as mprotect,
> > munmap, mlock etc.
>
> I'm not sure if PTEs are explicitly unmapped/remapped when splitting a VMA? I
> suspect not, so if the VMA is split in the middle of a currently folded contpte
> block, it will remain folded. But this is safe and continues to work correctly.
> The VMA arrangement is not important; it is just important that a single folio
> is mapped contiguously across the whole block.

Even with different permissions, for example, read-only vs read-write?
The mprotect() may change the permission. It should be misprogramming
per ARM ARM.

>
> >
> > 3. try_to_unmap_one() to reclaim a folio, ptes are scanned one by one
> > rather than being as a whole.
>
> Yes, as per 1; the arm64 implementation will notice when the first entry is
> cleared and unfold the contpte block.
>
> >
> > In hardware, we need to make sure CONTPTE follow the rule - always 16
> > contiguous physical address with CONTPTE set. if one of them run away
> > from the 16 ptes group and PTEs become unconsistent, some terrible
> > errors/faults can happen in HW. for example
>
> Yes, the implementation obeys all these rules; see contpte_try_fold() and
> contpte_try_unfold(). the fold/unfold operation is only done when all
> requirements are met, and we perform it in a manner that is conformant to the
> architecture requirements (see contpte_fold() - being renamed to
> contpte_convert() in the next version).
>
> Thanks for the review!
>
> Thanks,
> Ryan
>
> >
> > case0:
> > addr0 PTE - has no CONTPE
> > addr0+4kb PTE - has CONTPTE
> > ....
> > addr0+60kb PTE - has CONTPTE
> >
> > case 1:
> > addr0 PTE - has no CONTPE
> > addr0+4kb PTE - has CONTPTE
> > ....
> > addr0+60kb PTE - has swap
> >
> > Unconsistent 16 PTEs will lead to crash even in the firmware based on
> > our observation.
> >
> > Thanks
> > Barry
> >
> >
>
>

2023-11-28 05:49:45

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v2 00/14] Transparent Contiguous PTEs for User Mappings

On Mon, Nov 27, 2023 at 5:15 PM Ryan Roberts <[email protected]> wrote:
>
> On 27/11/2023 03:18, Barry Song wrote:
> >> Ryan Roberts (14):
> >> mm: Batch-copy PTE ranges during fork()
> >> arm64/mm: set_pte(): New layer to manage contig bit
> >> arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit
> >> arm64/mm: pte_clear(): New layer to manage contig bit
> >> arm64/mm: ptep_get_and_clear(): New layer to manage contig bit
> >> arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit
> >> arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit
> >> arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit
> >> arm64/mm: ptep_set_access_flags(): New layer to manage contig bit
> >> arm64/mm: ptep_get(): New layer to manage contig bit
> >> arm64/mm: Split __flush_tlb_range() to elide trailing DSB
> >> arm64/mm: Wire up PTE_CONT for user mappings
> >> arm64/mm: Implement ptep_set_wrprotects() to optimize fork()
> >> arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown
> >
> > Hi Ryan,
> > Not quite sure if I missed something, are we splitting/unfolding CONTPTES
> > in the below cases
>
> The general idea is that the core-mm sets the individual ptes (one at a time if
> it likes with set_pte_at(), or in a block with set_ptes()), modifies its
> permissions (ptep_set_wrprotect(), ptep_set_access_flags()) and clears them
> (ptep_clear(), etc); This is exactly the same interface as previously.
>
> BUT, the arm64 implementation of those interfaces will now detect when a set of
> adjacent PTEs (a contpte block - so 16 naturally aligned entries when using 4K
> base pages) are all appropriate for having the CONT_PTE bit set; in this case
> the block is "folded". And it will detect when the first PTE in the block
> changes such that the CONT_PTE bit must now be unset ("unfolded"). One of the
> requirements for folding a contpte block is that all the pages must belong to
> the *same* folio (that means its safe to only track access/dirty for thecontpte
> block as a whole rather than for each individual pte).
>
> (there are a couple of optimizations that make the reality slightly more
> complicated than what I've just explained, but you get the idea).
>
> On that basis, I believe all the specific cases you describe below are all
> covered and safe - please let me know if you think there is a hole here!
>
> >
> > 1. madvise(MADV_DONTNEED) on a part of basepages on a CONTPTE large folio
>
> The page will first be unmapped (e.g. ptep_clear() or ptep_get_and_clear(), or
> whatever). The implementation of that will cause an unfold and the CONT_PTE bit
> is removed from the whole contpte block. If there is then a subsequent
> set_pte_at() to set a swap entry, the implementation will see that its not
> appropriate to re-fold, so the range will remain unfolded.
>
> >
> > 2. vma split in a large folio due to various reasons such as mprotect,
> > munmap, mlock etc.
>
> I'm not sure if PTEs are explicitly unmapped/remapped when splitting a VMA? I
> suspect not, so if the VMA is split in the middle of a currently folded contpte
> block, it will remain folded. But this is safe and continues to work correctly.
> The VMA arrangement is not important; it is just important that a single folio
> is mapped contiguously across the whole block.
>
> >
> > 3. try_to_unmap_one() to reclaim a folio, ptes are scanned one by one
> > rather than being as a whole.
>
> Yes, as per 1; the arm64 implementation will notice when the first entry is
> cleared and unfold the contpte block.
>
> >
> > In hardware, we need to make sure CONTPTE follow the rule - always 16
> > contiguous physical address with CONTPTE set. if one of them run away
> > from the 16 ptes group and PTEs become unconsistent, some terrible
> > errors/faults can happen in HW. for example
>
> Yes, the implementation obeys all these rules; see contpte_try_fold() and
> contpte_try_unfold(). the fold/unfold operation is only done when all
> requirements are met, and we perform it in a manner that is conformant to the
> architecture requirements (see contpte_fold() - being renamed to
> contpte_convert() in the next version).

Hi Ryan,

sorry for too many comments, I remembered another case

4. mremap

a CONTPTE might be remapped to another address which might not be
aligned with 16*basepage. thus, in move_ptes(), we are copying CONPTEs
from src to dst.
static int move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
unsigned long old_addr, unsigned long old_end,
struct vm_area_struct *new_vma, pmd_t *new_pmd,
unsigned long new_addr, bool need_rmap_locks)
{
struct mm_struct *mm = vma->vm_mm;
pte_t *old_pte, *new_pte, pte;
...

/*
* We don't have to worry about the ordering of src and dst
* pte locks because exclusive mmap_lock prevents deadlock.
*/
old_pte = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
if (!old_pte) {
err = -EAGAIN;
goto out;
}
new_pte = pte_offset_map_nolock(mm, new_pmd, new_addr, &new_ptl);
if (!new_pte) {
pte_unmap_unlock(old_pte, old_ptl);
err = -EAGAIN;
goto out;
}
if (new_ptl != old_ptl)
spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
flush_tlb_batched_pending(vma->vm_mm);
arch_enter_lazy_mmu_mode();

for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,
new_pte++, new_addr += PAGE_SIZE) {
if (pte_none(ptep_get(old_pte)))
continue;

pte = ptep_get_and_clear(mm, old_addr, old_pte);
....
}

This has two possibilities
1. new_pte is aligned with CONT_PTES, we can still keep CONTPTE;
2. new_pte is not aligned with CONT_PTES, we should drop CONTPTE
while copying.

does your code also handle this properly?

>
> Thanks for the review!
>
> Thanks,
> Ryan
>
> >
> > case0:
> > addr0 PTE - has no CONTPE
> > addr0+4kb PTE - has CONTPTE
> > ....
> > addr0+60kb PTE - has CONTPTE
> >
> > case 1:
> > addr0 PTE - has no CONTPE
> > addr0+4kb PTE - has CONTPTE
> > ....
> > addr0+60kb PTE - has swap
> >
> > Unconsistent 16 PTEs will lead to crash even in the firmware based on
> > our observation.
> >
> > Thanks
> > Barry

Thanks
Barry

2023-11-28 06:54:49

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v2 14/14] arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown


Ryan Roberts <[email protected]> writes:

> On 27/11/2023 07:34, Alistair Popple wrote:
>>
>> Ryan Roberts <[email protected]> writes:
>>
>>> On 24/11/2023 01:35, Alistair Popple wrote:
>>>>
>>>> Ryan Roberts <[email protected]> writes:
>>>>
>>>>> On 23/11/2023 05:13, Alistair Popple wrote:
>>>>>>
>>>>>> Ryan Roberts <[email protected]> writes:
>>>>>>
>>>>>>> ptep_get_and_clear_full() adds a 'full' parameter which is not present
>>>>>>> for the fallback ptep_get_and_clear() function. 'full' is set to 1 when
>>>>>>> a full address space teardown is in progress. We use this information to
>>>>>>> optimize arm64_sys_exit_group() by avoiding unfolding (and therefore
>>>>>>> tlbi) contiguous ranges. Instead we just clear the PTE but allow all the
>>>>>>> contiguous neighbours to keep their contig bit set, because we know we
>>>>>>> are about to clear the rest too.
>>>>>>>
>>>>>>> Before this optimization, the cost of arm64_sys_exit_group() exploded to
>>>>>>> 32x what it was before PTE_CONT support was wired up, when compiling the
>>>>>>> kernel. With this optimization in place, we are back down to the
>>>>>>> original cost.
>>>>>>>
>>>>>>> This approach is not perfect though, as for the duration between
>>>>>>> returning from the first call to ptep_get_and_clear_full() and making
>>>>>>> the final call, the contpte block in an intermediate state, where some
>>>>>>> ptes are cleared and others are still set with the PTE_CONT bit. If any
>>>>>>> other APIs are called for the ptes in the contpte block during that
>>>>>>> time, we have to be very careful. The core code currently interleaves
>>>>>>> calls to ptep_get_and_clear_full() with ptep_get() and so ptep_get()
>>>>>>> must be careful to ignore the cleared entries when accumulating the
>>>>>>> access and dirty bits - the same goes for ptep_get_lockless(). The only
>>>>>>> other calls we might resonably expect are to set markers in the
>>>>>>> previously cleared ptes. (We shouldn't see valid entries being set until
>>>>>>> after the tlbi, at which point we are no longer in the intermediate
>>>>>>> state). Since markers are not valid, this is safe; set_ptes() will see
>>>>>>> the old, invalid entry and will not attempt to unfold. And the new pte
>>>>>>> is also invalid so it won't attempt to fold. We shouldn't see this for
>>>>>>> the 'full' case anyway.
>>>>>>>
>>>>>>> The last remaining issue is returning the access/dirty bits. That info
>>>>>>> could be present in any of the ptes in the contpte block. ptep_get()
>>>>>>> will gather those bits from across the contpte block. We don't bother
>>>>>>> doing that here, because we know that the information is used by the
>>>>>>> core-mm to mark the underlying folio as accessed/dirty. And since the
>>>>>>> same folio must be underpinning the whole block (that was a requirement
>>>>>>> for folding in the first place), that information will make it to the
>>>>>>> folio eventually once all the ptes have been cleared. This approach
>>>>>>> means we don't have to play games with accumulating and storing the
>>>>>>> bits. It does mean that any interleaved calls to ptep_get() may lack
>>>>>>> correct access/dirty information if we have already cleared the pte that
>>>>>>> happened to store it. The core code does not rely on this though.
>>>>>>
>>>>>> Does not *currently* rely on this. I can't help but think it is
>>>>>> potentially something that could change in the future though which would
>>>>>> lead to some subtle bugs.
>>>>>
>>>>> Yes, there is a risk, although IMHO, its very small.
>>>>>
>>>>>>
>>>>>> Would there be any may of avoiding this? Half baked thought but could
>>>>>> you for example copy the access/dirty information to the last (or
>>>>>> perhaps first, most likely invalid) PTE?
>>>>>
>>>>> I spent a long time thinking about this and came up with a number of
>>>>> possibilities, none of them ideal. In the end, I went for the simplest one
>>>>> (which works but suffers from the problem that it depends on the way it is
>>>>> called not changing).
>>>>
>>>> Ok, that answers my underlying question of "has someone thought about
>>>> this and are there any easy solutions". I suspected that was the case
>>>> given the excellent write up though!
>>>>
>>>>> 1) copy the access/dirty flags into all the remaining uncleared ptes within the
>>>>> contpte block. This is how I did it in v1; although it was racy. I think this
>>>>> could be implemented correctly but its extremely complex.
>>>>>
>>>>> 2) batch calls from the core-mm (like I did for pte_set_wrprotects()) so that we
>>>>> can clear 1 or more full contpte blocks in a single call - the ptes are never in
>>>>> an intermediate state. This is difficult because ptep_get_and_clear_full()
>>>>> returns the pte that was cleared so its difficult to scale that up to multiple ptes.
>>>>>
>>>>> 3) add ptep_get_no_access_dirty() and redefine the interface to only allow that
>>>>> to be called while ptep_get_and_clear_full() calls are on-going. Then assert in
>>>>> the other functions that ptep_get_and_clear_full() is not on-going when they are
>>>>> called. So we would get a clear sign that usage patterns have changed. But there
>>>>> is no easy place to store that state (other than scanning a contpte block
>>>>> looking for pte_none() amongst pte_valid_cont() entries) and it all felt ugly.
>>>>>
>>>>> 4) The simple approach I ended up taking; I thought it would be best to keep it
>>>>> simple and see if anyone was concerned before doing something more drastic.
>>>>>
>>>>> What do you think? If we really need to solve this, then option 1 is my
>>>>> preferred route, but it would take some time to figure out and reason about a
>>>>> race-free scheme.
>>>>
>>>> Well I like simple, and I agree the risk is small. But I can't help feel
>>>> the current situation is too subtle, mainly because it is architecture
>>>> specific and the assumptions are not communicated in core-mm code
>>>> anywhere. But also none of the aternatives seem much better.
>>>>
>>>> However there are only three callers of ptep_get_and_clear_full(), and
>>>> all of these hold the PTL. So if I'm not mistaken that should exclude
>>>> just about all users of ptep_get*() which will take the ptl before hand.
>>>
>>> The problem isn't racing threads because as you say, the PTL is already
>>> serializing all calls except ptep_get_lockless(). And although there are 3
>>> callers to ptep_get_and_clear_full(), only the caller in zap_pte_range() ever
>>> calls it with full=1, as I recall.
>>>
>>> The problem is that the caller in zap_pte_range() does this:
>>>
>>> ptl = lock_page_table()
>>> for each pte {
>>> ptent = ptep_get(pte);
>>> if (pte_present(ptent) {
>>> ptent = ptep_get_and_clear_full(ptent);
>>> if (pte_dirty(ptent))
>>> ...
>>> if (pte_young(ptent))
>>> ...
>>> }
>>> }
>>> unlock_page_table(ptl)
>>>
>>> It deliberately interleves calls to ptep_get() and ptep_get_and_clear_full()
>>> under the ptl. So if the loop is iterating over a contpte block and the HW
>>> happens to be storing the access/dirty info in the first pte entry, then the
>>> first time through the loop, ptep_get() will return the correct access/dirty
>>> info, as will ptep_get_and_clear_full(). The next time through the loop though,
>>> the access/dirty info which was in the previous pte is now cleared so ptep_get()
>>> and ptep_get_and_clear_full() will return old/clean. It all works, but is fragile.
>>
>> So if ptep_get_lockless() isn't a concern what made the option posted in
>> v1 racy (your option 1 above)? Is there something else reading PTEs or
>> clearing PTE bits without holding the PTL that I'm missing?
>
> The HW could be racing to set access and dirty bits. Well actually, I'm not
> completely sure if that's the case here; if full=1 then presumably no other
> threads in the process should be running at this point, so perhaps it can be
> guarranteed that nothing is causing a concurrent memory access and the HW is
> therefore definitely not going to try to write the access/dirty bits
> concurrently. But I didn't manage to convince myself that's definitely the case.

I suppose it's possible something attached to an SMMU or some such could
still be running and causing accesses so agree it's probably not the
case (although it would be an odd corner case).

> So if we do need to deal with racing HW, I'm pretty sure my v1 implementation is
> buggy because it iterated through the PTEs, getting and accumulating. Then
> iterated again, writing that final set of bits to all the PTEs. And the HW could
> have modified the bits during those loops. I think it would be possible to fix
> the race, but intuition says it would be expensive.

So the issue as I understand it is subsequent iterations would see a
clean PTE after the first iteration returned a dirty PTE. In
ptep_get_and_clear_full() why couldn't you just copy the dirty/accessed
bit (if set) from the PTE being cleared to an adjacent PTE rather than
all the PTEs?

That would fix the inconsistency as far as subsequent iterations of
ptep_get_and_clear_full() returning the dirty/accessed if a previous
iteration did. Obviously HW could still race and cause a previously
clean iteration to return dirty, but that seems ok.

However all this has just left me with more questions :-)

Won't clearing bits like this result in inconsistent programming of the
PTE_CONT bit? What happens if HW access a page in the contiguous region
while some of the PTEs are invalid? And same question for programming
them really - I don't think we can atomically set PTE_CONT on multiple
PTEs all at once so if we assume something can be accessing them
concurrently how do we do that without having some HW observing an
intermediate state where PTE_CONT is misprogrammed?

Thanks.

>>
>>>>
>>>> So really that only leaves ptep_get_lockless() that could/should
>>>> interleave right?
>>>
>>> Yes, but ptep_get_lockless() is special. Since it is called without the PTL, it
>>> is very careful to ensure that the contpte block is in a consistent state and it
>>> keeps trying until it is. So this will always return the correct consistent
>>> information.
>>>
>>>> From a quick glance of those users none look at the
>>>> young/dirty information anyway, so I wonder if we can just assert in the
>>>> core-mm that ptep_get_lockless() does not return young/dirty information
>>>> and clear it in the helpers? That would make things explicit and
>>>> consistent which would address my concern (although I haven't looked too
>>>> closely at the details there).
>>>
>>> As per explanation above, its not ptep_get_lockless() that is the problem so I
>>> don't think this helps.
>>>
>>> Thanks,
>>> Ryan
>>>
>>>>
>>>>> Thanks,
>>>>> Ryan
>>>>
>>

2023-11-28 07:33:46

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v2 14/14] arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown

> +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
> +static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
> + unsigned long addr, pte_t *ptep, int full)
> +{
> + pte_t orig_pte = __ptep_get(ptep);
> +
> + if (!pte_valid_cont(orig_pte) || !full) {
> + contpte_try_unfold(mm, addr, ptep, orig_pte);
> + return __ptep_get_and_clear(mm, addr, ptep);
> + } else
> + return contpte_ptep_get_and_clear_full(mm, addr, ptep);
> +}
> +

Hi Ryan,

I feel quite hard to understand the code. when !pte_valid_cont(orig_pte),
we will call contpte_try_unfold(mm, addr, ptep, orig_pte);

but in contpte_try_unfold(), we call unfold only if pte_valid_cont()
is true:
static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pte)
{
if (contpte_is_enabled(mm) && pte_valid_cont(pte))
__contpte_try_unfold(mm, addr, ptep, pte);
}

so do you mean the below?

if (!pte_valid_cont(orig_pte))
return __ptep_get_and_clear(mm, addr, ptep);

if (!full) {
contpte_try_unfold(mm, addr, ptep, orig_pte);
return __ptep_get_and_clear(mm, addr, ptep);
} else {
return contpte_ptep_get_and_clear_full(mm, addr, ptep);
}

Thanks
Barry


2023-11-28 08:19:15

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v2 14/14] arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown

> +pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
> + unsigned long addr, pte_t *ptep)
> +{
> + /*
> + * When doing a full address space teardown, we can avoid unfolding the
> + * contiguous range, and therefore avoid the associated tlbi. Instead,
> + * just get and clear the pte. The caller is promising to call us for
> + * every pte, so every pte in the range will be cleared by the time the
> + * tlbi is issued.
> + *
> + * This approach is not perfect though, as for the duration between
> + * returning from the first call to ptep_get_and_clear_full() and making
> + * the final call, the contpte block in an intermediate state, where
> + * some ptes are cleared and others are still set with the PTE_CONT bit.
> + * If any other APIs are called for the ptes in the contpte block during
> + * that time, we have to be very careful. The core code currently
> + * interleaves calls to ptep_get_and_clear_full() with ptep_get() and so
> + * ptep_get() must be careful to ignore the cleared entries when
> + * accumulating the access and dirty bits - the same goes for
> + * ptep_get_lockless(). The only other calls we might resonably expect
> + * are to set markers in the previously cleared ptes. (We shouldn't see
> + * valid entries being set until after the tlbi, at which point we are
> + * no longer in the intermediate state). Since markers are not valid,
> + * this is safe; set_ptes() will see the old, invalid entry and will not
> + * attempt to unfold. And the new pte is also invalid so it won't
> + * attempt to fold. We shouldn't see this for the 'full' case anyway.
> + *
> + * The last remaining issue is returning the access/dirty bits. That
> + * info could be present in any of the ptes in the contpte block.
> + * ptep_get() will gather those bits from across the contpte block. We
> + * don't bother doing that here, because we know that the information is
> + * used by the core-mm to mark the underlying folio as accessed/dirty.
> + * And since the same folio must be underpinning the whole block (that
> + * was a requirement for folding in the first place), that information
> + * will make it to the folio eventually once all the ptes have been
> + * cleared. This approach means we don't have to play games with
> + * accumulating and storing the bits. It does mean that any interleaved
> + * calls to ptep_get() may lack correct access/dirty information if we
> + * have already cleared the pte that happened to store it. The core code
> + * does not rely on this though.

even without any other threads running and touching those PTEs, this won't survive
on some hardware. we expose inconsistent CONTPTEs to hardware, this might result
in crashed firmware even in trustzone, strange&unknown faults to trustzone we have
seen on Qualcomm, but for MTK, it seems fine. when you do tlbi on a part of PTEs
with dropped CONT but still some other PTEs have CONT, we make hardware totally
confused.

zap_pte_range() has a force_flush when tlbbatch is full:

if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
force_flush = 1;
addr += PAGE_SIZE;
break;
}

this means you can expose partial tlbi/flush directly to hardware while some
other PTEs are still CONT.

on the other hand, contpte_ptep_get_and_clear_full() doesn't need to depend
on fullmm, as long as zap range covers a large folio, we can flush tlbi for
those CONTPTEs all together in your contpte_ptep_get_and_clear_full() rather
than clearing one PTE.

Our approach in [1] is we do a flush for all CONTPTEs and go directly to the end
of the large folio:

#ifdef CONFIG_CONT_PTE_HUGEPAGE
if (pte_cont(ptent)) {
unsigned long next = pte_cont_addr_end(addr, end);

if (next - addr != HPAGE_CONT_PTE_SIZE) {
__split_huge_cont_pte(vma, pte, addr, false, NULL, ptl);
/*
* After splitting cont-pte
* we need to process pte again.
*/
goto again_pte;
} else {
cont_pte_huge_ptep_get_and_clear(mm, addr, pte);

tlb_remove_cont_pte_tlb_entry(tlb, pte, addr);
if (unlikely(!page))
continue;

if (is_huge_zero_page(page)) {
tlb_remove_page_size(tlb, page, HPAGE_CONT_PTE_SIZE);
goto cont_next;
}

rss[mm_counter(page)] -= HPAGE_CONT_PTE_NR;
page_remove_rmap(page, true);
if (unlikely(page_mapcount(page) < 0))
print_bad_pte(vma, addr, ptent, page);

tlb_remove_page_size(tlb, page, HPAGE_CONT_PTE_SIZE);
}
cont_next:
/* "do while()" will do "pte++" and "addr + PAGE_SIZE" */
pte += (next - PAGE_SIZE - (addr & PAGE_MASK))/PAGE_SIZE;
addr = next - PAGE_SIZE;
continue;
}
#endif

this is our "full" counterpart, which clear_flush CONT_PTES pages directly, and
it never requires tlb->fullmm at all.

static inline pte_t __cont_pte_huge_ptep_get_and_clear_flush(struct mm_struct *mm,
unsigned long addr,
pte_t *ptep,
bool flush)
{
pte_t orig_pte = ptep_get(ptep);

CHP_BUG_ON(!pte_cont(orig_pte));
CHP_BUG_ON(!IS_ALIGNED(addr, HPAGE_CONT_PTE_SIZE));
CHP_BUG_ON(!IS_ALIGNED(pte_pfn(orig_pte), HPAGE_CONT_PTE_NR));

return get_clear_flush(mm, addr, ptep, PAGE_SIZE, CONT_PTES, flush);
}

[1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c#L1539

> + */
> +
> + return __ptep_get_and_clear(mm, addr, ptep);
> +}
> +EXPORT_SYMBOL(contpte_ptep_get_and_clear_full);
> +

Thanks
Barry


2023-11-28 09:14:52

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

On 27/11/2023 20:34, Barry Song wrote:
> On Tue, Nov 28, 2023 at 12:07 AM Ryan Roberts <[email protected]> wrote:
>>
>> On 27/11/2023 10:28, Barry Song wrote:
>>> On Mon, Nov 27, 2023 at 11:11 PM Ryan Roberts <[email protected]> wrote:
>>>>
>>>> On 27/11/2023 09:59, Barry Song wrote:
>>>>> On Mon, Nov 27, 2023 at 10:35 PM Ryan Roberts <[email protected]> wrote:
>>>>>>
>>>>>> On 27/11/2023 08:42, Barry Song wrote:
>>>>>>>>> + for (i = 0; i < nr; i++, page++) {
>>>>>>>>> + if (anon) {
>>>>>>>>> + /*
>>>>>>>>> + * If this page may have been pinned by the
>>>>>>>>> + * parent process, copy the page immediately for
>>>>>>>>> + * the child so that we'll always guarantee the
>>>>>>>>> + * pinned page won't be randomly replaced in the
>>>>>>>>> + * future.
>>>>>>>>> + */
>>>>>>>>> + if (unlikely(page_try_dup_anon_rmap(
>>>>>>>>> + page, false, src_vma))) {
>>>>>>>>> + if (i != 0)
>>>>>>>>> + break;
>>>>>>>>> + /* Page may be pinned, we have to copy. */
>>>>>>>>> + return copy_present_page(
>>>>>>>>> + dst_vma, src_vma, dst_pte,
>>>>>>>>> + src_pte, addr, rss, prealloc,
>>>>>>>>> + page);
>>>>>>>>> + }
>>>>>>>>> + rss[MM_ANONPAGES]++;
>>>>>>>>> + VM_BUG_ON(PageAnonExclusive(page));
>>>>>>>>> + } else {
>>>>>>>>> + page_dup_file_rmap(page, false);
>>>>>>>>> + rss[mm_counter_file(page)]++;
>>>>>>>>> + }
>>>>>>>>> }
>>>>>>>>> - rss[MM_ANONPAGES]++;
>>>>>>>>> - } else if (page) {
>>>>>>>>> - folio_get(folio);
>>>>>>>>> - page_dup_file_rmap(page, false);
>>>>>>>>> - rss[mm_counter_file(page)]++;
>>>>>>>>> +
>>>>>>>>> + nr = i;
>>>>>>>>> + folio_ref_add(folio, nr);
>>>>>>>>
>>>>>>>> You're changing the order of mapcount vs. refcount increment. Don't.
>>>>>>>> Make sure your refcount >= mapcount.
>>>>>>>>
>>>>>>>> You can do that easily by doing the folio_ref_add(folio, nr) first and
>>>>>>>> then decrementing in case of error accordingly. Errors due to pinned
>>>>>>>> pages are the corner case.
>>>>>>>>
>>>>>>>> I'll note that it will make a lot of sense to have batch variants of
>>>>>>>> page_try_dup_anon_rmap() and page_dup_file_rmap().
>>>>>>>>
>>>>>>>
>>>>>>> i still don't understand why it is not a entire map+1, but an increment
>>>>>>> in each basepage.
>>>>>>
>>>>>> Because we are PTE-mapping the folio, we have to account each individual page.
>>>>>> If we accounted the entire folio, where would we unaccount it? Each page can be
>>>>>> unmapped individually (e.g. munmap() part of the folio) so need to account each
>>>>>> page. When PMD mapping, the whole thing is either mapped or unmapped, and its
>>>>>> atomic, so we can account the entire thing.
>>>>>
>>>>> Hi Ryan,
>>>>>
>>>>> There is no problem. for example, a large folio is entirely mapped in
>>>>> process A with CONPTE,
>>>>> and only page2 is mapped in process B.
>>>>> then we will have
>>>>>
>>>>> entire_map = 0
>>>>> page0.map = -1
>>>>> page1.map = -1
>>>>> page2.map = 0
>>>>> page3.map = -1
>>>>> ....
>>>>>
>>>>>>
>>>>>>>
>>>>>>> as long as it is a CONTPTE large folio, there is no much difference with
>>>>>>> PMD-mapped large folio. it has all the chance to be DoubleMap and need
>>>>>>> split.
>>>>>>>
>>>>>>> When A and B share a CONTPTE large folio, we do madvise(DONTNEED) or any
>>>>>>> similar things on a part of the large folio in process A,
>>>>>>>
>>>>>>> this large folio will have partially mapped subpage in A (all CONTPE bits
>>>>>>> in all subpages need to be removed though we only unmap a part of the
>>>>>>> large folioas HW requires consistent CONTPTEs); and it has entire map in
>>>>>>> process B(all PTEs are still CONPTES in process B).
>>>>>>>
>>>>>>> isn't it more sensible for this large folios to have entire_map = 0(for
>>>>>>> process B), and subpages which are still mapped in process A has map_count
>>>>>>> =0? (start from -1).
>>>>>>>
>>>>>>>> Especially, the batch variant of page_try_dup_anon_rmap() would only
>>>>>>>> check once if the folio maybe pinned, and in that case, you can simply
>>>>>>>> drop all references again. So you either have all or no ptes to process,
>>>>>>>> which makes that code easier.
>>>>>>
>>>>>> I'm afraid this doesn't make sense to me. Perhaps I've misunderstood. But
>>>>>> fundamentally you can only use entire_mapcount if its only possible to map and
>>>>>> unmap the whole folio atomically.
>>>>>
>>>>>
>>>>>
>>>>> My point is that CONTPEs should either all-set in all 16 PTEs or all are dropped
>>>>> in 16 PTEs. if all PTEs have CONT, it is entirely mapped; otherwise,
>>>>> it is partially
>>>>> mapped. if a large folio is mapped in one processes with all CONTPTEs
>>>>> and meanwhile in another process with partial mapping(w/o CONTPTE), it is
>>>>> DoubleMapped.
>>>>
>>>> There are 2 problems with your proposal, as I see it;
>>>>
>>>> 1) the core-mm is not enlightened for CONTPTE mappings. As far as it is
>>>> concerned, its just mapping a bunch of PTEs. So it has no hook to inc/dec
>>>> entire_mapcount. The arch code is opportunistically and *transparently* managing
>>>> the CONT_PTE bit.
>>>>
>>>> 2) There is nothing to say a folio isn't *bigger* than the contpte block; it may
>>>> be 128K and be mapped with 2 contpte blocks. Or even a PTE-mapped THP (2M) and
>>>> be mapped with 32 contpte blocks. So you can't say it is entirely mapped
>>>> unless/until ALL of those blocks are set up. And then of course each block could
>>>> be unmapped unatomically.
>>>>
>>>> For the PMD case there are actually 2 properties that allow using the
>>>> entire_mapcount optimization; It's atomically mapped/unmapped through the PMD
>>>> and we know that the folio is exactly PMD sized (since it must be at least PMD
>>>> sized to be able to map it with the PMD, and we don't allocate THPs any bigger
>>>> than PMD size). So one PMD map or unmap operation corresponds to exactly one
>>>> *entire* map or unmap. That is not true when we are PTE mapping.
>>>
>>> well. Thanks for clarification. based on the above description, i agree the
>>> current code might make more sense by always using mapcount in subpage.
>>>
>>> I gave my proposals as I thought we were always CONTPTE size for small-THP
>>> then we could drop the loop to iterate 16 times rmap. if we do it
>>> entirely, we only
>>> need to do dup rmap once for all 16 PTEs by increasing entire_map.
>>
>> Well its always good to have the discussion - so thanks for the ideas. I think
>> there is a bigger question lurking here; should we be exposing the concept of
>> contpte mappings to the core-mm rather than burying it in the arm64 arch code?
>> I'm confident that would be a huge amount of effort and the end result would be
>> similar performace to what this approach gives. One potential benefit of letting
>> core-mm control it is that it would also give control to core-mm over the
>> granularity of access/dirty reporting (my approach implicitly ties it to the
>> folio). Having sub-folio access tracking _could_ potentially help with future
>> work to make THP size selection automatic, but we are not there yet, and I think
>> there are other (simpler) ways to achieve the same thing. So my view is that
>> _not_ exposing it to core-mm is the right way for now.
>
> Hi Ryan,
>
> We(OPPO) started a similar project like you even before folio was imported to
> mainline, we have deployed the dynamic hugepage(that is how we name it)
> on millions of mobile phones on real products and kernels before 5.16, making
> a huge success on performance improvement. for example, you may
> find the out-of-tree 5.15 source code here

Oh wow, thanks for reaching out and explaining this - I have to admit I feel
embarrassed that I clearly didn't do enough research on the prior art because I
wasn't aware of your work. So sorry about that.

I sensed that you had a different model for how this should work vs what I've
implemented and now I understand why :). I'll review your stuff and I'm sure
I'll have questions. I'm sure each solution has pros and cons.


>
> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11
>
> Our modification might not be so clean and has lots of workarounds
> just for the stability of products
>
> We mainly have
>
> 1. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/cont_pte_hugepage.c
>
> some CONTPTE helpers
>
> 2.https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/include/linux/mm.h
>
> some Dynamic Hugepage APIs
>
> 3. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c
>
> modified all page faults to support
> (1). allocation of hugepage of 64KB in do_anon_page

My Small-Sized THP patch set is handling the equivalent of this.

> (2). CoW hugepage in do_wp_page

This isn't handled yet in my patch set; the original RFC implemented it but I
removed it in order to strip back to the essential complexity for the initial
submission. DavidH has been working on a precise shared vs exclusive map
tracking mechanism - if that goes in, it will make CoWing large folios simpler.
Out of interest, what workloads benefit most from this?

> (3). copy CONPTEs in copy_pte_range

As discussed this is done as part of the contpte patch set, but its not just a
simple copy; the arch code will notice and set the CONT_PTE bit as needed.

> (4). allocate and swap-in Hugepage as a whole in do_swap_page

This is going to be a problem but I haven't even looked at this properly yet.
The advice so far has been to continue to swap-in small pages only, but improve
khugepaged to collapse to small-sized THP. I'll take a look at your code to
understand how you did this.

>
> 4. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/vmscan.c
> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/rmap.c
>
> reclaim hugepage as a whole and LRU optimization for 64KB dynamic hugepage.

I think this is all naturally handled by the folio code that exists in modern
kernels?

>
> So we are 100% interested in your patchset and hope it can find a way
> to land on the
> mainline, thus decreasing all the cost we have to maintain out-of-tree
> code from a
> kernel to another kernel version which we have done on a couple of
> kernel versions
> before 5.16. Firmly, we are 100% supportive of large anon folios
> things you are leading.

That's great to hear! Of course Reviewed-By's and Tested-By's will all help move
it closer :). If you had any ability to do any A/B performance testing, it would
be very interesting to see how this stacks up against your solution - if there
are gaps it would be good to know where and develop a plan to plug the gap.

>
> A big pain was we found lots of races especially on CONTPTE unfolding
> and especially a part
> of basepages ran away from the 16 CONPTEs group since userspace is
> always working
> on basepages, having no idea of small-THP. We ran our code on millions of
> real phones, and now we have got them fixed (or maybe "can't reproduce"),
> no outstanding issue.

I'm going to be brave and say that my solution shouldn't suffer from these
problems; but of course the proof is only in the testing. I did a lot of work
with our architecture group and micro architects to determine exactly what is
and isn't safe; We even tightened the Arm ARM spec very subtlely to allow the
optimization in patch 13 (see the commit log for details). Of course this has
all been checked with partners and we are confident that all existing
implementations conform to the modified wording.

>
> Particularly for the rmap issue we are discussing, our out-of-tree is
> using the entire_map for
> CONTPTE in the way I sent to you. But I guess we can learn from you to decouple
> CONTPTE from mm-core.
>
> We are doing this in mm/memory.c
>
> copy_present_cont_pte(struct vm_area_struct *dst_vma, struct
> vm_area_struct *src_vma,
> pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
> struct page **prealloc)
> {
> struct mm_struct *src_mm = src_vma->vm_mm;
> unsigned long vm_flags = src_vma->vm_flags;
> pte_t pte = *src_pte;
> struct page *page;
>
> page = vm_normal_page(src_vma, addr, pte);
> ...
>
> get_page(page);
> page_dup_rmap(page, true); // an entire dup_rmap as you can
> see.............
> rss[mm_counter(page)] += HPAGE_CONT_PTE_NR;
> }
>
> and we have a split in mm/cont_pte_hugepage.c to handle partially unmap,
>
> static void __split_huge_cont_pte_locked(struct vm_area_struct *vma, pte_t *pte,
> unsigned long haddr, bool freeze)
> {
> ...
> if (compound_mapcount(head) > 1 && !TestSetPageDoubleMap(head)) {
> for (i = 0; i < HPAGE_CONT_PTE_NR; i++)
> atomic_inc(&head[i]._mapcount);
> atomic_long_inc(&cont_pte_double_map_count);
> }
>
>
> if (atomic_add_negative(-1, compound_mapcount_ptr(head))) {
> ...
> }
>
> I am not selling our solution any more, but just showing you some differences we
> have :-)

OK, I understand what you were saying now. I'm currently struggling to see how
this could fit into my model. Do you have any workloads and numbers on perf
improvement of using entire_mapcount?

>
>>
>>>
>>> BTW, I have concerns that a variable small-THP size will really work
>>> as userspace
>>> is probably friendly to only one fixed size. for example, userspace
>>> heap management
>>> might be optimized to a size for freeing memory to the kernel. it is
>>> very difficult
>>> for the heap to adapt to various sizes at the same time. frequent unmap/free
>>> size not equal with, and particularly smaller than small-THP size will
>>> defeat all
>>> efforts to use small-THP.
>>
>> I'll admit to not knowing a huge amount about user space allocators. But I will
>> say that as currently defined, the small-sized THP interface to user space
>> allows a sysadmin to specifically enable the set of sizes that they want; so a
>> single size can be enabled. I'm diliberately punting that decision away from the
>> kernel for now.
>
> Basically, userspace heap library has a PAGESIZE setting and allows users
> to allocate/free all kinds of small objects such as 16,32,64,128,256,512 etc.
> The default size is for sure equal to the basepage SIZE. once some objects are
> freed by free() and libc get a free "page", userspace heap libraries might free
> the PAGESIZE page to kernel by things like MADV_DONTNEED, then zap_pte_range().
> it is quite similar with kernel slab.
>
> so imagine we have small-THP now, but userspace libraries have *NO*
> idea at all, so it can frequently cause unfolding.
>
>>
>> FWIW, My experience with the Speedometer/JavaScript use case is that performance
>> is a little bit better when enabling 64+32+16K vs just 64K THP.
>>
>> Functionally, it will not matter if the allocator is not enlightened for the THP
>> size; it can continue to free, and if a partial folio is unmapped it is put on
>> the deferred split list, then under memory pressure it is split and the unused
>> pages are reclaimed. I guess this is the bit you are concerned about having a
>> performance impact?
>
> right. If this is happening on the majority of small-THP folios, we
> don't have performance
> improvement, and probably regression instead. This is really true on
> real workloads!!
>
> So that is why we really love a per-VMA hint to enable small-THP but
> obviously you
> have already supported it now by
> mm: thp: Introduce per-size thp sysfs interface
> https://lore.kernel.org/linux-mm/[email protected]/
>
> we can use MADVISE rather than ALWAYS and set fixed size like 64KB, so userspace
> can set the VMA flag when it is quite sure this VMA is working with
> the alignment
> of 64KB?

Yes, that all exists in the series today. We have also discussed the possibility
of adding a new madvise_process() call that would take the set of THP sizes that
should be considered. Then you can set different VMAs to use different sizes;
the plan was to layer that on top if/when a workload was identified. Sounds like
you might be able to help there?

>
>>
>> Regardless, it would be good to move this conversation to the small-sized THP
>> patch series since this is all independent of contpte mappings.
>>
>>>
>>>>
>>>>>
>>>>> Since we always hold ptl to set or drop CONTPTE bits, set/drop is
>>>>> still atomic in a
>>>>> spinlock area.
>>>>>
>>>>>>
>>>>>>>>
>>>>>>>> But that can be added on top, and I'll happily do that.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>> David / dhildenb
>>>>>>>
>>>>>
>>>
>
> Thanks
> Barry

2023-11-28 09:50:30

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

On Tue, Nov 28, 2023 at 10:14 PM Ryan Roberts <[email protected]> wrote:
>
> On 27/11/2023 20:34, Barry Song wrote:
> > On Tue, Nov 28, 2023 at 12:07 AM Ryan Roberts <[email protected]> wrote:
> >>
> >> On 27/11/2023 10:28, Barry Song wrote:
> >>> On Mon, Nov 27, 2023 at 11:11 PM Ryan Roberts <[email protected]> wrote:
> >>>>
> >>>> On 27/11/2023 09:59, Barry Song wrote:
> >>>>> On Mon, Nov 27, 2023 at 10:35 PM Ryan Roberts <[email protected]> wrote:
> >>>>>>
> >>>>>> On 27/11/2023 08:42, Barry Song wrote:
> >>>>>>>>> + for (i = 0; i < nr; i++, page++) {
> >>>>>>>>> + if (anon) {
> >>>>>>>>> + /*
> >>>>>>>>> + * If this page may have been pinned by the
> >>>>>>>>> + * parent process, copy the page immediately for
> >>>>>>>>> + * the child so that we'll always guarantee the
> >>>>>>>>> + * pinned page won't be randomly replaced in the
> >>>>>>>>> + * future.
> >>>>>>>>> + */
> >>>>>>>>> + if (unlikely(page_try_dup_anon_rmap(
> >>>>>>>>> + page, false, src_vma))) {
> >>>>>>>>> + if (i != 0)
> >>>>>>>>> + break;
> >>>>>>>>> + /* Page may be pinned, we have to copy. */
> >>>>>>>>> + return copy_present_page(
> >>>>>>>>> + dst_vma, src_vma, dst_pte,
> >>>>>>>>> + src_pte, addr, rss, prealloc,
> >>>>>>>>> + page);
> >>>>>>>>> + }
> >>>>>>>>> + rss[MM_ANONPAGES]++;
> >>>>>>>>> + VM_BUG_ON(PageAnonExclusive(page));
> >>>>>>>>> + } else {
> >>>>>>>>> + page_dup_file_rmap(page, false);
> >>>>>>>>> + rss[mm_counter_file(page)]++;
> >>>>>>>>> + }
> >>>>>>>>> }
> >>>>>>>>> - rss[MM_ANONPAGES]++;
> >>>>>>>>> - } else if (page) {
> >>>>>>>>> - folio_get(folio);
> >>>>>>>>> - page_dup_file_rmap(page, false);
> >>>>>>>>> - rss[mm_counter_file(page)]++;
> >>>>>>>>> +
> >>>>>>>>> + nr = i;
> >>>>>>>>> + folio_ref_add(folio, nr);
> >>>>>>>>
> >>>>>>>> You're changing the order of mapcount vs. refcount increment. Don't.
> >>>>>>>> Make sure your refcount >= mapcount.
> >>>>>>>>
> >>>>>>>> You can do that easily by doing the folio_ref_add(folio, nr) first and
> >>>>>>>> then decrementing in case of error accordingly. Errors due to pinned
> >>>>>>>> pages are the corner case.
> >>>>>>>>
> >>>>>>>> I'll note that it will make a lot of sense to have batch variants of
> >>>>>>>> page_try_dup_anon_rmap() and page_dup_file_rmap().
> >>>>>>>>
> >>>>>>>
> >>>>>>> i still don't understand why it is not a entire map+1, but an increment
> >>>>>>> in each basepage.
> >>>>>>
> >>>>>> Because we are PTE-mapping the folio, we have to account each individual page.
> >>>>>> If we accounted the entire folio, where would we unaccount it? Each page can be
> >>>>>> unmapped individually (e.g. munmap() part of the folio) so need to account each
> >>>>>> page. When PMD mapping, the whole thing is either mapped or unmapped, and its
> >>>>>> atomic, so we can account the entire thing.
> >>>>>
> >>>>> Hi Ryan,
> >>>>>
> >>>>> There is no problem. for example, a large folio is entirely mapped in
> >>>>> process A with CONPTE,
> >>>>> and only page2 is mapped in process B.
> >>>>> then we will have
> >>>>>
> >>>>> entire_map = 0
> >>>>> page0.map = -1
> >>>>> page1.map = -1
> >>>>> page2.map = 0
> >>>>> page3.map = -1
> >>>>> ....
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> as long as it is a CONTPTE large folio, there is no much difference with
> >>>>>>> PMD-mapped large folio. it has all the chance to be DoubleMap and need
> >>>>>>> split.
> >>>>>>>
> >>>>>>> When A and B share a CONTPTE large folio, we do madvise(DONTNEED) or any
> >>>>>>> similar things on a part of the large folio in process A,
> >>>>>>>
> >>>>>>> this large folio will have partially mapped subpage in A (all CONTPE bits
> >>>>>>> in all subpages need to be removed though we only unmap a part of the
> >>>>>>> large folioas HW requires consistent CONTPTEs); and it has entire map in
> >>>>>>> process B(all PTEs are still CONPTES in process B).
> >>>>>>>
> >>>>>>> isn't it more sensible for this large folios to have entire_map = 0(for
> >>>>>>> process B), and subpages which are still mapped in process A has map_count
> >>>>>>> =0? (start from -1).
> >>>>>>>
> >>>>>>>> Especially, the batch variant of page_try_dup_anon_rmap() would only
> >>>>>>>> check once if the folio maybe pinned, and in that case, you can simply
> >>>>>>>> drop all references again. So you either have all or no ptes to process,
> >>>>>>>> which makes that code easier.
> >>>>>>
> >>>>>> I'm afraid this doesn't make sense to me. Perhaps I've misunderstood. But
> >>>>>> fundamentally you can only use entire_mapcount if its only possible to map and
> >>>>>> unmap the whole folio atomically.
> >>>>>
> >>>>>
> >>>>>
> >>>>> My point is that CONTPEs should either all-set in all 16 PTEs or all are dropped
> >>>>> in 16 PTEs. if all PTEs have CONT, it is entirely mapped; otherwise,
> >>>>> it is partially
> >>>>> mapped. if a large folio is mapped in one processes with all CONTPTEs
> >>>>> and meanwhile in another process with partial mapping(w/o CONTPTE), it is
> >>>>> DoubleMapped.
> >>>>
> >>>> There are 2 problems with your proposal, as I see it;
> >>>>
> >>>> 1) the core-mm is not enlightened for CONTPTE mappings. As far as it is
> >>>> concerned, its just mapping a bunch of PTEs. So it has no hook to inc/dec
> >>>> entire_mapcount. The arch code is opportunistically and *transparently* managing
> >>>> the CONT_PTE bit.
> >>>>
> >>>> 2) There is nothing to say a folio isn't *bigger* than the contpte block; it may
> >>>> be 128K and be mapped with 2 contpte blocks. Or even a PTE-mapped THP (2M) and
> >>>> be mapped with 32 contpte blocks. So you can't say it is entirely mapped
> >>>> unless/until ALL of those blocks are set up. And then of course each block could
> >>>> be unmapped unatomically.
> >>>>
> >>>> For the PMD case there are actually 2 properties that allow using the
> >>>> entire_mapcount optimization; It's atomically mapped/unmapped through the PMD
> >>>> and we know that the folio is exactly PMD sized (since it must be at least PMD
> >>>> sized to be able to map it with the PMD, and we don't allocate THPs any bigger
> >>>> than PMD size). So one PMD map or unmap operation corresponds to exactly one
> >>>> *entire* map or unmap. That is not true when we are PTE mapping.
> >>>
> >>> well. Thanks for clarification. based on the above description, i agree the
> >>> current code might make more sense by always using mapcount in subpage.
> >>>
> >>> I gave my proposals as I thought we were always CONTPTE size for small-THP
> >>> then we could drop the loop to iterate 16 times rmap. if we do it
> >>> entirely, we only
> >>> need to do dup rmap once for all 16 PTEs by increasing entire_map.
> >>
> >> Well its always good to have the discussion - so thanks for the ideas. I think
> >> there is a bigger question lurking here; should we be exposing the concept of
> >> contpte mappings to the core-mm rather than burying it in the arm64 arch code?
> >> I'm confident that would be a huge amount of effort and the end result would be
> >> similar performace to what this approach gives. One potential benefit of letting
> >> core-mm control it is that it would also give control to core-mm over the
> >> granularity of access/dirty reporting (my approach implicitly ties it to the
> >> folio). Having sub-folio access tracking _could_ potentially help with future
> >> work to make THP size selection automatic, but we are not there yet, and I think
> >> there are other (simpler) ways to achieve the same thing. So my view is that
> >> _not_ exposing it to core-mm is the right way for now.
> >
> > Hi Ryan,
> >
> > We(OPPO) started a similar project like you even before folio was imported to
> > mainline, we have deployed the dynamic hugepage(that is how we name it)
> > on millions of mobile phones on real products and kernels before 5.16, making
> > a huge success on performance improvement. for example, you may
> > find the out-of-tree 5.15 source code here
>
> Oh wow, thanks for reaching out and explaining this - I have to admit I feel
> embarrassed that I clearly didn't do enough research on the prior art because I
> wasn't aware of your work. So sorry about that.
>
> I sensed that you had a different model for how this should work vs what I've
> implemented and now I understand why :). I'll review your stuff and I'm sure
> I'll have questions. I'm sure each solution has pros and cons.
>
>
> >
> > https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11
> >
> > Our modification might not be so clean and has lots of workarounds
> > just for the stability of products
> >
> > We mainly have
> >
> > 1. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/cont_pte_hugepage.c
> >
> > some CONTPTE helpers
> >
> > 2.https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/include/linux/mm.h
> >
> > some Dynamic Hugepage APIs
> >
> > 3. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c
> >
> > modified all page faults to support
> > (1). allocation of hugepage of 64KB in do_anon_page
>
> My Small-Sized THP patch set is handling the equivalent of this.

right, the only difference is that we did a huge-zeropage for reading
in do_anon_page.
mapping all large folios to CONTPTE to zero page.

>
> > (2). CoW hugepage in do_wp_page
>
> This isn't handled yet in my patch set; the original RFC implemented it but I
> removed it in order to strip back to the essential complexity for the initial
> submission. DavidH has been working on a precise shared vs exclusive map
> tracking mechanism - if that goes in, it will make CoWing large folios simpler.
> Out of interest, what workloads benefit most from this?

as a phone, Android has a design almost all processes are forked from zygote.
thus, CoW happens quite often to all apps.

>
> > (3). copy CONPTEs in copy_pte_range
>
> As discussed this is done as part of the contpte patch set, but its not just a
> simple copy; the arch code will notice and set the CONT_PTE bit as needed.

right, i have read all your unfold and fold stuff today, now i understand your
approach seems quite nice!


>
> > (4). allocate and swap-in Hugepage as a whole in do_swap_page
>
> This is going to be a problem but I haven't even looked at this properly yet.
> The advice so far has been to continue to swap-in small pages only, but improve
> khugepaged to collapse to small-sized THP. I'll take a look at your code to
> understand how you did this.

this is also crucial to android phone as swap is always happening
on an embedded device. if we don't support large folios in swapin,
our large folios will never come back after it is swapped-out.

and i hated the collapse solution from the first beginning as there is
never a guarantee to succeed and its overhead is unacceptable to user UI,
so we supported hugepage allocation in do_swap_page from the first beginning.

>
> >
> > 4. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/vmscan.c
> > https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/rmap.c
> >
> > reclaim hugepage as a whole and LRU optimization for 64KB dynamic hugepage.
>
> I think this is all naturally handled by the folio code that exists in modern
> kernels?

We had a CONTPTE hugepage pool, if the pool is very limited, we let LRU
reclaim large folios to the pool. as phones are running lots of apps
and drivers, and the memory is very limited, after a couple of hours,
it will become very hard to allocate large folios in the original buddy. thus,
large folios totally disappeared after running the phone for some time
if we didn't have the pool.

>
> >
> > So we are 100% interested in your patchset and hope it can find a way
> > to land on the
> > mainline, thus decreasing all the cost we have to maintain out-of-tree
> > code from a
> > kernel to another kernel version which we have done on a couple of
> > kernel versions
> > before 5.16. Firmly, we are 100% supportive of large anon folios
> > things you are leading.
>
> That's great to hear! Of course Reviewed-By's and Tested-By's will all help move
> it closer :). If you had any ability to do any A/B performance testing, it would
> be very interesting to see how this stacks up against your solution - if there
> are gaps it would be good to know where and develop a plan to plug the gap.
>

sure.

> >
> > A big pain was we found lots of races especially on CONTPTE unfolding
> > and especially a part
> > of basepages ran away from the 16 CONPTEs group since userspace is
> > always working
> > on basepages, having no idea of small-THP. We ran our code on millions of
> > real phones, and now we have got them fixed (or maybe "can't reproduce"),
> > no outstanding issue.
>
> I'm going to be brave and say that my solution shouldn't suffer from these
> problems; but of course the proof is only in the testing. I did a lot of work
> with our architecture group and micro architects to determine exactly what is
> and isn't safe; We even tightened the Arm ARM spec very subtlely to allow the
> optimization in patch 13 (see the commit log for details). Of course this has
> all been checked with partners and we are confident that all existing
> implementations conform to the modified wording.

cool. I like your try_unfold/fold code. it seems your code is setting/dropping
CONT automatically based on ALIGHMENT, Page number etc. Alternatively,
our code is always stupidly checking some conditions before setting and dropping
CONT everywhere.

>
> >
> > Particularly for the rmap issue we are discussing, our out-of-tree is
> > using the entire_map for
> > CONTPTE in the way I sent to you. But I guess we can learn from you to decouple
> > CONTPTE from mm-core.
> >
> > We are doing this in mm/memory.c
> >
> > copy_present_cont_pte(struct vm_area_struct *dst_vma, struct
> > vm_area_struct *src_vma,
> > pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
> > struct page **prealloc)
> > {
> > struct mm_struct *src_mm = src_vma->vm_mm;
> > unsigned long vm_flags = src_vma->vm_flags;
> > pte_t pte = *src_pte;
> > struct page *page;
> >
> > page = vm_normal_page(src_vma, addr, pte);
> > ...
> >
> > get_page(page);
> > page_dup_rmap(page, true); // an entire dup_rmap as you can
> > see.............
> > rss[mm_counter(page)] += HPAGE_CONT_PTE_NR;
> > }
> >
> > and we have a split in mm/cont_pte_hugepage.c to handle partially unmap,
> >
> > static void __split_huge_cont_pte_locked(struct vm_area_struct *vma, pte_t *pte,
> > unsigned long haddr, bool freeze)
> > {
> > ...
> > if (compound_mapcount(head) > 1 && !TestSetPageDoubleMap(head)) {
> > for (i = 0; i < HPAGE_CONT_PTE_NR; i++)
> > atomic_inc(&head[i]._mapcount);
> > atomic_long_inc(&cont_pte_double_map_count);
> > }
> >
> >
> > if (atomic_add_negative(-1, compound_mapcount_ptr(head))) {
> > ...
> > }
> >
> > I am not selling our solution any more, but just showing you some differences we
> > have :-)
>
> OK, I understand what you were saying now. I'm currently struggling to see how
> this could fit into my model. Do you have any workloads and numbers on perf
> improvement of using entire_mapcount?

TBH, I don't have any data on this as from the first beginning, we were using
entire_map. So I have no comparison at all.

>
> >
> >>
> >>>
> >>> BTW, I have concerns that a variable small-THP size will really work
> >>> as userspace
> >>> is probably friendly to only one fixed size. for example, userspace
> >>> heap management
> >>> might be optimized to a size for freeing memory to the kernel. it is
> >>> very difficult
> >>> for the heap to adapt to various sizes at the same time. frequent unmap/free
> >>> size not equal with, and particularly smaller than small-THP size will
> >>> defeat all
> >>> efforts to use small-THP.
> >>
> >> I'll admit to not knowing a huge amount about user space allocators. But I will
> >> say that as currently defined, the small-sized THP interface to user space
> >> allows a sysadmin to specifically enable the set of sizes that they want; so a
> >> single size can be enabled. I'm diliberately punting that decision away from the
> >> kernel for now.
> >
> > Basically, userspace heap library has a PAGESIZE setting and allows users
> > to allocate/free all kinds of small objects such as 16,32,64,128,256,512 etc.
> > The default size is for sure equal to the basepage SIZE. once some objects are
> > freed by free() and libc get a free "page", userspace heap libraries might free
> > the PAGESIZE page to kernel by things like MADV_DONTNEED, then zap_pte_range().
> > it is quite similar with kernel slab.
> >
> > so imagine we have small-THP now, but userspace libraries have *NO*
> > idea at all, so it can frequently cause unfolding.
> >
> >>
> >> FWIW, My experience with the Speedometer/JavaScript use case is that performance
> >> is a little bit better when enabling 64+32+16K vs just 64K THP.
> >>
> >> Functionally, it will not matter if the allocator is not enlightened for the THP
> >> size; it can continue to free, and if a partial folio is unmapped it is put on
> >> the deferred split list, then under memory pressure it is split and the unused
> >> pages are reclaimed. I guess this is the bit you are concerned about having a
> >> performance impact?
> >
> > right. If this is happening on the majority of small-THP folios, we
> > don't have performance
> > improvement, and probably regression instead. This is really true on
> > real workloads!!
> >
> > So that is why we really love a per-VMA hint to enable small-THP but
> > obviously you
> > have already supported it now by
> > mm: thp: Introduce per-size thp sysfs interface
> > https://lore.kernel.org/linux-mm/[email protected]/
> >
> > we can use MADVISE rather than ALWAYS and set fixed size like 64KB, so userspace
> > can set the VMA flag when it is quite sure this VMA is working with
> > the alignment
> > of 64KB?
>
> Yes, that all exists in the series today. We have also discussed the possibility
> of adding a new madvise_process() call that would take the set of THP sizes that
> should be considered. Then you can set different VMAs to use different sizes;
> the plan was to layer that on top if/when a workload was identified. Sounds like
> you might be able to help there?

i'm not quite sure as on phones, we are using fixed-size CONTPTE. so we ask
for either 64KB or 4KB. If we think one VMA is all good to use CONTPTE, we
set a flag in this VMA and try to allocate 64KB.

But I will try to understand this requirement to madvise THPs size on a specific
VMA.

>
> >
> >>
> >> Regardless, it would be good to move this conversation to the small-sized THP
> >> patch series since this is all independent of contpte mappings.
> >>
> >>>
> >>>>
> >>>>>
> >>>>> Since we always hold ptl to set or drop CONTPTE bits, set/drop is
> >>>>> still atomic in a
> >>>>> spinlock area.
> >>>>>
> >>>>>>
> >>>>>>>>
> >>>>>>>> But that can be added on top, and I'll happily do that.
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Cheers,
> >>>>>>>>
> >>>>>>>> David / dhildenb
> >>>>>>>
> >>>>>
> >>>

Thanks
Barry

2023-11-28 10:49:42

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

On 28/11/2023 09:49, Barry Song wrote:
> On Tue, Nov 28, 2023 at 10:14 PM Ryan Roberts <[email protected]> wrote:
>>
>> On 27/11/2023 20:34, Barry Song wrote:
>>> On Tue, Nov 28, 2023 at 12:07 AM Ryan Roberts <[email protected]> wrote:
>>>>
>>>> On 27/11/2023 10:28, Barry Song wrote:
>>>>> On Mon, Nov 27, 2023 at 11:11 PM Ryan Roberts <[email protected]> wrote:
>>>>>>
>>>>>> On 27/11/2023 09:59, Barry Song wrote:
>>>>>>> On Mon, Nov 27, 2023 at 10:35 PM Ryan Roberts <[email protected]> wrote:
>>>>>>>>
>>>>>>>> On 27/11/2023 08:42, Barry Song wrote:
>>>>>>>>>>> + for (i = 0; i < nr; i++, page++) {
>>>>>>>>>>> + if (anon) {
>>>>>>>>>>> + /*
>>>>>>>>>>> + * If this page may have been pinned by the
>>>>>>>>>>> + * parent process, copy the page immediately for
>>>>>>>>>>> + * the child so that we'll always guarantee the
>>>>>>>>>>> + * pinned page won't be randomly replaced in the
>>>>>>>>>>> + * future.
>>>>>>>>>>> + */
>>>>>>>>>>> + if (unlikely(page_try_dup_anon_rmap(
>>>>>>>>>>> + page, false, src_vma))) {
>>>>>>>>>>> + if (i != 0)
>>>>>>>>>>> + break;
>>>>>>>>>>> + /* Page may be pinned, we have to copy. */
>>>>>>>>>>> + return copy_present_page(
>>>>>>>>>>> + dst_vma, src_vma, dst_pte,
>>>>>>>>>>> + src_pte, addr, rss, prealloc,
>>>>>>>>>>> + page);
>>>>>>>>>>> + }
>>>>>>>>>>> + rss[MM_ANONPAGES]++;
>>>>>>>>>>> + VM_BUG_ON(PageAnonExclusive(page));
>>>>>>>>>>> + } else {
>>>>>>>>>>> + page_dup_file_rmap(page, false);
>>>>>>>>>>> + rss[mm_counter_file(page)]++;
>>>>>>>>>>> + }
>>>>>>>>>>> }
>>>>>>>>>>> - rss[MM_ANONPAGES]++;
>>>>>>>>>>> - } else if (page) {
>>>>>>>>>>> - folio_get(folio);
>>>>>>>>>>> - page_dup_file_rmap(page, false);
>>>>>>>>>>> - rss[mm_counter_file(page)]++;
>>>>>>>>>>> +
>>>>>>>>>>> + nr = i;
>>>>>>>>>>> + folio_ref_add(folio, nr);
>>>>>>>>>>
>>>>>>>>>> You're changing the order of mapcount vs. refcount increment. Don't.
>>>>>>>>>> Make sure your refcount >= mapcount.
>>>>>>>>>>
>>>>>>>>>> You can do that easily by doing the folio_ref_add(folio, nr) first and
>>>>>>>>>> then decrementing in case of error accordingly. Errors due to pinned
>>>>>>>>>> pages are the corner case.
>>>>>>>>>>
>>>>>>>>>> I'll note that it will make a lot of sense to have batch variants of
>>>>>>>>>> page_try_dup_anon_rmap() and page_dup_file_rmap().
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> i still don't understand why it is not a entire map+1, but an increment
>>>>>>>>> in each basepage.
>>>>>>>>
>>>>>>>> Because we are PTE-mapping the folio, we have to account each individual page.
>>>>>>>> If we accounted the entire folio, where would we unaccount it? Each page can be
>>>>>>>> unmapped individually (e.g. munmap() part of the folio) so need to account each
>>>>>>>> page. When PMD mapping, the whole thing is either mapped or unmapped, and its
>>>>>>>> atomic, so we can account the entire thing.
>>>>>>>
>>>>>>> Hi Ryan,
>>>>>>>
>>>>>>> There is no problem. for example, a large folio is entirely mapped in
>>>>>>> process A with CONPTE,
>>>>>>> and only page2 is mapped in process B.
>>>>>>> then we will have
>>>>>>>
>>>>>>> entire_map = 0
>>>>>>> page0.map = -1
>>>>>>> page1.map = -1
>>>>>>> page2.map = 0
>>>>>>> page3.map = -1
>>>>>>> ....
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> as long as it is a CONTPTE large folio, there is no much difference with
>>>>>>>>> PMD-mapped large folio. it has all the chance to be DoubleMap and need
>>>>>>>>> split.
>>>>>>>>>
>>>>>>>>> When A and B share a CONTPTE large folio, we do madvise(DONTNEED) or any
>>>>>>>>> similar things on a part of the large folio in process A,
>>>>>>>>>
>>>>>>>>> this large folio will have partially mapped subpage in A (all CONTPE bits
>>>>>>>>> in all subpages need to be removed though we only unmap a part of the
>>>>>>>>> large folioas HW requires consistent CONTPTEs); and it has entire map in
>>>>>>>>> process B(all PTEs are still CONPTES in process B).
>>>>>>>>>
>>>>>>>>> isn't it more sensible for this large folios to have entire_map = 0(for
>>>>>>>>> process B), and subpages which are still mapped in process A has map_count
>>>>>>>>> =0? (start from -1).
>>>>>>>>>
>>>>>>>>>> Especially, the batch variant of page_try_dup_anon_rmap() would only
>>>>>>>>>> check once if the folio maybe pinned, and in that case, you can simply
>>>>>>>>>> drop all references again. So you either have all or no ptes to process,
>>>>>>>>>> which makes that code easier.
>>>>>>>>
>>>>>>>> I'm afraid this doesn't make sense to me. Perhaps I've misunderstood. But
>>>>>>>> fundamentally you can only use entire_mapcount if its only possible to map and
>>>>>>>> unmap the whole folio atomically.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> My point is that CONTPEs should either all-set in all 16 PTEs or all are dropped
>>>>>>> in 16 PTEs. if all PTEs have CONT, it is entirely mapped; otherwise,
>>>>>>> it is partially
>>>>>>> mapped. if a large folio is mapped in one processes with all CONTPTEs
>>>>>>> and meanwhile in another process with partial mapping(w/o CONTPTE), it is
>>>>>>> DoubleMapped.
>>>>>>
>>>>>> There are 2 problems with your proposal, as I see it;
>>>>>>
>>>>>> 1) the core-mm is not enlightened for CONTPTE mappings. As far as it is
>>>>>> concerned, its just mapping a bunch of PTEs. So it has no hook to inc/dec
>>>>>> entire_mapcount. The arch code is opportunistically and *transparently* managing
>>>>>> the CONT_PTE bit.
>>>>>>
>>>>>> 2) There is nothing to say a folio isn't *bigger* than the contpte block; it may
>>>>>> be 128K and be mapped with 2 contpte blocks. Or even a PTE-mapped THP (2M) and
>>>>>> be mapped with 32 contpte blocks. So you can't say it is entirely mapped
>>>>>> unless/until ALL of those blocks are set up. And then of course each block could
>>>>>> be unmapped unatomically.
>>>>>>
>>>>>> For the PMD case there are actually 2 properties that allow using the
>>>>>> entire_mapcount optimization; It's atomically mapped/unmapped through the PMD
>>>>>> and we know that the folio is exactly PMD sized (since it must be at least PMD
>>>>>> sized to be able to map it with the PMD, and we don't allocate THPs any bigger
>>>>>> than PMD size). So one PMD map or unmap operation corresponds to exactly one
>>>>>> *entire* map or unmap. That is not true when we are PTE mapping.
>>>>>
>>>>> well. Thanks for clarification. based on the above description, i agree the
>>>>> current code might make more sense by always using mapcount in subpage.
>>>>>
>>>>> I gave my proposals as I thought we were always CONTPTE size for small-THP
>>>>> then we could drop the loop to iterate 16 times rmap. if we do it
>>>>> entirely, we only
>>>>> need to do dup rmap once for all 16 PTEs by increasing entire_map.
>>>>
>>>> Well its always good to have the discussion - so thanks for the ideas. I think
>>>> there is a bigger question lurking here; should we be exposing the concept of
>>>> contpte mappings to the core-mm rather than burying it in the arm64 arch code?
>>>> I'm confident that would be a huge amount of effort and the end result would be
>>>> similar performace to what this approach gives. One potential benefit of letting
>>>> core-mm control it is that it would also give control to core-mm over the
>>>> granularity of access/dirty reporting (my approach implicitly ties it to the
>>>> folio). Having sub-folio access tracking _could_ potentially help with future
>>>> work to make THP size selection automatic, but we are not there yet, and I think
>>>> there are other (simpler) ways to achieve the same thing. So my view is that
>>>> _not_ exposing it to core-mm is the right way for now.
>>>
>>> Hi Ryan,
>>>
>>> We(OPPO) started a similar project like you even before folio was imported to
>>> mainline, we have deployed the dynamic hugepage(that is how we name it)
>>> on millions of mobile phones on real products and kernels before 5.16, making
>>> a huge success on performance improvement. for example, you may
>>> find the out-of-tree 5.15 source code here
>>
>> Oh wow, thanks for reaching out and explaining this - I have to admit I feel
>> embarrassed that I clearly didn't do enough research on the prior art because I
>> wasn't aware of your work. So sorry about that.
>>
>> I sensed that you had a different model for how this should work vs what I've
>> implemented and now I understand why :). I'll review your stuff and I'm sure
>> I'll have questions. I'm sure each solution has pros and cons.
>>
>>
>>>
>>> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11
>>>
>>> Our modification might not be so clean and has lots of workarounds
>>> just for the stability of products
>>>
>>> We mainly have
>>>
>>> 1. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/cont_pte_hugepage.c
>>>
>>> some CONTPTE helpers
>>>
>>> 2.https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/include/linux/mm.h
>>>
>>> some Dynamic Hugepage APIs
>>>
>>> 3. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c
>>>
>>> modified all page faults to support
>>> (1). allocation of hugepage of 64KB in do_anon_page
>>
>> My Small-Sized THP patch set is handling the equivalent of this.
>
> right, the only difference is that we did a huge-zeropage for reading
> in do_anon_page.
> mapping all large folios to CONTPTE to zero page.

FWIW, I took a slightly different approach in my original RFC for the zero page
- although I ripped it all out to simplify for the initial series. I found that
it was pretty rare for user space to read multiple consecutive pages without
ever interleving any writes, so I kept the zero page as a base page, but at CoW,
I would expand the allocation to an approprately sized THP. But for the couple
of workloads that I've gone deep with, I found that it made barely any dent on
the amount of memory that ended up contpte-mapped; the vast majority was from
write allocation in do_anonymous_page().

>
>>
>>> (2). CoW hugepage in do_wp_page
>>
>> This isn't handled yet in my patch set; the original RFC implemented it but I
>> removed it in order to strip back to the essential complexity for the initial
>> submission. DavidH has been working on a precise shared vs exclusive map
>> tracking mechanism - if that goes in, it will make CoWing large folios simpler.
>> Out of interest, what workloads benefit most from this?
>
> as a phone, Android has a design almost all processes are forked from zygote.
> thus, CoW happens quite often to all apps.

Sure. But in my analysis I concluded that most of the memory mapped in zygote is
file-backed and mostly RO so therefore doing THP CoW doesn't help much. Perhaps
there are cases where that conclusion is wrong.

>
>>
>>> (3). copy CONPTEs in copy_pte_range
>>
>> As discussed this is done as part of the contpte patch set, but its not just a
>> simple copy; the arch code will notice and set the CONT_PTE bit as needed.
>
> right, i have read all your unfold and fold stuff today, now i understand your
> approach seems quite nice!

Great - thanks!

>
>
>>
>>> (4). allocate and swap-in Hugepage as a whole in do_swap_page
>>
>> This is going to be a problem but I haven't even looked at this properly yet.
>> The advice so far has been to continue to swap-in small pages only, but improve
>> khugepaged to collapse to small-sized THP. I'll take a look at your code to
>> understand how you did this.
>
> this is also crucial to android phone as swap is always happening
> on an embedded device. if we don't support large folios in swapin,
> our large folios will never come back after it is swapped-out.
>
> and i hated the collapse solution from the first beginning as there is
> never a guarantee to succeed and its overhead is unacceptable to user UI,
> so we supported hugepage allocation in do_swap_page from the first beginning.

Understood. I agree it would be nice to preserve large folios across swap. I
think this can be layered on top of the current work though.

>
>>
>>>
>>> 4. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/vmscan.c
>>> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/rmap.c
>>>
>>> reclaim hugepage as a whole and LRU optimization for 64KB dynamic hugepage.
>>
>> I think this is all naturally handled by the folio code that exists in modern
>> kernels?
>
> We had a CONTPTE hugepage pool, if the pool is very limited, we let LRU
> reclaim large folios to the pool. as phones are running lots of apps
> and drivers, and the memory is very limited, after a couple of hours,
> it will become very hard to allocate large folios in the original buddy. thus,
> large folios totally disappeared after running the phone for some time
> if we didn't have the pool.
>
>>
>>>
>>> So we are 100% interested in your patchset and hope it can find a way
>>> to land on the
>>> mainline, thus decreasing all the cost we have to maintain out-of-tree
>>> code from a
>>> kernel to another kernel version which we have done on a couple of
>>> kernel versions
>>> before 5.16. Firmly, we are 100% supportive of large anon folios
>>> things you are leading.
>>
>> That's great to hear! Of course Reviewed-By's and Tested-By's will all help move
>> it closer :). If you had any ability to do any A/B performance testing, it would
>> be very interesting to see how this stacks up against your solution - if there
>> are gaps it would be good to know where and develop a plan to plug the gap.
>>
>
> sure.
>
>>>
>>> A big pain was we found lots of races especially on CONTPTE unfolding
>>> and especially a part
>>> of basepages ran away from the 16 CONPTEs group since userspace is
>>> always working
>>> on basepages, having no idea of small-THP. We ran our code on millions of
>>> real phones, and now we have got them fixed (or maybe "can't reproduce"),
>>> no outstanding issue.
>>
>> I'm going to be brave and say that my solution shouldn't suffer from these
>> problems; but of course the proof is only in the testing. I did a lot of work
>> with our architecture group and micro architects to determine exactly what is
>> and isn't safe; We even tightened the Arm ARM spec very subtlely to allow the
>> optimization in patch 13 (see the commit log for details). Of course this has
>> all been checked with partners and we are confident that all existing
>> implementations conform to the modified wording.
>
> cool. I like your try_unfold/fold code. it seems your code is setting/dropping
> CONT automatically based on ALIGHMENT, Page number etc. Alternatively,
> our code is always stupidly checking some conditions before setting and dropping
> CONT everywhere.
>
>>
>>>
>>> Particularly for the rmap issue we are discussing, our out-of-tree is
>>> using the entire_map for
>>> CONTPTE in the way I sent to you. But I guess we can learn from you to decouple
>>> CONTPTE from mm-core.
>>>
>>> We are doing this in mm/memory.c
>>>
>>> copy_present_cont_pte(struct vm_area_struct *dst_vma, struct
>>> vm_area_struct *src_vma,
>>> pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
>>> struct page **prealloc)
>>> {
>>> struct mm_struct *src_mm = src_vma->vm_mm;
>>> unsigned long vm_flags = src_vma->vm_flags;
>>> pte_t pte = *src_pte;
>>> struct page *page;
>>>
>>> page = vm_normal_page(src_vma, addr, pte);
>>> ...
>>>
>>> get_page(page);
>>> page_dup_rmap(page, true); // an entire dup_rmap as you can
>>> see.............
>>> rss[mm_counter(page)] += HPAGE_CONT_PTE_NR;
>>> }
>>>
>>> and we have a split in mm/cont_pte_hugepage.c to handle partially unmap,
>>>
>>> static void __split_huge_cont_pte_locked(struct vm_area_struct *vma, pte_t *pte,
>>> unsigned long haddr, bool freeze)
>>> {
>>> ...
>>> if (compound_mapcount(head) > 1 && !TestSetPageDoubleMap(head)) {
>>> for (i = 0; i < HPAGE_CONT_PTE_NR; i++)
>>> atomic_inc(&head[i]._mapcount);
>>> atomic_long_inc(&cont_pte_double_map_count);
>>> }
>>>
>>>
>>> if (atomic_add_negative(-1, compound_mapcount_ptr(head))) {
>>> ...
>>> }
>>>
>>> I am not selling our solution any more, but just showing you some differences we
>>> have :-)
>>
>> OK, I understand what you were saying now. I'm currently struggling to see how
>> this could fit into my model. Do you have any workloads and numbers on perf
>> improvement of using entire_mapcount?
>
> TBH, I don't have any data on this as from the first beginning, we were using
> entire_map. So I have no comparison at all.
>
>>
>>>
>>>>
>>>>>
>>>>> BTW, I have concerns that a variable small-THP size will really work
>>>>> as userspace
>>>>> is probably friendly to only one fixed size. for example, userspace
>>>>> heap management
>>>>> might be optimized to a size for freeing memory to the kernel. it is
>>>>> very difficult
>>>>> for the heap to adapt to various sizes at the same time. frequent unmap/free
>>>>> size not equal with, and particularly smaller than small-THP size will
>>>>> defeat all
>>>>> efforts to use small-THP.
>>>>
>>>> I'll admit to not knowing a huge amount about user space allocators. But I will
>>>> say that as currently defined, the small-sized THP interface to user space
>>>> allows a sysadmin to specifically enable the set of sizes that they want; so a
>>>> single size can be enabled. I'm diliberately punting that decision away from the
>>>> kernel for now.
>>>
>>> Basically, userspace heap library has a PAGESIZE setting and allows users
>>> to allocate/free all kinds of small objects such as 16,32,64,128,256,512 etc.
>>> The default size is for sure equal to the basepage SIZE. once some objects are
>>> freed by free() and libc get a free "page", userspace heap libraries might free
>>> the PAGESIZE page to kernel by things like MADV_DONTNEED, then zap_pte_range().
>>> it is quite similar with kernel slab.
>>>
>>> so imagine we have small-THP now, but userspace libraries have *NO*
>>> idea at all, so it can frequently cause unfolding.
>>>
>>>>
>>>> FWIW, My experience with the Speedometer/JavaScript use case is that performance
>>>> is a little bit better when enabling 64+32+16K vs just 64K THP.
>>>>
>>>> Functionally, it will not matter if the allocator is not enlightened for the THP
>>>> size; it can continue to free, and if a partial folio is unmapped it is put on
>>>> the deferred split list, then under memory pressure it is split and the unused
>>>> pages are reclaimed. I guess this is the bit you are concerned about having a
>>>> performance impact?
>>>
>>> right. If this is happening on the majority of small-THP folios, we
>>> don't have performance
>>> improvement, and probably regression instead. This is really true on
>>> real workloads!!
>>>
>>> So that is why we really love a per-VMA hint to enable small-THP but
>>> obviously you
>>> have already supported it now by
>>> mm: thp: Introduce per-size thp sysfs interface
>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>
>>> we can use MADVISE rather than ALWAYS and set fixed size like 64KB, so userspace
>>> can set the VMA flag when it is quite sure this VMA is working with
>>> the alignment
>>> of 64KB?
>>
>> Yes, that all exists in the series today. We have also discussed the possibility
>> of adding a new madvise_process() call that would take the set of THP sizes that
>> should be considered. Then you can set different VMAs to use different sizes;
>> the plan was to layer that on top if/when a workload was identified. Sounds like
>> you might be able to help there?
>
> i'm not quite sure as on phones, we are using fixed-size CONTPTE. so we ask
> for either 64KB or 4KB. If we think one VMA is all good to use CONTPTE, we
> set a flag in this VMA and try to allocate 64KB.

When you say "we set a flag" do you mean user space? Or is there some heuristic
in the kernel?

>
> But I will try to understand this requirement to madvise THPs size on a specific
> VMA.
>
>>
>>>
>>>>
>>>> Regardless, it would be good to move this conversation to the small-sized THP
>>>> patch series since this is all independent of contpte mappings.
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Since we always hold ptl to set or drop CONTPTE bits, set/drop is
>>>>>>> still atomic in a
>>>>>>> spinlock area.
>>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> But that can be added on top, and I'll happily do that.
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Cheers,
>>>>>>>>>>
>>>>>>>>>> David / dhildenb
>>>>>>>>>
>>>>>>>
>>>>>
>
> Thanks
> Barry

2023-11-28 11:01:45

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

On 28/11/2023 00:11, Barry Song wrote:
> On Mon, Nov 27, 2023 at 10:24 PM Ryan Roberts <[email protected]> wrote:
>>
>> On 27/11/2023 05:54, Barry Song wrote:
>>>> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>>>> + pte_t *dst_pte, pte_t *src_pte,
>>>> + unsigned long addr, unsigned long end,
>>>> + int *rss, struct folio **prealloc)
>>>> {
>>>> struct mm_struct *src_mm = src_vma->vm_mm;
>>>> unsigned long vm_flags = src_vma->vm_flags;
>>>> pte_t pte = ptep_get(src_pte);
>>>> struct page *page;
>>>> struct folio *folio;
>>>> + int nr = 1;
>>>> + bool anon;
>>>> + bool any_dirty = pte_dirty(pte);
>>>> + int i;
>>>>
>>>> page = vm_normal_page(src_vma, addr, pte);
>>>> - if (page)
>>>> + if (page) {
>>>> folio = page_folio(page);
>>>> - if (page && folio_test_anon(folio)) {
>>>> - /*
>>>> - * If this page may have been pinned by the parent process,
>>>> - * copy the page immediately for the child so that we'll always
>>>> - * guarantee the pinned page won't be randomly replaced in the
>>>> - * future.
>>>> - */
>>>> - folio_get(folio);
>>>> - if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
>>>> - /* Page may be pinned, we have to copy. */
>>>> - folio_put(folio);
>>>> - return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
>>>> - addr, rss, prealloc, page);
>>>> + anon = folio_test_anon(folio);
>>>> + nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
>>>> + end, pte, &any_dirty);
>>>
>>> in case we have a large folio with 16 CONTPTE basepages, and userspace
>>> do madvise(addr + 4KB * 5, DONTNEED);
>>
>> nit: if you are offsetting by 5 pages from addr, then below I think you mean
>> page0~page4 and page6~15?
>>
>>>
>>> thus, the 4th basepage of PTE becomes PTE_NONE and folio_nr_pages_cont_mapped()
>>> will return 15. in this case, we should copy page0~page3 and page5~page15.
>>
>> No I don't think folio_nr_pages_cont_mapped() will return 15; that's certainly
>> not how its intended to work. The function is scanning forwards from the current
>> pte until it finds the first pte that does not fit in the batch - either because
>> it maps a PFN that is not contiguous, or because the permissions are different
>> (although this is being relaxed a bit; see conversation with DavidH against this
>> same patch).
>>
>> So the first time through this loop, folio_nr_pages_cont_mapped() will return 5,
>> (page0~page4) then the next time through the loop we will go through the
>> !present path and process the single swap marker. Then the 3rd time through the
>> loop folio_nr_pages_cont_mapped() will return 10.
>
> one case we have met by running hundreds of real phones is as below,
>
>
> static int
> copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
> unsigned long end)
> {
> ...
> dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
> if (!dst_pte) {
> ret = -ENOMEM;
> goto out;
> }
> src_pte = pte_offset_map_nolock(src_mm, src_pmd, addr, &src_ptl);
> if (!src_pte) {
> pte_unmap_unlock(dst_pte, dst_ptl);
> /* ret == 0 */
> goto out;
> }
> spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
> orig_src_pte = src_pte;
> orig_dst_pte = dst_pte;
> arch_enter_lazy_mmu_mode();
>
> do {
> /*
> * We are holding two locks at this point - either of them
> * could generate latencies in another task on another CPU.
> */
> if (progress >= 32) {
> progress = 0;
> if (need_resched() ||
> spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
> break;
> }
> ptent = ptep_get(src_pte);
> if (pte_none(ptent)) {
> progress++;
> continue;
> }
>
> the above iteration can break when progress > =32. for example, at the
> beginning,
> if all PTEs are none, we break when progress >=32, and we break when we
> are in the 8th pte of 16PTEs which might become CONTPTE after we release
> PTL.
>
> since we are releasing PTLs, next time when we get PTL, those pte_none() might
> become pte_cont(), then are you going to copy CONTPTE from 8th pte,
> thus, immediately
> break the consistent CONPTEs rule of hardware?
>
> pte0 - pte_none
> pte1 - pte_none
> ...
> pte7 - pte_none
>
> pte8 - pte_cont
> ...
> pte15 - pte_cont
>
> so we did some modification to avoid a break in the middle of PTEs
> which can potentially
> become CONTPE.
> do {
> /*
> * We are holding two locks at this point - either of them
> * could generate latencies in another task on another CPU.
> */
> if (progress >= 32) {
> progress = 0;
> #ifdef CONFIG_CONT_PTE_HUGEPAGE
> /*
> * XXX: don't release ptl at an unligned address as
> cont_pte might form while
> * ptl is released, this causes double-map
> */
> if (!vma_is_chp_anonymous(src_vma) ||
> (vma_is_chp_anonymous(src_vma) && IS_ALIGNED(addr,
> HPAGE_CONT_PTE_SIZE)))
> #endif
> if (need_resched() ||
> spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
> break;
> }
>
> We could only reproduce the above issue by running thousands of phones.
>
> Does your code survive from this problem?

Yes I'm confident my code is safe against this; as I said before, the CONT_PTE
bit is not blindly "copied" from parent to child pte. As far as the core-mm is
concerned, there is no CONT_PTE bit; they are just regular PTEs. So the code
will see some pte_none() entries followed by some pte_present() entries. And
when calling set_ptes() on the child, the arch code will evaluate the current
state of the pgtable along with the new set_ptes() request and determine where
it should insert the CONT_PTE bit.

>
>>
>> Thanks,
>> Ryan
>>
>>>
>>> but the current code is copying page0~page14, right? unless we are immediatly
>>> split_folio to basepages in zap_pte_range(), we will have problems?
>>>
>>>> +
>>>> + for (i = 0; i < nr; i++, page++) {
>>>> + if (anon) {
>>>> + /*
>>>> + * If this page may have been pinned by the
>>>> + * parent process, copy the page immediately for
>>>> + * the child so that we'll always guarantee the
>>>> + * pinned page won't be randomly replaced in the
>>>> + * future.
>>>> + */
>>>> + if (unlikely(page_try_dup_anon_rmap(
>>>> + page, false, src_vma))) {
>>>> + if (i != 0)
>>>> + break;
>>>> + /* Page may be pinned, we have to copy. */
>>>> + return copy_present_page(
>>>> + dst_vma, src_vma, dst_pte,
>>>> + src_pte, addr, rss, prealloc,
>>>> + page);
>>>> + }
>>>> + rss[MM_ANONPAGES]++;
>>>> + VM_BUG_ON(PageAnonExclusive(page));
>>>> + } else {
>>>> + page_dup_file_rmap(page, false);
>>>> + rss[mm_counter_file(page)]++;
>>>> + }
>>>
>
> Thanks
> Barry

2023-11-28 11:16:31

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 14/14] arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown

On 28/11/2023 07:32, Barry Song wrote:
>> +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
>> +static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
>> + unsigned long addr, pte_t *ptep, int full)
>> +{
>> + pte_t orig_pte = __ptep_get(ptep);
>> +
>> + if (!pte_valid_cont(orig_pte) || !full) {
>> + contpte_try_unfold(mm, addr, ptep, orig_pte);
>> + return __ptep_get_and_clear(mm, addr, ptep);
>> + } else
>> + return contpte_ptep_get_and_clear_full(mm, addr, ptep);
>> +}
>> +
>
> Hi Ryan,
>
> I feel quite hard to understand the code. when !pte_valid_cont(orig_pte),
> we will call contpte_try_unfold(mm, addr, ptep, orig_pte);
>
> but in contpte_try_unfold(), we call unfold only if pte_valid_cont()
> is true:
> static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> pte_t *ptep, pte_t pte)
> {
> if (contpte_is_enabled(mm) && pte_valid_cont(pte))
> __contpte_try_unfold(mm, addr, ptep, pte);
> }
>
> so do you mean the below?
>
> if (!pte_valid_cont(orig_pte))
> return __ptep_get_and_clear(mm, addr, ptep);
>
> if (!full) {
> contpte_try_unfold(mm, addr, ptep, orig_pte);
> return __ptep_get_and_clear(mm, addr, ptep);
> } else {
> return contpte_ptep_get_and_clear_full(mm, addr, ptep);
> }

Yes, this is equivalent. In general, I was trying not to spray `if
(pte_valid_cont(orig_pte))` checks everywhere to guard contpte_try_unfold() and
instead put the checks into contpte_try_unfold() (hence the 'try'). I figured
just calling it unconditionally and letting the compiler optimize as it sees fit
was the cleanest approach.

But in this instance I can see this is confusing. I'll modify as you suggest.
Thanks!

>
> Thanks
> Barry
>
>

2023-11-28 11:49:57

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 14/14] arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown

On 28/11/2023 08:17, Barry Song wrote:
>> +pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
>> + unsigned long addr, pte_t *ptep)
>> +{
>> + /*
>> + * When doing a full address space teardown, we can avoid unfolding the
>> + * contiguous range, and therefore avoid the associated tlbi. Instead,
>> + * just get and clear the pte. The caller is promising to call us for
>> + * every pte, so every pte in the range will be cleared by the time the
>> + * tlbi is issued.
>> + *
>> + * This approach is not perfect though, as for the duration between
>> + * returning from the first call to ptep_get_and_clear_full() and making
>> + * the final call, the contpte block in an intermediate state, where
>> + * some ptes are cleared and others are still set with the PTE_CONT bit.
>> + * If any other APIs are called for the ptes in the contpte block during
>> + * that time, we have to be very careful. The core code currently
>> + * interleaves calls to ptep_get_and_clear_full() with ptep_get() and so
>> + * ptep_get() must be careful to ignore the cleared entries when
>> + * accumulating the access and dirty bits - the same goes for
>> + * ptep_get_lockless(). The only other calls we might resonably expect
>> + * are to set markers in the previously cleared ptes. (We shouldn't see
>> + * valid entries being set until after the tlbi, at which point we are
>> + * no longer in the intermediate state). Since markers are not valid,
>> + * this is safe; set_ptes() will see the old, invalid entry and will not
>> + * attempt to unfold. And the new pte is also invalid so it won't
>> + * attempt to fold. We shouldn't see this for the 'full' case anyway.
>> + *
>> + * The last remaining issue is returning the access/dirty bits. That
>> + * info could be present in any of the ptes in the contpte block.
>> + * ptep_get() will gather those bits from across the contpte block. We
>> + * don't bother doing that here, because we know that the information is
>> + * used by the core-mm to mark the underlying folio as accessed/dirty.
>> + * And since the same folio must be underpinning the whole block (that
>> + * was a requirement for folding in the first place), that information
>> + * will make it to the folio eventually once all the ptes have been
>> + * cleared. This approach means we don't have to play games with
>> + * accumulating and storing the bits. It does mean that any interleaved
>> + * calls to ptep_get() may lack correct access/dirty information if we
>> + * have already cleared the pte that happened to store it. The core code
>> + * does not rely on this though.
>
> even without any other threads running and touching those PTEs, this won't survive
> on some hardware. we expose inconsistent CONTPTEs to hardware, this might result

No that's not the case; if you read the Arm ARM, the page table is only
considered "misgrogrammed" when *valid* entries within the same contpte block
have different values for the contiguous bit. We are clearing the ptes to zero
here, which is an *invalid* entry. So if the TLB entry somehow gets invalidated
(either due to explicit tlbi as you point out below, or due to a concurrent TLB
miss which selects our entry for removal to make space for the new incomming
entry), then it gets an access request for an address in our partially cleared
contpte block the address will either be:

A) an address for a pte entry we have already cleared, so its invalid and it
will fault (and get serialized behind the PTL).

or

B) an address for a pte entry we haven't yet cleared, so it will reform a TLB
entry for the contpte block. But that's ok because the memory still exists
because we haven't yet finished clearing the page table and have not yet issued
the final tlbi.


> in crashed firmware even in trustzone, strange&unknown faults to trustzone we have
> seen on Qualcomm, but for MTK, it seems fine. when you do tlbi on a part of PTEs
> with dropped CONT but still some other PTEs have CONT, we make hardware totally
> confused.

I suspect this is because in your case you are "misprogramming" the contpte
block; there are *valid* pte entries within the block that disagree about the
contiguous bit or about various other fields. In this case some HW TLB designs
can do weird things. I suspect in your case, that's resulting in accessing bad
memory space and causing an SError, which is trapped by EL3, and the FW is
probably just panicking at that point.

>
> zap_pte_range() has a force_flush when tlbbatch is full:
>
> if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
> force_flush = 1;
> addr += PAGE_SIZE;
> break;
> }
>
> this means you can expose partial tlbi/flush directly to hardware while some
> other PTEs are still CONT.

Yes, but that's also possible even if we have a tight loop that clears down the
contpte block; there could still be another core that issues a tlbi while you're
halfway through that loop, or the HW could happen to evict due to TLB pressure
at any time. The point is, it's safe if you are clearing the pte to an *invalid*
entry.

>
> on the other hand, contpte_ptep_get_and_clear_full() doesn't need to depend
> on fullmm, as long as zap range covers a large folio, we can flush tlbi for
> those CONTPTEs all together in your contpte_ptep_get_and_clear_full() rather
> than clearing one PTE.
>
> Our approach in [1] is we do a flush for all CONTPTEs and go directly to the end
> of the large folio:
>
> #ifdef CONFIG_CONT_PTE_HUGEPAGE
> if (pte_cont(ptent)) {
> unsigned long next = pte_cont_addr_end(addr, end);
>
> if (next - addr != HPAGE_CONT_PTE_SIZE) {
> __split_huge_cont_pte(vma, pte, addr, false, NULL, ptl);
> /*
> * After splitting cont-pte
> * we need to process pte again.
> */
> goto again_pte;
> } else {
> cont_pte_huge_ptep_get_and_clear(mm, addr, pte);
>
> tlb_remove_cont_pte_tlb_entry(tlb, pte, addr);
> if (unlikely(!page))
> continue;
>
> if (is_huge_zero_page(page)) {
> tlb_remove_page_size(tlb, page, HPAGE_CONT_PTE_SIZE);
> goto cont_next;
> }
>
> rss[mm_counter(page)] -= HPAGE_CONT_PTE_NR;
> page_remove_rmap(page, true);
> if (unlikely(page_mapcount(page) < 0))
> print_bad_pte(vma, addr, ptent, page);
>
> tlb_remove_page_size(tlb, page, HPAGE_CONT_PTE_SIZE);
> }
> cont_next:
> /* "do while()" will do "pte++" and "addr + PAGE_SIZE" */
> pte += (next - PAGE_SIZE - (addr & PAGE_MASK))/PAGE_SIZE;
> addr = next - PAGE_SIZE;
> continue;
> }
> #endif
>
> this is our "full" counterpart, which clear_flush CONT_PTES pages directly, and
> it never requires tlb->fullmm at all.

Yes, but you are benefitting from the fact that contpte is exposed to core-mm
and it is special-casing them at this level. I'm trying to avoid that.

I don't think there is any correctness issue here. But there is a problem with
fragility, as raised by Alistair. I have some ideas on potentially how to solve
that. I'm going to try to work on it this afternoon and will post if I get some
confidence that it is a real solution.

Thanks,
Ryan

>
> static inline pte_t __cont_pte_huge_ptep_get_and_clear_flush(struct mm_struct *mm,
> unsigned long addr,
> pte_t *ptep,
> bool flush)
> {
> pte_t orig_pte = ptep_get(ptep);
>
> CHP_BUG_ON(!pte_cont(orig_pte));
> CHP_BUG_ON(!IS_ALIGNED(addr, HPAGE_CONT_PTE_SIZE));
> CHP_BUG_ON(!IS_ALIGNED(pte_pfn(orig_pte), HPAGE_CONT_PTE_NR));
>
> return get_clear_flush(mm, addr, ptep, PAGE_SIZE, CONT_PTES, flush);
> }
>
> [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c#L1539
>
>> + */
>> +
>> + return __ptep_get_and_clear(mm, addr, ptep);
>> +}
>> +EXPORT_SYMBOL(contpte_ptep_get_and_clear_full);
>> +
>
> Thanks
> Barry
>
>

2023-11-28 11:53:12

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 00/14] Transparent Contiguous PTEs for User Mappings

On 27/11/2023 22:53, Barry Song wrote:
> On Tue, Nov 28, 2023 at 12:11 AM Ryan Roberts <[email protected]> wrote:
>>
>> On 27/11/2023 10:35, Barry Song wrote:
>>> On Mon, Nov 27, 2023 at 10:15 PM Ryan Roberts <[email protected]> wrote:
>>>>
>>>> On 27/11/2023 03:18, Barry Song wrote:
>>>>>> Ryan Roberts (14):
>>>>>> mm: Batch-copy PTE ranges during fork()
>>>>>> arm64/mm: set_pte(): New layer to manage contig bit
>>>>>> arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit
>>>>>> arm64/mm: pte_clear(): New layer to manage contig bit
>>>>>> arm64/mm: ptep_get_and_clear(): New layer to manage contig bit
>>>>>> arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit
>>>>>> arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit
>>>>>> arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit
>>>>>> arm64/mm: ptep_set_access_flags(): New layer to manage contig bit
>>>>>> arm64/mm: ptep_get(): New layer to manage contig bit
>>>>>> arm64/mm: Split __flush_tlb_range() to elide trailing DSB
>>>>>> arm64/mm: Wire up PTE_CONT for user mappings
>>>>>> arm64/mm: Implement ptep_set_wrprotects() to optimize fork()
>>>>>> arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown
>>>>>
>>>>> Hi Ryan,
>>>>> Not quite sure if I missed something, are we splitting/unfolding CONTPTES
>>>>> in the below cases
>>>>
>>>> The general idea is that the core-mm sets the individual ptes (one at a time if
>>>> it likes with set_pte_at(), or in a block with set_ptes()), modifies its
>>>> permissions (ptep_set_wrprotect(), ptep_set_access_flags()) and clears them
>>>> (ptep_clear(), etc); This is exactly the same interface as previously.
>>>>
>>>> BUT, the arm64 implementation of those interfaces will now detect when a set of
>>>> adjacent PTEs (a contpte block - so 16 naturally aligned entries when using 4K
>>>> base pages) are all appropriate for having the CONT_PTE bit set; in this case
>>>> the block is "folded". And it will detect when the first PTE in the block
>>>> changes such that the CONT_PTE bit must now be unset ("unfolded"). One of the
>>>> requirements for folding a contpte block is that all the pages must belong to
>>>> the *same* folio (that means its safe to only track access/dirty for thecontpte
>>>> block as a whole rather than for each individual pte).
>>>>
>>>> (there are a couple of optimizations that make the reality slightly more
>>>> complicated than what I've just explained, but you get the idea).
>>>>
>>>> On that basis, I believe all the specific cases you describe below are all
>>>> covered and safe - please let me know if you think there is a hole here!
>>>>
>>>>>
>>>>> 1. madvise(MADV_DONTNEED) on a part of basepages on a CONTPTE large folio
>>>>
>>>> The page will first be unmapped (e.g. ptep_clear() or ptep_get_and_clear(), or
>>>> whatever). The implementation of that will cause an unfold and the CONT_PTE bit
>>>> is removed from the whole contpte block. If there is then a subsequent
>>>> set_pte_at() to set a swap entry, the implementation will see that its not
>>>> appropriate to re-fold, so the range will remain unfolded.
>>>>
>>>>>
>>>>> 2. vma split in a large folio due to various reasons such as mprotect,
>>>>> munmap, mlock etc.
>>>>
>>>> I'm not sure if PTEs are explicitly unmapped/remapped when splitting a VMA? I
>>>> suspect not, so if the VMA is split in the middle of a currently folded contpte
>>>> block, it will remain folded. But this is safe and continues to work correctly.
>>>> The VMA arrangement is not important; it is just important that a single folio
>>>> is mapped contiguously across the whole block.
>>>
>>> I don't think it is safe to keep CONTPTE folded in a split_vma case. as
>>> otherwise, copy_ptes in your other patch might only copy a part
>>> of CONTPES.
>>> For example, if page0-page4 and page5-page15 are splitted in split_vma,
>>> in fork, while copying pte for the first VMA, we are copying page0-page4,
>>> this will immediately cause inconsistent CONTPTE. as we have to
>>> make sure all CONTPTEs are atomically mapped in a PTL.
>>
>> No that's not how it works. The CONT_PTE bit is not blindly copied from parent
>> to child. It is explicitly managed by the arch code and set when appropriate. In
>> the case above, we will end up calling set_ptes() for page0-page4 in the child.
>> set_ptes() will notice that there are only 5 contiguous pages so it will map
>> without the CONT_PTE bit.
>
> Ok. cool. alternatively, in the code I shared to you, we are doing an unfold
> immediately when split_vma happens within a large anon folio, so we disallow
> CONTPTE to cross two VMAs to avoid all kinds of complexity afterwards.
>
> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/huge_memory.c
>
> #ifdef CONFIG_CONT_PTE_HUGEPAGE
> void vma_adjust_cont_pte_trans_huge(struct vm_area_struct *vma,
> unsigned long start,
> unsigned long end,
> long adjust_next)
> {
> /*
> * If the new start address isn't hpage aligned and it could
> * previously contain an hugepage: check if we need to split
> * an huge pmd.
> */
> if (start & ~HPAGE_CONT_PTE_MASK &&
> (start & HPAGE_CONT_PTE_MASK) >= vma->vm_start &&
> (start & HPAGE_CONT_PTE_MASK) + HPAGE_CONT_PTE_SIZE <= vma->vm_end)
> split_huge_cont_pte_address(vma, start, false, NULL);
>
> ....
> }
> #endif
>
> In your approach, you are still holding CONTPTE crossing two VMAs. but it seems
> ok. I can't have a case which might fail in my brain right now. only

Yes, I'm dealing with the CONT_PTE bit at the pgtable level, not at the VMA level.


> running the code on
> a large amount of real hardware will tell :-)

Indeed - is this something you might be able to help with? :)

>
>>
>>>
>>>>
>>>>>
>>>>> 3. try_to_unmap_one() to reclaim a folio, ptes are scanned one by one
>>>>> rather than being as a whole.
>>>>
>>>> Yes, as per 1; the arm64 implementation will notice when the first entry is
>>>> cleared and unfold the contpte block.
>>>>
>>>>>
>>>>> In hardware, we need to make sure CONTPTE follow the rule - always 16
>>>>> contiguous physical address with CONTPTE set. if one of them run away
>>>>> from the 16 ptes group and PTEs become unconsistent, some terrible
>>>>> errors/faults can happen in HW. for example
>>>>
>>>> Yes, the implementation obeys all these rules; see contpte_try_fold() and
>>>> contpte_try_unfold(). the fold/unfold operation is only done when all
>>>> requirements are met, and we perform it in a manner that is conformant to the
>>>> architecture requirements (see contpte_fold() - being renamed to
>>>> contpte_convert() in the next version).
>>>>
>>>> Thanks for the review!
>>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>>>
>>>>> case0:
>>>>> addr0 PTE - has no CONTPE
>>>>> addr0+4kb PTE - has CONTPTE
>>>>> ....
>>>>> addr0+60kb PTE - has CONTPTE
>>>>>
>>>>> case 1:
>>>>> addr0 PTE - has no CONTPE
>>>>> addr0+4kb PTE - has CONTPTE
>>>>> ....
>>>>> addr0+60kb PTE - has swap
>>>>>
>>>>> Unconsistent 16 PTEs will lead to crash even in the firmware based on
>>>>> our observation.
>>>>>
>>>
>
> Thanks
> Barry

2023-11-28 11:58:42

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 00/14] Transparent Contiguous PTEs for User Mappings

On 28/11/2023 03:13, Yang Shi wrote:
> On Mon, Nov 27, 2023 at 1:15 AM Ryan Roberts <[email protected]> wrote:
>>
>> On 27/11/2023 03:18, Barry Song wrote:
>>>> Ryan Roberts (14):
>>>> mm: Batch-copy PTE ranges during fork()
>>>> arm64/mm: set_pte(): New layer to manage contig bit
>>>> arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit
>>>> arm64/mm: pte_clear(): New layer to manage contig bit
>>>> arm64/mm: ptep_get_and_clear(): New layer to manage contig bit
>>>> arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit
>>>> arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit
>>>> arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit
>>>> arm64/mm: ptep_set_access_flags(): New layer to manage contig bit
>>>> arm64/mm: ptep_get(): New layer to manage contig bit
>>>> arm64/mm: Split __flush_tlb_range() to elide trailing DSB
>>>> arm64/mm: Wire up PTE_CONT for user mappings
>>>> arm64/mm: Implement ptep_set_wrprotects() to optimize fork()
>>>> arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown
>>>
>>> Hi Ryan,
>>> Not quite sure if I missed something, are we splitting/unfolding CONTPTES
>>> in the below cases
>>
>> The general idea is that the core-mm sets the individual ptes (one at a time if
>> it likes with set_pte_at(), or in a block with set_ptes()), modifies its
>> permissions (ptep_set_wrprotect(), ptep_set_access_flags()) and clears them
>> (ptep_clear(), etc); This is exactly the same interface as previously.
>>
>> BUT, the arm64 implementation of those interfaces will now detect when a set of
>> adjacent PTEs (a contpte block - so 16 naturally aligned entries when using 4K
>> base pages) are all appropriate for having the CONT_PTE bit set; in this case
>> the block is "folded". And it will detect when the first PTE in the block
>> changes such that the CONT_PTE bit must now be unset ("unfolded"). One of the
>> requirements for folding a contpte block is that all the pages must belong to
>> the *same* folio (that means its safe to only track access/dirty for thecontpte
>> block as a whole rather than for each individual pte).
>>
>> (there are a couple of optimizations that make the reality slightly more
>> complicated than what I've just explained, but you get the idea).
>>
>> On that basis, I believe all the specific cases you describe below are all
>> covered and safe - please let me know if you think there is a hole here!
>>
>>>
>>> 1. madvise(MADV_DONTNEED) on a part of basepages on a CONTPTE large folio
>>
>> The page will first be unmapped (e.g. ptep_clear() or ptep_get_and_clear(), or
>> whatever). The implementation of that will cause an unfold and the CONT_PTE bit
>> is removed from the whole contpte block. If there is then a subsequent
>> set_pte_at() to set a swap entry, the implementation will see that its not
>> appropriate to re-fold, so the range will remain unfolded.
>>
>>>
>>> 2. vma split in a large folio due to various reasons such as mprotect,
>>> munmap, mlock etc.
>>
>> I'm not sure if PTEs are explicitly unmapped/remapped when splitting a VMA? I
>> suspect not, so if the VMA is split in the middle of a currently folded contpte
>> block, it will remain folded. But this is safe and continues to work correctly.
>> The VMA arrangement is not important; it is just important that a single folio
>> is mapped contiguously across the whole block.
>
> Even with different permissions, for example, read-only vs read-write?
> The mprotect() may change the permission. It should be misprogramming
> per ARM ARM.

If the permissions are changed, then mprotect() must have called the pgtable
helpers to modify the page table (e.g. ptep_set_wrprotect(),
ptep_set_access_flags() or whatever). These functions will notice that the
contpte block is currently folded and unfold it before apply the permissions
change. The unfolding process is done in a way that intentionally avoids
misprogramming as defined by the Arm ARM. See contpte_fold() in contpte.c.

>
>>
>>>
>>> 3. try_to_unmap_one() to reclaim a folio, ptes are scanned one by one
>>> rather than being as a whole.
>>
>> Yes, as per 1; the arm64 implementation will notice when the first entry is
>> cleared and unfold the contpte block.
>>
>>>
>>> In hardware, we need to make sure CONTPTE follow the rule - always 16
>>> contiguous physical address with CONTPTE set. if one of them run away
>>> from the 16 ptes group and PTEs become unconsistent, some terrible
>>> errors/faults can happen in HW. for example
>>
>> Yes, the implementation obeys all these rules; see contpte_try_fold() and
>> contpte_try_unfold(). the fold/unfold operation is only done when all
>> requirements are met, and we perform it in a manner that is conformant to the
>> architecture requirements (see contpte_fold() - being renamed to
>> contpte_convert() in the next version).
>>
>> Thanks for the review!
>>
>> Thanks,
>> Ryan
>>
>>>
>>> case0:
>>> addr0 PTE - has no CONTPE
>>> addr0+4kb PTE - has CONTPTE
>>> ....
>>> addr0+60kb PTE - has CONTPTE
>>>
>>> case 1:
>>> addr0 PTE - has no CONTPE
>>> addr0+4kb PTE - has CONTPTE
>>> ....
>>> addr0+60kb PTE - has swap
>>>
>>> Unconsistent 16 PTEs will lead to crash even in the firmware based on
>>> our observation.
>>>
>>> Thanks
>>> Barry
>>>
>>>
>>
>>

2023-11-28 12:08:50

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 00/14] Transparent Contiguous PTEs for User Mappings

On 28/11/2023 05:49, Barry Song wrote:
> On Mon, Nov 27, 2023 at 5:15 PM Ryan Roberts <[email protected]> wrote:
>>
>> On 27/11/2023 03:18, Barry Song wrote:
>>>> Ryan Roberts (14):
>>>> mm: Batch-copy PTE ranges during fork()
>>>> arm64/mm: set_pte(): New layer to manage contig bit
>>>> arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit
>>>> arm64/mm: pte_clear(): New layer to manage contig bit
>>>> arm64/mm: ptep_get_and_clear(): New layer to manage contig bit
>>>> arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit
>>>> arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit
>>>> arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit
>>>> arm64/mm: ptep_set_access_flags(): New layer to manage contig bit
>>>> arm64/mm: ptep_get(): New layer to manage contig bit
>>>> arm64/mm: Split __flush_tlb_range() to elide trailing DSB
>>>> arm64/mm: Wire up PTE_CONT for user mappings
>>>> arm64/mm: Implement ptep_set_wrprotects() to optimize fork()
>>>> arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown
>>>
>>> Hi Ryan,
>>> Not quite sure if I missed something, are we splitting/unfolding CONTPTES
>>> in the below cases
>>
>> The general idea is that the core-mm sets the individual ptes (one at a time if
>> it likes with set_pte_at(), or in a block with set_ptes()), modifies its
>> permissions (ptep_set_wrprotect(), ptep_set_access_flags()) and clears them
>> (ptep_clear(), etc); This is exactly the same interface as previously.
>>
>> BUT, the arm64 implementation of those interfaces will now detect when a set of
>> adjacent PTEs (a contpte block - so 16 naturally aligned entries when using 4K
>> base pages) are all appropriate for having the CONT_PTE bit set; in this case
>> the block is "folded". And it will detect when the first PTE in the block
>> changes such that the CONT_PTE bit must now be unset ("unfolded"). One of the
>> requirements for folding a contpte block is that all the pages must belong to
>> the *same* folio (that means its safe to only track access/dirty for thecontpte
>> block as a whole rather than for each individual pte).
>>
>> (there are a couple of optimizations that make the reality slightly more
>> complicated than what I've just explained, but you get the idea).
>>
>> On that basis, I believe all the specific cases you describe below are all
>> covered and safe - please let me know if you think there is a hole here!
>>
>>>
>>> 1. madvise(MADV_DONTNEED) on a part of basepages on a CONTPTE large folio
>>
>> The page will first be unmapped (e.g. ptep_clear() or ptep_get_and_clear(), or
>> whatever). The implementation of that will cause an unfold and the CONT_PTE bit
>> is removed from the whole contpte block. If there is then a subsequent
>> set_pte_at() to set a swap entry, the implementation will see that its not
>> appropriate to re-fold, so the range will remain unfolded.
>>
>>>
>>> 2. vma split in a large folio due to various reasons such as mprotect,
>>> munmap, mlock etc.
>>
>> I'm not sure if PTEs are explicitly unmapped/remapped when splitting a VMA? I
>> suspect not, so if the VMA is split in the middle of a currently folded contpte
>> block, it will remain folded. But this is safe and continues to work correctly.
>> The VMA arrangement is not important; it is just important that a single folio
>> is mapped contiguously across the whole block.
>>
>>>
>>> 3. try_to_unmap_one() to reclaim a folio, ptes are scanned one by one
>>> rather than being as a whole.
>>
>> Yes, as per 1; the arm64 implementation will notice when the first entry is
>> cleared and unfold the contpte block.
>>
>>>
>>> In hardware, we need to make sure CONTPTE follow the rule - always 16
>>> contiguous physical address with CONTPTE set. if one of them run away
>>> from the 16 ptes group and PTEs become unconsistent, some terrible
>>> errors/faults can happen in HW. for example
>>
>> Yes, the implementation obeys all these rules; see contpte_try_fold() and
>> contpte_try_unfold(). the fold/unfold operation is only done when all
>> requirements are met, and we perform it in a manner that is conformant to the
>> architecture requirements (see contpte_fold() - being renamed to
>> contpte_convert() in the next version).
>
> Hi Ryan,
>
> sorry for too many comments, I remembered another case
>
> 4. mremap
>
> a CONTPTE might be remapped to another address which might not be
> aligned with 16*basepage. thus, in move_ptes(), we are copying CONPTEs
> from src to dst.
> static int move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
> unsigned long old_addr, unsigned long old_end,
> struct vm_area_struct *new_vma, pmd_t *new_pmd,
> unsigned long new_addr, bool need_rmap_locks)
> {
> struct mm_struct *mm = vma->vm_mm;
> pte_t *old_pte, *new_pte, pte;
> ...
>
> /*
> * We don't have to worry about the ordering of src and dst
> * pte locks because exclusive mmap_lock prevents deadlock.
> */
> old_pte = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
> if (!old_pte) {
> err = -EAGAIN;
> goto out;
> }
> new_pte = pte_offset_map_nolock(mm, new_pmd, new_addr, &new_ptl);
> if (!new_pte) {
> pte_unmap_unlock(old_pte, old_ptl);
> err = -EAGAIN;
> goto out;
> }
> if (new_ptl != old_ptl)
> spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
> flush_tlb_batched_pending(vma->vm_mm);
> arch_enter_lazy_mmu_mode();
>
> for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,
> new_pte++, new_addr += PAGE_SIZE) {
> if (pte_none(ptep_get(old_pte)))
> continue;
>
> pte = ptep_get_and_clear(mm, old_addr, old_pte);
> ....
> }
>
> This has two possibilities
> 1. new_pte is aligned with CONT_PTES, we can still keep CONTPTE;
> 2. new_pte is not aligned with CONT_PTES, we should drop CONTPTE
> while copying.
>
> does your code also handle this properly?

Yes; same mechanism - the arm64 arch code does the CONT_PTE bit management and
folds/unfolds as neccessary.

Admittedly this may be non-optimal because we are iterating a single PTE at a
time. When we clear the first pte of a contpte block in the source, the block
will be unfolded. When we set the last pte of the contpte block in the dest, the
block will be folded. If we had a batching mechanism, we could just clear the
whole source contpte block in one hit (no need to unfold first) and we could
just set the dest contpte block in one hit (no need to fold at the end).

I haven't personally seen this as a hotspot though; I don't know if you have any
data to the contrary? I've followed this type of batching technique for the fork
case (patch 13). We could do a similar thing in theory, but its a bit more
complex because of the ptep_get_and_clear() return value; you would need to
return all ptes for the cleared range, or somehow collapse the actual info that
the caller requires (presumably access/dirty info).

>
>>
>> Thanks for the review!
>>
>> Thanks,
>> Ryan
>>
>>>
>>> case0:
>>> addr0 PTE - has no CONTPE
>>> addr0+4kb PTE - has CONTPTE
>>> ....
>>> addr0+60kb PTE - has CONTPTE
>>>
>>> case 1:
>>> addr0 PTE - has no CONTPE
>>> addr0+4kb PTE - has CONTPTE
>>> ....
>>> addr0+60kb PTE - has swap
>>>
>>> Unconsistent 16 PTEs will lead to crash even in the firmware based on
>>> our observation.
>>>
>>> Thanks
>>> Barry
>
> Thanks
> Barry

2023-11-28 12:46:39

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 14/14] arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown

On 28/11/2023 06:54, Alistair Popple wrote:
>
> Ryan Roberts <[email protected]> writes:
>
>> On 27/11/2023 07:34, Alistair Popple wrote:
>>>
>>> Ryan Roberts <[email protected]> writes:
>>>
>>>> On 24/11/2023 01:35, Alistair Popple wrote:
>>>>>
>>>>> Ryan Roberts <[email protected]> writes:
>>>>>
>>>>>> On 23/11/2023 05:13, Alistair Popple wrote:
>>>>>>>
>>>>>>> Ryan Roberts <[email protected]> writes:
>>>>>>>
>>>>>>>> ptep_get_and_clear_full() adds a 'full' parameter which is not present
>>>>>>>> for the fallback ptep_get_and_clear() function. 'full' is set to 1 when
>>>>>>>> a full address space teardown is in progress. We use this information to
>>>>>>>> optimize arm64_sys_exit_group() by avoiding unfolding (and therefore
>>>>>>>> tlbi) contiguous ranges. Instead we just clear the PTE but allow all the
>>>>>>>> contiguous neighbours to keep their contig bit set, because we know we
>>>>>>>> are about to clear the rest too.
>>>>>>>>
>>>>>>>> Before this optimization, the cost of arm64_sys_exit_group() exploded to
>>>>>>>> 32x what it was before PTE_CONT support was wired up, when compiling the
>>>>>>>> kernel. With this optimization in place, we are back down to the
>>>>>>>> original cost.
>>>>>>>>
>>>>>>>> This approach is not perfect though, as for the duration between
>>>>>>>> returning from the first call to ptep_get_and_clear_full() and making
>>>>>>>> the final call, the contpte block in an intermediate state, where some
>>>>>>>> ptes are cleared and others are still set with the PTE_CONT bit. If any
>>>>>>>> other APIs are called for the ptes in the contpte block during that
>>>>>>>> time, we have to be very careful. The core code currently interleaves
>>>>>>>> calls to ptep_get_and_clear_full() with ptep_get() and so ptep_get()
>>>>>>>> must be careful to ignore the cleared entries when accumulating the
>>>>>>>> access and dirty bits - the same goes for ptep_get_lockless(). The only
>>>>>>>> other calls we might resonably expect are to set markers in the
>>>>>>>> previously cleared ptes. (We shouldn't see valid entries being set until
>>>>>>>> after the tlbi, at which point we are no longer in the intermediate
>>>>>>>> state). Since markers are not valid, this is safe; set_ptes() will see
>>>>>>>> the old, invalid entry and will not attempt to unfold. And the new pte
>>>>>>>> is also invalid so it won't attempt to fold. We shouldn't see this for
>>>>>>>> the 'full' case anyway.
>>>>>>>>
>>>>>>>> The last remaining issue is returning the access/dirty bits. That info
>>>>>>>> could be present in any of the ptes in the contpte block. ptep_get()
>>>>>>>> will gather those bits from across the contpte block. We don't bother
>>>>>>>> doing that here, because we know that the information is used by the
>>>>>>>> core-mm to mark the underlying folio as accessed/dirty. And since the
>>>>>>>> same folio must be underpinning the whole block (that was a requirement
>>>>>>>> for folding in the first place), that information will make it to the
>>>>>>>> folio eventually once all the ptes have been cleared. This approach
>>>>>>>> means we don't have to play games with accumulating and storing the
>>>>>>>> bits. It does mean that any interleaved calls to ptep_get() may lack
>>>>>>>> correct access/dirty information if we have already cleared the pte that
>>>>>>>> happened to store it. The core code does not rely on this though.
>>>>>>>
>>>>>>> Does not *currently* rely on this. I can't help but think it is
>>>>>>> potentially something that could change in the future though which would
>>>>>>> lead to some subtle bugs.
>>>>>>
>>>>>> Yes, there is a risk, although IMHO, its very small.
>>>>>>
>>>>>>>
>>>>>>> Would there be any may of avoiding this? Half baked thought but could
>>>>>>> you for example copy the access/dirty information to the last (or
>>>>>>> perhaps first, most likely invalid) PTE?
>>>>>>
>>>>>> I spent a long time thinking about this and came up with a number of
>>>>>> possibilities, none of them ideal. In the end, I went for the simplest one
>>>>>> (which works but suffers from the problem that it depends on the way it is
>>>>>> called not changing).
>>>>>
>>>>> Ok, that answers my underlying question of "has someone thought about
>>>>> this and are there any easy solutions". I suspected that was the case
>>>>> given the excellent write up though!
>>>>>
>>>>>> 1) copy the access/dirty flags into all the remaining uncleared ptes within the
>>>>>> contpte block. This is how I did it in v1; although it was racy. I think this
>>>>>> could be implemented correctly but its extremely complex.
>>>>>>
>>>>>> 2) batch calls from the core-mm (like I did for pte_set_wrprotects()) so that we
>>>>>> can clear 1 or more full contpte blocks in a single call - the ptes are never in
>>>>>> an intermediate state. This is difficult because ptep_get_and_clear_full()
>>>>>> returns the pte that was cleared so its difficult to scale that up to multiple ptes.
>>>>>>
>>>>>> 3) add ptep_get_no_access_dirty() and redefine the interface to only allow that
>>>>>> to be called while ptep_get_and_clear_full() calls are on-going. Then assert in
>>>>>> the other functions that ptep_get_and_clear_full() is not on-going when they are
>>>>>> called. So we would get a clear sign that usage patterns have changed. But there
>>>>>> is no easy place to store that state (other than scanning a contpte block
>>>>>> looking for pte_none() amongst pte_valid_cont() entries) and it all felt ugly.
>>>>>>
>>>>>> 4) The simple approach I ended up taking; I thought it would be best to keep it
>>>>>> simple and see if anyone was concerned before doing something more drastic.
>>>>>>
>>>>>> What do you think? If we really need to solve this, then option 1 is my
>>>>>> preferred route, but it would take some time to figure out and reason about a
>>>>>> race-free scheme.
>>>>>
>>>>> Well I like simple, and I agree the risk is small. But I can't help feel
>>>>> the current situation is too subtle, mainly because it is architecture
>>>>> specific and the assumptions are not communicated in core-mm code
>>>>> anywhere. But also none of the aternatives seem much better.
>>>>>
>>>>> However there are only three callers of ptep_get_and_clear_full(), and
>>>>> all of these hold the PTL. So if I'm not mistaken that should exclude
>>>>> just about all users of ptep_get*() which will take the ptl before hand.
>>>>
>>>> The problem isn't racing threads because as you say, the PTL is already
>>>> serializing all calls except ptep_get_lockless(). And although there are 3
>>>> callers to ptep_get_and_clear_full(), only the caller in zap_pte_range() ever
>>>> calls it with full=1, as I recall.
>>>>
>>>> The problem is that the caller in zap_pte_range() does this:
>>>>
>>>> ptl = lock_page_table()
>>>> for each pte {
>>>> ptent = ptep_get(pte);
>>>> if (pte_present(ptent) {
>>>> ptent = ptep_get_and_clear_full(ptent);
>>>> if (pte_dirty(ptent))
>>>> ...
>>>> if (pte_young(ptent))
>>>> ...
>>>> }
>>>> }
>>>> unlock_page_table(ptl)
>>>>
>>>> It deliberately interleves calls to ptep_get() and ptep_get_and_clear_full()
>>>> under the ptl. So if the loop is iterating over a contpte block and the HW
>>>> happens to be storing the access/dirty info in the first pte entry, then the
>>>> first time through the loop, ptep_get() will return the correct access/dirty
>>>> info, as will ptep_get_and_clear_full(). The next time through the loop though,
>>>> the access/dirty info which was in the previous pte is now cleared so ptep_get()
>>>> and ptep_get_and_clear_full() will return old/clean. It all works, but is fragile.
>>>
>>> So if ptep_get_lockless() isn't a concern what made the option posted in
>>> v1 racy (your option 1 above)? Is there something else reading PTEs or
>>> clearing PTE bits without holding the PTL that I'm missing?
>>
>> The HW could be racing to set access and dirty bits. Well actually, I'm not
>> completely sure if that's the case here; if full=1 then presumably no other
>> threads in the process should be running at this point, so perhaps it can be
>> guarranteed that nothing is causing a concurrent memory access and the HW is
>> therefore definitely not going to try to write the access/dirty bits
>> concurrently. But I didn't manage to convince myself that's definitely the case.
>
> I suppose it's possible something attached to an SMMU or some such could
> still be running and causing accesses so agree it's probably not the
> case (although it would be an odd corner case).

Indeed.

>
>> So if we do need to deal with racing HW, I'm pretty sure my v1 implementation is
>> buggy because it iterated through the PTEs, getting and accumulating. Then
>> iterated again, writing that final set of bits to all the PTEs. And the HW could
>> have modified the bits during those loops. I think it would be possible to fix
>> the race, but intuition says it would be expensive.
>
> So the issue as I understand it is subsequent iterations would see a
> clean PTE after the first iteration returned a dirty PTE. In
> ptep_get_and_clear_full() why couldn't you just copy the dirty/accessed
> bit (if set) from the PTE being cleared to an adjacent PTE rather than
> all the PTEs?

The raciness I'm describing is the race between reading access/dirty from one
pte and applying it to another. But yes I like your suggestion. if we do:

pte = __ptep_get_and_clear_full(ptep)

on the target pte, then we have grabbed access/dirty from it in a race-free
manner. we can then loop from current pte up towards the top of the block until
we find a valid entry (and I guess wrap at the top to make us robust against
future callers clearing an an arbitrary order). Then atomically accumulate the
access/dirty bits we have just saved into that new entry. I guess that's just a
cmpxchg loop - there are already examples of how to do that correctly when
racing the TLB.

For most entries, we will just be copying up to the next pte. For the last pte,
we would end up reading all ptes and determine we are the last one.

What do you think?

>
> That would fix the inconsistency as far as subsequent iterations of
> ptep_get_and_clear_full() returning the dirty/accessed if a previous
> iteration did. Obviously HW could still race and cause a previously
> clean iteration to return dirty, but that seems ok.
>
> However all this has just left me with more questions :-)
>
> Won't clearing bits like this result in inconsistent programming of the
> PTE_CONT bit? What happens if HW access a page in the contiguous region
> while some of the PTEs are invalid? And same question for programming
> them really - I don't think we can atomically set PTE_CONT on multiple
> PTEs all at once so if we assume something can be accessing them
> concurrently how do we do that without having some HW observing an
> intermediate state where PTE_CONT is misprogrammed?

I'm fairly confident at this point that it is safe and conformant to the
architecture. See my explanation at [1]. Of course having people look for holes
is very welcome - so thanks!

There is also a clarification we made to the Arm ARM primarily for patch 13 (see
the commit log) but it includes a clarification on how invalid ptes are not
included when considering if a contpte block is misprogrammed. See section
D21194 at [2].


[1] https://lore.kernel.org/linux-mm/[email protected]/
[2] https://developer.arm.com/documentation/102105/latest/

>
> Thanks.
>
>>>
>>>>>
>>>>> So really that only leaves ptep_get_lockless() that could/should
>>>>> interleave right?
>>>>
>>>> Yes, but ptep_get_lockless() is special. Since it is called without the PTL, it
>>>> is very careful to ensure that the contpte block is in a consistent state and it
>>>> keeps trying until it is. So this will always return the correct consistent
>>>> information.
>>>>
>>>>> From a quick glance of those users none look at the
>>>>> young/dirty information anyway, so I wonder if we can just assert in the
>>>>> core-mm that ptep_get_lockless() does not return young/dirty information
>>>>> and clear it in the helpers? That would make things explicit and
>>>>> consistent which would address my concern (although I haven't looked too
>>>>> closely at the details there).
>>>>
>>>> As per explanation above, its not ptep_get_lockless() that is the problem so I
>>>> don't think this helps.
>>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>>>
>>>>>> Thanks,
>>>>>> Ryan
>>>>>
>>>
>

2023-11-28 16:55:26

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 14/14] arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown

>>> So if we do need to deal with racing HW, I'm pretty sure my v1 implementation is
>>> buggy because it iterated through the PTEs, getting and accumulating. Then
>>> iterated again, writing that final set of bits to all the PTEs. And the HW could
>>> have modified the bits during those loops. I think it would be possible to fix
>>> the race, but intuition says it would be expensive.
>>
>> So the issue as I understand it is subsequent iterations would see a
>> clean PTE after the first iteration returned a dirty PTE. In
>> ptep_get_and_clear_full() why couldn't you just copy the dirty/accessed
>> bit (if set) from the PTE being cleared to an adjacent PTE rather than
>> all the PTEs?
>
> The raciness I'm describing is the race between reading access/dirty from one
> pte and applying it to another. But yes I like your suggestion. if we do:
>
> pte = __ptep_get_and_clear_full(ptep)
>
> on the target pte, then we have grabbed access/dirty from it in a race-free
> manner. we can then loop from current pte up towards the top of the block until
> we find a valid entry (and I guess wrap at the top to make us robust against
> future callers clearing an an arbitrary order). Then atomically accumulate the
> access/dirty bits we have just saved into that new entry. I guess that's just a
> cmpxchg loop - there are already examples of how to do that correctly when
> racing the TLB.
>
> For most entries, we will just be copying up to the next pte. For the last pte,
> we would end up reading all ptes and determine we are the last one.
>
> What do you think?

OK here is an attempt at something which solves the fragility. I think this is
now robust and will always return the correct access/dirty state from
ptep_get_and_clear_full() and ptep_get().

But I'm not sure about performance; each call to ptep_get_and_clear_full() for
each pte in a contpte block will cause a ptep_get() to gather the access/dirty
bits from across the contpte block - which requires reading each pte in the
contpte block. So its O(n^2) in that sense. I'll benchmark it and report back.

Was this the type of thing you were thinking of, Alistair?


--8<--
arch/arm64/include/asm/pgtable.h | 23 ++++++++-
arch/arm64/mm/contpte.c | 81 ++++++++++++++++++++++++++++++++
arch/arm64/mm/fault.c | 38 +++++++++------
3 files changed, 125 insertions(+), 17 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 9bd2f57a9e11..6c295d277784 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -851,6 +851,7 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
return pte_pmd(pte_modify(pmd_pte(pmd), newprot));
}

+extern int __ptep_set_access_flags_notlbi(pte_t *ptep, pte_t entry);
extern int __ptep_set_access_flags(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep,
pte_t entry, int dirty);
@@ -1145,6 +1146,8 @@ extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pte, unsigned int nr);
+extern pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
+ unsigned long addr, pte_t *ptep);
extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep);
extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
@@ -1270,12 +1273,28 @@ static inline void pte_clear(struct mm_struct *mm,
__pte_clear(mm, addr, ptep);
}

+#define __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
+static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
+ unsigned long addr, pte_t *ptep, int full)
+{
+ pte_t orig_pte = __ptep_get(ptep);
+
+ if (!pte_valid_cont(orig_pte))
+ return __ptep_get_and_clear(mm, addr, ptep);
+
+ if (!full) {
+ contpte_try_unfold(mm, addr, ptep, orig_pte);
+ return __ptep_get_and_clear(mm, addr, ptep);
+ }
+
+ return contpte_ptep_get_and_clear_full(mm, addr, ptep);
+}
+
#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
unsigned long addr, pte_t *ptep)
{
- contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
- return __ptep_get_and_clear(mm, addr, ptep);
+ return ptep_get_and_clear_full(mm, addr, ptep, 0);
}

#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index 2a57df16bf58..99b211118d93 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -145,6 +145,14 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
for (i = 0; i < CONT_PTES; i++, ptep++) {
pte = __ptep_get(ptep);

+ /*
+ * Deal with the partial contpte_ptep_get_and_clear_full() case,
+ * where some of the ptes in the range may be cleared but others
+ * are still to do. See contpte_ptep_get_and_clear_full().
+ */
+ if (!pte_valid(pte))
+ continue;
+
if (pte_dirty(pte))
orig_pte = pte_mkdirty(orig_pte);

@@ -257,6 +265,79 @@ void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
}
EXPORT_SYMBOL(contpte_set_ptes);

+pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
+ unsigned long addr, pte_t *ptep)
+{
+ /*
+ * When doing a full address space teardown, we can avoid unfolding the
+ * contiguous range, and therefore avoid the associated tlbi. Instead,
+ * just get and clear the pte. The caller is promising to call us for
+ * every pte, so every pte in the range will be cleared by the time the
+ * final tlbi is issued.
+ *
+ * This approach requires some complex hoop jumping though, as for the
+ * duration between returning from the first call to
+ * ptep_get_and_clear_full() and making the final call, the contpte
+ * block is in an intermediate state, where some ptes are cleared and
+ * others are still set with the PTE_CONT bit. If any other APIs are
+ * called for the ptes in the contpte block during that time, we have to
+ * be very careful. The core code currently interleaves calls to
+ * ptep_get_and_clear_full() with ptep_get() and so ptep_get() must be
+ * careful to ignore the cleared entries when accumulating the access
+ * and dirty bits - the same goes for ptep_get_lockless(). The only
+ * other calls we might resonably expect are to set markers in the
+ * previously cleared ptes. (We shouldn't see valid entries being set
+ * until after the tlbi, at which point we are no longer in the
+ * intermediate state). Since markers are not valid, this is safe;
+ * set_ptes() will see the old, invalid entry and will not attempt to
+ * unfold. And the new pte is also invalid so it won't attempt to fold.
+ * We shouldn't see pte markers being set for the 'full' case anyway
+ * since the address space is being torn down.
+ *
+ * The last remaining issue is returning the access/dirty bits. That
+ * info could be present in any of the ptes in the contpte block.
+ * ptep_get() will gather those bits from across the contpte block (for
+ * the remaining valid entries). So below, if the pte we are clearing
+ * has dirty or young set, we need to stash it into a pte that we are
+ * yet to clear. This allows future calls to return the correct state
+ * even when the info was stored in a different pte. Since the core-mm
+ * calls from low to high address, we prefer to stash in the last pte of
+ * the contpte block - this means we are not "dragging" the bits up
+ * through all ptes and increases the chances that we can exit early
+ * because a given pte will have neither dirty or young set.
+ */
+
+ pte_t orig_pte = __ptep_get_and_clear(mm, addr, ptep);
+ bool dirty = pte_dirty(orig_pte);
+ bool young = pte_young(orig_pte);
+ pte_t *start;
+
+ if (!dirty && !young)
+ return contpte_ptep_get(ptep, orig_pte);
+
+ start = contpte_align_down(ptep);
+ ptep = start + CONT_PTES - 1;
+
+ for (; ptep >= start; ptep--) {
+ pte_t pte = __ptep_get(ptep);
+
+ if (!pte_valid(pte))
+ continue;
+
+ if (dirty)
+ pte = pte_mkdirty(pte);
+
+ if (young)
+ pte = pte_mkyoung(pte);
+
+ __ptep_set_access_flags_notlbi(ptep, pte);
+ return contpte_ptep_get(ptep, orig_pte);
+ }
+
+ return orig_pte;
+}
+EXPORT_SYMBOL(contpte_ptep_get_and_clear_full);
+
int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep)
{
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index d63f3a0a7251..b22216a8153c 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -199,19 +199,7 @@ static void show_pte(unsigned long addr)
pr_cont("\n");
}

-/*
- * This function sets the access flags (dirty, accessed), as well as write
- * permission, and only to a more permissive setting.
- *
- * It needs to cope with hardware update of the accessed/dirty state by other
- * agents in the system and can safely skip the __sync_icache_dcache() call as,
- * like __set_ptes(), the PTE is never changed from no-exec to exec here.
- *
- * Returns whether or not the PTE actually changed.
- */
-int __ptep_set_access_flags(struct vm_area_struct *vma,
- unsigned long address, pte_t *ptep,
- pte_t entry, int dirty)
+int __ptep_set_access_flags_notlbi(pte_t *ptep, pte_t entry)
{
pteval_t old_pteval, pteval;
pte_t pte = __ptep_get(ptep);
@@ -238,10 +226,30 @@ int __ptep_set_access_flags(struct vm_area_struct *vma,
pteval = cmpxchg_relaxed(&pte_val(*ptep), old_pteval, pteval);
} while (pteval != old_pteval);

+ return 1;
+}
+
+/*
+ * This function sets the access flags (dirty, accessed), as well as write
+ * permission, and only to a more permissive setting.
+ *
+ * It needs to cope with hardware update of the accessed/dirty state by other
+ * agents in the system and can safely skip the __sync_icache_dcache() call as,
+ * like __set_ptes(), the PTE is never changed from no-exec to exec here.
+ *
+ * Returns whether or not the PTE actually changed.
+ */
+int __ptep_set_access_flags(struct vm_area_struct *vma,
+ unsigned long address, pte_t *ptep,
+ pte_t entry, int dirty)
+{
+ int changed = __ptep_set_access_flags_notlbi(ptep, entry);
+
/* Invalidate a stale read-only entry */
- if (dirty)
+ if (changed && dirty)
flush_tlb_page(vma, address);
- return 1;
+
+ return changed;
}

static bool is_el1_instruction_abort(unsigned long esr)
--8<--

2023-11-28 19:01:29

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

On Wed, Nov 29, 2023 at 12:00 AM Ryan Roberts <[email protected]> wrote:
>
> On 28/11/2023 00:11, Barry Song wrote:
> > On Mon, Nov 27, 2023 at 10:24 PM Ryan Roberts <[email protected]> wrote:
> >>
> >> On 27/11/2023 05:54, Barry Song wrote:
> >>>> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> >>>> + pte_t *dst_pte, pte_t *src_pte,
> >>>> + unsigned long addr, unsigned long end,
> >>>> + int *rss, struct folio **prealloc)
> >>>> {
> >>>> struct mm_struct *src_mm = src_vma->vm_mm;
> >>>> unsigned long vm_flags = src_vma->vm_flags;
> >>>> pte_t pte = ptep_get(src_pte);
> >>>> struct page *page;
> >>>> struct folio *folio;
> >>>> + int nr = 1;
> >>>> + bool anon;
> >>>> + bool any_dirty = pte_dirty(pte);
> >>>> + int i;
> >>>>
> >>>> page = vm_normal_page(src_vma, addr, pte);
> >>>> - if (page)
> >>>> + if (page) {
> >>>> folio = page_folio(page);
> >>>> - if (page && folio_test_anon(folio)) {
> >>>> - /*
> >>>> - * If this page may have been pinned by the parent process,
> >>>> - * copy the page immediately for the child so that we'll always
> >>>> - * guarantee the pinned page won't be randomly replaced in the
> >>>> - * future.
> >>>> - */
> >>>> - folio_get(folio);
> >>>> - if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
> >>>> - /* Page may be pinned, we have to copy. */
> >>>> - folio_put(folio);
> >>>> - return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
> >>>> - addr, rss, prealloc, page);
> >>>> + anon = folio_test_anon(folio);
> >>>> + nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
> >>>> + end, pte, &any_dirty);
> >>>
> >>> in case we have a large folio with 16 CONTPTE basepages, and userspace
> >>> do madvise(addr + 4KB * 5, DONTNEED);
> >>
> >> nit: if you are offsetting by 5 pages from addr, then below I think you mean
> >> page0~page4 and page6~15?
> >>
> >>>
> >>> thus, the 4th basepage of PTE becomes PTE_NONE and folio_nr_pages_cont_mapped()
> >>> will return 15. in this case, we should copy page0~page3 and page5~page15.
> >>
> >> No I don't think folio_nr_pages_cont_mapped() will return 15; that's certainly
> >> not how its intended to work. The function is scanning forwards from the current
> >> pte until it finds the first pte that does not fit in the batch - either because
> >> it maps a PFN that is not contiguous, or because the permissions are different
> >> (although this is being relaxed a bit; see conversation with DavidH against this
> >> same patch).
> >>
> >> So the first time through this loop, folio_nr_pages_cont_mapped() will return 5,
> >> (page0~page4) then the next time through the loop we will go through the
> >> !present path and process the single swap marker. Then the 3rd time through the
> >> loop folio_nr_pages_cont_mapped() will return 10.
> >
> > one case we have met by running hundreds of real phones is as below,
> >
> >
> > static int
> > copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> > pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
> > unsigned long end)
> > {
> > ...
> > dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
> > if (!dst_pte) {
> > ret = -ENOMEM;
> > goto out;
> > }
> > src_pte = pte_offset_map_nolock(src_mm, src_pmd, addr, &src_ptl);
> > if (!src_pte) {
> > pte_unmap_unlock(dst_pte, dst_ptl);
> > /* ret == 0 */
> > goto out;
> > }
> > spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
> > orig_src_pte = src_pte;
> > orig_dst_pte = dst_pte;
> > arch_enter_lazy_mmu_mode();
> >
> > do {
> > /*
> > * We are holding two locks at this point - either of them
> > * could generate latencies in another task on another CPU.
> > */
> > if (progress >= 32) {
> > progress = 0;
> > if (need_resched() ||
> > spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
> > break;
> > }
> > ptent = ptep_get(src_pte);
> > if (pte_none(ptent)) {
> > progress++;
> > continue;
> > }
> >
> > the above iteration can break when progress > =32. for example, at the
> > beginning,
> > if all PTEs are none, we break when progress >=32, and we break when we
> > are in the 8th pte of 16PTEs which might become CONTPTE after we release
> > PTL.
> >
> > since we are releasing PTLs, next time when we get PTL, those pte_none() might
> > become pte_cont(), then are you going to copy CONTPTE from 8th pte,
> > thus, immediately
> > break the consistent CONPTEs rule of hardware?
> >
> > pte0 - pte_none
> > pte1 - pte_none
> > ...
> > pte7 - pte_none
> >
> > pte8 - pte_cont
> > ...
> > pte15 - pte_cont
> >
> > so we did some modification to avoid a break in the middle of PTEs
> > which can potentially
> > become CONTPE.
> > do {
> > /*
> > * We are holding two locks at this point - either of them
> > * could generate latencies in another task on another CPU.
> > */
> > if (progress >= 32) {
> > progress = 0;
> > #ifdef CONFIG_CONT_PTE_HUGEPAGE
> > /*
> > * XXX: don't release ptl at an unligned address as
> > cont_pte might form while
> > * ptl is released, this causes double-map
> > */
> > if (!vma_is_chp_anonymous(src_vma) ||
> > (vma_is_chp_anonymous(src_vma) && IS_ALIGNED(addr,
> > HPAGE_CONT_PTE_SIZE)))
> > #endif
> > if (need_resched() ||
> > spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
> > break;
> > }
> >
> > We could only reproduce the above issue by running thousands of phones.
> >
> > Does your code survive from this problem?
>
> Yes I'm confident my code is safe against this; as I said before, the CONT_PTE
> bit is not blindly "copied" from parent to child pte. As far as the core-mm is
> concerned, there is no CONT_PTE bit; they are just regular PTEs. So the code
> will see some pte_none() entries followed by some pte_present() entries. And
> when calling set_ptes() on the child, the arch code will evaluate the current
> state of the pgtable along with the new set_ptes() request and determine where
> it should insert the CONT_PTE bit.

yep, i have read very carefully and think your code is safe here. The
only problem
is that the code can randomly unfold parent processes' CONPTE while setting
wrprotect in the middle of a large folio while it actually should keep CONT
bit as all PTEs can be still consistent if we set protect from the 1st PTE.

while A forks B, progress >= 32 might interrupt in the middle of a
new CONTPTE folio which is forming, as we have to set wrprotect to parent A,
this parent immediately loses CONT bit. this is sad. but i can't find a
good way to resolve it unless CONT is exposed to mm-core. any idea on
this?

Our code[1] resolves this by only breaking at the aligned address

if (progress >= 32) {
progress = 0;
#ifdef CONFIG_CONT_PTE_HUGEPAGE
/*
* XXX: don't release ptl at an unligned address as cont_pte
might form while
* ptl is released, this causes double-map
*/
if (!vma_is_chp_anonymous(src_vma) ||
(vma_is_chp_anonymous(src_vma) && IS_ALIGNED(addr,
HPAGE_CONT_PTE_SIZE)))
#endif
if (need_resched() ||
spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
break;
}

[1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c#L1180


Thanks
Barry

2023-11-28 19:38:26

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v2 00/14] Transparent Contiguous PTEs for User Mappings

On Wed, Nov 29, 2023 at 1:08 AM Ryan Roberts <[email protected]> wrote:
>
> On 28/11/2023 05:49, Barry Song wrote:
> > On Mon, Nov 27, 2023 at 5:15 PM Ryan Roberts <[email protected]> wrote:
> >>
> >> On 27/11/2023 03:18, Barry Song wrote:
> >>>> Ryan Roberts (14):
> >>>> mm: Batch-copy PTE ranges during fork()
> >>>> arm64/mm: set_pte(): New layer to manage contig bit
> >>>> arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit
> >>>> arm64/mm: pte_clear(): New layer to manage contig bit
> >>>> arm64/mm: ptep_get_and_clear(): New layer to manage contig bit
> >>>> arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit
> >>>> arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit
> >>>> arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit
> >>>> arm64/mm: ptep_set_access_flags(): New layer to manage contig bit
> >>>> arm64/mm: ptep_get(): New layer to manage contig bit
> >>>> arm64/mm: Split __flush_tlb_range() to elide trailing DSB
> >>>> arm64/mm: Wire up PTE_CONT for user mappings
> >>>> arm64/mm: Implement ptep_set_wrprotects() to optimize fork()
> >>>> arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown
> >>>
> >>> Hi Ryan,
> >>> Not quite sure if I missed something, are we splitting/unfolding CONTPTES
> >>> in the below cases
> >>
> >> The general idea is that the core-mm sets the individual ptes (one at a time if
> >> it likes with set_pte_at(), or in a block with set_ptes()), modifies its
> >> permissions (ptep_set_wrprotect(), ptep_set_access_flags()) and clears them
> >> (ptep_clear(), etc); This is exactly the same interface as previously.
> >>
> >> BUT, the arm64 implementation of those interfaces will now detect when a set of
> >> adjacent PTEs (a contpte block - so 16 naturally aligned entries when using 4K
> >> base pages) are all appropriate for having the CONT_PTE bit set; in this case
> >> the block is "folded". And it will detect when the first PTE in the block
> >> changes such that the CONT_PTE bit must now be unset ("unfolded"). One of the
> >> requirements for folding a contpte block is that all the pages must belong to
> >> the *same* folio (that means its safe to only track access/dirty for thecontpte
> >> block as a whole rather than for each individual pte).
> >>
> >> (there are a couple of optimizations that make the reality slightly more
> >> complicated than what I've just explained, but you get the idea).
> >>
> >> On that basis, I believe all the specific cases you describe below are all
> >> covered and safe - please let me know if you think there is a hole here!
> >>
> >>>
> >>> 1. madvise(MADV_DONTNEED) on a part of basepages on a CONTPTE large folio
> >>
> >> The page will first be unmapped (e.g. ptep_clear() or ptep_get_and_clear(), or
> >> whatever). The implementation of that will cause an unfold and the CONT_PTE bit
> >> is removed from the whole contpte block. If there is then a subsequent
> >> set_pte_at() to set a swap entry, the implementation will see that its not
> >> appropriate to re-fold, so the range will remain unfolded.
> >>
> >>>
> >>> 2. vma split in a large folio due to various reasons such as mprotect,
> >>> munmap, mlock etc.
> >>
> >> I'm not sure if PTEs are explicitly unmapped/remapped when splitting a VMA? I
> >> suspect not, so if the VMA is split in the middle of a currently folded contpte
> >> block, it will remain folded. But this is safe and continues to work correctly.
> >> The VMA arrangement is not important; it is just important that a single folio
> >> is mapped contiguously across the whole block.
> >>
> >>>
> >>> 3. try_to_unmap_one() to reclaim a folio, ptes are scanned one by one
> >>> rather than being as a whole.
> >>
> >> Yes, as per 1; the arm64 implementation will notice when the first entry is
> >> cleared and unfold the contpte block.
> >>
> >>>
> >>> In hardware, we need to make sure CONTPTE follow the rule - always 16
> >>> contiguous physical address with CONTPTE set. if one of them run away
> >>> from the 16 ptes group and PTEs become unconsistent, some terrible
> >>> errors/faults can happen in HW. for example
> >>
> >> Yes, the implementation obeys all these rules; see contpte_try_fold() and
> >> contpte_try_unfold(). the fold/unfold operation is only done when all
> >> requirements are met, and we perform it in a manner that is conformant to the
> >> architecture requirements (see contpte_fold() - being renamed to
> >> contpte_convert() in the next version).
> >
> > Hi Ryan,
> >
> > sorry for too many comments, I remembered another case
> >
> > 4. mremap
> >
> > a CONTPTE might be remapped to another address which might not be
> > aligned with 16*basepage. thus, in move_ptes(), we are copying CONPTEs
> > from src to dst.
> > static int move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
> > unsigned long old_addr, unsigned long old_end,
> > struct vm_area_struct *new_vma, pmd_t *new_pmd,
> > unsigned long new_addr, bool need_rmap_locks)
> > {
> > struct mm_struct *mm = vma->vm_mm;
> > pte_t *old_pte, *new_pte, pte;
> > ...
> >
> > /*
> > * We don't have to worry about the ordering of src and dst
> > * pte locks because exclusive mmap_lock prevents deadlock.
> > */
> > old_pte = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
> > if (!old_pte) {
> > err = -EAGAIN;
> > goto out;
> > }
> > new_pte = pte_offset_map_nolock(mm, new_pmd, new_addr, &new_ptl);
> > if (!new_pte) {
> > pte_unmap_unlock(old_pte, old_ptl);
> > err = -EAGAIN;
> > goto out;
> > }
> > if (new_ptl != old_ptl)
> > spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
> > flush_tlb_batched_pending(vma->vm_mm);
> > arch_enter_lazy_mmu_mode();
> >
> > for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,
> > new_pte++, new_addr += PAGE_SIZE) {
> > if (pte_none(ptep_get(old_pte)))
> > continue;
> >
> > pte = ptep_get_and_clear(mm, old_addr, old_pte);
> > ....
> > }
> >
> > This has two possibilities
> > 1. new_pte is aligned with CONT_PTES, we can still keep CONTPTE;
> > 2. new_pte is not aligned with CONT_PTES, we should drop CONTPTE
> > while copying.
> >
> > does your code also handle this properly?
>
> Yes; same mechanism - the arm64 arch code does the CONT_PTE bit management and
> folds/unfolds as neccessary.
>
> Admittedly this may be non-optimal because we are iterating a single PTE at a
> time. When we clear the first pte of a contpte block in the source, the block
> will be unfolded. When we set the last pte of the contpte block in the dest, the
> block will be folded. If we had a batching mechanism, we could just clear the
> whole source contpte block in one hit (no need to unfold first) and we could
> just set the dest contpte block in one hit (no need to fold at the end).
>
> I haven't personally seen this as a hotspot though; I don't know if you have any
> data to the contrary? I've followed this type of batching technique for the fork
> case (patch 13). We could do a similar thing in theory, but its a bit more

in my previous testing, i don't see mremap quite often, so no worries.
as long as it is bug-free,
it is fine to me though a mremap microbench will definitely lose :-)

> complex because of the ptep_get_and_clear() return value; you would need to
> return all ptes for the cleared range, or somehow collapse the actual info that
> the caller requires (presumably access/dirty info).
>

Thanks
Barry

2023-11-28 20:24:18

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v2 14/14] arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown

On Wed, Nov 29, 2023 at 12:49 AM Ryan Roberts <[email protected]> wrote:
>
> On 28/11/2023 08:17, Barry Song wrote:
> >> +pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
> >> + unsigned long addr, pte_t *ptep)
> >> +{
> >> + /*
> >> + * When doing a full address space teardown, we can avoid unfolding the
> >> + * contiguous range, and therefore avoid the associated tlbi. Instead,
> >> + * just get and clear the pte. The caller is promising to call us for
> >> + * every pte, so every pte in the range will be cleared by the time the
> >> + * tlbi is issued.
> >> + *
> >> + * This approach is not perfect though, as for the duration between
> >> + * returning from the first call to ptep_get_and_clear_full() and making
> >> + * the final call, the contpte block in an intermediate state, where
> >> + * some ptes are cleared and others are still set with the PTE_CONT bit.
> >> + * If any other APIs are called for the ptes in the contpte block during
> >> + * that time, we have to be very careful. The core code currently
> >> + * interleaves calls to ptep_get_and_clear_full() with ptep_get() and so
> >> + * ptep_get() must be careful to ignore the cleared entries when
> >> + * accumulating the access and dirty bits - the same goes for
> >> + * ptep_get_lockless(). The only other calls we might resonably expect
> >> + * are to set markers in the previously cleared ptes. (We shouldn't see
> >> + * valid entries being set until after the tlbi, at which point we are
> >> + * no longer in the intermediate state). Since markers are not valid,
> >> + * this is safe; set_ptes() will see the old, invalid entry and will not
> >> + * attempt to unfold. And the new pte is also invalid so it won't
> >> + * attempt to fold. We shouldn't see this for the 'full' case anyway.
> >> + *
> >> + * The last remaining issue is returning the access/dirty bits. That
> >> + * info could be present in any of the ptes in the contpte block.
> >> + * ptep_get() will gather those bits from across the contpte block. We
> >> + * don't bother doing that here, because we know that the information is
> >> + * used by the core-mm to mark the underlying folio as accessed/dirty.
> >> + * And since the same folio must be underpinning the whole block (that
> >> + * was a requirement for folding in the first place), that information
> >> + * will make it to the folio eventually once all the ptes have been
> >> + * cleared. This approach means we don't have to play games with
> >> + * accumulating and storing the bits. It does mean that any interleaved
> >> + * calls to ptep_get() may lack correct access/dirty information if we
> >> + * have already cleared the pte that happened to store it. The core code
> >> + * does not rely on this though.
> >
> > even without any other threads running and touching those PTEs, this won't survive
> > on some hardware. we expose inconsistent CONTPTEs to hardware, this might result
>
> No that's not the case; if you read the Arm ARM, the page table is only
> considered "misgrogrammed" when *valid* entries within the same contpte block
> have different values for the contiguous bit. We are clearing the ptes to zero
> here, which is an *invalid* entry. So if the TLB entry somehow gets invalidated
> (either due to explicit tlbi as you point out below, or due to a concurrent TLB
> miss which selects our entry for removal to make space for the new incomming
> entry), then it gets an access request for an address in our partially cleared
> contpte block the address will either be:
>
> A) an address for a pte entry we have already cleared, so its invalid and it
> will fault (and get serialized behind the PTL).
>
> or
>
> B) an address for a pte entry we haven't yet cleared, so it will reform a TLB
> entry for the contpte block. But that's ok because the memory still exists
> because we haven't yet finished clearing the page table and have not yet issued
> the final tlbi.
>
>
> > in crashed firmware even in trustzone, strange&unknown faults to trustzone we have
> > seen on Qualcomm, but for MTK, it seems fine. when you do tlbi on a part of PTEs
> > with dropped CONT but still some other PTEs have CONT, we make hardware totally
> > confused.
>
> I suspect this is because in your case you are "misprogramming" the contpte
> block; there are *valid* pte entries within the block that disagree about the
> contiguous bit or about various other fields. In this case some HW TLB designs
> can do weird things. I suspect in your case, that's resulting in accessing bad
> memory space and causing an SError, which is trapped by EL3, and the FW is
> probably just panicking at that point.

you are probably right. as we met the SError, we became very very
cautious. so anytime
when we flush tlb for a CONTPTE, we strictly do it by
1. set all 16 ptes to zero
2. flush the whole 16 ptes

in your case, it can be:
1. set pte0 to zero
2. flush pte0

TBH, i have never tried this. but it might be safe according to your
description.

>
> >
> > zap_pte_range() has a force_flush when tlbbatch is full:
> >
> > if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
> > force_flush = 1;
> > addr += PAGE_SIZE;
> > break;
> > }
> >
> > this means you can expose partial tlbi/flush directly to hardware while some
> > other PTEs are still CONT.
>
> Yes, but that's also possible even if we have a tight loop that clears down the
> contpte block; there could still be another core that issues a tlbi while you're
> halfway through that loop, or the HW could happen to evict due to TLB pressure
> at any time. The point is, it's safe if you are clearing the pte to an *invalid*
> entry.
>
> >
> > on the other hand, contpte_ptep_get_and_clear_full() doesn't need to depend
> > on fullmm, as long as zap range covers a large folio, we can flush tlbi for
> > those CONTPTEs all together in your contpte_ptep_get_and_clear_full() rather
> > than clearing one PTE.
> >
> > Our approach in [1] is we do a flush for all CONTPTEs and go directly to the end
> > of the large folio:
> >
> > #ifdef CONFIG_CONT_PTE_HUGEPAGE
> > if (pte_cont(ptent)) {
> > unsigned long next = pte_cont_addr_end(addr, end);
> >
> > if (next - addr != HPAGE_CONT_PTE_SIZE) {
> > __split_huge_cont_pte(vma, pte, addr, false, NULL, ptl);
> > /*
> > * After splitting cont-pte
> > * we need to process pte again.
> > */
> > goto again_pte;
> > } else {
> > cont_pte_huge_ptep_get_and_clear(mm, addr, pte);
> >
> > tlb_remove_cont_pte_tlb_entry(tlb, pte, addr);
> > if (unlikely(!page))
> > continue;
> >
> > if (is_huge_zero_page(page)) {
> > tlb_remove_page_size(tlb, page, HPAGE_CONT_PTE_SIZE);
> > goto cont_next;
> > }
> >
> > rss[mm_counter(page)] -= HPAGE_CONT_PTE_NR;
> > page_remove_rmap(page, true);
> > if (unlikely(page_mapcount(page) < 0))
> > print_bad_pte(vma, addr, ptent, page);
> >
> > tlb_remove_page_size(tlb, page, HPAGE_CONT_PTE_SIZE);
> > }
> > cont_next:
> > /* "do while()" will do "pte++" and "addr + PAGE_SIZE" */
> > pte += (next - PAGE_SIZE - (addr & PAGE_MASK))/PAGE_SIZE;
> > addr = next - PAGE_SIZE;
> > continue;
> > }
> > #endif
> >
> > this is our "full" counterpart, which clear_flush CONT_PTES pages directly, and
> > it never requires tlb->fullmm at all.
>
> Yes, but you are benefitting from the fact that contpte is exposed to core-mm
> and it is special-casing them at this level. I'm trying to avoid that.

I am thinking we can even do this while we don't expose CONTPTE.
if zap_pte_range meets a large folio and the zap_range covers the whole
folio, we can flush all ptes in this folio and jump to the end of this folio?
i mean

if (folio head && range_end > folio_end) {
nr = folio_nr_page(folio);
full_flush_nr_ptes()
pte += nr -1;
addr += (nr - 1) * basepage size
}

zap_pte_range is the most frequent behaviour from userspace libc heap
as i explained
before. libc can call madvise(DONTNEED) the most often. It is crucial
to performance.

and this way can also help drop your full version by moving to full
flushing the whole
large folios? and we don't need to depend on fullmm any more?

>
> I don't think there is any correctness issue here. But there is a problem with
> fragility, as raised by Alistair. I have some ideas on potentially how to solve
> that. I'm going to try to work on it this afternoon and will post if I get some
> confidence that it is a real solution.
>
> Thanks,
> Ryan
>
> >
> > static inline pte_t __cont_pte_huge_ptep_get_and_clear_flush(struct mm_struct *mm,
> > unsigned long addr,
> > pte_t *ptep,
> > bool flush)
> > {
> > pte_t orig_pte = ptep_get(ptep);
> >
> > CHP_BUG_ON(!pte_cont(orig_pte));
> > CHP_BUG_ON(!IS_ALIGNED(addr, HPAGE_CONT_PTE_SIZE));
> > CHP_BUG_ON(!IS_ALIGNED(pte_pfn(orig_pte), HPAGE_CONT_PTE_NR));
> >
> > return get_clear_flush(mm, addr, ptep, PAGE_SIZE, CONT_PTES, flush);
> > }
> >
> > [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c#L1539
> >
> >> + */
> >> +
> >> + return __ptep_get_and_clear(mm, addr, ptep);
> >> +}
> >> +EXPORT_SYMBOL(contpte_ptep_get_and_clear_full);
> >> +
> >

Thanks
Barry

>

2023-11-28 21:06:46

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

On Tue, Nov 28, 2023 at 11:49 PM Ryan Roberts <[email protected]> wrote:
>
> On 28/11/2023 09:49, Barry Song wrote:
> > On Tue, Nov 28, 2023 at 10:14 PM Ryan Roberts <[email protected]> wrote:
> >>
> >> On 27/11/2023 20:34, Barry Song wrote:
> >>> On Tue, Nov 28, 2023 at 12:07 AM Ryan Roberts <[email protected]> wrote:
> >>>>
> >>>> On 27/11/2023 10:28, Barry Song wrote:
> >>>>> On Mon, Nov 27, 2023 at 11:11 PM Ryan Roberts <[email protected]> wrote:
> >>>>>>
> >>>>>> On 27/11/2023 09:59, Barry Song wrote:
> >>>>>>> On Mon, Nov 27, 2023 at 10:35 PM Ryan Roberts <[email protected]> wrote:
> >>>>>>>>
> >>>>>>>> On 27/11/2023 08:42, Barry Song wrote:
> >>>>>>>>>>> + for (i = 0; i < nr; i++, page++) {
> >>>>>>>>>>> + if (anon) {
> >>>>>>>>>>> + /*
> >>>>>>>>>>> + * If this page may have been pinned by the
> >>>>>>>>>>> + * parent process, copy the page immediately for
> >>>>>>>>>>> + * the child so that we'll always guarantee the
> >>>>>>>>>>> + * pinned page won't be randomly replaced in the
> >>>>>>>>>>> + * future.
> >>>>>>>>>>> + */
> >>>>>>>>>>> + if (unlikely(page_try_dup_anon_rmap(
> >>>>>>>>>>> + page, false, src_vma))) {
> >>>>>>>>>>> + if (i != 0)
> >>>>>>>>>>> + break;
> >>>>>>>>>>> + /* Page may be pinned, we have to copy. */
> >>>>>>>>>>> + return copy_present_page(
> >>>>>>>>>>> + dst_vma, src_vma, dst_pte,
> >>>>>>>>>>> + src_pte, addr, rss, prealloc,
> >>>>>>>>>>> + page);
> >>>>>>>>>>> + }
> >>>>>>>>>>> + rss[MM_ANONPAGES]++;
> >>>>>>>>>>> + VM_BUG_ON(PageAnonExclusive(page));
> >>>>>>>>>>> + } else {
> >>>>>>>>>>> + page_dup_file_rmap(page, false);
> >>>>>>>>>>> + rss[mm_counter_file(page)]++;
> >>>>>>>>>>> + }
> >>>>>>>>>>> }
> >>>>>>>>>>> - rss[MM_ANONPAGES]++;
> >>>>>>>>>>> - } else if (page) {
> >>>>>>>>>>> - folio_get(folio);
> >>>>>>>>>>> - page_dup_file_rmap(page, false);
> >>>>>>>>>>> - rss[mm_counter_file(page)]++;
> >>>>>>>>>>> +
> >>>>>>>>>>> + nr = i;
> >>>>>>>>>>> + folio_ref_add(folio, nr);
> >>>>>>>>>>
> >>>>>>>>>> You're changing the order of mapcount vs. refcount increment. Don't.
> >>>>>>>>>> Make sure your refcount >= mapcount.
> >>>>>>>>>>
> >>>>>>>>>> You can do that easily by doing the folio_ref_add(folio, nr) first and
> >>>>>>>>>> then decrementing in case of error accordingly. Errors due to pinned
> >>>>>>>>>> pages are the corner case.
> >>>>>>>>>>
> >>>>>>>>>> I'll note that it will make a lot of sense to have batch variants of
> >>>>>>>>>> page_try_dup_anon_rmap() and page_dup_file_rmap().
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> i still don't understand why it is not a entire map+1, but an increment
> >>>>>>>>> in each basepage.
> >>>>>>>>
> >>>>>>>> Because we are PTE-mapping the folio, we have to account each individual page.
> >>>>>>>> If we accounted the entire folio, where would we unaccount it? Each page can be
> >>>>>>>> unmapped individually (e.g. munmap() part of the folio) so need to account each
> >>>>>>>> page. When PMD mapping, the whole thing is either mapped or unmapped, and its
> >>>>>>>> atomic, so we can account the entire thing.
> >>>>>>>
> >>>>>>> Hi Ryan,
> >>>>>>>
> >>>>>>> There is no problem. for example, a large folio is entirely mapped in
> >>>>>>> process A with CONPTE,
> >>>>>>> and only page2 is mapped in process B.
> >>>>>>> then we will have
> >>>>>>>
> >>>>>>> entire_map = 0
> >>>>>>> page0.map = -1
> >>>>>>> page1.map = -1
> >>>>>>> page2.map = 0
> >>>>>>> page3.map = -1
> >>>>>>> ....
> >>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> as long as it is a CONTPTE large folio, there is no much difference with
> >>>>>>>>> PMD-mapped large folio. it has all the chance to be DoubleMap and need
> >>>>>>>>> split.
> >>>>>>>>>
> >>>>>>>>> When A and B share a CONTPTE large folio, we do madvise(DONTNEED) or any
> >>>>>>>>> similar things on a part of the large folio in process A,
> >>>>>>>>>
> >>>>>>>>> this large folio will have partially mapped subpage in A (all CONTPE bits
> >>>>>>>>> in all subpages need to be removed though we only unmap a part of the
> >>>>>>>>> large folioas HW requires consistent CONTPTEs); and it has entire map in
> >>>>>>>>> process B(all PTEs are still CONPTES in process B).
> >>>>>>>>>
> >>>>>>>>> isn't it more sensible for this large folios to have entire_map = 0(for
> >>>>>>>>> process B), and subpages which are still mapped in process A has map_count
> >>>>>>>>> =0? (start from -1).
> >>>>>>>>>
> >>>>>>>>>> Especially, the batch variant of page_try_dup_anon_rmap() would only
> >>>>>>>>>> check once if the folio maybe pinned, and in that case, you can simply
> >>>>>>>>>> drop all references again. So you either have all or no ptes to process,
> >>>>>>>>>> which makes that code easier.
> >>>>>>>>
> >>>>>>>> I'm afraid this doesn't make sense to me. Perhaps I've misunderstood. But
> >>>>>>>> fundamentally you can only use entire_mapcount if its only possible to map and
> >>>>>>>> unmap the whole folio atomically.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> My point is that CONTPEs should either all-set in all 16 PTEs or all are dropped
> >>>>>>> in 16 PTEs. if all PTEs have CONT, it is entirely mapped; otherwise,
> >>>>>>> it is partially
> >>>>>>> mapped. if a large folio is mapped in one processes with all CONTPTEs
> >>>>>>> and meanwhile in another process with partial mapping(w/o CONTPTE), it is
> >>>>>>> DoubleMapped.
> >>>>>>
> >>>>>> There are 2 problems with your proposal, as I see it;
> >>>>>>
> >>>>>> 1) the core-mm is not enlightened for CONTPTE mappings. As far as it is
> >>>>>> concerned, its just mapping a bunch of PTEs. So it has no hook to inc/dec
> >>>>>> entire_mapcount. The arch code is opportunistically and *transparently* managing
> >>>>>> the CONT_PTE bit.
> >>>>>>
> >>>>>> 2) There is nothing to say a folio isn't *bigger* than the contpte block; it may
> >>>>>> be 128K and be mapped with 2 contpte blocks. Or even a PTE-mapped THP (2M) and
> >>>>>> be mapped with 32 contpte blocks. So you can't say it is entirely mapped
> >>>>>> unless/until ALL of those blocks are set up. And then of course each block could
> >>>>>> be unmapped unatomically.
> >>>>>>
> >>>>>> For the PMD case there are actually 2 properties that allow using the
> >>>>>> entire_mapcount optimization; It's atomically mapped/unmapped through the PMD
> >>>>>> and we know that the folio is exactly PMD sized (since it must be at least PMD
> >>>>>> sized to be able to map it with the PMD, and we don't allocate THPs any bigger
> >>>>>> than PMD size). So one PMD map or unmap operation corresponds to exactly one
> >>>>>> *entire* map or unmap. That is not true when we are PTE mapping.
> >>>>>
> >>>>> well. Thanks for clarification. based on the above description, i agree the
> >>>>> current code might make more sense by always using mapcount in subpage.
> >>>>>
> >>>>> I gave my proposals as I thought we were always CONTPTE size for small-THP
> >>>>> then we could drop the loop to iterate 16 times rmap. if we do it
> >>>>> entirely, we only
> >>>>> need to do dup rmap once for all 16 PTEs by increasing entire_map.
> >>>>
> >>>> Well its always good to have the discussion - so thanks for the ideas. I think
> >>>> there is a bigger question lurking here; should we be exposing the concept of
> >>>> contpte mappings to the core-mm rather than burying it in the arm64 arch code?
> >>>> I'm confident that would be a huge amount of effort and the end result would be
> >>>> similar performace to what this approach gives. One potential benefit of letting
> >>>> core-mm control it is that it would also give control to core-mm over the
> >>>> granularity of access/dirty reporting (my approach implicitly ties it to the
> >>>> folio). Having sub-folio access tracking _could_ potentially help with future
> >>>> work to make THP size selection automatic, but we are not there yet, and I think
> >>>> there are other (simpler) ways to achieve the same thing. So my view is that
> >>>> _not_ exposing it to core-mm is the right way for now.
> >>>
> >>> Hi Ryan,
> >>>
> >>> We(OPPO) started a similar project like you even before folio was imported to
> >>> mainline, we have deployed the dynamic hugepage(that is how we name it)
> >>> on millions of mobile phones on real products and kernels before 5.16, making
> >>> a huge success on performance improvement. for example, you may
> >>> find the out-of-tree 5.15 source code here
> >>
> >> Oh wow, thanks for reaching out and explaining this - I have to admit I feel
> >> embarrassed that I clearly didn't do enough research on the prior art because I
> >> wasn't aware of your work. So sorry about that.
> >>
> >> I sensed that you had a different model for how this should work vs what I've
> >> implemented and now I understand why :). I'll review your stuff and I'm sure
> >> I'll have questions. I'm sure each solution has pros and cons.
> >>
> >>
> >>>
> >>> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11
> >>>
> >>> Our modification might not be so clean and has lots of workarounds
> >>> just for the stability of products
> >>>
> >>> We mainly have
> >>>
> >>> 1. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/cont_pte_hugepage.c
> >>>
> >>> some CONTPTE helpers
> >>>
> >>> 2.https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/include/linux/mm.h
> >>>
> >>> some Dynamic Hugepage APIs
> >>>
> >>> 3. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c
> >>>
> >>> modified all page faults to support
> >>> (1). allocation of hugepage of 64KB in do_anon_page
> >>
> >> My Small-Sized THP patch set is handling the equivalent of this.
> >
> > right, the only difference is that we did a huge-zeropage for reading
> > in do_anon_page.
> > mapping all large folios to CONTPTE to zero page.
>
> FWIW, I took a slightly different approach in my original RFC for the zero page
> - although I ripped it all out to simplify for the initial series. I found that
> it was pretty rare for user space to read multiple consecutive pages without
> ever interleving any writes, so I kept the zero page as a base page, but at CoW,
> I would expand the allocation to an approprately sized THP. But for the couple
> of workloads that I've gone deep with, I found that it made barely any dent on
> the amount of memory that ended up contpte-mapped; the vast majority was from
> write allocation in do_anonymous_page().

the problem is even if there is only one page read in 16 ptes, you
will map the page to
zero basepage. then while you write another page in these 16 ptes, you
lose the chance
to become large folio as pte_range_none() becomes false.

if we map these 16ptes to contpte zero page, in do_wp_page, we have a
good chance
to CoW and get a large anon folio.

>
> >
> >>
> >>> (2). CoW hugepage in do_wp_page
> >>
> >> This isn't handled yet in my patch set; the original RFC implemented it but I
> >> removed it in order to strip back to the essential complexity for the initial
> >> submission. DavidH has been working on a precise shared vs exclusive map
> >> tracking mechanism - if that goes in, it will make CoWing large folios simpler.
> >> Out of interest, what workloads benefit most from this?
> >
> > as a phone, Android has a design almost all processes are forked from zygote.
> > thus, CoW happens quite often to all apps.
>
> Sure. But in my analysis I concluded that most of the memory mapped in zygote is
> file-backed and mostly RO so therefore doing THP CoW doesn't help much. Perhaps
> there are cases where that conclusion is wrong.

CoW is much less than do_anon_page on my phone which is running dynamic
hugepage for a couple of hours:

OP52D1L1:/ # cat /proc/cont_pte_hugepage/stat
...
thp_cow 34669 ---- CoW a large folio
thp_do_anon_pages 1032362 ----- a large folio in do_anon_page
...

so it is around 34669/1032362 = 3.35%.

>
> >
> >>
> >>> (3). copy CONPTEs in copy_pte_range
> >>
> >> As discussed this is done as part of the contpte patch set, but its not just a
> >> simple copy; the arch code will notice and set the CONT_PTE bit as needed.
> >
> > right, i have read all your unfold and fold stuff today, now i understand your
> > approach seems quite nice!
>
> Great - thanks!
>
> >
> >
> >>
> >>> (4). allocate and swap-in Hugepage as a whole in do_swap_page
> >>
> >> This is going to be a problem but I haven't even looked at this properly yet.
> >> The advice so far has been to continue to swap-in small pages only, but improve
> >> khugepaged to collapse to small-sized THP. I'll take a look at your code to
> >> understand how you did this.
> >
> > this is also crucial to android phone as swap is always happening
> > on an embedded device. if we don't support large folios in swapin,
> > our large folios will never come back after it is swapped-out.
> >
> > and i hated the collapse solution from the first beginning as there is
> > never a guarantee to succeed and its overhead is unacceptable to user UI,
> > so we supported hugepage allocation in do_swap_page from the first beginning.
>
> Understood. I agree it would be nice to preserve large folios across swap. I
> think this can be layered on top of the current work though.

This will be my first priority to use your large folio code on phones.
We need a patchset
on top of yours :-)

without it, we will likely fail. Typically, one phone can have a 4~8GB
zRAM to compress
a lot of anon pages, if the compression ratio is 1:4, that means
uncompressed anon
pages are much much more. Thus, while the background app is switched back
to foreground, we need those swapped-out large folios back rather than getting
small basepages replacement. swap-in basepage is definitely not going to
work well on a phone, neither does THP collapse.

>
> >
> >>
> >>>
> >>> 4. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/vmscan.c
> >>> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/rmap.c
> >>>
> >>> reclaim hugepage as a whole and LRU optimization for 64KB dynamic hugepage.
> >>
> >> I think this is all naturally handled by the folio code that exists in modern
> >> kernels?
> >
> > We had a CONTPTE hugepage pool, if the pool is very limited, we let LRU
> > reclaim large folios to the pool. as phones are running lots of apps
> > and drivers, and the memory is very limited, after a couple of hours,
> > it will become very hard to allocate large folios in the original buddy. thus,
> > large folios totally disappeared after running the phone for some time
> > if we didn't have the pool.
> >
> >>
> >>>
> >>> So we are 100% interested in your patchset and hope it can find a way
> >>> to land on the
> >>> mainline, thus decreasing all the cost we have to maintain out-of-tree
> >>> code from a
> >>> kernel to another kernel version which we have done on a couple of
> >>> kernel versions
> >>> before 5.16. Firmly, we are 100% supportive of large anon folios
> >>> things you are leading.
> >>
> >> That's great to hear! Of course Reviewed-By's and Tested-By's will all help move
> >> it closer :). If you had any ability to do any A/B performance testing, it would
> >> be very interesting to see how this stacks up against your solution - if there
> >> are gaps it would be good to know where and develop a plan to plug the gap.
> >>
> >
> > sure.
> >
> >>>
> >>> A big pain was we found lots of races especially on CONTPTE unfolding
> >>> and especially a part
> >>> of basepages ran away from the 16 CONPTEs group since userspace is
> >>> always working
> >>> on basepages, having no idea of small-THP. We ran our code on millions of
> >>> real phones, and now we have got them fixed (or maybe "can't reproduce"),
> >>> no outstanding issue.
> >>
> >> I'm going to be brave and say that my solution shouldn't suffer from these
> >> problems; but of course the proof is only in the testing. I did a lot of work
> >> with our architecture group and micro architects to determine exactly what is
> >> and isn't safe; We even tightened the Arm ARM spec very subtlely to allow the
> >> optimization in patch 13 (see the commit log for details). Of course this has
> >> all been checked with partners and we are confident that all existing
> >> implementations conform to the modified wording.
> >
> > cool. I like your try_unfold/fold code. it seems your code is setting/dropping
> > CONT automatically based on ALIGHMENT, Page number etc. Alternatively,
> > our code is always stupidly checking some conditions before setting and dropping
> > CONT everywhere.
> >
> >>
> >>>
> >>> Particularly for the rmap issue we are discussing, our out-of-tree is
> >>> using the entire_map for
> >>> CONTPTE in the way I sent to you. But I guess we can learn from you to decouple
> >>> CONTPTE from mm-core.
> >>>
> >>> We are doing this in mm/memory.c
> >>>
> >>> copy_present_cont_pte(struct vm_area_struct *dst_vma, struct
> >>> vm_area_struct *src_vma,
> >>> pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
> >>> struct page **prealloc)
> >>> {
> >>> struct mm_struct *src_mm = src_vma->vm_mm;
> >>> unsigned long vm_flags = src_vma->vm_flags;
> >>> pte_t pte = *src_pte;
> >>> struct page *page;
> >>>
> >>> page = vm_normal_page(src_vma, addr, pte);
> >>> ...
> >>>
> >>> get_page(page);
> >>> page_dup_rmap(page, true); // an entire dup_rmap as you can
> >>> see.............
> >>> rss[mm_counter(page)] += HPAGE_CONT_PTE_NR;
> >>> }
> >>>
> >>> and we have a split in mm/cont_pte_hugepage.c to handle partially unmap,
> >>>
> >>> static void __split_huge_cont_pte_locked(struct vm_area_struct *vma, pte_t *pte,
> >>> unsigned long haddr, bool freeze)
> >>> {
> >>> ...
> >>> if (compound_mapcount(head) > 1 && !TestSetPageDoubleMap(head)) {
> >>> for (i = 0; i < HPAGE_CONT_PTE_NR; i++)
> >>> atomic_inc(&head[i]._mapcount);
> >>> atomic_long_inc(&cont_pte_double_map_count);
> >>> }
> >>>
> >>>
> >>> if (atomic_add_negative(-1, compound_mapcount_ptr(head))) {
> >>> ...
> >>> }
> >>>
> >>> I am not selling our solution any more, but just showing you some differences we
> >>> have :-)
> >>
> >> OK, I understand what you were saying now. I'm currently struggling to see how
> >> this could fit into my model. Do you have any workloads and numbers on perf
> >> improvement of using entire_mapcount?
> >
> > TBH, I don't have any data on this as from the first beginning, we were using
> > entire_map. So I have no comparison at all.
> >
> >>
> >>>
> >>>>
> >>>>>
> >>>>> BTW, I have concerns that a variable small-THP size will really work
> >>>>> as userspace
> >>>>> is probably friendly to only one fixed size. for example, userspace
> >>>>> heap management
> >>>>> might be optimized to a size for freeing memory to the kernel. it is
> >>>>> very difficult
> >>>>> for the heap to adapt to various sizes at the same time. frequent unmap/free
> >>>>> size not equal with, and particularly smaller than small-THP size will
> >>>>> defeat all
> >>>>> efforts to use small-THP.
> >>>>
> >>>> I'll admit to not knowing a huge amount about user space allocators. But I will
> >>>> say that as currently defined, the small-sized THP interface to user space
> >>>> allows a sysadmin to specifically enable the set of sizes that they want; so a
> >>>> single size can be enabled. I'm diliberately punting that decision away from the
> >>>> kernel for now.
> >>>
> >>> Basically, userspace heap library has a PAGESIZE setting and allows users
> >>> to allocate/free all kinds of small objects such as 16,32,64,128,256,512 etc.
> >>> The default size is for sure equal to the basepage SIZE. once some objects are
> >>> freed by free() and libc get a free "page", userspace heap libraries might free
> >>> the PAGESIZE page to kernel by things like MADV_DONTNEED, then zap_pte_range().
> >>> it is quite similar with kernel slab.
> >>>
> >>> so imagine we have small-THP now, but userspace libraries have *NO*
> >>> idea at all, so it can frequently cause unfolding.
> >>>
> >>>>
> >>>> FWIW, My experience with the Speedometer/JavaScript use case is that performance
> >>>> is a little bit better when enabling 64+32+16K vs just 64K THP.
> >>>>
> >>>> Functionally, it will not matter if the allocator is not enlightened for the THP
> >>>> size; it can continue to free, and if a partial folio is unmapped it is put on
> >>>> the deferred split list, then under memory pressure it is split and the unused
> >>>> pages are reclaimed. I guess this is the bit you are concerned about having a

> >>>> performance impact?
> >>>
> >>> right. If this is happening on the majority of small-THP folios, we
> >>> don't have performance
> >>> improvement, and probably regression instead. This is really true on
> >>> real workloads!!
> >>>
> >>> So that is why we really love a per-VMA hint to enable small-THP but
> >>> obviously you
> >>> have already supported it now by
> >>> mm: thp: Introduce per-size thp sysfs interface
> >>> https://lore.kernel.org/linux-mm/[email protected]/
> >>>
> >>> we can use MADVISE rather than ALWAYS and set fixed size like 64KB, so userspace
> >>> can set the VMA flag when it is quite sure this VMA is working with
> >>> the alignment
> >>> of 64KB?
> >>
> >> Yes, that all exists in the series today. We have also discussed the possibility
> >> of adding a new madvise_process() call that would take the set of THP sizes that
> >> should be considered. Then you can set different VMAs to use different sizes;
> >> the plan was to layer that on top if/when a workload was identified. Sounds like
> >> you might be able to help there?
> >
> > i'm not quite sure as on phones, we are using fixed-size CONTPTE. so we ask
> > for either 64KB or 4KB. If we think one VMA is all good to use CONTPTE, we
> > set a flag in this VMA and try to allocate 64KB.
>
> When you say "we set a flag" do you mean user space? Or is there some heuristic
> in the kernel?

we are using a field extended by the android kernel in vma struct to
mark this vma
is all good to use CONTPTE. With the upstream solution you are providing, we can
remove this dirty code[1].
static inline bool vma_is_chp_anonymous(struct vm_area_struct *vma)
{
return vma->android_kabi_reserved2 == THP_SWAP_PRIO_MAGIC;
}

[1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/include/linux/mm.h#L4031

Thanks
Barry

2023-11-29 12:22:25

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

On 28/11/2023 21:06, Barry Song wrote:
> On Tue, Nov 28, 2023 at 11:49 PM Ryan Roberts <[email protected]> wrote:
>>
>> On 28/11/2023 09:49, Barry Song wrote:
>>> On Tue, Nov 28, 2023 at 10:14 PM Ryan Roberts <[email protected]> wrote:
>>>>
>>>> On 27/11/2023 20:34, Barry Song wrote:
>>>>> On Tue, Nov 28, 2023 at 12:07 AM Ryan Roberts <[email protected]> wrote:
>>>>>>
>>>>>> On 27/11/2023 10:28, Barry Song wrote:
>>>>>>> On Mon, Nov 27, 2023 at 11:11 PM Ryan Roberts <[email protected]> wrote:
>>>>>>>>
>>>>>>>> On 27/11/2023 09:59, Barry Song wrote:
>>>>>>>>> On Mon, Nov 27, 2023 at 10:35 PM Ryan Roberts <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>> On 27/11/2023 08:42, Barry Song wrote:
>>>>>>>>>>>>> + for (i = 0; i < nr; i++, page++) {
>>>>>>>>>>>>> + if (anon) {
>>>>>>>>>>>>> + /*
>>>>>>>>>>>>> + * If this page may have been pinned by the
>>>>>>>>>>>>> + * parent process, copy the page immediately for
>>>>>>>>>>>>> + * the child so that we'll always guarantee the
>>>>>>>>>>>>> + * pinned page won't be randomly replaced in the
>>>>>>>>>>>>> + * future.
>>>>>>>>>>>>> + */
>>>>>>>>>>>>> + if (unlikely(page_try_dup_anon_rmap(
>>>>>>>>>>>>> + page, false, src_vma))) {
>>>>>>>>>>>>> + if (i != 0)
>>>>>>>>>>>>> + break;
>>>>>>>>>>>>> + /* Page may be pinned, we have to copy. */
>>>>>>>>>>>>> + return copy_present_page(
>>>>>>>>>>>>> + dst_vma, src_vma, dst_pte,
>>>>>>>>>>>>> + src_pte, addr, rss, prealloc,
>>>>>>>>>>>>> + page);
>>>>>>>>>>>>> + }
>>>>>>>>>>>>> + rss[MM_ANONPAGES]++;
>>>>>>>>>>>>> + VM_BUG_ON(PageAnonExclusive(page));
>>>>>>>>>>>>> + } else {
>>>>>>>>>>>>> + page_dup_file_rmap(page, false);
>>>>>>>>>>>>> + rss[mm_counter_file(page)]++;
>>>>>>>>>>>>> + }
>>>>>>>>>>>>> }
>>>>>>>>>>>>> - rss[MM_ANONPAGES]++;
>>>>>>>>>>>>> - } else if (page) {
>>>>>>>>>>>>> - folio_get(folio);
>>>>>>>>>>>>> - page_dup_file_rmap(page, false);
>>>>>>>>>>>>> - rss[mm_counter_file(page)]++;
>>>>>>>>>>>>> +
>>>>>>>>>>>>> + nr = i;
>>>>>>>>>>>>> + folio_ref_add(folio, nr);
>>>>>>>>>>>>
>>>>>>>>>>>> You're changing the order of mapcount vs. refcount increment. Don't.
>>>>>>>>>>>> Make sure your refcount >= mapcount.
>>>>>>>>>>>>
>>>>>>>>>>>> You can do that easily by doing the folio_ref_add(folio, nr) first and
>>>>>>>>>>>> then decrementing in case of error accordingly. Errors due to pinned
>>>>>>>>>>>> pages are the corner case.
>>>>>>>>>>>>
>>>>>>>>>>>> I'll note that it will make a lot of sense to have batch variants of
>>>>>>>>>>>> page_try_dup_anon_rmap() and page_dup_file_rmap().
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> i still don't understand why it is not a entire map+1, but an increment
>>>>>>>>>>> in each basepage.
>>>>>>>>>>
>>>>>>>>>> Because we are PTE-mapping the folio, we have to account each individual page.
>>>>>>>>>> If we accounted the entire folio, where would we unaccount it? Each page can be
>>>>>>>>>> unmapped individually (e.g. munmap() part of the folio) so need to account each
>>>>>>>>>> page. When PMD mapping, the whole thing is either mapped or unmapped, and its
>>>>>>>>>> atomic, so we can account the entire thing.
>>>>>>>>>
>>>>>>>>> Hi Ryan,
>>>>>>>>>
>>>>>>>>> There is no problem. for example, a large folio is entirely mapped in
>>>>>>>>> process A with CONPTE,
>>>>>>>>> and only page2 is mapped in process B.
>>>>>>>>> then we will have
>>>>>>>>>
>>>>>>>>> entire_map = 0
>>>>>>>>> page0.map = -1
>>>>>>>>> page1.map = -1
>>>>>>>>> page2.map = 0
>>>>>>>>> page3.map = -1
>>>>>>>>> ....
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> as long as it is a CONTPTE large folio, there is no much difference with
>>>>>>>>>>> PMD-mapped large folio. it has all the chance to be DoubleMap and need
>>>>>>>>>>> split.
>>>>>>>>>>>
>>>>>>>>>>> When A and B share a CONTPTE large folio, we do madvise(DONTNEED) or any
>>>>>>>>>>> similar things on a part of the large folio in process A,
>>>>>>>>>>>
>>>>>>>>>>> this large folio will have partially mapped subpage in A (all CONTPE bits
>>>>>>>>>>> in all subpages need to be removed though we only unmap a part of the
>>>>>>>>>>> large folioas HW requires consistent CONTPTEs); and it has entire map in
>>>>>>>>>>> process B(all PTEs are still CONPTES in process B).
>>>>>>>>>>>
>>>>>>>>>>> isn't it more sensible for this large folios to have entire_map = 0(for
>>>>>>>>>>> process B), and subpages which are still mapped in process A has map_count
>>>>>>>>>>> =0? (start from -1).
>>>>>>>>>>>
>>>>>>>>>>>> Especially, the batch variant of page_try_dup_anon_rmap() would only
>>>>>>>>>>>> check once if the folio maybe pinned, and in that case, you can simply
>>>>>>>>>>>> drop all references again. So you either have all or no ptes to process,
>>>>>>>>>>>> which makes that code easier.
>>>>>>>>>>
>>>>>>>>>> I'm afraid this doesn't make sense to me. Perhaps I've misunderstood. But
>>>>>>>>>> fundamentally you can only use entire_mapcount if its only possible to map and
>>>>>>>>>> unmap the whole folio atomically.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> My point is that CONTPEs should either all-set in all 16 PTEs or all are dropped
>>>>>>>>> in 16 PTEs. if all PTEs have CONT, it is entirely mapped; otherwise,
>>>>>>>>> it is partially
>>>>>>>>> mapped. if a large folio is mapped in one processes with all CONTPTEs
>>>>>>>>> and meanwhile in another process with partial mapping(w/o CONTPTE), it is
>>>>>>>>> DoubleMapped.
>>>>>>>>
>>>>>>>> There are 2 problems with your proposal, as I see it;
>>>>>>>>
>>>>>>>> 1) the core-mm is not enlightened for CONTPTE mappings. As far as it is
>>>>>>>> concerned, its just mapping a bunch of PTEs. So it has no hook to inc/dec
>>>>>>>> entire_mapcount. The arch code is opportunistically and *transparently* managing
>>>>>>>> the CONT_PTE bit.
>>>>>>>>
>>>>>>>> 2) There is nothing to say a folio isn't *bigger* than the contpte block; it may
>>>>>>>> be 128K and be mapped with 2 contpte blocks. Or even a PTE-mapped THP (2M) and
>>>>>>>> be mapped with 32 contpte blocks. So you can't say it is entirely mapped
>>>>>>>> unless/until ALL of those blocks are set up. And then of course each block could
>>>>>>>> be unmapped unatomically.
>>>>>>>>
>>>>>>>> For the PMD case there are actually 2 properties that allow using the
>>>>>>>> entire_mapcount optimization; It's atomically mapped/unmapped through the PMD
>>>>>>>> and we know that the folio is exactly PMD sized (since it must be at least PMD
>>>>>>>> sized to be able to map it with the PMD, and we don't allocate THPs any bigger
>>>>>>>> than PMD size). So one PMD map or unmap operation corresponds to exactly one
>>>>>>>> *entire* map or unmap. That is not true when we are PTE mapping.
>>>>>>>
>>>>>>> well. Thanks for clarification. based on the above description, i agree the
>>>>>>> current code might make more sense by always using mapcount in subpage.
>>>>>>>
>>>>>>> I gave my proposals as I thought we were always CONTPTE size for small-THP
>>>>>>> then we could drop the loop to iterate 16 times rmap. if we do it
>>>>>>> entirely, we only
>>>>>>> need to do dup rmap once for all 16 PTEs by increasing entire_map.
>>>>>>
>>>>>> Well its always good to have the discussion - so thanks for the ideas. I think
>>>>>> there is a bigger question lurking here; should we be exposing the concept of
>>>>>> contpte mappings to the core-mm rather than burying it in the arm64 arch code?
>>>>>> I'm confident that would be a huge amount of effort and the end result would be
>>>>>> similar performace to what this approach gives. One potential benefit of letting
>>>>>> core-mm control it is that it would also give control to core-mm over the
>>>>>> granularity of access/dirty reporting (my approach implicitly ties it to the
>>>>>> folio). Having sub-folio access tracking _could_ potentially help with future
>>>>>> work to make THP size selection automatic, but we are not there yet, and I think
>>>>>> there are other (simpler) ways to achieve the same thing. So my view is that
>>>>>> _not_ exposing it to core-mm is the right way for now.
>>>>>
>>>>> Hi Ryan,
>>>>>
>>>>> We(OPPO) started a similar project like you even before folio was imported to
>>>>> mainline, we have deployed the dynamic hugepage(that is how we name it)
>>>>> on millions of mobile phones on real products and kernels before 5.16, making
>>>>> a huge success on performance improvement. for example, you may
>>>>> find the out-of-tree 5.15 source code here
>>>>
>>>> Oh wow, thanks for reaching out and explaining this - I have to admit I feel
>>>> embarrassed that I clearly didn't do enough research on the prior art because I
>>>> wasn't aware of your work. So sorry about that.
>>>>
>>>> I sensed that you had a different model for how this should work vs what I've
>>>> implemented and now I understand why :). I'll review your stuff and I'm sure
>>>> I'll have questions. I'm sure each solution has pros and cons.
>>>>
>>>>
>>>>>
>>>>> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11
>>>>>
>>>>> Our modification might not be so clean and has lots of workarounds
>>>>> just for the stability of products
>>>>>
>>>>> We mainly have
>>>>>
>>>>> 1. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/cont_pte_hugepage.c
>>>>>
>>>>> some CONTPTE helpers
>>>>>
>>>>> 2.https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/include/linux/mm.h
>>>>>
>>>>> some Dynamic Hugepage APIs
>>>>>
>>>>> 3. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c
>>>>>
>>>>> modified all page faults to support
>>>>> (1). allocation of hugepage of 64KB in do_anon_page
>>>>
>>>> My Small-Sized THP patch set is handling the equivalent of this.
>>>
>>> right, the only difference is that we did a huge-zeropage for reading
>>> in do_anon_page.
>>> mapping all large folios to CONTPTE to zero page.
>>
>> FWIW, I took a slightly different approach in my original RFC for the zero page
>> - although I ripped it all out to simplify for the initial series. I found that
>> it was pretty rare for user space to read multiple consecutive pages without
>> ever interleving any writes, so I kept the zero page as a base page, but at CoW,
>> I would expand the allocation to an approprately sized THP. But for the couple
>> of workloads that I've gone deep with, I found that it made barely any dent on
>> the amount of memory that ended up contpte-mapped; the vast majority was from
>> write allocation in do_anonymous_page().
>
> the problem is even if there is only one page read in 16 ptes, you
> will map the page to
> zero basepage. then while you write another page in these 16 ptes, you
> lose the chance
> to become large folio as pte_range_none() becomes false.
>
> if we map these 16ptes to contpte zero page, in do_wp_page, we have a
> good chance
> to CoW and get a large anon folio.

Yes understood. I think we are a bit off-topic for this patch set though.
small-sized THP zero pages can be tackled as a separate series once these
initial series are in. I'd be happy to review a small-sized THP zero page post :)

>
>>
>>>
>>>>
>>>>> (2). CoW hugepage in do_wp_page
>>>>
>>>> This isn't handled yet in my patch set; the original RFC implemented it but I
>>>> removed it in order to strip back to the essential complexity for the initial
>>>> submission. DavidH has been working on a precise shared vs exclusive map
>>>> tracking mechanism - if that goes in, it will make CoWing large folios simpler.
>>>> Out of interest, what workloads benefit most from this?
>>>
>>> as a phone, Android has a design almost all processes are forked from zygote.
>>> thus, CoW happens quite often to all apps.
>>
>> Sure. But in my analysis I concluded that most of the memory mapped in zygote is
>> file-backed and mostly RO so therefore doing THP CoW doesn't help much. Perhaps
>> there are cases where that conclusion is wrong.
>
> CoW is much less than do_anon_page on my phone which is running dynamic
> hugepage for a couple of hours:
>
> OP52D1L1:/ # cat /proc/cont_pte_hugepage/stat
> ...
> thp_cow 34669 ---- CoW a large folio
> thp_do_anon_pages 1032362 ----- a large folio in do_anon_page
> ...
>
> so it is around 34669/1032362 = 3.35%.

well its actually 34669 / (34669 + 1032362) = 3.25%. But, yes, the point is that
very few of large folios are lost due to CoW so there is likely to be little
perf impact. Again, I'd happily review a series that enables this!

>
>>
>>>
>>>>
>>>>> (3). copy CONPTEs in copy_pte_range
>>>>
>>>> As discussed this is done as part of the contpte patch set, but its not just a
>>>> simple copy; the arch code will notice and set the CONT_PTE bit as needed.
>>>
>>> right, i have read all your unfold and fold stuff today, now i understand your
>>> approach seems quite nice!
>>
>> Great - thanks!
>>
>>>
>>>
>>>>
>>>>> (4). allocate and swap-in Hugepage as a whole in do_swap_page
>>>>
>>>> This is going to be a problem but I haven't even looked at this properly yet.
>>>> The advice so far has been to continue to swap-in small pages only, but improve
>>>> khugepaged to collapse to small-sized THP. I'll take a look at your code to
>>>> understand how you did this.
>>>
>>> this is also crucial to android phone as swap is always happening
>>> on an embedded device. if we don't support large folios in swapin,
>>> our large folios will never come back after it is swapped-out.
>>>
>>> and i hated the collapse solution from the first beginning as there is
>>> never a guarantee to succeed and its overhead is unacceptable to user UI,
>>> so we supported hugepage allocation in do_swap_page from the first beginning.
>>
>> Understood. I agree it would be nice to preserve large folios across swap. I
>> think this can be layered on top of the current work though.
>
> This will be my first priority to use your large folio code on phones.
> We need a patchset
> on top of yours :-)
>
> without it, we will likely fail. Typically, one phone can have a 4~8GB
> zRAM to compress
> a lot of anon pages, if the compression ratio is 1:4, that means
> uncompressed anon
> pages are much much more. Thus, while the background app is switched back
> to foreground, we need those swapped-out large folios back rather than getting
> small basepages replacement. swap-in basepage is definitely not going to
> work well on a phone, neither does THP collapse.

Yep understood. From the other thread, it sounds like you are preparing a series
for large swap-in - looking forward to seeing it!

>
>>
>>>
>>>>
>>>>>
>>>>> 4. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/vmscan.c
>>>>> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/rmap.c
>>>>>
>>>>> reclaim hugepage as a whole and LRU optimization for 64KB dynamic hugepage.
>>>>
>>>> I think this is all naturally handled by the folio code that exists in modern
>>>> kernels?
>>>
>>> We had a CONTPTE hugepage pool, if the pool is very limited, we let LRU
>>> reclaim large folios to the pool. as phones are running lots of apps
>>> and drivers, and the memory is very limited, after a couple of hours,
>>> it will become very hard to allocate large folios in the original buddy. thus,
>>> large folios totally disappeared after running the phone for some time
>>> if we didn't have the pool.
>>>
>>>>
>>>>>
>>>>> So we are 100% interested in your patchset and hope it can find a way
>>>>> to land on the
>>>>> mainline, thus decreasing all the cost we have to maintain out-of-tree
>>>>> code from a
>>>>> kernel to another kernel version which we have done on a couple of
>>>>> kernel versions
>>>>> before 5.16. Firmly, we are 100% supportive of large anon folios
>>>>> things you are leading.
>>>>
>>>> That's great to hear! Of course Reviewed-By's and Tested-By's will all help move
>>>> it closer :). If you had any ability to do any A/B performance testing, it would
>>>> be very interesting to see how this stacks up against your solution - if there
>>>> are gaps it would be good to know where and develop a plan to plug the gap.
>>>>
>>>
>>> sure.
>>>
>>>>>
>>>>> A big pain was we found lots of races especially on CONTPTE unfolding
>>>>> and especially a part
>>>>> of basepages ran away from the 16 CONPTEs group since userspace is
>>>>> always working
>>>>> on basepages, having no idea of small-THP. We ran our code on millions of
>>>>> real phones, and now we have got them fixed (or maybe "can't reproduce"),
>>>>> no outstanding issue.
>>>>
>>>> I'm going to be brave and say that my solution shouldn't suffer from these
>>>> problems; but of course the proof is only in the testing. I did a lot of work
>>>> with our architecture group and micro architects to determine exactly what is
>>>> and isn't safe; We even tightened the Arm ARM spec very subtlely to allow the
>>>> optimization in patch 13 (see the commit log for details). Of course this has
>>>> all been checked with partners and we are confident that all existing
>>>> implementations conform to the modified wording.
>>>
>>> cool. I like your try_unfold/fold code. it seems your code is setting/dropping
>>> CONT automatically based on ALIGHMENT, Page number etc. Alternatively,
>>> our code is always stupidly checking some conditions before setting and dropping
>>> CONT everywhere.
>>>
>>>>
>>>>>
>>>>> Particularly for the rmap issue we are discussing, our out-of-tree is
>>>>> using the entire_map for
>>>>> CONTPTE in the way I sent to you. But I guess we can learn from you to decouple
>>>>> CONTPTE from mm-core.
>>>>>
>>>>> We are doing this in mm/memory.c
>>>>>
>>>>> copy_present_cont_pte(struct vm_area_struct *dst_vma, struct
>>>>> vm_area_struct *src_vma,
>>>>> pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
>>>>> struct page **prealloc)
>>>>> {
>>>>> struct mm_struct *src_mm = src_vma->vm_mm;
>>>>> unsigned long vm_flags = src_vma->vm_flags;
>>>>> pte_t pte = *src_pte;
>>>>> struct page *page;
>>>>>
>>>>> page = vm_normal_page(src_vma, addr, pte);
>>>>> ...
>>>>>
>>>>> get_page(page);
>>>>> page_dup_rmap(page, true); // an entire dup_rmap as you can
>>>>> see.............
>>>>> rss[mm_counter(page)] += HPAGE_CONT_PTE_NR;
>>>>> }
>>>>>
>>>>> and we have a split in mm/cont_pte_hugepage.c to handle partially unmap,
>>>>>
>>>>> static void __split_huge_cont_pte_locked(struct vm_area_struct *vma, pte_t *pte,
>>>>> unsigned long haddr, bool freeze)
>>>>> {
>>>>> ...
>>>>> if (compound_mapcount(head) > 1 && !TestSetPageDoubleMap(head)) {
>>>>> for (i = 0; i < HPAGE_CONT_PTE_NR; i++)
>>>>> atomic_inc(&head[i]._mapcount);
>>>>> atomic_long_inc(&cont_pte_double_map_count);
>>>>> }
>>>>>
>>>>>
>>>>> if (atomic_add_negative(-1, compound_mapcount_ptr(head))) {
>>>>> ...
>>>>> }
>>>>>
>>>>> I am not selling our solution any more, but just showing you some differences we
>>>>> have :-)
>>>>
>>>> OK, I understand what you were saying now. I'm currently struggling to see how
>>>> this could fit into my model. Do you have any workloads and numbers on perf
>>>> improvement of using entire_mapcount?
>>>
>>> TBH, I don't have any data on this as from the first beginning, we were using
>>> entire_map. So I have no comparison at all.
>>>
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> BTW, I have concerns that a variable small-THP size will really work
>>>>>>> as userspace
>>>>>>> is probably friendly to only one fixed size. for example, userspace
>>>>>>> heap management
>>>>>>> might be optimized to a size for freeing memory to the kernel. it is
>>>>>>> very difficult
>>>>>>> for the heap to adapt to various sizes at the same time. frequent unmap/free
>>>>>>> size not equal with, and particularly smaller than small-THP size will
>>>>>>> defeat all
>>>>>>> efforts to use small-THP.
>>>>>>
>>>>>> I'll admit to not knowing a huge amount about user space allocators. But I will
>>>>>> say that as currently defined, the small-sized THP interface to user space
>>>>>> allows a sysadmin to specifically enable the set of sizes that they want; so a
>>>>>> single size can be enabled. I'm diliberately punting that decision away from the
>>>>>> kernel for now.
>>>>>
>>>>> Basically, userspace heap library has a PAGESIZE setting and allows users
>>>>> to allocate/free all kinds of small objects such as 16,32,64,128,256,512 etc.
>>>>> The default size is for sure equal to the basepage SIZE. once some objects are
>>>>> freed by free() and libc get a free "page", userspace heap libraries might free
>>>>> the PAGESIZE page to kernel by things like MADV_DONTNEED, then zap_pte_range().
>>>>> it is quite similar with kernel slab.
>>>>>
>>>>> so imagine we have small-THP now, but userspace libraries have *NO*
>>>>> idea at all, so it can frequently cause unfolding.
>>>>>
>>>>>>
>>>>>> FWIW, My experience with the Speedometer/JavaScript use case is that performance
>>>>>> is a little bit better when enabling 64+32+16K vs just 64K THP.
>>>>>>
>>>>>> Functionally, it will not matter if the allocator is not enlightened for the THP
>>>>>> size; it can continue to free, and if a partial folio is unmapped it is put on
>>>>>> the deferred split list, then under memory pressure it is split and the unused
>>>>>> pages are reclaimed. I guess this is the bit you are concerned about having a
>
>>>>>> performance impact?
>>>>>
>>>>> right. If this is happening on the majority of small-THP folios, we
>>>>> don't have performance
>>>>> improvement, and probably regression instead. This is really true on
>>>>> real workloads!!
>>>>>
>>>>> So that is why we really love a per-VMA hint to enable small-THP but
>>>>> obviously you
>>>>> have already supported it now by
>>>>> mm: thp: Introduce per-size thp sysfs interface
>>>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>>>
>>>>> we can use MADVISE rather than ALWAYS and set fixed size like 64KB, so userspace
>>>>> can set the VMA flag when it is quite sure this VMA is working with
>>>>> the alignment
>>>>> of 64KB?
>>>>
>>>> Yes, that all exists in the series today. We have also discussed the possibility
>>>> of adding a new madvise_process() call that would take the set of THP sizes that
>>>> should be considered. Then you can set different VMAs to use different sizes;
>>>> the plan was to layer that on top if/when a workload was identified. Sounds like
>>>> you might be able to help there?
>>>
>>> i'm not quite sure as on phones, we are using fixed-size CONTPTE. so we ask
>>> for either 64KB or 4KB. If we think one VMA is all good to use CONTPTE, we
>>> set a flag in this VMA and try to allocate 64KB.
>>
>> When you say "we set a flag" do you mean user space? Or is there some heuristic
>> in the kernel?
>
> we are using a field extended by the android kernel in vma struct to
> mark this vma
> is all good to use CONTPTE. With the upstream solution you are providing, we can
> remove this dirty code[1].
> static inline bool vma_is_chp_anonymous(struct vm_area_struct *vma)
> {
> return vma->android_kabi_reserved2 == THP_SWAP_PRIO_MAGIC;
> }

Sorry I'm not sure I've understood; how does that flag get set in the first
place? Does user space tell the kernel (via e.g. madvise()) or does the kernel
set it based on devined heuristics?

>
> [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/include/linux/mm.h#L4031
>
> Thanks
> Barry

2023-11-29 12:29:59

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

On 28/11/2023 19:00, Barry Song wrote:
> On Wed, Nov 29, 2023 at 12:00 AM Ryan Roberts <[email protected]> wrote:
>>
>> On 28/11/2023 00:11, Barry Song wrote:
>>> On Mon, Nov 27, 2023 at 10:24 PM Ryan Roberts <[email protected]> wrote:
>>>>
>>>> On 27/11/2023 05:54, Barry Song wrote:
>>>>>> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>>>>>> + pte_t *dst_pte, pte_t *src_pte,
>>>>>> + unsigned long addr, unsigned long end,
>>>>>> + int *rss, struct folio **prealloc)
>>>>>> {
>>>>>> struct mm_struct *src_mm = src_vma->vm_mm;
>>>>>> unsigned long vm_flags = src_vma->vm_flags;
>>>>>> pte_t pte = ptep_get(src_pte);
>>>>>> struct page *page;
>>>>>> struct folio *folio;
>>>>>> + int nr = 1;
>>>>>> + bool anon;
>>>>>> + bool any_dirty = pte_dirty(pte);
>>>>>> + int i;
>>>>>>
>>>>>> page = vm_normal_page(src_vma, addr, pte);
>>>>>> - if (page)
>>>>>> + if (page) {
>>>>>> folio = page_folio(page);
>>>>>> - if (page && folio_test_anon(folio)) {
>>>>>> - /*
>>>>>> - * If this page may have been pinned by the parent process,
>>>>>> - * copy the page immediately for the child so that we'll always
>>>>>> - * guarantee the pinned page won't be randomly replaced in the
>>>>>> - * future.
>>>>>> - */
>>>>>> - folio_get(folio);
>>>>>> - if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
>>>>>> - /* Page may be pinned, we have to copy. */
>>>>>> - folio_put(folio);
>>>>>> - return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
>>>>>> - addr, rss, prealloc, page);
>>>>>> + anon = folio_test_anon(folio);
>>>>>> + nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
>>>>>> + end, pte, &any_dirty);
>>>>>
>>>>> in case we have a large folio with 16 CONTPTE basepages, and userspace
>>>>> do madvise(addr + 4KB * 5, DONTNEED);
>>>>
>>>> nit: if you are offsetting by 5 pages from addr, then below I think you mean
>>>> page0~page4 and page6~15?
>>>>
>>>>>
>>>>> thus, the 4th basepage of PTE becomes PTE_NONE and folio_nr_pages_cont_mapped()
>>>>> will return 15. in this case, we should copy page0~page3 and page5~page15.
>>>>
>>>> No I don't think folio_nr_pages_cont_mapped() will return 15; that's certainly
>>>> not how its intended to work. The function is scanning forwards from the current
>>>> pte until it finds the first pte that does not fit in the batch - either because
>>>> it maps a PFN that is not contiguous, or because the permissions are different
>>>> (although this is being relaxed a bit; see conversation with DavidH against this
>>>> same patch).
>>>>
>>>> So the first time through this loop, folio_nr_pages_cont_mapped() will return 5,
>>>> (page0~page4) then the next time through the loop we will go through the
>>>> !present path and process the single swap marker. Then the 3rd time through the
>>>> loop folio_nr_pages_cont_mapped() will return 10.
>>>
>>> one case we have met by running hundreds of real phones is as below,
>>>
>>>
>>> static int
>>> copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>>> pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
>>> unsigned long end)
>>> {
>>> ...
>>> dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
>>> if (!dst_pte) {
>>> ret = -ENOMEM;
>>> goto out;
>>> }
>>> src_pte = pte_offset_map_nolock(src_mm, src_pmd, addr, &src_ptl);
>>> if (!src_pte) {
>>> pte_unmap_unlock(dst_pte, dst_ptl);
>>> /* ret == 0 */
>>> goto out;
>>> }
>>> spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
>>> orig_src_pte = src_pte;
>>> orig_dst_pte = dst_pte;
>>> arch_enter_lazy_mmu_mode();
>>>
>>> do {
>>> /*
>>> * We are holding two locks at this point - either of them
>>> * could generate latencies in another task on another CPU.
>>> */
>>> if (progress >= 32) {
>>> progress = 0;
>>> if (need_resched() ||
>>> spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
>>> break;
>>> }
>>> ptent = ptep_get(src_pte);
>>> if (pte_none(ptent)) {
>>> progress++;
>>> continue;
>>> }
>>>
>>> the above iteration can break when progress > =32. for example, at the
>>> beginning,
>>> if all PTEs are none, we break when progress >=32, and we break when we
>>> are in the 8th pte of 16PTEs which might become CONTPTE after we release
>>> PTL.
>>>
>>> since we are releasing PTLs, next time when we get PTL, those pte_none() might
>>> become pte_cont(), then are you going to copy CONTPTE from 8th pte,
>>> thus, immediately
>>> break the consistent CONPTEs rule of hardware?
>>>
>>> pte0 - pte_none
>>> pte1 - pte_none
>>> ...
>>> pte7 - pte_none
>>>
>>> pte8 - pte_cont
>>> ...
>>> pte15 - pte_cont
>>>
>>> so we did some modification to avoid a break in the middle of PTEs
>>> which can potentially
>>> become CONTPE.
>>> do {
>>> /*
>>> * We are holding two locks at this point - either of them
>>> * could generate latencies in another task on another CPU.
>>> */
>>> if (progress >= 32) {
>>> progress = 0;
>>> #ifdef CONFIG_CONT_PTE_HUGEPAGE
>>> /*
>>> * XXX: don't release ptl at an unligned address as
>>> cont_pte might form while
>>> * ptl is released, this causes double-map
>>> */
>>> if (!vma_is_chp_anonymous(src_vma) ||
>>> (vma_is_chp_anonymous(src_vma) && IS_ALIGNED(addr,
>>> HPAGE_CONT_PTE_SIZE)))
>>> #endif
>>> if (need_resched() ||
>>> spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
>>> break;
>>> }
>>>
>>> We could only reproduce the above issue by running thousands of phones.
>>>
>>> Does your code survive from this problem?
>>
>> Yes I'm confident my code is safe against this; as I said before, the CONT_PTE
>> bit is not blindly "copied" from parent to child pte. As far as the core-mm is
>> concerned, there is no CONT_PTE bit; they are just regular PTEs. So the code
>> will see some pte_none() entries followed by some pte_present() entries. And
>> when calling set_ptes() on the child, the arch code will evaluate the current
>> state of the pgtable along with the new set_ptes() request and determine where
>> it should insert the CONT_PTE bit.
>
> yep, i have read very carefully and think your code is safe here. The
> only problem
> is that the code can randomly unfold parent processes' CONPTE while setting
> wrprotect in the middle of a large folio while it actually should keep CONT
> bit as all PTEs can be still consistent if we set protect from the 1st PTE.
>
> while A forks B, progress >= 32 might interrupt in the middle of a
> new CONTPTE folio which is forming, as we have to set wrprotect to parent A,
> this parent immediately loses CONT bit. this is sad. but i can't find a
> good way to resolve it unless CONT is exposed to mm-core. any idea on
> this?

No this is not the case; copy_present_ptes() will copy as many ptes as are
physcially contiguous and belong to the same folio (which usually means "the
whole folio" - the only time it doesn't is when we hit the end of the vma). We
will then return to the main loop and move forwards by the number of ptes that
were serviced, including:

progress += 8 * ret;

That might go above 32, so we will flash the lock. But we haven't done that in
the middle of a large folio. So the contpte-ness should be preserved.

>
> Our code[1] resolves this by only breaking at the aligned address
>
> if (progress >= 32) {
> progress = 0;
> #ifdef CONFIG_CONT_PTE_HUGEPAGE
> /*
> * XXX: don't release ptl at an unligned address as cont_pte
> might form while
> * ptl is released, this causes double-map
> */
> if (!vma_is_chp_anonymous(src_vma) ||
> (vma_is_chp_anonymous(src_vma) && IS_ALIGNED(addr,
> HPAGE_CONT_PTE_SIZE)))
> #endif
> if (need_resched() ||
> spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
> break;
> }
>
> [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c#L1180
>
>
> Thanks
> Barry

2023-11-30 00:52:00

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

On Thu, Nov 30, 2023 at 1:21 AM Ryan Roberts <[email protected]> wrote:
>
> On 28/11/2023 21:06, Barry Song wrote:
> > On Tue, Nov 28, 2023 at 11:49 PM Ryan Roberts <[email protected]> wrote:
> >>
> >> On 28/11/2023 09:49, Barry Song wrote:
> >>> On Tue, Nov 28, 2023 at 10:14 PM Ryan Roberts <[email protected]> wrote:
> >>>>
> >>>> On 27/11/2023 20:34, Barry Song wrote:
> >>>>> On Tue, Nov 28, 2023 at 12:07 AM Ryan Roberts <[email protected]> wrote:
> >>>>>>
> >>>>>> On 27/11/2023 10:28, Barry Song wrote:
> >>>>>>> On Mon, Nov 27, 2023 at 11:11 PM Ryan Roberts <[email protected]> wrote:
> >>>>>>>>
> >>>>>>>> On 27/11/2023 09:59, Barry Song wrote:
> >>>>>>>>> On Mon, Nov 27, 2023 at 10:35 PM Ryan Roberts <[email protected]> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On 27/11/2023 08:42, Barry Song wrote:
> >>>>>>>>>>>>> + for (i = 0; i < nr; i++, page++) {
> >>>>>>>>>>>>> + if (anon) {
> >>>>>>>>>>>>> + /*
> >>>>>>>>>>>>> + * If this page may have been pinned by the
> >>>>>>>>>>>>> + * parent process, copy the page immediately for
> >>>>>>>>>>>>> + * the child so that we'll always guarantee the
> >>>>>>>>>>>>> + * pinned page won't be randomly replaced in the
> >>>>>>>>>>>>> + * future.
> >>>>>>>>>>>>> + */
> >>>>>>>>>>>>> + if (unlikely(page_try_dup_anon_rmap(
> >>>>>>>>>>>>> + page, false, src_vma))) {
> >>>>>>>>>>>>> + if (i != 0)
> >>>>>>>>>>>>> + break;
> >>>>>>>>>>>>> + /* Page may be pinned, we have to copy. */
> >>>>>>>>>>>>> + return copy_present_page(
> >>>>>>>>>>>>> + dst_vma, src_vma, dst_pte,
> >>>>>>>>>>>>> + src_pte, addr, rss, prealloc,
> >>>>>>>>>>>>> + page);
> >>>>>>>>>>>>> + }
> >>>>>>>>>>>>> + rss[MM_ANONPAGES]++;
> >>>>>>>>>>>>> + VM_BUG_ON(PageAnonExclusive(page));
> >>>>>>>>>>>>> + } else {
> >>>>>>>>>>>>> + page_dup_file_rmap(page, false);
> >>>>>>>>>>>>> + rss[mm_counter_file(page)]++;
> >>>>>>>>>>>>> + }
> >>>>>>>>>>>>> }
> >>>>>>>>>>>>> - rss[MM_ANONPAGES]++;
> >>>>>>>>>>>>> - } else if (page) {
> >>>>>>>>>>>>> - folio_get(folio);
> >>>>>>>>>>>>> - page_dup_file_rmap(page, false);
> >>>>>>>>>>>>> - rss[mm_counter_file(page)]++;
> >>>>>>>>>>>>> +
> >>>>>>>>>>>>> + nr = i;
> >>>>>>>>>>>>> + folio_ref_add(folio, nr);
> >>>>>>>>>>>>
> >>>>>>>>>>>> You're changing the order of mapcount vs. refcount increment. Don't.
> >>>>>>>>>>>> Make sure your refcount >= mapcount.
> >>>>>>>>>>>>
> >>>>>>>>>>>> You can do that easily by doing the folio_ref_add(folio, nr) first and
> >>>>>>>>>>>> then decrementing in case of error accordingly. Errors due to pinned
> >>>>>>>>>>>> pages are the corner case.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'll note that it will make a lot of sense to have batch variants of
> >>>>>>>>>>>> page_try_dup_anon_rmap() and page_dup_file_rmap().
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> i still don't understand why it is not a entire map+1, but an increment
> >>>>>>>>>>> in each basepage.
> >>>>>>>>>>
> >>>>>>>>>> Because we are PTE-mapping the folio, we have to account each individual page.
> >>>>>>>>>> If we accounted the entire folio, where would we unaccount it? Each page can be
> >>>>>>>>>> unmapped individually (e.g. munmap() part of the folio) so need to account each
> >>>>>>>>>> page. When PMD mapping, the whole thing is either mapped or unmapped, and its
> >>>>>>>>>> atomic, so we can account the entire thing.
> >>>>>>>>>
> >>>>>>>>> Hi Ryan,
> >>>>>>>>>
> >>>>>>>>> There is no problem. for example, a large folio is entirely mapped in
> >>>>>>>>> process A with CONPTE,
> >>>>>>>>> and only page2 is mapped in process B.
> >>>>>>>>> then we will have
> >>>>>>>>>
> >>>>>>>>> entire_map = 0
> >>>>>>>>> page0.map = -1
> >>>>>>>>> page1.map = -1
> >>>>>>>>> page2.map = 0
> >>>>>>>>> page3.map = -1
> >>>>>>>>> ....
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> as long as it is a CONTPTE large folio, there is no much difference with
> >>>>>>>>>>> PMD-mapped large folio. it has all the chance to be DoubleMap and need
> >>>>>>>>>>> split.
> >>>>>>>>>>>
> >>>>>>>>>>> When A and B share a CONTPTE large folio, we do madvise(DONTNEED) or any
> >>>>>>>>>>> similar things on a part of the large folio in process A,
> >>>>>>>>>>>
> >>>>>>>>>>> this large folio will have partially mapped subpage in A (all CONTPE bits
> >>>>>>>>>>> in all subpages need to be removed though we only unmap a part of the
> >>>>>>>>>>> large folioas HW requires consistent CONTPTEs); and it has entire map in
> >>>>>>>>>>> process B(all PTEs are still CONPTES in process B).
> >>>>>>>>>>>
> >>>>>>>>>>> isn't it more sensible for this large folios to have entire_map = 0(for
> >>>>>>>>>>> process B), and subpages which are still mapped in process A has map_count
> >>>>>>>>>>> =0? (start from -1).
> >>>>>>>>>>>
> >>>>>>>>>>>> Especially, the batch variant of page_try_dup_anon_rmap() would only
> >>>>>>>>>>>> check once if the folio maybe pinned, and in that case, you can simply
> >>>>>>>>>>>> drop all references again. So you either have all or no ptes to process,
> >>>>>>>>>>>> which makes that code easier.
> >>>>>>>>>>
> >>>>>>>>>> I'm afraid this doesn't make sense to me. Perhaps I've misunderstood. But
> >>>>>>>>>> fundamentally you can only use entire_mapcount if its only possible to map and
> >>>>>>>>>> unmap the whole folio atomically.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> My point is that CONTPEs should either all-set in all 16 PTEs or all are dropped
> >>>>>>>>> in 16 PTEs. if all PTEs have CONT, it is entirely mapped; otherwise,
> >>>>>>>>> it is partially
> >>>>>>>>> mapped. if a large folio is mapped in one processes with all CONTPTEs
> >>>>>>>>> and meanwhile in another process with partial mapping(w/o CONTPTE), it is
> >>>>>>>>> DoubleMapped.
> >>>>>>>>
> >>>>>>>> There are 2 problems with your proposal, as I see it;
> >>>>>>>>
> >>>>>>>> 1) the core-mm is not enlightened for CONTPTE mappings. As far as it is
> >>>>>>>> concerned, its just mapping a bunch of PTEs. So it has no hook to inc/dec
> >>>>>>>> entire_mapcount. The arch code is opportunistically and *transparently* managing
> >>>>>>>> the CONT_PTE bit.
> >>>>>>>>
> >>>>>>>> 2) There is nothing to say a folio isn't *bigger* than the contpte block; it may
> >>>>>>>> be 128K and be mapped with 2 contpte blocks. Or even a PTE-mapped THP (2M) and
> >>>>>>>> be mapped with 32 contpte blocks. So you can't say it is entirely mapped
> >>>>>>>> unless/until ALL of those blocks are set up. And then of course each block could
> >>>>>>>> be unmapped unatomically.
> >>>>>>>>
> >>>>>>>> For the PMD case there are actually 2 properties that allow using the
> >>>>>>>> entire_mapcount optimization; It's atomically mapped/unmapped through the PMD
> >>>>>>>> and we know that the folio is exactly PMD sized (since it must be at least PMD
> >>>>>>>> sized to be able to map it with the PMD, and we don't allocate THPs any bigger
> >>>>>>>> than PMD size). So one PMD map or unmap operation corresponds to exactly one
> >>>>>>>> *entire* map or unmap. That is not true when we are PTE mapping.
> >>>>>>>
> >>>>>>> well. Thanks for clarification. based on the above description, i agree the
> >>>>>>> current code might make more sense by always using mapcount in subpage.
> >>>>>>>
> >>>>>>> I gave my proposals as I thought we were always CONTPTE size for small-THP
> >>>>>>> then we could drop the loop to iterate 16 times rmap. if we do it
> >>>>>>> entirely, we only
> >>>>>>> need to do dup rmap once for all 16 PTEs by increasing entire_map.
> >>>>>>
> >>>>>> Well its always good to have the discussion - so thanks for the ideas. I think
> >>>>>> there is a bigger question lurking here; should we be exposing the concept of
> >>>>>> contpte mappings to the core-mm rather than burying it in the arm64 arch code?
> >>>>>> I'm confident that would be a huge amount of effort and the end result would be
> >>>>>> similar performace to what this approach gives. One potential benefit of letting
> >>>>>> core-mm control it is that it would also give control to core-mm over the
> >>>>>> granularity of access/dirty reporting (my approach implicitly ties it to the
> >>>>>> folio). Having sub-folio access tracking _could_ potentially help with future
> >>>>>> work to make THP size selection automatic, but we are not there yet, and I think
> >>>>>> there are other (simpler) ways to achieve the same thing. So my view is that
> >>>>>> _not_ exposing it to core-mm is the right way for now.
> >>>>>
> >>>>> Hi Ryan,
> >>>>>
> >>>>> We(OPPO) started a similar project like you even before folio was imported to
> >>>>> mainline, we have deployed the dynamic hugepage(that is how we name it)
> >>>>> on millions of mobile phones on real products and kernels before 5.16, making
> >>>>> a huge success on performance improvement. for example, you may
> >>>>> find the out-of-tree 5.15 source code here
> >>>>
> >>>> Oh wow, thanks for reaching out and explaining this - I have to admit I feel
> >>>> embarrassed that I clearly didn't do enough research on the prior art because I
> >>>> wasn't aware of your work. So sorry about that.
> >>>>
> >>>> I sensed that you had a different model for how this should work vs what I've
> >>>> implemented and now I understand why :). I'll review your stuff and I'm sure
> >>>> I'll have questions. I'm sure each solution has pros and cons.
> >>>>
> >>>>
> >>>>>
> >>>>> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11
> >>>>>
> >>>>> Our modification might not be so clean and has lots of workarounds
> >>>>> just for the stability of products
> >>>>>
> >>>>> We mainly have
> >>>>>
> >>>>> 1. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/cont_pte_hugepage.c
> >>>>>
> >>>>> some CONTPTE helpers
> >>>>>
> >>>>> 2.https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/include/linux/mm.h
> >>>>>
> >>>>> some Dynamic Hugepage APIs
> >>>>>
> >>>>> 3. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c
> >>>>>
> >>>>> modified all page faults to support
> >>>>> (1). allocation of hugepage of 64KB in do_anon_page
> >>>>
> >>>> My Small-Sized THP patch set is handling the equivalent of this.
> >>>
> >>> right, the only difference is that we did a huge-zeropage for reading
> >>> in do_anon_page.
> >>> mapping all large folios to CONTPTE to zero page.
> >>
> >> FWIW, I took a slightly different approach in my original RFC for the zero page
> >> - although I ripped it all out to simplify for the initial series. I found that
> >> it was pretty rare for user space to read multiple consecutive pages without
> >> ever interleving any writes, so I kept the zero page as a base page, but at CoW,
> >> I would expand the allocation to an approprately sized THP. But for the couple
> >> of workloads that I've gone deep with, I found that it made barely any dent on
> >> the amount of memory that ended up contpte-mapped; the vast majority was from
> >> write allocation in do_anonymous_page().
> >
> > the problem is even if there is only one page read in 16 ptes, you
> > will map the page to
> > zero basepage. then while you write another page in these 16 ptes, you
> > lose the chance
> > to become large folio as pte_range_none() becomes false.
> >
> > if we map these 16ptes to contpte zero page, in do_wp_page, we have a
> > good chance
> > to CoW and get a large anon folio.
>
> Yes understood. I think we are a bit off-topic for this patch set though.
> small-sized THP zero pages can be tackled as a separate series once these
> initial series are in. I'd be happy to review a small-sized THP zero page post :)

I agree this can be deferred. Right now our first priority is the
swap-in series, so
I can't give a time when we can send a small-sized THP zero page.

>
> >
> >>
> >>>
> >>>>
> >>>>> (2). CoW hugepage in do_wp_page
> >>>>
> >>>> This isn't handled yet in my patch set; the original RFC implemented it but I
> >>>> removed it in order to strip back to the essential complexity for the initial
> >>>> submission. DavidH has been working on a precise shared vs exclusive map
> >>>> tracking mechanism - if that goes in, it will make CoWing large folios simpler.
> >>>> Out of interest, what workloads benefit most from this?
> >>>
> >>> as a phone, Android has a design almost all processes are forked from zygote.
> >>> thus, CoW happens quite often to all apps.
> >>
> >> Sure. But in my analysis I concluded that most of the memory mapped in zygote is
> >> file-backed and mostly RO so therefore doing THP CoW doesn't help much. Perhaps
> >> there are cases where that conclusion is wrong.
> >
> > CoW is much less than do_anon_page on my phone which is running dynamic
> > hugepage for a couple of hours:
> >
> > OP52D1L1:/ # cat /proc/cont_pte_hugepage/stat
> > ...
> > thp_cow 34669 ---- CoW a large folio
> > thp_do_anon_pages 1032362 ----- a large folio in do_anon_page
> > ...
> >
> > so it is around 34669/1032362 = 3.35%.
>
> well its actually 34669 / (34669 + 1032362) = 3.25%. But, yes, the point is that
> very few of large folios are lost due to CoW so there is likely to be little
> perf impact. Again, I'd happily review a series that enables this!

right, same as above.

>
> >
> >>
> >>>
> >>>>
> >>>>> (3). copy CONPTEs in copy_pte_range
> >>>>
> >>>> As discussed this is done as part of the contpte patch set, but its not just a
> >>>> simple copy; the arch code will notice and set the CONT_PTE bit as needed.
> >>>
> >>> right, i have read all your unfold and fold stuff today, now i understand your
> >>> approach seems quite nice!
> >>
> >> Great - thanks!
> >>
> >>>
> >>>
> >>>>
> >>>>> (4). allocate and swap-in Hugepage as a whole in do_swap_page
> >>>>
> >>>> This is going to be a problem but I haven't even looked at this properly yet.
> >>>> The advice so far has been to continue to swap-in small pages only, but improve
> >>>> khugepaged to collapse to small-sized THP. I'll take a look at your code to
> >>>> understand how you did this.
> >>>
> >>> this is also crucial to android phone as swap is always happening
> >>> on an embedded device. if we don't support large folios in swapin,
> >>> our large folios will never come back after it is swapped-out.
> >>>
> >>> and i hated the collapse solution from the first beginning as there is
> >>> never a guarantee to succeed and its overhead is unacceptable to user UI,
> >>> so we supported hugepage allocation in do_swap_page from the first beginning.
> >>
> >> Understood. I agree it would be nice to preserve large folios across swap. I
> >> think this can be layered on top of the current work though.
> >
> > This will be my first priority to use your large folio code on phones.
> > We need a patchset
> > on top of yours :-)
> >
> > without it, we will likely fail. Typically, one phone can have a 4~8GB
> > zRAM to compress
> > a lot of anon pages, if the compression ratio is 1:4, that means
> > uncompressed anon
> > pages are much much more. Thus, while the background app is switched back
> > to foreground, we need those swapped-out large folios back rather than getting
> > small basepages replacement. swap-in basepage is definitely not going to
> > work well on a phone, neither does THP collapse.
>
> Yep understood. From the other thread, it sounds like you are preparing a series
> for large swap-in - looking forward to seeing it!

right. as said, this is the first priority.

>
> >
> >>
> >>>
> >>>>
> >>>>>
> >>>>> 4. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/vmscan.c
> >>>>> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/rmap.c
> >>>>>
> >>>>> reclaim hugepage as a whole and LRU optimization for 64KB dynamic hugepage.
> >>>>
> >>>> I think this is all naturally handled by the folio code that exists in modern
> >>>> kernels?
> >>>
> >>> We had a CONTPTE hugepage pool, if the pool is very limited, we let LRU
> >>> reclaim large folios to the pool. as phones are running lots of apps
> >>> and drivers, and the memory is very limited, after a couple of hours,
> >>> it will become very hard to allocate large folios in the original buddy. thus,
> >>> large folios totally disappeared after running the phone for some time
> >>> if we didn't have the pool.
> >>>
> >>>>
> >>>>>
> >>>>> So we are 100% interested in your patchset and hope it can find a way
> >>>>> to land on the
> >>>>> mainline, thus decreasing all the cost we have to maintain out-of-tree
> >>>>> code from a
> >>>>> kernel to another kernel version which we have done on a couple of
> >>>>> kernel versions
> >>>>> before 5.16. Firmly, we are 100% supportive of large anon folios
> >>>>> things you are leading.
> >>>>
> >>>> That's great to hear! Of course Reviewed-By's and Tested-By's will all help move
> >>>> it closer :). If you had any ability to do any A/B performance testing, it would
> >>>> be very interesting to see how this stacks up against your solution - if there
> >>>> are gaps it would be good to know where and develop a plan to plug the gap.
> >>>>
> >>>
> >>> sure.
> >>>
> >>>>>
> >>>>> A big pain was we found lots of races especially on CONTPTE unfolding
> >>>>> and especially a part
> >>>>> of basepages ran away from the 16 CONPTEs group since userspace is
> >>>>> always working
> >>>>> on basepages, having no idea of small-THP. We ran our code on millions of
> >>>>> real phones, and now we have got them fixed (or maybe "can't reproduce"),
> >>>>> no outstanding issue.
> >>>>
> >>>> I'm going to be brave and say that my solution shouldn't suffer from these
> >>>> problems; but of course the proof is only in the testing. I did a lot of work
> >>>> with our architecture group and micro architects to determine exactly what is
> >>>> and isn't safe; We even tightened the Arm ARM spec very subtlely to allow the
> >>>> optimization in patch 13 (see the commit log for details). Of course this has
> >>>> all been checked with partners and we are confident that all existing
> >>>> implementations conform to the modified wording.
> >>>
> >>> cool. I like your try_unfold/fold code. it seems your code is setting/dropping
> >>> CONT automatically based on ALIGHMENT, Page number etc. Alternatively,
> >>> our code is always stupidly checking some conditions before setting and dropping
> >>> CONT everywhere.
> >>>
> >>>>
> >>>>>
> >>>>> Particularly for the rmap issue we are discussing, our out-of-tree is
> >>>>> using the entire_map for
> >>>>> CONTPTE in the way I sent to you. But I guess we can learn from you to decouple
> >>>>> CONTPTE from mm-core.
> >>>>>
> >>>>> We are doing this in mm/memory.c
> >>>>>
> >>>>> copy_present_cont_pte(struct vm_area_struct *dst_vma, struct
> >>>>> vm_area_struct *src_vma,
> >>>>> pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
> >>>>> struct page **prealloc)
> >>>>> {
> >>>>> struct mm_struct *src_mm = src_vma->vm_mm;
> >>>>> unsigned long vm_flags = src_vma->vm_flags;
> >>>>> pte_t pte = *src_pte;
> >>>>> struct page *page;
> >>>>>
> >>>>> page = vm_normal_page(src_vma, addr, pte);
> >>>>> ...
> >>>>>
> >>>>> get_page(page);
> >>>>> page_dup_rmap(page, true); // an entire dup_rmap as you can
> >>>>> see.............
> >>>>> rss[mm_counter(page)] += HPAGE_CONT_PTE_NR;
> >>>>> }
> >>>>>
> >>>>> and we have a split in mm/cont_pte_hugepage.c to handle partially unmap,
> >>>>>
> >>>>> static void __split_huge_cont_pte_locked(struct vm_area_struct *vma, pte_t *pte,
> >>>>> unsigned long haddr, bool freeze)
> >>>>> {
> >>>>> ...
> >>>>> if (compound_mapcount(head) > 1 && !TestSetPageDoubleMap(head)) {
> >>>>> for (i = 0; i < HPAGE_CONT_PTE_NR; i++)
> >>>>> atomic_inc(&head[i]._mapcount);
> >>>>> atomic_long_inc(&cont_pte_double_map_count);
> >>>>> }
> >>>>>
> >>>>>
> >>>>> if (atomic_add_negative(-1, compound_mapcount_ptr(head))) {
> >>>>> ...
> >>>>> }
> >>>>>
> >>>>> I am not selling our solution any more, but just showing you some differences we
> >>>>> have :-)
> >>>>
> >>>> OK, I understand what you were saying now. I'm currently struggling to see how
> >>>> this could fit into my model. Do you have any workloads and numbers on perf
> >>>> improvement of using entire_mapcount?
> >>>
> >>> TBH, I don't have any data on this as from the first beginning, we were using
> >>> entire_map. So I have no comparison at all.
> >>>
> >>>>
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> BTW, I have concerns that a variable small-THP size will really work
> >>>>>>> as userspace
> >>>>>>> is probably friendly to only one fixed size. for example, userspace
> >>>>>>> heap management
> >>>>>>> might be optimized to a size for freeing memory to the kernel. it is
> >>>>>>> very difficult
> >>>>>>> for the heap to adapt to various sizes at the same time. frequent unmap/free
> >>>>>>> size not equal with, and particularly smaller than small-THP size will
> >>>>>>> defeat all
> >>>>>>> efforts to use small-THP.
> >>>>>>
> >>>>>> I'll admit to not knowing a huge amount about user space allocators. But I will
> >>>>>> say that as currently defined, the small-sized THP interface to user space
> >>>>>> allows a sysadmin to specifically enable the set of sizes that they want; so a
> >>>>>> single size can be enabled. I'm diliberately punting that decision away from the
> >>>>>> kernel for now.
> >>>>>
> >>>>> Basically, userspace heap library has a PAGESIZE setting and allows users
> >>>>> to allocate/free all kinds of small objects such as 16,32,64,128,256,512 etc.
> >>>>> The default size is for sure equal to the basepage SIZE. once some objects are
> >>>>> freed by free() and libc get a free "page", userspace heap libraries might free
> >>>>> the PAGESIZE page to kernel by things like MADV_DONTNEED, then zap_pte_range().
> >>>>> it is quite similar with kernel slab.
> >>>>>
> >>>>> so imagine we have small-THP now, but userspace libraries have *NO*
> >>>>> idea at all, so it can frequently cause unfolding.
> >>>>>
> >>>>>>
> >>>>>> FWIW, My experience with the Speedometer/JavaScript use case is that performance
> >>>>>> is a little bit better when enabling 64+32+16K vs just 64K THP.
> >>>>>>
> >>>>>> Functionally, it will not matter if the allocator is not enlightened for the THP
> >>>>>> size; it can continue to free, and if a partial folio is unmapped it is put on
> >>>>>> the deferred split list, then under memory pressure it is split and the unused
> >>>>>> pages are reclaimed. I guess this is the bit you are concerned about having a
> >
> >>>>>> performance impact?
> >>>>>
> >>>>> right. If this is happening on the majority of small-THP folios, we
> >>>>> don't have performance
> >>>>> improvement, and probably regression instead. This is really true on
> >>>>> real workloads!!
> >>>>>
> >>>>> So that is why we really love a per-VMA hint to enable small-THP but
> >>>>> obviously you
> >>>>> have already supported it now by
> >>>>> mm: thp: Introduce per-size thp sysfs interface
> >>>>> https://lore.kernel.org/linux-mm/[email protected]/
> >>>>>
> >>>>> we can use MADVISE rather than ALWAYS and set fixed size like 64KB, so userspace
> >>>>> can set the VMA flag when it is quite sure this VMA is working with
> >>>>> the alignment
> >>>>> of 64KB?
> >>>>
> >>>> Yes, that all exists in the series today. We have also discussed the possibility
> >>>> of adding a new madvise_process() call that would take the set of THP sizes that
> >>>> should be considered. Then you can set different VMAs to use different sizes;
> >>>> the plan was to layer that on top if/when a workload was identified. Sounds like
> >>>> you might be able to help there?
> >>>
> >>> i'm not quite sure as on phones, we are using fixed-size CONTPTE. so we ask
> >>> for either 64KB or 4KB. If we think one VMA is all good to use CONTPTE, we
> >>> set a flag in this VMA and try to allocate 64KB.
> >>
> >> When you say "we set a flag" do you mean user space? Or is there some heuristic
> >> in the kernel?
> >
> > we are using a field extended by the android kernel in vma struct to
> > mark this vma
> > is all good to use CONTPTE. With the upstream solution you are providing, we can
> > remove this dirty code[1].
> > static inline bool vma_is_chp_anonymous(struct vm_area_struct *vma)
> > {
> > return vma->android_kabi_reserved2 == THP_SWAP_PRIO_MAGIC;
> > }
>
> Sorry I'm not sure I've understood; how does that flag get set in the first
> place? Does user space tell the kernel (via e.g. madvise()) or does the kernel
> set it based on devined heuristics?

Basically we did it in an ugly way, on android, different vma types
have different
names. For some types of vma, we have optimized them in userspace and tried
to decrease/avoid fragments and unaligned CONTPTEs unfold. So in the kernel,
we compare the name of the vma, if it is an optimized vma type, we set the
field in vma. noted for many cases, we might have to write dirty code as we have
to follow Android kernel's KMI :-)

based on your new sysfs interface, we can move to madvise(HUGEPAGE) and
set 64KB as MADVISE.

BTW, large anon folios can bring disaster to an unoptimized userspace
especially for
a memory limited system, memory footprint can terribly increase. so it
is really nice
to have your new sysfs interface and let userspace decide if it wants
large folios.

>
> >
> > [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/include/linux/mm.h#L4031
> >

Thanks
Barry

2023-11-30 05:08:28

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v2 14/14] arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown


Ryan Roberts <[email protected]> writes:

>>>> So if we do need to deal with racing HW, I'm pretty sure my v1 implementation is
>>>> buggy because it iterated through the PTEs, getting and accumulating. Then
>>>> iterated again, writing that final set of bits to all the PTEs. And the HW could
>>>> have modified the bits during those loops. I think it would be possible to fix
>>>> the race, but intuition says it would be expensive.
>>>
>>> So the issue as I understand it is subsequent iterations would see a
>>> clean PTE after the first iteration returned a dirty PTE. In
>>> ptep_get_and_clear_full() why couldn't you just copy the dirty/accessed
>>> bit (if set) from the PTE being cleared to an adjacent PTE rather than
>>> all the PTEs?
>>
>> The raciness I'm describing is the race between reading access/dirty from one
>> pte and applying it to another. But yes I like your suggestion. if we do:
>>
>> pte = __ptep_get_and_clear_full(ptep)
>>
>> on the target pte, then we have grabbed access/dirty from it in a race-free
>> manner. we can then loop from current pte up towards the top of the block until
>> we find a valid entry (and I guess wrap at the top to make us robust against
>> future callers clearing an an arbitrary order). Then atomically accumulate the
>> access/dirty bits we have just saved into that new entry. I guess that's just a
>> cmpxchg loop - there are already examples of how to do that correctly when
>> racing the TLB.
>>
>> For most entries, we will just be copying up to the next pte. For the last pte,
>> we would end up reading all ptes and determine we are the last one.
>>
>> What do you think?
>
> OK here is an attempt at something which solves the fragility. I think this is
> now robust and will always return the correct access/dirty state from
> ptep_get_and_clear_full() and ptep_get().
>
> But I'm not sure about performance; each call to ptep_get_and_clear_full() for
> each pte in a contpte block will cause a ptep_get() to gather the access/dirty
> bits from across the contpte block - which requires reading each pte in the
> contpte block. So its O(n^2) in that sense. I'll benchmark it and report back.
>
> Was this the type of thing you were thinking of, Alistair?

Yes, that is along the lines of what I was thinking. However I have
added a couple of comments inline.

> --8<--
> arch/arm64/include/asm/pgtable.h | 23 ++++++++-
> arch/arm64/mm/contpte.c | 81 ++++++++++++++++++++++++++++++++
> arch/arm64/mm/fault.c | 38 +++++++++------
> 3 files changed, 125 insertions(+), 17 deletions(-)
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 9bd2f57a9e11..6c295d277784 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -851,6 +851,7 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
> return pte_pmd(pte_modify(pmd_pte(pmd), newprot));
> }
>
> +extern int __ptep_set_access_flags_notlbi(pte_t *ptep, pte_t entry);
> extern int __ptep_set_access_flags(struct vm_area_struct *vma,
> unsigned long address, pte_t *ptep,
> pte_t entry, int dirty);
> @@ -1145,6 +1146,8 @@ extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
> extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
> extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
> pte_t *ptep, pte_t pte, unsigned int nr);
> +extern pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
> + unsigned long addr, pte_t *ptep);
> extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> unsigned long addr, pte_t *ptep);
> extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> @@ -1270,12 +1273,28 @@ static inline void pte_clear(struct mm_struct *mm,
> __pte_clear(mm, addr, ptep);
> }
>
> +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
> +static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
> + unsigned long addr, pte_t *ptep, int full)
> +{
> + pte_t orig_pte = __ptep_get(ptep);
> +
> + if (!pte_valid_cont(orig_pte))
> + return __ptep_get_and_clear(mm, addr, ptep);
> +
> + if (!full) {
> + contpte_try_unfold(mm, addr, ptep, orig_pte);
> + return __ptep_get_and_clear(mm, addr, ptep);
> + }
> +
> + return contpte_ptep_get_and_clear_full(mm, addr, ptep);
> +}
> +
> #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
> static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
> unsigned long addr, pte_t *ptep)
> {
> - contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> - return __ptep_get_and_clear(mm, addr, ptep);
> + return ptep_get_and_clear_full(mm, addr, ptep, 0);
> }
>
> #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index 2a57df16bf58..99b211118d93 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -145,6 +145,14 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> for (i = 0; i < CONT_PTES; i++, ptep++) {
> pte = __ptep_get(ptep);
>
> + /*
> + * Deal with the partial contpte_ptep_get_and_clear_full() case,
> + * where some of the ptes in the range may be cleared but others
> + * are still to do. See contpte_ptep_get_and_clear_full().
> + */
> + if (!pte_valid(pte))
> + continue;
> +
> if (pte_dirty(pte))
> orig_pte = pte_mkdirty(orig_pte);
>
> @@ -257,6 +265,79 @@ void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
> }
> EXPORT_SYMBOL(contpte_set_ptes);
>
> +pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
> + unsigned long addr, pte_t *ptep)
> +{
> + /*
> + * When doing a full address space teardown, we can avoid unfolding the
> + * contiguous range, and therefore avoid the associated tlbi. Instead,
> + * just get and clear the pte. The caller is promising to call us for
> + * every pte, so every pte in the range will be cleared by the time the
> + * final tlbi is issued.
> + *
> + * This approach requires some complex hoop jumping though, as for the
> + * duration between returning from the first call to
> + * ptep_get_and_clear_full() and making the final call, the contpte
> + * block is in an intermediate state, where some ptes are cleared and
> + * others are still set with the PTE_CONT bit. If any other APIs are
> + * called for the ptes in the contpte block during that time, we have to
> + * be very careful. The core code currently interleaves calls to
> + * ptep_get_and_clear_full() with ptep_get() and so ptep_get() must be
> + * careful to ignore the cleared entries when accumulating the access
> + * and dirty bits - the same goes for ptep_get_lockless(). The only
> + * other calls we might resonably expect are to set markers in the
> + * previously cleared ptes. (We shouldn't see valid entries being set
> + * until after the tlbi, at which point we are no longer in the
> + * intermediate state). Since markers are not valid, this is safe;
> + * set_ptes() will see the old, invalid entry and will not attempt to
> + * unfold. And the new pte is also invalid so it won't attempt to fold.
> + * We shouldn't see pte markers being set for the 'full' case anyway
> + * since the address space is being torn down.
> + *
> + * The last remaining issue is returning the access/dirty bits. That
> + * info could be present in any of the ptes in the contpte block.
> + * ptep_get() will gather those bits from across the contpte block (for
> + * the remaining valid entries). So below, if the pte we are clearing
> + * has dirty or young set, we need to stash it into a pte that we are
> + * yet to clear. This allows future calls to return the correct state
> + * even when the info was stored in a different pte. Since the core-mm
> + * calls from low to high address, we prefer to stash in the last pte of
> + * the contpte block - this means we are not "dragging" the bits up
> + * through all ptes and increases the chances that we can exit early
> + * because a given pte will have neither dirty or young set.
> + */
> +
> + pte_t orig_pte = __ptep_get_and_clear(mm, addr, ptep);
> + bool dirty = pte_dirty(orig_pte);
> + bool young = pte_young(orig_pte);
> + pte_t *start;
> +
> + if (!dirty && !young)
> + return contpte_ptep_get(ptep, orig_pte);

I don't think we need to do this. If the PTE is !dirty && !young we can
just return it. As you say we have to assume HW can set those flags at
any time anyway so it doesn't get us much. This means in the common case
we should only run through the loop setting the dirty/young flags once
which should alay the performance concerns.

However I am now wondering if we're doing the wrong thing trying to hide
this down in the arch layer anyway. Perhaps it would be better to deal
with this in the core-mm code after all.

So how about having ptep_get_and_clear_full() clearing the PTEs for the
entire cont block? We know by definition all PTEs should be pointing to
the same folio anyway, and it seems at least zap_pte_range() would cope
with this just fine because subsequent iterations would just see
pte_none() and continue the loop. I haven't checked the other call sites
though, but in principal I don't see why we couldn't define
ptep_get_and_clear_full() as being something that clears all PTEs
mapping a given folio (although it might need renaming).

This does assume you don't need to partially unmap a page in
zap_pte_range (ie. end >= folio), but we're already making that
assumption.

> +
> + start = contpte_align_down(ptep);
> + ptep = start + CONT_PTES - 1;
> +
> + for (; ptep >= start; ptep--) {
> + pte_t pte = __ptep_get(ptep);
> +
> + if (!pte_valid(pte))
> + continue;
> +
> + if (dirty)
> + pte = pte_mkdirty(pte);
> +
> + if (young)
> + pte = pte_mkyoung(pte);
> +
> + __ptep_set_access_flags_notlbi(ptep, pte);
> + return contpte_ptep_get(ptep, orig_pte);
> + }
> +
> + return orig_pte;
> +}
> +EXPORT_SYMBOL(contpte_ptep_get_and_clear_full);
> +
> int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> unsigned long addr, pte_t *ptep)
> {
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index d63f3a0a7251..b22216a8153c 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -199,19 +199,7 @@ static void show_pte(unsigned long addr)
> pr_cont("\n");
> }
>
> -/*
> - * This function sets the access flags (dirty, accessed), as well as write
> - * permission, and only to a more permissive setting.
> - *
> - * It needs to cope with hardware update of the accessed/dirty state by other
> - * agents in the system and can safely skip the __sync_icache_dcache() call as,
> - * like __set_ptes(), the PTE is never changed from no-exec to exec here.
> - *
> - * Returns whether or not the PTE actually changed.
> - */
> -int __ptep_set_access_flags(struct vm_area_struct *vma,
> - unsigned long address, pte_t *ptep,
> - pte_t entry, int dirty)
> +int __ptep_set_access_flags_notlbi(pte_t *ptep, pte_t entry)
> {
> pteval_t old_pteval, pteval;
> pte_t pte = __ptep_get(ptep);
> @@ -238,10 +226,30 @@ int __ptep_set_access_flags(struct vm_area_struct *vma,
> pteval = cmpxchg_relaxed(&pte_val(*ptep), old_pteval, pteval);
> } while (pteval != old_pteval);
>
> + return 1;
> +}
> +
> +/*
> + * This function sets the access flags (dirty, accessed), as well as write
> + * permission, and only to a more permissive setting.
> + *
> + * It needs to cope with hardware update of the accessed/dirty state by other
> + * agents in the system and can safely skip the __sync_icache_dcache() call as,
> + * like __set_ptes(), the PTE is never changed from no-exec to exec here.
> + *
> + * Returns whether or not the PTE actually changed.
> + */
> +int __ptep_set_access_flags(struct vm_area_struct *vma,
> + unsigned long address, pte_t *ptep,
> + pte_t entry, int dirty)
> +{
> + int changed = __ptep_set_access_flags_notlbi(ptep, entry);
> +
> /* Invalidate a stale read-only entry */
> - if (dirty)
> + if (changed && dirty)
> flush_tlb_page(vma, address);
> - return 1;
> +
> + return changed;
> }
>
> static bool is_el1_instruction_abort(unsigned long esr)
> --8<--

2023-11-30 05:58:26

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v2 14/14] arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown

On Thu, Nov 30, 2023 at 1:08 PM Alistair Popple <[email protected]> wrote:
>
>
> Ryan Roberts <[email protected]> writes:
>
> >>>> So if we do need to deal with racing HW, I'm pretty sure my v1 implementation is
> >>>> buggy because it iterated through the PTEs, getting and accumulating. Then
> >>>> iterated again, writing that final set of bits to all the PTEs. And the HW could
> >>>> have modified the bits during those loops. I think it would be possible to fix
> >>>> the race, but intuition says it would be expensive.
> >>>
> >>> So the issue as I understand it is subsequent iterations would see a
> >>> clean PTE after the first iteration returned a dirty PTE. In
> >>> ptep_get_and_clear_full() why couldn't you just copy the dirty/accessed
> >>> bit (if set) from the PTE being cleared to an adjacent PTE rather than
> >>> all the PTEs?
> >>
> >> The raciness I'm describing is the race between reading access/dirty from one
> >> pte and applying it to another. But yes I like your suggestion. if we do:
> >>
> >> pte = __ptep_get_and_clear_full(ptep)
> >>
> >> on the target pte, then we have grabbed access/dirty from it in a race-free
> >> manner. we can then loop from current pte up towards the top of the block until
> >> we find a valid entry (and I guess wrap at the top to make us robust against
> >> future callers clearing an an arbitrary order). Then atomically accumulate the
> >> access/dirty bits we have just saved into that new entry. I guess that's just a
> >> cmpxchg loop - there are already examples of how to do that correctly when
> >> racing the TLB.
> >>
> >> For most entries, we will just be copying up to the next pte. For the last pte,
> >> we would end up reading all ptes and determine we are the last one.
> >>
> >> What do you think?
> >
> > OK here is an attempt at something which solves the fragility. I think this is
> > now robust and will always return the correct access/dirty state from
> > ptep_get_and_clear_full() and ptep_get().
> >
> > But I'm not sure about performance; each call to ptep_get_and_clear_full() for
> > each pte in a contpte block will cause a ptep_get() to gather the access/dirty
> > bits from across the contpte block - which requires reading each pte in the
> > contpte block. So its O(n^2) in that sense. I'll benchmark it and report back.
> >
> > Was this the type of thing you were thinking of, Alistair?
>
> Yes, that is along the lines of what I was thinking. However I have
> added a couple of comments inline.
>
> > --8<--
> > arch/arm64/include/asm/pgtable.h | 23 ++++++++-
> > arch/arm64/mm/contpte.c | 81 ++++++++++++++++++++++++++++++++
> > arch/arm64/mm/fault.c | 38 +++++++++------
> > 3 files changed, 125 insertions(+), 17 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > index 9bd2f57a9e11..6c295d277784 100644
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -851,6 +851,7 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
> > return pte_pmd(pte_modify(pmd_pte(pmd), newprot));
> > }
> >
> > +extern int __ptep_set_access_flags_notlbi(pte_t *ptep, pte_t entry);
> > extern int __ptep_set_access_flags(struct vm_area_struct *vma,
> > unsigned long address, pte_t *ptep,
> > pte_t entry, int dirty);
> > @@ -1145,6 +1146,8 @@ extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
> > extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
> > extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
> > pte_t *ptep, pte_t pte, unsigned int nr);
> > +extern pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
> > + unsigned long addr, pte_t *ptep);
> > extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> > unsigned long addr, pte_t *ptep);
> > extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> > @@ -1270,12 +1273,28 @@ static inline void pte_clear(struct mm_struct *mm,
> > __pte_clear(mm, addr, ptep);
> > }
> >
> > +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
> > +static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
> > + unsigned long addr, pte_t *ptep, int full)
> > +{
> > + pte_t orig_pte = __ptep_get(ptep);
> > +
> > + if (!pte_valid_cont(orig_pte))
> > + return __ptep_get_and_clear(mm, addr, ptep);
> > +
> > + if (!full) {
> > + contpte_try_unfold(mm, addr, ptep, orig_pte);
> > + return __ptep_get_and_clear(mm, addr, ptep);
> > + }
> > +
> > + return contpte_ptep_get_and_clear_full(mm, addr, ptep);
> > +}
> > +
> > #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
> > static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
> > unsigned long addr, pte_t *ptep)
> > {
> > - contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> > - return __ptep_get_and_clear(mm, addr, ptep);
> > + return ptep_get_and_clear_full(mm, addr, ptep, 0);
> > }
> >
> > #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
> > diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> > index 2a57df16bf58..99b211118d93 100644
> > --- a/arch/arm64/mm/contpte.c
> > +++ b/arch/arm64/mm/contpte.c
> > @@ -145,6 +145,14 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> > for (i = 0; i < CONT_PTES; i++, ptep++) {
> > pte = __ptep_get(ptep);
> >
> > + /*
> > + * Deal with the partial contpte_ptep_get_and_clear_full() case,
> > + * where some of the ptes in the range may be cleared but others
> > + * are still to do. See contpte_ptep_get_and_clear_full().
> > + */
> > + if (!pte_valid(pte))
> > + continue;
> > +
> > if (pte_dirty(pte))
> > orig_pte = pte_mkdirty(orig_pte);
> >
> > @@ -257,6 +265,79 @@ void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
> > }
> > EXPORT_SYMBOL(contpte_set_ptes);
> >
> > +pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
> > + unsigned long addr, pte_t *ptep)
> > +{
> > + /*
> > + * When doing a full address space teardown, we can avoid unfolding the
> > + * contiguous range, and therefore avoid the associated tlbi. Instead,
> > + * just get and clear the pte. The caller is promising to call us for
> > + * every pte, so every pte in the range will be cleared by the time the
> > + * final tlbi is issued.
> > + *
> > + * This approach requires some complex hoop jumping though, as for the
> > + * duration between returning from the first call to
> > + * ptep_get_and_clear_full() and making the final call, the contpte
> > + * block is in an intermediate state, where some ptes are cleared and
> > + * others are still set with the PTE_CONT bit. If any other APIs are
> > + * called for the ptes in the contpte block during that time, we have to
> > + * be very careful. The core code currently interleaves calls to
> > + * ptep_get_and_clear_full() with ptep_get() and so ptep_get() must be
> > + * careful to ignore the cleared entries when accumulating the access
> > + * and dirty bits - the same goes for ptep_get_lockless(). The only
> > + * other calls we might resonably expect are to set markers in the
> > + * previously cleared ptes. (We shouldn't see valid entries being set
> > + * until after the tlbi, at which point we are no longer in the
> > + * intermediate state). Since markers are not valid, this is safe;
> > + * set_ptes() will see the old, invalid entry and will not attempt to
> > + * unfold. And the new pte is also invalid so it won't attempt to fold.
> > + * We shouldn't see pte markers being set for the 'full' case anyway
> > + * since the address space is being torn down.
> > + *
> > + * The last remaining issue is returning the access/dirty bits. That
> > + * info could be present in any of the ptes in the contpte block.
> > + * ptep_get() will gather those bits from across the contpte block (for
> > + * the remaining valid entries). So below, if the pte we are clearing
> > + * has dirty or young set, we need to stash it into a pte that we are
> > + * yet to clear. This allows future calls to return the correct state
> > + * even when the info was stored in a different pte. Since the core-mm
> > + * calls from low to high address, we prefer to stash in the last pte of
> > + * the contpte block - this means we are not "dragging" the bits up
> > + * through all ptes and increases the chances that we can exit early
> > + * because a given pte will have neither dirty or young set.
> > + */
> > +
> > + pte_t orig_pte = __ptep_get_and_clear(mm, addr, ptep);
> > + bool dirty = pte_dirty(orig_pte);
> > + bool young = pte_young(orig_pte);
> > + pte_t *start;
> > +
> > + if (!dirty && !young)
> > + return contpte_ptep_get(ptep, orig_pte);
>
> I don't think we need to do this. If the PTE is !dirty && !young we can
> just return it. As you say we have to assume HW can set those flags at
> any time anyway so it doesn't get us much. This means in the common case
> we should only run through the loop setting the dirty/young flags once
> which should alay the performance concerns.
>
> However I am now wondering if we're doing the wrong thing trying to hide
> this down in the arch layer anyway. Perhaps it would be better to deal
> with this in the core-mm code after all.
>
> So how about having ptep_get_and_clear_full() clearing the PTEs for the
> entire cont block? We know by definition all PTEs should be pointing to

I truly believe we should clear all PTEs for the entire folio block. However,
if the existing api ptep_get_and_clear_full() is always handling a single one
PTE, we might keep its behaviour as is. On the other hand, clearing the
whole block isn't only required in fullmm case, it is also a requirement for
normal zap_pte_range() cases coming from madvise(DONTNEED) etc.

I do think we need a folio-level variant. as we are now supporting
pte-level large
folios, we need some new api to handle folio-level PTEs entirely as we always
have the needs to drop the whole folio rather than one by one when they are
compound.

> the same folio anyway, and it seems at least zap_pte_range() would cope
> with this just fine because subsequent iterations would just see
> pte_none() and continue the loop. I haven't checked the other call sites
> though, but in principal I don't see why we couldn't define
> ptep_get_and_clear_full() as being something that clears all PTEs
> mapping a given folio (although it might need renaming).
>
> This does assume you don't need to partially unmap a page in
> zap_pte_range (ie. end >= folio), but we're already making that
> assumption.
>
> > +
> > + start = contpte_align_down(ptep);
> > + ptep = start + CONT_PTES - 1;
> > +
> > + for (; ptep >= start; ptep--) {
> > + pte_t pte = __ptep_get(ptep);
> > +
> > + if (!pte_valid(pte))
> > + continue;
> > +
> > + if (dirty)
> > + pte = pte_mkdirty(pte);
> > +
> > + if (young)
> > + pte = pte_mkyoung(pte);
> > +
> > + __ptep_set_access_flags_notlbi(ptep, pte);
> > + return contpte_ptep_get(ptep, orig_pte);
> > + }
> > +
> > + return orig_pte;
> > +}
> > +EXPORT_SYMBOL(contpte_ptep_get_and_clear_full);
> > +
> > int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> > unsigned long addr, pte_t *ptep)
> > {
> > diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> > index d63f3a0a7251..b22216a8153c 100644
> > --- a/arch/arm64/mm/fault.c
> > +++ b/arch/arm64/mm/fault.c
> > @@ -199,19 +199,7 @@ static void show_pte(unsigned long addr)
> > pr_cont("\n");
> > }
> >
> > -/*
> > - * This function sets the access flags (dirty, accessed), as well as write
> > - * permission, and only to a more permissive setting.
> > - *
> > - * It needs to cope with hardware update of the accessed/dirty state by other
> > - * agents in the system and can safely skip the __sync_icache_dcache() call as,
> > - * like __set_ptes(), the PTE is never changed from no-exec to exec here.
> > - *
> > - * Returns whether or not the PTE actually changed.
> > - */
> > -int __ptep_set_access_flags(struct vm_area_struct *vma,
> > - unsigned long address, pte_t *ptep,
> > - pte_t entry, int dirty)
> > +int __ptep_set_access_flags_notlbi(pte_t *ptep, pte_t entry)
> > {
> > pteval_t old_pteval, pteval;
> > pte_t pte = __ptep_get(ptep);
> > @@ -238,10 +226,30 @@ int __ptep_set_access_flags(struct vm_area_struct *vma,
> > pteval = cmpxchg_relaxed(&pte_val(*ptep), old_pteval, pteval);
> > } while (pteval != old_pteval);
> >
> > + return 1;
> > +}
> > +
> > +/*
> > + * This function sets the access flags (dirty, accessed), as well as write
> > + * permission, and only to a more permissive setting.
> > + *
> > + * It needs to cope with hardware update of the accessed/dirty state by other
> > + * agents in the system and can safely skip the __sync_icache_dcache() call as,
> > + * like __set_ptes(), the PTE is never changed from no-exec to exec here.
> > + *
> > + * Returns whether or not the PTE actually changed.
> > + */
> > +int __ptep_set_access_flags(struct vm_area_struct *vma,
> > + unsigned long address, pte_t *ptep,
> > + pte_t entry, int dirty)
> > +{
> > + int changed = __ptep_set_access_flags_notlbi(ptep, entry);
> > +
> > /* Invalidate a stale read-only entry */
> > - if (dirty)
> > + if (changed && dirty)
> > flush_tlb_page(vma, address);
> > - return 1;
> > +
> > + return changed;
> > }
> >
> > static bool is_el1_instruction_abort(unsigned long esr)
> > --8<--
>

Thanks
Barry

2023-11-30 11:47:42

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 14/14] arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown

On 30/11/2023 05:07, Alistair Popple wrote:
>
> Ryan Roberts <[email protected]> writes:
>
>>>>> So if we do need to deal with racing HW, I'm pretty sure my v1 implementation is
>>>>> buggy because it iterated through the PTEs, getting and accumulating. Then
>>>>> iterated again, writing that final set of bits to all the PTEs. And the HW could
>>>>> have modified the bits during those loops. I think it would be possible to fix
>>>>> the race, but intuition says it would be expensive.
>>>>
>>>> So the issue as I understand it is subsequent iterations would see a
>>>> clean PTE after the first iteration returned a dirty PTE. In
>>>> ptep_get_and_clear_full() why couldn't you just copy the dirty/accessed
>>>> bit (if set) from the PTE being cleared to an adjacent PTE rather than
>>>> all the PTEs?
>>>
>>> The raciness I'm describing is the race between reading access/dirty from one
>>> pte and applying it to another. But yes I like your suggestion. if we do:
>>>
>>> pte = __ptep_get_and_clear_full(ptep)
>>>
>>> on the target pte, then we have grabbed access/dirty from it in a race-free
>>> manner. we can then loop from current pte up towards the top of the block until
>>> we find a valid entry (and I guess wrap at the top to make us robust against
>>> future callers clearing an an arbitrary order). Then atomically accumulate the
>>> access/dirty bits we have just saved into that new entry. I guess that's just a
>>> cmpxchg loop - there are already examples of how to do that correctly when
>>> racing the TLB.
>>>
>>> For most entries, we will just be copying up to the next pte. For the last pte,
>>> we would end up reading all ptes and determine we are the last one.
>>>
>>> What do you think?
>>
>> OK here is an attempt at something which solves the fragility. I think this is
>> now robust and will always return the correct access/dirty state from
>> ptep_get_and_clear_full() and ptep_get().
>>
>> But I'm not sure about performance; each call to ptep_get_and_clear_full() for
>> each pte in a contpte block will cause a ptep_get() to gather the access/dirty
>> bits from across the contpte block - which requires reading each pte in the
>> contpte block. So its O(n^2) in that sense. I'll benchmark it and report back.
>>
>> Was this the type of thing you were thinking of, Alistair?
>
> Yes, that is along the lines of what I was thinking. However I have
> added a couple of comments inline.
>
>> --8<--
>> arch/arm64/include/asm/pgtable.h | 23 ++++++++-
>> arch/arm64/mm/contpte.c | 81 ++++++++++++++++++++++++++++++++
>> arch/arm64/mm/fault.c | 38 +++++++++------
>> 3 files changed, 125 insertions(+), 17 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 9bd2f57a9e11..6c295d277784 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -851,6 +851,7 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
>> return pte_pmd(pte_modify(pmd_pte(pmd), newprot));
>> }
>>
>> +extern int __ptep_set_access_flags_notlbi(pte_t *ptep, pte_t entry);
>> extern int __ptep_set_access_flags(struct vm_area_struct *vma,
>> unsigned long address, pte_t *ptep,
>> pte_t entry, int dirty);
>> @@ -1145,6 +1146,8 @@ extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
>> extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
>> extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>> pte_t *ptep, pte_t pte, unsigned int nr);
>> +extern pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
>> + unsigned long addr, pte_t *ptep);
>> extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>> unsigned long addr, pte_t *ptep);
>> extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>> @@ -1270,12 +1273,28 @@ static inline void pte_clear(struct mm_struct *mm,
>> __pte_clear(mm, addr, ptep);
>> }
>>
>> +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
>> +static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
>> + unsigned long addr, pte_t *ptep, int full)
>> +{
>> + pte_t orig_pte = __ptep_get(ptep);
>> +
>> + if (!pte_valid_cont(orig_pte))
>> + return __ptep_get_and_clear(mm, addr, ptep);
>> +
>> + if (!full) {
>> + contpte_try_unfold(mm, addr, ptep, orig_pte);
>> + return __ptep_get_and_clear(mm, addr, ptep);
>> + }
>> +
>> + return contpte_ptep_get_and_clear_full(mm, addr, ptep);
>> +}
>> +
>> #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>> static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>> unsigned long addr, pte_t *ptep)
>> {
>> - contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> - return __ptep_get_and_clear(mm, addr, ptep);
>> + return ptep_get_and_clear_full(mm, addr, ptep, 0);
>> }
>>
>> #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> index 2a57df16bf58..99b211118d93 100644
>> --- a/arch/arm64/mm/contpte.c
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -145,6 +145,14 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>> for (i = 0; i < CONT_PTES; i++, ptep++) {
>> pte = __ptep_get(ptep);
>>
>> + /*
>> + * Deal with the partial contpte_ptep_get_and_clear_full() case,
>> + * where some of the ptes in the range may be cleared but others
>> + * are still to do. See contpte_ptep_get_and_clear_full().
>> + */
>> + if (!pte_valid(pte))
>> + continue;
>> +
>> if (pte_dirty(pte))
>> orig_pte = pte_mkdirty(orig_pte);
>>
>> @@ -257,6 +265,79 @@ void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>> }
>> EXPORT_SYMBOL(contpte_set_ptes);
>>
>> +pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
>> + unsigned long addr, pte_t *ptep)
>> +{
>> + /*
>> + * When doing a full address space teardown, we can avoid unfolding the
>> + * contiguous range, and therefore avoid the associated tlbi. Instead,
>> + * just get and clear the pte. The caller is promising to call us for
>> + * every pte, so every pte in the range will be cleared by the time the
>> + * final tlbi is issued.
>> + *
>> + * This approach requires some complex hoop jumping though, as for the
>> + * duration between returning from the first call to
>> + * ptep_get_and_clear_full() and making the final call, the contpte
>> + * block is in an intermediate state, where some ptes are cleared and
>> + * others are still set with the PTE_CONT bit. If any other APIs are
>> + * called for the ptes in the contpte block during that time, we have to
>> + * be very careful. The core code currently interleaves calls to
>> + * ptep_get_and_clear_full() with ptep_get() and so ptep_get() must be
>> + * careful to ignore the cleared entries when accumulating the access
>> + * and dirty bits - the same goes for ptep_get_lockless(). The only
>> + * other calls we might resonably expect are to set markers in the
>> + * previously cleared ptes. (We shouldn't see valid entries being set
>> + * until after the tlbi, at which point we are no longer in the
>> + * intermediate state). Since markers are not valid, this is safe;
>> + * set_ptes() will see the old, invalid entry and will not attempt to
>> + * unfold. And the new pte is also invalid so it won't attempt to fold.
>> + * We shouldn't see pte markers being set for the 'full' case anyway
>> + * since the address space is being torn down.
>> + *
>> + * The last remaining issue is returning the access/dirty bits. That
>> + * info could be present in any of the ptes in the contpte block.
>> + * ptep_get() will gather those bits from across the contpte block (for
>> + * the remaining valid entries). So below, if the pte we are clearing
>> + * has dirty or young set, we need to stash it into a pte that we are
>> + * yet to clear. This allows future calls to return the correct state
>> + * even when the info was stored in a different pte. Since the core-mm
>> + * calls from low to high address, we prefer to stash in the last pte of
>> + * the contpte block - this means we are not "dragging" the bits up
>> + * through all ptes and increases the chances that we can exit early
>> + * because a given pte will have neither dirty or young set.
>> + */
>> +
>> + pte_t orig_pte = __ptep_get_and_clear(mm, addr, ptep);
>> + bool dirty = pte_dirty(orig_pte);
>> + bool young = pte_young(orig_pte);
>> + pte_t *start;
>> +
>> + if (!dirty && !young)
>> + return contpte_ptep_get(ptep, orig_pte);
>
> I don't think we need to do this. If the PTE is !dirty && !young we can
> just return it. As you say we have to assume HW can set those flags at
> any time anyway so it doesn't get us much. This means in the common case
> we should only run through the loop setting the dirty/young flags once
> which should alay the performance concerns.

I don't follow your logic. This is precisely the problem I was trying to solve
vs my original (simple) attempt - we want to always report the correct
access/dirty info. If we read one of the PTEs and neither access nor dirty are
set, that doesn't mean its old and clean, it just means that that info is
definitely not stored in this PTE - we need to check the others. (when the
contiguous bit is set, the HW will only update the access/dirty bits for 1 of
the PTEs in the contpte block).

Also, IIRC correctly, the core-mm sets access when initially setting up the
mapping so its not guarranteed that all but one of the PTEs in the contpte block
have (!dirty && !young).

>
> However I am now wondering if we're doing the wrong thing trying to hide
> this down in the arch layer anyway. Perhaps it would be better to deal
> with this in the core-mm code after all.
>
> So how about having ptep_get_and_clear_full() clearing the PTEs for the
> entire cont block? We know by definition all PTEs should be pointing to
> the same folio anyway, and it seems at least zap_pte_range() would cope
> with this just fine because subsequent iterations would just see
> pte_none() and continue the loop. I haven't checked the other call sites
> though, but in principal I don't see why we couldn't define
> ptep_get_and_clear_full() as being something that clears all PTEs
> mapping a given folio (although it might need renaming).

Ahha! Yes, I've been working on a solution like this since Barry raised it
yesterday. I have a working version, that seems to perform well. I wouldn't want
to just clear all the PTEs in the block inside ptep_get_and_clear_full() because
although it might work today, its fragile in the same way that my v2 version is.

Instead, I've defined a new helper, clear_ptes(), which takes a starting pte and
a number of ptes to clear (like set_ptes()). It returns the PTE read from the
*first* slot, but with the access/dirty bits being accumulated from all of the
ptes in the requested batch. Then zap_pte_range() is reworked to find
appropriate batches (similar to how I've reworked for ptep_set_wrprotects()).

I was trying to avoid introducing new helpers, but I think this is the most
robust approach, and looks slightly more performant to, on first sight. It also
addresses cases where full=0, which Barry says are important for madvise(DONTNEED).

>
> This does assume you don't need to partially unmap a page in
> zap_pte_range (ie. end >= folio), but we're already making that
> assumption.

That's fine for full=1. But we can't make that assumption for full=0. If a VMA
gets split for a reason that doesn't require re-setting the PTEs then a contpte
block could straddle 2 VMAs. But the solution I describe above is robust to that.

I'll finish gathering perf data then post for all 3 approaches; v2 as originally
posted, "robust ptep_get_and_clear_full()", and clear_ptes(). Hopefully later today.

>
>> +
>> + start = contpte_align_down(ptep);
>> + ptep = start + CONT_PTES - 1;
>> +
>> + for (; ptep >= start; ptep--) {
>> + pte_t pte = __ptep_get(ptep);
>> +
>> + if (!pte_valid(pte))
>> + continue;
>> +
>> + if (dirty)
>> + pte = pte_mkdirty(pte);
>> +
>> + if (young)
>> + pte = pte_mkyoung(pte);
>> +
>> + __ptep_set_access_flags_notlbi(ptep, pte);
>> + return contpte_ptep_get(ptep, orig_pte);
>> + }
>> +
>> + return orig_pte;
>> +}
>> +EXPORT_SYMBOL(contpte_ptep_get_and_clear_full);
>> +
>> int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>> unsigned long addr, pte_t *ptep)
>> {
>> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
>> index d63f3a0a7251..b22216a8153c 100644
>> --- a/arch/arm64/mm/fault.c
>> +++ b/arch/arm64/mm/fault.c
>> @@ -199,19 +199,7 @@ static void show_pte(unsigned long addr)
>> pr_cont("\n");
>> }
>>
>> -/*
>> - * This function sets the access flags (dirty, accessed), as well as write
>> - * permission, and only to a more permissive setting.
>> - *
>> - * It needs to cope with hardware update of the accessed/dirty state by other
>> - * agents in the system and can safely skip the __sync_icache_dcache() call as,
>> - * like __set_ptes(), the PTE is never changed from no-exec to exec here.
>> - *
>> - * Returns whether or not the PTE actually changed.
>> - */
>> -int __ptep_set_access_flags(struct vm_area_struct *vma,
>> - unsigned long address, pte_t *ptep,
>> - pte_t entry, int dirty)
>> +int __ptep_set_access_flags_notlbi(pte_t *ptep, pte_t entry)
>> {
>> pteval_t old_pteval, pteval;
>> pte_t pte = __ptep_get(ptep);
>> @@ -238,10 +226,30 @@ int __ptep_set_access_flags(struct vm_area_struct *vma,
>> pteval = cmpxchg_relaxed(&pte_val(*ptep), old_pteval, pteval);
>> } while (pteval != old_pteval);
>>
>> + return 1;
>> +}
>> +
>> +/*
>> + * This function sets the access flags (dirty, accessed), as well as write
>> + * permission, and only to a more permissive setting.
>> + *
>> + * It needs to cope with hardware update of the accessed/dirty state by other
>> + * agents in the system and can safely skip the __sync_icache_dcache() call as,
>> + * like __set_ptes(), the PTE is never changed from no-exec to exec here.
>> + *
>> + * Returns whether or not the PTE actually changed.
>> + */
>> +int __ptep_set_access_flags(struct vm_area_struct *vma,
>> + unsigned long address, pte_t *ptep,
>> + pte_t entry, int dirty)
>> +{
>> + int changed = __ptep_set_access_flags_notlbi(ptep, entry);
>> +
>> /* Invalidate a stale read-only entry */
>> - if (dirty)
>> + if (changed && dirty)
>> flush_tlb_page(vma, address);
>> - return 1;
>> +
>> + return changed;
>> }
>>
>> static bool is_el1_instruction_abort(unsigned long esr)
>> --8<--
>

2023-12-03 23:20:51

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v2 14/14] arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown


Ryan Roberts <[email protected]> writes:

> On 30/11/2023 05:07, Alistair Popple wrote:
>>
>> Ryan Roberts <[email protected]> writes:
>>
>>>>>> So if we do need to deal with racing HW, I'm pretty sure my v1 implementation is
>>>>>> buggy because it iterated through the PTEs, getting and accumulating. Then
>>>>>> iterated again, writing that final set of bits to all the PTEs. And the HW could
>>>>>> have modified the bits during those loops. I think it would be possible to fix
>>>>>> the race, but intuition says it would be expensive.
>>>>>
>>>>> So the issue as I understand it is subsequent iterations would see a
>>>>> clean PTE after the first iteration returned a dirty PTE. In
>>>>> ptep_get_and_clear_full() why couldn't you just copy the dirty/accessed
>>>>> bit (if set) from the PTE being cleared to an adjacent PTE rather than
>>>>> all the PTEs?
>>>>
>>>> The raciness I'm describing is the race between reading access/dirty from one
>>>> pte and applying it to another. But yes I like your suggestion. if we do:
>>>>
>>>> pte = __ptep_get_and_clear_full(ptep)
>>>>
>>>> on the target pte, then we have grabbed access/dirty from it in a race-free
>>>> manner. we can then loop from current pte up towards the top of the block until
>>>> we find a valid entry (and I guess wrap at the top to make us robust against
>>>> future callers clearing an an arbitrary order). Then atomically accumulate the
>>>> access/dirty bits we have just saved into that new entry. I guess that's just a
>>>> cmpxchg loop - there are already examples of how to do that correctly when
>>>> racing the TLB.
>>>>
>>>> For most entries, we will just be copying up to the next pte. For the last pte,
>>>> we would end up reading all ptes and determine we are the last one.
>>>>
>>>> What do you think?
>>>
>>> OK here is an attempt at something which solves the fragility. I think this is
>>> now robust and will always return the correct access/dirty state from
>>> ptep_get_and_clear_full() and ptep_get().
>>>
>>> But I'm not sure about performance; each call to ptep_get_and_clear_full() for
>>> each pte in a contpte block will cause a ptep_get() to gather the access/dirty
>>> bits from across the contpte block - which requires reading each pte in the
>>> contpte block. So its O(n^2) in that sense. I'll benchmark it and report back.
>>>
>>> Was this the type of thing you were thinking of, Alistair?
>>
>> Yes, that is along the lines of what I was thinking. However I have
>> added a couple of comments inline.
>>
>>> --8<--
>>> arch/arm64/include/asm/pgtable.h | 23 ++++++++-
>>> arch/arm64/mm/contpte.c | 81 ++++++++++++++++++++++++++++++++
>>> arch/arm64/mm/fault.c | 38 +++++++++------
>>> 3 files changed, 125 insertions(+), 17 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>> index 9bd2f57a9e11..6c295d277784 100644
>>> --- a/arch/arm64/include/asm/pgtable.h
>>> +++ b/arch/arm64/include/asm/pgtable.h
>>> @@ -851,6 +851,7 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
>>> return pte_pmd(pte_modify(pmd_pte(pmd), newprot));
>>> }
>>>
>>> +extern int __ptep_set_access_flags_notlbi(pte_t *ptep, pte_t entry);
>>> extern int __ptep_set_access_flags(struct vm_area_struct *vma,
>>> unsigned long address, pte_t *ptep,
>>> pte_t entry, int dirty);
>>> @@ -1145,6 +1146,8 @@ extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
>>> extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
>>> extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>>> pte_t *ptep, pte_t pte, unsigned int nr);
>>> +extern pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
>>> + unsigned long addr, pte_t *ptep);
>>> extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>> unsigned long addr, pte_t *ptep);
>>> extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>>> @@ -1270,12 +1273,28 @@ static inline void pte_clear(struct mm_struct *mm,
>>> __pte_clear(mm, addr, ptep);
>>> }
>>>
>>> +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
>>> +static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
>>> + unsigned long addr, pte_t *ptep, int full)
>>> +{
>>> + pte_t orig_pte = __ptep_get(ptep);
>>> +
>>> + if (!pte_valid_cont(orig_pte))
>>> + return __ptep_get_and_clear(mm, addr, ptep);
>>> +
>>> + if (!full) {
>>> + contpte_try_unfold(mm, addr, ptep, orig_pte);
>>> + return __ptep_get_and_clear(mm, addr, ptep);
>>> + }
>>> +
>>> + return contpte_ptep_get_and_clear_full(mm, addr, ptep);
>>> +}
>>> +
>>> #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>> static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>> unsigned long addr, pte_t *ptep)
>>> {
>>> - contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>> - return __ptep_get_and_clear(mm, addr, ptep);
>>> + return ptep_get_and_clear_full(mm, addr, ptep, 0);
>>> }
>>>
>>> #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
>>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>>> index 2a57df16bf58..99b211118d93 100644
>>> --- a/arch/arm64/mm/contpte.c
>>> +++ b/arch/arm64/mm/contpte.c
>>> @@ -145,6 +145,14 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>> for (i = 0; i < CONT_PTES; i++, ptep++) {
>>> pte = __ptep_get(ptep);
>>>
>>> + /*
>>> + * Deal with the partial contpte_ptep_get_and_clear_full() case,
>>> + * where some of the ptes in the range may be cleared but others
>>> + * are still to do. See contpte_ptep_get_and_clear_full().
>>> + */
>>> + if (!pte_valid(pte))
>>> + continue;
>>> +
>>> if (pte_dirty(pte))
>>> orig_pte = pte_mkdirty(orig_pte);
>>>
>>> @@ -257,6 +265,79 @@ void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>>> }
>>> EXPORT_SYMBOL(contpte_set_ptes);
>>>
>>> +pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
>>> + unsigned long addr, pte_t *ptep)
>>> +{
>>> + /*
>>> + * When doing a full address space teardown, we can avoid unfolding the
>>> + * contiguous range, and therefore avoid the associated tlbi. Instead,
>>> + * just get and clear the pte. The caller is promising to call us for
>>> + * every pte, so every pte in the range will be cleared by the time the
>>> + * final tlbi is issued.
>>> + *
>>> + * This approach requires some complex hoop jumping though, as for the
>>> + * duration between returning from the first call to
>>> + * ptep_get_and_clear_full() and making the final call, the contpte
>>> + * block is in an intermediate state, where some ptes are cleared and
>>> + * others are still set with the PTE_CONT bit. If any other APIs are
>>> + * called for the ptes in the contpte block during that time, we have to
>>> + * be very careful. The core code currently interleaves calls to
>>> + * ptep_get_and_clear_full() with ptep_get() and so ptep_get() must be
>>> + * careful to ignore the cleared entries when accumulating the access
>>> + * and dirty bits - the same goes for ptep_get_lockless(). The only
>>> + * other calls we might resonably expect are to set markers in the
>>> + * previously cleared ptes. (We shouldn't see valid entries being set
>>> + * until after the tlbi, at which point we are no longer in the
>>> + * intermediate state). Since markers are not valid, this is safe;
>>> + * set_ptes() will see the old, invalid entry and will not attempt to
>>> + * unfold. And the new pte is also invalid so it won't attempt to fold.
>>> + * We shouldn't see pte markers being set for the 'full' case anyway
>>> + * since the address space is being torn down.
>>> + *
>>> + * The last remaining issue is returning the access/dirty bits. That
>>> + * info could be present in any of the ptes in the contpte block.
>>> + * ptep_get() will gather those bits from across the contpte block (for
>>> + * the remaining valid entries). So below, if the pte we are clearing
>>> + * has dirty or young set, we need to stash it into a pte that we are
>>> + * yet to clear. This allows future calls to return the correct state
>>> + * even when the info was stored in a different pte. Since the core-mm
>>> + * calls from low to high address, we prefer to stash in the last pte of
>>> + * the contpte block - this means we are not "dragging" the bits up
>>> + * through all ptes and increases the chances that we can exit early
>>> + * because a given pte will have neither dirty or young set.
>>> + */
>>> +
>>> + pte_t orig_pte = __ptep_get_and_clear(mm, addr, ptep);
>>> + bool dirty = pte_dirty(orig_pte);
>>> + bool young = pte_young(orig_pte);
>>> + pte_t *start;
>>> +
>>> + if (!dirty && !young)
>>> + return contpte_ptep_get(ptep, orig_pte);
>>
>> I don't think we need to do this. If the PTE is !dirty && !young we can
>> just return it. As you say we have to assume HW can set those flags at
>> any time anyway so it doesn't get us much. This means in the common case
>> we should only run through the loop setting the dirty/young flags once
>> which should alay the performance concerns.
>
> I don't follow your logic. This is precisely the problem I was trying to solve
> vs my original (simple) attempt - we want to always report the correct
> access/dirty info. If we read one of the PTEs and neither access nor dirty are
> set, that doesn't mean its old and clean, it just means that that info is
> definitely not stored in this PTE - we need to check the others. (when the
> contiguous bit is set, the HW will only update the access/dirty bits for 1 of
> the PTEs in the contpte block).

So my concern wasn't about incorrectly returning a !young && !dirty PTE
when the CONT_PTE block was *previously* clean/old (ie. the first
ptep_get/ptep_get_and_clear_full returned clean/old) because we have to
tolerate that anyway due to HW being able to set those bits. Rather my
concern was ptep_get_and_clear_full() could implicitly clear dirty/young
bits - ie. ptep_get_and_clear_full() could return a dirty/young PTE but
the next call would not.

That's because regardless of what we do here it is just a matter of
timing if we have to assume other HW threads can set these bits at any
time. There is nothing stopping HW from doing that just after we read
them in that loop, so a block can always become dirty/young at any time.
However it shouldn't become !dirty/!young without explicit SW
intervention.

But this is all a bit of a moot point due to the discussion below.

> Also, IIRC correctly, the core-mm sets access when initially setting up the
> mapping so its not guarranteed that all but one of the PTEs in the contpte block
> have (!dirty && !young).
>
>>
>> However I am now wondering if we're doing the wrong thing trying to hide
>> this down in the arch layer anyway. Perhaps it would be better to deal
>> with this in the core-mm code after all.
>>
>> So how about having ptep_get_and_clear_full() clearing the PTEs for the
>> entire cont block? We know by definition all PTEs should be pointing to
>> the same folio anyway, and it seems at least zap_pte_range() would cope
>> with this just fine because subsequent iterations would just see
>> pte_none() and continue the loop. I haven't checked the other call sites
>> though, but in principal I don't see why we couldn't define
>> ptep_get_and_clear_full() as being something that clears all PTEs
>> mapping a given folio (although it might need renaming).
>
> Ahha! Yes, I've been working on a solution like this since Barry raised it
> yesterday. I have a working version, that seems to perform well. I wouldn't want
> to just clear all the PTEs in the block inside ptep_get_and_clear_full() because
> although it might work today, its fragile in the same way that my v2 version is.

Yes, agree a new helper would be needed.

> Instead, I've defined a new helper, clear_ptes(), which takes a starting pte and
> a number of ptes to clear (like set_ptes()). It returns the PTE read from the
> *first* slot, but with the access/dirty bits being accumulated from all of the
> ptes in the requested batch. Then zap_pte_range() is reworked to find
> appropriate batches (similar to how I've reworked for ptep_set_wrprotects()).
>
> I was trying to avoid introducing new helpers, but I think this is the most
> robust approach, and looks slightly more performant to, on first sight. It also
> addresses cases where full=0, which Barry says are important for madvise(DONTNEED).

I strongly agree with this approach now especially if it is equally (or
more!) performant. I get why you didn't want to intorduce new helpers
but I think doing so was making things too subtle so would like to see
this.

>>
>> This does assume you don't need to partially unmap a page in
>> zap_pte_range (ie. end >= folio), but we're already making that
>> assumption.
>
> That's fine for full=1. But we can't make that assumption for full=0. If a VMA
> gets split for a reason that doesn't require re-setting the PTEs then a contpte
> block could straddle 2 VMAs. But the solution I describe above is robust to that.
>
> I'll finish gathering perf data then post for all 3 approaches; v2 as originally
> posted, "robust ptep_get_and_clear_full()", and clear_ptes(). Hopefully later today.

Thanks!

>>
>>> +
>>> + start = contpte_align_down(ptep);
>>> + ptep = start + CONT_PTES - 1;
>>> +
>>> + for (; ptep >= start; ptep--) {
>>> + pte_t pte = __ptep_get(ptep);
>>> +
>>> + if (!pte_valid(pte))
>>> + continue;
>>> +
>>> + if (dirty)
>>> + pte = pte_mkdirty(pte);
>>> +
>>> + if (young)
>>> + pte = pte_mkyoung(pte);
>>> +
>>> + __ptep_set_access_flags_notlbi(ptep, pte);
>>> + return contpte_ptep_get(ptep, orig_pte);
>>> + }
>>> +
>>> + return orig_pte;
>>> +}
>>> +EXPORT_SYMBOL(contpte_ptep_get_and_clear_full);
>>> +
>>> int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>> unsigned long addr, pte_t *ptep)
>>> {
>>> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
>>> index d63f3a0a7251..b22216a8153c 100644
>>> --- a/arch/arm64/mm/fault.c
>>> +++ b/arch/arm64/mm/fault.c
>>> @@ -199,19 +199,7 @@ static void show_pte(unsigned long addr)
>>> pr_cont("\n");
>>> }
>>>
>>> -/*
>>> - * This function sets the access flags (dirty, accessed), as well as write
>>> - * permission, and only to a more permissive setting.
>>> - *
>>> - * It needs to cope with hardware update of the accessed/dirty state by other
>>> - * agents in the system and can safely skip the __sync_icache_dcache() call as,
>>> - * like __set_ptes(), the PTE is never changed from no-exec to exec here.
>>> - *
>>> - * Returns whether or not the PTE actually changed.
>>> - */
>>> -int __ptep_set_access_flags(struct vm_area_struct *vma,
>>> - unsigned long address, pte_t *ptep,
>>> - pte_t entry, int dirty)
>>> +int __ptep_set_access_flags_notlbi(pte_t *ptep, pte_t entry)
>>> {
>>> pteval_t old_pteval, pteval;
>>> pte_t pte = __ptep_get(ptep);
>>> @@ -238,10 +226,30 @@ int __ptep_set_access_flags(struct vm_area_struct *vma,
>>> pteval = cmpxchg_relaxed(&pte_val(*ptep), old_pteval, pteval);
>>> } while (pteval != old_pteval);
>>>
>>> + return 1;
>>> +}
>>> +
>>> +/*
>>> + * This function sets the access flags (dirty, accessed), as well as write
>>> + * permission, and only to a more permissive setting.
>>> + *
>>> + * It needs to cope with hardware update of the accessed/dirty state by other
>>> + * agents in the system and can safely skip the __sync_icache_dcache() call as,
>>> + * like __set_ptes(), the PTE is never changed from no-exec to exec here.
>>> + *
>>> + * Returns whether or not the PTE actually changed.
>>> + */
>>> +int __ptep_set_access_flags(struct vm_area_struct *vma,
>>> + unsigned long address, pte_t *ptep,
>>> + pte_t entry, int dirty)
>>> +{
>>> + int changed = __ptep_set_access_flags_notlbi(ptep, entry);
>>> +
>>> /* Invalidate a stale read-only entry */
>>> - if (dirty)
>>> + if (changed && dirty)
>>> flush_tlb_page(vma, address);
>>> - return 1;
>>> +
>>> + return changed;
>>> }
>>>
>>> static bool is_el1_instruction_abort(unsigned long esr)
>>> --8<--
>>

2023-12-04 09:39:38

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 14/14] arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown

On 03/12/2023 23:20, Alistair Popple wrote:
>
> Ryan Roberts <[email protected]> writes:
>
>> On 30/11/2023 05:07, Alistair Popple wrote:
>>>
>>> Ryan Roberts <[email protected]> writes:
>>>
>>>>>>> So if we do need to deal with racing HW, I'm pretty sure my v1 implementation is
>>>>>>> buggy because it iterated through the PTEs, getting and accumulating. Then
>>>>>>> iterated again, writing that final set of bits to all the PTEs. And the HW could
>>>>>>> have modified the bits during those loops. I think it would be possible to fix
>>>>>>> the race, but intuition says it would be expensive.
>>>>>>
>>>>>> So the issue as I understand it is subsequent iterations would see a
>>>>>> clean PTE after the first iteration returned a dirty PTE. In
>>>>>> ptep_get_and_clear_full() why couldn't you just copy the dirty/accessed
>>>>>> bit (if set) from the PTE being cleared to an adjacent PTE rather than
>>>>>> all the PTEs?
>>>>>
>>>>> The raciness I'm describing is the race between reading access/dirty from one
>>>>> pte and applying it to another. But yes I like your suggestion. if we do:
>>>>>
>>>>> pte = __ptep_get_and_clear_full(ptep)
>>>>>
>>>>> on the target pte, then we have grabbed access/dirty from it in a race-free
>>>>> manner. we can then loop from current pte up towards the top of the block until
>>>>> we find a valid entry (and I guess wrap at the top to make us robust against
>>>>> future callers clearing an an arbitrary order). Then atomically accumulate the
>>>>> access/dirty bits we have just saved into that new entry. I guess that's just a
>>>>> cmpxchg loop - there are already examples of how to do that correctly when
>>>>> racing the TLB.
>>>>>
>>>>> For most entries, we will just be copying up to the next pte. For the last pte,
>>>>> we would end up reading all ptes and determine we are the last one.
>>>>>
>>>>> What do you think?
>>>>
>>>> OK here is an attempt at something which solves the fragility. I think this is
>>>> now robust and will always return the correct access/dirty state from
>>>> ptep_get_and_clear_full() and ptep_get().
>>>>
>>>> But I'm not sure about performance; each call to ptep_get_and_clear_full() for
>>>> each pte in a contpte block will cause a ptep_get() to gather the access/dirty
>>>> bits from across the contpte block - which requires reading each pte in the
>>>> contpte block. So its O(n^2) in that sense. I'll benchmark it and report back.
>>>>
>>>> Was this the type of thing you were thinking of, Alistair?
>>>
>>> Yes, that is along the lines of what I was thinking. However I have
>>> added a couple of comments inline.
>>>
>>>> --8<--
>>>> arch/arm64/include/asm/pgtable.h | 23 ++++++++-
>>>> arch/arm64/mm/contpte.c | 81 ++++++++++++++++++++++++++++++++
>>>> arch/arm64/mm/fault.c | 38 +++++++++------
>>>> 3 files changed, 125 insertions(+), 17 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>>> index 9bd2f57a9e11..6c295d277784 100644
>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>> @@ -851,6 +851,7 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
>>>> return pte_pmd(pte_modify(pmd_pte(pmd), newprot));
>>>> }
>>>>
>>>> +extern int __ptep_set_access_flags_notlbi(pte_t *ptep, pte_t entry);
>>>> extern int __ptep_set_access_flags(struct vm_area_struct *vma,
>>>> unsigned long address, pte_t *ptep,
>>>> pte_t entry, int dirty);
>>>> @@ -1145,6 +1146,8 @@ extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
>>>> extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
>>>> extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>>>> pte_t *ptep, pte_t pte, unsigned int nr);
>>>> +extern pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
>>>> + unsigned long addr, pte_t *ptep);
>>>> extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>> unsigned long addr, pte_t *ptep);
>>>> extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>>>> @@ -1270,12 +1273,28 @@ static inline void pte_clear(struct mm_struct *mm,
>>>> __pte_clear(mm, addr, ptep);
>>>> }
>>>>
>>>> +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
>>>> +static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
>>>> + unsigned long addr, pte_t *ptep, int full)
>>>> +{
>>>> + pte_t orig_pte = __ptep_get(ptep);
>>>> +
>>>> + if (!pte_valid_cont(orig_pte))
>>>> + return __ptep_get_and_clear(mm, addr, ptep);
>>>> +
>>>> + if (!full) {
>>>> + contpte_try_unfold(mm, addr, ptep, orig_pte);
>>>> + return __ptep_get_and_clear(mm, addr, ptep);
>>>> + }
>>>> +
>>>> + return contpte_ptep_get_and_clear_full(mm, addr, ptep);
>>>> +}
>>>> +
>>>> #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>>> static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>>> unsigned long addr, pte_t *ptep)
>>>> {
>>>> - contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>>> - return __ptep_get_and_clear(mm, addr, ptep);
>>>> + return ptep_get_and_clear_full(mm, addr, ptep, 0);
>>>> }
>>>>
>>>> #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
>>>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>>>> index 2a57df16bf58..99b211118d93 100644
>>>> --- a/arch/arm64/mm/contpte.c
>>>> +++ b/arch/arm64/mm/contpte.c
>>>> @@ -145,6 +145,14 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>>> for (i = 0; i < CONT_PTES; i++, ptep++) {
>>>> pte = __ptep_get(ptep);
>>>>
>>>> + /*
>>>> + * Deal with the partial contpte_ptep_get_and_clear_full() case,
>>>> + * where some of the ptes in the range may be cleared but others
>>>> + * are still to do. See contpte_ptep_get_and_clear_full().
>>>> + */
>>>> + if (!pte_valid(pte))
>>>> + continue;
>>>> +
>>>> if (pte_dirty(pte))
>>>> orig_pte = pte_mkdirty(orig_pte);
>>>>
>>>> @@ -257,6 +265,79 @@ void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>>>> }
>>>> EXPORT_SYMBOL(contpte_set_ptes);
>>>>
>>>> +pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
>>>> + unsigned long addr, pte_t *ptep)
>>>> +{
>>>> + /*
>>>> + * When doing a full address space teardown, we can avoid unfolding the
>>>> + * contiguous range, and therefore avoid the associated tlbi. Instead,
>>>> + * just get and clear the pte. The caller is promising to call us for
>>>> + * every pte, so every pte in the range will be cleared by the time the
>>>> + * final tlbi is issued.
>>>> + *
>>>> + * This approach requires some complex hoop jumping though, as for the
>>>> + * duration between returning from the first call to
>>>> + * ptep_get_and_clear_full() and making the final call, the contpte
>>>> + * block is in an intermediate state, where some ptes are cleared and
>>>> + * others are still set with the PTE_CONT bit. If any other APIs are
>>>> + * called for the ptes in the contpte block during that time, we have to
>>>> + * be very careful. The core code currently interleaves calls to
>>>> + * ptep_get_and_clear_full() with ptep_get() and so ptep_get() must be
>>>> + * careful to ignore the cleared entries when accumulating the access
>>>> + * and dirty bits - the same goes for ptep_get_lockless(). The only
>>>> + * other calls we might resonably expect are to set markers in the
>>>> + * previously cleared ptes. (We shouldn't see valid entries being set
>>>> + * until after the tlbi, at which point we are no longer in the
>>>> + * intermediate state). Since markers are not valid, this is safe;
>>>> + * set_ptes() will see the old, invalid entry and will not attempt to
>>>> + * unfold. And the new pte is also invalid so it won't attempt to fold.
>>>> + * We shouldn't see pte markers being set for the 'full' case anyway
>>>> + * since the address space is being torn down.
>>>> + *
>>>> + * The last remaining issue is returning the access/dirty bits. That
>>>> + * info could be present in any of the ptes in the contpte block.
>>>> + * ptep_get() will gather those bits from across the contpte block (for
>>>> + * the remaining valid entries). So below, if the pte we are clearing
>>>> + * has dirty or young set, we need to stash it into a pte that we are
>>>> + * yet to clear. This allows future calls to return the correct state
>>>> + * even when the info was stored in a different pte. Since the core-mm
>>>> + * calls from low to high address, we prefer to stash in the last pte of
>>>> + * the contpte block - this means we are not "dragging" the bits up
>>>> + * through all ptes and increases the chances that we can exit early
>>>> + * because a given pte will have neither dirty or young set.
>>>> + */
>>>> +
>>>> + pte_t orig_pte = __ptep_get_and_clear(mm, addr, ptep);
>>>> + bool dirty = pte_dirty(orig_pte);
>>>> + bool young = pte_young(orig_pte);
>>>> + pte_t *start;
>>>> +
>>>> + if (!dirty && !young)
>>>> + return contpte_ptep_get(ptep, orig_pte);
>>>
>>> I don't think we need to do this. If the PTE is !dirty && !young we can
>>> just return it. As you say we have to assume HW can set those flags at
>>> any time anyway so it doesn't get us much. This means in the common case
>>> we should only run through the loop setting the dirty/young flags once
>>> which should alay the performance concerns.
>>
>> I don't follow your logic. This is precisely the problem I was trying to solve
>> vs my original (simple) attempt - we want to always report the correct
>> access/dirty info. If we read one of the PTEs and neither access nor dirty are
>> set, that doesn't mean its old and clean, it just means that that info is
>> definitely not stored in this PTE - we need to check the others. (when the
>> contiguous bit is set, the HW will only update the access/dirty bits for 1 of
>> the PTEs in the contpte block).
>
> So my concern wasn't about incorrectly returning a !young && !dirty PTE
> when the CONT_PTE block was *previously* clean/old (ie. the first
> ptep_get/ptep_get_and_clear_full returned clean/old) because we have to
> tolerate that anyway due to HW being able to set those bits. Rather my
> concern was ptep_get_and_clear_full() could implicitly clear dirty/young
> bits - ie. ptep_get_and_clear_full() could return a dirty/young PTE but
> the next call would not.
>
> That's because regardless of what we do here it is just a matter of
> timing if we have to assume other HW threads can set these bits at any
> time. There is nothing stopping HW from doing that just after we read
> them in that loop, so a block can always become dirty/young at any time.
> However it shouldn't become !dirty/!young without explicit SW
> intervention.
>
> But this is all a bit of a moot point due to the discussion below.
>
>> Also, IIRC correctly, the core-mm sets access when initially setting up the
>> mapping so its not guarranteed that all but one of the PTEs in the contpte block
>> have (!dirty && !young).
>>
>>>
>>> However I am now wondering if we're doing the wrong thing trying to hide
>>> this down in the arch layer anyway. Perhaps it would be better to deal
>>> with this in the core-mm code after all.
>>>
>>> So how about having ptep_get_and_clear_full() clearing the PTEs for the
>>> entire cont block? We know by definition all PTEs should be pointing to
>>> the same folio anyway, and it seems at least zap_pte_range() would cope
>>> with this just fine because subsequent iterations would just see
>>> pte_none() and continue the loop. I haven't checked the other call sites
>>> though, but in principal I don't see why we couldn't define
>>> ptep_get_and_clear_full() as being something that clears all PTEs
>>> mapping a given folio (although it might need renaming).
>>
>> Ahha! Yes, I've been working on a solution like this since Barry raised it
>> yesterday. I have a working version, that seems to perform well. I wouldn't want
>> to just clear all the PTEs in the block inside ptep_get_and_clear_full() because
>> although it might work today, its fragile in the same way that my v2 version is.
>
> Yes, agree a new helper would be needed.
>
>> Instead, I've defined a new helper, clear_ptes(), which takes a starting pte and
>> a number of ptes to clear (like set_ptes()). It returns the PTE read from the
>> *first* slot, but with the access/dirty bits being accumulated from all of the
>> ptes in the requested batch. Then zap_pte_range() is reworked to find
>> appropriate batches (similar to how I've reworked for ptep_set_wrprotects()).
>>
>> I was trying to avoid introducing new helpers, but I think this is the most
>> robust approach, and looks slightly more performant to, on first sight. It also
>> addresses cases where full=0, which Barry says are important for madvise(DONTNEED).
>
> I strongly agree with this approach now especially if it is equally (or
> more!) performant. I get why you didn't want to intorduce new helpers
> but I think doing so was making things too subtle so would like to see
> this.
>
>>>
>>> This does assume you don't need to partially unmap a page in
>>> zap_pte_range (ie. end >= folio), but we're already making that
>>> assumption.
>>
>> That's fine for full=1. But we can't make that assumption for full=0. If a VMA
>> gets split for a reason that doesn't require re-setting the PTEs then a contpte
>> block could straddle 2 VMAs. But the solution I describe above is robust to that.
>>
>> I'll finish gathering perf data then post for all 3 approaches; v2 as originally
>> posted, "robust ptep_get_and_clear_full()", and clear_ptes(). Hopefully later today.
>
> Thanks!

From the commit log of the new version, which I'll hopefully post later today:

The following shows the results of running a kernel compilation workload
and measuring the cost of arm64_sys_exit_group() (which at ~1.5% is a
very small part of the overall workload).

Benchmarks were run on Ampere Altra in 2 configs; single numa node and 2
numa nodes (tlbis are more expensive in 2 node config).

- baseline: v6.7-rc1 + anonfolio-v7
- no-opt: contpte series without any attempt to optimize exit()
- simple-ptep_get_clear_full: simple optimization to exploit full=1.
ptep_get_clear_full() does not fully conform to its intended semantic
- robust-ptep_get_clear_full: similar to previous but
ptep_get_clear_full() fully conforms to its intended semantic
- clear_ptes: optimization implemented by this patch

| config | numa=1 | numa=2 |
|----------------------------|--------|--------|
| baseline | 0% | 0% |
| no-opt | 190% | 768% |
| simple-ptep_get_clear_full | 8% | 29% |
| robust-ptep_get_clear_full | 21% | 19% |
| clear_ptes | 13% | 9% |

In all cases, the cost of arm64_sys_exit_group() increases; this is
anticipated because there is more work to do to tear down the page
tables. But clear_ptes() gives the smallest increase overall.

Note that "simple-ptep_get_clear_full" is the version I posted with v2.
"robust-ptep_get_clear_full" is the version I tried as part of this
conversation. And "clear_ptes" is the batched version that I think we all now
prefer (and plan to post as part of v3).

Thanks,
Ryan